Jump to : Download | Abstract | Contact | BibTex reference | EndNote reference |


P. Natarajan et al. BBN VISER TRECVID 2011 Multimedia Event Detection System. In NIST TRECVID Workshop, Gaithersburg, MD, December 2011.

Download [help]

Download paper: Adobe portable document (pdf)

Copyright notice:This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.


We describe the Raytheon BBN (BBN) VISER system that is designed to detect events of interest in multimedia data. We also present a comprehensive analysis of the different modules of that system in the context of the MED 2011 task. The VISER system incorporates a large set of low-level features that capture appearance, color, motion, audio, and audio-visual cooccurrence patterns in videos. For the low-level features, we rigorously analyzed several coding and pooling strategies, and also used state-of-the-art spatio-temporal pooling strategies to model relationships between different features. The system also uses high-level (i.e., semantic) visual information obtained from detecting scene, object, and action concepts. Furthermore, the VISER system exploits multimodal information by analyzing available spoken and videotext content using BBN's state-of-the-art Byblos automatic speech recognition (ASR) and video text recognition systems. These diverse streams of information are combined into a single, fixed dimensional vector for each video. We explored two different combination strategies: early fusion and late fusion. Early fusion was implemented through a fast kernel-based fusion framework and late fusion was performed using both Bayesian model combination (BAYCOM) as well as an innovative a weighted-average framework. Consistent with the previous MED'10 evaluation, low-level visual features exhibit strong performance and form the basis of our system. However, high-level information from speech, video-text, and object detection provide consistent and significant performance improvements. Overall, BBN's VISER system exhibited the best performance among all the submitted systems with an average ANDC score of 0.46 across the 10 MED'11 test events when the threshold was optimized for the NDC score, and <30% missed detection rate when the threshold was optimized to minimize missed detections at 6% false alarm rate

BibTex Reference

   Author = {al, P. Natarajan et},
   Title = {BBN VISER TRECVID 2011 Multimedia Event Detection System},
   BookTitle = {NIST TRECVID Workshop},
   Address = {Gaithersburg, MD},
   Month = {December},
   Year = {2011}

EndNote Reference [help]

Get EndNote Reference (.ref)


For problems or questions regarding this web site contact The Web Master.

This document was translated automatically from BibTEX by bib2html (Copyright 2003 © Eric Marchand, INRIA, Vista Project).