Jump to : Download | Abstract | Contact | BibTex reference | EndNote reference |


Yu-Gang Jiang, Xiaohong Zeng, Guangnan Ye, Subhabrata Bhattacharya, Dan Ellis, Mubarak Shah, Shih-Fu Chang. Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching. In NIST TRECVID Workshop, Gaithersburg, MD, November 2010.

Download [help]

Download paper: Adobe portable document (pdf)

Copyright notice:This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.


TRECVID Multimedia Event Detection offers an interesting but very challenging task in detecting highlevel complex events (Figure 1) in user-generated videos. In this paper, we will present an overview and comparative analysis of our results, which achieved top performance among all 45 submissions in TRECVID 2010. Our aim is to answer the following questions. What kind of feature is more effective for multimedia event detection? Are features from different feature modalities (e.g., audio and visual) complementary for event detection? Can we benefit from generic concept detection of background scenes, human actions, and audio concepts? Are sequence matching and event-specific object detectors critical? Our findings indicate that spatial-temporal feature is very effective for event detection, and it’s also very complementary to other features such as static SIFT and audio features. As a result, our baseline run combining these three features already achieves very impressive results, with a mean minimal normalized cost (MNC) of 0.586. Incorporating the generic concept detectors using a graph diffusion algorithm provides marginal gains (mean MNC 0.579). Sequence matching with Earth Mover’s Distance (EMD) further improves the results (mean MNC 0.565). The event-specific detector (“batter”), however, didn’t prove useful from our current re-ranking tests. We conclude that it is important to combine strong complementary features from multiple modalities for multimedia event detection, and cross-frame matching is helpful in coping with temporal order variation. Leveraging contextual concept detectors and foreground activities remains a very attractive direction requiring further research


Yu-Gang Jiang
Guangnan Ye
Subhabrata Bhattacharya
Shih-Fu Chang

BibTex Reference

   Author = {Jiang, Yu-Gang and Zeng, Xiaohong and Ye, Guangnan and Bhattacharya, Subhabrata and Ellis, Dan and Shah, Mubarak and Chang, Shih-Fu},
   Title = {Columbia-UCF TRECVID2010 Multimedia Event Detection: Combining Multiple Modalities, Contextual Concepts, and Temporal Matching},
   BookTitle = {NIST TRECVID Workshop},
   Address = {Gaithersburg, MD},
   Month = {November},
   Year = {2010}

EndNote Reference [help]

Get EndNote Reference (.ref)


For problems or questions regarding this web site contact The Web Master.

This document was translated automatically from BibTEX by bib2html (Copyright 2003 © Eric Marchand, INRIA, Vista Project).