Jump to : Download | Abstract | Contact | BibTex reference | EndNote reference |


Guangnan Ye, I-Hong Jhuo, Dong Liu, Yu-Gang Jiang, D.T. Lee, Shih-Fu Chang. Joint Audio-Visual Bi-Modal Codewords for Video Event Detection. In ACM International Conference on Multimedia Retrieval (ICMR), 2012.

Download [help]

Download paper: Adobe portable document (pdf)

Copyright notice:This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.


Joint audio-visual patterns often exist in videos and provide strong multi-modal cues for detecting multimedia events. However, conventional methods generally fuse the visual and audio information only at a superficial level, without adequately exploring deep intrinsic joint patterns. In this paper, we propose a joint audio-visual bi-modal representation, called bi-modal words. We first build a bipartite graph to model relation across the quantized words extracted from the visual and audio modalities. Partitioning over the bipartite graph is then applied to construct the bi-modal words that reveal the joint patterns across modalities. Finally, different pooling strategies are employed to re-quantize the visual and audio words into the bi-modal words and form bi-modal Bag-of-Words representations that are fed to subsequent multimedia event classifiers. We experimentally show that the proposed multi-modal feature achieves statistically significant performance gains over methods using individual visual and audio features alone and alternative multi-modal fusion methods. Moreover, we found that average pooling is the most suitable strategy for bi-modal feature generation


Guangnan Ye
Dong Liu
Yu-Gang Jiang
Shih-Fu Chang

BibTex Reference

   Author = {Ye, Guangnan and Jhuo, I-Hong and Liu, Dong and Jiang, Yu-Gang and Lee, D.T. and Chang, Shih-Fu},
   Title = {Joint Audio-Visual Bi-Modal Codewords for Video Event       Detection},
   BookTitle = {ACM International Conference on Multimedia Retrieval (ICMR)},
   Year = {2012}

EndNote Reference [help]

Get EndNote Reference (.ref)


For problems or questions regarding this web site contact The Web Master.

This document was translated automatically from BibTEX by bib2html (Copyright 2003 © Eric Marchand, INRIA, Vista Project).