wjiang:phdthesis

Wei Jiang. Advanced Techniques for Semantic Concept Detection in General Videos. PhD Thesis Graduate School of Arts and Sciences, Columbia University, 2010.

Download [help]

Download paper: Adobe portable document (pdf)

Copyright notice:This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

Note on this paper

Advisor: Prof. Chang

Abstract

The automatic analysis and indexing of multimedia content in general domains are im- portant for a variety of multimedia applications. This thesis investigates the problem of semantic concept detection in general videos focusing on two advanced directions: multi- concept learning and multi-modality learning. Semantic concept detection refers to the task of assigning an input video sequence one or multiple labels indicating the presence of one or multiple semantic concepts in the video sequence. Much of the prior research work deals with the problem in an isolated manner, i.e., a binary classifier is constructed using feature vectors from the single visual modality to classify whether or not a video contains a specific concept. However, multimedia videos comprise of information from multiple modalities (both visual and audio). Each modality brings some information about the other and their simultaneous processing can uncover relationships that are otherwise unavailable when considering the modalities separately. In addition, real-world semantic concepts do not occur in isolation. The context information is useful for enhancing detection of individual concepts. This thesis explores multi-concept learning and multi-modality learning to improve se- mantic concept detection in general videos, i.e., videos with general content and are captured in uncontrolled conditions. For multi-concept learning, we propose two methods with the frameworks of two-layer Context-Based Concept Fusion (CBCF) and single-layer multi-label classification, respectively. The first method represents the inter-conceptual relationships by a Conditional Random Field (CRF). The inputs of the CRF are initial detection prob- abilities from independent concept detectors. Through inference with concept relations in the CRF we get updated concept detection probabilities as outputs. To avoid the difficulty of designing compatibility potentials in the CRF, a discriminative cost function aiming at class separation is directly minimized. Also, we further extend this method to study an interesting "20 questions problem" for semantic concept detection, where user's interaction is incorporated to annotate a small number of key concepts for each data, which are then used to improve detection of the remaining concepts. To this end, an active CBCF approach is proposed that can choose the most informative concepts for the user to label. The second multi-concept learning method does not explicitly model concept relations but optimizes multi-label discrimination for all concepts over all training data through a single-layer joint boosting algorithm. By sharing "good" kernels among different concepts, accuracy of in- dividual detectors can be improved; by joint learning of common detectors across different classes, required kernels and computational complexity for detecting individual concepts can be reduced. For multi-modality learning, we develop methods with two strategies: global fusion of features or models from multiple modalities, and construction of the local audio-visual atomic representation to enforce a moderate-level audio-visual synchronization. Two al- gorithms are developed for global multi-modality fusion, i.e., the late-fusion audio-visual boosted CRF and the early-fusion audio-visual joint boosting. The first method is an exten- sion of the above two-layer CBCF multi-concept learning approach where the inputs of the CRF include independent concept detection probabilities obtained by using both visual and audio features, individually. The second method is an extension of the above single-layer multi-label classification approach, where both visual-based kernels and audio-based ker- nels are shared by multiple concepts through the joint boosting multi-label concept detector. These two proposed methods naturally combines multi-modality learning and multi-concept learning to exert the power of both for enhancing semantic concept detection. To analyze moderate-level audio-visual synchronization in general videos, we propose to generate a local audio-visual atomic representation, i.e., the Audio-Visual Atom (AVA). We track visually consistent regions in the video sequence to generate visual atoms. At the same time we lo- cate audio onsets in the audio soundtrack to generate audio atoms. Then visual atoms and audio atoms are combined together to form AVAs, on top of which joint audio-visual code- books are constructed. The audio-visual codebooks capture the co-occurring audio-visual patterns that are representative to describe different individual concepts, and accordingly can improve concept detection. The contributions of this thesis can be summarized as follows. (1) An in-depth study of jointly detecting multiple concepts in general domains, where concept relationships are hard to compute. (2) The first system to explore the "20 questions" problem for semantic concept detection, by incorporating users¡¯ interactions and taking into account joint detection of multiple concepts. (3) An in-depth investigation of combining audio and visual information to enhance detecting generic concepts. (4) The first system to explore the localized joint audio-visual atomic representation for concept detection, under challenging conditions in general domains

Contact

Wei Jiang

BibTex Reference

@PhdThesis{wjiang:phdthesis,
   Author = {Jiang, Wei},
   Title = {Advanced Techniques for Semantic Concept Detection in General Videos},
   School = {Graduate School of Arts and Sciences, Columbia University},
   Year = {2010}
}

EndNote Reference [help]

Get EndNote Reference (.ref)

For problems or questions regarding this web site contact The Web Master.