Columbia University
Electrical Engineering Department

Current Projects
Semantic Concept Detection with Mulitple Modality Learning


Introduction

Semantic concept detection in multimedia data like images and videos has become an increasingly critical issue for organizing, browsing, and retrieving multimedia assets. Traditional approaches mainly focus on visual aspects, i.e., extracting visual features to train various concept detectors. Multimedia data are generally associated with many types of information, e.g., textual description like the ASR script, metadata like the authorship, visual features like color and texture, and audio features like MFCCs. Besides visual modality, others like audio modality and textual modality are also indispensable for effective multimedia classification. Our goal is to develop multi-modal learning algorithms that can exploit the advantages of different modalities for better multimedia concept detection.


Development of a Large-scale Benchmark Consumer Video Set

We have developed Kodak's benchmark consumer data set, which includes (1) a significant number of videos from actual users, (2) a rich lexicon that accommodates consumers' needs, and (3) the annotation of a subset of concepts over the entire data set. To the best of our knowledge, this is the first systematic work in the consumer domain aimed at the definition of a large lexicon, construction of a large benchmark data set, and annotation of videos in a rigorous fashion. Such effort will have a significant impact by providing a sound foundation for developing and evaluating large-scale learning based semantic indexing/annotation techniques in the consumer domain.

The Kodak's consumer video data set is available Here


Large-scale Multimodal Semantic Concept Detection for Consumer Videos

In addition, we present a systematic study of automatic classification of consumer videos using the above benchmark data set. Our goals are to assess the state of the art of multimedia analytics (including both audio and visual analysis) in consumer video classification and to discover new research opportunities. We investigated several statistical approaches built upon global/local visual features, audio features, and audio-visual combinations. Three multi-modal fusion frameworks (ensemble, context fusion, and joint boosting) are also evaluated. Experiment results show that visual and audio models perform best for different sets of concepts. Both provide significant contributions to multimodal fusion, via expansion of the classifier pool for context fusion and the feature bases for feature sharing. The fused multimodal models are shown to significantly reduce the detection errors (compared to single modality models), resulting in a promising accuracy of 83% over diverse concepts. To the best of our knowledge, this is the first work on systematic investigation of multimodal classification using a large-scale ontology and realistic video corpus.


Joint Feature Subspace and SVM Learning in the Semi-supervised Setting

On difficulty for multi-modal fusion is the curse of dimensionality: the dimensionality of multi-modal concatenated feature is very high, while the labeled training data are relative few due to expensive manual annotation. Semi-supervised learning leverages the large amount of unlabeled data in developing effective classifiers. Feature subspace learning finds optimal feature subspaces for representing data and helping classification. In this paper, we present a novel algorithm, Locality Preserving Semi-supervised Support Vector Machines (LPSSVM), to jointly learn an optimal feature subspace as well as a large margin SVM classifier. Over both labeled and unlabeled data, an optimal feature subspace is learned that can maintain the smoothness of local neighborhoods as well as being discriminative for classification. Simultaneously, an SVM classifier is optimized in the learned feature subspace to have large margin. The resulting classifier can be readily used to handle unseen test data. Additionally, we show that the LPSSVM algorithm can be used in a Reproducing Kernel Hilbert Space for nonlinear classification. We evaluate the proposed algorithm over the challenging Kodak's consumer video data set for semantic concept detection with high-dimension audio-visual features. Promising results are obtained which clearly confirm the effectiveness of our method.


Current Work

I am currently developing a multi-scale visual-audio synchronization system. By spatial-temporal analysis with visual appearances and time-frequency analysis with audio signals, we generate and synchronize visual and audio features at different scales for helping multimedia concept detection.


Publications

  1. Wei Jiang, Shih-Fu Chang, Tony Jebara, Alexander C. Loui, "Semantic concept classification by joint semi-supervised learning of feature subspaces and support vector machines", ECCV, Marseille, France, pp. 270-283, 2008. PDF
  2. Alexander C. Loui, Jiebo Luo, Shih-Fu Chang, Dan Ellis, Wei Jiang, Lyndon Kennedy, Keansub Lee, Akira Yanagawa, "Kodak's consumer video benchmark data set: Concept definition and annotation ", MIR, pp.245-254, 2007. PDF
  3. Shih-Fu Chang, Dan Ellis, Wei Jiang, Keansub Lee, Akira Yanagawa, Alexander C. Loui, Jiebo Luo, "Large-scale multimodal semantic concept detection for consumer video", MIR, pp. 255-264, 2007. PDF
  4. Akira Yanagawa, Alexander C. Loui, Jiebo Luo, Shih-Fu Chang, Dan Ellis, Wei Jiang, Lyndon Kennedy, Keansub Lee, Kodak consumer video benckmark data set: concept definition and annotation, Columbia University ADVENT Technical Report #222-2008-8, September 2008. PDF