|
Current Projects
Semantic Concept Detection with Mulitple Modality Learning
Introduction
Semantic concept detection in multimedia data like images and videos has become an increasingly
critical issue for organizing, browsing, and retrieving multimedia assets. Traditional approaches
mainly focus on visual aspects, i.e., extracting visual features to train various concept detectors.
Multimedia data are generally associated with many types of information, e.g., textual description like
the ASR script, metadata like the authorship, visual features like color and texture, and audio features
like MFCCs. Besides visual modality, others like audio modality and textual modality are also indispensable
for effective multimedia classification. Our goal is to develop multi-modal learning algorithms that can
exploit the advantages of different modalities for better multimedia concept detection.
Development of a Large-scale Benchmark Consumer Video Set
We have developed Kodak's benchmark consumer data set, which includes (1) a significant number of videos from actual users,
(2) a rich lexicon that accommodates consumers' needs, and (3) the annotation of a subset of concepts over the
entire data set. To the best of our knowledge, this is the first systematic work in the consumer domain aimed at
the definition of a large lexicon, construction of a large benchmark data set, and annotation of videos in a
rigorous fashion. Such effort will have a significant impact by providing a sound foundation for developing
and evaluating large-scale learning based semantic indexing/annotation techniques in the consumer domain.
The Kodak's consumer video data set is available Here
Large-scale Multimodal Semantic Concept Detection for Consumer Videos
In addition, we present a systematic study of automatic classification of consumer videos using the above
benchmark data set. Our goals are to assess the state of the art of multimedia analytics (including both
audio and visual analysis) in consumer video classification and to discover new research opportunities.
We investigated several statistical approaches built upon global/local visual features, audio features,
and audio-visual combinations. Three multi-modal fusion frameworks (ensemble, context fusion, and joint
boosting) are also evaluated. Experiment results show that visual and audio models perform best for
different sets of concepts. Both provide significant contributions to multimodal fusion, via expansion
of the classifier pool for context fusion and the feature bases for feature sharing. The fused multimodal
models are shown to significantly reduce the detection errors (compared to single modality models),
resulting in a promising accuracy of 83% over diverse concepts. To the best of our knowledge, this is
the first work on systematic investigation of multimodal classification using a large-scale ontology
and realistic video corpus.
Joint Feature Subspace and SVM Learning in the Semi-supervised Setting
On difficulty for multi-modal fusion is the curse of dimensionality: the dimensionality of multi-modal concatenated
feature is very high, while the labeled training data are relative few due to expensive manual annotation.
Semi-supervised learning leverages the large amount of unlabeled data in developing effective classifiers. Feature subspace learning finds
optimal feature subspaces for representing data and helping classification. In this paper, we present a
novel algorithm, Locality Preserving Semi-supervised Support Vector Machines (LPSSVM), to jointly learn
an optimal feature subspace as well as a large margin SVM classifier. Over both labeled and unlabeled
data, an optimal feature subspace is learned that can maintain the smoothness of local neighborhoods as
well as being discriminative for classification. Simultaneously, an SVM classifier is optimized in the
learned feature subspace to have large margin. The resulting classifier can be readily used to handle
unseen test data. Additionally, we show that the LPSSVM algorithm can be used in a Reproducing Kernel
Hilbert Space for nonlinear classification. We evaluate the proposed algorithm over the challenging Kodak's consumer video data set
for semantic concept detection with high-dimension audio-visual features. Promising results
are obtained which clearly confirm the effectiveness of our method.
Current Work
I am currently developing a multi-scale visual-audio synchronization system. By spatial-temporal analysis
with visual appearances and time-frequency analysis with audio signals, we generate and synchronize visual and
audio features at different scales for helping multimedia concept detection.
Publications
- Wei Jiang, Shih-Fu Chang, Tony Jebara, Alexander C. Loui, "Semantic concept classification
by joint semi-supervised learning of feature subspaces and support vector machines", ECCV,
Marseille, France, pp. 270-283, 2008. PDF
- Alexander C. Loui, Jiebo Luo, Shih-Fu Chang, Dan Ellis, Wei Jiang, Lyndon Kennedy, Keansub Lee, Akira Yanagawa,
"Kodak's consumer video benchmark data set: Concept definition and annotation ", MIR, pp.245-254, 2007.
PDF
- Shih-Fu Chang, Dan Ellis, Wei Jiang, Keansub Lee, Akira Yanagawa, Alexander C. Loui, Jiebo Luo,
"Large-scale multimodal semantic concept detection for consumer video", MIR, pp. 255-264, 2007.
PDF
- Akira Yanagawa, Alexander C. Loui, Jiebo Luo, Shih-Fu Chang, Dan Ellis, Wei Jiang, Lyndon Kennedy, Keansub Lee,
Kodak consumer video benckmark data set: concept definition and annotation, Columbia University ADVENT Technical Report
#222-2008-8, September 2008.
PDF
|