xie05thesis

Lexing Xie. Unsupervised Pattern Discovery for Multimedia Sequences. PhD Thesis Graduate School of Arts and Sciences, Columbia University, August 2005.

Download [help]

Download paper: Adobe portable document (pdf)

Copyright notice:This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

Note on this paper

Advisor: Shih-Fu Chang

Abstract

This thesis investigates the problem of discovering patterns from multimedia sequences. The problem is of interest as capturing and storing large amounts of multimedia data has become commonplace, yet our capability to process, interpret, and use these rich corpora has notably lagged behind. Patterns refer to the the recurrent and statistically consistent units in a data collection, their recurrence and consistency provide useful bases for organizing large corpra. Unsupervised pattern discovery is important, as it is desirable to adapt to diverse media collections without extensive annotation. Moreover, the patterns should be meaningful, since meanings are what we humans perceive from multimedia. The goal of this thesis is to devise a general framework for finding multi-modal temporal patterns from a collection of multimedia sequences, using the self-similarity in both the appearance and the temporal progression of the content. There, we have addressed three sub-problems: learning temporal pattern models, associating meanings with patterns, and finding patterns in multimodality. We propose novel models for the discovery of multimedia temporal patterns. We construct dynamic graphical models for capturing the multi-level dependency between the audio-visual observations and the events. We propose a stochastic search scheme for finding the optimal model size and topology, as well as unsupervised feature grouping for selecting relevant descriptors for temporal streams. We present novel approaches towards automatically explaining and evaluating the patterns in multimedia streams. Such approaches link the computational representations of the patterns with words in the video stream. The linking between the representation of audio-visual patterns, such as those acquired by a dynamic graphical model and the metadata, is achieved by statistical association. We develop solutions for finding patterns that reside across multiple modalities. This is realized with layered dynamic mixture model, and we address the modeling problems of intra-modality temporal dependency and inter-modality asynchrony in different parts of the model structure. With unsupervised pattern discovery, we are able to discover from baseball and soccer programs the common semantic states, {play} and {break}, with accuracies comparable to their supervised counterparts. On large broadcast news corpus we find that multimedia patterns have good correspondence with news topics that have salient audio-visual cues. These findings demonstrate the potential of our framework of mining multi-level temporal patterns from multimodal streams, and it has broad outlook in adapting to new content domains and extending to other applications such as event detection and information retrieval.

Contact

Lexing Xie

BibTex Reference

@PhdThesis{xie05thesis,
   Author = {Xie, Lexing},
   Title = {Unsupervised Pattern Discovery for Multimedia Sequences},
   School = {Graduate School of Arts and Sciences, Columbia University},
   Month = {August},
   Year = {2005}
}

EndNote Reference [help]

Get EndNote Reference (.ref)

For problems or questions regarding this web site contact The Web Master.