DVMM / Research / VideoMining

Video Mining and Spatial-Temporal Pattern Discovery

Organizing multimedia content with as few labeled examples as possible is a problem of both theoretical and practical interest. This work is concerned with unsupervised learning of temporal structures, i.e., finding a statistical description for similar repetitive segments and locating them from the original sequences simultaneously. Example interesting structures include: large camera motion followed by audience cheering in sports highlights, or dubious human motion co-occurring with sound spotted by a surveillance setup. We approach the problem in two aspects: (1) Discovery of video structure by unsupervised learning -- our current solution involves the use of dynamic graphical models with automatic adaptation of the model size and the feature set; (2) Associating meanings to discovered structures using the metadata streams -- our current approach involves co-occurrence analysis between the identified structures and speech transcript and refining the co-occurrence statistics with machine translation techniques. Future investigations would focus on multimodal fusion, scalability at different semantic levels, applications to multimedia retrieval etc.
Part I

Unsupervised Disocovery of Video Structure with Statistical Temporal Models

This part of the work presents: a computational framework for modeling the recurrent temporal events in diverse domains [icme03, VideoMining03]; and algorithms automatic grouping of content descriptors for the relevant set of events [icip03].

Part II

Finding Meaningful Video Structure in News with Associated Text

This part is concerned with automatic association of semantic meanings to the large set of temporal structures discovered [icip04].

Part III

Layered Dynamic Mixture Model for Multimodal Pattern Discovery across Asynchronous Streams

This part is concerned with inferring frequent patterns from the joint statistics of a set of streams of different information rate, e.g. audio, video and text.

Prospective extension

Multi-stream Temporal Event Mining in AV Sensor Surveillance System

The generalized pattern mining problem in un-edited, distributed multi-sensor system.


Structure Parsing for Sports Videos Using Hidden Markov Models

The unsupervised leanring framework in part I has been evaluated on various sports videos where the results coincide with the domain insights obtained from supervised learning techniques [icassp02, prletter04].

Publications and Reports
See the list of publications on the publications page, and a set of overview slides here.

Lexing Xie
Shih-Fu Chang

Last update: October 6, 2004