Jump to : Download | Note | Abstract | Contact | BibTex reference | EndNote reference |


Winston Hsu. An Information-Theoretic Framework towards Large-Scale Video Structuring, Threading, and Retrieval. PhD Thesis Graduate School of Arts and Sciences, Columbia University, 2007.

Download [help]

Download paper: Adobe portable document (pdf)

Copyright notice:This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

Note on this paper

Advisor: Prof. Chang


Video and image retrieval has been an active and challenging research area due to the explosive growth of online video data, personal video recordings, digital photos, and broadcast news videos. In order to effectively manage and use such enormous multimedia resources, users need to be able to access, search, or browse video content at the semantic level. Current solutions primarily rely on text features and do not utilize rich multimodal cues. Works exploring multimodal features often use manually selected features and/or ad hoc models, thus lacking scalability to general applications. To fully exploit the potential of integrating multimodal features and ensure generality of solutions, this thesis presents a novel, rigorous framework and new statistical methods for video structuring, threading, and search in large-scale video databases. We focus on investigation of several fundamental problems for video indexing and retrieval: (1) How to select and fuse a large number of heterogeneous multimodal features from image, speech, audio, and text? (2) How to automatically discover and model mid-level features for multimedia content? (3) How to model similarity between multimodal documents such as news videos or multimedia web documents? (4) How to exploit unsupervised methods in video search to boost performance in an automatic fashion? To address such challenging problems, our main contributions include the following: First, we extend the Maximum Entropy model to fuse diverse perceptualfeatures from multiple levels and modalities and demonstrate significant performance improvement in broadcast news video segmentation. Secondly, we propose an information-theoretic approach to automatically construct mid-level representations. It is the first work to remove the dependency on the manual and laborintensive processes in developing mid-level feature representations from low-level features. Thirdly, we introduce new multimodal representations based on visual duplicates, cue word clusters, high-level concepts, etc. to compute similarity between the multimedia documents. Using such new similarity metrics, we demonstrate significant gain in multi-lingual cross-domain topic tracking. Lastly, to improve the automatic image and video search performance, we propose two new methods for reranking the initial video search results based on text keywords only. In the image/video level, we apply the information bottleneck principle to discover the image clusters in the initial search results, and then rerank the images based on cluster-level relevance scores and the occurrence frequency of images. Such method is efficient and generic, applicable to reranking of any initial search results using other search approaches, such as content-based image search or semantic concept-based search. In the multimedia document level, building on the multimodal document similarities, we propose a random walk framework for reranking the initial text-based video search results. Significant performance improvement is demonstrated in comparison with text-based reranking methods. In addition, we have studied application and optimal parameter settings of the power method in solving the multi-modal random walk problems. All of our experiments are conducted using the large-scale diverse video data such as the TRECVID benchmark data set, which includes more than 160 hours of broadcast videos from multiple international channels.


Winston Hsu

BibTex Reference

   Author = {Hsu, Winston},
   Title = {An Information-Theoretic Framework towards Large-Scale Video Structuring, Threading, and Retrieval},
   School = {Graduate School of Arts and Sciences, Columbia University},
   Year = {2007}

EndNote Reference [help]

Get EndNote Reference (.ref)


For problems or questions regarding this web site contact The Web Master.

This document was translated automatically from BibTEX by bib2html (Copyright 2003 © Eric Marchand, INRIA, Vista Project).