Segmentation, Structure Detection and Summarization of Multimedia Sequences

Hari Sundaram


This thesis investigates the problem of efficiently summarizing audio-visual sequences. The problem is important since consumers now have access to vast amounts of multimedia content, that can be viewed over a range of devices.

The goal of this thesis is to be able to provide an adaptive framework for automatically generating a short multimedia clip as a summary, when given longer multimedia segment as input to the system. In our framework, the solution to the summarization problem is predicated on the solution to three important sub-problems - segmentation, structure detection and audio-visual condensation of the data.

In the segmentation problem, we focus on the determination of computable scenes. These are segments of audio-visual data that are consistent with respect to certain low-level properties and which preserve the syntax of the original video. This work does not address the problem of semantics of the segments, since this is not a well posed problem. There are three novel ideas in our approach: (a) analysis of the effects of rules of production on the data (b) a finite, causal memory model for segmenting audio and video and (c) the use of top-down structural grouping rules that enable us to be consistent with human perception. These scenes form the input to our condensation algorithm.

In the problem of detecting structure, we propose a novel framework that analyzes the topology of the sequence. In our work, we will limit our scope to discrete, temporal structures that have a priori known deterministic generative mechanisms. We show two general approaches to solving the problem, and we shall present robust algorithms for detecting two specific visual structures - the dialog and the regular anchor.

We propose a novel entity-utility framework for the problem of condensing audio-visual segments. The idea is that the multimedia sequence can be thought of as comprising entities, a subset of which will satisfy the users information needs. We associate a utility to these entities, and formulate the problem of preserving the entities required by the user as a convex utility maximization problem with constraints. The framework allows for adaptability to changing device and other resource conditions. Other original contributions include - (a) the idea that comprehension of a shot is related to its visual complexity (b) the idea that the preservation of visual syntax is necessary for the generation of coherent multimedia summaries (c) auditory analysis that uses discourse structure and (d) novel multimedia synchronization requirements.

We conducted user studies using the multimedia summary clips generated by the system. These user studies indicate that the summaries are perceived as coherent at condensation rates as high as 90%. The study also revealed that the measurable improvements over competing algorithms were statistically significant.