Projects and On-Line Demos
The project of Utility-based MPEG-4 Video Transcoding is part of the work in the joint project "Universal Multimedia Access" with Electronics and Telecommunications Research Institute (ETRI), Korea. The special interest lies in delivering media content through various network channels matching the diversity of devices and user interests. In our work, we employ the FD-CD-combined transcoding scheme and proposed a description of utility function based optimum transcoding operators, named as a utility-based transcoding descriptor. The utility-based transcoding descriptor basically provides the information of a possible set of transcoding operators that meet the bit rate constrained by network or terminal, and when available, the utility ranking of the set of operators. The generation of the description can be done for each video stream stored in a server in advance in the case of on-demand streaming applications. In the streaming of live video, the description could be generated by prediction-based approach in real-time.
Handling packet loss or delay in the mobile and/or Internet environment is usually a challenging problem for multimedia transmission. Using connection-oriented protocol such as TCP may introduce intolerable time delay in re-transmission. Using datagram-oriented protocols such as UDP may cause partial representation in case of packet loss. In this project, we propose a new method of using our self-authentication-and-recovery images (SARI) to do the error detection and concealment in the UDP environment. SARI uses a unique watermarking technique for image feature extraction and embedding. For wireless video, the lost information in a SARI image can be approximately recovered based on the embedded watermark, which includes the content-based authentication information and recovery information. Images or video frames are watermarked in a priori such that no additional mechanism is needed in the networking or the encoding process. Because the recovery is not based on adjacent blocks, the proposed method can recover the corrupted area even though the information loss happen in large areas or high variant areas. Our experiments show the advantages of such technique in both transmission time saving and broad application potentials.
Sports video poses many interesting and challenging problems for video indexing and filtering. It provides rich structures and events that have strong correlation with low-level multimedia features. It provides an interesting domain for testing new frameworks and tools. Certainly it also has a high level of practical value due to the popularity of the content.
Sports videos are highly structured and "compressible" (familiar with the regretful feeling after watching a 4-hour baseball game?) Long sports video programs can be significantly reduced in time to include important events or highlights only. In this project, we are developing high-level structure analysis and content filtering techniques for sports video. Our strategy is to find a systematic methodology to combine effective generic computational approaches with domain knowledge available in specific domains. Three major activities are currently underway.
(Description and Demo: http://www.ctr.columbia.edu/video-summary)
Video structure discovery aims at the understanding and construction of spatio-temporal structures of video programs at the syntactic as well as the semantic levels. Such structures enable development of useful tools and interfaces for video access, such as Table of Contents, personalized browsers, and highlight skimming. Examples of our work in this area include joint audio-visual scene segmentation using sychological perceptual models, real-time detection of video shots and special transitional effects, and detection of high-level semantic events (such as dialog and repeating sports events). Among these, Project KIA focuses on the problem of automatic summarization of videos via a fully automatic analysis of audio and video data.
Text recognition in video is very useful for various application scenarios. In this project, we are focusing on caption text detection and recognition in sports video. The aim of this project is to extract structural and semantic information from sports video using optical character recognition techniques. The project attempts to solve the following problems: localization of the caption area, analysis of the layout of the caption area, and recognition of the words and digits. Besides using traditional character recognition and detection techniques for image, we reinforce the recognition performance by using sequence analysis approaches and domain knowledge model. Also we use compressed domain techniques to make the text detection system more efficient. Our system, implemented in software on a regular PC, achieves very high accuracy (92% recognition, 99% detection) with the real-time speed in sports video caption box.
The algorithms that have developed
can be applied in many applications. For example, score caption area detection
can be used to condense the sports videos to obtain caption box sequences,
which can be compressed and transmitted to pager, cell phone, or
AMOS is a video object segmentation and retrieval system. In this framework, a video object (e.g. person, car) is modeled and tracked as a set of regions with corresponding visual features and spatio-temporal relations. The region-based model also provides an effective base for similarity retrieval of video objects.
AMOS effectively combines user input and automatic region segmentation for defining and tracking video objects at a semantic level. First, the user roughly outlines the contour of an object at the starting frame, which is used to create a video object with underlying homogeneous regions. This process is based on a region segmentation method that involves color and edge features and a region aggregation method that classifies regions into foreground and background. Then, the object and the homogeneous regions are tracked through successive frames. This process uses affine motion models to project regions from frame to frame and a color-based region growing to determine the final projected regions. Users can stop the segmentation at any time to correct the contour of video objects. Extensive experimental results have demonstrated excellent results. Most tracking errors are caused by uncovered regions and can be corrected with a few user inputs.
AMOS also extracts salient
regions within video objects that users can interactively create and manipulate.
Visual features and spatio-temporal relations are computed for video objects
and salient regions and stored in a database for similarity matching.
The features include motion trajectory, dominant color, texture, shape,
and time descriptors. Currently three types of relations among the regions
of a video object are supported: orientation spatial (angle between two
regions), topological spatial (contains, does not contain, or inside),
and directional temporal (start before, at the same time, or after). Users
can enter textual annotations for the objects. AMOS
accepts queries in the form of sketches or examples and returns similar
video objects based on different features and relations. The query process
of finding candidate video objects for a query uses a filtering together
with a joining scheme. The first step is to find a list candidate regions
from the database for each query region based on the visual features.
Then, the region lists are joined to obtain candidate objects and their
total distance to the query is computed by matching the spatio-temporal
VideoQ expands the traditional search methods (e.g., keywords and subject navigation) with a novel search technique that allows users to search video objects based on a rich set of visual features and spatio-temporal relationships. Our objective is to investigate the full potential of visual cues in object-oriented content-based video search. Some of the unique features of VideoQ include:
VideoQ currently supports a large database of digital videos. Individual videos are automatically segmented into separate shots. Currently, over 2000 shots are stored. Each shot is compressed and stored in three layers to meet different bandwidth requirements.
In addition to query by sketch, the user can browse the video shots or search video by text. The video shots are cataloged into a subject taxonomy, which the user can easily navigate. Each video shot has also been manually annotated so the user can perform simple text search of keywords.
(Description and Demo: http://www.ctr.columbia.edu/~shahram/research_intro.htm)
Echocardiogram video is a standard technique used in cardiology for diagnosis. An echocardiogram study includes the use of multimedia data, such as video of the heart, images of the blood flows, EKG graphs, and heart sounds. Our goal is to develop multimedia analysis/indexing tools to enable doctors, residents, and students to exploit such medical multimedia resources for educational and clinical purposes.
Our multi-phase goals include the following.
(Description and Demo: http://www.ctr.columbia.edu/sari)
We have developed a unique system for image authentication and recovery. We explored unique invariant properties in lossy compression such as JPEG, Motion JPEG, and MPEG. Content-based invisible watermarks are designed based on the invariant properties of such lossy compression and embedded into reconstructable coefficients. We proved such watermarks can be used to distinguish acceptable manipulations on digital images and malicious attacks (such as cropping and replacement). Applications include medical, insurance, law enforcement, and news, in which some manipulations are required while malicious manipulations must be rejected.
(Description and Demo: http://www.ctr.columbia.edu/imka)
The IMKA project is developing a multimedia knowledge representation framework, MediaNet, and methods for constructing and using multimedia knowledge to improve systems for retrieval, navigation and synthesis of multimedia. The IMKA approach is based on integrating both symbolic and perceptual representations of knowledge. The word IMKA stands for Intelligent Multimedia Knowledge Application.
(Description and Demos: http://www.ctr.columbia.edu/mpeg-7)
The MPEG-7 Project at Columbia University focusses on the development of description schemes for image, video, multimedia, and collection content to the MPEG-7 Standard.
We develop a novel interactive system for learning visual object filters, in which models are defined by a user according to his interests via a multiple-level object definition hierarchy. The system facilitates cooperation between user and system, in which the computer performs automatic image region segmentation while the user manually labels and maps segmented regions to various nodes in the object definition hierarchy. As the user provides examples from images or video, Visual Object Detectors are constructed automatically using a variety of machine learning techniques.
The detectors are then used in a multiple-stage process that incorporates non-visual information for automatic filtering or by agents. We introduce the concept of Recurrent Visual Semantics and show how it can be exploited in our framework. Our approach is applied in the context of baseball videos and the Kosovo crisis for filtering. Visual detectors have also been tested in detecting handshake images from news sources.
A new conceptual framework for classifying visual information (image, video, etc.) attributes. The framework, which draws on research from severl fields related to image/video indexing (Cognitive Psychology, Information Sciences, Content-Based Retrieval, etc.) classifies visual attributes (and relationships) into 10 levels, distinguishing between Syntax (form) and Semantics (meaning).
We are studying the way people move their eyes as they observe images of different visual categories (e.g., images with handshakes, crowds, landscapes, crowds, etc.). Although extensive research to study eye movements has been performed, this is the first study that considers differences accross image categories. This information is useful for understanding the human visual process, and for building automatic classification systems.
An interactive framework for organizing personal digital image collections. We introduce the detection of bracketing (two or more photographs of the same subject, made using different aperture/speed settings), and a novel clustering algorithm (based on Ward's algorithm) to exploit image sequence information in each roll of film. Images within a roll of film (and accross rolls) are clustered hierarchically and presented to the user, who modifies the clusters to organize his collection.
Human activities in videos provides direct information about the video content. Summaries of people appearance in the spatial and temporal dimensions can help users to quickly understand the interaction among people. Example questions answered are who are in the video, in what order, who has face-to-face discussion with whom. We are developing algorithms and systems for real-time detection, accurate tracking, and effective summarization of human faces in the video compressed domain. The real-time detection component uses the MPEG compressed data to detect face objects and refine the results by tracking the movement of faces. Various motion models in Kalman filters have been studied for tracking. In specific domains (e.g., interview videos), high-level transition models are also explored to model the probabilistic transition patterns among speakers. The same detection-tracking-transition paradigm can be applied to other domains (e.g., sports, presentation) for understanding the high-level content of videos.
We are investigating innovative techniques using integrated multimedia features for automatic image classification. This is a collaborative project with the Natural Language Processing group at Columbia. On the image side, we are developing a new approach, called OF*IIF (Object Frequency * Inverse Image Frequency), to automatically extract the discriminative objects and their distribution from single images or classes of images. The OF*IIF feature vector has been proved to be effective compared to other state-of-the-art image classifiers. It achieves significant performance gain when combined with the popular text-based approach, TF*IIF.
We are currently developing tan integration framework using Bayesian Networks for combining text-based and visual-based feature vectors for classifying images to different categories such as indoor/outdoor, people, handshake, etc. The demo shows automatic classification using combined text-visual features.
(Demo and Plug-ins: http://www.ctr.columbia.edu/webclip)
WebClip is a prototype for editing/browsing compressed video over the World Wide Web. It uses a general system architecture to store, retrieve, and edit MPEG-1 or MPEG-2 compressed video over the network. WebClip is a Web application built on the Compressed Video Editing, Parsing, and Search (CVEPS) technology developed at Columbia University.
The unique features of WebClip include compressed- domain video editing, content-based video retrieval, multi-resolution access, and a distributed network architecture. The compressed-domain approach has great synergy with the network editing environment, in which compressed video sources are retrieved and edited to produce new compressed video content.
WebClip includes several major components. The video content aggregator collects video sources online from distributed sites. The video content analyzer includes programs for automatic extraction of visual features from MPEG videos in the compressed domain. Video features and extracted icon streams are stored online in the server database. The editing engine and the search engine include programs for rendering special effects and processing visual queries issued by users.
WebSEEk is a content-based image and video catalog and search tool for the World Wide Web. WebSEEK collects the images and videos using several autonomous Web agents which automatically analyze, index, and assign the images and videos to subject classes. The system is novel in that it utilizes text and visual information synergistically to provide for cataloging and searching for the images and videos. The complete system possesses several powerful functionalities, namely, searching using image content-based techniques, query modification using content-based relevance feedback, automated collection of visual information, compact presentation of images and videos for displaying query results, image and video subject search and navigation, text-based searching, and search results lists manipulations such as intersection, subtraction and concatenation. At present, the system has catalogued over 650,000 images and 10,000 videos from the Web.
New algorithms are being developed for automatic categorization of new unconstrained images/video to semantic-level subject classes in the image taxonomy. A working image taxonomy has been constructed in a semi-automatic way in the current prototype of WebSEEk. The categorization algorithms explore optimal integration of visual features (such as color, texture, spatial layout) and text features (such as associated html tags, captions, and articles).
SaFe is a general system for spatial and feature image search. It provides a framework for searching for and comparing images by the spatial arrangement of regions or objects. In a SaFe query, objects or regions are assigned by the user. These are given properties of spatial location, size and visual features, such as color. The SaFe system finds the images that best match the query. SaFe uses fully automatic tools for region/feature extraction and indexing. SaFe also resolves spatial relationships, which allows the user to position objects relative to each other in a query.
Example queries include "find images including a blue region on top and a wide green open region in the bottom (looking for images with blue sky and open grass field)," and "use this spatial pattern of red, white, blue colors to find images containing American Flags."
Search engines are very powerful resources for finding information on the rapidly expanding World Wide Web. Finding the desired search engines and learning how to use them, however, can be very time consuming. The integration of such search tools enables the users to access information across the world in a transparent and efficient manner. These systems are called meta-search engines. The recent emergence of visual information search engines on the web is leading to the same efficiency problem. We have developed MetaSEEk, a content-based meta-search engine used for finding images on the Web based on their visual information. MetaSEEk is designed to intelligently select and interface with multiple on-line image search engines by ranking their performance for different classes of user queries. User feedback is also integrated in the ranking refinement. Comparison of MetaSEEk with a base line version of meta-search engine (which does not use the past performance of the different search engines in recommending target search engines for future queries) shows promising results in improving the efficiency of user queries.
For problems or questions
regarding this web site contact The