Finding Images/Video from Large Distributed Information Sources
Columbia's Content-Based Visual Query Project

http://www.ee.columbia.edu/~sfchang/vis-project


Table of Contents


Problem Statement - An Application Driven Problem

How do we find a photograph from a large archive which contains thousands or millions of pictures? How does a video journalist find a specific clip from the myriad of video tapes, ranging from historical to contemporary, from sports to humanities? How do people organize and search the content of personal video tapes of family events, travel scenes, or social gatherings?

The era of "the information explosion" has brought about the wide dissemination and use of visual information, particularly, digital images and video, which we are also seeing in combination with text, audio, and graphics. The development of tools and systems that enhance image functionalities, such as searching and authoring, is critical to the effective use of visual information in the new media applications.

The current research and development of images and video search tools is driven by practical applications. We are seeing the establishment of large digital image and video archives, such as the Corbis catalog, which includes the Bettman Archive; the Picture Exchange, which is a joint venture between Kodak and Sprint; and many digital video libraries in various domains (e.g., environment, politics, arts), such as the on-line CNN news sources.

The systems for the search and retrieval of images and video from these archives require the development of efficient and effective visual query tools.


State of the Art

The use of comprehensive textual annotations provide one method for image and video search and retrieval. Today, text-based search techniques are the most direct, accurate, and efficient methods for finding "unconstrained" images and video. Text annotation is obtained by manual effort, transcripts, captions, embedded text, or hyperlinked documents. In these systems, keyword and full text searching may be enhanced by natural language processing techniques to provide great potential for categorizing and matching images.

The searching of images by their visual content complements the text-based approaches. Very often, textual information is not sufficient. Visual features of the images and video also provide a description of their content. By exploring the synergy between textual and visual features, these image search systems can be further improved.

Many content-based image search systems have been developed for various applications. There has been substantial progress in developing powerful tools which allow users to specify image queries by giving examples, drawing sketches, selecting visual features (e.g., color and texture), and arranging spatial structure of features. Much success has been achieved, particularly in specific domains, such as remote sensing and medical applications.

New challenges remain in applying the above content-based image search tools to meet real user needs. Our experience indicates that use of the image search systems varies greatly. Users may want to find the most similar images, find an appropriate class of images, browse the image collection quickly, and so on. One unique aspect of image search systems is the active role played by users. By modeling the users and learning from them in the search process, we can better adapt to the users' subjectivity. In this way, we can adjust the search system to the fact that the perception of the image content varies between individuals, or over time.


Our Approaches

Our approach to solving the above challenge is based on the following strategies.

Create a visual feature library by automatic image /video analysis

Although today's computer vision systems cannot recognize high-level objects in unconstrained images, we are finding that low-level visual features can be used to partially characterize image content. These features also provide a potential basis for abstraction of the image semantic content. The extraction of local region features (such as color, texture, face, contour, motion) and their spatial/temporal relationships is being achieved with success (see VideoQ and VisualSEEk below). We argue that the automated segmentation of images/video objects does not need to accurately identify real world objects contained in the imagery. Our goal is to extract the "salient" visual features and index them with efficient data structures for fast and powerful querying. Semi-automated region extraction processes and use of domain knowledge may further improve the extraction process.

Explore the synergy between compression and functionalities

It's impossible to anticipate the users' needs completely at the feature extraction and indexing stage. The ideal solution is that images and video are represented (for compression also) in a way that is amenable to dynamic feature extraction. Today's compression standards (such as JPEG, MPEG-1, MPEG-2), are not suited to this need. The objective in the design of these compression standards was to reduce bandwidth and increase subjective quality. Although many interesting analysis and manipulation tasks can still be achieved in today's compression formats (see WebClip below), the potential functionalities of the images were not considered. However, recent trends in compression, such as MPEG-4 and object-based video, have shown interest and promise in this direction. The goal is to develop a system in which the video objects are extracted, then encoded, transmitted, manipulated, and indexed flexibly with efficient adaptation to users' preference and system conditions.

Learn from user and domain knowledge

To break the barrier of decoding semantic content in images, user-interaction and domain knowledge is needed. These systems learn from the users' input as to how the low-level visual features are to be used in the matching of images at the semantic level. For example, the system may model the cases in which low-level feature search tools are successful in finding the images with the desired semantic content. We have developed a unique concept called Semantic Visual Templates (ref below), which use an active learning system to find a samll subset of graphic icons representing the semantic concepts (e.g., sunsets or high jumpers). In an information environment including distributed, federated search engines, a meta-search system may monitor user's preference of different feature models and search tools and then make recommendation of search options to help users to improve the search efficiency (MetaSEEk below).

Integrate visual and other multimedia features

Exploring the association of visual features with other multimedia features, such as text, speech, and audio, provides another potentially fruitful direction. Our experience indicates that it is more difficult to characterize the visual content of still images compared to video. Video often has text transcripts and audio that may also be analyzed, indexed, and searched. Also, images on the World Wide Web typically have text associated with them. In this domain, the use of all potential multimedia features enhances image retrieval performance (see WebSEEk below) .


Prototype Systems and On-Line Demos


WebSEEk - An Image Search and Cataloging System on the WWW

(Demo: http://www.ctr.columbia.edu/webseek)

WebSEEk is a content-based image and video catalog and search tool for the World Wide Web. WebSEEK collects the images and videos using several autonomous Web agents which automatically analyze, index, and assign the images and videos to subject classes. The system is novel in that it utilizes text and visual information synergistically to provide for cataloging and searching for the images and videos. The complete system possesses several powerful functionalities, namely, searching using image content-based techniques, query modification using content-based relevance feedback, automated collection of visual information, compact presentation of images and videos for displaying query results, image and video subject search and navigation, text-based searching, and search results lists manipulations such as intersection, subtraction and concatenation. At present, the system has catalogued over 650,000 images and 10,000 videos from the Web.

New algorithms are being developed for automatic mapping of new unconstrained images/video to semantic-level subject classes in the image taxonomy. A working image taxonomy has been constructed in a semi- automatic way in the current prototype of WebSEEk. The mapping algorithms explore visual features (such as color, texture, spatial layout, video object features), text features (such as associated html documents, transcript, caption), and intelligent clustering techniques in the feature space.


VideoQ - An Automatic Object-Oriented Content-Based Video Search System

(demo: http://www.ctr.columbia.edu/VideoQ)

VideoQ expands the traditional search methods (e.g., keywords and subject navigation) with a novel search technique that allows users to search video based on a rich set of visual features and spatio-temporal relationships. Our objective is to investigate the full potential of visual cues in object-oriented content-based video search. Some of the unique features of VideoQ include:

VideoQ currently supports a large database of digital videos. Individual videos are automatically segmented into separate shots. Currently, over 2000 shots are stored. Each shot is compressed and stored in three layers to meet different bandwidth requirements.

In addition to query by sketch, the user can browse the video shots or search video by text. The video shots are cataloged into a subject taxonomy, which the user can easily navigate. Each video shot has also been manually annotated so the user can perform simple text search of keywords.


WebClip - A Distributed System for Editing and Browsing Compressed Video Over the WWW

(demo and plugins: http://www.ctr.columbia.edu/webclip)

WebClip is a prototype for editing/browsing compressed video over the World Wide Web. It uses a general system architecture to store, retrieve, and edit MPEG-1 or MPEG-2 compressed video over the network. WebClip is a Web application built on the Compressed Video Editing, Parsing, and Search (CVEPS) technology developed at Columbia University.

The unique features of WebClip include compressed- domain video editing, content-based video retrieval, multi-resolution access, and a distributed network architecture. The compressed-domain approach has great synergy with the network editing environment, in which compressed video sources are retrieved and edited to produce new compressed video content.

WebClip is developed based on the following observations. First, today's computing environment has grown from traditional localized systems to distributed computing over networks. The concept of network computing has emerged as the network and computer technologies prevail and start to merge. Second, a significant amount of visual content (images and video) have been produced, and stored online with a increasing rate. However, the development of technologies for accessing and manipulating visual content are lacking or falling behind. Third, most of the video materials will be stored in some compressed form. Synergy between compression and manipulation needs to be fully explored.

WebClip includes several major components. The video content aggregator collects video sources online from distributed sites. The video content analyzer includes programs for automatic extraction of visual features from MPEG videos in the compressed domain. Video features and extracted icon streams are stored online in the server database. The editing engine and the search engine include programs for rendering special effects and processing visual queries issued by users.


SaFe/VisualSEEk - Automatic Joint Spatial/Feature Based Image Search System

(Demo: http://disney.ctr.columbia.edu/SaFe, http://www.ctr.columbia.edu/VisualSEEk)

SaFe is a general system for spatial and feature image search. It provides a framework for searching for and comparing images by the spatial arrangement of regions or objects. In a SaFe query, objects or regions are assigned by the user. These are given properties of spatial location, size and visual features, such as color. The SaFe system finds the images that best match the query. SaFe uses fully automatic tools for region/feature extraction and indexing. SaFe also resolves spatial relationships, which allows the user to position objects relative to each other in a query.

Example queries include "find images including a blue region on top and a wide green open region in the bottom (looking for images with blue sky and open grass field)," and "use this spatial pattern of red, white, blue colors to find images containing American Flags."


MetaSEEk - A Content-based Meta Search Engine for Finding Images on the WWW

(Demo: http://www.ctr.columbia.edu/MetaSEEk)

Search engines are very powerful resources for finding information on the rapidly expanding World Wide Web. Finding the desired search engines and learning how to use them, however, can be very time consuming. The integration of such search tools enables the users to access information across the world in a transparent and efficient manner. These systems are called meta-search engines. The recent emergence of visual information retrieval search engines on the web is leading to the same efficiency problem. We have developed MetaSEEk, a content-based meta-search engine used for finding images on the Web based on their visual information. MetaSEEk is designed to intelligently select and interface with multiple on-line image search engines by ranking their performance for different classes of user queries. User feedback is also integrated in the ranking refinement. Comparison of MetaSEEk with a base line version of meta-search engine (which does not use the past performance of the different search engines in recommending target search engines for future queries) shows promising results in improving the efficiency of user queries.


Publications


Acknowledgements

This project has been supported by the Information and Data Management Program of the National Science Foundation (CAREER-IRI-9501266), NSF STIMULATE Program (IRI-96-19124), IBM under a UPP Faculty Development Award, HP, Intel, NEC Research Institute, and partners of Columbia's ADVENT project.