June 1, 2013, Columbia Alumni Reunion Day Intellectual Talks
[slide
]
Wondrous technologies, like multimedia object recognition, augmented reality, and personalized news, are being added to mobile and social platforms at an explosive pace. How do they work and what are the remaining technical challenges? This talk will focus on such issues and demonstrate some of the prototype systems.
Oct/Dec 2012, ECCV and NIPS, joint work with Junfeng He, Sanjiv Kumar, Wei Liu, and Jun Wang
[slide
7.5MB]
Finding nearest neighbor data in high-dimensional spaces is a common yet challenging task in many applications, such as computer vision, image
retrieval, and large graph construction. A scalable indexing scheme is
needed to achieve satisfactory accuracy, search complexity, and storage
cost. However, the well-known issue related to the curse of
dimensionality made existing solutions impractical. Recent advances in
locality sensitive hashing show promises by hashing high-dimensional
features into a small number of bits while preserving proximity in the
original feature space. It has attracted great attention due to its
simplicity in implementation (projection only), constant search
time, low storage cost, and applicability to diverse features and
metrics. In this talk, I will first survey a few recent methods that
extends basic hashing methods to incorporate labeled information through supervised and semi-supervised hashing, employ hyperplane hashing for finding nearest points to subspaces (e.g., planes), and demonstrate the practical utility of compact hashing methods in solving several challenging problems of large-scale mobile visual search - low
bandwidth, limited processing power on mobile devices, and needs of
searching large databases on servers. Finally, we study the fundamental
questions of high-dimensional search - how is nearest neighbor search
performance affected by data size, dimension, and sparsity; can we
predict the performance of hashing methods over a data set before its
implementation?
July 2012
[slide
6MB]
Smartphone cameras provide new ways of sensing the real-world environment. The augmented capability
can be used to find information about the surrounding scenes or objects through visual matching over
large data sources at the remote servers. Recent examples, such as the Google Glass project,
offer interesting promise for such functionalities. However, visual searching on the mobile devices
presents new technical challenges, such as limited power, bandwidth, and image quality. In this talk,
I will describe solutions in addressing such challenges, and demonstrate a large mobile product search
system capable of searching one million product images in near real time. The system leverages recent
advances in visual feature matching and compact hash based indexing, which are perfect for the large
mobile visual search scenario. I will review principles and optimization techniques for designing compact
hash code, a popular choice for solving general large-scale nearest neighbor search problems.
Additionally, to explore the human-in-the-loop power, I will present another system, called
Active Query Sensing, which aims at more intuitive mobile visual search experience. It uses visual analysis
to discover the best view angle and guide user to capture best queries for location recognition.
UCLA Institute for Pure and Applied Mathematics, Workshop on Large Scale Multimedia Search, Jan. 2012
[slide
8.6MB]
Graph is a popular data representation capturing relations among samples, such as images and documents.
Many successful graph-based techniques, such as Regularized Laplacian and Random Walk, have been used
for multimedia applications in retrieval and classification. In this talk, I will review a few novel
graph-based techniques designed specifically to handle the new challenges associated with large-scale
noisy multimedia data encountered on the Web. I will review (1) label diagnosis and spectral filtering
techniques for removing unreliable labels, (2) anchor graph methods for scaling up graph-based techniques
to gigantic data sets, and (3) multi-edge graph that captures heterogeneous similarities among multimedia data.
Applications in Web multimedia retrieval and novel systems searching images with Brain Machine Interfaces
will be presented.
Keynote Speech, ACM SIGMM Technical Achievement Award, Nov. 2011, ACMMM, Scottsdale, AZ
[slide
5MB]
In the past two decades, we have witnessed bourgeoning research on content based multimedia information
retrieval, covering a wide range of topics such as feature extraction, content matching, structure parsing,
semantic annotation, multimodal analysis, 3D content retrieval, and user-in-the-loop interaction.
Recently, exciting solutions are emerging in several practical contexts such as mobile media search,
augmented reality, and Web-scale copy detection. However, many fundamental problems remain open,
including but not limited to large-scale semantic annotation, multimedia ontological organization,
and human-machine interaction for searching complex events. In this talk, I will discuss lessons learned
from our past research, drawing from successes and failures in developing and deploying a few image/video
search systems in different domains, and then share views about promising future directions.
Distinguished Lecture, ECE Department, Boston University, Oct. 2010
[slide
12MB]
Visual Search applications, like iPhone’s SnapTell, Google Goggles, or Ricoh’s HotPaper, have attracted great attention in recent years. Users are able to take a picture of a location, product, or document from their camera phones and trigger a search that pulls up information associated with the image. In this talk, I will
explain the technologies behind these emerging applications, discuss fundamental technical challenges, and survey the state of the art in related fields. I will discuss topics such as fast visual matching over millions and more images, image retrieval with relevance feedback and semi-supervised learning, enhanced visual search using hundreds of visual recognition models, and hybrid search systems combining computer vision and brain machine interfaces. I will show demos of research
prototypes and discuss several large-scale evaluation initiatives.
NSF Hybrid Neuro-Computer Vision Systems Workshop, April 19-20, 2010.
[slide]
Human vision system is able to recognize a wide range of targets under challenging conditions, but has limited throughput. Machine vision and automatic content analytics can process images at a high speed, but suffers from inadequate recognition accuracy for general target classes. In this talk, we present a new paradigm to explore and combine the strengths of both systems. A single trial EEG-based brain machine interface (BCI) subsystem is used to detect objects of interest of arbitrary classes from an initial subset of images. The EEG detection outcomes are used as noisy labels to a graph-based semi-supervised learning subsystem to refine and propagate the labels to retrieve relevant images from a much larger pool. The combined strategy is unique in its generality, robustness, and high throughput. It has great potential for advancing the state of the art in media retrieval applications. We will discuss the performance gains of the proposed hybrid system with multiple and diverse image classes over several data sets, including those commonly used in object recognition (Caltech 101) and remote sensing images. (joint work with J. Wang, E. Pohlmeyer, B. Hanna, Y.-G. Jiang, and P. Sajda)
Keynote Speech at International
Conference on Multimedia and Exhibition (ICME), New York, June-July 2009.
[slide
10MB]
Finding information from large image/video sources poses a challenging yet exciting problem. Two well-known barriers preventing successful solutions to date are
the semantic gap and the user gap. The former refers to the difficulty in inferring high-level semantic labels from low-level pixel data. The latter reflects user’s frustration in expressing his/her information needs of visual content using existing systems. In this talk, I will present recent research trends aiming at solving both problems. First, I will survey various approaches (expert driven, linguistics based, and user driven) taken in defining large-scale multimedia concept ontologies, covering thousands of visual con-cepts from different categories (scene, object, event, people etc). I will then discuss the state of the art and open issues of automatic concept categorization, which if successful, promise to fill the semantic gap. I will emphasize issues of learning from a large pool of data with limited or imperfect labels (e.g., data from Internet). Second, to address the user gap, I will present efforts in developing intuitive interactive tools for video search. I will demonstrate a novel paradigm with unique emphasis on helping users to make sense of visual data throughout the search process, starting from initial query formulation to real-time navigation of the concept space and result sets simulta-neously.
Keynote Speech at International
Workshop on Image Analysis for Multimedia Interactive Services, Santorini,
Greece, June 2007
[slide
6MB]
With the prevalent success of Internet search, researchers are facing new
opportunities and challenges – developing next-generation multimedia
search technologies that may reach a performance level similar to that of
text search. Despite the grand scale of the challenges, some promising grounds
have been revealed recently. In this talk, I will focus on two exciting
areas – semantic annotation and multimedia document ranking. For the
former, we are witnessing the significant developments in large-scale image/video
collection for benchmarking, multimedia lexicons, and a large number of
semantic classifiers. For example, we have developed classifies for 374
semantic concepts with encouraging performance using more than 160 hours
of digital videos in TRECVID 2006. The collective power of a large number
of semantic models offers great potential – I will share recent results
of this approach in video searching, topic threading, and temporal pattern
mining. For the second area, I will present recent efforts to model video
retrieval as a document ranking problem similar to that used for page ranking
of web search. I will discus how the semantic models and visual duplicate
information may be used to approximate the information required in constructing
document link graphs. I will conclude the talk with discussions of open
issues in this dynamic and exciting area.
Keynote Speech at IEEE
Multimedia Signal Processing Workshop (MMSP), Shanghai, Oct. 2005
Keynote Speech at Conference on Computer Vision and Graphic Image Processing
(CVGIP), Taipei, 2005 [slide
3MB]
With the significant progress made in information analysis in text, audio,
and video and the recent availability of large-scale benchmarking events,
new opportunities emerge in developing and testing novel frameworks for
integrating multi-modal information to solve many challenging programs,
such as automatic annotation, story segmentation, multi-modal retrieval,
and topic clustering. In this talk, I discuss the opportuni-ties, state
of the art, and open research issues in using multi-modal integration in
video indexing. In addition, I discuss applications of some of the techniques
to another class of problems – video mining, namely, automatic discovery
of meaningful patterns in videos without expert domain knowledge or manual
supervision. Case studies showing promising performance will be described,
primarily in the broadcast news video domain.
Invited paper at ICASSP
March 2005, Philadelphia. [slide]
joint paper with R. Manmatha and Tat-Seng Chua.
I gave an overview of recent approaches and progress in fusing multi-modal
features for solving different problems in video indexing, such as story
segmentation, topic detection and tracking, multi-modal retrieval, and automatic
annotation. I reviewed the use of common mathematical models such as maximal
entropy, probabilistic clustering, query dependent retrieval, and cross-media
relevance model. In addition, I discussed the new issues arising from the
unique problems in dealing with heterogeneous, high-dimensional multimedia
features.
[slide] [project web site]
Overview of our new project aiming at blind detection of digital tampering of photographs, without using any watermark or digital signature. Our approach combines statistical modeling of natural image signals, camera filters and transfer functions, and 3D scene-lighting-reflectance modeling. We have developed algorithms and software for detecting block-level splicing and computer graphics generated images. We have also constructed benchmark data sets that can be used by researchers for evaluating performance.
keynote speech at International
Conference on Image and Video Retrieval, Dublin, Ireland, July 2004. [slide]
In this talk, I advocated a new research direction of unsupervised mining
of patterns from large collection of videos. Patterns are recurrent entities
that exhibit consistent spatio-temporal structures and attributes. In particular,
I focused on continuous, stochastic temporal video patterns, that may correspond
to useful semantics such as play/break events in sports, production patterns
in news broadcast, recurrent human activities in surveillance video, or
news topical threads across multiple news channels. I presented our recent
results in using Hierarchical HMM to discover the temporal patterns at multi
levels from sports and news video. I also presented several fusion methods
using the news transcripts (through ASR or closed captions) to automatically
discover the meanings of the discovered temporal patterns (tokens).
keynote speech at International Conference on Image Analysis and Processing (ICIAP), Mantova, Italy, Sept. 2003. [slide]
In this talk, I use the ubiquitous media access scenarios to advocate exploration of content analysis techniques for new applications such as adaptive video presentation, streaming, and transcoding. I presented the need and some solutions for real-time event detection and unsupervised pattern discovery in new domains. I also described our recent results in real-time content-based utility function prediction for automatically selecting the optimal MPEG-4 transcoding options.
I presented our recent work on unsupervised discovery of temporal patterns in video sequences. Specifically we use Hierarchical HMM to model the multi-level events in video. We investigated the difficult issues of parameter estimation, model adaptation, and feature selection. Experiment results on soccer and baseball videos showed promising results, discovering the play/break structures automatically with an accuracy level comparable to using supervised approaches.
This talk includes an overview of our research in video indexing, structure discovery, skimming, and their applications. Specific topics include
I presented the utility-based optimization framework to model the relationships between different types of resources, utilities, and adaptation operations. The framework is analogous to the conventional rate-distortion model, in which rate (resource), distortion (utility), and coding parameters (adaptation operators) are considered. I described how the framework can be used to help derive optimal solutions for video skimming and transcoding operator selection.
In this panel presentation, I attempted to provide a list of criteria for assessing potential impact of any automatic content-based analysis technique. I tried to justify the adopted criteria, and applied them to some applications systems we are developing in medical, sports, and film domains.
The paper I wrote for the vision column of IEEE Multimedia Magazine, "The Holy Grail of Content-Based Media Analysis," be a good companion reading of the above presentation.
We propose a new paradigm called content-adaptive video streaming. The focus is to explore the synergy between content analysis, video coding, and transmission. We presented a real-time system that dynamically change the resource allocation (e.g., bit rate) within a live video streaming session, according to the "importance" of each video segment. The segment importance is defined based on high-level events detectable in specific domains such as pitching and scoring in sports.