January 2016, University of Amsterdam
Thanks to the massive growth of big multimedia data, we are celebrating the rapid advances in various domains including mobile consumers (photo tagging, mobile visual search), science (galaxy image classification), and security (surveillance video). When replicating such success to other domains, challenges arise - how to handle label scarcity, data imprecision, and the prohibitive complexity of machine learning models. In this talk, I will first quickly share some emerging ideas in addressing such issues. Then I will move on to explore new frontiers in multimedia information extraction, especially those closely related to online human communication. Investigation of such problems is particularly timely given the Web today has become substantially dominated by visual content and visual communication. Specifically, we study how machines can infer emotions and sentiments of users through the visual content they share online. Additionally, can machines automatically discover knowledge and its structures (such as arguments and roles in real-world events) from multimedia sensory data? Outcomes of such studies will greatly contribute to the foundation for intelligent human-machine interaction and autonomous knowledge extraction from big data.
January 2015, Distinguished Seminar Lecture, Department of Electrical Engineering, UCLA
Finding nearest neighbors among millions or billions of samples in high-dimensional spaces is a common yet challenging task in many large-scale applications, such as image retrieval, graph construction, and stereo vision. A scalable indexing scheme is needed to achieve constant or sublinear complexity with satisfactory accuracy, search time, and storage cost. The well-known problem caused by the curse of dimensionality made existing solutions impractical. Recent advances in locality sensitive hashing and compact code learning show promises by hashing high-dimensional features into a small number of binary bits while preserving proximity in the original feature space. In this talk, I will survey open issues and recent advances in this area with focuses on (1) supervised and semi-supervised hash code learning by taking advantage of additional labeled information, (2) circulant hashing structures that borrow fast transform ideas originated in signal processing, and (3) the idea of sparse embedding onto anchors for handling large-scale graph hashing. We will show utility of these techniques in several practical applications such as real-time mobile visual search, large-scale image retrieval, and accelerated learning of filters in deep convolutional neural networks. (Joint work with Sanjiv Kumar, Wei Liu, Jun Wang, Junfeng He, and Felix Yu)
September 2014, Invited Talk, Workshop on Parts and Attributes, European Conference on Computer Vision (ECCV)
Attributes and parts are intuitive representations for real-world objects and have been shown effective in recent research on object recognition. An analogous framework has been used in the multimedia community using "concepts" for describing high-level complex events such as "birthday party" or "changing a vehicle tire." Concepts involve objects, scenes, actions, activities, and other syntactic elements usually seen in video events. In this talk, I will address several fundamental issues encountered when developing concept-based event framework - how to determine the basic concepts needed by humans when annotating video events; how to use Web mining to automatically discover a large concept pool for event representation; how to handle the weak supervision problem when concept labels are assigned to long video clips without precise timing; and finally how the concept classifier pool can be used to help retrieve novel events that have not been seen before (namely the zero-shot retrieval problem).
September 2014, Invited Talk, Workshop on Storytelling with Images and Videos (VisStory), European Conference on Computer Vision (ECCV)
As visual content such as images and video become pervasive on the Web and various forms of social media, there is growing interest in understanding how visual content influences outcomes of social communication online. Visual content is generally considered important in attracting user interest and eliciting responses in todays social media platforms. Particularly, in order to make messages viral, content conveying strong emotions is often used. In this talk, I describe a fully automatic system (AICER) we recently developed. AICER is an automatic tool that can recognize visual attributes appearing in the picture content, predict the likely viewer responses, and suggest plausible comments to assist users in social communication. Our approach emphasizes a mid-level concept representation, in which intended affects of the image publisher is characterized by a large pool of visual concepts (termed PACs) detected from image content directly instead of textual metadata. Evoked viewer affects are represented by concepts (termed VACs) mined from online comments, and statistical methods are used to model the correlations among these two types of concepts. [online demo]
June 24, 2013, Keynote Talk, Workshop on Big Data Computer Vision, IEEE CVPR
A picture is worth one thousand words, but what words should be used to describe the sentiments and emotions conveyed in the increasingly popular social multimedia? I will present a principled approach combining sound structures from psychology and the folksonomy information extracted from social multimedia to developing a visual sentiment ontology consisting of about 1200 classes. I will also present SentiBank, a library of machine learning classifiers trained using such ontology. Visualization tools supporting intuitive exploration of the rich visual sentiment space will be discussed. The ontology, dataset, and classifiers will be made available.
June 1, 2013, Columbia Alumni Reunion Day Intellectual Talks
Wondrous technologies, like multimedia object recognition, augmented reality, and personalized news, are being added to mobile and social platforms at an explosive pace. How do they work and what are the remaining technical challenges? This talk will focus on such issues and demonstrate some of the prototype systems.
Oct/Dec 2012, ECCV and NIPS, joint work with Junfeng He, Sanjiv Kumar, Wei Liu, and Jun Wang
Finding nearest neighbor data in high-dimensional spaces is a common yet challenging task in many applications, such as computer vision, image retrieval, and large graph construction. A scalable indexing scheme is needed to achieve satisfactory accuracy, search complexity, and storage cost. However, the well-known issue related to the curse of dimensionality made existing solutions impractical. Recent advances in locality sensitive hashing show promises by hashing high-dimensional features into a small number of bits while preserving proximity in the original feature space. It has attracted great attention due to its simplicity in implementation (projection only), constant search time, low storage cost, and applicability to diverse features and metrics. In this talk, I will first survey a few recent methods that extends basic hashing methods to incorporate labeled information through supervised and semi-supervised hashing, employ hyperplane hashing for finding nearest points to subspaces (e.g., planes), and demonstrate the practical utility of compact hashing methods in solving several challenging problems of large-scale mobile visual search - low bandwidth, limited processing power on mobile devices, and needs of searching large databases on servers. Finally, we study the fundamental questions of high-dimensional search - how is nearest neighbor search performance affected by data size, dimension, and sparsity; can we predict the performance of hashing methods over a data set before its implementation?
Smartphone cameras provide new ways of sensing the real-world environment. The augmented capability can be used to find information about the surrounding scenes or objects through visual matching over large data sources at the remote servers. Recent examples, such as the Google Glass project, offer interesting promise for such functionalities. However, visual searching on the mobile devices presents new technical challenges, such as limited power, bandwidth, and image quality. In this talk, I will describe solutions in addressing such challenges, and demonstrate a large mobile product search system capable of searching one million product images in near real time. The system leverages recent advances in visual feature matching and compact hash based indexing, which are perfect for the large mobile visual search scenario. I will review principles and optimization techniques for designing compact hash code, a popular choice for solving general large-scale nearest neighbor search problems. Additionally, to explore the human-in-the-loop power, I will present another system, called Active Query Sensing, which aims at more intuitive mobile visual search experience. It uses visual analysis to discover the best view angle and guide user to capture best queries for location recognition.
UCLA Institute for Pure and Applied Mathematics, Workshop on Large Scale Multimedia Search, Jan. 2012
Graph is a popular data representation capturing relations among samples, such as images and documents. Many successful graph-based techniques, such as Regularized Laplacian and Random Walk, have been used for multimedia applications in retrieval and classification. In this talk, I will review a few novel graph-based techniques designed specifically to handle the new challenges associated with large-scale noisy multimedia data encountered on the Web. I will review (1) label diagnosis and spectral filtering techniques for removing unreliable labels, (2) anchor graph methods for scaling up graph-based techniques to gigantic data sets, and (3) multi-edge graph that captures heterogeneous similarities among multimedia data. Applications in Web multimedia retrieval and novel systems searching images with Brain Machine Interfaces will be presented.
Keynote Speech, ACM SIGMM Technical Achievement Award, Nov. 2011, ACMMM, Scottsdale, AZ
In the past two decades, we have witnessed bourgeoning research on content based multimedia information retrieval, covering a wide range of topics such as feature extraction, content matching, structure parsing, semantic annotation, multimodal analysis, 3D content retrieval, and user-in-the-loop interaction. Recently, exciting solutions are emerging in several practical contexts such as mobile media search, augmented reality, and Web-scale copy detection. However, many fundamental problems remain open, including but not limited to large-scale semantic annotation, multimedia ontological organization, and human-machine interaction for searching complex events. In this talk, I will discuss lessons learned from our past research, drawing from successes and failures in developing and deploying a few image/video search systems in different domains, and then share views about promising future directions.
Distinguished Lecture, ECE Department, Boston University, Oct. 2010
Visual Search applications, like iPhone’s SnapTell, Google Goggles, or Ricoh’s HotPaper, have attracted great attention in recent years. Users are able to take a picture of a location, product, or document from their camera phones and trigger a search that pulls up information associated with the image. In this talk, I will explain the technologies behind these emerging applications, discuss fundamental technical challenges, and survey the state of the art in related fields. I will discuss topics such as fast visual matching over millions and more images, image retrieval with relevance feedback and semi-supervised learning, enhanced visual search using hundreds of visual recognition models, and hybrid search systems combining computer vision and brain machine interfaces. I will show demos of research prototypes and discuss several large-scale evaluation initiatives.
NSF Hybrid Neuro-Computer Vision Systems Workshop, April 19-20, 2010.
Human vision system is able to recognize a wide range of targets under challenging conditions, but has limited throughput. Machine vision and automatic content analytics can process images at a high speed, but suffers from inadequate recognition accuracy for general target classes. In this talk, we present a new paradigm to explore and combine the strengths of both systems. A single trial EEG-based brain machine interface (BCI) subsystem is used to detect objects of interest of arbitrary classes from an initial subset of images. The EEG detection outcomes are used as noisy labels to a graph-based semi-supervised learning subsystem to refine and propagate the labels to retrieve relevant images from a much larger pool. The combined strategy is unique in its generality, robustness, and high throughput. It has great potential for advancing the state of the art in media retrieval applications. We will discuss the performance gains of the proposed hybrid system with multiple and diverse image classes over several data sets, including those commonly used in object recognition (Caltech 101) and remote sensing images. (joint work with J. Wang, E. Pohlmeyer, B. Hanna, Y.-G. Jiang, and P. Sajda)
Keynote Speech at International
Conference on Multimedia and Exhibition (ICME), New York, June-July 2009.
Finding information from large image/video sources poses a challenging yet exciting problem. Two well-known barriers preventing successful solutions to date are the semantic gap and the user gap. The former refers to the difficulty in inferring high-level semantic labels from low-level pixel data. The latter reflects user’s frustration in expressing his/her information needs of visual content using existing systems. In this talk, I will present recent research trends aiming at solving both problems. First, I will survey various approaches (expert driven, linguistics based, and user driven) taken in defining large-scale multimedia concept ontologies, covering thousands of visual con-cepts from different categories (scene, object, event, people etc). I will then discuss the state of the art and open issues of automatic concept categorization, which if successful, promise to fill the semantic gap. I will emphasize issues of learning from a large pool of data with limited or imperfect labels (e.g., data from Internet). Second, to address the user gap, I will present efforts in developing intuitive interactive tools for video search. I will demonstrate a novel paradigm with unique emphasis on helping users to make sense of visual data throughout the search process, starting from initial query formulation to real-time navigation of the concept space and result sets simulta-neously.
Keynote Speech at International
Workshop on Image Analysis for Multimedia Interactive Services, Santorini,
Greece, June 2007
With the prevalent success of Internet search, researchers are facing new opportunities and challenges – developing next-generation multimedia search technologies that may reach a performance level similar to that of text search. Despite the grand scale of the challenges, some promising grounds have been revealed recently. In this talk, I will focus on two exciting areas – semantic annotation and multimedia document ranking. For the former, we are witnessing the significant developments in large-scale image/video collection for benchmarking, multimedia lexicons, and a large number of semantic classifiers. For example, we have developed classifies for 374 semantic concepts with encouraging performance using more than 160 hours of digital videos in TRECVID 2006. The collective power of a large number of semantic models offers great potential – I will share recent results of this approach in video searching, topic threading, and temporal pattern mining. For the second area, I will present recent efforts to model video retrieval as a document ranking problem similar to that used for page ranking of web search. I will discus how the semantic models and visual duplicate information may be used to approximate the information required in constructing document link graphs. I will conclude the talk with discussions of open issues in this dynamic and exciting area.
Keynote Speech at IEEE
Multimedia Signal Processing Workshop (MMSP), Shanghai, Oct. 2005
Keynote Speech at Conference on Computer Vision and Graphic Image Processing (CVGIP), Taipei, 2005 [slide 3MB]
With the significant progress made in information analysis in text, audio, and video and the recent availability of large-scale benchmarking events, new opportunities emerge in developing and testing novel frameworks for integrating multi-modal information to solve many challenging programs, such as automatic annotation, story segmentation, multi-modal retrieval, and topic clustering. In this talk, I discuss the opportuni-ties, state of the art, and open research issues in using multi-modal integration in video indexing. In addition, I discuss applications of some of the techniques to another class of problems – video mining, namely, automatic discovery of meaningful patterns in videos without expert domain knowledge or manual supervision. Case studies showing promising performance will be described, primarily in the broadcast news video domain.
Invited paper at ICASSP
March 2005, Philadelphia. [slide]
joint paper with R. Manmatha and Tat-Seng Chua.
I gave an overview of recent approaches and progress in fusing multi-modal features for solving different problems in video indexing, such as story segmentation, topic detection and tracking, multi-modal retrieval, and automatic annotation. I reviewed the use of common mathematical models such as maximal entropy, probabilistic clustering, query dependent retrieval, and cross-media relevance model. In addition, I discussed the new issues arising from the unique problems in dealing with heterogeneous, high-dimensional multimedia features.
[slide] [project web site]
Overview of our new project aiming at blind detection of digital tampering of photographs, without using any watermark or digital signature. Our approach combines statistical modeling of natural image signals, camera filters and transfer functions, and 3D scene-lighting-reflectance modeling. We have developed algorithms and software for detecting block-level splicing and computer graphics generated images. We have also constructed benchmark data sets that can be used by researchers for evaluating performance.
keynote speech at International
Conference on Image and Video Retrieval, Dublin, Ireland, July 2004. [slide]
In this talk, I advocated a new research direction of unsupervised mining of patterns from large collection of videos. Patterns are recurrent entities that exhibit consistent spatio-temporal structures and attributes. In particular, I focused on continuous, stochastic temporal video patterns, that may correspond to useful semantics such as play/break events in sports, production patterns in news broadcast, recurrent human activities in surveillance video, or news topical threads across multiple news channels. I presented our recent results in using Hierarchical HMM to discover the temporal patterns at multi levels from sports and news video. I also presented several fusion methods using the news transcripts (through ASR or closed captions) to automatically discover the meanings of the discovered temporal patterns (tokens).
keynote speech at International Conference on Image Analysis and Processing (ICIAP), Mantova, Italy, Sept. 2003. [slide]
In this talk, I use the ubiquitous media access scenarios to advocate exploration of content analysis techniques for new applications such as adaptive video presentation, streaming, and transcoding. I presented the need and some solutions for real-time event detection and unsupervised pattern discovery in new domains. I also described our recent results in real-time content-based utility function prediction for automatically selecting the optimal MPEG-4 transcoding options.
I presented our recent work on unsupervised discovery of temporal patterns in video sequences. Specifically we use Hierarchical HMM to model the multi-level events in video. We investigated the difficult issues of parameter estimation, model adaptation, and feature selection. Experiment results on soccer and baseball videos showed promising results, discovering the play/break structures automatically with an accuracy level comparable to using supervised approaches.
This talk includes an overview of our research in video indexing, structure discovery, skimming, and their applications. Specific topics include
I presented the utility-based optimization framework to model the relationships between different types of resources, utilities, and adaptation operations. The framework is analogous to the conventional rate-distortion model, in which rate (resource), distortion (utility), and coding parameters (adaptation operators) are considered. I described how the framework can be used to help derive optimal solutions for video skimming and transcoding operator selection.
In this panel presentation, I attempted to provide a list of criteria for assessing potential impact of any automatic content-based analysis technique. I tried to justify the adopted criteria, and applied them to some applications systems we are developing in medical, sports, and film domains.
The paper I wrote for the vision column of IEEE Multimedia Magazine, "The Holy Grail of Content-Based Media Analysis," be a good companion reading of the above presentation.
We propose a new paradigm called content-adaptive video streaming. The focus is to explore the synergy between content analysis, video coding, and transmission. We presented a real-time system that dynamically change the resource allocation (e.g., bit rate) within a live video streaming session, according to the "importance" of each video segment. The segment importance is defined based on high-level events detectable in specific domains such as pitching and scoring in sports.
(abstract and bio)