[Switch to Paper List on Google Scholar]

Selected Summaries Listed Below: [2020]  [2019]  [2018]  [2017]  [2016]  [2015]  [2014]  [2013]  [2012] 

Open-Vocabulary Object Detection Using Captions.  Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, Shih-Fu Chang  In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),   Virtual, June, 2021 (Oral presentation)   [pdf]

Despite the remarkable accuracy of deep neural net- works in object detection, they are costly to train and scale due to supervision requirements. Particularly, learn- ing more object categories typically requires proportionally more bounding box annotations. Weakly supervised and zero-shot learning techniques have been explored to scale object detectors to more categories with less supervision, but they have not been as successful and widely adopted as supervised models. In this paper, we put forth a novel formulation of the object detection problem, namely open- vocabulary object detection, which is more general, more practical, and more effective than weakly supervised and zero-shot approaches. We propose a new method to train object detectors using bounding box annotations for a lim- ited set of object categories, as well as image-caption pairs that cover a larger variety of objects at a significantly lower cost. We show that the proposed method can detect and localize objects for which no bounding box annotation is provided during training, at a significantly higher accuracy than zero-shot approaches. Meanwhile, objects with bound- ing box annotation can be detected almost as accurately as supervised methods, which is significantly better than weakly supervised baselines. Accordingly, we establish a new state of the art for scalable object detection.


VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs.  Xudong Lin, Gedas Bertasius, Jue Wang, Shih-Fu Chang, Devi Parikh, Lorenzo Torresani  In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),   Virtual, June, 2021   [pdf]

We present VX2TEXT, a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio. In order to leverage transformer networks, which have been shown to be effective at modeling language, each modality is first converted into a set of language embed- dings by a learnable tokenizer. This allows our approach to perform multimodal fusion in the language space, thus eliminating the need for ad-hoc cross-modal fusion mod- ules. To address the non-differentiability of tokenization on continuous inputs (e.g., video or audio), we utilize a relaxation scheme that enables end-to-end training. Fur- thermore, unlike prior encoder-only models, our network includes an autoregressive decoder to generate open-ended text from the multimodal embeddings fused by the language encoder. This renders our approach fully generative and makes it directly applicable to different "video+x to text" problems without the need to design specialized network heads for each task. The proposed framework is not only conceptually simple but also remarkably effective: experi- ments demonstrate that our approach based on a single ar- chitecture outperforms the state-of-the-art on three video- based text-generation tasks--captioning, question answer- ing and audio-visual scene-aware dialog.


Co-Grounding Networks with Semantic Attention for Referring Expression Comprehension in Videos.  Sijie Song, Xudong Lin, Jiaying Liu, Zongming Guo, Shih-Fu Chang  In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),   Virtual, June, 2021   [pdf]

In this paper, we address the problem of referring expres- sion comprehension in videos, which is challenging due to complex expression and scene dynamics. Unlike previous methods which solve the problem in multiple stages (i.e., tracking, proposal-based matching), we tackle the problem from a novel perspective, co-grounding, with an elegant one-stage framework. We enhance the single-frame ground- ing accuracy by semantic attention learning and improve the cross-frame grounding consistency with co-grounding feature learning. Semantic attention learning explicitly parses referring cues in different attributes to reduce the ambiguity in the complex expression. Co-grounding feature learning boosts visual feature representations by integrat- ing temporal correlation to reduce the ambiguity caused by scene dynamics. Experiment results demonstrate the superi- ority of our framework on the video grounding datasets VID and LiOTB in generating accurate and stable results across frames. Our model is also applicable to referring expres- sion comprehension in images, illustrated by the improved performance on the RefCOCO dataset. Our project is available at https://sijiesong.github.io/co- grounding.


Discovering Image Manipulation History by Pairwise Relation and Forensics Tools.  Zhang, Xu, Zhaohui H. Sun, Svebor Karaman, and Shih-Fu Chang  In IEEE Journal of Selected Topics in Signal Processing   14, no. 5 (2020): 1012-1023.  [pdf]

Given a potentially manipulated probe image, provenance analysis aims to find all images derived from the probe (offsprings) and all images from which the probe is derived (ancestors) in a large dataset (provenance filtering), and reconstruct the manipulation history with the retrieved images (provenance graph building). In this paper, we address two major challenges in provenance analysis, retrieving the source image of the small regions that are spliced into the probe image, and, detecting source images within the search results. For the former challenge, we propose to detect spliced regions by pairwise image comparison and only use local features extracted from the spliced region to perform an additional search. This removes the influence of the background and greatly improves the recall. For the latter, we propose to learn a pairwise ancestor-offspring detector and use it jointly with a holistic image manipulation detector to identify the source image. The proposed provenance analysis system has performed remarkably in evaluations using comprehensive provenance datasets. It's the winning solution for NIST Media Forensics Challenge (MFC) in 2018, 2019 and 2020. In MFC 2019, our provenance results achieved a 12% improvement in filtering and a 20% gain in oracle provenance graphs building over the alternative methods. In the real-world Reddit dataset, the edge overlap between our reconstructed provenance graphs and the ground-truth graphs is 5 times better than the state-of-the-art system.


Learning Visual Commonsense for Robust Scene Graph Generation.  Alireza Zareian, Haoxuan You, Zhecan Wang, Shih-Fu Chang  In European Conference on Computer Vision (ECCV)   Online, August, 2020  [pdf]

Scene graph generation models understand the scene through object and predicate recognition, but are prone to mistakes due to the challenges of perception in the wild. Perception errors often lead to nonsensical compositions in the output scene graph, which do not follow real-world rules and patterns, and can be corrected using commonsense knowledge. We propose the first method to acquire visual commonsense such as affordance and intuitive physics automatically from data, and use that to enhance scene graph generation. To this end, we extend transformers to incorporate the structure of scene graphs, and train our Global-Local Attention Transformer on a scene graph corpus. Once trained, our commonsense model can be applied on any perception model and correct its obvious mistakes, resulting in a more commonsensical scene graph. We show the proposed model learns commonsense better than any alternative, and improves the accuracy of any scene graph generation model. Nevertheless, strong disproportions in real-world datasets could bias commonsense to miscorrect already confident perceptions. We address this problem by devising a fusion module that compares predictions made by the perception and commonsense models, and the confidence of each, to make a hybrid decision. Our full model learns commonsense and knows when to use it, which is shown effective through experiments, resulting in a new state of the art.


Bridging Knowledge Graphs to Generate Scene Graphs.  Alireza Zareian, Svebor Karaman, Shih-Fu Chang  In European Conference on Computer Vision (ECCV)   Online, August, 2020  [pdf]

Scene graphs are powerful representations that parse images into their abstract semantic elements, i.e., objects and their interactions, which facilitates visual comprehension and explainable reasoning. On the other hand, commonsense knowledge graphs are rich repositories that encode how the world is structured, and how general concepts interact. In this paper, we present a unified formulation of these two constructs, where a scene graph is seen as an image-conditioned instantiation of a commonsense knowledge graph. Based on this new perspective, we re-formulate scene graph generation as the inference of a bridge between the scene and commonsense graphs, where each entity or predicate instance in the scene graph has to be linked to its corresponding entity or predicate class in the commonsense graph. To this end, we propose a novel graph-based neural network that iteratively propagates information between the two graphs, as well as within each of them, while gradually refining their bridge in each iteration. Our Graph Bridging Network, GB-Net, successively infers edges and nodes, allowing to simultaneously exploit and refine the rich, heterogeneous structure of the interconnected scene and commonsense graphs. Through extensive experimentation, we showcase the superior accuracy of GB-Net compared to the most recent methods, resulting in a new state of the art. We publicly release the source code of our method.


Context-Gated Convolution .  Xudong Lin, Lin Ma, Wei Liu, Shih-Fu Chang  In European Conference on Computer Vision (ECCV)   Online, August, 2020  [pdf]

As the basic building block of Convolutional Neural Networks (CNNs), the convolutional layer is designed to extract local patterns and lacks the ability to model global context in its nature. Many efforts have been recently devoted to complementing CNNs with the global modeling ability, especially by a family of works on global feature interaction. In these works, the global context information is incorporated into local features before they are fed into convolutional layers. However, research on neuroscience reveals that the neurons' ability of modifying their functions dynamically according to context is essential for the perceptual tasks, which has been overlooked in most of CNNs. Motivated by this, we propose one novel Context-Gated Convolution (CGC) to explicitly modify the weights of convolutional layers adaptively under the guidance of global context. As such, being aware of the global context, the modulated convolution kernel of our proposed CGC can better extract representative local patterns and compose discriminative features. Moreover, our proposed CGC is lightweight and applicable with modern CNN architectures, and consistently improves the performance of CNNs according to extensive experiments on image classification, action recognition, and machine translation.


Cross-media Structured Common Space for Multimedia Event Extraction.  Manling Li*, Alireza Zareian*, Qi Zeng, Spencer Whitehead, Di Lu, Heng Ji and Shih-Fu Chang (*equal contributions)  In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL2020)   Seattle, WA   July, 2020  (Oral presentation) [pdf] [Project Page]

We introduce a new task, MultiMedia Event Extraction (M2E2), which aims to extract events and their arguments from multimedia documents. We develop the first benchmark and collect a dataset of 245 multimedia news articles with extensively annotated events and arguments. We propose a novel method, Weakly Aligned Structured Embedding (WASE), that encodes structured representations of semantic information from textual and visual data into a common embedding space. The structures are aligned across modalities by employing a weakly supervised training strategy, which enables exploiting available resources without explicit cross-media annotation. Compared to unimodal state-of-the-art methods, our approach achieves 4.0% and 9.8% absolute F-score gains on text event argument role labeling and visual event extraction. Compared to stateof-the-art multimedia unstructured representations, we achieve 8.3% and 5.0% absolute Fscore gains on multimedia event extraction and argument role labeling, respectively. By utilizing images, we extract 21.4% more event mentions than traditional text-only methods.


GAIA: A Fine-grained Multimedia Knowledge Extraction System.  Manling Li*, Alireza Zareian*, Ying Lin, Xiaoman Pan, Spencer Whitehead, Brian Chen, Bo Wu, Heng Ji, Shih-Fu Chang, Clare Voss, Daniel Napierski and Marjorie Freedman (*equal contributions)  In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL2020) Demo Track   Seattle, WA   July, 2020 [pdf]   [Project Page] (Best Demo Paper Award)

We present the first comprehensive, open source multimedia knowledge extraction system that takes a massive stream of unstructured, heterogeneous multimedia data from various sources and languages as input, and creates a coherent, structured knowledge base, indexing entities, relations, and events, following a rich, fine-grained ontology. Our system, GAIA, enables seamless search of complex graph queries, and retrieves multimedia evidence including text, images and videos. GAIA achieves top performance at the recent NIST TAC SM-KBP2019 evaluation. The system is publicly available at GitHub and DockerHub, with complete documentation.


Weakly Supervised Visual Semantic Parsing.  Alireza Zareian, Svebor Karaman, and Shih-Fu Chang  In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)   Seattle, WA   June, 2020   (Oral presentation)  [pdf]  [code]

Scene Graph Generation (SGG) aims to extract entities, predicates and their semantic structure from images, enabling deep understanding of visual content, with many applications such as visual reasoning and image retrieval. Nevertheless, existing SGG methods require millions of manually annotated bounding boxes for training, and are computationally inefficient, as they exhaustively process all pairs of object proposals to detect predicates. In this paper, we address those two limitations by first proposing a generalized formulation of SGG, namely Visual Semantic Parsing, which disentangles entity and predicate recognition, and enables sub-quadratic performance. Then we propose the Visual Semantic Parsing Network, VSPNet, based on a dynamic, attention-based, bipartite message passing framework that jointly infers graph nodes and edges through an iterative process. Additionally, we propose the first graph-based weakly supervised learning framework, based on a novel graph alignment algorithm, which enables training without bounding box annotations. Through extensive experiments, we show that VSPNet outperforms weakly supervised baselines significantly and approaches fully supervised performance, while being several times faster. We publicly release the source code of our method.


General Partial Label Learning via Dual Bipartite Graph Autoencoder.  Brian Chen, Bo Wu, Alireza Zareian, Hanwang Zhang, and Shih-Fu Chang  In The Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI)   New York, NY   February, 2020    [pdf]  

We formulate a practical yet challenging problem: General Partial Label Learning (GPLL). Compared to the traditional Partial Label Learning (PLL) problem, GPLL relaxes the supervision assumption from instance-level --- a label set partially labels an instance --- to group-level: 1) a label set partially labels a group of instances, where the within-group instance-label link annotations are missing, and 2) cross-group links are allowed --- instances in a group may be partially linked to the label set from another group. Such ambiguous group-level supervision is more practical in real-world scenarios as additional annotation on the instance-level is no longer required, e.g., face-naming in videos where the group consists of faces in a frame, labeled by a name set in the corresponding caption. In this paper, we propose a novel graph convolutional network (GCN) called Dual Bipartite Graph Autoencoder (DB-GAE) to tackle the label ambiguity challenge of GPLL. First, we exploit the cross-group correlations to represent the instance groups as dual bipartite graphs: within-group and cross-group, which reciprocally complements each other to resolve the linking ambiguities. Second, we design a GCN autoencoder to encode and decode them, where the decodings are considered as the refined results. It is worth noting that DB-GAE is self-supervised and transductive, as it only uses the group-level supervision without a separate offline training stage. Extensive experiments on two real-world datasets demonstrate that DB-GAE significantly outperforms the best baseline over absolute 0.159 F1-score and 24.8% accuracy. We further offer analysis on various levels of label ambiguities.


Unsupervised Embedding Learning via Invariant and Spreading Instance Feature.  Mang Ye, Xu Zhang, Pong C. Yuen, and Shih-Fu Chang  In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)   Long Beach, California   June, 2019    [pdf] [code]  

This paper studies the unsupervised embedding learning problem, which requires an effective similarity measurement between samples in low-dimensional embedding space. Motivated by the positive concentrated and negative separated properties observed from category-wise supervised learning, we propose to utilize the instance-wise supervision to approximate these properties, which aims at learning data augmentation invariant and instance spread-out features. To achieve this goal, we propose a novel instance based softmax embedding method, which directly optimizes the ‘real’ instance features on top of the softmax function. It achieves significantly faster learning speed and higher accuracy than all existing methods. The proposed method performs well for both seen and unseen testing categories with cosine similarity. It also achieves competitive performance even without pre-trained network over samples from fine-grained categories.


DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition.  Zheng Shou, Xudong Lin, Yannis Kalantidis, Laura Sevilla-Lara, Marcus Rohrbach, Shih-Fu Chang, and Zhicheng Yan  In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)   Long Beach, California   June, 2019    [pdf]  

Motion has shown to be useful for video understanding, where motion is typically represented by optical flow. However, computing flow from video frames is very time-consuming. Recent works directly leverage the motion vectors and residuals readily available in the compressed video to represent motion at no cost. While this avoids flow computation, it also hurts accuracy since the motion vector is noisy and has substantially reduced resolution, which makes it a less discriminative motion representation. To remedy these issues, we propose a lightweight generator network, which reduces noises in motion vectors and captures fine motion details, achieving a more Discriminative Motion Cue (DMC) representation. Since optical flow is a more accurate motion representation, we train the DMC generator to approximate flow using a reconstruction loss and a generative adversarial loss, jointly with the downstream action classification task. Extensive evaluations on three action recognition benchmarks (HMDB-51, UCF-101, and a subset of Kinetics) confirm the effectiveness of our method. Our full system, consisting of the generator and the classifier, is coined as DMC-Net which obtains high accuracy close to that of using flow and runs two orders of magnitude faster than using optical flow at inference time.


Multi-level Multimodal Common Semantic Space for Image-Phrase Grounding.  Hassan Akbari, Svebor Karaman, Surabhi Bhargava, Brian Chen, Carl Vondrick, and Shih-Fu Chang  In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)   Long Beach, California   June, 2019    [pdf] [code]  

We address the problem of phrase grounding by learning a multi-level common semantic space shared by the textual and visual modalities. We exploit multiple levels of feature maps of a Deep Convolutional Neural Network, as well as contextualized word and sentence embeddings extracted from a character-based language model. Following dedicated non-linear mappings for visual features at each level, word, and sentence embeddings, we obtain multiple instantiations of our common semantic space in which comparisons between any target text and the visual content is performed with cosine similarity. We guide the model by a multi-level multimodal attention mechanism which outputs attended visual features at each level. The best level is chosen to be compared with text content for maximizing the pertinence scores of image-sentence pairs of the ground truth. Experiments conducted on three publicly available datasets show significant performance gains (20%-60% relative) over the state-of-the-art in phrase localization and set a new performance record on those datasets. We provide a detailed ablation study to show the contribution of each element of our approach and release our code on GitHub.


Unsupervised Rank-Preserving Hashing for Large-Scale Image Retrieval.  Svebor Karaman, Xudong Lin, Xuefeng Hu, and Shih-Fu Chang  In International Conference on Multimedia Retrieval (ICMR)   Ottawa, Canada   June, 2019    [pdf]  

We propose an unsupervised hashing method, exploiting a shallow neural network, that aims to produce binary codes that preserve the ranking induced by an original real-valued representation. This is motivated by the emergence of small-world graph-based approximate search methods that rely on local neighborhood ranking. We formalize the training process in an intuitive way by considering each training sample as a query and aiming to obtain a ranking of a random subset of the training set using the hash codes that is the same as the ranking using the original features. We also explore the use of a decoder to obtain an approximated reconstruction of the original features. At test time, we retrieve the most promising database samples using only the hash codes and perform re-ranking using the reconstructed features, thus allowing the complete elimination of the original real-valued features and the associated high memory cost. Experiments conducted on publicly available large-scale datasets show that our method consistently outperforms all compared state-of-the-art unsupervised hashing methods and that the reconstruction procedure can effectively boost the search accuracy with a minimal constant additional cost.


Low-shot Learning via Covariance-Preserving Adversarial Augmentation Networks.  Hang Gao, Zheng Shou, Alireza Zareian, Hanwang Zhang, and Shih-Fu Chang  In Advances in Neural Information Processing Systems (NIPS)   Montreal, Canada   December, 2018    [pdf] [arxiv]  

Deep neural networks suffer from over-fitting and catastrophic forgetting when trained with small data. One natural remedy for this problem is data augmentation, which has been recently shown to be effective. However, previous works either assume that intra-class variances can always be generalized to new classes, or employ naive generation methods to hallucinate finite examples without modeling their latent distributions. In this work, we propose Covariance-Preserving Adversarial Augmentation Networks to overcome existing limits of low-shot learning. Specifically, a novel Generative Adversarial Network is designed to model the latent distribution of each novel class given its related base counterparts. Since direct estimation on novel classes can be inductively biased, we explicitly preserve covariance information as the "variability" of base examples during the generation process. Empirical results show that our model can generate realistic yet diverse examples, leading to substantial improvements on the ImageNet benchmark over the state of the art.


AutoLoc: Weakly-supervised Temporal Action Localization in Untrimmed Videos.  Zheng Shou, Hang Gao, Lei Zhang, Kazuyuki Miyazawa, Shih-Fu Chang  In European Conference on Computer Vision (ECCV)   Munich, Germany   September, 2018    [arxiv]  

Temporal Action Localization (TAL) in untrimmed video is important for many applications. But it is very expensive to annotate the segment-level ground truth (action class and temporal boundary). This raises the interest of addressing TAL with weak supervision, namely only video-level annotations are available during training). However, the state-of-the-art weakly-supervised TAL methods only focus on generating good Class Activation Sequence (CAS) over time but conduct simple thresholding on CAS to localize actions. In this paper, we first develop a novel weakly-supervised TAL framework called AutoLoc to directly predict the temporal boundary of each action instance. We propose a novel Outer-Inner-Contrastive (OIC) loss to automatically discover the needed segment-level supervision for training such a boundary predictor. Our method achieves dramatically improved performance: under the IoU threshold 0.5, our method improves mAP on THUMOS'14 from 13.7% to 21.2% and mAP on ActivityNet from 7.4% to 27.3%. It is also very encouraging to see that our weakly-supervised method achieves comparable results with some fully-supervised methods.


Online Detection of Action Start in Untrimmed, Streaming Videos..  Zheng Shou, Junting Pan, Jonathan Chan, Kazuyuki Miyazawa, Hassan Mansour, Anthony Vetro, Xavi Gir  In European Conference on Computer Vision (ECCV)   Munich, Germany   September, 2018   [arxiv]  

We aim to tackle a novel task in action detection - Online Detection of Action Start (ODAS) in untrimmed, streaming videos. The goal of ODAS is to detect the start of an action instance, with high categorization accuracy and low detection latency. ODAS is important in many applications such as early alert generation to allow timely security or emergency response. We propose three novel methods to specifically address the challenges in training ODAS models: (1) hard negative samples generation based on Generative Adversarial Network (GAN) to distinguish ambiguous background, (2) explicitly modeling the temporal consistency between data around action start and data succeeding action start, and (3) adaptive sampling strategy to handle the scarcity of training data. We conduct extensive experiments using THUMOS'14 and ActivityNet. We show that our proposed methods lead to significant performance gains and improve the state-of-the-art methods. An ablation study confirms the effectiveness of each proposed method.


PatternNet: Visual Pattern Mining with Deep Neural Network.  Li, Hongzhi, Joseph G. Ellis, Lei Zhang, and Shih-Fu Chang  In International Conference on Multimedia Retrieval (ICMR)   Yokohama, Japan   June, 2018   [arxiv]  

Visual patterns represent the discernible regularity in the visual world. They capture the essential nature of visual objects or scenes. Understanding and modeling visual patterns is a fundamental problem in visual recognition that has wide ranging applications. In this paper, we study the problem of visual pattern mining and propose a novel deep neural network architecture called PatternNet for discovering these patterns that are both discriminative and representative. The proposed PatternNet leverages the filters in the last convolution layer of a convolutional neural network to find locally consistent visual patches, and by combining these filters we can effectively discover unique visual patterns. In addition, PatternNet can discover visual patterns efficiently without performing expensive image patch sampling, and this advantage provides an order of magnitude speedup compared to most other approaches. We evaluate the proposed PatternNet subjectively by showing randomly selected visual patterns which are discovered by our method and quantitatively by performing image classification with the identified visual patterns and comparing our performance with the current state-of-the-art. We also directly evaluate the quality of the discovered visual patterns by leveraging the identified patterns as proposed objects in an image and compare with other relevant methods. Our proposed network and procedure, PatterNet, is able to outperform competing methods for the tasks described.


Zero-Shot Visual Recognition using Semantics-Preserving Adversarial Embedding Network.  Long Chen, Hanwang Zhang, Jun Xiao, Wei Liu, Shih-Fu Chang  In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)   Salt Late City, USA   June, 2018    [arxiv]  

We propose a novel framework called Semantics-Preserving Adversarial Embedding Network (SP-AEN) for zero-shot visual recognition (ZSL), where test images and their classes are both unseen during training. SP-AEN aims to tackle the inherent problem --- semantic loss --- in the prevailing family of embedding-based ZSL, where some semantics would be discarded during training if they are non-discriminative for training classes, but informative for test classes. Specifically, SP-AEN prevents the semantic loss by introducing an independent visual-to-semantic space embedder which disentangles the semantic space into two subspaces for the two arguably conflicting objectives: classification and reconstruction. Through adversarial learning of the two subspaces, SP-AEN can transfer the semantics from the reconstructive subspace to the discriminative one, accomplishing the improved zero-shot recognition of unseen classes. Compared to prior works, SP-AEN can not only improve classification but also generate photo-realistic images, demonstrating the effectiveness of semantic preservation. On four benchmarks: CUB, AWA, SUN and aPY, SP-AEN considerably outperforms other state-of-the-art methods by absolute 12.2%, 9.3%, 4.0%, and 3.6% in harmonic mean values.


Grounding Referring Expressions in Images by Variational Context.  Hanwang Zhang, Yulei Niu, Shih-Fu Chang  In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)   Salt Late City, USA   June, 2018   [arxiv]  

We focus on grounding (i.e., localizing or linking) referring expressions in images, e.g., "largest elephant standing behind baby elephant". This is a general yet challenging vision-language task since it does not only require the localization of objects, but also the multimodal comprehension of context --- visual attributes (e.g., "largest", "baby") and relationships (e.g., "behind") that help to distinguish the referent from other objects, especially those of the same category. Due to the exponential complexity involved in modeling the context associated with multiple image regions, existing work oversimplifies this task to pairwise region modeling by multiple instance learning. In this paper, we propose a variational Bayesian method, called Variational Context, to solve the problem of complex context modeling in referring expression grounding. Our model exploits the reciprocal relation between the referent and context, i.e., either of them influences the estimation of the posterior distribution of the other, and thereby the search space of context can be greatly reduced, resulting in better localization of referent. We develop a novel cue-specific language-vision embedding network that learns this reciprocity model end-to-end. We also extend the model to the unsupervised setting where no annotation for the referent is available. Extensive experiments on various benchmarks show consistent improvement over state-of-the-art methods in both supervised and unsupervised settings.


Skip RNN: Learning to Skip State Updates in Recurrent Neural Networks.  Campos, Victor and Jou, Brendan and Giro-i-Nieto, Xavier and Torres, Jordi and Chang, Shih-Fu  In International Conference on Learning Representations   Vancouver, Canada   April, 2018    [arxiv] [pdf] [code]  

Recurrent Neural Networks (RNNs) continue to show outstanding performance in sequence modeling tasks. However, training RNNs on long sequences often face challenges like slow inference, vanishing gradients and difficulty in capturing long term dependencies. In backpropagation through time settings, these issues are tightly coupled with the large, sequential computational graph resulting from unfolding the RNN in time. We introduce the Skip RNN model which extends existing RNN models by learning to skip state updates and shortens the effective size of the computational graph. This model can also be encouraged to perform fewer state updates through a budget constraint. We evaluate the proposed model on various tasks and show how it can reduce the number of required RNN updates while preserving, and sometimes even improving, the performance of baseline RNN models.


PPR-FCN: Weakly Supervised Visual Relation Detection via Parallel Pairwise R-FCN.  Hanwang Zhang, Zawlin Kyaw, Jinyang Yu, Shih-Fu Chang  In International Conference on Computer Vision. ICCV 2017.   Venice, Italy   October, 2017   [pdf]   

We aim to tackle a novel vision task called Weakly Supervised Visual Relation Detection (WSVRD) to detect ``subject-predicate-object'' relations in an image with object relation groundtruths available only at the image level. This is motivated by the fact that it is extremely expensive to label the combinatorial relations between objects at the instance level. Compared to the extensively studied problem, Weakly Supervised Object Detection (WSOD), WSVRD is more challenging as it needs to examine a large set of regions pairs, which is computationally prohibitive and more likely stuck in a local optimal solution such as those involving wrong spatial context. To this end, we present a Parallel, Pairwise Region-based, Fully Convolutional Network (PPR-FCN) for WSVRD. It uses a parallel FCN architecture that simultaneously performs pair selection and classification of single regions and region pairs for object and relation detection, while sharing almost all computation shared over the entire image. In particular, we propose a novel position-role-sensitive score map with pairwise RoI pooling to efficiently capture the crucial context associated with a pair of objects. We demonstrate the superiority of PPR-FCN over all baselines in solving the WSVRD challenge by using results of extensive experiments over two visual relation benchmarks.


Learning Spread-out Local Feature Descriptors.  Xu Zhang, Felix X. Yu, Sanjiv Kumar, Shih-Fu Chang  In International Conference on Computer Vision. ICCV 2017   Venice, Italy   October, 2017    [pdf][code]  

We develop a novel deep learning based method for learning robust local image descriptors, by using a simple, yet powerful regularization technique that can be used to significantly improve both the pairwise and triplet losses. The idea is that in order to fully utilize the expressive power of the descriptor space, good local feature descriptors should be sufficiently "spread-out" in the space. In this work, we assume non-matching features follow a uniform distribution and propose a regularization term to maximize the spread of the feature descriptor based on the theoretical property of the uniform distribution. We show that the proposed regularization with triplet loss outperforms existing Euclidean distance based descriptor learning techniques by a large margin. As an extension, the proposed regularization technique can also be used to improve image-level deep feature embedding.


Visual Translation Embedding Network for Visual Relation Detection.  Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, Tat-Seng Chua  In In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)   Honolulu, Hawaii   July, 2017   [arxiv][codes][demo]  

Visual relations, such as “person ride bike” and “bike next to car”, offer a comprehensive scene understanding of an image, and have already shown their great utility in connecting computer vision and natural language. However, due to the challenging combinatorial complexity of modeling subject-predicate-object relation triplets, very little work has been done to localize and predict visual relations. Inspired by the recent advances in relational representation learning of knowledge bases and convolutional object detection networks, we propose a Visual Translation Embedding network (VTransE) for visual relation detection. VTransE places objects in a low-dimensional relation space where a relation can be modeled as a simple vector translation, i.e., subject + predicate ≈ object. We propose a novel feature extraction layer that enables object-relation knowledge transfer in a fully-convolutional fashion that supports training and inference in a single forward/backward pass. To the best of our knowledge, VTransE is the first end-to- end relation detection network. We demonstrate the effectiveness of VTransE over other state-of-the-art methods on two large-scale datasets: Visual Relationship and Visual Genome. Note that even though VTransE is a purely visual model, it is also competitive to prior work on multi-modal model with language priors.


Learning Discriminative and Transformation Covariant Local Feature Detectors.  Xu Zhang, Felix X. Yu, Svebor Karaman, Shih-Fu Chang  In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)   Honolulu, Hawaii   July, 2017    [pdf][code]  
Robust covariant local feature detectors are important for detecting local features that are (1) discriminative of the image content and (2) can be repeatably detected at consistent locations when the image undergoes diverse transformations. Such detectors are critical for applications such as image search and scene reconstruction. Many learning-based local feature detectors address one of these two problems while overlooking the other. In this work, we propose a novel learning-based method to simultaneously address both issues. Specifically, we extend the previously proposed covariant constraint by defining the concepts of “standard patch” and “canonical feature” and leverage these to train a novel robust covariant detector. We show that the introduction of these concepts greatly simplifies the learning stage of the covariant detector, and also makes the detector much more robust. Extensive experiments show that our method outperforms previous handcrafted and learning-based detectors by large margins in terms of repeatability.


CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos.  Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and Shih-Fu Chang  In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), oral session   Honolulu, Hawaii   July, 2017    [pdf] [arxiv] [code] [project]  

Temporal action localization is an important yet challenging problem. Given a long, untrimmed video consisting of multiple action instances and complex background contents, we need not only to recognize their action categories, but also to localize the start time and end time of each instance. Many state-of-the-art systems use segmentlevel classifiers to select and rank proposal segments of pre-determined boundaries. However, a desirable model should move beyond segment-level and make dense predictions at a fine granularity in time to determine precise temporal boundaries. To this end, we design a novel Convolutional-De-Convolutional (CDC) network that places CDC filters on top of 3D ConvNets, which have been shown to be effective for abstracting action semantics but reduce the temporal length of the input data. The proposed CDC filter performs the required temporal upsampling and spatial downsampling operations simultaneously to predict actions at the frame-level granularity. It is unique in jointly modeling action semantics in space-time and fine-grained temporal dynamics. We train the CDC network in an end-to-end manner efficiently. Our model not only achieves superior performance in detecting actions in every frame, but also significantly boosts the precision of localizing temporal boundaries. Finally, the CDC network demonstrates a very high efficiency with the ability to process 500 frames per second on a single GPU server. Source code and trained models are available online at https://bitbucket.org/columbiadvmm/cdc.


Deep Image Set Hashing.  Jie Feng, Svebor Karaman, Shih-Fu Chang  In IEEE Winter Conference on Applications of Computer Vision   Santa Rosa, California, USA   March, 2017    [pdf]  


Learning-based hashing is often used in large scale image retrieval as they provide a compact representation of each sample and the Hamming distance can be used to efficiently compare two samples. However, most hashing methods encode each image separately and discard knowledge that multiple images in the same set represent the same object or person. We investigate the set hashing problem by combining both set representation and hashing in a single deep neural network. An image set is first passed to a CNN module to extract image features, then these features are aggregated using two types of set feature to capture both set specific and database-wide distribution information. The computed set feature is then fed into a multilayer perceptron to learn a compact binary embedding trained with triplet loss. We extensively evaluate our approach on multiple image datasets and show highly competitive performance compared to state-of-the-art methods.


Deep Cross Residual Learning for Multitask Visual Recognition.  Brendan Jou and Shih-Fu Chang  In ACM Multimedia   Amsterdam, The Netherlands   October, 2016   [pdf] [arXiv] [code]  

Residual learning has recently surfaced as an effective means of constructing very deep neural networks for object recognition. However, current incarnations of residual networks do not allow for the modeling and integration of complex relations between closely coupled recognition tasks or across domains. Such problems are often encountered in multimedia applications involving large-scale content recognition. We propose a novel extension of residual learning for deep networks that enables intuitive learning across multiple related tasks using cross-connections called cross-residuals. These cross-residuals connections can be viewed as a form of in-network regularization and enables greater network generalization. We show how cross-residual learning (CRL) can be integrated in multitask networks to jointly train and detect visual concepts across several tasks. We present a single multitask cross-residual network with >40% less parameters that is able to achieve competitive, or even better, detection performance on a visual sentiment concept detection problem normally requiring multiple specialized single-task networks. The resulting multitask cross-residual network also achieves better detection performance by about 10.4% over a standard multitask residual network without cross-residuals with even a small amount of cross-task weighting.


Event Specific Multimodal Pattern Mining for Knowledge Base Construction.  Hongzhi Li, Joseph G. Ellis, Heng Ji, Shih-Fu Chang  In Proceedings of the 24th ACM international conference on Multimedia   Amsterdam, The Netherlands   October, 2016    [pdf]  

Despite the impressive progress in image recognition, current approaches assume a predefined set of classes are known (e.g., ImageNet). Such recognition tools of fixed vocabularies do not meet the needs when data from new domains of unknown classes are encountered. Automatic discovery of visual patterns and object classes so far are still limited to low-level primitive or mid-level patterns, lacking semantic meanings. In this paper, we develop a novel multi-modal framework, in which deep learning is used to learn representations for both images and texts that can be combined to discover multimodal patterns. Such multimodal patterns can be automatically "named" to describe the semantic concepts unique to each event. The named concepts can be used to build expanded schemas for each event using such unique concepts. We experiment with a large unconstrained corpus of weakly-supervised image-caption pairs related to high-level events such as "attack" and "demonstration" and demonstrate the superior performance of the proposed method in terms of the number of automatically discovered named concepts, their semantic coherence, and relevance to high level events of interest.


Tamp: A Library for Compact Deep Neural Networks with Structured Matrices.  Bingchen Gong, Brendan Jou, Felix X. Yu, Shih-Fu Chang  In ACM Multimedia, Open Source Software Competition   Amsterdam, The Netherlands   October, 2016    [code] [pdf]  

We introduce Tamp, an open-source C++ library for reducing the space and time costs of deep neural network models. In particular, Tamp implements several recent works which use structured matrices to replace unstructured matrices which are often bottlenecks in neural networks. Tamp is also designed to serve as a unified development platform with several supported optimization back-ends and abstracted data types. This paper introduces the design and API and also demonstrates the effectiveness with experiments on public datasets.



Placing Broadcast News Videos in their Social Media Context using Hashtags.  Joseph G. Ellis, Svebor Karaman, Hongzhi Li, Hong Bin Shim, Shih-Fu Chang  In ACM international conference on Multimedia (ACM MM)   Amsterdam, Netherlands.   October, 2016    [pdf] [video]  
With the growth of social media platforms in recent years, social media is now a major source of information and news for many people around the world. In particular the rise of hashtags have helped to build communities of discussion around particular news, topics, opinions, and ideologies. However, television news programs still provide value and are used by a vast majority of the population to obtain their news, but these videos are not easily linked to broader discussion on social media. We have built a novel pipeline that allows television news to be placed in its relevant social media context, by leveraging hashtags. In this paper, we present a method for automatically collecting television news and social media content (Twitter) and discovering the hashtags that are relevant for a TV news video. Our algorithms incorporate both the visual and text information within social media and television content, and we show that by leveraging both modalities we can improve performance over single modality approaches.


Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs.  Zheng Shou, Dongang Wang, and Shih-Fu Chang  In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)   Las Vegas, Nevada, USA   June, 2016    [pdf] [code]  

We address temporal action localization in untrimmed long videos. This is important because videos in real applications are usually unconstrained and contain multiple action instances plus video content of background scenes or other activities. To address this challenging issue, we exploit the effectiveness of deep networks in temporal action localization via three segment-based 3D ConvNets: (1) a proposal network identifies candidate segments in a long video that may contain actions; (2) a classification network learns one-vs-all action classification model to serve as initialization for the localization network; and (3) a localization network fine-tunes the learned classification network to localize each action instance. We propose a novel loss function for the localization network to explicitly consider temporal overlap and achieve high temporal localization accuracy. In the end, only the proposal network and the localization network are used during prediction. On two large-scale benchmarks, our approach achieves significantly superior performances compared with other state-of-the-art systems: mAP increases from 1.7% to 7.4% on MEXaction2 and increases from 15.0% to 19.0% on THUMOS 2014.


Interactive Segmentation on RGBD Images via Cue Selection.  Jie Feng, Brian Price, Scott Cohen, Shih-Fu Chang  In IEEE Conference on Computer Vision and Pattern Recognition (CVPR)   Las Vegas, NV   June, 2016   [pdf][video]  
Interactive image segmentation is an important problem in computer vision with many applications including image editing, object recognition and image retrieval. Most existing interactive segmentation methods only operate on color images. Until recently, very few works have been proposed to leverage depth information from low-cost sensors to improve interactive segmentation. While these methods achieve better results than color-based methods, they are still limited in either using depth as an additional color channel or simply combining depth with color in a linear way. We propose a novel interactive segmentation algorithm which can incorporate multiple feature cues like color, depth, and normals in an unified graph cut framework to leverage these cues more effectively. A key contribution of our method is that it automatically selects a single cue to be used at each pixel, based on the intuition that only one cue is necessary to determine the segmentation label locally. This is achieved by optimizing over both segmentation labels and cue labels, using terms designed to decide where both the segmentation and label cues should change. Our algorithm thus produces not only the segmentation mask but also a cue label map that indicates where each cue contributes to the final result. Extensive experiments on five large scale RGBD datasets show that our proposed algorithm performs significantly better than both other color-based and RGBD based algorithms in reducing the amount of user inputs as well as increasing segmentation accuracy.


SentiCart: Cartography & Geo-contextualization for Multilingual Visual Sentiment.  Brendan Jou*, Margaret Yuying Qian*, Shih-Fu Chang  In ACM International Conference on Multimedia Retrieval (ICMR)   New York, NY   June, 2016   [pdf] [demo] [video]  

Where in the world are pictures of cute animals or ancient architecture most shared from? And are they equally sentimentally perceived across different languages? We demonstrate a series of visualization tools, that we collectively call SentiCart, for answering such questions and navigating the landscape of how sentiment-biased images are shared around the world in multiple languages. We present visualizations using a large-scale, self-gathered geodata corpus of >1.54M geo-references coming from over 235 countries mined from >15K visual concepts over 12 languages. We also highlight several compelling data-driven findings about multilingual visual sentiment in geo-social interactions.


3D Shape Retrieval using a Single Depth Image from Low-cost Sensors.  Jie Feng, Yan Wang, Shih-Fu Chang  In IEEE Winter Conference on Applications of Computer Vision (WACV) 2016   Lake Placid, NY, USA   March, 2016    [pdf]  
Content-based 3D shape retrieval is an important problem in computer vision. Traditional retrieval interfaces require a 2D sketch or a manually designed 3D model as the query, which is difficult to specify and thus not practical in real applications. With the recent advance in low-cost 3D sensors such as Microsoft Kinect and Intel Realsense, capturing depth images that carry 3D information is fairly simple, making shape retrieval more practical and user-friendly. In this paper, we study the problem of cross-domain 3D shape retrieval using a single depth image from low-cost sensors as the query to search for similar human designed CAD models. We propose a novel method using an ensemble of autoencoders in which each autoencoder is trained to learn a compressed representation of depth views synthesized from each database object. By viewing each autoencoder as a probabilistic model, a likelihood score can be derived as a similarity measure. 


An exploration of parameter redundancy in deep networks with circulant projections.  Yu Cheng*, Felix X. Yu*, Rogerio Feris, Sanjiv Kumar, Alok Choudhary, Shih-Fu Chang  In International Conference on Computer Vision (ICCV)     December, 2015    [arXiv]  

We explore the redundancy of parameters in deep neural networks by replacing the conventional linear projection in fully-connected layers with the circulant projection. The circulant structure substantially reduces memory footprint and enables the use of the Fast Fourier Transform to speed up the computation. Considering a fully-connected neural network layer with d input nodes, and d output nodes, this method improves the time complexity from O(d^2) to O(dlogd) and space complexity from O(d^2) to O(d). The space savings are particularly important for modern deep convolutional neural network architectures, where fully-connected layers typically contain more than 90% of the network parameters. We further show that the gradient computation and optimization of the circulant projections can be performed very efficiently. Our experiments on three standard datasets show that the proposed approach achieves this significant gain in storage and efficiency with minimal increase in error rate compared to neural networks with unstructured projections.


Fast Orthogonal Projection Based on Kronecker Product.  Xu Zhang, Felix X. Yu, Ruiqi Guo, Sanjiv Kumar, Shengjin Wang and Shih-Fu Chang  In ICCV 2015   Santiago de Chile   December, 2015    [pdf][code]  

We propose a family of structured matrices to speed up orthogonal projections for high-dimensional data commonly seen in computer vision applications. In this, a structured matrix is formed by the Kronecker product of a series of smaller orthogonal matrices. This achieves O(d log d) computational complexity and O(log d) space complexity for d-dimensional data, a drastic improvement over the standard unstructured projections whose computation and space complexities are both O(d^2).


Visual Affect Around the World: A Large-scale Multilingual Visual Sentiment Ontology.  Brendan Jou*, Tao Chen*, Nikolaos Pappas*, Miriam Redi*, Mercan Topkara*, and Shih-Fu Chang  In ACM Multimedia   Brisbane, Australia   October, 2015    [pdf]  

Every culture and language is unique. Our work expressly focuses on the uniqueness of culture and language in relation to human affect, specifically sentiment and emotion semantics, and how they manifest in social multimedia. We develop sets of sentiment- and emotion-polarized visual concepts by adapting semantic structures called adjective-noun pairs, originally introduced by Borth et al. (2013), but in a multilingual context. We propose a new language-dependent method for automatic discovery of these adjective-noun constructs. We show how this pipeline can be applied on a social multimedia platform for the creation of a large-scale multilingual visual sentiment concept ontology (MVSO). Unlike the flat structure in Borth et al. (2013), our unified ontology is organized hierarchically by multilingual clusters of visually detectable nouns and subclusters of emotionally biased versions of these nouns. In addition, we present an image-based prediction task to show how generalizable language-specific models are in a multilingual context. A new, publicly available dataset of > 15.6K sentiment-biased visual concepts across 12 languages with language-specific detector banks, > 7.36M images and their metadata is also released.


EventNet: A Large Scale Structured Concept Library for Complex Event Detection in Video.  Guangnan Ye*, Yitong Li*, Hongliang Xu, Dong Liu, Shih-Fu Chang  In ACM Multimedia (ACM MM)   Brisbane, Australia   October, 2015    [pdf]  
Event-specific concepts are the semantic concepts specifically designed for the events of interest, which can be used as a mid-level representation of complex events in videos. Existing methods only focus on defining event-specific concepts for a small number of pre-defined events, but cannot handle novel unseen events. This motivates us to build a large scale event-specific concept library that covers as many real-world events and their concepts as possible. Specifically, we choose WikiHow, an online forum containing a large number of how-to articles on human daily life events. We perform a coarse-to-fine event discovery process and discover 500 events from WikiHow articles. Then we use each event name as query to search YouTube and discover event-specific concepts from the tags of returned videos. After an automatic filter process, we end up with around 95,321 videos and 4,490 concepts. We train a Convolutional Neural Network (CNN) model on the 95,321 videos over the 500 events, and use the model to extract deep learning feature from video content. With the learned deep learning feature, we train 4,490 binary SVM classifiers as the event-specific concept library. The concepts and events are further organized in a hierarchical structure defined by WikiHow, and the resultant concept library is called EventNet. Finally, the EventNet concept library is used to generate concept based representation of event videos. To the best of our knowledge, EventNet represents the first video event ontology that organizes events and their concepts into a semantic structure. It offers great potential for event retrieval and browsing.


Large Video Event Ontology Browsing, Search and Tagging (EventNet Demo).  Hongliang Xu, Guangnan Ye, Yitong Li, Dong Liu, Shih-Fu Chang  In ACM Multimedia (ACM MM)   Brisbane, Australia   October, 2015   [pdf]  

 EventNet is the largest video event ontology existent today, consisting of 500 events and 4, 490 event-specific concepts systematically discovered from the crowdsource forum like WikiHow. Such sources offer rich information about events happening in everyday lives. Additionally, it includes automatic detection models for the constituent events and concepts using deep learning with around 95K training videos from YouTube. In this demo, we present several novel functions of EventNet: 1) interactive ontology browsing, 2) semantic event search, and 3) tagging of user-loaded videos via open web interfaces. The system is the first in allowing users to explore rich hierarchical structures among video events, relations between concepts and events, and automatic detection of events and concepts embedded in user-uploaded videos in a live fashion.


CamSwarm: Instantaneous Smartphone Camera Arrays for Collaborative Photography.  Yan Wang, Jue Wang, Shih-Fu Chang  In arVix preprint     August, 2015   [arXiv] [video]  

Camera arrays (CamArrays) are widely used in commercial filming projects for achieving special visual effects such as bullet time effect, but are very expensive to set up. We propose CamSwarm, a low-cost and lightweight alternative to professional CamArrays for consumer applications. It allows the construction of a collaborative photography platform from multiple mobile devices anywhere and anytime, enabling new capturing and editing experiences that a single camera cannot provide. Our system allows easy team formation; uses real-time visualization and feedback to guide camera positioning; provides a mechanism for synchronized capturing; and finally allows the user to efficiently browse and edit the captured imagery. Our user study suggests that CamSwarm is easy to use; the provided real-time guidance is helpful; and the full system achieves high quality results promising for non-professional use.


New Insights into Laplacian Similarity Search.  Xiao-Ming Wu, Zhenguo Li, and Shih-Fu Chang  In IEEE Conference on Computer Vision and Pattern Recognition (CVPR)   Boston, USA   June, 2015   [pdf] [supplement] [abstract]  

Graph-based computer vision applications rely critically on similarity metrics which compute the pairwise similarity between any pair of vertices on graphs. This paper investigates the fundamental design of commonly used similarity metrics, and provides new insights to guide their use in practice. In particular, we introduce a family of similarity metrics in the form of $(L+alphaLambda)^{-1}$, where $L$ is the graph Laplacian, $Lambda$ is a positive diagonal matrix acting as a regularizer, and $alpha$ is a positive balancing factor. Such metrics respect graph topology when $alpha$ is small, and reproduce well-known metrics such as hitting times and the pseudo-inverse of graph Laplacian with different regularizer $Lambda$.

   This paper is the first to analyze the important impact of selecting $Lambda$ in retrieving the local cluster from a seed. We find that different $Lambda$ can lead to surprisingly complementary behaviors: $Lambda = D$ (degree matrix) can reliably extract the cluster of a query if it is sparser than surrounding clusters, while $Lambda = I$ (identity matrix) is preferred if it is denser than surrounding clusters. Since in practice there is no reliable way to determine the local density in order to select the right model, we propose a new design of $Lambda$ that automatically adapts to the local density. Experiments on image retrieval verify our theoretical arguments and confirm the benefit of the proposed metric. We expect the insights of our theory to provide guidelines for more applications in computer vision and other domains. 


Attributes and Categories for Generic Instance Search from One Example.  Ran Tao, Arnold WM Smeulders, Shih-Fu Chang  In In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).   Boston   June, 2015    [pdf]  

This paper aims for generic instance search from one example where the instance can be an arbitrary 3D object like shoes, not just near-planar and one-sided instances like buildings and logos. Firstly, we evaluate state-of-the-art instance search methods on this problem. We observe that what works for buildings loses its generality on shoes. Secondly, we propose to use automatically learned category-specific attributes to address the large appearance variations present in generic instance search. On the problem of searching among instances from the same category as the query, the category-specific attributes outperform existing approaches by a large margin. On a shoe dataset containing 6624 shoe images recorded from all viewing angles, we improve the performance from 36.73 to 56.56 using category-specific attributes. Thirdly, we extend our methods to search objects without restricting to the specifically known category. We show the combination of category-level information and the category-specific attributes is superior to combining category-level information with low-level features such as Fisher vector.


Regrasping and Unfolding of Garments Using Predictive Thin Shell Modeling.  Yinxiao Li, Danfei Xu, Yonghao Yue, Yan Wang, Shih-Fu Chang, Eitan Grinspun, and Peter K. Allen  In IEEE International Conference on Robotics and Automation (ICRA)     May, 2015    [video] [pdf]  

Deformable objects such as garments are highly unstructured, making them difficult to recognize and manipulate. In this paper, we propose a novel method to teach a twoarm robot to efficiently track the states of a garment from an unknown state to a known state by iterative regrasping. The problem is formulated as a constrained weighted evaluation metric for evaluating the two desired grasping points during regrasping, which can also be used for a convergence criterion The result is then adopted as an estimation to initialize a regrasping, which is then considered as a new state for evaluation. The process stops when the predicted thin shell conclusively agrees with reconstruction. We show experimental results for regrasping a number of different garments including sweater, knitwear, pants, and leggings, etc.


Discrete Graph Hashing.  Wei Liu, Cun Mu, Sanjiv Kumar and Shih-Fu Chang  In Neural Information Processing Systems (NIPS)   Montreal, Canada   December, 2014   [pdf] [supplementary material]  

Hashing has emerged as a popular technique for fast nearest neighbor search in gigantic databases. In particular, learning based hashing has received considerable attention due to its appealing storage and search efficiency. However, the performance of most unsupervised learning based hashing methods deteriorates rapidly as the hash code length increases. We argue that the degraded performance is due to inferior optimization procedures used to achieve discrete binary codes. This paper presents a graph-based unsupervised hashing model to preserve the neighborhood structure of massive data in a discrete code space. We cast the graph hashing problem into a discrete optimization framework which directly learns the binary codes. A tractable alternating maximization algorithm is then proposed to explicitly deal with the discrete constraints, yielding high-quality codes to well capture the local neighborhoods. Extensive experiments performed on four large datasets with up to one million samples show that our discret optimization based graph hashing method obtains superior search accuracy over state-of-the-art unsupervised hashing methods, especially for longer codes.


Why We Watch the News: A Dataset for Exploring Sentiment IN Broadcast Video News.  Joseph G. Ellis, Brendan Jou, Shih-Fu Chang  In ACM International Conference on Multimodal Interaction   Istanbul, Turkey   November, 2014   [pdf] [poster] [web]  

We present a multimodal sentiment study performed on a novel collection of videos mined from broadcast and cable television news programs.  To the best of our knowledge, this is the first dataset released for studying sentiment in the domain of broadcast video news. We describe our algorithm for the processing and creation of person-specific segments from news video, yielding 929 sentence-length videos, and are annotated via Amazon Mechanical Turk. The spoken transcript and the video content itself are each annotated for their expression of positive, negative or neutral sentiment.

Based on these gathered user annotations, we demonstrate for news video the importance of taking into account multimodal information for sentiment prediction, and in particular, challenging previous text-based approaches that rely solely on available transcripts.  We show that as much as 21.54 of the sentiment annotations for transcripts differ from their respective sentiment annotations when the video clip itself is presented.  We present audio and visual classification baselines over a three-way sentiment prediction of positive, negative and neutral, as well as person-dependent versus person-independent classification influence on performance.  Finally, we release the News Rover Sentiment dataset to the greater research community.


Scalable Visual Instance Mining with Threads of Features.  Wei Zhang, Hongzhi Li, Chong-Wah Ngo, Shih-Fu Chang  In ACM Multimedia (ACM MM)   Orlando, USA   November, 2014   [pdf]  

We address the problem of visual instance mining, which is to extract frequently appearing visual instances automatically from a multimedia collection. We propose a scalable mining method by exploiting Thread of Features (ToF). Specifically, ToF, a compact representation that links consistent features across images, is extracted to reduce noises, discover patterns, and speed up processing. Various instances, especially small ones, can be discovered by exploiting correlated ToFs. Our approach is significantly more effective than other methods in mining small instances. At the same time, it is also more efficient by requiring much fewer hash tables. We compared with several state-of-the-art methods on two fully annotated datasets: MQA and Oxford, showing large performance gain in mining (especially small) visual instances. We also run our method on another Flickr dataset with one million images for scalability test. Two applications, instance search and multimedia summarization, are developed from the novel perspective of instance mining, showing great potential of our method in multimedia analysis.


Predicting Viewer Perceived Emotions in Animated GIFs.  Brendan Jou, Subhabrata Bhattacharya, Shih-Fu Chang  In ACM Multimedia   Orlando, FL USA   November, 2014    [pdf]  

Animated GIFs are everywhere on the Web. Our work focuses on the computational prediction of emotions perceived by viewers after they are show0n animated GIF images. We evaluate our results on a dataset of over 3,800 animated GIFs gathered from MIT's GIFGIF platform, each with scores for 17 discrete emotions aggregated from over 2.5M user annotations -- the first computational evaluation of its kind for content-based prediction on animated GIFs to our knowledge. In addition, we advocate a conceptual paradigm in emotion prediction that shows delineating distinct types of emotion is important and is useful to be concrete about the emotion target. One of our objectives is to systematically compare different types of content features for emotion prediction, including low-level, aesthetics, semantic and face features. We also formulate a multi-task regression problem to evaluate whether viewer perceived emotion prediction can benefit from jointly learning across emotion classes compared to disjoint, independent learning.


Modeling Attributes from Category-Attribute Proportions.  Felix X. Yu, Liangliang Cao, Michele Merler, Noel Codella, Tao Chen, John R. Smith, Shih-Fu Chang  In ACM Multimedia   Orlando, USA   November, 2014    [PDF]  

Attribute-based representation has been widely used in visual recognition and retrieval due to its interpretability and cross-category generalization properties. However, classic attribute learning requires manually labeling attributes on the images, which is very expensive, and not scalable. In this paper, we propose to model attributes from category-attribute proportions. The proposed framework can model attributes without attribute labels on the images. Specifically, given a multi-class image datasets with N categories, we model an attribute, based on an N-dimensional category-attribute proportion vector, where each element of the vector characterizes the proportion of images in the corresponding category having the attribute. The attribute learning can be formulated as a learning from label proportion (LLP) problem. Our method is based on a newly proposed machine learning algorithm called $propto$SVM. Finding the category-attribute proportions is much easier than manually labeling images, but it is still not a trivial task. We further propose to estimate the proportions from multiple modalities such as human commonsense knowledge, NLP tools, and other domain knowledge. The value of the proposed approach is demonstrated by various applications including modeling animal attributes, visual sentiment attributes, and scene attributes.



Object-Based Visual Sentiment Concept Analysis and Application.  Tao Chen, Felix X. Yu, Jiawei Chen, Yin Cui, Yan-Ying Chen and Shih-Fu Chang  In Proceedings of the 22nd ACM International Conference on Multimedia   Orlando, FL, USA   November, 2014    [pdf]  

This paper studies the problem of modeling object-based visual concepts such as "crazy car" and "shy dog" with a goal to extract emotion related information from social multimedia content. We focus on detecting such adjective-noun pairs because of their strong co-occurrence relation with image tags about emotions. This problem is very challenging due to the highly subjective nature of the adjectives like "crazy" and "shy" and the ambiguity associated with the annotations. However, associating adjectives with concrete physical nouns makes the combined visual concepts more detectable and tractable. We propose a hierarchical system to handle the concept classification in an object specific manner and decompose the hard problem into object localization and sentiment related concept modeling. In order to resolve the ambiguity of concepts we propose a novel classification approach by modeling the concept similarity, leveraging on online commonsense knowledgebase. The proposed framework also allows us to interpret the classifiers by discovering discriminative features. The comparisons between our method and several baselines show great improvement in classification performance. We further demonstrate the power of the proposed system with a few novel applications such as sentiment-aware music slide shows of personal albums.


Real-time Pose Estimation of Deformable Objects Using a Volumetric Approach.  Yinxiao Li*, Yan Wang*, Michael Case, Shih-Fu Chang, Peter K. Allen  In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)   Chicago, IL   September, 2014    [pdf] [video] * indicates equal contribution.  

Pose estimation of deformable objects is a fundamental and challenging problem in robotics. We present a novel solution to this problem by first reconstructing a 3D model of the object from a low-cost depth sensor such as Kinect, and then searching a database of simulated models in different poses to predict the pose. Given noisy depth images from 360-degree views of the target object acquired from the Kinect sensor, we reconstruct a smooth 3D model of the object using depth image segmentation and volumetric fusion. Then with an efficient feature extraction and matching scheme, we search the database, which contains a large number of deformable objects in different poses, to obtain the most similar model, whose pose is then adopted as the prediction. Extensive experiments demonstrate better accuracy and orders of magnitude speed- up compared to our previous work. An additional benefit of our method is that it produces a high-quality mesh model and camera pose, which is necessary for other tasks such as regrasping and object manipulation.


From Low-Cost Depth Sensors to CAD: Cross-Domain 3D Shape Retrieval via Regression Tree Fields.  Yan Wang, Jie Feng, Zhixiang Wu, Jun Wang, Shih-Fu Chang  In European Conference on Computer Vision (ECCV)   Zurich, Switzerland   September, 2014   [pdf] [video]  

The recent advances of low-cost and mobile depth sensors dramatically extend the potential of 3D shape retrieval and analysis. While the traditional research of 3D retrieval mainly focused on searching by a rough 2D sketch or with a high-quality CAD model, we tackle a novel and challenging problem of cross-domain 3D shape retrieval, in which users can use 3D scans from low-cost depth sensors like Kinect as queries to search CAD models in the database. To cope with the imperfection of user-captured models such as model noise and occlusion, we propose a cross-domain shape retrieval framework, which minimizes the potential function of a Conditional Random Field to efficiently generate the retrieval scores. In particular, the potential function consists of two critical components: one unary potential term provides robust cross-domain partial matching and the other pairwise potential term embeds spatial structures to alleviate the instability from model noise. Both potential components are efficiently estimated using random forests with 3D local features, forming a Regression Tree Field framework. We conduct extensive experiments on two recently released user-captured 3D shape datasets and compare with several state-of-the-art approaches on the cross-domain shape retrieval task. The experimental results demonstrate that our proposed method outperforms the competing methods with a significant performance gain.


Discriminative Indexing for Probabilistic Image Patch Priors.  Yan Wang, Sunghyun Cho, Jue Wang, Shih-Fu Chang  In European Conference on Computer Vision (ECCV)   Zurich, Switzerland   September, 2014    [pdf]  

Newly emerged probabilistic image patch priors, such as Expected Patch Log-Likelihood (EPLL), have shown excellent performance on image restoration tasks, especially deconvolution, due to its rich expressiveness. However, its applicability is limited by the heavy computation involved in the associated optimization process. Inspired by the recent advances on using regression trees to index priors defined on a Conditional Random Field, we propose a novel discriminative indexing approach on patch-based priors to expedite the optimization process. Specifically, we propose an efficient tree indexing structure for EPLL, and overcome its training tractability challenges in high-dimensional spaces by utilizing special structures of the prior. Experimental results show that our approach accelerates state-of-the-art EPLL-based deconvolution methods by up to 40 times, with very little quality compromise.


Recognizing Complex Events in Videos by Learning Key Static-Dynamic Evidences.  Kuan-Ting Lai, Dong Liu, Ming-Syan Chen and Shih-Fu Chang.  In European Conference on Computer Vision.   Zurich, Switzerland.   September, 2014   [pdf]  

Complex events in videos consist of various human interactions with different objects in diverse environments. As a consequence,  the evidences needed to recognize events may occur in short time periods with variable lengths and may happen anywhere in a video. This fact prevents conventional machine learning algorithms from e ffectively recognizing the events. We propose a novel method that can automatically identify the key evidences in videos for detecting complex events. Both static instances (objects) and dynamic instances (actions) are considered by sampling frames and temporal segments respectively. To compare the characteristic power of heterogeneous instances, we embed static and dynamic instances into a multiple instance learning framework via instance similarity measures, and cast the problem as an Evidence Selective Ranking (ESR) process. We impose L-1 norm to select key evidences while using the Infi nite Push Loss Function to enforce positive videos to have higher detection scores than negative videos. Experiments on large-scale video datasets show that our method can improve the detection accuracy while providing the unique capability in discovering key evidences of each complex event.


Circulant Binary Embedding.  Felix X. Yu, Sanjiv Kumar, Yunchao Gong, Shih-Fu Chang  In International Conference on Machine Learning (ICML) 2014   Beijing, China   June, 2014   [PDF] [Code] [Slides]  

Binary embedding of high-dimensional data requires long codes to preserve the discriminative power of the input space. Traditional binary coding methods often suffer from very high computation and storage costs in such a scenario. To address this problem, we propose Circulant Binary Embedding (CBE) which generates binary codes by projecting the data with a circulant matrix. The circulant structure enables the use of Fast Fourier Transformation to speed up the computation. Compared to methods that use unstructured matrices, the proposed method improves the time complexity from $mathcal{O}(d^2)$ to $mathcal{O}(dlog{d})$, and the space complexity from $mathcal{O}(d^2)$ to $mathcal{O}(d)$ where $d$ is the input dimensionality. We also propose a novel time-frequency alternating optimization to learn data-dependent circulant projections, which alternatively minimizes the objective in original and Fourier domains. We show by extensive experiments that the proposed approach gives much better performance than the state-of-the-art approaches for fixed time, and provides much faster computation with no performance degradation for fixed number of bits.


Hash-SVM: Scalable Kernel Machines for Large-Scale Visual Classification.  Yadong Mu, Gang Hua, Wei Fan, Shih-Fu Chang  In IEEE Conference on Computer Vision and Pattern Recognition (CVPR)   Columbus, Ohio   June, 2014    [PDF]  

This paper presents a novel algorithm which uses compact hash bits to greatly improve the efficiency of non-linear kernel SVM in very large scale visual classification problems. Our key idea is to represent each sample with compact hash bits, over which an inner product is defined to serve as the surrogate of the original nonlinear kernels. Then the problem of solving the nonlinear SVM can be transformed into solving a linear SVM over the hash bits. The proposed Hash-SVM enjoys dramatic storage cost reduction owing to the compact binary representation, as well as a (sub-)linear training complexity via linear SVM. As a critical component of Hash-SVM, we propose a novel hashing scheme for arbitrary non-linear kernels via random subspace projection in reproducing kernel Hilbert space. Our comprehensive analysis reveals a well behaved theoretic bound of the deviation between the proposed hashing-based kernel approximation and the original kernel function. We also derive requirements on the hash bits for achieving a satisfactory accuracy level. Several experiments on large-scale visual classification benchmarks are conducted, including one with over 1 million images. The results show that Hash-SVM greatly reduces the computational complexity (more than ten times faster in many cases) while keeping comparable accuracies.


Video Event Detection by Inferring Temporal Instance Labels.  Kuan-Ting Lai, Felix X. Yu, Ming-Syan Chen, Shih-Fu Chan  In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR)   Columbus, OH   June, 2014   [pdf]  

We address a well-known challenge related to video indexing - given annotations of complex events (such as wedding, rock climbing, etc) occurring in videos, how to identify precise video segments in the long programs that instantiate event evidences. Solutions of this problem will enable temporarily aligned video content summarization, search, and target ad placement. In this work, we propose a novel instance-based video event detection model based on a new learning algorithm called Proportional SVM. It considers each video as a bag of instances, which may be extracted from multiple temporal granularities (key frames and segments of varying lengths) and represented by various features (SIFT or motion boundary histogram). It uses a large-margin formulation which treats the instance labels as hidden latent variables, and simultaneously infers the instance labels as well as the instance-level classification model. Our method assumes positive videos have a large proportion of positive instances while negative videos have a small one. Extensive experiments on large-scale video event datasets demonstrate significant performance gains. Our method is also able to discover the optimal temporal granularity for detecting each complex event.


Locally Linear Hashing for Extracting Non-linear Manifolds.  Go Irie, Zhenguo Li, Xiao-Ming Wu, Shih-Fu Chang  In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR)   Columbus, OH   June, 2014   [pdf]  

In this paper, we propose a hashing method aiming at reconstructing the locally linear structures of data manifolds in the binary Hamming space, which can be captured by locality-sensitive sparse coding. We cast the problem as a joint minimization of reconstruction error and quantization loss and show that a local optimum can be obtained efficiently via alternating optimization. Our results improve previous state-of-the-art methods by typically 28-74% in semantic retrieval performance, and 627% on the Yale face data.


Minimally Needed Evidence for Complex Event Recognition in Unconstrained Videos.  Subhabrata Bhattacharya, Felix X. Yu, Shih-Fu Chang  In ACM International Conference on Multimedia Retrieval   Glasgow, UK   April, 2014   [Pdf] [slides]  
This paper addresses the fundamental question -- How do humans recognize complex events in videos? Normally, humans view videos in a sequential manner. We hypothesize that humans can make high-level inference such as an event is present or not in a video, by looking at a very small number of frames not  necessarily in a linear order. We attempt to verify this cognitive capability of humans and to discover the Minimally Needed Evidence (MNE) for each event.  To this end, we introduce an online game based event quiz facilitating selection of minimal evidence required by humans to judge the presence or absence of a complex event in an open source video. Each video is divided into a set of  temporally coherent microshots (1.5 secs in length) which are revealed only on player request. The player's task is to identify the positive and negative occurrences of the given target event with minimal number of requests to reveal evidence. Incentives are given to players for correct identification with the minimal number of requests.
Our extensive human study using the game quiz validates our hypothesis - 55% of videos need only one microshot for correct human judgment and events of varying complexity require different amounts of evidence for human judgment. In addition, the proposed notion of MNE enables us to select discriminative features, drastically improving speed and accuracy of a video retrieval system.


Predicting Viewer Affective Comments Based on Image Content in Social Media.  Yan-Ying Chen, Tao Chen, Winston H. Hsu, Hong-Yuan Mark Liao, Shih-Fu Chang  In ACM International Conference on Multimedia Retrieval   Glasgow, United Kingdom   April, 2014    [pdf]  


Visual sentiment analysis is getting increasing attention because of the rapidly growing amount of images in online social interactions and emerging applications such as online propaganda and advertisement. This paper focuses on predicting what viewer affect concepts will be triggered when the image is perceived by the viewers. For example, given an image tagged with "yummy food," the viewers are likely to comment "delicious" and "hungry," which we refer to as viewer affect concepts (VAC) in this paper. We propose an automatic content based approach to predict VACs by first detecting sentiment related visual concepts expressed by the image publisher in the image content and then applying statistical correlations between such publisher affects and the VACs. We demonstrate the novel use of the proposed models in several real-world applications - recommending images to invoke certain target affects among viewers, increasing the accuracy of predicting VACs by 20.1%, and finally developing a comment robot tool that may suggest plausible, content-specific and desirable comments when a new image is shown.


Event-Driven Semantic Concept Discovery by Exploiting Weakly Tagged Internet Images.  Jiawei Chen, Yin Cui, Guangnan Ye, Dong Liu, Shih-Fu Chang.  In ACM International Conference on Multimedia Retrieval.   Glasgow, UK.   April, 2014    [pdf]   
In this project, we aim to develop automatic methods to discover semantic concepts characterizing complex video events by (1) first extracting key terms in event definitions, (2) crawling web images with the extracted terms, and (3) discovering the frequent tags used to described the web images. The final discovered tags are used to construct the event specific concepts. The approach is fully automatic and scalable as it does not need human annotation and can readily utilize the large amount of Web images to train the concept detectors. We use the TRECVID Multimedia Event Detection (MED) 2013 as the video test set and crawl 400K Flickr images to automatically discover 2,000 visual concepts and train their classifiers.  We show the large concept classifier pool can be used to achieve significant performance gains in supervised event detection, as well as zero-shot retrieval without needing any video training examples. It outperforms other concept pools like Classemes and ImageNet by a large margin (228%) in zero-shot event retrieval. Subjective evaluation by humans also confirms the discovered concepts are much more intuitive than other concept discovery methods.


Large Scale Video Hashing via Structure Learning.  Guangnan Ye, Dong Liu, Jun Wang, Shih-Fu Chang  In In IEEE International Conference on Computer Vision   Sydney, Australia   December, 2013   [pdf]   

In this paper, we developed a novel hashing method that aims at preserving both the visual content similairty and temporal information of videos. We formulate a regularized optimization objective that exploits visual descriptors common to a semantic class, and "push" successive video frames into the same hash bins. We show that the objective can be efficiently solved by an Accelerated Proximal Gradient (APG) method, and significant performance gains over the state of the art can be achieved.


News Rover: Exploring Topical Structures and Serendipity in Heterogeneous Multimedia News.  Hongzhi Li*, Brendan Jou*, Joseph G. Ellis*, Daniel Morozoff*, and Shih-Fu Chang  In ACM Multimedia   Barcelona, Spain   October, 2013    [pdf] [project] [demo video]  

News stories are rarely understood in isolation. Every story is driven by key entities that give the story its context. Persons, places, times, and several surrounding topics can often succinctly represent a news event, but are only useful if they can be both identified and linked together. We introduce a novel architecture called News Rover for re-bundling broadcast video news, online articles, and Twitter content. The system utilizes these many multimodal sources to link and organize content by topics, events, persons and time. We present two intuitive interfaces for navigating content by topics and their related news events as well as serendipitously learning about a news topic. These two interfaces trade-off between user-controlled and serendipitous exploration of news while retaining the story context. The novelty of our work includes the linking of multi-source, multimodal news content to extracted entities and topical structures for contextual understanding, and visualized in intuitive active and passive interfaces.


SentiBank: Large-Scale Ontology and Classifiers for Detecting Sentiment and Emotions in Visual Content.  Damian Borth, Tao Chen, Rong-Rong Ji and Shih-Fu Chang  In ACM Multimedia,   Barcelona, Spain   October, 2013   [project]  [pdf] [video1] [video2]  

We demonstrate a novel system which combines sound structures from psychology and the folksonomy extracted from social multimedia to develop a large visual sentiment ontology consisting of 1,200 concepts and their corresponding automatic classifiers called SentiBank. Each concept, defined as an Adjective Noun Pair (ANP), is made of an adjective strongly indicating emotions and a noun corresponding to objects or scenes that have a reasonable prospect of automatic detection. We demonstrate novel applications made possible by SentiBank including live sentiment prediction of phototweets and visualization of visual content in a rich semantic space organized by emotion categories. In addition, two novel browsers are developed, implementing ideas of the wheel of emotion and the tree map respectively.


Large-scale Visual Sentiment Ontology and Detectors Using Adjective Noun Pairs.  Damian Borth, Rongrong Ji, Tao Chen, Thomas Breuel and Shih-Fu Chang  In ACM Multimedia, Brave New Idea Program.   Barcelona, Spain   October, 2013   [project] [pdf] [video1] [video2]  

To address the challenge of sentiment analysis from visual content, we propose a novel approach based on understanding of the visual concepts that are strongly related to sentiments. Our key contribution is two-fold: first, we present a method built upon psychological theories and web mining to automatically construct a large-scale Visual Sentiment Ontology (VSO) consisting of more than 3,000 Adjective Noun Pairs (ANP). Second, we propose SentiBank, a novel visual concept detector library that can be used to detect the presence of 1,200 ANPs in an image. Experiments on detecting sentiment of image tweets demonstrate significant improvement in detection accuracy when comparing the proposed SentiBank based predictors with the text-based approaches. The effort also leads to a large publicly available resource consisting of a visual sentiment ontology, a large detector library, and the training/testing benchmark for visual sentiment analysis.


Structured Exploration of Who, What, When, and Where in Heterogeneous Multimedia News Sources.  Brendan Jou* and Hongzhi Li* and Joseph G. Ellis* and Daniel Morozoff and Shih-Fu Chang  In ACM Multimedia   Barcelona, Spain   October, 2013   [pdf] [project] [demo video]  

We present a fully automatic system from raw data gathering to navigation over heterogeneous news sources, including over 18k hours of broadcast video news, 3.58M online articles, and 430M public Twitter messages. Our system addresses the challenge of extracting "who," "what," "when," and "where" from a truly multimodal perspective, leveraging audiovisual information in broadcast news and those embedded in articles, as well as textual cues in both closed captions and raw document content in articles and social media. Performed over time, we are able to extract and study the trend of topics in the news and detect interesting peaks in news coverage over the life of the topic. We visualize these peaks in trending news topics using automatically extracted keywords and iconic images, and introduce a novel multimodal algorithm for naming speakers in the news. We also present several intuitive navigation interfaces for interacting with these complex topic structures over different news sources.


Towards a Comprehensive Computational Model for Aesthetic Assessment of Videos.  Subhabrata Bhattacharya, Behnaz Nojavanasghari, Tao Chen, Dong Liu, Shih-Fu Chang, Mubarak Shah  In ACM International Conference on Multimedia (ACM MM)   Barcelona, Spain   October, 2013  



In this paper we propose a novel aesthetic model emphasizing psycho-visual statistics extracted from multiple levels in contrast to earlier approaches that rely only on descriptors suited for image recognition or based on photographic principles. At the lowest level, we determine dark-channel, sharpness and eye-sensitivity statistics over rectangular cells within a frame. At the next level, we extract Sentibank features (1,200 pre-trained visual classifiers) on a given frame, that invoke specific sentiments such as "colorful clouds", "smiling face" etc. and collect the classifier responses as frame-level statistics. At the topmost level, we extract trajectories from video shots. Using viewer's fixation priors, the trajectories are labeled as foreground, and background/camera on which statistics are computed. Additionally, spatio-temporal local binary patterns are computed that capture texture variations in a given shot. Classifiers are trained on individual feature representations independently. On thorough evaluation of 9 different types of features, we select the best features from each level -- dark channel, affect and camera motion statistics. Next, corresponding classifier scores are integrated in a sophisticated low-rank fusion framework to improve the final prediction scores. Our approach demonstrates strong correlation with human prediction on 1,000 broadcast quality videos released by NHK as an aesthetic evaluation dataset.


Designing category-level attributes for discriminative visual recognition.  Felix X. Yu; Liangliang Cao; Rogerio S. Feris; John R. Smith; Shih-Fu Chang  In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR)   Portland, OR,   June, 2013    [pdf][supplementary material]  

Attribute-based representation has shown great promises for visual recognition but usually requires human design efforts. In this paper, we propose a novel formulation to automatically design discriminative category-level attributes, which are distinct from prior works focusing on sample-level attributes. The designed attributes can be used for tasks of cross-category knowledge transfer and zero-shot learning, achieving superior performance over well-known attribute dataset Animals with Attributes (AwA) and a large-scale ILSVRC2010 dataset (1.2M images).


Robust Object Co-Detection.  Xin Guo, Dong Liu, Brendan Jou, Mojun Zhu, Anni Cai, Shih-Fu Chang  In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR)   Portland, OR,   June, 2013   [pdf]  

Object co-detection aims at simultaneous detection of objects of the same category from a pool of related images by exploiting consistent visual patterns present in candidate objects in the images. In this paper, we propose a novel robust approach to dramatically enhance co-detection by extracting a shared low-rank representation of the object instances in multiple feature spaces. The low-rank approach enables effective removal of noisy and outlier samples and can be used to detect the target objects by spectral clustering.


A Bayesian Approach to Multimodal Visual Dictionary Learning.  Go Irie, Dong Liu, Zhenguo Li, Shih-Fu Chang  In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR)   Portland, OR,   June, 2013   [pdf] [supplementary material]  

In this paper, we propose a Bayesian co-clustering approach to multimodal visual dictionary learning. Most existing visual dictionary learning methods rely on image descriptors alone. However, Web images are often associated with text data which may carry substantial information regarding image semantics. Our method jointly estimates the underlying distributions of the continuous image descriptors as well as the relationship between such distributions and the textual words through a unified Bayesian inference. 


Sample-Specific Late Fusion for Visual Category Recognition.  Dong Liu, Kuan-Ting Lai, Guangnan Ye, Ming-Syan Chen, Shih-Fu Chang  In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR)   Portland, OR,   June, 2013   [pdf]  
In this paper, we propose a sample-specific late fusion method to optimally determine the fusion weight for each sample. The problem is cast as an information propagation process which propagates the fusion weights learned on the labeled samples to unlabeled samples, while enforcing that positive samples have higher fused scores than negative ones. We formulate our problem as a L_infinity norm constrained optimization problem and apply the Alternating Direction Method of Multipliers for optimization. Extensive experiment results on various visual categorization tasks show that the proposed method consistently and significantly outperforms the state-of-the-art late fusion methods. To the best knowledge, this is the first method supporting sample-specific fusion weight learning.


Hash Bit Selection: a Unified Solution for Selection Problems in Hashing.  Xianglong Liu, Junfeng He, Bo Lang, Shih-Fu Chang  In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR)   Portland, OR,   June, 2013   [pdf]  

Hashing based methods recently have been shown promising for large-scale nearest neighbor search. However, good designs involve difficult decisions of many unknowns - data features, hashing algorithms, parameter settings, kernels, etc. In this paper, we provide a unifiied solution as hash bit selection, i.e., selecting the most informative hash bits from a pool of candidates that may have been generated under various conditions mentioned above. We represent the candidate bit pool as a vertex- and edge-weighted graph with the pooled bits as vertices. Then we formulate the bit selection problem as quadratic programming over the graph, and solve it efficiently by replicator dynamics. Extensive experiments show that our bit selection approach can achieve superior performance over both naive selection methods and state-of-the-art methods under each scenario, usually with significant accuracy gains from 10% to 50% relatively.


Label Propagation from ImageNet to 3D Point Clouds.  Yan Wang, Rongrong Ji, Shih-Fu Chang  In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR)   Portland, OR,   June, 2013    [pdf]  

Despite of the growing popularity of 3D data, 3D point cloud labeling remains an open problem, due to the difficulty in acquiring sufficient labels. In this paper, we overcome the challenge by utilizing the massive existing 2D semantic labeled datasets, such as ImageNet and LabelMe, and a novel "cross-domain" label propagation framework. Our method consists of two novel components- Exemplar SVM based label propagation for solving the cross-domain issue, and a graphical model based contextual refinement process. The entire solution does not require any training data from the target 3D scenes, and has excellent scalability towards large applications. It achieves significantly higher efficiency and comparable accuracy when no 3D training data is used, and a major accuracy gain when incorporating the target training data.


$\propto$SVM for learning with label proportions.  Felix X. Yu, Dong Liu, Sanjiv Kumar, Tony Jebara, Shih-Fu Chang.  In International Conference on Machine Learning (ICML)   Atlanta, GA   June, 2013   [PDF][Supp][arXiv][Slides]  

We study the problem of learning with label proportions in which the training data is provided in groups and only the proportion of each class in each group is known. This learning setting has broad applications in data privacy, political science, healthcare, marketing and computer vision. 

We propose a new method called proportion-SVM, or $\propto$SVM, pSVM, which explicitly models the latent unknown instance labels together with the known group label proportions in a large-margin framework. Unlike the existing works, the approach avoids making restrictive assumptions about the data. The $\propto$SVM model leads to a non-convex integer programming problem. In order to solve it efficiently, we propose two algorithms: one based on simple alternating optimization and the other based on a convex relaxation. Extensive experiments on standard datasets show that $\propto$SVM outperforms the state-of-the-art, especially for larger group sizes.


Learning with Partially Absorbing Random Walks.  Xiao-Ming Wu, Zhenguo Li, Anthony Man-Cho So, John Wright, Shih-Fu Chang  In Proceedings of Annual Conference on Neural Information Processing Systems (NIPS)   Lake Tahoe, NV, USA   December, 2012    [pdf] [Supplement] [Poster]  

 We propose a novel stochastic process that is with probability $alpha_i$ being absorbed at current state $i$, and with probability $1-alpha_i$ follows a random edge out of it. We analyze its properties and show its potential for exploring graph structures. We prove that under proper absorption rates, a random walk starting from a set $mathcal{S}$ of low conductance will be mostly absorbed in $mathcal{S}$. Moreover, the absorption probabilities vary slowly inside $mathcal{S}$, while dropping sharply outside, thus implementing the desirable cluster assumption for graph-based learning. Remarkably, the partially absorbing process unifies many popular models arising in a variety of contexts, provides new insights into them, and makes it possible for transferring findings from one paradigm to another. Simulation results demonstrate its promising applications in retrieval and classification.


Hybrid Social Media Network.  Liu, Dong and Ye, Guangnan and Chen, Ching-Ting and Yan, Shuicheng and Chang, Shih-Fu  In Proceeding of ACM International Conference on Multimedia (ACM MM)     October, 2012   [pdf][slides]  

We develop a hybrid social media network to integrate the diverse types of entities and relations in an optimal selective way for each query adaptively. Rather than weight averaging multiple edges or treating diverse relation types as the same in the inferencing process, we propose an optimal relation selection method based on several sparsity principles. The model not only allows effective information diffusion but also answer the question - which node and relation play the most important role in diffusing information for each query and task. Results of a few applications (image tagging, target multimedia ad, text-to-image illustration) over the real social multimedia data sets confirm the clear performance gains.


Accelerated Large Scale Optimization by Concomitant Hashing.  Mu, Yadong and Wright, John and Chang, Shih-Fu  In European Conference on Computer Vision (ECCV)     October, 2012   [pdf]  

In this paper, we propose a brand new application of hashing techniques - accelerating the common bottleneck operation, min/max inner product, encountered in many large-scale optimization algorithms used in applications like active learning SVM, sparse coding, and other computer vision tasks. Our technique is based on some unique properties of order statistics computed from statistically correlated random vectors.


Scene Aligned Pooling for Complex Video Recognition.  Cao, Liangliang and Mu, Yadong and Natsev Apostol and Chang, Shih-Fu and Hua, Gang and Smith, John R.  In European Conference on Computer Vision (ECCV)     October, 2012   [pdf]  

We develop a new visual representation, called scene aligned pooling, for complex video event recognition. The key idea is to align each frame in a video with component scenes, which tend to capture unique characteristics of an event and can be discovered automatically. Feature pooling and event classification models are then performed over the aligned scenes. Results of TRECVID MED and the Human Motion Recognition Databases (HMDB) demonstrate the significant peroformance gains. 


Robust and Scalable Graph-Based Semisupervised Learning.  Liu, Wei and Wang, Jun and Chang, Shih-Fu  In Proceedings of the IEEE, vol 100, no 9     September, 2012   [pdf]  

Graph-based semi-supervised learning (GSSL) provides a promising paradigm for modeling the manifold structures that may exist in massive data sources in high-dimensional spaces. This paper provides a survey of several classical GSSL methods and a few promising methods in handling challenging issues often encountered in web-scale applications, including how to handle noisy labels and gigantic data sets.


Weak Attributes for Large-Scale Image Retrieval.  Felix X. Yu, Rongrong Ji, Ming-Hen Tsai, Guangnan Ye, Shih-Fu Chang  In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)   Providence, Rhode Island   June, 2012    

We developed a novel method for retrieving images over a large pool of weak attributes (> 6000), which can be imported from external sources and do not require manual annotations. We used a semi-supervised learning approach to map semantic attribute-based queries to weak attributes and determine the optimal retrieval weights.


Robust Visual Domain Adaptation with Low-Rank Reconstruction.  I-Hong Jhuo, Dong Liu, D.T. Lee, Shih-Fu Chang  In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)   Providence, Rhode Island   June, 2012   [pdf][Project Page]  

We present a low-rank reconstruction method to address the domain distribution disparity when adapting image classification models across domains. Besides clear performance improvement, our method has the unique capability in uncovering outlier points that cannot be adapted.


Robust Late Fusion with Rank Minimization.  Guangnan Ye, Dong Liu, I-Hong Jhuo, Shih-Fu Chang  In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)   Providence, Rhode Island   June, 2012   [pdf] [Project Page][Poster][Supplementary Material]  

We propose a rank minimization method to fuse the predicted confidence scores of multiple models, each of which is obtained based on a certain kind of feature. 


Segmentation Using Superpixels: A Bipartite Graph Partitioning Approach.  Zhenguo Li, Xiao-Ming Wu, Shih-Fu Chang  In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)   Providence, Rhode Island   June, 2012   [pdf] [project page] [poster][software]  

We proposed a novel bipartite formulation to integrate superpixels extracted by different models at different scales, and a highly efficient linear-time spectral solver to demonstrate significant better image segmentation results.


Exploiting Web Images for Event Recognition in Consumer Videos: A Multiple Source Domain Adaptation Approach.  Lixin Duan, Dong Xu, Shih-Fu Chang  In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)   Providence, Rhode Island   June, 2012   [pdf]  

We developed a new multiple source domain adaptation method called Domain Selection Machine (DSM) for event recognition in consumer videos by leveraging a large number of loosely labeled web images from different sources such as Flickr. A performance gain of 46% is observed.


Mobile Product Search with Bag of Hash Bits and Boundary Reranking.  Junfeng He, Jinyuan Feng, Xianglong Liu, Tao Cheng, Tai-Hsu Lin, Hyunjin Chung, Shih-Fu Chang  In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)   Providence, Rhode Island   June, 2012   [pdf][ poster ]  

In this work, we propose a novel mobile visual search system based on ”Bag of Hash Bits” (BoHB), in which each local feature is encoded to a very small number of hash bits, instead of quantized to visual words, and the whole image is represented as bag of hash bits.  We also incorporate a boundary feature in the reranking step to describe the object shapes, complementing the local features that are usually used to characterize the local details.


Spherical Hashing.  Jae-Pil Heo, YoungWoon Lee, Junfeng He, Shih-Fu Chang, Sung-eui Yoon  In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)   Providence, Rhode Island   June, 2012   [pdf]  

In this paper we propose a novel hashing function based on hypershpere, Spherical Hashing, to map more spatially coherent data points into the same binary code compared to hyperplane based hashing functions.The excellent results confirm the unique merits of the proposed idea in using hyperspheres to encode proximity regions in high dimensional spaces.


Supervised Hashing with Kernels.  Wei Liu, Jun Wang, Rongrong Ji, Yu-Gang Jiang, Shih-Fu Chang  In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR)   Providence, Rhode Island   June, 2012   (oral presentation) [pdf][ slide ][ code ]  


We showed the supervised information such as semantic label consistence and metric neighborhood between pairs of points can be utilized to design robust compact hash codes. Our supervised hashing technique is unique in using code inner product rather than Hamming distance and incorporating kernel-based models in the optimization process.



Joint Audio-Visual Bi-Modal Codewords for Video Event Detection.  Guangnan Ye, I-Hong Jhuo, Dong Liu, Yu-Gang Jiang, D.T. Lee, Shih-Fu Chang  In ACM International Conference on Multimedia Retrieval (ICMR)   Hong Kong   June, 2012   [pdf][Poster]  

We develop a joint audio-visual bi-modal representation  to discover strong audio-visual joint patterns  in videos for detecting multimedia events.


Compact Hashing for Mixed Image-Keyword Query over Multi-Label Images.  Xianglong Liu, Yadong Mu, Bo Lang, Shih-Fu Chang  In ACM International Conference on Multimedia Retrieval (ICMR)   Hong Kong   June, 2012   (Oral presentation) [pdf][slides]    

This paper addresses a new problem of query-adaptive hashing. Our work is distinct from others by two unique features: 1) mixed image-keyword query, namely each query is comprised of an examplar image and several descriptive keywords, and 2) each data point is associated with multiple labels. We propose a boosting style algorithm for effectively learning compact hash codes under above settings.


Compact Hyperplane Hashing with Bilinear Functions.  Wei Liu, Jun Wang, Yadong Mu, Sanjiv Kumar, Shih-Fu Chang  In International Conference on Machine Learning (ICML)   Edinburgh, Scotland   June, 2012   [pdf][ poster ][ slide ]  

Hyperplane hashing aims at rapidly searching nearest points to a hyperplane, and has shown major impact in scaling up active learning with SVMs. We developed a projection technique based on a novel bilinear form, with a proven higher collision probability than all prior results. We further developed methods to learn the bilinear functions directly from the data.


On the Difficulty of Nearest Neighbor Search.  Junfeng He, Sanjiv Kumar, Shih-Fu Chang  In International Conference on Machine Learning (ICML)   Edinburgh, Scotland   June, 2012   [pdf][supplementary material][poster]  

How difficult is (approximate) nearest neighbor search in a given data set? And which data properties affect the difficulty of nearest neighbor search and how?This paper introduces the first concrete measure called Relative Contrast that can be used to evaluate the influence of several crucial data characteristics such as dimensionality, sparsity, and database size simultaneously in arbitrary normed metric spaces.


Active Query Sensing for mobile location search.  Felix X. Yu, Rongrong Ji, Shih-Fu Chang  In Proceeding of ACM International Conference on Multimedia (ACM MM), full paper     November, 2011   [pdf][poster][slides][project page]  

While much exciting progress is being made in mobile visual search, one important question has been left unexplored in all current systems. When the first query fails to find the right target (up to 50% likelihood), how should the user form his/her search strategy in the subsequent interaction? We propose a novel Active Query Sensing system to suggest the best way for sensing the surrounding scenes while forming the second query for location search. This work may open up an exciting new direction for developing interactive mobile media applications through innovative exploitation of active sensing and query formulation.