Vision + Language

Back to Project List



  1. Grounding Referring Expressions in Images by Variational Context. Hanwang Zhang, Yulei Niu, Shih-Fu Chang In IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), Salt Late City, USA. June, 2018. [pdf]
    We focus on grounding (i.e., localizing or linking) referring expressions in images, e.g., "largest elephant standing behind baby elephant". This is a general yet challenging vision-language task since it does not only require the localization of objects, but also the multimodal comprehension of context --- visual attributes (e.g., "largest", "baby") and relationships (e.g., "behind") that help to distinguish the referent from other objects, especially those of the same category. Due to the exponential complexity involved in modeling the context associated with multiple image regions, existing work oversimplifies this task to pairwise region modeling by multiple instance learning. In this paper, we propose a variational Bayesian method, called Variational Context, to solve the problem of complex context modeling in referring expression grounding. Our model exploits the reciprocal relation between the referent and context, i.e., either of them influences the estimation of the posterior distribution of the other, and thereby the search space of context can be greatly reduced, resulting in better localization of referent. We develop a novel cue-specific language-vision embedding network that learns this reciprocity model end-to-end. We also extend the model to the unsupervised setting where no annotation for the referent is available. Extensive experiments on various benchmarks show consistent improvement over state-of-the-art methods in both supervised and unsupervised settings.
  2. PPR-FCN: Weakly Supervised Visual Relation Detection via Parallel Pairwise R-FCN. Hanwang Zhang, Zawlin Kyaw, Jinyang Yu, Shih-Fu Chang, In International Conference on Computer Vision (ICCV), Venice, Italy. October, 2017. [pdf]
    We aim to tackle a novel vision task called Weakly Supervised Visual Relation Detection (WSVRD) to detect ``subject-predicate-object'' relations in an image with object relation groundtruths available only at the image level. This is motivated by the fact that it is extremely expensive to label the combinatorial relations between objects at the instance level. Compared to the extensively studied problem, Weakly Supervised Object Detection (WSOD), WSVRD is more challenging as it needs to examine a large set of regions pairs, which is computationally prohibitive and more likely stuck in a local optimal solution such as those involving wrong spatial context. To this end, we present a Parallel, Pairwise Region-based, Fully Convolutional Network (PPR-FCN) for WSVRD. It uses a parallel FCN architecture that simultaneously performs pair selection and classification of single regions and region pairs for object and relation detection, while sharing almost all computation shared over the entire image. In particular, we propose a novel position-role-sensitive score map with pairwise RoI pooling to efficiently capture the crucial context associated with a pair of objects. We demonstrate the superiority of PPR-FCN over all baselines in solving the WSVRD challenge by using results of extensive experiments over two visual relation benchmarks.
  3. Visual Translation Embedding Network for Visual Relation Detection. Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, Tat-Seng Chua In In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, Hawaii. July, 2017. [pdf]
    Visual relations, such as “person ride bike” and “bike next to car”, offer a comprehensive scene understanding of an image, and have already shown their great utility in connecting computer vision and natural language. However, due to the challenging combinatorial complexity of modeling subject-predicate-object relation triplets, very little work has been done to localize and predict visual relations. Inspired by the recent advances in relational representation learning of knowledge bases and convolutional object detection networks, we propose a Visual Translation Embedding network (VTransE) for visual relation detection. VTransE places objects in a low-dimensional relation space where a relation can be modeled as a simple vector translation, i.e., subject + predicate ≈ object. We propose a novel feature extraction layer that enables object-relation knowledge transfer in a fully-convolutional fashion that supports training and inference in a single forward/backward pass. To the best of our knowledge, VTransE is the first end-to- end relation detection network. We demonstrate the effectiveness of VTransE over other state-of-the-art methods on two large-scale datasets: Visual Relationship and Visual Genome. Note that even though VTransE is a purely visual model, it is also competitive to prior work on multi-modal model with language priors.
  4. Event Specific Multimodal Pattern Mining for Knowledge Base Construction. Hongzhi Li, Joseph G. Ellis, Heng Ji, Shih-Fu Chang, In Proceedings of the 24th ACM international conference on Multimedia, Amsterdam, The Netherlands. October, 2016. [pdf]
    Despite the impressive progress in image recognition, current approaches assume a predefined set of classes are known (e.g., ImageNet). Such recognition tools of fixed vocabularies do not meet the needs when data from new domains of unknown classes are encountered. Automatic discovery of visual patterns and object classes so far are still limited to low-level primitive or mid-level patterns, lacking semantic meanings. In this paper, we develop a novel multi-modal framework, in which deep learning is used to learn representations for both images and texts that can be combined to discover multimodal patterns. Such multimodal patterns can be automatically "named" to describe the semantic concepts unique to each event. The named concepts can be used to build expanded schemas for each event using such unique concepts. We experiment with a large unconstrained corpus of weakly-supervised image-caption pairs related to high-level events such as "attack" and "demonstration" and demonstrate the superior performance of the proposed method in terms of the number of automatically discovered named concepts, their semantic coherence, and relevance to high level events of interest.
  5. Placing Broadcast News Videos in their Social Media Context using Hashtags. Joseph G. Ellis, Svebor Karaman, Hongzhi Li, Hong Bin Shim, Shih-Fu Chang In ACM international conference on Multimedia (ACM MM) Amsterdam, Netherlands. October, 2016. [pdf]
    With the growth of social media platforms in recent years, social media is now a major source of information and news for many people around the world. In particular the rise of hashtags have helped to build communities of discussion around particular news, topics, opinions, and ideologies. However, television news programs still provide value and are used by a vast majority of the population to obtain their news, but these videos are not easily linked to broader discussion on social media. We have built a novel pipeline that allows television news to be placed in its relevant social media context, by leveraging hashtags. In this paper, we present a method for automatically collecting television news and social media content (Twitter) and discovering the hashtags that are relevant for a TV news video. Our algorithms incorporate both the visual and text information within social media and television content, and we show that by leveraging both modalities we can improve performance over single modality approaches.