DVMM-publications

	[Switch to Complete Paper List]
	Selected Summaries Listed Below: [2018] [2017] [2016] [2015] [2014] [2013] [2012] [2011]

PPR-FCN: Weakly Supervised Visual Relation Detection via Parallel Pairwise R-FCN. Hanwang Zhang, Zawlin Kyaw, Jinyang Yu, Shih-Fu Chang In International Conference on Computer Vision. ICCV 2017. Venice, Italy October, 2017 [pdf]

We aim to tackle a novel vision task called Weakly Supervised Visual Relation Detection (WSVRD) to detect ``subject-predicate-object'' relations in an image with object relation groundtruths available only at the image level. This is motivated by the fact that it is extremely expensive to label the combinatorial relations between objects at the instance level. Compared to the extensively studied problem, Weakly Supervised Object Detection (WSOD), WSVRD is more challenging as it needs to examine a large set of regions pairs, which is computationally prohibitive and more likely stuck in a local optimal solution such as those involving wrong spatial context. To this end, we present a Parallel, Pairwise Region-based, Fully Convolutional Network (PPR-FCN) for WSVRD. It uses a parallel FCN architecture that simultaneously performs pair selection and classification of single regions and region pairs for object and relation detection, while sharing almost all computation shared over the entire image. In particular, we propose a novel position-role-sensitive score map with pairwise RoI pooling to efficiently capture the crucial context associated with a pair of objects. We demonstrate the superiority of PPR-FCN over all baselines in solving the WSVRD challenge by using results of extensive experiments over two visual relation benchmarks.

Learning Spread-out Local Feature Descriptors. Xu Zhang, Felix X. Yu, Sanjiv Kumar, Shih-Fu Chang In International Conference on Computer Vision. ICCV 2017 Venice, Italy October, 2017 [pdf][code]

We develop a novel deep learning based method for learning robust local image descriptors, by using a simple, yet powerful regularization technique that can be used to significantly improve both the pairwise and triplet losses. The idea is that in order to fully utilize the expressive power of the descriptor space, good local feature descriptors should be sufficiently "spread-out" in the space. In this work, we assume non-matching features follow a uniform distribution and propose a regularization term to maximize the spread of the feature descriptor based on the theoretical property of the uniform distribution. We show that the proposed regularization with triplet loss outperforms existing Euclidean distance based descriptor learning techniques by a large margin. As an extension, the proposed regularization technique can also be used to improve image-level deep feature embedding.

Visual Translation Embedding Network for Visual Relation Detection. Hanwang Zhang, Zawlin Kyaw, Shih-Fu Chang, Tat-Seng Chua In In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Honolulu, Hawaii July, 2017 [arxiv][codes][demo]

Visual relations, such as “person ride bike” and “bike next to car”, offer a comprehensive scene understanding of an image, and have already shown their great utility in connecting computer vision and natural language. However, due to the challenging combinatorial complexity of modeling subject-predicate-object relation triplets, very little work has been done to localize and predict visual relations. Inspired by the recent advances in relational representation learning of knowledge bases and convolutional object detection networks, we propose a Visual Translation Embedding network (VTransE) for visual relation detection. VTransE places objects in a low-dimensional relation space where a relation can be modeled as a simple vector translation, i.e., subject + predicate ≈ object. We propose a novel feature extraction layer that enables object-relation knowledge transfer in a fully-convolutional fashion that supports training and inference in a single forward/backward pass. To the best of our knowledge, VTransE is the first end-to- end relation detection network. We demonstrate the effectiveness of VTransE over other state-of-the-art methods on two large-scale datasets: Visual Relationship and Visual Genome. Note that even though VTransE is a purely visual model, it is also competitive to prior work on multi-modal model with language priors.

Learning Discriminative and Transformation Covariant Local Feature Detectors. Xu Zhang, Felix X. Yu, Svebor Karaman, Shih-Fu Chang In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Honolulu, Hawaii July, 2017 [pdf][code]

Robust covariant local feature detectors are important for detecting local features that are (1) discriminative of the image content and (2) can be repeatably detected at consistent locations when the image undergoes diverse transformations. Such detectors are critical for applications such as image search and scene reconstruction. Many learning-based local feature detectors address one of these two problems while overlooking the other. In this work, we propose a novel learning-based method to simultaneously address both issues. Specifically, we extend the previously proposed covariant constraint by defining the concepts of “standard patch” and “canonical feature” and leverage these to train a novel robust covariant detector. We show that the introduction of these concepts greatly simplifies the learning stage of the covariant detector, and also makes the detector much more robust. Extensive experiments show that our method outperforms previous handcrafted and learning-based detectors by large margins in terms of repeatability.

CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos. Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, and Shih-Fu Chang In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), oral session Honolulu, Hawaii July, 2017 [pdf] [arxiv] [code] [project]

Temporal action localization is an important yet challenging problem. Given a long, untrimmed video consisting of multiple action instances and complex background contents, we need not only to recognize their action categories, but also to localize the start time and end time of each instance. Many state-of-the-art systems use segmentlevel classifiers to select and rank proposal segments of pre-determined boundaries. However, a desirable model should move beyond segment-level and make dense predictions at a fine granularity in time to determine precise temporal boundaries. To this end, we design a novel Convolutional-De-Convolutional (CDC) network that places CDC filters on top of 3D ConvNets, which have been shown to be effective for abstracting action semantics but reduce the temporal length of the input data. The proposed CDC filter performs the required temporal upsampling and spatial downsampling operations simultaneously to predict actions at the frame-level granularity. It is unique in jointly modeling action semantics in space-time and fine-grained temporal dynamics. We train the CDC network in an end-to-end manner efficiently. Our model not only achieves superior performance in detecting actions in every frame, but also significantly boosts the precision of localizing temporal boundaries. Finally, the CDC network demonstrates a very high efficiency with the ability to process 500 frames per second on a single GPU server. Source code and trained models are available online at https://bitbucket.org/columbiadvmm/cdc.

Deep Image Set Hashing. Jie Feng, Svebor Karaman, Shih-Fu Chang In IEEE Winter Conference on Applications of Computer Vision Santa Rosa, California, USA March, 2017 [pdf]

Learning-based hashing is often used in large scale image retrieval as they provide a compact representation of each sample and the Hamming distance can be used to efficiently compare two samples. However, most hashing methods encode each image separately and discard knowledge that multiple images in the same set represent the same object or person. We investigate the set hashing problem by combining both set representation and hashing in a single deep neural network. An image set is first passed to a CNN module to extract image features, then these features are aggregated using two types of set feature to capture both set specific and database-wide distribution information. The computed set feature is then fed into a multilayer perceptron to learn a compact binary embedding trained with triplet loss. We extensively evaluate our approach on multiple image datasets and show highly competitive performance compared to state-of-the-art methods.