DVMM-publications

	[Switch to Complete Paper List]
	Selected Summaries Listed Below: [2018] [2017] [2016] [2015] [2014] [2013] [2012] [2011]

An exploration of parameter redundancy in deep networks with circulant projections. Yu Cheng*, Felix X. Yu*, Rogerio Feris, Sanjiv Kumar, Alok Choudhary, Shih-Fu Chang In International Conference on Computer Vision (ICCV) December, 2015 [arXiv]

We explore the redundancy of parameters in deep neural networks by replacing the conventional linear projection in fully-connected layers with the circulant projection. The circulant structure substantially reduces memory footprint and enables the use of the Fast Fourier Transform to speed up the computation. Considering a fully-connected neural network layer with d input nodes, and d output nodes, this method improves the time complexity from O(d^2) to O(dlogd) and space complexity from O(d^2) to O(d). The space savings are particularly important for modern deep convolutional neural network architectures, where fully-connected layers typically contain more than 90% of the network parameters. We further show that the gradient computation and optimization of the circulant projections can be performed very efficiently. Our experiments on three standard datasets show that the proposed approach achieves this significant gain in storage and efficiency with minimal increase in error rate compared to neural networks with unstructured projections.

Fast Orthogonal Projection Based on Kronecker Product. Xu Zhang, Felix X. Yu, Ruiqi Guo, Sanjiv Kumar, Shengjin Wang and Shih-Fu Chang In ICCV 2015 Santiago de Chile December, 2015 [pdf][code]

We propose a family of structured matrices to speed up orthogonal projections for high-dimensional data commonly seen in computer vision applications. In this, a structured matrix is formed by the Kronecker product of a series of smaller orthogonal matrices. This achieves O(d log d) computational complexity and O(log d) space complexity for d-dimensional data, a drastic improvement over the standard unstructured projections whose computation and space complexities are both O(d^2).

Visual Affect Around the World: A Large-scale Multilingual Visual Sentiment Ontology. Brendan Jou*, Tao Chen*, Nikolaos Pappas*, Miriam Redi*, Mercan Topkara*, and Shih-Fu Chang In ACM Multimedia Brisbane, Australia October, 2015 [pdf]

Every culture and language is unique. Our work expressly focuses on the uniqueness of culture and language in relation to human affect, specifically sentiment and emotion semantics, and how they manifest in social multimedia. We develop sets of sentiment- and emotion-polarized visual concepts by adapting semantic structures called adjective-noun pairs, originally introduced by Borth et al. (2013), but in a multilingual context. We propose a new language-dependent method for automatic discovery of these adjective-noun constructs. We show how this pipeline can be applied on a social multimedia platform for the creation of a large-scale multilingual visual sentiment concept ontology (MVSO). Unlike the flat structure in Borth et al. (2013), our unified ontology is organized hierarchically by multilingual clusters of visually detectable nouns and subclusters of emotionally biased versions of these nouns. In addition, we present an image-based prediction task to show how generalizable language-specific models are in a multilingual context. A new, publicly available dataset of > 15.6K sentiment-biased visual concepts across 12 languages with language-specific detector banks, > 7.36M images and their metadata is also released.

EventNet: A Large Scale Structured Concept Library for Complex Event Detection in Video. Guangnan Ye*, Yitong Li*, Hongliang Xu, Dong Liu, Shih-Fu Chang In ACM Multimedia (ACM MM) Brisbane, Australia October, 2015 [pdf]

Event-specific concepts are the semantic concepts specifically designed for the events of interest, which can be used as a mid-level representation of complex events in videos. Existing methods only focus on defining event-specific concepts for a small number of pre-defined events, but cannot handle novel unseen events. This motivates us to build a large scale event-specific concept library that covers as many real-world events and their concepts as possible. Specifically, we choose WikiHow, an online forum containing a large number of how-to articles on human daily life events. We perform a coarse-to-fine event discovery process and discover 500 events from WikiHow articles. Then we use each event name as query to search YouTube and discover event-specific concepts from the tags of returned videos. After an automatic filter process, we end up with around 95,321 videos and 4,490 concepts. We train a Convolutional Neural Network (CNN) model on the 95,321 videos over the 500 events, and use the model to extract deep learning feature from video content. With the learned deep learning feature, we train 4,490 binary SVM classifiers as the event-specific concept library. The concepts and events are further organized in a hierarchical structure defined by WikiHow, and the resultant concept library is called EventNet. Finally, the EventNet concept library is used to generate concept based representation of event videos. To the best of our knowledge, EventNet represents the first video event ontology that organizes events and their concepts into a semantic structure. It offers great potential for event retrieval and browsing.

Large Video Event Ontology Browsing, Search and Tagging (EventNet Demo). Hongliang Xu, Guangnan Ye, Yitong Li, Dong Liu, Shih-Fu Chang In ACM Multimedia (ACM MM) Brisbane, Australia October, 2015 [pdf]

EventNet is the largest video event ontology existent today, consisting of 500 events and 4, 490 event-specific concepts systematically discovered from the crowdsource forum like WikiHow. Such sources offer rich information about events happening in everyday lives. Additionally, it includes automatic detection models for the constituent events and concepts using deep learning with around 95K training videos from YouTube. In this demo, we present several novel functions of EventNet: 1) interactive ontology browsing, 2) semantic event search, and 3) tagging of user-loaded videos via open web interfaces. The system is the first in allowing users to explore rich hierarchical structures among video events, relations between concepts and events, and automatic detection of events and concepts embedded in user-uploaded videos in a live fashion.

CamSwarm: Instantaneous Smartphone Camera Arrays for Collaborative Photography. Yan Wang, Jue Wang, Shih-Fu Chang In arVix preprint August, 2015 [arXiv] [video]

Camera arrays (CamArrays) are widely used in commercial filming projects for achieving special visual effects such as bullet time effect, but are very expensive to set up. We propose CamSwarm, a low-cost and lightweight alternative to professional CamArrays for consumer applications. It allows the construction of a collaborative photography platform from multiple mobile devices anywhere and anytime, enabling new capturing and editing experiences that a single camera cannot provide. Our system allows easy team formation; uses real-time visualization and feedback to guide camera positioning; provides a mechanism for synchronized capturing; and finally allows the user to efficiently browse and edit the captured imagery. Our user study suggests that CamSwarm is easy to use; the provided real-time guidance is helpful; and the full system achieves high quality results promising for non-professional use.

New Insights into Laplacian Similarity Search. Xiao-Ming Wu, Zhenguo Li, and Shih-Fu Chang In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Boston, USA June, 2015 [pdf] [supplement] [abstract]

Graph-based computer vision applications rely critically on similarity metrics which compute the pairwise similarity between any pair of vertices on graphs. This paper investigates the fundamental design of commonly used similarity metrics, and provides new insights to guide their use in practice. In particular, we introduce a family of similarity metrics in the form of $(L+alphaLambda)^{-1}$, where $L$ is the graph Laplacian, $Lambda$ is a positive diagonal matrix acting as a regularizer, and $alpha$ is a positive balancing factor. Such metrics respect graph topology when $alpha$ is small, and reproduce well-known metrics such as hitting times and the pseudo-inverse of graph Laplacian with different regularizer $Lambda$.

This paper is the first to analyze the important impact of selecting $Lambda$ in retrieving the local cluster from a seed. We find that different $Lambda$ can lead to surprisingly complementary behaviors: $Lambda = D$ (degree matrix) can reliably extract the cluster of a query if it is sparser than surrounding clusters, while $Lambda = I$ (identity matrix) is preferred if it is denser than surrounding clusters. Since in practice there is no reliable way to determine the local density in order to select the right model, we propose a new design of $Lambda$ that automatically adapts to the local density. Experiments on image retrieval verify our theoretical arguments and confirm the benefit of the proposed metric. We expect the insights of our theory to provide guidelines for more applications in computer vision and other domains.

Attributes and Categories for Generic Instance Search from One Example. Ran Tao, Arnold WM Smeulders, Shih-Fu Chang In In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Boston June, 2015 [pdf]

This paper aims for generic instance search from one example where the instance can be an arbitrary 3D object like shoes, not just near-planar and one-sided instances like buildings and logos. Firstly, we evaluate state-of-the-art instance search methods on this problem. We observe that what works for buildings loses its generality on shoes. Secondly, we propose to use automatically learned category-specific attributes to address the large appearance variations present in generic instance search. On the problem of searching among instances from the same category as the query, the category-specific attributes outperform existing approaches by a large margin. On a shoe dataset containing 6624 shoe images recorded from all viewing angles, we improve the performance from 36.73 to 56.56 using category-specific attributes. Thirdly, we extend our methods to search objects without restricting to the specifically known category. We show the combination of category-level information and the category-specific attributes is superior to combining category-level information with low-level features such as Fisher vector.

Regrasping and Unfolding of Garments Using Predictive Thin Shell Modeling. Yinxiao Li, Danfei Xu, Yonghao Yue, Yan Wang, Shih-Fu Chang, Eitan Grinspun, and Peter K. Allen In IEEE International Conference on Robotics and Automation (ICRA) May, 2015 [video] [pdf]

Deformable objects such as garments are highly unstructured, making them difficult to recognize and manipulate. In this paper, we propose a novel method to teach a twoarm robot to efficiently track the states of a garment from an unknown state to a known state by iterative regrasping. The problem is formulated as a constrained weighted evaluation metric for evaluating the two desired grasping points during regrasping, which can also be used for a convergence criterion The result is then adopted as an estimation to initialize a regrasping, which is then considered as a new state for evaluation. The process stops when the predicted thin shell conclusively agrees with reconstruction. We show experimental results for regrasping a number of different garments including sweater, knitwear, pants, and leggings, etc.