DVMM-publications

	[Switch to Complete Paper List]
	Selected Summaries Listed Below: [2018] [2017] [2016] [2015] [2014] [2013] [2012] [2011]

Deep Cross Residual Learning for Multitask Visual Recognition. Brendan Jou and Shih-Fu Chang In ACM Multimedia Amsterdam, The Netherlands October, 2016 [pdf] [arXiv] [code]

Residual learning has recently surfaced as an effective means of constructing very deep neural networks for object recognition. However, current incarnations of residual networks do not allow for the modeling and integration of complex relations between closely coupled recognition tasks or across domains. Such problems are often encountered in multimedia applications involving large-scale content recognition. We propose a novel extension of residual learning for deep networks that enables intuitive learning across multiple related tasks using cross-connections called cross-residuals. These cross-residuals connections can be viewed as a form of in-network regularization and enables greater network generalization. We show how cross-residual learning (CRL) can be integrated in multitask networks to jointly train and detect visual concepts across several tasks. We present a single multitask cross-residual network with >40% less parameters that is able to achieve competitive, or even better, detection performance on a visual sentiment concept detection problem normally requiring multiple specialized single-task networks. The resulting multitask cross-residual network also achieves better detection performance by about 10.4% over a standard multitask residual network without cross-residuals with even a small amount of cross-task weighting.

Event Specific Multimodal Pattern Mining for Knowledge Base Construction. Hongzhi Li, Joseph G. Ellis, Heng Ji, Shih-Fu Chang In Proceedings of the 24th ACM international conference on Multimedia Amsterdam, The Netherlands October, 2016 [pdf]

Despite the impressive progress in image recognition, current approaches assume a predefined set of classes are known (e.g., ImageNet). Such recognition tools of fixed vocabularies do not meet the needs when data from new domains of unknown classes are encountered. Automatic discovery of visual patterns and object classes so far are still limited to low-level primitive or mid-level patterns, lacking semantic meanings. In this paper, we develop a novel multi-modal framework, in which deep learning is used to learn representations for both images and texts that can be combined to discover multimodal patterns. Such multimodal patterns can be automatically "named" to describe the semantic concepts unique to each event. The named concepts can be used to build expanded schemas for each event using such unique concepts. We experiment with a large unconstrained corpus of weakly-supervised image-caption pairs related to high-level events such as "attack" and "demonstration" and demonstrate the superior performance of the proposed method in terms of the number of automatically discovered named concepts, their semantic coherence, and relevance to high level events of interest.

Tamp: A Library for Compact Deep Neural Networks with Structured Matrices. Bingchen Gong, Brendan Jou, Felix X. Yu, Shih-Fu Chang In ACM Multimedia, Open Source Software Competition Amsterdam, The Netherlands October, 2016 [code] [pdf]

We introduce Tamp, an open-source C++ library for reducing the space and time costs of deep neural network models. In particular, Tamp implements several recent works which use structured matrices to replace unstructured matrices which are often bottlenecks in neural networks. Tamp is also designed to serve as a unified development platform with several supported optimization back-ends and abstracted data types. This paper introduces the design and API and also demonstrates the effectiveness with experiments on public datasets.

Placing Broadcast News Videos in their Social Media Context using Hashtags. Joseph G. Ellis, Svebor Karaman, Hongzhi Li, Hong Bin Shim, Shih-Fu Chang In ACM international conference on Multimedia (ACM MM) Amsterdam, Netherlands. October, 2016 [pdf] [video]

With the growth of social media platforms in recent years, social media is now a major source of information and news for many people around the world. In particular the rise of hashtags have helped to build communities of discussion around particular news, topics, opinions, and ideologies. However, television news programs still provide value and are used by a vast majority of the population to obtain their news, but these videos are not easily linked to broader discussion on social media. We have built a novel pipeline that allows television news to be placed in its relevant social media context, by leveraging hashtags. In this paper, we present a method for automatically collecting television news and social media content (Twitter) and discovering the hashtags that are relevant for a TV news video. Our algorithms incorporate both the visual and text information within social media and television content, and we show that by leveraging both modalities we can improve performance over single modality approaches.

Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs. Zheng Shou, Dongang Wang, and Shih-Fu Chang In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Las Vegas, Nevada, USA June, 2016 [pdf] [code]

We address temporal action localization in untrimmed long videos. This is important because videos in real applications are usually unconstrained and contain multiple action instances plus video content of background scenes or other activities. To address this challenging issue, we exploit the effectiveness of deep networks in temporal action localization via three segment-based 3D ConvNets: (1) a proposal network identifies candidate segments in a long video that may contain actions; (2) a classification network learns one-vs-all action classification model to serve as initialization for the localization network; and (3) a localization network fine-tunes the learned classification network to localize each action instance. We propose a novel loss function for the localization network to explicitly consider temporal overlap and achieve high temporal localization accuracy. In the end, only the proposal network and the localization network are used during prediction. On two large-scale benchmarks, our approach achieves significantly superior performances compared with other state-of-the-art systems: mAP increases from 1.7% to 7.4% on MEXaction2 and increases from 15.0% to 19.0% on THUMOS 2014.

Interactive Segmentation on RGBD Images via Cue Selection. Jie Feng, Brian Price, Scott Cohen, Shih-Fu Chang In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Las Vegas, NV June, 2016 [pdf][video]

Interactive image segmentation is an important problem in computer vision with many applications including image editing, object recognition and image retrieval. Most existing interactive segmentation methods only operate on color images. Until recently, very few works have been proposed to leverage depth information from low-cost sensors to improve interactive segmentation. While these methods achieve better results than color-based methods, they are still limited in either using depth as an additional color channel or simply combining depth with color in a linear way. We propose a novel interactive segmentation algorithm which can incorporate multiple feature cues like color, depth, and normals in an unified graph cut framework to leverage these cues more effectively. A key contribution of our method is that it automatically selects a single cue to be used at each pixel, based on the intuition that only one cue is necessary to determine the segmentation label locally. This is achieved by optimizing over both segmentation labels and cue labels, using terms designed to decide where both the segmentation and label cues should change. Our algorithm thus produces not only the segmentation mask but also a cue label map that indicates where each cue contributes to the final result. Extensive experiments on five large scale RGBD datasets show that our proposed algorithm performs significantly better than both other color-based and RGBD based algorithms in reducing the amount of user inputs as well as increasing segmentation accuracy.

SentiCart: Cartography & Geo-contextualization for Multilingual Visual Sentiment. Brendan Jou*, Margaret Yuying Qian*, Shih-Fu Chang In ACM International Conference on Multimedia Retrieval (ICMR) New York, NY June, 2016 [pdf] [demo] [video]

Where in the world are pictures of cute animals or ancient architecture most shared from? And are they equally sentimentally perceived across different languages? We demonstrate a series of visualization tools, that we collectively call SentiCart, for answering such questions and navigating the landscape of how sentiment-biased images are shared around the world in multiple languages. We present visualizations using a large-scale, self-gathered geodata corpus of >1.54M geo-references coming from over 235 countries mined from >15K visual concepts over 12 languages. We also highlight several compelling data-driven findings about multilingual visual sentiment in geo-social interactions.

3D Shape Retrieval using a Single Depth Image from Low-cost Sensors. Jie Feng, Yan Wang, Shih-Fu Chang In IEEE Winter Conference on Applications of Computer Vision (WACV) 2016 Lake Placid, NY, USA March, 2016 [pdf]

Content-based 3D shape retrieval is an important problem in computer vision. Traditional retrieval interfaces require a 2D sketch or a manually designed 3D model as the query, which is difficult to specify and thus not practical in real applications. With the recent advance in low-cost 3D sensors such as Microsoft Kinect and Intel Realsense, capturing depth images that carry 3D information is fairly simple, making shape retrieval more practical and user-friendly. In this paper, we study the problem of cross-domain 3D shape retrieval using a single depth image from low-cost sensors as the query to search for similar human designed CAD models. We propose a novel method using an ensemble of autoencoders in which each autoencoder is trained to learn a compressed representation of depth views synthesized from each database object. By viewing each autoencoder as a probabilistic model, a likelihood score can be derived as a similarity measure.