![]() |
Domain Adaptive Semantic Diffusion for Large Scale Context-Based Video Annotation |
||||||||
|
|||||||||
|
Summary
|
![]() | Figure 1: Illustration of context-based video annotation. (a) Top 5 video shots of concept desert according to the annotation scores from an existing pre-trained detector, in which the semantic context was not considered. (b) Refined shot list by semantic diffusion. The subgraph on the top shows two concepts with higher correlations to desert. Line width indicates graph edge weight. (c) Refined subgraph and shot list by the proposed domain adaptive semantic diffusion (DASD). The graph adaptation process in DASD is able to refine the concept relationship which in turn can further help improve the annotation accuracy. |
Approach
The semantic graph is characterized by the relationship between concepts, i.e., the affinity matrix W. We estimate the concept relationship using TRECVID 2005 development set X and its corresponding label matrix Y, where yij=1 denotes the presence of concept ci in the sample xj , otherwise yij=0. The pairwise concept relationship (W) is computed using Pearson product moment correlation of the row vectors in Y.
Denote g(ci) as an annotation score vector of concept ci over a test set. Intuitively, the function values g(ci) and g(cj) should be consistent with the affinity between concepts ci and cj, i.e. Wij. In other words, strongly correlated concepts should have similar concept annotation scores. Motivated by this semantic consistency, here we formulate this problem as a graph diffusion process and define a cost function on the semantic graph as:
![]() |
Apparently, this cost function evaluates the smoothness of the function g over the semantic graph. Reducing the function value makes the annotation results more consistent with the concept relationships. Specifically, as shown in the following equation, our objective is to reduce ε by updating g and W iteratively. The modification of g makes it more consistent with the concept affinities, while the refinement of W facilitates the domain adaptation of the semantic context.
![]() |
Experiments and Results
We evaluate the approach against TRECVID 2005--2007 video data sets. The data sets were used in the annual TRECVID benchmark evaluation by NIST. In total, there are 340 hours of video data. The videos are partitioned into shots and one or more representative keyframes are extracted from each shot. The 2005 and 2006 videos are broadcast news from different TV programs in English, Chinese and Arabic, while the 2007 data set consists mainly of documentary videos in Dutch.
We use the publicly available VIREO-374 as baseline, which includes SVM models of 374 LSCOM concepts. These models have been shown in TRECVID evaluations to achieve top performance. A semantic graph with 374 nodes is built using mannual labels on the TRECVID 2005 development set. In our experiments, for TRECVID 2005, we adopt the development set as our target database and report performance of 39 concepts. The development set is partitioned into training, validation and test sets. For TRECVID 2006 and 2007, we report performance of the 20 official evaluated concepts on each year's test set.
Table 1 shows the results, achieved by the VIREO-374 baseline, the semantic diffusion (SD, without graph adaptation), and the DASD. When SD is used, the performance gain ranges from 11.8% to 15.6%. DASD further boosts the performance for TRECVID 2006 and 2007 data sets. There is no improvement from DASD for TRECVID 2005 since the graph was built using manual labels on the same data set. These results confirm the effectiveness of our approach by formulating graph diffusion for improving video annotation accuracy. Figure 2 demonstrates the adaptation process of a fraction of the semantic graph. Figure 3 further gives per-concept performance of the 20 evaluated concepts in TRECVID 2006. Our approach consistently improves all the concepts.
DASD is highly efficient. The complexity is O(mn), where m is the number of concepts and n is the number of test video shots/frames. On TRECVID 2006 data set which contains 79,484 video shots, DASD finishes in just 165 seconds. In other words, running DASD over 374 concepts for each video shot only takes 2 milliseconds.
![]() Table 1: Overall performance gain (relative improvement) on TRECVID 2005-2007 data sets. SD: semantic diffusion. DASD: domain adaptive semantic diffusion. |
![]() Figure 2: A fraction of the semantic graph. The animation shows the adaptation process of concept affinities from broadcast news videos to documentary videos. |
![]() Figure 3: Per-concept performance before and after semantic diffusion on TRECVID 2006 test set. Consistent improvments are observed for all of the 20 semantic concepts. |
Yu-Gang Jiang, Qi Dai, Jun Wang, Chong-Wah Ngo, Xiangyang Xue, Shih-Fu Chang. Fast Semantic Diffusion for Large Scale Context-Based Image and Video Annotation. IEEE Transactions on Image Processing, 2012.[pdf]
Yu-Gang Jiang, Jun Wang, Shih-Fu Chang, Chong-Wah Ngo. Domain Adaptive Semantic Diffusion for Large Scale Context-Based Video Annotation. In IEEE International Conference on Computer Vision (ICCV), Kyoto, Japan, September 2009. [pdf]
For problems or questions
regarding this web site contact The
Web Master.
Last updated: Oct. 30th, 2009.