Concept-Based Video Search with Flickr Context Similarity
Figure 1: Concept-based video search framework. The semantic concept detectors serve as an intermediate layer to bridge the semantic gap. The proposed Flickr context similarity is used (in the knowledge source layer) to select suitable concept detectors for each textual user query given on-the-fly.
However, due to the lack of manually labeled training samples and the
limitation of computational resources, the number of available concept
detectors to date remains in the scale of hundreds, which is much
smaller compared to the size of human vocabulary. Therefore, one open
issue underlying this search methodology is the selection of appropriate
detectors for the queries, especially when direct matching of words
fails. For example, given a query find video shots of something
burning with flames visible, explosion_fire and smoke
are probably suitable detectors. Different from most existing works in
which semantic reasoning techniques based on the WordNet ontology were
used for detector selection, here we explore context information
associated with Flickr images for better query-detector similarity
estimation. This measurement, named Flickr context similarity (FCS), is
grounded on the co-occurrence statistics of two words in the context of
images (e.g., tags, title, descriptions etc.), and is therefore able to
word co-occurrence in image context rather than textual corpus. This
advantage of FCS enables a more appropriate selection of detectors for
searching image and video data. For example, two words bridge and
stadium have high semantic relatedness in WordNet, since both of
them are very close to a common ancestor construction in the
WordNet hierarchy. However, when a user issues a query find shots of
a bridge, stadium is obviously not a helpful detector since
it rarely co-occurs with bridge in images/videos. While for the
same query, our proposed FCS is able to suggest a more suitable detector
Flickr Context Similarity
Given two words x and y, we query both words against Flickr images (via its API) separately and jointly (x AND y). Let f(x) denote the total number of images in Flickr containing word x in its context (e.g., title, tags, descriptions, and etc.), we first compute the distance of both words based on the definition of normalized Google distance (Cilibrasi and Vitanyi, IEEE TKDE 2007):
where f(x,y) is the number images with context containing both x and y, and N is the total number of images on Flickr (roughly estimated as 3.5 billion at the time we did the experiments). Notice that we search against all kinds of Flickr image contexts (not just the tags, as shown in the Figure below).
The word distance is then converted to a similarity value by a Gaussian kernel:
where ρ is a kernel width parameter.
Figure 2: Left. Rich context information associated with a Flickr image. Right. The total number of images returned using keyword-based search in Flickr image context.
We evaluate FCS using 100+ queries on the TRECVID video data sets. Table 1 shows the selected detectors for a few example queries, where our FCS is able to select more suitable detectors. For example, FCS selects 'Railroad' for the query term 'train', while WUP and NGD select 'Vehicle' and 'Car' respectively. Obviously 'Railroad' is a better detector for searching 'train' since they frequently co-occur with each other.
Figure 3 further compares FCS with other query-detector measurements based on WordNet and Web documents (NGD computed based on results from Yahoo web search API). FCS consistently shows better performance in terms of mean average precision over the queries evaluated in TRECVID 2005-2008.
Table 1: Detector selection using various query-detector similarity measurements, including WUP based on WordNet hierarchy, NGD based on textual Web documents (via Yahoo search API), and our FCS based on Flickr context. The detectors are selected according to the query words shown in bold ('goal', 'flames', 'scenes', and 'train' respectively from each of the queries).
Figure 3: Search performance comparison using various query-detector similarity measurements. Resnik, JCN, WUP, and Lesk are all based on the WordNet lexicon/hierarchy. Results are shown in terms of mean average precision over the queries used in the official TRECVID evaluations 2005-2008 (TV05-08).
Yu-Gang Jiang, Chong-Wah Ngo, Shih-Fu Chang, Semantic Context Transfer across Heterogeneous Sources for Domain Adaptive Video Search, ACM Multimedia (ACM MM), Beijing, China, October 2009. [pdf]
For problems or questions
regarding this web site contact The
Last updated: Oct. 30th, 2009.