Semantic concept detection is an active research topic as it can provide
semantic filters and aid in automatic search of image and video databases. The
annual NIST TRECVID video retrieval benchmarking event has greatly contributed
to this area by providing benchmark datasets and performing system evaluation.
As acquiring ground truths of semantic concepts is time-consuming, in the
TRECVID event only 10-20 concepts were selected for evaluation each year. This
is insufficient for general video retrieval tasks, for which most researchers
believe that hundreds or thousands of concepts would be more appropriate. In
light of this, several efforts have developed and released annotation data for
hundreds of concepts, such as
LSCOM.
Although the annotations are publicly available, building detectors for hundreds
of concepts is complicated and time-consuming. To stimulate innovation of new
techniques and reduce the effort in replicating similar methods, there are
several efforts in developing and releasing large-scale concept detectors,
including
Mediamill-101,
Columbia374, and
VIREO374. The Mediamill-101 includes 101 detectors over TRECVID 2005/2006
datasets, including ground truth labels, features, and detection scores.
Columbia374 and VIREO374 released detectors for a larger set of 374 semantic
concepts selected from the LSCOM ontology. Columbia374 employed a simple and
efficient baseline method using three types of global features. VIREO374 also
adopted similar framework, but with an emphasize on the use of local keypoint
features.
While keypoint features describe the local structures in an image and do not
contain any color information, global features are statistics about the overall
distribution of color, texture, or edge information in an image. Hence, we
expect these two types of features are complementary for semantic concept
detection, which requires either global color information (e.g. for concepts
water, desert), or local structure information
(e.g., for US-flag, car), or both (e.g., for moutain). It
is interesting not only to compare the performance of various features, but also
to see whether their combination further improves the performance. As
Columbia374 and VIREO-374 work on the same set of concepts, we unify the
output formats and fuse the detection scores of both detector sets.
With the goal of stimulating innovation in concept detection and providing
better large-scale concept detectors for video search, we are releasing the
fused detection scores on TRECVID datasets to the multimedia community.
Details about fusion method, performance comparisons, and data format can be
found in
our technical report.
note09: in addition to the detection scores generated from the old models trained on TRECVID 2005 developement data,
the 2009 release also includes additional detection scores of the 20 concepts announced in TRECVID 2009,
generated from new models trained on TRECVID 2009 development data using similar method.
note10: the 2010 release is based on models re-trained on TRECVID 2010 development
set, using the basic fusion method described in our 2008 technical report. It used multiple bag-of-visual-words local features
computed from various spatial partitions. It also incorporated
the DASD algorithm
to explore concept relationship (context) for improved detection. Technical citation for general usage of these new scores should be
the 2008 technical report:
Y.-G. Jiang, A. Yanagawa, S.-F. Chang, C.-W. Ngo,
"CU-VIREO374: Fusing Columbia374 and VIREO374 for Large Scale Semantic Concept Detection",
Columbia University ADVENT Technical Report #223-2008-1, Aug. 2008.
[pdf & bibtex]
For citation to the new contextual diffusion algorithm DASD, please use the following ICCV '09 paper:
Y.-G. Jiang, J. Wang, S.-F. Chang, C.-W. Ngo,
"Domain Adaptive Semantic Diffusion for Large Scale Context-Based Video Annotation",
International Conference on Computer Vision (ICCV), Kyoto, Japan, September 2009.
[pdf & bibtex]
For problems or questions regarding this download site, please contact
Yu-Gang Jiang. Last updated: Aug
10, 2010