|
[ This page is under construction!!
]
PART I:
REAL-WORLD MULTIMODALITY LEARNING MACHINES
We are investigating the theory,
algorithm
and system issues toward the
construction of automatic cognitive learning machine. With the recent
success
of machine learning algorithms, many traditional thinking of system
design may
be re-examined via machine learning approaches. Under machine learning
infrastructure, system designers no longer play the role of assigning
rules but
designing algorithms to allow systems to learn to solve problems by
themselves.
For instance, this approach has significantly increased the number of
concepts
that machine can understand. Previously, researchers work for decades
to model
a few concepts – e.g., face, car, people, etc. However, with machine
learning
approaches, the number of concept detectors increase to the range of
hundreds
and have competitive or better accuracy than prior methods. In
order to increase machine’s capability of
problem-solving, we propose to create a new kind of learning system.
These
learning machines automatically capture data from multimodality sensing
sources
(audio, visual, text, etc), execute recognition, and then learn to
infer more
complicated concepts/knowledge. Also, the learning system can execute
existing
knowledge-base, e.g., electronic dictionary, to extend and connect
concepts.
(1)
Autonomous Learning (with Xiaodan Song and Ming-Ting Sun)
(2) Smart Semantic Video Camera (with Victor Sutan and Jason
Cardillo)
(3) Imperfect and Continuous Learning (with Xiaodan Song, Panda Navneet and Gang Wu)
(3) Multimedia Semantic Analysis
Multimedia Semantic Concept
Analysis
My research objectives are (1) Object Detection
- Robust and accurate detection, location and counts of specific
objects; (2) Object Recognition
- Determine the specific instance of an
object class (e.g. person); (3) Event
Understanding - Form inferences
from occurrences or reoccurrence of activity; (4) Multi-Modal Fusion -
Combine multiple sources to maximize the salient information that can
be extracted from the video; (5) Video
Query by Example - Retrieve
information through database inquiry using a sequence of video or
content descriptors; (6) Video Summary - Methods
to reduce information representation and scenario based activity
summarization; (7) Multi-Modal Video Mining -
Automatically discovering trends, patterns, and associations in video;
(8) Object Tracking - Determine the path of a known object within a
video sequence; (9) Motion Analysis
- Quantity the movement of objects
or phenomenon in a video sequence and (10) Kinematics Analysis -
Identify an object or phenomenon by its motion
We had the following objectives of
addressing the challenging problem of fully-automatic indexing and
retrieval of unstructured video content, engaging the research/industry
community in establishing benchmark for video content retrieval,
participating in the benchmark and leveraging it for advancing
technology in video content retrieval, and establishing IBM Research as premiere
thought leaders in the area of multimedia indexing and semantic
understanding.
Our effort resulted in the following
accomplishments. First, we helped with the formation of the TREC video retrieval
benchmark and its tasks and participated in TREC video retrieval
benchmark since its establishment in 2001. We provided the leadership
role in establishing the "concept detection" task within the TREC video
retrieval benchmark. IBM proposed the idea
to NIST in Nov. 2001 and followed through by leading effort to design
the benchmark and test methodology and choose the concepts for
detection. We provided the leadership role in establishing the "MPEG-7
concept/transcript/shot exchange" task of the TREC-2002 benchmark with
the goal of accelerating the pace of technological advancement by
allowing different participants to focus on different aspects of
multimedia indexing problems. In 2003, we initiated and
organized a collaborative
video annotation forum, in which we jointly work with colleagues in
23 groups to build ground-truth labels on 62 hours of
video. Near 500K of labels (after hierarchical propagation) have been
annotated on 45K of shots. These ground-truth labels have been widely
used for video semantic concept training and system evaluation.
(Collaborators: Belle L.
Tseng, Milind Naphade, Apostol Natsev, John R. Smith)
[ This page is under
construbtion!!]
Last Updated:
01/24/2006
|