|
|
| Speaker |
Professor Marek Domanski, Poznan University, Poland |
| Title |
Spatio-temporal scalability in DCT-based hybrid video
coders |
| Abstract |
|
The speech is on a generic multi-loop coder structure suitable
for mixed spatial and temporal scalability combined with fine
granular SNR scalability. The structure is suitable for various
variants of hybrid video
coders like MPEG-2, H.263 and AVC (JVT/H.264). The idea of mixed
spatial and temporal scalability i.e. spatio-temporal scalability
is substantial for the proposal. Its application allows improving
the scalable coding efficiency i.e. decreasing the scalability
overhead. The coder consists of independently motion-compensated
sub-coders that produce bitstreams corresponding to individual
levels of spatio-temporal resolution. The
bitrate can be smoothly matched to the particular channel bandwidth
by use of data partitioning, which is related to drift errors
in the decoder. Accumulation and propagation of these errors can
be bounded by use of proper structure of groups of pictures.
|
| Speaker |
Masaki Miura, Fujitsu, Japan |
| Title |
Network and Video Application |
| Abstract |
|
I'd like to talk about the business of our department and my
research topic at the Friday seminar.
First, I'm going to introduce our business. As I mailed you before,
the key word of our business is "Network and Video Application".
The three bases are video transmission, video compression and
video processing. Our products have been developed based on the
key words. I'm going to introduce them and show you a demo. Then
I'm going to talk about my research topic at DVMM, "Event
Detection and Event-based Bursty Transmission". It aims to
realize a large scale video surveillance system. The concept has
an analogy with DNA microarray. Based on a lot of events (for
example detected by IP-700s),a server learns patterns of event,
finds out hidden information among them, predicts future state,
assigns proper bandwidth and controls traffic from video encoders.
Finally, if I have some spare time, I'd like to show you some
pictures of Japan and Japanese character table (especially for
Shahram and Alex)
|
| Speaker |
Ryoma Oami, NEC, Japan |
| Title |
My Research in NEC |
| Abstract |
|
I'll talk about my jobs in NEC in the seminar, especially three
topics: DCT-based lossless video coding, watermark robustness
calculation against attacks, and low bit-rate multiple object
coding. The first topic is a
lossless video coding based on DCT. This has a compatibility with
ordinary DCT-based lossy coding, such as MPEG-2. The next topic
is about a measure to compare the robustness among different watermarking
algorithms quantitatively, which was difficult so far. The final
topic includes a background coding alogrithm with adaptive resolution
control and quantization control based on the distance to the
object boundaries, and a bit allocation method for mutiple objects
based on the prediction of their area and complexity variations.
I hope these topics are interesting for you.
|
| Speaker |
Yong Wang, PhD student, DVMM group |
| Title |
Content-based Utility Function Prediction |
| Abstract |
|
In this talk I will give a brief introduction about my recent
work on content-based
utility function prediction, which is part of the UMA project.
In our application scenario
of online media transcoding, utility function(UF) is used to illustrate
the relationship
between the resource limitation (generally the network bandwidth)
and the utility value
(generally the video quality evaluation such as PSNR), and defined
using a set of
transcoding operators. UF is an efficient way to guide the transcoding
in the compress
domain. The motivation of utility function prediction comes from
the desire of instant
and real-time media processing. I will explain some involved aspects,
such as the
generation of UF, video content feature and UF feature extraction,
UF unsupervised
clustering, and content feature based prediction. This is a work
going on and some
preliminary result will be presented.
|
| Speaker |
Anthony Vetro, MERL - Mitsubishi Electric Research
Labs, Murray Hill, NJ |
| Title |
MPEG-2 to MPEG-4 Transcoding with Reduced Resolution |
| Abstract |
|
Recent advances in signal processing combined with an increase
in network capacity are paving the way for users to enjoy services
wherever they go and on a host of multimedia capable devices.
Each of these terminals may support a variety of different formats.
Furthermore, the networks they are connected to are often characterized
by different network conditions, and the terminals themselves
vary in display capabilities, processing power and memory capacity.
Given such a dynamic environment, it becomes necessary to consider
methods of adapting the content accordingly.
This talk focuses on the general problem of reduced-resolution
transcoding, and more specifically on the conversion between MPEG-2
and MPEG-4. This technology enables broadcast-quality video streams
to be transmitted, decoded and displayed on low-cost mobile devices.
Technical topics include:
- analysis of drift errors when transcoding to a lower spatial
resolution
- presentation of various architectures to overcome sources of
drift
- macroblock-level conversions, e.g., MV mapping, texture down-sampling
- rate control and bit allocation issues
- evaluation of complexity and quality
A live demo will also be shown.
|
| Biography |
|
Anthony Vetro is with Mitsubishi
Electric Research Labs in Murray Hill, NJ, where he is currently a
Senior Principal
Member of the Technical Staff. He received the PhD degree in Electrical
Engineering from Polytechnic University
in Brooklyn, NY, and his main research interests are in the areas
video coding and transmission, with emphasis on content scaling and
rate allocation. He has published a number of papers in these areas
and has been an active
participant in MPEG standards for several years, where he is now serving
as Editor for MPEG-21 Part 7, Digital Item
Adaptation. |
| Speaker |
Danny Hong, PhD student, MMSP group |
| Title |
Flavor: A Language for Media Representation |
| Abstract |
|
Flavor has been created as a language for describing coded multimedia
bitstreams in a formal way so that the bitstream parsing/generating
code can be automatically produced. For this, Flavor comes with
a software
tool that translates Flavor description into C++ or Java code.
Since Version 5.0, the Flavor translator has been enhanced (the
enhanced translator is called XFlavor) so that XML features are
supported. XFlavor has the capability to transform Flavor description
into XML schema and it can also produce code for generating XML
documents corresponding to the bitstreams described by Flavor.
As a part of XFlavor, a compression tool for converting the XML
representation back into the original bitstream format is provided
as well.
In summary Flavor simplifies and speeds up the development of
software that processes coded multimedia information by providing
the necessary code for parsing and generating bitstreams. XFlavor
takes an alternate approach. Rather then providing the code for
accessing bitstreams, the bitstreams are transformed into XML
documents for easier processing.
First, I'll give a detailed overview of the Flavor language and
its translator, and then I'll talk about XFlavor.
|
| Speaker |
Dr. Aleksandra Mojsilovic, IBM T.J. Watson Research
Center |
| Title |
Semantic Based Image Modelling and Retrieval |
| Abstract |
|
This talk will cover some of our recent work in image semantic
modeling and retrieval.
In order to design more satisfying image navigation systems,
we need tools to construct a ?semantic bridge? between a user
and image database. We have recently developed a novel image indexing
scheme and
query language, which allow the user to introduce a cognitive
dimension to the search. At an abstract level, this approach consists
of:
1) learning the "natural language" that humans speak
to communicate their semantic experience of images,
2) understanding the relationships between this language and objective
measurable image attributes, and,
3) developing the corresponding feature extraction schemes.
We have conducted several subjective experiments in which we
asked human subjects to group images, and then explain verbally
why they did so. The results of this study indicated that a part
of the abstraction involved in image interpretation is often driven
by semantic categories, which can be broken into more tangible
semantic entities, i.e. objective semantic indicators. By analyzing
our experimental data, we have identified some candidate semantic
categories (i.e. portraits, people, crowds, cityscapes, landscapes,
etc.) and their underlying semantic indicators (i.e. skin, sky,
water, object, etc.). These experiments also helped us derive
important low-level image descriptors, accounting for our perception
of these indicators. We have then used these findings to develop
an image feature extraction and indexing scheme. In particular,
our feature set has been carefully designed to match the way humans
communicate image meaning. This led us to the development of a
"semantic-friendly" query language for browsing and
searching diverse collections of images. We have implemented our
algorithm in two Internet search engines, ISee (photographic images)
and ILive (medical images). ISee incorporates an image robot,
the proposed indexing scheme, and web browser, to search and browse
the Internet using visual attributes. ILive is a search engine
for medical applications that uses the proposed semantic based
image features to perform automatic categorization of medical
images into different imaging modalities.
|
| Biography |
|
Aleksandra (Saska) Mojsilovic was
born in Belgrade, Yugoslavia, in 1968. She received her BSEE, MSEE
and Ph. D. degrees from the Department of Electrical Engineering,
University of Belgrade, Belgrade, Yugoslavia,
in 1992, 1994 and 1997, respectively. From 1994 to 1998 she was a
member of academic staff at the University of Belgrade, Department
of Electrical Engineering. From 1998 to 2000 she was with Bell Laboratories,
Lucent Technologies, Murray Hill, New Jersey. Aleksandra Mojsilovic
is currently a research staff member at the IBM T. J. Watson Research
Center, Hawthorne, New York. Her main research interests include computer
vision, image processing, multimedia, multidimensional signal processing,
medical imaging and human perception. In 2001, she
received the Young Author Best Paper Award from the IEEE Signal Processing
Society for her paper on image retrieval with Jelena Kovacevic, Jianying
Hu, Robert Safranek and Kicha Ganapathy. Dr. Mojsilovic is a member
of the IEEE Signal Processing society and currently serves as an Associate
Editor for the IEEE Transactions on
Image Processing. |
| Speaker |
Rob Turetsky, PhD student, LabROSA |
| Title |
Bridging the Gap: Aligning Songs with Transcriptions for Musical
Structure Discovery
|
| Abstract |
|
Musical structure is present at many levels of abstraction, from
genre, song, movement and phrase, down to the raw signal. The
ability to automatically extract musical structure at these levels
would have many applications, including the semantic indexing
of personal digital music libraries, audio browsing/skimming and
the creation of high-level musical models for use in algorithms
such as pitch extraction. Most work on the general problem of
analysis of musical signals operates either on raw audio data
or some type of transcription (MIDI, Humdrum, etc). While transcriptions
allow musicologists to locate reocurring patterns in phrases,
transitions and key/scale, the vast majority of music does not
exist in transcribed form. By focusing on the raw audio data of
a musical performance (CD, .mp3, etc), engineers have access to
signal-level charecteristics, but automatic transcsription of
real-world audio is an extremely difficult, if not impossible
to solve problem.
We present a simple approach to bridging this gap by aligning
available MIDI transcriptions with their corresponding performance.
This will allow the creation of more advanced musical models with
access to both note- and signal-level data. We are currently using
these alignments at the signal level to create a "ground
truth" database of labeled segments of recorded audio for
automatic transcription. At higher levels, we envision training
these models, for example, to recognize certain features of a
genre (such as repetition, ABA structure). We can then use the
global model to predict musical behavior at the local level. Also
described is ongoing efforts in musical structure discovery at
specific levels of abstraction, including a system being developed
to retrieve alternate performances and cover versions of a song.
|
| Speaker |
Winston Hsu, PhD student, DVMM group |
| Title |
Two approaches toward news story segmentation
|
| Abstract |
|
This talk presents our investigation on news story segmentation.
Two approaches under different constrains are proposed. The first
one is to guarantee real-time processing with fair performance
on resource-limited devices, such as set-top boxes; the other
is an ongoing statistical model fusing mid-level "perceptual"
features (except closed captions).
The first method is based on anchorperson detection where a skin-tone
detector is applied to mark perspective face regions. An unsupervised
clustering measured with color histograms on regions of interest,
extended from face regions, is employed to distill actual anchor
shots. A further speaker identification and music/speech discrimination
is applied to smooth clustering results.
The second approach utilizes a weighted and exponentially linear
family of off-the-shelf features to account for story boundary
probability and estimated with the measurement of Kullbak-Leibler
divergence from empirical news video corpora. Moreover, rather
than depending on heuristic rules or specific inference graph,
a feature inducing procedure, incrementally including those most
salient features, is applied on "perceptual" feature
candidates. A dynamic programming approach is then invoked to
locate story segments. Moreover, we hope to expand this probabilistic
framework to adopt further semantic features (e.g. cue words surrounding
story boundaries) to hack unsolved segmentation problems, such
as multiple stories within an anchor shot.
|
| Speaker |
Ana Belen Benitez, PhD student, DVMM group |
| Title |
Organization and Browsing of Annotated Images Using Multiresolution
Knowledge Networks
|
| Abstract |
|
This talk will present novel methods for organizing and browsing
annotated images based on multiresolution networks representing
knowledge about the images. At the highest resolution, images
are organized by discovering perceptual knowledge (e.g., image
clusters and visual relations), semantic knowledge (e.g., word
senses and semantic relations), and statistical interrelations
among these. This process drives on the integrated processing
of both images and annotations and the use of the electronic dictionary
WordNet. Knowledge networks at lower resolutions are constructed
by clustering similar concepts together. Users can then browse
the annotated images by navigating the multiresolution knowledge
networks. The visualization of the knowledge
networks exploits ideas from fish-eye views for concept display
using example text and images, and from spring modeling for network
drawing. Experiments have shown the trade-off between the knowledge
completeness
and consistency/conciseness with increasing number of concepts
and have justified some of the proposed browsing decisions.
|
| Speaker |
Lexing Xie, PhD student, DVMM group |
| Title |
Structure Discovery from Video Using Hierarchical Hidden Markov
Models
|
| Abstract |
|
This talk will present the problem of structure discovery, and
our approach for unsupervised structure discovery from video using
hierarchical hidden Markov models. Structure elements in a time
sequence are repetitive segments that bear syntactic characteristics,
and often interpretable in a semantic sense. Structure is a term
broad enough for many domains, such as audio-video stream, speech
or music, genome data, system logs and so on. But here we shall
restrict our attention to video structure with stochastic properties.
In addition to many prior works that have been successful with
learning the descriptions of structure from a supervised data
pool, and detecting structure using these learned descriptions,
a few attempts have been made in other domains as well as in video
to learn the description and to locate the structure at the same
time. Our approach selects hierarchical hidden Markov model based
on several domain assumptions. We will discuss the learning algorithm
and it performances, and interesting open issues arise in interpreting
the results, model selection and feature selection.
|
| Speaker |
Kamal Hasan Basri, Associate Director, Columbia Video
Network |
| Title |
Utilizing Akamai's caching technology and CVN's visual content
for superior course delivery in online lectures
|
| Abstract |
|
What does it take to run a successful distance learning program?
Basically, there are five elements that are needed: 1) Good Bandwidth
2) Good Content 3) Good Customer Services 4) Good staff and 5)
Good Marketing/Business Plan. This talk will only focus on the
first two "goods", the bandwidth and the visual content.
The advent of ISPs in providing bandwidth for large applications
such as video streaming has enhanced the quality of the presentation
of the content provider. This talk will introduce Columbia Video
Network, the distance learning office of the Columbia University's
School of Engineering and Applied Sciences which has utilized
Akamai Technology's caching infrastructure in delivering its video
streaming lectures to its students located all over the world.
The delivery of the lectures and issues will be presented. The
performance and feedback is also measured and the visual systems
of the lectures are elaborated at length including how the technology
and content established CVN as one of the best distance learning
programs in the nation.
|
| Speaker |
Dr. Deepak Turage, Philips Research |
| Title |
Motion-compensated wavelet video coding for scalable
and robust wireless video delivery |
| Abstract |
|
The advantages of using wavelets for image compression have been
well recognised, especially the high coding efficiency in conjunction
with the inherent scalability they provide. Recent advances in
temporal filtering techniques have allowed the extension of these
schemes into the video coding arena. Wavelet video coding schemes
can provide flexible spatial, temporal, SNR and complexity scalability
with fine granularity over a large range of bit-rates, while maintaining
a very high coding efficiency. The inherent prioritization of
data in this framework, as well as the availability of mature
spatio-temporal wavelet filtering techniques, leads to added robustness
and considerably improved error concealment properties. Wavelet
video bitstreams may also be easily tailored into multiple descriptions
to improve robustness when used alongside wireless path-diversity
and multiple antenna wireless systems. In this talk we highlight
some of these prop e! ! rties of the wavelet based video coding
schemes, specifically targeted towards overcoming the constraints
imposed by wireless networks on video transmission. We will discuss
the general wavelet video framework, and then introduce some of
our research, mainly our design of a new unconstrained temporal
filtering framework. We will also show some video examples to
illustrate the ideas and show some comparisons with state-of-the-art
non-scalable solutions, i.e. H.26L. Finally, we will indicate
some of our ongoing and future research directions including other
temporal decomposition techniques, multiple description coding,
error concealment, frame-rate upconversion etc.
|
| Biography |
|
| Deepak S. Turaga received the B.Tech.
degree in Electrical engineering in 1997 from the Indian Institute
of Technology, Bombay and his M.S. and Ph.D. in Electrical and Computer
engineering in 1999 and 2001 from Carnegie Mellon University, Pittsburgh,
Pennsylvania. He is currently with Philips Research USA as part of
the Wireless Communications and Networking Group, where he is a Senior
Member of Research Staff. He was with the Semantic Video Processing
group at Intel, Portland during the summer of 1999. His research interests
include multimedia signal and image processing and computer vision
applications. His current research focus is on scalable video coding
for wireless networks and network-adaptive video encoder optimizations.
He is also interested in enhancing the error resilience of video transmitted
over wireless networks, in particular, using multiple description
coding, cross-protocol layer optimization and model-based error concealment
schemes. |
| Speaker |
Trista Pei-Chun Chen, Carnegie Mellon University |
| Title |
Rate Shaping for Error Resilient Video Streaming |
| Abstract |
|
Due to the rapid growth of wireless communications, video over
wireless networks has gained a lot of attention. Challenges such
as the time-varying error rate and fluctuating bandwidth create
the need for error resilient video transport. Joint source-channel
coding (JSCC) techniques are often applied to achieve error resilient
video transport with online coding. However, JSCC techniques are
limited by only providing end-to-end optimization at the time
of encoding and are not suitable for streaming precoded video.
The encoded bitstream may not be optimal for transmission along
a different path or along the same path at later time. I will
present rate shaping as a better solution in this
talk. Rate shaping provides the optimal solution in current network
conditions for each link along the path of transmission.
Given the feedback from the network layer, dynamic rate
shaping (DRS) was proposed to shape, i.e., to
reduce, the bit rate of a single-layered pre source-coded video,
in order to meet the real-time bandwidth constraint. Conventional
DRS did not consider shaping for the parity bits in addition to
the source-coding bits. In this talk, I first extend rate shaping
for streaming precoded video that is both pre source- and channel-
coded, which I call baseline rate shaping (BRS). While
BRS operates on a coarse level, fine-grained rate shaping
(FGRS) is proposed to allow for bandwidth adaptation in
fine granularities. In addition to FGRS, to consider that the
decoder may perform error concealment (EC) if any video data is
lost during the transmission, I propose a rate-shaping scheme
that is aware of the EC method used at the decoder, which is called
error concealment aware rate shaping (ECARS).
|
| Biography |
|
| Trista Pei-chun Chen received the
B.S. degree and the M.S. degree from National Tsing Hua University,
Hsinchu, Taiwan, in 1997 and 1999, respectively. Since August 1999,
she has been working towards her Ph.D. degree in Electrical and Computer
Engineering at Carnegie Mellon University, Pittsburgh, Pennsylvania.
During the summer of 2000, she was with HP Cambridge Research Laboratory,
Cambridge, Massachusetts, conducting research in image retrieval for
massive databases. During the summer of 2001, she was with Pittsburgh
Sony Design Center, Pittsburgh, Pennsylvania, designing circuits for
Video Watermarking (VWM). Her research interests are in the areas
of networked video, watermark/data hiding, image processing, and biometric
signal processing. She is a student member of the IEEE. |
| Speaker |
Dr. Beth Logan, HP Cambridge Research Lab |
| Title |
Content-Based Music Analysis |
| Abstract |
|
The advent of MP3 is changing the world of music distribution.
We are moving toward a future in which all the world's music will
be ubiquitously available. This raises new issues in music retrieval
which necessitate the development of techniques for content-based
music analysis. In this talk, we describe several of our efforts
in this
direction: content-based music summarization and music similarity.
Both techniques form statistical models of the spectral features
of each song. Our summarization technique then automatically chooses
a representative phrase for each song using the segmentation provided
by its model. Our similarity technique compares the models for
each pair of songs using the Earth Mover's Distance (Rubner1998)
to form a distance matrix. Both techniques show great promise.
We present subjective and objective results on a non-trivial database
(and hopefully a couple of demonstrations).
|
| Biography |
|
Beth Logan received the BSc. and
B.E. degrees from the University of Queensland, Australia, in 1990
and 1991 respectively. She received the PhD in engineering from the
University of Cambridge, United Kingdom, in 1998, completing a dissertation
on speech enhancement. Since 1998, she has been a research scientist
at HP Lab's (formerly Digital's and
Compaq's) Cambridge Research Laboratory in Cambridge Massachusetts,
U.S.A. Her work here has focused on scalable organization of digital
content, primarily looking at indexing and modeling of speech and
music. |
| Speaker |
Dr. Milind R. Naphade, IBM T. J. Watson Research Center |
| Title |
Concepts, Context and Structure: Learning Multimedia
Semantics |
| Abstract |
|
Multimedia content is an essential part of information technology.
However, the difficulty in filtering, searching, and summarizing
video has so far hindered the effective utilization of video databases.
Users want to filter and query video by high-level (semantic)
concepts, while automatic algorithms can extract only low-level
features (e.g., color, texture, shape, amount of motion). Bridging
this gap is thus the most challenging problem in video (multimedia)
indexing, retrieval, and filtering.
In this talk I will present recent and ongoing research for mapping
low-level features to high-level semantics, where the semantics
relates to the concepts, context and structure in multimedia content.
I will argue that supervision leads to semantics. This framework
consists of representation of concepts using probabilistic multimedia
objects "multijects" and representation of contextual
constraints of such objects called the "multinet". To
extract semantics from video, I approach the problem of video
understanding as a pattern recognition problem and develop a probabilistic
framework for modeling the concepts and the context of video content.
I use state of the art learning techniques to model contents.
For modeling inter-conceptual context I use a factor graph framework
for non-causal inference. Using the TREC Video Benchmark corpus
I present detection results for a large number of semantic concepts.
For modeling structure I propose unsupervised modeling of the
temporal evolution with graphical models.
I will then argue that it is important to reduce the amount of
"dumb" supervision. Instead, I propose the use of several
machine-learning techniques for active learning, unsupervised
pattern discovery and multiple instance learning to alleviate
the burden of labeling large data sets through "smart"
supervision.
|
| Biography |
|
| Milind Ramesh Naphade received
his B.E. degree in Instrumentation and Control Engineering from the
University of Pune, India in July 1995, ranking first among the university
students in his discipline. He received his M.S. and Ph.D. degrees
in Electrical Engineering from the University of Illinois at Urbana-Champaign
in 1998 and 2001 respectively. He was a Computational Sciences and
Engineering Fellow and a member of the Image Formation and Processing
group at the Beckman Institute for Advanced Science and Technology
from August 1996 to March 2001. In 2001 he joined the Pervasive Media
Management Group at the IBM T. J. Watson Research Center in Hawthorne,
NY, is a research staff member. He has worked with the Center for
Development of Advanced Computing (C-DAC) India, the Kodak Research
Laboratories of the Eastman Kodak Company and with the Microcomputer
Research Laboratories at Intel Corporation. He is a member of IEEE
and the honor society of Phi Kappa Phi. He is among the early proponents
of statistical modeling of semantics of multimedia content and context
with more than 40 journal articles and conference publications, several
book chapters and 6 patents (filed or granted). His research interests
include application of learning techniques for audio-visual signal
processing and analysis for the purpose of multimedia understanding,
content-based indexing, retrieval and mining. |
| Speaker |
Marios Athineos, PhD Student, LabROSA |
| Title |
Sound texture modelling with linear prediction in both time
and frequency domains
|
| Abstract |
|
Sound textures - for instance, a crackling fire, running water,
or applause - constitute a large and largely neglected class of
audio signals. Whereas tonal sounds have been effectively and
flexibly modelled with sinusoids, aperiodic energy is usually
modelled as white noise filtered to match the approximate spectrum
of the original over 10-30 ms windows, which fails to provide
a perceptually satisfying reproduction of many real-world noisy
sound textures.
In this talk we argue that this failure is due to the loss of
short-term temporal structure, and we introduce a second modelling
stage in which the time envelope of the residual from conventional
linear predictive modelling is itself modelled with linear prediction
in the spectral domain. This cascade time- and frequency-domain
linear prediction
(CTFLP) leads to noise-excited resyntheses that have high perceptual
fidelity.
To support our claims we introduce a novel quantitative error
analysis of the difference between original and resynthesis by
measuring the mean proportional error within time-frequency cells
across a range of timescales. This analysis confirms our expectation
that CTFLP is better able to model the short-term energy variations
present in certain kinds of sound textures, and provides insight
into which kinds of sound are best suited to this kind of modelling.
|
| Speaker |
Shahram Ebadollahi, PhD Student, DVMM |
| Title |
Indexing of Echocardiogram Videos Using View Recognition
|
| Abstract |
|
Echocardiography is a common diagnostic modality to assess the
structure and the function of the heart. Indexing echocardiogram
videos at different levels of structure is essential for providing
efficient access to their content for browsing and retrieval.
In this talk I present our approach for parsing the content of
the echocardiograms into their constituent views using their spatio-temporal
structure. We pose the problem as a 3D object aspect recognition
and use a viewer-centered model-based approach for solving it.
The spatial configuration of the heart chambers in each distinct
view is used as the distinguishing feature of the views and is
modeled by Markov Random Field model. A bounded state duration
HMM is used to capture the temporal transition of the sequence
of the views and their durations. Results of the application of
the method to several echocardiogram videos will be presented.
|
| Speaker |
Dr. Ching-Yung Lin, IBM T.J. Watson Research Center |
| Title |
A Systematic Approach to Multimedia Understanding |
| Abstract |
|
Modern advancements in information technology have enabled pervasive
uses of digital multimedia data. Accompanying with an explosive
growth in the generation, storage, distribution and consumption
of multimedia data are emerging requirements in indexing content,
building standard exchange formats and ensuring a trustworthy
framework between users. While feature-based indexing techniques
satisfied some of the requirements, a need for understanding semantic
meaning of multimedia data is foreseen and is currently driving
research paradigm into a new level. Although advances in speech/text/face
recognition have been observed in recent applications, a generic
framework which recognizes thousands of visual objects and acoustic
information has not been seen in the literature. In this talk,
I will introduce our current effort in developing frameworks for
generic audio-visual object recognition and video structure understanding.
I will also show our experimental results and compare them in
the context of TREC Video Retrieval Benchmarking 2002. Besides,
I will demonstrate our effort in building public system tools
for multimedia understanding, summarization, and indexing such
as VideoAnnEx, VideoAL, VideoEd, VideoSue, etc., .
|
| Biography |
|
Ching-Yung Lin received the B.S.
and M.S. degrees from National Taiwan University in 1991 and 1993,
respectively, and his Ph.D. degree from Columbia University in 2000,
all in Electrical Engineering.
Since 2000, he has been a Research Staff Member in IBM T. J. Watson
Research Center, New York. His current research interests include
multimedia understanding and multimedia security. Dr. Lin's team performs
best in NIST TREC video semantic retrieval benchmarking in 2001 and
video concept detection benchmarking in 2002. He designed the first
successful multimedia content authentication system and the first
public watermarking system surviving print-and-scan process. Dr. Lin
is the technical program chair of ITRE 2003 and will serve as a guest
editor of the Proceedings of IEEE -- special issue on Digital Rights
Management, April 2004. He organized special sessions in ITCC 2001
and ICIP 2003, and will give a tutorial lecture on multimedia security
in ICME 2003. He is the recipient of Lung-Teng Thesis Award and an
Outstanding Paper Award in CVGIP.
Dr. Lin is the author/ co-author of 50 journal and conference papers
and four public software tools. He holds three US patents and seven
pending patents in the fields of multimedia security and multimedia
semantic analysis. |
| Speaker |
Jelena Tesic, PhD Student, University of California,
Santa Barbara |
| Title |
Efficient Query Processing in Relevance Feedback |
| Abstract |
|
This talk introduces the problem of repetitive nearest neighbor
search in relevance feedback. An efficient search scheme is proposed
for high dimensional feature spaces. Relevance feedback learning
is a popular strategy used in content based image and video retrieval
to support high level concept queries. Our work addresses those
scenarios in which a similarity or distance matrix is updated
during each iteration of the relevance feedback search and a new
set of nearest neighbors is computed. This repetitive nearest
neighbor computation in high dimensional feature spaces is expensive,
particularly when the number of items in the data set is large.
In this context, we suggest a search algorithm that supports relevance
feedback for the general quadratic distance metric. The scheme
exploits correlation between two consecutive nearest neighbor
sets thus significantly reducing the overall search complexity.
Detailed experimental results are provided using 60 dimensional
texture feature dataset. If time permits, I will talk about my
recent work on (1) application of data mining to large image datasets,
particularly aerial imagery, and (2) dimensionality reduction
of Gabor texture descriptors.
|
| Biography |
|
Jelena Tesic is a doctoral candidate
in the department of Electrical and Computer Engineering at the University
of California, Santa Barbara. She works with Prof. B. S. Manjunath
at the Vision
Research Lab. She received her B.Sc. in Electrical Engineering (1998)
from the University of Belgrade, Serbia, and her M.S. in Electrical
and Computer Engineering (1999) from the University of California,
Santa Barbara. Her current research focuses on managing large multimedia
datasets.
|
| Speaker |
Adam Berenzweig, PhD Student, LabROSA |
| Title |
Semantic Anchor Space for Music
|
| Abstract |
|
I will talk about a method of mapping music into a semantic space
that can be used for similarity measurement, classification, and
music information retrieval. The value along each dimension of
this "anchor space" is computed as the output from a
classifier which is trained to measure a particular semantic feature,
for example music genre. In anchor space, distributions that represent
objects such as artists or songs are modeled with Gaussian Mixture
Models, and these distributions can be compared using an approximation
to the Kullback-Leibler divergence. Evaluation is one of the most
problematic aspects of this research, and several evaluation methods
using various sources of human similarity judgments are explored.
An artist classification experiment using the models will also
be presented, with promising results. Finally, a music similarity
browsing application will be demonstrated, with a novel interface
that makes use of the fact that anchor space dimensions are meaningful
to users.
|
| Speaker |
Dr. Giridharan Iyengar, IBM T.J. Watson, Yorktown Heights,
NY. |
| Title |
Information Fusion from Multiple Modalities for Multimedia
Mining Applications |
| Abstract |
|
In this talk, I will describe some of our work in joint processing
of audio, visual and textual information
for a variety of multimedia applications. Specifically, I will
focus on three broad themes: Semantic concept
detection in Multimedia content, Information Retrieval in Multimedia
content and Detection of Synchrony
in Multimedia events. In each of these cases, I will illustrate
the promise and challenges of information fusion.
The examples presented in this talk form an integral part of the
IBM system at Video TREC 2002, a benchmark organized by NIST.
|
| Biography |
|
Dr. Giridharan Iyengar has been a Research Staff Member in the Audio-Visual
Speech Technologies Group at the IBM TJ Watson Research Center since
1999. He received his BTech in Electrical Engineering from the Indian
Institute of Technology, Mumbai in 1990. After working as an Engineer
in Larsen and Toubro, India for one year, he obtained his Master's
degree in Electrical Engineering from the University of Ottawa, Canada.
He then did his doctoral work at the MIT Media Laboratory, where he
worked on video retrieval and indexing. Giri is a member of the IBM
team that participates in TREC video track organized by NIST. Since
the past 2 years, he has been the project leader of the Multimedia
Mining project at IBM Research. He has authored over 35 papers, filed
9 patents (4 currently issued) and has participated in program committees
of conferences and reviewed for journals in multimedia, image processing
and computer vision and image understanding. His research interests
include multimodal signal processing, video indexing and retrieval,
speech processing and information retrieval. |
| Speaker |
Professor Nasir Memon, Polytechnic University, New
York |
| Title |
Delta Compression of File Collections |
| Abstract |
|
Delta compression techniques have been suggested for efficient
representation of an updated version of a file with respect to
an earlier version. In this talk we will first review the problem
of Delta compression and the various approaches that have been
used for arriving at good solutions. We then argue that delta
compression can be useful for a broader set of applications. We
propose a cluster based delta compression technique which can
effectively compress a collection of related files by performing
pair-wise delta compression. The problem of finding an optimal
delta encoding for a collection of files by taking pair-wise deltas
can be reduced to the problem of computing a branching of maximal
weight in a weighted directed graph. Given the quadratic complexity
of finding such an optimal branching, we employ a clustering technique
that reduces the collection into small subgroups of related files,
and then compress each subgroup by computing an optimal branching.
To demonstrate the efficacy of our approach, we present experimental
results with large collections of web pages. Our experiments show
that cluster-based delta compression of these collections provides
significant improvements in total compression ratio as compared
to individually compressing each file.
|
| Biography |
|
Nasir Memon is an Associate Professor in the computer science department
at Polytechnic University, New York. Prof. Memon's research interests
include Data Compression, Computer and Network Security and Multimedia
Communication and Computing. He has published more than a 100 articles
in journals and conference proceedings and holds two patents in image
compression. He has been the principal investigator on several funded
research projects sponsored by NSF as well as industry. He was a visiting
faculty at Hewlett-Packard Research Labs during the academic year
1997-98.He is currently an associate editor for IEEE Transactions
on Image Processing, the ACM Multimedia Systems Journal and the Journal
of Electronic Imaging. |
| Speaker |
Patricia Scanlon, LabROSA |
| Title |
Using Mutual Information to Design Class-Specific Phone Recognizers
|
| Abstract |
|
Information concerning the identity of subword units such as phones
cannot easily be pinpointed because it is broadly distributed
in time and frequency. In this talk I will show how we have used
Mutual Information as measure of the usefulness of individual
time-frequency cells for various speech classification tasks,
using the hand-annotations of the TIMIT database as our ground
truth. Since different broad phonetic classes such as vowels and
stops have such different temporal characteristics, we examine
mutual information separately for each class, revealing structure
that was not uncovered in earlier work on this subject; further
structure is revealed by aligning the time-frequency displays
of each phone at the center of their hand-marked segments, rather
than averaging across all possible alignments within each segment.
Based on these results, we evaluate a range of vowel classifiers
over the TIMIT test set and I will show that selecting input features
according to the mutual information criteria can provides a significant
increase in classification accuracy.
|
| Speaker |
Ana Belen Benitez, PhD Student, DVMM |
| Title |
IMAGE CLASSIFICATION USING MULTIMEDIA KNOWLEDGE NETWORKS
|
| Abstract |
|
We present novel methods for classifying images based on knowledge
discovered from annotated images. The novelty of this work lies
on the automatic class discovery and the classifier combination
using the extracted knowledge. The knowledge includes image clusters,
word-senses and relationships between them. Concepts (image clusters
and word-senses) that are similar statistically can be merged.
Our knowledge classifier is constructed by training a meta-classifier
to predict the presence of each concept in images. A Bayesian
network is then learned with the meta-classifiers as nodes and
the knowledge's structure as initial topology. A new image is
first classified using the meta-classifiers, and the labels refined
using the Bayesian network. Another important contribution of
this work is the analysis of the role of visual and text descriptors
in image classification. As text or text-visual descriptors perform
the best, we propose to use the latter treating text descriptions
as missing data for images without annotations.
|
| Speaker |
Dr. Li Zhang
Ph.D. Operation Research (1997), Columbia
University
Research Staff Member, Systems Analysis and Optimization
IBM Research, Thomas J. Watson Research Center, Yorktown Heights,
NY
|
| Title |
Workload Service Requirements Analysis: A Queueing
Network Optimization Approach |
| Abstract |
|
We study important performance issues at high volume commercial
Web sites based on a general multi-class queueing network approach.
In a typical Web service environment, it is relatively easy to
collect
server throughput and utilization data. It is often possible to
collect user response time data in a controlled environment, or
with certain instrumentation on the client machines. However,
the actual service time of a job (excluding the queueing time),
can be very difficult to obtain. The answers to many important
performance related questions depends crucially on the service
times of different class of jobs. We present a general approach
to infer the per-class service times at different servers from
the server throughput, utilization and
the per-class response time measurements. The per-class service
times are solutions to an optimization problem with queueing-theoretic
formulas in the objective and constraints.
We further study the impact of the variance of service times on
the variance of response times, noting that these results can
be used to obtain estimations of the per-class service time variances
from the per-class response time measurements. We present a few
case studies to demonstrate the power of our approach.
|
| Biography |
|
| Dr. Zhang graduated from the IEOR
department, Columbia University in 1997, after receiving degrees from
Purdue and Beijing University. His present work comprises the study
of performance modeling and analysis techniques in the Web environment.
He is also interested in resource allocation and optimal control schemes
in Web farms. His other favorite research area is network measurement
and time synchronization algorithms. |
| Speaker |
Dr. Christos Dimitrakopoulos
Ph.D. Materials Science (1993), Columbia
University
Research Staff Member, Silicon Technology Dept..
IBM Research, Thomas J. Watson Research Center, Yorktown Heights,
NY
|
| Title |
Performance and Stability of Organic Thin-film transistors
Based on Pentacene Channels |
| Abstract |
|
In this talk I will briefly cover the fabrication, characterization
and optimization of performance of organic thin-film field-effect
transistors (OTFTs) based on pentacene channels deposited either
by vacuum sublimation or from a pentacene precursor solution.
The effects of morphology, purity, doping, device configuration,
modification of surfaces by self-assembled monolayers (SAMs),
and environment will also be discussed. Studies of pentacene device
stability under current stress and various environmental conditions
will be presented. The fabrication of OTFTs comprising single-grain
pentacene channels will be presented together with device characteristics
measured from room temperature down to 4 K.
|
| Biography |
|
Dr. Dimitrakopoulos present work comprises the study of ultra-low
k dielectric materials. Previously he worked with organic semiconductor
materials and devices. He is the author/co-author of 10 patents, several
more pending patent applications and approximately 30 papers in this
field. In 2000 he received an IBM Outstanding Innovation Award for
"High Performance Organic Transistors on Plastic", recently
joined the ranks of Master Inventors at IBM Research, and at the 2002
IEDM meeting he was the co-recipient of the Paul Rappaport Award from
the IEEE Electron Devices Society |
| Speaker |
Dr. Jack O. Chu
Ph.D. Chemistry (1984), Columbia University
Research Staff Member, Electronic Materials & Structures
IBM Research, Thomas J. Watson Research Center, Yorktown Heights,
NY
|
| Title |
SiGe-On-Insulator (SGOI) Substrates for Device Applications |
| Abstract |
|
The utilization of the strained Si/SiGe material system with
enhanced transport properties to replace bulk Si in state-of-the
art VLSI technologies is very promising to provide high performance
CMOS in the sub-100nm regime. Moreover, the "synergistic"
combination of strained Si/SiGe devices with silicon-on-insulator
has attracted increasing interest because of its great potential
to realize high performance MOSFETs and MODFETs for low power
and high speed operations. However, there is no straightforward
integration scheme for fabricating such high performance devices
on an insulator for reducing parasitic junction capacitance and
for low power operations. This presentation will describe a recently
patented technique (US Pat. No. 6,524,935 issued Feb. 25, 2003)
for the fabrication of various "high mobility" Si/SiGe
heterostructures on SiGe-On-Insulator (SGOI) substrates which
were generated by a wafer bonding and hydrogen-induced layer transfer
process. The key advantage of this approach is that the completed
SGOI structure is fully relaxed (>90%) in comparison to Toshiba's
proposed SGOI fabrication technique relying on a Ge enrichment
process (by oxidation) which is only limited to achieving strain
relaxation up to 50%. For device applications, both n-type and
p-type modulation-doped SiGe heterostructures have been fabricated
on SGOI substrates which yielded high electron mobility in the
range of 1500-2000 cm2/Vs and enhanced hole mobility of about
500 cm2/Vs at room temperature, respectively. Similarly, CMOS
devices utilizing a tensilely strained silicon channel have also
been fabricated on SGOI which yielded electron mobility enhancement
of greater than 50%, and enhanced hole mobility of about 20% for
nFET and pFET devices, respectively.
|
| Biography |
|
Dr. Chu is a Research Staff Member in the Electronic Materials and
Device Group, and has been involved in the development and application
of the UHV-CVD technique to fabricate various type of metastable silicon
alloys (SiGe, SiC, SiGe:P, SiGe:B) and structures with applications
to high performance bipolar and field effect transistors. In particular,
high quality SiGe hterostructures have been fabricated setting world
records in the areas of bipolar device performance as well as in modulation
doped FET devices. His current efforts are on the development of high
speed and low-powered CMOS logic technologies based upon strained
Si and SiGe device heterostructures. He has authored and coauthored
over 130 publications in the microelectronics field and holds over
25 related patents. He received an IBM Research Division Award for
his work on understanding silylene gas phase dynamics, and is a recipient
of an IBM Outstanding Technical Achievement Award for high mobility
electron and hole transport in SiGe structures. |
| Speaker |
Dr. John R. Smith
Ph.D. Electrical Engineering (1993), Columbia
University
Manager, Pervasive Media Managment Group
IBM T. J. Watson Research Center
|
| Title |
MPEG-21 Multimedia Framework |
| Abstract |
|
MPEG-21 is an emerging standard that specifies a framework for
transactions of multimedia content. MPEG-21 is built around a
fundamental unit of transaction called a "digital item,"
which is a packaging of media resources, metadata, rights expressions,
identifiers, and processing methods. Examples of digital items
include packages of a movies and related video out-take clips,
musical recordings with graphics and liner notes, photo albums,
and so on. MPEG-21 is not designed around any particular business
model, rather allows Users, who can be any participants in value
network, to seamlessly exchange digital items across networks
and devices. In this talk, we discuss the goals of MPEG-21 and
report on the latest developments with respect to MPEG-21 Digital
Item Declaration, MPEG-21 Rights Expression Language and Rights
Data Dictionary, and MPEG-21 Digital Item Adaptation.
|
| Biography |
|
| John R. Smith is Manager of the
Pervasive Media Management Group at IBM T. J. Watson Research Center,
where he leads a research team developing systems and methods for
multimedia semantic indexing and retrieval. He is currently Chair
of the ISO MPEG Multimedia Description Schemes (MDS) group and serves
as co-Project Editor for several parts of the MPEG-7 standard. Dr.
Smith received his M. Phil and Ph.D. degrees in Electrical Engineering
from Columbia University in 1994 and 1997, respectively, and is currently
serving as IBM Research Campus Relationship Manager for Columbia University. |

For problems or
questions regarding this web site contact The Web Master.
Last updated: February 10, 2003.
|