Abstracts - ADVENT Seminars
bar

 

Speaker Professor Marek Domanski, Poznan University, Poland
Title Spatio-temporal scalability in DCT-based hybrid video coders
Abstract  

 

The speech is on a generic multi-loop coder structure suitable for mixed spatial and temporal scalability combined with fine granular SNR scalability. The structure is suitable for various variants of hybrid video
coders like MPEG-2, H.263 and AVC (JVT/H.264). The idea of mixed spatial and temporal scalability i.e. spatio-temporal scalability is substantial for the proposal. Its application allows improving the scalable coding efficiency i.e. decreasing the scalability overhead. The coder consists of independently motion-compensated sub-coders that produce bitstreams corresponding to individual levels of spatio-temporal resolution. The
bitrate can be smoothly matched to the particular channel bandwidth by use of data partitioning, which is related to drift errors in the decoder. Accumulation and propagation of these errors can be bounded by use of proper structure of groups of pictures.


Speaker Masaki Miura, Fujitsu, Japan
Title Network and Video Application
Abstract  

 

I'd like to talk about the business of our department and my research topic at the Friday seminar.
First, I'm going to introduce our business. As I mailed you before, the key word of our business is "Network and Video Application". The three bases are video transmission, video compression and video processing. Our products have been developed based on the key words. I'm going to introduce them and show you a demo. Then I'm going to talk about my research topic at DVMM, "Event Detection and Event-based Bursty Transmission". It aims to realize a large scale video surveillance system. The concept has an analogy with DNA microarray. Based on a lot of events (for example detected by IP-700s),a server learns patterns of event, finds out hidden information among them, predicts future state, assigns proper bandwidth and controls traffic from video encoders. Finally, if I have some spare time, I'd like to show you some pictures of Japan and Japanese character table (especially for Shahram and Alex)



Speaker Ryoma Oami, NEC, Japan
Title My Research in NEC
Abstract  

 

I'll talk about my jobs in NEC in the seminar, especially three topics: DCT-based lossless video coding, watermark robustness calculation against attacks, and low bit-rate multiple object coding. The first topic is a
lossless video coding based on DCT. This has a compatibility with ordinary DCT-based lossy coding, such as MPEG-2. The next topic is about a measure to compare the robustness among different watermarking algorithms quantitatively, which was difficult so far. The final topic includes a background coding alogrithm with adaptive resolution control and quantization control based on the distance to the object boundaries, and a bit allocation method for mutiple objects based on the prediction of their area and complexity variations. I hope these topics are interesting for you.



Speaker Yong Wang, PhD student, DVMM group
Title Content-based Utility Function Prediction
Abstract  

 

In this talk I will give a brief introduction about my recent work on content-based
utility function prediction, which is part of the UMA project. In our application scenario
of online media transcoding, utility function(UF) is used to illustrate the relationship
between the resource limitation (generally the network bandwidth) and the utility value
(generally the video quality evaluation such as PSNR), and defined using a set of
transcoding operators. UF is an efficient way to guide the transcoding in the compress
domain. The motivation of utility function prediction comes from the desire of instant
and real-time media processing. I will explain some involved aspects, such as the
generation of UF, video content feature and UF feature extraction, UF unsupervised
clustering, and content feature based prediction. This is a work going on and some
preliminary result will be presented.


Speaker Anthony Vetro, MERL - Mitsubishi Electric Research Labs, Murray Hill, NJ
Title MPEG-2 to MPEG-4 Transcoding with Reduced Resolution
Abstract  

 

Recent advances in signal processing combined with an increase in network capacity are paving the way for users to enjoy services wherever they go and on a host of multimedia capable devices. Each of these terminals may support a variety of different formats. Furthermore, the networks they are connected to are often characterized by different network conditions, and the terminals themselves vary in display capabilities, processing power and memory capacity. Given such a dynamic environment, it becomes necessary to consider methods of adapting the content accordingly.

This talk focuses on the general problem of reduced-resolution transcoding, and more specifically on the conversion between MPEG-2 and MPEG-4. This technology enables broadcast-quality video streams to be transmitted, decoded and displayed on low-cost mobile devices.

Technical topics include:
- analysis of drift errors when transcoding to a lower spatial
resolution
- presentation of various architectures to overcome sources of drift
- macroblock-level conversions, e.g., MV mapping, texture down-sampling
- rate control and bit allocation issues
- evaluation of complexity and quality

A live demo will also be shown.

 

Biography  
Anthony Vetro is with Mitsubishi Electric Research Labs in Murray Hill, NJ, where he is currently a Senior Principal
Member of the Technical Staff. He received the PhD degree in Electrical Engineering from Polytechnic University
in Brooklyn, NY, and his main research interests are in the areas video coding and transmission, with emphasis on content scaling and rate allocation. He has published a number of papers in these areas and has been an active
participant in MPEG standards for several years, where he is now serving as Editor for MPEG-21 Part 7, Digital Item
Adaptation.

Speaker Danny Hong, PhD student, MMSP group
Title Flavor: A Language for Media Representation
Abstract  

 

Flavor has been created as a language for describing coded multimedia bitstreams in a formal way so that the bitstream parsing/generating code can be automatically produced. For this, Flavor comes with a software
tool that translates Flavor description into C++ or Java code. Since Version 5.0, the Flavor translator has been enhanced (the enhanced translator is called XFlavor) so that XML features are supported. XFlavor has the capability to transform Flavor description into XML schema and it can also produce code for generating XML documents corresponding to the bitstreams described by Flavor. As a part of XFlavor, a compression tool for converting the XML representation back into the original bitstream format is provided as well.

In summary Flavor simplifies and speeds up the development of software that processes coded multimedia information by providing the necessary code for parsing and generating bitstreams. XFlavor takes an alternate approach. Rather then providing the code for accessing bitstreams, the bitstreams are transformed into XML documents for easier processing.

First, I'll give a detailed overview of the Flavor language and its translator, and then I'll talk about XFlavor.


Speaker Dr. Aleksandra Mojsilovic, IBM T.J. Watson Research Center
Title Semantic Based Image Modelling and Retrieval
Abstract  

This talk will cover some of our recent work in image semantic modeling and retrieval.

In order to design more satisfying image navigation systems, we need tools to construct a ?semantic bridge? between a user and image database. We have recently developed a novel image indexing scheme and
query language, which allow the user to introduce a cognitive dimension to the search. At an abstract level, this approach consists of:

1) learning the "natural language" that humans speak to communicate their semantic experience of images,
2) understanding the relationships between this language and objective measurable image attributes, and,
3) developing the corresponding feature extraction schemes.

We have conducted several subjective experiments in which we asked human subjects to group images, and then explain verbally why they did so. The results of this study indicated that a part of the abstraction involved in image interpretation is often driven by semantic categories, which can be broken into more tangible semantic entities, i.e. objective semantic indicators. By analyzing our experimental data, we have identified some candidate semantic categories (i.e. portraits, people, crowds, cityscapes, landscapes, etc.) and their underlying semantic indicators (i.e. skin, sky, water, object, etc.). These experiments also helped us derive important low-level image descriptors, accounting for our perception of these indicators. We have then used these findings to develop an image feature extraction and indexing scheme. In particular, our feature set has been carefully designed to match the way humans communicate image meaning. This led us to the development of a "semantic-friendly" query language for browsing and searching diverse collections of images. We have implemented our algorithm in two Internet search engines, ISee (photographic images) and ILive (medical images). ISee incorporates an image robot, the proposed indexing scheme, and web browser, to search and browse the Internet using visual attributes. ILive is a search engine for medical applications that uses the proposed semantic based image features to perform automatic categorization of medical images into different imaging modalities.

 

Biography  
Aleksandra (Saska) Mojsilovic was born in Belgrade, Yugoslavia, in 1968. She received her BSEE, MSEE and Ph. D. degrees from the Department of Electrical Engineering, University of Belgrade, Belgrade, Yugoslavia,
in 1992, 1994 and 1997, respectively. From 1994 to 1998 she was a member of academic staff at the University of Belgrade, Department of Electrical Engineering. From 1998 to 2000 she was with Bell Laboratories, Lucent Technologies, Murray Hill, New Jersey. Aleksandra Mojsilovic is currently a research staff member at the IBM T. J. Watson Research Center, Hawthorne, New York. Her main research interests include computer vision, image processing, multimedia, multidimensional signal processing, medical imaging and human perception. In 2001, she
received the Young Author Best Paper Award from the IEEE Signal Processing Society for her paper on image retrieval with Jelena Kovacevic, Jianying Hu, Robert Safranek and Kicha Ganapathy. Dr. Mojsilovic is a member of the IEEE Signal Processing society and currently serves as an Associate Editor for the IEEE Transactions on
Image Processing.

 


Speaker Rob Turetsky, PhD student, LabROSA
Title

Bridging the Gap: Aligning Songs with Transcriptions for Musical Structure Discovery

Abstract  

 

Musical structure is present at many levels of abstraction, from genre, song, movement and phrase, down to the raw signal. The ability to automatically extract musical structure at these levels would have many applications, including the semantic indexing of personal digital music libraries, audio browsing/skimming and the creation of high-level musical models for use in algorithms such as pitch extraction. Most work on the general problem of analysis of musical signals operates either on raw audio data or some type of transcription (MIDI, Humdrum, etc). While transcriptions allow musicologists to locate reocurring patterns in phrases, transitions and key/scale, the vast majority of music does not exist in transcribed form. By focusing on the raw audio data of a musical performance (CD, .mp3, etc), engineers have access to signal-level charecteristics, but automatic transcsription of real-world audio is an extremely difficult, if not impossible to solve problem.

We present a simple approach to bridging this gap by aligning available MIDI transcriptions with their corresponding performance. This will allow the creation of more advanced musical models with access to both note- and signal-level data. We are currently using these alignments at the signal level to create a "ground truth" database of labeled segments of recorded audio for automatic transcription. At higher levels, we envision training these models, for example, to recognize certain features of a genre (such as repetition, ABA structure). We can then use the global model to predict musical behavior at the local level. Also described is ongoing efforts in musical structure discovery at specific levels of abstraction, including a system being developed to retrieve alternate performances and cover versions of a song.


Speaker Winston Hsu, PhD student, DVMM group
Title

Two approaches toward news story segmentation

Abstract  

This talk presents our investigation on news story segmentation. Two approaches under different constrains are proposed. The first one is to guarantee real-time processing with fair performance on resource-limited devices, such as set-top boxes; the other is an ongoing statistical model fusing mid-level "perceptual" features (except closed captions).

The first method is based on anchorperson detection where a skin-tone detector is applied to mark perspective face regions. An unsupervised clustering measured with color histograms on regions of interest, extended from face regions, is employed to distill actual anchor shots. A further speaker identification and music/speech discrimination is applied to smooth clustering results.

The second approach utilizes a weighted and exponentially linear family of off-the-shelf features to account for story boundary probability and estimated with the measurement of Kullbak-Leibler divergence from empirical news video corpora. Moreover, rather than depending on heuristic rules or specific inference graph, a feature inducing procedure, incrementally including those most salient features, is applied on "perceptual" feature candidates. A dynamic programming approach is then invoked to locate story segments. Moreover, we hope to expand this probabilistic framework to adopt further semantic features (e.g. cue words surrounding story boundaries) to hack unsolved segmentation problems, such as multiple stories within an anchor shot.

 


Speaker Ana Belen Benitez, PhD student, DVMM group
Title

Organization and Browsing of Annotated Images Using Multiresolution
Knowledge Networks

Abstract  

 

This talk will present novel methods for organizing and browsing annotated images based on multiresolution networks representing knowledge about the images. At the highest resolution, images are organized by discovering perceptual knowledge (e.g., image clusters and visual relations), semantic knowledge (e.g., word senses and semantic relations), and statistical interrelations among these. This process drives on the integrated processing of both images and annotations and the use of the electronic dictionary WordNet. Knowledge networks at lower resolutions are constructed by clustering similar concepts together. Users can then browse the annotated images by navigating the multiresolution knowledge networks. The visualization of the knowledge
networks exploits ideas from fish-eye views for concept display using example text and images, and from spring modeling for network drawing. Experiments have shown the trade-off between the knowledge completeness
and consistency/conciseness with increasing number of concepts and have justified some of the proposed browsing decisions.

 


Speaker Lexing Xie, PhD student, DVMM group
Title

Structure Discovery from Video Using Hierarchical Hidden Markov Models

Abstract  

 

This talk will present the problem of structure discovery, and our approach for unsupervised structure discovery from video using hierarchical hidden Markov models. Structure elements in a time sequence are repetitive segments that bear syntactic characteristics, and often interpretable in a semantic sense. Structure is a term broad enough for many domains, such as audio-video stream, speech or music, genome data, system logs and so on. But here we shall restrict our attention to video structure with stochastic properties. In addition to many prior works that have been successful with learning the descriptions of structure from a supervised data pool, and detecting structure using these learned descriptions, a few attempts have been made in other domains as well as in video to learn the description and to locate the structure at the same time. Our approach selects hierarchical hidden Markov model based on several domain assumptions. We will discuss the learning algorithm and it performances, and interesting open issues arise in interpreting the results, model selection and feature selection.

 


Speaker Kamal Hasan Basri, Associate Director, Columbia Video Network
Title

Utilizing Akamai's caching technology and CVN's visual content for superior course delivery in online lectures

Abstract  

 

What does it take to run a successful distance learning program? Basically, there are five elements that are needed: 1) Good Bandwidth 2) Good Content 3) Good Customer Services 4) Good staff and 5) Good Marketing/Business Plan. This talk will only focus on the first two "goods", the bandwidth and the visual content. The advent of ISPs in providing bandwidth for large applications such as video streaming has enhanced the quality of the presentation of the content provider. This talk will introduce Columbia Video Network, the distance learning office of the Columbia University's School of Engineering and Applied Sciences which has utilized Akamai Technology's caching infrastructure in delivering its video streaming lectures to its students located all over the world. The delivery of the lectures and issues will be presented. The performance and feedback is also measured and the visual systems of the lectures are elaborated at length including how the technology and content established CVN as one of the best distance learning programs in the nation.

 


Speaker Dr. Deepak Turage, Philips Research
Title Motion-compensated wavelet video coding for scalable and robust wireless video delivery
Abstract  

The advantages of using wavelets for image compression have been well recognised, especially the high coding efficiency in conjunction with the inherent scalability they provide. Recent advances in temporal filtering techniques have allowed the extension of these schemes into the video coding arena. Wavelet video coding schemes can provide flexible spatial, temporal, SNR and complexity scalability with fine granularity over a large range of bit-rates, while maintaining a very high coding efficiency. The inherent prioritization of data in this framework, as well as the availability of mature spatio-temporal wavelet filtering techniques, leads to added robustness and considerably improved error concealment properties. Wavelet video bitstreams may also be easily tailored into multiple descriptions to improve robustness when used alongside wireless path-diversity and multiple antenna wireless systems. In this talk we highlight some of these prop e! ! rties of the wavelet based video coding schemes, specifically targeted towards overcoming the constraints imposed by wireless networks on video transmission. We will discuss the general wavelet video framework, and then introduce some of our research, mainly our design of a new unconstrained temporal filtering framework. We will also show some video examples to illustrate the ideas and show some comparisons with state-of-the-art non-scalable solutions, i.e. H.26L. Finally, we will indicate some of our ongoing and future research directions including other temporal decomposition techniques, multiple description coding, error concealment, frame-rate upconversion etc.

Biography  
Deepak S. Turaga received the B.Tech. degree in Electrical engineering in 1997 from the Indian Institute of Technology, Bombay and his M.S. and Ph.D. in Electrical and Computer engineering in 1999 and 2001 from Carnegie Mellon University, Pittsburgh, Pennsylvania. He is currently with Philips Research USA as part of the Wireless Communications and Networking Group, where he is a Senior Member of Research Staff. He was with the Semantic Video Processing group at Intel, Portland during the summer of 1999. His research interests include multimedia signal and image processing and computer vision applications. His current research focus is on scalable video coding for wireless networks and network-adaptive video encoder optimizations. He is also interested in enhancing the error resilience of video transmitted over wireless networks, in particular, using multiple description coding, cross-protocol layer optimization and model-based error concealment schemes.

 


Speaker Trista Pei-Chun Chen, Carnegie Mellon University
Title Rate Shaping for Error Resilient Video Streaming
Abstract  

Due to the rapid growth of wireless communications, video over wireless networks has gained a lot of attention. Challenges such as the time-varying error rate and fluctuating bandwidth create the need for error resilient video transport. Joint source-channel coding (JSCC) techniques are often applied to achieve error resilient video transport with online coding. However, JSCC techniques are limited by only providing end-to-end optimization at the time of encoding and are not suitable for streaming precoded video. The encoded bitstream may not be optimal for transmission along a different path or along the same path at later time. I will present “rate shaping” as a better solution in this talk. Rate shaping provides the optimal solution in current network conditions for each link along the path of transmission.

Given the feedback from the network layer, “dynamic rate shaping (DRS)” was proposed to “shape”, i.e., to reduce, the bit rate of a single-layered pre source-coded video, in order to meet the real-time bandwidth constraint. Conventional DRS did not consider shaping for the parity bits in addition to the source-coding bits. In this talk, I first extend rate shaping for streaming precoded video that is both pre source- and channel- coded, which I call “baseline rate shaping (BRS)”. While BRS operates on a coarse level, “fine-grained rate shaping (FGRS)” is proposed to allow for bandwidth adaptation in fine granularities. In addition to FGRS, to consider that the decoder may perform error concealment (EC) if any video data is lost during the transmission, I propose a rate-shaping scheme that is aware of the EC method used at the decoder, which is called “error concealment aware rate shaping (ECARS)”.

Biography  
Trista Pei-chun Chen received the B.S. degree and the M.S. degree from National Tsing Hua University, Hsinchu, Taiwan, in 1997 and 1999, respectively. Since August 1999, she has been working towards her Ph.D. degree in Electrical and Computer Engineering at Carnegie Mellon University, Pittsburgh, Pennsylvania. During the summer of 2000, she was with HP Cambridge Research Laboratory, Cambridge, Massachusetts, conducting research in image retrieval for massive databases. During the summer of 2001, she was with Pittsburgh Sony Design Center, Pittsburgh, Pennsylvania, designing circuits for Video Watermarking (VWM). Her research interests are in the areas of networked video, watermark/data hiding, image processing, and biometric signal processing. She is a student member of the IEEE.

Speaker Dr. Beth Logan, HP Cambridge Research Lab
Title Content-Based Music Analysis
Abstract  

The advent of MP3 is changing the world of music distribution. We are moving toward a future in which all the world's music will be ubiquitously available. This raises new issues in music retrieval which necessitate the development of techniques for content-based music analysis. In this talk, we describe several of our efforts in this
direction: content-based music summarization and music similarity. Both techniques form statistical models of the spectral features of each song. Our summarization technique then automatically chooses a representative phrase for each song using the segmentation provided by its model. Our similarity technique compares the models for each pair of songs using the Earth Mover's Distance (Rubner1998) to form a distance matrix. Both techniques show great promise. We present subjective and objective results on a non-trivial database (and hopefully a couple of demonstrations).

Biography  
Beth Logan received the BSc. and B.E. degrees from the University of Queensland, Australia, in 1990 and 1991 respectively. She received the PhD in engineering from the University of Cambridge, United Kingdom, in 1998, completing a dissertation on speech enhancement. Since 1998, she has been a research scientist at HP Lab's (formerly Digital's and
Compaq's) Cambridge Research Laboratory in Cambridge Massachusetts, U.S.A. Her work here has focused on scalable organization of digital content, primarily looking at indexing and modeling of speech and music.

Speaker Dr. Milind R. Naphade, IBM T. J. Watson Research Center
Title Concepts, Context and Structure: Learning Multimedia Semantics
Abstract  

Multimedia content is an essential part of information technology. However, the difficulty in filtering, searching, and summarizing video has so far hindered the effective utilization of video databases. Users want to filter and query video by high-level (semantic) concepts, while automatic algorithms can extract only low-level features (e.g., color, texture, shape, amount of motion). Bridging this gap is thus the most challenging problem in video (multimedia) indexing, retrieval, and filtering.

In this talk I will present recent and ongoing research for mapping low-level features to high-level semantics, where the semantics relates to the concepts, context and structure in multimedia content. I will argue that supervision leads to semantics. This framework consists of representation of concepts using probabilistic multimedia objects "multijects" and representation of contextual constraints of such objects called the "multinet". To extract semantics from video, I approach the problem of video understanding as a pattern recognition problem and develop a probabilistic framework for modeling the concepts and the context of video content. I use state of the art learning techniques to model contents. For modeling inter-conceptual context I use a factor graph framework for non-causal inference. Using the TREC Video Benchmark corpus I present detection results for a large number of semantic concepts. For modeling structure I propose unsupervised modeling of the temporal evolution with graphical models.

I will then argue that it is important to reduce the amount of "dumb" supervision. Instead, I propose the use of several machine-learning techniques for active learning, unsupervised pattern discovery and multiple instance learning to alleviate the burden of labeling large data sets through "smart" supervision.

Biography  
Milind Ramesh Naphade received his B.E. degree in Instrumentation and Control Engineering from the University of Pune, India in July 1995, ranking first among the university students in his discipline. He received his M.S. and Ph.D. degrees in Electrical Engineering from the University of Illinois at Urbana-Champaign in 1998 and 2001 respectively. He was a Computational Sciences and Engineering Fellow and a member of the Image Formation and Processing group at the Beckman Institute for Advanced Science and Technology from August 1996 to March 2001. In 2001 he joined the Pervasive Media Management Group at the IBM T. J. Watson Research Center in Hawthorne, NY, is a research staff member. He has worked with the Center for Development of Advanced Computing (C-DAC) India, the Kodak Research Laboratories of the Eastman Kodak Company and with the Microcomputer Research Laboratories at Intel Corporation. He is a member of IEEE and the honor society of Phi Kappa Phi. He is among the early proponents of statistical modeling of semantics of multimedia content and context with more than 40 journal articles and conference publications, several book chapters and 6 patents (filed or granted). His research interests include application of learning techniques for audio-visual signal processing and analysis for the purpose of multimedia understanding, content-based indexing, retrieval and mining.

Speaker Marios Athineos, PhD Student, LabROSA
Title

Sound texture modelling with linear prediction in both time and frequency domains

Abstract  

 

Sound textures - for instance, a crackling fire, running water, or applause - constitute a large and largely neglected class of audio signals. Whereas tonal sounds have been effectively and flexibly modelled with sinusoids, aperiodic energy is usually modelled as white noise filtered to match the approximate spectrum of the original over 10-30 ms windows, which fails to provide a perceptually satisfying reproduction of many real-world noisy sound textures.

In this talk we argue that this failure is due to the loss of short-term temporal structure, and we introduce a second modelling stage in which the time envelope of the residual from conventional linear predictive modelling is itself modelled with linear prediction in the spectral domain. This cascade time- and frequency-domain linear prediction
(CTFLP) leads to noise-excited resyntheses that have high perceptual fidelity.

To support our claims we introduce a novel quantitative error analysis of the difference between original and resynthesis by measuring the mean proportional error within time-frequency cells across a range of timescales. This analysis confirms our expectation that CTFLP is better able to model the short-term energy variations present in certain kinds of sound textures, and provides insight into which kinds of sound are best suited to this kind of modelling.


Speaker Shahram Ebadollahi, PhD Student, DVMM
Title

Indexing of Echocardiogram Videos Using View Recognition

Abstract  

 

Echocardiography is a common diagnostic modality to assess the structure and the function of the heart. Indexing echocardiogram videos at different levels of structure is essential for providing efficient access to their content for browsing and retrieval.
In this talk I present our approach for parsing the content of the echocardiograms into their constituent views using their spatio-temporal structure. We pose the problem as a 3D object aspect recognition and use a viewer-centered model-based approach for solving it. The spatial configuration of the heart chambers in each distinct view is used as the distinguishing feature of the views and is modeled by Markov Random Field model. A bounded state duration HMM is used to capture the temporal transition of the sequence of the views and their durations. Results of the application of the method to several echocardiogram videos will be presented.


Speaker Dr. Ching-Yung Lin, IBM T.J. Watson Research Center
Title A Systematic Approach to Multimedia Understanding
Abstract  


Modern advancements in information technology have enabled pervasive uses of digital multimedia data. Accompanying with an explosive growth in the generation, storage, distribution and consumption of multimedia data are emerging requirements in indexing content, building standard exchange formats and ensuring a trustworthy framework between users. While feature-based indexing techniques satisfied some of the requirements, a need for understanding semantic meaning of multimedia data is foreseen and is currently driving research paradigm into a new level. Although advances in speech/text/face recognition have been observed in recent applications, a generic framework which recognizes thousands of visual objects and acoustic information has not been seen in the literature. In this talk, I will introduce our current effort in developing frameworks for generic audio-visual object recognition and video structure understanding. I will also show our experimental results and compare them in the context of TREC Video Retrieval Benchmarking 2002. Besides, I will demonstrate our effort in building public system tools for multimedia understanding, summarization, and indexing such as VideoAnnEx, VideoAL, VideoEd, VideoSue, etc., .

Biography  
Ching-Yung Lin received the B.S. and M.S. degrees from National Taiwan University in 1991 and 1993, respectively, and his Ph.D. degree from Columbia University in 2000, all in Electrical Engineering.

Since 2000, he has been a Research Staff Member in IBM T. J. Watson Research Center, New York. His current research interests include multimedia understanding and multimedia security. Dr. Lin's team performs best in NIST TREC video semantic retrieval benchmarking in 2001 and video concept detection benchmarking in 2002. He designed the first successful multimedia content authentication system and the first public watermarking system surviving print-and-scan process. Dr. Lin is the technical program chair of ITRE 2003 and will serve as a guest editor of the Proceedings of IEEE -- special issue on Digital Rights Management, April 2004. He organized special sessions in ITCC 2001 and ICIP 2003, and will give a tutorial lecture on multimedia security in ICME 2003. He is the recipient of Lung-Teng Thesis Award and an Outstanding Paper Award in CVGIP.

Dr. Lin is the author/ co-author of 50 journal and conference papers and four public software tools. He holds three US patents and seven pending patents in the fields of multimedia security and multimedia semantic analysis.

Speaker Jelena Tesic, PhD Student, University of California, Santa Barbara
Title Efficient Query Processing in Relevance Feedback
Abstract  


This talk introduces the problem of repetitive nearest neighbor search in relevance feedback. An efficient search scheme is proposed for high dimensional feature spaces. Relevance feedback learning is a popular strategy used in content based image and video retrieval to support high level concept queries. Our work addresses those scenarios in which a similarity or distance matrix is updated during each iteration of the relevance feedback search and a new set of nearest neighbors is computed. This repetitive nearest neighbor computation in high dimensional feature spaces is expensive, particularly when the number of items in the data set is large. In this context, we suggest a search algorithm that supports relevance feedback for the general quadratic distance metric. The scheme exploits correlation between two consecutive nearest neighbor sets thus significantly reducing the overall search complexity. Detailed experimental results are provided using 60 dimensional texture feature dataset. If time permits, I will talk about my recent work on (1) application of data mining to large image datasets, particularly aerial imagery, and (2) dimensionality reduction of Gabor texture descriptors.

Biography  
Jelena Tesic is a doctoral candidate in the department of Electrical and Computer Engineering at the University of California, Santa Barbara. She works with Prof. B. S. Manjunath at the Vision
Research Lab. She received her B.Sc. in Electrical Engineering (1998) from the University of Belgrade, Serbia, and her M.S. in Electrical and Computer Engineering (1999) from the University of California, Santa Barbara. Her current research focuses on managing large multimedia datasets.

Speaker Adam Berenzweig, PhD Student, LabROSA
Title

Semantic Anchor Space for Music

Abstract  

 

I will talk about a method of mapping music into a semantic space that can be used for similarity measurement, classification, and music information retrieval. The value along each dimension of this "anchor space" is computed as the output from a classifier which is trained to measure a particular semantic feature, for example music genre. In anchor space, distributions that represent objects such as artists or songs are modeled with Gaussian Mixture Models, and these distributions can be compared using an approximation to the Kullback-Leibler divergence. Evaluation is one of the most problematic aspects of this research, and several evaluation methods using various sources of human similarity judgments are explored. An artist classification experiment using the models will also be presented, with promising results. Finally, a music similarity browsing application will be demonstrated, with a novel interface that makes use of the fact that anchor space dimensions are meaningful to users.


Speaker Dr. Giridharan Iyengar, IBM T.J. Watson, Yorktown Heights, NY.
Title Information Fusion from Multiple Modalities for Multimedia Mining Applications
Abstract  


In this talk, I will describe some of our work in joint processing of audio, visual and textual information
for a variety of multimedia applications. Specifically, I will focus on three broad themes: Semantic concept
detection in Multimedia content, Information Retrieval in Multimedia content and Detection of Synchrony
in Multimedia events. In each of these cases, I will illustrate the promise and challenges of information fusion.
The examples presented in this talk form an integral part of the IBM system at Video TREC 2002, a benchmark organized by NIST.

Biography  

Dr. Giridharan Iyengar has been a Research Staff Member in the Audio-Visual Speech Technologies Group at the IBM TJ Watson Research Center since 1999. He received his BTech in Electrical Engineering from the Indian Institute of Technology, Mumbai in 1990. After working as an Engineer in Larsen and Toubro, India for one year, he obtained his Master's degree in Electrical Engineering from the University of Ottawa, Canada. He then did his doctoral work at the MIT Media Laboratory, where he worked on video retrieval and indexing. Giri is a member of the IBM team that participates in TREC video track organized by NIST. Since the past 2 years, he has been the project leader of the Multimedia Mining project at IBM Research. He has authored over 35 papers, filed 9 patents (4 currently issued) and has participated in program committees of conferences and reviewed for journals in multimedia, image processing and computer vision and image understanding. His research interests include multimodal signal processing, video indexing and retrieval, speech processing and information retrieval.

Speaker Professor Nasir Memon, Polytechnic University, New York
Title Delta Compression of File Collections
Abstract  



Delta compression techniques have been suggested for efficient representation of an updated version of a file with respect to an earlier version. In this talk we will first review the problem of Delta compression and the various approaches that have been used for arriving at good solutions. We then argue that delta compression can be useful for a broader set of applications. We propose a cluster based delta compression technique which can effectively compress a collection of related files by performing pair-wise delta compression. The problem of finding an optimal delta encoding for a collection of files by taking pair-wise deltas can be reduced to the problem of computing a branching of maximal weight in a weighted directed graph. Given the quadratic complexity of finding such an optimal branching, we employ a clustering technique that reduces the collection into small subgroups of related files, and then compress each subgroup by computing an optimal branching. To demonstrate the efficacy of our approach, we present experimental results with large collections of web pages. Our experiments show that cluster-based delta compression of these collections provides significant improvements in total compression ratio as compared to individually compressing each file.

Biography  

Nasir Memon is an Associate Professor in the computer science department at Polytechnic University, New York. Prof. Memon's research interests include Data Compression, Computer and Network Security and Multimedia Communication and Computing. He has published more than a 100 articles in journals and conference proceedings and holds two patents in image compression. He has been the principal investigator on several funded research projects sponsored by NSF as well as industry. He was a visiting faculty at Hewlett-Packard Research Labs during the academic year 1997-98.He is currently an associate editor for IEEE Transactions on Image Processing, the ACM Multimedia Systems Journal and the Journal of Electronic Imaging.

Speaker Patricia Scanlon, LabROSA
Title

Using Mutual Information to Design Class-Specific Phone Recognizers

Abstract  


Information concerning the identity of subword units such as phones cannot easily be pinpointed because it is broadly distributed in time and frequency. In this talk I will show how we have used Mutual Information as measure of the usefulness of individual time-frequency cells for various speech classification tasks, using the hand-annotations of the TIMIT database as our ground truth. Since different broad phonetic classes such as vowels and stops have such different temporal characteristics, we examine mutual information separately for each class, revealing structure that was not uncovered in earlier work on this subject; further structure is revealed by aligning the time-frequency displays of each phone at the center of their hand-marked segments, rather than averaging across all possible alignments within each segment. Based on these results, we evaluate a range of vowel classifiers over the TIMIT test set and I will show that selecting input features according to the mutual information criteria can provides a significant increase in classification accuracy.


Speaker Ana Belen Benitez, PhD Student, DVMM
Title

IMAGE CLASSIFICATION USING MULTIMEDIA KNOWLEDGE NETWORKS

Abstract  


We present novel methods for classifying images based on knowledge discovered from annotated images. The novelty of this work lies on the automatic class discovery and the classifier combination using the extracted knowledge. The knowledge includes image clusters, word-senses and relationships between them. Concepts (image clusters and word-senses) that are similar statistically can be merged. Our knowledge classifier is constructed by training a meta-classifier to predict the presence of each concept in images. A Bayesian network is then learned with the meta-classifiers as nodes and the knowledge's structure as initial topology. A new image is first classified using the meta-classifiers, and the labels refined using the Bayesian network. Another important contribution of this work is the analysis of the role of visual and text descriptors in image classification. As text or text-visual descriptors perform the best, we propose to use the latter treating text descriptions as missing data for images without annotations.


Speaker

Dr. Li Zhang

Ph.D. Operation Research (1997), Columbia University
Research Staff Member, Systems Analysis and Optimization
IBM Research, Thomas J. Watson Research Center, Yorktown Heights, NY

Title Workload Service Requirements Analysis: A Queueing Network Optimization Approach
Abstract  


We study important performance issues at high volume commercial Web sites based on a general multi-class queueing network approach. In a typical Web service environment, it is relatively easy to collect
server throughput and utilization data. It is often possible to collect user response time data in a controlled environment, or with certain instrumentation on the client machines. However, the actual service time of a job (excluding the queueing time), can be very difficult to obtain. The answers to many important performance related questions depends crucially on the service times of different class of jobs. We present a general approach to infer the per-class service times at different servers from the server throughput, utilization and
the per-class response time measurements. The per-class service times are solutions to an optimization problem with queueing-theoretic formulas in the objective and constraints.
We further study the impact of the variance of service times on the variance of response times, noting that these results can be used to obtain estimations of the per-class service time variances from the per-class response time measurements. We present a few case studies to demonstrate the power of our approach.

Biography  
Dr. Zhang graduated from the IEOR department, Columbia University in 1997, after receiving degrees from Purdue and Beijing University. His present work comprises the study of performance modeling and analysis techniques in the Web environment. He is also interested in resource allocation and optimal control schemes in Web farms. His other favorite research area is network measurement and time synchronization algorithms.
Speaker

Dr. Christos Dimitrakopoulos

Ph.D. Materials Science (1993), Columbia University
Research Staff Member, Silicon Technology Dept..
IBM Research, Thomas J. Watson Research Center, Yorktown Heights, NY

Title Performance and Stability of Organic Thin-film transistors Based on Pentacene Channels
Abstract  


In this talk I will briefly cover the fabrication, characterization and optimization of performance of organic thin-film field-effect transistors (OTFTs) based on pentacene channels deposited either by vacuum sublimation or from a pentacene precursor solution. The effects of morphology, purity, doping, device configuration, modification of surfaces by self-assembled monolayers (SAMs), and environment will also be discussed. Studies of pentacene device stability under current stress and various environmental conditions will be presented. The fabrication of OTFTs comprising single-grain pentacene channels will be presented together with device characteristics measured from room temperature down to 4 K.

Biography  

Dr. Dimitrakopoulos’ present work comprises the study of ultra-low k dielectric materials. Previously he worked with organic semiconductor materials and devices. He is the author/co-author of 10 patents, several more pending patent applications and approximately 30 papers in this field. In 2000 he received an IBM Outstanding Innovation Award for "High Performance Organic Transistors on Plastic", recently joined the ranks of Master Inventors at IBM Research, and at the 2002 IEDM meeting he was the co-recipient of the Paul Rappaport Award from the IEEE Electron Devices Society
Speaker

Dr. Jack O. Chu


Ph.D. Chemistry (1984), Columbia University
Research Staff Member, Electronic Materials & Structures
IBM Research, Thomas J. Watson Research Center, Yorktown Heights, NY

Title SiGe-On-Insulator (SGOI) Substrates for Device Applications
Abstract  

The utilization of the strained Si/SiGe material system with enhanced transport properties to replace bulk Si in state-of-the art VLSI technologies is very promising to provide high performance CMOS in the sub-100nm regime. Moreover, the "synergistic" combination of strained Si/SiGe devices with silicon-on-insulator has attracted increasing interest because of its great potential to realize high performance MOSFETs and MODFETs for low power and high speed operations. However, there is no straightforward integration scheme for fabricating such high performance devices on an insulator for reducing parasitic junction capacitance and for low power operations. This presentation will describe a recently patented technique (US Pat. No. 6,524,935 issued Feb. 25, 2003) for the fabrication of various "high mobility" Si/SiGe heterostructures on SiGe-On-Insulator (SGOI) substrates which were generated by a wafer bonding and hydrogen-induced layer transfer process. The key advantage of this approach is that the completed SGOI structure is fully relaxed (>90%) in comparison to Toshiba's proposed SGOI fabrication technique relying on a Ge enrichment process (by oxidation) which is only limited to achieving strain relaxation up to 50%. For device applications, both n-type and p-type modulation-doped SiGe heterostructures have been fabricated on SGOI substrates which yielded high electron mobility in the range of 1500-2000 cm2/Vs and enhanced hole mobility of about 500 cm2/Vs at room temperature, respectively. Similarly, CMOS devices utilizing a tensilely strained silicon channel have also been fabricated on SGOI which yielded electron mobility enhancement of greater than 50%, and enhanced hole mobility of about 20% for nFET and pFET devices, respectively.

Biography  

Dr. Chu is a Research Staff Member in the Electronic Materials and Device Group, and has been involved in the development and application of the UHV-CVD technique to fabricate various type of metastable silicon alloys (SiGe, SiC, SiGe:P, SiGe:B) and structures with applications to high performance bipolar and field effect transistors. In particular, high quality SiGe hterostructures have been fabricated setting world records in the areas of bipolar device performance as well as in modulation doped FET devices. His current efforts are on the development of high speed and low-powered CMOS logic technologies based upon strained Si and SiGe device heterostructures. He has authored and coauthored over 130 publications in the microelectronics field and holds over 25 related patents. He received an IBM Research Division Award for his work on understanding silylene gas phase dynamics, and is a recipient of an IBM Outstanding Technical Achievement Award for high mobility electron and hole transport in SiGe structures.
Speaker

Dr. John R. Smith

Ph.D. Electrical Engineering (1993), Columbia University
Manager, Pervasive Media Managment Group
IBM T. J. Watson Research Center

Title MPEG-21 Multimedia Framework
Abstract  


MPEG-21 is an emerging standard that specifies a framework for transactions of multimedia content. MPEG-21 is built around a fundamental unit of transaction called a "digital item," which is a packaging of media resources, metadata, rights expressions, identifiers, and processing methods. Examples of digital items include packages of a movies and related video out-take clips, musical recordings with graphics and liner notes, photo albums, and so on. MPEG-21 is not designed around any particular business model, rather allows Users, who can be any participants in value network, to seamlessly exchange digital items across networks and devices. In this talk, we discuss the goals of MPEG-21 and report on the latest developments with respect to MPEG-21 Digital Item Declaration, MPEG-21 Rights Expression Language and Rights Data Dictionary, and MPEG-21 Digital Item Adaptation.

Biography  
John R. Smith is Manager of the Pervasive Media Management Group at IBM T. J. Watson Research Center, where he leads a research team developing systems and methods for multimedia semantic indexing and retrieval. He is currently Chair of the ISO MPEG Multimedia Description Schemes (MDS) group and serves as co-Project Editor for several parts of the MPEG-7 standard. Dr. Smith received his M. Phil and Ph.D. degrees in Electrical Engineering from Columbia University in 1994 and 1997, respectively, and is currently serving as IBM Research Campus Relationship Manager for Columbia University.

 


 

 

 


bar

For problems or questions regarding this web site contact The Web Master.
Last updated: February 10, 2003.