![]()
NOTE |
The goal of our work is to explore the frontier created by the advent of digital media – and particularly digital images and video – in computers, computer networks, and computer-like devices. These systems became capable of handling such real-time, bandwidth-intensive data in the early 1990’s, and have totally changed the landscape of both consumer electronics and professional media production worlds. In 1990, digital video was limited to early adopters in professional studios, and had a price tag in the $100,000 range. Today, handheld digital video cameras that connect directly to computers can be bought in any electronics store worldwide for a few hundred dollars. What’s even more amazing, digital video cameras are now standard equipment in practically all high-end cellular phones. The effects of these technological developments are many, and they affect deeply both our personal lives and society as a whole.
Here we provide a brief overview of our own research activities in this field. Our work is diverse; it ranges from video compression and streaming, to programming languages and tools for codec developers, all the way to the analysis of the mathematical fundamentals of media representation. For example, in addition to answering obvious questions such as how to better transport compressed digital video over packet-based networks such as the Internet, we also explored new ways of structuring and delivering audio-visual content that were made possible by the new underlying computing and networking infrastructure. Through our work in the development of the MPEG-4 international standard we established the technical foundation that allowed such new content structures to be created. Our later work in collaboration with industry used this foundation to create a new model for music and video distribution based on MP4 files: collections of audio, video, graphics, and interactive components that were delivered as a whole package and as a single unit to the end user. These are to be contrasted with MP3 files, which contain only the audio part. This effort provided a convincing vision for the future of content distribution online, where media and software co-exist in a synergistic way to deliver the artistic (or other) message.
Such an integrative approach is an important component of our research philosophy, and our interest in tool building, or platform building, has had an important influence in our work. Our typical research methodology is to explore problem spaces via the design and implementation of prototype systems, identify major conceptual hurdles in the design and/or implementation, and – whenever possible – attack them using theoretical analysis and optimization techniques.
Our past work falls in five major thematic units, all within the broad area of multimedia signal processing:
We have also started exploring two new avenues of research in:
Through the development of the first complete videoconferencing system to operate using regular workstations over Ethernet LANs in 1992-1993 (Xphone) we identified the need for rate adaptation of bandwidth-intensive video in best-effort networks. This directly led to the concept of “rate shaping.” We fully characterized optimal rate shaping (1994-1995), as well as a related problem called data partitioning, and found fast algorithms that exhibit nearly optimal behavior (within 0.5 dB). One of the key properties – invariance of the algorithm to the accounting for accumulated error – was investigated theoretically only recently (2003), and shown to relate to the spectral properties of the error signal. We tied rate shaping with network-friendly transmission over the Internet, through the use of TCP flow control but without its error control features (1996-1997). Through the implementation of a real-time streaming system we demonstrated that it can be an effective mechanism for video streaming that ensures fair competition with other types of network traffic (HTTP, FTP, etc.).
We participated in the development of the DAVIC specification – an international consortium developing video-on-demand (VoD) standards – and hosted the first VoD system interoperability event at Columbia (1997). In addition to a fully operational MPEG-2 video server, we also run one of the first implementations of the DSM-CC protocol (part of the MPEG-2 and DAVIC specifications). The close relationship between efficient video transmission and multicasting also motivated us early on to do the first detailed mathematical analysis of multicast address management and associated protocol designs for the Internet (1994).
Our experience with VoD led us to extensive work on scheduling the transmission of streamed data, with emphasis on object-based multimedia presentations (a central theme of MPEG-4, see below). We applied operations research results in the analysis of optimal scheduling algorithms in the presence of bandwidth or delay constraints, by exploring the similarity of scheduling jobs on a single machine (1998-1999). We also designed incremental algorithms, which can determine the schedulability of a presentation when a new object is added or deleted (an important capability for online editing). We also explored the use of an early-tardy penalization framework to derive both optimal algorithms as well as fast heuristics (2001). This later model was also shown to be applicable to client-side scheduling as well, where constraints may exist due to the possible low-cost nature of an inexpensive client device. Our interest in transmission-related issues also resulted in alternative formulations to the problem of dependent quantization (as a multiple choice knapsack problem, 2000), as well as optimal buffered compression for shape coding (a feature unique to MPEG-4 video coding, 2000).
Our experience in video streaming systems also led to our development of the so-called Mobile Broadcast Video (MBV) system, in collaboration with CAMSAT (1998-2000). CAMSAT, a company started by the founding CEO of CNN, Reese Schonfeld, wanted to duplicate the impact of satellite TV to network news journalism by creating a new portable system for live video reporting from anywhere in the world. The system would utilize low-earth orbit satellite systems (such as Teledesic’s) and a low-power system on the ground. We were assigned the design of the prototype system, with the exception of the satellite antenna. We used a two-tiered approach where a camera unit transmits real-time MPEG-2 video via wireless LAN (IEEE 802.11b) to a laptop, which then re-transmits the data to a satellite. We started using 2 Mbps wireless adapters, and soon moved to 11 Mbps as they became available late in 1999. Our final prototype system included a highly portable camera unit that fits in a backpack, a laptop as an intermediate unit, and a regular PC as the headquarters’ unit, with all the associated software developed in-house.
Following Columbia’s tradition of involvement in the MPEG standardization activity, we became part of the development of the MPEG-4 specification from the very beginning (1995). Our objective was to pursue techniques that would allow multiple types of media – images, video, graphics, audio, text, animations – to be defined and co-exist in the same environment, with complete support for both streaming over a computer network as well as user interaction. We subsequently worked extensively on problems related to MPEG-4 and object-based multimedia presentations in general, and contributed at several levels: from its architectural definition all the way to individual details. Important contributions included the concept of separation of scene description and object data using a parametric – rather than programmatic – model. This model is now part of MPEG-4, although VRML was used as the basis for the scene description (a choice we consider unfortunate as we feel it has been an impediment to the proliferation of this part of the standard). Additional contributions include the design of the server-side user interaction architecture, the design of the MP4 file format (based on Apple’s QuickTime), and the construction of MPEG-4’s framework for Intellectual Property Management and Protection (IPMP). In May 1998 we in fact hosted the first meeting of the IPMP group, where the IPMP architecture was defined. In October 1998 we did one of the two first complete demonstrations of streaming MPEG-4 content. We transmitted via satellite from Sunnyvale, CA, to Atlantic City, NJ, where an MPEG meeting was held. Working with Lockheed Martin and Xbind, which provided satellite facilities and networking software, respectively, we provided all other system components: MPEG-4 server software, player software (using parts of the MPEG-4 reference implementation), and original content.
In 2000-2001 we led a start-up effort to commercialize MPEG-4 technology for the distribution of music as collections of audio, video, graphics, and interactive components that were delivered as a whole package and as a single unit to the end user (through Flavor Software Inc.). These are to be contrasted with MP3 files, which contain only the audio part. The company developed an MPEG-4 player (both standalone and a plug-in for Winamp) as well as content creation tools, but subsequently froze its operations, awaiting a better regulatory and business environment. [An important consideration was the availability of illegally free audio content initially through Napster and then through many more peer-to-peer network services. To this day all commercial music download and/or streaming services have failed, and only Apple has recently made a credible new entry in the field.] The technology used was largely based on results developed at Columbia. Among the innovations introduced (with a patent pending) was the capability to load the graphical user interface of the player (called a “skin”) from the content file itself (Apple implemented this in QuickTime as “Media Skins”).
The design and development of Flavor is another example of how real problems encountered in building systems spur new ideas: while developing MPEG-2 code for implementing rate shapers, it became apparent that the way the standard was specified (pseudo C-like with extensive use of text) was primitive. Other disciplines, such as circuit design and network software have long relied on software tools to automate mundane tasks but also generate optimized solutions. Unfortunately compression – and, more generally, media representation – has managed to be software-free even after 50 years of evolution.
Our initial solution was a Perl script that would read simple descriptions of Huffman tables and automatically generate C code for parsing them. Later on, these simple descriptions became an entire language, called “Formal Language for Audio-Visual Object Representation,” or Flavor. Flavor has been created as a language for describing coded multimedia bitstreams in a formal way so that the code for reading and writing bitstreams can be automatically generated. It is extended from C++ and Java, in which the typing system incorporates bitstream representation semantics. This allows one to describe in a single place both the in-memory representation of data as well as their bitstream-level (compressed) representation. Flavor also comes with a translator that automatically generates standard C++ or Java code from the Flavor source code so that direct access to coded multimedia information by application developers can be achieved with essentially zero programming. Flavor has been adopted as the syntactic description language of the MPEG-4 Systems specification as well as the MPEG-4 Structured Audio specification. It has gone through many enhancements and is currently in its fifth major release. It now features support for XML, as well as the fastest Huffman decoder software generator. The software has been made into an open source project, runs on Windows and all flavors of UNIX, and is freely available at http://flavor.sourceforge.net. We were recently (2003) awarded a 3-year NSF ITR grant to extend this work, through the Next Generation Software Program. Flavor also won the ACM Multimedia 2004 Open Source Software Competition.
Our research has, for the most part, stayed away from content analysis as it requires a completely different methodology and set of tools. We pursued, however, two particular cases that have practical interest and can generate useful results: a) using content segmentation to help encoders do a better job, and b) help users do manual segmentation by minimizing the number of mouse clicks that they have to do (semi-automatic segmentation).
In early work that we did at Bell Laboratories in 1993 and 1994, we used elliptical models to robustly detect the position of the head in head-and-shoulder video sequences typical in videoconferencing scenes. By detecting the general location of the head, we were able to instruct the encoder to spend more bits in this area, and fewer bits on the background. This way, the bits were spent in perceptually important parts of the image. This is very important in very low bit rate applications, where the total bandwidth may be severely limited (below 64 Kbps). The rate control schemes we developed were called buffer rate and size modulation, as by modulating the operating buffer output rate and size we could “trick” the encoder to spend more or less bits, as desired. Equillibrium (or “balance”) equations ensured that the total average bit rate was not exceeded.
These techniques were further refined at Columbia in 1997, were we extended this technique in three dimensions, i.e., including the temporal one. In addition to controlling the bits spent in an area, we could also control the number of times per second it was coded in the bitstream. By locating the position of the eyes and mouth, we were able to spend more bits but fewer frames per second on the eyes (that need clarity for eye contact) and fewer bits but more frames per second on the mouth (that needs frequent update to properly correlate with uttered speech). An important property of our algorithms is that they do not require any change on the decoder, and thus are fully compatible with all modern video coding schemes.
The second project in segmentation involved the minimization of the number of mouse clicks that a user needs to do to segment a given image or video. We designed a novel user interface concept called “interactive rubberband” (1998-1999), in essence a dynamically controlled rectangle within which the computer performs contour searching. The user controls the size of the rectangle, observing the solution that the computer offers on the screen. The rectangle is grown for as long the computer tracks the right curve; when it starts failing, the user clicks to stop the current rectangle and starts a new one. In the temporal dimension (for video), the user segments two temporally distant frames and the computer will track the curves in both directions, selecting between the offered solutions (interpolation). When the computer fails, the user can intervene at the problematic frame and segment it by hand. The computer can then continue in the two newly created temporal subsegments. We have built a demonstration version of this system which was later enhanced at HP. It is now one of the easiest, if not the easiest, systems to work with. [Unfortunately the software is not available for public download as HP has not allowed its public release.]
Media representation, or the way we represent information in digital form, is squarely based on the concepts of Information Theory (IT), established by Shannon: entropy for lossless representation, and rate-distortion for lossy representation (e.g., compressed audio or video signals). IT assumes that the encoder and decoders are black boxes and, using probability theory and stochastic processes, establishes lower bounds on achievable representation/transmission rates. An interesting question is what happens if we acknowledge that, in most cases, the black box is in fact a computer or a Turing machine. Following this route, Kolmogorov and Chaitin defined the Complexity of a string ‘x’ as the length of the smallest program that, when run on a Turing machine, it will output ‘x’. It has been shown that, under certain conditions, complexity is asymptotically equivalent to entropy. We extended this concept to include distortion, thus addressing lossy representations. We defined the Complexity Distortion function and proved (1997-2001) that, despite their different natures, Complexity Distortion and Rate Distortion predict asymptotically the same results, under stationary and ergodic assumptions. This closes the circle of representation models, from probabilistic models of information proposed by Shannon in Information and Rate Distortion theories, to deterministic algorithmic models, proposed by Kolmogorov in Complexity Theory and our extension to lossy source coding, or what we call Complexity Distortion Theory.
We have done further practical investigations in algorithmic representation of information by using Genetic Programming (GP) to encode images. GP uses genetics-inspired mutations and crossovers on evolving tree-structured “programs” that are refined based on their performance; it thus acts as an interesting search algorithm in the space of all possible programs based on the particular language. Initial results are promising, and have shown that the process depends critically on the effectiveness of the language used. These concepts have very interesting links to the new field of “bio-EE,” or the application of signal processing and information-theoretic concepts to the understanding of the operation of the genetic code.
We have started exploring new avenues of research in music signal processing, and particularly algorithms and tools for production, computer-based sound synthesis, and sound reinforcement applications. This work has been motivated by our own direct involvement in this field as technology users, and is still at an early stage of development. We have built a state-of-the art music production environment including Digidesign ProTools HD, TASCAM Gigastudio, and a Yamaha 02R96 digital mixer, together with software-based tools such Properllerheads Reason, ReBirth, and ReCycle. A new course on Music Signal Processing will be offered in Spring 2004 at Columbia, in preparation for further work in this area.
Our involvement in DSP hardware has also just started, and relates to both our interests in music signal processing but also the design of novel video coding engines addressing low-power operation. In the past, all of our work was implemented in general-purpose microprocessors; we thus encountered significant limitations in terms of computational speed, but also could not address (from a prototype implementation point of view) new platforms where power consumption is critical (PDAs, cellular phones, etc.). PC-based prototypes in these areas fail to capture all the complexities of the underlying sysem, or benefit from custom hardware features that are available in DSP processors or custom-designed ASICs. We are currently collaborating with the Integrated Systems group at Columbia to offer a DSP Hardware course as part of the digital design series of courses. We are also collaborating with the University of Patras, Greece, in the prototype implementation of a novel video codec chip with ultra-low power consumption.