The MPEG-4 System and Description Languages:
A Way Ahead in Audio Visual Information Representation
Signal Processing: Image Communication, Special Issue on MPEG-4, 1997 (to appear)
Introduction
The fundamental goal of standardization has always been to provide universal interoperability. An adjunct to that has also been to consider the cost of implementation, so a secondary goal has been to provide universal affordability. In many cases, notably the case of audiovisual data coding, this has meant that a very specific method has been chosen for standardization, because the lowest cost has been achieved with a fixed-function solution. The penalty has been a limit to flexibility, both in the context of being application-specific, and in extendibility to future applications and operating environments.
This situation is currently changing, under the influence of rapid technological advances, and declining price points. Most notably, the clock speeds of CPUs and DSP chips are increasing very quickly with consequent increase in computational power. As a result, it is becoming cost effective to solve traditional problems in programmable systems. Indeed, the future is in software. Changes are also occurring in display technology, though at a slower place, and there is a trend toward higher resolution, progressive scan, high refresh-rate displays. On the audio side, spatial audio is becoming important for entertainment applications such as games.
Such developments provide a whole new degree of freedom for designing audiovisual communication and storage systems. Being software-based, there is less compelling need to standardize a specific algorithm, so a set of algorithms covering a range of applications can be considered. Also the set can be extended in the future. The changes in display technology mean there can be a decoupling between the coded representation of the data and the presentation of the data. It is interesting to note that notwithstanding the advanced technology deployed in the MPEG1/2 standards, the input and output data are still the same analog TV format invented almost 60 years ago! The important point to note is that the data structure of the coded data has been forced in the past to be that of the presentation format. The MPEG1/2 syntax is composed explicitly of coded frames, but these data structure elements are not interesting in terms of comprising real-world objects.
Standards are composed of "normative requirements", and in the case of the MPEG1/2 standards, it should be noted that most of the normative requirements are expressed in the syntax and semantics of the standardized bitstream. The elements not expressed in the syntax concern the (fixed) data structure and the (fixed) coding algorithm. It is of course tempting to propose that all normative requirements be expressed in the syntax, and then to define the syntax (with semantics) as a programmable communication language, not a rigid specification.
MPEG-4 takes advantage of these underlying developments. It provides a coded representation of real-world audiovisual objects, as opposed to presentation-based images of those objects. It provides a truly generic language for the communication of audiovisual objects. This then establishes a very flexible environment that can be customized for specific applications and that can be adapted in the future to take advantage of new developments in coding technology.
With all the normative requirements expressed in the bitstream, with an object-based data structure, and with a software-based implementation, a user-driven, fully interactive environment is possible. Imagine that the user can access an audiovisual "scene" that is three-dimensional (both for audio and video), that has a spatial extent and spatial resolution far higher than the presentation device used to access it, and that is composed of audiovisual objects, animated in real time. Further imagine that some objects are being generated by the encoder, but that other objects may have been generated in the past and have been downloaded, or are being generated locally. Interactivity can now take two forms: first the ability to move the presentation window around this scene, and zoom in and out of it, and second to interact with the audiovisual object themselves. This is the environment that MSDL (MPEG-4 Systems and Description Languages) enable.