Dan Ellis: Research
One of the characteristics of research into sound analysis is the large
amount of data involved. Sound is in itself a fairly voluminous data stream,
and it is the subtlety with which information is represented within it that
makes the whole research area so rich. In addition, techniques of statistical
classification, and the procedures of empirical evaluation, often call for
large collections of sound examples.
Confronted with such large volumes of data, it becomes critical to have
efficient and convenient tools for inspecting and investigating the data.
Particularly in the early stages of investigating a new research idea, it
is valuable to be able to inspect the data both at the input
and output of algorithms to get a 'feeling' for what is happening,
to diagnose unexpected results, and to identify new opportunities for analysis
We use many different kinds of represtation in our work on sound analysis.
In addition to the basic waveform, there are time-frequency representations
of the spectrogram family, scalar or vector features resulting from analysis
algorithms, and discrete labels for particular time ranges generated by
classifiers. Ideally, we want to be able to visualize each of these data
sets in the most convenient form, and to be able to make direct comparisons
between data sets corresponding to the same underlying sound, even when
their formats may be very different. Each new question may require a new
form or configuration of the display elements, so the tool needs to be very
flexible and easy to extend for new datatypes.
Although an effort to ennumerate all the possible dimensions of any dataset
we may ever wish to work with, certain aspects seem universal. At the top
level, there is the dimension of soundfile within a corpus: the sound visualization
should probably be invoked from a kind of database manager that allows browsing
among the different examples in a database, and that keeps track of the
correspondence between the base waveform files and their analyses in the
various other representations (which may occur in separater files, or as
records in a single large archive file).
Within the sound display, the main organizing dimension is the time axis,
since all sounds have a finite, usually explicit, duration. As in the sketch
above, this can be used most successfully when various data displays share
a common left-to-right time axis, and are displayed in a stacked, synchronized
pattern. Other dimensions that may apply to multiple representations are
the channel within a sound (e.g. for stereo, or for the 16-channel meeting
recordings) and frequency (e.g. a separate plot of vertical 'slices' through
a spectrogram, where frequency becomes the x-axis, with the vertical axis
The different data formats that it will be necessary to support include:
- Basic waveform display, which can get tricky for very long files displayed
at a very compressed scale.
- Basic spectrogram display, i.e. a time-frequency display in grayscale
or pseudocolor. This can be calculated on-the-fly to avoid having to store
it in a separate file. Interactive adjustment of time-frequency tradeoff
factor, and colormap range and scaling should be provided.
- Other 1-D datasets with uniform time sampling, such as the output of
an energy level detector. A variety of file formats should be supported,
depending on what is in use. In certain cases, plotting several 1-D functions
on a single set of axes may be desirable, as for instance with posterior
probabilities, which are often dominated by a single value.
- Other 2-D datasets read from file, but displayed like the spectrogram.
- 3-D datasets (such as the within-channel short-time autocorrelation
a function of time, frequency channel and autocorrelation lag) are
hard to handle. Sometimes they are shown as animations. We could pick out
frames corresponding to the current 'window center' time.
- Discrete label tags, for instance, time-aligned word hypotheses coming
out of a speech recognizer. It is very valuable to have these aligned with
the underlying features, even though the representations are so different.
.. and doubtless many others.
The development of an in-house visualization solution would include:
- Evaluation of existing tools. Even though we may need to have something
that has been developed in-house to allow customization and extension for
new projects, there may well be existing pieces that we can incorporate
or extend. Some related links are included below.
- Evaluation of development environments: Considerations include: cross-platform
portability, ease/rapidity of development and modification, accessibility
for sound data, compatibility with existing tools or libraries. Candidate
solutions include java, Tcl/Tk and Python. The multi-level approach, where
extensions to a scripting language like Tcl or Python are written and compiled
in C or C++, then executed individually as new additions to the scripting
language, seems particularly successful.
- Development of infrastructure/framework: Database/corpus manager, basic
display frame and base classes for time-oriented, stacking display panels.
- Development of data display widgets within the framework for different
data types and displays: waveform, spectrogram, 1-D, 2-D, etc.
- Development of tools for customization e.g. setting up new default
layouts, automatically opening particular files, or associating derived
file types with their parents, etc.
Some relevant, related ideas include:
- Snack, a set of extensions
for handling and displaying sound within Tcl/Tk, and Wavesurfer,
a sound editor based on Snack, by the same author, Kåre Sjölander.
- Transcriber, a tool for creating and browsing time-aligned text transcriptions
of audio recordings, that gets its sound functionality from Snack. We are
using Transcriber in the Meeting
- I saw a very interesting talk by Lloyd Watts including an animated
sound viewer that had a zoomed-in section, to show you two levels of detail
at once. There are a few static screenshots on this
page from Lloyd's web page, but you had to see it really.
- Xwaves (and its label viewer, xlabel) are probably the best-known programs
used in the research community. They are expensive, however, and I'm not
even sure of their status since Entropic
was taken over by Microsoft. (This
page gives you some idea of the look of the program; unfortunately,
it's in German).
is one of several earlier attempts by me to build a sound visualizer of
this kind. It shows the various stages of representation in a running speech
recognizer. Related projects are pfview
Last updated: $Date: $
Dan Ellis <firstname.lastname@example.org>