Distorted speech


introduction | demonstration | investigate | reading | credits | downloading | home

Introduction

'Speech' often arrives at the ears of the listener as a rather different entity than on production. The effects of reverberation, communication channel restrictions and failures, and the presence of other sources all contribute to the degraded signal. Listeners possess strategies for handling many types of distortion, and uncovering these strategies is of importance in designing robust systems for computational hearing.

The range of possible distortions is quite wide, and many forms of signal modification have been the subject of psychoacoustic inestigation since Fletcher's work in the early part of the 20th century (see Fletcher, 1953; Allen, 1994). Speech signals have been subjected to such distortions as:

The current demonstration allows the user to produce some of these forms of distortion.

The demonstration

To start the demonstration, type 'distortion' at the MATLAB prompt. A window like the one above will appear. Choose the load option in the file menu (1) to select a sound file (some speech and music examples are provided with the distribution, but any .au or .snd format files can be loaded). A spectrogram of the sound appears in the top panel (4). As distortions are applied, this display is updated to show the distorted sound spectrogram. To hear the signal, click anywhere in the spectrogram.

Shortly after the spectrogram appears, a series of waveforms will be displayed in the lower panel (3). These correspond to (downsampled) Hilbert envelopes of the outputs of a bank of auditory gammatone filters. The centre frequencies of the filters are shown on the left side of the display. It is advisable to limit the number of channels chosen to 10 or below unless you are dealing with short (< 1 second) signals. On some platforms, memory and compute speed becomes an issue with larger numbers of channels.

[ASIDE: It is important to note that the filter bandwidths are NOT adjusted to ensure equal coverage of all spectral regions. Bandwidths are set to those defined in Glasberg & Moore (1990) and represent estimates of the effective frequency-dependent resolution of the auditory periphery. Future versions may allow proper treatment of this issue, but the redundancy of the speech signal ensures that even an 8 gammatone analysis gives a perfectly adequate 'clean' baseline for subsequent distortions.]

The spectrogram display has a linear-in-Hz y-axis, whereas the filter centre frequencies (CF) are arrayed on an ERB-rate scale. The latter is approximately logarithmic.

You can listen to individual bands by shift-clicking on their waveform. [On platforms other than the Mac, this might involve using the right button]. Clicking on individual waveforms selects/unselects that signal. An unselected signal does not contribute to the overall output. By selecting/unselecting, you can explore various forms of spectral filtering. Alternatively, use the presets menu (2) to speed up the selection process.

Other distortions are controlled by the popup menus on the right of the display.

Popup menu (6) allows the envelope in each band to be replaced by a noise waveform, or by a constant (set to the mean of the envelope). Popup menu (7) allows the carrier to be a noise signal or a tone. The noise signal results from passing a wide-band noise through each gammatone filter. The tone frequency is set to channel CF.

Popup menu (8) specifies a maximum time shift (in ms) to be applied to each channel. The actual time shift applied is a pseuodrandom delay bounded by this figure.

In all cases, both displays are updated to show the effect of the distortion.

Distortions can be combined in arbitrary ways. Envelope and carrier modifications in each band are applied independently, and the resulting waveform in each band then is temporally-distorted.

Plenty of space exists to add further distortions!

Things to investigate

Since it is easy to 'hear out' utterances once you know what they are, it is advisable to load different utterances frequently, and to start with the most challenging conditions, gradually introducing more information until the utterance can be readily identified.

  1. Basic spectral filtering. Listen to the effect of lowpass, highpass, bandpass and bandstop filtering. The latter condition has been investigated recently by Lippmann (1996) for consonant identification.
  2. Single bands. Listen first to the lowest frequency band, then to the highest, then to one in the middle frequencies. Warren et al (1995) recently measured word identification performance for extremely narrow bands as a function of their CF, and found much better performance at 1500 Hz than at 300 or 6000 Hz. However ...
  3. Two bands. ... Warren et al found supra-additive peformance when the low and high frequency bands are presented together.
  4. Carrier distortion. Listen to the effect of replacing carriers by noise and by tones. Examine the spectrogram to see how information relating to both voicing and formant structure is disrupted. The noise carrier condition is related to a recent discovery by Shannon et al (1995), although that involved a smaller number of wider bands.
  5. Envelope distortion.
  6. Temporal distortion. Starting with the longest delays (1 second!), gradually reduce the amount of temporal distortion until the utterance can be identified. This signal modification is related to the temporal modulation filtering of Drullman (1995).
  7. It is instructive to compare the effects of identical distortions on signals other than speech. Some music examples are provided in the standard distribution.

References

  1. Allen (1994). IEEE Trans. Speech & Audio Proc., 2(4), 567-577.
  2. Cooke (1996). Proc. ESCA Workshop on Auditory Basis for Speech Perception, Keele.
  3. Drullman (1995). JASA, 97(1), 585-592.
  4. Fletcher (1953). Speech & Hearing in Communication. Van Nostrand.
  5. Glasberg & Moore (1990). Hearing Research, 47, 103-138.
  6. Lippmann (1996). IEEE Trans. on Speech & Audio Processing, 4(1), 66-69.
  7. Shannon et al (1995). Science, 270, 303-304.
  8. Warren et al (1995). Perc. & Psychophys., 57(2), 175-182.

Further reading


Credits etc

Produced by: Martin Cooke

Release date: June 22 1998

Permissions: This demonstration may be used and modified freely by anyone. It may be distributed in unmodified form.