modulation-spectrum audio coding in MATLAB

page overview:

Modulation spectrum is emerging as a novel sound representation which has found applications in both ASR as well as most recently in audio coding. During my visit at IDIAP, Prof. Hermansky introduced me to the topic referencing a paper by M. Vinton and L. Atlas on modulation spectrum audio coding [Vint01]. I found the paper very interesting so I decided to make available a full MATLAB implementation of all the transforms involved. I believe that this is a good start for people who want to experiment more on this new topic. Note that no entropy coding or masking thresholds are implemented as the purpose of this page is to explore the modulation spectrum representation.


The core m-files provided are:

There are many other miscellaneous files which we don't need to mention here. You can download all the necessary m-files here: The sample neneh32.wav used is from Neneh Cherry's song Manchild, the album is Raw Like Sushi.


1 - sanity:
The first example is a sanity check, demonstrating perfect reconstruction when we don't alter neither the modulation coefficients nor the phase. The m-file is: ex1_sanity.m and the following figure displays the spectrograms of the original, the reconstructed and the difference (error). Notice the distortion at the edges of the spectrum.

Perfect reconstruction (click for full resolution)

neneh32.wav sanity_decode.wav sanity_error.wav

(Note that the error is amplitude-normalized because otherwise it is hard to hear)

2 - phasesub:
The second example examines the effect of phase substitution by random noise above some threshold frequency. The m-file is: ex2_phasesub.m and the following figure displays the spectrograms of the original, the reconstructed and the error in the case of substitution over 4kHz.

Phase substitution over 4kHz (click for full resolution)

In the following table you can find a bracketing of the phase substitution cutoff frequency. Notice how the reconstructed signal sounds almost identical to the original for frequencies of 4kHz and up. Zero corresponds to substituting all the phase information with noise.

neneh32.wav phasesub_0_decode.wav phasesub_0_error.wav
  phasesub_1_decode.wav phasesub_1_error.wav
  phasesub_2_decode.wav phasesub_2_error.wav
  phasesub_4_decode.wav phasesub_4_error.wav
  phasesub_6_decode.wav phasesub_6_error.wav
  phasesub_8_decode.wav phasesub_8_error.wav

3 - modsub:
The third example examines the effect of zeroing out the coefficients for higher modulation frequencies but keeping the phase unaltered. The m-file is: ex3_modsub.m and the following figures display the spectrograms of the original, the reconstructed and the error in the case of 20ms base transform analysis window, 1 sec modulation spectrum analysis window and zeroing of 50% and an extreme of 95% of the modulation coefficients.

50% Modulation frequency truncation

(click for full res)

95% Modulation frequency truncation

(click for full res)

In the following table you can find the sounds that correspond the above two cases as well as a bracketing of the modulation frequency truncation. In the first row we only keep 5% of the coefficients and in the last we keep 75%. The base transform and modulation spectrum analysis windows are 20ms and 1sec respectively.

neneh32.wav modsub_0.05_decode.wav modsub_0.05_error.wav
  modsub_0.1_decode.wav modsub_0.1_error.wav
  modsub_0.25_decode.wav modsub_0.25_error.wav
  modsub_0.5_decode.wav modsub_0.5_error.wav
  modsub_0.75_decode.wav modsub_0.75_error.wav

4 - multires:
The performance of the coder greatly depends on the choice of the two analysis windows. What is interesting though is that the analysis window of the modulation spectrum transform needs not be fixed across frequency bins. For example if we need to time-localize the distortion introduced by the truncation we would want to choose a short modulation window something that would most probably be appropriate for high frequency bins. At the same time low frequency bins could take advantage of longer modulation windows. This multi-resolution functionality is implemented and demonstrated in the m-file: ex4_multires.m. A comparison of 50% truncation between simple and multi-resolution approaches follows:

neneh32.wav modsub_0.5_decode.wav modsub_0.5_error.wav
  multires_500-50_decode.wav multires_500-50_error.wav

Note that a different multi-resolution approach using a Hierarchical Lapped Transform (HLT) was taken in [Thom03].

5 - logmodsub:
A very interesting idea is to take the logarithm of the magnitude of the base transform before we extract the modulation spectrum (of course we need to take the exponential for reconstruction). Now when we zero out high frequency coefficients (the same way we did in example 3 - modsub) we get much better sounding results. The m-file is: ex5_logmodsub.m and the following figures and sounds can be directly compared with the ones of experiment 3 without the logarithm.

50% Modulation frequency truncation with log
(click for full res)

95% Modulation frequency truncation with log
(click for full res)

The following table contains the sound examples when we take the logarithm and can be directly compared to the corresponding ones in example 3 - modsub.

neneh32.wav logmodsub_0.05_decode.wav logmodsub_0.05_error.wav
  logmodsub_0.1_decode.wav logmodsub_0.1_error.wav
  logmodsub_0.25_decode.wav logmodsub_0.25_error.wav
  logmodsub_0.5_decode.wav logmodsub_0.5_error.wav
  logmodsub_0.75_decode.wav logmodsub_0.75_error.wav

6 - powNmodsub:
It's coming right up ...


[Vint01] M. Vinton and L. Atlas, "A Scalable and Progressive Audio Codec," in Proceedings of the 2001 IEEE ICASSP, 2001

[Thom03] J. Thompson and L. Atlas, "A Non-uniform Modulation Transform for Audio Coding with Increased Time Resolution," in Proceedings of the 2003 IEEE ICASSP, 2003