E6820 Assignment 7

Reading assignment

Paper: “A tutorial on MPEG/Audio compression,” D. Pan.

Summary:

This paper (unsurprisingly) covers the basic ideas behind the MPEG standard. As discussed in class, the basic idea of MPEG is to use knowledge of human perception to intelligently perform lossy compression. There appear to have been two main objectives, (1) create a compression scheme that works for any kind of audio, not just speech and (2) create a compression scheme that imperceptibly alters to original audio.

This is all accomplished basically through a filter bank. In an ideal world, each filter in the filter bank would cover one critical band, but equal-sized bands are used because they make the calculations much, much easier to perform. Each subband is then processed to take advantage of auditory masking.

The basic steps for encoding are (1) time-align audio data (2) convert audio to a frequency domain representation, (3) process spectral values into groupings related to critical-band widths, (4) separate spectral values into tonal and non-tonal components, (5) apply a spreading function, (6) set a lower bound for threshold values, (7) find the masking threshold for each subband, and (8) calculate the signal-to-mask ratio. At this point, you have everything you need to quantize each band and package them up.

There are three layers of MPEG. The first layer is the simplest and fastest to compute. The second layer is a bit more high-fidelity with bigger frames and a slightly more intelligent encoding. MP3, the third layer, does a couple more intelligent things in the pre-processing phase and then basically goes all-out to get the most compressed encoding possible. The compression tricks include nonuniform quantization and entropy coding of data layers.

Thoughts:

I wonder who the MPEG team's "expert listeners" were who were used to test the standard. Were they audio engineers? Audiophiles? I also wonder what kind of equipment they used to present the audio. A bad pair of headphones can mask a large number of sins.

During the analysis phase, MPEG layer 1 models each sub-band in terms of tonal and nontonal components. Pan points out that "[i]n effect, model 1 converts nontonal components into a form of tonal component." I wonder if this means that MPEG 1 would work better on music than on something nontonal like rain drops or waves crashing.

Back to the top

Practical assignment

I made my quantization decisions on the basis of the way the reconstructed audio sounded. I found that the SNR could sometimes be misleading. What was a noticeable SNR using one kind of distortion was unnoticeable using another SNR.

(a) Quantizing filter coefficients (code)

I found that 8 bits for the filter coefficients maintained the integrity of the sound (-51.9124 SNR), but that at 7 bits (-35.4075 SNR) there were weird blip noises. At 6 bits (-31.0454 SNR), there were these really weird gloopy noises. At 5 bits, I got really LOUD added noise. I yanked my headset off the first time I listened to the reconstruction!

(b) LSPs (code)

This time, 7 bits sounded good (-48.7503 SNR) and 6 bits (-38.3282 SNR) sounded messy. I calculated a bit rate of 6.4796e+003 bits/second for 8-bit coefficients and 5.2335e+003 bits/second for 7-bit LSPs. That's some compression, but not much.

(c) Excitation (code)

This is pretty dense - I apologize in advance.

Reducing the range
I found that I could quite comfortably reduce the range from [-18,22] to [-5,10] and still have a good result (-54.6811 SNR). That's quite a range reduction. However, when I reduced to [-2, 5], the reconstruction was noticeably different from the original (-43.7851 SNR).

Quantizing
4 bits (-49.0658 SNR) was fine, but 3 bits (-43.0696 SNR) didn't sound good.

Multi-Pulse Excitation
This one was really hard because the change was really gradual. At 64 pulses per frame (-54.9354 SNR), I couldn't hear a difference between it and the original. Even at 16 pulses per frame (-45.0017 SNR), the sound was fuzzy, but still remarkably good. In the end, I went with 45 pulses per frame (-51.5301 SNR).

As far as bit-rate is concerned, I calculated 3.1983e+004 bits/second for the excitation quantized to 4 bits. I calculated a bit rate of 1.1215e+004 bits/second for the 4-bit quantized pulses and 1.9626e+004 bits/second for the quantized times (7-bit time values for a 2^7 = 128 sample window). So, the combined bit rate for the pulses and times was 3.0840e+004 bits/second. That's hardly any savings at all.

(d) Buzz-hiss (code)

I found that the buzz-hiss algorithm produced a muffled-sounded reconstruction (-35.4634 SNR). It was like the guy was far away or in a sound-deadening room. His voice just wasn't as rich.

Comparing bit-rates, buzz-hiss won hands down at 436.1264 bit/second, essentially a full order of magnitude better than quantizing the excitation or using multi-pulse excitation. Of course, in some ways it's not a completely fair comparison. I picked 45 pulses per frame because it gave me a high-fidelity reconstruction in the multi-pulse scheme. The buzz-hiss scheme definitely produced a noticeably different sound.

Back to the top

Project

Work on the project can be found on my project page here.

Back to the top

Christine Smit