<<
Back to main page
E6820 Assignment 7
Reading
assignment
Paper:
“A tutorial on MPEG/Audio compression,” D. Pan.
Summary:
This paper (unsurprisingly) covers the basic ideas behind the MPEG
standard. As discussed in class, the basic idea of MPEG is to use
knowledge of human perception to intelligently perform lossy
compression. There appear to have been two main objectives, (1)
create a compression scheme that works for any kind of audio, not just
speech and (2) create a compression scheme that imperceptibly alters to
original audio.
This is all accomplished basically through a filter bank. In an
ideal world, each filter in the filter bank would cover one critical
band, but equal-sized bands are used because they make the calculations
much, much easier to perform. Each subband is then processed to
take advantage of auditory masking.
The basic steps for encoding are (1) time-align audio data (2) convert
audio to a frequency domain representation, (3) process spectral values
into groupings related to critical-band widths, (4) separate spectral
values into tonal and non-tonal components, (5) apply a spreading
function, (6) set a lower bound for threshold values, (7) find the
masking threshold for each subband, and (8) calculate the
signal-to-mask ratio. At this point, you have everything you need
to quantize each band and package them up.
There are three layers of MPEG. The first layer is the simplest
and fastest to compute. The second layer is a bit more
high-fidelity with bigger frames and a slightly more intelligent
encoding. MP3, the third layer, does a couple more intelligent
things in the pre-processing phase and then basically goes all-out to
get the most compressed encoding possible. The compression
tricks include nonuniform quantization and entropy coding of data
layers.
Thoughts:
I wonder who the MPEG team's "expert listeners" were who were used to
test the standard. Were they audio engineers? Audiophiles?
I also wonder what kind of equipment they used to present the
audio. A bad pair of headphones can mask a large number of sins.
During the analysis phase, MPEG layer 1 models each sub-band in terms
of tonal and nontonal components. Pan points out that "[i]n
effect, model 1 converts nontonal components into a form of tonal
component." I wonder if this means that MPEG 1 would work better
on music than on something nontonal like rain drops or waves crashing.
Back
to the top
Practical
assignment
I made my quantization decisions on the basis of the way the
reconstructed audio sounded. I found that the SNR could sometimes
be misleading. What was a noticeable SNR using one kind of
distortion was unnoticeable using another SNR.
(a) Quantizing filter coefficients (code)
I found that 8 bits for the filter coefficients maintained the integrity of the sound (-51.9124 SNR), but that at 7 bits (-35.4075 SNR) there were weird blip noises. At 6 bits (-31.0454 SNR), there were these really weird gloopy noises. At 5 bits, I got really LOUD added noise. I yanked my headset off the first time I listened to the reconstruction!
(b) LSPs (code)
This time, 7 bits sounded good (-48.7503 SNR) and 6 bits
(-38.3282 SNR) sounded messy. I calculated a bit rate of
6.4796e+003 bits/second for 8-bit coefficients and 5.2335e+003
bits/second for 7-bit LSPs. That's some compression, but not much.
(c) Excitation (code)
This is pretty dense - I apologize in advance.
Reducing the range
I found that I could quite comfortably reduce the range from [-18,22] to [-5,10] and still have a good result (-54.6811 SNR). That's quite a range reduction. However, when I reduced to [-2, 5], the reconstruction was noticeably different from the original (-43.7851 SNR).
Quantizing
4 bits (-49.0658 SNR) was fine, but 3 bits (-43.0696 SNR) didn't sound good.
Multi-Pulse Excitation
This one was really hard because the change was really gradual. At 64 pulses per frame (-54.9354 SNR), I couldn't hear a difference between it and the original. Even at 16 pulses per frame (-45.0017 SNR), the sound was fuzzy, but still remarkably good. In the end, I went with 45 pulses per frame (-51.5301 SNR).
As far as bit-rate is concerned, I calculated 3.1983e+004 bits/second
for the excitation quantized to 4 bits. I calculated a bit rate
of 1.1215e+004 bits/second for the 4-bit quantized pulses and
1.9626e+004 bits/second for the quantized times (7-bit time values for
a 2^7 = 128 sample window). So, the combined bit rate for the
pulses and times was 3.0840e+004 bits/second. That's hardly any
savings at all.
(d) Buzz-hiss (code)
I found that the buzz-hiss algorithm produced
a muffled-sounded reconstruction (-35.4634 SNR). It was like the
guy was far away or in a sound-deadening room. His voice just
wasn't as rich.
Comparing bit-rates, buzz-hiss won hands down at 436.1264 bit/second,
essentially a full order of magnitude better than quantizing the
excitation or using multi-pulse excitation. Of course, in some
ways it's not a completely fair comparison. I picked 45 pulses
per frame because it gave me a high-fidelity reconstruction in the
multi-pulse scheme. The buzz-hiss scheme definitely produced a
noticeably different sound.
Back
to the top
Project
Work on the project can be found on my project page here.
Back
to the top
Christine Smit