<<
Back to main page
E6820 Assignment 8
Reading
assignment
Paper:
“A tutorial on MPEG/Audio compression,” D. Pan.
Summary:
This paper discusses combining monoaural and binaural techniques for
locating a sound in an anechoic environment. Azimuth is determined
entirely from the binaural calculation. Elevation is determined
by combining the monoaural and binaural estimates. The authors
argue that the monoaural elevation estimate is useful in the median
plain, where binaural estimates are inaccurate due to symmetry of the
ears.
The binaural estimation looks at the ratio of the right ear and left
ear fourier transforms. Dividing the right ear by the left ear
transformation leaves you with the ratio of the head-related transfer
functions because the source signal's fourier transform cancels.
This leaves the authors with a characteristic that is a function
of the HRTF, azimuth and elevation, not source signal.
The monoaural estimate cannot perform the cancellation trick, so its
estimate relies on knowing the source signal in addition to the HRTF.
The exact elevation-dependent spectral variations are somewhat
complicated, but essentially involve the second derivative of the dB
monoaural fourier transform.
Thoughts:
The results are interesting. Sustained white gaussian noise had
the
lowest average elevation error. Short bursts of sound, like a
plucked
guitar string, had the highest average error. Given that many,
many
important noises are short bursts, it seems as though we would
have evolved to deal with them better than can be explained by this
model. Or maybe not. Maybe azimuth was more important to
us, as humans, than elevation. Since we don't fly or swim, we
just needed to know which direction to run in on the horizontal.
Back
to the top
Practical
assignment
(a) Here be the scatter plot generated with this Matlab code:

(b) Rather than posting a gazillion plots with various window sizes, I've consolidated them into this movie. I started with a frame size of 0.25 seconds and worked my way up to 2.0
seconds in 0.25-second increments.
Clearly, a smaller frame size gives
you finer time resolution. However, it also seems to be less
accurate. There are a whole lot more extreme outliers. An
outlier with a 600-sample correlation match corresponds to 0.0375
second delay between mikes. If the speaker and the two mikes form
a line and we use the speed of sound at sea level, then 0.0375 seconds
corresponds to mikes that are approximately 13 meters apart!
As the time frame decreases, you get a sort of cross-pattern emerging,
where one set of mikes calculates a (correct) short correlation match
and one set of mikes calculates something huge. As the frame size
goes up, the time resolution goes down, but accuracy seems to go up.
There are really no outliers when the frame size is about 1.5
seconds. Of course, the main speaker can change quite rapidly in
a meeting, so a 1.5 second frame may not be a good idea, depending on
the application.
Project
Work on the project can be found on my project page here.
Back
to the top
Christine Smit