E6820 Assignment 8

Reading assignment

Paper: “A tutorial on MPEG/Audio compression,” D. Pan.

Summary:

This paper discusses combining monoaural and binaural techniques for locating a sound in an anechoic environment. Azimuth is determined entirely from the binaural calculation. Elevation is determined by combining the monoaural and binaural estimates. The authors argue that the monoaural elevation estimate is useful in the median plain, where binaural estimates are inaccurate due to symmetry of the ears.

The binaural estimation looks at the ratio of the right ear and left ear fourier transforms. Dividing the right ear by the left ear transformation leaves you with the ratio of the head-related transfer functions because the source signal's fourier transform cancels. This leaves the authors with a characteristic that is a function of the HRTF, azimuth and elevation, not source signal.

The monoaural estimate cannot perform the cancellation trick, so its estimate relies on knowing the source signal in addition to the HRTF. The exact elevation-dependent spectral variations are somewhat complicated, but essentially involve the second derivative of the dB monoaural fourier transform.

Thoughts:

The results are interesting. Sustained white gaussian noise had the lowest average elevation error. Short bursts of sound, like a plucked guitar string, had the highest average error. Given that many, many important noises are short bursts, it seems as though we would have evolved to deal with them better than can be explained by this model. Or maybe not. Maybe azimuth was more important to us, as humans, than elevation. Since we don't fly or swim, we just needed to know which direction to run in on the horizontal.

Back to the top

Practical assignment

(a) Here be the scatter plot generated with this Matlab code:

(b) Rather than posting a gazillion plots with various window sizes, I've consolidated them into this movie. I started with a frame size of 0.25 seconds and worked my way up to 2.0 seconds in 0.25-second increments.

Clearly, a smaller frame size gives you finer time resolution. However, it also seems to be less accurate. There are a whole lot more extreme outliers. An outlier with a 600-sample correlation match corresponds to 0.0375 second delay between mikes. If the speaker and the two mikes form a line and we use the speed of sound at sea level, then 0.0375 seconds corresponds to mikes that are approximately 13 meters apart!

As the time frame decreases, you get a sort of cross-pattern emerging, where one set of mikes calculates a (correct) short correlation match and one set of mikes calculates something huge. As the frame size goes up, the time resolution goes down, but accuracy seems to go up. There are really no outliers when the frame size is about 1.5 seconds. Of course, the main speaker can change quite rapidly in a meeting, so a 1.5 second frame may not be a good idea, depending on the application.

Project

Work on the project can be found on my project page here.

Back to the top

Christine Smit