E6820 Assignment 9

Reading assignment

Paper: “Speech Recognition by Machines and Humans”, R. Lippmann, Speech Communication. 22,1-15, 1997.

Summary:

This paper summarized the best results in speech recognition for humans and machines in various domains. The paper is 10 years old and I assume that algorithms have improved since then, but I have to say that the algorithms covered did not do well against humans. Often, humans were an order of magnitude more accurate than the computer algorithms performing the same task. Humans also were pretty impervious to noise. The computer algorithms, on the other hands, took huge hits when there was noise. The huge hits were present even though the computer algorithms had been tuned for the particular kinds of noise they were going to have to deal with.

Lippmann started with simplest and easiest tasks, namely recognizing read digits and read letters. Both tasks clearly had limited vocabulary sizes. Lippmann then moved on to the North American Business News Set, which was still read, but had a large vocabulary. Finally, the paper covered the most difficult tasks of all: spontaneous speech transcription and word spotting.

Thoughts:

To me, the interesting thing about speech is just how un-computer-like it is. Any designer who came up with a communication system resembling speech would be held up for ridicule. Just consider the variation in pronunciation that is acceptable: "I say [təmeːIto], you [təmato]..." I would hazard a guess that if someone of foreign origin told you that he was adding a [təmIto] or even a [təmʌto] to his pasta sauce, you might not even notice how badly he'd mangled that second vowel. Yet, despite its imprecision, humans do a really good job of decoding speech.

I thought that the section on labelling in the paper was interesting, too. All data sets are, of course, labelled by humans. It almost seems comical that all the human accuracy tests were run by (1) having humans label speech and then (2) having other humans label the speech. The first humans represented the "truth" and the second humans, the "test." Researchers then proceeded to say something about the accuracy of human transcription on the basis of the difference between the truth and test data even though they were both generated by humans. So, I thought it was interesting that Lippmann pointed out that the "truth" humans do actually have a leg up on the "test" humans. The "truth" humans are highly motivated and get access to the best recorded data. In the case of spoken text, they also get access to the original text the speakers were asked to say.

Back to the top

Practical assignment

Nothing to write up this week.

Project

Work on the project can be found on my project page here.

Back to the top

Christine Smit