<<
Back to main page
E6820 Assignment 9
Reading
assignment
Paper:
“Speech Recognition by Machines and Humans”, R.
Lippmann, Speech Communication. 22,1-15, 1997.
Summary:
This paper summarized the best results in speech
recognition for humans and machines in various domains. The
paper is 10 years old and I assume that algorithms have improved since
then, but I have to say that the algorithms covered did not do well
against humans. Often, humans were an order of magnitude more
accurate than the computer algorithms performing the same task.
Humans also were pretty impervious to noise. The
computer algorithms, on the other hands, took huge hits when there was
noise. The huge hits were present even though the computer
algorithms had been tuned for the particular kinds of noise they were
going to have to deal with.
Lippmann started with simplest and easiest tasks, namely recognizing
read digits and read letters. Both tasks clearly had limited
vocabulary sizes. Lippmann then moved on to the North American
Business News Set, which was still read, but had a large vocabulary.
Finally, the paper covered the most difficult tasks of all:
spontaneous speech transcription and word spotting.
Thoughts:
To me, the interesting thing about speech is just how
un-computer-like it is. Any designer who came up with a
communication system resembling speech would be held up for ridicule.
Just consider the variation in pronunciation that is acceptable: "I say [təmeːIto], you [təmato]..."
I would hazard a guess that if someone of foreign origin told
you that he was adding a [təmIto] or even a [təmʌto] to
his
pasta sauce, you might not
even notice how badly he'd mangled that second vowel. Yet,
despite its imprecision, humans do a really good job of decoding speech.
I thought that the section on labelling in the paper was interesting, too.
All data sets are, of course, labelled by humans.
It almost seems comical that all the human accuracy tests
were run by (1) having humans label speech and then (2) having other
humans label the speech. The first humans represented the
"truth" and the second humans, the "test." Researchers
then proceeded to say something about the accuracy of human transcription
on the basis of the difference between the truth and test data even
though they were both generated by humans. So, I thought it
was interesting that Lippmann pointed out that the "truth" humans do
actually have a leg up on the "test" humans. The "truth"
humans are highly motivated and get access to the best recorded data.
In the case of spoken text, they also get access to the
original text the speakers were asked to say.
Back
to the top
Practical
assignment
Nothing to write up this week.
Project
Work on the project can be found on my project page here.
Back
to the top
Christine Smit