5. Part 3: Evaluate different front ends (Required)

In this part, you will evaluate different portions of the front end you wrote in Part 1 to see how much each technique affects speech recognition performance. In addition, you will do runs on different test sets to see the difference in difficulties of various testing conditions (speaker-dependent, speaker-independent, etc.).

You will need to be in the directory ~/e6884/lab1/; all of the scripts will run the version of the program DcdDTW in the current directory. If you haven't done this already, type
smk DcdDTW
to compile this program with your front end code. This program is an (albeit primitive) speech recognizer, which is also known as a decoder. We supply it with a training example (or template) for each word that we would potentially like to recognize. To recognize a new utterance, it uses dynamic time warping to find the closest training example and returns the class of that training example.

When using DTW to compare an utterance with a training example, both waveforms are processed using the given signal processing modules, and DTW is performed on the resulting feature vectors. Thus, the quality of the signal processing will greatly affect the accuracy of decoding; the more salient the features that are extracted, the better performance should be. Given a set of templates, a list of signal processing modules to apply, and a set of waveforms to recognize, the program DcdDTW loops through each test utterance in turn and performs DTW to select which word it thinks it is. At the end of the run, it outputs the overall error rate.

The test set used in this part consists of 11 utterances (one of each digit) from each of 10 speakers. While this test set is not large enough to reveal statistically significant differences between some of the contrast conditions, we did not want to use a larger test set so the runs will be quick and it should be large enough to give you the basic idea. For each test speaker, templates from a different training speaker are used. In the first set of experiments, the training and test speaker in each run are the same. Each script performs ten different runs of DcdDTW (using different training and test speakers) and averages the results.

First, let's see how much each processing step in the front end matters. Run each of the following scripts:

scriptdescription
lab1p3win.shwindowing alone
lab1p3fft.shwindowing+FFT
lab1p3mel.shwindowing+FFT+mel-bin (w/o log)
lab1p3mellog.shwindowing+FFT+mel-bin (w/ log)
lab1p3dct.shwindowing+FFT+mel-bin+DCT
lab1p3noham.shwindowing+FFT+mel-bin+DCT (w/o Hamming)

Remember to keep track of the results, since you will need to fill in these numbers in lab1.txt. The first two scripts are quite slow (since the output features are of high dimension), so be patient. In case you are wondering what the accuracy of running DTW on raw (time-domain) waveforms are for this test set, it is 89.1% (accuracy, not error rate). You can run this for yourself if you figure out how, but this run took about 12 hours.

Now, let's see what happens if we relax the constraint that for each test speaker, we use DTW templates from the same speaker (i.e., we no longer do speaker-dependent recognition). Run each of the following scripts (all use the full MFCC front end):

scriptrelation between test and template speaker
lab1p3sd.shsame
lab1p3dgd.shsame gender and part of US
lab1p3gd.shsame gender
lab1p3si.shnone (i.e., speaker-independent recognition)