In this part, you will evaluate different portions of the front end you wrote in Part 1 to see how much each technique affects speech recognition performance. In addition, you will do runs on different test sets to see the difference in difficulties of various testing conditions (speaker-dependent, speaker-independent, etc.).
You will need to be in the directory ~/e6884/lab1/; all of the scripts will run the version of the program DcdDTW in the current directory. If you haven't done this already, type
smk DcdDTW |
When using DTW to compare an utterance with a training example, both waveforms are processed using the given signal processing modules, and DTW is performed on the resulting feature vectors. Thus, the quality of the signal processing will greatly affect the accuracy of decoding; the more salient the features that are extracted, the better performance should be. Given a set of templates, a list of signal processing modules to apply, and a set of waveforms to recognize, the program DcdDTW loops through each test utterance in turn and performs DTW to select which word it thinks it is. At the end of the run, it outputs the overall error rate.
The test set used in this part consists of 11 utterances (one of each digit) from each of 10 speakers. While this test set is not large enough to reveal statistically significant differences between some of the contrast conditions, we did not want to use a larger test set so the runs will be quick and it should be large enough to give you the basic idea. For each test speaker, templates from a different training speaker are used. In the first set of experiments, the training and test speaker in each run are the same. Each script performs ten different runs of DcdDTW (using different training and test speakers) and averages the results.
First, let's see how much each processing step in the front end matters. Run each of the following scripts:
script | description |
lab1p3win.sh | windowing alone |
lab1p3fft.sh | windowing+FFT |
lab1p3mel.sh | windowing+FFT+mel-bin (w/o log) |
lab1p3mellog.sh | windowing+FFT+mel-bin (w/ log) |
lab1p3dct.sh | windowing+FFT+mel-bin+DCT |
lab1p3noham.sh | windowing+FFT+mel-bin+DCT (w/o Hamming) |
Now, let's see what happens if we relax the constraint that for each test speaker, we use DTW templates from the same speaker (i.e., we no longer do speaker-dependent recognition). Run each of the following scripts (all use the full MFCC front end):