2. Part 1: Implement the Viterbi algorithm and Gaussian likelihood evaluation

In this part, you will be implementing the interesting parts of a simple HMM decoder, i.e., the program that computes the most likely word sequence given an utterance. We have pre-trained the transition and observation probabilities of an HMM on data consisting of isolated digits, and this is the model you will be decoding with.

2.1. Background

As discussed in the slides for Lecture 5 (10/6), we can use an HMM to model each word in a vocabulary. For the lab, we use the same HMM topology and parameterization as in the slides. That is, for each word, we compute how many phonemes its canonical pronunciation has, and use three times that many states plus a final state. We line up the states linearly, and each (non-final) state has an arc to itself and to the next state. We place output distributions on arcs (as in the slides) rather than states (as in the readings), and each output distribution is modeled using a GMM. For each state, the GMM on each of its outgoing arcs is taken to be the same GMM.

For example, consider the word TWO whose canonical pronunciation can be written as the two phonemes T UW. Its HMM contains 2*3=6 states plus a final state. Intuitively, the GMM attached to the outgoing arcs of the first state can be thought of as modeling the feature vectors associated with the first third of the phoneme T, the second state's GMM models the second third of a T, etc. The topology of the HMM can be thought of as accommodating one or more feature vectors representing the first third of a T, followed by one or more feature vectors representing the second third of a T, etc.

Now, we can use word HMM's to perform isolated word recognition in the same way that recognition is performed with DTW. Instead of a training example (or template) for each word, we have a word HMM. Instead of computing an ad hoc distance between each word and the test example, we compute the probability that each word HMM assigns the test example, and select the word that assigns the highest probability. We can compute the total probability an HMM assigns to a sequence of feature vectors efficiently using dynamic programming.

However, this approach doesn't scale well, and we can do better using an alternate approach. Instead of scoring each word HMM separately, we can build one big HMM consisting of each word HMM glued together in parallel, and keep track of which word each part of the big HMM belongs to. Then, we can use the Viterbi algorithm on this big HMM, and do backtracing to recover the highest scoring path. By seeing which word HMM the best path belongs to, we know what the best word is. In theory, doing dynamic programming on this big HMM takes about the same amount of time as the corresponding computation on all of the individual word HMM's, but this approach lends itself better to pruning, which can vastly accelerate computation. Furthermore, this approach can easily be extended beyond isolated word recognition.

For example, consider the case of wanting to recognize continuous digit strings rather than single digits. We can do this by taking the big HMM we would use for single digits, and simply adding an arc from the final state back to the start state. The HMM can now accept digit strings consisting of multiple digits, and we can use the Viterbi algorithm to find the best word sequence in the same way as we did for isolated digits. In this case, the best path may loop through the machine several times, producing several words of output. We use the “one big HMM” framework in this lab.

2.2. The Decoder

For this part of the lab, we supply much of a decoder for you, and you have to fill in a couple parts. Here's an outline of what the decoder does:

You have to implement the part with the (*); we do everything else. Now, as part of the Viterbi algorithm, you need to compute a total probability for each arc at each frame, consisting of the static arc probability multiplied by the probability of the associated GMM generating the feature vector at that frame. We have thoughtfully provided a function that computes this for you: get_arc_log_prob(). However, this function calls the function LprObservR() that evaluates the likelihood that a Gaussian assigns to a feature vector, which you must also fill in. We take care of handling the static arc probabilities and shunting around the acoustic feature vectors to where they need to go, so you don't need to deal with this.

2.3. The Big HMM Graph

We store the big HMM in a structure of type GraphType. We assume states are numbered starting from 1. To find out the number of states in a graph graph, you can call graph.get_state_count(). We assume that HMM's have a single “start” state, which can be found by calling graph.get_start(). To find the outgoing arcs for a state, you can call graph.get_out_arcs(). (There is no easy way to find the incoming arcs for a state, but you don't need to do this for the lab.) Finally, you can call graph.get_arc_log_prob() to find the total (log) probability for an arc at a frame, i.e., the product of the static arc probability (usually denoted aij) and the observation probability (usually denoted bij(t) or some such). See the documentation in Lab2_DP.C for more details.

Arcs in the HMM are of type ArcType. To find the destination state of an arc arc, call arc.get_dest_state(). Under the covers, each arc also has a transition probability index that lets us look up the static arc probability, an index specifying which GMM is attached to it, and possibly a word label. We attach a word label to the final arc in each word HMM, so that when we do backtracing, we can figure out which word HMM's we pass through. However, you don't need to directly access any of these other fields for the lab.

Aside: in addition to having a start state (the state that all legal paths much start in), an HMM can also have final states, states that legal paths must end in. In our implementation, a graph can have multiple final states, and each final state can have a “final probability” that is multiplied in when computing the probability of a path ending there. However, you don't have to worry about this, because we supply all of the code dealing with final states.

2.4. The Dynamic Programming Chart

When doing dynamic programming as in any of the three main HMM tasks (likelihood computation, Viterbi, Forward-Backward), you need to fill in a matrix of values, e.g., forward probabilities or backtrace pointers or the such. This matrix of values is sometimes referred to as a dynamic programming chart, and the tuple of values you need to compute at each location in the chart is sometimes called a cell of the chart. Appropriately, we define a structure named ChartCell that can store the values you might possibly need in a cell, and allocate a matrix of cells for you in a variable named dpChart for you to fill in for the Viterbi computation.

For Viterbi, you should fill in the member forwLogProbM (this is a slight misnomer, since a forward probability is different than a Viterbi probability) and the backtrace pointer backPtrM for each cell in the chart.

One thing to note is that instead of storing probabilities or likelihoods directly, we will be storing log probabilities (base e). This is because if we store probabilities directly, we may cause numerical underflow. For more information, read Section 9.12 (p. 153) in Holmes!!! We're not kidding: if you don't understand the concepts in section 9.12 before starting, you're going to be in a world of pain!! (This can be found on the web site.) For example, we would initialize forwLogProbM for the start state at frame 0 to the logprob value 0, since ln 1 = 0. If we want to set a value to correspond to the probability 0, we would set it to the logprob -(infinity), or to the constant zeroLogProbS that we have provided which is pretty close. Hint: you may be tempted to convert log probabilities into regular probabilities to make things clearer in your mind, but resist the temptation! The reason we use log probabilities is because converting to a regular probability may result in an underflow.

2.5. What You're Supposed to Do

To prepare for the exercise, create the relevant subdirectory and copy over the needed files:
mkdir -p ~/e6884/lab2/
cd ~/e6884/lab2/
cp ~stanchen/e6884/lab2/Lab2_AM.C .
cp ~stanchen/e6884/lab2/Lab2_DP.C .
cp ~stanchen/e6884/lab2/.mk_chain .
Your job in this part is to fill in the sections between the markers BEGIN_LAB and END_LAB in two different functions: the function ViterbiLab2G in Lab2_DP.C implementing the Viterbi algorithm, and the function LprObservR in Lab2_AM.C implementing Gaussian likelihood evaluation. Read each of these files to see what input and output structures need to be accessed. In this lab, output distributions will be modeled with single Gaussians, not mixtures of Gaussians. In addition, we use diagonal-covariance Gaussians, so we need not store a full covariance matrix for each Gaussian, but only the covariances along the diagonal. Hint: remember that variances are sigma2, not sigma!

2.6. Compiling and testing

Your code will be compiled into the program DcdLab2. To compile this program with your code, type
smk DcdLab2
in the directory containing your source files (which must also contain the file .mk_chain).

To run this decoder on some utterances representing isolated digits, run
(This script can be found in ~stanchen/pub/exec/, which should be on your path from Lab 0.) This shell script starts the executable DcdLab2 located in the current directory with the appropriate flags. To start up DcdLab2 in the debugger, add the flag -debug to the above line.

This script takes some HMM's and GMM's that we have trained for you, and runs the decoder using your Viterbi implementation and Gaussian evaluator on ten utterances, each containing a single digit. The big HMM, or decoding graph, used in this run consists of an (optional) silence HMM followed by each digit HMM in parallel followed by another (optional) silence HMM.

The “correct” output of this script can be found in the file p1a.out in ~stanchen/e6884/lab2/. It's OK if your output doesn't match exactly, but it should be very, very close. In particular, look to see if the logprob for each utterance matches (i.e., the basic Viterbi algorithm and Gaussian eval are correct) and the output word sequence matches (i.e., the back pointers are correct).

To help with debugging, we have provided the file lab2p1a.debug which contains portions of the DP chart for the first utterance when running lab2p1a.sh. You should try to match these forwLogProbM values. Another hint: for frame 0 in the first utterance, the Gaussian associated with the outgoing arcs of state 1 should return a log prob of 20.273.

The instructions in lab2.txt will ask you to run the script lab2p1b.sh, which does the same thing as lab2p1a.sh except on a different test set.