4. Part 3: Implement Witten-Bell smoothing

Witten-Bell smoothing is this smoothing algorithm that was invented by some dude named Moffat, but dudes named Witten and Bell have generally gotten credit for it. It is significant in the field of text compression and is relatively easy to implement, and that's good enough for us.

Here's a rough motivation for this smoothing algorithm: One of the central problems in smoothing is how to estimate the probability of n-grams with zero count. For example, let's say we're building a bigram model and the bigram wi-1 wi has zero count, so PMLE(wi | wi-1) = 0. According to the Good-Turing estimate, the total mass of counts belonging to things with zero count in a distribution is the number of things with exactly one count. In other words, the probability mass assigned to the backoff distribution should be around N1(wi-1) / ch(wi-1), where N1(wi-1) is the number of words w' following wi-1 exactly once in the training data (i.e., the number of bigrams wi-1 w' with exactly one count). This suggests the following smoothing algorithm

where lambda is set to some value so that this probability distribution sums to 1, and Pbackoff(wi) is some unigram distribution that we can backoff to.

However, N1(wi-1) is kind of a finicky value; e.g., it can be zero even for distributions with lots of counts. Thus, we replace it with N1+(wi-1), the number of words following wi-1 at least once (rather than exactly once), and we fiddle with some of the other terms. Long story short, we get

For the backoff distribution, we can use an analogous equation:

The term ch(epsilon) is the 0-gram history count defined earlier, and N1+(epsilon) is the number of different words with at least one count. For the backoff distribution for the unigram model, we use the uniform distribution Punif(wi) = 1 / |V|. Trigram models are defined analogously.

If a particular distribution has no history counts, then just use the backoff distribution directly. For example, if when computing PWB(wi | wi-1) you find that the history count ch(wi-1) is zero, then just take PWB(wi | wi-1) = PWB(wi). Intuitively, if a history h has no counts, the MLE distribution PMLE(w | h) is not meaningful and should be ignored.

Your job in this part is to fill in the function get_prob_witten_bell(). This function should return the value PWB(wi | wi-2 wi-1) given a trigram wi-2 wi-1 wi. You will be provided with all of the relevant counts (which you computed for Part 1). Again, this routine corresponds to step (B) in the pseudocode listed in Section 2.1.

Your code will again be compiled into the program EvalLMLab3. To compile this program with your code, type
smk EvalLMLab3
To run this program with the same training and test set as before, run
lab3p3a.sh
The “correct” output can be found in the file p3a.out in ~stanchen/e6884/lab3/. Again, you should be able to match the values in this output just about exactly. This script is set up to print the smoothed probability you compute for each trigram in the test set. The file p3a.out also contains the results of some intermediate computations that may help you with debugging, but which you do not need to replicate.

The instructions in lab3.txt will ask you to run the script lab3p3b.sh, which does the same thing as lab3p3a.sh except on a different test set.