The goal of this assignment is for you, the student, to implement basic algorithms for n-gram language modeling. This lab will involve counting n-grams and doing basic n-gram smoothing. For this lab, we will be working with Switchboard data. The Switchboard corpus is a collection of recordings of telephone conversations; participants were told to have a discussion on one of seventy topics (e.g., pollution, gun control).
The lab consists of the following parts, all of which are required:
Part 1: Implement n-gram counting --- Given some text, collect the counts of all n-grams needed in building a trigram language model.
Part 2: Implement “+delta” smoothing --- Write code to compute LM probabilities for a trigram model smoothed with “+delta” smoothing.
Part 3: Implement Witten-Bell smoothing --- Write code to compute LM probabilities for a trigram model smoothed with Witten-Bell smoothing.
Part 4: Evaluate various n-gram models on the task of N-best list rescoring --- See how n-gram order and smoothing affects WER when doing N-best list rescoring for Switchboard.
All of the files needed for the lab can be found in the directory ~stanchen/e6884/lab3/. Before starting the lab, please read the file lab3.txt; this includes all of the questions you will have to answer while doing the lab. Questions about the lab can be posted on Courseworks (https://courseworks.columbia.edu/); a discussion topic will be created for each lab. Note: The hyperlinks in this document are enclosed in square brackets; you need an online version of this document to find out where they point to.