1. Overview

The goal of this assignment is for you, the student, to implement basic algorithms for n-gram language modeling. This lab will involve counting n-grams and doing basic n-gram smoothing. For this lab, we will be working with Switchboard data. The Switchboard corpus is a collection of recordings of telephone conversations; participants were told to have a discussion on one of seventy topics (e.g., pollution, gun control).

The lab consists of the following parts, all of which are required:

All of the files needed for the lab can be found in the directory ~stanchen/e6884/lab3/. Before starting the lab, please read the file lab3.txt; this includes all of the questions you will have to answer while doing the lab. Questions about the lab can be posted on Courseworks (https://courseworks.columbia.edu/); a discussion topic will be created for each lab. Note: The hyperlinks in this document are enclosed in square brackets; you need an online version of this document to find out where they point to.