EE4830 Problem Set #6

EE4830 Spring 2009

HW #6

Problems 1-3 are worth 100+15 bonus point, due Monday May 5th 11:59pm.
Problems 4.1-4.2 are worth 30 bunus points if you hand them in by May 7th 11:59pm.

1) Decision surface for two rectangles with Gaussian and nearest neighbor classifiers (30pts)
We observe four data points each for two pattern classes y=1 and y=2 in 2D (x1, x2):

C1: (0, -0.5), (0, 1.5), (1, -0.5), (1, 1.5)
C2: (3.5, 5), (3.5, 3), (4.5, 3), (4.5, 5)

1.1) (10pts) Assume that the two classes both have have Gaussian probability density function have equal prior, i.e., P(y=1)=P(y=2) =1/2.
Compute the means and covariance matrixes of the two class-conditional distributions p(x|y=1) and p(x|y=2)
Obtain the equation of the Bayes decision boundary between class 1 and class 2, and sketch it on the x1-x2 plane.

1.2) (10pts) Does the decision boundary change if Pr(y=1) = 0.8, or P(y=1)=0.1? If yes, sketch or plot the respective new boundary (boundaries).

1.3) (10pts) Assume we use Nearest-Neighbor classifier on these eight points. Draw the Voronoi Diagram on the x1-x2 plane, and find the corresponding descision boundary.

2) Recognizing hand-written digits (40 pts + 15 pt bonus)
Download a copy of the MNIST dataset here. Download the example scripts one-by-one as listed below.

Read, understand and run the classification example hw6_example.m , it involves four implemented steps and two more steps left to you:

%% 1. load the data %%%%%%%%%%%%

%% 2. Specify the training and testing data involved in this 
task
% .. we're working with four digits [2 4 6 8] only

%% 3. classification with minimum distance classifiers
% this needs mdist_learn.m and mdist_classify.m for learning and applying the minimum distance 
classifier
% .. this step should finish in a second or two and yield an error rate ~9%

%% 4. 1NN classification with Euclidean metric 
%%%%%%%%%%
% this needs NN_euclidean.m for performing 1-NN classification
% .. this step should finish in about 2.5 minutes and yield an error rate ~1.5%

Can we do better in classifying digits than (a) the simple minimum distance classifier, or (b) the nearest neighbor ?

2.1) (10 pts) Change the 1-nearest neighbor algorithm into k-Nearest-Neighbor with L1 norm (defined here). Run the classifier from k=2 to k=10, report the classification error rates and the k value with the lowest error rate.

2.2) (10 pts) Find out which digits are mis-classfied by 1-NN and k-NN (use the optimal k value from 2.1) -- display their image and submit them in your writeup, also list the k-neighbor votes for these mis-classified images. Are the errors reasonable?

2.3) (10 pts) Apply PCA/KL-transform on the data vector to reduce the dimensionality of input space, followed by k-NN (use the optimal k value from 2.1). How many dimensions (in PCA) should you keep are optimal in terms of the classification error rates? Report the lowest classification error rate and the number of principal components you use (try number of dimensions d=5:5:50;) .

2.4) (10 pts) Discuss other possible ways for improving performance, list at least two approaches from what has been covered in class, in books, or reference papers and websites. For each proposed approach briefly describe why it may be useful for this task, esp., what extra information is explicitly or implicitly used.
e.g. consider perceptron and neural network (with netlab), or SVM (with libSVM)

2.5) (15 bonus points) implement one of your proposed solutions in 2.4 and see if it improves the result. Report how you choose the parameters in your implementation and compare its performance withk-NN.

2.6) 5 more bonus points for the lowest classification error rate obtained among all submissions of question 2.4.

3) Source codes for letters and words (30pts)

In this problem we look at a language written with half the English alphabet. The designated 13 letters appear with the following probability.

Letter	a	b	c	e	g	i	l	m	o	r	s	t	u
Probability	0.1000	0.0600	0.0400	0.1200	0.0800	0.0800	0.0700	0.0500	0.1000	0.0600	0.0900	0.0800	0.0700

3.1) (5pts) Compute the entropy of the language assuming any string of letters is i.i.d.
3.2) (10 pts) Construct a binary Huffman code for these symbols, compute its average code-length.
3.3) (5 pts) Find the arithmetic code for the word "image".
3.4) (10 pts) The arithmatic code for a five-letter word is 0.37700896. Decode this word.

Problem 4, Image reconstruction (due May 7, 30 bonus points)
4.1) [15 pts.] Solve problem 5.28 of Gonzalez and Wood.s textbook.
4.2) [15 pts.] Solve problem 5.30 of Gonzalez and Wood.s textbook.

Prepared by Lexing Xie < xlx at ee dot columbia dot edu >, 2009-04-20