ELEN E6880 - Summer 2003

Class Project

Due Date: The day before the scheduled final - Submit deliverables via e-mail

Fifteen percent of the grade for the class will depend on a project. We strongly encourage students to work in groups of 2 or 3.

The goal of this project is to provide significant experience employing a number of different learning algorithms over a real-world, complex dataset. The challenge in this project is to choose the most effective preprocessing of the data into feature vectors for classification, and then to choose an effective algorithm applied in an effective way to this data.

In this html file you will find:

A description of the data
A statement of the problem to solve
A description of the objectives and deliverables of the project
A list of resources, including SW and the data

Data Description (Updated 03/17/2002)

We consider a distributed, client-server computer system with an application running on a server (e.g., web server or database server), and users accessing the application from their client workstations (e.g., browsing the web, or working with remotedatabases).

Our goal is to recover a sequence of user's actions, or high-level commands, also called transactions, from a sequence of low-level commands, called Remote Procedure Calls (RPCs), which are recorded in a server log-files.

There are various reasons why we may want to do that. For example, we may want to model user behavior in order to predict his future actions (e.g., buyer vs. non-buyer on a web-page) and adapt the web-site behavior to the particular user's interest, attract potential buyers, or simply have a more realistic user behavior model in order to better test an application or manage its performance.
Unfortunately, it is hard to obtain the infromation about user' actions directly, since this requires additional instrumentation of the client side (i.e., modifying a web-browser and/or asking a user click a button before and after he is going to do some high-level ''action'', which is clearly not realistic).

Luckily for us, there is practically unlimited amount of historic data on server, where the data contain the sequence of RPCs being executed on server. Every user transaction execution leads to a sequence of RPC calls. Hence, although we do not have information on transactions, we have a lot of sequences of RPC calls.
However, a sequence of RPC calls corresponding to a specific transaction is not deterministic, since it depends on the current state of both client and server (i.e., whether certain databases are already opened or not), on the parameters of the transaction (e.g., how many records must be deleted from a database), and on other sources of uncertainty.
Thus, we wish to learn a classifier that would be able to recognize a transaction type from a given sequence of RPCs that were invoked when executing this transaction.

We obtained a set of data labeled by an application expert, which give us instances of RPC sequencs with the corresponding transaction types.

Data format (updated 03/17/2002)

There are 59 different RPC types, denoted by numeric identifiers: 1,2,3,.... 59
There are 32 different transaction types, identified by a unique numeric label ranging from 1 to 32
The data is stored in an ASCII file. It is arranged in two columns, and satisfies the following constraint

the value in the first column is the unique transaction number, the value in the second column is RPC identifier; the value in the third column is a transaction label.
a sequence of successive rows having the same transaction label correspond to the sequence of RPCs invoked during a specific transaction of the type described by the label.
RPCs belonging to the same transaction (see 2. above) are listed in the order in which they were invoked.
the sequences of RPCs belonging to an individual transaction is listed in a contiguous range of rows, namely, the data for two different transactions is not intermixed
the number of RPCs within an individual transaction varies, even across transactions of the same type.
there is no time information in the file; the only time-related information you have is within individual transactions, where the sequence of RPCs is invoked in order starting from the top one.
the data contains instances of all the 32 different transaction types
the data contains instances of all the 59 RPC types.

For example, consider the following small portion of the file:

4 49 20
4 44 20
4 42 20
4 41 20
4 22 20
4 9 20
5 57 27
5 57 27
5 55 27
5 16 27
5 26 27
5 46 27
5 57 27
6 57 27
6 57 27
6 55 27
6 16 27
6 26 27
6 46 27
6 57 27

the data must be interpreted as follows

the top 6 lines correspond to an instance of transaction of type 20 (identified as the 4th transaction in the file by the number 4 in the first column), the next 7 lines correspond to a transaction of type 27 ( the 5th transaction in the file), and the last 7 lines correspond to another transaction of type 27 (the 6th transaction in the file)
the first transaction (of type 20) consisted of the following sequence of RPC calls: 49 44 42 41 22 9, in this order.

of type 27

also of type 27

Problem Statement

Problem nr 1: for everyone

Assume that the data is segmented into sequences of RPCs corresponding to transactions
You wish to construct a classifier that assigns a transaction label to each segmented sequence of RPCs.

15 points for this part (note: with this part alone you can achieve a score of 105% for the entire class).

Problem nr 2: for all who want to do extra work for extra points

Prerequisite: the first part should have been satisfactorily completed: we will not accept the additional work if it appears that you rushed through the first part!
Assume that the data is not segmented into sequences of RPC calls. Assume that sequences of RPC calls corresponding to different transactions are not intermixed, just like in the training set
You wish to construct an algorithm that automatically segments the sequence of RPC calls into transactions, and labels them with the classifier of Problem nr 1.

10 points for this part (note: since this is extra work, which you might want to do if you feel your grades need a boost, we will grade this part more strictly than problem nr 1: if you decide to address problem nr 2, do it seriously: there is no bonus point for a bad piece of work).

Additional Bonus

You will be working on a real, hard problem. The purpose of the project is for you to learn how to deal with data. However, should a team successfully solve the problem in an innovative fashion, we will help said team to turn their report into a paper, and to submit the paper to an appropriate, prestigious conference.

Project Objectives and Requirements (aka: what you are suppose to do)

The execution of the assignment should follow a repeated hypothesis-test cycle.

You hypothesize some method(s) for feature extraction and some choice of induction method(s) and method(s) of evaluation.
You write some code for the data preprocessing, and possibly for the execution and evaluation.
You run classification learning and testing experiments.

First Deliverable: a journal of your progress, in soft copy (simple text, LaTeX/PS/PDF, or Microsoft Word only) This deliverable is meant to show that you worked at the problem - it should tell me the story of your struggle to produce the second deliverable.

As each of these steps is performed you should keep a journal of your progress. At a minimum this journal should specify:

The high-level objectives with the experimental design (e.g. reduce number of features, by removing unrelated features, and by hand creating linear combinations of related features.)
A description of the specific approach taken. (E.g. the method used to determine if a feature is irrelevant.)
The code and parameter settings used in the experiment.
The results of the experiment.
The conclusions you have drawn from the experiment. (E.g. the performance is slightly worse than the last experiment, but we don’t know if this effect is from the removal of features or the hand-combining of the features.)

The conclusions of one test or set of tests will lead to the hypotheses that drive the next test.

Teams of two or three students are encouraged to work together so that each team will be able to perform a number of these iterations, and so that each team will have multiple sources of input for choosing effective strategies. The journals are intended to be relatively informal; the writing of the journal itself should not significantly impact the time it takes to complete each iteration of the cycle above. Document parts (3) and (4) above may require that separate source code files, configuration files, and results files be included with the report. For your convenience we will directly accept these data files, configuration files, and results files from each run in rather unprocessed form. The goal is to allow you to quickly document that the experiment was carried out, without requiring that you to massage all these pieces into a report. When possible it would be preferable if you can simply cut and paste relevant data into the report or an appendix to the report. Otherwise the report must at least mention the external data files by name and describe what they represent.

Second Deliverable: a report of your conclusions, in soft copy (simple text, LaTeX/PS/PDF, or Microsoft Word only) THIS IS THE IMPORTANT DELIVERABLE: be

At the end of the project you will write a report detailing

the approaches your team attempted on the problem

description of feature construction - why did you come up with your features
description of feature selection - why did you select the features you selected
description of classifier selection - why did you select the classifiers you selected

your analysis of the approaches taken

how did you set up the experiments?
what where the results of the experiments?
what are your conclusions regarding the features and the classifiers given the experimental results?

your analysis of the domain itself as consequence of your overall work

Important points

You should come up with a variety of features - and you should motivate your choices
You should select a good variety of classifiers from the off-the-shelf available ones
You should compare different classifiers as well as different variants of the same classifiers, wherever appropriate (for example, if you use Neural Networks, you should see what happens when you vary the number of hidden nodes) or use model-selection techniques (for example, instead of selecting different number of hidden nodes, use model-selection techniques to select the number of hidden nodes)
You should clearly describe the experimental settings
If you feel adventurous, you could even build your own classifier

Unlike the journal this report is intended to be a polished piece of writing. You are free to refer back to your journal for details in this report, but the report should stand as a summary of the work done by your team, containing your results and your conclusions.

Resources

Data

This is the dataset. By downloading the dataset you accept the following terms of usage:

the dataset will be used only for the project of the ELEN E6880 course, and will be deleted afterwards.
the dataset cannot be distributed, in its present form or in any modified form, to anybody who is not currently enrolled in the ELEN E6880 course.

Software

The goal of the project is NOT to write your own software package that implements learning algorithms. There is a wealth of such packages available for free over the network. We have identified 2 groups of such products: one for Matlab hadcore users, and one for those who really like coffee, er... Java !

For Matlab Users

Free Matlab Classification Toolbox at http://tiger.technion.ac.il/~eladyt/Classification_toolbox.html. NEW, Modified toolbox, FOR CLASS USE ONLY. Read the file "content.m" for the description of what is new. In a few words, a lot of the methods have been modified to allow arbitrary input dimensions and arbitrary number of classes. Two methods have been added.
Netlab: neural networks in Matlab
Support Vector Machines Toolbox, at University of East Anglia
Bayes Network Toolbox, at Berkeley

cannot just use Netlab

For Java users

Weka (follow the "software" link at the top of the page)

Papers

These are papers that you might find interesting. Most of them are available either through the web or through the IEEE digital library or the ACM digital library. Should you be unable to obtain some of them, please contact Vittorio.

Eric P. Xing, Michael I. Jordan Richard M. Karp, "Feature Selection for High-Dimensional Genomic Microarray Data", Proceedings of the Eighteenth International Conference on Machine Learning (ICML2001).
MORE To Come here.

These are archives of papers

The NEC Research Index: this is a huge archive of papers, with links to pdf and ps files, backwards and forward citations etc.
The DBLP index: another index of publications, some with links to pdf and ps files.
Collection of papers on Microarray and Data Analysis (there are a lot of papers on feature selections hidden around here, but careful: they are for a specific topic...)

Last Updated 03/25/2002