ELEN E6880 - Summer 2003
Class Project
Due Date: The day before the scheduled final - Submit deliverables
via e-mail
Fifteen percent of the grade for the class will depend on a project.
We strongly encourage students to work in groups of 2 or 3.
The goal of this project is to provide significant experience employing
a number of different learning algorithms over a real-world, complex dataset.
The challenge in this project is to choose the most effective preprocessing
of the data into feature vectors for classification, and then to choose an
effective algorithm applied in an effective way to this data.
In this html file you will find:
Data Description (Updated 03/17/2002)
We consider a distributed, client-server computer system with an application
running on a server (e.g., web server or database server), and users accessing
the application from their client workstations (e.g., browsing the web,
or working with remotedatabases).
Our goal is to recover a sequence of user's actions, or high-level
commands, also called transactions, from a sequence of low-level
commands, called Remote Procedure Calls (RPCs), which are recorded
in a server log-files.
There are various reasons why we may want to do that. For example,
we may want to model user behavior in order to predict his future actions
(e.g., buyer vs. non-buyer on a web-page) and adapt the web-site behavior
to the particular user's interest, attract potential buyers, or simply have
a more realistic user behavior model in order to better test an application
or manage its performance.
Unfortunately, it is hard to obtain the infromation about user' actions directly,
since this requires additional instrumentation of the client side (i.e.,
modifying a web-browser and/or asking a user click a button before and after
he is going to do some high-level ''action'', which is clearly not realistic).
Luckily for us, there is practically unlimited amount of historic data
on server, where the data contain the sequence of RPCs being executed on server.
Every user transaction execution leads to a sequence of RPC calls. Hence,
although we do not have information on transactions, we have a lot of sequences
of RPC calls.
However, a sequence of RPC calls corresponding to a specific transaction is
not deterministic, since it depends on the current state of both client and
server (i.e., whether certain databases are already opened or not), on the
parameters of the transaction (e.g., how many records must be deleted
from a database), and on other sources of uncertainty.
Thus, we wish to learn a classifier that would be able to recognize a
transaction type from a given sequence of RPCs that were invoked when executing
this transaction.
We obtained a set of data labeled by an application expert, which give
us instances of RPC sequencs with the corresponding transaction types.
Data format (updated 03/17/2002)
There are 59 different RPC types, denoted by numeric identifiers: 1,2,3,....
59
There are 32 different transaction types, identified by a unique numeric
label ranging from 1 to 32
The data is stored in an ASCII file. It is arranged in two columns,
and satisfies the following constraint
- the value in the first column is the unique transaction number, the
value in the second column is RPC identifier; the value in the third column
is a transaction label.
- a sequence of successive rows having the same transaction label correspond
to the sequence of RPCs invoked during a specific transaction of the type
described by the label.
- RPCs belonging to the same transaction (see 2. above) are listed in
the order in which they were invoked.
- the sequences of RPCs belonging to an individual transaction is listed
in a contiguous range of rows, namely, the data for two different transactions
is not intermixed
- the number of RPCs within an individual transaction varies, even across
transactions of the same type.
- there is no time information in the file; the only time-related information
you have is within individual transactions, where the sequence of RPCs is
invoked in order starting from the top one.
- the data contains instances of all the 32 different transaction types
- the data contains instances of all the 59 RPC types.
For example, consider the following small portion of the file:
4 49 20
4 44 20
4 42 20
4 41 20
4 22 20
4 9 20
5 57 27
5 57 27
5 55 27
5 16 27
5 26 27
5 46 27
5 57 27
6 57 27
6 57 27
6 55 27
6 16 27
6 26 27
6 46 27
6 57 27
the data must be interpreted as follows
- the top 6 lines correspond to an instance of transaction of type 20
(identified as the 4th transaction in the file by the number 4 in the first
column), the next 7 lines correspond to a transaction of type 27 ( the 5th
transaction in the file), and the last 7 lines correspond to another
transaction of type 27 (the 6th transaction in the file)
- the first transaction (of type 20) consisted of the following
sequence of RPC calls: 49 44 42 41 22 9, in this order.
the second transaction (of type 27) consist of the following sequence
of RPC calls: 57 57 55 16 26 46 57, in this order
the third transaction (also of type 27) consists of the following sequence
of RPC calls: 57 57 55 16 26 46 57, in this order, and is identical
to the previous one
Problem Statement
Problem nr 1: for everyone
Assume that the data is segmented into sequences of RPCs corresponding
to transactions
You wish to construct a classifier that assigns a transaction label to
each segmented sequence of RPCs.
15 points for this part (note: with this part alone you can achieve
a score of 105% for the entire class).
Problem nr 2: for all who want to do extra work for extra points
Prerequisite: the first part should have been satisfactorily
completed: we will not accept the additional work if it appears that you
rushed through the first part!
Assume that the data is not segmented into sequences of RPC calls. Assume
that sequences of RPC calls corresponding to different transactions are not
intermixed, just like in the training set
You wish to construct an algorithm that automatically segments the sequence
of RPC calls into transactions, and labels them with the classifier of Problem
nr 1.
10 points for this part (note: since this is extra work,
which you might want to do if you feel your grades need a boost, we will grade
this part more strictly than problem nr 1: if you decide to address problem
nr 2, do it seriously: there is no bonus point for a bad piece of work).
Additional Bonus
You will be working on a real, hard problem.
The purpose of the project is for you to learn how to deal with data.
However, should a team successfully solve the problem in an innovative fashion,
we will help said team to turn their report into a paper, and to submit the
paper to an appropriate, prestigious conference.
Project
Objectives and Requirements (aka: what you are suppose to do)
The execution of the assignment should follow a repeated hypothesis-test
cycle.
- You hypothesize some method(s) for feature extraction and some choice
of induction method(s) and method(s) of evaluation.
- You write some code for the data preprocessing, and possibly for the
execution and evaluation.
- You run classification learning and testing experiments.
First Deliverable: a journal of your progress, in soft copy (simple
text, LaTeX/PS/PDF, or Microsoft Word only) This
deliverable is meant to show that you worked at the problem - it should tell
me the story of your struggle to produce the second deliverable.
As each of these steps is performed you should keep a journal of your
progress. At a minimum this journal should specify:
- The high-level objectives with the experimental design (e.g. reduce
number of features, by removing unrelated features, and by hand creating
linear combinations of related features.)
- A description of the specific approach taken. (E.g. the method
used to determine if a feature is irrelevant.)
- The code and parameter settings used in the experiment.
- The results of the experiment.
- The conclusions you have drawn from the experiment. (E.g. the
performance is slightly worse than the last experiment, but we don’t know
if this effect is from the removal of features or the hand-combining of the
features.)
The conclusions of one test or set of tests will lead to the hypotheses that
drive the next test.
Teams of two or three students are encouraged to work together so that
each team will be able to perform a number of these iterations, and so that
each team will have multiple sources of input for choosing effective strategies.
The journals are intended to be relatively informal; the writing of the journal
itself should not significantly impact the time it takes to complete each
iteration of the cycle above. Document parts (3) and (4) above may
require that separate source code files, configuration files, and results
files be included with the report. For your convenience we will directly
accept these data files, configuration files, and results files from each
run in rather unprocessed form. The goal is to allow you to quickly
document that the experiment was carried out, without requiring that you
to massage all these pieces into a report. When possible it would be
preferable if you can simply cut and paste relevant data into the report
or an appendix to the report. Otherwise the report must at least mention
the external data files by name and describe what they represent.
Second Deliverable: a report of your conclusions, in soft copy
(simple text, LaTeX/PS/PDF, or Microsoft Word only) THIS IS THE IMPORTANT DELIVERABLE: be
At the end of the project you will write a report detailing
- the approaches your team attempted on the problem
- description of feature construction - why did you come up with your
features
- description of feature selection - why did you select the features
you selected
- description of classifier selection - why did you select the
classifiers you selected
- your analysis of the approaches taken
- how did you set up the experiments?
- what where the results of the experiments?
- what are your conclusions regarding the features and the classifiers
given the experimental results?
- your analysis of the domain itself as consequence of your overall
work
Important points
- You should come up with a variety of features - and you should motivate
your choices
- You should select a good variety of classifiers from the off-the-shelf
available ones
- You should compare different classifiers as well as different variants
of the same classifiers, wherever appropriate (for example, if you use Neural
Networks, you should see what happens when you vary the number of hidden
nodes) or use model-selection techniques (for example, instead of selecting
different number of hidden nodes, use model-selection techniques to select
the number of hidden nodes)
- You should clearly describe the experimental settings
- If you feel adventurous, you could even build your own classifier
Unlike the journal this report is intended to be a polished piece of writing.
You are free to refer back to your journal for details in this report, but
the report should stand as a summary of the work done by your team, containing
your results and your conclusions.
Resources
Data
This is the dataset. By downloading the dataset
you accept the following terms of usage:
- the dataset will be used only for the project of the ELEN E6880 course,
and will be deleted afterwards.
- the dataset cannot be distributed, in its present form or in any modified
form, to anybody who is not currently enrolled in the ELEN E6880 course.
Software
The goal of the project is NOT to write your own software package that implements
learning algorithms. There is a wealth of such packages available for
free over the network. We have identified 2 groups of such products: one
for Matlab hadcore users, and one for those who really like coffee, er...
Java !
- For Matlab Users
NOTE: if you decide to use matlab, you cannot just use Netlab, because
it only supports neural networks. - For Java users
- Weka
(follow the "software" link at the top of the page)
Papers
These are papers that you might find interesting. Most of them are
available either through the web or through the IEEE digital library or the
ACM digital library. Should you be unable to obtain some of them, please
contact Vittorio.
These are archives of papers
Last Updated 03/25/2002