SA-CLR 2.2 Instructions

Input Files

The main input file is the data file: a tab-delimited matrix of expression values. Each column represents an array, and each row represents a gene. The first row begins with a tab, and then the array labels. The first column contains the probe labels (starting on the second row). There can also be probe files that contain lists of probes to select transcription factors, partners or targets. These are simply text files with one probe label per line. Finally, there is an optional phenotype file. This is a tab-delimited file. The first row is not used (can contain column headers). The first column has array labels and the second column has the value 0 or 1, indicating the phenotype. See testDataFor2.txt for an example data file, testTFsFor2.txt for an example probe file and testPhen.txt for an example phenotype file.

Optionally, an annotation file can also be present. This is used to map probe sets to genes. Affymetrix annotations files available from their website (in CSV format) will work atumoatically. Otherwise, the file should be a comma-delimited file with at least one column labelled Probe Set ID and another column labeled Gene Symbol. If an annotation file is made available, then results will include both probe set IDs as well as gene symbols.

These files are all specified in a configuration file. This file consists of key=value pairs, one per line. Lines beginning with a # are ignored. The following configuration parameters are supported:

`data`	Specifies the data file.
`tfs`	Specifies the probe file for the transcription factors, or `all` to indicate that all probes are TFs. If missing, `all` is assumed.
`targets`	Specifies the probe file for the targets, or `all` to indicate that all probes are TFs, or `phenotype` to indicate that the phenotype should be used as the target. If missing, `all` is assumed.
`partners`	Specifies the probe file for the partners, or `all` to indicate that all probes could be partners. If missing, `all` is assumed.
`phenotypes`	Specifies the phenotype file (only required if the phenotype is used as the target).
`features`	Specifies the chip feature file (for now it accept only the format used by DREAM5 challenge, must have a column DeletedGenes, and a column OverexpressedGenes).
`time_series`	Time series file for gene perturbation filter.
`annotations`	Specifies the annotation file (optional).
`bins`	The number of bins to use in the spline entropy estimation.
`spline_order`	Spline order to use in the spline function.
`topscores`	The number of top scores to retain at the end of the analysis (defaults to 100,000 if missing).
`corr_filter`	Filtering out all pairs with negative correlation. {true \| false}

The idea is that these configuration files can be prepared for various data sets and kept around for easy tracking and reproducability of previous work.

Running Steps

The software can be run directly with the command line:

java -jar saclr-2.2.jar edu.columbia.ee.saclr.Saclr [OPTIONS] COMMAND

The above [OPTIONS] are the following:

-h Displays some help information and exits.

-c <FILE> Specifies the configuration file (required).

-p <SEED> Specifies a permutation seed. The targets only will be permuted with this seed. A value of zero means to not permute at all.

-v Specifies verbose mode, which will produce much more information as the software runs.

Only the -c parameter is required, although -v is recommended. The COMMAND parameter is required and is one of the following:

`2RUN`	Runs the 2D mutual information processing step.
`3RUN`	Runs the 3D mutual information processing step, storing synergy for each tf-target pair.
`3RUN3D`	Runs the 3D mutual information processing step, storing 3-way MI for each tf-target-partner triplet. (Could be storage demanding!)
`CLR`	Merge 2D scores with CLR background correction.
`SACLR`	Merge 3D scores given by `3RUN` with SACLR background correction.
`SACLR3D`	Merge 3D scores given by `3RUN3D` with SACLR background correction.
`GPFILTER`	Using the deletion and overexpression information in feature file to refine the final scores. Time series data is required too.

The commands should be run in order in order to produce the final result. Such as (2RUN -> CLR), (2RUN -> 3RUN -> SACLR), or (2RUN -> 3RUN3D -> SACLR3D).

There is also a Unix script available called sgiRun.sh that simplifies the process of running the software on a Sun Grid Engine cluster. Before using it, remember to modify the path to java bin command in the last line of the script, and the path to the working directory. It should be run as follows:

qsub sgiRun.sh CONF_FILE COMMAND PERM

The three parameters are the configuration file, the command (2RUN, 3RUN, etc.) and the permutation seed (use 0 if no permutation is desired). The script sgiRun.sh itself can be edited to adjust the number of jobs used in processing. Because the jobs are broken up over many nodes, the results from 2RUN and 3RUN are broken in to many files. There are two scripts that can be used to merge these files, called cat2.sh and cat3.sh, respectively.

The final step (CLR, SACLR, SACLR3D, GPFILTER) does not need to be run on the cluster, and in fact should only be run on one computer. For this, the qsub can be dropped from the command-line and the script run directly:

./sgiRun.sh CONF_FILE SACLR PERM

Bin Size

The only tuning parameter that the user need be concerned with is the number of bins used in the Spline-based MI estimator. Typically a value of 6 or 7 is ideal for data sets with 100 or more arrays. In the below toy example, 4 bins are used because the data set is small.

Simple Example

Here is a small example using the included shell scripts on Unix. The two test data files (testDataFor2.txt and testTFsFor2.txt) are random data, except for a fairly weak interaction between TF 1 and target 3 (with partner gene 2). Also included is a configuration file, testConf.txt. Here is how to run the three steps on this toy data set on the cluster:

2D MI calculation, which results in the intermediate files spline2.data, corr.data:
```
qsub sgiRun.sh testConf.txt 2RUN 0
```
Once complete, the following is run:
```
./cat2.sh
```
The 3D MI calculation, which in this case is run on a single computer because the data set is small, and results in the intermediate file spline3.data, partner.data:
```
qsub sgiRun.sh testConf.txt 3RUN 0
```
Once complete, the following is run:
```
./cat3.sh
```
The final results phase. The intermediate files from the first two steps must be present in the directory from which this step is run:
```
./sgiRun.sh testConf.txt SACLR 0
```
The results (up to the top 100,000) will be in the file zScores.txt. These are the top Z-score-corrected M+S results. Also included will be the top raw M+S results in rawScores.txt and the top synergy scores in synergies.txt.

`-h`	Displays some help information and exits.
`-c <FILE>`	Specifies the configuration file (required).
`-p <SEED>`	Specifies a permutation seed. The targets only will be permuted with this seed. A value of zero means to not permute at all.
`-v`	Specifies verbose mode, which will produce much more information as the software runs.