testDataFor2.txt
for an example data file, testTFsFor2.txt
for an example probe file and testPhen.txt
for an example phenotype file.
Optionally, an annotation file can also be present. This is used to map probe sets to genes. Affymetrix annotations files available from their website (in CSV format) will work atumoatically. Otherwise, the file should be a comma-delimited file with at least one column labelled Probe Set ID
and another column labeled Gene Symbol
. If an annotation file is made available, then results will include both probe set IDs as well as gene symbols.
These files are all specified in a configuration file. This file consists of key=value
pairs, one per line. Lines beginning with a #
are ignored. The following configuration parameters are supported:
data | Specifies the data file. |
tfs | Specifies the probe file for the transcription factors, or all to indicate that all probes are TFs. If missing, all is assumed. |
targets | Specifies the probe file for the targets, or all to indicate that all probes are TFs, or phenotype to indicate that the phenotype should be used as the target. If missing, all is assumed. |
partners | Specifies the probe file for the partners, or all to indicate that all probes could be partners. If missing, all is assumed. |
phenotypes | Specifies the phenotype file (only required if the phenotype is used as the target). |
features | Specifies the chip feature file (for now it accept only the format used by DREAM5 challenge, must have a column DeletedGenes, and a column OverexpressedGenes). |
time_series | Time series file for gene perturbation filter. |
annotations | Specifies the annotation file (optional). |
bins | The number of bins to use in the spline entropy estimation. |
spline_order | Spline order to use in the spline function. |
topscores | The number of top scores to retain at the end of the analysis (defaults to 100,000 if missing). |
corr_filter | Filtering out all pairs with negative correlation. {true | false} |
The idea is that these configuration files can be prepared for various data sets and kept around for easy tracking and reproducability of previous work.
java -jar saclr-2.2.jar edu.columbia.ee.saclr.Saclr [OPTIONS] COMMANDThe above
[OPTIONS]
are the following:
-h | Displays some help information and exits. |
-c <FILE> | Specifies the configuration file (required). |
-p <SEED> | Specifies a permutation seed. The targets only will be permuted with this seed. A value of zero means to not permute at all. |
-v | Specifies verbose mode, which will produce much more information as the software runs. |
Only the -c
parameter is required, although -v
is recommended. The COMMAND
parameter is required and is one of the following:
2RUN | Runs the 2D mutual information processing step. |
3RUN | Runs the 3D mutual information processing step, storing synergy for each tf-target pair. |
3RUN3D | Runs the 3D mutual information processing step, storing 3-way MI for each tf-target-partner triplet. (Could be storage demanding!) |
CLR | Merge 2D scores with CLR background correction. |
SACLR | Merge 3D scores given by 3RUN with SACLR background correction. |
SACLR3D | Merge 3D scores given by 3RUN3D with SACLR background correction. |
GPFILTER | Using the deletion and overexpression information in feature file to refine the final scores. Time series data is required too. |
The commands should be run in order in order to produce the final result. Such as (2RUN -> CLR), (2RUN -> 3RUN -> SACLR), or (2RUN -> 3RUN3D -> SACLR3D).
There is also a Unix script available called sgiRun.sh
that simplifies the process of running the software on a Sun Grid Engine cluster. Before using it, remember to modify the path to java bin command in the last line of the script, and the path to the working directory. It should be run as follows:
qsub sgiRun.sh CONF_FILE COMMAND PERMThe three parameters are the configuration file, the command (2RUN, 3RUN, etc.) and the permutation seed (use 0 if no permutation is desired). The script
sgiRun.sh
itself can be edited to adjust the number of jobs used in processing. Because the jobs are broken up over many nodes, the results from 2RUN and 3RUN are broken in to many files. There are two scripts that can be used to merge these files, called cat2.sh
and cat3.sh
, respectively.
The final step (CLR, SACLR, SACLR3D, GPFILTER) does not need to be run on the cluster, and in fact should only be run on one computer. For this, the qsub
can be dropped from the command-line and the script run directly:
./sgiRun.sh CONF_FILE SACLR PERM
testDataFor2.txt
and testTFsFor2.txt
) are random data, except for a fairly weak interaction between TF 1 and target 3 (with partner gene 2). Also included is a configuration file, testConf.txt
. Here is how to run the three steps on this toy data set on the cluster:
spline2.data, corr.data
:
qsub sgiRun.sh testConf.txt 2RUN 0Once complete, the following is run:
./cat2.sh
spline3.data, partner.data
:
qsub sgiRun.sh testConf.txt 3RUN 0Once complete, the following is run:
./cat3.sh
./sgiRun.sh testConf.txt SACLR 0The results (up to the top 100,000) will be in the file
zScores.txt
. These are the top Z-score-corrected M+S results. Also included will be the top raw M+S results in rawScores.txt
and the top synergy scores in synergies.txt
.