How to use SpectroFish

SpectroFish is easy to use.  You must simply know something about how spectrograms work.

Spectrograms attempt to bypass (though with only partial success) the time/frequency tradeoff present in any application of Fourier techniques.  In the spectrogram technique one uses a window of size (NFFT) much smaller than the actual length of the data sequence.  Using this window of data the Fast Fourier Transform (FFT) is computed resulting in a sequence of length NFFT (equal to the window).  After this operation is done the window is advanced some number of data points and performed again.  The next FFT sequence is placed vertically and adjacent to the previous FFT sequence thus resulting in a two dimensional matrix which gives both time and frequency information about the sequence.  An example is show below.

Figure 1 - A simple spectrogram given from a sequence show below.  This spectrogram was generated by taking a 256 point window of the sequence, computing the FFT and advancing the window 16 points between each FFT calculation.  The spectrogram shows the frequencies of the sequence are getting lower and lower as the sequence progresses.  

The SpectroFish program computes the spectrograms for DNA in essentially the same way.  First the string is converted into 'binary indicator strings for each base (e.g. 'AATCTAGA' gives '11000101' as the binary indicator sequence for 'A').  These spectrograms are then combined using linear algebra to give an optimal coloration to each base so that all four spectrograms (one for each base) can be viewed at once.  In other words, the color contains real information.  Note that this is not the case in the spectrogram shown above.  In figure 1 the colors are pseudocolor.  They merely represent intensity and convey no further information.  Please see the references for the complete details on how to make DNA color spectrograms.

The color spectrogram uses the color translations as follows - 

A -  blue, T  - red, C - green, G - gray.  

Since there are four bases and three colors the choices of color for each base are figured in order that each base color is maximally distant from other base colors.  Of course, any color viewed in the spectrogram NOT unique as there are four bases and three color parameters (RGB.)

Using the Interface

The SpectroFish tool allows one to create basic spectrograms for any DNA sequence and then navigate through them via zooming in and out of spectrogram.  The frequency content of the DNA can be fully explored and corresponding spectrograms can be saved.  Additionally, the strings that create the spectrograms can be viewed / saved as well.  Thus one can use the SpectroFish tool to 'fish' out interesting DNA subsequences based on interesting spectral properties.  It's quite probable that many new properties of DNA sequences will be identified this way.  Below is shown the SpectroFish interface used in it's most simple way.

To get started quick simply open a sample FNA file using the "SpectroFish->Open FNA File" menu item.  Files are opened via the "SpectroFish -> Open FNA File" and only this way.  FNA files are NOT opened by opening the file menu.  The file menu is reserved for MatLab figure operations such as saving, printing, etc.  Aside from printing, the File menu is not useful.  Please note that SpectroFish handles only one FNA file at a time.  Attempts to load multiple files will result in spurious errors.

 

There are two parameters in computing DNA color spectrograms that are more important than anything else: FFT Size and Overlap.  

Edit Texts explained:

Push Buttons explained:

Menus explained:

SpectroFish Menu:

Configuration Menu:

 

PMUSIC - 

Finally, an attempt to accomplish more meaningful thresholding is implemented using pseudospectrum techniques.  In particular, the PMUSIC algorithm, which is based on Gaussian noise model, employs the subspace technique of PMUSIC to determine those frequencies which are noise vs. those that are not.  The technique is parameterized with two parameters.  The first parameter is the number of expected sinusoids (include 2 times the expected number as the negative frequencies are also counted) in the signal and the second parameter gives a threshold (based on low eigenvalues) that separates noise from signal.  If in doubt, reasonable starting parameters are "10 4".    These values give a black screen if a completely random DNA sequence is used for the spectrogram.

 

 

 

Figure 4 - PMUSIC algorithm run on the same sample file as the other figures.  The PMUSIC parameters used to generate the image are "10 2"