Last updated: 05/03/2005. New source code available for version 0.95beta.
SpectroFish – a tool for visualizing periodicity and relative nucleotide content in DNA.
Summary:
SpectroFish is a genomic DNA sequence visualization tool.
SpectroFish shows various periodicities in DNA by utilizing color
spectrograms. The unique
visualization provided by SpectroFish clearly shows both global and local
information found in genomes. It is
written in MatLab and can be run on UNIX, Linux and Windows platforms.
Contact: sussillo@ee.columbia.edu
Introduction: SpectroFish provides visualization of DNA sequences via color spectrograms, which are a novel visualization tool as described in [1], [2], [3]. These spectrograms give a simultaneous view of the local frequency content throughout the sequence for all frequencies, as well as the local relative nucleotide content indicated by the color of the spectrogram. They are helpful not only for the identification of genes and other regions of known biological significance, but also for the discovery of yet unknown regions of potential significance, characterized by distinct visual patterns in the spectrogram that are not easily detectable by character string analysis. SpectroFish derives both its power and uniqueness from the fact that it works in the frequency domain.
SpectroFish works by highlighting strong frequencies in DNA sequences. As an example, it easily reveals the well-known 3-base periodicity [4] of protein-coding regions and the 10.4-base DNA helical repeat [5]. SpectroFish goes much further and shows repeat regions of any complexity, from telomere DNA to satellite regions. The repeat unit length can be easily determined by visual inspection. SpectroFish utilizes color encoding to encode relative nucleotide content, thus revealing global properties such as AT or GC content. The distinct G-banding of chromosomes and CpG-islands are easily visualized with SpectroFish.
Perhaps most interesting are those features of genomes revealed by SpectroFish that are not well understood. Some examples include a mutated repeat string that dominates the right arm of a C. elegans chromosome 3 (Figure 4), the preponderance of nucleotides A&T at lower frequencies throughout the entire genome of E. coli, or the many embedded periodicities found in the vast majority of intergenic DNA in human chromosome 22 [1].
The key to the visualization power in SpectroFish lies in the tool’s ability to provide arbitrary frequency and sequence resolution for any DNA sequence. Frequency content can be distinguished from the 2-base to the 300-base periodicity. Spectrograms have been created for sequences as small as 100bp or as large as 34Mbp (human chromosome 22) [3]. Further, SpectroFish allows one to zoom in and out in both frequency and character dimensions, while remembering previous spectrograms, thus providing an ability to discover exactly those frequency characteristics of any region of interest. Control over the sequence resolution allows one to visualize meaningful features over an entire chromosome or a single gene. The following examples demonstrate this control over sequence and frequency in addition to the various spectral features of selected DNA sequences.

Figure 1 - Three nearly identical genes are shown from C elegans chromosome III. Additional blue-green spots at the 3-base periodicity reveal other exons. The sequence size is 16Kbp, periodicity (not frequency) is shown on the vertical scale.

Figure 2 - This spectrogram shows a piece of C. elegans chromosome III 432Kbp in length. There are many features to notice. First the artificial white line denotes the 3-base periodicity as well as the 10.1 base periodicity. Long stretches of DNA have spectral energy along the 10.1 period. Many 3-base 'bumps' can be seen highlighting various exons/genes. There are a few repeat sequences in this piece of DNA as well. At ~270Kbp there is a repeat with length > 55 bases. At ~150Kpb there is a stretch of DNA exhibiting identical repeats. This piece of DNA was likely replicated and inserted into the genome again. Finally, there is a change in color roughly above and below the 6 base periodicity. Above the 6 base periodicity the hue tends to be purplish while below tends to be more gray or green. By Parseval's equation this must reflect A and T being utilized at higher periodicities (lower frequencies), on average, while G and C are utilized at lower periodicities (higher frequencies), on average.

Figure 3 - A spectrogram of another stretch of C. elegans chromosome III. An artificial white line has been placed at the 35-base periodicity. This spectrogram reveals a family of repeats with very similar base repeat sequences. Examples are located at ~40kbp, ~64kbp, ~90kbp and ~160kbp. The spots at 35 base periodicity represent the fundamental frequencies of these repeats while the regularly spaced spectral bumps reflect harmonics at higher frequencies.

Figure 4 – Spectrogram showing the so-called quilt in protein FLO1 in S. cerevisiae. The quilt region corresponds to a flocculation domain. The region shown is roughly 10Kbp.

Figure
5 – A spectrogram
of the mitochondria DNA (86Kbp) in S. cerevisiae.
Uncommon is the very low frequency content (pink/red) at the top
corresponding to periodicities up to 200.
Interestingly, the mitochondria DNA of Neurospora crassa also
shows similar low frequency content (not shown.)

Figure
6 – Screenshot of
SpectroFish. The spectrogram is
that of 282Kbp of P. falciparum chromosome 4.
Periodicity is shown the vertical scale while the horizontal scale
gives the sequence location. Protein-coding
regions are visualized by following the 3-base periodicity.
On the left are shown 4 genes for erythrocyte membrane proteins.
On the right is a large region of subtelomeric DNA (the period of the
basic repeat is 21). Zooming
allows for arbitrary frequency and sequence resolution (compare with Figure
1and Figure 4),
while a history of spectrograms is stored for easy retrieval.

Figure 7 - Spectrogram of the chromosome III of C elegans (13.8 Mbp). Noted is the 3-base periodicity relating to protein coding. A minisatellite is noticeable in the middle (at 7.4Mbp). The periodicities visible in the right arm correspond to a family of repeated strings with period 35.
The program comes in two pieces which are listed below. Additionally there are a couple of sample *.fna (FASTA format) files that are used as samples.
D. Sussillo, A. Kundaje, D. Anastassiou, “Spectrogram Analysis of Genomes,” EURASIP Journal of Applied Signal Processing (2004) 29-42
D.
Anastassiou, “Genomic Signal Processing,” IEEE Signal
Processing Magazine, theme article,
July 2001.
D.
Anastassiou, “Frequency-domain
analysis of biomolecular sequences,” Bioinformatics Vol. 16, No.
12, pp. 1073-1081, December 2000.
Shepherd,
J. C. (1981). Periodic correlations in DNA sequences and evidence suggesting
their evolutionary origin in a comma-less genetic code. J Mol Evol, 17(2),
94-102.
Rhodes,
D., & Klug, A. (1980). Helical periodicity of DNA determined by enzyme
digestion. Nature (London), 286, 573-578.