MORGAN: A decision tree system for finding genes in vertebrate DNA
MORGAN is an integrated system for finding genes in vertebrate DNA sequences.
MORGAN uses a variety of techniques to accomplish this task, the most distinctive
of which is a decision tree classifier. The decision tree system is combined
with new methods for identifying start codons, donor sites, and acceptor
sites, and these are brought together in a frame-sensitive dynamic programming
algorithm that finds the optimal segmentation of a DNA sequence into coding
and noncoding regions (exons and introns). The optimal segmentation is
dependent on a separate scoring function that takes a subsequence and assigns
to it a score reflecting the probability that the sequence is an exon.
The scoring functions in MORGAN are sets of decision trees that are combined
to give a probability estimate. Experimental results on a database of 570
vertebrate DNA sequences show that MORGAN has excellent performance by
many different measures. On a separate test set, it achieves an overall
accuracy of 95%, with a correlation coefficient of 0.78 and a sensitivity
and specificity for coding bases of 83% and 79%. In addition, MORGAN identifies
58% of coding exons exactly; i.e., both the beginning and end of
the coding regions are predicted correctly.
The paper describing MORGAN is S. Salzberg, A. Delcher, K. Fasman,
and J. Henderson. A
Decision Tree System for Finding Genes in DNA.
Journal of Computational
Biology 5:4 (1998), 667-680. A more tutorial introduction is
S. Salzberg. Decision Trees and Markov Chains for GeneFinding.
In S. Salzberg, D. Searls, and S. Kasif (eds.),
Methods in Molecular Biology, pp. 187-203. Amsterdam:
Elsevier Science B.V., 1998.
Obtaining the MORGAN System
As of January 2001, MORGAN is available free of charge to all. Just
download it by clicking
here. The system includes source code. Note that this is
now old, UNSUPPORTED code. It has not been changed since 1997.
Though I'd love to answer questions about MORGAN by email, I'm afraid I
don't have time. A much newer and improved gene finder, GlimmerM,
uses some of the techniques from MORGAN as well as newer methods from
our Glimmer bacterial gene finder. Read more about GlimmerM at TIGR's
software home page.
The following table and figures are from our technical report on MORGAN.
The table shows MORGAN's overall performance on the Burset and Guigo database
of 570 vertebrate genes, and for comparison it includes figures for several
other well-known gene finding systems.
Table 1: A comparison of MORGAN's performance on vertebrate sequences
with other gene-finding systems. AC is the approximate correlation proposed
by Burset and Guigo (1996), as a more accurate overall indicator than the
correlation coefficient. MORGAN's AC and CC values were identical for both
training and test sets. Sensitivity (Sn) is the fraction of true coding
bases that were correctly predicted as coding, and specificity (Sp) is
the number of bases predicted to be in coding regions that actually were
coding. The ``exact exon'' columns show the corresponding results for prediction
of whole exons, a much more stringent test. A ``whole exon'' prediction
is correct only if both ends of the coding portion of an exon are predicted
exactly. ME is the fraction of whole coding exons that are missed completely.
Figure 1: Sample decision tree for classifying human DNA. The
tree shown was used for classifying sequences as internal exons. Subsequences
are passed down the tree beginning at the top, where a ``yes'' result on
any test means that an example should be passed down to the left. The features
tested in this tree include the donor site score (donor), the sum of the
donor and acceptor site scores (d+a), the in-frame hexamer frequency (hex),
and Fickett's position asymmetry statistic (asym). The leaf nodes contain
class distributions for the two classes ``exon'' and ``pseudo-exon.'' For
example, the leftmost leaf node contains the distribution (6,560), which
means that in the training set, 6 examples that reached this point in the
tree were exons and 560 were pseudo-exons. The nodes shown as triangles
represent un-expanded subtrees.
Figure 2: How the dynamic programming algorithm finds the optimal
parse. The algorithm first marks all the candidate start, stop, donor,
and acceptor sites. Then it finds the best previous site for each one,
and marks it with a pointer, shown as an arrow in the figure. For example,
acceptor sites have pointers back to donor sites, indicating candidate
introns. Donor sites can have pointers back to previous start sites or
acceptor sites, indicating initial and internal exons. The ends of each
solid arrow in the figure mark potential exons, while the ends of the dotted
arrows indicate potential introns. The optimal parse at each stop site
can be found by tracing the pointers back to a start codon.
Check out VEIL
Please check out our other gene-finding system, VEIL,
which uses HMMs to find genes.
The MORGAN project is supported in part by the National Science Foundation
under grant IRI-9530462, and by the National Human Genome Research Institute
at NIH under grant K01-HG00022-01.
You are visitor number on this
to Computational Biology at Johns Hopkins Home Page
|Last modified on: October 28, 1999