MORGAN: A decision tree system for finding genes in vertebrate DNA

Contents

About MORGAN

MORGAN is an integrated system for finding genes in vertebrate DNA sequences. MORGAN uses a variety of techniques to accomplish this task, the most distinctive of which is a decision tree classifier. The decision tree system is combined with new methods for identifying start codons, donor sites, and acceptor sites, and these are brought together in a frame-sensitive dynamic programming algorithm that finds the optimal segmentation of a DNA sequence into coding and noncoding regions (exons and introns). The optimal segmentation is dependent on a separate scoring function that takes a subsequence and assigns to it a score reflecting the probability that the sequence is an exon. The scoring functions in MORGAN are sets of decision trees that are combined to give a probability estimate. Experimental results on a database of 570 vertebrate DNA sequences show that MORGAN has excellent performance by many different measures. On a separate test set, it achieves an overall accuracy of 95%, with a correlation coefficient of 0.78 and a sensitivity and specificity for coding bases of 83% and 79%. In addition, MORGAN identifies 58% of coding exons exactly; i.e., both the beginning and end of the coding regions are predicted correctly.

References

The paper describing MORGAN is  S. Salzberg, A. Delcher, K. Fasman, and J. Henderson.  A Decision Tree System for Finding Genes in DNA. Journal of Computational Biology 5:4 (1998), 667-680.  A more tutorial introduction is S. Salzberg.  Decision Trees and Markov Chains for GeneFinding.  In S. Salzberg, D. Searls, and S. Kasif (eds.), Computational Methods in Molecular Biology, pp. 187-203.  Amsterdam:  Elsevier Science B.V., 1998.

Obtaining the MORGAN System

As of January 2001, MORGAN is available free of charge to all.  Just download it by clicking here.  The system includes source code.  Note that this is now old, UNSUPPORTED code.  It has not been changed since 1997.  Though I'd love to answer questions about MORGAN by email, I'm afraid I don't have time.  A much newer and improved gene finder, GlimmerM, uses some of the techniques from MORGAN as well as newer methods from our Glimmer bacterial gene finder.  Read more about GlimmerM at TIGR's software home page.

 

Performance

The following table and figures are from our technical report on MORGAN. The table shows MORGAN's overall performance on the Burset and Guigo database of 570 vertebrate genes, and for comparison it includes figures for several other well-known gene finding systems.
 
 
Coding Bases Exact Exons
Gene-Finder Sn Sp AC Sn Sp Avg ME
MORGAN 0.83  0.79  0.78  0.58  0.51  0.55  0.14 
VEIL 0.83  0.72  0.73  0.53  0.49  0.51  0.19 
Genie 0.78  0.84  0.77  0.61  0.64  0.62  0.15 
FGENEH 0.77  0.85  0.78  0.61  0.61  0.61  0.15 
GRAIL 2 0.72  0.87  0.75  0.36  0.43  0.40  0.25 
GeneID 0.63  0.81  0.67  0.44  0.46  0.45  0.28 
GeneParser2 0.66  0.79  0.67  0.35  0.40  0.37  0.29 
GenLang 0.72  0.79  0.69  0.51  0.52  0.52  0.21 
SORFIND 0.71  0.85  0.73  0.42  0.47  0.45  0.24 
Xpound 0.61  0.87  0.68  0.15  0.18  0.17  0.33 

Table 1: A comparison of MORGAN's performance on vertebrate sequences with other gene-finding systems. AC is the approximate correlation proposed by Burset and Guigo (1996), as a more accurate overall indicator than the correlation coefficient. MORGAN's AC and CC values were identical for both training and test sets. Sensitivity (Sn) is the fraction of true coding bases that were correctly predicted as coding, and specificity (Sp) is the number of bases predicted to be in coding regions that actually were coding. The ``exact exon'' columns show the corresponding results for prediction of whole exons, a much more stringent test. A ``whole exon'' prediction is correct only if both ends of the coding portion of an exon are predicted exactly. ME is the fraction of whole coding exons that are missed completely.
 


Selected Figures


Figure 1: Sample decision tree for classifying human DNA. The tree shown was used for classifying sequences as internal exons. Subsequences are passed down the tree beginning at the top, where a ``yes'' result on any test means that an example should be passed down to the left. The features tested in this tree include the donor site score (donor), the sum of the donor and acceptor site scores (d+a), the in-frame hexamer frequency (hex), and Fickett's position asymmetry statistic (asym). The leaf nodes contain class distributions for the two classes ``exon'' and ``pseudo-exon.'' For example, the leftmost leaf node contains the distribution (6,560), which means that in the training set, 6 examples that reached this point in the tree were exons and 560 were pseudo-exons. The nodes shown as triangles represent un-expanded subtrees.
 


Figure 2: How the dynamic programming algorithm finds the optimal parse. The algorithm first marks all the candidate start, stop, donor, and acceptor sites. Then it finds the best previous site for each one, and marks it with a pointer, shown as an arrow in the figure. For example, acceptor sites have pointers back to donor sites, indicating candidate introns. Donor sites can have pointers back to previous start sites or acceptor sites, indicating initial and internal exons. The ends of each solid arrow in the figure mark potential exons, while the ends of the dotted arrows indicate potential introns. The optimal parse at each stop site can be found by tracing the pointers back to a start codon.

Check out VEIL

Please check out our other gene-finding system, VEIL, which uses HMMs to find genes.
 


Acknowledgements

The MORGAN project is supported in part by the National Science Foundation under grant IRI-9530462, and by the National Human Genome Research Institute at NIH under grant K01-HG00022-01.
back to contents

You are visitor number  on this site.

Back to Computational Biology at Johns Hopkins Home Page
Last modified on: October 28, 1999