Index
- Introduction
- Algorithm overview
- Training an organism specific GlimmerM
- Need of a specific organism training
- Data collection
- Training GlimmerM
- Changing the parameters obtained during training
- Running GlimmerM
- Gene Modeling Performance
- Obtaining GlimmerM
I. Introduction
GlimmerM is a gene finder developed specifically for small eukaryotes with a gene density
of around 20% (Salzberg, Pertea et al. 1999 ). Originally developed for
Plasmodium falciparum, the malaria parasite , the system has
been trained for several other organisms, including Arabidopsis thaliana, Oryza sativa
(Yuan, Quackenbush et al. 2001), and Aspergillus fumigatus, and should
also work well on closely related organisms. A special package of GlimmerM re-trains the system
using data provided by the user, therefore making the gene finder able to find genes in any
organism for which a training set can be collected.
II.Algorithm overview
The basis of GlimmerM is a dynamic programming algorithm that considers all combinations of
possible exons for inclusion in a gene model and chooses the best of all these combinations.
The possible exon-intron combinations are formed after an initial screening of the possible
translational start sites and splice sites found in the genome. Both these entities are
determined with specially designed modules based primarily on Markov chains.
There are a number of assumptions used by GlimmerM when predicting genes in the DNA sequence.
The main assumptions are
(1) the coding region of every gene begins with a start codon ATG,
(2) a gene has no in-frame stop codons except the very last codon,
(3) each exon is in a consistent reading frame with the previous exon.
These constraints significantly enhance the efficiency of computing the optimal gene models,
by restricting the search space of the DP algorithm. On the other hand, genuine frame shifts
cannot be detected by the system.
The decision about what gene model is best is a combination of the strength of the splice sites
and the score of the exons produced by a special generalization of a Markov chains called an
interpolated Markov model (IMM) (Salzberg, Pertea et al. 1999).
GlimmerM uses the same IMM algorithm as in Glimmer.
Based on the success of Glimmer in bacterial sequence annotation, we thought that IMMs should make a good foundation for eukaryotic gene finding.
This is particularly true of small eukaryotes like P. falciparum in which the gene density is
intermediate between that of prokaryotes and higher eukaryotes.
Besides IMMs, GlimmerM uses other methods to score potential coding regions. A scoring function
based on decision trees is built in order to estimate the probability that a DNA subsequence is
coding or not. Five types of subsequences are evaluated: introns, initial exons, internal exons,
final exons and single exons. The probabilities obtained with the decision trees are averaged to
produce a smoothed estimate of the probability that the given subsequence is of a certain type.
A gene model is only accepted if the IMM score over all coding sequences exceeds a fixed threshold.
III.Training an organism specific GlimmerM
III.1. Need of a specific organism training
GlimmerM trained for rice produced clearly superior gene models than GlimmerM trained for Arabidopsis
(Pertea and Salzberg 2002). Although the current trainings of GlimmerM should work
well on closely related organisms, the user should use our training procedure to re-train the system
for an other eukaryotic organism in order to improve the accuracy of the gene prediction. For all
species that GlimmerM is already trained, we expect the performance to improve even further by re-training the system as the amount of DNA sequences from these species continues to grow.
III.2 Data collection
First of all it should be noted that a thorough collection of a good training data set is a step that
should precede the training of any gene finder. This step is very important since the quality of the
data presented for training is directly proportional with the accuracy of the resulted gene finder.
As any species-specific gene finder, GlimmerM needs a training data set containing as many as possible
complete coding sequences from the organism genome for which the gene prediction is intended. By surveying
the public databases, one should be able to find all previously known genes of the target organism that
are backed by laboratory evidence. These genes would form a sound training data set if their number
would be large enough. Unfortunately, this is rarely the case, so one should use other methods in order
to construct a reliable data set. From our experience, most often the training data set was formed by
searching long ORFs (with more than 500 bases) against non redundant protein sequence databases by using
programs like BLAST in order to map known proteins into the genome.
III.3 Training GlimmerM
To train GlimmerM you should run trainGlimmerM with the following command:
trainGlimmerM <mfasta_file> <exon_file> [optional_parameters]
<mfasta_file> and <exon_file> are the multi-FASTA file and the file containing the exon coordinates of
the known genes, respectively.
One of the steps of the gene finding algorithm of GlimmerM is to determine potential splice sites in the
DNA sequences presented at input. Sequences that contain the consensus GT or AG dinucleotide and score
about a fixed threshold are retained as potential donor or acceptor sites respectively. A significant
number of false positive splice sites is further eliminated by keeping only those sequences whose score
was maximal within a fixed DNA window (Pertea, Lin et al. 2001). The default length
of the DNA window used to filter the locally optimal splice sites as before is 60bp. This value can be
changed by using two optional parameters with the trainGlimmerM procedure:
-a [filter value]
where [filter value] is an integer specifying the window length for filtering locally maximal acceptor
sites (default=60)
-d [filter value]
where [filter value] is an integer specifying the window length for filtering locally maximal donor
sites (default=60)
If not enough data is available for training the splice sites the training procedure will be unsuccessful
and exit with a warning message. In this case the user should collect more known genes with introns and
then try the training procedure again. If not enough data is available for training the translational
start sites of the genes, the training procedure will still be successful and GlimmerM will consider
any ATG a potential start site.
III.4 Changing the parameters obtained during training
This step is optional to the user but sometimes a manual tuning of the system could improve the accuracy
of the predictions.
Choosing adequate thresholds that determine the false negative-false positive ratio for specific site
detection may be a challenging problem given the small number of true sites and the overwhelming number
of false positives, and requires a suite of decisions that maximizes the accuracy of the recognition
task without a big loss in the sensitivity. In the training procedure of GlimmerM the threshold for
calling a sequence a real splice site is chosen by examining the trade-off in the false positive rate.
The system creates a sorted list of thresholds, adjusting the scoring function so that it will miss
1,2,3, etc. true sites. The default threshold is chosen to be the score at which the false positive
rate drops by less than 1%. To allow a greater flexibility in setting the signals' thresholds, our
training procedure allows the user to consult the false positive and false negative rates specific
for the training data and set his own threshold.
After running trainGlimmerM, a log file can be consulted to find the default values set for some of
the parameters of GlimmerM. The config_file in the training directory specifies on each line the
following parameters:
1) minimum intron length,
2) maximum intron length,
3) minimum exon length,
4) maximum exon length,
5) maximum gene length,
6) a flag indicating if decision trees were used for computing the acceptor site scores,
7) a flag indicating if decision trees were used for computing the donor site scores,
8) the acceptor site threshold,
9) the donor site threshold,
10) the length of the filter window for the acceptor sites,
11) the length of the filter window for the donor sites,
12) the acceptor site threshold when filtering is used,
13) the donor site threshold when filtering is used,
14) a flag indicating if start codon modeling was used,
15) the start codon threshold.
With the exception of parameters 6, 7, and 14, all of the above parameters can be changed by hand.
IV. Running GlimmerM
The program GlimmerM takes two inputs: a DNA sequence file in FASTA format and a directory
containing the training files for the program. If not specified, the training directory is by
default the current working directory. For instance, if the user is using the pre-compiled
versions of GlimmerM located in the bin directory, the following command should be used:
glimmerm_
or
glimmerm_ -d
where is linux, alpha or sun.
The optional parameters that can be given to the program are:
-d dir |
Set the directory of the training files to dir. (default is current directory.) |
-g n |
Set minimum gene length to n. (default minimum gene length = 60.) |
-t n |
Set the threshold score above which a DNA region is called a gene to n.
If the in-frame score is greater or equal to n (an integer between 0 and
100), then the region is given a number and considered a potential gene. (default threshold score = 90.) |
-r |
Don't use independent probability score column. (default false.) |
+r |
Use independent probability score column. (default true.) |
-f |
Don't use maximal local filtering of the splice sites. (default false.) |
+f |
Use maximal local filtering of the splice sites. (default true.) |
-s |
Don't use the start site model. (default false.) |
+s |
Use the start site model. (default true) |
-5 |
Use threshold for the acceptor sites. |
-3 |
Use threshold for the donor sites. |
V. Gene Modeling Performance
The main problem when training a gene finder for a newly sequenced genome is the lack of positive examples.
Moreover, the training data set should contain genes that represent a random sample from the genes in that
genome, but practical considerations often make this requirement impossible to satisfy. The accuracy of the
resulting gene finder will depend not only on the modeling technique used but also on the training set, and
one can often dramatically improve gene finding results by re-training a system as more genes become known.
Because of these aspects, the numbers presented here should be only considered an optimistic upper bound of
the gene prediction accuracy of the gene finder.
GlimmerM trained for malaria has computed rates of sensitivity and specificity for nucleotide level
recognition above 94% and 97% respectively (Salzberg, Pertea et al. 1999; Pertea, Salzberg et al. 2000).
GlimmerM's accuracy has also been evaluated on the model plant Arabidopsis thaliana (Pertea and Salzberg 2002),
and has been shown to have comparable results to GeneMark.hmm and Genscan.
VI. Obtaining GlimmerM
GlimmerM is available free of charge under the open-source Artistic License.
To download GlimmerM please click here.
References
Pertea, M., X. Lin, et al. (2001). "GeneSplicer: a new computational method for splice site prediction." Nucleic Acids Res 29(5): 1185-90.
Pertea, M. and S. L. Salzberg (2002). "Computational gene finding in plants." Plant Molecular Biology 48(1-2): 39-48.
Pertea, M., S. L. Salzberg, et al. (2000). "Finding genes in Plasmodium falciparum." Nature 404(6773): 34; discussion 34-5.
Salzberg, S. L., M. Pertea, et al. (1999). "Interpolated Markov models for eukaryotic gene finding." Genomics 59(1): 24-31.
Yuan, Q., J. Quackenbush, et al. (2001). "Rice bioinformatics. analysis of rice sequence data and leveraging the data to other plant species." Plant Physiol 125(3): 1166-74.
Back to the CBCB Software Page
|