GlimmerM User Manual

GlimmerM is a gene finder developed specifically for small eukaryotes with a gene density of around 20% (Salzberg, Pertea et al. 1999 ). Originally developed for Plasmodium falciparum, the malaria parasite , the system has been trained for several other organisms, including Arabidopsis thaliana, Oryza sativa (Yuan, Quackenbush et al. 2001), and Aspergillus fumigatus, and should also work well on closely related organisms. A special package of GlimmerM re-trains the system using data provided by the user, therefore making the gene finder able to find genes in any organism for which a training set can be collected.

II.Algorithm overview

The basis of GlimmerM is a dynamic programming algorithm that considers all combinations of possible exons for inclusion in a gene model and chooses the best of all these combinations. The possible exon-intron combinations are formed after an initial screening of the possible translational start sites and splice sites found in the genome. Both these entities are determined with specially designed modules based primarily on Markov chains. There are a number of assumptions used by GlimmerM when predicting genes in the DNA sequence.

The main assumptions are

(1) the coding region of every gene begins with a start codon ATG,

(2) a gene has no in-frame stop codons except the very last codon,

(3) each exon is in a consistent reading frame with the previous exon.

These constraints significantly enhance the efficiency of computing the optimal gene models, by restricting the search space of the DP algorithm. On the other hand, genuine frame shifts cannot be detected by the system.

The decision about what gene model is best is a combination of the strength of the splice sites and the score of the exons produced by a special generalization of a Markov chains called an interpolated Markov model (IMM) (Salzberg, Pertea et al. 1999). GlimmerM uses the same IMM algorithm as in Glimmer. Based on the success of Glimmer in bacterial sequence annotation, we thought that IMMs should make a good foundation for eukaryotic gene finding. This is particularly true of small eukaryotes like P. falciparum in which the gene density is intermediate between that of prokaryotes and higher eukaryotes. Besides IMMs, GlimmerM uses other methods to score potential coding regions. A scoring function based on decision trees is built in order to estimate the probability that a DNA subsequence is coding or not. Five types of subsequences are evaluated: introns, initial exons, internal exons, final exons and single exons. The probabilities obtained with the decision trees are averaged to produce a smoothed estimate of the probability that the given subsequence is of a certain type. A gene model is only accepted if the IMM score over all coding sequences exceeds a fixed threshold.

III.Training an organism specific GlimmerM

III.1. Need of a specific organism training

GlimmerM trained for rice produced clearly superior gene models than GlimmerM trained for Arabidopsis (Pertea and Salzberg 2002). Although the current trainings of GlimmerM should work well on closely related organisms, the user should use our training procedure to re-train the system for an other eukaryotic organism in order to improve the accuracy of the gene prediction. For all species that GlimmerM is already trained, we expect the performance to improve even further by re-training the system as the amount of DNA sequences from these species continues to grow.

III.2 Data collection

First of all it should be noted that a thorough collection of a good training data set is a step that should precede the training of any gene finder. This step is very important since the quality of the data presented for training is directly proportional with the accuracy of the resulted gene finder. As any species-specific gene finder, GlimmerM needs a training data set containing as many as possible complete coding sequences from the organism genome for which the gene prediction is intended. By surveying the public databases, one should be able to find all previously known genes of the target organism that are backed by laboratory evidence. These genes would form a sound training data set if their number would be large enough. Unfortunately, this is rarely the case, so one should use other methods in order to construct a reliable data set. From our experience, most often the training data set was formed by searching long ORFs (with more than 500 bases) against non redundant protein sequence databases by using programs like BLAST in order to map known proteins into the genome.

III.3 Training GlimmerM

To train GlimmerM you should run trainGlimmerM with the following command:

trainGlimmerM <mfasta_file> <exon_file> [optional_parameters]

<mfasta_file> and <exon_file> are the multi-FASTA file and the file containing the exon coordinates of the known genes, respectively.
One of the steps of the gene finding algorithm of GlimmerM is to determine potential splice sites in the DNA sequences presented at input. Sequences that contain the consensus GT or AG dinucleotide and score about a fixed threshold are retained as potential donor or acceptor sites respectively. A significant number of false positive splice sites is further eliminated by keeping only those sequences whose score was maximal within a fixed DNA window (Pertea, Lin et al. 2001). The default length of the DNA window used to filter the locally optimal splice sites as before is 60bp. This value can be changed by using two optional parameters with the trainGlimmerM procedure:

-a [filter value]
where [filter value] is an integer specifying the window length for filtering locally maximal acceptor sites (default=60)

-d [filter value]
where [filter value] is an integer specifying the window length for filtering locally maximal donor sites (default=60)

If not enough data is available for training the splice sites the training procedure will be unsuccessful and exit with a warning message. In this case the user should collect more known genes with introns and then try the training procedure again. If not enough data is available for training the translational start sites of the genes, the training procedure will still be successful and GlimmerM will consider any ATG a potential start site.

III.4 Changing the parameters obtained during training

This step is optional to the user but sometimes a manual tuning of the system could improve the accuracy of the predictions.
Choosing adequate thresholds that determine the false negative-false positive ratio for specific site detection may be a challenging problem given the small number of true sites and the overwhelming number of false positives, and requires a suite of decisions that maximizes the accuracy of the recognition task without a big loss in the sensitivity. In the training procedure of GlimmerM the threshold for calling a sequence a real splice site is chosen by examining the trade-off in the false positive rate.
The system creates a sorted list of thresholds, adjusting the scoring function so that it will miss 1,2,3, etc. true sites. The default threshold is chosen to be the score at which the false positive rate drops by less than 1%. To allow a greater flexibility in setting the signals' thresholds, our training procedure allows the user to consult the false positive and false negative rates specific for the training data and set his own threshold.

After running trainGlimmerM, a log file can be consulted to find the default values set for some of the parameters of GlimmerM. The config_file in the training directory specifies on each line the following parameters:
1) minimum intron length,
2) maximum intron length,
3) minimum exon length,
4) maximum exon length,
5) maximum gene length,
6) a flag indicating if decision trees were used for computing the acceptor site scores,
7) a flag indicating if decision trees were used for computing the donor site scores,
8) the acceptor site threshold,
9) the donor site threshold,
10) the length of the filter window for the acceptor sites,
11) the length of the filter window for the donor sites,
12) the acceptor site threshold when filtering is used,
13) the donor site threshold when filtering is used,
14) a flag indicating if start codon modeling was used,
15) the start codon threshold.
With the exception of parameters 6, 7, and 14, all of the above parameters can be changed by hand.

IV. Running GlimmerM

The program GlimmerM takes two inputs: a DNA sequence file in FASTA format and a directory containing the training files for the program. If not specified, the training directory is by default the current working directory. For instance, if the user is using the pre-compiled versions of GlimmerM located in the bin directory, the following command should be used: glimmerm_ or glimmerm_ -d where is linux, alpha or sun.

The optional parameters that can be given to the program are:


-d dir	Set the directory of the training files to dir. (default is current directory.)
-g n	Set minimum gene length to n. (default minimum gene length = 60.)
-t n	Set the threshold score above which a DNA region is called a gene to n. If the in-frame score is greater or equal to n (an integer between 0 and 100), then the region is given a number and considered a potential gene. (default threshold score = 90.)
-r	Don't use independent probability score column. (default false.)
+r	Use independent probability score column. (default true.)
-f	Don't use maximal local filtering of the splice sites. (default false.)
+f	Use maximal local filtering of the splice sites. (default true.)
-s	Don't use the start site model. (default false.)
+s	Use the start site model. (default true)
-5	Use threshold for the acceptor sites.
-3	Use threshold for the donor sites.

V. Gene Modeling Performance

The main problem when training a gene finder for a newly sequenced genome is the lack of positive examples. Moreover, the training data set should contain genes that represent a random sample from the genes in that genome, but practical considerations often make this requirement impossible to satisfy. The accuracy of the resulting gene finder will depend not only on the modeling technique used but also on the training set, and one can often dramatically improve gene finding results by re-training a system as more genes become known. Because of these aspects, the numbers presented here should be only considered an optimistic upper bound of the gene prediction accuracy of the gene finder.

GlimmerM trained for malaria has computed rates of sensitivity and specificity for nucleotide level recognition above 94% and 97% respectively (Salzberg, Pertea et al. 1999; Pertea, Salzberg et al. 2000).

GlimmerM's accuracy has also been evaluated on the model plant Arabidopsis thaliana (Pertea and Salzberg 2002), and has been shown to have comparable results to GeneMark.hmm and Genscan.

VI. Obtaining GlimmerM

GlimmerM is available free of charge under the open-source Artistic License.

To download GlimmerM please click here.

References

Pertea, M., X. Lin, et al. (2001). "GeneSplicer: a new computational method for splice site prediction." Nucleic Acids Res 29(5): 1185-90.

Pertea, M. and S. L. Salzberg (2002). "Computational gene finding in plants." Plant Molecular Biology 48(1-2): 39-48.

Pertea, M., S. L. Salzberg, et al. (2000). "Finding genes in Plasmodium falciparum." Nature 404(6773): 34; discussion 34-5.

Salzberg, S. L., M. Pertea, et al. (1999). "Interpolated Markov models for eukaryotic gene finding." Genomics 59(1): 24-31.

Yuan, Q., J. Quackenbush, et al. (2001). "Rice bioinformatics. analysis of rice sequence data and leveraging the data to other plant species." Plant Physiol 125(3): 1166-74.

Back to the CBCB Software Page