|
JIGSAW: gene prediction using
multiple sources of evidence
|
|
JIGSAW is a program designed to use the output from gene finders,
splice site prediction programs and sequence alignments to predict gene
models. The program provides an automated way to take advantage of the
many succsessful methods for computational gene prediction and can
provide substantial improvements in accuracy over an individual gene
prediction program.
JIGSAW is available for all species. We have tested JIGSAW on Human,
Rice (Oryza sativa), Arabidopsis thaliana, C.
elegans, Brugia
malayi, Cryptococcus neoformans, Entamoeba histolytica,
Theileria parva, Aspergillus fumigatus, Plasmodium
falciparum and Plasmodium yoelii.
UPDATE!
The linear combiner option is now available in the current JIGSAW
software distribution. This allows JIGSAW to be run without the use of
training data. A weight is assigned to each evidence source, and gene
predictions are based on a weighted voting scheme, yielding the best
'consensus' predictions.
Predictions are now available for the ENCODE regions in Human
and viewable as custom tracks in the UCSC
Human Genome Browser
Predictions available for the Human
genome and viewable as custom tracks in the UCSC
Human Genome Browser
Prediction
Program |
Gene
sensitivity |
Gene
precision |
Exon
sensitivity |
Exon
precision |
Nucleotide
sensitivity |
Nucleotide
precision |
JIGSAW |
59% |
66% |
87% |
89% |
90% |
98% |
Ensembl |
62% |
50% |
85% |
80% |
85% |
95% |
UCSC's
KnownGene track |
65% |
38% |
84% |
77% |
82% |
93% |
Table 1
The results in Table 1 measure accuracy of JIGSAW, Ensembl and cDNA
alignments from the UCSC genome browser in Human. The test is made up
of 1563 genes. JIGSAW uses the output from Ensembl and the cDNA
alignments along with many other evidence sources available in the UCSC
genome database, including other gene finders and expression
evidence. Sensitivity measures the percentage of true genes
(exons/nucleotides) that the program finds. Precision measures
the percentage of the program's predicted genes (exons/nucleotides)
that are correct.
Prediction
Program |
Correct
Genes |
Missed
Genes |
Correct
Exons |
Missed
Exons |
Nucleotide
Sensitivity |
JIGSAW |
54% |
3% |
86% |
4% |
97% |
FgenesH |
42% |
4% |
80% |
8% |
95% |
GeneMark.hmm |
26% |
5% |
51% |
28% |
76% |
Table 2
The results in Table 2 measure accuracy of JIGSAW, FgenesH and
GeneMark.hmm in Oryza sativa. The test set includes 5,595 genes
from 26,827 exons. JIGSAW uses the output from FgenesH,
GlimmerR, GeneMark.hmm, Genscan and splice site predictions from
GeneSplicer, sequence alignments from a protein database and sequence
alignments from the TIGR gene indices.
Prediction
Program |
Correct
Genes |
Missed
Genes |
Correct
Exons |
Missed
Exons |
Nucleotide
Sensitivity |
JIGSAW |
78% |
1% |
93% |
3% |
98% |
TwinScan |
67% |
1% |
87% |
4% |
96% |
GeneMark.hmm |
45% |
2% |
79% |
5% |
96% |
Genscan |
37% |
2% |
75% |
10% |
92% |
GlimmerM |
32% |
1% |
71% |
9% |
93% |
Table 3
The results in Table 3 measure the accuracy of gene prediction programs
in Arabidopsis thaliana. The test set includes 1,783 genes from
7,510 exons. JIGSAW uses output from the other gene prediction programs
listed in the table,
an earlier version of GlimmerM, splice site predictions from
GeneSplicer, sequence alignments from a protein database and sequence
alignments from the TIGR gene
indices.
A training set is given to JIGSAW, which consists of example output
from an automated gene structure annotation pipeline along with
sequence coordinates of known genes. JIGSAW compares the pipeline's
predicted genes to the example known genes to record the prediction
accuracy of each combination of evidence. A non-linear model is built
to estimate the accuracy of the different combinations of evidence
found in new data. JIGSAW pieces together gene structure models most
likely to be accuracte based on statistics collected in the training
set.
JIGSAW predicts gene models for a user supplied genomic sequence. The
main interface is a simple "evidence list" file, which lists the file
names of each prediction program's output, file format and the type of
evidence. JIGSAW reads several coordinate based file formats including GFF.
System requirements
JIGSAW is developed in C++ and compiles using GNU gcc 3.2 or newer.
To download the most recent JIGSAW system, just click HERE.
This software is OSI Certified
Open Source Software.
The distribution includes documentation on how to get started. Included
in the distrubtion is a tutorial demonstrating step by step the process
of training and running JIGSAW. The tutorial is available online HERE.
Software development documentation
Library API
Application API
References
J.E. Allen, W.H. Majoros, M. Pertea, and S.L. Salzberg. JIGSAW, GeneZilla, and
GlimmerHMM: puzzling out the features of human genes in the ENCODE
regions. Genome Biology
2007, 7(Suppl):S9.
J. E. Allen and S. L. Salzberg.
JIGSAW: integration of multiple sources of evidence for gene prediction.
Bioinformatics 21(18): 3596-3603, 2005.
J. E. Allen, M. Pertea and S. L. Salzberg. Computational
gene prediction using multiple sources of evidence. Genome
Research, 14(1), 2004.
Development of JIGSAW was supported in part by the NIH grant RO1-LM06845 to SLS.
jeallen - umiacs umd edu
Back to the CBCB Software Page
|