JIGSAW: gene prediction using multiple sources of evidence

 

Overview

JIGSAW is a program designed to use the output from gene finders, splice site prediction programs and sequence alignments to predict gene models. The program provides an automated way to take advantage of the many succsessful methods for computational gene prediction and can provide substantial improvements in accuracy over an individual gene prediction program.

JIGSAW is available for all species. We have tested JIGSAW on Human, Rice (Oryza sativa), Arabidopsis thaliana , Brugia malayi, Cryptococcus neoformans, Entamoeba histolytica, Theileria parva, Aspergillus fumigatus, Plasmodium falciparum and Plasmodium yoelii.

NEW!
Predictions are now available for the ENCODE regions in Human and viewable as custom tracks in the UCSC Human Genome Browser

Accuracy

Prediction
Program
Correct
Genes
Missed
Genes
Correct
Exons
Missed
Exons
Nucleotide
Sensitivity
JIGSAW 54% 3% 86% 4% 97%
FgenesH 42% 4% 80% 8% 95%
GeneMark.hmm 26% 5% 51% 28% 76%

Table 1


The results in Table 1 measure accuracy of JIGSAW, FgenesH and GeneMark.hmm in Oryza sativa. The test set includes 5,595 genes from 26,827 exons. JIGSAW uses the output from FgenesH, GlimmerR, GeneMark.hmm, Genscan and splice site predictions from GeneSplicer, sequence alignments from a protein database and sequence alignments from the TIGR gene indices.


Prediction
Program
Correct
Genes
Missed
Genes
Correct
Exons
Missed
Exons
Nucleotide
Sensitivity
JIGSAW 78% 1% 93% 3% 98%
TwinScan 67% 1% 87% 4% 96%
GeneMark.hmm 45% 2% 79% 5% 96%
Genscan 37% 2% 75% 10% 92%
GlimmerM 32% 1% 71% 9% 93%

Table 2


The results in Table 2 measure the accuracy of gene prediction programs in Arabidopsis thaliana. The test set includes 1,783 genes from 7,510 exons. JIGSAW uses output from the other gene prediction programs listed in the table, an earlier version of GlimmerM, splice site predictions from GeneSplicer, sequence alignments from a protein database and sequence alignments from the TIGR gene indices.

Using JIGSAW

A training set is given to JIGSAW, which consists of example output from an automated gene structure annotation pipeline along with sequence coordinates of known genes. JIGSAW compares the pipeline's predicted genes to the example known genes to record the prediction accuracy of each combination of evidence. A non-linear model is built to estimate the accuracy of the different combinations of evidence found in new data. JIGSAW pieces together gene structure models most likely to be accuracte based on statistics collected in the training set.
JIGSAW predicts gene models for a user supplied genomic sequence. The main interface is a simple "evidence list" file, which lists the file names of each prediction program's output, file format and the type of evidence. JIGSAW reads several coordinate based file formats including GFF.

System requirements

JIGSAW is developed in C++ and compiles on Linux using gcc 2.95 and gcc 3.2 and on SunOS 5.8 using gcc 2.95. (The software should compile on many other platforms as well.)


Download

To download the most recent JIGSAW system, just click HERE.

This software is OSI Certified Open Source Software.


Documentation

The distribution includes documentation on how to get started. Included in the distrubtion is a tutorial demonstrating step by step the process of training and running JIGSAW. The tutorial is available online HERE.

References

J. E. Allen and S. L. Salzberg. JIGSAW: integration of multiple sources of evidence for gene prediction Bioinformatics, August 2, 2005. doi:10.1093/bioinformatics/bti609 (Open Access!)

J. E. Allen, M. Pertea and S. L. Salzberg. Computational gene prediction using mutliple sources of evidence. Genome Research, 14(1), 2004.

Acknowledgements

Development of JIGSAW was supported in part by the NIH grant RO1-LM06845.

Contact Information

jeallen - umiacs umd edu

Back to the CBCB Software Page