Prediction Informatics Resources And Techniques


OVERVIEW

This page is a repository for gene finders and other bioinformatics prediction softwares developped by us and our collaborators.  Included are full source code for all our gene-finders as well as C++ class libraries and other components for prediction and machine learning tasks.

See also http://www.



 Source Code -- Complete Projects 
Project
Description
Language
TigrScan
GHMM gene-finder like Genscan/Genie written in highly optimized C++ & designed to be extensible and reusable for other tasks related to gene-finding.  Uses IMMs.
C++
GlimmerHMM
GHMM gene-finder like Genscan/Genie, which makes use of the techniques implemented previously by GlimmerM : OC1 decision trees and IMMs. Fast and accurate.
JIGSAW
Uses the output from gene finders, splice site prediction programs and sequence alignments to predict gene models.
GlimmerM
Eurokaryotic gene-finder using OC1 decision trees and Interpolated Markov Models.

Unveil
Pure HMM-based gene-finder based on the VEIL model.  Highly optimized C++.
C++
ELPH
Gibbs sampler for finding motifs in DNA; has been used for detecting exon splice enhancers (ESE's).  Also applicable to other motif-detection tasks.

GeneSplicer
A fast, flexible system for detecting splice sites in eukaryotic DNA.
Glimmer
Prokaryotic gene-finder using IMM's.
TransTerm
A program that finds rho-independent transcription terminators in bacterial genomes
RBSfinder
A program to find ribosome binding sites in prokaryotic DNA.
Perl
VEIL
The original VEIL gene-finder.
C++
MORGAN
A eukaryotic gene-finder using OC1 decision trees.
C
All of the software listed above is Open Source and is distributed under the ARTISTIC LICENSE.  See www.opensource.org.


 
Source Code -- Reusable Software Components
Package
Description
Language
OC1
oblique decision trees for classification
C
NET
backpropagation neural networks for classification
C++
ET
entropy-based decision trees for classification
C++
Suffix Trees
Suffix trees by Stefan Kurtz C
tigr++
C++ class library used by several TIGR genefinders and other packages.  Covers string & sequence processing, math/statistics, many efficient data structures, GFF parsing, sorting, and I/O.
C++
regress
multivariate regression for classification
C++
bayes
Naive Bayes classifier
C++
GP
genetic algorithms / genetic programming
C++
KNN
K-nearest-neighbors classifier with Mahalanobis distance (to control for correlation among attributes) and feature selection based on F-ratio.
C++









All of the software listed above is Open Source and is distributed under the ARTISTIC LICENSE.  See www.opensource.org.



Documentation
TigrScan User Manual
How to install and use the TigrScan gene-finder
TigrScan Training Manual
How to train the TigrScan gene-finder
TigrScan Software Architecture
How the TigrScan gene-finder software is structured -- for those who wish to modify the program
TIGR++ Reference Manual
Class reference
Machine-learning packages
Brief description of the bayesian / neural / tree / regression / nearest-neighbor / genetic-algorithm packages listed above



Training Data
Arabidopsis.thaliana.tar.gz
GFF coordinates & FASTA file
Aspergillus.fumigatus.tar.gz
GFF coordinates & FASTA file
Aspergillus.spp.tar.gz
GFF coordinates & FASTA file
Homo.sapiens.tar.gz
GFF coordinates & FASTA file
Mus.musculus.tar.gz
GFF coordinates & FASTA file
Plasmodium.falciparum.tar.gz
GFF coordinates & FASTA file
sample-prediction-problems.tar.gz
sample training/test/configuration files for machine-learning packages




See also http://www.