Appendix A. Software and
Databases for Computational Biology on the Internet
This page is a supplement to the book Computational
Methods in Molecular Biology, edited by Steven Salzberg, David
Searls, and Simon Kasif. The publisher is Elsevier Sciences. Please
contact Steven Salzberg (salzberg@umiacs.umd.edu) if you wish to have
your software referenced on this site, or if you wish to change the
description of your software already listed here. NOTE: many of
the links below are now out of date - this list first appeared in 1998
- but I updated some of them in 2006 and 2007.
Gene finders and other sequence analysis programs
- Glimmer is a
system that uses Interpolated Markov Models (IMMs) to identify coding
regions in microbial DNA. IMMs are a generalization of Markov models
that allow great flexibility in the choice of the "context"; i.e., how
many previous bases to use in predicting the next base. Glimmer has
been tested on the complete genomes of H. influenzae, E. coli, H.
pylori, M. genitalium, and other genomes, and results to date have
proven it to be highly accurate. Glimmer was the principal gene finder
for the genomes of B. burgdorferi , T. pallidum, C.
trachomatis, C. pneumoniae, D. radiodurans, T. maritima, and
others. The complete system, including source code, is available from
this site. A version of the system built for the malaria
parasite, GlimmerM, is also available.
- GeneFinding.org is a
page with links to most of the latest eukaryotic gene
finders. It has much better links and is more up to date
than most of the links below.
- GENSCAN is a
program designed to predict complete gene structures, including exons,
introns, promoter and poly-adenylation signals, in genomic sequences.
It differs from the majority of existing gene finding algorithms in
that it allows for partial genes as well as complete genes and for the
occurrence of multiple genes in a single sequence, on either or both
DNA strands. Program versions suitable for vertebrate, nematode
(experimental), maize and Arabidopsis sequences are currently
available. The vertebrate version also works fairly well for Drosophila
sequences. Sequences can be submitted on a web-based form at this site.
The GENSCAN Web site is at Stanford University.
- GeneMark is a system
for finding genes in bacterial DNA sequences. The algorithm is based on
non-homogeneous 5th-order Markov chains, and it was used to locate the
genes in the complete genomes of H. influenzae, M. genitalium, and
several other complete genomes. The site includes documentation and a
Web interface to which sequences can be submitted. This system is at
the Georgia Institute of Technology in Atlanta, GA.
- NetPlantGene is at the
Technical University of Denmark. The NetPlantGene Web server uses
neural networks to predict splice sites in Arabidopsis thaliana
DNA. This site also contains programs for other sequence analysis
problems as well, such as the recognition of signal peptides.
NetPlantGene is to be replaced with NetGene2.
- Repeat
Pattern Toolkit (RPT) consists of tools for analyzing repetitive
sequences in a genome. RPT takes as input a single sequence in GenBank
format, and attempts to find both coding (possible gene duplications,
pseudogenes, homologous genes) and non-coding repeats. RPT locates all
repeats using a fast Senstive Search Tool (SST). These repeats are
evaluated for statistical significance utilizing a sensitive All-PAM
search, and their evolutionary distance is estimated. The repeats are
classified into families of similar sequences. The classification
output is tabulated using perl scripts and plotted using gnuplot. RPT
is at the Institute for Biomedical Computing at Washington University
in St. Louis.
- SplicePredictor
is a program designed to predict donor and acceptor splice sites in
maize and Arabidopsis sequences. Sequences can be submitted on a
web-based form at this site. The system is at Stanford University.
- TESS
(Transcription Element Search Software) is a set of software routines
for locating and displaying transcription factor binding sites in DNA
sequence. TESS uses the Transfac database as its store of transcription
factors and their binding sites. This page is at the University of
Pennsylvania's Computational Biology and Informatics Laboratory.
- Genotator,
a workbench for automated sequence annotation, provides a flexible,
transparent system for automatically running a series of sequence
analysis programs on genetic sequences. It also has a graphical display
that allows users to view all of the automatically-generated
annotations and add their own. Genotator's display allows annotated
sequences to be examined at multiple levels of detail, from an overview
of the entire sequence down to individual bases. By displaying the
aligned output of multiple types of sequence analysis, Genotator
provides an intuitive way to identify the significant regions (for
example, probable exons) in a sequence. Genotator was developed by Nomi
Harris at Lawrence Berkely National Laboratory.
- WebGene (GenView,
ORFGene, SpliceView) is Web interface for several coding region
recognition programs, including:
- GenView: a system for protein-coding gene prediction
- ORFGene: gene structure prediction using information on
homologous protein sequences
- SpliceView: prediction of splicing signals
- HCpolya: a hamming Clustering Method for Poly-A prediction
in eukaryotic genes
This page is at the Instituto Tecnologie Biomediche Avanzate in Italy.
- The Staden Package
contains a wealth of useful programs for sequence assembly, DNA
sequence comparison and analysis, protein sequence analysis, and
sequencing project quality control. The site is mirrored in several
locations around the world.
Databases
- The
NCBI Entrez PubMed Browser, at the National Center for
Biotechnology Information (NCBI), is the world's largest and most
heavily used resource for
searching Genbank as well as other NCBI
databases.
- PhyloFacts
is a searchable database of pre-calculated structural and phylogenomic
analyses of over 40,000 protein families. Each family contains a
multiple
sequence alignment, one or more phylogenetic trees, predicted
functional subfamilies and 3D structures, PFAM domains, GO
annotations, biological literature, predicted critical residues and
hidden Markov models; all core data can be downloaded freely from the
resource. Over 1M hidden Markov models (HMMs) for families and
subfamilies are provided to enable classification of novel sequences
to different levels of a functional hierarchy and for prediction of 3D
structure. Users can also perform database queries to retrieve protein
families meeting their specified criteria, or simply browse the
resource.
- GeneCards
is a database of human genes, their products and their involvement in
diseases. It offers concise information about the functions of all
human genes that have an approved symbol as well as selected others. It
is especially useful for those who are searching for information
working in functional genomics and proteomics. The data is collected
with Knowledge Discovery and Data Mining's techniques and accessed by
means of proprietary Guidance System that makes more or less
intelligent suggestions to the user of where and how the information
may be retrieved.
- HHS Sequence
Classification. HHS is a database of sequences that have been
clustered based on a variety of criteria. The database and clustering
algorithms are described in Chapter 6. This Web page, at the Insitute
for Biomedical Computing at Washington University in St. Louis, allows
one to access classifications by sequence, group listing, structure,
and alignment.
- The
EpoDB (Erythropoiesis Database) is a database of genes that relate
to vertebrate red blood cells. A detailed description of EpoDB can be
found on Chapter 5. The database includes DNA sequence, structural
features and potential transcription factor binding sites. This Web
site is at the University of Pennsylvania's CBIL.
- The LENS
(Linking ESTs and their associated Name Space) database links and
resolves the names and identifiers of clones and ESTs generated in the
I.M.A.G.E. Consortium/WashU/Merck EST project. The name space includes
library and clone IDs and names from IMAGE Consortium, EST sequence IDs
from Washington University, sequence entry accession numbers from
dbEST/NCBI, and library and clone IDs from GDB. LENS allows for
querying of IMAGE Consortium data via all the different IDs.
- PDD, the NIMH-NCI
Protein-Disease Database is at the Laboratory of Experimental and
Computational Biology at the National Cancer Institute. This server is
part of the NIMH-NCI Protein-Disease Database project for correlating
diseases with proteins observable in serum, CSF, urine and other common
human body fluids based on biomedical literature.
- The
TRANSFAC Database is at the Gesellschaft für Biotechnologische
Forschung mbH (Germany). TRANSFAC is a transcription factor database.
It compiles data about gene regulatory DNA sequences and protein
factors binding to them. On this basis, programs are developed that
help to identify putative promoter or enhancer structures and to
suggest their features.
- TransTermHP
is a program for predicting transcription terminators in bacteria, and
the website has a database of predictions for several hundred genomes.
Motif Search:
- ELPH is a general-purpose Gibbs sampler for finding motifs
in a
set of DNA or protein sequences. The program takes as input a set
containing anywhere from a few dozen to thousands of sequences,
and searches through them for the most common motif, assuming that
each sequence contains one copy of the motif. ELPH
has been to find patterns such as ribosome binding sites (RBSs) and
exon
splicing enhancers (ESEs).
- PROSITE
Search Form Allows you to rapidly compare a protein sequence
against all patterns stored in the PROSITE pattern database.
ScanProsite: Protein
against Prosite form allows one to scan a protein sequence (either
from SWISS-PROT or provided by the user) for the occurrences of
patterns sorted in the PROSITE database. Pattern against
SWISS-PROT scans in all of the SWISS-PROT database (including
weekly releases) for the occurrence of a pattern that can originate
from PROSITE or be provided by the user. (ExPASY)
- Motifs in
protein databases program determines if a protein motif is present
in a database of protein sequences. This program allows the user to
define a protein motif and then determine if a DNA sequence might
encode them or if they are present in a protein database. The programs
do not search a library of predefined protein motifs. A motif is
defined by entering the amino acids of interest at each
position.(Alces)
- MatInspectorA
tool for the detection of transcription factor binding sites. It is
able to locate matches of sequences of unlimited length and compare
one, several or all sequences in a sequence file against all or
selected subsets of matrices from a library of matrix descriptions of
protein binding sites. (GSF)
- MEME - Multiple EM for
Motif Elicitation system allows one to discover motifs of highly
conserved regions in groups of related DNA or protein sequences and
search sequence databases using motifs using MAST:
Works by calculating match scores for each sequence. The match scores
are converted into various types of p-values and these are used to
determine the overall match of the sequence to the group of
motifs.(SDSC)
- Regular Expression
Searches of Sequence DB using FPAT. This page is designed to search
a molecular sequence database (proteins only) for patterns using simple
regular expressions.At present, only protein sequence databases are
available on the server. (Univ. of Toronto)
- BCM Search Launcher
: The Baylor College of Medicine has a variety of biology related
search and analysis services including
general protein sequence/Pattern searches and
Species-Specific Protein Sequence Searches.(HGSC)
- Screening
pattern or alignment against PROTEIN databank This method of
looking for all pattern entries in PROTEIN databank is almost the same
as in PROSITE screening procedure. The one difference is that
coincidence of pattern's and fragment's letter could be seen in a broad
sense: as a similarity of letters according to a weight matrix selected
by the user. (genebee)
- Pattern Searching Proteins:
A collection of software tools for protein sequence analysis.
PATTINPROT scans a protein sequence or a protein database for one
or several patterns.
PROSCAN scans a protein sequence for sites/signatures against the
PROSITE database.
- PPSEARCH :
Prosite Database Searches (sequence against databases of motifs).
Allows you to search sequences for motifs or functional patterns in the
prosite database (EBI)
- COGnitor:
Compare your sequence to COG- Clusters of Orthologous Groups database.
Each COG consists of individual proteins or groups of paralogs from at
least 3 lineages and thus corresponds to an ancient conserved domain.
(NCBI)
- HMMER Sean Eddy: Profile
hidden Markov models can be used to do sensitive database searching
using statistical descriptions of a sequence family's consensus.The
advantage of using HMMS is that HMMS have a formal probabilistic basis
and can be trained from unaligned sequences, if a trusted alignment
isn't yet known. They do however make poor models of RNAs because they
cannot describe base pairs. HMMER is a freely distibutable
implementation of profile HMM software for protein sequence analysis.
(Washington Univ.)
- emotif is a
research system that forms motifs for subsets of aligned sequences.
Emotif ranks the motifs that it finds by both their specificity and the
number of supplied sequences that it covers.(Stanford Bioinformatics
Group)
- FunSiteP
Promoter Recognition: Recognition and classification of eukaryotic
promoters by searching transcription factor binding sites using
transcription factor binding site consensi. (GSF)
- SMART
Simple Modular Architecture Research Tool: Allows rapid identification
and annotation of signalling protein domain sequences. It is able to
determine the modular architectures of single sequences or genomes.
(EBI)
- SAM
: Sequence Alignment and Modeling System using HMM (Hidden Markov
Model). SAM is a collection of software tools for creating, refining
and using linear HMM for biological sequence analysis. Documentation
for SAM can be found
here.(Pasteur)
- BioRainbow
: Human Promoter Extractor (Prom Extra). The Promextra windows
application is designated to solve the problem of extracting promoters
fast and simple for given list of genes.
Secondary Structure Prediction:
- THREADER2
is a program for predicting protein tertiary structure by recognizing
the correct fold from a library of alternatives. Of course, if a fold
similar to the native fold of the protein being predicted is not in the
library, then this approach will not succeed. Fortunately, certain
folds crop up time and time again, and so fold recognition methods for
predicting protein structure can be very effective. In the first
prediction contest held at Asilomar, organized by John Moult and
colleagues, THREADER correctly identified 8 out of 11 target structures
which either globally or locally resembled a previously observed fold.
Preliminary analysis of the results from the second competition (CASP2)
show that THREADER 2 has shown clear improvement in both fold
recognition sensitivity AND sequence-structure alignment accuracy. In
CASP2, the new version of THREADER recognized 4 folds correctly out of
6 targets with recognizable structures (including the difficult task of
assigning a jelly-roll fold rather than other beta-sandwich topologies
for one target). THREADER 2 produced more correct fold predictions
(i.e. correct folds ranked at No. 1) than any other method.
- Predict
Protein is a service for sequence analysis and structure
prediction. Once you submit a protein sequence, PredictProtein
retrieves similar sequences in the database and predicts aspects of
protein structure, residue solvent accessibility and helical
transmembrane regions. (EMBL)
- MultPred
- Multpredict Secondary Structure of Multiply Aligned Sequences: This
program predicts secondary structure using physicochemical information
from a set of aligned sequences and the Garnier secondary structure
decision constants. The program requires as input, the sequences,
aligned using the AMPS program. (AMPS)
- NNPREDICT
Protein Secondary Structure Prediction: A program that predicts the
secondary structure type for each residue in an amino acid sequence.
The basis of the prediction is a two-layer, feed-forward neural
network. NNPREDICT takes as input a protein sequence and returns a
secondary structure prediction for each position in the sequence.(UCSF)
- PSA Protein Structure
Prediction Server: Predicts probable secondary structures and
folding classes for a given amino acid sequence. It performs three
types of protein structure/sequence analysis:
- Analysis of full length amino acid sequences that are
assumed to be monomeric globular, water-soluble proteins consisting of
a single domain
- Analysis of either complete sequences, or sequence fragments
with a minimal set of modelled structural assumptions
- Analysis of potential WD-repeat protein family
sequences
(BMERC at Boston University)
- SSCP
Secondary Structural Content Prediction computes predictions for
the content of helix, strand, and coil for a given protein using the
amino acid composition as the only input of inofrmation. The method
used by SSCP consists in the application of analytic vector
decomposition methods applied on the composition vector of the query
protein.
- PREDATOR
Secondary structure prediction from fingle or multiple sequences. Takes
as input a single protein sequence to be predicted and can optimally
use a set of unaligned sequences as additional information to predict
the query sequence. PREDATOR does not use multiple sequence alignment,
instead it relies on careful pairwise local alignments of the sequences
in the set with the query sequence to be predicted. If you supply a set
of sequences in the form of a multiple alignment in CLUSTAL or MSF
format, the sequences will be used but unaligned. (EMBL)
- RNA
secondary structure prediction: If a multiple alignment is given by
the user, the information on conservative positions in it and
compensation exchanges in some of those will be used - stems, including
such positions, are given more chances to be included into the
resulting secondary structure. The algorithm is the following: first
all of the possible ways of fitting together different pieces of the
sequences are looked for. Then locally optimal secondary structures are
built from the helices found. Lastly, the final system construction is
done optimizing the model energy of the system (includes inputs from
conservative and complementary pairs with corresponding coefficients).
(Genebee)
- PSCAN
server page: A program to play with protein threading. Allows one to
align two sequences, find a match in the database through email (more
reliable) or without email.
- RNA-mfold
and DNA-mfold:
Performs RNA and DNA secondary structure prediction using nearest
neighbor thermodynamic rules. The mfold software uses what are called
nearest neighbor energy rules. That is, free energies are assigned to
loops rather than to base pairs. Documentation for the programs and
more detail about how the structures are computed can be found here.
(M. Zuker at Washington University)
- SoWhat:
The SoWhat WWW server predicts distance constraints between amino acids
in proteins from the amino acid sequence. It uses a neural network
based method to predict contacts between C-alpha atoms from the amino
acid sequence. (CBS Denmark)
- Pasteur Institute:
- STRIDE:
Protein secondary structure assignment from atomic coordinates
- DSSP:
Definition of secondary structure of proteins given a set of 3D
coordinates
- DSC:
Discrimination of protein secondary structure class
- PREDATOR:
Protein secondary structure prediction from a single sequence or a set
of sequences
- environ:
calculate accessible as well as buried surface area in protein
structure
- confmat:
Side chain packing optimization on a given main chain template for
protein
Other Software and Information Sources:
- The
VSNS BioComputing Division offers educational services over the
Internet in bioinformatics/biocomputing. They have offered award
winning online courses in sequence analysis. The site includes a
hypertext coursebook, covering topics such as pairwise sequence
alignments, networking, and multiple alignment. You can also find a
collection of online exercises, called ``Sequence Analysis with
Distributed Resources'', and ``Biocomputing For Everyone'' and
``Biocomputing For Schools'' Websites.
- The
Banbury Cross Site is a web page for benchmarking gene
identification software. Banbury Cross is at the Centre National De La
Recherche Scientifique. This Benchmark site is intended to be a forum
for scientists working in the field of gene identification and
anonymous genomic sequence annotation, with the goal of improving
current methods in the context of very large (in particular) vertebrate
genomic sequences.
- CBIL bioWidgets,
at the University of Pennsylvania, is a collection of software
libraries used for rapid development of graphical molecular biological
applications. It includes:
- bioWidgets for Java(tm), a toolkit of biology-specific user
interface widgets useful for rapid application development in Java(tm)
- bioTK, a toolkit of biology-specific user interface widgets
useful for rapid application development in Tcl/Tk
- RSVP, a PostScript tool which lets your printer do nucleic
acid sequence analysis; it generates very nice color diagrams of the
results.
- Human
Genome Project Information at Oak Ridge National Laboratory
contains many interesting and useful items about the U.S. Human Genome
Project. They also have a more technical
Research site.
- FAKtory: A
software environment for DNA Sequencing is at the University of
Arizona. It is a prototype software environment in support of DNA
sequencing. The environment consists of
- their software library, FAK, for the core combinatorial
problem of assembling fragments
- a Tcl/Tk based interface
- a software suite supporting a database of fragments and a
processing pipeline that includes clipping, tagging, and vector removal
modules.
A key feature of FAKtory is that it is highly customizable: the
structure of the fragment database, the processing pipeline, and the
operation of each phase of the pipeline may be specified by the user.
- Computational
Analysis and Annotation of Sequence Data. This is a tutorial by A.
Baxevanis, M. Boguski, and B.F. Ouellette on how to use alignment
programs and databases for sequence comparison. It is a review that
will appear in the forthcoming book Genome Analysis: A Laboratory
Manual (Bruce Birren, Eric Green, Phil Hieter, Sue Klapholz and Rick
Myers, eds) to be published by Cold Spring Harbor Laboratory Press. The
hypertext version of the review is linked to Medline records, software
repositories, sequences, structures, and taxonomies via the Entrez
system of the National Center for Biotechnology Information.
Back to Computational Methods in
Molecular Biology