Documentation
|
NOTE: |
|
|
This document is currently undergoing
revision as we attempt to improve the software and to automate some of
the steps listed in this manual. Please check back for updates to get the most
recent
version. Thank-you for your patience! |
1. Introduction
TWAIN is a new comparative gene finder which is based on the idea of a
Generalized Pair Hidden Markov Model (GPHMM). This class of
algorithms model DNA sequence and the genes occurring in that DNA for a
pair of related organisms as having been generated by a state-based
stochastic model. Each state in the model is conceptualized as
having the capacity to generate pairs of features, one feature per
genome. In this way, a state exists for generating introns,
another state exists for generating intergenic DNA, and several states
exist for generating the different types of exons which may compose a
gene.
In order for a GPHMM-based gene finder to predict genes accurately, the
gene finder must be supplied with parameters which accurately describe
the statistical properties of the average base compositions and typical
gene structures for the specific pair of organisms to which the program
is to be applied. These parameters must be estimated from sample
genes. This process of parameter estimation is called training, and is a necessary step
before the gene finder can be expected to produce reliable gene
predictions.
This document describes how to download, compile, install, train, use,
and troubleshoot the TWAIN system. Please bear in mind that TWAIN is a
new research tool consisting of ~25,000 lines of C++ code that is
currently undergoing many modifications and enhancements. It was
produced in a research environment, and is not a commercial
product. While we think TWAIN will be very useful for the
practical purpose of genome annotation, we wish to remind prospective
users that because the program is not a commercial product, it has not
undergone extensive testing for user-friendliness and
portability. We hope over the next several months to improve the
usability of the system so that it will be easier for novice users to
painlessly install and use the system in more of a "turn-key" fashion
on their particular computer systems. Until then, we make TWAIN
available to those advanced users who are willing to invest the
required amount of effort in deploying the system at their site.
The documentation given below will assist in this undertaking.
Omissions and errors should be directed to the authors (an email
address is provided at the end of this document) so that these can be
corrected. Please check back often for updates to this
document. This document resides at:
2. Installation
2.1 Downloading and unpacking the programs
TWAIN consists of a number of programs, including MUMmer, ROSE, OASIS,
and TigrScan. Once these programs have been downloaded from the
TWAIN web site at
you should gunzip and untar each of the programs into a separate
directory using the command:
tar
-xvzf <program>.tar.gz
We recommend that the following directory structure be created before
the programs are compiled. Some of these subdirectories will be created
automatically when the tar files are unpacked. Note that
duplicate directory names indicate symbolic links which must be created.
twain oasis obj tigr++ perl alignment obj tigr++ matrices tigrscan obj tigr++ perl rose perl mummer
To create a symbolic link in UNIX, use the command:
ln -s
<directory> <linkname>
This will create a symbolic link in the current directory. The
link will appear as <linkname> and will point to
<directory>.
Installation instructions for MUMmer are given in the INSTALL file in
the MUMmer package. Additional information appears in the README file
and under the docs directory in the MUMmer package.
2.2 Compiling the programs
Instructions for compiling MUMmer are given in the INSTALL file in the
MUMmer package. The other programs can be installed as follows:
cd twain/oasis/tigrscan make tigrscan make mdd make train-signal-sensor make train-content-sensor make WMM-add-pseudocounts cd ../alignment make needleman cd .. <edit tigrscan/tigrscan.H and uncomment the line: #define EXPLICIT_GRAPHS> make oasis
The second-to-last step involves editing the tigrscan.H file and
uncommenting the line which defines the symbol EXPLICIT_GRAPHS.
This is necessary because the standard TigrScan distribution does not
use explicit graphs, whereas the OASIS program does. Note that
the stand-alone version of TigrScan should never be compiled with
EXPLICIT_GRAPHS.
Note that the C++ programs included in the TWAIN package were developed
under Linux and were found to compile flawlessly under the gcc C++
compiler version 3.3.3. Changes to the C++ standard often cause
source code which previously compiled to produce compile errors when
those changes are incorporated into the compiler. If your version
of g++ is not 3.3.3, compile errors may occur, necessitating changes to
the source code. These changes are generally of a superficial
nature, and you may wish to make those changes yourself, or you may
request the developers of TWAIN to make the changes to support your
system. Be aware that the developers of TWAIN have other
responsibilities, however, and may not be able to attend to your
request immediately.
Although the perl programs included in the package do not need to be
compiled, they must be made executable before they can be
executed. This can be done in UNIX using the command:
chmod +x *.pl
This command should be executed in the tigrscan, perl, oasis, and rose directories.
2.3 Installing the programs
Several environment variables need to be set before the program can be
run. If you use the csh or tsh shells you can do this by using
the command:
setenv
<variable> <value>
If you use a different shell, please consult the appropriate
documentation, or ask your system administrator. The following
variables must be set to their respective values:
TWAIN = full path to top-level twain directory
OASIS = full path to oasis subdirectory
TIGRSCAN = full path to tigrscan directory
PERLLIB = full path to perl subdirectory, where the *.pm files are
stored
3. Training the Gene Finder
There are two main components of TWAIN which need to be
trained/configured before the gene finder can be utilized: (1) the GHMM
gene finder TigrScan, and (2) the GPHMM gene finder OASIS. The
training of TigrScan is described in the documentation for TigrScan,
which can be found either in the doc
subdirectory of the TigrScan package, or online at:
Once TigrScan has been trained, OASIS can be configured by adding
entries to the TigrScan configuration files to specify the alignment
parameters and other settings which affect the operation of
OASIS. The configuration file is the one which has a filename
ending in .cfg.
The following lines should be appended to the configuration file.
Suggested default values are given for each variable shown below.
The optimal values for a given organism can be found by exploring the
parameter space and evaluating the performance of the gene finder at
each point in that space.
prob-nuc-match = 0.64 prob-amino-match = 0.58 use-signal-thresholds = true signal-threshold-multiplier = 2 queue-capacity = 8 exon-optimism = 2 intron-optimism = 0 min-promer-identity = 40 min-promer-similarity = 45 max-promer-stop-codons = 25 off-match-portion = 0.05 signal-tolerance = 40 match-edge-tolerance = 200 nonexact-signal-tolerance = 0 reduction-level = 4 nucmer-maxgap = 30 nucmer-minmatch = 10 nucmer-mincluster = 30 needleman-matrix = NUC.4.4 needleman-gap-penalty = 0 amino-substitution-matrix = /......./alignment/matrices/blosum50 nuc-substitution-matrix = /......./alignment/matrices/NUC.4.4
The prob-nuc-match and prob-amino-match settings
determine, for nucleotide alignments and amino acid alignments,
respectively, the expected degree of conservation at the level of
individual residues in an alignment. They are percentages, expressed as
a number between 0 and 1.
The use-signal-thresholds
parameter must be true or
false. If it is true, then any putative signal
(such as a donor, acceptor, or start or stop codon) which does not
score above a given threshold will be ignored and cannot be part of the
final gene prediction. The threshold which is applied is selected
during the training of TigrScan (see the TigrScan training manual for
details). If use-signal-thresholds
is set to false, then all
putative signals of the correct consensus will be considered as
possible signals in the final gene prediction. This generally
increases the amount of memory and time required by OASIS. The signal-threshold-multiplier
allows the signal thresholds to be modified from within the
configuration file (ie., without retraining TigrScan). Each
signal threshold is multiplied by the value given for this
parameter. Since the thresholds are negative (being in log
space), a multiplier greater than 1 will lower the threshold and allow
more signals to be found.
The queue-capacity
parameter dicatates the sizes of the signal queues used for noncoding
features during gene prediction. When TigrScan analyzes the
sequence, it identifies putative signals by applying a threshold, as
described above. When a putative signal is found, it is linked
back to all appropriate predecessor signals in the corresponding signal
queues, and then the new signal is itself added to an appropriate
queue. These queues are kept sorted according to signal scores,
so that the worst signal currently in a queue is always at the bottom
of the queue. For noncoding features, the queues have a fixed size, so
that when a high-scoring signal enters the queue, the lowest-scoring
signal currently in the queue is dropped from the queue to make room
for the new signal. Once a signal is dropped from the queue, it cannot
be linked to any other signals found by TigrScan after this point in
the sequence, although the signal may still participate in the final
gene prediction if it was previously linked to other signals. For
coding features, the queues used in TigrScan have no upper limit on
their size, so that putative coding features are never ignored simply
because a queue has become full. Thus, the queue-capacity parameter
applies only to noncoding features.
The exon-optimism and intron-optimism parameters are
log values which are added to the transition scores between states in
TigrScan's underlying GHMM. Their effect is to increase or
decrease the estimated probability of making a transition across an
exon or an intron. A positive value for exon-optimism will tend to
cause the gene finder to predict more exons than it otherwise would;
and similarly for intron-optimism
and the prediction of introns.
The min-promer-identity, min-promer-similarity, and max-promer-stop-codons
parameters are used to filter the HSPs (High Scoring Segment Pairs)
which are produced by the PROmer program. Any PROmer HSP not
satisfying all three of these thresholds is discarded by OASIS and
cannot be used as evidence for a conserved exon. For information
on setting these parameters, see the MUMmer reference manual. The
default values given above are reasonable values which have been found
to work well in practice for a pair of fungal genomes. These
values are percentages, expressed as an integer between 1 and 100.
The off-match-portion, signal-tolerance, and match-edge-tolerance parameters
influence the sparseness of the dynamic programming matrix which is
used by OASIS to model the pairing of open reading frames between the
two species. Modifying these parameters should be done only after
consulting the source code and ensuring that you have a full
understanding of their impact. Changes to the sparseness of the
matrix can affect the accuracy of the gene finder as well as the
computational requirements, such as memory usage and run-time.
Similarly, the nonexact-signal-tolerance
parameter is provided only for research purposes, and should not be
modified by the user. It should be set to zero. The reduction-level parameter
indicates how aggressively OASIS should attempt to eliminate less
promising regions of the dynamic programming matrix. Legal values
are 0 through 4, though we highly recommend that casual users leave
this parameter set to 4.
The nucmer-maxgap, nucmer-minmatch, and nucmer-mincluster parameters
specify the maxgap, minmatch, and mincluster parameters which are
passed by ROSE to the NUCmer program, which is part of the MUMmer
package. For information on these parameters, please refer to the
MUMmer documentation.
The needleman-matrix
specifies the substitution matrix which is used by ROSE to align the
gaps between NUCmer hits when it is computing a global guide alignment
for use by OASIS. The needleman-gap-penalty
specifies the penalty which is assessed by the Needleman-Wunsch
algorithm for aligning a base to a gap.
The amino-substitution-matrix
and nuc-substitution-matrix
entries specify the substitution matrices which should be used by OASIS
to compute the percent similarity and percent identity of amino acid
and nucleotide approximate alignments, respectively.
4. Using the Gene Finder
Several scripts are provided in the ROSE package for running the TWAIN
pipeline.
The run-twain.pl script
runs the entire pipeline from beginning to end, starting with ROSE and
MUMmer and ending with OASIS. Although this may be suitable in
some circumstances, we have found it useful to run the individual
processes separately, so that once ROSE extracts the putative syntenic
regions, OASIS can be iteratively re-run on those regions using
different parameterizations until a suitable set of parameterization
has been found. Thus, until the user feels comfortable with the
configuration of TWAIN, it is advisable that the pipeline be run using
the runrose.pl and runoasis.pl scripts, which will
invoke ROSE and OASIS separately.
All of these scripts will produce usage statements if they are invoked
on the command line with no arguments. The usage statements
describe the require parameters to these programs in some detail.
The output from the gene finder is a pair of GFF files, one for each of
the two species. Each GFF file contains the set of gene
predictions emitted by the GPHMM. These predictions are given in
the General Feature Format, which is a convenient, standardized format
for specifying coordinates of features in genomic sequence.
Information on the syntax and meaning of GFF fields is given at:
It is important to note that the coordinates given in the GFF files are
1-based, which means that the very first base in a DNA sequence is
presumed to have index 1, rather than 0.
5. Troubleshooting
The following table provides a list of troubleshooting hints that we
have found useful. This table will grow as we deploy the system
to a greater array of computing environments, so check back often for
updates.
problem
|
possible
solutions
|
The code does not compile.
|
- Try compiling the code with g++ version 3.3.3
- Ensure that the correct directory structure has been
created, as shown above in section 2.1.
- Make sure an obj subdirectory has been created for the
object files
|
The program complains that some
parameter is not set in the configuration file.
|
Add the specified parameter to
the configuration file. The configuration file is a file having a
name ending in .cfg, and is referenced in the .iso file. You must
modify the standard .iso file so that it contains the full path to the
.cfg file on your system. See the TigrScan documentation for more
information.
|
PROmer runs out of memory during
the execution of ROSE.
|
Try switching the order of the
reference and query parameters. PROmer builds a large data
structure called a suffix tree, but it builds this for only one of the
two genomes. Reversing the reference and query will force PROmer
to build the tree for the other genome, possibly requiring less space.
|
Perl complains that it cannot
find a module.
|
You must set your PERLLIB or
PERL5LIB environment variable to the perl subdirectory in the twain
directory. If you do not do this, Perl cannot find the needed
modules.
|
The program does not produce any
predictions when a contig contains N's
|
If TigrScan is trained without
N's it will not be able to make predictions when the input sequence
contains N's. Edit the intergenic0-100.fasta file and insert a
long series of N's midway in one of the sequences so that TigrScan can
collect statistics for the N's. Also add the sequence NANTNCNGN
so that null dinucleotide frequencies involving N's can be
collected. Then re-train the intergenic model as described in the
TigrScan documentation.
|
6. Software Architecture
The software architecture of TWAIN is necessarily complex, due to the
efficiency requirements placed on the implementation of its GPHMM
subsystem. The basic GHMM gene finder underlying TWAIN is
described in the TigrScan documentation at:
Additional details pertaining specifically to ROSE and OASIS will be
added here soon.
|