Documentation



NOTE:
This document is currently undergoing revision as we attempt to improve the software and to automate some of the steps listed in this manual.  Please check back for updates to get the most recent version.  Thank-you for your patience!




Contents
1. Introduction
2. Installation
3. Training the Gene Finder
4. Using the Gene Finder
5. Troubleshooting
6. Software Architecture


1. Introduction

TWAIN is a new comparative gene finder which is based on the idea of a Generalized Pair Hidden Markov Model (GPHMM).  This class of algorithms model DNA sequence and the genes occurring in that DNA for a pair of related organisms as having been generated by a state-based stochastic model.  Each state in the model is conceptualized as having the capacity to generate pairs of features, one feature per genome.  In this way, a state exists for generating introns, another state exists for generating intergenic DNA, and several states exist for generating the different types of exons which may compose a gene. 

In order for a GPHMM-based gene finder to predict genes accurately, the gene finder must be supplied with parameters which accurately describe the statistical properties of the average base compositions and typical gene structures for the specific pair of organisms to which the program is to be applied.  These parameters must be estimated from sample genes.  This process of parameter estimation is called training, and is a necessary step before the gene finder can be expected to produce reliable gene predictions.

This document describes how to download, compile, install, train, use, and troubleshoot the TWAIN system. Please bear in mind that TWAIN is a new research tool consisting of ~25,000 lines of C++ code that is currently undergoing many modifications and enhancements.  It was produced in a research environment, and is not a commercial product.  While we think TWAIN will be very useful for the practical purpose of genome annotation, we wish to remind prospective users that because the program is not a commercial product, it has not undergone extensive testing for user-friendliness and portability.  We hope over the next several months to improve the usability of the system so that it will be easier for novice users to painlessly install and use the system in more of a "turn-key" fashion on their particular computer systems.  Until then, we make TWAIN available to those advanced users who are willing to invest the required amount of effort in deploying the system at their site.  The documentation given below will assist in this undertaking.  Omissions and errors should be directed to the authors (an email address is provided at the end of this document) so that these can be corrected.  Please check back often for updates to this document.  This document resides at:



2. Installation

2.1 Downloading and unpacking the programs

TWAIN consists of a number of programs, including MUMmer, ROSE, OASIS, and TigrScan.  Once these programs have been downloaded from the TWAIN web site at


you should gunzip and untar each of the programs into a separate directory using the command:

tar -xvzf <program>.tar.gz

We recommend that the following directory structure be created before the programs are compiled. Some of these subdirectories will be created automatically when the tar files are unpacked.  Note that duplicate directory names indicate symbolic links which must be created.
twain
oasis
obj
tigr++
perl
alignment
obj
tigr++
matrices
     tigrscan
obj
tigr++
perl
rose
perl
mummer
To create a symbolic link in UNIX, use the command:

ln -s <directory> <linkname>

This will create a symbolic link in the current directory.  The link will appear as <linkname> and will point to <directory>.

Installation instructions for MUMmer are given in the INSTALL file in the MUMmer package. Additional information appears in the README file and under the docs directory in the MUMmer package.

2.2 Compiling the programs

Instructions for compiling MUMmer are given in the INSTALL file in the MUMmer package.  The other programs can be installed as follows:
cd twain/oasis/tigrscan
make tigrscan
make mdd
make train-signal-sensor
make train-content-sensor
make WMM-add-pseudocounts
cd ../alignment
make needleman
cd ..
<edit tigrscan/tigrscan.H and uncomment the line: #define EXPLICIT_GRAPHS>
make oasis

The second-to-last step involves editing the tigrscan.H file and uncommenting the line which defines the symbol EXPLICIT_GRAPHS.  This is necessary because the standard TigrScan distribution does not use explicit graphs, whereas the OASIS program does.  Note that the stand-alone version of TigrScan should never be compiled with EXPLICIT_GRAPHS.

Note that the C++ programs included in the TWAIN package were developed under Linux and were found to compile flawlessly under the gcc C++ compiler version 3.3.3.  Changes to the C++ standard often cause source code which previously compiled to produce compile errors when those changes are incorporated into the compiler.  If your version of g++ is not 3.3.3, compile errors may occur, necessitating changes to the source code.  These changes are generally of a superficial nature, and you may wish to make those changes yourself, or you may request the developers of TWAIN to make the changes to support your system.  Be aware that the developers of TWAIN have other responsibilities, however, and may not be able to attend to your request immediately.

Although the perl programs included in the package do not need to be compiled, they must be made executable before they can be executed.  This can be done in UNIX using the command:

chmod +x *.pl

This command should be executed in the tigrscan, perl, oasis, and rose directories.

2.3 Installing the programs

Several environment variables need to be set before the program can be run.  If you use the csh or tsh shells you can do this by using the command:

setenv <variable> <value>

If you use a different shell, please consult the appropriate documentation, or ask your system administrator. The following variables must be set to their respective values:

TWAIN = full path to top-level twain directory
OASIS = full path to oasis subdirectory
TIGRSCAN = full path to tigrscan directory
PERLLIB = full path to perl subdirectory, where the *.pm files are stored
 

3. Training the Gene Finder

There are two main components of TWAIN which need to be trained/configured before the gene finder can be utilized: (1) the GHMM gene finder TigrScan, and (2) the GPHMM gene finder OASIS.  The training of TigrScan is described in the documentation for TigrScan, which can be found either in the doc subdirectory of the TigrScan package, or online at:


Once TigrScan has been trained, OASIS can be configured by adding entries to the TigrScan configuration files to specify the alignment parameters and other settings which affect the operation of OASIS.  The configuration file is the one which has a filename ending in .cfg.

The following lines should be appended to the configuration file.  Suggested default values are given for each variable shown below.  The optimal values for a given organism can be found by exploring the parameter space and evaluating the performance of the gene finder at each point in that space.
prob-nuc-match              = 0.64
prob-amino-match = 0.58
use-signal-thresholds = true
signal-threshold-multiplier = 2
queue-capacity = 8
exon-optimism = 2
intron-optimism = 0
min-promer-identity = 40
min-promer-similarity = 45
max-promer-stop-codons = 25
off-match-portion = 0.05
signal-tolerance = 40
match-edge-tolerance = 200
nonexact-signal-tolerance = 0
reduction-level = 4
nucmer-maxgap = 30
nucmer-minmatch = 10
nucmer-mincluster = 30
needleman-matrix = NUC.4.4
needleman-gap-penalty = 0
amino-substitution-matrix = /......./alignment/matrices/blosum50
nuc-substitution-matrix = /......./alignment/matrices/NUC.4.4
The prob-nuc-match and prob-amino-match settings determine, for nucleotide alignments and amino acid alignments, respectively, the expected degree of conservation at the level of individual residues in an alignment. They are percentages, expressed as a number between 0 and 1. 

The use-signal-thresholds parameter must be true or false.  If it is true, then any putative signal (such as a donor, acceptor, or start or stop codon) which does not score above a given threshold will be ignored and cannot be part of the final gene prediction. The threshold which is applied is selected during the training of TigrScan (see the TigrScan training manual for details).  If use-signal-thresholds is set to false, then all putative signals of the correct consensus will be considered as possible signals in the final gene prediction.  This generally increases the amount of memory and time required by OASIS.  The signal-threshold-multiplier allows the signal thresholds to be modified from within the configuration file (ie., without retraining TigrScan).  Each signal threshold is multiplied by the value given for this parameter.  Since the thresholds are negative (being in log space), a multiplier greater than 1 will lower the threshold and allow more signals to be found.

The queue-capacity parameter dicatates the sizes of the signal queues used for noncoding features during gene prediction.  When TigrScan analyzes the sequence, it identifies putative signals by applying a threshold, as described above.  When a putative signal is found, it is linked back to all appropriate predecessor signals in the corresponding signal queues, and then the new signal is itself added to an appropriate queue.  These queues are kept sorted according to signal scores, so that the worst signal currently in a queue is always at the bottom of the queue. For noncoding features, the queues have a fixed size, so that when a high-scoring signal enters the queue, the lowest-scoring signal currently in the queue is dropped from the queue to make room for the new signal. Once a signal is dropped from the queue, it cannot be linked to any other signals found by TigrScan after this point in the sequence, although the signal may still participate in the final gene prediction if it was previously linked to other signals. For coding features, the queues used in TigrScan have no upper limit on their size, so that putative coding features are never ignored simply because a queue has become full. Thus, the queue-capacity parameter applies only to noncoding features.

The exon-optimism and intron-optimism parameters are log values which are added to the transition scores between states in TigrScan's underlying GHMM.  Their effect is to increase or decrease the estimated probability of making a transition across an exon or an intron.  A positive value for exon-optimism will tend to cause the gene finder to predict more exons than it otherwise would; and similarly for intron-optimism and the prediction of introns.

The min-promer-identity, min-promer-similarity, and max-promer-stop-codons parameters are used to filter the HSPs (High Scoring Segment Pairs) which are produced by the PROmer program.  Any PROmer HSP not satisfying all three of these thresholds is discarded by OASIS and cannot be used as evidence for a conserved exon.  For information on setting these parameters, see the MUMmer reference manual.  The default values given above are reasonable values which have been found to work well in practice for a pair of fungal genomes.  These values are percentages, expressed as an integer between 1 and 100.

The off-match-portion, signal-tolerance, and match-edge-tolerance parameters influence the sparseness of the dynamic programming matrix which is used by OASIS to model the pairing of open reading frames between the two species.  Modifying these parameters should be done only after consulting the source code and ensuring that you have a full understanding of their impact.  Changes to the sparseness of the matrix can affect the accuracy of the gene finder as well as the computational requirements, such as memory usage and run-time.  Similarly, the nonexact-signal-tolerance parameter is provided only for research purposes, and should not be modified by the user.  It should be set to zero.  The reduction-level parameter indicates how aggressively OASIS should attempt to eliminate less promising regions of the dynamic programming matrix.  Legal values are 0 through 4, though we highly recommend that casual users leave this parameter set to 4.

The nucmer-maxgap, nucmer-minmatch, and nucmer-mincluster parameters specify the maxgap, minmatch, and mincluster parameters which are passed by ROSE to the NUCmer program, which is part of the MUMmer package.  For information on these parameters, please refer to the MUMmer documentation.

The needleman-matrix specifies the substitution matrix which is used by ROSE to align the gaps between NUCmer hits when it is computing a global guide alignment for use by OASIS. The needleman-gap-penalty specifies the penalty which is assessed by the Needleman-Wunsch algorithm for aligning a base to a gap.

The amino-substitution-matrix and nuc-substitution-matrix entries specify the substitution matrices which should be used by OASIS to compute the percent similarity and percent identity of amino acid and nucleotide approximate alignments, respectively.


4. Using the Gene Finder

Several scripts are provided in the ROSE package for running the TWAIN pipeline.

The run-twain.pl script runs the entire pipeline from beginning to end, starting with ROSE and MUMmer and ending with OASIS.  Although this may be suitable in some circumstances, we have found it useful to run the individual processes separately, so that once ROSE extracts the putative syntenic regions, OASIS can be iteratively re-run on those regions using different parameterizations until a suitable set of parameterization has been found.  Thus, until the user feels comfortable with the configuration of TWAIN, it is advisable that the pipeline be run using the runrose.pl and runoasis.pl scripts, which will invoke ROSE and OASIS separately.

All of these scripts will produce usage statements if they are invoked on the command line with no arguments.  The usage statements describe the require parameters to these programs in some detail.

The output from the gene finder is a pair of GFF files, one for each of the two species.  Each GFF file contains the set of gene predictions emitted by the GPHMM.  These predictions are given in the General Feature Format, which is a convenient, standardized format for specifying coordinates of features in genomic sequence.  Information on the syntax and meaning of GFF fields is given at:


It is important to note that the coordinates given in the GFF files are 1-based, which means that the very first base in a DNA sequence is presumed to have index 1, rather than 0.


5. Troubleshooting

The following table provides a list of troubleshooting hints that we have found useful.  This table will grow as we deploy the system to a greater array of computing environments, so check back often for updates.

problem
possible solutions
The code does not compile.
  1. Try compiling the code with g++ version 3.3.3
  2. Ensure that the correct directory structure has been created, as shown above in section 2.1. 
  3. Make sure an obj subdirectory has been created for the object files
The program complains that some parameter is not set in the configuration file.
Add the specified parameter to the configuration file.  The configuration file is a file having a name ending in .cfg, and is referenced in the .iso file.  You must modify the standard .iso file so that it contains the full path to the .cfg file on your system.  See the TigrScan documentation for more information.
PROmer runs out of memory during the execution of ROSE.
Try switching the order of the reference and query parameters.  PROmer builds a large data structure called a suffix tree, but it builds this for only one of the two genomes.  Reversing the reference and query will force PROmer to build the tree for the other genome, possibly requiring less space.
Perl complains that it cannot find a module.
You must set your PERLLIB or PERL5LIB environment variable to the perl subdirectory in the twain directory.  If you do not do this, Perl cannot find the needed modules.
The program does not produce any predictions when a contig contains N's
If TigrScan is trained without N's it will not be able to make predictions when the input sequence contains N's.  Edit the intergenic0-100.fasta file and insert a long series of N's midway in one of the sequences so that TigrScan can collect statistics for the N's.  Also add the sequence NANTNCNGN so that null dinucleotide frequencies involving N's can be collected.  Then re-train the intergenic model as described in the TigrScan documentation.



6. Software Architecture

The software architecture of TWAIN is necessarily complex, due to the efficiency requirements placed on the implementation of its GPHMM subsystem.  The basic GHMM gene finder underlying TWAIN is described in the TigrScan documentation at:


Additional details pertaining specifically to ROSE and OASIS will be added here soon.




back to: CBCB | TWAIN  | PIRATE  | genefinding.org