ELPH User Manual

ELPH is a general-purpose Gibbs sampler for finding motifs in a set of DNA or protein sequences. The program takes as input a set containing anywhere from a few dozen to thousands of sequences, and searches through them for the most common motif, assuming that each sequence contains one copy of the motif. We have used ELPH to find patterns such as ribosome binding sites (RBSs) and exon splicing enhancers (ESEs).

II. Running ELPH

There are two different ways to run the program: as a motif finder, or as a tool to measure if there is any significant difference between the appearances of a motif in two different files.

II.1 Running the program as a motif finder

In this case the input of ELPH is a file in multi FASTA format containing the sequences:
elph <multi-fasta_file> [options]
Common usage:
elph DNAseqs.fasta LEN=5

II.1.1 Algorithm overview

The common usage of the program is to use as motif finder a Gibbs site sampler. The programs begins by randomly selecting one motif element in each sequence. After this initially setup the program iteratively runs through the following two steps:

- predictive update step: one sequence from the input file is selected, beginning with the first sequence and proceeding to the last sequence. The current motif element from the sequence is added too the background and the motif matrix is updated accordingly.

- sampling step: each possible starting position for a motif in the given sequence is assigned a probability of being a motif starting at that position; after that, a motif element is assigned to the sequence by performing a weighted sample from all the possible positions.

These steps are repeated until a local maximum is reached or a fixed maximum number of iterations are made. The Gibbs sampler is restarted several times with a different seed in order to avoid trappings into a local maximum. Once the motif alignment is found, the posteriori value of the alignment is computed. An optimizing procedure is run to maximize this posteriori value returning the MAP (maximum a posteriori probability) of the motif.

II.1.2 Command line options

Several command line options can be specified:

LEN=n : n is the length of the searched motif; if the length of the motif is not given, the program will ask for it from stdin.
ITERNO=n : n represents the maximum number of times that the Gibbs sampler is restarted in order to avoid trapping into the local maximum; default=10.
MAXLOOP=n : n represents the maximum number of iterations used by the program to compute the local maximum; default = 500.
SGFNO=n : n is the number of iterations to compute significance of motif (see also the -g option); default = 1000.
-h : prints a help with the options the program.
-o <out_file> : write output in <out_file> instead of stdout
-a : by default the multiFASTA file is considered to contain DNA sequences; if this option is specified the input file would be considered to contain amino acid sequences (this option has not been tested yet!).
-s <seed> : sets the seed for the random generation.
-p n : n represents the number of iterations before deciding that the local maximum has been reached; default=20.
-b : if this options is specified then only matrix frequencies for the background and the motif are printed; i.e. the positions of each motif element within the sequence are not shown.
-x : normally the output of the program shows the motif elements that contributed to the computation of the motif matrix; if the -x option is used the output will also show for each sequence those positions which give the maximal score by using the computed motif matrix (can be different from the motif elements' positions).
-m <motif> : use the given pattern <motif> to compute its best fit matrix to the data.
-g : if this option is specified then a significance of the motif found is computed by comparing the appearances of the motif elements within the input file to the appearances of the motif within a randomly generated file containing sequences of the same lengths as in the input file and with the same residue distribution. The randomly generated file is paired to the input file. Given the motif matrix, motif sampling is performed a number of times (specified by SGFNO), and a probability of occurrence of each motif element is computed in the two paired samples. Two significance tests are used: the Wilcoxon pair test (most reliable) and the student test.
-d : this option regards the way the significance of the motif is computed; when -v is specified, the probability of occurrence of each motif element is estimated from the motif matrix, so no there isn't necessary to run the Gibbs sampler SGFNO times; this option should accompany the -g option.
-v : if this option is given then the Gibbs sampler is not used anymore, and the motif is computed in a deterministic way which maximizes the MAP (faster).
-e : only when an additional file is used to test the significance of the motif: find only the motifs that exactly match the input pattern (-m or -t options)
-n [0..5] : degree of Markov chain used to generate the random file used to test the significance of the motif default = 2
-l : if the -m option is specified too, computes the Least Likely Consensus (LLC) score for the given motif; this score measure the information content of the motif combined in respect to its background rareness.

II.2 Running the program to compare the significant difference of a motif appearances in two different files

The input to ELPH in this case consists of two files in multi FASTA format:

elph <multi-fasta_file-1> <multi-fasta_file-2> [options]

The program computes a motif in <multi-fasta_file-1 > and then estimates if the motif is significantly more represented in <multi-fasta_file-1> compared to <multi-fasta_file-2>. All the options for computing the motif can be specified. There is an additional option which can only be given to this way of running the program:

III. Obtaining ELPH

ELPH is available free of charge under the open-source Artistic License.

To download ELPH please click here.

Back to the CBCB Software Page