 |

|
 |
|
|
| primer_match |
 |
 |
Name
primer_match - Find and count primers in a DNA
sequence database
Synopsis
primer_match [
options ]
Description
primer_match finds and counts exact and near exact instances of short
DNA sequences, usually primers, in a (much) larger DNA sequence database such as the human genome.
By default, primer_match outputs a human readable
alignment for each occurrence of a primer in the sequence database. With
appropriate option, -c, primer_match will output the number of
occurrences of each primer. The format of the alignments and counts is completely configurable with the
-A and -C options.
primer_match runs fastest when the sequence database has been
pre-processed with compress_seq, but this is not necessary. If the
sequence database has not been pre-processed with compress_seq, the
sequence database must be in a regular FASTA format. Each line, except
for the last, of every sequence entry must hold the same number of sequence
characters. If the sequence database is not in a regular FASTA format, the
results may be incorrect. primer_match will warn the user if the FASTA
format is not in a regular format.
Options
-i FASTA_sequence_database
Name of the sequence database to search. Required.
-p primers
White space (space, tab, new line) separated list of primer sequences to find in the sequence database. When this
option is used on the command line, the primers will usually need to be placed in quotes ("). One of
-p, -P, -F, or -S must be supplied.
-P primer_file
File containing a white space (space, tab, new line) separated list of primer
sequences to find in the sequence database. One of
-p, -P, -F, or -S must be supplied.
-F fasta_primer_file
FASTA file containing a list of primer sequences to find in the sequence
database. One of
-p, -P, -F, or -S must be supplied.
-S sts-format-file
UniSTS format file containing a list of primer sequences to find in the
sequence database. One of
-p, -P, -F, or -S must be supplied.
-o output_file
Output is redirected into the file output_file. If absent, output goes to standard out.
-k edit_distance
The maximum number of insertions, deletions, and substitutions
permitted in any primer alignment. If absent, edit distance 0 is
assumed. When searching amino-acid sequences with peptide sequences, a
codon edit-distance is available: substitutions score as the minimum
number of nucleotide substitutions to turn a codon of one amino-acid
into a codon of the other; insertions, deletions and non-amino-acid
substitutions are scored as 3 edits. The codon scoring model is
indicated by edit-distance values .1, .2, .3.
-K mismatches
The maximum number of mismatches permitted in any primer
alignment. When searching amino-acid sequences with peptide sequences, a
codon edit-distance is available: substitutions score as the minimum
number of nucleotide substitutions to turn a codon of one amino-acid
into a codon of the other; non-amino-acid substitutions are scored
as 3 edits. The codon scoring model is
indicated by mismatch values .1, .2, .3.
-r
Search for the reverse complements of the primers too.
-x l
Length of exact seed or word size, ala blast, required by any primer
alignment. Can be combined with other options.
-s ( l | ~l )
Constrain the first l primer characters to match exactly; any insertions, deletions or substitutions must occur after position
l. The reverse complement of a primer must also have its first l characters
match exactly. Note that a wildcard match is considered an exact match. With
the ~ modifier, the first l primer characters are constrained to
match inexactly, the remaining characters must match exactly.
-e ( l | ~l )
Constrain the last l primer characters to match exactly; any insertions, deletions or substitutions must occur before position
l. The reverse complement of a primer must also have its last l characters
match exactly. Note that a wildcard match is considered an exact
match. With the ~ modifier, the last l primer characters are
constrained to match inexactly, the remaining characters must match exactly.
-5 ( l | ~l )
Constrain the l primer characters at the 5' end of the primer to match
exactly; any insertions, deletions or substitutions must occur after position l
from the 5' end of the primer. The reverse complement of a primer must also
have the l characters at its 5' end match exactly. Note that a wildcard
match is considered an exact match. With the ~ modifier, the l primer
characters at the 5' end of the primer are constrained to match inexactly, the
remaining characters must match exactly.
-3 ( l | ~l )
Constrain the l primer characters at the 3' end of a primer to match
exactly; any insertions, deletions or substitutions must occur after position l
from the 3' end of the primer. The reverse complement of a primer must also have the l
characters at its 3' end match exactly. Note that a wildcard match is
considered an exact match. With the ~ modifier, the l primer characters
at the 3' end of the primer are constrained to match inexactly, the remaining
characters must match exactly.
-w
Respect IUPAC ambiguity codes as wildcards, in both the sequence database and
the primers. A symbol from the sequence database is considered a wildcard match
to a primer symbol if either set of represented DNA symbols contains the other.
The only exception is that a N in the sequence database does not match any
primer symbol. Note: this is almost certainly what you want, as long stretches
of Ns are often used to indicate gaps in assembled sequence.
-W
Respect IUPAC ambiguity codes as wildcards, in both the sequence database and
the primers. A symbol from the sequence database is considered a wildcard match
to a primer symbol if either set of represented DNA symbols contains the other.
Also respects Ns in the sequence databases.
-u
Force all primers to uppercase characters.
-M max
Stop counting primer occurrences once a primer has been seen max
times.
-A format
Output format for primer alignments. See Output Format below. If present, alignments will be output.
-0
Output start and end positions using the 0-based convention. Default: space-based.
-1
Output start and end positions using the 1-based convention. Default: space-based.
-C format
Output format for primer counts. See Output Format below. If present,
counts will be output.
-R report_interval
Usually, primer_match accumulates many matches before taking the time
to output alignments. This reduces the running time tremendously. However, if you are debugging or want reassurance that
primer_match is actually doing something, setting report_interval
to 1 will force primer_match to report alignments as they are found.
-E eos
Consider the sequence character with ascii code eos to represent the
end of the sequence in a FASTA entry. This character can never be part of an alignment, except if explicitly included in a primer sequence.
By default, 12 (new line) is considered the end of sequence character. The end of sequence character is inserted by
compress_seq.
-D ( 0 | 1 | 2 | 3 | 4 )
Select the sequence database pre-processing strategy. The default, 0, will
choose the fastest strategy, based on the pre-processing done, or not done, by compress_seq.
- Sequence database has not been pre-processed.
- Sequence database has been indexed by compress_seq. This is the
default behavior of compress_seq.
- Sequence database has been indexed and normalized by compress_seq,
using the option -n true.
- Sequence database has been indexed, normalized and compressed by compress_seq,
using the option -z true.
Given the availability of pre-processed sequence database files, option 3 is
selected first, then option 4, then option 2, then option 1. This will typically
represent the fastest possible run time.
-I
Do not load the FASTA sequence database index. For some alignment
format elements, such as the absolute file position of an alignment
or for alignment counting, the index is not needed. If the alignment
format contains only those elements that do not need the index, then
-I ensures it is not loaded. This option is implicitly set
whenever counting (-c) only is selected.
-B
Use buffered standard I/O rather than mmap to stream through the sequence
database. On some platforms, where the use of mmap is somewhat unpredictable,
this option may make it possible to run primer_match reliably.
-N ( 0 | 1 | ... | 14 )
Select the primer search strategy. The default, 0, will
heuristicly choose the fastest strategy, based on the sequence database pre-processing done, or not done, by compress_seq, on the characteristics of the primer set, and on the search constraints. It is not usually necessary to set this parameter explicitly.
- Exact search, using keyword tree deterministic automata with
list-nodes. (Best for large alphabets.)
- Exact search, using keyword tree deterministic automata with
DNA-optimized nodes. (Best for DNA sequence databases - but requires
preprocessing with compress_seq options -n true or -z true.)
- Exact search, using keyword tree deterministic automata with
jump-table nodes. (Best for relatively few primers that have little
sequence similarity.)
- Exact search, using bitvector and "shift_and"
algorithm. (Supports IUPAC wildcard match.)
- Inexact search, using bitvector and approximate match
"shift_and" algorithm. (Supports IUPAC wildcard match and
arbitrarily large edit-distance.)
- Inexact search using exact seed, using a collission free
hash-table. (Potential loss of alignments, in rare cases. Suitable
for arbitrarily large edit-distance.)
- Inexact search, using keyword tree deterministic automata with
list-nodes for the constrained exact sequence at beginning or end of
each primer. (Best for large alphabets.)
- Inexact search, using keyword tree deterministic automata with
DNA-optimized nodes for the constrained exact sequence at beginning or end of
each primer. (Best for DNA sequence databases - but requires
preprocessing with compress_seq options -n true or -z true.)
- Inexact search, using keyword tree deterministic automata with
jump-table nodes for the constrained exact sequence at beginning or end of
each primer. (Best for relatively few primers that have little
sequence similarity.)
- Inexact search, using bitvector and "shift_and"
algorithm for the constrained exact sequence at beginning or end of
each primer. (Supports IUPAC wildcard match.)
- Inexact search at edit-distance 1, using keyword tree deterministic automata with
list-nodes for each half of the primer. (Best for large alphabets.)
- Inexact search at edit-distance 1, using keyword tree deterministic automata with
DNA-optimized nodes for each half of the primer. (Best for DNA sequence databases - but requires
preprocessing with compress_seq options -n true or -z true.)
- Inexact search at edit-distance 1, using keyword tree deterministic automata with
jump-table nodes for each half of the primer. (Best for relatively few primers that have little
sequence similarity.)
- Inexact search at edit-distance 1, using bitvector and "shift_and"
algorithm for each half of the primer. (Supports IUPAC wildcard match.)
If a DNA sequence database has been nomalized by compress_seq,
then 2 is fastest for exact search, 12 for edit-distance 1 search, and
8 for edit-distance > 1 search with a start or end of primer exact
sequence constraint. For searches with edit-distance > 1 and no
constraints, option 5 is slow, but guarantees all alignments are
found; while option 6 is fast, but cannot guarantee all alignments
(ala blast). If IUPAC wildcards are required, then either option 4 or
5 must be used, since these are (relatively) slow, their use for large
sequence databases or many primers is not recommended.
-v
Verbose (version & diagnostic) output.
-h
Command-line help.
Output Format
The default alignment output format is
>defline
sequence start end edits
alignment
primer index rc?
where defline is the FASTA header line of the sequence entry containing the
alignment; sequence is the aligned sequence from the sequence database; start and
end are the space based start and end positions of the aligned sequence in the sequence entry;
edits is the number of insertions, deletions, and substitutions in the alignment;
alignment is a series of alignment characters indicating match, insertion, deletion or substitution at
each position of the alignment; primer is the aligned primer sequence; index is the index of this primer in the primer input set; and
rc? is "REVCOMP" if the primer matched in its reverse complement form.
For example
>CCO_UID:219000002141424:BAC_UID:human_12212001_reproc:LEN:33337
AGATCGCAGGTACATAAATGCTTCT 20115 20140 0
|||||||||||||||||||||||||
AGATCGCAGGTACATAAATGCTTCT 3242
>CCO_UID:219000002142926:BAC_UID:human_12212001_reproc:LEN:2262
CCCATTCAGTCTTTCTTTTAAAAACATTTATTTTTAATTCAT 1671 1713 0
||||||||||||||||||||||||||||||||||||||||||
CCCATTCAGTCTTTCTTTTAAAAACATTTATTTTTAATTCAT 4781 REVCOMP
and
>gi|683734|gb|U20581.1|MFU20581 Macaca fascicularis endothelin 3 mRNA
CAGCCAGATCTGAG 44 58 1
|||*||||||||||
CAGTCAGATCTGAG 3
>gi|9967394|dbj|AB047965.1| Macaca fascicularis brain cDNA
CTCAGATCTGA-TG 1569 1582 1
|||||||||||v||
CTCAGATCTGACTG 3 REVCOMP
and
>gi|21320903|dbj|AB059653.1| Macaca fascicularis PGDH1 mRNA
TGGATAATTTTT 2338 2350 1
+++|^|+||||+
WRRA-AWTTTTW 13
>gi|21320905|dbj|AB059654.1| Macaca fascicularis PGDH2 mRNA
ACCGAGGAGGA 502 513 1
||*||+|||||
ACAGAKGAGGA 11
>gi|21320905|dbj|AB059654.1| Macaca fascicularis PGDH2 mRNA
AGCTG-GTGGG 512 522 1
|||||v|||||
AGCTGYGTGGG 18
>gi|7593035|dbj|AB041420.1| Gorilla gorilla gene for alpha-1
CGCCRGCACGAGTT 596 610 1
||||+|||^|||||
CGCCAGCA-GAGTT 2
The default counts output format is
index rc? primer count ( 0-count 1-count ... )
where index is the index of the primer; rc? is "R" for the reverse
complement of the primer and "F" otherwise; primer is the sequence of
the primer if rc? is "F" and the sequence of the primer's reverse
complement if rc? is "R"; count is the number of occurrences of primer in the
sequence database; and k-count is the number of occurrences of primer in the
sequence database with k insertions, deletions, or substitutions.
For example
1 F TTACGGGCAGCTCA 9 ( 6 3 )
1 R TGAGCTGCCCGTAA 0 ( 0 0 )
2 F CCTTGCCAGTCAGATC 23 ( 8 15 )
2 R GATCTGACTGGCAAGG 0 ( 0 0 )
3 F CAGTCAGATCTGAG 15 ( 2 13 )
3 R CTCAGATCTGACTG 6 ( 0 6 )
The command line options -A and -C give the user explicit control over the
output of alignments and counts respectively. Each format string contains conversion characters, which specify pieces of the alignment or count
output.
Alignment format conversion characters:
|
%h |
FASTA header (defline) of the sequence entry containing the alignment. |
|
%H |
First "word" of the FASTA
header (defline) of the sequence entry containing the alignment. The first word is everything up to (but not
including) the first whitespace character of the defline. |
|
%f |
Index of the FASTA entry containing
the alignment. |
|
%s |
Start position of the alignment
within the FASTA entry (space based). |
|
%e |
End position of the alignment in the
FASTA entry (space based). |
|
%l |
Length of the alignment. |
|
%5 |
Position of the 5' end of the alignment in the sequence entry (space
based). |
|
%3 |
Position of the 3' end of the alignment in the sequence entry (space
based). |
|
%S |
Start position (absolute) of the alignment in the sequence database. |
|
%E |
End position (absolute) of the alignment in the sequence database. |
|
%i |
Index of the aligned primer. |
|
%d |
Edit distance (number of insertions, deletions, substitutions) of the
alignment. |
|
%D |
Length difference between primer and aligned sequence |
|
%M |
Mass (based on amino-acid residual mass) difference between peptide and aligned sequence |
|
%p |
The (forward) sequence of the primer, whether it was found in its
forward or reverse complement form. |
|
%P |
The FASTA header (defline) of
the primer, if the primers came from a FASTA format file. Otherwise,
"". |
|
%I |
STS id (first column) of primer
entry, if the primers came from a UniSTS format file. Otherwise, "". |
|
%L |
STS length for primer entry, if
the primers came from a UniSTS format file. Otherwise, "". |
|
%a |
STS accession for primer entry,
if the primers came from a IUniSTS format file. Otherwise, "". |
|
%O |
STS organism for primer entry,
if the primers came from a UniSTS format file. Otherwise, "". |
|
%& |
Alternative STS accessions for
primer entry, if the primers came from a UniSTS format file. Otherwise,
"". |
|
%X |
STS chromosome for primer entry,
if the primers came from a UniSTS format file. Otherwise, "". |
|
%q |
The primer sequence of the alignment. |
|
%Q |
The primer sequence of the alignment, with alignment characters to
indicate an insertion. |
|
%t |
The aligned sequence from the sequence database. |
|
%T |
The aligned sequence from the sequence database, with alignment
characters to indicate deletion. |
|
%A |
The string of alignment characters indicating exact match, insertion,
deletion and substitution at each position of the alignment. |
|
%| |
Number of matches in the alignment. |
|
%+ |
Number of wildcard matches in the alignment. |
|
%* |
Number of substitutions in the alignment. |
|
%^ |
Number of insertions in the alignment. |
|
%v |
Number of deletions in the alignment. |
|
%r |
"F" if the forward form of the primer was found, "R"
if the reverse complement form of the primer was found. |
|
%R |
" REVCOMP" if the reverse complement form of the primer was
found, "" otherwise.
|
|
%= |
Multi-line alignment output (includes primer, sequence, and alignment caracters). |
|
%% |
Percent (%). |
The default alignment format is ">%h\n %T %s %e %d\n %A\n %Q %i%R\n".
Count format conversion characters:
|
%i |
The primer index. |
|
%p |
The (forward form of the) sequence of the primer. |
|
%P |
The FASTA header (defline) of
the primer, if the primers came from a FASTA format file. Otherwise,
"". |
|
%q |
The forward or reverse complement form of the primer. |
|
%r |
"F" for the
forward form of the primer, "R" for the reverse complement form of the primer. |
|
%R |
" REVCOMP" for the reverse complement form of the primer,
"" otherwise. |
|
%c |
Count for primer or reverse complement. |
|
%C |
Space separated list of counts for edit distance 0, 1,
etc. |
|
%+ |
Plus (+) if the count for this primer exceeded the maximum count
threshold. |
|
%% |
Percent (%). |
The default count format is "%i %q %c%+ ( %C )\n".
See Also
pcr_match, compress_seq
Author
Nathan Edwards
|
|
 |