 |

|
 |
|
|
| pcr_match |
 |
 |
Name
pcr_match - Find primer pairs in a DNA
sequence database
Synopsis
pcr_match [
options ]
Description
pcr_match finds pairs of short
DNA sequences, usually primers, in a (much) larger DNA sequence database such as the human genome.
By default, pcr_match outputs a human readable
alignment for each occurrence of a primer in the sequence database. The format of the alignments is completely configurable with the
-A options.
pcr_match runs fastest when the sequence database has been
pre-processed with compress_seq, but this is not necessary. If the
sequence database has not been pre-processed with compress_seq, the
sequence database must be in a regular FASTA format. Each line, except
for the last, of every sequence entry must hold the same number of sequence
characters. If the sequence database is not in a regular FASTA format, the
results may be incorrect. pcr_match will warn the user if the FASTA
format is not in a regular format.
Options
-i FASTA_sequence_database
Name of the sequence database to search. Required.
-p primers
White space (space, tab, new line) separated list of primer sequences to find in the sequence database. When this
option is used on the command line, the primers will usually need to be placed in quotes (").
The primer pairs must be consecutive in the list of primers. One of
-p, -P, -F, or -S must be supplied.
-P primer_file
File containing a white space (space, tab, new line) separated list of primer
sequences to find in the sequence database. The primer pairs must be consecutive
in the list of primers. One of
-p, -P, -F, or -S must be supplied.
-F fasta_primer_file
FASTA file containing a list of primer sequences to find in the sequence
database. The primer pairs must be consecutive in the list of primers. One of
-p, -P, -F, or -S must be supplied.
-S sts-format-file
UniSTS format file containing a list of primer pairs to find in the sequence
database. One of
-p, -P, -F, or -S must be supplied.
-o output_file
Output is redirected into the file output_file. If absent, output goes to standard out.
-k edit_distance
The maximum number of insertions, deletions, and substitutions
permitted in any primer alignment. If absent, edit distance 0 is
assumed.
-K mismatches
The maximum number of mismatches permitted in any primer
alignment.
-r
Reverse complement the reverse complement (second) primer. This option is
automatically set, for UniSTS format primer pairs. Default: false.
-a
Output all primer pair orientations. Default: false.
-x l
Length of exact seed or word size, ala blast, required by any primer
alignment. Can be combined with other options.
-s ( l | ~l )
Constrain the first l primer characters to match exactly; any insertions, deletions or substitutions must occur after position
l. The reverse complement of a primer must also have its first l characters
match exactly. Note that a wildcard match is considered an exact match. With
the ~ modifier, the first l primer characters are constrained to
match inexactly, the remaining characters must match exactly.
-e ( l | ~l )
Constrain the last l primer characters to match exactly; any insertions, deletions or substitutions must occur before position
l. The reverse complement of a primer must also have its last l characters
match exactly. Note that a wildcard match is considered an exact
match. With the ~ modifier, the last l primer characters are
constrained to match inexactly, the remaining characters must match exactly.
-5 ( l | ~l )
Constrain the l primer characters at the 5' end of the primer to match
exactly; any insertions, deletions or substitutions must occur after position l
from the 5' end of the primer. The reverse complement of a primer must also
have the l characters at its 5' end match exactly. Note that a wildcard
match is considered an exact match. With the ~ modifier, the l primer
characters at the 5' end of the primer are constrained to match inexactly, the
remaining characters must match exactly.
-3 ( l | ~l )
Constrain the l primer characters at the 3' end of a primer to match
exactly; any insertions, deletions or substitutions must occur after position l
from the 3' end of the primer. The reverse complement of a primer must also have the l
characters at its 3' end match exactly. Note that a wildcard match is
considered an exact match. With the ~ modifier, the l primer characters
at the 3' end of the primer are constrained to match inexactly, the remaining
characters must match exactly.
-w
Respect IUPAC ambiguity codes as wildcards, in both the sequence database and
the primers. A symbol from the sequence database is considered a wildcard match
to a primer symbol if either set of represented DNA symbols contains the other.
The only exception is that a N in the sequence database does not match any
primer symbol. Note: this is almost certainly what you want, as long stretches
of Ns are often used to indicate gaps in assembled sequence.
-W
Respect IUPAC ambiguity codes as wildcards, in both the sequence database and
the primers. A symbol from the sequence database is considered a wildcard match
to a primer symbol if either set of represented DNA symbols contains the other.
Also respects Ns in the sequence databases.
-u
Force all primers to uppercase characters.
-m min-length
Minimum length, in bases, of the amplicon product of the primer pairs.
Default: 0.
-M max-length
Maximum length, in bases, of the amplicon product of the primer pairs.
Default: 2000.
-d deviation
Maximum deviation of the length, in bases, of the amplicon product of the
primer pairs from the length specified in the UniSTS format primer file. UniSTS
format primers required. Default: no constraint.
-b
Measure the length of amplicon as number of bases between primers.
-A format
Output format for primer alignments. See Output Format below. If present, alignments will be output.
-0
Output start and end positions using the 0-based convention. Default: space-based.
-1
Output start and end positions using the 1-based convention. Default: space-based.
-R report_interval
Usually, primer_match accumulates many matches before taking the time
to output alignments. This reduces the running time tremendously. However, if you are debugging or want reassurance that
primer_match is actually doing something, setting report_interval
to 1 will force primer_match to report alignments as they are found.
-E eos
Consider the sequence character with ascii code eos to represent the
end of the sequence in a FASTA entry. This character can never be part of an alignment, except if explicitly included in a primer sequence.
By default, 12 (new line) is considered the end of sequence character. The end of sequence character is inserted by
compress_seq.
-D ( 0 | 1 | 2 | 3 | 4 )
Select the sequence database pre-processing strategy. The default, 0, will
choose the fastest strategy, based on the pre-processing done, or not done, by compress_seq.
- Sequence database has not been pre-processed.
- Sequence database has been indexed by compress_seq. This is the
default behavior of compress_seq.
- Sequence database has been indexed and normalized by compress_seq,
using the option -n true.
- Sequence database has been indexed, normalized and compressed by compress_seq,
using the option -z true.
Given the availability of pre-processed sequence database files, option 3 is
selected first, then option 4, then option 2, then option 1. This will typically
represent the fastest possible run time.
-I
Do not load the FASTA sequence database index. For some alignment
format elements, such as the absolute file position of an alignment
or for alignment counting, the index is not needed. If the alignment
format contains only those elements that do not need the index, then
-I ensures it is not loaded. This option is implicitly set whenever
counting (-c) only is selected.
-B
Use buffered standard I/O rather than mmap to stream through the sequence
database. On some platforms, where the use of mmap is somewhat unpredictable,
this option may make it possible to run primer_match reliably.
-N ( 0 | 1 | ... | 14 )
Select the primer search strategy. The default, 0, will
heuristicly choose the fastest strategy, based on the sequence database pre-processing done, or not done, by compress_seq, on the characteristics of the primer set, and on the search constraints. It is not usually necessary to set this parameter explicitly.
- Exact search, using keyword tree deterministic automata with
list-nodes. (Best for large alphabets.)
- Exact search, using keyword tree deterministic automata with
DNA-optimized nodes. (Best for DNA sequence databases - but requires
preprocessing with compress_seq options -n true or -z true.)
- Exact search, using keyword tree deterministic automata with
jump-table nodes. (Best for relatively few primers that have little
sequence similarity.)
- Exact search, using bitvector and "shift_and"
algorithm. (Supports IUPAC wildcard match.)
- Inexact search, using bitvector and approximate match
"shift_and" algorithm. (Supports IUPAC wildcard match and
arbitrarily large edit-distance.)
- Inexact search using exact seed, using a collission free
hash-table. (Potential loss of alignments, in rare cases. Suitable
for arbitrarily large edit-distance.)
- Inexact search, using keyword tree deterministic automata with
list-nodes for the constrained exact sequence at beginning or end of
each primer. (Best for large alphabets.)
- Inexact search, using keyword tree deterministic automata with
DNA-optimized nodes for the constrained exact sequence at beginning or end of
each primer. (Best for DNA sequence databases - but requires
preprocessing with compress_seq options -n true or -z true.)
- Inexact search, using keyword tree deterministic automata with
jump-table nodes for the constrained exact sequence at beginning or end of
each primer. (Best for relatively few primers that have little
sequence similarity.)
- Inexact search, using bitvector and "shift_and"
algorithm for the constrained exact sequence at beginning or end of
each primer. (Supports IUPAC wildcard match.)
- Inexact search at edit-distance 1, using keyword tree deterministic automata with
list-nodes for each half of the primer. (Best for large alphabets.)
- Inexact search at edit-distance 1, using keyword tree deterministic automata with
DNA-optimized nodes for each half of the primer. (Best for DNA sequence databases - but requires
preprocessing with compress_seq options -n true or -z true.)
- Inexact search at edit-distance 1, using keyword tree deterministic automata with
jump-table nodes for each half of the primer. (Best for relatively few primers that have little
sequence similarity.)
- Inexact search at edit-distance 1, using bitvector and "shift_and"
algorithm for each half of the primer. (Supports IUPAC wildcard match.)
If a DNA sequence database has been nomalized by compress_seq,
then 2 is fastest for exact search, 12 for edit-distance 1 search, and
8 for edit-distance > 1 search with a start or end of primer exact
sequence constraint. For searches with edit-distance > 1 and no
constraints, option 5 is slow, but guarantees all alignments are
found; while option 6 is fast, but cannot guarantee all alignments
(ala blast). If IUPAC wildcards are required, then either option 4 or
5 must be used, since these are (relatively) slow, their use for large
sequence databases or many primers is not recommended.
-v
Verbose (version & diagnostic) output.
-h
Command-line help.
Output Format
The an example of the default alignment output format
>gi|21700565|gb|AC092408.3| Papio anubis clone RP41-446H8, complete sequence
CTTGTAATCCCAGAACTTTGG 57681 ... 1714 ... 59395 CCCCGTCTCTACTAAAAATA
||^||||||||||*||||||| |||||||||||||*||||||
CT-GTAATCCCAGGACTTTGG F R CCCCGTCTCTACTTAAAATA D11S3114 REVERSE-STRAND
The command line option -A give the user explicit control over the
output of alignments respectively. Each format string contains conversion characters, which specify pieces of the alignment or count
output.
Alignment format conversion characters:
|
%h |
FASTA header (defline) of the sequence entry containing the alignment. |
|
%H |
First "word" of the FASTA
header (defline) of the sequence entry containing the alignment. The first word is everything up to (but not
including) the first whitespace character of the defline. |
|
%f |
Index of the FASTA entry containing
the alignment. |
|
%>s |
Start position of the
"left" primer alignment
within the FASTA entry (space based). |
|
%<s |
Start position of the
"right" primer alignment
within the FASTA entry (space based). |
|
%>e |
End position of the
"left" primer alignment in the FASTA entry (space based). |
|
%<e |
End position of the
"right" primer alignment in the FASTA entry (space based). |
|
%>l |
Length of the
"left" primer alignment. |
|
%<l |
Length of the
"right" primer alignment. |
|
%l |
Length of the
amplicon. |
|
%>5 |
Position of the 5' end of the
"left" primer alignment in the sequence entry (space
based). |
|
%<5 |
Position of the 5' end of the
"right" primer alignment in the sequence entry (space
based). |
|
%>3 |
Position of the 3' end of the
"left" primer alignment in the sequence entry (space
based). |
|
%<3 |
Position of the 3' end of the
"right" primer alignment in the sequence entry (space
based). |
|
%>S |
Start position (absolute) of the
"left" primer alignment in the sequence database. |
|
%<S |
Start position (absolute) of the
"right" primer alignment in the sequence database. |
|
%>E |
End position (absolute) of the
"left" primer alignment in the sequence database. |
|
%<E |
End position (absolute) of the
"right" primer alignment in the sequence database. |
|
%i |
Index of the aligned primer
pair. |
|
%>d |
Edit distance (number of insertions, deletions, substitutions) of the
"left" primer alignment. |
|
%<d |
Edit distance (number of insertions, deletions, substitutions) of the
"right" primer alignment. |
|
%>p |
The (forward) sequence of the
"left" primer, whether it was found in its
forward or reverse complement form. |
|
%<p |
The (forward) sequence of the
"right" primer, whether it was found in its
forward or reverse complement form. |
|
%>P |
The FASTA header (defline) of
the "left" primer, if the primers came from a FASTA format file. Otherwise,
"". |
|
%<P |
The FASTA header (defline) of
the "right" primer, if the primers came from a FASTA format file. Otherwise,
"". |
|
%I |
The STS identifier
of the primer pair, if the primers came from a UniSTS format file. Otherwise,
"". |
|
%L |
The STS length of
the primer pair, if the primers came from a UniSTS format file. Otherwise,
"". |
|
%D |
The absolute value
of the difference between the length of the amplicon and the STS length of
the primer pair, if the primers came from a UniSTS format file. |
|
%a |
The STS accession
of the primer pair, if the primers came from a UniSTS format file. Otherwise,
"". |
|
%O |
The STS organism of
the primer pair, if the primers came from a UniSTS format file. Otherwise,
"". |
|
%& |
The alternative STS
accessions of the primer pair, if the primers came from a UniSTS format
file. Otherwise,
"". |
|
%X |
The STS chromosome
of the primer pair, if the primers came from a UniSTS format file. Otherwise,
"". |
|
%>q |
The "left" primer sequence of the alignment. |
|
%<q |
The "right" primer sequence of the alignment. |
|
%>Q |
The "left" primer sequence of the alignment, with alignment characters to
indicate an insertion. |
|
%<Q |
The "right" primer sequence of the alignment, with alignment characters to
indicate an insertion. |
|
%>t |
The "left" aligned sequence from the sequence database. |
|
%<t |
The "right" aligned sequence from the sequence database. |
|
%>T |
The "left" aligned sequence from the sequence database, with alignment
characters to indicate deletion. |
|
%<T |
The "right" aligned sequence from the sequence database, with alignment
characters to indicate deletion. |
|
%>A |
The string of alignment characters indicating exact match, insertion,
deletion and substitution at each position of the "left" primer alignment. |
|
%<A |
The string of alignment characters indicating exact match, insertion,
deletion and substitution at each position of the "right" primer alignment. |
|
%>r |
"F" if the forward form of the
"left" primer was found, "R"
if the reverse complement form of the "left" primer was found. |
|
%<r |
"F" if the forward form of the
"right" primer was found, "R"
if the reverse complement form of the "left" primer was found. |
|
%>R |
" REVCOMP" if the reverse complement form of the "left" primer was
found, "" otherwise.
|
|
%<R |
" REVCOMP" if the reverse complement form of the "right" primer was
found, "" otherwise.
|
|
%R |
" REVERSE-STRAND" if the primer pair was found in reverse strand
orientation (second primer first), "" otherwise.
|
|
%@ |
The sequence of the
amplicon amplified by the primer pair. |
|
%N |
The number of N's
in the sequence of the amplicon "amplified" by the primer pair. |
|
%0 |
e-PCR format
output. |
|
%% |
Percent (%). |
The default alignment format is ">%h\n %>T %>s ... %l ... %<e %<T\n %>A %!>s
%!l %!<e %<A\n %>Q %>r%!>s %!l %!<e%<r %<Q %a%R\n". The character '!' in
the format indicates that the number of characters occupied by the formated data
should output as spaces.
See Also
primer_match, compress_seq
Author
Nathan Edwards
|
|
 |