University of Maryland Nathan Edwards
Center for Bioinformatics and Computational Biology
Home Research Teaching Publications

Research
Proteomics
Tools
Data
Research Statements


compress_seq

Name

compress_seq - Normalize and compress a multi-FASTA sequence database

Synopsis

compress_seq -i fasta_sequence_database [ options ]

Description

compress_seq takes a multi-FASTA sequence database and splits it into separate sequence, header and index files. It can output the sequence data in a variety of forms, depending of the command line options given, including a DNA optimized normalized form, and a bit compressed form. It can also add an arbitrary end of sequence character to each sequence entry, and force each sequence character to uppercase.

If compress_seq is used to pre-process a sequence database for primer_match, the best performance will result from the use of the -n true option, to index and normalize the sequence database.

By default, compress_seq splits the multi-FASTA sequence database file db into three files: db.seq, db.hdr and db.idb. In order to make compress_seq more suitable for use in computational analysis pipelines, it only re-constructs the sequence database component files if the file system timestamps indicate that it is necessary. Note that the -F option can be used to force each component to be re-made.

Command Line Options

-i fasta_sequence_database

Name of the multi-Fasta sequence database file to process. Required.

-I ( true | false )

Write fasta index file in binary format. Default: true.

-n ( true | false )

Create a normalized version of the sequence data. This creates additional files with suffixes .sqn and .tbl. Default: false.

-D ( true | false )

Optimize the normalized sequence data for DNA sequence. Default: true.

-z ( true | false )

Create a bit compressed normalized version of the sequence data. This creates additional files with suffixes .sqz and .tbz. Default: false.

-u ( true | false )

Force the sequence data to uppercase characters. Default: false.

-e ( true | false )

Add an end of sequence character after each entry from the multi-FASTA sequence database file. This can help ensure that text matching algorithms cannot find a match that straddles two FASTA entries. Default: true.

-E eos

Use eos as the end of sequence character. eos is the ascii code for the desired character, it may be specified as a decimal, octal or hexadecimal number. Default: 12 (newline).

-S ( true | false )

Insert end of sequence character before initial sequence entry. Default: false.

-F ( true | false )

Force each component of the compressed sequence database to be regenerated, even if the file timestamps indicate that this isn't necessary. Default: false.

-C ( true | false )

Cleanup unnecessary temporary files. Default: true.

-B

Use buffered standard I/O rather than mmap to stream through the sequence database. On some platforms, where the use of mmap is somewhat unpredictable, this option may make it possible to run compress_seq reliably. 

-v 

Output the release tag of the binary.

-h 

Command-line help.

See Also

primer_match, pcr_match

Author

Nathan Edwards

.......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... ..........

University of Maryland     UM Home | Directories | Search | Admissions | Calendar
Original created by John Fuetsch
Questions and comments to Nathan Edwards