| Peptide Sequence Databases |
 |
 |
RECOMB Satallite Workshop on Computational Proteomics accepted talk:
Novel Peptide Identification using Expressed Sequence Tags and Sequence Database Compression
Nathan J. Edwards
Download the human gene-centric compressed EST peptide sequence database...
USHUPO 2006 Poster:
Novel Peptide Identification using ESTs and Genomic Sequence
Nathan J. Edwards, Xue Wu, and Chau-Wen Tseng
Supporting presentation for novel identified peptides, with embedded spectrum and genome browser links.
USHUPO 2005 Poster:
Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression
Nathan J. Edwards
Introduction
The protein sequence databases used by tandem mass spectra search
engines are designed to be useful as possible to as many researchers
as possible. As such, they are a less than ideal substrate for tandem
mass spectra search. Protein sequence databases typically represent
only "full-length" protein sequences and attempt to collapse protein
variants to a single "consensus" entry. Tandem mass spectra search
engines, however, chop up the protein sequence using an
in-silico enzymatic digestion (typically trypsin), so
full-length proteins are not needed in order to identify
experimentally observed peptides; and the currently available search
engines require the experimental peptides' sequences be explicitly
present in the sequence database in order to identify them, so
explicit sequence variants are very important.
Our research in peptide sequence databases seeks to remedy this problem by:
- Enumerating putative peptide sequences that might otherwise be
missed by gene/transcript/protein annotation pipelines and curation;
and
- Compressing the resulting peptide sequence databases so that
search times remain feasible.
Putative Peptide Sequences
There are many different sources of evidence for peptide sequences
that is missing from our current protein sequence databases. We could
search assembled genomes, where they are available, but there are
inherent problems with this approach: most genomic sequence lies
between gene sequences, introns break up genes into coding and
non-coding regions, and exons are short enough that many tryptic
peptides straddle introns.
Expressed sequence tags (ESTs) solve a number of these issues, but
introduce others. ESTs are the result of sequencing the ends of
complementary DNA (cDNA) derived from mRNA, the result of
transcription. ESTs, then, represent sequence that is known to be
transcribed (the first step of protein synthesis) and contain no
intronic sequence. ESTs are one of the key pieces of evidence used by
gene annotation pipelines. The primary problems with the EST databases
is that they often have many ESTs from the same region of the same
transcripts, and since they correspond to a single sequencing read,
they have an error rate of about 1%.
Fortunately, these issues with ESTs can be naturally overcome using
the "Sequence Database Compression" technique below. In addition to
ensuring that all peptides generated from the ESTs are represented in
much less sequence, we can also require that peptides be observed in
at least 2 independent ESTs. The resulting EST based peptide sequence
database for human contains about 200Mb of sequence, a forty-fold
reduction in magnitude from a naive 6-frame translation, which would
require about 8Gb of sequence. See
Genomic Peptide Sequence Databases
for more details. The peptide sequence database for humans is
available for download.
Sequence Database Compression
Search engines for tandem mass spectra typically have running times
proportional to the size of the amino-acid sequence database being
searched, rather than the number of distinct peptides contained in the
sequence database. Unfortunately, when we enumerate putative peptide
sequences and sequence variants, we often end up with redundant copies
of peptide sequence. This increases our search time unnecessarily.
A good example of this phenomenon is the
varsplic.pl Perl script from UniProt. The
varsplic.pl script enumerates all the sequence variants,
conflicts, and isoforms from SwissProt and other UniProt sequence
databases as full length fasta entries. Clearly, these represent
valuable peptide sequences, but the price is high --- for
about 1% additional peptides, we increase the sequence database size
66%, which is then reflected in the running time of our search
engines.
Fortunately, it is possible to construct a new sequence database that
contains all peptide sequences from the original without introducing
new ones, by considering all "words" of length 30 (30-mers) from the
sequence database. The paper "Sequence database compression for
peptide identification from tandem mass spectra" describes the
process by which this new sequence database can be constructed, and
gives compression rates for commonly used protein sequence
databases. These peptide sequence databases are available for
download.
A proof of principle implementation of these techniques has been
carried out for the Mascot search engine. The results are available in
the presentation "Faster, more
sensitive peptide identification by sequence database
compression". All of the tools required for this proof of
principle are available for download, see Implementation of Sequence Database
Compression for Mascot for more details. Implementations for other
sequence database search engines should be similar.
|