University of Maryland Nathan Edwards
Center for Bioinformatics and Computational Biology
Home Research Teaching Publications

Research
Proteomics
Tools
Data
Research Statements


Peptide Sequence Databases

RECOMB Satallite Workshop on Computational Proteomics accepted talk:
Novel Peptide Identification using Expressed Sequence Tags and Sequence Database Compression
Nathan J. Edwards
Download the human gene-centric compressed EST peptide sequence database...

USHUPO 2006 Poster:
Novel Peptide Identification using ESTs and Genomic Sequence
Nathan J. Edwards, Xue Wu, and Chau-Wen Tseng
Supporting presentation for novel identified peptides, with embedded spectrum and genome browser links.

USHUPO 2005 Poster:
Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression
Nathan J. Edwards

Introduction

The protein sequence databases used by tandem mass spectra search engines are designed to be useful as possible to as many researchers as possible. As such, they are a less than ideal substrate for tandem mass spectra search. Protein sequence databases typically represent only "full-length" protein sequences and attempt to collapse protein variants to a single "consensus" entry. Tandem mass spectra search engines, however, chop up the protein sequence using an in-silico enzymatic digestion (typically trypsin), so full-length proteins are not needed in order to identify experimentally observed peptides; and the currently available search engines require the experimental peptides' sequences be explicitly present in the sequence database in order to identify them, so explicit sequence variants are very important.

Our research in peptide sequence databases seeks to remedy this problem by:

  1. Enumerating putative peptide sequences that might otherwise be missed by gene/transcript/protein annotation pipelines and curation; and
  2. Compressing the resulting peptide sequence databases so that search times remain feasible.

Putative Peptide Sequences

There are many different sources of evidence for peptide sequences that is missing from our current protein sequence databases. We could search assembled genomes, where they are available, but there are inherent problems with this approach: most genomic sequence lies between gene sequences, introns break up genes into coding and non-coding regions, and exons are short enough that many tryptic peptides straddle introns.

Expressed sequence tags (ESTs) solve a number of these issues, but introduce others. ESTs are the result of sequencing the ends of complementary DNA (cDNA) derived from mRNA, the result of transcription. ESTs, then, represent sequence that is known to be transcribed (the first step of protein synthesis) and contain no intronic sequence. ESTs are one of the key pieces of evidence used by gene annotation pipelines. The primary problems with the EST databases is that they often have many ESTs from the same region of the same transcripts, and since they correspond to a single sequencing read, they have an error rate of about 1%.

Fortunately, these issues with ESTs can be naturally overcome using the "Sequence Database Compression" technique below. In addition to ensuring that all peptides generated from the ESTs are represented in much less sequence, we can also require that peptides be observed in at least 2 independent ESTs. The resulting EST based peptide sequence database for human contains about 200Mb of sequence, a forty-fold reduction in magnitude from a naive 6-frame translation, which would require about 8Gb of sequence. See Genomic Peptide Sequence Databases for more details. The peptide sequence database for humans is available for download.

Sequence Database Compression

Search engines for tandem mass spectra typically have running times proportional to the size of the amino-acid sequence database being searched, rather than the number of distinct peptides contained in the sequence database. Unfortunately, when we enumerate putative peptide sequences and sequence variants, we often end up with redundant copies of peptide sequence. This increases our search time unnecessarily.

A good example of this phenomenon is the varsplic.pl Perl script from UniProt. The varsplic.pl script enumerates all the sequence variants, conflicts, and isoforms from SwissProt and other UniProt sequence databases as full length fasta entries. Clearly, these represent valuable peptide sequences, but the price is high --- for about 1% additional peptides, we increase the sequence database size 66%, which is then reflected in the running time of our search engines.

Fortunately, it is possible to construct a new sequence database that contains all peptide sequences from the original without introducing new ones, by considering all "words" of length 30 (30-mers) from the sequence database. The paper "Sequence database compression for peptide identification from tandem mass spectra" describes the process by which this new sequence database can be constructed, and gives compression rates for commonly used protein sequence databases. These peptide sequence databases are available for download.

A proof of principle implementation of these techniques has been carried out for the Mascot search engine. The results are available in the presentation "Faster, more sensitive peptide identification by sequence database compression". All of the tools required for this proof of principle are available for download, see Implementation of Sequence Database Compression for Mascot for more details. Implementations for other sequence database search engines should be similar.

.......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... ..........

University of Maryland     UM Home | Directories | Search | Admissions | Calendar
Original created by John Fuetsch
Questions and comments to Nathan Edwards