University of Maryland Nathan Edwards
Center for Bioinformatics and Computational Biology
Home Research Teaching Publications

Research
Proteomics
Tools
Data
Research Statements


Genomic Peptide Sequence Databases

Introduction

Our current protein sequence databases capture only some of the peptide sequences that could be observed in proteomics identification workflows. The sequences in protein sequence databases represent the result of extensive gene, transcript and protein annotation pipelines that rely heavily on computational predictions that weigh all of the various types of evidence for (and against) a full length protein sequence. Furthermore, protein sequence databases often suppress similar sequences for the same protein in an effort to control database size and annotation overhead. Our approach is to enumerate many possible sources of peptide sequences, whether or not we have strong evidence for full length protein sequences that contain them, so that our MS/MS search engines can discover evidence for them in proteomics workflows.

Genomes

Searching genomic sequence directly is problematic for eukaryotes due to the intron/exon structure of genes. Human genes, for example, have exons that are, on average, about 150 nucleotides in length. As such, the chance that a peptide sequence falls on an exon boundary is pretty good, and therefore searching the genome directly will miss many potential peptide sequences. Searching the genome directly is very wasteful, too, since we know that only a small fraction of the sequence is coding. We plan to address these issues by using computational gene-finders to predict exons and use these as a basis for peptide sequence databases, however, this work is still in its early stages.

EST Databases

ESTs represent transcribed sequence, so we needn't worry about searching unnecessary sequence; and introns are spliced out, so peptides on exon boundaries are no longer a concern. However, EST introduce other issues, since EST databases contain many transcripts from the same gene, resulting in significant redundancy; and ESTs are understood to have a sequencing error rate of about 1%. Fortunately, we can use our sequence database compression technique to mitigate these problems.

A brute force six-frame translation of the Human dbEST would turn the 3Gb of nucleotide sequence into 6Gb of amino-acid sequence. In addition to significant redundancy and sequencing error problems, we would also spend a lot of time searching clearly incorrect sequence frame translations.

We process the brute force six-frame translation of the ESTs through a number of stages. First, we keep only unambiguous amino-acid sequences of length at least 50. This ensures that each retained amino-acid sequence is in an open reading frame (ORF) of at least 150 nucleotides. This a pretty conservative filter for correct frame, but reduces the amount of sequence to about 1Gb. Second, we compute the C3 compressed sequence database representation of all of the 30-mers contained in the resulting translated sequence. In the process, we eliminate all 30-mers that occur exactly once in the EST database. For Human ESTs, the result is about 150Mb of sequence, a 40 fold reduction in the sequence database size, which turns a 15-20 hour Mascot search into a 20 minute search. The C3 compressed peptide sequence database derived from Human ESTs available for download.

mRNA Sequences

The RefSeq mRNA sequences, like ESTs, represent transcript sequence. As such they do not contain introns, but unlike ESTs, they consist of well understood full length transcript sequences with few sequencing errors, and little redundancy. We enumerate peptide sequences from RefSeq mRNA sequences using a brute force 3-frame (genomic direction is established) translation. We do not filter for short open reading frames and we do not filter out 30-mers that occur only once.

Genomic Peptides

In order to capture putative peptide sequences in concert with well understood peptide sequences, particularly to avoid redundancy from the sequences they share, we form a new sequence database containing the EST amino-acid sequences in open reading frames of at least 50 unambiguous amino-acids that occur at least twice; RefSeq mRNA amino-acid sequence enumeration; all RefSeq proteins; and the corresponding IPI database, where available. For Human sequences, a brute force enumeration of these sequences would be over 6Gb in length, following C3 compression, the genomic peptides sequence database is about 160Mb in size. The C3 compressed genomic peptide sequence database for Human sequences is available for download.

.......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... ..........

University of Maryland     UM Home | Directories | Search | Admissions | Calendar
Original created by John Fuetsch
Questions and comments to Nathan Edwards