| Genomic Peptide Sequence Databases |
 |
 |
Introduction
Our current protein sequence databases capture only some of the
peptide sequences that could be observed in proteomics identification
workflows. The sequences in protein sequence databases represent the
result of extensive gene, transcript and protein annotation pipelines
that rely heavily on computational predictions that weigh all of the
various types of evidence for (and against) a full length protein
sequence. Furthermore, protein sequence databases often suppress
similar sequences for the same protein in an effort to control
database size and annotation overhead. Our approach is to enumerate
many possible sources of peptide sequences, whether or not we have
strong evidence for full length protein sequences that contain them,
so that our MS/MS search engines can discover evidence for them in
proteomics workflows.
Genomes
Searching genomic sequence directly is problematic for eukaryotes due
to the intron/exon structure of genes. Human genes, for example, have
exons that are, on average, about 150 nucleotides in length. As such,
the chance that a peptide sequence falls on an exon boundary is pretty
good, and therefore searching the genome directly will miss many
potential peptide sequences. Searching the genome directly is very
wasteful, too, since we know that only a small fraction of the
sequence is coding. We plan to address these issues by using
computational gene-finders to predict exons and use these as a basis
for peptide sequence databases, however, this work is still in its
early stages.
EST Databases
ESTs represent transcribed sequence, so we needn't worry about
searching unnecessary sequence; and introns are spliced out, so
peptides on exon boundaries are no longer a concern. However, EST
introduce other issues, since EST databases contain many transcripts
from the same gene, resulting in significant redundancy; and ESTs are
understood to have a sequencing error rate of about 1%. Fortunately,
we can use our sequence database compression technique to mitigate
these problems.
A brute force six-frame translation of the Human dbEST would turn the
3Gb of nucleotide sequence into 6Gb of amino-acid sequence. In
addition to significant redundancy and sequencing error problems, we
would also spend a lot of time searching clearly incorrect sequence
frame translations.
We process the brute force six-frame translation of the ESTs through a
number of stages. First, we keep only unambiguous amino-acid sequences
of length at least 50. This ensures that each retained amino-acid
sequence is in an open reading frame (ORF) of at least 150
nucleotides. This a pretty conservative filter for correct frame, but
reduces the amount of sequence to about 1Gb. Second, we compute the C3
compressed sequence database representation of all of the 30-mers
contained in the resulting translated sequence. In the process, we
eliminate all 30-mers that occur exactly once in the EST database. For
Human ESTs, the result is about 150Mb of sequence, a 40 fold reduction
in the sequence database size, which turns a 15-20 hour Mascot search
into a 20 minute search. The C3 compressed peptide sequence database
derived from Human ESTs available for download.
mRNA Sequences
The RefSeq mRNA sequences, like ESTs, represent transcript
sequence. As such they do not contain introns, but unlike ESTs, they
consist of well understood full length transcript sequences with few
sequencing errors, and little redundancy. We enumerate peptide
sequences from RefSeq mRNA sequences using a brute force 3-frame
(genomic direction is established) translation. We do not filter for
short open reading frames and we do not filter out 30-mers that occur
only once.
Genomic Peptides
In order to capture putative peptide sequences in concert with well
understood peptide sequences, particularly to avoid redundancy from
the sequences they share, we form a new sequence database containing
the EST amino-acid sequences in open reading frames of at least 50
unambiguous amino-acids that occur at least twice; RefSeq mRNA
amino-acid sequence enumeration; all RefSeq proteins; and the
corresponding IPI database, where available. For Human sequences, a
brute force enumeration of these sequences would be over 6Gb in
length, following C3 compression, the genomic peptides sequence
database is about 160Mb in size. The C3 compressed genomic peptide sequence
database for Human sequences is available for download.
|