| Sequence Database Compression for Mascot |
 |
 |
Introduction
The installation and use of peptide sequence databases with Mascot
should be considered a proof of concept. Considerable experience with
running and configuring Mascot will be assumed in what follows. Using
peptide sequence databases to search MS/MS spectra is broken down into
two steps.
- Search the C3 compressed form of the sequence database with Mascot.
- Associate the peptide sequences with their proteins and
insert then into the search results.
The first of these steps requires us to install C3 sequence database
and configure Mascot to use them. The second requires us to install
scripts and programs that read the Mascot results, extract the peptide
sequences, search the peptides against the original protein sequence
database, and reconstruct the Mascot search results.
Each of these installation steps is outlined next.
Installation
We assume the standard Mascot directory layout, with the Mascot root
directory at $MASCOT. This is usually
C:\INETPUB\MASCOT for Windows, and /usr/local/mascot
for Unix and Linux. Throughout, we will demonstrate the steps required
for the specific sequence database Varsplic, which is the result of
enumerating all sequence variants from Swiss-Prot.
Note that we assume the availability of Perl, which is required by Mascot.
C3 Sequence Database Installation
The required sequence databases, both the original and the C3
compressed forms, should be downloaded. Each pair of original
and C3 compressed sequence databases should be placed in their own
directory, in the usual Mascot directory structure.
For Varsplic, the following pseudo-code commands would suffice:
mkdir $MASCOT/sequence/Varsplic/{current,incoming,old}
Download uniprot_sprot_vs.fasta.gz to $MASCOT/sequence/Varsplic/current
gunzip uniprot_sprot_vs.fasta.gz
rename uniprot_sprot_vs.fasta uniprot_sprot_vs.1.fasta
mkdir $MASCOT/sequence/VarsplicC3/{current,incoming,old}
Download uniprot_sprot_vs.cfa.gz to $MASCOT/sequence/VarsplicC3/current
gunzip uniprot_sprot_vs.cfa.gz
rename uniprot_sprot_vs.cfa uniprot_sprot_vs.1.cfa
Next, we configure Mascot to use the C3 compressed Varsplic sequence
database (uniprot_sprot_vs.1.cfa). Using the standard database
maintenance interface, create a new sequence database with no taxonomy
rules, and very basic defline regular expressions. The C3 sequence
database deflines consist entirely of a unique "accession" string. In
what follows, we assume that the C3 compressed Varsplic sequence
database is installed as "VarsplicC3". The regular Varsplic sequence
database (uniprot_sprot_vs.1.fasta) should be installed as normal, as
Varsplic. Verify that Mascot correctly pre-processes VarsplicC3 and
Varsplic, and that its test search completes without error.
Configure Mascot
The C3 sequence databases introduce a non-amino-acid symbol, J, to
ensure that Mascot's in silico digestion algorithm does not
create false tryptic peptides. We ensure the correct digest behavior
by ensuring that J is never used in a peptide, and that it never
defines a tryptic digestion site.
We accomplish the first part of this by providing a modification that sets J's mass to 10kDa (or more!) In
$MASCOT/config/mod_file, add the modification
Title:HeavyJ
Residues:J 10000.0 10000.0
*
We ensure J never marks a tryptic digestion site by creating a new
trypsin digestion rule in $MASCOT/config/enzyme:
Title:TrypsinJ
Cleavage:KR
Restrict:PJ
Cterm
*
We can now search the C3 compressed sequence database with Mascot, in
the same way that the original protein sequence database is searched,
so long as we make sure to select TrypsinJ as our digestion enzyme
rather than Trypsin, and we set the fixed modification HeavyJ.
peptide_scan Installation
peptide_scan was written to rapidly search for each peptide
sequence in a sequence database and output the location and protein
annotation of each occurrence. The peptide_scan source code
is available from ftp://ftp.umiacs.umd.edu/pub/nedwards/peptide_scan.
Compile all of the programs from this tarball and place them in
$MASCOT/bin. Make sure $MASCOT/bin is on your path.
Once compiled, test peptide_scan as follows:
cd $MASCOT/sequence/Varsplic
compress_seq -i uniprot_sprot_vs.1.fasta -n true -D false
peptide_scan -i uniprot_sprot_vs.1.fasta -p CCAAADPHECYAK
which should produce output that looks similar, in format, to:
CCAAADPHECYAK 1 384 397 K V 4203 >ALBU_HUMAN (P02768) Serum albumin precursor
CCAAADPHECYAK 1 376 389 K V 4204 >ALBU_MACMU (Q28522) Serum albumin precursor
...
Preprocessing the original (uncompressed) sequence database using
compress_seq, as above, must be done for each C3 sequence
database installed.
protein_associate.pl Installation
protein_associate.pl was written to extract the peptide
sequences from the Mascot results, invoke peptide_scan (and
the other programs protein_mw and peptide_mult) to
fill in the missing protein information, and write out the modified
Mascot results. The source code for protein_associate.pl can
be found at ftp://ftp.umiacs.umd.edu/pub/nedwards/protein_associate. This
tarball should be installed in $MASCOT/bin.
protein_associate.pl takes the following command line parameters:
protein_associate.pl [options] [ input-mascot-file [ output-mascot-file ] ]
Options:
-c protein_associate.pl configuration file. Required
-C No (C)lean of temporary files
An example configuration file, protein_associate.cfg is
provided. The configuration file provides the system information
required to run protein_associate.pl.
The GLOBALS section lists those parameters that are the same,
regardless of the C3 sequence database used.
{
Key: GLOBALS
Path: "C:/INETPUB/MASCOT/bin"
MinMW: 600.0
BufferedIO: false
}
The only parameter that should be changed is Path, which
specifies the location of peptide_scan, and the other
programs installed above. In this example, we show location of
$MASCOT/bin for a Windows installation.
The other sections of the configuration file specify the parameters of
original sequence database whose protein annotations should be
substituted for the C3 sequence database.
{
Key: VarsplicC3
Name: "C:/INETPUB/MASCOT/sequence/Varsplic/uniprot_sprot_vs.1.fasta"
Handle: Varsplic
Acc_RE: "^>([^\s]+) "
Desc_RE: "^>[^\s]+\s+(.*)$"
Sequences: 214029
Amino_Acids: 97639742
}
The Key for these sections is the Mascot database name for the C3
compressed sequence database. Name gives the filename of the
original sequence database, Handle gives the Mascot database
name, Acc_RE gives the (perl) regular expression for
extracting the accession number of the protein from a defline,
Desc_RE gives the (perl) regular expression for extracting
the description of the protein from a defline, Sequences
gives the number of sequences in the original sequence database, and
Amino_Acids gives the size of the sequence portion of the
original sequence database. If compress_seq has been run, as
specified above, on the original sequence database specified in
Name, then the size of the corresponding .sqn file
is the number of amino-acids in the original sequence database, while
the number of lines of the corresponding .hdr is the number
of sequences. Alternatively, get these values from the Varsplic
statistics link on the database status page.
Testing
To test the installation, obtain a small dataset of MS/MS
spectra. First, search against Varsplic, using Trypsin and typical
search parameters. Then, search against VarsplicC3, using TrypsinJ and
HeavyJ, and otherwise, the same search parameters as before. Take note
of the Mascot results filename in each case. These files will be in
$MASCOT/data, under a directory named for the date of the
search. Suppose results file of the search against Varsplic is
F001274.dat and the results file of the search against VarsplicC3 is
F001275.dat. Run
protein_associate.pl as follows:
cd $MASCOT\data\20050301
protein_associate.pl -c protein_associate.cfg F001275.dat > F001275a.dat
The Mascot search results produced, F001275a.dat, will be equivalent
to the search results in F001274.dat. To see these results in the
Mascot web interface, first browse to the search against
VarsplicC3. Find F001275.dat in the web address of the page, and
substitute F001275a.dat for it. Bring up the search against Varsplic
in another window, and compare.
The result of this test on Mascot's test dataset ('A few
peptides from an LCMS run') as static web pages:
General Usage
As described, the installation procedure is intended to be as
non-invasive for Mascot as possible. Clearly integration with Mascot
could be streamlined considerably such that the user is not even aware
that C3 sequence databases are being used. So far, this degree of
integration has not been done, but could be carried out easily by
anyone sufficiently experienced with Mascot configuration, perl, and
web server cgi-bin scripts.
Experimental Results
To test whether or not this approach resulted in significant running
time savings, we conducted the following experiment. We searched the
Institute for Systems Biology and Sashimi Repository's 17 Protein Mix
LC/MS/MS dataset against five protein sequence databases and their C3
compressed counterparts.
The setting for the experiment is as follows:
- ISB/Sashimi 17 Protein Mix LC/MS/MS dataset, consisting of 2043 spectra
- Mascot 2.0
- Dell PC with 512Mb RAM
- Precursor tolerance 2Da
- Fragment tolerance 0.15Da
- Up to 2 missed tryptic cleavages
- Sequence databases IPI-HUMAN, Swiss-Prot, Varsplic, UniProt,
UniProt-VS in their original and C3 compressed form.
The only parameter that was varied was the sequence database searched
against. In total, 10 searches were conducted, one against each
sequence database. As above Varsplic is the result of enumerating all
sequence variants from Swiss-Prot. UniProt is the concatenation of
Swiss-Prot and TrEMBL, while UniProt-VS is the result of enumerating
all sequence variants from Swiss-Prot and TrEMBL.
| Label | Sequence Database | Amino-Acids | Mascot Search (sec) | Peptide Scan (sec) | Protein Associate (sec) | Total Time (sec) |
| 1 | IPI-HUMAN C3 | 14925233 | 147 | 27 | 42 | 189 |
| 2 | IPI-HUMAN | 21742541 | 203 | | | 203 |
| 4 | Swiss-Prot C3 | 54513970 | 515 | 42 | 59 | 574 |
| 5 | Swiss-Prot | 58246755 | 553 | | | 553 |
| 7 | Varsplic C3 | 56651639 | 529 | 54 | 82 | 611 |
| 8 | Varsplic | 92944773 | 807 | | | 807 |
| 10 | UniProt C3 | 373556205 | 3622 | 238 | 354 | 3976 |
| 11 | UniProt | 578246465 | 5586 | | | 5586 |
| 13 | UniProt-VS C3 | 375498473 | 3664 | 225 | 306 | 3970 |
| 14 | UniProt-VS | 613135481 | 5840 | | | 5840 |
Figure: Total search time in seconds, for each of the above sequence databases. Blue represents Mascot running time, green represents peptide_scan running time, and brown represents protein_associate running time minus peptide_scan running time.
Figure: Total search time relative to Mascot search time on original sequence database, for each of the above sequence databases. Blue represents Mascot running time, green represents peptide_scan running time, and brown represents protein_associate running time minus peptide_scan running time.
We observe this speedup because Mascot's running time is linear in the size of the sequence database, not the number of distinct 30-mers it contains, which is the same for each pair of sequence databases.
Figure: Running time for Mascot search versus sequence database size.
The only significant difference between each pair of results is in the
Expect column. The same peptide score is given a lower expect value
(i.e. it is more significant) when searched against the C3 database
than when searched against the original database.
Figure: Expect values for best hit peptide (same sequence in each case) against Varsplic and VarsplicC3, where either expect value is <= 0.1.
The reason for this is that there is less redundancy in the C3
database. The expect value is proportional to the number of trials, in
this case, the number of peptide sequences in the sequence database
that match the precursor ion mass. Since there is less repetition of
identical peptide sequences, we have fewer trials with no loss of
information.
Figure: Number of peptide sequences that match each spectrum's
precursor mass when searched against Varsplic and VarsplicC3.
|