University of Maryland Nathan Edwards
Center for Bioinformatics and Computational Biology
Home Research Teaching Publications

Research
Proteomics
Tools
Data
Research Statements


Sequence Database Compression for Mascot

Introduction

The installation and use of peptide sequence databases with Mascot should be considered a proof of concept. Considerable experience with running and configuring Mascot will be assumed in what follows. Using peptide sequence databases to search MS/MS spectra is broken down into two steps.

  1. Search the C3 compressed form of the sequence database with Mascot.
  2. Associate the peptide sequences with their proteins and insert then into the search results.

The first of these steps requires us to install C3 sequence database and configure Mascot to use them. The second requires us to install scripts and programs that read the Mascot results, extract the peptide sequences, search the peptides against the original protein sequence database, and reconstruct the Mascot search results.

Each of these installation steps is outlined next.

Installation

We assume the standard Mascot directory layout, with the Mascot root directory at $MASCOT. This is usually C:\INETPUB\MASCOT for Windows, and /usr/local/mascot for Unix and Linux. Throughout, we will demonstrate the steps required for the specific sequence database Varsplic, which is the result of enumerating all sequence variants from Swiss-Prot.

Note that we assume the availability of Perl, which is required by Mascot.

C3 Sequence Database Installation

The required sequence databases, both the original and the C3 compressed forms, should be downloaded. Each pair of original and C3 compressed sequence databases should be placed in their own directory, in the usual Mascot directory structure.

For Varsplic, the following pseudo-code commands would suffice:

  mkdir $MASCOT/sequence/Varsplic/{current,incoming,old}
  Download uniprot_sprot_vs.fasta.gz to $MASCOT/sequence/Varsplic/current
  gunzip uniprot_sprot_vs.fasta.gz
  rename uniprot_sprot_vs.fasta uniprot_sprot_vs.1.fasta
  mkdir $MASCOT/sequence/VarsplicC3/{current,incoming,old}
  Download uniprot_sprot_vs.cfa.gz to $MASCOT/sequence/VarsplicC3/current
  gunzip uniprot_sprot_vs.cfa.gz
  rename uniprot_sprot_vs.cfa uniprot_sprot_vs.1.cfa

Next, we configure Mascot to use the C3 compressed Varsplic sequence database (uniprot_sprot_vs.1.cfa). Using the standard database maintenance interface, create a new sequence database with no taxonomy rules, and very basic defline regular expressions. The C3 sequence database deflines consist entirely of a unique "accession" string. In what follows, we assume that the C3 compressed Varsplic sequence database is installed as "VarsplicC3". The regular Varsplic sequence database (uniprot_sprot_vs.1.fasta) should be installed as normal, as Varsplic. Verify that Mascot correctly pre-processes VarsplicC3 and Varsplic, and that its test search completes without error.

Configure Mascot

The C3 sequence databases introduce a non-amino-acid symbol, J, to ensure that Mascot's in silico digestion algorithm does not create false tryptic peptides. We ensure the correct digest behavior by ensuring that J is never used in a peptide, and that it never defines a tryptic digestion site.

We accomplish the first part of this by providing a modification that sets J's mass to 10kDa (or more!) In $MASCOT/config/mod_file, add the modification

  Title:HeavyJ
  Residues:J 10000.0 10000.0
  *

We ensure J never marks a tryptic digestion site by creating a new trypsin digestion rule in $MASCOT/config/enzyme:

  Title:TrypsinJ
  Cleavage:KR
  Restrict:PJ
  Cterm
  *

We can now search the C3 compressed sequence database with Mascot, in the same way that the original protein sequence database is searched, so long as we make sure to select TrypsinJ as our digestion enzyme rather than Trypsin, and we set the fixed modification HeavyJ.

peptide_scan Installation

peptide_scan was written to rapidly search for each peptide sequence in a sequence database and output the location and protein annotation of each occurrence. The peptide_scan source code is available from ftp://ftp.umiacs.umd.edu/pub/nedwards/peptide_scan.

Compile all of the programs from this tarball and place them in $MASCOT/bin. Make sure $MASCOT/bin is on your path.

Once compiled, test peptide_scan as follows:

  cd $MASCOT/sequence/Varsplic
  compress_seq -i uniprot_sprot_vs.1.fasta -n true -D false
  peptide_scan -i uniprot_sprot_vs.1.fasta -p CCAAADPHECYAK
which should produce output that looks similar, in format, to:
  CCAAADPHECYAK 1 384 397 K V 4203 >ALBU_HUMAN (P02768) Serum albumin precursor
  CCAAADPHECYAK 1 376 389 K V 4204 >ALBU_MACMU (Q28522) Serum albumin precursor
  ...

Preprocessing the original (uncompressed) sequence database using compress_seq, as above, must be done for each C3 sequence database installed.

protein_associate.pl Installation

protein_associate.pl was written to extract the peptide sequences from the Mascot results, invoke peptide_scan (and the other programs protein_mw and peptide_mult) to fill in the missing protein information, and write out the modified Mascot results. The source code for protein_associate.pl can be found at ftp://ftp.umiacs.umd.edu/pub/nedwards/protein_associate. This tarball should be installed in $MASCOT/bin.

protein_associate.pl takes the following command line parameters:

protein_associate.pl [options] [ input-mascot-file [ output-mascot-file ] ]
Options:
  -c                protein_associate.pl configuration file. Required
  -C                No (C)lean of temporary files

An example configuration file, protein_associate.cfg is provided. The configuration file provides the system information required to run protein_associate.pl.

The GLOBALS section lists those parameters that are the same, regardless of the C3 sequence database used.

  {
    Key:        GLOBALS
    Path:       "C:/INETPUB/MASCOT/bin"
    MinMW:      600.0
    BufferedIO: false
  }
The only parameter that should be changed is Path, which specifies the location of peptide_scan, and the other programs installed above. In this example, we show location of $MASCOT/bin for a Windows installation.

The other sections of the configuration file specify the parameters of original sequence database whose protein annotations should be substituted for the C3 sequence database.

  {
    Key:         VarsplicC3
    Name:        "C:/INETPUB/MASCOT/sequence/Varsplic/uniprot_sprot_vs.1.fasta"
    Handle:      Varsplic
    Acc_RE:      "^>([^\s]+) "
    Desc_RE:     "^>[^\s]+\s+(.*)$"
    Sequences:   214029
    Amino_Acids: 97639742
  }
The Key for these sections is the Mascot database name for the C3 compressed sequence database. Name gives the filename of the original sequence database, Handle gives the Mascot database name, Acc_RE gives the (perl) regular expression for extracting the accession number of the protein from a defline, Desc_RE gives the (perl) regular expression for extracting the description of the protein from a defline, Sequences gives the number of sequences in the original sequence database, and Amino_Acids gives the size of the sequence portion of the original sequence database. If compress_seq has been run, as specified above, on the original sequence database specified in Name, then the size of the corresponding .sqn file is the number of amino-acids in the original sequence database, while the number of lines of the corresponding .hdr is the number of sequences. Alternatively, get these values from the Varsplic statistics link on the database status page.

Testing

To test the installation, obtain a small dataset of MS/MS spectra. First, search against Varsplic, using Trypsin and typical search parameters. Then, search against VarsplicC3, using TrypsinJ and HeavyJ, and otherwise, the same search parameters as before. Take note of the Mascot results filename in each case. These files will be in $MASCOT/data, under a directory named for the date of the search. Suppose results file of the search against Varsplic is F001274.dat and the results file of the search against VarsplicC3 is F001275.dat. Run protein_associate.pl as follows:

  cd $MASCOT\data\20050301
  protein_associate.pl -c protein_associate.cfg F001275.dat > F001275a.dat

The Mascot search results produced, F001275a.dat, will be equivalent to the search results in F001274.dat. To see these results in the Mascot web interface, first browse to the search against VarsplicC3. Find F001275.dat in the web address of the page, and substitute F001275a.dat for it. Bring up the search against Varsplic in another window, and compare.

The result of this test on Mascot's test dataset ('A few peptides from an LCMS run') as static web pages:

General Usage

As described, the installation procedure is intended to be as non-invasive for Mascot as possible. Clearly integration with Mascot could be streamlined considerably such that the user is not even aware that C3 sequence databases are being used. So far, this degree of integration has not been done, but could be carried out easily by anyone sufficiently experienced with Mascot configuration, perl, and web server cgi-bin scripts.

Experimental Results

To test whether or not this approach resulted in significant running time savings, we conducted the following experiment. We searched the Institute for Systems Biology and Sashimi Repository's 17 Protein Mix LC/MS/MS dataset against five protein sequence databases and their C3 compressed counterparts.

The setting for the experiment is as follows:

  • ISB/Sashimi 17 Protein Mix LC/MS/MS dataset, consisting of 2043 spectra
  • Mascot 2.0
    • Dell PC with 512Mb RAM
    • Precursor tolerance 2Da
    • Fragment tolerance 0.15Da
    • Up to 2 missed tryptic cleavages
  • Sequence databases IPI-HUMAN, Swiss-Prot, Varsplic, UniProt, UniProt-VS in their original and C3 compressed form.

The only parameter that was varied was the sequence database searched against. In total, 10 searches were conducted, one against each sequence database. As above Varsplic is the result of enumerating all sequence variants from Swiss-Prot. UniProt is the concatenation of Swiss-Prot and TrEMBL, while UniProt-VS is the result of enumerating all sequence variants from Swiss-Prot and TrEMBL.

LabelSequence
Database
Amino-AcidsMascot
Search
(sec)
Peptide
Scan
(sec)
Protein
Associate
(sec)
Total
Time
(sec)
1IPI-HUMAN C3149252331472742189
2IPI-HUMAN21742541203203
4Swiss-Prot C3545139705154259574
5Swiss-Prot58246755553553
7Varsplic C3566516395295482611
8Varsplic92944773807807
10UniProt C337355620536222383543976
11UniProt57824646555865586
13UniProt-VS C337549847336642253063970
14UniProt-VS61313548158405840

Figure: Total search time in seconds, for each of the above sequence databases. Blue represents Mascot running time, green represents peptide_scan running time, and brown represents protein_associate running time minus peptide_scan running time.

Figure: Total search time relative to Mascot search time on original sequence database, for each of the above sequence databases. Blue represents Mascot running time, green represents peptide_scan running time, and brown represents protein_associate running time minus peptide_scan running time.

We observe this speedup because Mascot's running time is linear in the size of the sequence database, not the number of distinct 30-mers it contains, which is the same for each pair of sequence databases.

Figure: Running time for Mascot search versus sequence database size.

The only significant difference between each pair of results is in the Expect column. The same peptide score is given a lower expect value (i.e. it is more significant) when searched against the C3 database than when searched against the original database.

Figure: Expect values for best hit peptide (same sequence in each case) against Varsplic and VarsplicC3, where either expect value is <= 0.1.

The reason for this is that there is less redundancy in the C3 database. The expect value is proportional to the number of trials, in this case, the number of peptide sequences in the sequence database that match the precursor ion mass. Since there is less repetition of identical peptide sequences, we have fewer trials with no loss of information.

Figure: Number of peptide sequences that match each spectrum's precursor mass when searched against Varsplic and VarsplicC3.

.......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... .......... ..........

University of Maryland     UM Home | Directories | Search | Admissions | Calendar
Original created by John Fuetsch
Questions and comments to Nathan Edwards