[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2004070643A2 - Method for predicting protein function - Google Patents

Method for predicting protein function Download PDF

Info

Publication number
WO2004070643A2
WO2004070643A2 PCT/IB2004/000757 IB2004000757W WO2004070643A2 WO 2004070643 A2 WO2004070643 A2 WO 2004070643A2 IB 2004000757 W IB2004000757 W IB 2004000757W WO 2004070643 A2 WO2004070643 A2 WO 2004070643A2
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
matches
peptide
tags
protein
Prior art date
Application number
PCT/IB2004/000757
Other languages
French (fr)
Other versions
WO2004070643A3 (en
Inventor
Andrej Shevchenko
Shamil Sunyaev
Adam Liska
Anna Shevchenko
Alexander Golod
Peer Bork
Original Assignee
European Molecular Biology Laboratory
Max Planck Gesellschaft Zur Foerderung Der Wissens
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by European Molecular Biology Laboratory, Max Planck Gesellschaft Zur Foerderung Der Wissens filed Critical European Molecular Biology Laboratory
Publication of WO2004070643A2 publication Critical patent/WO2004070643A2/en
Publication of WO2004070643A3 publication Critical patent/WO2004070643A3/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids

Definitions

  • This invention is in the field of bioinformatics and specifically relates to the use of mass spectrometry in the prediction of function of proteins, whose sequences are not present in protein, EST (Expressed Sequence Tags) or genomic databases.
  • the software correlates the observed masses with theoretically predicted masses derived from peptide sequences produced by in silico digestion of protein database entries, and calculates the statistical significance of hits.
  • the sof ware does not require a full representation of the fragment ions in the tandem mass spectrum and can positively identify the peptide even if only some of the fragment ions are matched.
  • the significance of hits increases if more fragment ions are detected and if more than one peptide sequence originating from the same database entry was recognised. Therefore, conventional database mining software is biased towards exact matching of spectra to catalogued sequences and in practice is mostly applied to the identification of proteins already residing in databases. Therefore proteomics is largely limited to organisms with sequenced genomes, despite the fact that phylogenetically related organisms share significant molecular homology and that extensive protein sequence information may be available from related species.
  • the proteomes of organisms whose genomes are unsequenced may be analysed using correlation methods based on sequence similarity (reviewed in 5).
  • FASTS modified FASTA (6)
  • MS BLAST BLAST (8)
  • homology searching algorithms which allow the mass spectrometric identification of a sizeable proportion of proteins sharing more than 50% of the sequence identity to their closest neighbours in a database (7, 10).
  • MS BLAST can use redundant, degenerate and inaccurate sequences produced by automated interpretation of tandem mass spectra (9) and can be linked to high throughput protein characterisation techniques such as LC-MS/MS through a simple scripting interface (11).
  • both the MS BLAST and FASTS methods provide independent means of evaluating the statistical significance of hits, and therefore it is not necessary to compare retrospectively the matched peptide sequences with actual tandem mass spectra to rule out false positive hits.
  • Tandem mass spectra are inherently deficient, i.e. peptide sequence fragments are often underrepresented.
  • spectra often display ions that originate from fragmentation of side chains of amino acid residues and are not accounted for by typical scoring schemes. It is common in femtomole sequencing for low peptide content and high chemical noise to allow only a few informative fragment ions to be detected in MS/MS spectra, from which software-assisted interpretation can not produce credible peptide sequence proposals and sequence similarity identification will likely yield a false negative result.
  • the peptide sequence tag approach for error-tolerant database searching developed by Mann and Wilm in 1994 (12) helps to overcome these limitations.
  • the sequence tag utilizes a short sequence (2-4 amino acid residues), which can be easily determined from low energy CID spectra acquired from multiply charged precursors, and a pair of masses that flank the determined sequence in the full length peptide sequence. These masses are the combined mass of all the amino acid residues between the N-terminus of the tryptic peptide and the short sequence, and similarly, the mass of the amino acid residues between the short sequence and the C- terminus of the tryptic peptide. In stringent database searches both the masses and the short sequence are required to match.
  • sequence tags are employed in protein, EST (13) and genomic database mining, however no evaluation of the significance of matches is provided in these searches. Therefore, even if a single hit was retrieved upon database searching, the match between the tandem mass spectra and the corresponding database hit has to be verified retrospectively by manual inspection. Error tolerant searching using this method can be carried out by allowing the sequence or one of the masses to mismatch. This approach allows cross species identifications in protein sequence databases (15). However, allowing mismatches typically results in many hundreds of hits being retrieved and manual inspection of them is tedious.
  • the present invention provides,
  • a ranked protein is dependent on the number of matches between the peptide sequence tags and the protein and the level of degeneracy of the matches as compared to the expected number of sequences from a random database that would match the same combination of tags with the same level of degeneracy or with a more specific combination of tags.
  • the method of the present invention automatically analyses the results of an error-tolerant database search, reveals proteins to which multiple fragmented peptides are matched in an error- tolerant fashion and computes the statistical significance of those matches allowing the discrimination of true matches from false positives.
  • This method improves over previous methods for the identification of proteins of unknown sequence, since it automatically generates a ranked list of proposed identities without the need for any manual annotation of the results.
  • Error tolerant searching is used to implement a statistical evaluation of matching multiple partial sequence tags in the identification of proteins with unsequenced genomes.
  • This approach herein referred to as "Multitag” enables the identification of distantly related proteins by sequence-similarity searching using only very short stretches of peptide sequence retrieved from tandem mass spectra. This is therefore a simplified and sensitive method of exploring the proteomes of organisms with unsequenced genomes.
  • the method may preferably be used to search protein databases, although EST databases can be searched in similar fashion.
  • Sequence tag searching is a recognized method identification of proteins in an EST database (28).
  • Peptide sequence tags can be matched to putative protein sequences obtained by a six frame translation of cDNA sequences from EST entries.
  • proteins are generally separated out on a one- or two-dimensional polyacrylamide gel. They are then excised and digested in-gel as previously described (16). Peptide samples are prepared using techniques well established in the art and not repeated herein in detail. Briefly, enzymatic digestion may be used to cleave the amide backbone of a protein. Preferably, the enzymatic digestion includes treatment of a protein with an enzyme that cleaves with high specificity. Such enzymes include trypsin, which cleaves at the C-terminus of Arg or Lys residues; endoproteases such as Lys-C or Arg-C.
  • Proteases with lower primary specificity could also be applied provided they produce peptides in 500 - 2000 Da mass range from protein substrates.
  • Bovine or porcine trypsin is a preferred enzyme to use in this stage of the method.
  • the peptide fragments are analysed by mass spectrometry.
  • mass spectrometry analysis may be carried out such as matrix-assisted laser desorption (MALDI), electrospray ionisation (ESI) implemented as nanoelectrospray method (26) or coupled on-line with the mass spectrometer (LC MS/MS) and related methods (e.g. Ionspray, Thermospray).
  • MALDI matrix-assisted laser desorption
  • ESI electrospray ionisation
  • LC MS/MS mass spectrometer
  • Ionspray Ionspray, Thermospray
  • these ion sources can be matched with any instrument configuration having tandem mass spectrometric capacity, such as triple quadrupole, Fourier _ transform ion cyclotron resonance (FTICR), ion trap, or combinations of these to give a hybrid instrument (e.g.
  • FTICR Fourier _ transform ion cyclotron resonance
  • tandem MS/MS instruments the peptide sample, usually in the form of a tryptic protein digest, is typically injected into a first mass analyzer to yield a mass spectrum of the ions present in the mixture ('normal' mass spectrum). Any ion can then be channelled selectively (i.e., the precursor or parent ion) into a collision cell in which fragment ions are generated by collision with neutral gas.
  • the fragment ions are then moved into a second mass analyzer, to yield a mass spectrum for the fragment ions.
  • An example of such an instrument is the orthogonal quadropole time-of- flight mass spectrometer (QqTOF). Isolation and fragmentation of precursor ions in ion trap mass spectrometers is performed in the same chamber (ion trap) and the instrument does not have a separate mass analyser for detecting fragments.
  • the analytical routine is similar to the one described above for the instruments with multiple mass analysers and consists of detection of all ions in the sample, followed by selective isolation and fragmentation of the precursors and detection of yielded peptide fragments.
  • step i) of the method peptide sequence tags identified from said peptide fragments are used to search a protein database in an error-tolerant manner for known protein sequences that match said peptide sequence tags.
  • This methodology is a development of the method of Mann (12).
  • Peptide sequence tags are constructed based on raw mass spectra (see Figure 2). In the case of spectra acquired using ESI on triple quadrupole or quadrupole TOF machines, sequence tags are typically identified from the high m/z region of tandem mass spectra of tryptic peptides, which are dominated by abundant y-ions. In the interpretation of MS/MS spectra acquired on ESI ion traps or with MALDI ionisation sources (when singly charged ions are fragmented), series of b- ions or a-ions should also be considered (see 27 for fragment ion nomenclature). The sequence tags consist of three sections.
  • the first part m>j is the added mass of the residues between the N terminus of the peptide and the determined sequence.
  • the second part, i A is the determined sequence.
  • the third part, mc is the mass of all amino acid residues between the determined sequence and the C terminus of the peptide sequence tag.
  • sequence tags are assembled for as many fragmented tryptic peptides as possible and are compared with peptide sequences in a database. This may be done by the interpretation of tandem mass spectra using computer software, for example, the BioAnalyst QS (Applied Biosystems, CA). Sequence databases are then searched by computer using software such as PepSea (part of the BioAnalyst QS package) with the mass tolerance different for masses of fragment and precursor ions. Searches are performed against a comprehensive database and no constraints on protein molecular weight, pi and species of origin are considered.
  • the masses are used for searching a database in both a stringent fashion ( . e. all three regions match) and in an error-tolerant, non-stringent manner (where a mismatch is tolerated in one or more of the three regions ⁇ _ N . UIA or nic).
  • Matches are preferably labelled by the mass of the precursor ion and by the matching region, wherein the matching region is abbreviated as NC for a search result with a completely matching tag; N for a search result with HI N and ⁇ IA matching; E for a search result with one amino acid error; C for a search result with n_ A and mc matching.
  • Masses of precursor ions measured with high accuracy enable the identification of known proteins or proteins highly homologous to known proteins by peptide mass fingerprinting.
  • confidence in the result can be confounded by the presence of modified amino acids or if sequences of the analysed unknown protein and homologous known proteins are more diverged.
  • a single tryptic peptide may not be completely identical between two protein sequences even at the level of 75% of full length sequence identity, although the identity of relatively short sequence stretches frequently occurs (see Figure 1). If the search is widened by altering these constraints, then more hits are obviously produced. However, a number of these are not significant. Prior to the present invention, they must therefore be analysed by hand to determine which are more relevant.
  • step ii) The full list of hits is then input into step ii) without any further analysis. For example, redundant hits which match the same peptide sequence in another database entry are retained in the list of matches, as well as redundant matches between a peptide sequence tag and the same peptide in a protein entry.
  • a database search hit is a protein sequence, which completely or partially matches some sequence tags in a degenerate or non- degenerate manner.
  • E- values give the expected number of sequences from a random database, which would match the same combination of tags with the same level of degeneracy or with a more specific combination of tags.
  • a more specific combination of tags is a result of either a higher number of tags matched or a lower degeneracy of the matches.
  • hits are ranked according to their E- values, which, in turn, depend on the number of matches between the peptide sequence tags and the protein and the level of degeneracy of the matches as compared to the expected number of sequences from a random database that would match the same combination of tags with the same level of degeneracy or with a more specific combination of tags.
  • step ii) the frequency of matches is compared to:
  • step c) a statistical significance is assigned by multiplying the probability of step c) to the total number of protein sequences in the database.
  • the peptide partially obeys the cleavage condition of the proteolytic enzyme.
  • the amino acid preceding the N-terminal amino acid residue is arginine or lysine, if trypsin was applied to digest the protein.
  • no C-terminal cleavage specificity would be requested, since C-terminal part of the peptide is not required to match. This improves specificity of searches and facilitates the correct identification of peptide fragments.
  • the significance of hits may preferably be calculated by the following process:
  • a given sequence tag has an N-terminal mass ⁇ _N .
  • three amino acids ai, a 2 , a 3 and a C-terminal mass mc (although the sequence tag may comprise any number of amino acid residues).
  • the probability that a random tryptic peptide would match this tag in a non-degenerate manner would be given as a product of the following three probabilities.
  • Second, the probability that this fragment of random peptide has amino acids ai, a 2 and a 3 .
  • the mass M of the random tryptic peptide can be regarded as an accumulated sum of masses of randomly generated amino acids where
  • Random values represented as successive sums of positive identically distributed values as shown in equation (1) are known in probability theory as a renewal process ( 18 ). Masses of randomly generated amino acids obey the probability distribution determined by amino acid frequencies, so that the probability that the random mass would be exactly m equals the combined frequency of amino acids of mass m and is given as p(m). Then, the distribution of the mass accumulated after n+1 step (the probability that the peptide fragment of length n+1 would have a mass smaller than t) is calculated as:
  • the distribution of the total mass of the tryptic peptide (the probability that the peptide 5 s total mass would not exceed t) is given by allowing for all possible lengths of the peptide:
  • the matching of the sequence tag is considered as a sequence of three consecutive, independent events, that is, the matching of the first mass, the matching of the short sequence stretch and the matching of the second mass.
  • the N-terminal mass is the first mass to match.
  • the probability that the mass of any N-terminal fragment of the peptide would be in the interval ( ⁇ _ N - ⁇ m, m N + ⁇ m) is given by:
  • Pnon - deg ener te P( N, Am)f(a ⁇ )f(a.)f(a.)P(mc, ⁇ m) V D )
  • the additional multiplier l/q(l-q) reflects the fact that zero length tryptic peptides are not considered. Therefore only peptides with a length of one or more are considered, which give a real number. Examples of probabilities for degenerate matches are given by:
  • the probability that a random protein sequence containing K tryptic peptides would match multiple sequence tags is calculated, taking into account the tags being matched and the type of degeneracy of the match. For instance, if there were three sequence tags and the random sequence simultaneously matched sequence tag 1 with an error in the C-terminal mass, sequence tag 2 with an error in the N-terminal mass and sequence tag 3 with a mismatch at the second identified amino acid, the probability of the event would be given by equation (8):
  • This example shows how to compute the probability that a random amino acid protein sequence would match an arbitrary combination of sequence tags.
  • the observed match has a probability p (given by eq. 8).
  • the E-value is given by the product of number of sequences in the database and the probability that a more or equally specific combination of tags would match a random sequence. This is given by the equation:
  • Software implementation of the method of the invention preferably uses the pre-computed distribution function F(t).
  • the software imports sequence tags in the conventional format (mc)a ⁇ ...a n (m. ⁇ ) (12) and computes probabilities for each tag to match a random tryptic peptide.
  • the software also imports a full list of hits produced by multiple degenerate and non-degenerate sequence tag searches and identifies hits corresponding to the same protein. For each hit the probability of the match is calculated (similar to equation 8), then all tag combinations giving the same or higher probability are identified, and the hit is assigned an E-value. Finally the hits are assorted according to E-value.
  • the method also relies upon a designated size of the database searched for producing a probability of a random match. This is straightforward with protein database searches because this is a designated number of database entries. For EST database searching, all nucleotide sequences or the query must be translated in six frames, generating additional erroneous hypothetical sequence; only one frame is the correct translation; therefore the number of entries is multiplied by 6 to account for this degeneracy.
  • sequence tags are not expected to match any single protein in the database even if it is a true homologue of one of the proteins present in the mixture, because tags representing other components of the mixture are irrelevant to this protein.
  • sequence tags in the query would correspond to different proteins. Therefore, any particular true hit (a protein being a component of the mixture) has no chance of being matched by all (or even most of) the tags.
  • PredCount predicted count
  • the ranking order is the same as the ranking order for E-values.
  • PredCount does not reflect the expected number of false positives when the entire query is searched against a database. Contrary to E-values, PredCount values only weakly depend on the number of tags in the query. Therefore in cases where most of the tags in the query were not matched, hits with PredCount values lower than around 10 "4 should be manually inspected, even if the E-values only have marginal significance.
  • the invention provides a computer apparatus adapted to perform a method according to any one of the aspects of the invention described above.
  • said computer apparatus may comprise a processor means incorporating a memory means adapted for storing data relating to amino acid or nucleotide sequences; means for inputting data relating to a plurality of protein or nucleic acid sequences; and computer software means stored in said computer memory that is adapted to perform a method according to any one of the aspects of the invention described above.
  • the invention also provides a computer-based system for identifying a protein of unknown sequence comprising means for inputting data relating to the masses of peptides generated by mass spectrometry; means adapted to search a protein database in an error-tolerant manner for known protein sequences that match said peptide sequence tags; means adapted to assign statistical significance to identified matches; means adapted to rank identified matches by their significance; and means for outputting a list of matching proteins ranked according to significance.
  • the system of this aspect of the invention may comprise a central processing unit; an input device for inputting requests; an output device; a memory; and at least one bus connecting the central processing unit, the memory, the input device and the output device.
  • the memory should store a module that is configured so that upon receiving a request to identifying a protein of unknown sequence, it performs the steps listed in any one of the methods of the invention described above.
  • data may be input by downloading the MS data from a local site such as a memory or disk drive, or alternatively from a remote site accessed over a network such as the Internet.
  • the sequences may be input by keyboard, if required.
  • the generated list of matching proteins may be output in any convenient format, for example, to a printer, a word processing program, a graphics viewing program or to a screen display device. Other convenient formats will be apparent to the skilled reader.
  • the means adapted to search a protein database in an error-tolerant manner for known protein sequences that match said peptide sequence tags; means adapted to assign statistical significance to identified matches will preferably comprise computer software means.
  • computer software means any number of different computer software means may be designed to implement this teaching.
  • a computer program product for use in conjunction with a computer, said computer program comprising a computer readable storage medium and a computer program mechanism embedded therein, the computer program mechanism comprising a module that is configured so that upon receiving a request to identify a protein of unknown sequence, it performs the steps listed in any one of the methods of the invention described above.
  • the invention provides software for the high throughput automated analysis of mass spectrometry data of peptide samples.
  • Figure 1 shows regions of partial identity between Human and Alligator alcohol dehydrogenase. Partial protein amino acids sequences for alcohol dehydrogenase are aligned above from human and alligator (75% identity). Regions alignable by error-tolerant sequence tags between the two sequences are highlighted in gray. These regions are theoretical tryptic peptides over six amino acids in length with more than three conserved amino acids from the N-terminus or more than four conserved amino acids from the C-terminus. Tryptic cleavage sites designated above are shared between both sequences. Tryptic cleavage sites not at the same point on the sequences are not designated by spaces; sites do not occur in the gray regions. Accession numbers: human, P00325; alligator, AAB28120. The sequences were aligned using the Clustal X program.
  • Figure 2 shows the type of raw mass spectra that is used to construct peptide sequence tags.
  • the MultiTag approach consists of constructing sequence tags from peptide tandem mass spectra, error-tolerant database searches, and sorting and calculation of the significance of multiple error- tolerant sequence tag alignments by the MultiTag software.
  • Panel 1 shows a tandem mass spectrum of a low abundance peptide with an overlaid sequence tag.
  • Panel 2 shows one complete and three error-tolerant sequence tag database searches, which is done for each MS/MS spectrum and corresponding sequence tag.
  • Panel 3 shows the combined list of search results (most of the 8000 entry list is not shown) from all spectra and all searches in the analysis of a single sample;
  • Tag Mass column indicating the tag's parent mass followed by an "NC” for search results with complete tags, an "N” for searches with tag regions 1 and 2, an “E” for searches with tags with one amino acid error, or a "C” for searches with tag regions 2 and 3;
  • Sequence column is the retrieved sequence found from the database search;
  • Mass column indicates the protein's total mass in kDa from which the peptide originated; "DB Accession” the proteins accession number; "Protein name”; "Species”.
  • Panel 4 shows the MultiTag output; "Tag Mass” column lists the tag- search code for the tags aligned; “Sequence” lists all of the full peptide sequences error- tolerantly aligned; “Mass” — “Species” same as Panel 3; “E-values” for the probability of the alignment of the group of sorted sequence tags. Additional column 'Predicted Counts' reflecting the number of expected random matches of a given combination of tags is not provided here for the sake of presentation clarity.
  • Figure 3 shows the spectra for Xenopus proteins.
  • Xenopus proteins were in-gel digested and analyzed by nanoelectrospray tandem mass spectrometry. MS spectrum peaks labeled with a "*" were fragmented and peptide sequence tags were constructed from MS/MS spectra (inset).
  • Figure 4 shows peptide tandem mass spectra with sequence tag which was acquired from the doubly charged precursor m/z 676.82 on a quadrupole time-of-flight mass spectrometer.
  • a sequence tag was made using the fragment y ion series in the m/z region higher than the precursor.
  • the sequence tag is (791.4143)LFM(1182.6073), precursor mass 1351.63.
  • MassN, the amino acid sequence, and Masse correspond to regions 1, 2, and 3 of the sequence tag, respectively.
  • Multitag The method of the invention (Multitag) was used to identify proteins from Xenopus laevis, which is, among other things, an important model for the study of the cell cycle (reviewed in 20), DNA replication (21) and developmental biology (22). Less than 7000 Xenopus proteins have been sequenced despite a public initiative in EST sequencing. There are around 221000 largely unannotated ESTs available as of 16th August 2002 (taken from http://www.ncbi.nlm.nih.gov). Therefore, this does not provide much coverage of the 3070 megabase pseudotetraploid genome (23).
  • MS BLAST identified three, however all five were identified by MultiTag. Importantly, in three cases both methods identified homologous sequences from the same organism or from different species, providing an independent validation of the MultiTag approach.
  • the "MTSearch" script was developed automatically to generate a list of database search results from a list of sequence tags.
  • Tags were used for searching a database in a stringent fashion (matching regions 1, 2 and 3, see figure 4) and in an error-tolerant fashion: a search tolerating a mismatch of the C-terminal mass (matching regions 1 and 2); a search tolerating a mismatch of the N-terminal mass (matching regions 2 and 3); and searches tolerating one mismatch in the amino acid sequence (matching regions 1 and 3); the hits were additionally encoded by the mass of the precursor ion and by the abbreviated matching region (NC, N 3 C, or E, respectively) in the sequence tag and compiled in a list for submission to MT.
  • NC, N 3 C, or E abbreviated matching region
  • MultiTag is a stand-alone application on the MS Windows platform. MultiTag code was written using C++ language with Microsoft Visual C++ and Microsoft Foundation Classes (Microsoft Inc. CA). The existing MultiTag software was modified so the average number of tryptic peptides per database entry could be specified. The average protein length in a non-redundant database was previously determined to be 492 amino acids (corresponding to ⁇ 60kD). The average length of a tryptic peptide was designated at 12 amino acids, setting the average number of tryptic peptides per database entry at 41. Since the average length of an EST entry codes for 166 amino acids (EST_others, Nov. 27, 2002, NCBI), this number was divided by 12 and the value for EST DB searching was set at 14.
  • Xenopus laevis proteins (provided by Andrei Popov, EMBL, Heidelberg) were isolated and resolved by one-dimensional gel electrophoresis, excised from two lanes containing similar biochemical preparations, and in-gel digested with the protease trypsin as previously described (16). Extracted peptides were analyzed by nanoelectrospray tandem mass spectrometry on a modified QSTAR Pulsar i quadruple time-of-flight (QqTOF) instrument (MDS Sciex, Concord, Canada), using uncoated borosilicate glass capillaries (1.2mm O.D. X 0.69mm ID.) from Harvard Apparatus Ltd (capillaries were drawn in-house on a Sutter P-97 puller). Database Searching.
  • QqTOF QSTAR Pulsar i quadruple time-of-flight
  • Mascot (17) queries were generated from tandem mass spectra using the processing script Mascot v.l.6b2 as an extension of BioAnalyst QS software (Applied Biosystems). Spectra were centroided and peaks were merged at 0.05 Thomsons, and peak lists contained mass values from peaks >2% base peak.
  • Mascot Database searches with Mascot were performed on an internal server with a precursor mass tolerance of 0.1 Da and a fragment ion mass tolerance of 0.05 Da, default precursor charge states were set at +2 and +3, with trypsin enzyme specificity, one miscleavage allowed, variable methionine oxidation, fixed carboxyamidomethyl cysteine, instrument type set at default, and no restrictions for protein molecular weight, but restricted to database entries from the species Xenopus laevis.
  • the Mascot identifications were made using the Peptide Summary Report for enhanced sensitivity.
  • MultiTag searches were performed using the PepSea software as a part of BioAnalyst QS, with a precursor mass tolerance of 0.1 Da and a fragment ion mass tolerance of 0.05 Da, with trypsin enzyme specificity, and fixed carboxyamidomethyl cysteine. Search results were analyzed with the MT software described above. MT parameters were set: 1,396,530 database entries searched (6 frames X 232,755 Xenopus laevis EST entries), 0.1 Da mass accuracy, and 14 for number of peptides per entry. MultiTag has no species restriction parameter and therefore all cross-species alignments were ignored.
  • Example 3 To test the specificity of MultiTag versus Mascot in EST database searching, a model dataset generated in a screen of microtubule-associated proteins from Xenopus laevis was used. Gel separated Xenopus proteins were analyzed by nanoelectrospray tandem mass spectrometry and identified by protein database searching using multiple techniques, which gave significant matches for a single protein often with Mascot, MS BLAST (9), and MultiTag (and corresponding greater sequence coverage with the later methods generally). To facilitate the database searching process with sequence tags, a script was developed for automated error-tolerant searching to generate unsorted search results for submission to the previously described MultiTag software (29). This script "MTSearch" was developed specifically for BioAnalyst QS (Applied Biosystems, CA) to automatically search a list of complete and error-tolerant sequence tags against a database and compile the results in an unsorted list.
  • MSearch BioAnalyst QS
  • MultiTag subsequently sorts search results by the statistical significance of combinations of multiple tags and individual tags.
  • the results of MTSearch can be directly submitted to the MultiTag statistics software ( Figure 5).
  • MultiTag relies upon an expected number of peptides per protein sequence, which was previously averaged at 41 peptides per protein for protein database searching. Since EST database sequences are shorter, one would expect fewer peptides possible from each entry; therefore the parameter designating the number of peptides expected per database entry was made adjustable to account for differences in length.
  • MT relies upon a designated size of the database searched for producing a probability of a random match. This is straightforward with protein database searches because this is a designated number of database entries. For EST database searching, all nucleotide sequences or the query must be translated in six frames, generating additional erroneous hypothetical sequence; only one frame is the correct translation; we multiplied the number of entries by 6 to account for this degeneracy.
  • MultiTag may recognize one specific EST with a query and Mascot may recognize a different EST corresponding to the same cDNA sequence
  • these top hits were carefully inspected to find alternate ESTs matching the same protein sequence.
  • the top five MultiTag hits were manually inspected by overlaying the retrieved peptide sequence "on the spectrum” using BioAnalyst QS, and comparing the observed fragment ions with theoretically calculated fragment ions (at a precision of 0.001 m/z), taking into consideration abundant a, b and y series ions, and immonium ions (see (27) for nomenclature).
  • MultiTag was able to detect 6 additional matches; 3 of these 6 were not in the Mascot top 5 hits (data not shown). This suggests that the MT method can also retrieve true matches that are not statistically significant; single hits below threshold should be manually inspected if no other aligmnents are made.
  • MultiTag proves to be a sensitive method for EST database searching, both because more identifications were made than the conventional software and more peptides were identified in total.
  • sequence tags were constructed from multiple MS/MS spectra from the analysis of a single in-gel digest ("Tags Submitted") and error-tolerantly searched against a protein database (resulting list of entries not shown).
  • the "Mass” column contains the corresponding complete mass for each sequence tag.
  • Results were sorted by MultiTag. Groups of matching partial sequence tags resulted ("Matching Tags").
  • E-values for the group of partial sequence tags were calculated by the MultiTag software (first column in Bold).
  • the final MultiTag report gave a list of database entries with diminishing E-values (data not shown). E-values are cited (column 1, "First False Positive") for the first database entry in the list to not correspond by annotated function (i.e.
  • MS BLAST Identifications Protein identifications made by MS BLAST are found in the column “MS BLAST Identifications” and peptide sequences aligned are in “MS BLAST Alignments.” Table 2. MultiTag E-values are Dependent on Amino Acids in Tag, Number of Tags, Mass Accuracy, and Database Size
  • E-value in Bold is shown in Table 1.
  • PredCount values are in italics.
  • Da corresponds to mass accuracy (in Da) used in database searches and input into MultiTag for calculations of E-values.
  • all tags submitted were included from Table 1.
  • 800,000 and 200,000 database entries correspond approximately to the NCBI Nonredundant (nrdb) and SwissProt protein databases, respectively.
  • Mass in column 2 indicates the full length of the peptide corresponding to the sequence tag in column 3. *800,000 database entries, **Mass accuracy of 0.1 Da.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Engineering & Computer Science (AREA)
  • Biotechnology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Analytical Chemistry (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)
  • Investigating Or Analysing Biological Materials (AREA)
  • Peptides Or Proteins (AREA)

Abstract

This invention is in the field of bioinformatics and relates to the use of mass spectrometry in the prediction of functions of proteins, whose sequences are not present in protein, EST or genomicdatabases.

Description

METHOD FOR PREDICTING PROTEIN FUNCTION
All documents cited herein are incorporated by reference in their entirety.
TECHNICAL FIELD
This invention is in the field of bioinformatics and specifically relates to the use of mass spectrometry in the prediction of function of proteins, whose sequences are not present in protein, EST (Expressed Sequence Tags) or genomic databases.
BACKGROUND ART
Developments in genomic sequencing and mass spectrometry have allowed the study of the proteome in detail (reviewed in 1). Proteins are typically resolved by gel electrophoresis, followed by enzymatic digestion and identification by peptide mass fingerprinting (PMF) or tandem mass spectrometry (MS/MS). An alternative approach is to digest a complex protein mixture in solution and to identify proteins by two-dimensional liquid chromatography-mass spectrometry (LC-MS/MS; reviewed in 2-3). Masses of intact tryptic peptides or masses of fragment ions (from PMF and MS/MS respectively) are submitted for database searching using specialised software (reviewed in 4). Regardless of the database searching algorithm or mass spectrometry platform used, the basic methods are the same. The software correlates the observed masses with theoretically predicted masses derived from peptide sequences produced by in silico digestion of protein database entries, and calculates the statistical significance of hits. However, the sof ware does not require a full representation of the fragment ions in the tandem mass spectrum and can positively identify the peptide even if only some of the fragment ions are matched. Obviously, the significance of hits increases if more fragment ions are detected and if more than one peptide sequence originating from the same database entry was recognised. Therefore, conventional database mining software is biased towards exact matching of spectra to catalogued sequences and in practice is mostly applied to the identification of proteins already residing in databases. Therefore proteomics is largely limited to organisms with sequenced genomes, despite the fact that phylogenetically related organisms share significant molecular homology and that extensive protein sequence information may be available from related species.
The proteomes of organisms whose genomes are unsequenced may be analysed using correlation methods based on sequence similarity (reviewed in 5). Recently, methods have been developed for protein identification by sequence similarity based on modified FASTA (6) (FASTS) (7) and BLAST (8) (MS BLAST)" (9) homology searching algorithms, which allow the mass spectrometric identification of a sizeable proportion of proteins sharing more than 50% of the sequence identity to their closest neighbours in a database (7, 10). MS BLAST can use redundant, degenerate and inaccurate sequences produced by automated interpretation of tandem mass spectra (9) and can be linked to high throughput protein characterisation techniques such as LC-MS/MS through a simple scripting interface (11). Importantly, both the MS BLAST and FASTS methods provide independent means of evaluating the statistical significance of hits, and therefore it is not necessary to compare retrospectively the matched peptide sequences with actual tandem mass spectra to rule out false positive hits.
A major limitation of sequence similarity data however, is the quality of de novo interpretation of raw tandem mass spectra, rather than in database searching. Tandem mass spectra are inherently deficient, i.e. peptide sequence fragments are often underrepresented. At the same time, spectra often display ions that originate from fragmentation of side chains of amino acid residues and are not accounted for by typical scoring schemes. It is common in femtomole sequencing for low peptide content and high chemical noise to allow only a few informative fragment ions to be detected in MS/MS spectra, from which software-assisted interpretation can not produce credible peptide sequence proposals and sequence similarity identification will likely yield a false negative result.
The peptide sequence tag approach for error-tolerant database searching developed by Mann and Wilm in 1994 (12) helps to overcome these limitations. The sequence tag utilizes a short sequence (2-4 amino acid residues), which can be easily determined from low energy CID spectra acquired from multiply charged precursors, and a pair of masses that flank the determined sequence in the full length peptide sequence. These masses are the combined mass of all the amino acid residues between the N-terminus of the tryptic peptide and the short sequence, and similarly, the mass of the amino acid residues between the short sequence and the C- terminus of the tryptic peptide. In stringent database searches both the masses and the short sequence are required to match. Currently sequence tags are employed in protein, EST (13) and genomic database mining, however no evaluation of the significance of matches is provided in these searches. Therefore, even if a single hit was retrieved upon database searching, the match between the tandem mass spectra and the corresponding database hit has to be verified retrospectively by manual inspection. Error tolerant searching using this method can be carried out by allowing the sequence or one of the masses to mismatch. This approach allows cross species identifications in protein sequence databases (15). However, allowing mismatches typically results in many hundreds of hits being retrieved and manual inspection of them is tedious.
DISCLOSURE OF THE INVENTION
The present invention provides,
an automated method of identifying a protein of unknown sequence, said method comprising
(i) using peptide sequence tags deduced from tandem mass spectra of peptides generated by cleavage of the protein of unknown sequence to search a database in an error-tolerant manner for known peptide sequences that match said peptide sequence tags;
(ii) assigning statistical significance to identified matches;
(iii) ranking identified matches by their significance;
(iv) generating a list of matching proteins ranked according to significance;
wherein the significance of a ranked protein is dependent on the number of matches between the peptide sequence tags and the protein and the level of degeneracy of the matches as compared to the expected number of sequences from a random database that would match the same combination of tags with the same level of degeneracy or with a more specific combination of tags.
The method of the present invention automatically analyses the results of an error-tolerant database search, reveals proteins to which multiple fragmented peptides are matched in an error- tolerant fashion and computes the statistical significance of those matches allowing the discrimination of true matches from false positives.
This method improves over previous methods for the identification of proteins of unknown sequence, since it automatically generates a ranked list of proposed identities without the need for any manual annotation of the results. Error tolerant searching is used to implement a statistical evaluation of matching multiple partial sequence tags in the identification of proteins with unsequenced genomes. This approach, herein referred to as "Multitag" enables the identification of distantly related proteins by sequence-similarity searching using only very short stretches of peptide sequence retrieved from tandem mass spectra. This is therefore a simplified and sensitive method of exploring the proteomes of organisms with unsequenced genomes.
An advantage of this method over sequence similarity searching methods such as FASTS (7) and MS BLAST (9) besides the ability to represent noisy and low intensity spectra, is that peptide sequences retrieved by sequence tag searches can be overlaid on fragment ion spectra, allowing the determination of whether the retrieved sequence is the correct sequence. Even though relatively weak matches can be evaluated in this way, the present invention can be used for high throughput searching since manual evaluation is only needed for easily recognisable borderline hits.
The method may preferably be used to search protein databases, although EST databases can be searched in similar fashion. Sequence tag searching is a recognized method identification of proteins in an EST database (28). Peptide sequence tags can be matched to putative protein sequences obtained by a six frame translation of cDNA sequences from EST entries.
In the preliminary steps that lead to the identification of peptide sequence tags, proteins are generally separated out on a one- or two-dimensional polyacrylamide gel. They are then excised and digested in-gel as previously described (16). Peptide samples are prepared using techniques well established in the art and not repeated herein in detail. Briefly, enzymatic digestion may be used to cleave the amide backbone of a protein. Preferably, the enzymatic digestion includes treatment of a protein with an enzyme that cleaves with high specificity. Such enzymes include trypsin, which cleaves at the C-terminus of Arg or Lys residues; endoproteases such as Lys-C or Arg-C. Proteases with lower primary specificity (as, for example, pepsin and thermolysin), could also be applied provided they produce peptides in 500 - 2000 Da mass range from protein substrates. Bovine or porcine trypsin is a preferred enzyme to use in this stage of the method.
The peptide fragments are analysed by mass spectrometry. A variety of methods exist by which mass spectrometry analysis may be carried out such as matrix-assisted laser desorption (MALDI), electrospray ionisation (ESI) implemented as nanoelectrospray method (26) or coupled on-line with the mass spectrometer (LC MS/MS) and related methods (e.g. Ionspray, Thermospray). Alternatively, these ion sources can be matched with any instrument configuration having tandem mass spectrometric capacity, such as triple quadrupole, Fourier _ transform ion cyclotron resonance (FTICR), ion trap, or combinations of these to give a hybrid instrument (e.g. ion trap-time-of-flight or quadrupole time-of-flight). For ionization, numerous matrix/wavelength combinations (MALDI) or solvent combinations (ESI) can be employed. Various hybrid instruments exist that operate in the tandem MS/MS mode. In tandem MS/MS instruments, the peptide sample, usually in the form of a tryptic protein digest, is typically injected into a first mass analyzer to yield a mass spectrum of the ions present in the mixture ('normal' mass spectrum). Any ion can then be channelled selectively (i.e., the precursor or parent ion) into a collision cell in which fragment ions are generated by collision with neutral gas. The fragment ions are then moved into a second mass analyzer, to yield a mass spectrum for the fragment ions. An example of such an instrument is the orthogonal quadropole time-of- flight mass spectrometer (QqTOF). Isolation and fragmentation of precursor ions in ion trap mass spectrometers is performed in the same chamber (ion trap) and the instrument does not have a separate mass analyser for detecting fragments. The analytical routine is similar to the one described above for the instruments with multiple mass analysers and consists of detection of all ions in the sample, followed by selective isolation and fragmentation of the precursors and detection of yielded peptide fragments.
In step i) of the method, peptide sequence tags identified from said peptide fragments are used to search a protein database in an error-tolerant manner for known protein sequences that match said peptide sequence tags. This methodology is a development of the method of Mann (12).
Peptide sequence tags are constructed based on raw mass spectra (see Figure 2). In the case of spectra acquired using ESI on triple quadrupole or quadrupole TOF machines, sequence tags are typically identified from the high m/z region of tandem mass spectra of tryptic peptides, which are dominated by abundant y-ions. In the interpretation of MS/MS spectra acquired on ESI ion traps or with MALDI ionisation sources (when singly charged ions are fragmented), series of b- ions or a-ions should also be considered (see 27 for fragment ion nomenclature). The sequence tags consist of three sections. The first part m>j is the added mass of the residues between the N terminus of the peptide and the determined sequence. The second part, iA, is the determined sequence. The third part, mc, is the mass of all amino acid residues between the determined sequence and the C terminus of the peptide sequence tag.
According to the method of the invention, sequence tags are assembled for as many fragmented tryptic peptides as possible and are compared with peptide sequences in a database. This may be done by the interpretation of tandem mass spectra using computer software, for example, the BioAnalyst QS (Applied Biosystems, CA). Sequence databases are then searched by computer using software such as PepSea (part of the BioAnalyst QS package) with the mass tolerance different for masses of fragment and precursor ions. Searches are performed against a comprehensive database and no constraints on protein molecular weight, pi and species of origin are considered.
The masses are used for searching a database in both a stringent fashion ( . e. all three regions match) and in an error-tolerant, non-stringent manner (where a mismatch is tolerated in one or more of the three regions Π_N. UIA or nic). Matches are preferably labelled by the mass of the precursor ion and by the matching region, wherein the matching region is abbreviated as NC for a search result with a completely matching tag; N for a search result with HIN and ΠIA matching; E for a search result with one amino acid error; C for a search result with n_A and mc matching.
Masses of precursor ions measured with high accuracy (for example, on hybrid quadrupole time- of flight instruments) enable the identification of known proteins or proteins highly homologous to known proteins by peptide mass fingerprinting. However, confidence in the result can be confounded by the presence of modified amino acids or if sequences of the analysed unknown protein and homologous known proteins are more diverged. In the latter case a single tryptic peptide may not be completely identical between two protein sequences even at the level of 75% of full length sequence identity, although the identity of relatively short sequence stretches frequently occurs (see Figure 1). If the search is widened by altering these constraints, then more hits are obviously produced. However, a number of these are not significant. Prior to the present invention, they must therefore be analysed by hand to determine which are more relevant.
The full list of hits is then input into step ii) without any further analysis. For example, redundant hits which match the same peptide sequence in another database entry are retained in the list of matches, as well as redundant matches between a peptide sequence tag and the same peptide in a protein entry.
The inventors have concluded that in order to identify truly homologous proteins from sequence tag search results, an evaluation of statistical significance is required. This may be done by assigning an E-value to each match, which represents the expected number of better or equally good matches found in a database at random. In this case, a database search hit is a protein sequence, which completely or partially matches some sequence tags in a degenerate or non- degenerate manner. E- values give the expected number of sequences from a random database, which would match the same combination of tags with the same level of degeneracy or with a more specific combination of tags. A more specific combination of tags is a result of either a higher number of tags matched or a lower degeneracy of the matches. Accordingly, hits are ranked according to their E- values, which, in turn, depend on the number of matches between the peptide sequence tags and the protein and the level of degeneracy of the matches as compared to the expected number of sequences from a random database that would match the same combination of tags with the same level of degeneracy or with a more specific combination of tags.
In a preferred embodiment of the invention, in step ii) the frequency of matches is compared to:
a) the probability that a given peptide sequence tag with a given type of degeneracy would match a random amino acid sequence;
b) the probability that a given combination of peptide sequence tags would match a random sequence, wherein said probability is computed as a product of the probabilities corresponding to individual matches;
c) the probability that any possible more specific (less likely) combination of peptide sequence tags than a given combination would match a random sequence;
and a statistical significance is assigned by multiplying the probability of step c) to the total number of protein sequences in the database.
Preferably, it is additionally required that the peptide partially obeys the cleavage condition of the proteolytic enzyme. For example, if matching of only the N-terminal part of the peptide is required, it is assumed that the amino acid preceding the N-terminal amino acid residue is arginine or lysine, if trypsin was applied to digest the protein. However, no C-terminal cleavage specificity would be requested, since C-terminal part of the peptide is not required to match. This improves specificity of searches and facilitates the correct identification of peptide fragments.
The significance of hits may preferably be calculated by the following process:
As an example, a given sequence tag has an N-terminal mass Π_N. three amino acids ai, a2, a3 and a C-terminal mass mc (although the sequence tag may comprise any number of amino acid residues). The probability that a random tryptic peptide would match this tag in a non-degenerate manner would be given as a product of the following three probabilities. First, the probability that the random tryptic peptide has an N-terminal fragment of any length, whose mass lies in the interval (niN-Δm, π_N+Δm), where Δm is mass tolerance of the instrument. Second, the probability that this fragment of random peptide has amino acids ai, a2 and a3. This is given by the product f(aι)f(a2)f(a3), where f(a;) denotes frequency of amino acid a;. And third, the probability that the mass of the random peptide fragment between these amino acids and the C- terminus would be between mc-Δm and mc+Δm.
In order to derive the probabilities m» and mc, the mass of a random tryptic peptide is regarded as a result of a random process. If the sequence of the random tryptic peptide was constructed by a random generator, giving random amino acids one at a time, the probability that the next amino acid will be \ is given by the frequency f(aj). At any given time the generator could produce a trypsin cleavage site (K or R residue) with the probability q=f(K)+f(R) and thus stops the process. The mass M of the random tryptic peptide can be regarded as an accumulated sum of masses of randomly generated amino acids where
M=M,+M2+M3+.... (1)
Random values represented as successive sums of positive identically distributed values as shown in equation (1) are known in probability theory as a renewal process (18). Masses of randomly generated amino acids obey the probability distribution determined by amino acid frequencies, so that the probability that the random mass would be exactly m equals the combined frequency of amino acids of mass m and is given as p(m). Then, the distribution of the mass accumulated after n+1 step (the probability that the peptide fragment of length n+1 would have a mass smaller than t) is calculated as:
Fn + ι(t) = (1 - q) Fn(t - m,)p(mi) (2)
Summation here is carried over all values of amino acid masses. Multiplication to (1-q) is needed to take into account that the process survived the n+lth step, i.e. the tryptic peptide has more than n+1 amino acids.
The distribution of the total mass of the tryptic peptide (the probability that the peptide5 s total mass would not exceed t) is given by allowing for all possible lengths of the peptide:
F(t) = q∑F»(t) (3)
Which implies that the probability that the peptide' s mass would be in the interval from m-Δm to m+Δm is:
P(m, Δm)=F(m+Δm)-(F(m-Δm) (4) This formula holds both for the whole mass of the peptide and for any of its fragments between a fixed amino acid position and the cleavage site. If this analogy is further considered with the renewal process, the properties are retained regardless of the point that the process is considered to start (the process has no memory). Therefore, after the position of the sequence tag on the peptide sequence is fixed through matching one mass and the short sequence stretch, the probability that the second mass would also match is given by equation (4).
The matching of the sequence tag is considered as a sequence of three consecutive, independent events, that is, the matching of the first mass, the matching of the short sequence stretch and the matching of the second mass. Although it makes no difference to the calculations, we preferably assume that the N-terminal mass is the first mass to match. The probability that the mass of any N-terminal fragment of the peptide would be in the interval (π_N-Δm, mN+Δm) is given by:
Q(mN,Am) = — [F(mN + Δm) - F(mn - Am)]
1 (5) or
Q(mn, m) = —P(mN, m)
The multiplier 1/q is introduced because the process survives the step with this mass, i.e. all peptides with arbitrary lengths with N-terminal parts matching the mass would satisfy the condition. It is noted, however, that equation (5) only holds if the mass tolerance of the instrument used is lower than the mass of any of the amino acids, otherwise it corresponds to the expectation and not to the probability.
Since the probability of the non-degenerate match of the sequence tag would be a product of probabilities of the N-terminal mass match (which importantly fixes the position of the tag along the peptide), sequence stretch match and the C-terminal mass match, it is expressed as:
Pnon - deg ener te = P( N, Am)f(aι)f(a.)f(a.)P(mc, Δm) VD)
The additional multiplier l/q(l-q) reflects the fact that zero length tryptic peptides are not considered. Therefore only peptides with a length of one or more are considered, which give a real number. Examples of probabilities for degenerate matches are given by:
PN - tβr min ai = [1 - P(mN, Am)]f(aι)f(a-)f(ay)P(mc, Am) q(\- q) and
P sec ond resid e = P(mN, Δm)/(αι)[l - f(aι)]f(a.)P(mc, Am) ( > q(\- q)
As the next step, the probability that a random protein sequence containing K tryptic peptides would match multiple sequence tags is calculated, taking into account the tags being matched and the type of degeneracy of the match. For instance, if there were three sequence tags and the random sequence simultaneously matched sequence tag 1 with an error in the C-terminal mass, sequence tag 2 with an error in the N-terminal mass and sequence tag 3 with a mismatch at the second identified amino acid, the probability of the event would be given by equation (8):
p _ L _ e-K-Pl(mlN )f(a )f(all )f{a^ )(\-Pl(nhc )) L _ g-K(ϊ-P2(m2l/ ))/(_-,, ) (o2. )f(a^ )P2(,„lc ) \ _ e-K-P,(m,N )f(aΑ ).l-/(α32 ))/(",, )P, «3C ) \
This example shows how to compute the probability that a random amino acid protein sequence would match an arbitrary combination of sequence tags.
In order to calculate statistical significance (E-values) it is first necessary to calculate the probability that any combination of tags would match a random sequence of amino acids in an equally specific or more specific manner. In other words, the sum of the probabilities of all possible matches (given by equation 8), should be calculated. It is too demanding to calculate the probability using all the less likely combinations of tags and is far easier to enumerate the combinations that are more likely to happen, as they mostly involve matches with very few tags.
The observed match has a probability p (given by eq. 8). The E-value is given by the product of number of sequences in the database and the probability that a more or equally specific combination of tags would match a random sequence. This is given by the equation:
E = N* ∑qt . (9)
1ι<P The summation is carried over such that the probability that the z'th combination will match the random sequence is q,. In this equation, only combinations where q, < p are included. N corresponds to the number of sequences in the database. The computation of E-values according to the above formula requires enumeration of very large number of tag combination. To avoid this, it is noted that the above formula is equivalent to
Figure imgf000013_0001
In this case all combinations of tags with q, > p (and not q, < p as above) are enumerated. Practically, this is a much smaller number of combinations to be enumerated.
Software implementation of the method of the invention preferably uses the pre-computed distribution function F(t). The software imports sequence tags in the conventional format (mc)aι...an(m.Ν) (12) and computes probabilities for each tag to match a random tryptic peptide. The software also imports a full list of hits produced by multiple degenerate and non-degenerate sequence tag searches and identifies hits corresponding to the same protein. For each hit the probability of the match is calculated (similar to equation 8), then all tag combinations giving the same or higher probability are identified, and the hit is assigned an E-value. Finally the hits are assorted according to E-value.
As would be expected, high E-values are given for poor quality tags, which are those that have short mass lengths for tag regions 1 and 3, and designate common amino acids that have a high frequency in proteins, e.g. leucine. Lower E-values are seen where the sequence stretch is longer, where uncommon amino acids (e.g. tryptophan) appear, and for tags that have longer mass lengths in regions 1 and 3. The probability that a number of partial sequence tags will match is lower than the probability that a single sequence tag will match, and significance increases as more partial sequence tags are aligned. Two partial sequence tag alignments might suffice to make a confident identification, depending on the tag lengths. However, matching of three or more tags lowers the E-value to the range of 10"6 - 10"9, even at 0.1 Da mass accuracy, thus allowing a confident prediction (see Table 2). This method relies upon an expected number of peptides per protein sequence. The average protein length in a non-redundant database is determined. The average length of a tryptic peptide is designated, allowing the average number of tryptic peptides per database entry to be calculated. Since EST database sequences are shorter, it is to be expected that there would be fewer peptides possible from each entry; therefore the parameter designating the number of peptides expected per database entry should be made adjustable to account for differences in length. The method also relies upon a designated size of the database searched for producing a probability of a random match. This is straightforward with protein database searches because this is a designated number of database entries. For EST database searching, all nucleotide sequences or the query must be translated in six frames, generating additional erroneous hypothetical sequence; only one frame is the correct translation; therefore the number of entries is multiplied by 6 to account for this degeneracy.
Where protein mixtures are analysed, many sequence tags are not expected to match any single protein in the database even if it is a true homologue of one of the proteins present in the mixture, because tags representing other components of the mixture are irrelevant to this protein. In other words, if a mixture were to be analysed, sequence tags in the query would correspond to different proteins. Therefore, any particular true hit (a protein being a component of the mixture) has no chance of being matched by all (or even most of) the tags.
The number of false positives rises with a growing number of irrelevant tags and increases the E- value of the positive hits. Simulation on a database using random sequences (data not shown) has demonstrated that the dependence of E-values on the full number of tags in the query is nonlinear, and is determined by the number of tags matched. For example, the E-value of a hit matched by only two tags increases by 3 to 5 times if the full number of tags in the query doubles. In practice, although the significance of strong hits drops with increasing number of tags in the query, all hits matched with four or more tags remain significant with any practical number of irrelevant tags. It is difficult to assess significance in borderline cases where two or three tags are matches out of a total of ten or more. For example, if three tags were matched, tripling the number of tags in the query increases the E-value by around 30 times, thus possibly rendering the hit insignificant.
Although this problem has not been noted in practice, in a preferred aspect of the invention, the following strategy may be used for borderline hits. The expected number of random matches of the same combination of tags with the same degeneracy is calculated. This predicted count (herein termed "PredCount") value reflects the specificity of matches, and the ranking order is the same as the ranking order for E-values. PredCount does not reflect the expected number of false positives when the entire query is searched against a database. Contrary to E-values, PredCount values only weakly depend on the number of tags in the query. Therefore in cases where most of the tags in the query were not matched, hits with PredCount values lower than around 10"4 should be manually inspected, even if the E-values only have marginal significance.
In a fourth aspect the invention provides a computer apparatus adapted to perform a method according to any one of the aspects of the invention described above.
In a preferred embodiment of the invention, said computer apparatus may comprise a processor means incorporating a memory means adapted for storing data relating to amino acid or nucleotide sequences; means for inputting data relating to a plurality of protein or nucleic acid sequences; and computer software means stored in said computer memory that is adapted to perform a method according to any one of the aspects of the invention described above.
The invention also provides a computer-based system for identifying a protein of unknown sequence comprising means for inputting data relating to the masses of peptides generated by mass spectrometry; means adapted to search a protein database in an error-tolerant manner for known protein sequences that match said peptide sequence tags; means adapted to assign statistical significance to identified matches; means adapted to rank identified matches by their significance; and means for outputting a list of matching proteins ranked according to significance. The system of this aspect of the invention may comprise a central processing unit; an input device for inputting requests; an output device; a memory; and at least one bus connecting the central processing unit, the memory, the input device and the output device. The memory should store a module that is configured so that upon receiving a request to identifying a protein of unknown sequence, it performs the steps listed in any one of the methods of the invention described above.
In the apparatus and systems of these embodiments of the invention, data may be input by downloading the MS data from a local site such as a memory or disk drive, or alternatively from a remote site accessed over a network such as the Internet. The sequences may be input by keyboard, if required. The generated list of matching proteins may be output in any convenient format, for example, to a printer, a word processing program, a graphics viewing program or to a screen display device. Other convenient formats will be apparent to the skilled reader.
The means adapted to search a protein database in an error-tolerant manner for known protein sequences that match said peptide sequence tags; means adapted to assign statistical significance to identified matches will preferably comprise computer software means. As the skilled reader will appreciate, once the novel and inventive teaching of the invention is appreciated, any number of different computer software means may be designed to implement this teaching.
According to a still further aspect of the invention, there is provided a computer program product for use in conjunction with a computer, said computer program comprising a computer readable storage medium and a computer program mechanism embedded therein, the computer program mechanism comprising a module that is configured so that upon receiving a request to identify a protein of unknown sequence, it performs the steps listed in any one of the methods of the invention described above. In a still further aspect, the invention provides software for the high throughput automated analysis of mass spectrometry data of peptide samples.
BRIEF DESCRIPTION OF DRAWINGS
Figure 1 shows regions of partial identity between Human and Alligator alcohol dehydrogenase. Partial protein amino acids sequences for alcohol dehydrogenase are aligned above from human and alligator (75% identity). Regions alignable by error-tolerant sequence tags between the two sequences are highlighted in gray. These regions are theoretical tryptic peptides over six amino acids in length with more than three conserved amino acids from the N-terminus or more than four conserved amino acids from the C-terminus. Tryptic cleavage sites designated above are shared between both sequences. Tryptic cleavage sites not at the same point on the sequences are not designated by spaces; sites do not occur in the gray regions. Accession numbers: human, P00325; alligator, AAB28120. The sequences were aligned using the Clustal X program.
Figure 2 shows the type of raw mass spectra that is used to construct peptide sequence tags. The MultiTag approach consists of constructing sequence tags from peptide tandem mass spectra, error-tolerant database searches, and sorting and calculation of the significance of multiple error- tolerant sequence tag alignments by the MultiTag software. Panel 1 shows a tandem mass spectrum of a low abundance peptide with an overlaid sequence tag. Panel 2 shows one complete and three error-tolerant sequence tag database searches, which is done for each MS/MS spectrum and corresponding sequence tag. Panel 3 shows the combined list of search results (most of the 8000 entry list is not shown) from all spectra and all searches in the analysis of a single sample; "Tag Mass" column indicating the tag's parent mass followed by an "NC" for search results with complete tags, an "N" for searches with tag regions 1 and 2, an "E" for searches with tags with one amino acid error, or a "C" for searches with tag regions 2 and 3; "Sequence" column is the retrieved sequence found from the database search; "Mass" column indicates the protein's total mass in kDa from which the peptide originated; "DB Accession" the proteins accession number; "Protein name"; "Species". Panel 4 shows the MultiTag output; "Tag Mass" column lists the tag- search code for the tags aligned; "Sequence" lists all of the full peptide sequences error- tolerantly aligned; "Mass" — "Species" same as Panel 3; "E-values" for the probability of the alignment of the group of sorted sequence tags. Additional column 'Predicted Counts' reflecting the number of expected random matches of a given combination of tags is not provided here for the sake of presentation clarity.
Figure 3 shows the spectra for Xenopus proteins. Xenopus proteins were in-gel digested and analyzed by nanoelectrospray tandem mass spectrometry. MS spectrum peaks labeled with a "*" were fragmented and peptide sequence tags were constructed from MS/MS spectra (inset).
Abundant y-ions above the multiply charged precursor in MS/MS spectra allow direct determination of the partial amino acid sequence of a peptide and corresponding sequence tag construction. Peaks in the MS spectrum labeled with a T belong to trypsin. The resulting sequence tag from the MS/MS spectrum shown is (587.36)VSQ(901.52), parent mass 1047.55.
All of the determined sequence tags from the analysis of this sample are found in Table 1. The protein was identified as Isoleucyl-tRNA synthetase.
Figure 4 shows peptide tandem mass spectra with sequence tag which was acquired from the doubly charged precursor m/z 676.82 on a quadrupole time-of-flight mass spectrometer. A sequence tag was made using the fragment y ion series in the m/z region higher than the precursor. The sequence tag is (791.4143)LFM(1182.6073), precursor mass 1351.63. MassN, the amino acid sequence, and Masse correspond to regions 1, 2, and 3 of the sequence tag, respectively. In this analysis of a protein digest, 25 peptides were fragmented including the one shown above, 10 sequence tags were constructed and submitted for database searching and MultiTag analysis, and 3 complete sequence tags matched one EST; Lysyl-tRNA synthetase, 12473885. Figure 5 shows integrated MultiTag database searching scheme. To interpret peptide tandem mass spectra with MultiTag for database searching: 1. Construct tags and list in text file; 2. Run database search script; 3. Submit search results to MultiTag software for sorting by probabilities. MultiTag search script is written for Applied Biosystems Bioanalyst QS software. MultiTag software was modified for EST database searching as described above.
Example 1
The method of the invention (Multitag) was used to identify proteins from Xenopus laevis, which is, among other things, an important model for the study of the cell cycle (reviewed in 20), DNA replication (21) and developmental biology (22). Less than 7000 Xenopus proteins have been sequenced despite a public initiative in EST sequencing. There are around 221000 largely unannotated ESTs available as of 16th August 2002 (taken from http://www.ncbi.nlm.nih.gov). Therefore, this does not provide much coverage of the 3070 megabase pseudotetraploid genome (23).
In this case, in-gel digests of Xenopus proteins were analysed by PMF and nanoelectrospray tandem mass spectroscopy. Sequence-similarity searching methods were applied for protein identification because Mascot database searching
(http://www.matrixscience.com/cgi/index.pl?page=:../home.html) with peptide mass fingerprints and with lists of fragment masses derived from uninterpreted tandem mass spectra were unable to identify proteins by stringent matching. Two methods of sequence-similarity searching were applied in parallel to the same set of MS/MS data. Peptide sequence proposals obtained by automated de novo interpretation of tandem mass spectra were submitted to MS BLAST (24). At the same time, peptide sequence tags were assembled via partial manual interpretation of spectra (Figure 3), followed by error-tolerant database searching and sorting and evaluating the results by MultiTag, as described above (Table 1).
From five attempted unknown proteins, MS BLAST identified three, however all five were identified by MultiTag. Importantly, in three cases both methods identified homologous sequences from the same organism or from different species, providing an independent validation of the MultiTag approach. Example 2
Software Development.
MT-Integrated database Search Software. The "MTSearch" script was developed automatically to generate a list of database search results from a list of sequence tags. Tags were used for searching a database in a stringent fashion (matching regions 1, 2 and 3, see figure 4) and in an error-tolerant fashion: a search tolerating a mismatch of the C-terminal mass (matching regions 1 and 2); a search tolerating a mismatch of the N-terminal mass (matching regions 2 and 3); and searches tolerating one mismatch in the amino acid sequence (matching regions 1 and 3); the hits were additionally encoded by the mass of the precursor ion and by the abbreviated matching region (NC, N3 C, or E, respectively) in the sequence tag and compiled in a list for submission to MT.
MultiTag is a stand-alone application on the MS Windows platform. MultiTag code was written using C++ language with Microsoft Visual C++ and Microsoft Foundation Classes (Microsoft Inc. CA). The existing MultiTag software was modified so the average number of tryptic peptides per database entry could be specified. The average protein length in a non-redundant database was previously determined to be 492 amino acids (corresponding to ~60kD). The average length of a tryptic peptide was designated at 12 amino acids, setting the average number of tryptic peptides per database entry at 41. Since the average length of an EST entry codes for 166 amino acids (EST_others, Nov. 27, 2002, NCBI), this number was divided by 12 and the value for EST DB searching was set at 14.
Mass Spectrometry.
Xenopus laevis proteins (provided by Andrei Popov, EMBL, Heidelberg) were isolated and resolved by one-dimensional gel electrophoresis, excised from two lanes containing similar biochemical preparations, and in-gel digested with the protease trypsin as previously described (16). Extracted peptides were analyzed by nanoelectrospray tandem mass spectrometry on a modified QSTAR Pulsar i quadruple time-of-flight (QqTOF) instrument (MDS Sciex, Concord, Canada), using uncoated borosilicate glass capillaries (1.2mm O.D. X 0.69mm ID.) from Harvard Apparatus Ltd (capillaries were drawn in-house on a Sutter P-97 puller). Database Searching.
Mascot (17) queries were generated from tandem mass spectra using the processing script Mascot v.l.6b2 as an extension of BioAnalyst QS software (Applied Biosystems). Spectra were centroided and peaks were merged at 0.05 Thomsons, and peak lists contained mass values from peaks >2% base peak. Database searches with Mascot were performed on an internal server with a precursor mass tolerance of 0.1 Da and a fragment ion mass tolerance of 0.05 Da, default precursor charge states were set at +2 and +3, with trypsin enzyme specificity, one miscleavage allowed, variable methionine oxidation, fixed carboxyamidomethyl cysteine, instrument type set at default, and no restrictions for protein molecular weight, but restricted to database entries from the species Xenopus laevis. The Mascot identifications were made using the Peptide Summary Report for enhanced sensitivity.
Sequence tags were generated as described above. MultiTag searches were performed using the PepSea software as a part of BioAnalyst QS, with a precursor mass tolerance of 0.1 Da and a fragment ion mass tolerance of 0.05 Da, with trypsin enzyme specificity, and fixed carboxyamidomethyl cysteine. Search results were analyzed with the MT software described above. MT parameters were set: 1,396,530 database entries searched (6 frames X 232,755 Xenopus laevis EST entries), 0.1 Da mass accuracy, and 14 for number of peptides per entry. MultiTag has no species restriction parameter and therefore all cross-species alignments were ignored. Both methods searched the same database, EST_others (November 27, 2002), from the National Center for Biotechnology Information, and only used the Xenopus laevis subset of this database. The identity of all ESTs was verified by blastx database searches at the NCBI internet site.
Example 3 To test the specificity of MultiTag versus Mascot in EST database searching, a model dataset generated in a screen of microtubule-associated proteins from Xenopus laevis was used. Gel separated Xenopus proteins were analyzed by nanoelectrospray tandem mass spectrometry and identified by protein database searching using multiple techniques, which gave significant matches for a single protein often with Mascot, MS BLAST (9), and MultiTag (and corresponding greater sequence coverage with the later methods generally). To facilitate the database searching process with sequence tags, a script was developed for automated error-tolerant searching to generate unsorted search results for submission to the previously described MultiTag software (29). This script "MTSearch" was developed specifically for BioAnalyst QS (Applied Biosystems, CA) to automatically search a list of complete and error-tolerant sequence tags against a database and compile the results in an unsorted list.
The previously described MultiTag software subsequently sorts search results by the statistical significance of combinations of multiple tags and individual tags. The results of MTSearch can be directly submitted to the MultiTag statistics software (Figure 5). A modification was made to MultiTag for EST database searching. MultiTag relies upon an expected number of peptides per protein sequence, which was previously averaged at 41 peptides per protein for protein database searching. Since EST database sequences are shorter, one would expect fewer peptides possible from each entry; therefore the parameter designating the number of peptides expected per database entry was made adjustable to account for differences in length. Secondly, MT relies upon a designated size of the database searched for producing a probability of a random match. This is straightforward with protein database searches because this is a designated number of database entries. For EST database searching, all nucleotide sequences or the query must be translated in six frames, generating additional erroneous hypothetical sequence; only one frame is the correct translation; we multiplied the number of entries by 6 to account for this degeneracy.
A set of EST database searches was performed using Mascot, relying upon its own statistics, and MultiTag, relying upon E values for the determination of true versus false positives, as a trial to judge the sensitivity of MultiTag. Peak lists with corresponding intensities were generated automatically from tandem mass spectra of protein digests and were submitted to Mascot for database searching. Sequence tags were constructed by manual interpretation from the same set of tandem mass spectra; averaging ~9 sequence tags per protein digest, which required ~4 minutes of spectrum interpretation per tag. Sequence tags were used for database searching and the results were analyzed by MultiTag. We used EST sequences from Xenopus laevis (30) as reference by both methods for protein identification. In general, MultiTag can make direct identifications or cross-species identifications.
From the model data set, Mascot was able to recognize 49 peptides with optimized settings, making 20 identifications (Table 3). From the same dataset, MultiTag was able to recognize 87 peptides and produced 31 identifications, which included all of the Mascot identifications. Whereas many identifications are statistically at the borderline by Mascot, MultiTag was able to increase the coverage of these alignments by error-tolerant matching of partial peptides to provide more evidence for a positive identification in 12 cases. Furthermore, MultiTag was able to make significant alignments in 11 cases where Mascot could not. Where MultiTag was able to make a significant alignment and Mascot produced no significant matches, the top five Mascot hits were inspected to see if the same protein had been nonconfidently detected. Since it is possible that MultiTag may recognize one specific EST with a query and Mascot may recognize a different EST corresponding to the same cDNA sequence, these top hits were carefully inspected to find alternate ESTs matching the same protein sequence. Where MultiTag was unable to make a significant alignment, the top five MultiTag hits were manually inspected by overlaying the retrieved peptide sequence "on the spectrum" using BioAnalyst QS, and comparing the observed fragment ions with theoretically calculated fragment ions (at a precision of 0.001 m/z), taking into consideration abundant a, b and y series ions, and immonium ions (see (27) for nomenclature). In this manner, MultiTag was able to detect 6 additional matches; 3 of these 6 were not in the Mascot top 5 hits (data not shown). This suggests that the MT method can also retrieve true matches that are not statistically significant; single hits below threshold should be manually inspected if no other aligmnents are made.
From these data, MultiTag proves to be a sensitive method for EST database searching, both because more identifications were made than the conventional software and more peptides were identified in total.
It will be understood that the invention has been described by way of example only and modifications may be made whilst remaining within the scope and spirit of the invention.
References
(1) Anderson, N. L.; Matheson. A. D.; Steiner, S. Curr Opin Biotechnol 2000, 11, 408-412.
(2) Mann, M.; Hendrickson, R. C; Pandey, A. Annu Rev Biochem 2001, 70, 437-473.
(3) Peng, J. M.; Gygi, S. P. JMass Spectrom 2001, 36, 1083-1091.
(4) Fenyo, D. Curr Opin Biotechnol 2000, 11, 391-395.
(5) Liska, A.; Shevchenko, A. Proteomics 2003, 3,19-28.
(6) Pearson, W. R. Genomics 1991, 11, 635-650.
(7) Mackey, A. J.; Haystead, T. A.; Pearson, W. R. Mol Cell Proteomics 2002, 1, 139-147.
(8) Altschul, S. F.; Madden, T. L.; Schaffer, A. A.; Zhang, J.; Zhang, Z.; Miller, W.; Lipman, D. J. Nucleic Acids Res 1997, 25, 3389-3402.
(9) Shevchenko, A.; Sunyaev, S.; Loboda, A.; Bork, P.; Ens, W.; Standing, K. G. Anal Chem 2001, 73, 1917-1926.
(10) Habermann, B.; Sunyaev, S.; Shevchenko, A. submitted
(11) Nimkar, S.; Loo, J. A. Proc.50th ASMS Conference on Mass Spectrometry and Allied Topics, Orlando FL 2002; Abstract 334.
(12) Mann, M.; Wilm, M. Anal Chem 1994, 66, 4390-4399.
(13) Mann, M. Trends Biochem Sci 1996, 1, 494-495.
(14) Kuster, B.; Mortensen, P.; Andersen, J. S.; Mann, M. Proteomics 2001, 1, 641-650.
(15) Shevchenko, A.; Keller, P.; Scheiffele, P.; Mann, M.; Simons, K. Electrophoresis 1997, 18, 2591-2600.
(16) Shevchenko, A.; Wilm, M.; Norm, O.; Mann, M. Anal. Chem. 1996, 6%, 850-858.
(17) Perkins, D. Ν.; Pappin, D. J.; Creasy, D. M.; Cottrell, J. S. Electrophoresis 1999, 20, 3551- 3567.
(18) Feller, W. An introduction to probability theory and its applications; John Wiley & Sons, Inc.: New York London Sydney, 1966. (19) Baudouin-Cornu, P.; Surdin-Kerjan, Y.; Marliere, P.; Thomas, D. Science 2001, 293, 297- 300.
(20) Nurse, P. Cell 2000, 100, 71-78.
(21) Herrick, J.; Stanislawski, P.; Hyrien, O.; Bensimon, A. JMol Biol 2000, 300, 1133-1142.
(22) De Robertis, E. M.; Larrain, J.; Oelgeschlager, M.; Wessely, O. Nat Rev Genet 2000, 1, 171-181.
(23) Graf, J. D.; Kobel, H. R. Methods Cell Biol 1991, 36, 19-34.
(24) Shevchenko, A.; Sunyaev, S.; Liska, A.; Bork, P.; Shevchenko, A. Meth Mol Biol 2002, 211, 221-234. (25) Keller, A.; Nesvizhskii, A. I.; Kolker, E.; Aebersold, R. Anal Chem 2002, 74, 5383-5392.
(26) Wilm, M., and Mann, M. 1996 Anal Chem 68, 1-8
(27) Biemann, K. 1988 Biomed Environ Mass Spectrom 16, 99-111
(28) Neubauer, G., King, A., Rappsilber, J., Calvio, C, Watson, M., Ajuh, P., Sleeman, J., Lamond, A., and Mann, M. (1998)., Nat Genetic 20, 46-50
(29) Sunyaev, S; Liska, A.J.; Gologd, A.; Shevchenko, A.; Shevchenko, A. Anal Chem. 2003, 75, 1307-1315.
(30) Blackshear, P. J.; Lai, W. S.; Thorn, J. M.; Kennington, E. A.; Staffa, N. G.; Moore, D. T.; Bouffard, G. G.; Beckstrom-Sternberg, S. M.; Touchman, J. W.; Bonaldo, M. F.; Soares, M. B. Gene 2001, 261, 71-87.
Table 1- Sequence Tags used in the Identification of Xenopus Proteins by MultiTag
Figure imgf000025_0001
For each sample, sequence tags were constructed from multiple MS/MS spectra from the analysis of a single in-gel digest ("Tags Submitted") and error-tolerantly searched against a protein database (resulting list of entries not shown). The "Mass" column contains the corresponding complete mass for each sequence tag. Results were sorted by MultiTag. Groups of matching partial sequence tags resulted ("Matching Tags"). E-values for the group of partial sequence tags were calculated by the MultiTag software (first column in Bold). The final MultiTag report gave a list of database entries with diminishing E-values (data not shown). E-values are cited (column 1, "First False Positive") for the first database entry in the list to not correspond by annotated function (i.e. HSP 90) to the most significant hit. Protein identifications made by MS BLAST are found in the column "MS BLAST Identifications" and peptide sequences aligned are in "MS BLAST Alignments." Table 2. MultiTag E-values are Dependent on Amino Acids in Tag, Number of Tags, Mass Accuracy, and Database Size
E-valuss PredCount E-values
Mass Sequence Tags In ths Identification of DNA Polvmerase 1.0 Da* 0.5 Da* 0.1 Da* 0.1 Da* 1 ,600,000 DB Errtries**200,000 DB Entries"
1 816.48 (456.31 )?T(704.42) 6062,84 2726.4 2556,48 175.81 5112,97 639.12
2 827.49 (345.25)QEL(715.43) 196.09 88.27 83.40 1.23 166.81 20.85
3 846.49 (38S.28)LY(661.42) 260.83 130.02 128.97 2.66 257.95 32.24
4 908.50 LGG(810.44) 8358.56 6533.57 6921.29 980.87 13842.6 1730.32
5 1131.65 LPE(872.50) 1726.02 1010.50 951,94 70.56 1903.87 237.98
6 LGG(810.44) + LPE(872.50) 50.82 27.14 25.08 Θ.23E-02 51.60 6.45
7 LGG(810.44) + LPE(872,S0) + (456.31)7T(704.42) 0.124 3.09E-02 2.86E-02 2.17E-05 5.72E-02 7.15E-03
8 LGG(810.44) + LPE(872.50) + (456.31 )?T(704,42) + (385.28)LY(661.42) 1.97E-05 3.23E-06 3.01 E-06 7.68E-11 6.02E-06 7.53E-07
9 LGG(8 0.44) + LPE(872.5Q) + (456.31 )?T(704.42) + (385.28)LY(661.42) + (345.25)QEL(715.43) - 11..4499EE--0066 8.60E-07 8.14E-07 1.26E-16 1.63E-06 2.03E-07
10 GG(810.44) + LPE(872.50) + (456,31 )7T(704.42) + (385.28)LY(66 .42) + (345.25)QEL(715.43)3 3..6644EE--0099 4.17E-09 3.55E-09 1.31E-16 7.11E-09 8.88E-10
The E-value in Bold is shown in Table 1. PredCount values are in italics. "Da" corresponds to mass accuracy (in Da) used in database searches and input into MultiTag for calculations of E-values. In the calculation of E-values for row 1-9, all tags submitted were included from Table 1. In row 10, only the tags that matched the database entry were included in the list of tags submitted for MultiTag calculations. 800,000 and 200,000 database entries correspond approximately to the NCBI Nonredundant (nrdb) and SwissProt protein databases, respectively. "Mass" in column 2 indicates the full length of the peptide corresponding to the sequence tag in column 3. *800,000 database entries, **Mass accuracy of 0.1 Da.
Table^ . EST database searching: Mascot vs. MultiTag
Figure imgf000027_0001
Total Peptide hits : 49 87 Identifications = 20 31
Two similar sets of purified proteins contributed to the identifications above. Peptides = no. of peptides matched to any single DB entry; Tags = no. of complete and partial tags matching any single EST sequence, X & Y (complete tags & partial tags, respectfully); Tags in Query = no. of tags submitted in query; f not in top 5 hits; Mascot hits below threshold score in top 5 are in italics; apparent molecular weights (MW) in KiloDaltons for corresponding gel bands.

Claims

1. An automated method of identifying a protein of unknown sequence, said method comprising
(i) using peptide sequence tags deduced from tandem mass spectra of peptides generated by cleavage of the protein of unknown sequence to search a database in an error- tolerant manner for known peptide sequences that match said peptide sequence tags;
(ii) assigning statistical significance to identified matches;
(iii) ranking identified matches by their significance;
(iv) generating a list of matching proteins ranked according to significance;
wherein the significance of a ranked protein is dependent on the number of matches between the peptide sequence tags and the protein and the level of degeneracy of the matches as compared to the expected number of sequences from a random database that would match the same combination of tags with the same level of degeneracy or with a more specific combination of tags.
2. A method according to claim 1, wherein the database is a protein or an expressed sequence tag database.
3. A method according to claim 1 or claim 2, wherein in order to identify peptide sequence tags from peptide fragments, each fragment of sequence M is divided into three parts, the first part rriN is the added mass of the residues between the N terminus of the peptide and the determined sequence, Π_A is the determined sequence and mc is the mass of all amino acid residues between the determined sequence and the C terminus of the peptide, and these masses are compared to predicted masses of sequences in the database.
4. A method according to any preceding claim, wherein in order to search a database in an error-tolerant manner, one of the regions ^, KIA or mc is allowed to mismatch.
5. A method according to any preceding claim, wherein a search tolerating a mismatch of region mc matches regions Π_N and ΓQA.
6. A method according to any preceding claim, wherein a search tolerating a mismatch of region HIN matches regions IΪIA and mc.
7. A method according to any preceding claim, wherein a search tolerating a mismatch of region _QA matches regions ΠIN and mc.
8. A method according to any preceding claim, wherein matches are labelled by the mass of the precursor ion and by the matching region, wherein the matching region is abbreviated as NC for a search result with a completely matching tag; N for a search result with Π_N and niA matching; E for a search result with one amino acid error; C for a search result with HIA and mc matching.
9. A method according to any preceding claim, wherein redundant hits which match the same peptide sequence in another database entry are retained in the list of matches.
10. A method according to any preceding claim, wherein multiple matches between peptide sequence tags and the same protein entry in the database are recorded.
11. A method according to any preceding claim, wherein in step iv) redundant matches between a peptide sequence tag and the same peptide in a protein entry are recorded.
12. A method according to any preceding claim, wherein a combination of matches between peptide sequence tags and a protein entry is considered more specific either due to the higher number of tags that are matched or due to the lower degeneracy of the matches.
13. A method according to any preceding claim, wherein in step iv) the significance of all matches between peptide sequence tags and protein entries are assigned a significance value by computing an estimate of the probability that such a combination of peptide sequence tags may match a protein entry at random.
14. A method according to any preceding claim, wherein in step iv) the frequency of matches is compared to:
a) the probability that a given peptide sequence tag with a given type of degeneracy would match a random amino acid sequence;
b) the probability that a given combination of peptide sequence tags would match a random sequence, wherein said probability is computed as a product of the probabilities corresponding to individual matches;
c) the probability that any possible more specific (less likely) combination of peptide sequence tags than a given combination would match a random sequence; and a statistical significance is assigned by multiplying the probability of step c) to the total number of protein sequences in the database.
15. A method according to claim 14, wherein the probabilities in steps a), b) and c) are given in accordance with the principles of equations 7, 8 and 9 above.
16. A method according to any preceding claim, wherein the specificity of matches is also used to rank the matches.
17. A method according to any preceding claim, where it is additionally required that the peptide obey the cleavage condition of the proteolytic enzyme, such that when trypsin is used for cleavage, the amino acid residue that is N-terminal to the match is Arginine or Lysine.
18. A method according to any preceding claim, wherein peptide sequence tags are called from the high m/z region of tandem mass spectra of peptides, which are dominated by abundant y-ions.
19. A software program for high throughput automated analysis of mass spectrometry data of peptide sample, which software is configured to perform the steps recited in any one of the preceding claims.
20. A computer apparatus adapted to perform a method according to any one of claims 1 to 18.
PCT/IB2004/000757 2003-02-06 2004-02-06 Method for predicting protein function WO2004070643A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
GB0302774.5 2003-02-06
GBGB0302774.5A GB0302774D0 (en) 2003-02-06 2003-02-06 Method for predicting protein function

Publications (2)

Publication Number Publication Date
WO2004070643A2 true WO2004070643A2 (en) 2004-08-19
WO2004070643A3 WO2004070643A3 (en) 2005-04-14

Family

ID=9952580

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2004/000757 WO2004070643A2 (en) 2003-02-06 2004-02-06 Method for predicting protein function

Country Status (2)

Country Link
GB (1) GB0302774D0 (en)
WO (1) WO2004070643A2 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102004051016A1 (en) * 2004-10-20 2006-05-04 Protagen Ag Method and system for elucidating the primary structure of biopolymers
CN100580414C (en) * 2006-12-15 2010-01-13 中国科学院植物研究所 A method for in-gel enzymatic hydrolysis of proteins
CN111243679A (en) * 2020-01-15 2020-06-05 重庆邮电大学 A storage and retrieval method for microbial community species diversity data

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6446010B1 (en) * 1999-06-15 2002-09-03 The Rockefeller University Method for assessing significance of protein identification

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
DE102004051016A1 (en) * 2004-10-20 2006-05-04 Protagen Ag Method and system for elucidating the primary structure of biopolymers
CN100580414C (en) * 2006-12-15 2010-01-13 中国科学院植物研究所 A method for in-gel enzymatic hydrolysis of proteins
CN111243679A (en) * 2020-01-15 2020-06-05 重庆邮电大学 A storage and retrieval method for microbial community species diversity data

Also Published As

Publication number Publication date
WO2004070643A3 (en) 2005-04-14
GB0302774D0 (en) 2003-03-12

Similar Documents

Publication Publication Date Title
Henzel et al. Protein identification: the origins of peptide mass fingerprinting
Xu et al. MassMatrix: a database search program for rapid characterization of proteins and peptides from tandem mass spectrometry data
Nesvizhskii Protein identification by tandem mass spectrometry and sequence database searching
Searle Scaffold: a bioinformatic tool for validating MS/MS‐based proteomic studies
Cagney et al. In silico proteome analysis to facilitate proteomics experiments using mass spectrometry
JP4767496B2 (en) Mass spectrum measurement method
JP3195358B2 (en) Identification of nucleotides, amino acids or carbohydrates by mass spectrometry
Karr Application of proteomics to ecology and population biology
Kapp et al. Overview of tandem mass spectrometry (MS/MS) database search algorithms
US20070282537A1 (en) Rapid characterization of post-translationally modified proteins from tandem mass spectra
Lu et al. A suffix tree approach to the interpretation of tandem mass spectra: applications to peptides of non-specific digestion and post-translational modifications
Liska et al. Combining mass spectrometry with database interrogation strategies in proteomics
Giddings et al. Genome-based peptide fingerprint scanning
Krug et al. Mass spectrometry at the interface of proteomics and genomics
Chakravarti et al. Informatic tools for proteome profiling
Jiménez et al. Searching sequence databases over the internet: protein identification using MS‐Fit
Eriksson et al. A model of random mass‐matching and its use for automated significance testing in mass spectrometric proteome analysis
Cristoni et al. Bioinformatics in mass spectrometry data analysis for proteomics studies
Pardanani et al. Primer on medical genomics part IV: expression proteomics
Bakhtiar et al. Mass spectrometry of the proteome
JP2007531874A (en) Protein identification and characterization using a novel database search format
Matthiesen et al. Analysis of mass spectrometry data in proteomics
Alves et al. Robust accurate identification of peptides (RAId): deciphering MS2 data using a structured library search with de novo based statistics
Wu et al. RT‐PSM, a real‐time program for peptide‐spectrum matching with statistical significance
WO2004070643A2 (en) Method for predicting protein function

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): BW GH GM KE LS MW MZ SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IT LU MC NL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
122 Ep: pct application non-entry in european phase