NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.
Koonin EV, Galperin MY. Sequence - Evolution - Function: Computational Approaches in Comparative Genomics. Boston: Kluwer Academic; 2003.
Sequence - Evolution - Function: Computational Approaches in Comparative Genomics.
Show detailsThe ultimate goal of genome analysis is understanding the biology of each particular organism in both functional and evolutionary terms, which requires combining disparate data from a variety of sources. Reliable information resources, compiling data on sequenced genomes and linking it to the wealth of associated functional data, are indispensable for comparative genomics. The amount of genome-related information stored in public databases and freely available to anyone with an Internet access is enormous. It has been our experience, however, that many researchers who should benefit the most from this information are not comfortable navigating these databases, let alone assessing the reliability of the data. This chapter is an attempt to bring the genomic databases closer to their principal users, molecular biologists and biochemists.
3.1. General Purpose Sequence Databases
To a computer scientist, developing a biological database might seem like a daunting task. Most fields are hard to define, and there always will be a need to create new ones. Assigning an object to a particular field is almost never final, and there are numerous exceptions to almost any rule. There is a lot of connectivity between different objects, and this, too, is subject to change. Small wonder that the problem of the optimal structure for a biological database is one of the most hotly debated topics at bioinformatics conferences and in such journals as Bioinformatics or Journal of Computational Biology. We chose to not address those questions (in which none of us is a true expert) here and, instead, refer the reader to several recently published books (see Further Reading at the end of this chapter). This chapter is intended for a naïve user (a biologist, not a computer scientist) and is limited to the discussion of the relative (dis)advantages of each of the available databases for certain common tasks.
3.1.1. Nucleotide sequence databases
What makes public nucleotide sequence databases so important for modern biology? To ensure the availability of the sequence data to the general public, none of the principal scientific journals would publish a paper describing a nucleotide or protein sequence unless this sequence has been deposited in one of the three major international nucleotide sequence databases: GenBank at the NCBI (Bethesda, Maryland, USA); the European Molecular Biology Laboratory (EMBL) Nucleotide Sequence Database at the European Bioinformatics Institute (EBI) in Hinxton, near Cambridge, UK; and the DNA Database of Japan (DDBJ) at the National Institute of Genetics in Mishima, Japan. These databases form an International Nucleotide Sequence Database Collaboration and exchange updates on a daily basis, so that the DNA sequence information kept in each database is essentially the same and is arranged using common principles (see http://www.ncbi.nlm.nih.gov/projects/collab). Although data representation in GenBank, EMBL, and DDBJ might differ slightly, each nucleotide sequence has the same accession number in all three databases. The information stored in these databases is available to the public by anonymous ftp and through the World Wide Web. This means that one can connect to the web site of any of the three databases, GenBank (http://www.ncbi.nlm.nih.gov/Entrez), EMBL (http://www.ebi.ac.uk), or DDBJ (http://www.ddbj.nig.ac.jp), and get the same nucleotide sequence using the same accession number. Thus, a sequence with a given GenBank accession number could have been originally submitted to EMBL or DDBJ, and vice versa. In everyday practice, people often refer to the public nucleotide database simply as "GenBank" when they actually mean the combination of all three public databases.
Although the nucleotide sequence data in GenBank, EMBL, and DDBJ are the same, these three databases differ in the additional services that they offer. NCBI, for example, maintains several other databases in addition to GenBank, such as the Taxonomy database (see 3.7) and PubMed (see 3.8). Accordingly, each nucleotide entry at the NCBI web site is hyperlinked to the corresponding journal article in PubMed (if available) and to the taxonomic entry for the source organism.
3.1.2. Protein sequence databases
For most of the 20th century, biologists usually had at least some idea of what they were studying, and new sequences were coming from well-defined projects that investigated a particular protein or a group of proteins. As a result, the first protein sequence database, Atlas of Protein Sequence and Structure, created by Margaret Dayhoff in the early 1960's [172,173], contained very few uncharacterized proteins and was used mainly to document and investigate sequence diversity between homologous proteins (e.g. globins or cytochromes) from diverse organisms. This trend continued for a few years, even after the introduction of rapid DNA sequencing methods. However, with the rapid increase in gene sequencing rate in the early 1980's, more and more new protein sequences were derived from translation of anonymous pieces of DNA (or mRNA), first as a collateral benefit of sequencing the gene of interest and later through genome projects. This quantitative growth of sequence information was accompanied by a qualitative change that brought about several major problems. Although these problems could be considered just the issues of database quality control, they touch upon fundamental scientific questions.
The first problem is getting the correct protein set, i.e. correctly predicting the protein-coding regions in DNA sequences, for which there is no experimental evidence. Gene prediction historically had been one of the most important and complex aspects of computational biology (see 4.1), and getting the correct set of, say, human proteins, still remains a daunting task. The other related and equally challenging problem is separating the wheat from the chaff, i.e. deduced protein sequences that are most likely to be correct from frameshifted fragments, sequences of pseudogenes, proteins with erroneously assigned activities, and other entries that are suspicious in one way or another, and then properly annotating them. This is an area in which the authors of this book have amassed considerable (and often painful) experience, and we try to share it with the reader in this and the next two chapters. Finally, a paramount higher-level problem is introducing some sort of database hierarchy, i.e. classifying the proteins into families and superfamilies and perhaps higher taxa according to their evolutionary relationships and organizing information in the database according to this classification. From the database angle, all these issues are aspects of database curation , i.e. adding information to the entries through expert analysis. Different protein databases, including those most relevant to genomics, have adopted substantially different approaches to these problems. Usually, there is a certain trade-off between coverage and curation in a database: small, specialized databases typically deal with a single protein (super)family and are likely to be thoroughly curated. A good list of such databases is available at the ExPASy web site at http://www.expasy.org/alinks.html. Other databases strive to cover as much protein diversity as possible but offer only basic curation, if any. In this section, we briefly discuss the general-purpose databases. The more specialized ones are addressed further in this chapter.
Entrez Proteins
Currently, the principal sources of protein sequence data are translations of nucleotide sequences deposited in the GenBank/EMBL/DDBJ database. All three international databases provide these translations, but their protein sets differ in both form and content. The NCBI protein database (Entrez Proteins, http://www.ncbi.nlm.nih.gov/entrez) offers the simplest and most complete set of deduced proteins. Each protein sequence is assigned a unique gene identification (gi) number; if the sequence is changed (e.g. expanded or merged with another sequence), the new sequence is assigned a new number. Obviously, this makes the database excessively large and redundant. The size and redundancy of Entrez Proteins is further increased by incorporating the protein sequences from the PIR and SWISS-PROT databases (see below). While this ensures completeness of the database, for most practical purposes (such as database searches), NCBI maintains a non-redundant (NR) protein database, in which identical sequences from the same source organism and all their fragments are merged into a single entry. Thus, the NR database includes all sequence variants, however minor. Unfortunately, it also includes variants arising due to sequencing errors or because different databases may differently treat certain sequence features (e.g. keep or remove the initiator methionine). The completeness of Entrez Proteins makes it the ultimate resource for almost any protein sequence. Once the desired protein sequence is found, it is always useful to follow the link to “Related Sequences”, which might show the same sequence from other databases, such as SWISS-PROT or PIR. Almost every Entrez Proteins entry is hyperlinked to the corresponding nucleotide sequence in GenBank (these links are absent in the records derived from PIR and SWISS-PROT). Each Entrez Proteins entry also has a link to the NCBI Taxonomy database (see 3.7), which allows one to examine the taxonomic position of the source organism. Many Entrez Proteins entries have links to PubMed (see 3.8); if the three-dimensional structure of the protein is known, there is a link to MMDB, the database of protein structures (see 3.4). For proteins associated with human diseases, a special database, OMIM (see 3.5), provides plenty of references and even some clinical information.
There are several important things one needs to know to make sequence retrieval from Entrez Proteins effective. First, each entry is assigned a unique gene index (gi) identifier, which never changes. If the same sequence is imported from a different source (SWISS-PROT, PIR, genome sequence translation), it receives a new gi each time. When the nucleotide sequence is updated, the protein sequence would also get a new gi. This makes gi the most stable identifier for a given version of a given sequence. Second, SWISS-PROT and PIR entries, imported into Entrez Proteins, are reformatted and may not be identical to the entries in those databases. Third, a search in Entrez Proteins can be best performed by specifying the search fields, such as author name [AUTH], EC number [ECNO], gene name [GENE], and organism [ORGN], and connecting them with Boolean operators AND, OR, and NOT (see 3.8). Two convenient search options allow one to specify the sequence length [SLEN] of the desired protein and its molecular weight [MOLWT] in Daltons (both lower and upper limits must be entered here as 6-digit numbers with leading zeros, e.g. 018500:018800[MOLWT]).
A critical aspect of the general-purpose databases, such as GenBank/EMBL/DDBJ, is that they are archival databases, which only serve as repositories of the submitted data. Curation in these databases is limited to the verification that each entry has the correct syntax and conforms to certain basic requirements, such as being free from vector contamination and actually encoding the predicted protein sequence. The responsibility for the correctness of the sequence and its annotation rests with the submitter, and accordingly, any updates or corrections must come through the submitter (third party annotation is not permitted). These ground rules are essential for preserving the integrity of the record, but they also have substantial effect on the reliability and utility of the data, which is important for all users to keep in mind (discussed later in this chapter).
SWISS-PROT
In contrast to the Entrez Proteins, which is composed of submitter-supplied translations of sequenced genes, the other two most commonly used protein databases, SWISS-PROT and PIR, were created and are curated by human experts. SWISS-PROT (http://www.expasy.org/sprot, mirrored on several web sites including http://us.expasy.org/sprot), was started by Amos Bairoch at the University of Geneva (hence the “Swiss” in SWISS-PROT) and is currently maintained by the Bairoch group in Geneva in collaboration with the EBI. SWISS-PROT strives to perform careful sequence analysis of each database entry [69]. New sequences are included into the database only after curation by expert biologists. In cases of discrepancies between several database entries for the same protein, a combined sequence is included in the database, and the variants are listed in the annotation. SWISS-PROT annotations include descriptions of the function of a protein, its domain structure, post-translational modifications, variants, reactions catalyzed by this protein, and similarities to other sequences.
The enzyme entries are cross-referenced with the ENZYME database (http://www.expasy.org/enzyme, see 3.6.3), the official database of the Enzyme Nomenclature Commission. As indicated above, the downside of such thorough curation is a relatively poor coverage of the protein diversity: the latest (June 6, 2002) release of SWISS-PROT contained only 110,419 entries. Wherever possible, SWISS-PROT entries are hyperlinked to various external databases, including literature citations from PubMed (see 3.7); nucleotide sequences from EMBL, GenBank, and DDBJ; protein motif and domain information from InterPro, Pfam, PROSITE, ProDom, and BLOCKS (see 3.3); and three-dimensional structures from Protein Data Bank (see 3.4). In addition to the accession number (e.g., P24182), each SWISS-PROT entry is assigned a 10-letter name, which consists of a four-letter gene or protein name and a five-letter species abbreviation (e.g. ACCC_ECOLI). These names are very convenient and are routinely used as protein identifiers in scientific literature and in this book, although, unlike accession numbers, they may change once new information becomes available. When no SWISS-PROT name is assigned to a protein, we use the Entrez Proteins gi numbers. Others, especially European researchers, routinely use identifiers from TrEMBL, a supplement to SWISS-PROT.
TrEMBL
To accommodate the growing influx of protein sequences without compromising the quality of SWISS-PROT, the protein translations of the EMBL nucleotide sequences that have not been properly curated by human annotators are put into a supplemental database, TrEMBL (Translated EMBL, http://www.expasy.org/sprot). This database serves as a kind of purgatory (or a “halfway house”) for SWISS-PROT [33]. Each TrEMBL entry is assigned a SWISS-PROT-type accession number that would stay with it when the sequence is finally manually checked and accepted into SWISS-PROT. To simplify curation, TrEMBL entries are even formatted in the SWISS-PROT style. However, one should be alert to the fact that TrEMBL entries are generated automatically, so their quality is not guaranteed and their annotations should not be considered as solid as those of authentic SWISS-PROT entries. In contrast to Entrez Proteins, which is updated daily, TrEMBL is produced in quarterly releases and may miss some of the latest data. On the other hand, TrEMBL is less redundant than Entrez Proteins (although it may also contain more than one entry for the same sequence). The TrEMBL release of May 31, 2002 contained 622,751 entries.
PIR
The PIR (Protein Information Resource, http://pir.georgetown.edu) database is an outgrowth of the Protein Sequence Database, originally created by Margaret Dayhoff [173], and is currently maintained at the Georgetown University in collaboration with Munich Information Center for Protein Sequences (MIPS, http://mips.gsf.de/proj/protseqdb) in Munich, Germany and the Japanese International Protein Information Database [76]. While technically also a curated database, PIR is far less rigorous than SWISS-PROT in maintaining the quality of its annotations (our personal favorite is the annotation of the D. radiodurans protein DRA0097 as "probable head morphogenesis protein", see below). The advantage of PIR, however, is in its hierarchical organization. The June 2002 release of PIR contained 283,236 entries that were classified into ~100,000 protein families and ~30,000 superfamilies. Unfortunately, as one can see from these numbers, the definitions of protein family and superfamily employed in PIR are far more narrow than those used in most of the other protein databases, particularly motif-based and structure-based ones (see 3.3 and 3.4). Thus, PIR superfamilies are often composed of very similar proteins, which may be treated by other databases as members of the same family. As a result, more distant relations between proteins (the least trivial and therefore the most interesting ones) are often not represented in PIR at all. Recently, PIR has intensified its protein classification efforts with the creation of iProClass (http://pir.georgetown.edu/iproclass, [922]), a protein classification database.
PRF
A small number of the protein database entries (<3,000 in Entrez Proteins) come from the Protein Research Foundation (http://www.prf.or.jp/en).
3.1.3. Reliability of database entries
A critical question that emerges with any use of a database is the reliability of the information. The problem stands differently with archival and expert-curated sequence databases. Both types of databases reflect the fundamental limitation of today’s genomics: only a small minority of genes in any sequenced genome or in the entire database have been characterized in direct experiments, whereas the great majority are annotated by transfer of information from the few characterized sequences on the basis of sequence similarity. With expert-curated databases, there is good reason to believe that, on most occasions, this information transfer is done responsibly and conservatively. However, one has to keep in mind that, because of the large number of sequences involved, the potential of sequence and structural analysis is rarely exploited to the fullest in these databases. The archival databases, which, because of their completeness, are searched and used for sequence retrieval most often, present an additional layer of problems caused by almost inevitable inconsistency of the approaches used by thousands of submitters for gene identification and protein annotation. Because only the submitting author can change the entry, erroneous and/or confusing annotations can linger in the databases for years. Therefore, it would be prudent to exercise certain caution before drawing any far-reaching conclusions from the sequence annotation alone, particularly when that assignment is not supported by published research. To better recognize questionable database entries, it is important to understand the common sources of unreliable and even patently wrong annotations in sequence databases; these sources are briefly discussed and exemplified below (see also 3.3.4). In 4.4.4, we discuss how to avoid making such mistakes in genome analysis and annotation.
3.1.3.1. Non-critical transfer of annotation
As we already had a chance to mention more than once, the reality of today’s biological research is that only a small minority of protein sequences deposited in the databases have experimentally proven biological activity. Most of the time, the functions of the proteins encoded in the sequenced DNA are deduced on the basis of their similarity to previously characterized proteins. Hence there is a great potential for propagation of errors.
Curated databases are generally much more reliable and trustworthy. However, one should always keep in mind that all databases are compiled by humans, and not one is perfect. SWISS-PROT entries can usually be trusted, but even they can be misleading (see below). GenBank and PIR entries, particularly those coming from complete genome sequencing projects, are especially error-prone. In some cases, the result can be quite amazing. Consider, for example, the Entrez Proteins annotation of the protein DRA0097 from D. radiodurans (AAF12241, gi|6460535) as "head morphogenesis protein, putative". PIR curators just changed that annotation into "probable head morphogenesis protein" (PIR entry C75604). Ordinarily, as a bacterium, D. radiodurans would not be expected to form a head, so where could this annotation come from? It turns out that DRA0097 is closely related to the product of gene 7 of the B. subtilis bacteriophage SPP1, which is indeed required for the formation of the bacteriophage head, although its exact function is unknown. Non-critical transfer of the annotation of the best database hit, coupled with the truncation of the first word, produced a result that could be considered funny, if it was not virtually irreversible. The next example shows that, because of the constant data flow between GenBank and related databases, it is almost impossible to completely remove a wrong entry.
3.1.3.2. Sequence annotations from unpublished research
Let us look at another D. radiodurans protein, DR2227, which is annotated as phosphonopyruvate decarboxylase. Its closest relatives, also annotated as phosphonopyruvate decarboxylases, come from the complete genomes of A. fulgidus, A. aeolicus, M. jannaschii, M. thermoautotrophicum, T. maritima, etc., meaning that they all have been annotated on the basis of sequence similarity. Therefore these annotations should not be considered reliable: all of them may be correct, but all of them could be wrong as well. The only non-genome-project protein homologous to DR2227 is a protein from Streptomyces hygroscopicus, which is currently annotated as OrfZZ in Entrez Proteins (BAA93685, gi|7416071) and as BCPC_STRHY (Q54271) in SWISS-PROT. The story here is a useful illustration of the inherent conflict between the logic of scientific investigation and the tendency of the databases to provide a snapshot of the available data. In 1994, Haruo Seto and colleagues at the University of Tokyo investigated a cluster of genes responsible for the biosynthesis of bialaphos, an antibiotic produced by S. hygroscopicus, and sequenced a piece of DNA (GenBank accession no. D37809.1, gi|520856) that appeared to participate in this process [501]. The authors provisionally annotated one of the sequenced genes (gi|520857) as probable phosphonopyruvate decarboxylase, promptly noting that there was no experimental evidence for that annotation. In the next several years, Seto and coworkers demonstrated that phosphonopyruvate decarboxylase was a thiamine pyrophosphate-dependent enzyme, sequenced it [603,604], and in 1999 replaced the original entry with a new, corrected version (GenBank accession no. D37809.2, gi|5545270). To emphasize that the original sequence was not a phosphonopyruvate decarboxylase and that its function was unknown, that sequence was renamed OrfZZ (BAA93685).
However, in the course of those five years, the original annotation of OrfZZ as phosphonopyruvate decarboxylase, although never experimentally substantiated, made its way into several databases, including PIR and SWISS-PROT, and was used to annotate the homologs of OrfZZ encoded in the genomes of A. fulgidus, A. aeolicus, M. jannaschii, T. maritima, and other prokaryotes. Even though the original incorrect annotation has been purged from the database, the ghosts it spawned still remain there and confuse scores of new annotators. As a result, several new proteins submitted to GenBank long after purging of the misannotated “phosphonopyruvate decarboxylase” were still annotated the same way, based on their similarity to misannotated proteins from A. fulgidus, A. aeolicus, M. jannaschii, and T. maritima. A detailed analysis of OrfZZ and related proteins showed that they are members of the alkaline phosphatase superfamily (see 3.3) and could function as phosphomutases, e.g. phosphoglycerate mutases [258,261]. Recently, this prediction has been experimentally confirmed [308,866] (see 2.2.6).
In the latter case, correcting the wrong annotation has been relatively straightforward, as there had been no experimental evidence whatsoever that the sequenced protein (OrfZZ) actually had the phosphonopyruvate decarboxylase activity. Such cases are caused by the strict requirement that no manuscript is accepted for publication unless the new sequences described in that manuscript are submitted to GenBank. When a manuscript describing a new sequence without sufficient experimental verification gets (justifiably) rejected at the stage of peer review, the sequence with its preliminary annotation would still linger in the database. Eventually, newly sequenced homologs of this sequence may get annotated as "protein related to" whatever was in that preliminary annotation. As the example above shows, such cases can become quite pervasive.
While the benefits of data exchange between the databases are obvious, it makes errors difficult to weed out. For example, even though the originally incorrect assignment of the protein P28176 (gi|401236) as thymidylate synthase was corrected in SWISS-PROT and the symbol TYSY_MYCTU was re-assigned to the (correct) SWISS-PROT entry O33306 (gi|2624286), it remained in Entrez Proteins until June 2001. In a similar case, although annotation of the M. tuberculosis protein Rv3018c (SWISS-PROT entry P31500, gi|399410) as dihydrofolate reductase has been recognized as erroneous in the M. tuberculosis genome annotation (gi|2791615) and this ORF has been included in TrEMBL under the new identifier O53265, this wrong annotation was lingering in both SWISS-PROT and in PIR (see Table 3.1) until June 2002.
The simple lesson from these cases is that one can trust an annotation of a protein in the database without further analysis only when this protein (or its close homolog) has been experimentally characterized and there is a trustworthy publication that supports the functional assignment.
3.1.3.3. Sequences with misinterpreted function
Unfortunately, a considerable number of erroneously annotated database entries seem to be “supported” by at least some experimental data. The most common scenario is apparently as follows. When a researcher clones a new gene, he or she usually looks for an open reading frame (ORF) that would complement an existing (known) mutation or produce an increase in the desired enzymatic activity. These effects, of course, can be caused by suppression of the mutation, provision of a missing cofactor or a transcriptional regulator, as well as a number of other mechanisms. For example, an ORF that complemented the hemG mutation in E. coli (HEMG_ECOLI, P27863) was initially correctly referred to as a “gene involved in the protoporphyrinogen oxidase activity” [746] but was later assumed to code for protoporphyrinogen oxidase itself [619], even though it represented a small flavodoxin-like protein, which usually composes only one of the several subunits of the dehydrogenase complex (Table 3.1). The apparent published experimental confirmation (“Cloning and identification of the hemG gene encoding protoporphyrinogen oxidase of Escherichia coli K-12”, [619]) makes such cases very difficult to recognize. The simplest way to identify them we could think of is based on the observation that such cases usually result in the database having two or more completely unrelated sequences assumed to perform the same function. While non-orthologous gene displacement resulting in such situations is common in nature, particularly among prokaryotes (see 2.2.5), each such case should be viewed with certain suspicion. Table 3.1 lists several cases where the available experimental evidence does not seem sufficiently convincing to justify the current annotation of the protein. It seems likely that the functions of most of these proteins have been predicted erroneously.
Another group of misleading database entries includes cases where annotation of a protein, while technically correct, does not contain any useful biological information and should not be used for assigning functions to its homologs. Thus, M. jannaschii protein MJ1618, originally annotated as polyketide synthase CurC, is indeed homologous to one of the ORFs in an operon that encodes a polyketide synthase, which is responsible for the biosynthesis of an antibiotic, curamycin, in Streptomyces curacoi [86].
However, such an annotation is clearly flawed because (i) M. jannaschii evidently does not produce this antibiotic and (ii) polyketide synthase is a complex of several enzymes with different biochemical activities. In contrast, a detailed analysis of MJ1618 shows that it has statistically significant sequence similarity to several enzymes of the cupin superfamily [200], including phosphomannose isomerases and, with reasonable confidence, can be annotated as a probable phosphohexomutase. Even annotating MJ1618 simply as a member of the cupin superfamily would make more sense than the “polyketide synthase” assignment.
The cases of “lost meaning” are especially common when the function of an experimentally characterized protein is complex and requires several words to explain, which does not fit into the preconfigured annotation fields. When such an entry is used to annotate an entire family of homologous proteins, confusion is almost inevitable. For example, in 1993, Michael Yarmolinsky and colleagues characterized two genes involved in the maintenance of bacteriophage P1 in the bacterial cell in the prophage form and gave them nice tongue-in-cheek names. The gene encoding the killer protein, which is responsible for cell death when the prophage is lost, was named "death on curing" (doc), whereas the gene encoding its antagonist was named "prevent host death" (phd) [502]. These puns did not go unnoticed, and homologs of these proteins in other organisms are now annotated either as Doc and Phd proteins, which is not very helpful for those unfamiliar with the original paper, or just as “analogues” (gi|1359617, gi|1359618, gi|1359619, gi|1359620), which is even less useful.
Although the problem of misleading database entries is quite serious, one cannot help enjoying these and other examples, which defy the rather common notion that sequence annotation is a tedious business. The following two items are remarkable for their attempts to properly reflect the uncertainty of the annotation:
gi|1968785, cDNA 5' end similar to similar to arrest-defective protein isolog (Homo sapiens), and gi|6522905, very hypothetical protein (Schizosaccharomyces pombe).
And here are some stimulating entries from the current version of Entrez Proteins:
Finally, an E. coli protein with gi|537235 has the following remarkable annotation: “Kenn Rudd identifies as gpmB [Escherichia coli]”. Although this protein is only distantly related to the E. coli phosphoglycerate mutase GpmA and, according to the latest results, is a broad-specificity phosphatase ([702], see 2.2.6 and 7.1), this is probably still a better way to introduce tentative annotations than any of the examples above. If readers of this book come across other exciting examples of creative protein annotation, the authors will be happy to hear about them.
3.1.3.5. Is there a way out?
The problem of unreliable annotation is real and serious (although apparently not as drastic as sometimes predicted [89]) and is being dealt with through a number of approaches. Entrez Proteins, for example, now offers a new graphical viewer, BLink, which allows the user to list all the sequence neighbors of the given protein with BLAST similarity scores over the certain value. In addition, BLink shows the annotations of all those hits. Comparing different annotations of closely related proteins is a good way to select the correct one. Of course, as in the phosphonopyruvate decarboxylase case described above, all those annotations might be wrong. Another recent development in Entrez Proteins is the “Domains” link that shows the user the conserved domains from CDD (see 3.2) that are found in the protein in question. While GenBank annotation may still characterize a protein as “conserved hypothetical”, the “Domains” view may offer new clues to its functions.
Curated databases constantly work at improving their annotations. Thus, the most recent release of SWISS-PROT has finally corrected a number of mistakes listed above. PIR descriptions now include the protein annotations from SWISS-PROT and NCBI's RefSeq, which helps identify the remaining discrepancies.
Finally, the ultimate way out may be through specialized domain and protein family databases that are discussed in the next section and in Chapters 4 and 5. By annotating protein families, rather than individual proteins, these databases are capable of taking care of the most common sources of annotation problems.
3.2. Protein Sequence Motifs and Domain Databases
The terms “protein sequence motif” and “protein domain” are widely used in biological literature for describing certain parts of proteins. The exact meaning of each of these terms is not easy to define because both are used in several, partially overlapping contexts. We would broadly define a protein sequence motif as a set of conserved amino acid residues that are important for protein function and located within a certain (short) distance from one another. These motifs can often provide clues to the functions of otherwise uncharacterized proteins. A protein domain is a structurally compact, independently folding unit that forms a stable three-dimensional structure and shows a certain level of evolutionary conservation. Typically, a conserved domain contains one or more motifs. Many proteins consist of a single protein domain, whereas others contain several domains or include additional, non-globular parts, e.g. signal peptides in membrane and secreted proteins. Some protein domains are “promiscuous” and can be found in association with a variety of other domains. Therefore, during protein sequence analysis, it is often advantageous to deal with one domain at a time. To facilitate annotation of multi-domain proteins, several popular databases contain extensive listings and descriptions of all identified protein domains. In subsequent chapters, we discuss the concepts of motif and domain and especially multidomain proteins in greater depth.
3.2.1. Motif databases
PROSITE: from patterns to profiles
The oldest and best known sequence motif database is PROSITE (http://www.expasy.org/prosite, mirrored in the US at http://us.expasy.org/prosite), maintained by Amos Bairoch and tightly integrated with SWISS-PROT [220]. For many years, PROSITE has been a collection of sequence motifs, which were represented and stored as UNIX regular expressions. For example, the famous P-loop motif, first described in 1982 by John Walker and colleagues as “Motif A” and found later in many ATP- and GTP-binding proteins (see 4.3.3), corresponds to a flexible loop, sandwiched between a β-strand and an α-helix and interacting with β- and γ-phosphates of ATP or GTP [880]. This motif is represented in PROSITE as
which means that the first position of the motif can be occupied by either Ala or Gly, the second, third, fourth, and fifth positions can be occupied by any amino acid residue, and the sixth and seventh positions have to be Gly and Lys, respectively, followed by either Ser or Thr.
This approach to describing sequence motifs has both advantages and disadvantages. On the plus side, a comparison of a given sequence against all the patterns in the database can be performed very fast even with limited computational resources. Virtually any user could download the whole database (less than 5 Mb) and use it on the home computer. On the other hand, regular expressions cannot fully account for the whole sequence diversity and necessarily exclude certain deviant, but closely related, sequences (see 4.3.3). An attempt to relax the motifs to accommodate this sequence diversity makes some motifs quite fuzzy and, as a result, almost useless. For example, possible sites for N-glycosylation of an Asn residue
where {P} means any amino acid other than proline, or for phosphorylation of protein Ser and Thr residues
can be found in almost every protein. To improve description of such motifs, PROSITE authors have started supplementing patterns with rules and profiles (matrices).
A rule is a textual description of a complex pattern that allows one to indicate not just what amino acid residues are permitted in a particular position but also which of these residues are most frequent (i.e. best conserved). For example, PROSITE pattern PS00008 for the N-terminal myristoylation site
is supplemented by the following rule:
- The N-terminal residue must be glycine.
- In position 2, uncharged residues are allowed. Charged residues, proline, and large hydrophobic residues are not allowed.
- In positions 3 and 4, most, if not all, residues are allowed.
- In position 5, small, uncharged residues are allowed (Ala, Ser, Thr, Cys, Asn, and Gly); serine is favored.
- In position 6, proline is not allowed.
Here, “serine is favored” clearly indicates that not all small, uncharged residues are equal in position 5, but how strongly is it favored? To answer this question, one has to go to a more complex system of notation, such as a profile (matrix). For example, ankyrin repeats (PROSITE pattern PS50088), which are responsible for the interaction of p53 with the p53-binding protein [297], of NFkB with its inhibitor IkBα [389], and for many other important protein-protein interactions, are too diverse to be described by even a complex pattern or set of rules. Instead, it is easier to align all known ankyrin repeats and calculate the frequency of each amino acid residue at each position of the alignment. This operation would produce a matrix that would have 20 frequency numbers for the first position, 20 numbers for the second one, and so on. If the alignment contains gaps, the frequency of a gap in any given position would give us the 21st number. Also, because some sequences come from acid hydrolysis, which converts Asn into Asp and Gln into Glu, there traditionally are two more letters, B (either Asn or Asp) and Z (either Gln or Glu). In addition, X would stand for an unknown amino acid residue. As a result, one would end up with a matrix of the size 24 × L, where L is the length of the motif. Actually, for the purposes of sequence comparison, rather than frequencies, it is more convenient to use their logarithms. PROSITE, like other tools (see 3.2), uses log-odds position-specific scoring matrices (PSSMs; see also 4.3 for further discussion of PSSMs and their use in sequence searches). See Figure 3.1.
Clearly, while the above form of presentation might be perfect for a computer, it is challenging for a human to comprehend. We would be more comfortable with something perhaps less precise but capturing the crucial features of a motif. There are several ways to achieve this. Probably the most convenient one is a Sequence Logo [752], in which the height of each letter indicates the degree of its conservation, whereas the total height of each column represents the statistical importance of the given position (Figure 3.2).
Despite all the shortcomings of sequence patterns, PROSITE remains a very convenient tool for rapid protein sequence analysis. The textual descriptions of the sequence motifs and protein families that are characterized by these motifs are of special value. They offer a unique perspective of the functional diversity of proteins that may be quite similar sequence-wise. A reader of this book would likely benefit from spending ample time looking through PROSITE documentation files (http://www.expasy.org/cgi-bin/prosite-list.pl). Another useful exercise is to take some well-characterized protein sequences, perhaps those most familiar to the reader, and search them for PROSITE patterns using the ScanProsite tool (http://www.expasy.org/tools/scnpsite.html). Finally, those interested in creating their own sequence patterns should take a look at the “optimal way to deduce motifs” picture at http://www.expasy.org/images/cartoon/prosite.gif. This nice cartoon clearly explains why, with the exception of PROSITE, all other motif databases do not attempt to create representative patterns for the selected motifs and keep their data simply as sets of alignments or matrices.
BLOCKS
The BLOCKS database (http://www.blocks.fhcrc.org, mirrored at http://bioinformatics.weizmann.ac.il/blocks) was developed by Steven Henikoff and coworkers at the Fred Hutchinson Cancer Center in Seattle, WA and is based on a completely different approach than PROSITE [345]. Each “block” in this database is a short ungapped multiple alignment of a conserved region in a family of proteins. These blocks were originally derived from proteins with PROSITE entries but were later expanded using data from many different sources. A part of the BLOCKS database entry for the ATP-grasp superfamily of proteins, which includes biotin carboxylase, carbamoyl phosphate synthetase, succinyl-CoA synthetase, D-alanine-D-alanine ligase, and several other enzymes ([262], see 3.3) is shown below.
Obviously, this sequence block can be easily converted to a PSSM. The BLOCKS alignments were used in developing the BLOSUM series of amino acid residue substitution matrices, which are currently employed in most sequence similarity search methods (see 4.2.1). Recently, the database has been updated and now includes blocks derived from Pfam, ProDom, PRINTS, and Domo motif and/or domain databases (see below).
The BLOCKS server allows one to search a given protein or nucleotide sequence against the blocks in the database; a nucleotide sequence will be translated in all six reading frames, and each translation will be checked. The BLOCKS database also has an important feature that allows the user to submit a set of sequences, create a new block, and search this block against the database. This option can be especially useful in cases where a standard database search finds several homologous proteins with no known function.
Finally, an attractive feature of BLOCKS is that each sequence block in the database can be used for creating sequence logos, similar to the one in Figure 3.2. This option allows one to visualize the degree of sequence conservation in each block, which helps to memorize the principal conserved residues of each enzyme family covered in the database.
PRINTS
The PRINTS database (http://www.bioinf.man.ac.uk/dbbrowser/PRINTS), also referred to as “PRINTS-S: the database formerly known as PRINTS”, is, like BLOCKS, a collection of conserved sequence fragments in protein sequences [61,62]. In contrast to the BLOCKS database, PRINTS would list several conserved sequence blocks for each protein, which results in much smaller families than in BLOCKS. One can compare a sequence or even a library of sequences against the whole database using BLAST (see 4.3.3), making it a useful tool for identifying distant relationships among proteins. PRINTS data are now incorporated into the EBI's InterPro database (http://www.ebi.ac.uk/interpro, see 3.2.3) and can be searched at the InterPro web site.
3.2.2. Domain databases
Pfam
The Pfam database [80] was jointly developed by three groups in UK, USA, and Sweden and is now available at the web sites of the Sanger Centre (http://www.sanger.ac.uk/Software/Pfam), Washington University in St. Louis (http://pfam.wustl.edu), and the Karolinska Institute in Stockholm (http://www.cgr.ki.se/Pfam), as well as on the web site of INRA in France (http://pfam.jouy.inra.fr). Pfam contains protein sequence alignments that were constructed using hidden Markov models (HMMs, see 4.3.3). In contrast to Entrez Proteins, SWISS-PROT, and PIR, which include full-length protein sequences, Pfam is a protein domain database. This means that a typical Pfam entry is not a protein sequence as in Entrez Proteins, SWISS-PROT, and PIR, or a sequence pattern as in PROSITE, but an alignment of the most conserved portions (“domains”) of many related proteins from SWISS-PROT and TrEMBL databases. Although a typical Pfam alignment consists of 20–30 sequences, the entries PF00516 (glycoprotein GP120) and PF00096 (C2H2-type zinc finger) include more than 10,000 sequences each. Altogether, more than 60% of the proteins in SWISS-PROT are included in one or more Pfam alignments.
In addition to complete alignments, Pfam provides “seed alignments”. These include fewer proteins which are, nevertheless, sufficiently different to reflect the diversity of the members of each given Pfam family. Besides multiple sequence alignments, each Pfam entry contains a HMM for the corresponding family, which combines a PSSM (see above) with a measure of the probability of the appearance of a given amino acid at a given position as a result of a mutation (see 4.3.3). The availability of precomputed HMMs for each protein family in Pfam allows a relatively fast and sensitive search of any given protein against the Pfam database.
Because proteins often include more than one conserved domain, correctly identifying domains and their boundaries is a necessary prerequisite to a detailed sequence analysis. By storing complete domain alignments, rather than selected amino acid patterns or blocks, Pfam preserves the sequence information in its entirety. This makes it a powerful tool for domain identification. As with PROSITE, simple browsing of Pfam description files provides an informative tour of important protein families.
Another useful feature of Pfam is that it now includes a supplement, referred to as Pfam-B, for entries that are only being considered for inclusion in Pfam. Pfam-B serves as a temporary storage for those entries, which have not yet been manually curated, much like the repository TrEMBL provides for SWISS-PROT. Browsing Pfam-B entries and deciding whether they really belong to the corresponding Pfam family can be a useful training exercise, which might even result in unexpected findings. Pfam data have been incorporated into the EBI's InterPro (http://www.ebi.ac.uk/interpro) and the NCBI's CDD (http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml) databases (see 3.2.3) and can be searched at their respective web sites. It is important to remember, however, that the search tool at InterPro (HMMer-based) is similar but not identical to the one used at the Pfam web site, whereas CDD uses a completely different algorithm (RPS-BLAST, see 4.3.3). As a result, one should not be surprised by differences in the Pfam search outputs with the same query sequence at those three web sites.
SMART
Like Pfam, Simple Modular Architecture Research Tool (SMART, http://smart.embl-heidelberg.de and http://smart.ox.ac.uk), developed by Peer Bork's group at EMBL and Chris Ponting at the University of Oxford, consists of multiple domain alignments and the accompanying HMMs that are used to search the database [507,757]. Although a much smaller database than Pfam, SMART concentrates on most common domains, particularly those involved in various forms of signal transduction. SMART alignments have been curated with greatest care, and an attempt has been made to include even the most divergent representatives of each domain. This makes SMART a highly reliable and sensitive tool for domain identification.
SMART also includes an excellent graphical tool which, in addition to displaying all the SMART domains found in a given protein, shows predicted signal peptides, transmembrane segments, and regions of low complexity identified during the SMART search (see Chapter 4 for the discussion of methods used for the identification of these features in proteins). The June 2002 release of SMART contained alignments of 639 domains. The power of extensive HMM searches, performed by the SMART team, becomes clear from the following example. The PROSITE profile for ankyrin repeats (see above) is said to correctly recognize all 134 occurrences of this repeat in the SWISS-PROT database. In contrast, SMART reports as many as 4489 statistically significant occurrences of this repeat in 1158 proteins in the non-redundant database, including 108 ankyrin repeats in C. elegans alone. Data from SMART have been incorporated into the EBI's InterPro database (http://www.ebi.ac.uk/interpro) and NCBI's CDD database (http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml, see 3.2.3) and can be searched at their web sites, which, however, cannot rival the superior graphics capabilities of the original SMART web site (Figure 3.8).
ProDom
In contrast to Pfam and SMART, which are manually curated, the ProDom database (http://www.toulouse.inra.fr/prodom.html), developed by Jérôme Gouzy, Florence Corpet, and Daniel Kahn in Toulouse, France, is created largely automatically, based on the results of PSI-BLAST searches of SWISS-PROT and TrEMBL databases [158,159]. As an automatic compilation of homologous domains, ProDom relies on fairly high threshold values for domain assignments. As a result, homologous sequences may end up being assigned to different domain families (see [260] for an example). Nevertheless, thanks to its colorful images, ProDom offers an easy and convenient way to visualize domain organization of proteins. Importantly, ProDom allows one to display all the proteins that share at least one domain with the given protein. This useful option is included in SWISS-PROT, which links its entries to ProDom. ProDom is also extensively interlinked with Pfam and provides a good graphical option for viewing Pfam alignments. ProDom data have been incorporated into InterPro (http://www.ebi.ac.uk/interpro, see 3.3.3) and are available through its unified interface.
COGs
Although the Clusters of Orthologous Groups of proteins (COG) database, maintained at the NCBI (http://www.ncbi.nlm.nih.gov/COG, [827]) is primarily a “genome-oriented” database and is described in more detail later in this chapter (see 3.5), we also mention it here because a comparison of orthologous proteins from phylogenetically distant organisms provides a powerful way to identify sequence motifs that are common to those proteins. This makes the COG database a convenient tool for motif and domain search, particularly because the annotation of many COGs is itself based primarily on their conserved motifs. Comparing a protein sequence against the proteins included in the COG database using the COGnitor program (http://www.ncbi.nlm.nih.gov/COG/xognitor.html) often allows one to identify conserved sequence motifs that are hard to recognize by other means.
3.2.3. Integrated motif and domain databases
The rapid growth of the domain-based databases, such as Pfam, SMART, ProDom, and others, made them a valuable resource for sequence similarity searches, conveniently supplementing EBI's SP-TrEMBL and the NCBI's non-redundant protein database (see also Chapters 4 and 5). In an effort to incorporate domain databases into their web sites, EBI and NCBI have created their own integrated domain databases, InterPro and Conserved Domain Database (CDD), respectively.
InterPro
Integrated Resource of Protein Families, Domains and Sites (InterPro, http://www.ebi.ac.uk/interpro) is an EBI database unifying protein sequences from SWISS-PROT with the data on functional sites and domains from PROSITE, PRINTS, ProDom, Pfam, and SMART databases [34]. InterPro entries are assigned their unique accession numbers and include functional descriptions and literature references. Each InterPro entry lists its matches in SWISS-PROT and TrEMBL. The family, domain, and functional site definitions of InterPro are expected to greatly simplify the automated annotation of TrEMBL by increasing both its efficiency and reliability. Notably, InterPro was used as the principal protein annotation resource during the analysis of the draft sequence of the human genome [488].
CDD
The NCBI's Conserved Domain Database and Search Service (CDD, http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml, [545]) is a collection of multiple alignments of protein domains from Pfam and SMART databases, supplemented with some alignments created by NCBI's own researchers. The alignments downloaded from Pfam and SMART are trimmed to leave only those positions that are represented in at least 50% of all aligned sequences, which determines the length of the consensus and the size of the corresponding PSSM. A compilation of these PSSMs can be used as a database for rapid sequence similarity search (CD-Search, Figure 3.8) using reverse position-specific BLAST (RPS-BLAST, see 4.3.3).
3.3. Protein Structure Databases
Three-dimensional (3D) protein structures are much harder to determine than primary sequences, but they are, at least in some respects, more informative. Knowledge of atomic coordinates leads to elucidation of the active site architecture, packing of secondary structural elements, patterns of surface exposure of side-chains, and relative positions of individual domains. Structural information is available only for a limited number of proteins, comprising ca. 600 distinct protein folds.
PDB
Protein Data Bank (PDB) is a public repository of 3D structures of proteins and nucleic acids. Until recently, PDB was housed by Brookhaven National Laboratory; it is now maintained by Research Collaboratory for Structural Bioinformatics (RCSB), which unites groups at the San Diego Supercomputer Center (http://www.rcsb.org/pdb), Rutgers University (http://rutgers.rcsb.org/pdb/), and the National Institute of Standards and Technology (http://nist.rcsb.org/pdb/). PDB is mirrored around the world, including fully supported mirror sites in UK, Singapore, Japan, and Brazil.
Just as every nucleotide sequence has to be deposited in GenBank prior to publication, atomic coordinates of all proteins and nucleic acids whose structures have been solved have to be deposited in PDB. However, processing of the submissions in PDB differs from that in GenBank in several important aspects. First, nucleotide (and protein) sequences submitted to GenBank are released to the public immediately after the publication of the paper that describes these sequences, if not earlier. The structures submitted to the PDB may remain “on hold” for up to a year after the publication. This delay has been instituted to allow successful processing of patent applications spawned by the determination of 3D structures of important drugs and drug targets. This policy is under review, as many researchers argue for release upon publication, which is the standard in sequence databases. In any case, the list of structures awaiting release is available at the PDB web site (http://www.rcsb.org/pdb/status.html). Those willing to test their skills in predicting the 3D structures can download protein sequences whose 3D structures have been determined and submitted to PDB, but are still kept on hold, and subsequently compare the predictions with released structures.
Just as GenBank automatically checks newly deposited nucleotide sequence to ensure that it indeed encodes the protein it is claimed to encode, has a correct taxonomic assignment, and contains all the required fields, the PDB submission process (ADIT) includes a number of tests that automatically validate certain parameters, such as bond distances, torsion angles, names of heteroatoms, etc. This validation procedure helps to ensure the quality of the newly submitted structures. Finally, unlike Entrez Proteins, PDB does not index structures by the degree of their similarity. This task is performed by other databases, such as MMDB, FSSP, SCOP, or CATH, each of which relies on its own approach to protein structure comparison.
MMDB
The Molecular Modeling Database (MMDB), maintained by the NCBI Structure group (http://www.ncbi.nlm.nih.gov/Structure), is tightly linked to Entrez Proteins and offers the same convenient links to similar protein sequences, Taxonomy and PubMed databases. In addition, for each given entry, it allows the user to access the list of structural neighbors, calculated using the VAST algorithm, developed by Steven Bryant and colleagues at the NCBI [283,535]. VAST (Vector Alignment Search Tool) searches for topologically similar fragments (α-helices, β-strands) in proteins, which is useful in comparing distantly related proteins that have no detectable sequence similarity. The output of a VAST search can be ranked by percent identity between the aligned sequences, the length of aligned region, the RMSD (root mean square distance) between the superimposed elements, or VAST scores and probability values. VAST allows the user several ways to view the structural alignment of the selected proteins and generate structure-based sequence alignments, which can be especially useful for the identification and analysis of distantly related proteins (see 3.1).
FSSP
The Fold classification based on Structure-Structure alignment of Proteins database (FSSP, http://www.ebi.ac.uk/dali/fssp), created by Liisa Holm and Chris Sander at the EBI, is produced by all-against-all structural comparisons of proteins with known three-dimensional structures using the DALI program [353,354]. DALI aligns protein structures based on the minimal RMSD of the carbon atoms in the main polypeptide chain (Cα atoms). For convenience, proteins with closely related structures are clustered together, and only structures representing substantially different proteins are compared and listed in the database. FSSP allows one to search for structural neighbors of each representative structure or to see the list of all indexed structures, rendered in a hierarchical format. Like MMDB, FSSP is convenient for generating structure-based sequence alignments of distantly related proteins. However, because these two databases use radically different approaches to the structural comparisons of proteins, they can sometimes complement each other. In cases of low structural similarity, it may be useful to compare the lists of neighbors of the given structure generated by both algorithms and examine the common hits and discrepancies between the two.
SCOP
In contrast to MMDB and FSSP, which both use automated procedures to generate their lists of structural neighbors, the Structural Classification Of Proteins (SCOP) is a manually curated database of protein structures, developed at the MRC Laboratory of Molecular Biology in Cambridge, England [522,590]. SCOP (http://scop.mrc-lmb.cam.ac.uk/scop, mirrored at http://scop.berkeley.edu) is a fully hierarchical database that classifies all protein structures into families of related proteins, structural superfamilies, folds, and structural classes (see also 8.1). All known structures are divided into eight classes, namely, all alpha proteins, all beta proteins, alpha and beta proteins with beta-alpha-beta units (α/β ), alpha and beta proteins (α+β) with segregated alpha and beta regions, multi-domain proteins (alpha and beta), membrane and cell surface proteins, small proteins, and coiled-coil proteins. Each class contains folds, which are further divided into superfamilies.
Manual curation of the protein taxonomy in SCOP supplements automatic structural comparisons with case-by-case analysis that takes into account results of sequence comparisons, conserved sequence motifs, functional data, and other information. As a result, SCOP assignments are often used as the ultimate authority on the structural similarity and evolutionary relatedness of proteins. Indeed, many SCOP superfamilies include proteins that, in addition to structural similarity, share other common features, such as similar substrate-binding sites, common enzymatic mechanisms, and so on (Table 3.2, see [468,846]). Typically, all proteins assigned by SCOP to the same fold show enough structural similarity to be considered homologous, although this may be questioned for some common folds (see 2.1.2).
To simplify assignment of new protein structures to folds and superfamilies, SCOP now offers a possibility to compare a protein sequence against the database, which allows one to determine its nearest relative with known 3D structure. In cases of sufficient sequence similarity, such comparison may yield important structural information.
CATH
The CATH database (http://www.biochem.ucl.ac.uk/bsm/cath_new), created by Janet Thornton and colleagues at the University College, London [633,661], also is a hierarchical classification of protein domain structures. CATH clusters proteins at four major levels, Class (C), Architecture (A), Topology (T), and Homologous superfamily (H). CATH classification also includes manual curation of each group, based on the secondary structure content (class), orientation of secondary structure elements (architecture), topological connections between them (topology), and finally, sequence and structure comparisons (homologous superfamily level).
3.4. Specialized Genomics Databases
Since the World Wide Web makes genome sequences available to anyone with an Internet access, there are a variety of databases that offer more or less convenient access to essentially the same sequence data. However, there are several convenient web sites that provide useful additional information, such as phylogenetic relationships, operon organization, functional predictions, 3D structure, or metabolic reconstructions.
Entrez Genomes
Since complete genome sequences are, in fact, nothing more than extremely long nucleotide sequences, one can always retrieve them from the NCBI FTP site (ftp://ncbi.nlm.nih.gov/Entrez/Genomes) or the FTP sites of the appropriate sequencing center. The other two public databases, DDBJ and EMBL, maintain their own genome retrieval systems, referred to as EBI Genomes (http://www.ebi.ac.uk/genomes) and Genome Information Broker (http://gib.genes.nig.ac.jp). It is important to emphasize that, as with all records from archival databases, these genomes represent original submissions and are not immune to the errors mentioned earlier in this chapter. Sometimes, submitters update genome sequences and/or their annotations. This was done, for example, for E. coli, H. influenzae, M. genitalium, and M. pneumoniae genomes. For most genomes, however, the sequence and its annotation remain unchanged. In order to provide updated (and unified) versions of complete genomes, NCBI has recently initiated the Reference Sequences project (http://www.ncbi.nlm.nih.gov/RefSeq/, RefSeq) that links the lists of gene products with some valuable sequence analysis information, such as predicted functions for uncharacterized gene products, frameshifted proteins, and so on (Figure 3.10).
Because NCBI also maintains a special BLAST page for searching unfinished genome sequences, contributed by various genome-sequencing centers (http://www.ncbi.nlm.nih.gov/sutils/genom_table.cgi), many researchers are confused about the status of all these different databases. One needs to clearly understand the distinction between the three kinds of data maintained at the NCBI. Complete genome entries in GenBank are kept exactly as submitted and can be changed only by submitters themselves. Genome entries in the Genomes division of Entrez, which can be identified by their NC_xxxxxx RefSeq accession numbers, are GenBank entries that have been curated by the NCBI staff. These entries are supplemented by various tables that present precomputed data on the taxonomic distribution of the best hits in the database for each protein in the given genome (see Chapter 6), COG assignments of these proteins (see the next page), neighbors with known three-dimensional structures, and results of sequence comparison against the CDD (see 3.3).
Finally, unfinished genome data, submitted by various genome sequencing centers (see Appendix 2), are available only for BLAST searches. They are protected from unauthorized access just like any GenBank submission, held until publication. The authors of this book, for example, have no more access to those data than anybody else in the world. For each BLAST hit, the user can get the corresponding DNA sequence and up to 1 kb of flanking sequence on each side. This allows the users to effectively search unfinished genome sequences for any protein (gene) of interest, at the same time preventing them from any large-scale genome analysis.
COGs
The Clusters of Orthologous Groups of proteins (COG) database ([827,828], http://www.ncbi.nlm.nih.gov/COG), already mentioned in the preceding section, has been designed to facilitate comparative genomic analysis and improve functional assignments of individual proteins. The latest COG release consists of 4,620 clusters of inferred orthologs from the completely sequenced genomes of bacteria, archaea, and unicellular eukaryotes. Each COG contains sets of proteins from at least three phylogenetic lineages.
Very briefly, the COG construction procedure included the following main steps: (i) all-against-all protein sequence comparison using BLAST; (ii) detection and clustering of obvious paralogs, i.e. proteins from the same genome that are more similar to each other than to any proteins from other species; (iii) detection of triangles of mutually consistent, genome-specific best hits, taking into account the paralogous groups detected at step (ii); (iv) merging triangles with a common side to form COGs; and (v) a case-by-case analysis of each COG to eliminate potential false-positives. Since orthologs typically perform the same function, delineation of orthologous families from diverse species allows the transfer of functional annotation from better-studied organisms to less-studied ones. The COGs are classified into 18 functional groups, which include uncharacterized conserved proteins and proteins for which only a general functional assignment (typically, prediction of biochemical activity but not the actual biological function) appeared appropriate (Figure 3.11).
The COG database is particularly useful for functional predictions in borderline cases, where the protein sequence similarity is relatively low. Due to the diversity of proteins in COGs, sequence similarity searches against the COG database (available at http://www.ncbi.nlm.nih.gov/COG/xognitor.html or from ORF finder, http://www.ncbi.nlm.nih.gov/gorf, see 3.2) can sometimes suggest a possible function for a protein that otherwise has no clear database hits. This database also offers convenient tools for comparative analysis of complete genomes, particularly phyletic pattern analysis that we use widely throughout this book (see 4.2).
KEGG
The Kyoto Encyclopedia of Genes and Genomes (KEGG, http://www.genome.ad.jp/kegg) is a part of the GenomeNet web site, created by Minoru Kanehisa and colleagues at the Kyoto University Institute for Chemical Research for comprehensive analysis of complete genomes. KEGG aims at using genome sequences for complete reconstruction of cellular metabolism and its regulation [414]. The KEGG web site presents a comprehensive set of metabolic pathway charts, both general and specific, for each of the sequenced genomes. The enzymes that have already been identified in a particular organism are color-coded, so that one can easily trace the pathways that are likely to be present or absent (Figure 3.12). For each of the metabolic pathways that it covers, KEGG also provides the lists of orthologous genes from all sequenced genomes that code for the enzymes participating in those pathways. It is also indicated whenever these genes are adjacent in the genome and form likely operons. A convenient search tool allows the user to compare two complete genomes and identify all cases where conserved genes in both organisms are adjacent or located close (within five genes) to each other. The KEGG site is continuously updated and is currently the best source of data for the analysis of metabolism in various organisms.
WIT/ERGO
The WIT (What Is There) database was originally developed by Ross Overbeek and Evgeni Selkov at the Argonne National Laboratory in Argonne, IL [642]. It is currently maintained in two different variants, as the public WIT database at the Argonne web site (http://wit.mcs.anl.gov) and as ERGO (http://ergo.integratedgenomics.com/ERGO) at the Integrated Genomics web site, which is largely closed to the public. Like KEGG, this system combines diverse tools that assist in functional annotation. WIT/ERGO is best known for its operon search tool ([640,641]; see 4.6.2). Like COGs, WIT/ERGO delineates clusters of orthologous proteins and uses these clusters to assign functions to the uncharacterized members of each cluster. In contrast to other databases discussed in this section, which perform analysis of complete genomes only, WIT/ERGO also includes proteins from many partially sequenced genomes. This allows this system to offer many more sequences of the same protein from different organisms than any other database, which facilitates detection of additional members of the respective protein families and increases the utility of operon analysis. An interesting feature of WIT/ERGO is that it allows registered users to submit their own functional annotations and comments. Eventually, this might lead to true “community annotation” projects that would offer everybody an opportunity to participate in the process.
MBGD
The Microbial Genome Database for Comparative Analysis (http://mbgd.genome.ad.jp) at the University of Tokyo is another convenient tool for comparative analysis of completely sequenced microbial genomes. Like COGs, MBGD stores precomputed results of similarity searches between all the ORFs in the complete genomes and attempts to classify them into homology clusters. In contrast to COGs, however, MBGD assigns homology relationships based solely on BLAST searches with the arbitrary cut-off P-value of 10−2 (see 4.2). MBGD contains a hierarchical list of cellular functions, classified into 16 principal functional groups, and allows one to list the genes that are responsible for a particular function in any given genome. After selecting the gene of interest, the user can search for homologs of this gene among all other sequenced microbial genomes.
PEDANT
The Protein Extraction, Description and ANalysis Tool (PEDANT, http://pedant.gsf.de), maintained at MIPS, is a useful web resource that presents results of extensive cross-genome comparisons using a variety of popular tools [245]. The available complete genomes and a number of unfinished genome sequences are analyzed using standard PEDANT queries, such as EC numbers, PROSITE patterns, Pfam domains, BLOCKS, and SCOP domains. Because these queries comprise some of the most common questions asked in genome comparisons, PEDANT can be used as a convenient entry point into the field of comparative genome analysis. For example, if you want to find out how many proteins in H. pylori have known (or confidently predicted) 3D structure or how many NAD-dependent alcohol dehydrogenases (EC 1.1.1.1) are encoded in the C. elegans genome, PEDANT provides an easy way to do that (Figure 3.10). Although PEDANT does not allow the users to enter their own queries, the variety of data available at this web site makes it an important tool for comparative and functional genomics.
TIGR Databases
The Institute for Genomic Research (http://www.tigr.org) maintains several useful databases, including the Comprehensive Microbial Resource (CMR, http://www.tigr.org/tigr-scripts/CMR2/CMRHomePage.spl), devoted to the analysis of bacterial and archaeal genomes [673], the TIGR Parasites Database (http://www.tigr.org/tdb/parasites) that provides links to protozoan sequencing projects under way at TIGR, and TIGR Gene Indices (http://www.tigr.org/tdb/tgi.shtml) that integrate the data from eukaryotic genome sequencing and EST projects.
The CMR combines information on all publicly available completely sequenced genomes with pre-publication data on the genomes sequenced at TIGR. It offers a variety of search and display options, including a convenient genome browser that lists, for each gene, the evidence on which the annotation is based (e.g. HMM match, BLAST match, or PROSITE match). It also allows the user to align the DNA sequences of any two microbial genomes using MUMmer [178]. The Restriction Digest Tool searches the genomic sequences for cutting sites recognized by the most commonly used restriction endonucleases. The results can be displayed in a variety of formats, including a genomic map showing the cutting sites, a list of the predicted restriction fragments, DNA sequences of these fragments, and an image showing predicted positions of these fragments in an agarose gel.
Other CMR options include the possibility to retrieve from various genomes genes with the same biological role (e.g. genes involved in amino acid biosynthesis), genes encoding enzymes with the same EC number, common name, and other options similar to those in PEDANT. One can also search for predicted proteins with particular properties, such as isoelectric point, molecular weight, or the number of predicted transmembrane regions.
The CMR also includes its own version of clusters of orthologs similar to COGs and TIGRFAMs, protein clusters built using the HMM searches of protein sets encoded in the complete genomes with Pfam profiles. Finally, the CMR has a well-organized list of all the transfer and ribosomal RNAs encoded in the complete microbial genomes.
TIGR Gene Indices (http://www.tigr.org/tdb/tgi.shtml) currently contain tentative consensus sequences (clustered EST sequences) from 12 protists, including Cryptosporidium parvum, Dictyostelium discoideum, Leishmania major, Plasmodium falciparum, Toxoplasma gondii, and Trypanosoma brucei, six fungi, including Saccharomyces cerevisiae, Schizosaccharomyces pombe, Neurospora crassa, and Aspergillus nidulans, 12 plants, including Arabidopsis, barley, maize, potato, rice, soybean, tomato, wheat, and cotton, and 13 animal species, such as C. elegans, Drosophila, human, mouse, rat, pig, and zebrafish. Groups of tentative consensus sequences from different organisms that encode homologous proteins form the TIGR Orthologous Gene Alignment database (TOGA, http://www.tigr.org/tdb/toga/toga.shtml). TOGA differs from COGs and other similar databases in that, instead of proteins, it clusters and aligns DNA sequences. This results in small clusters, composed of closely related, most likely indeed orthologous sequences (see 3.1). However, many orthologous genes end up being assigned to different TOGAs.
An important application of TOGAs is the identification of orthologs of human disease genes, i.e. genes, mutations in which cause hereditary diseases. These data are organized in a single large table http://www.tigr.org/tdb/tgi/ego/human_dis_gene.shtml that lists genes involved in the pathogenesis of a variety of human disorders. Each human disease gene is hyperlinked to the respective OMIM entry (see 3.5) and is accompanied by inferred orthologs from other organisms.
3.5. Organism-specific Databases
In addition to general genomics databases, numerous databases center on a particular organism or a group of organisms. While most of these databases are useful in some respect, those devoted to model organisms, such as E. coli, B. subtilis, yeast, C. elegans, Drosophila, and mouse, are probably the ones most widely used for functional assignments in other, less thoroughly studied organisms. For someone involved in functional genomics, it is important to be able to quickly verify the reliability of each database entry. Thus, if one has reasons to doubt the database annotation of a particular gene or protein (see 3.2), it often helps to check whether a functional assignment made from studies of a particular model organism is accepted by the community of researchers studying that organism. In addition, organism-specific web sites may contain additional information that is hard to fit into the standard annotation scheme (e.g. viability of mutants, availability of clones, or results of two-hybrid experiments). The following list is by no means complete or even representative; however, it covers the databases for model organisms that the authors find most useful in their own work and that are likely to similarly help other researchers.
3.5.1. Prokaryotes
Escherichia coli
The importance of E. coli for molecular biology is reflected in the large number of databases dedicated to this bacterium. The research groups of Fred Blattner at the University of Wisconsin-Madison (http://www.genome.wisc.edu) and Hirotada Mori at the Nara Institute of Technology (http://ecoli.aist-nara.ac.jp), which independently sequenced the E. coli genome, maintain useful web sites devoted to the post-genomic analysis of E. coli genes.
Since the Blattner group has recently completed sequencing the genome of enteropathogenic E. coli O157:H7 and is currently involved in genome sequencing of other enteric pathogens, such as E. coli K1, Shigella flexneri, Salmonella typhi, and Yersinia pestis, their web site is most useful as a source of data on these bacteria. It also contains a list of E. coli genes that have been amplified using gene-specific primer pairs and are now available to other researchers. There is also a partial list of genes shown to be essential for growth in E. coli (http://magpie.genome.wisc.edu/~chris/essential.html).
The group led by Mori coordinates the Japanese GenoBase project (http://ecoli.aist-nara.ac.jp/docs/genobase/index.html), aimed at elucidating the functions of E. coli genes that currently remain uncharacterized. Their web site provides a convenient link from the genomic data to the Kohara restriction map of E. coli and allows one to search for the Kohara clones that cover the region of interest. The Japanese National Institute of Genetics maintains another useful database, called Profiling of Escherichia coli Chromosome (PEC, http://shigen.lab.nig.ac.jp/ecoli/pec), which contains a detailed description of each E. coli gene, including its location, Kohara clone that covers this gene, information on whether it is essential, results of PSI-BLAST searches of its product against the PDB, PROSITE motifs and Pfam domains present in this protein, and many other pieces of valuable information.
EcoGene (http://bmb.med.miami.edu/ecogene), a database of E. coli genes created by Kenneth Rudd, currently at the University of Miami, aims at providing curated sequences of E. coli proteins. This is a good place to look for frameshifted and potentially mistranslated proteins. For each E. coli gene, EcoGene provides a short description of its function, including alternative gene names and relevant references.
A useful web site (http://web.bham.ac.uk/bcm4ght6), aptly named “The E. coli index”, is maintained by Gavin Thomas at the University of Sheffield. It contains good links devoted to clinical strains of E. coli, but the major attraction is the list of recent functional assignments in E. coli. The compilation of genes that have been annotated since the completion of the genome sequence can be found in the “Completing the E. coli proteome” section (http://web.bham.ac.uk/bcm4ght6/genome.html), whereas the “What's new” section (http://web.bham.ac.uk/bcm4ght6/gennew.html) lists the latest experimental results.
The web site of the E. coli Genetic Stock Center at Yale University (http://cgsc.biology.yale.edu) lists all the mutant strains of E. coli available in its collection. It also provides gene linkage and functional information.
The GeneProtEC (http://genprotec.mbl.edu) database, created by Monica Riley at Woods Hole, the Encyclopedia of E. coli Genes and Metabolism (EcoCyc, http://www.ecocyc.org), developed by Peter Karp, and RegulonDB (http://www.cifn.unam.mx/Computational_Biology/regulondb), maintained by Julio Collado-Vides, are interconnected databases devoted, respectively, to metabolic and regulatory pathways of E. coli.
Finally, the Colibri (http://bioweb.pasteur.fr/GenoList/Colibri) database at the Institut Pasteur is specifically designed for a molecular biologist doing experimental work on E. coli. It has a good web site with a convenient feature allowing the user to download the DNA sequence of a given E. coli gene with up to 1 kb upstream and downstream sequence. This can be useful for designing PCR primers, searching for the convenient restriction sites, delineating promoters and transcription regulator-binding sites, and many other applications. However, the Colibri web site has not been updated for a long time, because of which its functional information is not likely to be as up to date as in other E. coli databases.
Bacillus subtilis
B. subtilis is a popular model organism for microbiological studies. Its genome, like that of E. coli, is the subject of an ongoing functional annotation project. In contrast to E. coli, the data collection is largely centralized, with the Subtilist web site maintained at the Institut Pasteur (http://bioweb.pasteur.fr/GenoList/SubtiList), serving as a clearing house for all new information concerning the B. subtilis genome. Like Colibri, Subtilist allows the user to download the DNA sequence of a given B. subtilis gene with flanking regions, which can be useful for experiment design.
For phenotypes of various mutants, one can use the Micado (a.k.a. MadBase) database (http://locus.jouy.inra.fr) at INRA, France. This site also lists 110 B. subtilis genes that had been previously mapped but have not been identified in the complete genome.
Mechanisms of sporulation and its regulation being some of the most actively studied properties of B. subtilis, there is a useful web-based index of B. subtilis sporulation genes, maintained by Simon Cutting at the Royal Holloway University of London (http://www.rhul.ac.uk/Biological-Sciences/cutting/index.html).
Synechocystis sp
The cyanobacterium Synechocystis sp. PCC6803 was one of the first bacterial genomes to be sequenced [417]. The Kazusa DNA Research Institute in the Japanese Prefecture of Chiba, which carried out the Synechocystis genome sequencing, maintains CyanoBase (http://www.kazusa.or.jp/cyano), a database devoted to the post-genomic studies of cyanobacterial genes. Although most of the CyanoBase gene assignment data can be found elsewhere (in GenBank, COGs, KEGG, WIT, and many other databases), this site contains a useful list of Synechocystis mutants, sorted in the order of the chromosomal locations of the corresponding genes. For each mutant, the list includes whatever functional information is available and provides the address of the researcher that has constructed this mutant. This resource is expected to grow rapidly, boosted by the recent completion of the genome of the second cyanobacterium, Anabaena (Nostoc) sp. PCC7120 [416].
3.5.2. Unicellular eukaryotes
Unicellular eukaryotes are targeted by a number of ongoing genome sequencing projects (see e.g. http://www.sanger.ac.uk/Projects/Protozoa), which generate a substantial amount of sequence data. Accordingly, there exist extensive databases and numerous web sites dedicated to Candida albicans, Dictyostelium discoideum, Entamoeba histolytica, Leishmania major, Neurospora crassa, Plasmodium falciparum, Pneumocystis carinii, and other unicellular eukaryotes (see Appendix 2). However, only yeasts Saccharomyces cerevisiae and Schizosaccharomyces pombe have been sufficiently studied biochemically to generate a database that would be useful in annotating other genomes. For this reason, yeast databases remain the major source of data for genome annotation for most other eukaryotes, including protozoa.
Yeast
The baker’s yeast Saccharomyces cerevisiae was the first eukaryote whose genome had been completely sequenced [290]; it is arguably the best characterized of all eukaryotic organisms. Several databases are specifically dedicated to functional analysis of the yeast genome, including three major ones, the Saccharomyces Genome Database (SGD) at Stanford University (http://genome-www.stanford.edu/Saccharomyces), the Yeast Database at MIPS (http://mips.gsf.de/proj/yeast), and Yeast Protein Database (YPD) at Proteome, Inc. (http://www.proteome.com/databases). All three resources, SGD, MIPS, and YPD, provide useful up-to-date information on the current status of the yeast genome analysis, including periodically updated lists of proteins with known or predicted functions, phenotypes of mutants (if available), protein-protein interactions, gene expression patterns, and other data. For each gene, there is a list of appropriate references that help in understanding its cellular role, even if the exact function remains unknown. Although there is a substantial overlap in the data between these three databases, it is often useful to check each of them when searching for information about a particular yeast protein. The SGD entries are interlinked with the yeast Gene Registry (http://genome-www.stanford.edu/Saccharomyces/registry.html) that keeps a complete list of all standard and non-standard names of S. cerevisiae genes. For the researchers who experimentally characterize yeast genes, this list includes useful links to SGD Gene Naming Guidelines and the Gene Registry Form. It also has a link to Global Gene Hunter (http://genome-www.stanford.edu/cgi-bin/SGD/geneform), a simple but convenient search engine that looks for the given yeast gene in SGD, YPD, PIR, SWISS-PROT, GenBank, PubMed, and Sacch3D (yeast protein structures) databases.
The MIPS yeast database serves as a resource for new results coming from the multinational EUROFAN project [199]. YPD is a commercial site but is free for academic users.
There are several other useful sites for yeast genome analysis. TRIPLES, TRansposon-Insertion Phenotypes, Localization, and Expression in Saccharomyces database (http://ygac.med.yale.edu), maintained by Michael Snyder's laboratory at Yale University, tracks the expression of transposon-induced mutants and the cellular localization of yeast proteins, tagged with the Tn3-derived minitransposon developed in the Snyder lab. This database also offers a convenient search for the phenotypes of insertion mutants, including insertions into unannotated short (<100 codons) open reading frames.
The Yeast Mitochondrial Protein Database (http://bmerc-www.bu.edu/mito) at Boston University presents a useful compilation of information regarding both proteins encoded in the mitochondrial genome and those encoded within the nuclear genome and post-translationally imported into the mitochondria.
Ron Davis' lab at Stanford University (http://genomics.stanford.edu) maintains the Saccharomyces Genome Deletion Project, aimed at creating and characterizing PCR-generated deletion mutants in every yeast gene. Although the complete database is currently open only to members of the consortium, the strains generated in the course of the project are available to other researchers and can be searched through the project web site. The Davis lab also maintains the Saccharomyces Cell Cycle Expression Database, which presents the available data on the changes in the mRNA transcript levels during the yeast cell cycle. A list of regulatory elements and transcriptional factors in yeast is kept in the Saccharomyces cerevisiae Promoter Database (http://cgsigma.cshl.org/jian) at Cold Spring Harbor Laboratory.
3.5.3. Multicellular eukaryotes
The Human Genome Project and related projects on complete genome sequencing of model organisms, such as nematode worm, fruit fly, pufferfish, mouse, and rat, resulted in a proliferation of web sites that attempt to make use of genomic sequence data. Only a few of them, however, are concerned with sequence annotation, that is, specialize in predicting genes and evaluating their probable functions. In this section, we review only those databases that are likely to help a beginner gene hunter in finding functional assignments.
Thale cress Arabidopsis thaliana
Arabidopsis thaliana, the first plant whose genome has been sequenced [35], is widely used as a model organism in plant biology. The Arabidopsis Information Resource (TAIR, http://www.arabidopsis.org), a collaboration between the Carnegie Institution of Washington Department of Plant Biology at Stanford University and the National Center for Genome Resources, a nonprofit organization in Santa Fe, New Mexico, serves as the principal resource on the Arabidopsis biology [360]. The primary sources for Arabidopsis genome annotation are the TIGR Arabidopsis thaliana Database (http://www.tigr.org/tdb/e2k1/ath1), MIPS Arabidopsis thaliana database (http://mips.gsf.de/proj/thal/db/index.html), and Stanford/Penn/PGEC database of Arabidopsis thaliana Annotation (DAtA, http://sequence-www.stanford.edu/ara/SPP.html). Useful web sites are also maintained at the Kazusa DNA Research Institute (KAOS, http://www.kazusa.or.jp/kaos) and Cold Spring Harbor Laboratory http://nucleus.cshl.org/protarab).
Worm Caenorhabditis elegans
The nematode worm Caenorhabditis elegans has been one of the favorite models for developmental biology for many years. With the availability of the (almost) complete genome of C. elegans, it is now becoming a target of functional genomics efforts.
WormBase (http://www.wormbase.org or http://wormbase.sanger.ac.uk) is a unified public resource on C. elegans biology, jointly maintained by researchers from CalTech, Cold Spring Harbor Laboratory, Washington University, The Sanger Centre, and CNRS (France), with contributions from scientists from all over the world [804]. WormBase combines mapping and sequencing data with phenotypic information on C. elegans. It has a powerful search engine that allows one to search the database by allele name, gene name (predicted or confirmed), cosmid or YAC clone name, author name, or GenBank accession number. WormBase has a convenient sequence viewer that displays positions of predicted curated and uncurated genes, results of transcriptional profiling, and RNA inhibition (RNAi) experiments. WormBase also contains a Pedigree Browser showing the complete cell lineages for the male and hermaphrodite organisms and information on each cell.
WormPD ( http://www.proteome.com/databases/index.html), like YPD, is a protein database maintained at Proteome Inc. [160]. It is a useful resource for annotation of C. elegans proteins that is being continuously updated. WormPD has a convenient search engine that allows one to search the database by keywords and/or categories (organismal role, biochemical function and cellular role, mutant phenotype, subcellular localization, molecular environment, post-translational modification, number of introns in the gene, and chromosomal location of the gene) as well as by properties of the predicted proteins (isoelectric point, molecular weight, codon adaptation index, and the number of potential transmembrane segments). By following the “WormPD Facts” link, the user can retrieve the updates made within the last week.
As the work on the C. elegans genome sequence continues, the web sites of the Sanger Centre (http://www.sanger.ac.uk/Projects/C_elegans) and the Washington University (http://genome.wustl.edu/gsc/Projects/C.elegans) continue to serve as valuable data sources for sequence updates.
Fruit fly Drosophila melanogaster
FlyBase (http://flybase.bio.indiana.edu/), produced by a consortium of researchers at Harvard University, University of Cambridge, Indiana University, UC Berkeley, and the EBI and mirrored in Japan, Taiwan, Australia, France, and Israel, is the ultimate data source on Drosophila melanogaster and related species. It contains a wide variety of Drosophila-related links, including one to the Insect Biology and Ecology site at Cornell University (http://www.nysaes.cornell.edu/ent/biocontrol/info/primer.html), which provides the introductory information on Drosophila and other insects. Another good site for an introduction to the Drosophila world is the Drosophila Virtual Library (http://ceolas.org/fly/).
GadFly (Berkeley Drosophila Genome Project, http://www.fruitfly.org) is another comprehensive resource that allows the user to search Drosophila genome annotations by name, chromosomal position, molecular function, or protein domain. Research results on the development and functioning of Drosophila nervous system are collected by the FlyBrain database at the University of Arizona in Tucson (http://flybrain.neurobio.arizona.edu), which is mirrored at the web sites of University of Freiburg, Germany (http://flybrain.uni-freiburg.de/) and National Institute for Basic Biology in Okazaki, Japan (http://flybrain.nibb.ac.jp).
InterActive Fly (http://www.sdbonline.org/fly/aimain/1aahome.htm) is a superb collection of information on tissue and organ development in Drosophila, compiled by Thomas and Judith Brody [121] and hosted at the Society for Developmental Biology web site. It lists development-related genes by name (in alphabetical order), by biochemical function (e.g. transcription factors, receptors), and by developmental pathways (maternal genes or zygotically transcribed genes). For convenience, there is a separate listing of the most recent additions to the database. Arguably, the most interesting part of the database is the listing of 36 evolutionarily conserved developmental pathways, common for Drosophila and other organisms, such as vertebrates (http://www.sdbonline.org/fly/aimain/aadevinx.htm).
Drosophila microarray project (http://quantgen.med.yale.edu) aims to define the gene expression patterns of Drosophila genes in vivo using DNA microarrays. It can be useful for predicting the function(s) of an unknown gene on the basis of coexpression with a previously characterized gene (see 5.2).
Finally, Drosophila Community Portal at CyberGenome Technologies (http://www.cybergenome.com/drosophila) contains a good collection of protocols for experimental work on Drosophila.
Human
Although a full description of the considerable bioinformatics activity spawned by the Human Genome Project is outside the scope of this book, several useful databases certainly deserve a mention.
Because, unfortunately, a substantial part of our knowledge about human genes comes from the analysis of hereditary diseases, Online Mendelian Inheritance in Man (OMIM™, http://www.ncbi.nlm.nih.gov/Omim), a catalog of human genes and genetic disorders, is probably the most important resource on the functions of human genes. This database is based on the book Mendelian Inheritance in Man by Victor McKusick and colleagues of Johns Hopkins University. The online version of the text and the database were developed at the NCBI. Recently, OMIM has become accessible through Entrez, and now it can be queried using the Entrez retrieval system (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM), just like other NCBI databases. OMIM is supplemented with the OMIM Morbid Map (http://www.ncbi.nlm.nih.gov/htbin-post/Omim/getmorbid), an alphabetic list of all the disease genes described in OMIM with their cytogenetic map locations. Because it is intended for use by physicians and patients who might be unfamiliar with the Entrez system, OMIM has its own extensive help file (http://www.ncbi.nlm.nih.gov/entrez/Omim/omimhelp.html), which contains detailed descriptions of the possible search strategies and databases linked to OMIM. In addition, there is a detailed list of frequently asked questions (http://www.ncbi.nlm.nih.gov/entrez/Omim/omimfaq.html).
The Genes and Disease (http://www.ncbi.nlm.nih.gov/disease) section of the NCBI web site features a collection of simplified descriptions, which are similar to those in OMIM but are easier to comprehend, contain fewer references, and are intended for a more general audience than OMIM. This is a good site for introductory reading on most common human genetic diseases, such as Alzheimer disease, phenylketonuria, Marfan syndrome, diastrophic dysplasia, muscular dystrophy, and many others. This site also contains useful information on the genetic roots of cancer, atherosclerosis, and obesity.
LocusLink (recently superceded by Entrez Gene, http://www.ncbi.nlm.nih.gov/entrez/query/static/help/genefaq.html) is an NCBI resource that provides a simple unified query interface to curated portions of human, mouse, rat, fruit fly, and zebrafish genomes. It can be used to search the RefSeq records that contain a variety of genetic information, such as official nomenclature, sequence accession numbers, EC numbers, UniGene clusters, dbSNP links, and STS marker links, and other data. RefSeq records are created by a combination of automated data processing with subsequent manual curation by NCBI staff, which also adds links to the relevant publications in PubMed (see 3.8). The latest release of LocusLink included 20,582 human records, 32,014 mouse records, 4,164 records for rat, 18,879 records for Drosophila, and 1,194 records for zebrafish.
LocusLink entries are also interlinked with HomoloGene (http://www.ncbi.nlm.nih.gov/HomoloGene), a collection of homologous genes in human, mouse, rat, fruit fly, zebrafish, and cow genomes, obtained from published reports and by nucleotide sequence comparisons between ESTs from each pair of organisms. It includes over 7,000 putative orthologs in human, mouse, and rat genomes. In contrast, there are only ~200 putative orthologs found in human, rodent (mouse or rat), and zebrafish.
Genomic Information for Eukaryotic Organisms database, euGenes (http://iubio.bio.indiana.edu/eugenes), maintained at the Center for Genomics and Bioinformatics at Indiana University, Bloomington, presents data automatically collected from the primary databases and available through a single convenient interface [284]. Information available through euGenes includes gene name, gene symbol, its chromosomal location, function, structure, and sequence similarity information for the gene product. Table 3.3 shows summary data from euGenes on the status of sequencing and annotation of eukaryotic genomes. Although these numbers are preliminary and not necessarily reliable, they offer a glimpse of the future comparative genomics of eukaryotes.
3.6. Taxonomy, Protein Interactions, and Other Databases
3.6.1. Taxonomy databases
NCBI Taxonomy
To organize the sequence data in accordance with the existing phylogenetic classification of organisms, NCBI maintains its own Taxonomy database (http://www.ncbi.nlm.nih.gov/Taxonomy), which contains the names of all organisms that are represented in GenBank. The NCBI Taxonomy database attempts to provide a consensus, up-to-date taxonomy tree based on a variety of sources, including published literature, web-based databases, and advice of sequence submitters and outside taxonomy experts.
The database has a hierarchical structure with six root-level taxa, Archaea, Eubacteria, Eukaryota, Viroids, Viruses, and Unclassified (the latter group, for uncultured environmental samples). For convenience, plasmids and other synthetic constructs are grouped together as “Other”. The Taxonomy database offers a convenient way to extract nucleotide or protein sequences from all organisms that belong to a particular genus, family, or a higher taxon. The user simply needs to follow the tree to the desired taxon. In case the user is unsure about the exact spelling of an organism name, there is a nifty “phonetic search” option that will search for similarly sounding names, so that an unfortunate researcher that entered “Drozofila” as a search pattern would not be completely lost.
The Taxonomy database offers a useful tool that allows one to construct and display a taxonomy tree for a selected set of organisms. For the organisms that are most commonly used in molecular biology, such a tree can be obtained simply by going to the Taxonomy database homepage in Entrez (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Taxonomy), selecting the desired organisms and clicking on the “Display Common tree” option. After that, the user can edit the resulting tree by adding and deleting species and selecting a complete or abbreviated lineage for each of them. Of course, it has to be kept in mind that the tree obtained from this database is only a taxonomic dendrogram, rather than a true phylogenetic tree. Nevertheless, it offers a convenient view of the taxonomic relationships between the selected organisms, which often reflects the actual phylogeny. The same tool can be reached from the BLink page for any protein in Entrez Proteins (see 3.2.2) by selecting the “Common Tree” option.
Ribosome Database Project
For those who want to see an actual phylogenetic tree of small subunit rRNA sequences, the place to look is the Ribosomal Database Project at Michigan State University (RDP, http://rdp.cme.msu.edu, mirrored at the Japanese National Institute of Genetics, http://wdcm.nig.ac.jp/RDP). The latest release of RDP provides numerous precomputed phylogenetic trees for various groups of organisms, accompanied by a sensible tutorial. These trees range from incredibly large ones, such as the full prokaryotic tree with 7,322 nodes, full eukaryotic tree with 2,055 nodes, and full mitochondrial tree with 1,503 nodes, to general trees for the domains Bacteria (197 nodes) and Archaea (107 nodes), to more specific trees, covering, for example, only the genus Escherichia (105 nodes) or genera Treponema and Spirochaeta (132 nodes). The trees can be viewed and edited using a Java-based viewer and saved as pictures or in the standard nested tree format that can be read using TreeView [646] and other programs. A phylogenetic tree for organisms with completely sequenced genomes is not yet available, although 16S rRNA sequences from most of them are included into at least some precomputed trees.
3.6.2. Signal transduction, regulation, protein-protein interaction, and other useful databases
TRANSFAC
The Transcription Factor database (TRANSFAC, http://transfac.gbf.de, also available at http://www.gene-regulation.de) compiles data on eukaryotic regulatory DNA elements and protein factors interacting with them [897,898]. It is maintained by Edgar Wingender and colleagues in Braunschweig, Germany and mirrored at several sites around the world. The database consists of six tables that cover transcription factor sites of various eukaryotes. The SITE table lists 4,504 individual regulatory sites within 1,078 eukaryotic genes. It also contains 3,494 artificial sequences derived from mutagenesis studies, in vitro selection procedures starting from random oligonucleotide mixtures, etc., and 417 consensus binding sequences, mostly taken from [219]. The GENE table provides short descriptions of each of these 1,078 genes, the FACTOR table (2785 entries) describes the proteins that bind these sites, the CLASS table lists 39 classes of transcriptional factors, and the CELL table lists the cellular sources of these proteins. Finally, the MATRIX table (309 entries) provides nucleotide frequency matrices for some of the transcription factor binding sites.
TRANSFAC also includes a hierarchical Classification of Transcription factors (http://transfac.gbf.de/TRANSFAC/cl/cl.html).
BRITE
The Biomolecular Relations in Information Transmission and Expression database (http://www.genome.ad.jp/brite_old), a part of KEGG (see 3.5), has long been known as a useful collection of regulatory pathways, including cell cycle control pathways for human and yeast, developmental pathways of Drosophila, and enzyme regulatory mechanisms from KEGG. This database has been recently expanded (http://www.genome.ad.jp/brite) and is now intended to serve as a collection of diverse data on all possible kinds of relations between any two proteins. It includes data on generalized protein-protein interactions (e.g. from KEGG pathway diagrams), experimental data on protein-protein interactions obtained from yeast two-hybrid systems, sequence similarity relations calculated using the Smith-Waterman algorithm (see 4.3.2.1), expression similarity relations uncovered by microarray gene expression profiles, and cross-reference links between database entries. Because this site contains data from two large-scale studies of protein-protein interactions in yeast [384,859], it is currently most useful for the analysis of yeast protein function.
DIP
The Database of Interacting Proteins (DIP, http://dip.doe-mbi.ucla.edu) is a compilation of experimentally demonstrated protein-protein interactions. It was created by David Eisenberg and colleagues at the UCLA-DOE Laboratory of Structural Biology and Molecular Medicine to provide a tool for understanding protein function and protein-protein relationships, properties of networks of interacting proteins, and protein evolution [925,926]. The data on protein-protein interactions in DIP come primarily from yeast two-hybrid experiments, although other experimental techniques, such as co-purification, immunoprecipitation (co-immunoprecipitation), binding to affinity columns, in vitro binding assays, and others, are also represented. The DIP database consists of three hyperlinked tables that list: (i) protein information, (ii) protein-protein interactions, and (iii) details of experiments. An additional table links DIP to the YPD database (see 3.6.2). The latest release of DIP lists 3,472 interactions between 2,659 proteins, reported in 1,020 publications [925]. Although more than 80% of those interactions have been reported in a single experiment, they offer useful hints to the functions of otherwise uncharacterized proteins. Besides, the fraction of confirmed protein-protein interactions in DIP is steadily growing, such that, with time, the utility of this database is most likely to increase.
BIND
The Biomolecular Interaction Network Database (BIND, http://www.bind.ca) was originally developed by Chris Hogue at the Samuel Lunenfeld Research Institute at the Mount Sinai Hospital in Toronto and Francis Ouellette at the Center for Molecular Medicine and Therapeutics of the University of British Columbia in Vancouver. It was recently transferred to a new non-profit company, Blueprint Worldwide, to initiate arguably the most ambitious project in the biological database building. BIND strives to unify protein sequence data with the information on protein-protein interactions and signal transduction pathways and plans to incorporate virtually all interactions between molecules, including proteins, nucleic acids, and small molecules [66]. In addition, there are plans to include photochemical reactions and conformational changes in proteins. Although BIND is only beginning to grow and its first pathway entries might seem cumbersome, it seems to have a great potential.
BioCarta
BioCarta (http://www.biocarta.com) has assembled an impressive list of pathways (http://www.biocarta.com/genes/allPathways.asp), which are presented as appealing colorful images. The site offers a place for users' comments, discussion of the pathway itself, and submission of new pathways. If nothing else, the readers should visit this web site just to enjoy its graphics.
EPD
Eukaryotic Promoter Database (EPD, http://www.epd.isb-sib.ch, [668]) was developed by Philipp Bucher and colleagues at the Swiss Institute for Experimental Cancer Research (ISREC). EPD is a curated non-redundant collection of 1,390 eukaryotic promoters with experimentally determined transcription start sites. Each entry contains a description of the initiation site, cross-references to other databases (EMBL/GenBank/DDBJ, LocusLink, Unigene, RefSeq, SWISS-PROT), and bibliographic references.
3.6.3. Biochemical databases
Biochemical Pathways Map
Anyone working on genome annotation should have a firm grasp of cell biochemistry and, ideally, should be able to quickly recall properties and functions of hundreds of different proteins. Since very few of us actually remember all the biochemical pathways, there are several useful resources that allow one to take a quick look at the biochemical pathways and figure out whether a particular annotation is plausible.
For many years, almost every molecular biology laboratory had a wall chart of biochemical pathways, created by the now retired biochemist Gerhard Mihal at Boehringer Mannheim Corp. A hyperlinked version of this chart is now available at the ExPASy web site (http://www.expasy.org/cgi-bin/search-biochem-index). For convenience, the chart is split into 120 fields, each representing a small fraction of the map. One can take a look at the whole map or examine one or two adjacent fields. The names of the enzymes and metabolites on this map can be searched as keywords. In addition, the enzyme names are hyperlinked to the ENZYME database (see below), allowing one to associate a reaction with the amino acid sequence of the enzyme.
As previously mentioned (see 3.5), a useful collection of metabolic pathways is available at the KEGG web site (http://www.genome.ad.jp/kegg). KEGG charts are much simpler and cover individual pathways, such as glycolysis or TCA cycle.
ENZYME
The ENZYME database (http://www.expasy.org/enzyme, mirrored at http://us.expasy.org/enzyme) has already been mentioned in the description of SWISS-PROT (see 3.2.2). ENZYME is a convenient source of information on the official nomenclature of enzymes, based on the recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (http://www.chem.qmw.ac.uk/iubmb/enzyme). ENZYME lists all the enzymes that have been assigned Enzyme Commission (EC) numbers and describes them with respect to the EC number, recommended and alternative names, catalytic activity, cofactors (if any), and the diseases associated with the deficiency of the enzyme (if known). Enzymes with known sequences are linked to the corresponding SWISS-PROT entries. The names of substrates and products of the catalyzed reactions are linked to their structures in the Klotho (http://www.biocheminfo.org/klotho) database.
The Nomenclature Committee web site, mentioned above, also contains some useful information on enzymes, including lists of newly approved enzymes (http://www.chem.qmw.ac.uk/iubmb/enzyme/newenz.html), retracted EC numbers, reaction schemes, and references to the original publications.
KLOTHO
Klotho: Biochemical Compounds Declarative Database (http://www.biocheminfo.org/klotho), developed by Toni Kazic and colleagues, is a listing of 439 (bio)chemical compounds, shown in a variety of representations, including Fischer diagrams, smiles strings, and actual 3D structures.
BRENDA
BRENDA (http://www.brenda.uni-koeln.de) is a comprehensive enzyme information system maintained by Dietmar Schomburg and colleagues at the Institute of Biochemistry of the University of Köln. In addition to the information listed in ENZYME, each enzyme entry in BRENDA is associated with up to 20 extra parameters, such as specific activity, turnover number, KM for various substrates, pH range, pH optimum, and pH stability, temperature range, temperature optimum, and temperature stability, inhibitors, molecular weight, and many others. Each entry is accompanied by extensive bibliography. While these data are extremely interesting from a purely enzymological standpoint, they also prove invaluable when one needs to evaluate the substrate specificity of an enzyme encoded in a newly sequenced genome or to decide whether a given gene product can catalyze a particular reaction (see 4.2).
LIGAND
The LIGAND database (http://www.genome.ad.jp/dbget/ligand.html) is part of the GenomeNet site, maintained by the Kanehisa laboratory at the Kyoto University. LIGAND is a site dedicated to enzymes and their substrates and tightly interlinked with KEGG (see 3.5). Its entries are somewhat similar to those of ENZYME but contain a much larger library of structures of enzyme substrates, specifically drawn for this database using the ISIS/Draw program (MDL Information Systems, http://www.mdli.com). This allows the user to view and, if necessary, save those structures, which definitely helps to understand the function(s) of each particular enzyme. In addition, for each metabolic enzyme, LIGAND lists its representation in the completely sequenced genomes.
AAindex
The AAindex (http://www.genome.ad.jp/dbget/aaindex.html) is yet another database from the Kanehisa Laboratory that provides an exhaustive listing of various amino acid indices and similarity matrices [429]. Amino acids can be grouped on the basis of their physico-chemical and biochemical properties, such as the propensity to form an α-helix, a turn or a βstrand, hydrophobicity, polarity, bulkiness, and many others. AAindex currently lists 434 amino acid indices that all come handy for one particular task or the other. In addition, amino acids can be grouped based on their exchangeability in protein sequences, similar to the matrix shown on Figure 3.1. Again, these matrices can be very different, depending on the evolutionary distances between proteins, on whether they are soluble or membrane-bound, globular or non-globular, and so on (see 4.2.1). AAindex currently lists 66 such amino acid substitution matrices that all can be used for evaluating sequence similarity between protein sequences in different contexts.
PMD
Protein Mutant Database (PMD, http://pmd.ddbj.nig.ac.jp), maintained at the DDBJ, is a collection of literature references that describes various mutations, naturally occurring in proteins or induced by mutagenesis. PMD allows the user to submit a protein sequence that will be compared against the sequences in the PMD using straightforward text matching. If a match is found, the mutated amino acid residues will be indicated, linked to the articles that describe the respective mutations. However, the text matching tool in PMD is not particularly powerful and cannot recognize sequences with <30% identity. Nevertheless, PMD is the only database, other than SWISS-PROT, that consistently records mutation data, which could be useful in delineating the active sites of poorly characterized enzymes.
3.7. PubMed
PubMed (http://www.ncbi.nlm.nih.gov/PubMed, or just http://pubmed.gov) is definitely the most widely used database in biology. As of the time of this writing, PubMed lists more than 11 million scientific articles, which makes the ability to find the relevant reference promptly a useful skill that requires at least some experience. Because the NCBI web site contains a PubMed overview, a vast PubMed help file, a list of frequently asked questions about PubMed, and even a detailed down-to-earth tutorial (http://www.nlm.nih.gov/bsd/pubmed_tutorial/m1001.html), we consider here only several of the less trivial aspects of PubMed searches.
The first thing to know about PubMed is that, although it contains over 11 million citations, it does not and has never been intended to cover all the biological literature. For example, PubMed has poor coverage of plant science, environmental research, and many other areas of biology that are not immediately related to human health. Also, PubMed lists very few papers published before 1965 (some papers from 1958 through 1965 are kept in the OldMEDLINE database, which is available through the NLM Gateway, http://gateway.nlm.nih.gov/gw/Cmd).
The full list of journals that are indexed by PubMed is available through the Journal Browser (http://www.ncbi.nlm.nih.gov/entrez/jrbrowser.cgi), which also allows browsing selected journals issue by issue.
3.7.1. Specifying the terms in a PubMed search
By default, PubMed looks for a combination of all the terms entered in a query (i.e. each term is treated as a required string). The simplest way to find a reference is to enter into the search field as many relevant terms as possible. This method works surprisingly well, especially for topics that happen to deviate from the mainstream health research. However, entering popular search terms like “AIDS” and “drug” would return more than 15,000 citations and force one to narrow down the search. PubMed has a special option, called “Single Citation Matcher”, which allows one to enter various bits and pieces of citation information (the author, year, title, and journal name of the publication).
Boolean operators
For more complex queries, it is best to specifically indicate the field(s) to be searched and to connect them with appropriate Boolean operators AND, OR, and NOT (PubMed requires that these be in caps). The operator AND is used in PubMed by default, so it only needs to be put in complex Boolean search patterns that contain different fields (see below). This operator requires that both terms connected by it appear in the citation; it is used to narrow down the search space. The operator OR allows either of the specified search terms to appear in the output; it is often used to expand the search space to include synonyms or otherwise similar subjects. The operator NOT is used to exclude certain terms from the search. The use of Boolean operators can dramatically improve search efficiency, especially when used in combination with terms from appropriately selected fields. Most of the fields used by PubMed are self-explanatory, but some are not. Use of the fields illustrated below is not entirely straightforward but may be convenient under certain circumstances.
Affiliation [AFFL]
Looking for publications by an author [AUTH] with a common last name, such as Smith or Green, can be frustrating. For example, a search for publications by Janet L. Smith (Smith JL[AUTH]), a Purdue University biochemist specializing in amidotransferases, returns 930 papers authored by various John L. Smiths from all over the world. However, entering the search pattern
Smith JL[AUTH] AND Purdue[AFFL]
allows one to retrieve a collection of papers by Janet L. Smith on various topics, not just amidotransferases, at the same time avoiding the sea of irrelevant citations. Of course, this search will miss the papers for which Purdue University is not entered as an affiliation, including the recent review of amidotransferase mechanisms [936], but that is a different headache.
Journal [JOUR] and Publication Date [PDAT]
It has probably happened to everyone: just before leaving for vacation, you read a particularly interesting paper in, say, Trends in Biochemical Sciences, but completely forgot what it was about. What you need to do is to browse back issues of the journal and try to figure out which paper it was. Short of going to the library, one can use the Journal Browser to retrieve all the papers from that journal:
"Trends Biochem Sci"[JOUR]
and further limit the output to the papers published in July and August 2000:
"2000/07"[PDAT]:"2000/08"[PDAT]
to come up with a simple search pattern:
"Trends Biochem Sci"[JOUR] AND ("2000/07"[PDAT]:"2000/08"[PDAT])
which would narrow your search down to just 21 papers. The publication date search can also be entered through the Limits function as described below.
Enzyme Classification [ECNO]
When searching for information on a specific enzyme or a group of enzymes, it often turns out to be convenient to simply use the EC number as the search parameter. For example, when searching for data on NADP-dependent alcohol dehydrogenases, the last thing one would like to do is to enter NADP, dehydrogenase, and alcohol (1,194 citations). NADP, alcohol, and dehydrogenase (in that order) would return only 455 citations, because the MESH system would recognize “alcohol dehydrogenase” as a single search term. In contrast, entering 1.1.1.2[ECNO] as the search term would return only 162 citations, most of which would be relevant to the topic.
Limits
Because the sheer number of publications broadly related to a particular topic in the database can be overwhelming, PubMed offers the user the opportunity to limit the search to certain values in particular fields. This alleviates the need to remember the syntax of the examples mentioned above and allows the user to construct fairly complex search patterns. This feature allows one to select articles published in a specific language and further specify the type of articles to retrieve, e.g. review papers only. Although the preset limits are geared mostly towards clinical studies, there are several options useful for biologists. For example, Limits allow one to directly enter the range of acceptable publication dates for the articles to look for. Importantly, the Entrez date parameter specifies the date when the new citation was added to the database. By using the Entrez date, one can search for papers added to PubMed during the last week, month, or any other period.
The Limits option is also convenient for the retrieval of protein and nucleotide sequences. For proteins, it allows one to search by gene location (genomic, mitochondrial, or chloroplast DNA) and the database (GenBank, EMBL, DDBJ, PDB, SWISS-PROT, PIR, or RefSeq). For nucleotide sequences, in addition, it offers the option of excluding patents, sequences of ESTs, STSs, GSS, and/or working draft sequences. This allows the user to significantly reduce the noise caused by the redundancy of GenBank protein and nucleotide databases.
3.7.2. Interpretation of the search pattern
Often enough, a PubMed search would not find the reference that should be there or would return references that seem to have nothing to do with the entered search pattern. One of the reasons for this is that PubMed does not simply scan all the abstracts for the word or phrase entered by the user. Instead, it first searches precompiled indexes of terms in four main lists. It starts by looking for a match in the Medical Subject Heading (MeSH) table. If it does not find a match, it looks in the Journals Table, then in the Phrase List, and finally, in the Author Index. As soon as PubMed finds a match in one of those four lists, the search stops. Thus, if one enters “Silver” as the search pattern, PubMed would not even look for papers authored by Simon Silver from the University of Illinois at Chicago or any other researcher with that last name. Instead, PubMed would interpret “silver” as a MeSH term and would ignore it in all other lists. After receiving the report that PubMed has found as many as 23,486 references, very few of which have Silver as an author, one could click “Details” and find out that the word “Silver” was translated by PubMed as
(“silver”[MeSH Terms] OR silver[Text Word]).
As discussed above, to search for papers authored by Silver, one would need to enter the search pattern “Silver[AUTH]” by typing it or going through the Limits option. If the author's initial is known, one could simply enter “silver s”, which would be interpreted as a name. Finally, to search for Silver's papers on Ag-resistance, one could use the pattern “silver s silver” and end up with a list of 13 references, 7 of which would be relevant.
The option of pressing “Details” to find out how the search terms have been interpreted by PubMed offers an easy way to avoid a lot of confusion. It also allows the user to modify the search so that PubMed would look exactly for what the user wants. Consider the following real-life example: you used Triton X-100 to solubilize your protein and want to find an easy way to measure residual Triton X-100 in your sample. Simply entering “triton determination” would bring 6,663 citations, some of which are literally out of this world. Pressing “Details” shows that your search pattern has been interpreted as follows:
((“neptune”[MeSH Terms] OR triton[Text Word]) AND
(“analysis”[Subheading] OR determination[Text Word]))
Where in the world did “neptune” come from and what does it have to do with triton? Looking up “neptune” as a MeSH term brings you the following comprehensive answer:
“Neptune: The eighth planet in order from the sun. It is one of the five outer planets of the solar system. Its two natural satellites are Nereid and Triton. Year introduced: 1995.”
Even if this entry leaves you puzzled about the impact of the planet Neptune on contemporary medical science, it at least explains the link of triton to neptune. As a matter of fact, the PubMed search engine is trying to be as helpful as possible: should you enter triton X-100, it would not need any quotation marks to correctly interpret it as
(“octoxynol”[MeSH Terms] OR triton X-100[Text Word]).
Looking up “octoxynol” as a MeSH term reveals yet another reason why there are so many irrelevant references in the output:
“Octoxynol: Nonionic surfactant mixtures varying in the number of repeating ethoxy (oxy-1,2-ethanediyl) groups. They are used as detergents, emulsifiers, wetting agents, defoaming agents, etc. Octoxynol–9, the compound with 9 repeating ethoxy groups, is a spermatocide.”
This means that you might want to disallow searching for octoxynol; otherwise, you might get distracted by the contraceptive uses of Triton X-100 that have never been mentioned in biochemistry textbooks. In the long run, the easiest way is usually to simply include in the search pattern as many relevant terms as possible, hoping that their intersection retrieves the relevant publications. In this case, the pattern
triton X-100[Text Word] AND detergent removal method
returns only 39 citations, two of which are definitely relevant. Pressing the “Related articles” link brings more papers of the same kind and finally allows one to move from searching to reading.
3.7.3. NCBI Bookshelf
None of us is equally proficient in all areas of biology. However, for most cancer researchers, relative ignorance in algology or mycology can be easily forgiven. Not so for genome annotators, who encounter database hits from Synechocystis or Dictyostelium on a daily basis and need to be able to quickly decide whether those hits and their annotations are plausible and whether those annotations could be applied to human genes. The NCBI Bookshelf (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Books) project aims at putting classical textbooks on the web and hyperlinking them with PubMed abstracts. The third edition of “Molecular Biology of the Cell” by Bruce Alberts, Dennis Bray, Julian Lewis, Martin Raff, Keith Roberts, and James D. Watson, has been available online since 2001. In the last two years, it has been joined by ten other books on such diverse topics as immunobiology, developmental biology, retroviruses, and cancer medicine. While availability of all these book online is an important development per se, their value for PubMed users goes far beyond that. By clicking the “Books” link from the abstract view, the reader can link the terms in the respective abstract to the same terms in either of the books on the bookshelf. Then, just by clicking on the obscure term, the user can jump to the book paragraph that mentions this term and see it in the proper context. There is a possibility to directly search the Bookshelf from Entrez, so that the reader can use a term that is even not in PubMed. Obviously, this tool works best for subjects that are specifically covered in the books available so far; there are still very few links to algae- or fungi-related topics. Nonetheless, this is a start of a very promising trend that should help researchers to deal with unfamiliar terminology in even the most complex PubMed entries.
3.8. Conclusions and Outlook
In this chapter, we gave only a perfunctory and non-technical overview of the databases that, in our opinion, are most important for researchers working in genomics. For more detail and particularly information on technical aspects of database architecture, the reader should refer to the sources listed below and other relevant literature. Appropriate information resources are necessary for any type of research, but in genomics, the quality of the employed databases affects the science more directly than in many other areas. Throughout this chapter, we emphasized the critical distinction between archival and curated (“value-added”) databases. It would be a grave mistake to think that the latter are unconditionally “better” than the former. The two types of databases perform fundamentally different and equally essential functions. Archival databases ensure the integrity of the edifice of genomics and will exist as long as the field itself. However, it is the other type of databases, the expert-curated, specialized ones, that are currently in the phase of explosive growth, and in our opinion, the future of genomics critically depends on these resources. In Chapters 4 and 5, we shall see how these databases are already transforming comparative-genomic research.
3.9. Further Reading
- 1.
- Nucleic Acids Research, 1998–2002, January 1st issues.
- 2.
- Computer Methods for Macromolecular Sequence Analysis. 1996. Doolittle R.F., ed. (Methods in Enzymology, vol. 266). Academic Press, San Diego.
- 3.
- Analysis of Amino Acid Sequences. 2000. Bork P., ed. (Advances in Protein Chemistry, vol. 54). Academic Press, San Diego. The following articles are relevant for this chapter: Apweiler R. Protein sequence databases, pp. 31–71 Bateman A., Birney E. Searching databases to find protein domain organization, pp. 137–157 Kanehisa M. Pathway databases and higher order function, pp. 381–408. [PubMed: 10829233]
- 4.
- Bioinformatics: Databases and Systems. 1999. Letovsky S.L., ed. Kluwer Academic Publishers, Boston.
- PubMedLinks to PubMed
- Information Sources for Genomics - Sequence - Evolution - FunctionInformation Sources for Genomics - Sequence - Evolution - Function
Your browsing activity is empty.
Activity recording is turned off.
See more...