[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2484762.2484809acmotherconferencesArticle/Chapter ViewAbstractPublication PagesxsedeConference Proceedingsconference-collections
research-article

High performance computing workflow for protein functional annotation

Published: 22 July 2013 Publication History

Abstract

Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the PSU (Protein Sequence Universe) expands exponentially. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible whereas a high compute cost limits the utility of existing automated approaches. In this study, we built an automated workflow to enable large-scale protein annotation into existing orthologous groups using HPC (High Performance Computing) architectures. We developed a low complexity classification algorithm to assign proteins into bacterial COGs (Clusters of Orthologous Groups of proteins). Based on the PSI-BLAST (Position-Specific Iterative Basic Local Alignment Search Tool), the algorithm was validated on simulated and archaeal data to ensure at least 80% specificity and sensitivity. The workflow with highly scalable parallel applications for classification and sequence alignment was developed on XSEDE (Extreme Science and Engineering Discovery Environment) supercomputers. Using the workflow, we have classified one million newly sequenced bacterial proteins. With the rapid expansion of the PSU, the proposed workflow will enable scientists to annotate big genome data.

References

[1]
S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. J. Mol. Biol., 215:403âĂŞ410, Oct. 1990.
[2]
S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25:3389âĂŞ3402, Sept. 1997.
[3]
W. A. Baumgartner, K. B. Cohen, L. M. Fox, G. Acquaah-Mensah, and L. Hunter. Manual curation is not sufficient for annotation of genomic databases. Bioinformatics, 23:i41âĂŞ48, July 2007.
[4]
D. A. Benson, I. Karsch-Mizrachi, K. Clark, D. J. Lipman, J. Ostell, and E. W. Sayers. GenBank. Nucleic acids research, 40(Database issue):D48--53, Jan. 2012. 22144687.
[5]
P. Bork. Powers and pitfalls in sequence analysis: the 70% hurdle. Genome Res., 10:398âĂŞ400, Apr. 2000.
[6]
Cluster File Systems, Inc. Lustre: A scalable, high performance file system. 2002.
[7]
A. E. Darling, L. Carey, and W. chun Feng. The design, implementation, and evaluation of mpiblast. In In Proceedings of ClusterWorld 2003, 2003.
[8]
R. C. Edgar. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research, 32(5):1792--1797, 2004.
[9]
A. J. Enright, S. Van Dongen, and C. A. Ouzounis. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res., 30(7):1575--1584, Apr 2002.
[10]
D. Frishman. Protein annotation at genomic scale: the current status. Chem. Rev., 107:3448âĂŞ3466, Aug. 2007.
[11]
M. Y. Galperin and E. Kolker. New metrics for comparative genomics. Current Opinion in Biotechnology, 17(5):440--447, Oct. 2006.
[12]
R. Higdon, W. Haynes, L. Stanberry, E. Stewart, G. Yandl, C. Howard, W. Broomall, N. Kolker, and E. Kolker. Unraveling the complexities of life sciences data. Big Data, 1(1):42--50, 2013.
[13]
L. J. Jensen, P. Julien, M. Kuhn, C. von Mering, J. Muller, T. Doerks, and P. Bork. eggNOG: automated construction and annotation of orthologous groups of genes. Nucleic Acids Res., 36:D250--254, Jan 2008.
[14]
B. W. Kernighan and D. M. Ritchie. The C programming Language. 1988.
[15]
E. Kolker, R. Higdon, W. Haynes, D. Welch, W. Broomall, D. Lancet, L. Stanberry, and N. Kolker. MOPED: model organism protein expression database. Nucleic Acids Research, 40(Database issue):D1093--1099, Jan. 2012. 22139914.
[16]
E. Kolker, R. Higdon, D. Welch, A. Bauman, E. Stewart, W. Haynes, W. Broomall, and N. Kolker. SPIRE: systematic protein investigative research environment. Journal of Proteomics, 75(1):122--126, Dec. 2011. 21609792.
[17]
E. Kolker, K. S. Makarova, S. Shabalina, A. F. Picone, S. Purvine, T. Holzman, T. Cherny, D. Armbruster, R. S. Munson, G. Kolesov, D. Frishman, and M. Y. Galperin. Identification and functional analysis of 'hypothetical' genes expressed in Haemophilus influenzae. Nucleic Acids Res., 32:2353--2361, 2004.
[18]
E. Kolker, E. Stewart, and V. Ozdemir. DELSA global for "Big Data" and the Bioeconomy: Catalyzing Collective Innovation. Industrial Biotechnology, 8(4):176--178, Aug. 2012.
[19]
N. Kolker, R. Higdon, W. Broomall, L. Stanberry, D. Welch, W. Lu, W. Haynes, R. Barga, and E. Kolker. Classifying proteins into functional groups based on all-versus-all BLAST of 10 million proteins. OMICS: A Journal of Integrative Biology, 15(7--8):513--521, July 2011.
[20]
E. Koonin and M. Galperin. Sequence - evolution - function: computational approaches in comparative genomics. Kluwer Academic, 2003.
[21]
A. Krause, J. Stoye, and M. Vingron. The SYSTERS protein sequence cluster set. Nucleic Acids Res., 28:270--272, Jan 2000.
[22]
E. V. Kriventseva, W. Fleischmann, E. M. Zdobnov, and R. Apweiler. CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins. Nucleic Acids Res., 29:33âĂŞ36, Jan. 2001.
[23]
K.-B. Li. ClustalW-MPI: ClustalW analysis using distributed and parallel computing. Bioinformatics, 19(12):1585--1586, 2003.
[24]
W. Li and A. Godzik. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22(13):1658--1659, Jul 2006.
[25]
H. Lin, X. Ma, and Y. P. Ch. Efficient data access for parallel blast. In in International Parallel and Distributed Processing Symposium, 2005.
[26]
S. Marru, L. Gunathilake, C. Herath, P. Tangchaisin, M. Pierce, C Mattmann, R. Singh, T. Gunarathne, E. Chinthaka, R. Gardler, et al. Apache airavata: a framework for distributed applications and computational workflows. In Proceedings of the 2011 ACM workshop on Gateway computing environments, pages 21--28. ACM, 2011.
[27]
National Institute for Computational Sciences. Kraken. http://www.nics.tennessee.edu/computing-resources/kraken.
[28]
NCBI. Genome assembly/annotation projects. ftp://ftp.ncbi.nih.gov/genomes/Bacteria.
[29]
C. S. Oehmen and D. J. Baxter. Scalablast 2.0: rapid and robust blast calculations on multiprocessor systems. Bioinformatics, 2013.
[30]
V. Ozdemir, T. Pang, B. M. Knoppers, D. Avard, S. A. Faraj, M. H. Zawati, and E. Kolker. Vaccines of the 21st century and vaccinomics: Data-enabled science meets global health to spark collective action for vaccine innovation. OMICS: A Journal of Integrative Biology, 15(9):523--527, Sept. 2011.
[31]
E. Pennisi. Human genome 10th anniversary. will computers crash genomics? Science, 331:666âĂŞ668, Feb. 2011.
[32]
S. Powell, D. Szklarczyk, K. Trachana, A. Roth, M. Kuhn, J. Muller, R. Arnold, T. Rattei, I. Letunic, T. Doerks, L. J. Jensen, C. von Mering, and P. Bork. eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges. Nucleic Acids Research, 40(D1):D284--D289, 2012.
[33]
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2012. ISBN 3-900051-07-0.
[34]
H. Rangwala, E. Lantz, R. Musselman, K. Pinnow, B. Smith, and B. Wallenfelt. Massively parallel BLAST for the Blue Gene/L. In High Availability and Performance Workshop, 2005.
[35]
B. Rekapalli, P. Giblock, and C. Reardon. PoPLAR: Portal for petascale lifescience applications and research. BMC Bioinformatics, 14(Suppl 9):S3, 2013.
[36]
B. Rekepalli, A. Vose, and P. Giblock. HSPp-BLAST: Highly Scalable Parallel PSI-BLAST for very large-scale sequence searches. In Proceedings of the 3rd International Conference on Bioinformatics and Computational Biology (BICoB), pages 37--42. BICoB, 2012.
[37]
G. E. Robinson, K. J. Hackett, M. Purcell-Miramontes, S. J. Brown, J. D. Evans, M. R. Goldsmith, D. Lawson, J. Okamuro, H. M. Robertson, and D. J. Schneider. Creating a buzz about insect genomes. Science, 331:1386, Mar. 2011.
[38]
M. C. Schatz, B. Langmead, and S. L. Salzberg. Cloud computing and the DNA data race. Nat. Biotechnol., 28:691--693, 2010.
[39]
L. Stanberry, R. Higdon, W. Haynes, N. Kolker, W. Broomall, S. Ekanayake, Y. Ruan, J. Qiu, E. Kolker, and G. Fox. Visualizing the protein sequence universe. Concurrency and Computation: Practice and Experience, (in press), 2013.
[40]
L. D. Stein. The case for cloud computing in genome informatics. Genome Biol., 11:207, 2010.
[41]
R. Tatusov, E. Koonin, and D. Lipman. A genomic perspective on protein families. Science, 278:631--637, 1997.
[42]
R. L. Tatusov, N. D. Fedorova, J. D. Jackson, A. R. Jacobs, B. Kiryutin, E. V. Koonin, D. M. Krylov, R. Mazumder, S. L. Mekhedov, A. N. Nikolskaya, B. S. Rao, S. Smirnov, A. V. Sverdlov, S. Vasudevan, Y. I. Wolf, J. J. Yin, and D. A. Natale. The COG database: an updated version includes eukaryotes. BMC Bioinformatics, 4:41, Sep 2003.
[43]
The Earth Microbiome Project. http://www.earthmicrobiome.org.
[44]
The University of Tennessee. Newton HPC Program -- High Performance Computing. http://newton.utk.edu.
[45]
J. Vlasblom and S. Wodak. Markov clustering versus affinity propagation for the partitioning of protein interaction graphs. BMC Bioinformatics, 10(1):99, 2009.
[46]
Y. I. Wolf, K. S. Makarova, N. Yutin, and E. V. Koonin. Updated clusters of orthologous genes for Archaea: a complex ancestor of the Archaea and the byways of horizontal gene transfer. Biol. Direct, 7:46, 2012.
[47]
G. Yona, N. Linial, and M. Linial. ProtoMap: automatic classification of protein sequences and hierarchy of protein families. Nucleic Acids Res., 28:49âĂŞ55, Jan. 2000.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
XSEDE '13: Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery
July 2013
433 pages
ISBN:9781450321709
DOI:10.1145/2484762
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 July 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. BLAST
  2. COG
  3. HSPp-BLAST
  4. PSI-BLAST
  5. PSU
  6. XSEDE
  7. computational bioinformatics
  8. data-enabled life sciences
  9. petascale
  10. protein annotation
  11. protein sequence universe
  12. science gateways
  13. sequence similarity

Qualifiers

  • Research-article

Conference

XSEDE '13

Acceptance Rates

Overall Acceptance Rate 129 of 190 submissions, 68%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 102
    Total Downloads
  • Downloads (Last 12 months)2
  • Downloads (Last 6 weeks)0
Reflects downloads up to 12 Jan 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media