More Web Proxy on the site http://driver.im/

research-article

High performance computing workflow for protein functional annotation

Authors:

Larissa Stanberry,

Bhanu Rekepalli,

William BroomallAuthors Info & Claims

XSEDE '13: Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery

Article No.: 19, Pages 1 - 6

https://doi.org/10.1145/2484762.2484809

Published: 22 July 2013 Publication History

Abstract

Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the PSU (Protein Sequence Universe) expands exponentially. Newly sequenced bacterial genomes alone contain over 7.5 million proteins. The rate of data generation has far surpassed that of protein annotation. The volume of protein data makes manual curation infeasible whereas a high compute cost limits the utility of existing automated approaches. In this study, we built an automated workflow to enable large-scale protein annotation into existing orthologous groups using HPC (High Performance Computing) architectures. We developed a low complexity classification algorithm to assign proteins into bacterial COGs (Clusters of Orthologous Groups of proteins). Based on the PSI-BLAST (Position-Specific Iterative Basic Local Alignment Search Tool), the algorithm was validated on simulated and archaeal data to ensure at least 80% specificity and sensitivity. The workflow with highly scalable parallel applications for classification and sequence alignment was developed on XSEDE (Extreme Science and Engineering Discovery Environment) supercomputers. Using the workflow, we have classified one million newly sequenced bacterial proteins. With the rapid expansion of the PSU, the proposed workflow will enable scientists to annotate big genome data.

References

[1]

S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. J. Mol. Biol., 215:403âĂŞ410, Oct. 1990.

[2]

S. F. Altschul, T. L. Madden, A. A. Schaffer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25:3389âĂŞ3402, Sept. 1997.

[3]

W. A. Baumgartner, K. B. Cohen, L. M. Fox, G. Acquaah-Mensah, and L. Hunter. Manual curation is not sufficient for annotation of genomic databases. Bioinformatics, 23:i41âĂŞ48, July 2007.

Digital Library

[4]

D. A. Benson, I. Karsch-Mizrachi, K. Clark, D. J. Lipman, J. Ostell, and E. W. Sayers. GenBank. Nucleic acids research, 40(Database issue):D48--53, Jan. 2012. 22144687.

[5]

P. Bork. Powers and pitfalls in sequence analysis: the 70% hurdle. Genome Res., 10:398âĂŞ400, Apr. 2000.

[6]

Cluster File Systems, Inc. Lustre: A scalable, high performance file system. 2002.

[7]

A. E. Darling, L. Carey, and W. chun Feng. The design, implementation, and evaluation of mpiblast. In In Proceedings of ClusterWorld 2003, 2003.

[8]

R. C. Edgar. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Research, 32(5):1792--1797, 2004.

[9]

A. J. Enright, S. Van Dongen, and C. A. Ouzounis. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res., 30(7):1575--1584, Apr 2002.

[10]

D. Frishman. Protein annotation at genomic scale: the current status. Chem. Rev., 107:3448âĂŞ3466, Aug. 2007.

[11]

M. Y. Galperin and E. Kolker. New metrics for comparative genomics. Current Opinion in Biotechnology, 17(5):440--447, Oct. 2006.

[12]

R. Higdon, W. Haynes, L. Stanberry, E. Stewart, G. Yandl, C. Howard, W. Broomall, N. Kolker, and E. Kolker. Unraveling the complexities of life sciences data. Big Data, 1(1):42--50, 2013.

[13]

L. J. Jensen, P. Julien, M. Kuhn, C. von Mering, J. Muller, T. Doerks, and P. Bork. eggNOG: automated construction and annotation of orthologous groups of genes. Nucleic Acids Res., 36:D250--254, Jan 2008.

[14]

B. W. Kernighan and D. M. Ritchie. The C programming Language. 1988.

Digital Library

[15]

E. Kolker, R. Higdon, W. Haynes, D. Welch, W. Broomall, D. Lancet, L. Stanberry, and N. Kolker. MOPED: model organism protein expression database. Nucleic Acids Research, 40(Database issue):D1093--1099, Jan. 2012. 22139914.

[16]

E. Kolker, R. Higdon, D. Welch, A. Bauman, E. Stewart, W. Haynes, W. Broomall, and N. Kolker. SPIRE: systematic protein investigative research environment. Journal of Proteomics, 75(1):122--126, Dec. 2011. 21609792.

[17]

E. Kolker, K. S. Makarova, S. Shabalina, A. F. Picone, S. Purvine, T. Holzman, T. Cherny, D. Armbruster, R. S. Munson, G. Kolesov, D. Frishman, and M. Y. Galperin. Identification and functional analysis of 'hypothetical' genes expressed in Haemophilus influenzae. Nucleic Acids Res., 32:2353--2361, 2004.

[18]

E. Kolker, E. Stewart, and V. Ozdemir. DELSA global for "Big Data" and the Bioeconomy: Catalyzing Collective Innovation. Industrial Biotechnology, 8(4):176--178, Aug. 2012.

[19]

N. Kolker, R. Higdon, W. Broomall, L. Stanberry, D. Welch, W. Lu, W. Haynes, R. Barga, and E. Kolker. Classifying proteins into functional groups based on all-versus-all BLAST of 10 million proteins. OMICS: A Journal of Integrative Biology, 15(7--8):513--521, July 2011.

[20]

E. Koonin and M. Galperin. Sequence - evolution - function: computational approaches in comparative genomics. Kluwer Academic, 2003.

Digital Library

[21]

A. Krause, J. Stoye, and M. Vingron. The SYSTERS protein sequence cluster set. Nucleic Acids Res., 28:270--272, Jan 2000.

[22]

E. V. Kriventseva, W. Fleischmann, E. M. Zdobnov, and R. Apweiler. CluSTr: a database of clusters of SWISS-PROT+TrEMBL proteins. Nucleic Acids Res., 29:33âĂŞ36, Jan. 2001.

[23]

K.-B. Li. ClustalW-MPI: ClustalW analysis using distributed and parallel computing. Bioinformatics, 19(12):1585--1586, 2003.

[24]

W. Li and A. Godzik. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22(13):1658--1659, Jul 2006.

Digital Library

[25]

H. Lin, X. Ma, and Y. P. Ch. Efficient data access for parallel blast. In in International Parallel and Distributed Processing Symposium, 2005.

Digital Library

[26]

S. Marru, L. Gunathilake, C. Herath, P. Tangchaisin, M. Pierce, C Mattmann, R. Singh, T. Gunarathne, E. Chinthaka, R. Gardler, et al. Apache airavata: a framework for distributed applications and computational workflows. In Proceedings of the 2011 ACM workshop on Gateway computing environments, pages 21--28. ACM, 2011.

Digital Library

[27]

National Institute for Computational Sciences. Kraken. http://www.nics.tennessee.edu/computing-resources/kraken.

[28]

NCBI. Genome assembly/annotation projects. ftp://ftp.ncbi.nih.gov/genomes/Bacteria.

[29]

C. S. Oehmen and D. J. Baxter. Scalablast 2.0: rapid and robust blast calculations on multiprocessor systems. Bioinformatics, 2013.

Digital Library

[30]

V. Ozdemir, T. Pang, B. M. Knoppers, D. Avard, S. A. Faraj, M. H. Zawati, and E. Kolker. Vaccines of the 21st century and vaccinomics: Data-enabled science meets global health to spark collective action for vaccine innovation. OMICS: A Journal of Integrative Biology, 15(9):523--527, Sept. 2011.

[31]

E. Pennisi. Human genome 10th anniversary. will computers crash genomics? Science, 331:666âĂŞ668, Feb. 2011.

[32]

S. Powell, D. Szklarczyk, K. Trachana, A. Roth, M. Kuhn, J. Muller, R. Arnold, T. Rattei, I. Letunic, T. Doerks, L. J. Jensen, C. von Mering, and P. Bork. eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges. Nucleic Acids Research, 40(D1):D284--D289, 2012.

[33]

R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria, 2012. ISBN 3-900051-07-0.

[34]

H. Rangwala, E. Lantz, R. Musselman, K. Pinnow, B. Smith, and B. Wallenfelt. Massively parallel BLAST for the Blue Gene/L. In High Availability and Performance Workshop, 2005.

[35]

B. Rekapalli, P. Giblock, and C. Reardon. PoPLAR: Portal for petascale lifescience applications and research. BMC Bioinformatics, 14(Suppl 9):S3, 2013.

[36]

B. Rekepalli, A. Vose, and P. Giblock. HSPp-BLAST: Highly Scalable Parallel PSI-BLAST for very large-scale sequence searches. In Proceedings of the 3rd International Conference on Bioinformatics and Computational Biology (BICoB), pages 37--42. BICoB, 2012.

[37]

G. E. Robinson, K. J. Hackett, M. Purcell-Miramontes, S. J. Brown, J. D. Evans, M. R. Goldsmith, D. Lawson, J. Okamuro, H. M. Robertson, and D. J. Schneider. Creating a buzz about insect genomes. Science, 331:1386, Mar. 2011.

[38]

M. C. Schatz, B. Langmead, and S. L. Salzberg. Cloud computing and the DNA data race. Nat. Biotechnol., 28:691--693, 2010.

[39]

L. Stanberry, R. Higdon, W. Haynes, N. Kolker, W. Broomall, S. Ekanayake, Y. Ruan, J. Qiu, E. Kolker, and G. Fox. Visualizing the protein sequence universe. Concurrency and Computation: Practice and Experience, (in press), 2013.

[40]

L. D. Stein. The case for cloud computing in genome informatics. Genome Biol., 11:207, 2010.

[41]

R. Tatusov, E. Koonin, and D. Lipman. A genomic perspective on protein families. Science, 278:631--637, 1997.

[42]

R. L. Tatusov, N. D. Fedorova, J. D. Jackson, A. R. Jacobs, B. Kiryutin, E. V. Koonin, D. M. Krylov, R. Mazumder, S. L. Mekhedov, A. N. Nikolskaya, B. S. Rao, S. Smirnov, A. V. Sverdlov, S. Vasudevan, Y. I. Wolf, J. J. Yin, and D. A. Natale. The COG database: an updated version includes eukaryotes. BMC Bioinformatics, 4:41, Sep 2003.

[43]

The Earth Microbiome Project. http://www.earthmicrobiome.org.

[44]

The University of Tennessee. Newton HPC Program -- High Performance Computing. http://newton.utk.edu.

[45]

J. Vlasblom and S. Wodak. Markov clustering versus affinity propagation for the partitioning of protein interaction graphs. BMC Bioinformatics, 10(1):99, 2009.

[46]

Y. I. Wolf, K. S. Makarova, N. Yutin, and E. V. Koonin. Updated clusters of orthologous genes for Archaea: a complex ancestor of the Archaea and the byways of horizontal gene transfer. Biol. Direct, 7:46, 2012.

[47]

G. Yona, N. Linial, and M. Linial. ProtoMap: automatic classification of protein sequences and hierarchy of protein families. Nucleic Acids Res., 28:49âĂŞ55, Jan. 2000.

Index Terms

High performance computing workflow for protein functional annotation
1. Applied computing
  1. Life and medical sciences
2. Information systems
  1. Information retrieval

Recommendations

Optimizing high performance computing workflow for protein functional annotation

Functional annotation of newly sequenced genomes is one of the major challenges in modern biology. With modern sequencing technologies, the protein sequence universe is rapidly expanding. Newly sequenced bacterial genomes alone contain over 7.5 million ...
Visualizing the Protein Sequence Universe

Modern biology is experiencing a rapid increase in data volumes that challenges our analytical skills and existing cyberinfrastructure. Exponential expansion of the protein sequence universe PSU, the protein sequence space, together with the costs and ...
Visualizing the protein sequence universe
ECMLS '12: Proceedings of the 3rd international workshop on Emerging computational methods for the life sciences

Modern biology is experiencing a rapid increase in data volumes that challenges our analytical skills and existing cyberinfrastructure. Exponential expansion of the Protein Sequence Universe (PSU), the protein sequence space, together with the costs and ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences

XSEDE '13: Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery

July 2013

433 pages

ISBN:9781450321709

DOI:10.1145/2484762

General Chair:
Nancy Wilkins-Diehr
San Diego Supercomputer Center

Copyright © 2013 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 July 2013

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

XSEDE '13

XSEDE '13: Extreme Science and Engineering Discovery Environment: Gateway to Discovery

July 22 - 25, 2013

California, San Diego, USA

Acceptance Rates

Overall Acceptance Rate 129 of 190 submissions, 68%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
102
Total Downloads

Downloads (Last 12 months)2
Downloads (Last 6 weeks)0

Reflects downloads up to 12 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents