[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3472456.3472460acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

MetaCache-GPU: Ultra-Fast Metagenomic Classification

Published: 05 October 2021 Publication History

Abstract

The cost of DNA sequencing has dropped exponentially over the past decade, making genomic data accessible to a growing number of scientists. In bioinformatics, localization of short DNA sequences (reads) within large genomic sequences is commonly facilitated by constructing index data structures which allow for efficient querying of substrings. Recent metagenomic classification pipelines annotate reads with taxonomic labels by analyzing their k-mer histograms with respect to a reference genome database. CPU-based index construction is often performed in a preprocessing phase due to the relatively high cost of building irregular data structures such as hash maps. However, the rapidly growing amount of available reference genomes establishes the need for index construction and querying at interactive speeds. In this paper, we introduce MetaCache-GPU – an ultra-fast metagenomic short read classifier specifically tailored to fit the characteristics of CUDA-enabled accelerators. Our approach employs a novel hash table variant featuring efficient minhash fingerprinting of reads for locality-sensitive hashing and their rapid insertion using warp-aggregated operations. Our performance evaluation shows that MetaCache-GPU is able to build large reference databases in a matter of seconds, enabling instantaneous operability, while popular CPU-based tools such as Kraken2 require over an hour for index construction on the same data. In the context of an ever-growing number of reference genomes, MetaCache-GPU is the first metagenomic classifier that makes analysis pipelines with on-demand composition of large-scale reference genome sets practical. The source code is publicly available at https://github.com/muellan/metacache.

References

[1]
D.A.F. Alcantara. 2011. Efficient Hash Tables on the GPU. Ph.D. Dissertation. University of California at Davis, Davis, CA, USA.
[2]
S Ashkiani, M Farach-Colton, and JD Owens. 2018. A Dynamic Hash Table for the GPU. In IPDPS 2018. IEEE, 419–429.
[3]
DC Bauer, AP Tay, L Wilson, D Reti, C Hosking, AJ McAuley, E Pharo, S Todd, V Stevens, MJ Neave, 2020. Supporting pandemic response using genomics and bioinformatics: a case study on the emergent SARS-CoV-2 outbreak. Transboundary and Emerging Diseases(2020).
[4]
S. Baxter. 2016. ModernGPU: Patterns and behaviors for GPU computing. https://github.com/moderngpu/moderngpu
[5]
AZ Broder. 2000. Identifying and Filtering Near-Duplicate Documents. In Proc. 11th Annual Symposium on Combinatorial Pattern Matching(COM ’00). 1–10.
[6]
N Cadenelli, J Polo, and D Carrera. 2017. Accelerating K-mer frequency counting with GPU and non-volatile memory. In IEEE HPCC 2017; IEEE SmartCity 2017; IEEE DSS 2017. IEEE, 434–441.
[7]
A Chacón, S Marco-Sola, A Espinosa, P Ribeca, and JC Moure. 2014. Boosting the FM-index on the GPU: Effective techniques to mitigate random memory access. IEEE/ACM Trans. on Computational Biology and Bioinformatics 12, 5(2014), 1048–1059.
[8]
S. Dalton, N. Bell, L. Olson, and M. Garland. 2015. CUSP: A C++ Templated Sparse Matrix Library. http://cusplibrary.github.io/
[9]
M Ellis, G Guidi, A Buluç, L Oliker, and K Yelick. 2019. diBELLA: Distributed Long Read to Long Read Alignment. In Proc. of the 48th Int. Conference on Parallel Processing. 1–11.
[10]
M Erbert, S Rechner, and M Müller-Hannemann. 2017. Gerbil: a fast and memory-efficient k-mer counter with GPU-support. Algorithms for Molecular Biology 12, 1 (2017), 1–12.
[11]
E Georganas, R Egan, S Hofmeyr, E Goltsman, B Arndt, A Tritt, A Buluç, L Oliker, and K Yelick. 2018. Extreme scale de novo metagenome assembly. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 122–134.
[12]
K Hou, W Liu, H Wang, and W Feng. 2017. Fast Segmented Sort on GPUs. In 31th International Conference on Supercomputing (ICS). Chicago, USA.
[13]
EJ Houtgast, V Sima, K Bertels, and Z Al-Ars. 2015. An FPGA-based systolic array to accelerate the BWA-MEM genomic mapping algorithm. In SAMOS 2015. IEEE, 221–227.
[14]
EJ Houtgast, V Sima, K Bertels, and Z Al-Ars. 2018. Hardware acceleration of BWA-MEM genomic short read mapping for longer read lengths. Computational Biology and Chemistry 75 (2018), 54–64.
[15]
D Jünger, C Hundt, and B Schmidt. 2018. WarpDrive: Massively Parallel Hashing on Multi-GPU Nodes. In IPDPS 2018. IEEE, 441–450.
[16]
Daniel Jünger, Robin Kobus, André Müller, Christian Hundt, Kai Xu, Weiguo Liu, and Bertil Schmidt. 2020. WarpCore: A Library for fast Hash Tables on GPUs. In HiPC 2020. IEEE, 11–20. https://doi.org/10.1109/HiPC50609.2020.00015
[17]
R Kobus, JM Abuín, A Müller, SL Hellmann, JC Pichel, TF Pena, A Hildebrandt, T Hankeln, and B Schmidt. 2020. A big data approach to metagenomics for all-food-sequencing. BMC Bioinformatics 21, 1 (2020), 1–15.
[18]
R Kobus, C Hundt, A Müller, and B Schmidt. 2017. Accelerating metagenomic read classification on CUDA-enabled GPUs. BMC Bioinformatics 18, 1 (2017), 1–10.
[19]
R Kobus, D Jünger, C Hundt, and B Schmidt. 2019. Gossip: Efficient Communication Primitives for Multi-GPU Systems. In 48th Int. Conference on Parallel Processing (ICPP ’19). 1–10.
[20]
K Korpela, A Salonen, LJ Virta, RA Kekkonen, K Forslund, P Bork, and WM De Vos. 2016. Intestinal microbiome is related to lifetime antibiotic use in Finnish pre-school children. Nature Communications 7(2016), 10410.
[21]
B Lessley and H Childs. 2019. Data-Parallel Hashing Techniques for GPU Architectures. IEEE Transactions on Parallel and Distributed Systems 31, 1 (2019), 237–250.
[22]
HA Lewin, GE Robinson, WJ Kress, WJ Baker, J Coddington, KA Crandall, R Durbin, SV Edwards, F Forest, MTP Gilbert, 2018. Earth BioGenome Project: Sequencing life for the future of life. Proceedings of the National Academy of Sciences 115, 17(2018), 4325–4333.
[23]
H Li, A Ramachandran, and D Chen. 2018. GPU acceleration of advanced k-mer counting for computational genomics. In 2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP). IEEE, 1–4.
[24]
S Lindgreen, K L Adair, and P Gardner. 2016. An evaluation of the accuracy and speed of metagenome analysis tools. Scientific Reports 6, 19233 (2016).
[25]
Y Liu and B Schmidt. 2013. CUSHAW2-GPU: empowering faster gapped short-read alignment using GPU computing. IEEE Design & Test 31, 1 (2013), 31–39.
[26]
C Marchet, C Boucher, SJ Puglisi, P Medvedev, M Salson, and R Chikhi. 2019. Data structures based on k-mers for querying large collections of sequencing datasets. bioRxiv (2019), 866756.
[27]
A Müller, C Hundt, A Hildebrandt, T Hankeln, and B Schmidt. 2017. MetaCache: context-aware classification of metagenomic reads using minhashing. Bioinformatics 33, 23 (2017), 3740–3748.
[28]
NA O’Leary, MW Wright, JR Brister, S Ciufo, D Haddad, R McVeigh, B Rajput, B Robbertse, B Smith-White, D Ako-Adjei, 2016. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Research 44, D1 (2016), D733–D745.
[29]
A Morgulis others. 2008. Database indexing for production MegaBLAST searches. Bioinformatics 24, 16 (2008), 1757–1764.
[30]
R Ounit, S Wanamaker, TJ Close, 2015. CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics 16, 1 (2015), 1–13.
[31]
TC Pan, S Misra, and S Aluru. 2018. Optimizing high performance distributed memory parallel hash tables for DNA k-mer counting. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 135–147.
[32]
Y Qiao, B Jia, Z Hu, C Sun, Yijin X, and C Wei. 2018. MetaBinG2: a fast and accurate metagenomic sequence classification system for samples with many unknown organisms. Biology Direct 13, 1 (2018), 1–21.
[33]
K Reinert, B Langmead, D Weese, and D J Evers. 2015. Alignment of Next-Generation Sequencing Reads. Annual Review of Genomics and Human Genetics 16 (2015), 133–151.
[34]
NVIDIA Research. 2021. CUB: Cooperative primitives for CUDA C++. https://nvlabs.github.io/cub/
[35]
C Schoch. 2020. NCBI Taxonomy Help. National Center for Biotechnology Information (US). https://www.ncbi.nlm.nih.gov/books/NBK53758/
[36]
M Seppey, M Manni, and E Zdobnov. 2020. LEMMI: A continuous benchmarking platform for metagenomics classifiers. Genome Research 30 (07 2020), gr.260398.119.
[37]
ZD Stephens, SY Lee, F Faghri, RH Campbell, C Zhai, MJ Efron, R Iyer, MC Schatz, S Sinha, and GE Robinson. 2015. Big data: astronomical or genomical?PLoS Biology 13, 7 (2015), e1002195.
[38]
D E Wood, J Lu, and B Langmead. 2019. Improved metagenomic analysis with Kraken 2. Genome biology 20, 1 (2019), 257.
[39]
D E Wood and S L Salzberg. 2014. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biology 15:R46(2014).

Cited By

View all
  • (2024)Rapid GPU-Based Pangenome Graph LayoutSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00035(1-19)Online publication date: 17-Nov-2024
  • (2024)MegIS: High-Performance, Energy-Efficient, and Low-Cost Metagenomic Analysis with In-Storage Processing2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00054(660-677)Online publication date: 29-Jun-2024
  • (2024)ViRAL: Vision Transformer Based Accelerator for ReAL Time Lineage Assignment of Viral PathogensIEEE Access10.1109/ACCESS.2024.336780112(28353-28368)Online publication date: 2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICPP '21: Proceedings of the 50th International Conference on Parallel Processing
August 2021
927 pages
ISBN:9781450390682
DOI:10.1145/3472456
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 October 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPUs
  2. bioinformatics
  3. hash tables
  4. metagenomics

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICPP 2021

Acceptance Rates

Overall Acceptance Rate 91 of 313 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)53
  • Downloads (Last 6 weeks)3
Reflects downloads up to 25 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Rapid GPU-Based Pangenome Graph LayoutSC24: International Conference for High Performance Computing, Networking, Storage and Analysis10.1109/SC41406.2024.00035(1-19)Online publication date: 17-Nov-2024
  • (2024)MegIS: High-Performance, Energy-Efficient, and Low-Cost Metagenomic Analysis with In-Storage Processing2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)10.1109/ISCA59077.2024.00054(660-677)Online publication date: 29-Jun-2024
  • (2024)ViRAL: Vision Transformer Based Accelerator for ReAL Time Lineage Assignment of Viral PathogensIEEE Access10.1109/ACCESS.2024.336780112(28353-28368)Online publication date: 2024
  • (2024)Dedicated Bioinformatics Analysis HardwareReference Module in Life Sciences10.1016/B978-0-323-95502-7.00022-1Online publication date: 2024
  • (2023)JACC-FPGAFuture Generation Computer Systems10.1016/j.future.2022.08.005138:C(26-42)Online publication date: 1-Jan-2023
  • (2022)General-purpose GPU hashing data structures and their application in accelerated genomicsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2022.01.006163:C(256-268)Online publication date: 1-May-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media