[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3225058.3225060acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicppConference Proceedingsconference-collections
research-article

SPECTR: Scalable Parallel Short Read Error Correction on Multi-core and Many-core Architectures

Published: 13 August 2018 Publication History

Abstract

Modern high throughput sequencing platforms can produce large amounts of short read DNA data at low cost. Error correction is an important but time-consuming initial step when processing this data in order to improve the quality of downstream analyses. In this paper, we present a Scalable Parallel Error CorrecToR designed to improve the throughput of DNA error correction for Illumina reads on various parallel platforms. Our design is based on a k-spectrum approach where a Bloom filter is frequently probed as a key operation and is optimized towards AVX-512-based multi-core CPUs, Xeon Phi many-cores (both KNC and KNL), and heterogeneous compute clusters. A number of architecture-specific optimizations are employed to achieve high performance such as memory alignment, vectorized Bloom filter probing, and a stack-based iteration to eliminate recursion. Our experiments show that our optimizations result in speedups of up to 2.8, 5.2, and 9.3 on a CPU (Xeon W-2123), a KNC-based Xeon Phi (31S1P), and a KNL-based Xeon Phi (7210), respectively, compared to a multi-threaded CPU reference implementation for the error correction stage. Furthermore, when executed on the same hardware, SPECTR achieves a speedup of up to 1.7, 2.1, 2.4, and 6.4, compared to the state-of-the-art tools Lighter, BLESS2, RECKONER, and Musket, respectively. In addition, our MPI implementation exhibits an efficiency of around 86% when executed on 32 nodes of the Tianhe-2 supercomputer. SPECTR is available at https://github.com/Xu-Kai/SPECTR.

References

[1]
Amin Allam, Panos Kalnis, and Victor Solovyev. 2015. Karect: accurate correction of substitution, insertion and deletion errors for next-generation sequencing data. Bioinformatics 31, 21 (2015), 3421--3428.
[2]
Burton H Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422--426.
[3]
Keith R Bradnam, Joseph N Fass, Anton Alexandrov, Paul Baranay, Michael Bechner, Inanç Birol, Sébastien Boisvert, Jarrod A Chapman, Guillaume Chapuis, Rayan Chikhi, et al. 2013. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience 2, 1 (2013), 10.
[4]
Chien-Chih Chen, Yu-Jung Chang, Wei-Chun Chung, Der-Tsai Lee, and Jan-Ming Ho. 2013. CloudRS: An error correction algorithm of high-throughput sequencing data based on scalable framework. In Proceedings of the 2013 IEEE International Conference on Big Data, 6--9 October 2013, Santa Clara, CA, USA. 717--722.
[5]
Yupeng Chen, Bertil Schmidt, and Douglas L. Maskell. 2013. Reconfigurable Accelerator for the Word-Matching Stage of BLASTN. IEEE Trans. VLSI Syst. 21, 4 (2013), 659--669.
[6]
Sebastian Deorowicz, Marek Kokot, Szymon Grabowski, and Agnieszka Debudaj-Grabysz. 2015. KMC 2: fast and resource-frugal k-mer counting. Bioinformatics 31, 10 (2015), 1569--1576.
[7]
Maciej Długosz and Sebastian Deorowicz. 2017. RECKONER: read error corrector based on KMC. Bioinformatics 33, 7 (2017), 1086--1089.
[8]
EMBL-EBI. {n. d.}. Read simulator. https://www.ebi.ac.uk/goldman-srv/simNGS/. ({n. d.}).
[9]
Yun Heo, Anand Ramachandran, Wen-mei W. Hwu, Jian Ma, and Deming Chen. 2016. BLESS 2: accurate, memory-efficient and fast error correction method. Bioinformatics 32, 15 (2016), 2369--2371.
[10]
Yun Heo, Xiao-Long Wu, Deming Chen, Jian Ma, and Wen-Mei Hwu. 2014. BLESS: bloom filter-based error correction solution for high-throughput sequencing reads. Bioinformatics 30, 10 (2014), 1354--1362.
[11]
Manuel Holtgrewe. 2010. Mason--a read simulator for second generation sequencing data. Technical Report FU Berlin (2010).
[12]
Lucian Ilie, Farideh Fazayeli, and Silvana Ilie. 2010. HiTEC: accurate error correction in high-throughput sequencing data. Bioinformatics 27, 3 (2010), 295--302.
[13]
Nagakishore Jammula, Sriram Chockalingam, and Srinivas Aluru. 2015. Parallel Read Error Correction for Big Genomic Datasets. In High Performance Computing (HiPC), 2015 IEEE 22nd International Conference on. IEEE, 446--455.
[14]
Wei-Chun Kao, Andrew H Chan, and Yun S Song. 2011. ECHO: a reference-free short-read error correction algorithm. Genome research 21, 7 (2011), 1181--1192.
[15]
David R Kelley, Michael C Schatz, and Steven L Salzberg. 2010. Quake: quality-aware detection and correction of sequencing errors. Genome biology 11, 11 (2010), R116.
[16]
Heng Li. 2015. BFC: correcting Illumina sequencing errors. Bioinformatics 31, 17 (2015), 2885--2887.
[17]
Yongchao Liu, Bertil Schmidt, and Douglas L Maskell. 2011. DecGPU: distributed error correction on massively parallel graphics processing units using CUDA and MPI. BMC bioinformatics 12, 1 (2011), 85.
[18]
Yongchao Liu, Jan Schröder, and Bertil Schmidt. 2012. Musket: a multistage k-mer spectrum-based error corrector for Illumina sequence data. Bioinformatics 29, 3 (2012), 308--315.
[19]
James R Lupski, Jeffrey G Reid, Claudia Gonzaga-Jauregui, David Rio Deiros, David CY Chen, Lynne Nazareth, Matthew Bainbridge, Huyen Dinh, Chyn Jing, David A Wheeler, et al. 2010. Whole-genome sequencing in a patient with Charcot--Marie--Tooth neuropathy. New England Journal of Medicine 362, 13 (2010), 1181--1191.
[20]
Sheng Ni, Rentong Guo, Xiaofei Liao, and Hai Jin. 2015. Parallel Bloom Filter on Xeon Phi Many-Core Processors. In International Conference on Algorithms and Architectures for Parallel Processing. Springer, 388--405.
[21]
Rasmus Nielsen, Joshua S Paul, Anders Albrechtsen, and Yun S Song. 2011. Genotype and SNP calling from next-generation sequencing data. Nature Reviews Genetics 12, 6 (2011), 443.
[22]
Jane Peterson, Susan Garges, Maria Giovanni, Pamela McInnes, Lu Wang, Jeffery A Schloss, Vivien Bonazzi, Jean E McEwen, Kris A Wetterstrand, Carolyn Deal, et al. 2009. The NIH human microbiome project. Genome research 19, 12 (2009), 2317--2323.
[23]
Orestis Polychroniou and Kenneth A Ross. 2014. Vectorized Bloom filters for advanced SIMD processors. In Proceedings of the Tenth International Workshop on Data Management on New Hardware. ACM, 6.
[24]
Anand Ramachandran, Yun Heo, Wen-mei Hwu, Jian Ma, and Deming Chen. 2015. FPGA accelerated DNA error correction. In Proceedings of the 2015 Design, Automation & Test in Europe Conference & Exhibition. EDA Consortium, 1371--1376.
[25]
Vipin Sachdeva, Srinivas Aluru, and David A Bader. 2016. A memory and time scalable parallelization of the Reptile error-correction code. In Parallel and Distributed Processing Symposium Workshops, 2016 IEEE International. IEEE, 453--462.
[26]
Leena Salmela and Jan Schröder. 2011. Correcting errors in short reads by multiple alignments. Bioinformatics 27, 11 (2011), 1455--1461.
[27]
Jan Schröder, Heiko Schröder, Simon J Puglisi, Ranjan Sinha, and Bertil Schmidt. 2009. SHREC: a short-read error correction method. Bioinformatics 25, 17 (2009), 2157--2163.
[28]
Marcel H Schulz, David Weese, Manuel Holtgrewe, Viktoria Dimitrova, Sijia Niu, Knut Reinert, and Hugues Richard. 2014. Fiona: a parallel and automatic strategy for read error correction. Bioinformatics 30, 17 (2014), i356--i363.
[29]
Ankit R Shah, Sriram Chockalingam, and Srinivas Aluru. 2012. A parallel algorithm for spectrum-based short read error correction. In Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International. IEEE, 60--70.
[30]
Avinash Sodani, Roger Gramunt, Jesus Corbal, Ho-Seop Kim, Krishna Vinod, Sundaram Chinthamani, Steven Hutsell, Rajat Agarwal, and Yen-Chen Liu. 2016. Knights Landing: Second-Generation Intel Xeon Phi Product. IEEE Micro 36, 2 (2016), 34--46.
[31]
Brad Solomon and Carl Kingsford. 2016. Fast search of thousands of short-read sequencing experiments. Nature biotechnology 34, 3 (2016), 300.
[32]
Li Song, Liliana Florea, and Ben Langmead. 2014. Lighter: fast and memory-efficient sequencing error correction without counting. Genome biology 15, 11 (2014), 509.
[33]
K. Wetterstrand. {n. d.}. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP). http://www.genome.gov/sequencingcosts. ({n. d.}).
[34]
Adrianto Wirawan, Robert S Harris, Yongchao Liu, Bertil Schmidt, and Jan Schröder. 2014. HECTOR: a parallel multistage homopolymer spectrum based error corrector for 454 sequencing data. BMC bioinformatics 15, 1 (2014), 131.
[35]
Xiao Yang, Sriram P Chockalingam, and Srinivas Aluru. 2012. A survey of error-correction methods for next-generation sequencing. Briefings in bioinformatics 14, 1 (2012), 56--66.
[36]
Xiao Yang, Karin S Dorman, and Srinivas Aluru. 2010. Reptile: representative tiling for short read error correction. Bioinformatics 26, 20 (2010), 2526--2533.
[37]
Liang Zhao, Qingfeng Chen, Wencui Li, Peng Jiang, Limsoon Wong, and Jinyan Li. 2017. MapReduce for accurate error correction of next-generation sequencing data. Bioinformatics 33, 23 (2017), 3844--3851.

Cited By

View all
  • (2021)DNA based service data security in cloud computing environment2021 6th International Conference on Innovative Technology in Intelligent System and Industrial Applications (CITISIA)10.1109/CITISIA53721.2021.9719943(1-11)Online publication date: 24-Nov-2021
  • (2020)SMusket: Spark-based DNA error correction on distributed-memory systemsFuture Generation Computer Systems10.1016/j.future.2019.10.038111(698-713)Online publication date: Oct-2020

Index Terms

  1. SPECTR: Scalable Parallel Short Read Error Correction on Multi-core and Many-core Architectures

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      ICPP '18: Proceedings of the 47th International Conference on Parallel Processing
      August 2018
      945 pages
      ISBN:9781450365109
      DOI:10.1145/3225058
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      In-Cooperation

      • University of Oregon: University of Oregon

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 13 August 2018

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. Bioinformatics
      2. Bloom filter
      3. MPI
      4. Vectorization
      5. Xeon Phi

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      ICPP 2018

      Acceptance Rates

      ICPP '18 Paper Acceptance Rate 91 of 313 submissions, 29%;
      Overall Acceptance Rate 91 of 313 submissions, 29%

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)5
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 15 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2021)DNA based service data security in cloud computing environment2021 6th International Conference on Innovative Technology in Intelligent System and Industrial Applications (CITISIA)10.1109/CITISIA53721.2021.9719943(1-11)Online publication date: 24-Nov-2021
      • (2020)SMusket: Spark-based DNA error correction on distributed-memory systemsFuture Generation Computer Systems10.1016/j.future.2019.10.038111(698-713)Online publication date: Oct-2020

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media