[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3459637.3482047acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
short-paper
Public Access

Accelerating Variant Calling on Human Genomes Using a Commodity Cluster

Published: 30 October 2021 Publication History

Abstract

Variant calling is a fundamental task that is performed to identify variants in an individual's genome compared to a reference human genome. This task can enable better understanding of an individual's risk to diseases and eventually lead to new innovations in precision medicine and drug discovery. However, variant calling on a large number of human genome sequences requires significant computing and storage resources. While access to such resources is possible today (e.g., through cloud computing), reducing the cost of analyzing genomes has become a major challenge. Motivated by these reasons, we address the problem of accelerating the variant calling pipeline on a large number of human genome sequences using a commodity cluster. We propose a novel approach that synergistically combines data and task parallelism for different stages of the variant calling pipeline across different sequences with minimal synchronization. Our approach employs futures to enable asynchronous computations in order to improve the overall cluster utilization and thereby, accelerate the variant calling pipeline. On a 16-node cluster, we observed that our approach was 3X-4.7X faster than the state-of-the-art Big Data Genomics software.

References

[1]
GATK4. https://github.com/broadinstitute/gatk.
[2]
Genome Reference Consortium Human Build 38. https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/, 2013.
[3]
1000 Genomes Phase 3 Release. https://www.internationalgenome.org/dataportal/data-collection/phase-3, 2015.
[4]
FASTQ Files Explained. https://support.illumina.com/bulletins/2016/04/fastqfiles-explained.html, 2016.
[5]
Size Matters: A Whole Genome is 6.4B Letters. https://www.veritasgenetics.com/our-thinking/whole-story, 2017.
[6]
Apache Parquet. https://parquet.apache.org/documentation/latest/, 2018.
[7]
Building the Fastest DNASeq Pipeline at Scale. https://databricks.com/blog/2018/09/10/building-the-fastest-dnaseq-pipeline-at-scale.html, 2018.
[8]
Big Data Genomics. https://github.com/bigdatagenomics/, 2020.
[9]
COVID-19 Genomics UK Consortium. https://www.cogconsortium.uk/, 2020.
[10]
DNA-Seq Analysis Pipeline. https://docs.gdc.cancer.gov/Data/Bioinformatics_ Pipelines/DNA_Seq_Variant_Calling_Pipeline, 2020.
[11]
Genomics on a Mission: Meeting the COVID-19 Challenge. https://www.genomecanada.ca/en/news/genomics-mission-meeting-covid-19-challenge/, 2020.
[12]
HaplotypeCaller in a Nutshell. https://gatk.broadinstitute.org/hc/en-us/articles/360035531412-HaplotypeCaller-in-a-nutshell, 2020.
[13]
Microsoft Genomics. https://www.microsoft.com/en-us/genomics/, 2020.
[14]
NVIDIA Clara Parabricks. https://developer.nvidia.com/clara-parabricks, 2020.
[15]
The Cost of Sequencing a Human Genome. https://www.genome.gov/aboutgenomics/fact-sheets/Sequencing-Human-Genome-cost, 2020.
[16]
The COVID Human Genetic Effort. https://www.covidhge.com/, 2020.
[17]
Base Quality Score Recalibration (BQSR). https://gatk.broadinstitute.org/hc/enus/articles/360035890531-Base-Quality-Score-Recalibration-BQSR-, 2021.
[18]
Ensembl Variation - Variant Classification. https://m.ensembl.org/info/genome/variation/prediction/classification.html, 2021.
[19]
FreeBayes, a Haplotype-Based Variant Detector. https://github.com/freebayes/freebayes, 2021.
[20]
Genome Analysis Toolkit. https://gatk.broadinstitute.org/hc/en-us, 2021.
[21]
Germline Short Variant Discovery (SNPs + Indels). https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPsIndels-, 2021.
[22]
Sequence Alignment/Map Format Specification. https://samtools.github.io/htsspecs/SAMv1.pdf, 2021.
[23]
The Variant Call Format (VCF) Version 4.2 Specification. https://samtools.github.io/hts-specs/VCFv4.2.pdf, 2021.
[24]
What is Paired-End Sequencing? https://www.illumina.com/science/technology/next-generation-sequencing/plan-experiments/paired-end-vssingle-read.html, 2021.
[25]
Abuin, J. M., Pichel, J. C., Pena, T. F., and Amigo, J. BigBWA: Approaching the Burrows-Wheeler Aligner to Big Data Technologies. Bioinformatics 31, 24 (2015), 4003--4005.
[26]
Abuin, J. M., Pichel, J. C., Pena, T. F., and Amigo, J. SparkBWA: Speeding up the Alignment of High-Throughput DNA Sequencing Data. PLoS ONE 11, 5 (2016).
[27]
Auton, A., and et.al. A Global Reference for Human Genetic Variation. Nature 526, 7571 (2015), 68--74.
[28]
Chen, Y.-T., Cong, J., Fang, Z., Lei, J., and Wei, P. When Apache Spark Meets FPGAs: A Case Study for Next-Generation DNA Sequencing Acceleration. In Proc. of the 8th USENIX Conference on Hot Topics in Cloud Computing (2016), pp. 64--70.
[29]
Cong, J., Lei, J., Li, S., Peto, M., Spellman, P., Wei, P., and Zhou, P. CSBWAMEM: A Fast and Scalable Read Aligner at the Cloud Scale for Whole Genome Sequencing. In High Throughput Sequencing Algorithms and Applications (HITSEQ) (2015).
[30]
Dean, J., and Ghemawat, S. MapReduce: Simplified Data Processing on Large Clusters. In Proc. of the 6th Symposium on Operating Systems Design and Implementation (2004), pp. 137--149.
[31]
Decap, D., Reumers, J., Herzeel, C., Costanza, P., and Fostier, J. Halvade: Scalable Sequence Analysis with MapReduce. Bioinformatics 31, 15 (2015), 2482-- 2488.
[32]
Duplyakin, D., Ricci, R., Maricq, A., Wong, G., Duerig, J., Eide, E., Stoller, L., Hibler, M., Johnson, D., Webb, K., Akella, A., Wang, K., Ricart, G., Landweber, L., Elliott, C., Zink, M., Cecchet, E., Kar, S., and Mishra, P. The Design and Operation of CloudLab. In 2019 USENIX Annual Technical Conference (USENIX ATC 19) (2019), pp. 1--14.
[33]
Fricke-Galindo, I., and Falfan-Valencia, R. Genetics Insight for COVID-19 Susceptibility and Severity: A Review. Frontiers in Immunology 12 (2021), 1057.
[34]
Garrison, E., and Marth, G. Haplotype-Based Variant Detection from ShortRead Sequencing, 2012.
[35]
Halstead, R. H. MULTILISP: A Language for Concurrent Symbolic Computation. ACM Transactions on Programming Languages and Systems 7, 4 (1985), 501--538.
[36]
Koboldt, D. C. Best Practices for Variant Calling in Clinical Sequencing. Genome Medicine 12, 1 (2020), 91.
[37]
Li, H. Aligning Sequence Reads, Clone Sequences and Assembly Contigs With BWA-MEM. arXiv e-prints (Mar. 2013), arXiv:1303.3997.
[38]
Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., and R. Durbin, e. a. The Sequence Alignment/Map Format and SAMtools. Bioinformatics 25, 16 (2009), 2078--2079.
[39]
Moritz, P., Nishihara, R., Wang, S., Tumanov, A., Liaw, R., Liang, E., Elibol, M., Yang, Z., Paul, W., Jordan, M. I., and Stoica, I. Ray: A Distributed Framework for Emerging AI Applications. In Proc. of the 13th USENIX Conference on Operating Systems Design and Implementation (2018), pp. 561--577.
[40]
Muir, P., Li, S., Lou, S., Wang, D., Spakowicz, D. J., Salichos, L., Zhang, J., Weinstock, G. M., Isaacs, F., Rozowsky, J., and Gerstein, M. The Real Cost of Sequencing: Scaling Computation to Keep Pace with Data Generation. Genome Biology 17, 1 (2016), 53.
[41]
Murray, D. G., Schwarzkopf, M., Smowton, C., Smith, S., Madhavapeddy, A., and Hand, S. CIEL: A Universal Execution Engine for Distributed DataFlow Computing. In 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 11) (USA, 2011), pp. 113--126.
[42]
Nguyen, T., Shi, W., and Ruden, D. CloudAligner: A Fast and Full-Featured MapReduce Based Tool for Sequence Mapping. BMC Research Notes 4, 1 (2011), 171.
[43]
Niemenmaa, M., Kallio, A., Schumacher, A., Klemela, P., Korpelainen, E., and Heljanko, K. Hadoop-BAM: Directly Manipulating Next Generation Sequencing Data in the Cloud. Bioinformatics 28, 6 (2012), 876--877.
[44]
Nothaft, F. A. Scalable Systems and Algorithms for Genomic Variant Analysis. PhD thesis, UC Berkeley, ProQuest, 2017.
[45]
Nothaft, F. A., Massie, M., Danford, T., Zhang, Z., Laserson, U., Yeksigian, C., Kottalam, J., Ahuja, A., Hammerbacher, J., Linderman, M. D., Franklin, M. J., Joseph, A. D., and Patterson, D. A. Rethinking Data-Intensive Science Using Scalable Analytics Systems. In Proc. of the 2015 ACM SIGMOD International Conference on Management of Data (2015), pp. 631--646.
[46]
Pireddu, L., Leo, S., and Zanetti, G. SEAL: A Distributed Short Read Mapping and Duplicate Removal Tool. Bioinformatics 27, 15 (2011), 2159--2160.
[47]
Rocklin, M. Dask: Parallel Computation with Blocked algorithms and Task Scheduling. In Proc. of the 14th Python in Science Conference (2015), K. Huff and J. Bergstra, Eds., pp. 130 -- 136.
[48]
Schatz, M. C. CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics 25, 11 (2009), 1363--1369.
[49]
Stephens, Z. D., Lee, S. Y., Faghri, F., Campbell, R. H., Zhai, C., Efron, M. J., Iyer, R., Schatz, M. C., Sinha, S., and Robinson, G. E. Big Data: Astronomical or Genomical? PLOS Biology 13, 7 (2015), 1--11.
[50]
Supernat, A., Vidarsson, O. V., Steen, V. M., and Stokowy, T. Comparison of Three Variant Callers for Human Whole Genome Sequencing. Scientific Reports 8 (2018).
[51]
White, T. Hadoop: The Definitive Guide, 1st ed. O'Reilly Media, Inc., 2009.
[52]
Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I. Spark: Cluster Computing with Working Sets. In Proc. of the 2nd USENIX Conference on Hot Topics in Cloud Computing (2010), pp. 10--10.
[53]
Zhang, L., Liu, C., and Dong, S. PipeMEM: A Framework to Speed Up BWAMEM in Spark with Low Overhead. Genes 10, 11 (2019).

Cited By

View all
  • (2024)Impact of the Networking Infrastructure on the Performance of Variant Calling on Human Genomes in Commodity ClustersProceedings of the 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics10.1145/3698587.3701354(1-11)Online publication date: 22-Nov-2024
  • (2024)A Scalable Tool for Democratizing Variant Calling on Human Genomes Using Commodity ClustersProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679221(5275-5279)Online publication date: 21-Oct-2024
  • (2024)A Technique for Secure Variant Calling on Human Genome Sequences Using SmartNICs2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00044(328-335)Online publication date: 7-Jul-2024
  • Show More Cited By

Index Terms

  1. Accelerating Variant Calling on Human Genomes Using a Commodity Cluster

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management
      October 2021
      4966 pages
      ISBN:9781450384469
      DOI:10.1145/3459637
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 30 October 2021

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. cluster computing
      2. futures
      3. human genomes
      4. variant calling

      Qualifiers

      • Short-paper

      Funding Sources

      Conference

      CIKM '21
      Sponsor:

      Acceptance Rates

      Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

      Upcoming Conference

      CIKM '25

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)102
      • Downloads (Last 6 weeks)13
      Reflects downloads up to 06 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Impact of the Networking Infrastructure on the Performance of Variant Calling on Human Genomes in Commodity ClustersProceedings of the 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics10.1145/3698587.3701354(1-11)Online publication date: 22-Nov-2024
      • (2024)A Scalable Tool for Democratizing Variant Calling on Human Genomes Using Commodity ClustersProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679221(5275-5279)Online publication date: 21-Oct-2024
      • (2024)A Technique for Secure Variant Calling on Human Genome Sequences Using SmartNICs2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00044(328-335)Online publication date: 7-Jul-2024
      • (2022)Enabling Large-Scale Human Genome Sequence Analysis on CloudLabIEEE INFOCOM 2022 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS)10.1109/INFOCOMWKSHPS54753.2022.9798223(1-2)Online publication date: 2-May-2022

      View Options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Login options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media