More Web Proxy on the site http://driver.im/

short-paper

Public Access

Accelerating Variant Calling on Human Genomes Using a Commodity Cluster

Authors:

Arun Zachariah,

Peter Tonellato,

Eduardo SimoesAuthors Info & Claims

CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

Pages 3388 - 3392

https://doi.org/10.1145/3459637.3482047

Published: 30 October 2021 Publication History

Abstract

Variant calling is a fundamental task that is performed to identify variants in an individual's genome compared to a reference human genome. This task can enable better understanding of an individual's risk to diseases and eventually lead to new innovations in precision medicine and drug discovery. However, variant calling on a large number of human genome sequences requires significant computing and storage resources. While access to such resources is possible today (e.g., through cloud computing), reducing the cost of analyzing genomes has become a major challenge. Motivated by these reasons, we address the problem of accelerating the variant calling pipeline on a large number of human genome sequences using a commodity cluster. We propose a novel approach that synergistically combines data and task parallelism for different stages of the variant calling pipeline across different sequences with minimal synchronization. Our approach employs futures to enable asynchronous computations in order to improve the overall cluster utilization and thereby, accelerate the variant calling pipeline. On a 16-node cluster, we observed that our approach was 3X-4.7X faster than the state-of-the-art Big Data Genomics software.

References

[1]

GATK4. https://github.com/broadinstitute/gatk.

[2]

Genome Reference Consortium Human Build 38. https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.26/, 2013.

[3]

1000 Genomes Phase 3 Release. https://www.internationalgenome.org/dataportal/data-collection/phase-3, 2015.

[4]

FASTQ Files Explained. https://support.illumina.com/bulletins/2016/04/fastqfiles-explained.html, 2016.

[5]

Size Matters: A Whole Genome is 6.4B Letters. https://www.veritasgenetics.com/our-thinking/whole-story, 2017.

[6]

Apache Parquet. https://parquet.apache.org/documentation/latest/, 2018.

[7]

Building the Fastest DNASeq Pipeline at Scale. https://databricks.com/blog/2018/09/10/building-the-fastest-dnaseq-pipeline-at-scale.html, 2018.

[8]

Big Data Genomics. https://github.com/bigdatagenomics/, 2020.

[9]

COVID-19 Genomics UK Consortium. https://www.cogconsortium.uk/, 2020.

[10]

DNA-Seq Analysis Pipeline. https://docs.gdc.cancer.gov/Data/Bioinformatics_ Pipelines/DNA_Seq_Variant_Calling_Pipeline, 2020.

[11]

Genomics on a Mission: Meeting the COVID-19 Challenge. https://www.genomecanada.ca/en/news/genomics-mission-meeting-covid-19-challenge/, 2020.

[12]

HaplotypeCaller in a Nutshell. https://gatk.broadinstitute.org/hc/en-us/articles/360035531412-HaplotypeCaller-in-a-nutshell, 2020.

[13]

Microsoft Genomics. https://www.microsoft.com/en-us/genomics/, 2020.

[14]

NVIDIA Clara Parabricks. https://developer.nvidia.com/clara-parabricks, 2020.

[15]

The Cost of Sequencing a Human Genome. https://www.genome.gov/aboutgenomics/fact-sheets/Sequencing-Human-Genome-cost, 2020.

[16]

The COVID Human Genetic Effort. https://www.covidhge.com/, 2020.

[17]

Base Quality Score Recalibration (BQSR). https://gatk.broadinstitute.org/hc/enus/articles/360035890531-Base-Quality-Score-Recalibration-BQSR-, 2021.

[18]

Ensembl Variation - Variant Classification. https://m.ensembl.org/info/genome/variation/prediction/classification.html, 2021.

[19]

FreeBayes, a Haplotype-Based Variant Detector. https://github.com/freebayes/freebayes, 2021.

[20]

Genome Analysis Toolkit. https://gatk.broadinstitute.org/hc/en-us, 2021.

[21]

Germline Short Variant Discovery (SNPs + Indels). https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPsIndels-, 2021.

[22]

Sequence Alignment/Map Format Specification. https://samtools.github.io/htsspecs/SAMv1.pdf, 2021.

[23]

The Variant Call Format (VCF) Version 4.2 Specification. https://samtools.github.io/hts-specs/VCFv4.2.pdf, 2021.

[24]

What is Paired-End Sequencing? https://www.illumina.com/science/technology/next-generation-sequencing/plan-experiments/paired-end-vssingle-read.html, 2021.

[25]

Abuin, J. M., Pichel, J. C., Pena, T. F., and Amigo, J. BigBWA: Approaching the Burrows-Wheeler Aligner to Big Data Technologies. Bioinformatics 31, 24 (2015), 4003--4005.

[26]

Abuin, J. M., Pichel, J. C., Pena, T. F., and Amigo, J. SparkBWA: Speeding up the Alignment of High-Throughput DNA Sequencing Data. PLoS ONE 11, 5 (2016).

[27]

Auton, A., and et.al. A Global Reference for Human Genetic Variation. Nature 526, 7571 (2015), 68--74.

[28]

Chen, Y.-T., Cong, J., Fang, Z., Lei, J., and Wei, P. When Apache Spark Meets FPGAs: A Case Study for Next-Generation DNA Sequencing Acceleration. In Proc. of the 8th USENIX Conference on Hot Topics in Cloud Computing (2016), pp. 64--70.

Digital Library

[29]

Cong, J., Lei, J., Li, S., Peto, M., Spellman, P., Wei, P., and Zhou, P. CSBWAMEM: A Fast and Scalable Read Aligner at the Cloud Scale for Whole Genome Sequencing. In High Throughput Sequencing Algorithms and Applications (HITSEQ) (2015).

[30]

Dean, J., and Ghemawat, S. MapReduce: Simplified Data Processing on Large Clusters. In Proc. of the 6th Symposium on Operating Systems Design and Implementation (2004), pp. 137--149.

Digital Library

[31]

Decap, D., Reumers, J., Herzeel, C., Costanza, P., and Fostier, J. Halvade: Scalable Sequence Analysis with MapReduce. Bioinformatics 31, 15 (2015), 2482-- 2488.

[32]

Duplyakin, D., Ricci, R., Maricq, A., Wong, G., Duerig, J., Eide, E., Stoller, L., Hibler, M., Johnson, D., Webb, K., Akella, A., Wang, K., Ricart, G., Landweber, L., Elliott, C., Zink, M., Cecchet, E., Kar, S., and Mishra, P. The Design and Operation of CloudLab. In 2019 USENIX Annual Technical Conference (USENIX ATC 19) (2019), pp. 1--14.

Digital Library

[33]

Fricke-Galindo, I., and Falfan-Valencia, R. Genetics Insight for COVID-19 Susceptibility and Severity: A Review. Frontiers in Immunology 12 (2021), 1057.

[34]

Garrison, E., and Marth, G. Haplotype-Based Variant Detection from ShortRead Sequencing, 2012.

[35]

Halstead, R. H. MULTILISP: A Language for Concurrent Symbolic Computation. ACM Transactions on Programming Languages and Systems 7, 4 (1985), 501--538.

Digital Library

[36]

Koboldt, D. C. Best Practices for Variant Calling in Clinical Sequencing. Genome Medicine 12, 1 (2020), 91.

[37]

Li, H. Aligning Sequence Reads, Clone Sequences and Assembly Contigs With BWA-MEM. arXiv e-prints (Mar. 2013), arXiv:1303.3997.

[38]

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G., and R. Durbin, e. a. The Sequence Alignment/Map Format and SAMtools. Bioinformatics 25, 16 (2009), 2078--2079.

Digital Library

[39]

Moritz, P., Nishihara, R., Wang, S., Tumanov, A., Liaw, R., Liang, E., Elibol, M., Yang, Z., Paul, W., Jordan, M. I., and Stoica, I. Ray: A Distributed Framework for Emerging AI Applications. In Proc. of the 13th USENIX Conference on Operating Systems Design and Implementation (2018), pp. 561--577.

Digital Library

[40]

Muir, P., Li, S., Lou, S., Wang, D., Spakowicz, D. J., Salichos, L., Zhang, J., Weinstock, G. M., Isaacs, F., Rozowsky, J., and Gerstein, M. The Real Cost of Sequencing: Scaling Computation to Keep Pace with Data Generation. Genome Biology 17, 1 (2016), 53.

[41]

Murray, D. G., Schwarzkopf, M., Smowton, C., Smith, S., Madhavapeddy, A., and Hand, S. CIEL: A Universal Execution Engine for Distributed DataFlow Computing. In 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 11) (USA, 2011), pp. 113--126.

Digital Library

[42]

Nguyen, T., Shi, W., and Ruden, D. CloudAligner: A Fast and Full-Featured MapReduce Based Tool for Sequence Mapping. BMC Research Notes 4, 1 (2011), 171.

[43]

Niemenmaa, M., Kallio, A., Schumacher, A., Klemela, P., Korpelainen, E., and Heljanko, K. Hadoop-BAM: Directly Manipulating Next Generation Sequencing Data in the Cloud. Bioinformatics 28, 6 (2012), 876--877.

Digital Library

[44]

Nothaft, F. A. Scalable Systems and Algorithms for Genomic Variant Analysis. PhD thesis, UC Berkeley, ProQuest, 2017.

[45]

Nothaft, F. A., Massie, M., Danford, T., Zhang, Z., Laserson, U., Yeksigian, C., Kottalam, J., Ahuja, A., Hammerbacher, J., Linderman, M. D., Franklin, M. J., Joseph, A. D., and Patterson, D. A. Rethinking Data-Intensive Science Using Scalable Analytics Systems. In Proc. of the 2015 ACM SIGMOD International Conference on Management of Data (2015), pp. 631--646.

Digital Library

[46]

Pireddu, L., Leo, S., and Zanetti, G. SEAL: A Distributed Short Read Mapping and Duplicate Removal Tool. Bioinformatics 27, 15 (2011), 2159--2160.

Digital Library

[47]

Rocklin, M. Dask: Parallel Computation with Blocked algorithms and Task Scheduling. In Proc. of the 14th Python in Science Conference (2015), K. Huff and J. Bergstra, Eds., pp. 130 -- 136.

[48]

Schatz, M. C. CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics 25, 11 (2009), 1363--1369.

Digital Library

[49]

Stephens, Z. D., Lee, S. Y., Faghri, F., Campbell, R. H., Zhai, C., Efron, M. J., Iyer, R., Schatz, M. C., Sinha, S., and Robinson, G. E. Big Data: Astronomical or Genomical? PLOS Biology 13, 7 (2015), 1--11.

[50]

Supernat, A., Vidarsson, O. V., Steen, V. M., and Stokowy, T. Comparison of Three Variant Callers for Human Whole Genome Sequencing. Scientific Reports 8 (2018).

[51]

White, T. Hadoop: The Definitive Guide, 1st ed. O'Reilly Media, Inc., 2009.

Digital Library

[52]

Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S., and Stoica, I. Spark: Cluster Computing with Working Sets. In Proc. of the 2nd USENIX Conference on Hot Topics in Cloud Computing (2010), pp. 10--10.

Digital Library

[53]

Zhang, L., Liu, C., and Dong, S. PipeMEM: A Framework to Speed Up BWAMEM in Spark with Low Overhead. Genes 10, 11 (2019).

Cited By

Das MRao PXu L(2024)Impact of the Networking Infrastructure on the Performance of Variant Calling on Human Genomes in Commodity ClustersProceedings of the 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics10.1145/3698587.3701354(1-11)Online publication date: 22-Nov-2024
https://dl.acm.org/doi/10.1145/3698587.3701354
Shehzad KKumar ASchutz MWebb CNalela PDas MRao PSerra ESpezzano F(2024)A Scalable Tool for Democratizing Variant Calling on Human Genomes Using Commodity ClustersProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679221(5275-5279)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679221
Rao PShehzad K(2024)A Technique for Secure Variant Calling on Human Genome Sequences Using SmartNICs2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00044(328-335)Online publication date: 7-Jul-2024
https://doi.org/10.1109/CLOUD62652.2024.00044
Show More Cited By

Index Terms

Accelerating Variant Calling on Human Genomes Using a Commodity Cluster
1. Applied computing
  1. Life and medical sciences
    1. Genomics
2. Computing methodologies

Recommendations

Efficient Variant Calling on Human Genome Sequences Using a GPU-Enabled Commodity Cluster
CIKM '23: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management

Human genome sequences are very large in size and require significant compute and storage resources for processing and analysis. Variant calling is a key task performed on an individual's genome to identify different types of variants. Knowing these ...
Accelerating Variant Calling with Parallelized DeepVariant
RACS '20: Proceedings of the International Conference on Research in Adaptive and Convergent Systems

Due to the rapid evolution of the next-generation sequencing (NGS) technology, the sequence of an individual's genome can be determined from billions of short reads at a decreasing cost, which has advanced the fields of medical research and precision ...
A Scalable Tool for Democratizing Variant Calling on Human Genomes Using Commodity Clusters
CIKM '24: Proceedings of the 33rd ACM International Conference on Information and Knowledge Management

Variant calling is a fundamental task that involves identifying variants in an individual's genome compared to the reference genome. Knowing these variants is critical for assessing an individual's risk for diseases such as cancer and developing new ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management

October 2021

4966 pages

ISBN:9781450384469

DOI:10.1145/3459637

General Chairs:
Gianluca Demartini
The University of Queensland, Australia
,
Guido Zuccon
The University of Queensland, Australia
,
Program Chairs:
J. Shane Culpepper
RMIT University, Australia
,
Zi Huang
The University of Queensland, Australia
,
Hanghang Tong
University of Illinois at Urbana-Champaign, USA

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 October 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

National Science Foundation

Conference

CIKM '21

Sponsor:

CIKM '21: The 30th ACM International Conference on Information and Knowledge Management

November 1 - 5, 2021

Queensland, Virtual Event, Australia

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
287
Total Downloads

Downloads (Last 12 months)102
Downloads (Last 6 weeks)13

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Das MRao PXu L(2024)Impact of the Networking Infrastructure on the Performance of Variant Calling on Human Genomes in Commodity ClustersProceedings of the 15th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics10.1145/3698587.3701354(1-11)Online publication date: 22-Nov-2024
https://dl.acm.org/doi/10.1145/3698587.3701354
Shehzad KKumar ASchutz MWebb CNalela PDas MRao PSerra ESpezzano F(2024)A Scalable Tool for Democratizing Variant Calling on Human Genomes Using Commodity ClustersProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679221(5275-5279)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679221
Rao PShehzad K(2024)A Technique for Secure Variant Calling on Human Genome Sequences Using SmartNICs2024 IEEE 17th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD62652.2024.00044(328-335)Online publication date: 7-Jul-2024
https://doi.org/10.1109/CLOUD62652.2024.00044
Rao PZachariah A(2022)Enabling Large-Scale Human Genome Sequence Analysis on CloudLabIEEE INFOCOM 2022 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS)10.1109/INFOCOMWKSHPS54753.2022.9798223(1-2)Online publication date: 2-May-2022
https://doi.org/10.1109/INFOCOMWKSHPS54753.2022.9798223

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents