[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1109/ISCA45697.2020.00031acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Genesis: a hardware acceleration framework for genomic data analysis

Published: 23 September 2020 Publication History

Abstract

In this paper, we describe our vision to accelerate algorithms in the domain of genomic data analysis by proposing a framework called Genesis (<u>gen</u>om<u>e</u> analy<u>sis</u>) that contains an interface and an implementation of a system that processes genomic data efficiently. This framework can be deployed in the cloud and exploit the FPGAs-as-a-service paradigm to provide cost-efficient secondary DNA analysis. We propose conceptualizing genomic reads and associated read attributes as a very large relational database and using extended SQL as a domain-specific language to construct queries that form various data manipulation operations. To accelerate such queries, we design a Genesis hardware library which consists of primitive hardware modules that can be composed to construct a dataflow architecture specialized for those queries.
As a proof of concept for the Genesis framework, we present the architecture and the hardware implementation of several genomic analysis stages in the secondary analysis pipeline corresponding to the best known software analysis toolkit, GATK4 workflow proposed by the Broad Institute. We walk through the construction of genomic data analysis operations using a sequence of SQL-style queries and show how Genesis hardware library modules can be utilized to construct the hardware pipelines designed to accelerate such queries. We exploit parallelism and data reuse by utilizing a dataflow architecture along with the use of on-chip scratchpads as well as non-blocking APIs to manage the accelerators, allowing concurrent execution of the accelerator and the host. Our accelerated system deployed on the cloud FPGA performs up to 19.3x better than GATK4 running on a commodity multi-core Xeon server and obtains up to 15x better cost savings. We believe that if a software algorithm can be mapped onto a hardware library to utilize the underlying accelerator(s) using an already-standardized software interface such as SQL, while allowing the efficient mapping of such interface to primitive hardware modules as we have demonstrated here, it will expedite the acceleration of domain-specific algorithms and allow the easy adaptation of algorithm changes.

References

[1]
J. H. Ahn, W. J. Dally, W. J. Dally, B. Khailany, U. J. Kapasi, and A. Das, "Evaluating the imagine stream architecture," in Proceedings of the 31st Annual International Symposium on Computer Architecture (ISCA), 2004.
[2]
M. Alser, H. Hassan, H. Xin, O. Ergin, O. Mutlu, and C. Alkan, "GateKeeper: a new hardware architecture for accelerating pre-alignment in DNA short read mapping," Bioinformatics, vol. 33, no. 21, pp. 3355--3363, 2017.
[3]
"Amazon EC2 F1 Instances," https://aws.amazon.com/ec2/instance-types/f1, Amazon.
[4]
J. Andrews, "23andMe competitor Veritas Genetics slashes price of whole genome sequencing 40% to $600," https://www.cnbc.com/.
[5]
"Spark SQL," https://spark.apache.org/sql/, Apache Software Foundation.
[6]
S. S. Banerjee, M. el-Hadedy, C. Y. Tan, Z. T. Kalbarczyk, S. Lumetta, and R. K. Iyer, "On accelerating pair-HMM computations in programmable hardware," in Proceedings of the 27th International Conference on Field Programmable Logic and Applications (FPL), 2017.
[7]
"Broad Institute About Us: This is Broad," https://www.broadinstitute.org/about-us, Broad Institute.
[8]
"Broad Institute Public Datasets Google Cloud Repository," https://console.cloud.google.com/storage/browser/broad-public-datasets?project=broad-public-datasets&organizationId=548622027621, Broad Institute.
[9]
"GATK4 data preprocessing," https://github.com/gatk-workflows/gatk4-data-processing, Broad Institute.
[10]
"Illumina and broad institute announce agreement to co-develop genomic secondary analysis tools," https://www.broadinstitute.org/news/illumina-and-broad-institute-announce-agreement-co-develop-genomic-secondary-analysis-tools, Broad Institute.
[11]
Broad Institute GATK Dev Team, "Introduction to the GATK best practices," https://software.broadinstitute.org/gatk/best-practices/.
[12]
M. F. Chang, Y. Chen, J. Cong, P. Huang, C. Kuo, and C. H. Yu, "The smem seeding acceleration for dna sequence alignment," in Proceedings of the 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2016.
[13]
Y.-T. Chen, J. Cong, J. Lei, and P. Wei, "A novel high-throughput acceleration engine for read alignment," in Proceedings of the 23rd Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2015.
[14]
"Chisel hardware construction language," https://chisel.eecs.berkeley.edu/.
[15]
E. S. Chung, J. D. Davis, and J. Lee, "Linqits: Big data on little clients," in Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA), 2013.
[16]
S. Ciricescu, R. Essick, B. Lucas, P. May, K. Moat, J. Norris, M. Schuette, and A. Saidi, "The reconfigurable streaming vector processor (RSVP)," in Proceedings of the 36th Annual International Symposium on Microarchitecture (MICRO), 2003.
[17]
"Compute Express Link," https://www.computeexpresslink.org/, CXL Consortium.
[18]
M. A. DePristo, E. Banks, R. Poplin, K. V. Garimella, J. R. Maguire, C. Hartl, A. A. Philippakis, G. del Angel, M. A. Rivas, M. Hanna, A. McKenna, T. J. Fennell, A. M. Kernytsky, A. Y. Sivachenko, K. Cibulskis, S. B. Gabriel, D. Altshuler, and M. J. Daly, "A framework for variation discovery and genotyping using next-generation dna sequencing data," Nature Genetics, vol. 43, 2011.
[19]
M. Drumond, A. Daglis, N. Mirzadeh, D. Ustiugov, J. Picorel, B. Falsafi, B. Grot, and D. Penvmatikatos, "The mondrian data engine," in Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), 2017.
[20]
"Dragen bio-it processor," http://www.edicogenome.com/dragen_bioit_platform/, Edico Genome.
[21]
D. Fujiki, A. Subramaniyan, T. Zhang, Y. Zeng, R. Das, D. Blaauw, and S. Narayanasamy, "GenAX: a genome sequencing accelerator," in Proceedings of the 45th Annual International Symposium on Computer Architecture (ISCA), June 2018.
[22]
E. Garrison and G. T. Marth, "Haplotype-based variant detection from short-read sequencing," ArXiv e-print arXiv:1207.3907, 2012.
[23]
L. Guo, J. Lau, Z. Ruan, P. Wei, and J. Cong, "Hardware acceleration of long read pairwise overlapping in genome sequencing: A race between fpga and gpu," in Proceedings of the 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2019.
[24]
T. J. Ham, L. Wu, N. Sundaram, N. Satish, and M. Martonosi, "Graphicionado: A high-performance and energy-efficient accelerator for graph analytics," in Proceedings of the 49th Annual International Symposium on Microarchitecture (MICRO), 2016.
[25]
B. Harris, A. C. Jacob, J. M. Lancaster, J. Buhler, and R. D. Chamberlain, "A banded Smith-Waterman FPGA accelerator for Mercury BLASTP," in Proceedings of the International Conference on Field Programmable Logic and Applications (FPL), 2007.
[26]
S. Huang, G. J. Manikandan, A. Ramachandran, K. Rupnow, W. H. Wenmei, and D. Chen, "Hardware acceleration of the Pair-HMM algorithm for DNA variant calling," in Proceedings of the International Symposium on Field-Programmable Gate Arrays (FPGA), 2017.
[27]
"Illumina," https://www.illumina.com/, Illumina.
[28]
U. J. Kapasi, W. J. Dally, S. Rixner, J. D. Owens, and B. Khailany, "The imagine stream processor," in Proceedings of the IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD), 2002.
[29]
K. I. Kendig, S. Baheti, M. A. Bockol, T. M. Drucker, S. N. Hart, J. R. Heldenbrand, M. Hernaez, M. E. Hudson, M. T. Kalmbach, E. W. Klee, N. R. Mattson, C. A. Ross, M. Taschuk, E. D. Wieben, M. Wiepert, D. E. Wildman, and L. S. Mainzer, "Sentieon dnaseq variant calling workflow demonstrates strong computational performance and accuracy," Frontiers in Genetics, vol. 10, p. 736, 2019.
[30]
S. Kim, K. Scheffler, A. L. Halpern, M. A. Bekritsky, E. Noh, M. Källberg, X. Chen, Y. Kim, D. Beyter, P. Krusche, and C. T. Saunders, "Strelka2: fast and accurate calling of germline and somatic variants," Nature Methods, vol. 15, no. 8, pp. 591--594, 2018.
[31]
C. Kozanitis, A. Heiberg, G. Varghese, and V. Bafna, "Using Genome Query Language to uncover genetic variation," Bioinformatics, vol. 30, no. 1, pp. 1--8, 2013.
[32]
C. Kozanitis and D. A. Patterson, "Genap: a distributed sql interface for genomic data," BMC bioinformatics, vol. 17, pp. 63--63, 2016.
[33]
E. S. Lander, L. M. Linton, B. Birren, C. Nusbaum, M. C. Zody, J. Baldwin, K. Devon, K. Dewar, M. Doyle, W. FitzHugh et al., "Initial sequencing and analysis of the human genome," Nature, vol. 409, no. 6822, pp. 860--921, 2001.
[34]
M. Lek, K. J. Karczewski, E. V. Minikel, K. E. Samocha, E. Banks, T. Fennell, A. H. O'Donnell-Luria, J. S. Ware, A. J. Hill, B. B. Cummings et al., "Analysis of protein-coding genetic variation in 60,706 humans," Nature, vol. 536, no. 7616, pp. 285--291, 2016.
[35]
H. Li, "Minimap2: pairwise alignment for nucleotide sequences," Bioinformatics, vol. 34, no. 18, 2018.
[36]
I. T. Li, W. Shum, and K. Truong, "160--fold acceleration of the Smith-Waterman algorithm using a field programmable gate array (FPGA)," BMC bioinformatics, vol. 8, no. 1, p. 185, 2007.
[37]
D. Mahajan, J. K. Kim, J. Sacks, A. Ardalan, A. Kumar, and H. Esmaeilzadeh, "In-rdbms hardware acceleration of advanced analytics," Proceedings of the VLDB Endowment, vol. 11, no. 11, p. 1317--1331, 2018.
[38]
G. J. Manikandan, S. Huang, K. Rupnow, W. W. Hwu, and D. Chen, "Acceleration of the Pair-HMM algorithm for DNA variant calling," in Proceedings of the 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2016.
[39]
M. Masseroli, P. Pinoli, F. Venco, A. Kaitoua, V. Jalili, F. Palluzzi, H. Muller, and S. Ceri, "GenoMetric Query Language: a novel approach to large-scale genomic data management," Bioinformatics, vol. 31, no. 12, pp. 1881--1888, 2015.
[40]
P. Muir, S. Li, S. Lou, D. Wang, D. J. Spakowicz, L. Salichos, J. Zhang, G. M. Weinstock, F. Isaacs, J. Rozowsky et al., "The real cost of sequencing: Scaling computation to keep pace with data generation," Genome biology, vol. 17, no. 1, p. 53, 2016.
[41]
"Human genome resources at NCBI," https://www.ncbi.nlm.nih.gov/genome/guide/human/, National Center for Biotechnology Information.
[42]
"DNA sequencing costs," http://www.genome.gov/sequencingcosts, National Human Genome Research Institute.
[43]
F. A. Nothaft, M. Massie, T. Danford, Z. Zhang, U. Laserson, C. Yeksigian, J. Kottalam, A. Ahuja, J. Hammerbacher, M. Linderman, M. Franklin, A. D. Joseph, and D. A. Patterson, "Rethinking data-intensive science using scalable analytics systems," in Proceedings of the International Conference on Management of Data (SIGMOD), 2015.
[44]
T. Nowatzki, V. Gangadhar, N. Ardalani, and K. Sankaralingam, "Stream-dataflow acceleration," in Proceedings of the 44th Annual International Symposium on Computer Architecture (ISCA), 2017.
[45]
"NVIDIA PARABRICKS," https://developer.nvidia.com/nvidia-parabricks, NVIDIA Corporation.
[46]
"PL/SQL for developers," https://www.oracle.com/database/technologies/appdev/plsql.html, Oracle.
[47]
"Database SQL Tuning Guide - SQL Processing," https://docs.oracle.com/database/121/TGSQL/tgsql_interp.htm#TGSQL94618, Oracle, 2020.
[48]
J. Ouyang, W. Qu, Y. Wang, Y. Tu, J. Wang, and B. Jia, "SDA: Software-defined accelerator for general-purpose big data analysis system," in Hot Chips: A Symposium on High Performance Chips, 2016.
[49]
U. Paila, B. A. Chapman, R. Kirchner, and A. R. Quinlan, "Gemini: Integrative exploration of genetic variation and genome annotations," PLOS Computational Biology, vol. 9, 2013.
[50]
"PCI Express Specifications," https://pcisig.com/specifications, PCI SIG.
[51]
R. Poplin, P.-C. Chang, D. Alexander, S. Schwartz, T. Colthurst, A. Ku, D. Newburger, J. Dijamco, N. Nguyen, P. T. Afshar, S. S. Gross, L. Dorfman, C. Y. McLean, and M. A. DePristo, "A universal snp and small-indel variant caller using deep neural networks," Nature Biotechnology, vol. 36, 2018.
[52]
U. Röhm and J. A. Barkeley, "Data management for high-throughput genomics," in Proceedings of the Conference on Innovative Data Systems Research (CIDR), 2009.
[53]
M.-P. Schapranow and H. Plattner, "HIG - an in-memory database platform enabling real-time analyses of genome data," in Proceedings of the IEEE International Conference on Big Data (IEEE Big Data), 2013.
[54]
Z. D. Stephens, S. Y. Lee, F. Faghri, R. H. Campbell, C. Zhai, M. J. Efron, R. Iyer, M. C. Schatz, S. Sinha, and G. E. Robinson, "Big data: Astronomical or genomical?" Public Library of Science (PLoS) Biology, vol. 13, no. 7, p. e1002195, 2015.
[55]
"Genome reference consortium human build 38," https://www.ncbi.nlm.nih.gov/grc, The Genome Reference Consortium.
[56]
The White House Office of the Press Secretary, "Fact sheet: President Obama's Precision Medicine Initiative," https://obamawhitehouse.archives.gov/the-press-office/2015/01/30/fact-sheet-president-obama-s-precision-medicine\\-initiative.
[57]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy, "Hive: A warehousing solution over a map-reduce framework," Proceedings of the VLDB Endowment, vol. 2, no. 2, pp. 1626--1629, 2009.
[58]
Y. Turakhia, G. Bejerano, and W. J. Dally, "Darwin: A genomics co-processor provides up to 15,000x acceleration on long read assembly," in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2018.
[59]
Y. Turakhia, S. D. Goenka, G. Bejerano, and W. J. Dally, "Darwin-wga: A co-processor provides increased sensitivity in whole genome alignments with high speedup," in Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), 2019.
[60]
G. Weisz and J. C. Hoe, "CoRAM++: Supporting data-structure-specific memory interfaces for FPGA computing," in Proceedings of the 25th International Conference on Field Programmable Logic and Applications (FPL), 2015.
[61]
M. Wiewiórka, A. Leśniewska, A. Szmurło, K. Stępień, M. Borowiak, M. Okoniewski, and T. Gambin, "SeQuiLa: an elastic, fast and scalable SQL-oriented solution for processing and querying genomic intervals," Bioinformatics, vol. 35, no. 12, pp. 2156--2158, 2018.
[62]
L. Wu, D. Bruns-Smith, F. A. Nothaft, Q. Huang, S. Karandikar, J. Le, A. Lin, H. Mao, B. Sweeney, K. Asanović, D. A. Patterson, and A. D. Joseph, "FPGA accelerated INDEL realignment in the cloud," in Proceedings of the International Symposium on High-Performance Computer Architecture (HPCA), 2019.
[63]
L. Wu, A. Lottarini, T. K. Paine, M. A. Kim, and K. A. Ross, "Q100: the architecture and design of a database processing unit," in Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2014.
[64]
"Supercharge Your AI and Database Applications with Xilinx's HBM-Enabled UltraScale+ Devices Featuring Samsung HBM2," https://www.xilinx.com/support/documentation/white_papers/wp508-hbm2.pdf, Xilinx, 2019.
[65]
C. K. Yung, G. Bourque, P. C. Boutros, K. El Emam, V. Ferretti, B. M. Knoppers, B. O'Connor, B. F. Ouellette, C. Sahinalp, S. P. Shah et al., "ICGC in the cloud," 2016.

Cited By

View all
  • (2024)Mozart: Taming Taxes and Composing Accelerators with Shared-MemoryProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676896(183-200)Online publication date: 14-Oct-2024
  • (2023)Abakus: Accelerating k-mer Counting with Storage TechnologyACM Transactions on Architecture and Code Optimization10.1145/363295221:1(1-26)Online publication date: 21-Nov-2023
  • (2023)GenDP: A Framework of Dynamic Programming Acceleration for Genome Sequencing AnalysisProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589060(1-15)Online publication date: 17-Jun-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
ISCA '20: Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture
May 2020
1152 pages
ISBN:9781728146614

Sponsors

In-Cooperation

  • IEEE

Publisher

IEEE Press

Publication History

Published: 23 September 2020

Check for updates

Author Tags

  1. FPGA
  2. SQL
  3. genome sequencing
  4. genomic data analysis
  5. hardware accelerator

Qualifiers

  • Research-article

Conference

ISCA '20
Sponsor:

Acceptance Rates

Overall Acceptance Rate 543 of 3,203 submissions, 17%

Upcoming Conference

ISCA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)19
  • Downloads (Last 6 weeks)4
Reflects downloads up to 31 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Mozart: Taming Taxes and Composing Accelerators with Shared-MemoryProceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques10.1145/3656019.3676896(183-200)Online publication date: 14-Oct-2024
  • (2023)Abakus: Accelerating k-mer Counting with Storage TechnologyACM Transactions on Architecture and Code Optimization10.1145/363295221:1(1-26)Online publication date: 21-Nov-2023
  • (2023)GenDP: A Framework of Dynamic Programming Acceleration for Genome Sequencing AnalysisProceedings of the 50th Annual International Symposium on Computer Architecture10.1145/3579371.3589060(1-15)Online publication date: 17-Jun-2023
  • (2023)JACC-FPGAFuture Generation Computer Systems10.1016/j.future.2022.08.005138:C(26-42)Online publication date: 1-Jan-2023
  • (2022)FPGA HLS Today: Successes, Challenges, and OpportunitiesACM Transactions on Reconfigurable Technology and Systems10.1145/353077515:4(1-42)Online publication date: 8-Aug-2022
  • (2021)SquiggleFilter: An Accelerator for Portable Virus DetectionMICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture10.1145/3466752.3480117(535-549)Online publication date: 18-Oct-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media