More Web Proxy on the site http://driver.im/

research-article

Public Access

SARVAVID: A Domain Specific Language for Developing Scalable Computational Genomics Applications

Authors:

Christopher Wright,

Milind Kulkarni,

Saurabh Bagchi,

Somali ChaterjiAuthors Info & Claims

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

Article No.: 34, Pages 1 - 12

https://doi.org/10.1145/2925426.2926283

Published: 01 June 2016 Publication History

Abstract

Breakthroughs in gene sequencing technologies have led to an exponential increase in the amount of genomic data. Efficient tools to rapidly process such large quantities of data are critical in the study of gene functions, diseases, evolution, and population variation. These tools are designed in an ad-hoc manner, and require extensive programmer effort to develop and optimize them. Often, such tools are written with the currently available data sizes in mind, and soon start to under perform due to the exponential growth in data. Furthermore, to obtain high-performance, these tools require parallel implementations, adding to the development complexity.

This paper makes an observation that most such tools contain a recurring set of software modules, or kernels. The availability of efficient implementations of such kernels can improve programmer productivity, and provide effective scalability with growing data. To achieve this goal, the paper presents a domain-specific language, called Sarvavid, which provides these kernels as language constructs. Sarvavid comes with a compiler that performs domain-specific optimizations, which are beyond the scope of libraries and generic compilers. Furthermore, Sarvavid inherently supports exploitation of parallelism across multiple nodes.

To demonstrate the efficacy of Sarvavid, we implement five well-known genomics applications---BLAST, MUMmer, E-MEM, SPAdes, and SGA---using Sarvavid. Our versions of BLAST, MUMmer, and E-MEM show a speedup of 2.4X, 2.5X, and 2.1X respectively compared to hand-optimized implementations when run on a single node, while SPAdes and SGA show the same performance as hand-written code. Moreover, Sarvavid applications scale to 1024 cores using a Hadoop backend.

References

[1]

Demo mummer. http://docs.seqan.de/seqan/2.0.0/page_DemoMummy.html.

[2]

Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman. Basic local alignment search tool. Journal of molecular biology, 215(3):403--410, 1990.

[3]

Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A Gurevich, Mikhail Dvorkin, Alexander S Kulikov, Valery M Lesin, Sergey I Nikolenko, Son Pham, Andrey D Prjibelski, et al. Spades: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology, 19(5):455--477, 2012.

[4]

Serafim Batzoglou, David B Jaffe, Ken Stanley, Jonathan Butler, Sante Gnerre, Evan Mauceli, Bonnie Berger, Jill P Mesirov, and Eric S Lander. Arachne: a whole-genome shotgun assembler. Genome research, 12(1):177--189, 2002.

[5]

Jessica Bolker. Model organisms: There's more to life than rats and flies. Nature, 491(7422):31--33, 2012.

[6]

Stefan Burkhardt, Andreas Crauser, Paolo Ferragina, Hans-Peter Lenhof, Eric Rivals, and Martin Vingron. q-gram based database searching using a suffix array (quasar). In Proceedings of the third annual international conference on Computational molecular biology, pages 77--83. ACM, 1999.

Digital Library

[7]

Davin Butt, Andrew J Roger, and Christian Blouin. libcov: A c++ bioinformatic library to manipulate protein structures, sequence alignments and phylogeny. BMC bioinformatics, 6(1):138, 2005.

[8]

Tracy Craddock, Colin R Harwood, Jennifer Hallinan, and Anil Wipat. e-science: relieving bottlenecks in large-scale genome analyses. Nature reviews microbiology, 6(12):948--954, 2008.

[9]

Aaron Darling, Lucas Carey, and Wu-chun Feng. The design, implementation, and evaluation of mpiblast. Proceedings of ClusterWorld, 2003:13--15, 2003.

[10]

Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008.

Digital Library

[11]

Andreas Döring, David Weese, Tobias Rausch, and Knut Reinert. Seqan an efficient, generic c++ library for sequence analysis. BMC bioinformatics, 9(1):11, 2008.

[12]

Alexei Drummond and Korbinian Strimmer. Pal: an object-oriented programming library for molecular evolution and phylogenetics. Bioinformatics, 17(7):662--663, 2001.

[13]

Julien Dutheil, Sylvain Gaillard, Eric Bazin, Sylvain Glémin, Vincent Ranwez, Nicolas Galtier, and Khalid Belkhir. Bio++: a set of c++ libraries for sequence analysis, phylogenetics, molecular evolution and population genetics. BMC bioinformatics, 7(1):188, 2006.

[14]

Francisco Fernandes and Ana T Freitas. slamem: efficient retrieval of maximal exact matches using a sampled lcp array. Bioinformatics, 30(4):464--471, 2014.

[15]

Sante Gnerre, Iain MacCallum, Dariusz Przybylski, Filipe J Ribeiro, Joshua N Burton, Bruce J Walker, Ted Sharpe, Giles Hall, Terrance P Shea, Sean Sykes, et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences, 108(4):1513--1518, 2011.

[16]

Gordon Gremme, Sascha Steinbiss, and Stefan Kurtz. Genometools: a comprehensive software library for efficient processing of structured genome annotations. Computational Biology and Bioinformatics, IEEE/ACM Transactions on, 10(3):645--656, 2013.

Digital Library

[17]

Leonard Guarente and Cynthia Kenyon. Genetic pathways that regulate ageing in model organisms. Nature, 408(6809):255--262, 2000.

[18]

Curtis Huttenhower, Mark Schroeder, Maria D Chikina, and Olga G Troyanskaya. The sleipnir library for computational functional genomics. Bioinformatics, 24(13):1559--1561, 2008.

Digital Library

[19]

W James Kent. Blatthe blast-like alignment tool. Genome research, 12(4):656--664, 2002.

[20]

Zia Khan, Joshua S Bloom, Leonid Kruglyak, and Mona Singh. A practical algorithm for finding maximal exact matches in large sequence datasets using sparse suffix arrays. Bioinformatics, 25(13):1609--1616, 2009.

Digital Library

[21]

Nilesh Khiste and Lucian Ilie. E-mem: efficient computation of maximal exact matches for very large genomes. Bioinformatics, 31(4):509--514, 2015.

[22]

Stefan Kurtz, Adam Phillippy, Arthur L Delcher, Michael Smoot, Martin Shumway, Corina Antonescu, and Steven L Salzberg. Versatile and open software for comparing large genomes. Genome biology, 5(2):R12, 2004.

[23]

Tak Wah Lam, Wing-Kin Sung, Siu-Lung Tam, Chi-Kwong Wong, and Siu-Ming Yiu. Compressed indexing and local alignment of dna. Bioinformatics, 24(6):791--797, 2008.

Digital Library

[24]

Ben Langmead, Kasper D Hansen, Jeffrey T Leek, et al. Cloud-scale rna-sequencing differential expression analysis with myrna. Genome Biol, 11(8):R83, 2010.

[25]

Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with bowtie 2. Nature methods, 9(4):357--359, 2012.

[26]

Kanak Mahadik, Somali Chaterji, Bowen Zhou, Milind Kulkarni, and Saurabh Bagchi. Orion: Scaling genomic sequence matching with fine-grained parallelization. In High Performance Computing, Networking, Storage and Analysis, SC14: International Conference for, pages 449--460. IEEE, 2014.

Digital Library

[27]

Andréa Matsunaga, Maurício Tsugawa, and José Fortes. Cloudblast: Combining mapreduce and virtualization on distributed resources for bioinformatics applications. In eScience, 2008. eScience'08. IEEE Fourth International Conference on, pages 222--229. IEEE, 2008.

Digital Library

[28]

Aaron McKenna, Matthew Hanna, Eric Banks, Andrey Sivachenko, Kristian Cibulskis, Andrew Kernytsky, Kiran Garimella, David Altshuler, Stacey Gabriel, Mark Daly, et al. The genome analysis toolkit: a mapreduce framework for analyzing next-generation dna sequencing data. Genome research, 20(9):1297--1303, 2010.

[29]

Saul B Needleman and Christian D Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 48(3):443--453, 1970.

[30]

Tung Nguyen, Weisong Shi, and Douglas Ruden. Cloudaligner: A fast and full-featured mapreduce based tool for sequence mapping. BMC research notes, 4(1):171, 2011.

[31]

Henrik Nordberg, Karan Bhatia, Kai Wang, and Zhong Wang. Biopig: a hadoop-based analytic toolkit for large-scale sequence data. Bioinformatics, page btt528, 2013.

[32]

WR Pitt, Mark A Williams, M Steven, B Sweeney, Alan J. Bleasby, and David S. Moss. The bioinformatics template library&Uring;generic components for biocomputing. Bioinformatics, 17(8):729--737, 2001.

[33]

Peter Rice, Ian Longden, Alan Bleasby, et al. Emboss: the european molecular biology open software suite. Trends in genetics, 16(6):276--277, 2000.

[34]

Michael C Schatz. Cloudburst: highly sensitive read mapping with mapreduce. Bioinformatics, 25(11):1363--1369, 2009.

Digital Library

[35]

Matthew B Scholz, Chien-Chi Lo, and Patrick SG Chain. Next generation sequencing and bioinformatic bottlenecks: the current state of metagenomic data analysis. Current opinion in biotechnology, 23(1):9--15, 2012.

[36]

Jared T Simpson and Richard Durbin. Efficient de novo assembly of large genomes using compressed data structures. Genome research, 22(3):549--556, 2012.

[37]

Temple F Smith and Michael S Waterman. Identification of common molecular subsequences. Journal of molecular biology, 147(1):195--197, 1981.

[38]

Jason E Stajich, David Block, Kris Boulez, Steven E Brenner, Stephen A Chervitz, Chris Dagdigian, Georg Fuellen, James GR Gilbert, Ian Korf, Hilmar Lapp, et al. The bioperl toolkit: Perl modules for the life sciences. Genome research, 12(10):1611--1618, 2002.

[39]

Zachary D Stephens, Skylar Y Lee, Faraz Faghri, Roy H Campbell, Chengxiang Zhai, Miles J Efron, Ravishankar Iyer, Michael C Schatz, Saurabh Sinha, and Gene E Robinson. Big data: Astronomical or genomical? PLoS Biol, 13(7):e1002195, 2015.

[40]

D Vakatov, K Siyan, and J Ostell. The ncbi c++ toolkit {internet}. National Library of Medicine (US), National Center for Biotechnology Information, Bethesda (MD), 2003.

[41]

Michaël Vyverman, Bernard De Baets, Veerle Fack, and Peter Dawyndt. essamem: finding maximal exact matches using enhanced sparse suffix arrays. Bioinformatics, 29(6):802--804, 2013.

Digital Library

[42]

Lauren J. Young. Genomic data growing faster than twitter and youtube. http://spectrum.ieee.org/tech-talk/biomedical/diagnostics/the-human-os-is-at-the-top-of-big-data, July 2015.

Cited By

Zhou ZDu ZChaterji SZhang ARangwala H(2022)KRATOS: Context-Aware Cell Type Classification and Interpretation using Joint Dimensionality Reduction and ClusteringProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539455(2616-2625)Online publication date: 14-Aug-2022
https://dl.acm.org/doi/10.1145/3534678.3539455
Berger BYu Y(2022)Navigating bottlenecks and trade-offs in genomic data analysisNature Reviews Genetics10.1038/s41576-022-00551-z24:4(235-250)Online publication date: 7-Dec-2022
https://doi.org/10.1038/s41576-022-00551-z
Roth MJain PKoo JChaterji S(2021)Simultaneous learning of individual microRNA-gene interactions and regulatory comodulesBMC Bioinformatics10.1186/s12859-021-04151-222:1Online publication date: 10-May-2021
https://doi.org/10.1186/s12859-021-04151-2
Show More Cited By

SARVAVID: A Domain Specific Language for Developing Scalable Computational Genomics Applications
1. Applied computing
  1. Life and medical sciences

Recommendations

G-quadruplex Structure Prediction and integration in the GenData2020 data model
BCB '16: Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics

DNA is a long polymer encoding information of all living organisms. It usually folds in a double-helix shape and packed in chromosomes. Nonetheless DNA exists in many other conformations, including A-DNA, B-DNA and Z-DNA. B-DNA is the most common form ...
eQTL networks unveil enriched mRNA master integrators downstream of complex disease-associated SNPs

Display Omitted We study the downstream transcriptome systems properties of 6432 SNPs of 574 complex diseases.eQTL network analyses unveil that downstream mRNAs are targeted by an excess number of SNPs.Thus multiple complex diseases associated SNPs tend ...
Identifying potential human and medicinal plant microRNAs against SARS-CoV-2 3′UTR region: A computational genomics assessment
Abstract
The coronavirus disease of 2019 (COVID-19) began as an outbreak and has taken a toll on human lives. The current pandemic requires scientific attention; hence we designed a systematic computational workflow to identify the cellular ...
Graphical abstract

Display Omitted
Highlights
- Analysis of 5024 SARS-CoV-2 complete genomes from 98 countries/territories worldwide.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

June 2016

547 pages

ISBN:9781450343619

DOI:10.1145/2925426

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

Conference

ICS '16

Sponsor:

SIGARCH

ICS '16: 2016 International Conference on Supercomputing

June 1 - 3, 2016

Istanbul, Turkey

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

15
Total Citations
View Citations
1,842
Total Downloads

Downloads (Last 12 months)1,273
Downloads (Last 6 weeks)270

Reflects downloads up to 15 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhou ZDu ZChaterji SZhang ARangwala H(2022)KRATOS: Context-Aware Cell Type Classification and Interpretation using Joint Dimensionality Reduction and ClusteringProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539455(2616-2625)Online publication date: 14-Aug-2022
https://dl.acm.org/doi/10.1145/3534678.3539455
Berger BYu Y(2022)Navigating bottlenecks and trade-offs in genomic data analysisNature Reviews Genetics10.1038/s41576-022-00551-z24:4(235-250)Online publication date: 7-Dec-2022
https://doi.org/10.1038/s41576-022-00551-z
Roth MJain PKoo JChaterji S(2021)Simultaneous learning of individual microRNA-gene interactions and regulatory comodulesBMC Bioinformatics10.1186/s12859-021-04151-222:1Online publication date: 10-May-2021
https://doi.org/10.1186/s12859-021-04151-2
Yelick KBuluç AAwan MAzad ABrock BEgan REkanayake SEllis MGeorganas EGuidi GHofmeyr SSelvitopi OTeodoropol COliker L(2020)The parallelism motifs of genomic data analysisPhilosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences10.1098/rsta.2019.0394378:2166(20190394)Online publication date: 20-Jan-2020
https://doi.org/10.1098/rsta.2019.0394
Mahgoub AWood PMedoff AMitra SMeyer FChaterji SBagchi SDan TDahlia M(2019)SOPHIAProceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference10.5555/3358807.3358827(223-239)Online publication date: 10-Jul-2019
https://dl.acm.org/doi/10.5555/3358807.3358827
Bagheri HMuppirala UMasonbrink RSeverin ARajan H(2019)Shared data science infrastructure for genomics dataBMC Bioinformatics10.1186/s12859-019-2967-220:1Online publication date: 22-Aug-2019
https://doi.org/10.1186/s12859-019-2967-2
Shajii ANumanagić IBaghdadi RBerger BAmarasinghe S(2019)Seq: a high-performance language for bioinformaticsProceedings of the ACM on Programming Languages10.1145/33605513:OOPSLA(1-29)Online publication date: 10-Oct-2019
https://dl.acm.org/doi/10.1145/3360551
Koo JSaumya CKulkarni MBagchi S(2019)PySE: Automatic Worst-Case Test Generation by Reinforcement Learning2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST)10.1109/ICST.2019.00023(136-147)Online publication date: Apr-2019
https://doi.org/10.1109/ICST.2019.00023
Chaterji S(2019)Panel 2 Position Paper: AI could Solve the World’s Healthcare Problems and that too at Scale!2019 11th International Conference on Communication Systems & Networks (COMSNETS)10.1109/COMSNETS.2019.8711419(520-522)Online publication date: Jan-2019
https://doi.org/10.1109/COMSNETS.2019.8711419
Chaterji S(2019)Panel 1 Position Paper: Man versus Machines–Humans Need Not Apply2019 11th International Conference on Communication Systems & Networks (COMSNETS)10.1109/COMSNETS.2019.8711417(517-519)Online publication date: Jan-2019
https://doi.org/10.1109/COMSNETS.2019.8711417
Show More Cited By

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents