[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/2925426.2926283acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article
Public Access

SARVAVID: A Domain Specific Language for Developing Scalable Computational Genomics Applications

Published: 01 June 2016 Publication History

Abstract

Breakthroughs in gene sequencing technologies have led to an exponential increase in the amount of genomic data. Efficient tools to rapidly process such large quantities of data are critical in the study of gene functions, diseases, evolution, and population variation. These tools are designed in an ad-hoc manner, and require extensive programmer effort to develop and optimize them. Often, such tools are written with the currently available data sizes in mind, and soon start to under perform due to the exponential growth in data. Furthermore, to obtain high-performance, these tools require parallel implementations, adding to the development complexity.
This paper makes an observation that most such tools contain a recurring set of software modules, or kernels. The availability of efficient implementations of such kernels can improve programmer productivity, and provide effective scalability with growing data. To achieve this goal, the paper presents a domain-specific language, called Sarvavid, which provides these kernels as language constructs. Sarvavid comes with a compiler that performs domain-specific optimizations, which are beyond the scope of libraries and generic compilers. Furthermore, Sarvavid inherently supports exploitation of parallelism across multiple nodes.
To demonstrate the efficacy of Sarvavid, we implement five well-known genomics applications---BLAST, MUMmer, E-MEM, SPAdes, and SGA---using Sarvavid. Our versions of BLAST, MUMmer, and E-MEM show a speedup of 2.4X, 2.5X, and 2.1X respectively compared to hand-optimized implementations when run on a single node, while SPAdes and SGA show the same performance as hand-written code. Moreover, Sarvavid applications scale to 1024 cores using a Hadoop backend.

References

[1]
Demo mummer. http://docs.seqan.de/seqan/2.0.0/page_DemoMummy.html.
[2]
Stephen F Altschul, Warren Gish, Webb Miller, Eugene W Myers, and David J Lipman. Basic local alignment search tool. Journal of molecular biology, 215(3):403--410, 1990.
[3]
Anton Bankevich, Sergey Nurk, Dmitry Antipov, Alexey A Gurevich, Mikhail Dvorkin, Alexander S Kulikov, Valery M Lesin, Sergey I Nikolenko, Son Pham, Andrey D Prjibelski, et al. Spades: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of Computational Biology, 19(5):455--477, 2012.
[4]
Serafim Batzoglou, David B Jaffe, Ken Stanley, Jonathan Butler, Sante Gnerre, Evan Mauceli, Bonnie Berger, Jill P Mesirov, and Eric S Lander. Arachne: a whole-genome shotgun assembler. Genome research, 12(1):177--189, 2002.
[5]
Jessica Bolker. Model organisms: There's more to life than rats and flies. Nature, 491(7422):31--33, 2012.
[6]
Stefan Burkhardt, Andreas Crauser, Paolo Ferragina, Hans-Peter Lenhof, Eric Rivals, and Martin Vingron. q-gram based database searching using a suffix array (quasar). In Proceedings of the third annual international conference on Computational molecular biology, pages 77--83. ACM, 1999.
[7]
Davin Butt, Andrew J Roger, and Christian Blouin. libcov: A c++ bioinformatic library to manipulate protein structures, sequence alignments and phylogeny. BMC bioinformatics, 6(1):138, 2005.
[8]
Tracy Craddock, Colin R Harwood, Jennifer Hallinan, and Anil Wipat. e-science: relieving bottlenecks in large-scale genome analyses. Nature reviews microbiology, 6(12):948--954, 2008.
[9]
Aaron Darling, Lucas Carey, and Wu-chun Feng. The design, implementation, and evaluation of mpiblast. Proceedings of ClusterWorld, 2003:13--15, 2003.
[10]
Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008.
[11]
Andreas Döring, David Weese, Tobias Rausch, and Knut Reinert. Seqan an efficient, generic c++ library for sequence analysis. BMC bioinformatics, 9(1):11, 2008.
[12]
Alexei Drummond and Korbinian Strimmer. Pal: an object-oriented programming library for molecular evolution and phylogenetics. Bioinformatics, 17(7):662--663, 2001.
[13]
Julien Dutheil, Sylvain Gaillard, Eric Bazin, Sylvain Glémin, Vincent Ranwez, Nicolas Galtier, and Khalid Belkhir. Bio++: a set of c++ libraries for sequence analysis, phylogenetics, molecular evolution and population genetics. BMC bioinformatics, 7(1):188, 2006.
[14]
Francisco Fernandes and Ana T Freitas. slamem: efficient retrieval of maximal exact matches using a sampled lcp array. Bioinformatics, 30(4):464--471, 2014.
[15]
Sante Gnerre, Iain MacCallum, Dariusz Przybylski, Filipe J Ribeiro, Joshua N Burton, Bruce J Walker, Ted Sharpe, Giles Hall, Terrance P Shea, Sean Sykes, et al. High-quality draft assemblies of mammalian genomes from massively parallel sequence data. Proceedings of the National Academy of Sciences, 108(4):1513--1518, 2011.
[16]
Gordon Gremme, Sascha Steinbiss, and Stefan Kurtz. Genometools: a comprehensive software library for efficient processing of structured genome annotations. Computational Biology and Bioinformatics, IEEE/ACM Transactions on, 10(3):645--656, 2013.
[17]
Leonard Guarente and Cynthia Kenyon. Genetic pathways that regulate ageing in model organisms. Nature, 408(6809):255--262, 2000.
[18]
Curtis Huttenhower, Mark Schroeder, Maria D Chikina, and Olga G Troyanskaya. The sleipnir library for computational functional genomics. Bioinformatics, 24(13):1559--1561, 2008.
[19]
W James Kent. Blatthe blast-like alignment tool. Genome research, 12(4):656--664, 2002.
[20]
Zia Khan, Joshua S Bloom, Leonid Kruglyak, and Mona Singh. A practical algorithm for finding maximal exact matches in large sequence datasets using sparse suffix arrays. Bioinformatics, 25(13):1609--1616, 2009.
[21]
Nilesh Khiste and Lucian Ilie. E-mem: efficient computation of maximal exact matches for very large genomes. Bioinformatics, 31(4):509--514, 2015.
[22]
Stefan Kurtz, Adam Phillippy, Arthur L Delcher, Michael Smoot, Martin Shumway, Corina Antonescu, and Steven L Salzberg. Versatile and open software for comparing large genomes. Genome biology, 5(2):R12, 2004.
[23]
Tak Wah Lam, Wing-Kin Sung, Siu-Lung Tam, Chi-Kwong Wong, and Siu-Ming Yiu. Compressed indexing and local alignment of dna. Bioinformatics, 24(6):791--797, 2008.
[24]
Ben Langmead, Kasper D Hansen, Jeffrey T Leek, et al. Cloud-scale rna-sequencing differential expression analysis with myrna. Genome Biol, 11(8):R83, 2010.
[25]
Ben Langmead and Steven L Salzberg. Fast gapped-read alignment with bowtie 2. Nature methods, 9(4):357--359, 2012.
[26]
Kanak Mahadik, Somali Chaterji, Bowen Zhou, Milind Kulkarni, and Saurabh Bagchi. Orion: Scaling genomic sequence matching with fine-grained parallelization. In High Performance Computing, Networking, Storage and Analysis, SC14: International Conference for, pages 449--460. IEEE, 2014.
[27]
Andréa Matsunaga, Maurício Tsugawa, and José Fortes. Cloudblast: Combining mapreduce and virtualization on distributed resources for bioinformatics applications. In eScience, 2008. eScience'08. IEEE Fourth International Conference on, pages 222--229. IEEE, 2008.
[28]
Aaron McKenna, Matthew Hanna, Eric Banks, Andrey Sivachenko, Kristian Cibulskis, Andrew Kernytsky, Kiran Garimella, David Altshuler, Stacey Gabriel, Mark Daly, et al. The genome analysis toolkit: a mapreduce framework for analyzing next-generation dna sequencing data. Genome research, 20(9):1297--1303, 2010.
[29]
Saul B Needleman and Christian D Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 48(3):443--453, 1970.
[30]
Tung Nguyen, Weisong Shi, and Douglas Ruden. Cloudaligner: A fast and full-featured mapreduce based tool for sequence mapping. BMC research notes, 4(1):171, 2011.
[31]
Henrik Nordberg, Karan Bhatia, Kai Wang, and Zhong Wang. Biopig: a hadoop-based analytic toolkit for large-scale sequence data. Bioinformatics, page btt528, 2013.
[32]
WR Pitt, Mark A Williams, M Steven, B Sweeney, Alan J. Bleasby, and David S. Moss. The bioinformatics template libraryŮgeneric components for biocomputing. Bioinformatics, 17(8):729--737, 2001.
[33]
Peter Rice, Ian Longden, Alan Bleasby, et al. Emboss: the european molecular biology open software suite. Trends in genetics, 16(6):276--277, 2000.
[34]
Michael C Schatz. Cloudburst: highly sensitive read mapping with mapreduce. Bioinformatics, 25(11):1363--1369, 2009.
[35]
Matthew B Scholz, Chien-Chi Lo, and Patrick SG Chain. Next generation sequencing and bioinformatic bottlenecks: the current state of metagenomic data analysis. Current opinion in biotechnology, 23(1):9--15, 2012.
[36]
Jared T Simpson and Richard Durbin. Efficient de novo assembly of large genomes using compressed data structures. Genome research, 22(3):549--556, 2012.
[37]
Temple F Smith and Michael S Waterman. Identification of common molecular subsequences. Journal of molecular biology, 147(1):195--197, 1981.
[38]
Jason E Stajich, David Block, Kris Boulez, Steven E Brenner, Stephen A Chervitz, Chris Dagdigian, Georg Fuellen, James GR Gilbert, Ian Korf, Hilmar Lapp, et al. The bioperl toolkit: Perl modules for the life sciences. Genome research, 12(10):1611--1618, 2002.
[39]
Zachary D Stephens, Skylar Y Lee, Faraz Faghri, Roy H Campbell, Chengxiang Zhai, Miles J Efron, Ravishankar Iyer, Michael C Schatz, Saurabh Sinha, and Gene E Robinson. Big data: Astronomical or genomical? PLoS Biol, 13(7):e1002195, 2015.
[40]
D Vakatov, K Siyan, and J Ostell. The ncbi c++ toolkit {internet}. National Library of Medicine (US), National Center for Biotechnology Information, Bethesda (MD), 2003.
[41]
Michaël Vyverman, Bernard De Baets, Veerle Fack, and Peter Dawyndt. essamem: finding maximal exact matches using enhanced sparse suffix arrays. Bioinformatics, 29(6):802--804, 2013.
[42]
Lauren J. Young. Genomic data growing faster than twitter and youtube. http://spectrum.ieee.org/tech-talk/biomedical/diagnostics/the-human-os-is-at-the-top-of-big-data, July 2015.

Cited By

View all
  • (2022)KRATOS: Context-Aware Cell Type Classification and Interpretation using Joint Dimensionality Reduction and ClusteringProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539455(2616-2625)Online publication date: 14-Aug-2022
  • (2022)Navigating bottlenecks and trade-offs in genomic data analysisNature Reviews Genetics10.1038/s41576-022-00551-z24:4(235-250)Online publication date: 7-Dec-2022
  • (2021)Simultaneous learning of individual microRNA-gene interactions and regulatory comodulesBMC Bioinformatics10.1186/s12859-021-04151-222:1Online publication date: 10-May-2021
  • Show More Cited By
  1. SARVAVID: A Domain Specific Language for Developing Scalable Computational Genomics Applications

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ICS '16: Proceedings of the 2016 International Conference on Supercomputing
    June 2016
    547 pages
    ISBN:9781450343619
    DOI:10.1145/2925426
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 June 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Computational Genomics
    2. Distributed Systems

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Funding Sources

    Conference

    ICS '16
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 629 of 2,180 submissions, 29%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,273
    • Downloads (Last 6 weeks)270
    Reflects downloads up to 15 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)KRATOS: Context-Aware Cell Type Classification and Interpretation using Joint Dimensionality Reduction and ClusteringProceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining10.1145/3534678.3539455(2616-2625)Online publication date: 14-Aug-2022
    • (2022)Navigating bottlenecks and trade-offs in genomic data analysisNature Reviews Genetics10.1038/s41576-022-00551-z24:4(235-250)Online publication date: 7-Dec-2022
    • (2021)Simultaneous learning of individual microRNA-gene interactions and regulatory comodulesBMC Bioinformatics10.1186/s12859-021-04151-222:1Online publication date: 10-May-2021
    • (2020)The parallelism motifs of genomic data analysisPhilosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences10.1098/rsta.2019.0394378:2166(20190394)Online publication date: 20-Jan-2020
    • (2019)SOPHIAProceedings of the 2019 USENIX Conference on Usenix Annual Technical Conference10.5555/3358807.3358827(223-239)Online publication date: 10-Jul-2019
    • (2019)Shared data science infrastructure for genomics dataBMC Bioinformatics10.1186/s12859-019-2967-220:1Online publication date: 22-Aug-2019
    • (2019)Seq: a high-performance language for bioinformaticsProceedings of the ACM on Programming Languages10.1145/33605513:OOPSLA(1-29)Online publication date: 10-Oct-2019
    • (2019)PySE: Automatic Worst-Case Test Generation by Reinforcement Learning2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST)10.1109/ICST.2019.00023(136-147)Online publication date: Apr-2019
    • (2019)Panel 2 Position Paper: AI could Solve the World’s Healthcare Problems and that too at Scale!2019 11th International Conference on Communication Systems & Networks (COMSNETS)10.1109/COMSNETS.2019.8711419(520-522)Online publication date: Jan-2019
    • (2019)Panel 1 Position Paper: Man versus Machines–Humans Need Not Apply2019 11th International Conference on Communication Systems & Networks (COMSNETS)10.1109/COMSNETS.2019.8711417(517-519)Online publication date: Jan-2019
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media