[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.5555/3154690.3154705guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Persona: a high-performance bioinformatics framework

Published: 12 July 2017 Publication History

Abstract

Next-generation genome sequencing technology has reached a point at which it is becoming cost-effective to sequence all patients. Biobanks and researchers are faced with an oncoming deluge of genomic data, whose processing requires new and scalable bioinformatics architectures and systems. Processing raw genetic sequence data is computationally expensive and datasets are large. Current software systems can require many hours to process a single genome and generally run only on a single computer. Common file formats are monolithic and row-oriented, a barrier to distributed computation.
To address these challenges, we built Persona, a cluster-scale, high-throughput bioinformatics framework. Persona currently supports paired-read alignment, sorting, and duplicate marking using well-known algorithms and techniques. Persona can significantly reduce end-to-end processing times for bioinformatics computations. A new Aggregate Genomic Data (AGD) format unifies sample data and analysis results, while enabling efficient distributed computation and I/O.
In a case study on sequence alignment, Persona sustains 1.353 gigabases aligned per second with 101 base pair reads on a 32-node cluster and can align a full genome in ∼16.7 seconds using the SNAP algorithm. Our results demonstrate that: (1) alignment computation with Persona scales linearly across servers with no measurable completion-time imbalance and negligible framework overheads; (2) on a single server, sorting with Persona and AGD is up to 2.3× faster than commonly used tools, while duplicate marking is 3× faster; (3) with AGD, a 7 node COTS network storage system can service up to 60 alignment compute nodes; (4) server cost dominates for a balanced system running Persona, while long-term data storage dwarfs the cost of computation.

References

[1]
ABADI, M., AGARWAL, A., BARHAM, P., BREVDO, E., CHEN, Z., CITRO, C., CORRADO, G. S., DAVIS, A., DEAN, J., DEVIN, M., GHEMAWAT, S., GOODFELLOW, I. J., HARP, A., IRVING, G., ISARD, M., JIA, Y., JZEFOWICZ, R., KAISER, L., KUDLUR, M., LEVENBERG, J., MANÉ, D., MONGA, R., MOORE, S., MURRAY, D. G., OLAH, C., SCHUSTER, M., SHLENS, J., STEINER, B., SUTSKEVER, I., TALWAR, K., TUCKER, P. A., VANHOUCKE, V., VASUDEVAN, V., VIÉ GAS, F. B., VINYALS, O., WARDEN, P., WATTENBERG, M., WICKE, M., YU, Y., AND ZHENG, X. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. CoRR abs/1603.04467 (2016).
[2]
ABUÍ N, J. M., PICHEL, J. C., PENA, T. F., AND AMIGO, J. Sparkbwa: speeding up the alignment of high-throughput dna sequencing data. PloS one 11, 5 (2016), e0155461.
[3]
ALTSCHUL, S. F., GISH, W., MILLER, W., MYERS, E. W., AND LIPMAN, D. J. Basic local alignment search tool. Journal of Molecular Biology 215, 3 (1990), 403 - 410.
[4]
AMAZON, I. Amazon glacier pricing. https://aws.amazon.com/glacier/pricing/. Accessed: 10-16- 2016.
[5]
BLUMOFE, R. D., AND LEISERSON, C. E. Scheduling Multithreaded Computations by Work Stealing. J. ACM 46, 5 (1999), 720-748.
[6]
BROWN, S. M. Next-Generation DNA Sequencing Informatics. Cold Spring Harbor Laboratory Press Cold Spring Harbo, 2013.
[7]
CHE, Y.-T., CONG, J., LEI, J., LI, S., PETO, M., SPELLMAN, P., WEI, P., AND ZHOU, P. CS-BWAMEM: A Fast and Scalable Read Aligner at the Cloud Scale for Whole Genome Sequencing (Poster). HiTSeq (2015).
[8]
COCK, P. J., FIELDS, C. J., GOTO, N., HEUER, M. L., AND RICE, P. M. The sanger fastq file format for sequences with quality scores, and the solexa/illumina fastq variants. Nucleic acids research 38, 6 (2010), 1767-1771.
[9]
DANECEK, P., AUTON, A., ABECASIS, G., ALBERS, C. A., BANKS, E., DEPRISTO, M. A., HANDSAKER, R. E., LUNTER, G., MARTH, G. T., SHERRY, S. T., ET AL. The variant call format and vcftools. Bioinformatics 27, 15 (2011), 2156-2158.
[10]
DEAN, J., AND BARROSO, L. A. The tail at scale. Commun. ACM 56, 2 (2013), 74-80.
[11]
DIMAKIS, A. G., GODFREY, B., WU, Y., WAINWRIGHT, M. J., AND RAMCHANDRAN, K. Network coding for distributed storage systems. IEEE Trans. Information Theory 56, 9 (2010), 4539-4551.
[12]
EBERLE, M. A., FRITZILAS, E., KRUSCHE, P., KALLBERG, M., MOORE, B. L., BEKRITSKY, M. A., IQBAL, Z., CHUANG, H.-Y., HUMPHRAY, S. J., HALPERN, A. L., KRUGLYAK, S., MARGULIES, E. H., MCVEAN, G., AND BENTLEY, D. R. A reference dataset of 5.4 million human variants validated by genetic inheritance from sequencing a three-generation 17-member pedigree. bioRxiv (2016).
[13]
EPFL VSLC-DCSL. Persona - A High-Performance Bioinformatics Framework. https://github.com/epfl-vlsc/persona.
[14]
FAUST, G. G., AND HALL, I. M. Samblaster: fast duplicate marking and structural variant read extraction. Bioinformatics (2014), btu314.
[15]
FRITZ, M. H.-Y., LEINONEN, R., COCHRANE, G., AND BIRNEY, E. Efficient storage of high throughput dna sequencing data using reference-based compression. Genome research 21, 5 (2011), 734-740.
[16]
GARRISON, E., AND MARTH, G. Haplotype-based variant detection from short-read sequencing. arXiv preprint arXiv:1207.3907 (2012).
[17]
GENOMICS ENGLAND (NHS). The 100,000 Genome Project. https://www.genomicsengland.co.uk, 2016.
[18]
GEORGANAS, E., BULUÇ, A., CHAPMAN, J., OLIKER, L., ROKHSAR, D., AND YELICK, K. A. merAligner: A Fully Parallel Sequence Aligner. In Proceedings of the 29th IEEE International Symposium on Parallel and Distributed Processing (IPDPS) (2015), pp. 561-570.
[19]
GONZLEZ-DOMNGUEZ, J., HUNDT, C., AND SCHMIDT, B. parsra: A framework for the parallel execution of short read aligners on compute clusters. Journal of Computational Science (2017), -.
[20]
GUO, S., AND PHAN, V. A distributed framework for aligning short reads to genomes. BMC Bioinformatics 15, S-10 (2014), P22.
[21]
HAMILTON, J. Overall data center costs. Accessed: 08- 13-2016.
[22]
HEALTH, I., AND SCIENCES, L. Genomicsdb. https://github.com/Intel-HLS/GenomicsDB/wiki. Accessed: 05-05-2017.
[23]
HG19 Human Genome Download. http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/. Accessed: 06-20-2016.
[24]
HPC LAB - OSU. Parallel Sequence Mapping Tool. http://bmi.osu.edu/hpc/software/pmap/pmap.html, 2016.
[25]
ILLUMINA, INC. Illumina NovaSeq. https://www.illumina.com/systems/sequencing-platforms/novaseq.html, 2017.
[26]
INC., G. Broad institute gatk on google genomics. https://cloud.google.com/genomics/gatk. Accessed: 08-13-2016.
[27]
INSTITUTE, B. Picard. https://broadinstitute.github.io/picard/. Accessed: 08-10-2016.
[28]
JOURDREN, L., BERNARD, M., DILLIES, M.-A., AND LE CROM, S. Eoulsan: a cloud computing-based framework facilitating high throughput sequencing analyses. Bioinformatics 28, 11 (2012), 1542.
[29]
LANGMEAD, B., TRAPNELL, C., POP, M., AND SALZBERG, S. L. Ultrafast and memory-efficient alignment of short dna sequences to the human genome. Genome Biology 10, 3 (2009), 1-10.
[30]
LI, H., AND DURBIN, R. Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics 25, 14 (2009), 1754-1760.
[31]
LI, H., HANDSAKER, B., WYSOKER, A., FENNELL, T., RUAN, J., HOMER, N., MARTH, G., ABECASIS, G., DURBIN, R., ET AL. The Sequence Alignment/map Format and SAMtools. Bioinformatics 25, 16 (2009), 2078- 2079.
[32]
LI, R., YU, C., LI, Y., LAM, T. W., YIU, S.-M., KRISTIANSEN, K., AND WANG, J. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 15 (2009), 1966-1967.
[33]
MASSIE, M., NOTHAFT, F., HARTL, C., KOZANITIS, C., SCHUMACHER, A., JOSEPH, A. D., AND PATTERSON, D. A. Adam: Genomics formats and processing patterns for cloud scale computing. University of California, Berkeley Technical Report, No. UCB/EECS-2013 207 (2013).
[34]
MCKENNA, A., HANNA, M., BANKS, E., SIVACHENKO, A., CIBULSKIS, K., KERNYTSKY, A., GARIMELLA, K., ALTSHULER, D., GABRIEL, S., DALY, M., ET AL. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Research 20, 9 (2010), 1297-1303.
[35]
MICROSOFT. Microsoft genomics. https://enterprise.microsoft.com/en-us/industries/health/genomics/. Accessed: 08-13-2016.
[36]
MORETTI, C., THRASHER, A., YU, L., OLSON, M., EMRICH, S. J., AND THAIN, D. A Framework for Scalable Genome Assembly on Clusters, Clouds, and Grids. IEEE Trans. Parallel Distrib. Syst. 23, 12 (2012), 2189- 2197.
[37]
NOVOCRAFT TECHNOLOGIES SDN BHD. NovoAlign. http://www.novocraft.com/products/novoalign/, 2016.
[38]
PABINGER, S., DANDER, A., FISCHER, M., SNAJDER, R., SPERK, M., EFREMOVA, M., KRABICHLER, B., SPEICHER, M. R., ZSCHOCKE, J., AND TRAJANOSKI, Z. A survey of tools for variant analysis of next-generation genome sequencing data. Briefings in Bioinformatics 15, 2 (2013), 256.
[39]
PAPADOPOULOS, S., DATTA, K., MADDEN, S., AND MATTSON, T. G. The TileDB Array Data Storage Manager. PVLDB 10, 4 (2016), 349-360.
[40]
PELLICER, S., CHEN, G., CHAN, K. C., AND PAN, Y. Distributed sequence alignment applications for the public computing architecture. IEEE transactions on nanobioscience 7, 1 (2008), 35-43.
[41]
REINDERS, J. VTune performance analyzer essentials. Intel Press, 2005.
[42]
SCHATZ, M. C. CloudBurst: highly sensitive read mapping with MapReduce. Bioinformatics 25, 11 (2009), 1363-1369.
[43]
SMITH, T. F., AND WATERMAN, M. S. Identification of common molecular subsequences. Journal of molecular biology 147, 1 (1981), 195-197.
[44]
THE HDF GROUP. Hierarchical data format version 5. http://www.hdfgroup.org/HDF5, 2000-2010.
[45]
TOLHUIS, B., LUNENBERG, J., AND KARTEN, H. Ultra-fast, accurate and cost-effective ngs read alignment with significant storage footprint reduction. http://www.genalice.com/wp-content/uploads/2013/07/GENALICE-poster-HiTSeq-2013.pdf. Accessed: 08-13-2016.
[46]
WEIL, S. A., BRANDT, S. A., MILLER, E. L., LONG, D. D. E., AND MALTZAHN, C. Ceph: A Scalable, High-Performance Distributed File System. In Proceedings of the 7th Symposium on Operating System Design and Implementation (OSDI) (2006), pp. 307-320.
[47]
ZAHARIA, M., BOLOSKY, W. J., CURTIS, K., FOX, A., PATTERSON, D. A., SHENKER, S., STOICA, I., KARP, R. M., AND SITTLER, T. Faster and More Accurate Sequence Alignment with SNAP. CoRR abs/1111.5572 (2011).
[48]
ZHANG, J., LIN, H., BALAJI, P., AND FENG, W. C. Optimizing burrows-wheeler transform-based sequence alignment on multicore architectures. In 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing (2013), pp. 377-384.

Cited By

View all
  • (2018)High-performance genomic analysis framework with in-memory computingACM SIGPLAN Notices10.1145/3200691.317851153:1(317-328)Online publication date: 10-Feb-2018
  • (2018)High-performance genomic analysis framework with in-memory computingProceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3178487.3178511(317-328)Online publication date: 10-Feb-2018
  • (2018)GenaxProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00017(69-82)Online publication date: 2-Jun-2018

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
USENIX ATC '17: Proceedings of the 2017 USENIX Conference on Usenix Annual Technical Conference
July 2017
811 pages
ISBN:9781931971386

Sponsors

  • VMware
  • NetApp
  • Microsoft: Microsoft
  • Facebook: Facebook
  • ORACLE: ORACLE

Publisher

USENIX Association

United States

Publication History

Published: 12 July 2017

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2018)High-performance genomic analysis framework with in-memory computingACM SIGPLAN Notices10.1145/3200691.317851153:1(317-328)Online publication date: 10-Feb-2018
  • (2018)High-performance genomic analysis framework with in-memory computingProceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming10.1145/3178487.3178511(317-328)Online publication date: 10-Feb-2018
  • (2018)GenaxProceedings of the 45th Annual International Symposium on Computer Architecture10.1109/ISCA.2018.00017(69-82)Online publication date: 2-Jun-2018

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media