[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

Experiences in the Development of a Data Management System for Genomics

  • Conference paper
  • First Online:
Data Management Technologies and Applications (DATA 2017)

Abstract

GMQL is a high-level query language for genomics, which operates on datasets described through GDM, a unifying data model for processed data formats. They are ingredients for the integration of processed genomic datasets, i.e. of signals produced by the genome after sequencing and long data extraction pipelines. While most of the processing load of today’s genomic platforms is due to data extraction pipelines, we anticipate soon a shift of attention towards processed datasets, as such data are being collected by large consortia and are becoming increasingly available.

In our view, biology and personalized medicine will increasingly rely on data extraction and analysis methods for inferring new knowledge from existing heterogeneous repositories of processed datasets, typically augmented with the results of experimental data targeting individuals or small populations. While today’s big data are raw reads of the sequencing machines, tomorrow’s big data will also include billions or trillions of genomic regions, each featuring specific values depending on the processing conditions.

Coherently, GMQL is a high-level, declarative language inspired by big data management, and its execution engines include classic cloud-based systems, from Pig to Flink to SciDB to Spark. In this paper, we discuss how the GMQL execution environment has been developed, by going through a major version change that marked a complete system redesign; we also discuss our experiences in comparatively evaluating the four platforms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 35.99
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 44.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    http://www.bioinformatics.deib.polimi.it/gendata/, PRIN Italian National Project, 2013–2016.

  2. 2.

    Data-Driven Genomic Computing, http://www.bioinformatics.deib.polimi.it/geco/, ERC Advanced Grant, 2016–2021.

  3. 3.

    GeCo V2 software is available at https://github.com/DEIB-GECO/GMQL.

References

  1. 1000 Genomes Consortium: An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012)

    Google Scholar 

  2. Albrecht, F., et al.: DeepBlue epigenomic data server: programmatic data retrieval and analysis of the epigenome. Nucleid Acids Res. 44(W1), W581–586 (2016)

    Article  Google Scholar 

  3. Anonymous paper, Accelerating bioinformatics research with new software for big data to knowledge (BD2K), Paradigm4 Inc. (2015). http://www.paradigm4.com/

  4. Apache Flink. http://flink.apache.org/

  5. Apache Lucene. http://lucene.apache.org/core/

  6. Apache Pig. http://pig.apache.org/

  7. Apache Spark. http://spark.apache.org/

  8. Bernasconi, A., Ceri, S., Campi, A., Masseroli, M.: Conceptual modeling for genomics: building an integrated repository of open data. In: Mayr, H.C., Guizzardi, G., Ma, H., Pastor, O. (eds.) ER 2017. LNCS, vol. 10650, pp. 325–339. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69904-2_26

    Chapter  Google Scholar 

  9. Bertoni, M., et al.: Evaluating cloud frameworks on genomic applications. In: Proceedings of IEEE Conference on Big Data Management, Santa Clara, CA (2015)

    Google Scholar 

  10. Cattani, S., et al.: Evaluating big data genomic applications on SciDB and Spark. In: Proceedings of Web Engineering Conference, Rome, IT (2017)

    Google Scholar 

  11. Ceri, S., et al.: Data management for heterogeneous genomic datasets. IEEE/ACM Trans. Comput. Biol. Bioinform. 14(6), 1251–1264 (2016)

    Article  Google Scholar 

  12. Cumbo, F., et al.: TCGA2BED: extracting, extending, integrating, and querying The Cancer Genome Atlas. BMC Bioinform. 18(6), 1–9 (2017)

    Google Scholar 

  13. Chawda, B., et al.: Processing interval joins on Map-Reduce. In: Proceedings of EDBT, pp. 463–474 (2014)

    Google Scholar 

  14. ENCODE Project Consortium: An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414), 57–74 (2012)

    Google Scholar 

  15. Hadoop 2. http://hadoop.apache.org/docs/stable/

  16. Jalili, V., et al.: Explorative visual analytics on interval-based genomic data and their metadata. BMC Bioinform. 18, 536 (2017)

    Article  Google Scholar 

  17. Kaitoua, A., et al.: Framework for supporting genomic operations. IEEE-TC (2016). https://doi.org/10.1109/TC.2016.2603980

  18. Kent, W.J.: The human genome browser at UCSC. Genome Res. 12(6), 996–1006 (2002)

    Article  Google Scholar 

  19. Masseroli, M., et al.: GenoMetric Query Language: a novel approach to large-scale genomic data management. Bioinformatics 31(12), 1881–1888 (2015)

    Article  Google Scholar 

  20. Masseroli, M., et al.: Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying. Methods 111, 3–11 (2016)

    Article  Google Scholar 

  21. Olston, C., et al.: Pig Latin: a not-so-foreign language for data processing. In: ACM-SIGMOD, pp. 1099–1110 (2008)

    Google Scholar 

  22. Roy, A., et al.: Massively parallel processing of whole genome sequence data: an in-depth performance study. In: ACM Sigmod, Boston, MA (2017)

    Google Scholar 

  23. Schuster, S.C.: Next-generation sequencing transforms today’s biology. Nat. Methods 5(1), 16–18 (2008)

    Article  Google Scholar 

  24. SciDB. http://www.scidb.org/

  25. Shvachko, K., et al.: The Hadoop distributed file system. In: Proceedings of MSST, pp. 1–10 (2010)

    Google Scholar 

  26. Stephens, Z.D., et al.: Big data: astronomical or genomical? PLoS Biol. 13(7), e1002195 (2015)

    Article  Google Scholar 

  27. Taylor, R.C., et al.: An overview of the Hadoop MapReduce HBase framework and its current applications in bioinformatics. BMC Bioinform. 11(Suppl. 12), S1 (2010)

    Article  Google Scholar 

  28. Weinstein, J.N., et al.: The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 45(10), 1113–1120 (2013)

    Article  Google Scholar 

  29. Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of USENIX, pp. 15–28 (2012)

    Google Scholar 

  30. Jensen, M.A., et al.: The NCI Genomic Data Commons as an engine for precision medicine. Blood 130(4), 453–459 (2017)

    Article  Google Scholar 

Download references

Acknowledgment

This research is funded by the ERC Advanced Grant project GeCo (Data-Driven Genomic Computing), 2016–2021.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Stefano Ceri .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ceri, S., Canakoglu, A., Kaitoua, A., Masseroli, M., Pinoli, P. (2018). Experiences in the Development of a Data Management System for Genomics. In: Filipe, J., Bernardino, J., Quix, C. (eds) Data Management Technologies and Applications. DATA 2017. Communications in Computer and Information Science, vol 814. Springer, Cham. https://doi.org/10.1007/978-3-319-94809-6_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-94809-6_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-94808-9

  • Online ISBN: 978-3-319-94809-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics