Experiences in the Development of a Data Management System for Genomics

Stefano Ceri¹²,
Arif Canakoglu¹²,
Abdulrahman Kaitoua¹²,
Marco Masseroli¹² &
…
Pietro Pinoli¹²

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 814))

Included in the following conference series:

International Conference on Data Management Technologies and Applications

518 Accesses

Abstract

GMQL is a high-level query language for genomics, which operates on datasets described through GDM, a unifying data model for processed data formats. They are ingredients for the integration of processed genomic datasets, i.e. of signals produced by the genome after sequencing and long data extraction pipelines. While most of the processing load of today’s genomic platforms is due to data extraction pipelines, we anticipate soon a shift of attention towards processed datasets, as such data are being collected by large consortia and are becoming increasingly available.

In our view, biology and personalized medicine will increasingly rely on data extraction and analysis methods for inferring new knowledge from existing heterogeneous repositories of processed datasets, typically augmented with the results of experimental data targeting individuals or small populations. While today’s big data are raw reads of the sequencing machines, tomorrow’s big data will also include billions or trillions of genomic regions, each featuring specific values depending on the processing conditions.

Coherently, GMQL is a high-level, declarative language inspired by big data management, and its execution engines include classic cloud-based systems, from Pig to Flink to SciDB to Spark. In this paper, we discuss how the GMQL execution environment has been developed, by going through a major version change that marked a complete system redesign; we also discuss our experiences in comparatively evaluating the four platforms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 35.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 44.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor

Article Open access 07 April 2022

Data Science for Genomic Data Management: Challenges, Resources, Experiences

Article 29 June 2019

PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets

Article Open access 08 November 2019

Notes

1.
http://www.bioinformatics.deib.polimi.it/gendata/, PRIN Italian National Project, 2013–2016.
2.
Data-Driven Genomic Computing, http://www.bioinformatics.deib.polimi.it/geco/, ERC Advanced Grant, 2016–2021.
3.
GeCo V2 software is available at https://github.com/DEIB-GECO/GMQL.

References

1000 Genomes Consortium: An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012)
Google Scholar
Albrecht, F., et al.: DeepBlue epigenomic data server: programmatic data retrieval and analysis of the epigenome. Nucleid Acids Res. 44(W1), W581–586 (2016)
Article Google Scholar
Anonymous paper, Accelerating bioinformatics research with new software for big data to knowledge (BD2K), Paradigm4 Inc. (2015). http://www.paradigm4.com/
Apache Flink. http://flink.apache.org/
Apache Lucene. http://lucene.apache.org/core/
Apache Pig. http://pig.apache.org/
Apache Spark. http://spark.apache.org/
Bernasconi, A., Ceri, S., Campi, A., Masseroli, M.: Conceptual modeling for genomics: building an integrated repository of open data. In: Mayr, H.C., Guizzardi, G., Ma, H., Pastor, O. (eds.) ER 2017. LNCS, vol. 10650, pp. 325–339. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-69904-2_26
Chapter Google Scholar
Bertoni, M., et al.: Evaluating cloud frameworks on genomic applications. In: Proceedings of IEEE Conference on Big Data Management, Santa Clara, CA (2015)
Google Scholar
Cattani, S., et al.: Evaluating big data genomic applications on SciDB and Spark. In: Proceedings of Web Engineering Conference, Rome, IT (2017)
Google Scholar
Ceri, S., et al.: Data management for heterogeneous genomic datasets. IEEE/ACM Trans. Comput. Biol. Bioinform. 14(6), 1251–1264 (2016)
Article Google Scholar
Cumbo, F., et al.: TCGA2BED: extracting, extending, integrating, and querying The Cancer Genome Atlas. BMC Bioinform. 18(6), 1–9 (2017)
Google Scholar
Chawda, B., et al.: Processing interval joins on Map-Reduce. In: Proceedings of EDBT, pp. 463–474 (2014)
Google Scholar
ENCODE Project Consortium: An integrated encyclopedia of DNA elements in the human genome. Nature 489(7414), 57–74 (2012)
Google Scholar
Hadoop 2. http://hadoop.apache.org/docs/stable/
Jalili, V., et al.: Explorative visual analytics on interval-based genomic data and their metadata. BMC Bioinform. 18, 536 (2017)
Article Google Scholar
Kaitoua, A., et al.: Framework for supporting genomic operations. IEEE-TC (2016). https://doi.org/10.1109/TC.2016.2603980
Kent, W.J.: The human genome browser at UCSC. Genome Res. 12(6), 996–1006 (2002)
Article Google Scholar
Masseroli, M., et al.: GenoMetric Query Language: a novel approach to large-scale genomic data management. Bioinformatics 31(12), 1881–1888 (2015)
Article Google Scholar
Masseroli, M., et al.: Modeling and interoperability of heterogeneous genomic big data for integrative processing and querying. Methods 111, 3–11 (2016)
Article Google Scholar
Olston, C., et al.: Pig Latin: a not-so-foreign language for data processing. In: ACM-SIGMOD, pp. 1099–1110 (2008)
Google Scholar
Roy, A., et al.: Massively parallel processing of whole genome sequence data: an in-depth performance study. In: ACM Sigmod, Boston, MA (2017)
Google Scholar
Schuster, S.C.: Next-generation sequencing transforms today’s biology. Nat. Methods 5(1), 16–18 (2008)
Article Google Scholar
SciDB. http://www.scidb.org/
Shvachko, K., et al.: The Hadoop distributed file system. In: Proceedings of MSST, pp. 1–10 (2010)
Google Scholar
Stephens, Z.D., et al.: Big data: astronomical or genomical? PLoS Biol. 13(7), e1002195 (2015)
Article Google Scholar
Taylor, R.C., et al.: An overview of the Hadoop MapReduce HBase framework and its current applications in bioinformatics. BMC Bioinform. 11(Suppl. 12), S1 (2010)
Article Google Scholar
Weinstein, J.N., et al.: The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 45(10), 1113–1120 (2013)
Article Google Scholar
Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of USENIX, pp. 15–28 (2012)
Google Scholar
Jensen, M.A., et al.: The NCI Genomic Data Commons as an engine for precision medicine. Blood 130(4), 453–459 (2017)
Article Google Scholar

Download references

Acknowledgment

This research is funded by the ERC Advanced Grant project GeCo (Data-Driven Genomic Computing), 2016–2021.

Author information

Authors and Affiliations

Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milano, Italy
Stefano Ceri, Arif Canakoglu, Abdulrahman Kaitoua, Marco Masseroli & Pietro Pinoli

Authors

Stefano Ceri
View author publications
You can also search for this author in PubMed Google Scholar
Arif Canakoglu
View author publications
You can also search for this author in PubMed Google Scholar
Abdulrahman Kaitoua
View author publications
You can also search for this author in PubMed Google Scholar
Marco Masseroli
View author publications
You can also search for this author in PubMed Google Scholar
Pietro Pinoli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Stefano Ceri .

Editor information

Editors and Affiliations

INSTICC, Polytechnic Institute of Setúbal, Setúbal, Portugal
Joaquim Filipe
University of Coimbra, Coimbra, Portugal
Jorge Bernardino
RWTH Aachen University, Aachen, Germany
Christoph Quix

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ceri, S., Canakoglu, A., Kaitoua, A., Masseroli, M., Pinoli, P. (2018). Experiences in the Development of a Data Management System for Genomics. In: Filipe, J., Bernardino, J., Quix, C. (eds) Data Management Technologies and Applications. DATA 2017. Communications in Computer and Information Science, vol 814. Springer, Cham. https://doi.org/10.1007/978-3-319-94809-6_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-94809-6_10
Published: 30 June 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-94808-9
Online ISBN: 978-3-319-94809-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Experiences in the Development of a Data Management System for Genomics

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor

Data Science for Genomic Data Management: Challenges, Resources, Experiences

PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Experiences in the Development of a Data Management System for Genomics

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

RGMQL: scalable and interoperable computing of heterogeneous omics big data and metadata in R/Bioconductor

Data Science for Genomic Data Management: Challenges, Resources, Experiences

PyGMQL: scalable data extraction and analysis for heterogeneous genomic datasets

Notes

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation