SPARSE ARRAY STORAGE
FOR GENOMICS
Using high-level APIs provided in C++, Java*, and Spark*, users can both write and read variant records to and from GenomicsDB shared-nothing instances in parallel using multiple processes in a Single Process Multiple Data (SPMD) manner.
GenomicsDB uses columnar sparse arrays where samples are mapped to rows and genome positions or sites of variants are mapped to columns. These columns are partitioned in a shared-nothing fashion across thousands of machines, enabling the joint genotyping workflow in Broad Institute’s genome analyzer toolkit (GATK) to scale to 100,000 samples and beyond.
GenomicsDB allows bioinformaticians to achieve analysis results with high statistical confidence. The low-level storage format enables faster and more efficient retrievals from disk compared to the use of files. Additionally, using libraries optimized for Intel® architecture to compress data on disk, GenomicsDB cumulatively achieves orders of magnitude improvement in performance compared to existing tools.
GenomicsDB was initially developed by Intel in collaboration with the Broad Institute of MIT & Harvard. GenomicsDB is an open sourced library and tools with a focus on optimizing sparse array storage specifically for genomic data. It is currently being hosted and developed by the open-source community sponsored by dātma Health Science.
Karthik Gururaj - Primary Contributor
Eric Banks - Broad Institute
Christopher Denny - UCLA
Kemal Sonmez - Oregon Health & Sciences University
Jaclyn Smith - Oxford University
Melvin Lathara - dātma
Nalini Ganapati - dātma
Aleks Shar - dātma
If you would like to become a charter member please contact us. Charter members help guide the future development of GenomicsDB
GenomicsDB is a C++ library built on top of an array based storage system for importing, querying and transforming variant data.