[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

ChronosDB: distributed, file based, geospatial array DBMS

Published: 01 June 2018 Publication History

Abstract

An array DBMS streamlines large N-d array management. A large portion of such arrays originates from the geospatial domain. The arrays often natively come as raster files while standalone command line tools are one of the most popular ways for processing these files. Decades of development and feedback resulted in numerous feature-rich, elaborate, free and quality-assured tools optimized mostly for a single machine. ChronosDB partially delegates in situ data processing to such tools and offers a formal N-d array data model to abstract from the files and the tools. ChronosDB readily provides a rich collection of array operations at scale and outperforms SciDB by up to 75× on average.

References

[1]
ArcGIS for server --- image extension. http://www.esri.com/software/arcgis/arcgisserver/extensions/image-extension.
[2]
P. Baumann, A. Dehmel, P. Furtado, R. Ritsch, and N. Widmann. Spatio-temporal retrieval with RasDaMan. In VLDB, pages 746--749, 1999.
[3]
P. Baumann, A. Dumitru, and V. Merticariu. The array database that is not a database: File based array query answering in RasDaMan. In SSTD 2013, volume 8098, pages 478--483. LNCS, Springer.
[4]
P. Baumann and S. Holsten. A comparative analysis of array models for databases. Int. J. Database Theory Appl., 5(1):89--120, 2012.
[5]
P. Baumann, P. Mazzetti, J. Ungar, R. Barbera, D. Barboni, A. Beccati, L. Bigagli, E. Boldrini, R. Bruno, et al. Big data analytics for Earth sciences: the EarthServer approach. International Journal of Digital Earth, 9(1):3--29, 2016.
[6]
S. Blanas, K. Wu, S. Byna, B. Dong, and A. Shoshani. Parallel data analysis directly on scientific file formats. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of data, pages 385--396. ACM, 2014.
[7]
P. G. Brown. Overview of SciDB: large scale array storage, processing and analysis. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pages 963--968. ACM, 2010.
[8]
J. Buck, N. Watkins, J. LeFevre, K. Ioannidou, C. Maltzahn, N. Polyzotis, and S. Brandt. SciHadoop: Array-based query processing in Hadoop. In SC 2011.
[9]
Y. Cheng, W. Zhao, and F. Rusu. Bi-level online aggregation on raw data. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management. ACM, 2017.
[10]
J. Chou, K. Wu, et al. FastQuery: a general indexing and querying system for scientific data. In International Conference on Scientific and Statistical Database Management, pages 573--574. Springer, 2011.
[11]
Climate Wikience and ChronosServer. http://www.wikience.org.
[12]
Climate Wikience and ChronosServer (mirror). http://wikience.gis.land.
[13]
Conda package manager. https://conda.io/docs/userguide/install/download.html.
[14]
Coverity scan: GDAL. https://scan.coverity.com/projects/gdal.
[15]
Coordinate Reference Systems (CRS) - Quantum GIS (QGIS) documentation. https://docs.qgis.org/2.14/en/docs/gentle_gis_introduction/coordinate_reference_systems.html.
[16]
P. Cudré-Mauroux, H. Kimura, K.-T. Lim, J. Rogers, R. Simakov, E. Soroush, P. Velikhov, D. L. Wang, M. Balazinska, J. Becla, et al. A demonstration of SciDB: A science-oriented DBMS. PVLDB, 2(2):1534--1537, 2009.
[17]
B. Dong, K. Wu, S. Byna, J. Liu, W. Zhao, and F. Rusu. ArrayUDF: User-defined scientific data analysis on arrays. In HPDC, 2017.
[18]
ERDDAP - working with the datasets.xml file. https://coastwatch.pfeg.noaa.gov/erddap/download/setupDatasetsXml.html.
[19]
GDAL homepage. http://www.gdal.org/.
[20]
GDAL: "gdalinfo" tool. http://www.gdal.org/gdalinfo.html.
[21]
GDAL: "gdalbuildvrt" tool. http://www.gdal.org/gdalbuildvrt.html.
[22]
GDAL virtual file format. http://www.gdal.org/gdal_vrttut.html.
[23]
GeoTIFF. http://trac.osgeo.org/geotiff/.
[24]
Array overview: Google Earth Engine API. https://developers.google.com/earth-engine/arrays_intro.
[25]
N. Gorelick, M. Hancher, M. Dixon, S. Ilyushchenko, D. Thau, and R. Moore. Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote Sensing of Environment, 2017.
[26]
M. Grawinkel et al. Analysis of the ECMWF storage landscape. In 13th USENIX Conf. on File and Storage Technologies, page 83, 2015.
[27]
Hadoop streaming. wiki.apache.org/hadoop/HadoopStreaming.
[28]
A. R. Huete. A soil-adjusted vegetation index (SAVI). Remote sensing of environment, 25(3):295--309, 1988.
[29]
ImageMagick. http://imagemagick.org.
[30]
A. Kashnitskii, E. Lupyan, I. Balashov, and A. Konstantinova. Technology for designing tools for the process and analysis of data from very large scale distributed satellite archives. Atmospheric and Oceanic Optics, 30(1):84--88, 2017.
[31]
Kernel.org tmpfs. https://www.kernel.org/doc/Documentation/filesystems/tmpfs.txt.
[32]
S. Lakshminarasimhan, D. A. Boyuka, S. V. Pendse, X. Zou, J. Jenkins, V. Vishwanath, M. E. Papka, and N. F. Samatova. Scalable in situ scientific data encoding for analytical query processing. In Proceedings of the 22nd international symposium on High-performance parallel and distributed computing, pages 1--12. ACM, 2013.
[33]
Landsat 8 bands. https://landsat.usgs.gov/what-are-band-designations-landsat-satellites.
[34]
Landsat naming conventions. https://landsat.usgs.gov/what-are-naming-conventions-landsat-scene-identifiers.
[35]
Launching digitalglobe's maps API. https://www.mapbox.com/blog/digitalglobe-maps-api/.
[36]
A. Lewis, S. Oliver, L. Lymburner, B. Evans, L. Wyborn, N. Mueller, G. Raevksi, J. Hooke, R. Woodcock, J. Sixsmith, et al. The Australian geoscience data cube---foundations and lessons learned. Remote Sensing of Environment, 2017.
[37]
L. Libkin, R. Machlin, and L. Wong. A query language for multidimensional arrays: design, implementation, and optimization techniques. In ACM SIGMOD Record, volume 25, pages 228--239. ACM, 1996.
[38]
A. P. Marathe and K. Salem. Query processing techniques for arrays. The International Journal on Very Large Data Bases, 11(1):68--91, 2002.
[39]
Measuring vegetation (NDVI & EVI): Feature articles. https://earthobservatory.nasa.gov/Features/MeasuringVegetation/.
[40]
S. Nativi, J. Caron, B. Domenico, and L. Bigagli. Unidata's common data model mapping to the ISO 19123 data model. Earth Sci. Inform., 1:59--78, 2008.
[41]
NCEP-DOE AMIP-II Reanalysis. http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanalysis2.html.
[42]
The NetCDF markup language (NcML). https://www.unidata.ucar.edu/software/thredds/current/netcdf-java/ncml/.
[43]
NCO homepage. http://nco.sourceforge.net/.
[44]
OGC (Open Geospatial Consortium) Network Common Data Form (NetCDF). http://www.opengeospatial.org/standards/netcdf.
[45]
Oracle database online documentation 12c release 1 (12.1), Spatial and Graph GeoRaster developer's guide. https://docs.oracle.com/database/121/GEORS/geor_image_proc.htm.
[46]
Oracle Spatial and Graph. http://www.oracle.com/technetwork/database/options/spatialandgraph/overview/index.html.
[47]
K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica. Sparrow: distributed, low latency scheduling. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles, pages 69--84, 2013.
[48]
S. Papadopoulos et al. The TileDB array data storage manager. PVLDB, 10(4):349--360, 2016.
[49]
Personal communications with the staff of German Climate Computing Centre (DKRZ), Space Research Institute (Russia), Institute of Numerical Mathematics (Russia), University of Madrid (Spain), Moscow State University, and other institutions.
[50]
PostGIS: Handling N-d arrays. https://lists.osgeo.org/pipermail/postgis-users/2017-October/042433.html.
[51]
PostGIS raster data management. http://postgis.net/docs/manual-2.2/using_raster_dataman.html.
[52]
RasDaMan features. http://www.rasdaman.org/wiki/Features.
[53]
RasDaMan forum: Condense operation. https://groups.google.com/forum/#!topic/rasdaman-users/28WWbNdTWYg.
[54]
RasDaMan forum: Query on multiple collections. https://groups.google.com/forum/#!topic/rasdaman-users/Vu0V4Ed6zms.
[55]
RasDaMan forum: Sobel filter. https://groups.google.com/forum/#!topic/rasdaman-users/fdu5jzQ9kmw.
[56]
R. Rew and G. Davis. NetCDF: an interface for scientific data access. IEEE computer graphics and applications, 10(4):76--82, 1990.
[57]
J. A. Richards. Remote Sensing Digital Image Analysis: An Introduction. Springer-Verlag Berlin Heidelberg, 5th edition, 2013.
[58]
R. A. Rodriges Zalipynis. ChronosServer: real-time access to "native" multi-terabyte retrospective data warehouse by thousands of concurrent clients. Inf., Cyb. and Comp. Eng., 14(188):151--161, 2011.
[59]
R. A. Rodriges Zalipynis. ChronosServer: Fast in situ processing of large multidimensional arrays with command line tools. In Supercomputing: Second Russian Supercomputing Days, RuSCDays 2016, Moscow, Russia, September 26--27, 2016, Revised Selected Papers, volume 687 of Communications in Computer and Information Science, pages 27--40, Cham, 2016. Springer International Publishing.
[60]
R. A. Rodriges Zalipynis. Array DBMS in environmental science: Satellite sea surface height data in the Cloud. In 9th IEEE International Conference on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, IDAACS 2017, Bucharest, Romania, September 21--23, 2017, pages 1062--1065. IEEE, 2017.
[61]
R. A. Rodriges Zalipynis. Distributed in situ processing of big raster data in the Cloud. In Perspectives of System Informatics - 11th International Andrei P. Ershov Informatics Conference, PSI 2017, Moscow, Russia, June 27--29, 2017, Revised Selected Papers, volume 10742 of Lecture Notes in Computer Science, pages 337--351. Springer, 2018.
[62]
R. A. Rodriges Zalipynis et al. Array DBMS and satellite imagery: Towards big raster data in the Cloud. In Analysis of Images, Social Networks and Texts - 6th International Conference, AIST 2017, Moscow, Russia, July 27--29, 2017, Revised Selected Papers, volume 10716 of Lecture Notes in Computer Science, pages 267--279. Springer, 2018.
[63]
SciDB consume() and chunk shuffling. http://forum.paradigm4.com/t/consume-and-chunkshuffling/2056.
[64]
SciDB output chunk distribution. http://forum.paradigm4.com/t/does-store-redistributes-chunks-among-cluster-nodes/1919.
[65]
SciDB forum: Interpolation. http://forum.paradigm4.com/t/interpolation/1283.
[66]
SciDB configuration. https://paradigm4.atlassian.net/wiki/display/ESD169/Configuring+SciDB.
[67]
SciDB documentation: Join operator. https://paradigm4.atlassian.net/wiki/spaces/ESD169/pages/50856234/join.
[68]
SciDB hardware guidelines. https://www.paradigm4.com/resources/hardware-guidelines/.
[69]
SciDB hardware guidelines (archived copy). http://wikience.org/archive/SciDB_HW_guidelines_24Feb2018.png.
[70]
SciDB streaming. https://github.com/Paradigm4/streaming.
[71]
Scidb forum: The fastest way to alter chunk shape. http://forum.paradigm4.com/t/fastest-way-to-alter-chunk-size/.
[72]
E. Soroush, M. Balazinska, and D. Wang. ArrayStore: a storage manager for complex parallel array processing. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pages 253--264. ACM, 2011.
[73]
M. Stonebraker, P. Brown, A. Poliakov, and S. Raman. The architecture of SciDB. In Scientific and Statistical Database Management, pages 1--16. Springer, 2011.
[74]
M. Stonebraker, P. Brown, D. Zhang, and J. Becla. SciDB: A database management system for applications with complex analytics. Computing in Science & Engineering, 15(3):54--62, 2013.
[75]
Y. Su and G. Agrawal. Supporting user-defined subsetting and aggregation over parallel NetCDF datasets. In Cluster, Cloud and Grid Computing (CCGrid), 2012 12th IEEE/ACM International Symposium on, pages 212--219. IEEE, 2012.
[76]
THREDDS - dataset inventory catalog specification. https://www.unidata.ucar.edu/software/thredds/v4.5/tds/catalog/InvCatalogSpec.html.
[77]
L. Tianhua, Z. Hongfeng, C. Guiran, and Z. Chuansheng. The design and implementation of zero-copy for Linux. In Eighth International Conference on Intelligent Systems Design and Applications, ISDA'08, pages 121--126. IEEE, 2008.
[78]
TileDB. http://istc-bigdata.org/tiledb/index.html.
[79]
TileDB: Documentation. https://docs.tiledb.io/docs/.
[80]
D. C. Tomlin. Geographic Information Systems and Cartographic Modeling. Prentice-Hall, 1990.
[81]
A. van Ballegooij. RAM: A multidimensional array DBMS. In EDBT Workshops, volume 3268, pages 154--165. Springer, 2004.
[82]
D. L. Wang, C. S. Zender, and S. F. Jenks. Efficient clustered server-side data analysis workflows using SWAMP. Earth Sci Inform, 2(3):141--155, 2009.
[83]
L. Wang et al. Clustered workflow execution of retargeted data analysis scripts. In CCGRID 2008.
[84]
W. Wang, T. Liu, D. Tang, H. Liu, W. Li, and R. Lee. SparkArray: An array-based scientific data management system built on Apache Spark. In IEEE International Conference on Networking, Architecture and Storage (NAS), pages 1--10. IEEE, 2016.
[85]
Y. Wang, W. Jiang, and G. Agrawal. SciMATE: A novel MapReduce-like framework for multiple scientific data formats. In CCGRID, pages 443--450, 2012.
[86]
Y. Wang, A. Nandi, and G. Agrawal. SAGA: Array storage as a DB with support for structural aggregations. In SSDBM 2014.
[87]
What is map algebra? - ArcGIS help. http://desktop. arcgis.com/en/arcmap/latest/extensions/spatial-analyst/map-algebra/what-is-map-algebra.htm.
[88]
C. S. Zender. Analysis of self-describing gridded geoscience data with netCDF operators (NCO). Environmental Modelling & Software, 23(10):1338--1342, 2008.
[89]
C. S. Zender and H. Mangalam. Scaling properties of common statistical operators for gridded datasets. The International Journal of High Performance Computing Applications, 21(4):485--498, 2007.
[90]
C. S. Zender and D. L. Wang. High performance distributed data reduction and analysis with the netCDF operators (NCO). In 87th AMS Annual Meeting, 2007.
[91]
Y. Zhang et al. SciQL: Bridging the gap between science and relational DBMS. In IDEAS, 2011.

Cited By

View all
  • (2023)DyVer: Dynamic Version Handling for Array DatabasesProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593734(144-154)Online publication date: 21-Jun-2023
  • (2022)Replicated layout for in-memory database systemsProceedings of the VLDB Endowment10.14778/3503585.350360615:4(984-997)Online publication date: 14-Apr-2022
  • (2022)A Survey on Spatio-temporal Data Analytics SystemsACM Computing Surveys10.1145/350790454:10s(1-38)Online publication date: 10-Nov-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 11, Issue 10
June 2018
248 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 June 2018
Published in PVLDB Volume 11, Issue 10

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)1
Reflects downloads up to 14 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)DyVer: Dynamic Version Handling for Array DatabasesProceedings of the 37th International Conference on Supercomputing10.1145/3577193.3593734(144-154)Online publication date: 21-Jun-2023
  • (2022)Replicated layout for in-memory database systemsProceedings of the VLDB Endowment10.14778/3503585.350360615:4(984-997)Online publication date: 14-Apr-2022
  • (2022)A Survey on Spatio-temporal Data Analytics SystemsACM Computing Surveys10.1145/350790454:10s(1-38)Online publication date: 10-Nov-2022
  • (2022)ReSKY: Efficient Subarray Skyline Computation in Array DatabasesDistributed and Parallel Databases10.1007/s10619-022-07419-540:2-3(261-298)Online publication date: 1-Sep-2022
  • (2021)The Raptor Join Operator for Processing Big Raster + Vector DataProceedings of the 29th International Conference on Advances in Geographic Information Systems10.1145/3474717.3483971(324-335)Online publication date: 2-Nov-2021
  • (2021)BeastProceedings of the 30th ACM International Conference on Information & Knowledge Management10.1145/3459637.3481897(3796-3807)Online publication date: 26-Oct-2021
  • (2021)Convergence of Array DBMS and Cellular AutomataProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3458457(2399-2403)Online publication date: 9-Jun-2021
  • (2021)Raptor: Large Scale Processing of Big Raster + Vector DataProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3450585(2905-2907)Online publication date: 9-Jun-2021
  • (2020)BitFunProceedings of the VLDB Endowment10.14778/3415478.341550613:12(2909-2912)Online publication date: 14-Sep-2020
  • (2019)Progressive top-k subarray query processing in array databasesProceedings of the VLDB Endowment10.14778/3329772.332977612:9(989-1001)Online publication date: 1-May-2019
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media