[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Spatial online sampling and aggregation

Published: 01 November 2015 Publication History

Abstract

The massive adoption of smart phones and other mobile devices has generated humongous amount of spatial and spatio-temporal data. The importance of spatial analytics and aggregation is ever-increasing. An important challenge is to support interactive exploration over such data. However, spatial analytics and aggregation using all data points that satisfy a query condition is expensive, especially over large data sets, and could not meet the needs of interactive exploration. To that end, we present novel indexing structures that support spatial online sampling and aggregation on large spatial and spatio-temporal data sets. In spatial online sampling, random samples from the set of spatial (or spatio-temporal) points that satisfy a query condition are generated incrementally in an online fashion. With more and more samples, various spatial analytics and aggregations can be performed in an online, interactive fashion, with estimators that have better accuracy over time. Our design works well for both memory-based and disk-resident data sets, and scales well towards different query and sample sizes. More importantly, our structures are dynamic, hence, they are able to deal with insertions and deletions efficiently. Extensive experiments on large real data sets demonstrate the improvements achieved by our indexing structures compared to other baseline methods.

References

[1]
S. Agarwal, B. Mozafari, A. Panda, H. Milner, S. Madden, and I. Stoica. BlinkDB: queries with bounded errors and bounded response times on very large data. In EuroSys, 2013.
[2]
S. Agarwal, A. Panda, B. Mozafari, A. P. Iyer, S. Madden, and I. Stoica. Blink and it's done: Interactive queries on very large data. In PVLDB, volume 5, 2012.
[3]
L. Arge. The buffer tree: A technique for designing batched external data structures. Algorithmica, 37(1):1--24, 2003.
[4]
L. Arge, M. de Berg, H. Haverkort, and K. Yi. The priority r-tree: A practically efficient and worst-case optimal r-tree. ACM Transactions on Algorithms, 4(1), 2008.
[5]
L. Azevedo, G. Zimbro1, and J. de Souza. Approximate query processing in spatial databases using raster signatures. In Advances in Geoinformatics, pages 69--86. Springer Berlin Heidelberg, 2007.
[6]
A. Belussi, B. Catania, and S. Migliorini. Approximate queries for spatial data. In Advanced Query Processing, volume 36 of Intelligent Systems Reference Library, pages 83--127. Springer Berlin Heidelberg, 2013.
[7]
S. Chaudhuri, R. Motwani, and V. Narasayya. On random sampling over joins. In SIGMOD, 1999.
[8]
A. Das Sarma, H. Lee, H. Gonzalez, J. Madhavan, and A. Halevy. Efficient spatial sampling of large geographical tables. In SIGMOD, New York, New York, USA, 2012.
[9]
R. Gemulla and W. Lehner. Deferred maintenance of disk-based random samples. In EDBT, 2006.
[10]
A. A. Guttman. R-trees: A Dynamic Index Structure for Spatial Searching. In SIGMOD, 1984.
[11]
P. Haas and J. Hellerstein. Ripple joins for online aggregation. In SIGMOD, pages 287--298, 1999.
[12]
P. J. Haas. Large-sample and deterministic confidence intervals for online aggregation. In SSDBM, 1997.
[13]
J. Hellerstein, P. Haas, and H. Wang. Online aggregation. In SIGMOD, 1997.
[14]
X. Hu, M. Qiao, and Y. Tao. Independent range sampling. In PODS, 2014.
[15]
R. Jampani, F. Xu, M. Wu, L. L. Perez, C. Jermaine, and P. J. Haas. The monte carlo database system: Stochastic analysis close to the data. ACM TODS, 36(3):18, 2011.
[16]
P. Jayachandran, K. Tunga, N. Kamat, and A. Nandi. Combining user interaction, speculative query execution and sampling in the DICE system. PVLDB, 7(13):1697--1700, 2014.
[17]
C. Jermaine, S. Arumugam, A. Pol, and A. Dobra. Scalable approximate query processing with the dbo engine. ACM Transactions on Database Systems, 33(4), Article 23, 2008.
[18]
C. Jermaine, A. Datta, and E. Omiecinski. A novel index supporting high volume data waresshouse insertion. In VLDB, pages 235--246, 1999.
[19]
C. Jermaine, A. Pol, and S. Arumugam. Online maintenance of very large random samples. In SIGMOD, 2004.
[20]
S. Joshi and C. Jermaine. Materialized sample views for database approximation. IEEE Transactions on Knowledge and Data Engineering, 20(3):337--351, 2008.
[21]
I. Kamel and C. Faloutsos. Hilbert R-tree: An Improved R-tree using Fractals. In VLDB, 1994.
[22]
A. Klein, R. Gemulla, P. Rösch, and W. Lehner. Derby/s: a DBMS for sample-based query answering. In SIGMOD, 2006.
[23]
R. Motwani and P. Raghavan. Randomized Algorithms. Cambridge University Press, 1995.
[24]
S. Nath and P. B. Gibbons. Online maintenance of very large random samples on flash storage. PVLDB, 1(1):970--983, 2008.
[25]
S. Nirkhiwale, A. Dobra, and C. M. Jermaine. A sampling algebra for aggregate estimation. PVLDB, 6(14):1798--1809, 2013.
[26]
F. Olken. Random Sampling from Databases. PhD thesis, University of California at Berkeley, 1993.
[27]
F. Olken and D. Rotem. Random sampling from B+ trees. In VLDB, 1989.
[28]
F. Olken and D. Rotem. Sampling from spatial databases. In ICDE, 1993.
[29]
OpenStreetMap. OpenStreetMap.org, 2014.
[30]
N. Pansare, V. R. Borkar, C. Jermaine, and T. Condie. Online aggregation for large mapreduce jobs. In PVLDB, volume 4, 2011.
[31]
P. Weber. User-Generated Street Maps. Pervasive Computing, IEEE, 7(4):12--18, 2008.
[32]
F. Xu, C. M. Jermaine, and A. Dobra. Confidence bounds for sampling-based group by estimates. ACM TODS, 33(3), 2008.
[33]
Y. Yan, L. J. Chen, and Z. Zhang. Error-bounded sampling for analytics on big sparse data. PVLDB, 7(13):1508--1519, 2014.
[34]
K. Yi, L. Wang, and Z. Wei. Indexing for summary queries: Theory and practice. ACM Transactions on Database Systems, 39(1), 2014.
[35]
K. Zeng, S. Gao, J. Gu, B. Mozafari, and C. Zaniolo. ABS: a system for scalable approximate queries with accuracy guarantees. In SIGMOD, 2014.
[36]
K. Zeng, S. Gao, B. Mozafari, and C. Zaniolo. The analytical bootstrap: a new method for fast error estimation in approximate query processing. In SIGMOD, 2014.
[37]
Y. Zheng, Q. Li, Y. Chen, X. Xie, and W.-Y. Ma. Understanding mobility based on GPS data. In UbiComp, 2008.

Cited By

View all
  • (2023)dbET: Execution Time Distribution-based Plan SelectionProceedings of the ACM on Management of Data10.1145/35887111:1(1-26)Online publication date: 30-May-2023
  • (2022)Enabling efficient and general subpopulation analytics in multidimensional data streamsProceedings of the VLDB Endowment10.14778/3551793.355186715:11(3249-3262)Online publication date: 1-Jul-2022
  • (2022)Algorithmic Techniques for Independent Query SamplingProceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3517804.3526068(129-138)Online publication date: 12-Jun-2022
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 9, Issue 3
November 2015
144 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 November 2015
Published in PVLDB Volume 9, Issue 3

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)1
Reflects downloads up to 08 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)dbET: Execution Time Distribution-based Plan SelectionProceedings of the ACM on Management of Data10.1145/35887111:1(1-26)Online publication date: 30-May-2023
  • (2022)Enabling efficient and general subpopulation analytics in multidimensional data streamsProceedings of the VLDB Endowment10.14778/3551793.355186715:11(3249-3262)Online publication date: 1-Jul-2022
  • (2022)Algorithmic Techniques for Independent Query SamplingProceedings of the 41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems10.1145/3517804.3526068(129-138)Online publication date: 12-Jun-2022
  • (2021)Spatial Independent Range SamplingProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452806(2023-2035)Online publication date: 9-Jun-2021
  • (2021)GeoSparkViz: a cluster computing system for visualizing massive-scale geospatial dataThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-020-00645-230:2(237-258)Online publication date: 7-Jan-2021
  • (2020)DiSAProceedings of the 28th International Conference on Advances in Geographic Information Systems10.1145/3397536.3422333(147-150)Online publication date: 3-Nov-2020
  • (2020)Marviq: Quality-Aware Geospatial Visualization of Range-Selection Queries Using MaterializationProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389730(67-82)Online publication date: 11-Jun-2020
  • (2019)Machine learning meets big spatial dataProceedings of the VLDB Endowment10.14778/3352063.335211512:12(1982-1985)Online publication date: 1-Aug-2019
  • (2019)DeepSPACEProceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems10.1145/3347146.3359112(500-503)Online publication date: 5-Nov-2019
  • (2019)An efficient aggregation and overlap removal algorithm for circle mapsGeoinformatica10.1007/s10707-019-00342-523:3(473-498)Online publication date: 1-Jul-2019
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media