[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1007/978-3-642-39467-6_19guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

Sampling estimators for parallel online aggregation

Published: 08 July 2013 Publication History

Abstract

Online aggregation provides estimates to the final result of a computation during the actual processing. The user can stop the computation as soon as the estimate is accurate enough, typically early in the execution. When coupled with parallel processing, this allows for the interactive data exploration of the largest datasets. In this paper, we identify the main functionality requirements of sampling-based parallel online aggregation--partial aggregation, parallel sampling, and estimation. We argue for overlapped online aggregation as the only scalable solution to combine computation and estimation. We analyze the properties of existent estimators and design a novel sampling-based estimator that is robust to node delay and failure. When executed over a massive 8TB TPC-H instance, the proposed estimator provides accurate confidence bounds early in the execution even when the cardinality of the final result is seven orders of magnitude smaller than the dataset size and achieves linear scalability.

References

[1]
Hellerstein, J. M., Haas, P. J., Wang, H. J.: Online Aggregation. In: SIGMOD (1997)
[2]
Rusu, F., Dobra, A.: GLADE: A Scalable Framework for Efficient Analytics. Operating Systems Review 46(1) (2012)
[3]
Cormode, G., Garofalakis, M. N., Haas, P. J., Jermaine, C.: Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches. Foundations and Trends in Databases 4(1-3) (2012)
[4]
Wu, S., Jiang, S., Ooi, B.C., Tan, K. L.: Distributed Online Aggregation. PVLDB 2(1) (2009)
[5]
Laptev, N., Zeng, K., Zaniolo, C.: Early Accurate Results for Advanced Analytics onMapReduce. PVLDB 5(10) (2012)
[6]
Rusu, F., Xu, F., Perez, L. L., Wu, M., Jampani, R., Jermaine, C., Dobra, A.: The DBO Database System. In: SIGMOD (2008)
[7]
Pansare, N., Borkar, V. R., Jermaine, C., Condie, T.: Online Aggregation for Large MapReduce Jobs. PVLDB 4(11) (2011)
[8]
Olken, F.: Random Sampling from Databases. Ph. D. thesis, UC Berkeley (1993)
[9]
Cochran, W. G.: Sampling Techniques. Wiley (1977)
[10]
Luo, G., Ellmann, C. J., Haas, P. J., Naughton, J. F.: A Scalable Hash Ripple Join Algorithm. In: SIGMOD (2002)
[11]
Jermaine, C., Dobra, A., Arumugam, S., Joshi, S., Pol, A.: The Sort-Merge-Shrink Join. TODS 31(4) (2006)
[12]
Jermaine, C., Arumugam, S., Pol, A., Dobra, A.: Scalable Approximate Query Processing with the DBO Engine. In: SIGMOD (2007)
[13]
Dobra, A., Jermaine, C., Rusu, F., Xu, F.: Turbo-Charging Estimate Convergence in DBO. PVLDB 2(1) (2009)
[14]
Cheng, Y., Qin, C., Rusu, F.: GLADE: Big Data Analytics Made Easy. In: SIGMOD (2012)
[15]
Qin, C., Rusu, F.: PF-OLA: A High-Performance Framework for Parallel On-Line Aggregation. CoRR abs/1206.0051 (2012)
[16]
Avnur, R., Hellerstein, J. M., Lo, B., Olston, C., Raman, B., Raman, V., Roth, T., Wylie, K.: CONTROL: Continuous Output and Navigation Technology with Refinement On-Line. In: SIGMOD (1998)
[17]
Haas, P. J., Hellerstein, J. M.: Ripple Joins for Online Aggregation. In: SIGMOD (1999)
[18]
Chen, S., Gibbons, P. B., Nath, S.: PR-Join: A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees. In: SIGMOD (2010)
[19]
Condie, T., Conway, N., Alvaro, P., Hellerstein, J. M., Elmeleegy, K., Sears, R.: MapReduce Online. In: NSDI (2010)
[20]
Agarwal, S., Panda, A., Mozafari, B., Iyer, A. P., Madden, S., Stoica, I.: Blink and It's Done: Interactive Queries on Very Large Data. PVLDB 5(12) (2012)

Cited By

View all
  • (2018)Random Sampling over Joins RevisitedProceedings of the 2018 International Conference on Management of Data10.1145/3183713.3183739(1525-1539)Online publication date: 27-May-2018

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Guide Proceedings
BNCOD'13: Proceedings of the 29th British National conference on Big Data
July 2013
302 pages
ISBN:9783642394669
  • Editors:
  • Georg Gottlob,
  • Giovanni Grasso,
  • Dan Olteanu,
  • Christian Schallhart

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 08 July 2013

Author Tags

  1. estimation
  2. online aggregation
  3. parallel databases
  4. sampling

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 01 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2018)Random Sampling over Joins RevisitedProceedings of the 2018 International Conference on Management of Data10.1145/3183713.3183739(1525-1539)Online publication date: 27-May-2018

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media