Abstract
Online aggregation provides estimates to the final result of a computation during the actual processing. The user can stop the computation as soon as the estimate is accurate enough, typically early in the execution. This allows for the interactive data exploration of the largest datasets.
In this paper we introduce the first framework for parallel online aggregation in which the estimation virtually does not incur any overhead on top of the actual execution. We define a generic interface to express any estimation model that abstracts completely the execution details. We design a novel estimator specifically targeted at parallel online aggregation. When executed by the framework over a massive 8 TB TPC-H instance, the estimator provides accurate confidence bounds early in the execution even when the cardinality of the final result is seven orders of magnitude smaller than the dataset size and without incurring overhead.
Similar content being viewed by others
References
Agarwal, S., Panda, A., Mozafari, B., Iyer, A.P., Madden, S., Stoica, I.: Blink and it’s done: interactive queries on very large data. Proc. VLDB Endow. 5(12), 1902–1905 (2012)
Arumugam, S., Dobra, A., Jermaine, C., Pansare, N., Perez, L.: The DataPath system: a data-centric analytic processing engine for large data warehouses. In: Proceedings of 2010 ACM SIGMOD International Conference on Management of Data, pp. 519–530 (2010)
Avnur, R., Hellerstein, J.M., Lo, B., Olston, C., Raman, B., Raman, V., Roth, T., Wylie, K.: CONTROL: continuous output and navigation technology with refinement on-line. In: Proceedings of 1998 ACM SIGMOD International Conference on Management of Data, pp. 567–569 (1998)
Chen, S., Gibbons, P.B., Nath, S.: PR-join: a non-blocking join achieving higher early result rate with statistical guarantees. In: Proceedings of 2010 ACM SIGMOD International Conference on Management of Data, pp. 147–158 (2010)
Cheng, Y., Qin, C., Rusu, F.: GLADE: big data analytics made easy. In: Proceedings of 2012 ACM SIGMOD International Conference on Management of Data, pp. 697–700 (2012)
Cochran, W.G.: Sampling Techniques. Wiley, New York (1977)
Cohen, S.: User-defined aggregate functions: bridging theory and practice. In: Proceedings of 2006 ACM SIGMOD International Conference on Management of Data, pp. 49–60 (2006)
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce online. In: Proceedings of 2010 USENIX Conference on Networked Systems Design and Implementation, pp. 21–32 (2010)
Cormode, G., Garofalakis, M.N., Haas, P.J., Jermaine, C.: Synopses for massive data: samples, histograms, wavelets, sketches. Found. Trends® Databases 4(1–3), 1–294 (2012)
Dobra, A., Jermaine, C., Rusu, F., Xu, F.: Turbo-charging estimate convergence in DBO. Proc. VLDB Endow. 2(1), 419–430 (2009)
Feng, X., Kumar, A., Recht, B., Ré, C.: Towards a unified architecture for in-RDBMS analytics. In: Proceedings of 2012 ACM SIGMOD International Conference on Management of Data, pp. 325–336 (2012)
Garofalakis, M.N., Gibbon, P.B.: Approximate query processing: taming the TeraBytes. In: Proceedings of 2001 VLDB International Conference on Very Large Databases (2001)
Haas, P.J.: Large-sample and deterministic confidence intervals for online aggregation. In: Proceedings of 1997 SSDBM International Conference on Scientific and Statistical Database Management, pp. 51–63 (1997)
Haas, P.J., Hellerstein, J.M.: Ripple joins for online aggregation. In: Proceedings of 1999 ACM SIGMOD International Conference on Management of Data, pp. 287–298 (1999)
Hadoop: http://hadoop.apache.org/. Accessed July 2011
Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: Proceedings of 1997 ACM SIGMOD International Conference on Management of Data, pp. 171–182 (1997)
Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. SIGMOD Rec. 26(2), 171–182 (1997)
Jermaine, C., Arumugam, S., Pol, A., Dobra, A.: Scalable approximate query processing with the DBO engine. In: Proceedings of 2007 ACM SIGMOD International Conference on Management of Data, pp. 725–736 (2007)
Jermaine, C., Dobra, A., Arumugam, S., Joshi, S., Pol, A.: The sort-merge-shrink join. ACM TODS 31(4) (2006)
Jermaine, C., Dobra, A., Pol, A., Joshi, S.: Online estimation for subset-based SQL queries. In: Proceedings of 2005 VLDB International Conference on Very Large Databases, pp. 745–756 (2005)
Laptev, N., Zeng, K., Zaniolo, C.: Early accurate results for advanced analytics on MapReduce. Proc. VLDB Endow. 5(10), 1028–1039 (2012)
Luo, G., Ellmann, C.J., Haas, P.J., Naughton, J.F.: A scalable hash ripple join algorithm. In: Proceedings of 2002 ACM SIGMOD International Conference on Management of Data, pp. 252–262 (2002)
Olken, F.: Random sampling from databases. Ph.D. thesis, UC Berkeley (1993)
Pansare, N., Borkar, V.R., Jermaine, C., Condie, T.: Online aggregation for large MapReduce jobs. Proc. VLDB Endow. 4(11), 1135–1145 (2011)
Rowe, L.A., Stonebraker, M.: The POSTGRES data model. In: Proceedings of 1987 VLDB International Conference on Very Large Databases, pp. 83–96 (1987)
Rusu, F., Dobra, A.: GLADE: a scalable framework for efficient analytics. Oper. Syst. Rev. 46(1), 12–18 (2012)
Rusu, F., Xu, F., Perez, L.L., Wu, M., Jampani, R., Jermaine, C., Dobra, A.: The DBO database system. In: Proceedings of 2008 ACM SIGMOD International Conference on Management of Data, pp. 1223–1226 (2008)
TPC-H: http://www.tpc.org/tpch/. Accessed February 2012
Wang, H., Zaniolo, C.: Using SQL to build new aggregates and extenders for object-relational systems. In: Proceedings of 2000 VLDB International Conference on Very Large Databases, pp. 166–175 (2000)
Wu, M., Jermaine, C.: A Bayesian method for guessing the extreme values in a data set. In: Proceedings of 2007 VLDB International Conference on Very Large Databases, pp. 471–482 (2007)
Wu, S., Jiang, S., Ooi, B.C., Tan, K.-L.: Distributed online aggregation. Proc. VLDB Endow. 2(1), 443–454 (2009)
Wu, S., Ooi, B.C., Tan, K.-L.: Continuous sampling for online aggregation over multiple queries. In: Proceedings of 2010 ACM SIGMOD International Conference on Management of Data, pp. 651–662 (2010)
Xu, F., Jermaine, C., Dobra, A.: Confidence bounds for sampling-based GROUP BY estimates. ACM TODS 33(3) (2008)
Acknowledgements
This work was supported in part by a gift from LogicBlox.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Feifei Li and Suman Nath.
Rights and permissions
About this article
Cite this article
Qin, C., Rusu, F. PF-OLA: a high-performance framework for parallel online aggregation. Distrib Parallel Databases 32, 337–375 (2014). https://doi.org/10.1007/s10619-013-7132-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10619-013-7132-8