Abstract
In Zeiten wachsender Datenbankgrößen ist es unumgänglich, Anfragen näherungsweise auszuwerten um schnelle Antworten zu erhalten. Dieser Artikel stellt verschiedene Methoden vor, dieses Ziel zu erreichen, und wendet sich anschließend dem Sampling zu, welches mit Hilfe einer Stichprobe schnell zu adäquaten Ergebnissen führt. Enthalten Datenbankanfragen Verbund- oder Gruppierungsoperationen, so sinkt die Genauigkeit vieler Sampling-Verfahren sehr stark; insbesondere werden vor allem kleine Gruppen nicht erkannt. Dieser Artikel befasst sich mit hierarchischen gruppenbasiertem Sampling, welches Sampling, Gruppierung und Verbundoperationen kombiniert.
Abstract
In times of increasing database sizes it is crucial to process queries approximately in order to obtain answers quickly. This article introduces several methods for achieving this goal and afterwards focuses on sampling, yielding appropriate results by using only a subset of the actual data. If database queries contain join or group-by operations, the accuracy of many sampling methods drops significantly; especially small groups are not recognized. This article is concerned with hierarchical group-based sampling, which combines sampling, grouping and joins.
Similar content being viewed by others
Literatur
Acharya S, Gibbons PB, Poosala V (2000) Congressional Samples for Approximate Answering of Group-By Queries. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp 487–498
Acharya S, Gibbons PB, Poosala V, Ramaswamy S (1999) Join synopses for approximate query answering. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, pp 275–286
Babcock B, Chaudhuri S, Das G (2003) Dynamic sample selection for approximate query processing. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp 539–550
Barbará D, DuMouchel W, Faloutsos C, Haas PJ, Hellerstein JM, Ioannidis YE, Jagadish HV, Johnson T, Ng RT, Poosala V, Ross KA, Sevcik KC (1997) The New Jersey Data Reduction Report. IEEE Data Eng Bull 20(4):3–45
Chakrabarti K, Garofalakis MN, Rastogi R, Shim K (2000) Approximate Query Processing Using Wavelets. In: Proceedings of 26th International Conference on Very Large Data Bases, VLDB 2000, pp 111–122
Chaudhuri S, Das G, Datar M, Motwani R, Narasayya VR (2001) Overcoming Limitations of Sampling for Aggregation Queries. In: 17th International Conference on Data Engineering (ICDE’ 01), pp 534–544, April
Chaudhuri S, Das G, Srivastava U (2004) Effective use of block-level sampling in statistics estimation. In: Proceedings of the 2004 ACM SIGMOD international conference on Management of data. ACM Press, pp 287–298
Chaudhuri S, Motwani R, Narasayya V (1999) On Random Sampling over Joins. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, pp 263–274
Ganti V, Lee M-L, Ramakrishnan R (2000) ICICLES: Self-Tuning Samples for Approximate Query Answering. In: The VLDB Journal, pp 176–187
Gibbons PB, Matias Y (1998) New sampling-based summary statistics for improving approximate query answers. In: Proceedings of the 1998 ACM SIGMOD international conference on Management of data. ACM Press, pp 331–342
Gryz J, Guo J, Liu L, Zuzarte C (2004) Query sampling in DB2 Universal Database. In: Proceedings of the 2004 ACM SIGMOD international conference on Management of data. ACM Press, pp 839–843
Haas PJ, Hellerstein JM (1999) Ripple Joins for Online Aggregation. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, pp 287–298
Haas PJ, König C (2004) A bi-level Bernoulli scheme for database sampling. In: Proceedings of the 2004 ACM SIGMOD international conference on Management of data. ACM Press, pp 275–286
Hellerstein JM, Haas PJ, Wang HJ (1997) Online Aggregation. In: Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, pp 171–182
Ioannidis YE, Poosala V (1995) Histogram-Based Solutions to Diverse Database Estimation Problems. Data Engineering Bulletin 18(3):10–18
Jermaine C (2003) Robust Estimation With Sampling and Approximate Pre-Aggregation. In: Proceedings of 29th International Conference on Very Large Data Bases, VLDB 2003, September 9–12, 2003, Berlin, Germany, Los Altos, CA 94022, USA, 2003. Morgan Kaufmann Publishers, pp 886–897
Jermaine C, Pol A, Arumugam S (2004) Online maintenance of very large random samples. In: Proceedings of the 2004 ACM SIGMOD international conference on Management of data. ACM Press, pp 299–310
Olken F (1993) Random Sampling from Databases. Thesis LBL-32883, Information and Computing Sciences Division, Lawrence Berkeley National Laboratory, Mailstop 50B-3238, 1 Cyclotron Road, Berkeley, California 94720, USA
Poosala V, Ganti V, Ioannidis YE (1999) Approximate Query Answering using Histograms. IEEE Data Eng Bull 22(4):5–14
Transaction Processing Performace Council (1998) TPC-D Benchmark Version 2.1, February. http://www.tpc.org
Vitter JS (1985) Random Sampling with a Reservoir. ACM Transactions on Mathematical Software 11(1):37–57, March
Vitter JS, Wang M (1999) Approximate computation of multidimensional aggregates of sparse data using wavelets. In: Proceedings of the 1999 ACM SIGMOD international conference on Management of data. ACM Press, pp 193–204
Author information
Authors and Affiliations
Corresponding author
Additional information
CR Subject Classification
H.2.4,H.4.2,H.3.3
Rights and permissions
About this article
Cite this article
Gemulla, R., Berthold, H. & Lehner, W. Hierarchisches gruppenbasiertes Sampling. Informatik Forsch. Entw. 20, 45–56 (2005). https://doi.org/10.1007/s00450-004-0175-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00450-004-0175-3