Hierarchisches gruppenbasiertes Sampling

Rainer Gemulla¹,
Henrike Berthold¹ &
Wolfgang Lehner¹

59 Accesses
Explore all metrics

Abstract

In Zeiten wachsender Datenbankgrößen ist es unumgänglich, Anfragen näherungsweise auszuwerten um schnelle Antworten zu erhalten. Dieser Artikel stellt verschiedene Methoden vor, dieses Ziel zu erreichen, und wendet sich anschließend dem Sampling zu, welches mit Hilfe einer Stichprobe schnell zu adäquaten Ergebnissen führt. Enthalten Datenbankanfragen Verbund- oder Gruppierungsoperationen, so sinkt die Genauigkeit vieler Sampling-Verfahren sehr stark; insbesondere werden vor allem kleine Gruppen nicht erkannt. Dieser Artikel befasst sich mit hierarchischen gruppenbasiertem Sampling, welches Sampling, Gruppierung und Verbundoperationen kombiniert.

Abstract

In times of increasing database sizes it is crucial to process queries approximately in order to obtain answers quickly. This article introduces several methods for achieving this goal and afterwards focuses on sampling, yielding appropriate results by using only a subset of the actual data. If database queries contain join or group-by operations, the accuracy of many sampling methods drops significantly; especially small groups are not recognized. This article is concerned with hierarchical group-based sampling, which combines sampling, grouping and joins.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Qualitative und quantitative Befragungs- und Auswertungssoftware sowie Datenbanken mit aggregierten Statistiken – Ein Überblick

Gruppenbildung: Clusteranalyse

8 Daten und Tarifierungsstatistiken

Literatur

Acharya S, Gibbons PB, Poosala V (2000) Congressional Samples for Approximate Answering of Group-By Queries. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp 487–498
Acharya S, Gibbons PB, Poosala V, Ramaswamy S (1999) Join synopses for approximate query answering. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, pp 275–286
Babcock B, Chaudhuri S, Das G (2003) Dynamic sample selection for approximate query processing. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp 539–550
Barbará D, DuMouchel W, Faloutsos C, Haas PJ, Hellerstein JM, Ioannidis YE, Jagadish HV, Johnson T, Ng RT, Poosala V, Ross KA, Sevcik KC (1997) The New Jersey Data Reduction Report. IEEE Data Eng Bull 20(4):3–45
Chakrabarti K, Garofalakis MN, Rastogi R, Shim K (2000) Approximate Query Processing Using Wavelets. In: Proceedings of 26th International Conference on Very Large Data Bases, VLDB 2000, pp 111–122
Chaudhuri S, Das G, Datar M, Motwani R, Narasayya VR (2001) Overcoming Limitations of Sampling for Aggregation Queries. In: 17th International Conference on Data Engineering (ICDE’ 01), pp 534–544, April
Chaudhuri S, Das G, Srivastava U (2004) Effective use of block-level sampling in statistics estimation. In: Proceedings of the 2004 ACM SIGMOD international conference on Management of data. ACM Press, pp 287–298
Chaudhuri S, Motwani R, Narasayya V (1999) On Random Sampling over Joins. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, pp 263–274
Ganti V, Lee M-L, Ramakrishnan R (2000) ICICLES: Self-Tuning Samples for Approximate Query Answering. In: The VLDB Journal, pp 176–187
Gibbons PB, Matias Y (1998) New sampling-based summary statistics for improving approximate query answers. In: Proceedings of the 1998 ACM SIGMOD international conference on Management of data. ACM Press, pp 331–342
Gryz J, Guo J, Liu L, Zuzarte C (2004) Query sampling in DB2 Universal Database. In: Proceedings of the 2004 ACM SIGMOD international conference on Management of data. ACM Press, pp 839–843
Haas PJ, Hellerstein JM (1999) Ripple Joins for Online Aggregation. In: Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data, pp 287–298
Haas PJ, König C (2004) A bi-level Bernoulli scheme for database sampling. In: Proceedings of the 2004 ACM SIGMOD international conference on Management of data. ACM Press, pp 275–286
Hellerstein JM, Haas PJ, Wang HJ (1997) Online Aggregation. In: Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, pp 171–182
Ioannidis YE, Poosala V (1995) Histogram-Based Solutions to Diverse Database Estimation Problems. Data Engineering Bulletin 18(3):10–18
Jermaine C (2003) Robust Estimation With Sampling and Approximate Pre-Aggregation. In: Proceedings of 29th International Conference on Very Large Data Bases, VLDB 2003, September 9–12, 2003, Berlin, Germany, Los Altos, CA 94022, USA, 2003. Morgan Kaufmann Publishers, pp 886–897
Jermaine C, Pol A, Arumugam S (2004) Online maintenance of very large random samples. In: Proceedings of the 2004 ACM SIGMOD international conference on Management of data. ACM Press, pp 299–310
Olken F (1993) Random Sampling from Databases. Thesis LBL-32883, Information and Computing Sciences Division, Lawrence Berkeley National Laboratory, Mailstop 50B-3238, 1 Cyclotron Road, Berkeley, California 94720, USA
Poosala V, Ganti V, Ioannidis YE (1999) Approximate Query Answering using Histograms. IEEE Data Eng Bull 22(4):5–14
Google Scholar
Transaction Processing Performace Council (1998) TPC-D Benchmark Version 2.1, February. http://www.tpc.org
Vitter JS (1985) Random Sampling with a Reservoir. ACM Transactions on Mathematical Software 11(1):37–57, March
Google Scholar
Vitter JS, Wang M (1999) Approximate computation of multidimensional aggregates of sparse data using wavelets. In: Proceedings of the 1999 ACM SIGMOD international conference on Management of data. ACM Press, pp 193–204

Download references

Author information

Authors and Affiliations

Database Technology Group, Technische Universität Dresden, 01062, Dresden
Rainer Gemulla, Henrike Berthold & Wolfgang Lehner

Authors

Rainer Gemulla
View author publications
You can also search for this author in PubMed Google Scholar
Henrike Berthold
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Lehner
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wolfgang Lehner.

Additional information

CR Subject Classification

H.2.4,H.4.2,H.3.3

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gemulla, R., Berthold, H. & Lehner, W. Hierarchisches gruppenbasiertes Sampling. Informatik Forsch. Entw. 20, 45–56 (2005). https://doi.org/10.1007/s00450-004-0175-3

Download citation

Received: 21 July 2004
Accepted: 07 December 2004
Published: 15 March 2005
Issue Date: October 2005
DOI: https://doi.org/10.1007/s00450-004-0175-3

Abstract

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Qualitative und quantitative Befragungs- und Auswertungssoftware sowie Datenbanken mit aggregierten Statistiken – Ein Überblick

Gruppenbildung: Clusteranalyse

8 Daten und Tarifierungsstatistiken

Literatur

Author information

Authors and Affiliations

Corresponding author

Additional information

CR Subject Classification

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Hierarchisches gruppenbasiertes Sampling

Abstract

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Qualitative und quantitative Befragungs- und Auswertungssoftware sowie Datenbanken mit aggregierten Statistiken – Ein Überblick

Gruppenbildung: Clusteranalyse

8 Daten und Tarifierungsstatistiken

Literatur

Author information

Authors and Affiliations

Corresponding author

Additional information

CR Subject Classification

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now