[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/564691.564741acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Dynamic multidimensional histograms

Published: 03 June 2002 Publication History

Abstract

Histograms are a concise and flexible way to construct summary structures for large data sets. They have attracted a lot of attention in database research due to their utility in many areas, including query optimization, and approximate query answering. They are also a basic tool for data visualization and analysis.In this paper, we present a formal study of dynamic multidimensional histogram structures over continuous data streams. At the heart of our proposal is the use of a dynamic summary data structure (vastly different from a histogram) maintaining a succinct approximation of the data distribution of the underlying continuous stream. On demand, an accurate histogram is derived from this dynamic data structure. We propose algorithms for extracting such an accurate histogram and we analyze their behavior and tradeoffs. The proposed algorithms are able to provide approximate guarantees about the quality of the estimation of the histograms they extract.We complement our analytical results with a thorough experimental evaluation using real data sets.

References

[1]
A. Aboulnaga and S. Chaudhuri. Self Tuning Histograms: Building Histograms Without Looking at Data. Proceedings of ACM SIGMOD, pages 181-192, June 1999.
[2]
S. Acharya, P. Gibbons, V. Poosala, and S. Ramaswamy. The Aqua Approximate Query Answering System. Proceedings of ACM SIGMOD, Philladephia PA, pages 574-578, June 1999.
[3]
B. Babcock, M. Datar, and R. Motwani. Sampling From a Moving Window Over Streaming Data. Proceedings of the Symposium on Discrete Algorithms, 2002.
[4]
S. Babu and J. Widom. Contineous Queries Over Data Streams. SIGMOD Record, Sept. 2001.
[5]
N. Bruno, L. Gravano, and S. Chaudhuri. STHoles: A Workload Aware Multidimensional Histogram. Proceedings of ACM SIGMOD, May 2001.
[6]
M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining Stream Statistics over Sliding Windows. Proceedings of the Symposium on Discrete Algorithms, 2002.
[7]
P. Gibbons, Y. Mattias, and V. Poosala. Fast Incremental Maintenance of Approximate Histograms. Proceedings of VLDB, Athens Greece, pages 466-475, Aug. 1997.
[8]
A. Gilbert, S. Guha, P. Indyk, Y. Kotadis, S. Muthukrishnan, and M. Strauss. Fast, small-space algorithms for approximate histogram maintanance. Proc. STOC, 2002.
[9]
A. Gilbert, Y. Kotadis, S. Muthukrishnan, and M. Strauss. Quicksand: quick summary and analysis of network data. DIMACS tech report.
[10]
A. Gilbert, Y. Kotadis, S. Muthukrishnan, and M. Strauss. Surfing Wavelets on Streams: One Pass Summaries for Approximate Aggregate Queries. Proceedings of VLDB, pages 79-88, 2001.
[11]
J. Gray, A. Bosworth, A. Leyman, and H. Pirahesh. Data Cube: A Relational Aggregation Operator Generalizing Group-by, Cross Tab and Sub Total. Proceedings of ICDE, pages 152-159, May 1996.
[12]
M. Greenwald and S. Khanna. Space-Efficient Online Computation of Quantile Summaries. Proceedings of ACM SIGMOD, Santa Barbara, May 2001.
[13]
S. Guha and N. Koudas. Approximating a Data Stream for Querying and Estimation: Algorithms and Performance Evaluation. ICDE, Feb. 2002.
[14]
S. Guha, N. Koudas, and K. Shim. Data Streams and Histograms. Symposium on the Theory of Computing (STOC), July 2001.
[15]
S. Guha, N. Mishra, R. Motwani, and L. O'callahan. Clustering Data Streams. Foundations of Computer Science (FOCS), Sept. 2000.
[16]
D. Gunopulos, G. Kollios, V. Tsotras, and C. Domeniconi. Approximating Multi-Dimensional Aggregate Range Queries Over Real Attributes. Proceedings of ACM SIGMOD, June 2000.
[17]
P. Haas, J. Naughton, S. Seshadri, and L. Stokes. Sampling Based Estimation Of the Number Of Distinct Values Of An Attribute. Proceedings of VLDB, pages 311-322, June 1995.
[18]
P. Haas, J. Naughton, S. Seshadri, and A. Swami. Fixed Precision Estimation Of Join Selectivity. Proceedings of ACM PODS, pages 190-201, June 1993.
[19]
P. Haas and A. Swami. Sequantial Sampling Procedures for Query Size Estimation. Proceedings of ACM SIGMOD, San Diego, CA, pages 341-350, June 1992.
[20]
P. Indyk. Stable Distributions, Pseudorandom Generators, Embeddings and Data Stream Computation". Foundations of Computer Science (FOCS), Sept. 2000.
[21]
Y. Ioannidis and V. Poosala. Balancing Histogram Optimality and Practicality for Query Result Size Estimation. Proceedings of ACM SIGMOD, San Hose, CA, pages 233-244, June 1995.
[22]
H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel. Optimal Histograms with Quality Guarantees. Proceedings of VLDB, pages 275-286, Aug. 1998.
[23]
S. Khanna, S. Muthukrishnan, and S. Skiena. Efficient array partitioning. Proc. ICALP, 1997.
[24]
R. P. Kooi. The Optimization of Queries in Relational Databases. PhD Thesis, Case Western Reserve University, Sept. 1980.
[25]
J. Lee, D. Kim, and C. Chung. Multidimensional Selectivity Estimation Using Compressed Histogram Information. Proceedings of ACM SIGMOD, pages 205-214, June 1999.
[26]
S. Madden and M. Franklin. Fjording the Stream: An Architecture for Queries Over Streaming Sensor Data. Proceedings of ICDE, Feb. 2002.
[27]
Y. Mattias, J. S. Vitter, and M. Wang. Wavelet-Based Histograms for Selectivity Estimation. Proc. of the 1998 ACM SIGMOD Intern. Conf. on Management of Data, June 1998.
[28]
Y. Mattias, J. S. Vitter, and M. Wang. Dynamic Maintenance of Wavelet-Based Histograms. Proceedings of the International Conference on Very Large Databases, (VLDB), Cairo, Egypt, pages 101-111, Sept. 2000.
[29]
S. Muthukrishnan, V. Poosala, and T. Suel. Partitioning two dimensional arrays: algorithms, complexity and applications. Proc. Intl Conf. Database Theory, 1998.
[30]
V. Poosala and Y. Ioannidis. Selectivity Estimation Without the Attribute Value Independence Assumption. Proceedings of VLDB, Athens Greece, pages 486-495, Aug. 1997.
[31]
V. Poosala, Y. Ioannidis, P. Haas, and E. Shekita. Improved Histograms for Selectivity Estimation of Range Predicates. Proceedings of ACM SIGMOD, Montreal Canada, pages 294-305, June 1996.
[32]
G. Singh, S. Rajagopalan, and B. Lindsay. Random Sampling Techniques For Space Efficient Computation Of Large Datasets. Proceedings of SIGMOD, Philladelphia PA, pages 251-262, June 1999.
[33]
J. Vitter and M. Wang. Approximate computation of multidimensional aggregates on sparse data using wavelets. Proceedings of SIGMOD, pages 193-204, June 1999.
[34]
J. Vitter, M. Wang, and B. R. Iyer. Data Cube Approximation and Histograms via Wavelets. Proc. of the 1998 ACM CIKM Intern. Conf. on Information and Knowledge Management, November 1998.
[35]
Y. Wu, D. Agrawal, and A. E. Abbadi. Applying the Golden Rule of Sampling for Selectivity Estimation. Proceedings of ACM SIGMOD, May 2001.

Cited By

View all
  • (2024)Duet: Efficient and Scalable Hybrid Neural Relation Understanding2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00012(56-69)Online publication date: 13-May-2024
  • (2023)JanusAQP: Efficient Partition Tree Maintenance for Dynamic Approximate Query Processing2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00050(572-584)Online publication date: Apr-2023
  • (2023)iDMS: An Index-Based Framework for Tracking Distributed Multidimensional Data Streams2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE)10.1109/CSCE60160.2023.00231(1381-1388)Online publication date: 24-Jul-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '02: Proceedings of the 2002 ACM SIGMOD international conference on Management of data
June 2002
654 pages
ISBN:1581134975
DOI:10.1145/564691
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 June 2002

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Article

Conference

SIGMOD/PODS02

Acceptance Rates

SIGMOD '02 Paper Acceptance Rate 42 of 240 submissions, 18%;
Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)26
  • Downloads (Last 6 weeks)5
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Duet: Efficient and Scalable Hybrid Neural Relation Understanding2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00012(56-69)Online publication date: 13-May-2024
  • (2023)JanusAQP: Efficient Partition Tree Maintenance for Dynamic Approximate Query Processing2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00050(572-584)Online publication date: Apr-2023
  • (2023)iDMS: An Index-Based Framework for Tracking Distributed Multidimensional Data Streams2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE)10.1109/CSCE60160.2023.00231(1381-1388)Online publication date: 24-Jul-2023
  • (2022)Near-optimal bounds for testing histogram distributionsProceedings of the 36th International Conference on Neural Information Processing Systems10.5555/3600270.3602561(31599-31611)Online publication date: 28-Nov-2022
  • (2022)Box queries over multi-dimensional streamsInformation Systems10.1016/j.is.2022.102086109:COnline publication date: 1-Nov-2022
  • (2021)AstridProceedings of the VLDB Endowment10.14778/3436905.343690714:4(471-484)Online publication date: 22-Feb-2021
  • (2021)Box queries over multi-dimensional streamsProceedings of the 15th ACM International Conference on Distributed and Event-based Systems10.1145/3465480.3466925(90-101)Online publication date: 28-Jun-2021
  • (2021)Weighted Distinct Sampling: Cardinality Estimation for SPJ QueriesProceedings of the 2021 International Conference on Management of Data10.1145/3448016.3452821(1465-1477)Online publication date: 9-Jun-2021
  • (2020)Deep Learning Models for Selectivity Estimation of Multi-Attribute QueriesProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389741(1035-1050)Online publication date: 11-Jun-2020
  • (2020)QuickSel: Quick Selectivity Learning with Mixture ModelsProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389727(1017-1033)Online publication date: 11-Jun-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media