More Web Proxy on the site http://driver.im/

Article

On computing correlated aggregates over continual data streams

Authors:

Johannes Gehrke,

Divesh SrivastavaAuthors Info & Claims

SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data

Pages 13 - 24

https://doi.org/10.1145/375663.375665

Published: 01 May 2001 Publication History

Abstract

In many applications from telephone fraud detection to network management, data arrives in a stream, and there is a need to maintain a variety of statistical summary information about a large number of customers in an online fashion. At present, such applications maintain basic aggregates such as running extrema values (MIN, MAX), averages, standard deviations, etc., that can be computed over data streams with limited space in a straightforward way. However, many applications require knowledge of more complex aggregates relating different attributes, so-called correlated aggregates. As an example, one might be interested in computing the percentage of international phone calls that are longer than the average duration of a domestic phone call. Exact computation of this aggregate requires multiple passes over the data stream, which is infeasible.

We propose single-pass techniques for approximate computation of correlated aggregates over both landmark and sliding window views of a data stream of tuples, using a very limited amount of space. We consider both the case where the independent aggregate (average duration in the example above) is an extrema value and the case where it is an average value, with any standard aggregate as the dependent aggregate; these can be used as building blocks for more sophisticated aggregates. We present an extensive experimental study based on some real and a wide variety of synthetic data sets to demonstrate the accuracy of our techniques. We show that this effectiveness is explained by the fact that our techniques exploit monotonicity and convergence properties of aggregates over data streams.

References

[1]

N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. JCSS: Journal of Computer and System Sciences, 58, 1999.]]

Digital Library

[2]

K. Alsabti, S.Ranka, and V. Singh. A one-pass algorithm for accurately estimating quantiles for disk-resident data. In VLDB'97, Proceedings of 23rd International Conference on Very Large Data Bases, August 25-29, 1997, Athens, Greece, pages 346-355, 1997.]]

Digital Library

[3]

R. Avnur and J. M. kHellerstein. Eddies: Continuously adaptive query processing. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16-18, 2000, Dallas, Texas, USA, pages 261-272, 2000.]]

Digital Library

[4]

D. Chatziantoniou. Ad hoc OLAP: Expression and evaluation. In Proceedings of the IEEE International Conference on Data Engineering, 1999.]]

Digital Library

[5]

D. Chatziantoniou, M.Akinde, T. Johnson, and S.Kim. The MD-join: An operator for complex OLAP. In Proceedings of the IEEE International Conference on Data Engineering, 2001]]

Digital Library

[6]

D. Chatziantoniou and K. A. Ross. Querying multiple features of groups in relational databases. In Proceedings of the International Conference on Very Large Databases, pages 295-306, 1996.]]

Digital Library

[7]

A. Delis, C. Faloutsos, and S. Ghandeharizadeh, editors. SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1-3, 1999, Philadephia, PJjennsylvania, USA. ACM Press, 1999.]]

[8]

P.Domingos and G. Hulten. Mining high-speed data streams. In Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining, pages 71-80, Boston, Ma, August 2000. ACM.]]

Digital Library

[9]

J. Feigenbaum, R. Kannan, M. Strauss, and M.Viswanathan. An approximate L1-difference algorithm for massive data streams. In IEEE Symposium on Foundations of Computer Science (FOCS), 1999.]]

Digital Library

[10]

J. Feigenbaum, S. Kannan, M. Strauss, and M.Viswanathan. Testing and spot-checking of data streams. In Proceedings of the 11th ACM-SIAM symposium on Discrete Algorithms, 2000.]]

Digital Library

[11]

A. Feldmann, A. Gilbert, and W. Willinger. jData networks as cascades: Investigating the multifractal nature of internet wan traffic. In ACM SIGCOMM, pages 42-55, 1998.]]

Digital Library

[12]

J.Fong and M. Strauss. An approximate L p-difference algorithm for massive data streams. In STACS: Annual Symposium on Theoretical Aspects of Computer Science, 2000.]]

Digital Library

[13]

J. Gehrke, V. Ganti, R. Ramakrishnan, and W.-Y. Loh. Boat-optimistic decision tree construction. In Delis et al. {7}, pages 169-180.]]

[14]

P. Gibbons, Y. Mattias, and V. Poosala. Fast Incremental Maintenance of Approximate Histograms. Proceedings of VLDB, Athens Greece, pages 466-475, Aug. 1997.]]

Digital Library

[15]

P.B. Gibbons and Y. Matias. Synopsis data structures for massive data sets. In Proceedings of the Tenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 909-910, N.Y., Jan. 17-19 1999. ACM-SIAM.]]

Digital Library

[16]

S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. In Proceedings of the Annual Symposium on Foundations of Computer Science. IEEE, November 2000.]]

Digital Library

[17]

P.J. Haas and J.M. Hellerstein. Ripple joins for online aggregation. In SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1-3, 1999, Philadephia, Pennsulvania, USA, pages 287-298, 19999.]]

Digital Library

[18]

J.M. Hellerstein, P.J. Haas, and H. Wang. Online aggregation. In J. Peckham, editor, SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, May 13-15, 1997, Tucson, Arizona, USA, pages 717-182. ACM Press, 1997.]]

Digital Library

[19]

M.R. Henzinger, P. Raghavan, and S. Rajagopalan. Computing on data streams. Technical Report 1998-011, Digital Equipment Corporation, Systems Research Center, May, 1998.]]

[20]

G.S. Manku, S. Rajagopalan, and B. G. Lindsay. Approximate medians and other quantiles in one pass and with limited memory. In L.M. Haas and A. Tiwary, editors, SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, June 2-4, 1998, Seattle, Washington, USA. pages 426-435. ACM Press, 1998.]]

Digital Library

[21]

G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Random sampling techniques for space efficient online computation of order statistics of large datasets. In kDelis et al. {7}, pages 251- 262.]]

Digital Library

[22]

C. Olston and J. Widom. Offering a precision-performance tradeoff for aggregation queries over replicated data. In A. E. Abbadi, M. L.Brodie, S. Chakravarthy, U. Dayal, N. Kamel, G. Schlageter, and K.-Y. Whang, editors, VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10-14, 2000, Cairo, Egypt, pages 144-155. Morgan Kaufmann, 2000.]]

Digital Library

[23]

V.Raman, B. Raman, and J. M. Hellerstein. Online dynamic reordering for interactive data processing. In VLDB'99 Proceedings of 25th International Conference on Very Large Data Bases, September 7-10, 1999, Edinburgh, Scotland, UK, pages 709-720, 1999.]]

Digital Library

Cited By

KUZNETSOV SVELIKHOV PFu Q(2023)REAL-TIME ANALYTICS: BENEFITS, LIMITATIONS, AND TRADEOFFSПрограммирование10.31857/S0132347423010053(3-31)Online publication date: 1-Jan-2023
https://doi.org/10.31857/S0132347423010053
Moustakas TKolomvatsos K(2023)Cluster based similarity extraction upon distributed datasetsCluster Computing10.1007/s10586-023-04116-527:3(2917-2929)Online publication date: 25-Aug-2023
https://doi.org/10.1007/s10586-023-04116-5
Fountas PKolomvatsos KAnagnostopoulos C(2021)A Deep Learning Model for Data Synopses Management in Pervasive Computing ApplicationsIntelligent Computing10.1007/978-3-030-80126-7_44(619-636)Online publication date: 7-Jul-2021
https://doi.org/10.1007/978-3-030-80126-7_44
Show More Cited By

Index Terms

On computing correlated aggregates over continual data streams
1. Information systems
  1. Data management systems
    1. Database administration

Recommendations

On computing correlated aggregates over continual data streams

In many applications from telephone fraud detection to network management, data arrives in a stream, and there is a need to maintain a variety of statistical summary information about a large number of customers in an online fashion. At present, such ...
Estimating statistical aggregates on probabilistic data streams

The probabilistic stream model was introduced by Jayram et al. [2007]. It is a generalization of the data stream model that is suited to handling probabilistic data, where each item of the stream represents a probability distribution over a set of ...
Time-decayed correlated aggregates over data streams
Best of SDM'09

Data stream analysis frequently relies on identifying correlations and posing conditional queries on the data after it has been seen. Correlated aggregates form an important example of such queries, which ask for an aggregation over one dimension of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data

May 2001

630 pages

ISBN:1581133324

DOI:10.1145/375663

ACM SIGMOD Record Volume 30, Issue 2
June 2001
625 pages
ISSN:0163-5808
DOI:10.1145/376284
Editors:
Timos Sellis
National Technical Univ. of Athens
,
Sharad Mehrotra
Univ. of California at Irvine
Issue’s Table of Contents

Copyright © 2001 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 2001

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Article

Conference

SIGMOD/PODS01

Sponsor:

SIGMOD

SIGMOD/PODS01: ACM SIGMOD International Conference on Management of Data

May 21 - 24, 2001

California, Santa Barbara, USA

Acceptance Rates

SIGMOD '01 Paper Acceptance Rate 44 of 293 submissions, 15%;

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

208
Total Citations
View Citations
1,160
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)1

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

KUZNETSOV SVELIKHOV PFu Q(2023)REAL-TIME ANALYTICS: BENEFITS, LIMITATIONS, AND TRADEOFFSПрограммирование10.31857/S0132347423010053(3-31)Online publication date: 1-Jan-2023
https://doi.org/10.31857/S0132347423010053
Moustakas TKolomvatsos K(2023)Cluster based similarity extraction upon distributed datasetsCluster Computing10.1007/s10586-023-04116-527:3(2917-2929)Online publication date: 25-Aug-2023
https://doi.org/10.1007/s10586-023-04116-5
Fountas PKolomvatsos KAnagnostopoulos C(2021)A Deep Learning Model for Data Synopses Management in Pervasive Computing ApplicationsIntelligent Computing10.1007/978-3-030-80126-7_44(619-636)Online publication date: 7-Jul-2021
https://doi.org/10.1007/978-3-030-80126-7_44
Kolomvatsos K(2020)A Proactive Uncertainty Driven Model for Data Synopses Management in Pervasive Applications2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC-SmartCity-DSS50907.2020.00164(1266-1273)Online publication date: Dec-2020
https://doi.org/10.1109/HPCC-SmartCity-DSS50907.2020.00164
Brown PDasu TKanza YSrivastava D(2019)From Rocks to PebblesACM Transactions on Spatial Algorithms and Systems10.1145/33296775:3(1-38)Online publication date: 12-Aug-2019
https://dl.acm.org/doi/10.1145/3329677
Wingerath WRitter NGessert FWingerath WRitter NGessert F(2019)Data Stream ManagementReal-Time & Stream Data Management10.1007/978-3-030-10555-6_4(43-55)Online publication date: 3-Jan-2019
https://doi.org/10.1007/978-3-030-10555-6_4
Cordeiro MSarmento RBrazdil PGama J(2018)Evolving Networks and Social Network Analysis Methods and TechniquesSocial Media and Journalism - Trends, Connections, Implications10.5772/intechopen.79041Online publication date: 31-Oct-2018
https://doi.org/10.5772/intechopen.79041
Cappellari PRoantree MChun S(2018)Optimizing data stream processing for large‐scale applicationsSoftware: Practice and Experience10.1002/spe.259648:9(1607-1641)Online publication date: 19-Jun-2018
https://doi.org/10.1002/spe.2596
Brown PDasu TKanza YSrivastava D(2017)Mind the Gaps (and Bumps)Proceedings of the 8th ACM SIGSPATIAL Workshop on GeoStreaming10.1145/3148160.3148165(29-38)Online publication date: 7-Nov-2017
https://dl.acm.org/doi/10.1145/3148160.3148165
Lecue FWu J(2017)Explaining and predicting abnormal expenses at large scale using knowledge graph based reasoningWeb Semantics: Science, Services and Agents on the World Wide Web10.1016/j.websem.2017.05.00344:C(89-103)Online publication date: 1-May-2017
https://dl.acm.org/doi/10.1016/j.websem.2017.05.003
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents