[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/375663.375665acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

On computing correlated aggregates over continual data streams

Published: 01 May 2001 Publication History

Abstract

In many applications from telephone fraud detection to network management, data arrives in a stream, and there is a need to maintain a variety of statistical summary information about a large number of customers in an online fashion. At present, such applications maintain basic aggregates such as running extrema values (MIN, MAX), averages, standard deviations, etc., that can be computed over data streams with limited space in a straightforward way. However, many applications require knowledge of more complex aggregates relating different attributes, so-called correlated aggregates. As an example, one might be interested in computing the percentage of international phone calls that are longer than the average duration of a domestic phone call. Exact computation of this aggregate requires multiple passes over the data stream, which is infeasible.
We propose single-pass techniques for approximate computation of correlated aggregates over both landmark and sliding window views of a data stream of tuples, using a very limited amount of space. We consider both the case where the independent aggregate (average duration in the example above) is an extrema value and the case where it is an average value, with any standard aggregate as the dependent aggregate; these can be used as building blocks for more sophisticated aggregates. We present an extensive experimental study based on some real and a wide variety of synthetic data sets to demonstrate the accuracy of our techniques. We show that this effectiveness is explained by the fact that our techniques exploit monotonicity and convergence properties of aggregates over data streams.

References

[1]
N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. JCSS: Journal of Computer and System Sciences, 58, 1999.]]
[2]
K. Alsabti, S.Ranka, and V. Singh. A one-pass algorithm for accurately estimating quantiles for disk-resident data. In VLDB'97, Proceedings of 23rd International Conference on Very Large Data Bases, August 25-29, 1997, Athens, Greece, pages 346-355, 1997.]]
[3]
R. Avnur and J. M. kHellerstein. Eddies: Continuously adaptive query processing. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16-18, 2000, Dallas, Texas, USA, pages 261-272, 2000.]]
[4]
D. Chatziantoniou. Ad hoc OLAP: Expression and evaluation. In Proceedings of the IEEE International Conference on Data Engineering, 1999.]]
[5]
D. Chatziantoniou, M.Akinde, T. Johnson, and S.Kim. The MD-join: An operator for complex OLAP. In Proceedings of the IEEE International Conference on Data Engineering, 2001]]
[6]
D. Chatziantoniou and K. A. Ross. Querying multiple features of groups in relational databases. In Proceedings of the International Conference on Very Large Databases, pages 295-306, 1996.]]
[7]
A. Delis, C. Faloutsos, and S. Ghandeharizadeh, editors. SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1-3, 1999, Philadephia, PJjennsylvania, USA. ACM Press, 1999.]]
[8]
P.Domingos and G. Hulten. Mining high-speed data streams. In Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining, pages 71-80, Boston, Ma, August 2000. ACM.]]
[9]
J. Feigenbaum, R. Kannan, M. Strauss, and M.Viswanathan. An approximate L1-difference algorithm for massive data streams. In IEEE Symposium on Foundations of Computer Science (FOCS), 1999.]]
[10]
J. Feigenbaum, S. Kannan, M. Strauss, and M.Viswanathan. Testing and spot-checking of data streams. In Proceedings of the 11th ACM-SIAM symposium on Discrete Algorithms, 2000.]]
[11]
A. Feldmann, A. Gilbert, and W. Willinger. jData networks as cascades: Investigating the multifractal nature of internet wan traffic. In ACM SIGCOMM, pages 42-55, 1998.]]
[12]
J.Fong and M. Strauss. An approximate L p-difference algorithm for massive data streams. In STACS: Annual Symposium on Theoretical Aspects of Computer Science, 2000.]]
[13]
J. Gehrke, V. Ganti, R. Ramakrishnan, and W.-Y. Loh. Boat-optimistic decision tree construction. In Delis et al. {7}, pages 169-180.]]
[14]
P. Gibbons, Y. Mattias, and V. Poosala. Fast Incremental Maintenance of Approximate Histograms. Proceedings of VLDB, Athens Greece, pages 466-475, Aug. 1997.]]
[15]
P.B. Gibbons and Y. Matias. Synopsis data structures for massive data sets. In Proceedings of the Tenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 909-910, N.Y., Jan. 17-19 1999. ACM-SIAM.]]
[16]
S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. In Proceedings of the Annual Symposium on Foundations of Computer Science. IEEE, November 2000.]]
[17]
P.J. Haas and J.M. Hellerstein. Ripple joins for online aggregation. In SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1-3, 1999, Philadephia, Pennsulvania, USA, pages 287-298, 19999.]]
[18]
J.M. Hellerstein, P.J. Haas, and H. Wang. Online aggregation. In J. Peckham, editor, SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, May 13-15, 1997, Tucson, Arizona, USA, pages 717-182. ACM Press, 1997.]]
[19]
M.R. Henzinger, P. Raghavan, and S. Rajagopalan. Computing on data streams. Technical Report 1998-011, Digital Equipment Corporation, Systems Research Center, May, 1998.]]
[20]
G.S. Manku, S. Rajagopalan, and B. G. Lindsay. Approximate medians and other quantiles in one pass and with limited memory. In L.M. Haas and A. Tiwary, editors, SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, June 2-4, 1998, Seattle, Washington, USA. pages 426-435. ACM Press, 1998.]]
[21]
G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Random sampling techniques for space efficient online computation of order statistics of large datasets. In kDelis et al. {7}, pages 251- 262.]]
[22]
C. Olston and J. Widom. Offering a precision-performance tradeoff for aggregation queries over replicated data. In A. E. Abbadi, M. L.Brodie, S. Chakravarthy, U. Dayal, N. Kamel, G. Schlageter, and K.-Y. Whang, editors, VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10-14, 2000, Cairo, Egypt, pages 144-155. Morgan Kaufmann, 2000.]]
[23]
V.Raman, B. Raman, and J. M. Hellerstein. Online dynamic reordering for interactive data processing. In VLDB'99 Proceedings of 25th International Conference on Very Large Data Bases, September 7-10, 1999, Edinburgh, Scotland, UK, pages 709-720, 1999.]]

Cited By

View all
  • (2023)REAL-TIME ANALYTICS: BENEFITS, LIMITATIONS, AND TRADEOFFSПрограммирование10.31857/S0132347423010053(3-31)Online publication date: 1-Jan-2023
  • (2023)Cluster based similarity extraction upon distributed datasetsCluster Computing10.1007/s10586-023-04116-527:3(2917-2929)Online publication date: 25-Aug-2023
  • (2021)A Deep Learning Model for Data Synopses Management in Pervasive Computing ApplicationsIntelligent Computing10.1007/978-3-030-80126-7_44(619-636)Online publication date: 7-Jul-2021
  • Show More Cited By

Index Terms

  1. On computing correlated aggregates over continual data streams

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data
    May 2001
    630 pages
    ISBN:1581133324
    DOI:10.1145/375663
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 May 2001

    Permissions

    Request permissions for this article.

    Check for updates

    Qualifiers

    • Article

    Conference

    SIGMOD/PODS01
    Sponsor:

    Acceptance Rates

    SIGMOD '01 Paper Acceptance Rate 44 of 293 submissions, 15%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)10
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 13 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)REAL-TIME ANALYTICS: BENEFITS, LIMITATIONS, AND TRADEOFFSПрограммирование10.31857/S0132347423010053(3-31)Online publication date: 1-Jan-2023
    • (2023)Cluster based similarity extraction upon distributed datasetsCluster Computing10.1007/s10586-023-04116-527:3(2917-2929)Online publication date: 25-Aug-2023
    • (2021)A Deep Learning Model for Data Synopses Management in Pervasive Computing ApplicationsIntelligent Computing10.1007/978-3-030-80126-7_44(619-636)Online publication date: 7-Jul-2021
    • (2020)A Proactive Uncertainty Driven Model for Data Synopses Management in Pervasive Applications2020 IEEE 22nd International Conference on High Performance Computing and Communications; IEEE 18th International Conference on Smart City; IEEE 6th International Conference on Data Science and Systems (HPCC/SmartCity/DSS)10.1109/HPCC-SmartCity-DSS50907.2020.00164(1266-1273)Online publication date: Dec-2020
    • (2019)From Rocks to PebblesACM Transactions on Spatial Algorithms and Systems10.1145/33296775:3(1-38)Online publication date: 12-Aug-2019
    • (2019)Data Stream ManagementReal-Time & Stream Data Management10.1007/978-3-030-10555-6_4(43-55)Online publication date: 3-Jan-2019
    • (2018)Evolving Networks and Social Network Analysis Methods and TechniquesSocial Media and Journalism - Trends, Connections, Implications10.5772/intechopen.79041Online publication date: 31-Oct-2018
    • (2018)Optimizing data stream processing for large‐scale applicationsSoftware: Practice and Experience10.1002/spe.259648:9(1607-1641)Online publication date: 19-Jun-2018
    • (2017)Mind the Gaps (and Bumps)Proceedings of the 8th ACM SIGSPATIAL Workshop on GeoStreaming10.1145/3148160.3148165(29-38)Online publication date: 7-Nov-2017
    • (2017)Explaining and predicting abnormal expenses at large scale using knowledge graph based reasoningWeb Semantics: Science, Services and Agents on the World Wide Web10.1016/j.websem.2017.05.00344:C(89-103)Online publication date: 1-May-2017
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media