[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Finding hierarchical heavy hitters in streaming data

Published: 02 February 2008 Publication History

Abstract

Data items that arrive online as streams typically have attributes which take values from one or more hierarchies (time and geographic location, source and destination IP addresses, etc.). Providing an aggregate view of such data is important for summarization, visualization, and analysis. We develop an aggregate view based on certain organized sets of large-valued regions (“heavy hitters”) corresponding to hierarchically discounted frequency counts. We formally define the notion of hierarchical heavy hitters (HHHs). We first consider computing (approximate) HHHs over a data stream drawn from a single hierarchical attribute. We formalize the problem and give deterministic algorithms to find them in a single pass over the input.
In order to analyze a wider range of realistic data streams (e.g., from IP traffic-monitoring applications), we generalize this problem to multiple dimensions. Here, the semantics of HHHs are more complex, since a “child” node can have multiple “parent” nodes. We present online algorithms that find approximate HHHs in one pass, with provable accuracy guarantees. The product of hierarchical dimensions forms a mathematical lattice structure. Our algorithms exploit this structure, and so are able to track approximate HHHs using only a small, fixed number of statistics per stored item, regardless of the number of dimensions.
We show experimentally, using real data, that our proposed algorithms yields outputs which are very similar (virtually identical, in many cases) to offline computations of the exact solutions, whereas straightforward heavy-hitters-based approaches give significantly inferior answer quality. Furthermore, the proposed algorithms result in an order of magnitude savings in data structure size while performing competitively.

References

[1]
Agarwal, S., Agrawal, R., Deshpande, P., Gupta, A., Naughton, J. F., Ramakrishnan, R., and Sarawagi, S. 1996. On the computation of multidimensional aggregates. In Proceedings of the International Conference on Very Large Data Bases.
[2]
Beyer, K. and Ramakrishnan, R. 1999. Bottom-Up computation of sparse and Iceberg CUBE. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Also in SIGMOD Rec. 28, 2, 359--370.
[3]
Charikar, M., Chen, K., and Farach-Colton, M. 2002. Finding frequent items in data streams. In Procedings of the International Colloquium on Automata, Languages and Programming (ICALP), 693--703.
[4]
Cormode, G., Korn, F., Muthukrishnan, S., Johnson, T., Spatscheck, O., and Srivastava, D. 2004. Holistic UDAFs at streaming speeds. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 35--46.
[5]
Cormode, G., Korn, F., Muthukrishnan, S., and Srivastava, D. 2003. Finding hierarchical heavy hitters in data streams. In Proceedings of the International Conference on Very Large Data Bases, 464--475.
[6]
Cormode, G., Korn, F., Muthukrishnan, S., and Srivastava, D. 2004. Diamond in the rough: Finding hierarchical heavy hitters in multi-dimensional data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 155--166.
[7]
Cormode, G. and Muthukrishnan, S. 2003. What's hot and what's not: Tracking most frequent items dynamically. In Proceedings of the ACM Conference on Principles of Database Systems, 296--306.
[8]
Cormode, G. and Muthukrishnan, S. 2005. An improved data stream summary: The count-min sketch and its applications. J. Algor. 55, 1, 58--75.
[9]
Cranor, C., Johnson, T., Spatscheck, O., and Shkapenyuk, V. 2003. Gigascope: A stream database for network applications. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 647--651.
[10]
Demaine, E., López-Ortiz, A., and Munro, J. I. 2002. Frequency estimation of internet packet streams with limited space. In Proceedings of the European Symposium on Algorithms (ESA). Lecture Notes in Computer Science, vol. 2461. Springer, Berlin, 348--360.
[11]
Estan, C., Savage, S., and Varghese, G. 2003. Automatically inferring patterns of resource consumption in network traffic. In Proceedings of the ACM SIGCOMM Data Communications Festival.
[12]
Ganesan, P., Garcia-Molina, H., and Widom, J. 2003. Exploiting hierarchical domain structure to compute similarity. ACM Trans. Inf. Syst. 21, 1, 64--93.
[13]
Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., and Strauss, M. 2002. How to summarize the universe: Dynamic maintenance of quantiles. In Proceedings of the International Conference on Very Large Data Bases, 454--465.
[14]
Guha, S., Koudas, N., and Shim, K. 2001. Data streams and histograms. In Proceedings of the ACM Symposium on Theory of Computing, 471--475.
[15]
Hershberger, J., Shrivastava, N., Suri, S., and Toth, C. 2005. Space complexity of hierarchical heavy hitters in multi-dimensional data streams. In Proceedings of the ACM Conferences Principles of Database Systems.
[16]
Karp, R., Papadimitriou, C., and Shenker, S. 2003. A simple algorithm for finding frequent elements in sets and bags. ACM Trans. Database Syst. 28, 51--55.
[17]
Lakshmanan, L. V. S., Ng, R. T., Wang, C. X., Zhou, X., and Johnson, T. 2002. The generalized MDL approach for summarization. In Proceedings of the International Conference on Very Large Data Bases, 766--777.
[18]
Lee, J., Kim, D., and Chung, C. 1999. Multidimensional selectivity estimation using compressed histogram information. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 205--214.
[19]
Manjhi, A., Shkapenyuk, V., Dhamdhere, K., and Olston, C. 2005. Finding (recently) frequent items in distributed data streams. In Proceedings of the IEEE International Conference on Data Engineering, 767--778.
[20]
Manku, G. and Motwani, R. 2002. Approximate frequency counts over data streams. In Proceedings of the International Conference on Very Large Data Bases, 346--357.
[21]
Metwally, A., Agrawal, D., and Abbadi, A. E. 2005. Efficient computation of frequent and top-k elements in data streams. In Proceedings of the International Conference on Database Theory.
[22]
Misra, J. and Gries, D. 1982. Finding repeated elements. Sci. Comput. Program. 2, 143--152.
[23]
Ng, R. T., Wagner, A. S., and Yin, Y. 2001. Iceberg-Cube computation with PC clusters. In Proceedings of the ACM SIGMOD International Conference on Management of Data.
[24]
Thaper, N., Indyk, P., Guha, S., and Koudas, N. 2002. Dynamic multidimensional histograms. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 359--366.
[25]
Vitter, J. S., Wang, M., and Iyer, B. 1998. Data cube approximation and histograms via wavelets. In Proceedings of the 7th ACM International Conference on Information and Knowledge Management. 96--104.
[26]
Zhang, Y., Singh, S., Sen, S., Duffield, N., and Lund, C. 2004. Online identification of hieararchical heavy hitters: Algorithms, evaluation and applications. In Proceedings of the Internet Measurement Conference (IMC).

Cited By

View all
  • (2023)Online Detection of 1D and 2D Hierarchical Super-Spreaders in High-Speed NetworksProceedings of the 7th Asia-Pacific Workshop on Networking10.1145/3600061.3600080(109-115)Online publication date: 29-Jun-2023
  • (2023)MVPipe: Enabling Lightweight Updates and Fast Convergence in Hierarchical Heavy Hitter DetectionIEEE/ACM Transactions on Networking10.1109/TNET.2023.327330731:6(3207-3221)Online publication date: Dec-2023
  • (2023)FTM-RCA: A Fast Two-Stage Multi-dimensional Root-Cause Analysis of Network Anomalies2023 IEEE/ACM 31st International Symposium on Quality of Service (IWQoS)10.1109/IWQoS57198.2023.10188732(01-10)Online publication date: 19-Jun-2023
  • Show More Cited By

Index Terms

  1. Finding hierarchical heavy hitters in streaming data

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Knowledge Discovery from Data
    ACM Transactions on Knowledge Discovery from Data  Volume 1, Issue 4
    January 2008
    143 pages
    ISSN:1556-4681
    EISSN:1556-472X
    DOI:10.1145/1324172
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 February 2008
    Accepted: 01 September 2007
    Revised: 01 July 2007
    Received: 01 April 2007
    Published in TKDD Volume 1, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Data mining
    2. approximation algorithms
    3. network data analysis

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)38
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 12 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Online Detection of 1D and 2D Hierarchical Super-Spreaders in High-Speed NetworksProceedings of the 7th Asia-Pacific Workshop on Networking10.1145/3600061.3600080(109-115)Online publication date: 29-Jun-2023
    • (2023)MVPipe: Enabling Lightweight Updates and Fast Convergence in Hierarchical Heavy Hitter DetectionIEEE/ACM Transactions on Networking10.1109/TNET.2023.327330731:6(3207-3221)Online publication date: Dec-2023
    • (2023)FTM-RCA: A Fast Two-Stage Multi-dimensional Root-Cause Analysis of Network Anomalies2023 IEEE/ACM 31st International Symposium on Quality of Service (IWQoS)10.1109/IWQoS57198.2023.10188732(01-10)Online publication date: 19-Jun-2023
    • (2022)Robustness of Sketched Linear Classifiers to Adversarial AttacksProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557687(4319-4323)Online publication date: 17-Oct-2022
    • (2022)Memento: Making Sliding Windows Efficient for Heavy HittersIEEE/ACM Transactions on Networking10.1109/TNET.2021.313238530:4(1440-1453)Online publication date: Aug-2022
    • (2022)Concise Retrieval of Flow Statistics for Software-Defined NetworksIEEE Systems Journal10.1109/JSYST.2021.306530616:1(554-565)Online publication date: Mar-2022
    • (2020)Structure-aware samplingProceedings of the VLDB Endowment10.14778/3402707.34027214:11(819-830)Online publication date: 3-Jun-2020
    • (2020)A case study on using heavy-hitters in interconnect bypass fraudACM SIGAPP Applied Computing Review10.1145/3429204.342920820:3(47-57)Online publication date: 8-Oct-2020
    • (2020)Efficient Online Classification and Tracking on Resource-constrained IoT DevicesACM Transactions on Internet of Things10.1145/33920511:3(1-29)Online publication date: 13-Jul-2020
    • (2020)Enabling Event-Triggered Data Plane MonitoringProceedings of the Symposium on SDN Research10.1145/3373360.3380830(14-26)Online publication date: 3-Mar-2020
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media