More Web Proxy on the site http://driver.im/

research-article

Finding hierarchical heavy hitters in streaming data

Authors:

Graham Cormode,

S. Muthukrishnan,

Divesh SrivastavaAuthors Info & Claims

ACM Transactions on Knowledge Discovery from Data (TKDD), Volume 1, Issue 4

Article No.: 2, Pages 1 - 48

https://doi.org/10.1145/1324172.1324174

Published: 02 February 2008 Publication History

Abstract

Data items that arrive online as streams typically have attributes which take values from one or more hierarchies (time and geographic location, source and destination IP addresses, etc.). Providing an aggregate view of such data is important for summarization, visualization, and analysis. We develop an aggregate view based on certain organized sets of large-valued regions (“heavy hitters”) corresponding to hierarchically discounted frequency counts. We formally define the notion of hierarchical heavy hitters (HHHs). We first consider computing (approximate) HHHs over a data stream drawn from a single hierarchical attribute. We formalize the problem and give deterministic algorithms to find them in a single pass over the input.

In order to analyze a wider range of realistic data streams (e.g., from IP traffic-monitoring applications), we generalize this problem to multiple dimensions. Here, the semantics of HHHs are more complex, since a “child” node can have multiple “parent” nodes. We present online algorithms that find approximate HHHs in one pass, with provable accuracy guarantees. The product of hierarchical dimensions forms a mathematical lattice structure. Our algorithms exploit this structure, and so are able to track approximate HHHs using only a small, fixed number of statistics per stored item, regardless of the number of dimensions.

We show experimentally, using real data, that our proposed algorithms yields outputs which are very similar (virtually identical, in many cases) to offline computations of the exact solutions, whereas straightforward heavy-hitters-based approaches give significantly inferior answer quality. Furthermore, the proposed algorithms result in an order of magnitude savings in data structure size while performing competitively.

References

[1]

Agarwal, S., Agrawal, R., Deshpande, P., Gupta, A., Naughton, J. F., Ramakrishnan, R., and Sarawagi, S. 1996. On the computation of multidimensional aggregates. In Proceedings of the International Conference on Very Large Data Bases.

Digital Library

[2]

Beyer, K. and Ramakrishnan, R. 1999. Bottom-Up computation of sparse and Iceberg CUBE. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Also in SIGMOD Rec. 28, 2, 359--370.

Digital Library

[3]

Charikar, M., Chen, K., and Farach-Colton, M. 2002. Finding frequent items in data streams. In Procedings of the International Colloquium on Automata, Languages and Programming (ICALP), 693--703.

Digital Library

[4]

Cormode, G., Korn, F., Muthukrishnan, S., Johnson, T., Spatscheck, O., and Srivastava, D. 2004. Holistic UDAFs at streaming speeds. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 35--46.

Digital Library

[5]

Cormode, G., Korn, F., Muthukrishnan, S., and Srivastava, D. 2003. Finding hierarchical heavy hitters in data streams. In Proceedings of the International Conference on Very Large Data Bases, 464--475.

Digital Library

[6]

Cormode, G., Korn, F., Muthukrishnan, S., and Srivastava, D. 2004. Diamond in the rough: Finding hierarchical heavy hitters in multi-dimensional data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 155--166.

Digital Library

[7]

Cormode, G. and Muthukrishnan, S. 2003. What's hot and what's not: Tracking most frequent items dynamically. In Proceedings of the ACM Conference on Principles of Database Systems, 296--306.

Digital Library

[8]

Cormode, G. and Muthukrishnan, S. 2005. An improved data stream summary: The count-min sketch and its applications. J. Algor. 55, 1, 58--75.

Digital Library

[9]

Cranor, C., Johnson, T., Spatscheck, O., and Shkapenyuk, V. 2003. Gigascope: A stream database for network applications. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 647--651.

Digital Library

[10]

Demaine, E., López-Ortiz, A., and Munro, J. I. 2002. Frequency estimation of internet packet streams with limited space. In Proceedings of the European Symposium on Algorithms (ESA). Lecture Notes in Computer Science, vol. 2461. Springer, Berlin, 348--360.

Digital Library

[11]

Estan, C., Savage, S., and Varghese, G. 2003. Automatically inferring patterns of resource consumption in network traffic. In Proceedings of the ACM SIGCOMM Data Communications Festival.

Digital Library

[12]

Ganesan, P., Garcia-Molina, H., and Widom, J. 2003. Exploiting hierarchical domain structure to compute similarity. ACM Trans. Inf. Syst. 21, 1, 64--93.

Digital Library

[13]

Gilbert, A. C., Kotidis, Y., Muthukrishnan, S., and Strauss, M. 2002. How to summarize the universe: Dynamic maintenance of quantiles. In Proceedings of the International Conference on Very Large Data Bases, 454--465.

Digital Library

[14]

Guha, S., Koudas, N., and Shim, K. 2001. Data streams and histograms. In Proceedings of the ACM Symposium on Theory of Computing, 471--475.

Digital Library

[15]

Hershberger, J., Shrivastava, N., Suri, S., and Toth, C. 2005. Space complexity of hierarchical heavy hitters in multi-dimensional data streams. In Proceedings of the ACM Conferences Principles of Database Systems.

Digital Library

[16]

Karp, R., Papadimitriou, C., and Shenker, S. 2003. A simple algorithm for finding frequent elements in sets and bags. ACM Trans. Database Syst. 28, 51--55.

Digital Library

[17]

Lakshmanan, L. V. S., Ng, R. T., Wang, C. X., Zhou, X., and Johnson, T. 2002. The generalized MDL approach for summarization. In Proceedings of the International Conference on Very Large Data Bases, 766--777.

Digital Library

[18]

Lee, J., Kim, D., and Chung, C. 1999. Multidimensional selectivity estimation using compressed histogram information. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 205--214.

Digital Library

[19]

Manjhi, A., Shkapenyuk, V., Dhamdhere, K., and Olston, C. 2005. Finding (recently) frequent items in distributed data streams. In Proceedings of the IEEE International Conference on Data Engineering, 767--778.

Digital Library

[20]

Manku, G. and Motwani, R. 2002. Approximate frequency counts over data streams. In Proceedings of the International Conference on Very Large Data Bases, 346--357.

Digital Library

[21]

Metwally, A., Agrawal, D., and Abbadi, A. E. 2005. Efficient computation of frequent and top-k elements in data streams. In Proceedings of the International Conference on Database Theory.

Digital Library

[22]

Misra, J. and Gries, D. 1982. Finding repeated elements. Sci. Comput. Program. 2, 143--152.

[23]

Ng, R. T., Wagner, A. S., and Yin, Y. 2001. Iceberg-Cube computation with PC clusters. In Proceedings of the ACM SIGMOD International Conference on Management of Data.

Digital Library

[24]

Thaper, N., Indyk, P., Guha, S., and Koudas, N. 2002. Dynamic multidimensional histograms. In Proceedings of the ACM SIGMOD International Conference on Management of Data, 359--366.

Digital Library

[25]

Vitter, J. S., Wang, M., and Iyer, B. 1998. Data cube approximation and histograms via wavelets. In Proceedings of the 7th ACM International Conference on Information and Knowledge Management. 96--104.

Digital Library

[26]

Zhang, Y., Singh, S., Sen, S., Duffield, N., and Lund, C. 2004. Online identification of hieararchical heavy hitters: Algorithms, evaluation and applications. In Proceedings of the Internet Measurement Conference (IMC).

Digital Library

Cited By

Su HXiao Q(2023)Online Detection of 1D and 2D Hierarchical Super-Spreaders in High-Speed NetworksProceedings of the 7th Asia-Pacific Workshop on Networking10.1145/3600061.3600080(109-115)Online publication date: 29-Jun-2023
https://dl.acm.org/doi/10.1145/3600061.3600080
Tang LHuang QLee P(2023)MVPipe: Enabling Lightweight Updates and Fast Convergence in Hierarchical Heavy Hitter DetectionIEEE/ACM Transactions on Networking10.1109/TNET.2023.327330731:6(3207-3221)Online publication date: Dec-2023
https://doi.org/10.1109/TNET.2023.3273307
Meng YZhang QTang XZhang WWang J(2023)FTM-RCA: A Fast Two-Stage Multi-dimensional Root-Cause Analysis of Network Anomalies2023 IEEE/ACM 31st International Symposium on Quality of Service (IWQoS)10.1109/IWQoS57198.2023.10188732(01-10)Online publication date: 19-Jun-2023
https://doi.org/10.1109/IWQoS57198.2023.10188732
Show More Cited By

Index Terms

Finding hierarchical heavy hitters in streaming data
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Constant Time Updates in Hierarchical Heavy Hitters
SIGCOMM '17: Proceedings of the Conference of the ACM Special Interest Group on Data Communication

Monitoring tasks, such as anomaly and DDoS detection, require identifying frequent flow aggregates based on common IP prefixes. These are known as hierarchical heavy hitters (HHH), where the hierarchy is determined based on the type of prefixes of ...
Differentially Private Hierarchical Heavy Hitters
PODS

The task of finding Hierarchical Heavy Hitters (HHH) was introduced by Cormode et al. [12] as a generalisation of the heavy hitter problem. While finding HHH in data streams has been studied extensively, the question of releasing HHH when the underlying ...
Finding hierarchical heavy hitters in data streams
VLDB '03: Proceedings of the 29th international conference on Very large data bases - Volume 29

Aggregation along hierarchies is a critical summary technique in a large variety of on-line applications including decision support and network management (e.g., IP clustering, denial-of-service attack monitoring). Despite the amount of recent study ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Knowledge Discovery from Data

ACM Transactions on Knowledge Discovery from Data Volume 1, Issue 4

January 2008

143 pages

ISSN:1556-4681

EISSN:1556-472X

DOI:10.1145/1324172

Issue’s Table of Contents

Copyright © 2008 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 February 2008

Accepted: 01 September 2007

Revised: 01 July 2007

Received: 01 April 2007

Published in TKDD Volume 1, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

National Science Foundation

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

65
Total Citations
View Citations
1,380
Total Downloads

Downloads (Last 12 months)38
Downloads (Last 6 weeks)3

Reflects downloads up to 12 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Su HXiao Q(2023)Online Detection of 1D and 2D Hierarchical Super-Spreaders in High-Speed NetworksProceedings of the 7th Asia-Pacific Workshop on Networking10.1145/3600061.3600080(109-115)Online publication date: 29-Jun-2023
https://dl.acm.org/doi/10.1145/3600061.3600080
Tang LHuang QLee P(2023)MVPipe: Enabling Lightweight Updates and Fast Convergence in Hierarchical Heavy Hitter DetectionIEEE/ACM Transactions on Networking10.1109/TNET.2023.327330731:6(3207-3221)Online publication date: Dec-2023
https://doi.org/10.1109/TNET.2023.3273307
Meng YZhang QTang XZhang WWang J(2023)FTM-RCA: A Fast Two-Stage Multi-dimensional Root-Cause Analysis of Network Anomalies2023 IEEE/ACM 31st International Symposium on Quality of Service (IWQoS)10.1109/IWQoS57198.2023.10188732(01-10)Online publication date: 19-Jun-2023
https://doi.org/10.1109/IWQoS57198.2023.10188732
Mahadevan AMerchant AWang YMathioudakis MAl Hasan MXiong L(2022)Robustness of Sketched Linear Classifiers to Adversarial AttacksProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557687(4319-4323)Online publication date: 17-Oct-2022
https://dl.acm.org/doi/10.1145/3511808.3557687
Ben Basat REinziger GKeslassy IOrda AVargaftik SWaisbard E(2022)Memento: Making Sliding Windows Efficient for Heavy HittersIEEE/ACM Transactions on Networking10.1109/TNET.2021.313238530:4(1440-1453)Online publication date: Aug-2022
https://doi.org/10.1109/TNET.2021.3132385
Chen YWang P(2022)Concise Retrieval of Flow Statistics for Software-Defined NetworksIEEE Systems Journal10.1109/JSYST.2021.306530616:1(554-565)Online publication date: Mar-2022
https://doi.org/10.1109/JSYST.2021.3065306
Cohen ECormode GDuffield N(2020)Structure-aware samplingProceedings of the VLDB Endowment10.14778/3402707.34027214:11(819-830)Online publication date: 3-Jun-2020
https://dl.acm.org/doi/10.14778/3402707.3402721
Veloso BGama JMartins CEspanha RAzevedo R(2020)A case study on using heavy-hitters in interconnect bypass fraudACM SIGAPP Applied Computing Review10.1145/3429204.342920820:3(47-57)Online publication date: 8-Oct-2020
https://dl.acm.org/doi/10.1145/3429204.3429208
Aftab MChau SShenoy P(2020)Efficient Online Classification and Tracking on Resource-constrained IoT DevicesACM Transactions on Internet of Things10.1145/33920511:3(1-29)Online publication date: 13-Jul-2020
https://dl.acm.org/doi/10.1145/3392051
Kučera JPopescu DWang HMoore AKořenek JAntichi G(2020)Enabling Event-Triggered Data Plane MonitoringProceedings of the Symposium on SDN Research10.1145/3373360.3380830(14-26)Online publication date: 3-Mar-2020
https://dl.acm.org/doi/10.1145/3373360.3380830
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents