[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3485447.3512019acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

DUET: A Generic Framework for Finding Special Quadratic Elements in Data Streams

Published: 25 April 2022 Publication History

Abstract

Finding special items, like heavy hitters, top-k, and persistent items, has always been a hot issue in data stream processing for web analysis. While data streams nowadays are usually high-dimensional, most prior works focus on special items according to a certain primary dimension and yield little insight into the correlations between dimensions. Therefore, we propose to find special quadratic elements to reveal close correlations. Based on the items mentioned above, we extend our problem to three applications related to heavy hitters, top-k, and persistent items, and design a generic framework DUET to process them. Besides, we analyze the error bound of our algorithm and conduct extensive experiments on four data sets. Our experimental results show that DUET can achieve 3.5 times higher throughput and three orders of magnitude lower average relative error compared with cutting-edge algorithms.

References

[1]
[n. d.]. Advanced Persistent Threat. http://www.usenix.org/event/lisa09/tech/slides/daly. pdf.
[2]
[n. d.]. BobHash. http://burtleburtle.net/bob/hash/evahash.html.
[3]
[n. d.]. The CAIDA Anonymized Internet Traces. https://www.caida.org/data/passive/passive_2016_dataset.xml.
[4]
[n. d.]. Social Network: Reddit Hyperlink Network. http://snap.stanford.edu/data/soc-RedditHyperlinks.html.
[5]
[n. d.]. Source Codes of DUET. https://github.com/callitwhatyouwannt13/WWW_DUET.
[6]
[n. d.]. The Source Codes of HeavyKeeper. https://github.com/papergitkeeper/heavy-keeper-project.
[7]
[n. d.]. Stack Overflow Temporal Network. http://snap.stanford.edu/data/sx-stackoverflow.html.
[8]
Juan Jose Garcia Adeva and Juan Manuel Pikatza Atxa. 2007. Intrusion detection in web applications using text mining. Engineering Applications of Artificial Intelligence 20, 4(2007), 555–566.
[9]
Mohammad Alizadeh, Tom Edsall, Sarang Dharmapurikar, Ramanan Vaidyanathan, Kevin Chu, Andy Fingerhut, Vinh The Lam, Francis Matus, Rong Pan, Navindra Yadav, 2014. CONGA: Distributed congestion-aware load balancing for datacenters. In Proceedings of ACM Conference on SIGCOMM. ACM, 503–514.
[10]
Ran Ben Basat, Gil Einziger, Roy Friedman, and Yaron Kassner. 2017. Optimal elephant flow detection. In Proceedings of International Conference on Computer Communications. IEEE, 1–9.
[11]
Ran Ben Basat, Gil Einziger, Michael Mitzenmacher, and Shay Vargaftik. 2020. Faster and More Accurate Measurement through Additive-Error Counters. In Proceedings of International Conference on Computer Communications. IEEE, 1251–1260.
[12]
Burton H Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422–426.
[13]
Moses Charikar, Kevin Chen, and Martin Farach-Colton. 2002. Finding frequent items in data streams. In Proceedings of International Colloquium on Automata, Languages, and Programming. Springer, 693–703.
[14]
Graham Cormode and Shan Muthukrishnan. 2005. An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55, 1 (2005), 58–75.
[15]
Haipeng Dai, Muhammad Shahzad, Alex X Liu, and Yuankun Zhong. 2016. Finding persistent items in data streams. Proceedings of the VLDB Endowment 10, 4 (2016), 289–300.
[16]
Italo Epicoco, Massimo Cafaro, and Marco Pulimeno. 2018. Fast and accurate mining of correlated heavy hitters. Data Mining and Knowledge Discovery 32, 1 (2018), 162–186.
[17]
Cristian Estan and George Varghese. 2002. New directions in traffic measurement and accounting. In Proceedings of ACM Conference on SIGCOMM. ACM, 323–336.
[18]
Qun Huang, Patrick PC Lee, and Yungang Bao. 2018. SketchLearn: relieving user burdens in approximate measurement with automated statistical inference. In Proceedings of ACM Conference on SIGCOMM. ACM, 576–590.
[19]
Mohammad T Irfan and Tucker Gordon. 2018. The power of context in networks: Ideal point models with social interactions. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems. 910–918.
[20]
Bibudh Lahiri and Srikanta Tirthapura. 2009. Finding correlated heavy-hitters over data streams. In Proceedings of International Performance Computing and Communications Conference. IEEE, 307–314.
[21]
Bibudh Lahiri, Srikanta Tirthapura, and Jaideep Chandrashekar. 2014. Space-efficient tracking of persistent items in a massive data stream. Statistical Analysis and Data Mining: The ASA Data Science Journal 7, 1(2014), 70–92.
[22]
Jizhou Li, Zikun Li, Yifei Xu, Shiqi Jiang, Tong Yang, Bin Cui, Yafei Dai, and Gong Zhang. 2020. WavingSketch: An Unbiased and Generic Sketch for Finding Top-k Items in Data Streams. In Proceedings of International Conference on Knowledge Discovery & Data Mining. ACM, 1574–1584.
[23]
Zaoxing Liu, Ran Ben-Basat, Gil Einziger, Yaron Kassner, Vladimir Braverman, Roy Friedman, and Vyas Sekar. 2019. Nitrosketch: Robust and general sketch-based monitoring in software switches. In Proceedings of ACM Conference on SIGCOMM. ACM, 334–350.
[24]
Zaoxing Liu, Antonis Manousis, Gregory Vorsanger, Vyas Sekar, and Vladimir Braverman. 2016. One sketch to rule them all: Rethinking network flow monitoring with univmon. In Proceedings of ACM Conference on SIGCOMM. ACM, 101–114.
[25]
Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. 2005. Efficient computation of frequent and top-k elements in data streams. In Proceedings of International Conference on Database Theory. Springer, 398–412.
[26]
Katsiaryna Mirylenka, Themis Palpanas, Graham Cormode, and Divesh Srivastava. 2013. Finding interesting correlations with conditional heavy hitters. In Proceedings of International Conference on Data Engineering. IEEE, 1069–1080.
[27]
Jayadev Misra and David Gries. 1982. Finding repeated elements. Science of Computer Programming 2, 2 (1982), 143–152.
[28]
Marco Pulimeno, Italo Epicoco, Massimo Cafaro, Catiuscia Melle, and Giovanni Aloisio. 2018. Parallel Mining of Correlated Heavy Hitters on Distributed and Shared-Memory Architectures. In Proceedings of International Conference on Big Data. IEEE, 5111–5118.
[29]
Amin Shokrollahi. 2006. Raptor codes. IEEE Transactions on Information Theory 52, 6 (2006), 2551–2567.
[30]
Stuart Staniford, James A Hoagland, and Joseph M McAlerney. 2002. Practical automated detection of stealthy portscans. Journal of Computer Security 10, 1-2 (2002), 105–136.
[31]
Karthik Subbian, Charu Aggarwal, and Kshiteesh Hegde. 2016. Recommendations for streaming data. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. 2185–2190.
[32]
Lu Tang, Qun Huang, and Patrick PC Lee. 2019. MV-Sketch: A fast and compact invertible sketch for heavy flow detection in network data streams. In Proceedings of International Conference on Computer Communications. IEEE, 2026–2034.
[33]
Tobias Weller. 2018. Compromised account detection based on clickstream data. In Companion Proceedings of the The Web Conference 2018. 819–823.
[34]
Tong Yang, Junzhi Gong, Haowei Zhang, Lei Zou, Lei Shi, and Xiaoming Li. 2018. HeavyGuardian: Separate and guard hot items in data streams. In Proceedings of International Conference on Knowledge Discovery & Data Mining. ACM, 2584–2593.
[35]
Tong Yang, Jie Jiang, Peng Liu, Qun Huang, Junzhi Gong, Yang Zhou, Rui Miao, Xiaoming Li, and Steve Uhlig. 2018. Elastic sketch: Adaptive and fast network-wide measurements. In Proceedings of the ACM Special Interest Group on Data Communication. ACM, 561–575.
[36]
Yinda Zhang, Jinyang Li, Yutian Lei, Tong Yang, Zhetao Li, Gong Zhang, and Bin Cui. 2020. On-off sketch: a fast and accurate sketch on persistence. Proceedings of the VLDB Endowment 14, 2 (2020), 128–140.

Cited By

View all
  • (2024)Rethinking Hash Tables: Challenges and Opportunities with Compute Express Link (CXL)Proceedings of the ACM Turing Award Celebration Conference - China 202410.1145/3674399.3674418(23-27)Online publication date: 5-Jul-2024
  • (2024)Scout Sketch+: Finding Both Promising and Damping Items Simultaneously in Data StreamsIEEE/ACM Transactions on Networking10.1109/TNET.2024.346919632:6(5491-5506)Online publication date: Dec-2024
  • (2024)SteadySketch: A High-Performance Algorithm for Finding Steady Flows in Data StreamsIEEE/ACM Transactions on Networking10.1109/TNET.2024.344448832:6(5004-5019)Online publication date: Dec-2024
  • Show More Cited By

Index Terms

  1. DUET: A Generic Framework for Finding Special Quadratic Elements in Data Streams
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        WWW '22: Proceedings of the ACM Web Conference 2022
        April 2022
        3764 pages
        ISBN:9781450390965
        DOI:10.1145/3485447
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 25 April 2022

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. data stream mining
        2. data structure
        3. sketch

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Funding Sources

        Conference

        WWW '22
        Sponsor:
        WWW '22: The ACM Web Conference 2022
        April 25 - 29, 2022
        Virtual Event, Lyon, France

        Acceptance Rates

        Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)54
        • Downloads (Last 6 weeks)4
        Reflects downloads up to 31 Dec 2024

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Rethinking Hash Tables: Challenges and Opportunities with Compute Express Link (CXL)Proceedings of the ACM Turing Award Celebration Conference - China 202410.1145/3674399.3674418(23-27)Online publication date: 5-Jul-2024
        • (2024)Scout Sketch+: Finding Both Promising and Damping Items Simultaneously in Data StreamsIEEE/ACM Transactions on Networking10.1109/TNET.2024.346919632:6(5491-5506)Online publication date: Dec-2024
        • (2024)SteadySketch: A High-Performance Algorithm for Finding Steady Flows in Data StreamsIEEE/ACM Transactions on Networking10.1109/TNET.2024.344448832:6(5004-5019)Online publication date: Dec-2024
        • (2024)Bamboo Filters: Make Resizing Smooth and AdaptiveIEEE/ACM Transactions on Networking10.1109/TNET.2024.340399732:5(3776-3791)Online publication date: Oct-2024
        • (2024)A Generic Framework for Finding Special Quadratic Elements in Data StreamsIEEE/ACM Transactions on Networking10.1109/TNET.2024.339202932:4(3269-3284)Online publication date: Aug-2024
        • (2024)Scout Sketch: Finding Promising Items in Data StreamsIEEE INFOCOM 2024 - IEEE Conference on Computer Communications10.1109/INFOCOM52122.2024.10621279(1561-1570)Online publication date: 20-May-2024
        • (2024)DISCO: A Dynamically Configurable Sketch Framework in Skewed Data Streams2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00365(4801-4814)Online publication date: 13-May-2024
        • (2023)SteadySketch: Finding Steady Flows in Data Streams2023 IEEE/ACM 31st International Symposium on Quality of Service (IWQoS)10.1109/IWQoS57198.2023.10188743(01-09)Online publication date: 19-Jun-2023
        • (2023)Variable-length Encoding Framework: A Generic Framework for Enhancing the Accuracy of Approximate Membership Queries2023 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM58522.2023.00015(61-70)Online publication date: 1-Dec-2023

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media