[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3549737.3549767acmotherconferencesArticle/Chapter ViewAbstractPublication PagessetnConference Proceedingsconference-collections
research-article

Weighted Reservoir Sampling On Evolving Streams: A Sampling Algorithmic Framework For Stream Event Identification

Published: 09 September 2022 Publication History

Abstract

Data streams are becoming increasingly important across a wide array of fields and are generally expected to be the preferred form of big data as aggregators and smart stream analytics in general can efficiently yield stream descriptions in various levels. Among them, event detection analytics are paramount since they typically allow the identification of distinct cases of interest like the so called black swans. Reservoir sampling refers to probabilistic class of techniques for keeping representative values of a stream given limited memory capacity. In the proposed framework event detection takes place once reservoir sampling is complete by clustering its output. The rationale behind this is that repeated representative values correspond to normal stream states, whereas any outliers indicate rare yet noteworthy events. With that information a probabilistic stream state graph can be constructed in order to examine the transition dynamics between states and to evaluate the role black swans play in the overall stream stability. A major part of the descriptive power of said graph lies on its inherent geometric interpretation in addition to the algebraic one. Results from two benchmark datasets, one coming from real world and a random one, are encouraging. The proposed framework is planned to be executed in Raspberry Pi as part of an IoT stack since it is sufficiently lightweight.

References

[1]
Marcel R Ackermann, Marcus Märtens, Christoph Raupach, Kamil Swierkot, Christiane Lammersen, and Christian Sohler. 2012. Streamkm++ a clustering algorithm for data streams. Journal of Experimental Algorithmics (JEA) 17 (2012), 2–1.
[2]
Charu C. Aggarwal, Jiawei Han, Jianyong Wang, and Philip S. Yu. 2003. A Framework for Clustering Evolving Data Streams. In Proceedings of the 29th International Conference on Very Large Data Bases - Volume 29(VLDB ’03). VLDB Endowment, Berlin, Germany, 81–92.
[3]
B Angelin and A Geetha. 2020. Outlier Detection using Clustering Techniques–K-means and K-median. In 2020 4th International Conference on Intelligent Computing and Control Systems (ICICCS). IEEE, 373–378.
[4]
David Arthur and Sergei Vassilvitskii. 2006. k-means++: The advantages of careful seeding. Technical Report. Stanford.
[5]
Azzedine Boukerche, Lining Zheng, and Omar Alfandi. 2020. Outlier detection: Methods, models, and classification. ACM Computing Surveys (CSUR) 53, 3 (2020), 1–37.
[6]
Vladimir Braverman, Rafail Ostrovsky, and Carlo Zaniolo. 2009. Optimal sampling from sliding windows. In Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 147–156.
[7]
Pierre Brémaud. 2013. Markov chains: Gibbs fields, Monte Carlo simulation, and queues. Vol. 31. Springer Science & Business Media.
[8]
Toon Calders, Nele Dexters, Joris JM Gillis, and Bart Goethals. 2014. Mining frequent itemsets in a stream. Information Systems 39(2014), 233–255.
[9]
David Camacho, Ma Victoria Luzón, and Erik Cambria. 2021. New research methods and algorithms in social network analysis., 290–293 pages.
[10]
Feng Cao, Martin Estert, Weining Qian, and Aoying Zhou. 2006. Density-based clustering over an evolving data stream with noise. In Proceedings of the Sixth SIAM International Conference on Data Mining. SIAM, SIAM, Bethesda, MD, USA, 328–339.
[11]
Jian Chen and Jeffrey S Rosenthal. 2012. Decrypting classical cipher text using Markov chain Monte Carlo. Statistics and Computing 22, 2 (2012), 397–413. https://doi.org/10.1007/s11222-011-9232-5
[12]
Ting Chen, Yizhou Sun, Yue Shi, and Liangjie Hong. 2017. On sampling strategies for neural network-based collaborative filtering. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 767–776.
[13]
Yun Chi, Haixun Wang, P.S. Yu, and R.R. Muntz. 2004. Moment: maintaining closed frequent itemsets over a stream sliding window. In Fourth IEEE International Conference on Data Mining (ICDM’04). IEEE, Brighton, UK, 59–66. https://doi.org/10.1109/ICDM.2004.10084
[14]
Mark Ming-Tso Chiang and Boris Mirkin. 2010. Intelligent choice of the number of clusters in k-means clustering: an experimental study with different cluster spreads. Journal of classification 27, 1 (2010), 3–40.
[15]
Ariel Debrouvier 2021. A model and query language for temporal graph databases. VLDB Journal 30, 5 (2021), 825–858.
[16]
Wei Deng, Guang Lin, and Faming Liang. 2022. An adaptively weighted stochastic gradient MCMC algorithm for Monte Carlo simulation and global optimization. Statistics and Computing 32, 4 (2022), 1–24.
[17]
Pavlos S Efraimidis and Paul G Spirakis. 2006. Weighted random sampling with a reservoir. Information processing letters 97, 5 (2006), 181–185.
[18]
Ahmed Eldawy, Yuan Li, Mohamed F. Mokbel, and Ravi Janardan. 2013. CG_Hadoop: Computational Geometry in MapReduce. In SIGSPATIAL. ACM, NY, USA, 294–303. https://doi.org/10.1145/2525314.2525349
[19]
S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. 2000. Clustering data streams. In Proceedings 41st Annual Symposium on Foundations of Computer Science. IEEE, Redondo Beach, CA, USA, 359–366. https://doi.org/10.1109/SFCS.2000.892124
[20]
John A. Hartigan. 1975. Clustering Algorithms(99th ed.). John Wiley & Sons, Inc., USA.
[21]
Saeid Iranmanesh 2019. Novel DTN mobility-driven routing in autonomous drone logistics networks. IEEE Access 8(2019), 13661–13673.
[22]
Ruoming Jin and Gagan Agrawal. 2007. Frequent Pattern Mining in Data Streams. Springer US, Boston, MA, 61–84. https://doi.org/10.1007/978-0-387-47534-9_4
[23]
Aristeidis Karras and Christos Karras. 2022. Integrating User and Item Reviews in Deep Cooperative Neural Networks for Movie Recommendation. arXiv preprint arXiv:2205.06296(2022).
[24]
Christos Karras and Aristeidis Karras. 2022. DBSOP: An Efficient Heuristic for Speedy MCMC Sampling on Polytopes. arXiv preprint arXiv:2203.10916(2022).
[25]
Christos Karras, Aristeidis Karras, Markos Avlonitis, Ioanna Giannoukou, and Spyros Sioutas. 2022. Maximum Likelihood Estimators on MCMC Sampling Algorithms for Decision Making. In IFIP International Conference on Artificial Intelligence Applications and Innovations. Springer, Cham, 345–356.
[26]
Christos Karras, Aristeidis Karras, Markos Avlonitis, and Spyros Sioutas. 2022. An Overview of MCMC Methods: From Theory to Applications. In IFIP International Conference on Artificial Intelligence Applications and Innovations. Springer, Cham, 319–332.
[27]
Christos Karras, Aristeidis Karras, and Spyros Sioutas. 2022. Pattern Recognition and Event Detection on IoT Data-streams. arXiv preprint arXiv:2203.01114(2022).
[28]
George Lagogiannis, Stavros Kontopoulos, and Christos Makris. 2019. On the randomness that generates biased samples: The limited randomness approach. Computer Science and Information Systems 16, 1 (2019), 205–225.
[29]
Aihua Li, Weijia Xu, Zhidong Liu, and Yong Shi. 2021. Improved incremental local outlier detection for data streams based on the landmark window model. Knowledge and Information Systems 63, 8 (2021), 2129–2155.
[30]
Yuan Li, Ahmed Eldawy, Jie Xue, Nadezda Knorozova, Mohamed F Mokbel, and Ravi Janardan. 2019. Scalable computational geometry in MapReduce. VLDB Journal 28, 4 (2019), 523–548.
[31]
Fernando Llorente, E Curbelo, Luca Martino, Victor Elvira, and David Delgado. 2022. MCMC-driven importance samplers. Applied Mathematical Modelling(2022).
[32]
David Luengo, Luca Martino, Mónica Bugallo, Víctor Elvira, and Simo Särkkä. 2020. A survey of Monte Carlo methods for parameter estimation. EURASIP Journal on Advances in Signal Processing 2020, 1(2020), 1–62.
[33]
Mohammad Sultan Mahmud, Joshua Zhexue Huang, Salman Salloum, Tamer Z Emara, and Kuanishbay Sadatdiynov. 2020. A survey of data partitioning and sampling methods to support big data analysis. Big Data Mining and Analytics 3, 2 (2020), 85–101.
[34]
Tomas Martin, Guy Francoeur, and Petko Valtchev. 2020. CICLAD: A Fast and Memory-Efficient Closed Itemset Miner for Streams. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (Virtual Event, CA, USA) (KDD ’20). Association for Computing Machinery, New York, NY, USA, 1810–1818. https://doi.org/10.1145/3394486.3403232
[35]
Luca Martino, Víctor Elvira, David Luengo, Jukka Corander, and Francisco Louzada. 2016. Orthogonal parallel MCMC methods for sampling and optimization. Digital Signal Processing 58 (2016), 64–84.
[36]
L. Martino, V. Elvira, D. Luengo, J. Corander, and F. Louzada. 2016. Orthogonal parallel MCMC methods for sampling and optimization. Digital Signal Processing 58 (2016), 64–84. https://doi.org/10.1016/j.dsp.2016.07.013
[37]
Eli Packer, Peter Bak, Mikko Nikkilä, Valentin Polishchuk, and Harold J. Ship. 2013. Visual analytics for spatial clustering: Using a heuristic approach for guided exploration. IEEE Transactions on Visualization and Computer Graphics 19, 12(2013), 2179–2188. https://doi.org/10.1109/TVCG.2013.224
[38]
Marcelo Pereyra, Philip Schniter, Emilie Chouzenoux, Jean-Christophe Pesquet, Jean-Yves Tourneret, Alfred O Hero, and Steve McLaughlin. 2015. A survey of stochastic simulation and optimization methods in signal processing. IEEE Journal of Selected Topics in Signal Processing 10, 2(2015), 224–241.
[39]
Chedy Raissi and Pascal Poncelet. 2007. Sampling for Sequential Pattern Mining: From Static Databases to Data Streams. In Seventh IEEE International Conference on Data Mining (ICDM 2007). IEEE, Omaha, NE, USA, 631–636. https://doi.org/10.1109/ICDM.2007.82
[40]
Pang-Ning Tan, Michael S. Steinbach, and Vipin Kumar. 2022. Introduction to Data Mining. Data Mining and Machine Learning Applications (2022).
[41]
Syed Khairuzzaman Tanbeer, Chowdhury Farhan Ahmed, Byeong-Soo Jeong, and Young-Koo Lee. 2009. Sliding window-based frequent pattern mining over data streams. Information sciences 179, 22 (2009), 3843–3865.
[42]
Hongzhi Wang, Mohamed Jaward Bah, and Mohamed Hammad. 2019. Progress in outlier detection techniques: A survey. Ieee Access 7(2019), 107964–108000.
[43]
Tongxin Wang 2021. MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification. Nature Communications 12, 1 (2021), 1–13.
[44]
Ming-Juan Wu 2019. Integrative hypergraph regularization principal component analysis for sample clustering and co-expression genes network analysis on multi-omics data. IEEE Journal of Biomedical and Health Informatics 24, 6(2019), 1823–1834.
[45]
Linli Xu and Dale Schuurmans. 2005. Unsupervised and Semi-Supervised Multi-Class Support Vector Machines. In Proceedings of the 20th National Conference on Artificial Intelligence - Volume 2(AAAI’05). AAAI Press, Pittsburgh, Pennsylvania, 904–910.

Cited By

View all
  • (2024)Evolving cybersecurity frontiersEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109143137:PAOnline publication date: 1-Nov-2024
  • (2024)Algorithms for generating small random samplesSoftware: Practice and Experience10.1002/spe.3379Online publication date: 18-Sep-2024
  • (2022)Distributed Gibbs Sampling and LDA Modelling for Large Scale Big Data Management on PySpark2022 7th South-East Europe Design Automation, Computer Engineering, Computer Networks and Social Media Conference (SEEDA-CECNSM)10.1109/SEEDA-CECNSM57760.2022.9932990(1-8)Online publication date: 23-Sep-2022
  • Show More Cited By

Index Terms

  1. Weighted Reservoir Sampling On Evolving Streams: A Sampling Algorithmic Framework For Stream Event Identification

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image ACM Other conferences
      SETN '22: Proceedings of the 12th Hellenic Conference on Artificial Intelligence
      September 2022
      450 pages
      ISBN:9781450395977
      DOI:10.1145/3549737
      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 09 September 2022

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. IoT
      2. black swan
      3. clustering
      4. data streams
      5. geometric analytics
      6. outlier discovery
      7. probabilistic state graphs
      8. reservoir sampling

      Qualifiers

      • Research-article
      • Research
      • Refereed limited

      Conference

      SETN 2022

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)19
      • Downloads (Last 6 weeks)2
      Reflects downloads up to 01 Jan 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Evolving cybersecurity frontiersEngineering Applications of Artificial Intelligence10.1016/j.engappai.2024.109143137:PAOnline publication date: 1-Nov-2024
      • (2024)Algorithms for generating small random samplesSoftware: Practice and Experience10.1002/spe.3379Online publication date: 18-Sep-2024
      • (2022)Distributed Gibbs Sampling and LDA Modelling for Large Scale Big Data Management on PySpark2022 7th South-East Europe Design Automation, Computer Engineering, Computer Networks and Social Media Conference (SEEDA-CECNSM)10.1109/SEEDA-CECNSM57760.2022.9932990(1-8)Online publication date: 23-Sep-2022
      • (2022)An Intelligent Microprocessor Integrating TinyML in Smart Hotels for Rapid Accident Prevention2022 7th South-East Europe Design Automation, Computer Engineering, Computer Networks and Social Media Conference (SEEDA-CECNSM)10.1109/SEEDA-CECNSM57760.2022.9932982(1-7)Online publication date: 23-Sep-2022

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      HTML Format

      View this article in HTML Format.

      HTML Format

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media