Abstract
Data partitioning among processing instances of distributed stream processing systems (DSPSs) plays a significant role in the performance of overall stream processing. Several data partitioning schemes, including round-robin and hash-based key-splitting strategies, are employed in this context. However, stateful operations introduce challenges such as data aggregation overhead and load imbalance among processing instances due to the skewed nature of real data. In this paper, we propose a partitioning strategy (HKS) that considers the popularity of the tuples on the fly and partitions them according to their frequency: higher frequent tuples are routed by employing power-of-the-two-choices, whereas low ones by using a single hash function. We perform a comprehensive experimental evaluation on synthetic and real-world data sets on well-known Apache Storm DSPS. Results demonstrate the superior performance of the HKS against state-of-the-art data partitioning schemes in terms of load imbalance and aggregation cost.
Department of Information and Communication Technology, DBGROUP.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Toshniwal, A., et al.: Storm@twitter. In: Proceedings of the International Conference on Management of Data (SIGMOD), pp. 147–156, ACM, Snowbird, Utah, USA (2014)
Liu, X., Buyya, R.: Resource management and scheduling in distributed stream processing systems: a taxonomy, review, and future directions. ACM Comput. Surv. (CSUR) 53(3), 1–41 (2020)
Apache Storm. https://storm.apache.org/. Accessed 4 Jan 2023
Zapridou, E., Mytilinis, I., Ailamaki, A.: Dalton: learned Partitioning for distributed data streams. Proc. VLDB Endowment 16(3), 491–504 (2022)
Nasir, M.A.U., Garg, S., Agrawal, A., Balazinska, M., Howe, B.: When two choices are not enough: balancing at scale in distributed stream processing. In: Proceedings of the International Conference on Data Engineering (ICDE), pp. 589–600, Helsinki, Finland (2016)
Nasir, M.A.U., Garg, S., Agrawal, A., Balazinska, M., Howe, B.: The power of both choices: practical load balancing for distributed stream processing engines. In: Proceedings of the International Conference on Data Engineering (ICDE), pp. 137–148, Seoul, South Korea (2015)
Metwally, A., Agrawal, D., Abbadi, A.E.: An integrated efficient solution for computing frequent and top-k elements in data streams. ACM Trans. Database Syst. (TODS) 31(3), 1095–1133 (2006)
Gedik, B.: Partitioning functions for stateful data parallelism in stream processing. VLDB J. 23(4), 517–539 (2014)
Abdelhamid, A.S., Aref, W.G.: PartLy: learning data partitioning for distributed data stream processing. In: Proceedings of the Third International Workshop on Exploiting Artificial Intelligence Techniques for Data Management, pp. 1–4 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Aslam, A., Simonini, G., Gagliardelli, L., Mozzillo, A., Bergamaschi, S. (2023). HKS: Efficient Data Partitioning for Stateful Streaming. In: Wrembel, R., Gamper, J., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Big Data Analytics and Knowledge Discovery. DaWaK 2023. Lecture Notes in Computer Science, vol 14148. Springer, Cham. https://doi.org/10.1007/978-3-031-39831-5_35
Download citation
DOI: https://doi.org/10.1007/978-3-031-39831-5_35
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-39830-8
Online ISBN: 978-3-031-39831-5
eBook Packages: Computer ScienceComputer Science (R0)