Abstract
In this paper, we study partitioning functions for stream processing systems that employ stateful data parallelism to improve application throughput. In particular, we develop partitioning functions that are effective under workloads where the domain of the partitioning key is large and its value distribution is skewed. We define various desirable properties for partitioning functions, ranging from balance properties such as memory, processing, and communication balance, structural properties such as compactness and fast lookup, and adaptation properties such as fast computation and minimal migration. We introduce a partitioning function structure that is compact and develop several associated heuristic construction techniques that exhibit good balance and low migration cost under skewed workloads. We provide experimental results that compare our partitioning functions to more traditional approaches such as uniform and consistent hashing, under different workload and application characteristics, and show superior performance.
Similar content being viewed by others
Notes
Consistent hash only migrates items from the existing nodes to the newly added node. No migrations happen between existing nodes.
The lower bound does not hold during system initialization, as there is not enough history to use.
References
Abadi, D., Ahmad, Y., Balazinska, M., Çetintemel, U., Cherniack, M., Hwang, J.H., Lindner, W., Maskey, A., Rasin, A., Ryvkina, E., Tatbul, N., Xing, Y., Zdonik, S.: The design of the Borealis stream processing engine. In: Proceedings of the Innovative Data Systems Research Conference (CIDR), pp. 277–289 (2005)
Arasu, A., Manku, G.S.: Approximate counts and quantiles over sliding windows. In: Proceedings of the Symposium on Principles of Database Systems (ACM PODS) (2004)
Arasu, A., Babcock, B., Babu, S., Datar, M., Ito, K., Motwani, R., Nishizawa, I., Srivastava, U., Thomas, D., Varma, R., Widom, J.: STREAM: the stanford stream data manager. IEEE Data Eng. Bull. 26(1), 665 (2003)
Balkesen, C., Tatbul, N.: Scalable data partitioning techniques for parallel sliding window processing over data streams. In: International Workshop on Data Management for Sensor Networks (DMSN) (2011)
Cormode, G., Garofalakis, M., Haas, P., Jermaine, C.: Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches. Now Publishing, Foundations and Trends in Databases Series (2011)
Deshpande, A., Ives, Z.G., Raman, V.: Adaptive query processing. Found. Trends Databases 1(1) (2007)
DeWitt, D., Naughton, J., Schneider, D., Seshadri, S.S.: Practical skew handling in parallel joins. In: Proceedings of the Very Large Data Bases Conference (VLDB) (1992)
Gates, A.F., Natkovich, O., Chopra, S., Kamath, P., Narayanamurthy, S.M., Olston, C., Reed, B., Srinivasan, S., Srivastava, U.: Building a high-level data flow system on top of map-reduce: The PIG experience. In: Proceedings of the Very Large Data Bases Conference (VLDB) (2009)
Gedik, B., Schneider, S., Hirzel, M., Wu, K.L.: Elastic scaling for data stream processing. IBM Research Technical Report, RC25401 (2013)
Gedik, B., Andrade, H.: A model-based framework for building extensible, high performance stream processing middleware and programming language for IBM InfoSphere streams. Softw. Pract. Exp. 42(11), 1363–1391 (2012)
Gufler, B., Augsten, N., Reiser, A., Kemper, A.: Handling data skew in mapreduce. In: Proceedings of the International Conference of Cloud Computing and Services Science (2011)
Gufler, B., Augsten, N., Reiser, A., Kemper, A.: Load balancing in mapreduce based on scalable cardinality estimates. In: Proceedings of the International Conference on Data Engineering (IEEE ICDE) (2012)
Hirzel, M., Andrade, H., Gedik, B., Kumar, V., Losa, G., Mendell, M., Nasgaard, H., Soulé, R., Wu, K.L.: SPL language spec. Tech. Rep. RC24897, IBM (2009)
Jain, N., Amini, L., Andrade, H., King, R., Park, Y., Selo, P., Venkatramani, C.: Design, implementation, and evaluation of the linear road benchmark on the stream processing core. In: Proceedings of the International Conference on Management of Data (ACM SIGMOD) (2006)
Karger, D.R., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., Lewin, D.: Consistent hashing and random trees: distributed caching protocols for relieving hot spots on the world wide web. In: Proceedings of the International Symposium on Theory of Computing (ACM STOC), pp. 654–663 (1997)
Karger, D.R., Sherman, A., Berkheimer, A., Bogstad, B., Dhanidina, R., Iwamoto, K., Kim, B., Matkins, L., Yerushalmi, Y.: Web caching with consistent hashing. Comput. Netw. 31(11–16), 1203–1213 (1999)
Kwon, Y., Balazinska, M., Howe, B., Rolia, J.A.: SkewTune: mitigating skew in mapreduce applications. In: Proceedings of the International Conference on Management of Data (ACM SIGMOD) (2012)
Manku, G.S., Motwani, R.: Approximate frequency counts over data streams. In: Proceedings of the International Conference on Very Large Databases (VLDB) (2002)
MurMurHash3. http://code.google.com/p/smhasher/wiki/MurmurHash3 (2013). Retrieved May 2013
Paton, N.W., Chavez, J.B., Chen, M., Raman, V., Swart, G., Narang, I., Yellin, D.M., Fernandes, A.A.A.: Autonomic query parallelization using non-dedicated computers: An evaluation of adaptivity options. In: Proceedings of the Very Large Data Bases Conference (VLDB) (2009)
Poosala, V., Ioannidis, Y.E.: Estimation of query-result distribution and its application in parallel-join load balancing. In: Proceedings of the Very Large Data Bases Conference (VLDB) (1996)
S4 distributed stream computing platform. http://www.s4.io/ (2012). Retrieved May 2012
Schneider, S., Andrade, H., Gedik, B., Biem, A., Wu, K.L.: Elastic scaling of data parallel operators in stream processing. In: Proceedings of the International Parallel and Distributed Processing Symposium (IEEE IPDPS) (2009)
Schneider, S., Hirzel, M., Gedik, B., Wu, K.L.: Auto-parallelizing stateful distributed streaming application. In: Proceedigns of the International Conference on Parallel Architectures and Compilation Techniques (PACT), pp. 53–64 (2012)
Shah, M.A., Hellerstein, J.M., Chandrasekaran, S., Franklin, M.J.: Flux: An adaptive partitioning operator for continuous query systems. In: Proceedings of the International Conference on Data Engineering (IEEE ICDE) (2003)
Shatdal, A., Naughton, J.: Adaptive parallel aggregation algorithms. In: Proceedings of the International Conference on Management of Data (ACM SIGMOD) (1995)
Storm project. http://storm-project.net/ (2012). Retrieved May 2012
StreamBase Systems. http://www.streambase.com (2012). Retr- ieved May 2012
Walton, C., Dale, A., Jenevein, R.: A taxonomy and performance model of data skew effects in parallel joins. In: Proceedings of the Very Large Data Bases Conference (VLDB) (1991)
Xu, Y., Kostamaa, P.: Efficient outer join data skew handling in parallel dbms. In: Proceedings of the Very Large Data Bases Conference (VLDB) (2009)
Acknowledgments
We thank IBM Thomas J. Watson Research Center for providing access to compute and software resources that made this research possible.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Gedik, B. Partitioning functions for stateful data parallelism in stream processing. The VLDB Journal 23, 517–539 (2014). https://doi.org/10.1007/s00778-013-0335-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-013-0335-9