[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3524860.3539809acmconferencesArticle/Chapter ViewAbstractPublication PagesdebsConference Proceedingsconference-collections
research-article

Substream management in distributed streaming dataflows

Published: 15 July 2022 Publication History

Abstract

Most state-of-the-art SPEs use punctuations to divide a stream into bounded substreams of messages, such as epochs and windows. The punctuation approach is powerful but has limitations: it does not support cyclic dataflows, is poorly scalable in some cases due to intensive use of broadcasts, and becomes inefficient when the number of chunks or cluster size becomes significant. We introduce a new substream tracking technique called trAcker that overcomes the limits of punctuations. We experimentally evaluate the properties of trAcker in both synthetic and real-world environments. Experiments show that our technique outperforms punctuations for a large number of substreams and efficiently handles real-world cyclic dataflows.

References

[1]
Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. 2013. MillWheel: Fault-tolerant Stream Processing at Internet Scale. Proc. VLDB 6, 11 (Aug. 2013), 1033--1044.
[2]
T. Akidau, S. Chernyak, and R. Lax. 2018. Streaming Systems: The What, Where, When, and how of Large-scale Data Processing. O'Reilly Media, Incorporated. https://books.google.ru/books?id=48-BAQAACAAJ
[3]
Ahmed Awad, Jonas Traub, and Sherif Sakr. 2019. Adaptive Watermarks: A Concept Drift-based Approach for Predicting Event-Time Progress in Data Streams. In EDBT. 622--625.
[4]
Edmon Begoli, Tyler Akidau, Slava Chernyak, Fabian Hueske, Kathryn Knight, Kenneth Knowles, Daniel Mills, and Dan Sotolongo. 2021. Watermarks in Stream Processing Systems: Semantics and Comparative Analysis of Apache Flink and Google Cloud Dataflow. Proc. VLDB Endow. 14, 12 (2021), 3135--3147. http://www.vldb.org/pvldb/vol14/p3135-begoli.pdf
[5]
Edmon Begoli, Tyler Akidau, Fabian Hueske, Julian Hyde, Kathryn Knight, and Kenneth Knowles. 2019. One SQL to Rule Them All - an Efficient and Syntactically Idiomatic Approach to Management of Streams and Tables. In Proceedings of the 2019 International Conference on Management of Data (Amsterdam, Netherlands) (SIGMOD '19). ACM, New York, NY, USA, 1757--1772.
[6]
Paris Carbone. 2018. Scalable and Reliable Data Stream Processing. Ph.D. Dissertation. KTH Royal Institute of Technology.
[7]
Paris Carbone, Stephan Ewen, Gyula Fóra, Seif Haridi, Stefan Richter, and Kostas Tzoumas. 2017. State Management in Apache Flink®: Consistent Stateful Distributed Stream Processing. Proc. VLDB 10, 12 (Aug. 2017), 1718--1729.
[8]
P. Carbone, G. Fóra, S. Ewen, S. Haridi, and K. Tzoumas. 2015. Lightweight Asynchronous Snapshots for Distributed Dataflows. ArXiv e-prints (June 2015). arXiv:1506.08603 [cs.DC]
[9]
Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 36, 4 (2015).
[10]
Badrish Chandramouli, Jonathan Goldstein, Mike Barnett, Robert DeLine, Danyel Fisher, John C Platt, James F Terwilliger, and John Wernsing. 2014. Trill: A high-performance incremental query processor for diverse analytics. Proceedings of the VLDB Endowment 8, 4 (2014), 401--412.
[11]
K. Mani Chandy and Leslie Lamport. 1985. Distributed Snapshots: Determining Global States of Distributed Systems. ACM Trans. Comput. Syst. 3, 1 (Feb. 1985), 63--75.
[12]
Vincenzo Gulisano, Yiannis Nikolakopoulos, Marina Papatriantafilou, and Philippas Tsigas. 2016. Scalejoin: A deterministic, disjoint-parallel and skew-resilient stream join. IEEE Transactions on Big Data (2016).
[13]
Gabriela Jacques-Silva, Fang Zheng, Daniel Debrunner, Kun-Lung Wu, Victor Dogaru, Eric Johnson, Michael Spicer, and Ahmet Erdem Sariyüce. 2016. Consistent regions: Guaranteed tuple processing in ibm streams. Proceedings of the VLDB Endowment 9, 13 (2016), 1341--1352.
[14]
Jeyhun Karimov, Tilmann Rabl, Asterios Katsifodimos, Roman Samarev, Henri Heiskanen, and Volker Markl. 2018. Benchmarking distributed stream data processing systems. In 2018 IEEE 34th International Conference on Data Engineering (ICDE). IEEE, 1507--1518.
[15]
Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, Karthik Ramasamy, and Siddarth Taneja. 2015. Twitter Heron: Stream Processing at Scale. In Proc. of the 2015 ACM SIGMOD Intnl. Conf. on Management of Data (Melbourne, Victoria, Australia) (SIGMOD '15). ACM, New York, NY, USA, 239--250.
[16]
Igor E. Kuralenok, Artem Trofimov, Nikita Marshalkin, and Boris Novikov. 2018. Deterministic Model for Distributed Speculative Stream Processing. In Advances in Databases and Information Systems, András Benczúr, Bernhard Thalheim, and Tomáš Horváth (Eds.). Springer International Publishing, Cham, 233--246.
[17]
Igor E. Kuralenok, Artem Trofimov, Nikita Marshalkin, and Boris Novikov. 2018. FlameStream: Model and Runtime for Distributed Stream Processing. In Proceedings of the 5th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond (Houston, TX, USA) (BeyondMR'18). ACM, New York, NY, USA, Article 8, 2 pages.
[18]
Jin Li, Kristin Tufte, Vladislav Shkapenyuk, Vassilis Papadimos, Theodore Johnson, and David Maier. 2008. Out-of-order Processing: A New Architecture for High-performance Stream Systems. Proc. VLDB Endow. 1, 1 (Aug. 2008), 274--288.
[19]
Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: A Timely Dataflow System. In Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles (Farminton, Pennsylvania) (SOSP '13). ACM, New York, NY, USA, 439--455.
[20]
Hannaneh Najdataei, Yiannis Nikolakopoulos, Marina Papatriantafilou, Philippas Tsigas, and Vincenzo Gulisano. 2019. Stretch: Scalable and elastic deterministic streaming analysis with virtual shared-nothing parallelism. In Proceedings of the 13th ACM International Conference on Distributed and Event-based Systems. 7--18.
[21]
Shadi A. Noghabi, Kartik Paramasivam, Yi Pan, Navina Ramesh, Jon Bringhurst, Indranil Gupta, and Roy H. Campbell. 2017. Samza: Stateful Scalable Stream Processing at LinkedIn. Proc. VLDB Endow. 10, 12 (Aug. 2017), 1634--1645.
[22]
Albrecht Schmidt, Florian Waas, Martin Kersten, Michael J Carey, Ioana Manolescu, and Ralph Busse. 2002. XMark: A benchmark for XML data management. In VLDB'02: Proceedings of the 28th International Conference on Very Large Databases. Elsevier, 974--985.
[23]
storm-site 2018. Apache Storm documentation, Guaranteeing Message Processing. https://storm.apache.org/releases/current/Guaranteeing-message-processing.html
[24]
storm-site 2018. Apache Storm documentation, Storm State Management. http://storm.apache.org/releases/1.2.1/State-checkpointing.html
[25]
Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, and Dmitriy Ryaboy. 2014. Storm@Twitter. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (Snowbird, Utah, USA) (SIGMOD '14). ACM, New York, NY, USA, 147--156.
[26]
Pete Tucker, Kristin Tufte, Vassilis Papadimos, and David Maier. 2008. NEXMark-A Benchmark for Queries over Data Streams (DRAFT). Technical Report. Technical report, OGI School of Science & Engineering at OHSU, Septembers.
[27]
Peter A. Tucker, David Maier, Tim Sheard, and Leonidas Fegaras. 2003. Exploiting Punctuation Semantics in Continuous Data Streams. IEEE Trans. on Knowl. and Data Eng. 15, 3 (March 2003), 555--568.
[28]
Peter A. Tucker, David Maier, Tim Sheard, and Leonidas Fegaras. 2003. Exploiting punctuation semantics in continuous data streams. IEEE Transactions on Knowledge and Data Engineering 15, 3 (2003), 555--568.
[29]
Chen Xu, Markus Holzemer, Manohar Kaul, and Volker Markl. 2016. Efficient fault-tolerance for iterative graph processing on distributed dataflow systems. In 2016 IEEE 32nd International Conference on Data Engineering (ICDE). IEEE, 613--624.
[30]
Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, and Ion Stoica. 2012. Discretized Streams: An Efficient and Fault-tolerant Model for Stream Processing on Large Clusters. In Proc. of the 4th USENIX Conf. on Hot Topics in Cloud Ccomputing (Boston, MA) (HotCloud'12). USENIX Association, Berkeley, CA, USA, 10--10.
[31]
Erfan Zamanian, Carsten Binnig, Tim Harris, and Tim Kraska. 2017. The End of a Myth: Distributed Transactions Can Scale. Proc. VLDB Endow. 10, 6 (Feb. 2017), 685--696.
[32]
Yunhao Zhang, Rong Chen, and Haibo Chen. 2017. Sub-millisecond stateful stream querying over fast-evolving linked data. In Proceedings of the 26th Symposium on Operating Systems Principles. ACM, 614--630.
[33]
Zhan Zhang, Wenhao Li, Xiao Qing, Xian Liu, and Hongwei Liu. 2021. Research on Optimal Checkpointing-Interval for Flink Stream Processing Applications. Mobile Networks and Applications (2021), 1--10.

Cited By

View all
  • (2023)Bounding substreams in distributed stream processingInformation Systems10.1016/j.is.2023.102251117:COnline publication date: 1-Jul-2023

Index Terms

  1. Substream management in distributed streaming dataflows

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    DEBS '22: Proceedings of the 16th ACM International Conference on Distributed and Event-Based Systems
    June 2022
    210 pages
    ISBN:9781450393089
    DOI:10.1145/3524860
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 July 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data streams
    2. punctuations
    3. state management
    4. stream join
    5. substreams
    6. watermarks

    Qualifiers

    • Research-article

    Conference

    DEBS '22

    Acceptance Rates

    DEBS '22 Paper Acceptance Rate 10 of 19 submissions, 53%;
    Overall Acceptance Rate 145 of 583 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)7
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 18 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2023)Bounding substreams in distributed stream processingInformation Systems10.1016/j.is.2023.102251117:COnline publication date: 1-Jul-2023

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media