[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

General incremental sliding-window aggregation

Published: 01 February 2015 Publication History

Abstract

Stream processing is gaining importance as more data becomes available in the form of continuous streams and companies compete to promptly extract insights from them. In such applications, sliding-window aggregation is a central operator, and incremental aggregation helps avoid the performance penalty of re-aggregating from scratch for each window change.
This paper presents Reactive Aggregator (RA), a new framework for incremental sliding-window aggregation. RA is general in that it does not require aggregation functions to be invertible or commutative, and it does not require windows to be FIFO. We implemented RA as a drop-in replacement for the Aggregate operator of a commercial streaming engine. Given m updates on a window of size n, RA has an algorithmic complexity of O(m + m log (n/m)), rivaling the best prior algorithms for any m. Furthermore, RA's implementation minimizes overheads from allocation and pointer traversals by using a single flat array.

References

[1]
U. A. Acar, G. E. Blelloch, M. Blume, R. Harper, and K. Tangwongsan. An experimental analysis of self-adjusting computation. Transactions on Programming Languages and Systems (TOPLAS), 32(1), 2009.
[2]
T. Akidau, A. Balikov, K. Bekiroglu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle. Millwheel: Fault-tolerant stream processing at internet scale. In Very Large Data Bases (VLDB) Industrial Track, pages 734--746, 2013.
[3]
A. Arasu and J. Widom. Resource sharing in continuous sliding window aggregates. In Conference on Very Large Data Bases (VLDB), pages 336--347, 2004.
[4]
A. Arasu, S. Babu, and J. Widom. The CQL continuous query language: semantic foundations and query execution. Journal on Very Large Data Bases (VLDB J.), 15(2): 121--142, 2006.
[5]
P. Bhatotia, A. Wieder, R. Rodrigues, U. A. Acar, and R. Pasquini. Incoop: MapReduce for incremental computations. In Symposium on Cloud Computing (SoCC), 2011.
[6]
P. Bhatotia, M. Dischinger, R. Rodrigues, and U. A. Acar. Slider: Incremental sliding-window computations for large-scale data analysis. Technical Report MPI-SWS-2012-004, Max Planck Institute for Software Systems, 2012.
[7]
I. Botan, R. Derakhshan, N. Dindar, L. Haas, R. J. Miller, and N. Tatbul. SECRET: A model for analysis of the execution semantics of stream processing systems. In Very Large Data Bases (VLDB), pages 232--243, 2010.
[8]
A. Bulut and A. K. Singh. SWAT: Hierarchical stream summarization in large networks. In International Conference on Data Engineering (ICDE), pages 303--314, 2003.
[9]
C. Cranor, T. Johnson, O. Spataschek, and V. Shkapenyuk. Gigascope: A stream database for network applications. In International Conference on Management of Data (SIGMOD) Industrial Track, pages 647--651, 2003.
[10]
A. Demers, T. Reps, and T. Teitelbaum. Incremental evaluation of attribute grammars with application to syntax-directed editors. In Principles of Programming Languages (POPL), pages 105--116, 1981.
[11]
C. L. Forgy. Rete: A fast algorithm for the many pattern/many object pattern match problem. Artificial Intelligence, 19: 17--37, 1982.
[12]
B. Gedik. Generic windowing support for extensible stream processing systems. Software Practice and Experience (SP&E), 44(9): 1105--1128, 2014.
[13]
T. M. Ghanem, M. A. Hammad, M. F. Mokbel, W. G. Aref, and A. K. Elmagarmid. Incremental evaluation of sliding-window queries over data streams. Transactions on Knowledge and Data Engineering (TKDE), 19(1): 57--72, 2007.
[14]
J. Gray, A. Bosworth, A. Layman, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-total. In International Conference on Data Engineering (ICDE), pages 152--159, 1996.
[15]
P. K. Gunda, L. Ravindranath, C. A. Thekkath, Y. Yu, and L. Zhuang. Nectar: Automatic management of data and computation in datacenters. In Operating Systems Design and Implementation (OSDI), 2010.
[16]
R. Hinze and R. Paterson. Finger trees: a simple general-purpose data structure. Journal of Functional Programming (JFP), 16(02): 197--217, 2006.
[17]
M. Hirzel. Partition and compose: Parallel complex event processing. In Conference on Distributed Event-Based Systems (DEBS), pages 191--200, 2012.
[18]
M. Hirzel, H. Andrade, B. Gedik, G. Jacques-Silva, R. Khandekar, V. Kumar, M. Mendell, H. Nasgaard, S. Schneider, R. Soulé, and K.-L. Wu. IBM Streams Processing Language: Analyzing big data in motion. IBM Journal of Research and Development (IBMRD), 57(3/4), 2013.
[19]
P. Hudak, A. Courtney, H. Nilsson, and J. Peterson. Arrows, robots, and functional reactive programming. In Summer School on Advanced Functional Programming, Oxford University, 2003.
[20]
IBMInfoSphereStreams. IBM InfoSphere Streams. http://www.ibm.com/software/data/infosphere/streams/. Accessed: 2014-08-27.
[21]
N. Jain, S. Mishra, A. Srinivasan, J. Gehrke, J. Widom, H. Balakrishnan, U. Cetintemel, M. Cherniack, R. Tibbets, and S. Zdonik. Towards a streaming SQL standard. In Very Large Data Bases (VLDB), pages 1379--1390, 2008.
[22]
S. Krishnamurthi, C. Wu, and M. Franklin. On-the-fly sharing for streamed aggregation. In International Conference on Management of Data (SIGMOD), pages 623--634, 2006.
[23]
J. Li, D. Maier, K. Tufte, V. Papadimos, and P. A. Tucker. No pane, no gain: efficient evaluation of sliding-window aggregates over data streams. ACM SIGMOD Record, 34(1): 39--44, 2005.
[24]
D. Logothetis, C. Olston, B. Reed, K. C. Webb, and K. Yocum. Stateful bulk processing for incremental analytics. In Symposium on Cloud Computing (SoCC), pages 51--62, 2010.
[25]
F. McSherry, D. G. Murray, R. Isaacs, and M. Isard. Differential dataflow. In Conference on Innovative Data Systems Research (CIDR), 2013.
[26]
B. Moon, I. F. V. López, and V. Immanuel. Scalable algorithms for large temporal aggregation. In International Conference on Data Engineering (ICDE), 2000.
[27]
D. Peng and F. Dabek. Large-scale incremental processing using distributed transactions and notifications. In Operating Systems Design and Implementation (OSDI), pages 251--264, 2010.
[28]
W. Pugh and T. Teitelbaum. Incremental computation via function caching. In Principles of Programming Languages (POPL), pages 315--328, 1989.
[29]
S. Schneider, M. Hirzel, B. Gedik, and K.-L. Wu. Auto-parallelizing stateful distributed streaming applications. In International Conference on Parallel Architectures and Compilation Techniques (PACT), pages 53--64, 2012.
[30]
R. Sedgewick. Algorithms in C++ - Parts 1--4: Fundamentals, Data Structures, Sorting, Searching (3. ed.). Addison-Wesley-Longman, 1999.
[31]
A. Shinnar, D. Cunningham, B. Herta, and V. Saraswat. M3R: Increased performance for in-memory Hadoop jobs. In Very Large Data Bases (VLDB) Industrial Track, pages 1736--1747, 2012.
[32]
E. Soisalon-Soininen and P. Widmayer. Amortized complexity of bulk updates in avl-trees. In SWAT 2002, 8th Scandinavian Workshop on Algorithm Theory, pages 439--448, 2002.
[33]
E. Soisalon-Soininen and P. Widmayer. Single and bulk updates in stratified trees: An amortized and worst-case analysis. In Computer Science in Perspective, Essays Dedicated to Thomas Ottmann, pages 278--292, 2003.
[34]
A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham, N. Bhagat, S. Mittal, and D. Ryaboy. Storm @twitter. In International Conference on Management of Data (SIGMOD), pages 147--156, 2014.
[35]
H. Wang, C. Zaniolo, and C. R. Luo. ATLAS: A small but complete SQL extension for data mining and data streams. In Demonstration at Very Large Data Bases (VLDB-Demo), pages 1113--1116, 2003.
[36]
J. Yang and J. Widom. Incremental computation and maintenance of temporal aggregates. In International Conference on Data Engineering (ICDE), 2001.
[37]
Y. Yu, P. K. Gunda, and M. Isard. Distributed aggregation for data-parallel computing: Interfaces and implementations. In Symposium on Operating Systems Principles (SOSP), 2009.
[38]
M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized streams: Fault-tolerant streaming computation at scale. In Symposium on Operating Systems Principles (SOSP), pages 423--438, 2013.

Cited By

View all
  • (2024)Query Compilation Without RegretsProceedings of the ACM on Management of Data10.1145/36549682:3(1-28)Online publication date: 30-May-2024
  • (2024)μWheel: Aggregate Management for Streams and QueriesProceedings of the 18th ACM International Conference on Distributed and Event-based Systems10.1145/3629104.3666031(54-65)Online publication date: 24-Jun-2024
  • (2024)Revisiting Optimal Window Aggregation in Data Streams: The Prefix-Sum ApproachProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679573(1660-1669)Online publication date: 21-Oct-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 8, Issue 7
February 2015
124 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 February 2015
Published in PVLDB Volume 8, Issue 7

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)62
  • Downloads (Last 6 weeks)7
Reflects downloads up to 05 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Query Compilation Without RegretsProceedings of the ACM on Management of Data10.1145/36549682:3(1-28)Online publication date: 30-May-2024
  • (2024)μWheel: Aggregate Management for Streams and QueriesProceedings of the 18th ACM International Conference on Distributed and Event-based Systems10.1145/3629104.3666031(54-65)Online publication date: 24-Jun-2024
  • (2024)Revisiting Optimal Window Aggregation in Data Streams: The Prefix-Sum ApproachProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679573(1660-1669)Online publication date: 21-Oct-2024
  • (2024)Bifurcation Detections of Multidimensional Random Processes in Dynamic Expert SystemsPattern Recognition and Image Analysis10.1134/S105466182470062734:3(751-756)Online publication date: 17-Oct-2024
  • (2024)FPCS: Feature Preserving Compensated Sampling of Streaming Time Series DataIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.345637531:1(1333-1342)Online publication date: 9-Sep-2024
  • (2023)DBSP: Automatic Incremental View Maintenance for Rich Query LanguagesProceedings of the VLDB Endowment10.14778/3587136.358713716:7(1601-1614)Online publication date: 8-May-2023
  • (2023)LAQy: Efficient and Reusable Query Approximations via Lazy SamplingProceedings of the ACM on Management of Data10.1145/35893191:2(1-26)Online publication date: 20-Jun-2023
  • (2023)FlowKV: A Semantic-Aware Store for Large-Scale State Management of Stream Processing EnginesProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3567493(768-783)Online publication date: 8-May-2023
  • (2023)Survey of window types for aggregation in stream processing systemsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-022-00778-632:5(985-1011)Online publication date: 17-Feb-2023
  • (2022)SWSProceedings of the VLDB Endowment10.14778/3503585.350359115:4(814-827)Online publication date: 14-Apr-2022
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media