[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3399666.3399911acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

The collection Virtual Machine: an abstraction for multi-frontend multi-backend data analysis

Published: 14 June 2020 Publication History

Abstract

Getting the best performance from the ever-increasing number of hardware platforms has been a recurring challenge for data processing systems. In recent years, the advent of data science with its increasingly numerous and complex types of analytics has made this challenge even more difficult. In practice, system designers are overwhelmed by the number of combinations and typically implement a single analytics type on one platform, leading to repeated implementation effort---and a plethora of semi-compatible tools for data scientists.
In this paper, we propose the "Collection Virtual Machine" (or CVM)---an extensible compiler framework designed to keep the specialization process of data analytics systems tractable. It can capture at the same time the essence of a large span of low-level, hardware-specific implementation techniques as well as high-level operations of different types of analyses. At its core lies a language for defining nested, collection-oriented intermediate representations (IRs). Frontends produce programs in their IR flavors defined in that language, which get optimized through a series of rewritings (possibly changing the IR flavor multiple times) until the program is finally expressed in an IR of platform-specific operators. While reducing the overall implementation effort, this also improves the interoperability of both analyses and hardware platforms. We have used CVM successfully to build specialized backends for platforms as diverse as multi-core CPUs, RDMA clusters, and serverless computing infrastructure in the cloud and expect similar results for many more frontends and hardware platforms in the near future.

References

[1]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, Xiaoqiang Zheng, and Google Brain. "TensorFlow: A System for Large-Scale Machine Learning". In: OSDI. 2016.
[2]
Christopher R Aberger, Andrew Lamb, Kunle Olukotun, and R Christopher. "LevelHeaded : A Unified Engine for Business Intelligence and Linear Algebra Querying". In: ICDE. 2018.
[3]
Christopher R. Aberger, Susan Tu, Kunle Olukotun, and Christopher Ré. "EmptyHeaded: A Relational Engine for Graph Processing". In: SIGMOD. 2016.
[4]
Sabir Akhadov. "PySpark at Bare-Metal Speed". MA thesis. ETH Zürich, 2017.
[5]
Martina-Cezara Albutiu, Alfons Kemper, and Thomas Neumann. "Massively Parallel Sort-Merge Joins in Main Memory Multi-Core Database Systems." In: PVLDB. Vol. 5. 10. 2012.
[6]
Cagri Balkesen, Jens Teubner, and Gustavo Alonso. "Main-Memory Hash Joins on Multi-Core CPUs: Tuning to the Underlying Hardware". In: ICDE. 2013.
[7]
Claude Barthels, Simon Loesing, Gustavo Alonso, and Donald Kossmann. "Rack-Scale In-Memory Join Processing using RDMA". In: SIGMOD. 2015.
[8]
Claude Barthels, Ingo Müller, Timo Schneider, Gustavo Alonso, and Torsten Hoefler. "Distributed Join Algorithms on Thousands of Cores". In: PVLDB. 2017.
[9]
Matthias Boehm, Iulian Antonov, Mark Dokter, Robert Ginthoer, Kevin Innerebner, Florijan Klezin, Stefanie Lindstaedt, Arnab Phani, and Benjamin Rath. "SystemDS: A Declarative Machine Learning System for the End-to-End Data Science Lifecycle". In: CIDR. 2020.
[10]
Matthias Boehm, Michael W Dusenberry, Deron Eriksson, Alexandre V Evfimievski, Faraz Makari Manshadi, Niketan Pansare, Berthold Reinwald, Frederick R Reiss, Prithviraj Sen, Arvind C Surve, and Shirish Tatikonda. "SystemML: Declarative Machine Learning on Spark". In: PVDLB 9.13 (2016).
[11]
Paul G. Brown. "Overview of SciDB: Large Scale Array Storage, Processing and Analysis". In: SIGMOD. 2010.
[12]
John Cieslewicz and K.A. Ross. "Adaptive Aggregation on Chip Multiprocessors". In: PVLDB. 2007.
[13]
Andrew Crotty, Alex Galakatos, Kayhan Dursun, Tim Kraska, Carsten Binnig, Ugur Cetintemel, and Stan Zdonik. "An Architecture for Compiling UDF-centric Workflows". In: VLDB 8.12 (2015).
[14]
Joseph Vinish D'silva, Florestan De Moor, and Bettina Kemme. "AIDA - Abstraction for Advanced In-Database Analytics". In: VLDB 11.11 (2018).
[15]
Niels Doekemeijer and Ana Lucia Varbanescu. A Survey of Parallel Graph Processing Frameworks. Tech. rep. DS-2014-003. TU Delft, 2014.
[16]
Kayhan Dursun, Carsten Binnig, Ugur Cetintemel, Garret Swart, and Weiwei Gong. "A Morsel-Driven Query Execution Engine for Heterogeneous Multi-cores". In: PVLDB. Vol. 12. 12. 2018.
[17]
Gregory Essertel, Ruby Tahboub, James Decker, Kevin Brown, Kunle Olukotun, and Tiark Rompf. "Flare: Optimizing Apache Spark for Scale-Up Architectures and Medium-Size Data". In: OSDI. 2018.
[18]
Yuanwei Fang, Chen Zou, and Andrew A. Chien. "Accelerating Raw Data Analysis with the ACCORDA Software and Hardware Architecture". In: PVLDB 12 (2018).
[19]
Philip W. Frey, Romulo Goncalves, Martin Kersten, and Jens Teubner. "A Spinning Join That Does Not Get Dizzy". In: ICDCS. 2010.
[20]
Henning Funke and Jens Teubner. "Data-Parallel Query Processing on Non-Uniform Data". In: 13 (2020).
[21]
Ionel Gog, Malte Schwarzkopf, Natacha Crooks, Matthew P Grosvenor, Allen Clement, and Steven Hand. "Musketeer: all for one, one for all in data processing systems". In: EuroSys. 2015.
[22]
Kazushige Goto and Robert A. van de Geijn. "Anatomy of High-Performance Matrix Multiplication". In: TOMS 34.3 (2008).
[23]
Naga Govindaraju, Jim Gray, Ritesh Kumar, and Dinesh Manocha. "GPUTeraSort: High Performance Graphics Co-processor Sorting for Large Database Management". In: SIGMOD. 2006.
[24]
Tim Gubner. "Designing an adaptive VM that combines vectorized and JIT execution on heterogeneous hardware". In: ICDE. 2018.
[25]
Gabriel Haas, Michael Haubenschild, Viktor Leis, Friedrich-Schiller-Universität Jena, and Tableau Software. "Exploiting Directly-Attached NVMe Arrays in DBMS". In: CIDR. 2020.
[26]
Bingsheng He, Ke Yang, Rui Fang, Mian Lu, Naga K. Govindaraju, Qiong Luo, and Pedro V. Sander. "Relational Joins on Graphics Processors". In: SIGMOD. 2008.
[27]
Dylan Hutchison, Bill Howe, and Dan Suciu. "LaraDB: A Minimalist Kernel for Linear and Relational Algebra Computation". In: BeyondMR. 2017.
[28]
Kaan Kara, Jana Giceva, and Gustavo Alonso. "FPGA-based Data Partitioning". In: SIGMOD. 2017.
[29]
Konstantinos Karanasos, Matteo Interlandi, Doris Xin, Fotis Psallidas, Rathijit Sen, Kwanghyun Park, Ivan Popivanov, Supun Nakandal, Subru Krishnan, Markus Weimer, et al. "Extending Relational Query Processing with ML Inference". In: CIDR. 2020.
[30]
Donald Kossmann and Donald. "The state of the art in distributed query processing". In: ACM Computing Surveys 32.4 (2000).
[31]
Dimitrios Koutsoukos, Ingo Müller, Renato Marroquín, and Gustavo Alonso. Modularis: Modular Data Analytics for Hardware, Software, and Platform Heterogeneity. 2020. arXiv: 2004.03488 [cs.DB].
[32]
Andreas Kunft, Asterios Katsifodimos, Sebastian Schelter, Sebastian Breß, Tilmann Rabl, and Volker Markl. "An intermediate representation for optimizing machine learning pipelines". In: PVLDB. 2019.
[33]
Chris Lattner, Mehdi Amini, Uday Bondhugula, Albert Cohen, Andy Davis, Jacques Pienaar, River Riddle, Tatiana Shpeisman, Nicolas Vasilache, and Oleksandr Zinenko. "MLIR: A Compiler Infrastructure for the End of Moore's Law". In: (2020). arXiv: 2002.11054.
[34]
Chris Leary and Todd Wang. "TensorFlow, Compiled". In: TensorFlow Dev Summit. 2017.
[35]
Viktor Leis, Peter Boncz, Alfons Kemper, and Thomas Neumann. "Morsel-Driven Parallelism: A NUMA-Aware Query Evaluation Framework for the Many-Core Age". In: SIGMOD. 2014.
[36]
D. Lemire and L. Boytsov. "Decoding billions of integers per second through vectorization". In: Softw. Pract. Exper. 45.1 (2015), pp. 1--29.
[37]
Feng Li, Sudipto Das, Manoj Syamala, and Vivek R. Narasayya. "Accelerating Relational Databases by Leveraging Remote Memory and RDMA". In: SIGMOD. 2016.
[38]
Yinan Li, Ippokratis Pandis, Rene Mueller, Vijayshankar Raman, and Guy Lohman. "NUMA-aware algorithms: the case of data shuffling". In: CIDR. 2013.
[39]
Simon Loesing, Markus Pilman, Thomas Etter, and Donald Kossmann. "On the Design and Scalability of Distributed Shared-Data Databases". In: SIGMOD. 2015.
[40]
Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumanov, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I Jordan, and Ion Stoica. "Ray: A Distributed Framework for Emerging {AI} Applications". In: OSDI. 2018.
[41]
Rene Mueller and Jens Teubner. "FPGAs: A New Point in the Database Design Space". In: EDBT. 2010.
[42]
Rene Mueller, Jens Teubner, and Gustavo Alonso. "Data processing on FPGAs". In: PVLDB. 2009.
[43]
Ingo Müller, Renato Marroquín, and Gustavo Alonso. "Lambada: Interactive Data Analytics on Cold Data using Serverless Cloud Infrastructure". In: SIGMOD. 2020.
[44]
Ingo Müller, Peter Sanders, Arnaud Lacurie, Wolfgang Lehner, and Franz Färber. "Cache-Efficient Aggregation: Hashing Is Sorting". In: SIGMOD. 2015.
[45]
Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. "Naiad: A Timely Dataflow Systems". In: SOSP. 2013.
[46]
Shoumik Palkar, James Thomas, Deepak Narayanan, Pratiksha Thaker, Rahul Palamuttam, Parimajan Negi, Anil Shanbhag, Malte Schwarzkopf, Holger Pirk, Saman Amarasinghe, Samuel Madden, Matei Zaharia, S Palkar, J Thomas, D Narayanan, P Thaker, R Palamuttam, and P Negi. "Evaluating End-to-End Optimization for Data Analytics Applications in Weld". In: PVLDB 11.9 (2018).
[47]
Johns Paul, Jiong He, and Bingsheng He. "GPL: A GPU-based Pipelined Query Processing Engine". In: SIGMOD. 2016.
[48]
Holger Pirk and Peter Giceva Jana Pietzuch. "Thriving in the No Man's Land between compilers and databases". In: CIDR. 2019.
[49]
Holger Pirk, Oscar Moll, Matei Zaharia, and Sam Madden. "Voodoo - A Vector Algebra for Portable Database Performance on Modern Hardware". In: VLDB. 2016.
[50]
Orestis Polychroniou, Arun Raghavan, and Kenneth A. Ross. "Rethinking SIMD Vectorization for In-Memory Databases". In: SIGMOD. 2015.
[51]
Orestis Polychroniou, Wangda Zhang, and Kenneth A. Ross. "Distributed Joins and Data Placement for Minimal Network Traffic". In: TODS 43 (2018).
[52]
R. Ramakrsihnan, D. Donjerkovic, A. Ranganathan, K.S. Beyer, and M. Krishnaprasad. "SRQL: Sorted Relational Query Language". In: SSDBM. 1998.
[53]
Matthew Rocklin. "Dask: Parallel Computation with Blocked algorithms and Task Scheduling". In: SciPy. 2015.
[54]
Wolf Rödiger, Tobias Mühlbauer, Alfons Kemper, and Thomas Neumann. "High-Speed Query Processing over High-Speed Networks". In: PVLDB 9.4 (2015).
[55]
Mark A. Roth, Herry F. Korth, and Abraham Silberschatz. "Extended Algebra and Calculus for Nested Relational Databases". In: TODS 13.4 (1988).
[56]
Michael Stonebraker and Uǧur Çetintemel. ""One size fits all": An idea whose time has come and gone". In: ICDE. 2005.
[57]
Ruby Y. Tahboub, Xilun Wu, Grégory M Essertel, and Tiark Rompf. "Towards Compiling Graph Queries in Relational Engines". In: SIGPLAN. 2019.
[58]
Ian H. Witten, Eibe Frank, Mark A. Hall, and Christopher J. Pal. Data Mining: Practical Machine Learning Tools and Techniques. 4th ed. 2017. isbn: 9780128043578.
[59]
Qiumin Xu, Huzefa Siyamwala, Mrinmoy Ghosh, Tameesh Suri, Manu Awasthi, Zvika Guz, Anahita Shayesteh, and Vijay Balakrishnan. "Performance analysis of NVMe SSDs and their implication on real world databases". In: SYSTOR. 2015.
[60]
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. "Spark: Cluster Computing with Working Sets". In: HotCloud. 2010.
[61]
Jingren Zhou and Kenneth A. Ross. "Implementing database operations using SIMD instructions". In: SIGMOD. 2002.

Cited By

View all
  • (2024)Configuring the Structure of the Serverless System for Efficient Data CollectionAdvances in Cyber-Physical Systems10.23939/acps2024.01.0399:1(39-45)Online publication date: 18-Jun-2024
  • (2024)Query Compilation Without RegretsProceedings of the ACM on Management of Data10.1145/36549682:3(1-28)Online publication date: 30-May-2024
  • (2023)Declarative Sub-Operators for Universal Data ProcessingProceedings of the VLDB Endowment10.14778/3611479.361153916:11(3461-3474)Online publication date: 24-Aug-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
DaMoN '20: Proceedings of the 16th International Workshop on Data Management on New Hardware
June 2020
127 pages
ISBN:9781450380249
DOI:10.1145/3399666
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 June 2020

Permissions

Request permissions for this article.

Check for updates

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '20
Sponsor:

Acceptance Rates

DaMoN '20 Paper Acceptance Rate 18 of 22 submissions, 82%;
Overall Acceptance Rate 94 of 127 submissions, 74%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)26
  • Downloads (Last 6 weeks)0
Reflects downloads up to 24 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Configuring the Structure of the Serverless System for Efficient Data CollectionAdvances in Cyber-Physical Systems10.23939/acps2024.01.0399:1(39-45)Online publication date: 18-Jun-2024
  • (2024)Query Compilation Without RegretsProceedings of the ACM on Management of Data10.1145/36549682:3(1-28)Online publication date: 30-May-2024
  • (2023)Declarative Sub-Operators for Universal Data ProcessingProceedings of the VLDB Endowment10.14778/3611479.361153916:11(3461-3474)Online publication date: 24-Aug-2023
  • (2022)Designing an open framework for query optimization and compilationProceedings of the VLDB Endowment10.14778/3551793.355180115:11(2389-2401)Online publication date: 1-Jul-2022
  • (2022)Low-latency query compilationThe VLDB Journal10.1007/s00778-022-00741-531:6(1171-1184)Online publication date: 10-May-2022
  • (2021)Database technology for the massesProceedings of the VLDB Endowment10.14778/3476249.347629614:11(2483-2490)Online publication date: 27-Oct-2021
  • (2020)Lambada: Interactive Data Analytics on Cold Data Using Serverless Cloud InfrastructureProceedings of the 2020 ACM SIGMOD International Conference on Management of Data10.1145/3318464.3389758(115-130)Online publication date: 11-Jun-2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media