More Web Proxy on the site http://driver.im/

research-article

Distributed numerical and machine learning computations via two-phase execution of aggregated join trees

Authors:

Dimitrije Jankov,

Chris JermaineAuthors Info & Claims

Proceedings of the VLDB Endowment, Volume 14, Issue 7

Pages 1228 - 1240

https://doi.org/10.14778/3450980.3450991

Published: 01 March 2021 Publication History

Abstract

When numerical and machine learning (ML) computations are expressed relationally, classical query execution strategies (hash-based joins and aggregations) can do a poor job distributing the computation. In this paper, we propose a two-phase execution strategy for numerical computations that are expressed relationally, as aggregated join trees (that is, expressed as a series of relational joins followed by an aggregation). In a pilot run, lineage information is collected; this lineage is used to optimally plan the computation at the level of individual records. Then, the computation is actually executed. We show experimentally that a relational system making use of this two-phase strategy can be an excellent platform for distributed ML computations.

References

[1]

[n.d.]. Intel® Math Kernel Library. https://software.intel.com/content/www/us/en/develop/tools/math-kernel-library.html. Accessed: 2021-03-10.

[2]

[n.d.]. MXNet. https://mxnet.apache.org/.

[3]

[n.d.]. Pytorch. https://pytorch.org/. Accessed: 2021-03-10.

[4]

[n.d.]. Pytorch Einstein Notation. https://pytorch.org/docs/master/generated/torch.einsum.html. Accessed: 2021-03-10.

[5]

[n.d.]. Ring-allreduce. https://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/. Accessed: 2021-03-10.

[6]

[n.d.]. ScaLAPACK. http://www.netlib.org/scalapack/. Accessed: 2021-03-10.

[7]

[n.d.]. scikit. https://scikit-learn.org/stable/. Accessed: 2021-03-10.

[8]

[n.d.]. Tensorflow. https://www.tensorflow.org/. Accessed: 2021-03-10.

[9]

Mahmoud Abo Khamis, Hung Q Ngo, XuanLong Nguyen, Dan Olteanu, and Maximilian Schleich. 2018. In-Database Learning with Sparse Tensors. Proceedings of the ACM Symposium on Principles of Database Systems, 325--340.

Digital Library

[10]

Foto N AFRATI and Jeffrey D ULLMAN. [n.d.]. Optimizing Multiway Joins in a Map-Reduce Environment. IEEE Transactions on Knowledge and Data Engineering, 1282--1298.

Digital Library

[11]

Ramesh C Agarwal, Susanne M Balle, Fred G Gustavson, Mahesh Joshi, and Prasad Palkar. 1995. A three-dimensional approach to parallel matrix multiplication. IBM Journal of Research and Development, 575--582.

Digital Library

[12]

Parag Agrawal, Omar Benjelloun, Anish Das Sarma, Chris Hayworth, Shubha Nabar, Tomoe Sugihara, and Jennifer Widom. 2006. Trio: A system for data, uncertainty, and lineage. Proceedings of the VLDB Endowment, 1151--1154.

Digital Library

[13]

Philip A. Bernstein, Nathan Goodman, Eugene Wong, Christopher L. Reeve, and James B. Rothnie. 1981. Query Processing in a System for Distributed Databases (SDD-1). ACM Transactions on Database Systems, 602--625.

Digital Library

[14]

Robert Brijder, Floris Geerts, Jan Van den Bussche, and Timmy Weerwag. 2017. On the expressive power of query languages for matrices. arXiv (2017).

[15]

Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. arXiv (2020).

[16]

Surajit Chaudhuri. 1998. An Overview of Query Optimization in Relational Systems. Proceedings of the ACM Symposium on Principles of Database Systems, 34--43.

Digital Library

[17]

A. Chen and V. Li. 1984. Improvement Algorithms for Semijoin Query Processing Programs in Distributed Database Systems. IEEE Trans. Comput., 959--967.

Digital Library

[18]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv (2018).

[19]

Robert Epstein, Michael Stonebraker, and Eugene Wong. 1978. Distributed Query Processing in a Relational Data Base System. Proceedings of the ACM SIGMOD International Conference on Management of Data, 169--180.

Digital Library

[20]

Ronald Fagin, Amnon Lotem, and Moni Naor. 2003. Optimal Aggregation Algorithms for Middleware. J. Comput. System Sci., 102--113.

Digital Library

[21]

Amol Ghoting, Rajasekar Krishnamurthy, Edwin Pednault, Berthold Reinwald, Vikas Sindhwani, Shirish Tatikonda, Yuanyuan Tian, and Shivakumar Vaithyanathan. 2011. SystemML: Declarative machine learning on MapReduce. IEEE International Conference on Data Engineering, 231--242.

Digital Library

[22]

Ravindra Guravannavar, HS Ramanujam, and S Sudarshan. 2005. Optimizing Nested Queries with Parameter Sort Orders. Proceedings of the VLDB Endowment, 481--492.

Digital Library

[23]

Peter J. Haas and Joseph M. Hellerstein. 1999. Ripple Joins for Online Aggregation. Proceedings of the ACM SIGMOD International Conference on Management of Data, 287--298.

Digital Library

[24]

Donghyoung Han, Yoon-Min Nam, Jihye Lee, Kyongseok Park, Hyunwoo Kim, and Min-Soo Kim. 2019. DistME: A Fast and Elastic Distributed Matrix Computation Engine using GPUs. Proceedings of the ACM SIGMOD International Conference on Management of Data, 759--774.

Digital Library

[25]

Joseph M Hellerstein, Christoper Ré, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, et al. 2012. The MADlib Analytics Library. Proceedings of the VLDB Endowment, 1700--1711.

Digital Library

[26]

Alan R Hevner and S Bing Yao. 1979. Query Processing in Distributed Database System. IEEE Transactions on Software Engineering, 177--187.

Digital Library

[27]

Wei Hong. 1992. Exploiting inter-operation parallelism in XPRS. Proceedings of the ACM SIGMOD International Conference on Management of Data, 19--28.

Digital Library

[28]

Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. arXiv (2019).

[29]

Dimitrije Jankov, Shangyu Luo, Binhang Yuan, Zhuhua Cai, Jia Zou, Chris Jermaine, and Zekai J Gao. 2019. Declarative recursive computation on an RDBMS: or, why you should use a database for distributed machine learning. Proceedings of the VLDB Endowment, 43--50.

Digital Library

[30]

Matthias Jasny, Tobias Ziegler, Tim Kraska, Uwe Roehm, and Carsten Binnig. 2020. DB4ML - An In-Memory Database Kernel with Machine Learning Support. Proceedings of the ACM SIGMOD International Conference on Management of Data, 159--173.

Digital Library

[31]

Konstantinos Karanasos, Andrey Balmin, Marcel Kutsch, Fatma Ozcan, Vuk Ercegovac, Chunyang Xia, and Jesse Jackson. 2014. Dynamically Optimizing Queries over Large Scale Data Platforms. Proceedings of the ACM SIGMOD International Conference on Management of Data, 943--954.

Digital Library

[32]

Mahmoud Abo Khamis, Hung Q Ngo, XuanLong Nguyen, Dan Olteanu, and Maximilian Schleich. 2018. AC/DC: in-database learning thunderstruck. Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning, 1--10.

Digital Library

[33]

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv (2020).

[34]

Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. Proceedings of the USENIX Conference on Operating Systems Design and Implementation, 583--598.

Digital Library

[35]

Xupeng Li, Bin Cui, Yiru Chen, Wentao Wu, and Ce Zhang. 2017. Mlog: Towards declarative in-database machine learning. Proceedings of the VLDB Endowment, 1933--1936.

Digital Library

[36]

Feilong Liu, Ario Salmasi, Spyros Blanas, and Anastasios Sidiropoulos. 2018. Chasing similarity: Distribution-aware aggregation scheduling. Proceedings of the VLDB Endowment, 292--306.

Digital Library

[37]

Shangyu Luo, Zekai J Gao, Michael Gubanov, Luis L Perez, and Christopher Jermaine. 2018. Scalable Linear Algebra on a Relational Database System. IEEE Transactions on Knowledge and Data Engineering, 1224--1238.

Digital Library

[38]

Julian McAuley, Rahul Pandey, and Jure Leskovec. 2015. Inferring networks of substitutable and complementary products. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785--794.

Digital Library

[39]

Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. 2015. Image-based recommendations on styles and substitutes. Proceedings of the International ACM SIGIR Conference on Research and Development in Information retrieval, 43--52.

Digital Library

[40]

Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. 2020. Memory-Efficient Pipeline-Parallel DNN Training. arXiv (2020).

[41]

Orestis Polychroniou, Rajkumar Sen, and Kenneth A Ross. 2014. Track Join: Distributed Joins with Minimal Network Traffic. Proceedings of the ACM SIGMOD International Conference on Management of Data, 1483--1494.

Digital Library

[42]

Fotis Psallidas and Eugene Wu. 2018. Smoke: Fine-grained Lineage at Interactive Speed. Proceedings of the VLDB Endowment, 719--732.

Digital Library

[43]

P. Griffiths Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. 1979. Access Path Selection in a Relational Database Management System. Proceedings of the ACM SIGMOD International Conference on Management of Data, 23--34.

Digital Library

[44]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. arXiv (2017).

[45]

Annita N Wilschut and Peter MG Apers. 1990. Pipelining in query execution. International Conference on Databases, Parallel Architectures, and Their Applications, 562--562.

[46]

Bowen Yang, Jian Zhang, Jonathan Li, Christopher Ré, Christopher R Aberger, and Christopher De Sa. 2019. PipeMare: Asynchronous Pipeline Parallel DNN Training. arXiv (2019).

[47]

Binhang Yuan, Dimitrije Jankov, Jia Zou, Yuxin Tang, Daniel Bourgeois, and Chris Jermaine. 2020. Tensor Relational Algebra for Machine Learning System Design. arXiv (2020).

[48]

Jia Zou, R Matthew Barnett, Tania Lorido-Botran, Shangyu Luo, Carlos Monroy, Sourav Sikdar, Kia Teymourian, Binhang Yuan, and Chris Jermaine. 2018. PlinyCompute: A platform for high-performance, distributed, data-intensive tool development. Proceedings of the VLDB Endowment, 1189--1204.

Digital Library

Cited By

Xu LQiu SYuan BJiang JRenggli CGan SKara KLi GLiu JWu WYe JZhang C(2024)Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systemsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00845-033:5(1231-1255)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1007/s00778-024-00845-0
Huang ZSen RLiu JWu E(2023)JoinBoost: Grow Trees over Normalized Data Using Only SQLProceedings of the VLDB Endowment10.14778/3611479.361150916:11(3071-3084)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611509
Boehm MInterlandi MJermaine CDas SPandis ISelçuk Candan KAmer-Yahia S(2023)Optimizing Tensor Computations: From Applications to Compilation and Runtime TechniquesCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589407(53-59)Online publication date: 4-Jun-2023
https://dl.acm.org/doi/10.1145/3555041.3589407
Show More Cited By

Recommendations

Distributed stream join query processing with semijoins

This paper addresses the distributed stream processing of window-based multi-way join queries considering the semijoin as a key join operator. In distributed stream processing, data streams arriving at remote sites need to be shipped to the processing ...
Combining Joint and Semi-Join Operations for Distributed Query Processing

The application of a combination of join and semi-join operations to minimize the amount of data transmission required for distributed query processing is discussed. Specifically, two important concepts that occur with the use of join operations as ...
Interleaving a Join Sequence with Semijoins in Distributed Query Processing

The problem of combining join and semijoin reducers for distributed query processing is studied. An approach based on interleaving a join sequence with beneficial semijoins is proposed. A join sequence is mapped into a join sequence tree first. The join ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment

Proceedings of the VLDB Endowment Volume 14, Issue 7

March 2021

130 pages

ISSN:2150-8097

Editors:
Xin Luna Dong
Amazon
,
Felix Naumann
HPI, University of Potsdam

Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 March 2021

Published in PVLDB Volume 14, Issue 7

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
59
Total Downloads

Downloads (Last 12 months)11
Downloads (Last 6 weeks)0

Reflects downloads up to 14 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xu LQiu SYuan BJiang JRenggli CGan SKara KLi GLiu JWu WYe JZhang C(2024)Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systemsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00845-033:5(1231-1255)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1007/s00778-024-00845-0
Huang ZSen RLiu JWu E(2023)JoinBoost: Grow Trees over Normalized Data Using Only SQLProceedings of the VLDB Endowment10.14778/3611479.361150916:11(3071-3084)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.14778/3611479.3611509
Boehm MInterlandi MJermaine CDas SPandis ISelçuk Candan KAmer-Yahia S(2023)Optimizing Tensor Computations: From Applications to Compilation and Runtime TechniquesCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589407(53-59)Online publication date: 4-Jun-2023
https://dl.acm.org/doi/10.1145/3555041.3589407
Yadav RAiken AKjolstad FJhala RDillig I(2022)DISTAL: the distributed tensor algebra compilerProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523437(286-300)Online publication date: 9-Jun-2022
https://dl.acm.org/doi/10.1145/3519939.3523437
Xu LQiu SYuan BJiang JRenggli CGan SKara KLi GLiu JWu WYe JZhang CIves ZBonifati AEl Abbadi A(2022)In-Database Machine Learning with CorgiPile: Stochastic Gradient Descent without Full Data ShuffleProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526150(1286-1300)Online publication date: 10-Jun-2022
https://dl.acm.org/doi/10.1145/3514221.3526150

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents