[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Distributed numerical and machine learning computations via two-phase execution of aggregated join trees

Published: 01 March 2021 Publication History

Abstract

When numerical and machine learning (ML) computations are expressed relationally, classical query execution strategies (hash-based joins and aggregations) can do a poor job distributing the computation. In this paper, we propose a two-phase execution strategy for numerical computations that are expressed relationally, as aggregated join trees (that is, expressed as a series of relational joins followed by an aggregation). In a pilot run, lineage information is collected; this lineage is used to optimally plan the computation at the level of individual records. Then, the computation is actually executed. We show experimentally that a relational system making use of this two-phase strategy can be an excellent platform for distributed ML computations.

References

[1]
[n.d.]. Intel® Math Kernel Library. https://software.intel.com/content/www/us/en/develop/tools/math-kernel-library.html. Accessed: 2021-03-10.
[2]
[n.d.]. MXNet. https://mxnet.apache.org/.
[3]
[n.d.]. Pytorch. https://pytorch.org/. Accessed: 2021-03-10.
[4]
[n.d.]. Pytorch Einstein Notation. https://pytorch.org/docs/master/generated/torch.einsum.html. Accessed: 2021-03-10.
[5]
[n.d.]. Ring-allreduce. https://andrew.gibiansky.com/blog/machine-learning/baidu-allreduce/. Accessed: 2021-03-10.
[6]
[n.d.]. ScaLAPACK. http://www.netlib.org/scalapack/. Accessed: 2021-03-10.
[7]
[n.d.]. scikit. https://scikit-learn.org/stable/. Accessed: 2021-03-10.
[8]
[n.d.]. Tensorflow. https://www.tensorflow.org/. Accessed: 2021-03-10.
[9]
Mahmoud Abo Khamis, Hung Q Ngo, XuanLong Nguyen, Dan Olteanu, and Maximilian Schleich. 2018. In-Database Learning with Sparse Tensors. Proceedings of the ACM Symposium on Principles of Database Systems, 325--340.
[10]
Foto N AFRATI and Jeffrey D ULLMAN. [n.d.]. Optimizing Multiway Joins in a Map-Reduce Environment. IEEE Transactions on Knowledge and Data Engineering, 1282--1298.
[11]
Ramesh C Agarwal, Susanne M Balle, Fred G Gustavson, Mahesh Joshi, and Prasad Palkar. 1995. A three-dimensional approach to parallel matrix multiplication. IBM Journal of Research and Development, 575--582.
[12]
Parag Agrawal, Omar Benjelloun, Anish Das Sarma, Chris Hayworth, Shubha Nabar, Tomoe Sugihara, and Jennifer Widom. 2006. Trio: A system for data, uncertainty, and lineage. Proceedings of the VLDB Endowment, 1151--1154.
[13]
Philip A. Bernstein, Nathan Goodman, Eugene Wong, Christopher L. Reeve, and James B. Rothnie. 1981. Query Processing in a System for Distributed Databases (SDD-1). ACM Transactions on Database Systems, 602--625.
[14]
Robert Brijder, Floris Geerts, Jan Van den Bussche, and Timmy Weerwag. 2017. On the expressive power of query languages for matrices. arXiv (2017).
[15]
Tom B Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. arXiv (2020).
[16]
Surajit Chaudhuri. 1998. An Overview of Query Optimization in Relational Systems. Proceedings of the ACM Symposium on Principles of Database Systems, 34--43.
[17]
A. Chen and V. Li. 1984. Improvement Algorithms for Semijoin Query Processing Programs in Distributed Database Systems. IEEE Trans. Comput., 959--967.
[18]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv (2018).
[19]
Robert Epstein, Michael Stonebraker, and Eugene Wong. 1978. Distributed Query Processing in a Relational Data Base System. Proceedings of the ACM SIGMOD International Conference on Management of Data, 169--180.
[20]
Ronald Fagin, Amnon Lotem, and Moni Naor. 2003. Optimal Aggregation Algorithms for Middleware. J. Comput. System Sci., 102--113.
[21]
Amol Ghoting, Rajasekar Krishnamurthy, Edwin Pednault, Berthold Reinwald, Vikas Sindhwani, Shirish Tatikonda, Yuanyuan Tian, and Shivakumar Vaithyanathan. 2011. SystemML: Declarative machine learning on MapReduce. IEEE International Conference on Data Engineering, 231--242.
[22]
Ravindra Guravannavar, HS Ramanujam, and S Sudarshan. 2005. Optimizing Nested Queries with Parameter Sort Orders. Proceedings of the VLDB Endowment, 481--492.
[23]
Peter J. Haas and Joseph M. Hellerstein. 1999. Ripple Joins for Online Aggregation. Proceedings of the ACM SIGMOD International Conference on Management of Data, 287--298.
[24]
Donghyoung Han, Yoon-Min Nam, Jihye Lee, Kyongseok Park, Hyunwoo Kim, and Min-Soo Kim. 2019. DistME: A Fast and Elastic Distributed Matrix Computation Engine using GPUs. Proceedings of the ACM SIGMOD International Conference on Management of Data, 759--774.
[25]
Joseph M Hellerstein, Christoper Ré, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, et al. 2012. The MADlib Analytics Library. Proceedings of the VLDB Endowment, 1700--1711.
[26]
Alan R Hevner and S Bing Yao. 1979. Query Processing in Distributed Database System. IEEE Transactions on Software Engineering, 177--187.
[27]
Wei Hong. 1992. Exploiting inter-operation parallelism in XPRS. Proceedings of the ACM SIGMOD International Conference on Management of Data, 19--28.
[28]
Yanping Huang, Youlong Cheng, Ankur Bapna, Orhan Firat, Dehao Chen, Mia Chen, HyoukJoong Lee, Jiquan Ngiam, Quoc V Le, Yonghui Wu, et al. 2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism. arXiv (2019).
[29]
Dimitrije Jankov, Shangyu Luo, Binhang Yuan, Zhuhua Cai, Jia Zou, Chris Jermaine, and Zekai J Gao. 2019. Declarative recursive computation on an RDBMS: or, why you should use a database for distributed machine learning. Proceedings of the VLDB Endowment, 43--50.
[30]
Matthias Jasny, Tobias Ziegler, Tim Kraska, Uwe Roehm, and Carsten Binnig. 2020. DB4ML - An In-Memory Database Kernel with Machine Learning Support. Proceedings of the ACM SIGMOD International Conference on Management of Data, 159--173.
[31]
Konstantinos Karanasos, Andrey Balmin, Marcel Kutsch, Fatma Ozcan, Vuk Ercegovac, Chunyang Xia, and Jesse Jackson. 2014. Dynamically Optimizing Queries over Large Scale Data Platforms. Proceedings of the ACM SIGMOD International Conference on Management of Data, 943--954.
[32]
Mahmoud Abo Khamis, Hung Q Ngo, XuanLong Nguyen, Dan Olteanu, and Maximilian Schleich. 2018. AC/DC: in-database learning thunderstruck. Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning, 1--10.
[33]
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv (2020).
[34]
Mu Li, David G. Andersen, Jun Woo Park, Alexander J. Smola, Amr Ahmed, Vanja Josifovski, James Long, Eugene J. Shekita, and Bor-Yiing Su. 2014. Scaling Distributed Machine Learning with the Parameter Server. Proceedings of the USENIX Conference on Operating Systems Design and Implementation, 583--598.
[35]
Xupeng Li, Bin Cui, Yiru Chen, Wentao Wu, and Ce Zhang. 2017. Mlog: Towards declarative in-database machine learning. Proceedings of the VLDB Endowment, 1933--1936.
[36]
Feilong Liu, Ario Salmasi, Spyros Blanas, and Anastasios Sidiropoulos. 2018. Chasing similarity: Distribution-aware aggregation scheduling. Proceedings of the VLDB Endowment, 292--306.
[37]
Shangyu Luo, Zekai J Gao, Michael Gubanov, Luis L Perez, and Christopher Jermaine. 2018. Scalable Linear Algebra on a Relational Database System. IEEE Transactions on Knowledge and Data Engineering, 1224--1238.
[38]
Julian McAuley, Rahul Pandey, and Jure Leskovec. 2015. Inferring networks of substitutable and complementary products. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785--794.
[39]
Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. 2015. Image-based recommendations on styles and substitutes. Proceedings of the International ACM SIGIR Conference on Research and Development in Information retrieval, 43--52.
[40]
Deepak Narayanan, Amar Phanishayee, Kaiyu Shi, Xie Chen, and Matei Zaharia. 2020. Memory-Efficient Pipeline-Parallel DNN Training. arXiv (2020).
[41]
Orestis Polychroniou, Rajkumar Sen, and Kenneth A Ross. 2014. Track Join: Distributed Joins with Minimal Network Traffic. Proceedings of the ACM SIGMOD International Conference on Management of Data, 1483--1494.
[42]
Fotis Psallidas and Eugene Wu. 2018. Smoke: Fine-grained Lineage at Interactive Speed. Proceedings of the VLDB Endowment, 719--732.
[43]
P. Griffiths Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. 1979. Access Path Selection in a Relational Database Management System. Proceedings of the ACM SIGMOD International Conference on Management of Data, 23--34.
[44]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need. arXiv (2017).
[45]
Annita N Wilschut and Peter MG Apers. 1990. Pipelining in query execution. International Conference on Databases, Parallel Architectures, and Their Applications, 562--562.
[46]
Bowen Yang, Jian Zhang, Jonathan Li, Christopher Ré, Christopher R Aberger, and Christopher De Sa. 2019. PipeMare: Asynchronous Pipeline Parallel DNN Training. arXiv (2019).
[47]
Binhang Yuan, Dimitrije Jankov, Jia Zou, Yuxin Tang, Daniel Bourgeois, and Chris Jermaine. 2020. Tensor Relational Algebra for Machine Learning System Design. arXiv (2020).
[48]
Jia Zou, R Matthew Barnett, Tania Lorido-Botran, Shangyu Luo, Carlos Monroy, Sourav Sikdar, Kia Teymourian, Binhang Yuan, and Chris Jermaine. 2018. PlinyCompute: A platform for high-performance, distributed, data-intensive tool development. Proceedings of the VLDB Endowment, 1189--1204.

Cited By

View all
  • (2024)Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systemsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00845-033:5(1231-1255)Online publication date: 1-Sep-2024
  • (2023)JoinBoost: Grow Trees over Normalized Data Using Only SQLProceedings of the VLDB Endowment10.14778/3611479.361150916:11(3071-3084)Online publication date: 24-Aug-2023
  • (2023)Optimizing Tensor Computations: From Applications to Compilation and Runtime TechniquesCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589407(53-59)Online publication date: 4-Jun-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 14, Issue 7
March 2021
130 pages
ISSN:2150-8097
Issue’s Table of Contents

Publisher

VLDB Endowment

Publication History

Published: 01 March 2021
Published in PVLDB Volume 14, Issue 7

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)0
Reflects downloads up to 14 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Stochastic gradient descent without full data shuffle: with applications to in-database machine learning and deep learning systemsThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-024-00845-033:5(1231-1255)Online publication date: 1-Sep-2024
  • (2023)JoinBoost: Grow Trees over Normalized Data Using Only SQLProceedings of the VLDB Endowment10.14778/3611479.361150916:11(3071-3084)Online publication date: 24-Aug-2023
  • (2023)Optimizing Tensor Computations: From Applications to Compilation and Runtime TechniquesCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589407(53-59)Online publication date: 4-Jun-2023
  • (2022)DISTAL: the distributed tensor algebra compilerProceedings of the 43rd ACM SIGPLAN International Conference on Programming Language Design and Implementation10.1145/3519939.3523437(286-300)Online publication date: 9-Jun-2022
  • (2022)In-Database Machine Learning with CorgiPile: Stochastic Gradient Descent without Full Data ShuffleProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3526150(1286-1300)Online publication date: 10-Jun-2022

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media