More Web Proxy on the site http://driver.im/

research-article

Model averaging in distributed machine learning: a case study with Apache Spark

Authors:

Jianzhong LiAuthors Info & Claims

The VLDB Journal, Volume 30, Issue 4

Pages 693 - 712

https://doi.org/10.1007/s00778-021-00664-7

Published: 15 April 2021 Publication History

Abstract

The increasing popularity of Apache Spark has attracted many users to put their data into its ecosystem. On the other hand, it has been witnessed in the literature that Spark is slow when it comes to distributed machine learning (ML). One resort is to switch to specialized systems such as parameter servers, which are claimed to have better performance. Nonetheless, users have to undergo the painful procedure of moving data into and out of Spark. In this paper, we investigate performance bottlenecks of MLlib (an official Spark package for ML) in detail, by focusing on analyzing its implementation of stochastic gradient descent (SGD)—the workhorse under the training of many ML models. We show that the performance inferiority of Spark is caused by implementation issues rather than fundamental flaws of the bulk synchronous parallel (BSP) model that governs Spark’s execution: we can significantly improve Spark’s performance by leveraging the well-known “model averaging” (MA) technique in distributed ML. Indeed, model averaging is not limited to SGD, and we further showcase an application of MA to training latent Dirichlet allocation (LDA) models within Spark. Our implementation is not intrusive and requires light development effort. Experimental evaluation results reveal that the MA-based versions of SGD and LDA can be orders of magnitude faster compared to their counterparts without using MA.

References

[1]

Amazon EC2 on-Demand Instance Pricing. https://aws.amazon.com/ec2/pricing/on-demand/

[2]

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: Tensorflow: a system for large-scale machine learning. In: OSDI, pp. 265–283 (2016)

[3]

Agarwal A, Chapelle O, Dudík M, and Langford J A reliable effective terascale linear learning system J. Mach. Learn. Res. 2014 15 1 1111-1133

[4]

Ahmed, A., Aly, M., Gonzalez, J., Narayanamurthy, S., Smola, A.J.: Scalable inference in latent variable models. In: WSDM, pp. 123–132. ACM (2012)

[5]

Alistarh, D., Allen-Zhu, Z., Li, J.: Byzantine stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 4613–4623 (2018)

[6]

Anderson M, Smith S, Sundaram N, Capotă M, Zhao Z, Dulloor S, Satish N, and Willke TL Bridging the gap between HPC and big data frameworks Proc. VLDB Endow. 2017 10 8 901-912

[7]

Bernardo JM et al. Psi (digamma) function Appl. Stat. 1976 25 3 315-317

[8]

Boden, C., Spina, A., Rabl, T., Markl, V.: Benchmarking data flow systems for scalable machine learning. In: Proceedings of the 4th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, pp. 1–10 (2017)

[9]

Bottou, L.: Large-scale machine learning with stochastic gradient descent, pp. 177–186 (2010)

[10]

Bottou, L.: Stochastic gradient descent tricks. In: Neural Networks: Tricks of the Trade, pp. 421–436. Springer (2012)

[11]

Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: SIGKDD, pp. 785–794 (2016)

[12]

Chen, W., Wang, Z., Zhou, J.: Large-scale l-BFGS using mapreduce. In: Advances in Neural Information Processing Systems, pp. 1332–1340 (2014)

[13]

Dai, J., Wang, Y., Qiu, X., Ding, D., Zhang, Y., Wang, Y., Jia, X., Zhang, C., Wan, Y., Li, Z., et al.: Bigdl: a distributed deep learning framework for big data. Preprint arXiv:1804.05839 (2018)

[14]

Dekel O, Gilad-Bachrach R, Shamir O, and Xiao L Optimal distributed online prediction using mini-batches Journal of Machine Learning Research 2012 13 Jan. 165-202

[15]

Fan, W., Xu, J., Wu, Y., Yu, W., Jiang, J., Zheng, Z., Zhang, B., Cao, Y., Tian, C.: Parallelizing sequential graph computations. In: SIGMOD, pp. 495–510 (2017)

[16]

Foulds, J., Boyles, L., DuBois, C., Smyth, P., Welling, M.: Stochastic collapsed variational Bayesian inference for latent Dirichlet allocation. In: SIGKDD, pp. 446–454. ACM (2013)

[17]

Hoffman, M., Bach, F.R., Blei, D.M.: Online learning for latent Dirichlet allocation. In: NIPS, pp. 856–864 (2010)

[18]

Hsieh, K., Harlap, A., Vijaykumar, N., Konomis, D., Ganger, G.R., Gibbons, P.B., Mutlu, O.: Gaia: Geo-distributed machine learning approaching

{

LAN

}

speeds. In: NSDI, pp. 629–647 (2017)

[19]

Huang Y, Jin T, Wu Y, Cai Z, Yan X, Yang F, Li J, Guo Y, and Cheng J Flexps: flexible parallelism control in parameter server architecture Proc. VLDB Endow. 2018 11 5 566-579

[20]

Jiang, J., Cui, B., Zhang, C., Yu, L.: Heterogeneity-aware distributed parameter servers. In: SIGMOD, pp. 463–478 (2017)

[21]

Jiang, J., Fu, F., Yang, T., Cui, B.: Sketchml: accelerating distributed machine learning with data sketches. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1269–1284 (2018)

[22]

Jiang J, Yu L, Jiang J, Liu Y, and Cui B Angel: a new large-scale machine learning system Natl. Sci. Rev. 2017 5 2 216-236

[23]

Jiang, P., Agrawal, G.: A linear speedup analysis of distributed deep learning with sparse and quantized communication. In: Advances in Neural Information Processing Systems, pp. 2525–2536 (2018)

[24]

Kaoudi, Z., Quiané-Ruiz, J.A., Thirumuruganathan, S., Chawla, S., Agrawal, D.: A cost-based optimizer for gradient descent optimization. In: SIGMOD, pp. 977–992. ACM (2017)

[25]

Kucukelbir, A., Ranganath, R., Gelman, A., Blei, D.: Automatic variational inference in Stan. In: NIPS, pp. 568–576 (2015)

[26]

Li, F., Chen, L., Zeng, Y., Kumar, A., Wu, X., Naughton, J.F., Patel, J.M.: Tuple-oriented compression for large-scale mini-batch stochastic gradient descent. In: Proceedings of the 2019 International Conference on Management of Data, pp. 1517–1534 (2019)

[27]

Li, M., Anderson, D.G., Park, J.W., Smola, A.J., Ahmed, A., Josifovski, V., Long, J., Shekita, E.J., Su, B.Y.: Scaling distributed machine learning with the parameter server. In: OSDI, pp. 583–598 (2014)

[28]

Liu DC and Nocedal J On the limited memory BFGS method for large scale optimization Math. Program. 1989 45 1–3 503-528

[29]

Liu, X., Zeng, J., Yang, X., Yan, J., Yang, Q.: Scalable parallel EM algorithms for latent Dirichlet allocation in multi-core systems. In: WWW, pp. 669–679 (2015)

[30]

McSherry, F., Isard, M., Murray, D.G.: Scalability! but at what

{

COST

}

? In: HotOS (2015)

[31]

Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S, et al. Mllib: machine learning in Apache Spark J. Mach. Learn. Res. 2016 17 1 1235-1241

[32]

Onizuka M, Fujimori T, and Shiokawa H Graph partitioning for distributed graph processing Data Sci. Eng. 2017 2 1 94-105

[33]

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, pp. 8024–8035 (2019)

[34]

Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. ELRA, Valletta, Malta (2010). http://is.muni.cz/publication/884893/en

[35]

Robbins H and Monro S A stochastic approximation method Ann. Math. Stat. 1951 22 400-407

[36]

Stich, S.U.: Local SGD converges fast and communicates little. In: ICLR 2019 International Conference on Learning Representations, CONF (2019)

[37]

Thakur R, Rabenseifner R, and Gropp W Optimization of collective communication operations in MPICH Int. J. High Perform. Comput. Appl. 2005 19 1 49-66

[38]

Ueno K, Suzumura T, Maruyama N, Fujisawa K, and Matsuoka S Efficient breadth-first search on massively parallel and distributed-memory machines Data Sci. Eng. 2017 2 1 22-35

[39]

Xie, C., Koyejo, S., Gupta, I.: Zeno: Distributed stochastic gradient descent with suspicion-based fault-tolerance. In: International Conference on Machine Learning, pp. 6893–6901 (2019)

[40]

Xing EP, Ho Q, Dai W, Kim JK, Wei J, Lee S, Zheng X, Xie P, Kumar A, and Yu Y Petuum: a new platform for distributed machine learning on big data IEEE Trans. Big Data 2015 1 2 49-67

[41]

Xu N, Chen L, and Cui B LogGP: a log-based dynamic graph partitioning method Proc. VLDB Endow. 2014 7 14 1917-1928

[42]

Yuan, J., Gao, F., Ho, Q., Dai, W., Wei, J., Zheng, X., Xing, E.P., Liu, T.Y., Ma, W.Y.: Lightlda: big topic models on modest computer clusters. In: World Wide Web, pp. 1351–1361 (2015)

[43]

Yut L, Zhang C, Shao Y, and Cui B LDA*: a robust and large-scale topic modeling system Proc. VLDB Endow. 2017 10 11 1406-1417

[44]

Zaharia M, Chowdhury M, Franklin MJ, Shenker S, and Stoica I Spark: cluster computing with working sets HotCloud 2010 10 10–10 95

[45]

Zaheer, M., Wick, M., Tristan, J.B., Smola, A., Steele, G.: Exponential stochastic cellular automata for massively parallel inference. In: Artificial Intelligence and Statistics, pp. 966–975 (2016)

[46]

Zhang C and Ré C Dimmwitted: a study of main-memory statistical analytics Proc. VLDB Endow. 2014 7 12 11

[47]

Zhang, H., Zeng, L., Wu, W., Zhang, C.: How good are machine learning clouds for binary classification with good features? In: SoCC, p. 649 (2017)

[48]

Zhang, J., De Sa, C., Mitliagkas, I., Ré, C.: Parallel SGD: When does averaging help? Preprint arXiv:1606.07365 (2016)

[49]

Zhang, K., Alqahtani, S., Demirbas, M.: A comparison of distributed machine learning platforms. In: ICCCN, pp. 1–9 (2017)

[50]

Zhang, Y., Jordan, M.I.: Splash: User-friendly programming interface for parallelizing stochastic algorithms. Preprint arXiv:1506.07552 (2015)

[51]

Zinkevich, M., Weimer, M., Li, L., Smola, A.J.: Parallelized stochastic gradient descent. In: NIPS, pp. 2595–2603 (2010)

Cited By

Miao XJia ZCui BBarcelo PSanchez-Pi NMeliou ASudarshan S(2024)Demystifying Data Management for Large Language ModelsCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654683(547-555)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3626246.3654683

Index Terms

Model averaging in distributed machine learning: a case study with Apache Spark
1. Computing methodologies
  1. Machine learning
2. Software and its engineering

Index terms have been assigned to the content through auto-classification.

Recommendations

Learning Apache Spark 2.0
Beginning Apache Spark 2: With Resilient Distributed Datasets, Spark SQL, Structured Streaming and Spark Machine Learning library
Performance comparison of Apache Hadoop and Apache Spark
ICAICR '19: Proceedings of the Third International Conference on Advanced Informatics for Computing Research

The term 'Big Data' is a broad term used for the data sets, which is enormous and traditional data processing applications find it hard to process. Both Apache Spark and Apache Hadoop are one of the significant parts of the big data family. Some of the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image The VLDB Journal — The International Journal on Very Large Data Bases

The VLDB Journal — The International Journal on Very Large Data Bases Volume 30, Issue 4

Jul 2021

214 pages

ISSN:1066-8888

Issue’s Table of Contents

© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2021.

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 15 April 2021

Accepted: 02 September 2020

Revision received: 26 July 2020

Received: 03 December 2019

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 18 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Miao XJia ZCui BBarcelo PSanchez-Pi NMeliou ASudarshan S(2024)Demystifying Data Management for Large Language ModelsCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654683(547-555)Online publication date: 9-Jun-2024
https://dl.acm.org/doi/10.1145/3626246.3654683

View Options

View options

Media

Figures

Other

Tables

View Issue’s Table of Contents