[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Model averaging in distributed machine learning: a case study with Apache Spark

Published: 15 April 2021 Publication History

Abstract

The increasing popularity of Apache Spark has attracted many users to put their data into its ecosystem. On the other hand, it has been witnessed in the literature that Spark is slow when it comes to distributed machine learning (ML). One resort is to switch to specialized systems such as parameter servers, which are claimed to have better performance. Nonetheless, users have to undergo the painful procedure of moving data into and out of Spark. In this paper, we investigate performance bottlenecks of MLlib (an official Spark package for ML) in detail, by focusing on analyzing its implementation of stochastic gradient descent (SGD)—the workhorse under the training of many ML models. We show that the performance inferiority of Spark is caused by implementation issues rather than fundamental flaws of the bulk synchronous parallel (BSP) model that governs Spark’s execution: we can significantly improve Spark’s performance by leveraging the well-known “model averaging” (MA) technique in distributed ML. Indeed, model averaging is not limited to SGD, and we further showcase an application of MA to training latent Dirichlet allocation (LDA) models within Spark. Our implementation is not intrusive and requires light development effort. Experimental evaluation results reveal that the MA-based versions of SGD and LDA can be orders of magnitude faster compared to their counterparts without using MA.

References

[1]
[2]
Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: Tensorflow: a system for large-scale machine learning. In: OSDI, pp. 265–283 (2016)
[3]
Agarwal A, Chapelle O, Dudík M, and Langford J A reliable effective terascale linear learning system J. Mach. Learn. Res. 2014 15 1 1111-1133
[4]
Ahmed, A., Aly, M., Gonzalez, J., Narayanamurthy, S., Smola, A.J.: Scalable inference in latent variable models. In: WSDM, pp. 123–132. ACM (2012)
[5]
Alistarh, D., Allen-Zhu, Z., Li, J.: Byzantine stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 4613–4623 (2018)
[6]
Anderson M, Smith S, Sundaram N, Capotă M, Zhao Z, Dulloor S, Satish N, and Willke TL Bridging the gap between HPC and big data frameworks Proc. VLDB Endow. 2017 10 8 901-912
[7]
Bernardo JM et al. Psi (digamma) function Appl. Stat. 1976 25 3 315-317
[8]
Boden, C., Spina, A., Rabl, T., Markl, V.: Benchmarking data flow systems for scalable machine learning. In: Proceedings of the 4th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, pp. 1–10 (2017)
[9]
Bottou, L.: Large-scale machine learning with stochastic gradient descent, pp. 177–186 (2010)
[10]
Bottou, L.: Stochastic gradient descent tricks. In: Neural Networks: Tricks of the Trade, pp. 421–436. Springer (2012)
[11]
Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: SIGKDD, pp. 785–794 (2016)
[12]
Chen, W., Wang, Z., Zhou, J.: Large-scale l-BFGS using mapreduce. In: Advances in Neural Information Processing Systems, pp. 1332–1340 (2014)
[13]
Dai, J., Wang, Y., Qiu, X., Ding, D., Zhang, Y., Wang, Y., Jia, X., Zhang, C., Wan, Y., Li, Z., et al.: Bigdl: a distributed deep learning framework for big data. Preprint arXiv:1804.05839 (2018)
[14]
Dekel O, Gilad-Bachrach R, Shamir O, and Xiao L Optimal distributed online prediction using mini-batches Journal of Machine Learning Research 2012 13 Jan. 165-202
[15]
Fan, W., Xu, J., Wu, Y., Yu, W., Jiang, J., Zheng, Z., Zhang, B., Cao, Y., Tian, C.: Parallelizing sequential graph computations. In: SIGMOD, pp. 495–510 (2017)
[16]
Foulds, J., Boyles, L., DuBois, C., Smyth, P., Welling, M.: Stochastic collapsed variational Bayesian inference for latent Dirichlet allocation. In: SIGKDD, pp. 446–454. ACM (2013)
[17]
Hoffman, M., Bach, F.R., Blei, D.M.: Online learning for latent Dirichlet allocation. In: NIPS, pp. 856–864 (2010)
[18]
Hsieh, K., Harlap, A., Vijaykumar, N., Konomis, D., Ganger, G.R., Gibbons, P.B., Mutlu, O.: Gaia: Geo-distributed machine learning approaching {LAN} speeds. In: NSDI, pp. 629–647 (2017)
[19]
Huang Y, Jin T, Wu Y, Cai Z, Yan X, Yang F, Li J, Guo Y, and Cheng J Flexps: flexible parallelism control in parameter server architecture Proc. VLDB Endow. 2018 11 5 566-579
[20]
Jiang, J., Cui, B., Zhang, C., Yu, L.: Heterogeneity-aware distributed parameter servers. In: SIGMOD, pp. 463–478 (2017)
[21]
Jiang, J., Fu, F., Yang, T., Cui, B.: Sketchml: accelerating distributed machine learning with data sketches. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1269–1284 (2018)
[22]
Jiang J, Yu L, Jiang J, Liu Y, and Cui B Angel: a new large-scale machine learning system Natl. Sci. Rev. 2017 5 2 216-236
[23]
Jiang, P., Agrawal, G.: A linear speedup analysis of distributed deep learning with sparse and quantized communication. In: Advances in Neural Information Processing Systems, pp. 2525–2536 (2018)
[24]
Kaoudi, Z., Quiané-Ruiz, J.A., Thirumuruganathan, S., Chawla, S., Agrawal, D.: A cost-based optimizer for gradient descent optimization. In: SIGMOD, pp. 977–992. ACM (2017)
[25]
Kucukelbir, A., Ranganath, R., Gelman, A., Blei, D.: Automatic variational inference in Stan. In: NIPS, pp. 568–576 (2015)
[26]
Li, F., Chen, L., Zeng, Y., Kumar, A., Wu, X., Naughton, J.F., Patel, J.M.: Tuple-oriented compression for large-scale mini-batch stochastic gradient descent. In: Proceedings of the 2019 International Conference on Management of Data, pp. 1517–1534 (2019)
[27]
Li, M., Anderson, D.G., Park, J.W., Smola, A.J., Ahmed, A., Josifovski, V., Long, J., Shekita, E.J., Su, B.Y.: Scaling distributed machine learning with the parameter server. In: OSDI, pp. 583–598 (2014)
[28]
Liu DC and Nocedal J On the limited memory BFGS method for large scale optimization Math. Program. 1989 45 1–3 503-528
[29]
Liu, X., Zeng, J., Yang, X., Yan, J., Yang, Q.: Scalable parallel EM algorithms for latent Dirichlet allocation in multi-core systems. In: WWW, pp. 669–679 (2015)
[30]
McSherry, F., Isard, M., Murray, D.G.: Scalability! but at what {COST}? In: HotOS (2015)
[31]
Meng X, Bradley J, Yavuz B, Sparks E, Venkataraman S, Liu D, Freeman J, Tsai D, Amde M, Owen S, et al. Mllib: machine learning in Apache Spark J. Mach. Learn. Res. 2016 17 1 1235-1241
[32]
Onizuka M, Fujimori T, and Shiokawa H Graph partitioning for distributed graph processing Data Sci. Eng. 2017 2 1 94-105
[33]
Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, pp. 8024–8035 (2019)
[34]
Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. ELRA, Valletta, Malta (2010). http://is.muni.cz/publication/884893/en
[35]
Robbins H and Monro S A stochastic approximation method Ann. Math. Stat. 1951 22 400-407
[36]
Stich, S.U.: Local SGD converges fast and communicates little. In: ICLR 2019 International Conference on Learning Representations, CONF (2019)
[37]
Thakur R, Rabenseifner R, and Gropp W Optimization of collective communication operations in MPICH Int. J. High Perform. Comput. Appl. 2005 19 1 49-66
[38]
Ueno K, Suzumura T, Maruyama N, Fujisawa K, and Matsuoka S Efficient breadth-first search on massively parallel and distributed-memory machines Data Sci. Eng. 2017 2 1 22-35
[39]
Xie, C., Koyejo, S., Gupta, I.: Zeno: Distributed stochastic gradient descent with suspicion-based fault-tolerance. In: International Conference on Machine Learning, pp. 6893–6901 (2019)
[40]
Xing EP, Ho Q, Dai W, Kim JK, Wei J, Lee S, Zheng X, Xie P, Kumar A, and Yu Y Petuum: a new platform for distributed machine learning on big data IEEE Trans. Big Data 2015 1 2 49-67
[41]
Xu N, Chen L, and Cui B LogGP: a log-based dynamic graph partitioning method Proc. VLDB Endow. 2014 7 14 1917-1928
[42]
Yuan, J., Gao, F., Ho, Q., Dai, W., Wei, J., Zheng, X., Xing, E.P., Liu, T.Y., Ma, W.Y.: Lightlda: big topic models on modest computer clusters. In: World Wide Web, pp. 1351–1361 (2015)
[43]
Yut L, Zhang C, Shao Y, and Cui B LDA*: a robust and large-scale topic modeling system Proc. VLDB Endow. 2017 10 11 1406-1417
[44]
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, and Stoica I Spark: cluster computing with working sets HotCloud 2010 10 10–10 95
[45]
Zaheer, M., Wick, M., Tristan, J.B., Smola, A., Steele, G.: Exponential stochastic cellular automata for massively parallel inference. In: Artificial Intelligence and Statistics, pp. 966–975 (2016)
[46]
Zhang C and Ré C Dimmwitted: a study of main-memory statistical analytics Proc. VLDB Endow. 2014 7 12 11
[47]
Zhang, H., Zeng, L., Wu, W., Zhang, C.: How good are machine learning clouds for binary classification with good features? In: SoCC, p. 649 (2017)
[48]
Zhang, J., De Sa, C., Mitliagkas, I., Ré, C.: Parallel SGD: When does averaging help? Preprint arXiv:1606.07365 (2016)
[49]
Zhang, K., Alqahtani, S., Demirbas, M.: A comparison of distributed machine learning platforms. In: ICCCN, pp. 1–9 (2017)
[50]
Zhang, Y., Jordan, M.I.: Splash: User-friendly programming interface for parallelizing stochastic algorithms. Preprint arXiv:1506.07552 (2015)
[51]
Zinkevich, M., Weimer, M., Li, L., Smola, A.J.: Parallelized stochastic gradient descent. In: NIPS, pp. 2595–2603 (2010)

Cited By

View all
  • (2024)Demystifying Data Management for Large Language ModelsCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654683(547-555)Online publication date: 9-Jun-2024

Index Terms

  1. Model averaging in distributed machine learning: a case study with Apache Spark
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Please enable JavaScript to view thecomments powered by Disqus.

      Information & Contributors

      Information

      Published In

      cover image The VLDB Journal — The International Journal on Very Large Data Bases
      The VLDB Journal — The International Journal on Very Large Data Bases  Volume 30, Issue 4
      Jul 2021
      214 pages

      Publisher

      Springer-Verlag

      Berlin, Heidelberg

      Publication History

      Published: 15 April 2021
      Accepted: 02 September 2020
      Revision received: 26 July 2020
      Received: 03 December 2019

      Author Tags

      1. Distributed machine learning
      2. Apache Spark MLlib
      3. Generalized linear models
      4. Latent Dirichlet allocation

      Qualifiers

      • Research-article

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)0
      • Downloads (Last 6 weeks)0
      Reflects downloads up to 18 Dec 2024

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Demystifying Data Management for Large Language ModelsCompanion of the 2024 International Conference on Management of Data10.1145/3626246.3654683(547-555)Online publication date: 9-Jun-2024

      View Options

      View options

      Media

      Figures

      Other

      Tables

      Share

      Share

      Share this Publication link

      Share on social media