[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3307681.3325407acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

CuLDA: Solving Large-scale LDA Problems on GPUs

Published: 17 June 2019 Publication History

Abstract

Latent Dirichlet Allocation(LDA) is a popular topic model. Given the fact that the input corpus of LDA algorithms consists of millions to billions of tokens, the LDA training process is very time-consuming, which prevents the adoption of LDA in many scenarios, e.g., online service. GPUs have benefited modern machine learning algorithms and big data analysis as they can provide high memory bandwidth and tremendous computation power. Therefore, many frameworks, e.g. TensorFlow, Caffe, CNTK, support GPUs for accelerating various data-intensive machine learning algorithms. However, we observe that the performance of existing LDA solutions on GPUs is not satisfying. In this paper, we present CuLDA, a GPU-based efficient and scalable approach to accelerate large-scale LDA problems. CuLDA is designed to efficiently solve LDA problems at high throughput. To this end, we first delicately design workload partitioning and synchronization mechanism to exploit multiple GPUs. Then, we offload the LDA sampling process to each individual GPU by optimizing from the sampling algorithm, parallelization, and data compression perspectives. Experiment evaluations show that compared with the state-of-the-art LDA solutions, CuLDA outperforms them by a large margin (up to 7.3X) on a single GPU. CuLDA is able to achieve an extra 7.5X speedup on 8 GPUs for large data sets.

References

[1]
Amr Ahmed, Moahmed Aly, Joseph Gonzalez, Shravan Narayanamurthy, and Alexander J. Smola. 2012. Scalable Inference in Latent Variable Models (WSDM '12).
[2]
Ashwin M. Aji, Lokendra S. Panwar, Feng Ji, Milind Chabbi, Karthik Murthy, Pavan Balaji, Keith R. Bisset, James Dinan, Wu-chun Feng, John Mellor-Crummey, Xiaosong Ma, and Rajeev Thakur. 2013. On the Efficacy of GPU-integrated MPI for Scientific Applications. In Proceedings of the 22Nd International Symposium on High-performance Parallel and Distributed Computing (HPDC '13).
[3]
Hesam Amoualian, Marianne Clausel, Eric Gaussier, and Massih-Reza Amini. 2016. Streaming-LDA: A Copula-based Approach to Modeling Topic Dependencies in Document Streams (KDD '16).
[4]
Bing Bai, Yushun Fan, Wei Tan, and Jia Zhang. 2017. SR-LDA: Mining Effective Representations for Generating Service Ecosystem Knowledge Maps (SCC'17).
[5]
Tal Ben-Nun, Michael Sutton, Sreepathi Pai, and Keshav Pingali. 2017. Groute: An Asynchronous Multi-GPU Programming Model for Irregular Computations. In Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17).
[6]
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research (2003).
[7]
Martin Burtscher, Rupesh Nasre, and Keshav Pingali. 2012. A quantitative study of irregular programs on GPUs (IISWC'12).
[8]
John Canny and Huasha Zhao. 2013. Bidmach: Large-scale learning with zero memory allocation. In BigLearning, NIPS Workshop .
[9]
Milind Chabbi, Karthik Murthy, Michael Fagan, and John Mellor-Crummey. 2013. Effective Sampling-driven Performance Tools for GPU-accelerated Supercomputers. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '13).
[10]
Jianfei Chen, Kaiwei Li, Jun Zhu, and Wenguang Chen. {n. d.}. WarpLDA: A Cache Efficient O(1) Algorithm for Latent Dirichlet Allocation. Proc. VLDB Endow. ( {n. d.}).
[11]
Xu Chen, Mingyuan Zhou, and Lawrence Carin. 2012. The Contextual Focused Topic Model (KDD '12).
[12]
K. M. Diab, M. M. Rafique, and M. Hefeeda. 2013. Dynamic Sharing of GPUs in Cloud Systems. In 2013 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum .
[13]
Yaozu Dong, Mochi Xue, Xiao Zheng, Jiajun Wang, Zhengwei Qi, and Haibing Guan. 2015. Boosting GPU Virtualization Performance with Hybrid Shadow Page Tables. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '15).
[14]
James Foulds, Levi Boyles, Christopher DuBois, Padhraic Smyth, and Max Welling. 2013. Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation (KDD '13).
[15]
H. Fu, Jingheng Xu, L. Gan, C. Yang, W. Xue, Wenlai Zhao, Wen Shi, X. Wang, and G. Yang. 2016. Unleashing the performance potential of CPU-GPU platforms for the 3D atmospheric Euler solver. In 2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP) .
[16]
Swapnil Ghike, Ruben Gran, Mar'ia J Garzarán, and David Padua. 2014. Directive-based compilers for GPUs. In International Workshop on Languages and Compilers for Parallel Computing. Springer, 19--35.
[17]
Fabian Gieseke, Justin Heinermann, Cosmin Oancea, and Christian Igel. 2014. Buffer kd trees: processing massive nearest neighbor queries on GPUs. In International Conference on Machine Learning. 172--180.
[18]
Aditya Grover and Jure Leskovec. 2016. Node2Vec: Scalable Feature Learning for Networks (KDD '16).
[19]
Junxian He, Zhiting Hu, Taylor Berg-Kirkpatrick, Ying Huang, and Eric P. Xing. 2017. Efficient Correlated Topic Modeling with Topic Embedding (KDD '17).
[20]
Stijn Heldens, Ana Lucia Varbanescu, and Alexandru Iosup. 2016. Dynamic Load Balancing for High-performance Graph Processing on Hybrid CPU-GPU Platforms. In Proceedings of the Sixth Workshop on Irregular Applications: Architectures and Algorithms (IAAA '16).
[21]
Gregory Herschlag, Seyong Lee, Jeffrey S. Vetter, and Amanda Randles. 2018. GPU Data Access on Complex Geometries for D3Q19 Lattice Boltzmann Method. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS) .
[22]
Connor Holmes, Daniel Mawhirter, Yuxiong He, Feng Yan, and Bo Wu. 2019. GRNN: Low-Latency and Scalable RNN Inference on GPUs. In 14th European Conference on Computer Systems(EuroSys) .
[23]
Gangwon Jo, Jeongho Nah, Jun Lee, Jungwon Kim, and Jaejin Lee. 2015. Accelerating LINPACK with MPI-OpenCL on Clusters of Multi-GPU Nodes. IEEE Transactions on Parallel and Distributed Systems (2015).
[24]
Heehoon Kim, Hyoungwook Nam, Wookeun Jung, and Jaejin Lee. 2017. Performance analysis of CNN frameworks for GPUs. In ISPASS .
[25]
Aaron Q. Li, Amr Ahmed, Sujith Ravi, and Alexander J. Smola. 2014. Reducing the Sampling Complexity of Topic Models (KDD '14).
[26]
Kaiwei Li, Jianfei Chen, Wenguang Chen, and Jun Zhu. 2017. SaberLDA: Sparsity-Aware Learning of Topic Models on GPUs (ASPLOS '17).
[27]
Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J. Smola. 2014. Efficient Mini-batch Training for Stochastic Optimization (KDD '14).
[28]
Xiuhong Li, Yun Liang, Shengen Yan, Liancheng Jia, and Yinghan Li. 2019. A Coordinated Tiling and Batching Framework for Efficient GEMM on GPUs. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP '19).
[29]
Xiuhong Li, Yun Liang, Wentai Zhang, Taide Liu, Haochen Li, Guojie Luo, and Ming Jiang. 2018a. cuMBIR: An Efficient Framework for Low-dose X-ray CT Image Reconstruction on GPUs. In Proceedings of the 2018 International Conference on Supercomputing (ICS '18).
[30]
Xiuhong Li, Yun Liang, Wentai Zhang, Taide Liu, Haochen Li, Guojie Luo, and Ming Jiang. 2018b. cuMBIR: An Efficient Framework for Low-dose X-ray CT Image Reconstruction on GPUs. (2018).
[31]
H. Liu and H. H. Huang. 2015. Enterprise: breadth-first graph traversal on GPUs. In SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis .
[32]
Gordon E Moon, Israt Nisa, Aravind Sukumaran-Rajam, Bortik Bandyopadhyay, Srinivasan Parthasarathy, and P Sadayappan. 2018. Parallel Latent Dirichlet Allocation on GPUs. In International Conference on Computational Science. Springer, 259--272.
[33]
Bahareh Mostafazadeh, Ferran Marti, Feng Liu, and Aparna Chandramowlishwaran. 2018. Roofline Guided Design and Analysis of a Multi-stencil CFD Solver for Multicore Performance. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS) .
[34]
Feng Niu, Benjamin Recht, Christopher Re, and Stephen J. Wright. 2011. HOGWILD!: A Lock-free Approach to Parallelizing Stochastic Gradient Descent. In Proceedings of the 24th International Conference on Neural Information Processing Systems (NIPS'11).
[35]
Zhen Peng, Alexander Powell, Bo Wu, Tekin Bicer, and Bin Ren. 2018. Graphphi: Efficient Parallel Graph Processing on Emerging Throughput-oriented Architectures. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT '18).
[36]
Steffen Rendle, Dennis Fetterly, Eugene J. Shekita, and Bor yiing Su. 2016. Robust Large-Scale Machine Learning in the Cloud.
[37]
Amit Sabne, Putt Sakdhnagool, and Rudolf Eigenmann. 2013. Scaling Large-data Computations on multi-GPU Accelerators. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (ICS '13).
[38]
Naser Sedaghati, Te Mu, Louis-Noel Pouchet, Srinivasan Parthasarathy, and P. Sadayappan. 2015. Automatic Selection of Sparse Matrix Representation on GPUs (ICS '15).
[39]
Dipanjan Sengupta, Shuaiwen L. Song, Kapil Agarwal, and Karsten Schwan. 2015. GraphReduce: processing large-scale graphs on accelerator-based systems. In SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis .
[40]
Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. 2018. Superneurons: Dynamic GPU Memory Management for Training Deep Neural Networks. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '18). 41--53.
[41]
Samuel Williams, Andrew Waterman, and David Patterson. {n. d.}. Roofline: An Insightful Visual Performance Model for Multicore Architectures. Commun. ACM ({n. d.}).
[42]
Xiaolong Xie, Yun Liang, Xiuhong Li, and Wei Tan. 2019. CuLDA_CGS: Solving Large-scale LDA Problems on GPUs(POSTER). In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP '19).
[43]
Xiaolong Xie, Yun Liang, Guangyu Sun, and Deming Chen. 2013. An efficient compiler framework for cache bypassing on GPUs (ICCAD'13).
[44]
Xiaolong Xie, Wei Tan, Liana L. Fong, and Yun Liang. 2017. CuMF_SGD: Parallelized Stochastic Gradient Descent for Matrix Factorization on GPUs (HPDC '17).
[45]
Eric P. Xing, Qirong Ho, Wei Dai, Jin-Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. 2015. Petuum: A New Platform for Distributed Machine Learning on Big Data (KDD '15).
[46]
Yuan Yang, Jianfei Chen, and Jun Zhu. 2016. Distributing the Stochastic Gradient Sampler for Large-Scale LDA (KDD '16).
[47]
Limin Yao, David Mimno, and Andrew McCallum. 2009. Efficient Methods for Topic Model Inference on Streaming Document Collections (KDD '09).
[48]
Hsiang-Fu Yu, Cho-Jui Hsieh, Hyokun Yun, S.V.N. Vishwanathan, and Inderjit S. Dhillon. 2015. A Scalable Asynchronous Distributed Algorithm for Topic Modeling (WWW '15).
[49]
Lele Yu, Ce Zhang, Yingxia Shao, and Bin Cui. 2017. LDA*: A Robust and Large-scale Topic Modeling System. Proc. VLDB Endow. (2017).
[50]
Jinhui Yuan, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei, Xun Zheng, Eric Po Xing, Tie-Yan Liu, and Wei-Ying Ma. 2015. LightLDA: Big Topic Models on Modest Computer Clusters (WWW '15).
[51]
Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. 2011. On-the-fly Elimination of Dynamic Irregularities for GPU Computing (ASPLOS XVI).

Cited By

View all
  • (2022)CoopMC: Algorithm-Architecture Co-Optimization for Markov Chain Monte Carlo Accelerators2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00012(38-52)Online publication date: Apr-2022
  • (2021)Change in Threads on Twitter Regarding Influenza, Vaccines, and Vaccination During the COVID-19 Pandemic: Artificial Intelligence–Based Infodemiology StudyJMIR Infodemiology10.2196/319831:1(e31983)Online publication date: 14-Oct-2021
  • (2020)FlexTensorProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378508(859-873)Online publication date: 9-Mar-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
HPDC '19: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing
June 2019
278 pages
ISBN:9781450366700
DOI:10.1145/3307681
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 June 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU
  2. LDA
  3. topic modeling

Qualifiers

  • Research-article

Funding Sources

  • NSF China

Conference

HPDC '19
Sponsor:

Acceptance Rates

HPDC '19 Paper Acceptance Rate 22 of 106 submissions, 21%;
Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)5
  • Downloads (Last 6 weeks)1
Reflects downloads up to 26 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2022)CoopMC: Algorithm-Architecture Co-Optimization for Markov Chain Monte Carlo Accelerators2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00012(38-52)Online publication date: Apr-2022
  • (2021)Change in Threads on Twitter Regarding Influenza, Vaccines, and Vaccination During the COVID-19 Pandemic: Artificial Intelligence–Based Infodemiology StudyJMIR Infodemiology10.2196/319831:1(e31983)Online publication date: 14-Oct-2021
  • (2020)FlexTensorProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378508(859-873)Online publication date: 9-Mar-2020
  • (2020)DistSNNMF: Solving Large-Scale Semantic Topic Model Problems on HPC for Streaming TextsRecent Advances in Intelligent Systems and Smart Applications10.1007/978-3-030-47411-9_23(429-449)Online publication date: 27-Jun-2020
  • (2019)SPART: Optimizing CNNs by Utilizing Both Sparsity of Weights and Feature MapsAdvanced Parallel Processing Technologies10.1007/978-3-030-29611-7_6(71-85)Online publication date: 9-Aug-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media