More Web Proxy on the site http://driver.im/

research-article

CuLDA: Solving Large-scale LDA Problems on GPUs

Authors:

Wei TanAuthors Info & Claims

HPDC '19: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing

Pages 195 - 205

https://doi.org/10.1145/3307681.3325407

Published: 17 June 2019 Publication History

Abstract

Latent Dirichlet Allocation(LDA) is a popular topic model. Given the fact that the input corpus of LDA algorithms consists of millions to billions of tokens, the LDA training process is very time-consuming, which prevents the adoption of LDA in many scenarios, e.g., online service. GPUs have benefited modern machine learning algorithms and big data analysis as they can provide high memory bandwidth and tremendous computation power. Therefore, many frameworks, e.g. TensorFlow, Caffe, CNTK, support GPUs for accelerating various data-intensive machine learning algorithms. However, we observe that the performance of existing LDA solutions on GPUs is not satisfying. In this paper, we present CuLDA, a GPU-based efficient and scalable approach to accelerate large-scale LDA problems. CuLDA is designed to efficiently solve LDA problems at high throughput. To this end, we first delicately design workload partitioning and synchronization mechanism to exploit multiple GPUs. Then, we offload the LDA sampling process to each individual GPU by optimizing from the sampling algorithm, parallelization, and data compression perspectives. Experiment evaluations show that compared with the state-of-the-art LDA solutions, CuLDA outperforms them by a large margin (up to 7.3X) on a single GPU. CuLDA is able to achieve an extra 7.5X speedup on 8 GPUs for large data sets.

References

[1]

Amr Ahmed, Moahmed Aly, Joseph Gonzalez, Shravan Narayanamurthy, and Alexander J. Smola. 2012. Scalable Inference in Latent Variable Models (WSDM '12).

Digital Library

[2]

Ashwin M. Aji, Lokendra S. Panwar, Feng Ji, Milind Chabbi, Karthik Murthy, Pavan Balaji, Keith R. Bisset, James Dinan, Wu-chun Feng, John Mellor-Crummey, Xiaosong Ma, and Rajeev Thakur. 2013. On the Efficacy of GPU-integrated MPI for Scientific Applications. In Proceedings of the 22Nd International Symposium on High-performance Parallel and Distributed Computing (HPDC '13).

Digital Library

[3]

Hesam Amoualian, Marianne Clausel, Eric Gaussier, and Massih-Reza Amini. 2016. Streaming-LDA: A Copula-based Approach to Modeling Topic Dependencies in Document Streams (KDD '16).

[4]

Bing Bai, Yushun Fan, Wei Tan, and Jia Zhang. 2017. SR-LDA: Mining Effective Representations for Generating Service Ecosystem Knowledge Maps (SCC'17).

[5]

Tal Ben-Nun, Michael Sutton, Sreepathi Pai, and Keshav Pingali. 2017. Groute: An Asynchronous Multi-GPU Programming Model for Irregular Computations. In Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '17).

Digital Library

[6]

David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research (2003).

Digital Library

[7]

Martin Burtscher, Rupesh Nasre, and Keshav Pingali. 2012. A quantitative study of irregular programs on GPUs (IISWC'12).

Digital Library

[8]

John Canny and Huasha Zhao. 2013. Bidmach: Large-scale learning with zero memory allocation. In BigLearning, NIPS Workshop .

[9]

Milind Chabbi, Karthik Murthy, Michael Fagan, and John Mellor-Crummey. 2013. Effective Sampling-driven Performance Tools for GPU-accelerated Supercomputers. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '13).

Digital Library

[10]

Jianfei Chen, Kaiwei Li, Jun Zhu, and Wenguang Chen. {n. d.}. WarpLDA: A Cache Efficient O(1) Algorithm for Latent Dirichlet Allocation. Proc. VLDB Endow. ( {n. d.}).

Digital Library

[11]

Xu Chen, Mingyuan Zhou, and Lawrence Carin. 2012. The Contextual Focused Topic Model (KDD '12).

Digital Library

[12]

K. M. Diab, M. M. Rafique, and M. Hefeeda. 2013. Dynamic Sharing of GPUs in Cloud Systems. In 2013 IEEE International Symposium on Parallel Distributed Processing, Workshops and Phd Forum .

Digital Library

[13]

Yaozu Dong, Mochi Xue, Xiao Zheng, Jiajun Wang, Zhengwei Qi, and Haibing Guan. 2015. Boosting GPU Virtualization Performance with Hybrid Shadow Page Tables. In Proceedings of the 2015 USENIX Conference on Usenix Annual Technical Conference (USENIX ATC '15).

Digital Library

[14]

James Foulds, Levi Boyles, Christopher DuBois, Padhraic Smyth, and Max Welling. 2013. Stochastic Collapsed Variational Bayesian Inference for Latent Dirichlet Allocation (KDD '13).

Digital Library

[15]

H. Fu, Jingheng Xu, L. Gan, C. Yang, W. Xue, Wenlai Zhao, Wen Shi, X. Wang, and G. Yang. 2016. Unleashing the performance potential of CPU-GPU platforms for the 3D atmospheric Euler solver. In 2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP) .

[16]

Swapnil Ghike, Ruben Gran, Mar'ia J Garzarán, and David Padua. 2014. Directive-based compilers for GPUs. In International Workshop on Languages and Compilers for Parallel Computing. Springer, 19--35.

[17]

Fabian Gieseke, Justin Heinermann, Cosmin Oancea, and Christian Igel. 2014. Buffer kd trees: processing massive nearest neighbor queries on GPUs. In International Conference on Machine Learning. 172--180.

Digital Library

[18]

Aditya Grover and Jure Leskovec. 2016. Node2Vec: Scalable Feature Learning for Networks (KDD '16).

Digital Library

[19]

Junxian He, Zhiting Hu, Taylor Berg-Kirkpatrick, Ying Huang, and Eric P. Xing. 2017. Efficient Correlated Topic Modeling with Topic Embedding (KDD '17).

Digital Library

[20]

Stijn Heldens, Ana Lucia Varbanescu, and Alexandru Iosup. 2016. Dynamic Load Balancing for High-performance Graph Processing on Hybrid CPU-GPU Platforms. In Proceedings of the Sixth Workshop on Irregular Applications: Architectures and Algorithms (IAAA '16).

Digital Library

[21]

Gregory Herschlag, Seyong Lee, Jeffrey S. Vetter, and Amanda Randles. 2018. GPU Data Access on Complex Geometries for D3Q19 Lattice Boltzmann Method. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS) .

[22]

Connor Holmes, Daniel Mawhirter, Yuxiong He, Feng Yan, and Bo Wu. 2019. GRNN: Low-Latency and Scalable RNN Inference on GPUs. In 14th European Conference on Computer Systems(EuroSys) .

Digital Library

[23]

Gangwon Jo, Jeongho Nah, Jun Lee, Jungwon Kim, and Jaejin Lee. 2015. Accelerating LINPACK with MPI-OpenCL on Clusters of Multi-GPU Nodes. IEEE Transactions on Parallel and Distributed Systems (2015).

[24]

Heehoon Kim, Hyoungwook Nam, Wookeun Jung, and Jaejin Lee. 2017. Performance analysis of CNN frameworks for GPUs. In ISPASS .

[25]

Aaron Q. Li, Amr Ahmed, Sujith Ravi, and Alexander J. Smola. 2014. Reducing the Sampling Complexity of Topic Models (KDD '14).

Digital Library

[26]

Kaiwei Li, Jianfei Chen, Wenguang Chen, and Jun Zhu. 2017. SaberLDA: Sparsity-Aware Learning of Topic Models on GPUs (ASPLOS '17).

Digital Library

[27]

Mu Li, Tong Zhang, Yuqiang Chen, and Alexander J. Smola. 2014. Efficient Mini-batch Training for Stochastic Optimization (KDD '14).

Digital Library

[28]

Xiuhong Li, Yun Liang, Shengen Yan, Liancheng Jia, and Yinghan Li. 2019. A Coordinated Tiling and Batching Framework for Efficient GEMM on GPUs. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP '19).

Digital Library

[29]

Xiuhong Li, Yun Liang, Wentai Zhang, Taide Liu, Haochen Li, Guojie Luo, and Ming Jiang. 2018a. cuMBIR: An Efficient Framework for Low-dose X-ray CT Image Reconstruction on GPUs. In Proceedings of the 2018 International Conference on Supercomputing (ICS '18).

Digital Library

[30]

Xiuhong Li, Yun Liang, Wentai Zhang, Taide Liu, Haochen Li, Guojie Luo, and Ming Jiang. 2018b. cuMBIR: An Efficient Framework for Low-dose X-ray CT Image Reconstruction on GPUs. (2018).

[31]

H. Liu and H. H. Huang. 2015. Enterprise: breadth-first graph traversal on GPUs. In SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis .

[32]

Gordon E Moon, Israt Nisa, Aravind Sukumaran-Rajam, Bortik Bandyopadhyay, Srinivasan Parthasarathy, and P Sadayappan. 2018. Parallel Latent Dirichlet Allocation on GPUs. In International Conference on Computational Science. Springer, 259--272.

[33]

Bahareh Mostafazadeh, Ferran Marti, Feng Liu, and Aparna Chandramowlishwaran. 2018. Roofline Guided Design and Analysis of a Multi-stencil CFD Solver for Multicore Performance. In 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS) .

[34]

Feng Niu, Benjamin Recht, Christopher Re, and Stephen J. Wright. 2011. HOGWILD!: A Lock-free Approach to Parallelizing Stochastic Gradient Descent. In Proceedings of the 24th International Conference on Neural Information Processing Systems (NIPS'11).

Digital Library

[35]

Zhen Peng, Alexander Powell, Bo Wu, Tekin Bicer, and Bin Ren. 2018. Graphphi: Efficient Parallel Graph Processing on Emerging Throughput-oriented Architectures. In Proceedings of the 27th International Conference on Parallel Architectures and Compilation Techniques (PACT '18).

Digital Library

[36]

Steffen Rendle, Dennis Fetterly, Eugene J. Shekita, and Bor yiing Su. 2016. Robust Large-Scale Machine Learning in the Cloud.

Digital Library

[37]

Amit Sabne, Putt Sakdhnagool, and Rudolf Eigenmann. 2013. Scaling Large-data Computations on multi-GPU Accelerators. In Proceedings of the 27th International ACM Conference on International Conference on Supercomputing (ICS '13).

Digital Library

[38]

Naser Sedaghati, Te Mu, Louis-Noel Pouchet, Srinivasan Parthasarathy, and P. Sadayappan. 2015. Automatic Selection of Sparse Matrix Representation on GPUs (ICS '15).

Digital Library

[39]

Dipanjan Sengupta, Shuaiwen L. Song, Kapil Agarwal, and Karsten Schwan. 2015. GraphReduce: processing large-scale graphs on accelerator-based systems. In SC '15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis .

Digital Library

[40]

Linnan Wang, Jinmian Ye, Yiyang Zhao, Wei Wu, Ang Li, Shuaiwen Leon Song, Zenglin Xu, and Tim Kraska. 2018. Superneurons: Dynamic GPU Memory Management for Training Deep Neural Networks. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '18). 41--53.

Digital Library

[41]

Samuel Williams, Andrew Waterman, and David Patterson. {n. d.}. Roofline: An Insightful Visual Performance Model for Multicore Architectures. Commun. ACM ({n. d.}).

Digital Library

[42]

Xiaolong Xie, Yun Liang, Xiuhong Li, and Wei Tan. 2019. CuLDA_CGS: Solving Large-scale LDA Problems on GPUs(POSTER). In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming (PPoPP '19).

Digital Library

[43]

Xiaolong Xie, Yun Liang, Guangyu Sun, and Deming Chen. 2013. An efficient compiler framework for cache bypassing on GPUs (ICCAD'13).

[44]

Xiaolong Xie, Wei Tan, Liana L. Fong, and Yun Liang. 2017. CuMF_SGD: Parallelized Stochastic Gradient Descent for Matrix Factorization on GPUs (HPDC '17).

Digital Library

[45]

Eric P. Xing, Qirong Ho, Wei Dai, Jin-Kyu Kim, Jinliang Wei, Seunghak Lee, Xun Zheng, Pengtao Xie, Abhimanu Kumar, and Yaoliang Yu. 2015. Petuum: A New Platform for Distributed Machine Learning on Big Data (KDD '15).

Digital Library

[46]

Yuan Yang, Jianfei Chen, and Jun Zhu. 2016. Distributing the Stochastic Gradient Sampler for Large-Scale LDA (KDD '16).

[47]

Limin Yao, David Mimno, and Andrew McCallum. 2009. Efficient Methods for Topic Model Inference on Streaming Document Collections (KDD '09).

Digital Library

[48]

Hsiang-Fu Yu, Cho-Jui Hsieh, Hyokun Yun, S.V.N. Vishwanathan, and Inderjit S. Dhillon. 2015. A Scalable Asynchronous Distributed Algorithm for Topic Modeling (WWW '15).

Digital Library

[49]

Lele Yu, Ce Zhang, Yingxia Shao, and Bin Cui. 2017. LDA*: A Robust and Large-scale Topic Modeling System. Proc. VLDB Endow. (2017).

Digital Library

[50]

Jinhui Yuan, Fei Gao, Qirong Ho, Wei Dai, Jinliang Wei, Xun Zheng, Eric Po Xing, Tie-Yan Liu, and Wei-Ying Ma. 2015. LightLDA: Big Topic Models on Modest Computer Clusters (WWW '15).

Digital Library

[51]

Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. 2011. On-the-fly Elimination of Dynamic Irregularities for GPU Computing (ASPLOS XVI).

Digital Library

Cited By

Chai YKo GMark Ting WBailey LBrooks DWei G(2022)CoopMC: Algorithm-Architecture Co-Optimization for Markov Chain Monte Carlo Accelerators2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00012(38-52)Online publication date: Apr-2022
https://doi.org/10.1109/HPCA53966.2022.00012
Benis AChatsubi ALevner EAshkenazi S(2021)Change in Threads on Twitter Regarding Influenza, Vaccines, and Vaccination During the COVID-19 Pandemic: Artificial Intelligence–Based Infodemiology StudyJMIR Infodemiology10.2196/319831:1(e31983)Online publication date: 14-Oct-2021
https://doi.org/10.2196/31983
Zheng SLiang YWang SChen RSheng KLarus JCeze LStrauss K(2020)FlexTensorProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378508(859-873)Online publication date: 9-Mar-2020
https://dl.acm.org/doi/10.1145/3373376.3378508
Show More Cited By

Index Terms

CuLDA: Solving Large-scale LDA Problems on GPUs
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
2. Theory of computation
  1. Design and analysis of algorithms
    1. Parallel algorithms
      1. Massively parallel algorithms

Recommendations

CuLDA_CGS: solving large-scale LDA problems on GPUs
PPoPP '19: Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming

GPUs have benefited many ML algorithms. However, we observe that the performance of existing Latent Dirichlet Allocation(LDA) solutions on GPUs are not satisfying. We present CuLDA_CGS, an efficient approach to accelerate large-scale LDA problems. We ...
SaberLDA: Sparsity-Aware Learning of Topic Models on GPUs
Asplos'17

Latent Dirichlet Allocation (LDA) is a popular tool for analyzing discrete count data such as text and images. Applications require LDA to handle both large datasets and a large number of topics. Though distributed CPU systems have been used, GPU-based ...
SaberLDA: Sparsity-Aware Learning of Topic Models on GPUs
ASPLOS '17

Latent Dirichlet Allocation (LDA) is a popular tool for analyzing discrete count data such as text and images. Applications require LDA to handle both large datasets and a large number of topics. Though distributed CPU systems have been used, GPU-based ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

HPDC '19: Proceedings of the 28th International Symposium on High-Performance Parallel and Distributed Computing

June 2019

278 pages

ISBN:9781450366700

DOI:10.1145/3307681

General Chair:
Jon Weissman
University of Minnesota, USA
,
Program Chairs:
Ali R. Butt
Virginia Tech, USA
,
Evgenia Smirni
College of William and Mary, USA

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

University of Arizona: University of Arizona
SIGHPC: ACM Special Interest Group on High Performance Computing, Special Interest Group on High Performance Computing
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 June 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

NSF China

Conference

HPDC '19

Sponsor:

University of Arizona
SIGHPC
SIGARCH

HPDC '19: The 28th International Symposium on High-Performance Parallel and Distributed Computing

June 22 - 29, 2019

AZ, Phoenix, USA

Acceptance Rates

HPDC '19 Paper Acceptance Rate 22 of 106 submissions, 21%;

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
363
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)1

Reflects downloads up to 26 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chai YKo GMark Ting WBailey LBrooks DWei G(2022)CoopMC: Algorithm-Architecture Co-Optimization for Markov Chain Monte Carlo Accelerators2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA)10.1109/HPCA53966.2022.00012(38-52)Online publication date: Apr-2022
https://doi.org/10.1109/HPCA53966.2022.00012
Benis AChatsubi ALevner EAshkenazi S(2021)Change in Threads on Twitter Regarding Influenza, Vaccines, and Vaccination During the COVID-19 Pandemic: Artificial Intelligence–Based Infodemiology StudyJMIR Infodemiology10.2196/319831:1(e31983)Online publication date: 14-Oct-2021
https://doi.org/10.2196/31983
Zheng SLiang YWang SChen RSheng KLarus JCeze LStrauss K(2020)FlexTensorProceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems10.1145/3373376.3378508(859-873)Online publication date: 9-Mar-2020
https://dl.acm.org/doi/10.1145/3373376.3378508
Gadelrab FSadek RHaggag M(2020)DistSNNMF: Solving Large-Scale Semantic Topic Model Problems on HPC for Streaming TextsRecent Advances in Intelligent Systems and Smart Applications10.1007/978-3-030-47411-9_23(429-449)Online publication date: 27-Jun-2020
https://doi.org/10.1007/978-3-030-47411-9_23
Xie JLiang Y(2019)SPART: Optimizing CNNs by Utilizing Both Sparsity of Weights and Feature MapsAdvanced Parallel Processing Technologies10.1007/978-3-030-29611-7_6(71-85)Online publication date: 9-Aug-2019
https://doi.org/10.1007/978-3-030-29611-7_6

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents