[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

An architecture for parallel topic models

Published: 01 September 2010 Publication History

Abstract

This paper describes a high performance sampling architecture for inference of latent topic models on a cluster of workstations. Our system is faster than previous work by over an order of magnitude and it is capable of dealing with hundreds of millions of documents and thousands of topics.
The algorithm relies on a novel communication structure, namely the use of a distributed (key, value) storage for synchronizing the sampler state between computers. Our architecture entirely obviates the need for separate computation and synchronization phases. Instead, disk, CPU, and network are used simultaneously to achieve high performance. We show that this architecture is entirely general and that it can be extended easily to more sophisticated latent variable models such as n-grams and hierarchies.

References

[1]
S. Aji and R. McEliece. The generalized distributive law. IEEE IT, 46:325--343, 2000.
[2]
A. Asuncion, P. Smyth, and M. Welling. Asynchronous distributed learning of topic models. In NIPS, pages 81--88. MIT Press, 2008.
[3]
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. JMLR, 3:993--1022, 2003.
[4]
S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, UK, 2004.
[5]
J. Gonzalez, Y. Low, and C. Guestrin. Residual splash for optimally parallelizing belief propagation. In AISTATS, Clearwater Beach, FL, 2009.
[6]
T. Griffiths and M. Steyvers. Finding scientific topics. PNAS, 101:5228--5235, 2004.
[7]
D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed algorithms for topic models, NIPS 2009.
[8]
H. Wallach, D. Mimno, and A. McCallum. Rethinking LDA: Why priors matter. NIPS, p. 1973--1981. 2009.
[9]
Y. Wang, H. Bai, M. Stanton, W. Chen, and E. Chang. PLDA: Parallel latent dirichlet allocation for large-scale applications. In Proc. of 5th International Conference on Algorithmic Aspects in Information and Management, 2009.
[10]
L. Yao, D. Mimno, and A. McCallum. Efficient methods for topic model inference on streaming document collections. In KDD'09, 2009.

Cited By

View all
  • (2024)Local AdaGrad-type algorithm for stochastic convex-concave optimizationMachine Language10.1007/s10994-022-06239-z113:4(1819-1838)Online publication date: 1-Apr-2024
  • (2023)DIV-DU: Data Integrity Verification and Dynamic Update of Cloud Storage in Distributed Machine LearningProceedings of the 2023 International Joint Conference on Robotics and Artificial Intelligence10.1145/3632971.3632984(8-12)Online publication date: 7-Jul-2023
  • (2023)Optimizing Tensor Computations: From Applications to Compilation and Runtime TechniquesCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589407(53-59)Online publication date: 4-Jun-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 3, Issue 1-2
September 2010
1658 pages

Publisher

VLDB Endowment

Publication History

Published: 01 September 2010
Published in PVLDB Volume 3, Issue 1-2

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)42
  • Downloads (Last 6 weeks)7
Reflects downloads up to 18 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Local AdaGrad-type algorithm for stochastic convex-concave optimizationMachine Language10.1007/s10994-022-06239-z113:4(1819-1838)Online publication date: 1-Apr-2024
  • (2023)DIV-DU: Data Integrity Verification and Dynamic Update of Cloud Storage in Distributed Machine LearningProceedings of the 2023 International Joint Conference on Robotics and Artificial Intelligence10.1145/3632971.3632984(8-12)Online publication date: 7-Jul-2023
  • (2023)Optimizing Tensor Computations: From Applications to Compilation and Runtime TechniquesCompanion of the 2023 International Conference on Management of Data10.1145/3555041.3589407(53-59)Online publication date: 4-Jun-2023
  • (2022)NuPS: A Parameter Server for Machine Learning with Non-Uniform Parameter AccessProceedings of the 2022 International Conference on Management of Data10.1145/3514221.3517860(481-495)Online publication date: 10-Jun-2022
  • (2022)Federated Data Preparation, Learning, and Debugging in Apache SystemDSProceedings of the 31st ACM International Conference on Information & Knowledge Management10.1145/3511808.3557162(4813-4817)Online publication date: 17-Oct-2022
  • (2022)The Evolution of Topic ModelingACM Computing Surveys10.1145/350790054:10s(1-35)Online publication date: 10-Nov-2022
  • (2022)DOSP: an optimal synchronization of parameter server for distributed machine learningThe Journal of Supercomputing10.1007/s11227-022-04422-678:12(13865-13892)Online publication date: 1-Aug-2022
  • (2022)Iteration number-based hierarchical gradient aggregation for distributed deep learningThe Journal of Supercomputing10.1007/s11227-021-04083-x78:4(5565-5587)Online publication date: 1-Mar-2022
  • (2022)On the performance and convergence of distributed stream processing via approximate fault toleranceThe VLDB Journal — The International Journal on Very Large Data Bases10.1007/s00778-019-00565-w28:5(821-846)Online publication date: 11-Mar-2022
  • (2021)Fast doubly-adaptive MCMC to estimate the gibbs partition function with weak mixing time boundsProceedings of the 35th International Conference on Neural Information Processing Systems10.5555/3540261.3542233(25760-25772)Online publication date: 6-Dec-2021
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media