[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

An architecture for parallel topic models

Published: 01 September 2010 Publication History

Abstract

This paper describes a high performance sampling architecture for inference of latent topic models on a cluster of workstations. Our system is faster than previous work by over an order of magnitude and it is capable of dealing with hundreds of millions of documents and thousands of topics.
The algorithm relies on a novel communication structure, namely the use of a distributed (key, value) storage for synchronizing the sampler state between computers. Our architecture entirely obviates the need for separate computation and synchronization phases. Instead, disk, CPU, and network are used simultaneously to achieve high performance. We show that this architecture is entirely general and that it can be extended easily to more sophisticated latent variable models such as n-grams and hierarchies.

References

[1]
S. Aji and R. McEliece. The generalized distributive law. IEEE IT, 46:325--343, 2000.
[2]
A. Asuncion, P. Smyth, and M. Welling. Asynchronous distributed learning of topic models. In NIPS, pages 81--88. MIT Press, 2008.
[3]
D. Blei, A. Ng, and M. Jordan. Latent Dirichlet allocation. JMLR, 3:993--1022, 2003.
[4]
S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, UK, 2004.
[5]
J. Gonzalez, Y. Low, and C. Guestrin. Residual splash for optimally parallelizing belief propagation. In AISTATS, Clearwater Beach, FL, 2009.
[6]
T. Griffiths and M. Steyvers. Finding scientific topics. PNAS, 101:5228--5235, 2004.
[7]
D. Newman, A. Asuncion, P. Smyth, and M. Welling. Distributed algorithms for topic models, NIPS 2009.
[8]
H. Wallach, D. Mimno, and A. McCallum. Rethinking LDA: Why priors matter. NIPS, p. 1973--1981. 2009.
[9]
Y. Wang, H. Bai, M. Stanton, W. Chen, and E. Chang. PLDA: Parallel latent dirichlet allocation for large-scale applications. In Proc. of 5th International Conference on Algorithmic Aspects in Information and Management, 2009.
[10]
L. Yao, D. Mimno, and A. McCallum. Efficient methods for topic model inference on streaming document collections. In KDD'09, 2009.

Cited By

View all
  • (2025)Parallelize Single-Site Dynamics up to Dobrushin CriterionJournal of the ACM10.1145/370855872:1(1-33)Online publication date: 25-Jan-2025
  • (2025)A Survey on Parameter Server Architecture: Approaches for Optimizing Distributed Centralized LearningIEEE Access10.1109/ACCESS.2025.353508513(30993-31015)Online publication date: 2025
  • (2025)Topic modelling through the bibliometrics lens and its techniqueArtificial Intelligence Review10.1007/s10462-024-11011-x58:3Online publication date: 6-Jan-2025
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the VLDB Endowment
Proceedings of the VLDB Endowment  Volume 3, Issue 1-2
September 2010
1658 pages

Publisher

VLDB Endowment

Publication History

Published: 01 September 2010
Published in PVLDB Volume 3, Issue 1-2

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)40
  • Downloads (Last 6 weeks)2
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Parallelize Single-Site Dynamics up to Dobrushin CriterionJournal of the ACM10.1145/370855872:1(1-33)Online publication date: 25-Jan-2025
  • (2025)A Survey on Parameter Server Architecture: Approaches for Optimizing Distributed Centralized LearningIEEE Access10.1109/ACCESS.2025.353508513(30993-31015)Online publication date: 2025
  • (2025)Topic modelling through the bibliometrics lens and its techniqueArtificial Intelligence Review10.1007/s10462-024-11011-x58:3Online publication date: 6-Jan-2025
  • (2024)Distributed bilevel optimization with communication compressionProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3692786(17877-17920)Online publication date: 21-Jul-2024
  • (2024)A Distributed Training Framework for Cross-Domain Heterogeneous ClustersProceedings of the 2024 4th International Conference on Signal Processing and Communication Technology10.1145/3712464.3712510(254-259)Online publication date: 27-Dec-2024
  • (2024)Deep Class-Incremental Learning From Decentralized DataIEEE Transactions on Neural Networks and Learning Systems10.1109/TNNLS.2022.321457335:5(7190-7203)Online publication date: May-2024
  • (2024)Markov Chain Monte Carlo Multiscan Data Association for Sets of TrajectoriesIEEE Transactions on Aerospace and Electronic Systems10.1109/TAES.2024.341978560:6(7804-7819)Online publication date: Dec-2024
  • (2024)LHCC: Low-Latency and Hi-Precision Congestion Control in RDMA Datacenter Networks2024 IEEE/ACM 32nd International Symposium on Quality of Service (IWQoS)10.1109/IWQoS61813.2024.10682889(1-10)Online publication date: 19-Jun-2024
  • (2024)Machine Learning With Computer Networks: Techniques, Datasets, and ModelsIEEE Access10.1109/ACCESS.2024.338446012(54673-54720)Online publication date: 2024
  • (2024)Asynchronous Algorithms in Distributed Optimization Over Multi-Agent NetworkReference Module in Materials Science and Materials Engineering10.1016/B978-0-443-14081-5.00084-2Online publication date: 2024
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media