[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

GibbsLDA++ is a C/C++ implementation of Latent Dirichlet Allocation (LDA) using Gibbs Sampling technique for parameter estimation and inference. It is very fast and is designed to analyze hidden/latent topic structures of large-scale datasets including large collections of text/Web documents. LDA was first introduced by David Blei et al [Blei03]. There have been several implementations of this model in C (using Variational Methods), Java, and Matlab. We decided to release this implementation of LDA in C/C++ using Gibbs Sampling to provide an alternative to the topic-model community.

GibbsLDA++ is useful for the following potential application areas:

  • Information retrieval and search (analyzing semantic/latent topic/concept structures of large text collection for a more intelligent information search).
  • Document classification/clustering, document summarization, and text/web mining community in general.
  • Content-based image clustering, object recognition, and other applications of computer vision in general.
  • Other potential applications in biological data.

Contact us: all comments, suggestions, and bug reports are highly appreciated. And if you have any further problems, please contact us:

Xuan-Hieu Phan (pxhieu at gmail dot com), was at Tohoku University, Japan (now at Vietnam National University, Hanoi)
Cam-Tu Nguyen (ncamtu at gmail dot com), was at Vietnam National University, Hanoi (now at Google Japan)

License: GibbsLDA++ is a free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. GibbsLDA++ is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with GibbsLDA++; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA.

You can find and download source code, document, and case studies of GibbsLDA++ at the project page. You should download version 0.2 that includes bug fix and code optimization, and thus faster than the version 0.1.

A Java implementation (JGibbLDA) is also available. You can download at its project page

Here are some other tools developed by the same author(s):

  • FlexCRFs: Flexible Conditional Random Fields
  • CRFTagger: CRF English POS Chunker
  • CRFChunker: CRF English Phrase Chunker
  • JTextPro: A Java-based Text Processing Toolkit
  • JWebPro: A Java-based Web Processing Toolkit
  • JVnSegmenter: A Java-based Vietnamese Word Segmentation Tool
  • JVnTextPro: A Java-based Vietnamese Text Processing Tool

Environments

Unix, Linux, Cygwin, and MinGW

System requirements

A C/C++ compiler and the STL library. In the Makefile, we use g++ as the default compiler command, if the C/C++ compiler on your system has another name (e.g., cc, cpp, CC, CPP, etc.), you can modify the CC variable in the Makefile in order to use make utility smoothly.

The computational time of GibbsLDA++ much depends on the size of input data, the CPU speed, and the memory size. If your dataset is quite large (e.g., larger than 100,000 documents or so), it is better to train GibbsLDA++ on a minimum of 2GHz CPU, 1Gb RAM system.

Untar and unzip the source code

$ gunzip GibbsLDA++.tar.gz

$ tar -xf GibbsLDA++.tar

Compile

Go to the home directory of GibbsLDA++ (i.e. GibbsLDA++ directory), and type:

$ make clean

$ make all

After compiling GibbsLDA++, we have an executable file "lda" in the GibbsLDA++/src directory.

Parameter estimation from scratch

Command line:

$ lda -est [-alpha <double>] [-beta <double>] [-ntopics <int>] [-niters <int>] [-savestep <int>] [-twords <int>] -dfile <string>

in which (parameters in [] are optional):

  • -est: Estimate the LDA model from scratch
  • -alpha <double>: The value of alpha, hyper-parameter of LDA. The default value of alpha is 50 / K (K is the the number of topics). See [Griffiths04] for a detailed discussion of choosing alpha and beta values.
  • -beta <double>: The value of beta, also the hyper-parameter of LDA. Its default value is 0.1
  • -ntopics <int>: The number of topics. Its default value is 100. This depends on the input dataset. See [Griffiths04] and [Blei03] for a more careful discussion of selecting the number of topics.
  • -niters <int>: The number of Gibbs sampling iterations. The default value is 2000.
  • -savestep <int>: The step (counted by the number of Gibbs sampling iterations) at which the LDA model is saved to hard disk. The default value is 200.
  • -twords <int>: The number of most likely words for each topic. The default value is zero. If you set this parameter a value larger than zero, e.g., 20, GibbsLDA++ will print out the list of top 20 most likely words per each topic each time it save the model to hard disk according to the parameter savestep above.
  • -dfile <string>: The input training data file. See section "Input data format" for a description of input data format.

Parameter estimation from a previously estimated model

Command line:

$ lda -estc -dir <string> -model <string> [-niters <int>] [-savestep <int>] [-twords <int>]

in which (parameters in [] are optional):

  • -estc: Continue to estimate the model from a previously estimated model.
  • -dir <string>: The directory contain the previously estimated model
  • -model <string>: The name of the previously estimated model. See section "Outputs" to know how GibbsLDA++ saves outputs on hard disk.
  • -niters <int>: The number of Gibbs sampling iterations to continue estimating. The default value is 2000.
  • -savestep <int>: The step (counted by the number of Gibbs sampling iterations) at which the LDA model is saved to hard disk. The default value is 200.
  • -twords <int>: The number of most likely words for each topic. The default value is zero. If you set this parameter a value larger than zero, e.g., 20, GibbsLDA++ will print out the list of top 20 most likely words per each topic each time it save the model to hard disk according to the parameter savestep above.

Inference for previously unseen (new) data

Command line:

$ lda -inf -dir <string> -model <string> [-niters <int>] [-twords <int>] -dfile <string>

in which (parameters in [] are optional):

  • -inf: Do inference for previously unseen (new) data using a previously estimated LDA model.
  • -dir <string>: The directory contain the previously estimated model
  • -model <string>: The name of the previously estimated model. See section "Outputs" to know how GibbsLDA++ saves outputs on hard disk.
  • -niters <int>: The number of Gibbs sampling iterations for inference. The default value is 20.
  • -twords <int>: The number of most likely words for each topic of the new data. The default value is zero. If you set this parameter a value larger than zero, e.g., 20, GibbsLDA++ will print out the list of top 20 most likely words per each topic after inference.
  • -dfile <string>:The file containing new data. See section "Input data format" for a description of input data format.

Input data format

Both data for training/estimating the model and new data (i.e., previously unseen data) have the same format as follows:

[M]

[document1]

[document2]

...

[documentM]

in which the first line is the total number for documents [M]. Each line after that is one document. [documenti] is the ith document of the dataset that consists of a list of Ni words/terms.

[documenti] = [wordi1] [wordi2] ... [wordiNi]

in which all [wordij] (i=1..M, j=1..Ni) are text strings and they are separated by the blank character.

Note that the terms document and word here are abstract and should not only be understood as normal text documents. This is because LDA can be used to discover the underlying topic structures of any kind of discrete data. Therefore, GibbsLDA++ is not limited to text and natural language processing but can also be applied to other kinds of data like images and biological sequences. Also, keep in mind that for text/Web data collections, we should first preprocess the data (e.g., removing stop words and rare words, stemming, etc.) before estimating with GibbsLDA++.

Outputs

Outputs of Gibbs sampling estimation of GibbsLDA++ include the following files:

<model_name>.others

<model_name>.phi

<model_name>.theta

<model_name>.tassign

<model_name>.twords

in which:

<model_name>: is the name of a LDA model corresponding to the time step it was saved on the hard disk. For example, the name of the model was saved at the Gibbs sampling iteration 400th will be model-00400. Similarly, the model was saved at the 1200th iteration is model-01200. The model name of the last Gibbs sampling iteration is model-final.

<model_name>.others: This file contains some parameters of LDA model, such as:

alpha=?

beta=?

ntopics=? # i.e., number of topics

ndocs=? # i.e., number of documents

nwords=? # i.e., the vocabulary size

liter=? # i.e., the Gibbs sampling iteration at which the model was saved

<model_name>.phi: This file contains the word-topic distributions, i.e., p(wordw|topict). Each line is a topic, each column is a word in the vocabulary.

<model_name>.theta: This file contains the topic-document distributions, i.e., p(topict|documentm). Each line is a document and each column is a topic.

<model_name>.tassign: This file contains the topic assignments for words in training data. Each line is a document that consists of a list of <wordij>:<topic of wordij>

<model_file>.twords: This file contains twords most likely words of each topic. twords is specified in the command line.

GibbsLDA++ also saves a file called wordmap.txt that contains the maps between words and word's IDs (integer). This is because GibbsLDA++ works directly with integer IDs of words/terms inside instead of text strings.

Outputs of Gibbs sampling inference for previously unseen data

The outputs of GibbsLDA++ inference are almost the same as those of the estimation process except that the contents of those files are of the new data. The <model_name> is exactly the same as the filename of the input (new) data.

For example, we want to estimate a LDA model for a collection of documents stored in file called models/casestudy/trndocs.dat and then use that model to do inference for new data stored in file models/casestudy/newdocs.dat.

We want to estimate for 100 topics with alpha = 0.5 and beta = 0.1. We want to perform 1000 Gibbs sampling iterations, save a model at every 100 iterations, and each time a model is saved, print out the list of 20 most likely words for each topic. Supposing that we are now at the home directory of GibbsLDA++, We will execute the following command to estimate LDA model from scratch:

$ src/lda -est -alpha 0.5 -beta 0.1 -ntopics 100 -niters 1000 -savestep 100 -twords 20 -dfile models/casestudy/trndocs.dat

Now look into the models/casestudy directory, we can see the outputs.

Now, we want to continue to perform another 800 Gibbs sampling iterations from the previously estimated model model-01000 with savestep = 100, twords = 30, we perform the following command:

$ src/lda -estc -dir models/casestudy/ -model model-01000 -niters 800 -savestep 100 -twords 30

Now, look into the casestudy directory to see the outputs.

Now, if we want to do inference (30 Gibbs sampling iterations) for the new data newdocs.dat (note that the new data file is stored in the same directory of the LDA models) using one of the previously estimated LDA models, for example model-01800, we perform the following command:

$ src/lda -inf -dir models/casestudy/ -model model-01800 -niters 30 -twords 20 -dfile newdocs.dat

Now, look into the casestudy directory, we can see the outputs of the inferences:

newdocs.dat.others

newdocs.dat.phi

newdocs.dat.tassign

newdocs.dat.theta

newdocs.dat.twords

Here are the outputs of two large-scale datasets:

Researches and papers using GibbsLDA++ for conducting experiments should include the following citation:

Xuan-Hieu Phan and Cam-Tu Nguyen. GibbsLDA++: A C/C++ implementation of latent Dirichlet allocation (LDA), 2007

Here is an incomplete list of published papers that use and cite GibbsLDA++:

  1. Michael Welch et al. Search Result Diversity for Information Queries. In Proceedings of The 20th International World Wide Web Conference (WWW 2011), Hyderabad, India.
  2. V. Suresh et al. A Non-syntactic Approach for Text SentimentClassi?cation with Stopwords, In Proceedings of The 20th International World Wide Web Conference (WWW 2011), Hyderabad, India.
  3. Xuan-Hieu Phan et al. A hidden topic-based framework toward building applications with short Web documents, IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE), Vol.23, No.7, 2011.
  4. C. Lin et al. Weakly-supervised Joint Sentiment-Topic Detection from Text, IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE), to appear.
  5. Shuguang Li and Suresh Manandhar. Improving question recommendation by exploiting information need, In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL), 2011.
  6. Xin Zhao et al. Comparing Twitter and traditional media using topic models. In Proceedings of the 33rd European Conference on Information Retrieval (ECIR), 2011.
  7. Sebastian GA Konietzny et al. Inferring functional modules of protein families with probabilistic topic models, BMC Bioinformatics, 12:141, 2011.
  8. Lidong Bing and Wai Lam. Investigation of Web Query Refinement via Topic Analysis and Learning with Personalization, In Proceedings of the ACM SIGIR Workshop on Query Representation and Understanding, 2011.
  9. Fady Draidi et al. P2Prec: a P2P recommendation system for large-scale data sharing, Transactions on Large-Scale Data and Knowledge-Centered System III, 2011.
  10. Hiroyuki Koga and Tadahiro Taniguchi. Developing a User Recommendation Engine on Twitter Using Estimated Latent Topics, In Proceedings of the Human Computer Interaction (HCI), 2011.
  11. Raphael Rubino and Georges Linares. A multi-view approach for term translation spotting. In Proceedings of the 12th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2011), 2011.
  12. Viet Cuong Nguyen et al. Improving Text Segmentation with Non-systematic Semantic Relation. In Proceedings of the 12th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2011), 2011.
  13. Nathalie Camelin et al. Unsupervised concept annotation using latent Dirichlet allocation and segmental methods, In Proceedings of the EMNLP Workshop on Unsupervised Learning in NLP, 2011.
  14. Avishay Livne et al. Networks and language in the 2010 election, In Proceedings of the 4th Annual Political Networks Conference, 2011.
  15. Evan Wei Xiang et al. Bridging domains using World Wide Web knowledge for transfer learning, IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE), Vol.22, No.6, 2010.
  16. Kerui Min et al. Decomposing background topics from keywords by principal component persuit, In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (ACM CIKM), 2010.
  17. Samuel Brody and Noemie Elhadad. An Unsupervised Aspect-Sentiment Model for Online Reviews, In Proceedings of the 2010 Human Language Technologies and The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (HLT-NAACL), PA, USA, 2010.
  18. Samuel Brody and Noemie Elhadad. Detecting salient aspects in online reviews of health providers, In Proceedings of the AMIA Annual Symposium, 2010.
  19. Georgiana Dinu and Mirella Lapata. Topic models for meaning similarity in context, In Proceedings of the 23rd International Conference on Computational Linguistics (COLING), PA, USA, 2010.
  20. Ye Tian et al. Topic detection and organization of mobile text messages, In Proceedings of the 19th ACM international Conference on Information and Knowledge Management (ACM CIKM), 2010.
  21. Wayne Zhao et al. Context modeling for ranking and tagging bursty features in text streams, In Proceedings of the 19th ACM International Conference on Information and Knowledge Management (ACM CIKM), 2010.
  22. Lei Zhang et al. A hybrid unsupervised image re-ranking approach with latent topic contents, In Proceedings of the ACM International Conference on Image and Video Retrieval (ACM CIVR), 2010.
  23. Trevor Fountain and Mirella Lapata. Meaning representation in natural language categorization, In Proceedings of the Annual Meeting of the Cognitive Science Society (COGSCI), 2010.
  24. Gaston L’Huillier et al. Topic-based social network analysis for virtual communities of interests in the Dark Web, In Proceedings of the ACM SIKDD Workshop on Intelligence and Security Informatics, 2010.
  25. Han Xiao and Thomas Stibor. Efficient Collapsed Gibbs Sampling for Latent Dirichlet Allocation, In Proceedings of the 2nd Asian Conference on Machine Learning (ACML), 2010, Tokyo.
  26. Sanae Fujita et al. MSS: Investigating the effectiveness of domain combinations and topic features for word sense disambiguation, In Proceedings of the 5th International Workshop on Semantic Evaluation, USA, 2010.
  27. Xiao Zhang and Prasenjit Mitra. Learning topical transition probabilities in click through data with regression models, In Proceedings of 13th International Workshop on the Web and Databases (WebDB), 2010.
  28. Piotr Mirowski et al. Dynamic Auto-Encoders for Semantic Indexing, In Proceedings of the NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2010.
  29. Trevor Savage et al. TopicXP: Exploring Topics in Source Code using Latent Dirichlet Allocation, In Proceedings of the 26th IEEE International Conference on Software Maintenance, Romania, 2010.
  30. Xin Jin et al. LDA-based related word detection in advertising, In Proceedings of the 2010 Seventh Web Information Systems and Applications Conference, 2010.
  31. He Yulan et al. Exploring English lexicon knowledge for Chinese sentiment analysis. In Proceedings of the CIPS-SIGHAN Joint Conference on Chinese Language Processing, 2010, Beijing, China.
  32. Mingrong Liu et al. Predicting best answerers for new questions in community question answering, In Proceedings of the 11th international conference on Web-age information management, 2010.
  33. Mai-Vu Tran et al. User interest analysis with hidden topic in news recommendation system, In Proceedings of the International Conference on Asian Language Processing, 2010.
  34. Scott Grant and James R. Cordy. Estimating the optimal number of latent concepts in source code analysis, In Proceedings of the 10th IEEE Working Conference on Source Code Analysis and Manipulation, 2010.
  35. Cam-Tu Nguyen et al. Web search clustering and labeling with hidden topics, ACM Transactions on Asian Language Information Processing (ACM TALIP), Vol.8, No.3, 2009.
  36. Chenghua Lin and Yulan He. Joint sentiment/topic model for sentiment analysis, In Proceeding of the 18th ACM Conference on Information and Knowledge Management (ACM CIKM), 2009.
  37. Michael Paul and Roxana Girju. Cross-cultural analysis of blogs and forums with mixed-collection topic models, In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2009.
  38. Makoto P. Kato. Rhythmixearch: searching for unknown music by mixing known music, In Proceedings of the 10th International Society for Music Information Retrieval Conference, 2009.
  39. Arun R. et al. Stopwords and Stylometry: A Latent DirichletAllocation Approach, In Proceedings of the NIPS Workshop on Applications of Topic Models: Text and Beyond, Canada, 2009.
  40. Kai Tian et al. Using Latent Dirichlet Allocation for automatic categorization of software, In Proceedings 6th IEEE Working Conference on Mining Software Repositories, Canada 2009.
  41. Serafettin Tasci and Tunga Güngör. LDA-based keyword selection in text categorization. In Proceedings of the 24th International Symposium on Computer and Information Sciences, 2009.
  42. Yixun Liu et al. Modeling class cohesion as mixtures of latent topics, In Proceedings of the IEEE International Conference on Software Maintenance, 2009.
  43. Ocampo-Guzman, I. et al. Data-driven approach for ontology learning, In Proceedings of the 6th International Conference on Electrical Engineering, Computing Science and Automatic Control, 2009.
  44. Sao Carlos et al. Towards the automatic learning of ontologies, In Proceedings of the 7th Brazilian Symposium in Information and Human Language Technology, 2009.
  45. Istvan Biro et al. A Comparative Analysis of Latent Variable Models for Web Page Classification. In Proceedings of the Latin American Web Conference, 2008.
  46. Xuan-Hieu Phan et al. Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections. In Proceedings of the 17th International World Wide Web Conference (WWW 2008), pp.91-100, April 2008, Beijing, China.
  47. Istvan Biro et al. Latent Dirichlet Allocation in Web Spam Filtering. In Proceedings of the Fourth International Workshop on Adversarial Information Retrieval on the Web, WWW2008, April 2008, Beijing, China.
  48. Dieu-Thu Le et al. Matching and ranking with hidden topics towards online contextual advertising, In Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (WI), 2008.

Query the Web for more information about papers citing GibbsLDA++

References

Acknowledgements

Our code is based on the Java code of Gregor Heinrich and the theoretical description of Gibbs Sampling for LDA in [Heinrich]. I would like to thank Heinrich for sharing the code and a comprehensive technical report.

We would like to thank Sourceforge.net for hosting this project.

SourceForge