[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
article

Clustering scientific documents with topic modeling

Published: 01 September 2014 Publication History

Abstract

Topic modeling is a type of statistical model for discovering the latent "topics" that occur in a collection of documents through machine learning. Currently, latent Dirichlet allocation (LDA) is a popular and common modeling approach. In this paper, we investigate methods, including LDA and its extensions, for separating a set of scientific publications into several clusters. To evaluate the results, we generate a collection of documents that contain academic papers from several different fields and see whether papers in the same field will be clustered together. We explore potential scientometric applications of such text analysis capabilities.

References

[1]
Apache Software Foundation, Drost, I., Dunning, T., Eastman, J., Gospodnetic, O., Ingersoll, G., Mannix, J., Owen, S., & Wettin, K. (2010). Apache mahout. http://mloss.org/software/view/144/.
[2]
Blei, D., & Lafferty, J. (2006a). Correlated topic models. In Y. Weiss, B. Schölkopf, & J. Platt (Eds.), Advances in neural information processing systems 18 (pp. 147-154). Cambridge: MIT Press.
[3]
Blei, D., Griffiths, T. L., Jordan, M. I., Tenenbaum, J. B., et al. (2004). Hierarchical topic models and the nested chinese restaurant process. In S. Thrun, L. Saul, & B. Schölkopf (Eds.), Advances in neural information processing systems 16. Cambridge: MIT Press.
[4]
Blei, D.M., & Lafferty, J.D. (2006b). Dynamic topic models. In Proceedings of the 23rd International Conference on Machine Learning (p. 113120).
[5]
Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of science. The Annals of Applied Statistics, 1(1), 17-35.
[6]
Blei, D. M., & Lafferty, J. D. (2009). Text mining: Classification, clustering, and applications (10th ed., pp. 71-94). London: Taylor and Francis. chap Topic Models.
[7]
Blei, D. M., Ng, A. Y., Jordon, M. I., et al. (2003). Latent dirichlet allocation. The Journal of Machine Learning Research, 3, 993-1022.
[8]
Borgman, C., & Furner, J. (2002). Scholarly communication and bibliometrics. Annual Review of Information Science and Technology, 36, 3-72.
[9]
Daim, T., Rueda, G., Martin, H., & Gerdsri, P. (2006). Forecasting emerging technologies: Use of bibliometrics and patent analysis. Technological Forecasting & Social Change, 73(8), 981-1012.
[10]
Feldman, R., & Sanger, J. (2006). The text mining handbook: Advanced approaches in analyzing unstructured data. Cambridge: Cambridge University Press.
[11]
Ferrara, A., & Salini, S. (2012). Ten challenges in modeling bibliographic data for bibliometric analysis. Scientometrics 121.
[12]
Glenisson, P., Glnzel, W., & Persson, O. (2005). Combining full-text analysis and bibliometric indicators. A pilot study. Scientometrics, 63(1), 163-180.
[13]
Griffiths, T. L., & Steyvers, M. (2004). Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 101, 5228-5235.
[14]
Grün, B., & Hornik, K. (2011). Topic models: An R package for fitting topic models. Journal of Statistical Software, 40(13), 130.
[15]
Hofmann, T. (1999). Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval (pp. 50-57).
[16]
McCallum, A. (2002). Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu.
[17]
Nallapati, R., Cohen, W., & Lafferty, J. (2007). Parallelized variational EM for latent dirichlet allocation: An experimental evaluation of speed and scalability. In 7th IEEE International Conference on Data Mining Workshops, 2007. ICDM Workshops 2007 (pp. 349-354).
[18]
Newman, N.C., Porter, A.L., Newman, D., Trumbach, C.C., & Bolan, S.D. (2012). Comparing methods to extract technical content for technological intelligence. In Technology Management for Emerging Technologies (PICMET), 2012 Proceedings of PICMET'12 (p. 12791285).
[19]
Ni, C., Sugimoto, C., & Cronin, B. (2012). Visualizing and comparing four facets of scholarly communication: producers, artifacts, concepts, and gatekeepers. Scientometrics pp. 1-13.
[20]
Smola, A., & Narayanamurthy, S. (2010). An architecture for parallel topic models. Proceedings of the VLDB Endowment, 3(1-2), 703710.
[21]
Steyvers, M., Smyth, P., Rosen-Zvi, M., & Griffiths, T. (2004). Probabilistic author-topic models for information discovery. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 306-315).
[22]
Suominen, A. (2013). Analysis of technological progression by quantitative measures: A comparison of two technologies. Technology Analysis & Strategic Management, 25(6), 687-706.
[23]
Teh, Y. W., Jordan, M. I., Beal, M. J., & Blei, D. M. (2006). Hierarchical dirichlet processes. Journal of the American Statistical Association, 101(476), 1566-1581.
[24]
Wallach, H. (2006). Topic modeling: Beyond bag-of-words. In In Proceedings of the 23rd International Conference on Machine Learning (p. 977984). Pittsburgh, Pennsylvania, U.S.
[25]
Wang, Y., Bai, H., Stanton, M., Chen, W.Y., & Chang, E.Y. (2009). Plda: Parallel latent dirichlet allocation for large-scale applications. In Algorithmic Aspects in Information and Management (p. 301314). Springer.
[26]
Wei, X., & Croft, W.B. (2006). Lda-based document models for ad-hoc retrieval. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (p. 178185).
[27]
Yan, E., Ding, Y., & Jacob, E.K. (2012). Overlaying communities and topics: An analysis on publication networks. Scientometrics pp. 1-15.
[28]
Zhai, K., Boyd-Graber, J., Asadi, N., & Alkhouja, M.L. (2012). Mr. LDA: A flexible large scale topic modeling package using variational inference in MapReduce. In Proceedings of the 21st international conference on World Wide Web (p. 879888).
[29]
Zhang, Y., Porter, A. L., Hu, Z., Guo, Y., & Newman, N. C. (2014). "Term clumping" for technical intelligence: A case study on dye-sensitized solar cells. Technological Forecasting and Social Change.

Cited By

View all
  • (2024)What Is in a ? Cross-lingual Topic Detection & Information Retrieval in Archives Portal EuropeJournal on Computing and Cultural Heritage 10.1145/349457217:2(1-23)Online publication date: 26-Mar-2024
  • (2024)COVID-19 knowledge deconstruction and retrieval: an intelligent bibliometric solutionScientometrics10.1007/s11192-023-04747-w129:11(7229-7259)Online publication date: 1-Nov-2024
  • (2024)Customer satisfaction analysis with Saudi Arabia mobile banking apps: a hybrid approach using text mining and predictive learning techniquesNeural Computing and Applications10.1007/s00521-023-09400-436:11(6005-6023)Online publication date: 1-Apr-2024
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Scientometrics
Scientometrics  Volume 100, Issue 3
September 2014
183 pages

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 01 September 2014

Author Tags

  1. Atent dirichlet allocation
  2. Text analysis
  3. Topic modeling

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)What Is in a ? Cross-lingual Topic Detection & Information Retrieval in Archives Portal EuropeJournal on Computing and Cultural Heritage 10.1145/349457217:2(1-23)Online publication date: 26-Mar-2024
  • (2024)COVID-19 knowledge deconstruction and retrieval: an intelligent bibliometric solutionScientometrics10.1007/s11192-023-04747-w129:11(7229-7259)Online publication date: 1-Nov-2024
  • (2024)Customer satisfaction analysis with Saudi Arabia mobile banking apps: a hybrid approach using text mining and predictive learning techniquesNeural Computing and Applications10.1007/s00521-023-09400-436:11(6005-6023)Online publication date: 1-Apr-2024
  • (2023)A transfer learning approach to interdisciplinary document classification with keyword-based explanationScientometrics10.1007/s11192-023-04825-z128:12(6449-6469)Online publication date: 1-Dec-2023
  • (2023)Thirty-six years of contributions to queueing systems: a content analysis, topic modeling, and graph-based exploration of research published in the QUESTA journalQueueing Systems: Theory and Applications10.1007/s11134-023-09876-w104:1-2(3-18)Online publication date: 2-Jun-2023
  • (2022)Impact of model settings on the text-based Rao diversity indexScientometrics10.1007/s11192-022-04312-x127:12(7751-7768)Online publication date: 1-Dec-2022
  • (2022)Identification of topic evolution: network analytics with piecewise linear representation and word embeddingScientometrics10.1007/s11192-022-04273-1127:9(5353-5383)Online publication date: 1-Sep-2022
  • (2022)Mining semantic information of co-word network to improve link prediction performanceScientometrics10.1007/s11192-021-04247-9127:6(2981-3004)Online publication date: 1-Jun-2022
  • (2022)A computational literature review of football performance analysis through probabilistic topic modelingArtificial Intelligence Review10.1007/s10462-021-09998-855:2(1351-1371)Online publication date: 1-Feb-2022
  • (2021)Topic Modeling Using Latent Dirichlet allocationACM Computing Surveys10.1145/346247854:7(1-35)Online publication date: 17-Sep-2021
  • Show More Cited By

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media