Abstract
Topic detection in short textual data is a challenging task due to its representation as high-dimensional and extremely sparse document-term matrix. In this paper we focus on the problem of classifying textual data on the base of their (unique) topic. For unsupervised classification, a popular approach called Mixture of Unigrams consists in considering a mixture of multinomial distributions over the word counts, each component corresponding to a different topic. The multinomial distribution can be easily extended by a Dirichlet prior to the compound mixtures of Dirichlet-Multinomial distributions, which is preferable for sparse data. We propose a gradient descent estimation method for fitting the model, and investigate supervised and unsupervised classification performance on real empirical problems.
Similar content being viewed by others
References
Ambroise C, Govaert G (2000) Em algorithm for partially known labels. In: Kiers HAL, Rasson J-P, Groenen PJF, Schader M (eds) Data analysis, classification, and related methods. Springer, Berlin, pp 161–166
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth, Belmont
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27
Feinerer I, Hornik K (2018) tm: text Mining Package. R package version 0.7-6
Feinerer I, Hornik K, Meyer D (2008) Text mining infrastructure in R. J Stat Softw 25(5):1–54
Hand D, Yu K (2001) Idiot’s Bayes—not so stupid after all? Int Stat Rev 69:385–398
Harris ZS (1954) Distributional structure. Word 10(2–3):146–162
Holmes I, Harris K, Quince C (2012) Dirichlet multinomial mixtures: generative models for microbial metagenomics. PLoS ONE 7(2):e30126
John G, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the 11th conference on uncertainty in artificial intelligence, pp. 338–345
Khan A, Baharudin B, Lee LH, Khan K, Tronoh UTP (2010) A review of machine learning algorithms for text-documents classification. J Adv Inf Technol 1:4–20
Ko Y (2012) A study of term weighting schemes using class information for text classification. In: SIGIR’12—proceedings of the international ACM SIGIR conference on research and development in information retrieval
Kohavi R et al (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th international joint conference on artificial intelligence, vol 2. Montreal, Canada, pp 1137–1145
Kumbhar P, Mali M (2016) A survey on feature selection techniques and classification algorithms for efficient text classification. Int J Sci Res 5(5):9
Lai S, Xu L, Liu K, Zhao J (2015) Recurrent convolutional neural networks for text classification. In: Proceedings of the twenty-ninth AAAI conference on artificial intelligence, AAAI’15. AAAI Press, pp 2267–2273
Nigam K, McCallum A, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39:103–134
Rigouste L, Cappé O, Yvon F (2007) Inference and evaluation of the multinomial mixture model for text clustering. Inf Process Manag 43(5):1260–1280
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47
Tibshirani R, Hastie T, Narasimhan B, Chu G (2003) Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Stat Sci 18:104–117
Yin J, Wang J (2014) A Dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD international conference on KDDM, KDD ’14, New York. ACM, pp 233–242
Zhu X, Goldberg AB (2009) Introduction to semi-supervised learning. Morgan & Claypool Publishers, San Rafael
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Anderlucci, L., Viroli, C. Mixtures of Dirichlet-Multinomial distributions for supervised and unsupervised classification of short text data. Adv Data Anal Classif 14, 759–770 (2020). https://doi.org/10.1007/s11634-020-00399-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11634-020-00399-3