Mixtures of Dirichlet-Multinomial distributions for supervised and unsupervised classification of short text data

898 Accesses
9 Citations
Explore all metrics

Abstract

Topic detection in short textual data is a challenging task due to its representation as high-dimensional and extremely sparse document-term matrix. In this paper we focus on the problem of classifying textual data on the base of their (unique) topic. For unsupervised classification, a popular approach called Mixture of Unigrams consists in considering a mixture of multinomial distributions over the word counts, each component corresponding to a different topic. The multinomial distribution can be easily extended by a Dirichlet prior to the compound mixtures of Dirichlet-Multinomial distributions, which is preferable for sparse data. We propose a gradient descent estimation method for fitting the model, and investigate supervised and unsupervised classification performance on real empirical problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Improving Classification Using Topic Correlation and Expectation Propagation

Deep mixtures of unigrams for uncovering topics in textual data

Article Open access 03 March 2021

Variational Bayes estimation of hierarchical Dirichlet-multinomial mixtures for text clustering

Article 15 May 2023

References

Ambroise C, Govaert G (2000) Em algorithm for partially known labels. In: Kiers HAL, Rasson J-P, Groenen PJF, Schader M (eds) Data analysis, classification, and related methods. Springer, Berlin, pp 161–166
Chapter Google Scholar
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Article Google Scholar
Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth, Belmont
MATH Google Scholar
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297
MATH Google Scholar
Cover T, Hart P (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory 13:21–27
Article Google Scholar
Feinerer I, Hornik K (2018) tm: text Mining Package. R package version 0.7-6
Feinerer I, Hornik K, Meyer D (2008) Text mining infrastructure in R. J Stat Softw 25(5):1–54
Article Google Scholar
Hand D, Yu K (2001) Idiot’s Bayes—not so stupid after all? Int Stat Rev 69:385–398
MATH Google Scholar
Harris ZS (1954) Distributional structure. Word 10(2–3):146–162
Article Google Scholar
Holmes I, Harris K, Quince C (2012) Dirichlet multinomial mixtures: generative models for microbial metagenomics. PLoS ONE 7(2):e30126
Article Google Scholar
John G, Langley P (1995) Estimating continuous distributions in Bayesian classifiers. In: Proceedings of the 11th conference on uncertainty in artificial intelligence, pp. 338–345
Khan A, Baharudin B, Lee LH, Khan K, Tronoh UTP (2010) A review of machine learning algorithms for text-documents classification. J Adv Inf Technol 1:4–20
Google Scholar
Ko Y (2012) A study of term weighting schemes using class information for text classification. In: SIGIR’12—proceedings of the international ACM SIGIR conference on research and development in information retrieval
Kohavi R et al (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th international joint conference on artificial intelligence, vol 2. Montreal, Canada, pp 1137–1145
Kumbhar P, Mali M (2016) A survey on feature selection techniques and classification algorithms for efficient text classification. Int J Sci Res 5(5):9
Google Scholar
Lai S, Xu L, Liu K, Zhao J (2015) Recurrent convolutional neural networks for text classification. In: Proceedings of the twenty-ninth AAAI conference on artificial intelligence, AAAI’15. AAAI Press, pp 2267–2273
Nigam K, McCallum A, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39:103–134
Article Google Scholar
Rigouste L, Cappé O, Yvon F (2007) Inference and evaluation of the multinomial mixture model for text clustering. Inf Process Manag 43(5):1260–1280
Article Google Scholar
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47
Article Google Scholar
Tibshirani R, Hastie T, Narasimhan B, Chu G (2003) Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Stat Sci 18:104–117
Article MathSciNet Google Scholar
Yin J, Wang J (2014) A Dirichlet multinomial mixture model-based approach for short text clustering. In: Proceedings of the 20th ACM SIGKDD international conference on KDDM, KDD ’14, New York. ACM, pp 233–242
Zhu X, Goldberg AB (2009) Introduction to semi-supervised learning. Morgan & Claypool Publishers, San Rafael
Book Google Scholar

Download references

Author information

Authors and Affiliations

Department of Statistical Sciences, University of Bologna, via Belle Arti, 41, 40126, Bologna, Italy
Laura Anderlucci & Cinzia Viroli

Authors

Laura Anderlucci
View author publications
You can also search for this author in PubMed Google Scholar
Cinzia Viroli
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Laura Anderlucci.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Anderlucci, L., Viroli, C. Mixtures of Dirichlet-Multinomial distributions for supervised and unsupervised classification of short text data. Adv Data Anal Classif 14, 759–770 (2020). https://doi.org/10.1007/s11634-020-00399-3

Download citation

Received: 12 July 2019
Revised: 22 December 2019
Accepted: 25 April 2020
Published: 25 May 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s11634-020-00399-3

Keywords

Mathematics Subject Classification

62H30

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Improving Classification Using Topic Correlation and Expectation Propagation

Deep mixtures of unigrams for uncovering topics in textual data

Variational Bayes estimation of hierarchical Dirichlet-multinomial mixtures for text clustering

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Navigation

Mixtures of Dirichlet-Multinomial distributions for supervised and unsupervised classification of short text data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Improving Classification Using Topic Correlation and Expectation Propagation

Deep mixtures of unigrams for uncovering topics in textual data

Variational Bayes estimation of hierarchical Dirichlet-multinomial mixtures for text clustering

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now

Search

Navigation