Fast Simultaneous Clustering and Feature Selection for Binary Data

Charlotte Laclau^17,18 &
Mohamed Nadif¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8819))

Included in the following conference series:

International Symposium on Intelligent Data Analysis

1524 Accesses
2 Citations

Abstract

This paper addresses the problem of clustering binary data with feature selection within the context of maximum likelihood (ML) and classification maximum likelihood (CML) approaches. In order to efficiently perform the clustering with feature selection, we propose the use of an appropriate Bernoulli model. We derive two algorithms: Expectation-Maximization (EM) and Classification EM (CEM) with feature selection. Without requiring a knowledge of the number of clusters, both algorithms optimize two approximations of the minimum message length (MML) criterion. To exploit the advantages of EM for clustering and of CEM for fast convergence, we combine the two algorithms. With Monte Carlo simulations and by varying parameters of the model, we rigorously validate the approach. We also illustrate our contribution using real datasets commonly used in document clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 35.99; Price includes VAT (United Kingdom)

Softcover Book: GBP 44.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Feature Selection for Multiclass Binary Data

A study on using data clustering for feature extraction to improve the quality of classification

Article Open access 04 May 2021

Sparse Clustering with K-Means - Which Penalties and for Which Data?

References

Berry, M.W., Browne, M., Langville, A.N., Pauca, V.P., Plemmons, R.J.: Algorithms and applications for approximate nonnegative matrix factorization. In: Computational Statistics and Data Analysis, pp. 155–173 (2006)
Google Scholar
Biernacki, C., Celeux, G., Govaert, G.: Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence, 719–725 (2000)
Google Scholar
Celeux, G., Govaert, G.: A classification em algorithm for clustering and two stochastic versions. Comput. Stat. Data Anal., 315–332 (1992)
Google Scholar
Dempster, A.P., Laird, M.N., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 1–22 (1977)
Google Scholar
Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis. John Willey & Sons, New Yotk (1973)
MATH Google Scholar
Figueiredo, M.A.T., Jain, K.: Unsupervised learning of finite mixture models. IEEE Trans. Pattern Anal. Mach. Intell., 381–396 (2002)
Google Scholar
Grim, J.: Multivariate statistical pattern recognition with nonreduced dimensionality. Kybernetika, 142–157 (1986)
Google Scholar
Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification, 193–218 (1985)
Google Scholar
Law, M.H.C., Figueiredo, M.A.T., Jain, A.K.: Simultaneous feature selection and clustering using mixture models. IEEE Trans. Pattern Anal. Mach. Intell., 1154–1166 (2004)
Google Scholar
Li, M., Zhang, L.: Multinomial mixture model with feature selection for text clustering. Know.-Based Syst., 704–708 (2008)
Google Scholar
McLachlan, G.J., Peel, D.: Finite mixture models. New York (2000)
Google Scholar
Pudil, P., Novovicová, J., Choakjarernwanit, N., Kittler, J.: Feature selection based on the approximation of class densities by finite mixtures of special type. Pattern Recognition, 1389–1398 (1995)
Google Scholar
Schwarz, G.E.: Estimating the dimension of a model. Annal of Statistics, 461–464 (1978)
Google Scholar
Strehl, A., Ghosh, J.: Cluster ensembles — a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res., 583–617 (2003)
Google Scholar
Symons, M.: Clustering criteria and multivariate normale mixture. Biometrics, 35–43 (1981)
Google Scholar

Download references

Author information

Authors and Affiliations

Université Paris Descartes, LIPADE, Paris, France
Charlotte Laclau & Mohamed Nadif
Imagine Lab, University of Ottawa, Ottawa, Canada
Charlotte Laclau

Authors

Charlotte Laclau
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Nadif
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, KU Leuven, 3001, Heverlee, Belgium
Hendrik Blockeel & Matthijs van Leeuwen &
Brunel University, UB8 3PH, Uxbridge, UK
Veronica Vinciotti

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Laclau, C., Nadif, M. (2014). Fast Simultaneous Clustering and Feature Selection for Binary Data. In: Blockeel, H., van Leeuwen, M., Vinciotti, V. (eds) Advances in Intelligent Data Analysis XIII. IDA 2014. Lecture Notes in Computer Science, vol 8819. Springer, Cham. https://doi.org/10.1007/978-3-319-12571-8_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-12571-8_17
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12570-1
Online ISBN: 978-3-319-12571-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Fast Simultaneous Clustering and Feature Selection for Binary Data

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Feature Selection for Multiclass Binary Data

A study on using data clustering for feature extraction to improve the quality of classification

Sparse Clustering with K-Means - Which Penalties and for Which Data?

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Fast Simultaneous Clustering and Feature Selection for Binary Data

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Feature Selection for Multiclass Binary Data

A study on using data clustering for feature extraction to improve the quality of classification

Sparse Clustering with K-Means - Which Penalties and for Which Data?

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation