[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
10.1145/3430036.3430066acmotherconferencesArticle/Chapter ViewAbstractPublication PagesvinciConference Proceedingsconference-collections
short-paper

XplainableClusterExplorer: a novel approach for interactive feature selection for clustering

Published: 08 December 2020 Publication History

Abstract

Human-centered machine learning is becoming an emerging field aiming to enable domain experts that do not necessarily have a data science background to make use of machine learning applications. Especially in unsupervised machine learning, e.g. cluster analysis, models cannot be autonomously tuned towards an optimal solution for a given application due to the absence of ground truth like class labels. In cluster analysis, different feature subsets may lead to different clusterings. The identification of the best subset of given features is therefore essential in order to improve the overall clustering performance and to obtain a clustering that is suitable for a given application. To support users in finding an optimal clustering solution, we propose XplainableClusterExplorer, an interactive and explorative approach suitable for feature selection for clustering. In an interactive combination of user and machine learning models, the user is supported by evaluation criteria and visualizations in determining feature subsets and adjusting hyperparameters. For feature subset selection we propose a combination with feature importances from random forests and LIME. Since this requires a supervised setting, the cluster assignments are used as tentative class labels in subsequent step. Our experimental results have shown that this subsequent classification step leveraging calculated feature importances can facilitate feature subset selection and therefore enhance overall clustering performance.

References

[1]
Salem Alelyani, Jiliang Tang, and Huan Liu. 2013. Feature Selection for Clustering: A Review. In Data Clustering, Charu C. Aggarwal and Chandan K. Reddy (Eds.). Chapman and Hall/CRC, Boca Raton, FL, 29--60.
[2]
Leo Breiman. 2001. Random Forests. Machine Learning 45, 1 (2001), 5--32.
[3]
Sanjoy Dasgupta, Nave Frost, Michal Moshkovitz, and Cyrus Rashtchian. 2020. Explainable k-Means and k-Medians Clustering. arXiv preprint arXiv:2002.12538 (2020).
[4]
Manoranjan Dash, Kiseok Choi, Peter Scheuermann, and Huan Liu. 2002. Feature Selection for Clustering - A Filter Solution. (2002).
[5]
David L. Davies and Donald W. Bouldin. 1979. A Cluster Separation Measure. IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI-1, 2 (1979), 224--227.
[6]
Dheeru Dua and Casey Graff. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/ml
[7]
J. C. Dunn. 1973. A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact Well-Separated Clusters. Journal of Cybernetics 3, 3 (1973), 32--57.
[8]
Jennifer G. Dy and Carla E. Brodley. 2000. Visualization and interactive feature selection for unsupervised data. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, Raghu Ramakrishnan (Ed.). ACM, New York, NY, 360--364.
[9]
Sanjay Goil, Harsha Nagesh, and Alok Choudhary. 1999. MAFIA: Efficient and scalable subspace clustering for very large data sets. In Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Vol. 443. ACM, 452.
[10]
Benedikt Grimmeisen and Andreas Theissler. 2020. The Machine Learning Model as a Guide: Pointing Users to Interesting Instances for Labeling through Visual Cues. In The 13th International Symposium on Visual Information Communication and Interaction (VINCI 2020), December 8--10, 2020, Eindhoven, Netherlands. ACM.
[11]
Diansheng Guo. 2003. Coordinating Computational and Visual Approaches for Interactive Feature Selection and Multivariate Clustering. Information Visualization 2, 4 (2003), 232--246.
[12]
Emrah Hancer, Bing Xue, and Mengjie Zhang. 2020. A survey on feature selection approaches for clustering. Artificial Intelligence Review 53, 6 (2020), 4519--4545.
[13]
Andreas Holzinger. 2015. Interactive Machine Learning (iML). Informatik-Spektrum 39 (2015).
[14]
Plotly Technologies Inc. 2015. Collaborative data science. Montreal, QC. https://plot.ly
[15]
R Development Core Team. 2010. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/ ISBN 3-900051-07-0.
[16]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. [n.d.]. "Why Should I Trust You?": Explaining the Predictions of Any Classifier. https://arxiv.org/pdf/1602.04938
[17]
Peter J. Rousseeuw. 1987. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20 (1987), 53--65.
[18]
Sobar, Rizanda Machmud, and Adi Wijaya. 2016. Behavior Determinant Based Cervical Cancer Early Detection with Machine Learning Algorithm. Advanced Science Letters 22, 10 (2016), 3120--3123.
[19]
Andreas Theissler, Anna-Lena Kraft, Max Rudeck, and Fabian Erlenbusch. 2020. VIAL-AD: Visual Interactive Labelling for Anomaly Detection - An approach and open research questions. In 4th International Workshop on Interactive Adaptive Learning (IAL2020). CEUR-WS.
[20]
Andreas Theissler, Simon Vollert, Patrick Benz, Laurentius A Meerhoff, and Marc Fernandes. 2020. ML-ModelExplorer: An Explorative Model-Agnostic Approach to Evaluate and Compare Multi-class Classifiers. In International Cross-Domain Conference for Machine Learning and Knowledge Extraction. Springer, 281--300.
[21]
Robert Tibshirani, Guenther Walther, and Trevor Hastie. 2001. Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 63, 2 (2001), 411--423.
[22]
B. Wang and K. Mueller. 2018. The Subspace Voyager: Exploring High-Dimensional Data along a Continuum of Salient 3D Subspaces. IEEE Transactions on Visualization and Computer Graphics 24, 2 (2018), 1204--1222.
[23]
X. Yuan, D. Ren, Z. Wang, and C. Guo. 2013. Dimension Projection Matrix/Tree: Interactive Subspace Visual Exploration and Analysis of High Dimensional Data. IEEE Transactions on Visualization and Computer Graphics 19, 12 (2013), 2625--2633.

Cited By

View all
  • (2023)Performance Evaluation of a Feature-Importance-based Feature Selection Method for Time Series PredictionJournal of information and communication convergence engineering10.56977/jicce.2023.21.1.8221:1(82-89)Online publication date: 31-Mar-2023
  • (2022)VisGIL: machine learning-based visual guidance for interactive labelingThe Visual Computer10.1007/s00371-022-02648-239:10(5097-5119)Online publication date: 25-Sep-2022

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
VINCI '20: Proceedings of the 13th International Symposium on Visual Information Communication and Interaction
December 2020
205 pages
ISBN:9781450387507
DOI:10.1145/3430036
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 08 December 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. clustering
  2. feature selection
  3. human-centered machine learning
  4. k-means
  5. unsupervised machine learning

Qualifiers

  • Short-paper

Conference

VINCI 2020

Acceptance Rates

Overall Acceptance Rate 71 of 193 submissions, 37%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)8
  • Downloads (Last 6 weeks)0
Reflects downloads up to 13 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Performance Evaluation of a Feature-Importance-based Feature Selection Method for Time Series PredictionJournal of information and communication convergence engineering10.56977/jicce.2023.21.1.8221:1(82-89)Online publication date: 31-Mar-2023
  • (2022)VisGIL: machine learning-based visual guidance for interactive labelingThe Visual Computer10.1007/s00371-022-02648-239:10(5097-5119)Online publication date: 25-Sep-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media