Computer Science > Machine Learning

arXiv:2205.08046 (cs)

[Submitted on 17 May 2022 (v1), last revised 5 Sep 2022 (this version, v3)]

Title:Shape complexity in cluster analysis

Authors:Eduardo J. Aguilar, Valmir C. Barbosa

View PDF

Abstract:In cluster analysis, a common first step is to scale the data aiming to better partition them into clusters. Even though many different techniques have throughout many years been introduced to this end, it is probably fair to say that the workhorse in this preprocessing phase has been to divide the data by the standard deviation along each dimension. Like division by the standard deviation, the great majority of scaling techniques can be said to have roots in some sort of statistical take on the data. Here we explore the use of multidimensional shapes of data, aiming to obtain scaling factors for use prior to clustering by some method, like k-means, that makes explicit use of distances between samples. We borrow from the field of cosmology and related areas the recently introduced notion of shape complexity, which in the variant we use is a relatively simple, data-dependent nonlinear function that we show can be used to help with the determination of appropriate scaling factors. Focusing on what might be called "midrange" distances, we formulate a constrained nonlinear programming problem and use it to produce candidate scaling-factor sets that can be sifted on the basis of further considerations of the data, say via expert knowledge. We give results on some iconic data sets, highlighting the strengths and potential weaknesses of the new approach. These results are generally positive across all the data sets used.

Comments:	Minor improvements and fixes in this version
Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2205.08046 [cs.LG]
	(or arXiv:2205.08046v3 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2205.08046
Journal reference:	PLoS One 18 (2023), e0286312
Related DOI:	https://doi.org/10.1371/journal.pone.0286312

Submission history

From: Valmir C. Barbosa [view email]
[v1] Tue, 17 May 2022 01:33:15 UTC (1,864 KB)
[v2] Wed, 18 May 2022 10:59:59 UTC (1,864 KB)
[v3] Mon, 5 Sep 2022 19:08:44 UTC (1,864 KB)

Computer Science > Machine Learning

Title:Shape complexity in cluster analysis

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Shape complexity in cluster analysis

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators