Paper • The following article is Open access

Parameter-free basis allocation for efficient multiple metric learning

Dongyeon Kim, Yejin Kan, Seungmin Lee and Gangman Yi⁴

Published 8 December 2023 • © 2023 The Author(s). Published by IOP Publishing Ltd
Machine Learning: Science and Technology, Volume 4, Number 4 Citation Dongyeon Kim et al 2023 Mach. Learn.: Sci. Technol. 4 045049 DOI 10.1088/2632-2153/ad113b

Download Article PDF

Article metrics

472 Total downloads
Video abstract views

Submit

Submit to this Journal

Dates

Received 18 September 2023
Accepted 30 November 2023
Published 8 December 2023

Peer review information

Method: Single-anonymous
Revisions: 1
Screened for originality? Yes

Buy this article in print

Journal RSS

Abstract

Metric learning involves learning a metric function for distance measurement, which plays an important role in improving the performance of classification or similarity-based algorithms. Multiple metric learning is essential for efficiently reflecting the local properties between instances, as single metric learning has limitations in reflecting the nonlinear structure of complex datasets. Previous research has proposed a method for learning a smooth metric matrix function through data manifold to address the challenge of independently learning multiple metrics. However, this method uses the basic distance-based clustering algorithm to set the anchor points, which are the basis for local metric learning, and the number of basis metrics is dependent on the user. We propose a new method that can assign sophisticated anchor points by iteratively partitioning to identify mixed clusters of multi-class instances and cluster the most similar class instances together. In an experiment, we demonstrate the reliability of the automatically set parameter by comparison with the distribution of error rates according to the number of basis metrics of the existing algorithm. Furthermore, we show the superior performance of the proposed method over a fixed parameter setting of existing algorithms and confirm the relative classification accuracy superiority through performance comparison with baseline algorithms.

Export citation and abstract BibTeX RIS

Previous article in issue

Next article in issue

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 license. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

The performance of similarity-based classification algorithms such as nearest neighbor (NN) [1] is highly dependent on the distance metric. As distance metric learning learns a good metric from the training set that causes items of the same class to be closer together, various metric learning algorithms have been studied in combination with K-NN classifiers [2–6]. In particular, unlike single metric learning (SML), multiple metric learning methods that learn multiple metrics are effective in improving classification performance on relatively complex datasets [7–9]. In multiple metric learning, defining and setting the number of bases for metric learning is important because it directly affects the resulting performance [10]. Therefore, research on differentiated and optimized basis settings for efficient multiple metric learning is important.

Basically, SML [2, 3, 5, 11–13] learns a single metric by reflecting the structure of the entire training set and the class information of instances. However, there are limitations in learning a single metric to solve the problem of complex datasets with a mixture of different nonlinear structures [14–16], because no single metric that satisfies all the constraints of datasets exists. To address this limitation, various multiple metric learning algorithms have been proposed to learn multiple metrics for a given problem [17–22]. Multiple metric learning can partition nonlinear structures through multiple metric settings to learn locally appropriate metrics [8, 23]. Multiple metric large margin NN (MMLMNN) [2] is an extension of LMNN that adjusts the distance between the target neighbor and the imposter at larger intervals and learns the local metric of each class. Clustered multi-metric learning (CMML) [24] performs clustering to form initial clusters and learns the local metric of each cluster by triplet constraints. It learns in a direction as possible to the single metric to prevent overfitting by reflecting the properties of other clusters. Parametric local metric learning (PLML) [4] learns the corresponding metric for each instance. To define anchor points, which are the basis for metric learning, a clustering method is used to form initial clusters from the training set. The basis metrics and weight values based on the assigned anchor points are combined through a weighted linear combination.

PLML significantly improves on the limitations of a single metric, but there are several considerations. First, the number of basis metrics that need to be learned is user-dependent. The number of basis metrics is an important hyperparameter, because it controls the complexity and fitting performance of the learned model. But in practice, the tuning process is often omitted or simply specified as the number of classes or an arbitrary number [4, 8, 24]. Second, anchor points are simply assigned based on the K-means clustering algorithm [25]. Anchor points play a key role in the fundamental process of metric learning and local linear weight learning to define the smooth metric matrix function. Therefore, assigning anchor points based on distance-based clustering may degrade performance by ignoring the class information of the NN.

We propose a new algorithm for assigning a sufficient number of anchor points to appropriate locations. For the initial clusters formed by the clustering algorithm, we apply linear discriminant analysis (LDA) to identify clusters that are mixed with multiple classes. The cluster split threshold is calculated and additional anchor points are assigned by splitting the cluster. A constraint on weak clusters is added to prevent overfitting due to the assignment of too many anchor points.

We make the following four contributions:

Solving the problem of assigning anchor points from mixed clusters of multiple classes, and assigning anchor points to more sophisticated locations.
Setting the number of anchor points automatically so that hyperparameter tuning can be omitted. This solves an important problem in multiple metric learning.
Verifying the reliability of the parameters calculated by the proposed algorithm through experiments on 20 datasets and demonstrating the superiority of the proposed method by comparing the classification accuracy with other metric learning algorithms.
Demonstrating the scalability of the proposed method by applying it to other multiple metric learning algorithms with similar constraints and evaluating their classification accuracy.

2. Preliminaries

2.1. Distance metric learning for NN classifiers

Distance metric learning to improve the performance of the NN classifier directly learns the positive semi-definite (PSD) matrix of the Mahalanobis distance by reflecting the class information of the training set. If $\mathbb{X}\subset \mathbb{R}^d$ is an instance space and $\mathbb{Y} = \left\{1, 2, {\ldots}, l \right\}$ is a class space, the entire training dataset $\mathbb{S}$ composed of N instances can be expressed as $\left\{\left ( x_{i}, y_{i} \right )\right\}^N_{i = 1}$ , and satisfies $\left ( x_{i}, y_{i} \right ) \in \mathbb{X} \times \mathbb{Y}$ for all i (l is the total number of classes). The Mahalanobis distance between all $x_i, x_j \in \mathbb{X}$ is defined as:

$M \in \mathbb{R}^{d \times d}$ is decomposed into $B^\mathrm{T}B$ as a positive semi-definite matrix. Equation (1) is equivalent to measuring the Euclidean distance in a new space mapped by the transformation matrix B. In other words, distance metric learning is the same problem as learning a transformation matrix to a new space that can improve the classification performance of the NN classifier. Based on the learned Mahalanobis distance, the K-NN classifier finds K NNs for each test instance and classifies them into the most dominant class.

2.2. PLML

PLML [4] learns a smooth metric matrix function for the entire data manifold and returns a corresponding vectorized local metric matrix as output for each input instance. Prior to metric learning for each instance, a basis matrix U is defined by extracting m anchor points from the instance space, and a basis metric is learned. After that, they are linearly combined using the approximation error bound of the metric matrix function, and manifold regularization is performed so that the metric matrix function can change smoothly. As we learn the local metric corresponding to each instance, the distance between instances x_i and x_j is defined as:

M_i is a local metric for x_i . M_i is approximated by a weighted linear combination of each element in the basis metric set $\left\{M_{b_1}, {\ldots}, M_{b_m} \right\}$ and parameterized as:

W is the N × m matrix containing the weight values, W_i contains the weight values for each element of the basis metric set $\left\{M_{b_1}, {\ldots}, M_{b_m} \right\}$ , and each element has a value of zero or higher. In addition, the $\sum_{b_k}W_{ib_k} = 1$ condition is included to prevent scaling problems between each local metric. PLML performs a two-step algorithm to parameterize the local metric corresponding to each instance. First, matrix W is learned in the smooth local linear weighting stage, which learns the weight for each basis metric, and basis metrics are learned in the large margin basis metric learning stage.

In order to define a smooth metric matrix function that minimizes approximation error for each instance and reflects local properties, matrix W is learned. The objective function is defined as:

λ₁ and λ₂ are the balance parameters, respectively. X is the N × d instance matrix and U is the m × d anchor point matrix. tr $(\cdot)$ and $\left\|\cdot\right\|_\mathrm{F}$ are the sum of the diagonal components of the matrix and the Frobenius norm of the matrix, respectively. m × N matrix G is a distance matrix between each anchor point and instance, and N × N matrix L is a Laplacian matrix defined as D − S. In this case, D is a diagonal matrix that satisfies $D_{ii} = \sum_{k}S_{ik}$ , and S is a symmetric pairwise similarity matrix of instances. Through an objective function, it learns to have similar weights between similar instances while minimizing approximation errors and reflecting the local properties.

Given that the basis metric is learned based on the large margin method, k₁ NNs of the same class and k₂ NNs of different classes are defined [2, 16, 26–28]. Based on this, triplet constraints are assigned so that the distance to k₂ instances is higher than the distance to k₁ instances, and the objective function for learning the basis metric $\left\{M_{b_1}, {\ldots}, M_{b_m} \right\}$ is defined as:

α₁ and α₂ are the balance parameters, respectively, and ξ is the large margin error. Basis metrics are trained to minimize the sum of the large margin errors and the squared pairwise distance to k₁ NNs calculated through the local metric. Furthermore, differentiated from LMNN, the squared Frobenius norm of basis metric is added to facilitate optimization.

For the test instances, the same method can be used to learn the weights for the basis metric via (4). However, we applied the NN method per test instance to use the learned weight values of the nearest training instances.

The weight matrix W is learned based on the initially assigned anchor point, and the basis metric $\left\{M_{b_1}, {\ldots}, M_{b_m} \right\}$ is learned based on the matrix W. Therefore, the number of anchor points and how they are defined has a noticeable impact on the performance of the trained metrics.

3. Proposed method

3.1. Overview

Parameter sensitivity. PLML algorithms can define sophisticated local metrics by learning a smooth metric matrix function, but many hyperparameters are required. In particular, the number of basis metrics is user-dependent, but the resulting classification accuracy deviation is quite large. Figure 1 shows the performance change of the PLML algorithm by the number of basis metrics for each dataset specified in table 1, which is categorized into four categories according to the range of difference values. For each category, the blue circular marker line indicates the dataset with the largest deviation, while the red triangular marker line indicates the dataset with the smallest deviation. The deviation depending on the number of basis metrics is quite large, so separate tuning is required to learn a good metric. However, this is time consuming.

Figure 1. Refer to the following caption and surrounding text. — **Figure 1.** Performance of the PLML according to the number of basis metrics. The x-axis indicates the number of basis metrics, and the y-axis indicates the difference between the resulting accuracy and the accuracy when the number of basis metrics is set to the number of class types in the dataset. Each graph shows the results for each group of datasets divided by the range of difference values. The red line indicates the dataset with the smallest difference in the group, and the blue line indicates the dataset with the largest difference in the group.
Download figure:
Standard image High-resolution image

Table 1. Two dataset groups of different sizes. Datasets with fewer than 1000 instances can be found on the left side of the table, while datasets with more than 1000 instances can be found on the right side of the table. The last column of each dataset group indicates the number of basis metrics automatically assigned by the proposed algorithm.

Dataset of small size					Dataset of large size
Dataset	Number of instances	Number of features	Number of classes	Number of basis metrics	Dataset	Number of instances	Number of features	Number of classes	Number of basis metrics
Wine	178	13	3	6	Winequality-red	1599	11	11	24
Glass	214	9	6	11	Optdigits	1797	64	10	18
Newthyroid	215	5	3	9	Segment	2310	19	7	19
Heart	270	13	2	8	Wilt	4839	5	2	9
Movement-libras	360	90	15	29	Texture	5500	40	11	24
Car	392	7	3	8	MNIST (10%)	6000	784	10	18
ILPD	583	10	2	7	Pendigits (55%)	6000	16	10	22
Balance	625	4	3	9	Satimage	6430	36	7	19
Vehicle	813	18	3	9	USPS	7291	256	10	21
Vowel	990	13	11	25	Isolet	7797	617	26	45

Assignment of anchor points. For anchor point allocation, which is the basis for metric learning and weight learning, the existing algorithm simply applies the K-means clustering algorithm from the training data. This is distance-based clustering, which uses Euclidean distance to measure the similarity between instances, so the class information is not taken into account at all. As a result, many initial clusters may be formed that contain a mixture of instances from multiple classes. If the center of such a cluster is designated as an anchor point, it will result in unsophisticated basis metric learning and initial weight value assignment, which will reduce the K-NN classification accuracy.

To overcome this issue, we identify clusters that contain a mixture of multiple classes and partition the clusters so that local property and class information between neighboring instances can be considered simultaneously. The goal is to finally improve the accuracy of the K-NN classifier by assigning anchor points from the refined clusters and automatically determining the number of basis metrics.

3.2. LDA for initial cluster

To form initial clusters for anchor point assignment, the same as PLML, a set of clusters $\mathbb{C} = \left\{C_{b_1}, {\ldots}, C_{b_m}\right\}$ is formed from the entire training set $\left\{\left ( x_{i}, y_{i} \right )\right\}^N_{i = 1}$ via K-means clustering. To ensure the learning of at least one dominant basis metric for each class, the number of clusters m is set to the number of classes in the training set. As the similarity between each instance is calculated using Euclidean distance when forming clusters, the class information between neighboring instances is not taken into account at all. Consequently, there is a high probability that the initial clusters contain a mixture of multiple classes. To identify these clusters, we apply LDA to each cluster and project all instances onto a major axis that preserves the discriminative information between multiple classes within a single cluster. Considering excessive class discriminative information can lead to overfitting problems in metric learning, as it ignores the local properties of instances. Therefore, we project to the most discriminative axis to reflect both class discrimination and local properties simultaneously

The vector $w_{b_k}$ is the axis along which all instances in cluster $C_{b_k}$ are projected, and $\tilde{x}$ is a instance projected onto the $w_{b_k}$ . The $w_{b_k}$ that maximally preserves the discriminative information of multiple classes in the cluster is derived with the following optimization problem:

The basic idea of LDA is to minimize the variance between instances belonging to the same class while maximizing the distance between groups of instances belonging to different classes. As LDA is applied to each cluster, the objective function is calculated by considering the class information corresponding to the instances in a single cluster. Here, $\eta_{A_k}$ represents the within-class scatter and $\eta_{B_k}$ represents the scatter between class variability of cluster $C_{b_k}$ , which are defined as follows:

l is the total number of classes. γ_ik denotes the set of instances related to class i within cluster $C_{b_k}$ and µ_ik represents the mean value of the elements of the set γ_ik. This means that $\eta_{A_k}$ represents the variances of the instances corresponding to each class

Here µ_k is the average of the means by instance group corresponding to each class in cluster $C_{b_k}$ . $\eta_{B_k}$ represents the distance from the overall mean to the center of each class group. Finally, the axis $w_{b_k}$ that preserves the discriminative information of each class can be derived through the eigen decomposition of $\eta_{A_k}^{-1}\eta_{B_k}$ .

While instances with similar local properties are clustered together, instances are projected so that instances of the same class are condensed as much as possible. Therefore, instances of the same class usually follow a normal distribution. Based on these normal distributions, partitioning threshold is calculated to distinguish between different classes.

3.3. Additional assignment of anchor points

For each cluster, the projected instances are distributed to preserve the discriminative information of the multiple classes as much as possible. Given that the distribution usually takes the form of a multiple normal distribution, the entire distribution is fitted via Gaussian mixture model. To split the initial cluster into two subclusters, the number of fitting Gaussian models is specified as two. As the assignment of excessive anchor points leads to overfitting, a condition check is performed for weak clusters, and clusters with a mix of classes are recursively split.

To divide the projected data into two groups, the split threshold is calculated to $\left(\mu_{1}+\mu_{2}\right)$ /2, which is the average of the mean values of the two fitted Gaussian distributions. However, it is likely that some instances will be misclassified near the split threshold value where the two Gaussian models overlap. To prevent this problem, the proposed algorithm computes the distance from each instance to the mean of the two partitioned groups based on the partitioning threshold and assigns it to the closer group to form a cluster. Each partitioned cluster is defined as follows:

Let ω₁ be the average value of the group of all projected instances whose values are smaller than the split threshold, and let ω₂ be the average value of all instances whose values are larger. Some instances in the existing cluster $C_{b_k}$ that satisfy the condition form a new cluster $C_{b_{m+1}}$

Cluster $C_{b_k}$ is redefined to contain only instances that satisfy the condition among the instances contained in the existing cluster $C_{b_k}$ .

In the process of splitting clusters, splitting even weak clusters with a very small number of instances can increase the computational complexity of metric learning and reduce classification accuracy due to overfitting. Therefore, a constraint to partition the clusters is added so weak clusters containing fewer instances than a rate $\epsilon$ to the number of instances in the entire training set N are not split. The rate $\epsilon$ is defined as follows:

Here, N is the number of instances in the training set and l is the number of classes. Since datasets with a larger number of instances and classes require more sophisticated anchor point assignment through cluster partitioning, we set the rate of identifying weak clusters by reflecting the number of instances and classes in the dataset in equation (12). More detailed validation results for the formula can be found in appendix.

The two clusters $C_{b_k}$ and $C_{b_{m+1}}$ derived from the cluster partitioning process are still likely to contain a mixture of multiple classes. Therefore, we perform further partitioning by recursively applying the proposed process to each cluster. Since the condition in equation (12) holds, the recursive process does not cause overfitting and enables more detailed partitioning.

The proposed algorithm is performed on the initially formed m clusters. That is, we construct the final cluster set by sequentially performing SPLIT-CLUSTER on each element of the cluster set $\mathbb{C}$ (algorithm 1). The resulting centroid of each cluster is defined as an anchor point for metric learning, solving the traditional problem of setting the number of bases.

Algorithm 1. SPLIT-CLUSTER
Input: cluster set $\mathbb{C}$ , cluster number k
Output: new cluster set $\mathbb{C}$
1. Project instances from $C_{b_k}$ onto the $w_{b_k}$ axis
2. Calculate border value $\left(\mu_{1}+\mu_{2}\right)$ /2 using Gaussian mixture model
3. Split the cluster into $C_{b_{m+1}}, \ C_{b_{k}}$
4. if both rate of instances in $C_{b_{m+1}}, \ C_{b_{k}} \gt \epsilon$ then
5. Update cluster set $\mathbb{C}$ , total number of cluster $m \gets m+1$
6. $\mathbb{C} \gets$ SPLIT-CLUSTER( $\mathbb{C}$ , k)
7. $\mathbb{C} \gets$ SPLIT-CLUSTER( $\mathbb{C}$ , m)
8. end if

To define the time-complexity of SPLIT-CLUSTER, the time-complexity of each subprocess is as follows: LDA has a complexity of $O(Ndt+t^3)$ , where N is the total number of instances, d is the dimension of an instance, and $t = \min(N,d)$ [29]. Also, a Gaussian mixture model has a complexity of $O(Nd^2tk)$ , where t is the total number of iterations of the expectation-maximization (EM) algorithm and k is the number of models [30, 31]. In the proposed algorithm, k = 2, so we have $O(Nd^2t)$ as a result. Finally, under the assumption that there is no weak cluster identification criterion, a mixed cluster can be divided up to a total of N times in the worst case, resulting in a time complexity of $O(N^2 d^2 t)$ .

4. Experiment

4.1. Datasets

To evaluate the performance of the proposed method, we check whether the accuracy is improved compared to the PLML algorithm and conduct comparison experiments with various metric learning algorithms. In all experiments, 20 different datasets were used, ranging from a relatively small number of instances and classes to a relatively large dataset [32]. By analyzing a wide range of datasets, we check whether the proposed method improves the resulting performance through efficient cluster partitioning even in datasets with a small number of instances and whether it is possible to assign more sophisticated anchor points in the structure of complex datasets. Table 1 summarizes the specific dataset information used in the experiments. In the case of a dataset with a large number of features, metric learning is computationally inclusive. Therefore, we applied principal component analysis (PCA) to some datasets for dimensionality reduction. Components corresponding to 95% of the total variance were retained, and PCA was applied to the following datasets: Optdigits, Texture, USPS, MNIST, and Isolet. We also performed standardization on the features of all datasets and unified the scale of the features through the normalization process [4]. All classifiers are evaluated by applying five-fold cross-validation. In addition, to prevent analysis errors due to outlier values, five replicate experiments with the same settings are conducted, and the resulting accuracy is replaced by the average value.

4.2. Visualization of trained metrics

To analyze the results of metric learning, we check some dimensions of the instances mapped by the trained metric matrix. Figure 2 is a visualization of the learned metric for the Wilt dataset. Based on the result in (b), metric learning has properly transformed the original data space into a space where instances of different classes can be distant from each other. Nevertheless, the misclassification probability of K-NN is quite high, because most of the instances are overlapped at the boundary between the two classes. In contrast, in the space transformed by the proposed method, instances belonging to class 2 are much more condensed. This is the result of learning a metric based on anchor points assigned by forming clusters of instances with maximally adjacent and similar classes. Therefore, the probability of selecting the NN of the same class through Euclidean distance is higher than the space in (b).

Figure 2. Refer to the following caption and surrounding text. — **Figure 2.** Visualization of trained metrics for the Wilt dataset. (a) Is the original data, (b) Is the result of mapping with PLML's learned metrics, and finally (c) is the result of applying the proposed method. The blue circles represent instances corresponding to class 1, and the orange triangles represent instances corresponding to class 2.
Download figure:
Standard image High-resolution image

4.3. Reliability of auto-tuned parameters

As shown in figure 1, the resulting performance of PLML is strongly affected by the number of basis metrics. Therefore, while the existing method requires a parameter tuning process, the proposed method omits it and automatically assigns the number of basis metrics. To ensure the reliability of the automated parameters, we compare the error rate of the automated parameters with the error rate distribution of the PLML algorithm as the number of basis metrics changes. To measure the result distribution of the PLML, we set the number of basis metrics from 1% to 20% of the training set for small datasets and 5% for large datasets. Figure 3 is a boxplot showing the distribution of error rates for each dataset.

Figure 3. Refer to the following caption and surrounding text. — **Figure 3.** Performance comparison against PLML error rate distribution. The range within the blue box represents the range from 25% to 75% of the total error rate results depending on the number of metrics setting, while the red line represents the median value. The black whisker indicates values within 1.5 times the IQR from the box boundary value, and values outside of it are considered outliers. The red + marker indicates the error rate of the PLML with our proposed method.
Download figure:
Standard image High-resolution image

Based on the experimental results, several important points are observed. First, for most datasets, the error rate of PLML with our proposed method is much lower than the error rate distribution of the original PLML algorithm. For a total of 16 datasets, the error rate value of the proposed method is lower than the median value of the PLML result distribution, and in particular, 13 datasets have error rates in the range of less than 25% of the total distribution. This indicates that the derived parameters are very robust, even though we omitted a separate parameter tuning process. Second, the error rate decreased for some datasets where the performance did not change with the number of basis metrics. This indicates that we have overcome the limitations of the existing anchor point setting problem using distance-based clustering and shows the efficiency of the proposed algorithm.

4.4. Comparative analysis of original PLML

To check the performance change of PLML with the proposed algorithm, a comparison with the original PLML was performed. The number of basis metrics for the PLML was set to the number of classes in each dataset, and the remaining hyperparameters were set to default values. Figure 4 shows the results for all datasets. As expected from previous experiments, we can see that the proposed method improves performance for most datasets. In particular, there is a relatively large improvement in accuracy for datasets with a relatively small number of instances. This may be due to the additional allocation of anchor points considering class information, which makes the weighted linear combination more sophisticated than before.

Figure 4. Refer to the following caption and surrounding text. — **Figure 4.** Difference in classification accuracies. The y-axis shows the accuracy difference between the PLML with the proposed method and the original PLML, where a positive value means an increase in accuracy and a negative value means a decrease.
Download figure:
Standard image High-resolution image

4.5. Comparative analysis of other algorithms

To verify the competitiveness of PLML with the proposed method, we conduct a comparison experiment with 11 metric learning algorithms where the hyperparameters were set to the default values specified by the authors.

Euclidean: apply Euclidean distance directly without performing metric learning.
SML [4]: derived from PLML, learning a single metric.
LMNN [2]: learn a metric so that NNs of the same class are closer while instances of different classes are farther away.
Nearest components analysis [5]: directly maximizes a stochastic variant of the leave-one-out K-NN score on the training set.
BoostMetric [12]: learning a distance metric using a boosting-based technique.
Information-theoretic metric learning [3]: an information-theory-based technique that aims at minimizing the LogDet divergence subject to linear constraints.
Geometric mean metric learning [13]: approach the task of learning a symmetric positive definite matrix by formulating it as a smooth, strictly convex optimization.
MMLMNN [2]: extension of LMNN, which learns each class's corresponding local linear transformation while making similar and dissimilar instances with larger intervals.
Cluster-based multi metric learning (CBLML) [4]: derived from PLML, weight matrix W is not learned. Assign a weight of 1 to the basis metric of the cluster that includes the instance, and 0 for the rest.
PLML [4]: the algorithm we want to improve in this paper.
CMML [24]: learning local metrics for each cluster, considering global regularization to preserve properties of other clusters.

Table 2 presents the overall experimental results for classification accuracy. The experimental results show that the classification performance of K-NN with distance metric learning is significantly improved compared to the classification performance with simple Euclidean distance. We also observe that multiple metric learning algorithms generally perform better than SML algorithms. According to the comparison results by average ranking, PLML with the proposed method is the best, CMML algorithm is the second best, and original PLML is the third best. It is noteworthy that CBLML performs poorly despite its multiple metric learning. It can be predicted that the initial clusters were formed by simply distance-based clustering and learned local metrics without reflecting the local properties of other clusters. MMLMNN outperforms LMNN for most datasets. In particular, we can see a clear improvement for relatively large datasets, which implies that multiple metric learning is relatively more efficient for complex datasets.

Table 2. Classification accuracy of metric learning algorithms. The highest accuracy for each dataset is shown in bold. The bottom row of the table lists the average rank order for all datasets.

	Proposed method (+PLML)	Single metric learning						Multiple metric learning				Euclidean
	Proposed method (+PLML)	SML	LMNN	NCA	BoostMetric	ITML	GMML	MMLMNN	CBLML	PLML	CMML	Euclidean
Balance	95.87	95.20	84.93	92.80	94.08	91.46	92.59	82.82	90.56	95.74	92.19	82.56
Car	74.80	73.87	74.89	72.19	74.95	73.31	75.45	74.38	71.07	74.59	74.96	74.48
Glass	65.89	62.91	70.65	66.82	68.55	67.10	68.41	72.15	65.05	65.24	65.60	71.67
Heart	80.30	80.00	80.29	80.00	81.11	82.22	80.22	82.32	79.93	80.00	80.22	80.81
ILPD	66.49	66.39	66.91	62.87	65.77	64.35	65.14	67.27	67.77	66.46	66.91	66.63
Isolet	95.25	93.64	85.84	95.02	94.42	91.73	90.03	85.05	93.15	95.04	94.72	84.21
MNIST(10%)	93.60	93.52	91.83	93.57	92.63	89.37	90.13	90.80	91.97	93.57	93.05	89.69
Movement_libras	81.17	77.61	78.00	79.67	77.77	76.83	74.59	79.72	73.50	79.72	81.14	77.77
Newthyroid	90.05	89.95	89.20	92.10	90.60	90.88	90.23	89.95	89.21	89.67	91.93	88.76
Optdigits	97.85	97.78	97.02	96.87	96.56	97.13	97.36	97.22	96.89	97.75	97.85	96.89
Pendigits(55%)	99.28	99.29	98.15	98.70	99.35	99.31	99.02	99.30	99.15	99.28	99.36	98.02
Satimage	89.82	89.34	87.89	89.38	89.46	89.04	89.16	89.10	89.36	89.58	89.45	87.89
Segment	95.72	95.33	94.81	94.81	95.36	94.40	95.26	95.07	95.41	95.63	95.41	94.88
Texture	96.64	96.35	95.98	96.15	96.33	95.99	97.28	96.52	96.18	96.64	96.58	95.42
USPS	92.41	91.92	90.63	91.93	91.83	91.04	90.96	91.56	89.69	92.26	91.80	89.32
Vehicle	95.15	94.02	90.16	93.62	96.46	93.35	89.35	92.15	94.44	95.10	95.01	89.35
Vowel	96.22	94.32	95.83	94.32	95.33	92.54	93.85	95.39	94.32	95.57	96.09	94.66
Wilt	97.66	97.73	97.20	97.93	98.22	98.11	98.41	97.16	97.60	97.63	97.80	97.21
Wine	97.41	96.96	96.62	96.63	96.71	97.08	97.90	94.72	95.17	97.53	99.43	95.05
Winequality-red	57.72	5.00	58.06	58.00	55.78	57.46	58.81	58.92	51.34	52.41	58.76	59.06
AVG RANKING	3.40	6.95	8.10	6.9	5.20	7.80	6.65	6.65	8.45	4.95	3.85	8.95

4.6. Expansion to other multiple metric learning algorithms

Other algorithms that learn local metrics based on clustering algorithms can theoretically apply our proposed method. We applied the proposed method to CBLML and CMML and analyzed the changes in classification accuracy. Figure 5 shows the overall experimental results for classification accuracy.

Figure 5. Refer to the following caption and surrounding text. — **Figure 5.** Difference in accuracies of CMML and CBLML. The y-axis shows the accuracy difference between the original algorithm and the proposed method, with positive values indicating an increase in accuracy and negative values indicating a decrease. For each dataset, the blue graph on the left shows the results of CBLML, and the red graph on the right shows the results of CMML.
Download figure:
Standard image High-resolution image

As CBLML omits the weighted linear combination, the accuracy is strongly influenced by the shape of the clusters formed. Therefore, we can see that for most of the datasets, the performance has changed and the accuracy has improved. Exceptionally, accuracy decreased significantly for movement-libras. As the dataset contains a small number of instances and many classes, it can be theorized that the additional local metric assignment through cluster partitioning is the cause of overfitting. In contrast, CMML learns local metrics based on global metrics, so the probability of overfitting is relatively low. As a result, the classification accuracy steadily increased for complex datasets owing to the formation of more refined clusters. However, there are some results that show a relative decrease in accuracy compared to PLML for simple datasets. This suggests that as PLML defines a local metric through a weighted linear combination of the basis metrics, the proposed method for partitioning clusters allows for a more sophisticated approximation and is efficient for more datasets.

5. Conclusion

In this study, we identify the importance of basis decision in multiple metric learning and propose a new algorithm for efficiently assigning anchor points. For the initial clusters formed for anchor point assignment, LDA is used to determine whether multiple classes are mixed within a single cluster. Then, a Gaussian mixture model is applied to calculate the partition threshold and assign additional anchor points. Instead of the existing distance-based clustering method, we reflect the class information of each instance and set anchor points by considering local properties. By applying the proposed method, automated parameter setting is possible, and the cost of parameter tuning can be significantly reduced. Experiments show that the results of applying the automated parameters on various datasets are superior to the error rate distributions of existing methods. Furthermore, comparative experiments with various algorithms as well as the PLML show that the proposed method outperforms them. Finally, the proposed method is demonstrated to be scalable by analyzing the results of its application to relevant metric learning algorithms that use clustering techniques.

Acknowledgment

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (No. NRF-2022R1F1A1074228), and was also supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) under the Artificial Intelligence Convergence Innovation Human Resources Development (IITP-2023-RS-2023-00254592) grant funded by the Korea government (MSIT).

Data availability statement

The datasets that support the findings of this study are openly available at the following URL/DOI: https://archive.ics.uci.edu/.

Appendix

In the proposed method, to avoid the overfitting problem caused by excessive cluster partitioning, we set constraints on the cluster partitioning through equation (12). Here, we demonstrate the efficiency of the formula by comparing the accuracy of the results obtained through equation (12) with the results when set manually. For performance evaluation, we utilized three datasets of varying number of classes in each of the small and large dataset groups, as follows: ILPD, Vehicle, Vowel, Winequality-red, Satimage and Isolet. The $\epsilon$ values for comparison were manually set in intervals of 0.25, ranging from 0.25 to 7.0, and the comparison results with the accuracy derived from equation (12) can be seen in figure A1.

Figure A1. Refer to the following caption and surrounding text. — **Figure A1.** Performance comparison with $\epsilon$ values. The x-axis represents the $\epsilon$ value and the y-axis represents the accuracy. The red line shows the accuracy of the original PLML algorithm, and the blue line shows the accuracy of our proposed method with different settings of the $\epsilon$ value. The black dashed line represents the $\epsilon$ value derived from equation (12).
Download figure:
Standard image High-resolution image

From the results, we can see that the proposed formula efficiently calculates the $\epsilon$ value for most datasets. In particular, for datasets with many instances and classes, such as Winequality-red, Satimage, and Isolet, the $\epsilon$ value was set to a range that achieved relatively high accuracy. This shows that the rate of weak cluster identification can have a significant impact on the performance of the results, demonstrating the importance of reflecting the number of instances and classes in the dataset.

Please wait… references are loading.