Open Access Published by De Gruyter December 2, 2023

Improved kernel density peaks clustering for plant image segmentation applications

Jiaze Bi , Pingzhe Zhang , Yujia Gao , Menglong Dong , Yongzhi Zhuang , Ao Liu , Wei Zhang and Yiqiong Chen

From the journal Journal of Intelligent Systems

https://doi.org/10.1515/jisys-2022-0151

Abstract

In order to better solve the shortcomings of the k-means clustering method and density peaks clustering (DPC) method in agricultural image segmentation, this work proposes a method to divide points in a high-dimensional space, and a clustering method is obtained to divide crops and soil. In the process of assigning points in the DPC method, if a point is divided incorrectly, a series of points may be assigned to a cluster that is not related to it. In response to this problem, this study uses the decision graph to select the centroids, and uses Gaussian kernel to map the data to the high-dimensional space, each centroid searches for the most relevant points in the high-dimensional space until a temporary boundary point is found to stop the first assignment strategy, and then the points that are not clustered are assigned to the correct cluster to complete the clustering. The experimental results show that the proposed method has a better clustering effect through experiments on multiple artificial datasets and UCI datasets, compared with other clustering methods, and finally applied to plant image segmentation.

Keywords: agricultural image segmentation; cluster; density peak; kernel

1 Introduction

In researching crops, plants require extra water and fertilisers in different conditions, so it is essential to accurately understand the current crop growth status. Image segmentation is an essential technology of image processing, and the application demand of this technology in agriculture is becoming more intense [1]. With the ongoing in-depth study of image segmentation technology, many machine learning methods are also applied in this field [2,3]. Clustering, which aims to divide similar points into similar groups, has been widely used in many fields [4]. k-means method is a distance-based clustering method, which has a fast convergence speed and robust versatility. It is one of the more effective methods among many clustering methods [4]. In agriculture, k-means is often used to segment crop morphological images. However, the colour of the crops in the early stage of growth is often similar to the soil colour, which exposes the k-means method’s shortcomings. The value of K in the k-means method needs to be set manually. After that, it performs randomly iteratively to find cluster centres.

In many cases, this method will bias selecting cluster centres, and the k-means method is more difficult to converge for non-convex data. Up to now, many domestic and foreign scholars have made improvements to address these two shortcomings. Arthur and Vassilvitskii [5] proposed the k-means++ method, which uses the point far from the centroid as the next centroid. The selection of the method centre point is optimised. Kong et al. [6] used the kernel function to map all points to high-dimensional feature space and then perform clustering [7,8]. This method can capture the non-linear structure in data. It proposes a method of kernel k-means clustering.

The Fuzzy c-means algorithm is a data clustering method based on the optimisation of the objective function. The algorithm is an unsupervised fuzzy clustering method, which does not require human intervention in the process of algorithm implementation. Compared with the hard clustering of k-means, Fuzzy c-means provides more flexible clustering results. However, when the data sample set is large and the number of features is large, the real-time performance of the algorithm is not very good, and the algorithm is easy to fall into local minimum, sensitive to initial value [9].

In addition, clustering by fast search and finding density peaks is a density-based clustering method, published by Italian scholars Rodriguez and Laio in Science in 2014 [10]. Compared with the k-means method, more datasets are suitable for density peak clustering (DPC) [11,12]. However, the method still has some shortcomings: (1) after selecting the centroid of the points according to the decision graph, the remaining points are assigned to the class of its nearest neighbours with higher density. In the process of point assignment in the DPC method, points closer to the cluster centre will be correctly assigned. However, as the assignment progresses, if a point is mistakenly divided into an unrelated and incorrect cluster, a chain reaction will lead to a series of points being incorrectly allocated to unrelated clusters [13,14]. (2) For partially non-linear datasets, especially in some datasets with higher dimensions or in a dataset where the distribution of points overlaps, the DPC method cannot achieve satisfactory clustering results [15,16]. Gravitation-based density peaks clustering algorithm (GDPC) overcomes the limitations of DPC’s anomaly detection problem [17], but when the density of the dataset is small and the distribution is irregular, the clustering effect is greatly reduced. The kernel function implies a mapping from low-dimensional space to high-dimensional space, which can map the linearly inseparable dataset in the low-dimensional space to the high-dimensional space, turn it into a linearly separable dataset [18,19]. Based on the above analysis, a new clustering method is proposed in this study: First, a decision graph is used to determine the centroid similar to the DPC method, and then Gaussian kernel (radial basis function; RBF) is used to map the data to a high-dimensional space, and finally, it is to start from each centroid, search for the most relevant points in the high-dimensional space until a temporary boundary point in this cluster is found, and then assign the unclustered points to the cluster of data points with the greatest similarity. In summary, the proposed method supports the selection of appropriate centroid, which can better solve the problem of non-linear dataset clustering. More importantly, the proposed method has been successfully applied to the image segmentation at the beginning of plant growth to reinforce the agricultural fields.

The groundwork of this study is conceived as follows. We introduced related methods and background knowledge in Section 2. The details of our method are given in Section 3. The experimental results to verify the effectiveness of this method is described in Section 4. Finally, we give a conclusion of this study in Section 5.

2 Preliminaries

This section focuses on explaining the basic idea and clustering process of the density peak method and points out the problems of the method.

2.1 Density peak algorithm

As a new density-based method, the density peak method has very practical value in clustering, and it can easily detect arbitrary shape datasets. The DPC method puts forward two hypotheses: (1) the cluster centre has the largest local density compared to the surrounding points; (2) the distances between different cluster centres are relatively large. After finding the centre point, assign the remaining points to the class of the point with the closest local density higher than it. In DPC, two unknowns need to be calculated: the local density and the minimum distance between the point and any other points with higher density calculation method in kinds of literature [10].

(1) ρ i = ∑ j ϵ X , j ≠ i χ ( d i j − d c ) .

Or use Gaussian kernel function for calculation:

(2) ρ i = ∑ j ϵ X , j ≠ i exp − d i j 2 d c 2 .

where we assume that the dataset is X { 1 , 2 , … , m } , with m denoting the number of data points. d c ∈ R is the cutoff distance. According to the literature [10], we can observe that d c is 1–2% of point number, d i j represents the Euclidean distance between X i and X j . And when u = d i j − d c , the characteristic function χ in the literature [10] to calculate ρ is given by

(3) χ ( u ) = 1 , 0 , u < 0 u ≥ 0 .

δ can be obtained under these conditions,

(4) δ i = min j ϵ X , ρ j > ρ i d i j .

The magnitude of ρ in relation to other points directly correlates with the density of points surrounding it, and δ is defined as the minimum distance from a point to any other with a significantly greater density. Therefore, we generally believe that a large δ means that the attachment has no points greater than its own density. DPC assumes that points with larger δ and larger ρ are more suitable as centroids. It is worth noting that when the point has the largest local density in the dataset, the point without δ , DPC believes that δ at point should be a very large value at this time, defaults to infinity.

The determination of the center point in DPC is achieved manually, guided by a decision graph where ρ and δ serve as the respective x and y axes. Under normal circumstances, the density of the cluster centre is greater than that of the adjacent data points, and there is no point nearby that is greater than its local density, so it is also relatively large. Therefore, DPC assumes that the centre point is greater than the cluster’s non-cluster centre data points. But only a small number of points qualify.

In common, we can roughly consider the points distributed in the upper right of the decision diagram as centroids.

When the centroids are determined, cluster the remaining data points. First, the remaining points are sorted in descending order according to the density, and then the points are assigned to the class of the closest high-density data point.

2.2 The proposed hypothesis

Since the DPC method cannot determine the specific number of clusters through the decision graph, this study assumes that the number of clusters is known. In the following experiments, it can be ensured that the number of clusters cannot affect the clustering results. This assumption can make the comparison result more objective.

2.3 Existing problems

The above analysis shows that whether DPC uses the traditional Gaussian kernel density method or the method in literature [10] to calculate all data points, it will cause the truncation distance d _c in the calculation process to be very sensitive. The value of d _c significantly contributes to the computation of the result; hence, the selection of d _c profoundly influences the final outcome. Compared to the right division of points as shown in Figure 1(a), a cluster contains two centroids, resulting in incorrect selection as shown in Figure 1(b).

Figure 1

(a) The right division of points. (b) Wrong selection of dataset centre.

We have observed that the DPC method assumes that the points should be assigned to the class of its nearest neighbours with higher density. This method sorts data points in descending order of local density to achieve point division in the implementation. In our opinion, in the process of clustering many datasets, this assumption will cause a specific point to be divided into a cluster unrelated to it, and a series of points will be divided incorrectly.

3 The proposed method

According to Section 2, k-means and most of its improved methods assume that the points belong to the cluster of centroids closest to itself. However, the mapping between points in most datasets contains extremely complex levels, and not all datasets are spherical. The inconsistency between this assumption and implementation will cause some problems with the clustering method. Unlike some previous methods that focus on finding simple mappings between data points, we add the kernel method to the clustering method to solve point division in a high-dimensional space.

Figure 2

Plant segmentation image based on the proposed method.

The proposed method is composed of two major steps:

In the first step, we retain the DPC method’s process of generating decision graphs based on ρ and δ to select the dataset’s centroids.
Using the RBF to map all the points into the high-dimensional space, a matrix containing similarity between each point can be obtained. Use the centroids obtained in step 1 as the initial value of the proposed method to realise the allocation of other points, and we will adopt two allocation strategies in this process. Finally, the proposed method is applied to agricultural image segmentation. The process of plant segmentation image based on the proposed method is shown in Figure 2.

The kernel function is used to project linearly indistinguishable data into a higher dimensional space using a kernel, making the data as linearly divisible as possible in the new space. The reference of the kernel function avoids the “dimensional disaster” and greatly reduces the computational effort. The dimension n of the input space has no effect on the kernel function matrix. Theoretically, the kernel function approach can effectively handle high-dimensional inputs.

3.1 RBF kernel function

In this work, we present a method of assigning points in high-dimensional space, and it can effectively solve some of the shortcomings of standard clustering methods. The kernel function is the critical factor in solving non-linear data, and it is also widely used in various research fields. The proposed method uses the idea of the kernel method, which mainly uses the dot product in the mapping space to represent the similarity measure, so in the solution process, it is not necessary to explicitly calculate the mapping of the data in the high-dimensional feature space. This work uses RBF to map data to high-dimensional feature space. The equation for RBF is

(5) K ( x i , x j ) = exp − | | x i − x j | | 2 2 σ 2 ,

where | | x i − x j | | 2 is the squared Euclidean distance of the two eigenvectors in RBF, and σ is a random parameter. For RBF, K ( x i , x j ) , i , j ∈ N + contains two parameters x i and x j . And its value decreases with distance, and 1 ≥ K ( x i , x j ) = K ( x j , x i ) > 0 , So it is a similarity measurement method. It is obvious that the kernel matrix K is symmetric and non-negative, which means that K ( x i , x j ) can represent the similarity between the two parameters x i and x j , and the closer x i and x j are to each other, the greater the degree of similarity between them. And σ as the radial base width, the smaller the value of σ, the narrower the width of the function, and its size determines how fast the RBF decreases, in other words, the faster the function value decreases away from the centre point.

In particular, since the dot product of x i and x j in the high-dimensional feature space is expressed in the form of a kernel function, RBF can extend the feature space to infinite dimensions, which is an excellent solution to the problem of non-linear data differentiation.

3.2 First allocation method

First, for given integers i and j, the points with the greatest similarity of the centroids C n in the sample space, Ω, is X i . Second, the point X i is the starting point to search for points of unassigned clusters X j ( i ≠ j ) , and it is searched by repeating C n in the sample space Ω. Finally, replace the point X i with the point X j to search. In the process of repeatedly searching for the maximum similarity of points, each searched point is divided into clusters of C n , and the similarity value S k and the average similarity deviation D i of the K time search point in the sample space Ω are recorded.

(6) S k = N n i = K ( C n , X i ) , k = 1 , S k = N i j = K ( X i , X j ) , k > 1 ,

(7) D i = S k − S k − 1 k , k > 1 .

If the temporary boundary point is found, the search will be stopped and the first allocation method will be terminated, and the second allocation method will be performed.

In this work, the definition of the temporary boundary point is: if the average similarity deviation D i is high, d q of the average value of the local density sum of all the centroids in the sample space Ω is given by

(8) D i > ∑ c = 1 ρ C n n × d q ,

where n is the number of centroids, and d q is a fixed parameter. We call the i-th point the temporary boundary point.

The density of centroids is not the same between different datasets. In the dataset, the search for a temporary boundary point is difficult in the case of a large centroid density. However, when the density of centroids is small, the process is simple. According to the above, it is known experimentally that the results are optimal for d q = 1 10 .

We will get a preliminary result after the first allocation method, as in Figure 3(b). These points are relatively important in the sample space Ω because they are closely related to each other. Figure 3(a) shows the search process for the temporary boundary point in the ideal case.

Figure 3

(a) Temporary boundary point concept map. (b) Example of the end of the first allocation method.

3.3 Second allocation method

At this stage, we find the similarity between all the remaining points and all the points in the preliminary result S k = N i j = K ( X i , X j ) based on the obtained preliminary result, filter out the point X i among the remaining points that is most similar to the points in the preliminary result, check which cluster the point is most similar to in the result, assign it to that cluster, and update the result after successful assignment, and repeat the above actions until all points are assigned.

We will get a clustering result at the end of the second allocation method, as shown in Figure 4. The proposed method uses similarity to perform clustering in the execution of the second allocation method, and the proposed method effectively reduces the sensitivity to d c compared to DPC.

Figure 4

Example of the end of the second allocation method.

3.4 Procedure of the proposed method

The proposed method.

Inputs:

m data points for clustering, X { 1 , 2 , . . . , m } , according to the introduction in the literature, the value of dc is the top 1–2% of all distances [1].

Outputs:

labels of data points L = { l i } , i ϵ X .

1: d i j = D I S T ( X i , X j ) , i , j ϵ X

2: ρ i = R H O ( X i , d i j ) , i ϵ X

3: δ i = DELTA ( X i , ρ i ) , i ϵ X

4: C n = C E N T E R ( ρ i , δ i ) , i ϵ X

5: σ = 0.1

6: kernel _ matrix = RBF ( d i j , σ ) // M = { kernel _ matrix }

7:for n = { 1 , … , p } do//p is the number of cluster centre points

8: M [ n ] [ n ] = 0

9:end for

10:k=0

11:for C n do

12: Find the point X i , i ϵ X corresponding to the maximum similarity in M[n]

13: Take X i , i ϵ X as the starting point and continue to find the maximum similarity point X j , j ϵ X in the same way.

14: Make k + 1 and record S k for each search

15: if S k − S k − 1 k > ∑ c = 1 ρ C n n × 1 10 , then

16: Continue

17:end for

18:For undivided cluster points X { 1 , 2 , . . . , m } :

19:for X { 1 , 2 , . . . , m } do

20: M [ m ] = SORT ( M [ m ] )

21: the points are assigned to the cluster of data points with the greatest similarity

22:end for

3.5 The complexity of the proposed algorithm

The method proposed uses RBF to map the data in this work. It is true that the complexity of the proposed algorithm with the kernel function will be slightly higher. However, after testing, the complexity of the proposed algorithm in this work is also equivalent to similar algorithms, and there is no surge.

The complexity of the proposed algorithm is shown in Table 1, N is the number of samples:

Table 1

Complexity of the proposed algorithm

Method	Complexity
DPC	N ²
RBF	N ²
Total	N ²

4 Experiment

In order to verify the effectiveness of the proposed method, this work tests the performance of the clustering results on synthetic datasets and real datasets and compares the results of the proposed method with those of other existing clustering methods, which include k-means [4], k-means++ [5], kernel k-means [6], and traditional DPC [10], and accuracy (ACC), normalised mutual information (NMI), and adjusted Rand index (ARI) metrics are used to evaluate the clustering performance in this case quantitatively.

The ACC evaluates the proportion of correctly labeled instances within the clustering results, assessing the consistency between the predicted values and the actual truth. The ACC result lies within the interval [0,1]. The ACC can be obtained from Table 2 as:

(9) ACC = TP + TN TP + FN + FP + TN .

Table 2

Pair confusion matrix

	The same cluster	Not the same as the cluster
The same cluster	True positive (TP)	False negative (FN)
Not the same as the cluster	False positive (FP)	True negative (TN)

The NMI is the degree of similarity between the clustering results and the correct partition of the point set, which can verify the robustness of the clustering results, and the results are in the interval [0,1]. P A ( a ) and P B ( b ) denote the probability distributions of A and B, and P A B ( a , b ) denotes the joint probability distribution of A and B.

(10) H ( A ) = − ∑ a P A ( a ) log P A ( a ) ,

(11) H ( B ) = − ∑ b P B ( b ) log P B ( b ) ,

(12) H ( A , B ) = − ∑ a P A , B ( a , b ) log P A , B ( a , b ) .

Based on the above, the NMI can be given as follows:

(13) NMI = H ( A ) + H ( B ) H ( A , B ) .

The ARI is a measure of the degree of agreement between two data distributions. The result is in the interval [−1,1]. The negative number indicates that the clustering result is less consistent with the actual situation. The closer the three evaluation indexes are to 1, the better the results are. The ARI can be obtained from Table 2 as:

(14) ARI = 2 × ( TP ⋅ TN − FN ⋅ FP ) ( TP + FN ) ( FN + TN ) + ( TP + FN ) ( FP + TN ) .

The parameter settings for the comparative methods, including the k-means method [4], k-means++ method [5], and kernel k-means method [6], involve setting the maximum number of iterations to be 500 for all datasets. DPC method threshold is set to the default value of 2%. For the kernel k-means method, set the RBF as the method similarity calculation function. The 6 methods were carried out 20–25 times on each batch of datasets, and the optimal evaluation index of each method in the parameter value range was compared with the clustering renderings and the result of the proposed method.

The experimental environment is: Intel® Core^TMi5- 9300H CPU @ 2.40 GHz, memory 8.00 GB, operating system Microsoft Windows 10, and the method writing environment is Python 3.8.

The basic information of the artificial dataset and UCI dataset is shown in Tables 3 and 4. The synthetic datasets include Spiral, Flame, and Two moons, where we observe some significant challenges for clustering methods, including irregular shapes, varied densities, varied cluster sizes, etc. The real datasets include Aggregation, Iris, Robot Navigation, Heart-Statlog, Dermatology, Seeds, Haberman, Ecoli, Ionosphere, etc. In conclusion, the diversity of the synthetic and real datasets used in our experiments is helpful to evaluate the validity of the proposed clustering method.

Table 3

Three synthetic datasets

Dataset	Instances	Number of features	Number of classes
Spiral	312	2	3
Flame	238	2	2
Two moons	1,502	2	2
Spherical	1,146	2	4
Zigzag	1,002	2	3

Table 4

Ten UCI datasets

Dataset	Instances	Number of classes ( N c )	Number of features
Aggregation	788	7	2
Iris	150	3	4
Robot navigation	5,456	4	25
Heart-Statlog	270	2	13
Dermatology	366	6	34
Seeds	210	3	7
Haberman	306	2	3
Ecoli	336	8	8
Ionosphere	351	2	34

For the proposed method and kernel k-means, the best value of its σ is selected from { 10 , 0.1 , 0.01 , 0.001 } . The number of K of all k-means structured DPC is selected from 2 to 10, and the detailed parameter configurations regarding these compared methods are determined based on their corresponding references and are listed in Table 5.

Table 5

Detailed parameters configurations

Methods	Parameters setting
k-means	Input N c
K-means++	Input N c
Kernel k-means	Input N c , σ
DPC	d c = 2 % ⁎ ( point number ) , 2 < K < 10
Fuzzy c-means	Input N c
GDPC	d c = 2 % ⁎ ( point number ) , 2 < K < 10
The proposed method	d c = 2 % ⁎ ( point number ) , 2 < K < 10 , σ

4.1 Artificial dataset

In this section, three artificial datasets are selected to verify the proposed method and compare the clustering results of six clustering methods. Figure 5 shows the clustering effect diagrams of the seven methods on the five datasets of Spiral, Flame, Two moons, Spherical, and Zigzag. Tables 6–8 shows the comparison of clustering results on artificial datasets with ACC, NMI, and ARI measure

Figure 5

The clustering effects of seven methods on (a) spiral dataset, (b) flame dataset, (c) two moons dataset, (d) spherical dataset, and (e) zigzag dataset.

Table 6

Comparison of clustering results on artificial datasets with ACC measure

Method	k-means	k-means++	kernel k-means	DPC	Fuzzy c-means	GDPC	The proposed method
Spiral	0.3462	0.3462	0.4455	1	0.3397	1	1
Flame	0.8445	0.8445	0.8571	1	0.8445	0.7857	1
Two moons	0.7357	0.7357	0.7377	0.6518	0.5919	0.7583	1
Spherical	0.2329	0.2329	1	0.2696	0.2015	0.2818	1
Zigzag	0.4810	0.4810	0.4810	0.4870	0.4551	0.4351	1

Table 7

Comparison of clustering results on artificial datasets with NMI measure

Method	k-means	k-means++	kernel k-means	DPC	Fuzzy c-means	GDPC	The proposed method
Spiral	0.0004	0.0004	0.1445	1	0.0003	1	1
Flame	0.45	0.45	0.4807	1	0.4305	0.4102	1
Two moons	0.1774	0.1774	0.1859	0.2008	0.2333	0.2145	1
Spherical	0.3736	0.3736	1	0.4127	0.3293	0.4130	1
Zigzag	0.1881	0.1896	0.1895	0.0989	0.1924	0.1427	1

Table 8

Comparison of clustering results on artificial datasets with ARI measure

Method	k-means	k-means++	kernel k-means	DPC	Fuzzy c-means	GDPC	The proposed method
Spiral	−0.006	−0.006	0.1445	1	−0.0062	1	1
Flame	0.4612	0.4612	0.4807	1	0.4727	0.3228	1
Two moons	0.2188	0.2188	0.1859	0.1525	0.1908	0.2659	1
Spherical	0.0931	0.0931	1	0.1409	0.3293	0.1495	1
Zigzag	0.1131	0.1144	0.1896	0.0364	0.1924	0.0713	1

For the spiral dataset, the characteristic of the dataset is spiral distribution, which is composed of three-ring clusters, which can test the ability of the method to process the ring dataset. The Flame dataset and Two moons dataset are both generated from two types of data, and the purpose is to detect whether the method can effectively distinguish the boundary points of different clusters. The ability of the proposed method to detect spherical datasets can be demonstrated on Spherical dataset and Zigzag dataset.

According to Figure 5, it can be seen that due to the improvement of the k-means clustering method and the optimisation of the clustering results, the proposed method can accurately divide the point set. In addition, the k-means method divides the dataset into equal parts in space, dividing each part of the points into one category, the clustering results obtained are poor. The k-means++ method refines the selection of centroids in a k-means framework, however, it does not bring about an enhancement in the determination of each data point’s cluster. Similar to k-means, it cannot identify clusters with non-convex shapes very well, and the clustering effect is equally bad. In this work, the kernel k-means method calculates the similarity between points according to the RBF, but the selection of centroid is poor and fails to achieve a better clustering effect. The DPC method divides the points with similar density into clusters. Since the edges of the two clusters in the Two moons dataset have a small number of points with similar densities, if there is an error in one point clustering, another point with a similar density will be clustered incorrectly. It can be seen that this method can accurately obtain the clustering results of datasets of different shapes. Fuzzy c-means has a poor choice of centroids, and in Spiral, the centroids are wrongly chosen, leading to a failure to obtain a good clustering effect. GDPC is improved based on DPC, which can judge the number of clusters well, but its performance in a densely distributed, cluttered dataset is not ideal, and will incorrectly assign points to clusters that do not belong to it.

4.2 UCI dataset

We select nine real datasets for experiments in this section, with different dimensions. The real datasets include Robot Navigation, Heart-Statlog, Dermatology, Seed, Haberman, Iris, Ionosphere, Ecoli, and Aggregation taken from the UCI machine learning repository. Table 4 presents a brief summarisation of the characteristics of these datasets.

Each dataset was compared using six methods, and the evaluation indicators of the clustering results are recorded, and the statistics are shown in Table 5. The bold data in the table are the optimal values of the corresponding dataset among the five clustering method results.

According to Tables 9–11, the proposed method is better than the k-means method and its improved method in NMI measurement. The proposed algorithm outperforms the GDPC method and the Fuzzy c-means method in all three metrics, where the GDPC only performs in the Seed dataset with the same performance as the proposed algorithm in this work, and all three metrics are 1. The three evaluation results of DPC in the Aggregation dataset are all 1, and the proposed method also performs well in this dataset. The results of the three evaluation indicators are also 1. The number of instances and dimensions of the Robot Navigation dataset are relatively high, and the results obtained by the proposed method on this dataset are better than others. The results of the six comparison methods are accepted, and ACC is increased by about 30%, and NMI and ARI are also significantly improved. Although kernel k-means also divides points in high-dimensional space, the method performs poorly. The experimental results further prove that the proposed method has better clustering ability in high-dimensional space. In the Aggregation dataset, we can observe that the proposed method still has some deficiencies, which is most likely caused by an error in the determination of temporary boundary points in a cluster.

Table 9

Comparison of clustering results on UCI datasets with ACC measure

Method	k-means	k-means++	kernel k-means	DPC	Fuzzy c-means	GDPC	The proposed method
Robot navigation	0.4034	0.403	0.4173	0.3616	0.3564	0.4043	0.7428
Heart-statlog	0.5926	0.5926	0.6037	0.5519	0.6111	0.5481	0.6444
Dermatology	0.2709	0.2765	0.3436	0.1564	0.0446	0.3045	0.4385
Seed	0.8952	0.8952	0.9048	0.8857	0.8952	0.8857	0.8952
Haberman	0.5229	0.5229	0.6176	0.5719	0.5588	0.7353	0.7386
Iris	0.8933	0.8933	0.893	0.9067	0.8933	0.6733	0.9733
Ionosphere	0.7123	0.7123	0.7379	0.6781	0.7066	0.6439	0.8575
Ecoli	0.4226	0.4405	0.4464	0.625	0.4732	0.4405	0.6994
Aggregation	0.6561	0.6218	0.5444	1	0.7728	1	1

Table 10

Comparison of clustering results on UCI datasets with NMI measure

Method	k-means	k-means++	kernel k-means	DPC	Fuzzy c-means	GDPC	The proposed method
Robot Navigation	0.1143	0.1123	0.0708	0.0738	0.0847	0.0014	0.5764
Heart-statlog	0.0199	0.0199	0.0041	0.0002	0.0328	0.0119	0.2314
Dermatology	0.1077	0.1096	0.1517	0.0737	0.2433	0.0269	0.5688
Seed	0.6949	0.6949	0.7006	0.6982	0.6982	0.6893	0.7313
Haberman	0.0001	0.0001	0.0166	0.0067	0.0119	0.0072	0.011
Iris	0.7581	0.7581	0.7581	0.8057	0.7496	0.7235	0.9143
Ionosphere	0.1349	0.1349	0.2214	0.0838	0.1292	0.017	0.4146
Ecoli	0.5998	0.5997	0.5639	0.454	0.5265	0.0897	0.6132
Aggregation	0.8528	0.87678	0.7695	1	0.8453	1	1

Table 11

Comparison of clustering results on UCI datasets with ARI measure

Method	k-means	k-means++	kernel k-means	DPC	Fuzzy c-means	GDPC	The proposed method
Robot navigation	0.0583	0.0569	0.0272	0.0422	0.0648	−0.003	0.4575
Heart-statlog	0.0302	0.0302	0.0007	−0.001	0.0457	0.003	0.0772
Dermatology	0.0266	0.0278	0.0706	0.022	0.0011	−0.0046	0.322
Seed	0.7166	0.7166	0.7384	0.7027	0.7145	0.7026	0.7166
Haberman	−0.0018	−0.0018	−0.0352	−0.0277	0.0112	0.0073	0.0187
Iris	0.7302	0.7302	0.7302	0.7592	0.7294	0.5657	0.9222
Ionosphere	0.1776	0.1776	0.2185	0.1232	0.1681	0.009	0.5005
Ecoli	0.4237	0.4598	0.3397	0.376	0.3599	0.0235	0.6385
Aggregation	0.7637	0.76	0.5502	1	0.7167	1	1

In general, the proposed method can obtain a more accurate clustering result on both high-dimensional and low-dimensional datasets.

4.3 Plant image segmentation

In this section, we selected the Leaf counting dataset published by Aarhus University, which can be used to segment images of small plants and help estimate their growth stages. In this experiment, we choose to split two images with different shooting angles, each image is 100 × 100 in size, and the proposed method is compared with the segmentation results of this method with k-means method, k-means++ method, kernel k-means method, Fuzzy c-means method, and GDPC method.

Figures 6 and 7 are the decision diagrams of the proposed method to determine the centroid before image segmentation, with the centroid coloured red.

Figure 6

(a) The corresponding decision graphs. (b) Results of comparative experiments.

Figure 7

(a) The corresponding decision graphs. (b) Results of a comparative experiment.

According to the decision diagram, three centre points are selected when segmenting the small plant picture in Figure 6b. According to the segmentation effect in the figure, the proposed method clearly outlines the plant and relatively completes the main body of the plant picture pixel clusters. However, because each component of the RGB colour space will be affected by the brightness, the clustering in the middle of the small plants is not completely successful. However, compared with the results of the DPC method and the GDPC method, we can find that this method has a better clustering effect in the middle of the plants. The other four comparison methods all draw part of the soil and plants together, and it is difficult to correctly classify similar colours, resulting in poor image segmentation results.

Judging by the decision graph, Figure 7b can divide the picture into two categories and input the same number of clusters into the four comparison methods. In the proposed method, soil and small plant leaves are accurately segmented, and the clustering results of the main body pixels of the plants are better than those. As the colour values of the plant body and the soil in Figure 7b are similar, the edges of each part obtained by the other six different clustering methods are confused, and the image segmentation result is poor.

5 Conclusion

k-means and DPC are common and more effective clustering methods, Fuzzy c-means method and improved DPC method, GDPC, are new clustering methods proposed in recent years, but these methods have some shortcomings. In this work, we proposed a new strategy for dividing points in a high-dimensional space. We propose a new strategy for dividing points in high-dimensional space. This method retains the DPC method to select centroids and starting from each centroid, searches for the most relevant points in the high-dimensional space until a temporary boundary point is found in this cluster to complete the first clustering, and then assigns points that are not divided into clusters to the largest cluster with similar data points. The proposed method can effectively segment crop images.

It is worth noting that in this work, the way we select cluster centres is similar to that of DPC method. In the future, we will investigate an adaptive method to select the appropriate cluster centres to replace the existing cluster centre selection method. And the common problem of using kernel functions is that the time complexity is slightly higher, and the time complexity will be optimised in the future.

Funding information: The Open Fund of State Key Laboratory of Tea Plant Biology and Utilisation, SKLTOF20200116
Author contributions: The research presented in this paper is the result of collaborative efforts. Jiaze Bi and Pingzhe Zhang contributed significantly to all stages of the paper, including concept development, experimental design, data collection, data analysis, data visualization, and the drafting of the initial manuscript. Ao Liu, Wei Zhang, and Yujia Gao meticulously revised and polished the manuscript, enhancing the clarity and quality of the text. Menglong Dong and Yongzhi Zhuang provided assistance during the data collection phase. Yujia Gao contributed not only by providing the GPU resources and funding necessary for running the simulations but also by offering key suggestions to the research ideas proposed by Jiaze Bi and Pingzhe Zhang and played a facilitating role in the experimentation and analysis processes. Yiqiong Chen, as the corresponding author, was responsible for coordinating the research team’s efforts, external communication, and identifying critical errors in the experimental process, ensuring the smooth progression of the research. All authors participated in the review and approval of the final manuscript.
Conflict of interest: No potential conflict of interest was reported by the authors.
Data availability statement: All data generated or analysed during this study are included in this article.

References

[1] Saxena L, Armstrong L. A survey of image processing techniques for agriculture. Proceedings of Asian Federation for Information Technology in Agriculture. Perth: Australian Society of Information and Communication Technologies in Agriculture; 2014. p. 401–413.Search in Google Scholar

[2] Guo R, Zhang L, Yang Z. Multiphase image segmentation model based on clustering method[C]. 2021 IEEE Asia-Pacific Conference on Image Processing. Electronics and Computers (IPEC). IEEE; 2021. p. 1236–9.10.1109/IPEC51340.2021.9421074Search in Google Scholar

[3] Ma Z, Liu Z, Luo C, Song L. Evidential classification of incomplete instance based on k-nearest centroid neighbor. J Intell Fuzzy Syst. 2021;41(6):7101–15.10.3233/JIFS-210991Search in Google Scholar

[4] Jain AK. Data clustering: 50 years beyond k-means. Pattern Recognit Lett. 2010;31(8):651–66.10.1016/j.patrec.2009.09.011Search in Google Scholar

[5] Arthur D, Vassilvitskii S. k-means++: The advantages of careful seeding. Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms; 2007. p. 1027–1035.Search in Google Scholar

[6] Tripathy BK, Ghosh A, Panda GK. Kernel based K-means clustering using rough set. 2012 International conference on computer communication and informatics. IEEE; 2012. p. 1–5.10.1109/ICCCI.2012.6158827Search in Google Scholar

[7] Bergman S, Schiffer M. Kernal function and conformal mapping. Compositio Mathematica. 1951;8:205–49.Search in Google Scholar

[8] Dhillon IS, Guan Y, Kulis B. kernel k-means: spectral clustering and normalized cuts. Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2004. p. 551–6.10.1145/1014052.1014118Search in Google Scholar

[9] Bezdek JC, Ehrlich R, Full W. FCM: The fuzzy c-means clustering algorithm. Comput Geosci. 1984;10(2–3):191–203.10.1016/0098-3004(84)90020-7Search in Google Scholar

[10] Rodriguez A, Laio A. Clustering by fast search and find of density peaks. Science. 2014;344(6191):1492–6.10.1126/science.1242072Search in Google Scholar PubMed

[11] Hou J, Zhang A, Qi N. Density peak clustering based on relative density relationship. Pattern Recognit. 2020;108:107554.10.1016/j.patcog.2020.107554Search in Google Scholar

[12] Likas A, Vlassis N, Verbeek JJ. The global k-means clustering method. Pattern Recognit. 2003;36(2):451–61.10.1016/S0031-3203(02)00060-2Search in Google Scholar

[13] Tao X, Guo W, Ren C, Li Q, He Q, Liu R, et al. Density peak clustering using global and local consistency adjustable manifold distance. Inf Sci. 2021;577:769–804.10.1016/j.ins.2021.08.036Search in Google Scholar

[14] Du M, Ding S, Jia H. Study on density peaks clustering based on k-nearest neighbors and principal component analysis. Knowl Syst. 2016;99:135–45.10.1016/j.knosys.2016.02.001Search in Google Scholar

[15] Cai J, Wei H, Yang H, Zhao X. A novel clustering method based on DPC and PSO. IEEE Access. 2020;8:88200–14.10.1109/ACCESS.2020.2992903Search in Google Scholar

[16] Rashid M, Zhang GZ, Bie RF, Dawood H, Ahmad H. Clustering by fast search and find of density peaks via heat diffusion. Neurocomputing. 2016;208(oct.5):210–21.10.1016/j.neucom.2016.01.102Search in Google Scholar

[17] Jiang J, Hao D, Chen Y, Parmar M, Li K. GDPC: Gravitation-based density peaks clustering algorithm. Phys A: Stat Mech Its Appl. 2018;502:345–55.10.1016/j.physa.2018.02.084Search in Google Scholar

[18] HEXQ. Multivariate statistical analysis[M]. Beijing: China Renmin University Press; 2015. p. 41–5.Search in Google Scholar

[19] Tan PN, Steinbach M, Kumar V. Introduction to Data Mining[M]. Fan M, Fan HJ, Translate. Beijing: Posts & Telecom Press; 2011. p. 327–77.Search in Google Scholar

Received: 2022-03-14

Revised: 2022-08-19

Accepted: 2022-10-05

Published Online: 2023-12-02

This work is licensed under the Creative Commons Attribution 4.0 International License.