CN111506730A

CN111506730A - Data clustering method and related device

Info

Publication number: CN111506730A
Application number: CN202010305219.8A
Authority: CN
Inventors: 郭舒婷; 张亚霏
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-04-17
Filing date: 2020-04-17
Publication date: 2020-08-07

Abstract

The embodiment of the application discloses a data clustering method and a related device, wherein in the process of clustering data of N data samples by computer equipment, multiple times of calculation iteration of attraction degree and attribution degree can be carried out to determine the data samples serving as a clustering center. In the t-th iteration, the attraction degree among the N data samples can be calculated according to the similarity among the N data samples. In the process of calculating the attraction degree, first auxiliary arrays corresponding to the N data samples can be obtained through statistics, the first auxiliary array corresponding to the ith data sample can be embodied in the t-th iteration, and the sum of the effective attraction degrees of the ith data sample on other data samples in the N data samples can be obtained through statistics. The first auxiliary array may be applied to the calculation of the degree of attribution among the N data samples in the t-th iteration, so that the first auxiliary array may be used directly when calculating the degree of attribution of one data sample with respect to the ith data sample. The method improves the efficiency of data clustering.

Description

Data clustering method and related device

Technical Field

The present application relates to the field of data processing, and in particular, to a data clustering method and a related apparatus.

Background

The data clustering technology can classify data in various dimensions, for example, for text data samples, the text data samples can be classified into multiple classes according to text similarity, and the text similarity between the text data samples in the same class is higher.

When data clustering is carried out through the correlation technique, the clustering center in the data sample can be determined only after the attraction degree and the attribution degree among the samples are calculated through multiple iterations, so that the final clustering is realized.

The related technology has low clustering calculation efficiency and is difficult to meet the current data clustering requirements.

Disclosure of Invention

In order to solve the technical problem, the application provides a data clustering method and a related device, which reduce the calculation complexity of calculating the attribution degree and improve the efficiency of data clustering.

The embodiment of the application discloses the following technical scheme:

in one aspect, an embodiment of the present application provides a data clustering method, which includes multiple computation iterations for an attraction degree and an attribution degree of N data samples in a data clustering process of the N data samples, where the method is performed by a computer device, and the method includes:

in the t-th iteration, according to the similarity among the N data samples, calculating the attraction among the N data samples; the tth iteration is one of the multiple computing iterations;

counting first auxiliary arrays corresponding to the N data samples respectively; aiming at the ith data sample in the N data samples, a first auxiliary array corresponding to the ith data sample is used for marking the sum of effective attraction degrees of other data samples in the N data samples to the ith data sample in the t iteration;

and calculating the attribution degree among the N data samples in the t iteration according to the attraction degree among the N data samples and the first auxiliary array.

In another aspect, an embodiment of the present application provides a computer device, where the computer device includes a first computing unit, a statistical unit, and a second computing unit:

the first calculating unit is used for performing multiple calculation iterations aiming at the attraction degree and the attribution degree of the N data samples in the data clustering process of the N data samples, and calculating the attraction degree among the N data samples according to the similarity among the N data samples in the t-th iteration; the tth iteration is one of the multiple computing iterations;

the statistical unit is used for counting first auxiliary arrays corresponding to the N data samples respectively; aiming at the ith data sample in the N data samples, a first auxiliary array corresponding to the ith data sample is used for marking the sum of effective attraction degrees of other data samples in the N data samples to the ith data sample in the t iteration;

and the second calculating unit is used for calculating the attribution degree among the N data samples in the t iteration according to the attraction degree among the N data samples and the first auxiliary array.

In another aspect, an embodiment of the present application provides a computer device, where the computer device includes a processor and a memory:

the memory is used for storing program codes and transmitting the program codes to the processor;

the processor is used for executing the data clustering method according to the instructions in the program codes.

In another aspect, an embodiment of the present application provides a computer-readable storage medium, which is used for storing program codes, where the program codes are used for executing the above data clustering method.

According to the technical scheme, in the process of clustering data of N data samples by computer equipment, the attraction degree and the attribution degree are subjected to multiple calculation iterations to determine the data samples serving as the clustering centers. In the t-th iteration, the attraction degree among the N data samples can be calculated according to the similarity among the N data samples. In the process of calculating the attraction degree, first auxiliary arrays corresponding to the N data samples can be obtained through statistics, the first auxiliary array corresponding to the ith data sample can be embodied in the t-th iteration, and the sum of the effective attraction degrees of the ith data sample on other data samples in the N data samples can be obtained through statistics. The first auxiliary array can be applied to the calculation of the attribution degree among the N data samples in the t-th iteration, so that when the attribution degree of one data sample relative to the ith data sample is calculated, the first auxiliary array can be directly used without additionally traversing the effective attraction degrees of other data samples relative to the ith data sample, the calculation complexity of the calculation of the attribution degree is reduced, and the efficiency of data clustering is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic view of an application scenario of a data clustering method according to an embodiment of the present application;

fig. 2 is a flowchart of a data clustering method according to an embodiment of the present application;

FIG. 3 is a flow chart of an iterative method;

fig. 4 is a scene schematic diagram of a concurrent computation method provided in the embodiment of the present application;

fig. 5 is a scene schematic diagram of a concurrent computation method provided in the embodiment of the present application;

FIG. 6a is a block diagram of a computer device according to an embodiment of the present disclosure;

FIG. 6b is a block diagram of a computer device according to an embodiment of the present disclosure;

FIG. 6c is a block diagram of a computer device according to an embodiment of the present disclosure;

FIG. 7 is a block diagram of a computer device according to an embodiment of the present disclosure;

fig. 8 is a block diagram of a server according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described below with reference to the accompanying drawings.

When data clustering is carried out through the correlation technique, the clustering center in the data sample can be determined only after the attraction degree and the attribution degree among the data samples are calculated through multiple iterations, and the final clustering is realized. In the one-iteration process of the related art, when the attribution degree of any one data sample a relative to the data sample b is calculated according to the attraction degree between the data samples, in addition to the attraction degree of the reference data sample a relative to the data sample b, the attraction degree of other data samples relative to the data sample b needs to be counted. A large amount of extra data needs to be traversed when the attraction between data samples is calculated, and the calculation complexity is improved, so that the clustering calculation efficiency is influenced, and the current data clustering requirements are difficult to meet.

Therefore, the embodiment of the application provides a data clustering method, and a first auxiliary array is introduced, so that when the attribution degree of one data sample relative to a target data sample is calculated, the attraction degree of other data samples relative to the target data sample does not need to be traversed additionally, the calculation complexity of calculating the attribution degree is reduced, and the efficiency of data clustering is improved.

First, an execution body of the embodiment of the present application will be described. The data clustering method provided by the application can be applied to computer equipment, and the computer equipment can be terminal equipment or a server. The terminal device may be, for example, a smart phone, a computer, a Personal Digital Assistant (PDA), a tablet computer, a Point of Sales (POS), a vehicle-mounted computer, or the like. The data clustering method can also be applied to servers, and the servers can be independent servers, servers in a cluster, cloud servers and the like.

In addition, it needs to be explained that the data clustering method provided by the embodiment of the application can be applied to various scenes in which data samples need to be clustered. In some embodiments, for example, the method may be applied to a corpus clustering scenario, where a data sample to be clustered in the scenario may be a corpus sample, and the corpus sample is clustered through dimensions such as semantic similarity, so as to obtain a sample set with semantic approximation in the corpus sample. Therefore, the corpus can be trained according to the corpus sample with similar semantics, and the semantic matching model performance can be improved.

For another example, the data clustering method may also be applied to a scene of node clustering, where the data samples to be clustered in the scene may be node samples, and the node samples may be data samples of different entities, such as user data samples, and a node set with similar entity characteristics may be obtained by clustering the node samples. For example, by clustering user data samples, a user set with close interests can be obtained, so that friend recommendation and the like are realized.

Similarly, the method can also be applied to other various scenes which need to realize functions through clustering, such as image classification, a question-answering system of human-computer interaction and the like, and the image classification precision, the question-answering system recall rate and the accuracy rate are improved through image clustering, semantic similarity mining and the like.

Next, a data clustering method provided by the embodiment of the present application is introduced by taking a server as an execution subject and combining an application scenario of semantic clustering.

Referring to fig. 1, fig. 1 is a schematic view of an application scenario of a data clustering method provided in an embodiment of the present application. Fig. 1 includes a server 101, and assuming that N corpus samples to be clustered are stored in the server 101, based on semantic information carried by the corpus samples, the server 101 may perform data clustering on the corpus samples by executing the data clustering method to obtain a sample set with similar semantics.

The data clustering method provided by the embodiment of the application can be an Affinity Propagation (AP) clustering method. The AP clustering method can be a method for calculating the clustering center of the data samples based on the message transmission among the data samples so as to realize data clustering.

The server 101 may determine the similarity between the N corpus samples in the data clustering process of the N corpus samples, so as to be used for calculating the attraction in the iterative process, where the similarity between the corpus samples does not change in the clustering process. In each iteration, such as the t-th iteration, performed by the server 101, the attraction and the attribution between the N corpus samples may be calculated and updated. The attractiveness and the attribution are messages transmitted among the corpus samples in the iterative process of the AP clustering method. After multiple iterations are completed, the clustering center among the N corpus samples can be determined according to the attraction degree and the attribution degree among the corpus samples, and the clustering center can be used for determining the corpus samples with semanteme close to the clustering center, so that semantic clustering is realized according to the clustering center.

In order to facilitate understanding of the technical solutions provided in the embodiments of the present application, the following describes related contents of the similarity S, the attraction degree R, and the attribution degree a mentioned in the solutions. The ith and jth corpus samples mentioned hereinafter belong to different corpus samples of the aforementioned N corpus samples, i being 1,2, … …, N; j is 1,2, … …, N.

And S (i, j) is the similarity of the ith corpus sample relative to the jth corpus sample, and can embody the capability of the jth corpus sample as the clustering center of the ith corpus sample.

R (i, j) is the attraction degree of the ith corpus sample relative to the jth corpus sample, and can reflect the attraction degree of the ith corpus sample passively attracted by the jth corpus sample, namely the objective degree of the jth data sample becoming the clustering center of the ith data sample. R (i, j) may be determined from S (i, j).

A (i, j) is the attribution degree of the ith corpus sample relative to the jth corpus sample, and can reflect the tendency degree of the ith corpus sample actively selecting the jth corpus sample as the clustering center, namely the subjective degree of the jth data sample becoming the clustering center of the ith data sample.

After the above descriptions about similarity, attraction, and attribution are completed, the following describes the technical solution of the present application.

In the embodiment of the present application, referring to fig. 1, in the t-th iteration, the server 101 may calculate R (i, j) according to the similarity S (i, j) between N corpus samples. It should be noted that, unless otherwise specified, R (i, j) and a (i, j) mentioned hereinafter may refer to data calculated in the t-th iteration.

In the related art, when calculating a (i, j) among N corpus samples, as mentioned above, besides R (i, j) of the ith corpus sample relative to the jth corpus sample, the attraction of other corpus samples in the N corpus samples relative to the jth corpus sample needs to be traversed. Thus, in one iteration of the server 101, N corpus samples are computedN in total²For A (i, j), in addition to reading R (i, j) between N corpus samples and other corpus samples, N additional operations are required²Traversing the attractiveness among the N corpus samples at a time, so that the calculation complexity of the attribution degree in one iteration is O (N)²)。

In order to reduce the computational complexity of the attribution degree, in the embodiment of the present application, in the t-th iteration, the server 101 may count the first auxiliary array rsum (k) corresponding to each of the N corpus samples, k being 1,2, … …, N in the process of calculating R (i, j), where rsum (k) may be data counted in the t-th iteration, and rsum (k) may be embodied in the t-th iteration, and the sum of effective attraction degrees of other corpus samples in the N corpus samples relative to the k-th corpus sample, that is, rsum (k) ∑_i,i≠_kmax {0, R (i, k) }, the effective attraction degree may be a non-negative attraction degree.

Thus, when the server 101 needs to calculate a (i, j), it only needs to read R (i, j) of the ith corpus sample relative to the jth corpus sample and directly call rsum (j), that is, the sum of the effective attractions of the jth corpus sample to the other corpus samples that have completed statistics is obtained without additional traversal. Furthermore, in one iteration, the server 101 only needs to read the attraction R (i, j) between the N corpus samples and other corpus samples and the first auxiliary array to complete the calculation of the a (i, j) between the N corpus samples, and the calculation complexity of the attribution is o (N). Therefore, the method reduces the calculation complexity of an exponential level aiming at the calculation mode of the attribution degree of each iteration, thereby improving the efficiency of data clustering.

Next, a data clustering method provided in the embodiment of the present application will be described with a server as an execution subject.

In the data clustering method, the number of data samples for clustering is N, and N is a positive integer. In the data clustering process of the N data samples, a plurality of calculation iterations of the attraction degree and the attribution degree among the N data samples are included.

Before iteration, relevant parameters involved in the data clustering process can be preconfigured, including: a reference p, a damping coefficient λ, a maximum iteration number (maximum), and a Convergence iteration number (Convergence _ iter). Wherein, p is the reference degree of the data sample as the clustering center for adjusting the clustering scale, and is usually taken as the average value of the similarity among the data samples; the lambda is used for relieving the data oscillation phenomenon in the iteration process; the Convergence _ iter may be a threshold of the number of times that the cluster center reaches the steady state, which is calculated in the iterative process.

Referring to fig. 2, this figure shows a flowchart of a data clustering method provided in an embodiment of the present application, and as shown in fig. 2, the method includes:

s201: in the t-th iteration, the attraction degree among the N data samples is calculated according to the similarity among the N data samples.

Wherein the t-th iteration is any one of the multiple calculation iterations.

In the embodiment of the present application, based on that some samples in the N data samples have similarity information in some aspects, the similarity S (i, j) between the N data samples can be calculated.

The embodiment of the application does not limit the expression form of the similarity S (i, j), and the similarity S (i, j) can be expressed by selecting a proper mode according to an actual scene or different requirements. For example, the similarity between two data samples may be expressed in the form of a distance between the two data samples, and the similarity determined in this way is in a symmetric form, i.e., S (i, j) ═ S (j, i). The similarity between the data samples can also be expressed by other forms, and the obtained similarity can also be in an asymmetric form, namely S (i, j) ≠ S (j, i).

It should be noted that in the embodiment of the present application, S (i, i), R (i, i), and a (i, i) among the data samples themselves may be calculated. Where S (i, i) may be an initial degree of the ith data sample as a cluster center, and S (i, i) may be set as the reference degree p. R (i, i) represents the objective evaluation level of the ith data sample to the clustering center, and A (i, i) can represent the confidence degree of the ith data sample to the clustering center. In addition, before the first iteration, R and a between N data samples may be initialized, for example, to 0.

It should be noted that, in the process of multiple iterations, the similarity between the data samples is fixed and unchanged, and will not change due to the iteration process.

In the t-th iteration, R between N data samples may be calculated based on S between the N data samples.

S202: and counting first auxiliary arrays corresponding to the N data samples respectively.

The first auxiliary array rsum (i) corresponding to the ith data sample of the N data samples may be used to identify a sum rsum (i) of effective attractiveness of other data samples of the N data samples with respect to the ith data sample in the tth iteration, which is ∑_j,j≠imax {0, R (j, i) }. The effective attraction described herein may be a non-negative attraction, and the effective attraction of the ith data sample with respect to the jth data sample is max {0, R (i, j) }.

Thus, in the embodiment of the present application, the first auxiliary arrays corresponding to the N data samples may be counted in the process of calculating the attraction between the N data samples.

S203: and calculating the attribution degree among the N data samples in the t iteration according to the attraction degree among the N data samples and the first auxiliary array.

In a specific implementation, a (i, j) among N data samples in the t-th iteration can be calculated according to the following formulas (1) and (2):

A(i,j)＝λA′(i,j)+(1-λ)A_t-1(i,j)； (2)

a in the above formula (2)_t-1(i, j) may be the degree of attribution of the ith data sample to the jth data sample in the t-1 th iteration.

Through the method, the attribution degree among the N data samples in the t iteration is obtained through calculation.

In the related art, when calculating R (i, j) in the t-th iteration, in addition to referring to S (i, j) of the ith data sample relative to the jth data sample, the similarity between the ith data sample and other data samples and the attribution degree of the ith data sample relative to other data samples in the last t-1 iteration need to be traversed to obtain the maximum value of the sum of the similarity between the ith data sample and other data samples and the attribution degree. That is, a total of N data samples over N data samples in the tth iteration of the computation²For R (i, j), an additional traversal of N is required²Similarity between the secondary data samples and attribution between the data samples in the last iteration. The computational complexity of computing R (i, j) between data samples in the related art is O (N)²) The efficiency of data clustering is affected.

For this purpose, in a possible implementation manner, in the process of calculating the attribution degree of the N data samples in the t-th iteration in S203, the method further includes: and determining second auxiliary arrays corresponding to the N data samples respectively according to the attribution degree.

The second auxiliary array corresponding to the ith data sample of the N data samples may be used to identify the maximum attribution information corresponding to the ith data sample and the data sample corresponding to the maximum attribution information in the t-th iteration.

The maximum attribution information corresponding to the ith data sample may be denoted as ASmax1(i), and ASmax1(i) may refer to the maximum value of the sum of the similarity and the attribution of the ith data sample with respect to other data samples (including the ith data sample). The data sample corresponding to the maximum attribution information can be recorded as ASmax1_index(i) A number of data samples. Wherein the ith data sample corresponds to the ASmax1_index(i) The sum of the similarity and the attribution degree of the data samples is the maximum value of the sum of the similarity and the attribution degree of the ith data sample relative to other data samples. Of these, ASmax1(i) and ASmax1_index(i) May be the data statistically used to calculate the attraction at the t +1 th iteration, during the t-th iteration.

That is, in the process of calculating the data sample intervals a (i, j) in the t-th iteration, the maximum attribution information ASmax1(i) corresponding to the i-th data sample and the data sample corresponding to the maximum attribution information, that is, the ASmax1, may be counted_index(i) A number of data samples.

Thus, when calculating R (i, j) among N data samples in the t +1 th iteration, if the j data sample does not belong to the ASmax1 counted in the t th iteration_index(i) One data sample (i.e., j ≠ ASmax 1)_index(i) R (i, j) in the t +1 th iteration can be calculated by reading the similarity between the ith data sample and the jth data sample, directly calling the statistical ASmax1(i), and by the following equations (3) and (4). Without traversing the attribution degree of the ith data sample relative to other data samples in the t iteration.

R(i,j)＝λR’(i,j)+(1-λ)R_t-1(i,j)；(4)

R in the above formula (4)_t-1(i, j) may be the attraction between the ith data sample and the jth data sample in the t-1 iteration.

Furthermore, in one iteration, the calculation of R (i, j) can be completed only by reading S (i, j) of N corpus samples relative to other corpus samples and the second auxiliary array, and the calculation complexity of the attraction is reduced to O (N). The method reduces the calculation complexity of an exponential level of the attraction calculation and improves the efficiency of data clustering.

In addition, in a possible implementation manner, in the process of calculating the attribution degree of the N data samples in the t-th iteration in S203, the method further includes: and determining third auxiliary arrays corresponding to the N data samples respectively according to the attribution degree.

And the third auxiliary array corresponding to the ith data sample is used for identifying the secondary big attribution information corresponding to the ith data sample in the t iteration. The second largest attribution information corresponding to the ith data sample may be a second largest value (i.e., a second largest value) of a sum of the similarity and the attribution degree of the ith data sample and other data samples (including the ith data sample), and is denoted as ASmax2 (i).

That is, during the computation of the data sample intervals a (i, j) in the tth iteration, ASmax2(i) may be counted.

Thus, when calculating R (i, j) among N data samples in the t +1 th iteration, if the j data sample is the ASmax1 counted in the t th iteration_index(i) One data sample (i.e., j ═ ASmax1_index(i) The attraction degree R (i, j) in the t +1 th iteration can be calculated by reading the similarity between the ith data sample and the jth data sample, and calling the statistical ASmax2(i), and by the above equations (3) and (4). Without additionally traversing the attribution of the ith data sample relative to other data samples.

In the method, when R (i, j) among data samples is calculated in the t +1 th iteration, the relative ASmax1 of the ith data sample to the ith data sample is reduced_index(i) The computational complexity of the attractiveness of the data samples further improves the efficiency of data clustering.

It should be noted that the embodiments of the present application do not limit N between the N data samples²Table of individual similarity and attraction and attribution in each iterationThe way is shown. The data between the data samples can be represented in a suitable manner according to actual situations or different requirements.

In one possible implementation, the similarity S, the attraction R, and the attribution a among the data samples may be represented in the form of a matrix.

The similarity between the N data samples may be represented by a similarity matrix, such as the similarity matrix (1) shown below, where the ith row is used to represent S (i, j) between the ith data sample and other data samples (including the ith data sample). For example, in the similarity matrix, the 1 st row represents the similarity between the 1 st data sample and other data samples (including the 1 st data sample), which are respectively S (1,1), S (1, j), …, and S (1, N).

Similarly, as shown in the attraction matrix (2) below, the attraction between the N data samples can be represented by the attraction matrix, wherein the ith row can be used to represent R (i, j) between the ith data sample and other data samples (including the ith data sample).

The attribution degree of the N data samples can be represented by an attribution degree matrix (3), wherein the ith row is used for representing a (i, j) between the ith data sample and other data samples (including the ith data sample).

The similarity, the attraction degree and the attribution degree among the N data samples are represented in a matrix form, so that data traversal is facilitated, and the data reading efficiency in the iterative process is improved.

It will be appreciated that in an actual scenario, some data samples may be similar in some respects, and some data samples may be completely dissimilar. Based on the fact that the messages in the AP clustering algorithm are transmitted globally, the similarity between some data samples may be high, and the similarity between some data samples may be low. When S (i, j) of the ith data sample is close to 0 with respect to the jth data sample, R (i, j) <0 calculated by the above equations (3) to (4), the corresponding effective attraction degree is 0. In this case, no matter whether R (i, j) is calculated, the value of the first auxiliary array rsum (j) of the jth data sample is not affected. In addition, when S (i, j) is close to 0, A (i, j) ≦ 0 calculated by the above formulas (1) - (2), so that A (i, j) + S (i, j) is close to 0. In this way, the maximum attribution information ASmax1(i) in the second auxiliary array corresponding to the ith data sample and the second largest attribution information ASmax2(i) corresponding to the third auxiliary array are not affected whether or not a (i, j) is calculated.

That is, when S (i, j) is low, whether to calculate corresponding R (i, j) and a (i, j) does not affect the results of the attraction and attribution between other data samples. And for the case that S (i, j) is low, because both corresponding R (i, j) and a (i, j) are low, the corresponding data sample does not become a cluster center.

It can be seen that ignoring the R and a calculations for data samples of very low similarity does not affect the result of the final data clustering. The computational process is redundant for such messaging between data samples.

Thus, in one possible implementation, the method further includes:

s301: and determining the target position with the similarity smaller than a threshold value in the similarity matrix.

The threshold may be used to determine similarity that does not need to be calculated for the attraction degree and the attribution degree, and when it is determined that the similarity is smaller than the threshold, the attraction degree and the attribution degree of the corresponding data sample may not be calculated.

In the embodiment of the present application, when the similarity in the similarity matrix is smaller than the threshold, the position of the similarity in the similarity matrix may be determined as the target position.

S302: and setting the numerical value of the target position in the similarity matrix, the attraction matrix and the attribution matrix to be null.

It will be appreciated that the data at the target location in the similarity, attraction and attribution matrices all correspond to the same data sample. That is, assuming that the target position is the ith row and the jth column, the corresponding data at the target position in the three matrices are the similarity, attraction and attribution of the ith data sample relative to the jth data sample, respectively.

Wherein the attraction matrix and the attribution matrix may be the attraction matrix and the attribution matrix in each iteration.

After the numerical values of the target positions in the similarity matrix, the attraction matrix and the attribution matrix are set to be null, the aim of avoiding the iterative calculation of R and A on the part of data at the target positions is achieved.

In a specific implementation, after converting the original dense matrix (the matrix with non-0 elements being larger) into a sparse matrix (the matrix with 0 elements being larger), the matrix may be stored in a Compressed Storage Row (CSR) format.

In the method, the dense matrix is converted into the sparse matrix by setting the target position in the matrix to be null, so that the space complexity is increased from O (N)²) The number of elements O (M) is reduced, M is the number of non-0 elements in the matrix after data processing, the memory space for storing the matrix is saved, and if M is adopted, the memory space for storing the matrix is saved<<N²The storage space can be greatly reduced. In addition, in each iteration, calculation of the attraction degree and the attribution degree of the target position in the matrix is avoided, the complexity of the calculation time is reduced to O (M), and the efficiency of data clustering is improved.

In the related art, referring to fig. 3 for the case of representing S, R and a between data samples in a matrix form, a flowchart of an iterative method is shown, as shown in fig. 3, when calculating R (i, j) of the ith data sample relative to other data samples in the t-th iteration, in addition to referring to S (i, j) of the ith data sample relative to other data samples, i.e. the line data (outlined by a gray frame) of the ith data sample in the similarity matrix, the attribution degree between the ith data sample relative to other data samples in the last iteration, i.e. t-1, needs to be traversed, i.e. the line data (outlined by a gray frame) corresponding to the ith data sample in the attribution degree matrix of the t-1 iteration, to obtain the maximum value of the sum of the similarity degree and the attribution degree of the ith data sample relative to other data samples, thereby realizing the calculation of the attraction degree.

That is to say, when the attraction degree of the ith data sample relative to other data samples is calculated in the t-th iteration, the attraction degree can be obtained by traversing the row data corresponding to the ith data sample in the similarity matrix and the row data of the ith data sample in the t-1-th iteration attribution degree matrix.

When calculating a (i, j) of the ith data sample relative to the jth data sample, it is necessary to traverse the column data of the jth column in the attraction matrix of this iteration, i.e., R (i, j) of other data samples relative to the jth data sample (outlined by a gray line frame), to obtain the sum of the effective attractions of other data samples relative to the jth data sample, thereby realizing the calculation of the attribution degree. For example, when calculating a (i,1), it is necessary to traverse the column data of the 1 st column in the attraction matrix; when calculating a (i, N), it is necessary to traverse the column data of the nth column in the attraction matrix. That is, when calculating the attribution degree of the ith data sample relative to other data samples, not only the row data corresponding to the ith data sample in the attraction degree matrix needs to be traversed, but also each column data in the attraction degree matrix needs to be traversed.

When the a (i, j) of the ith data sample relative to other data samples is calculated in each iteration, each column of data needs to be traversed on the complete attraction degree matrix, so that the attraction degree matrix cannot be split, and the attribution degree of the partial data sample relative to other data samples is concurrently calculated.

In an embodiment of the present application, in one possible implementation, a way to concurrently calculate the attraction and the attribution in each iteration is provided.

First, it should be noted that the three matrices used for concurrent computation may be the dense matrix that is not processed by data or the sparse matrix that is processed by data, which is not limited in this application.

In the embodiment of the present application, the N data samples may be divided into M sample groups, where M < N. Each sample group may include one or more data samples, thereby enabling concurrent computation of the M sample groups, respectively. The concurrent computation referred to in the present application may refer to a method of splitting data into multiple parts, so as to perform computation on the parts in parallel in different threads. The efficiency of calculating A and R among all data samples can be improved by performing concurrent calculation on A and R corresponding to the data samples in each data group.

In this way, in the t-th iteration, the method for calculating the attraction degree between N data samples according to the similarity degree between N data samples in the above S201 includes:

and (4) concurrently calculating the attraction degree of the M sample groups. The attraction between the data samples in the sample group and the N data samples can be calculated according to the row data corresponding to the data samples in the sample group in the similarity matrix.

It can be understood that, in the t-th iteration, when one data sample in the sample group is calculated to be the attraction degree of the ith data sample relative to other data samples, the similarity degree of the ith data sample relative to other data samples can be obtained through the line data corresponding to the ith data sample in the similarity matrix, so that the attraction degree of the ith data sample relative to other data samples can be calculated.

In S203, calculating the attribution degree between the N data samples in the t-th iteration according to the attraction degree between the N data samples and the first auxiliary array, including:

in the t-th iteration, the attribution degree is calculated for the M sample groups concurrently. The attribution degree between the data samples in the sample group and the N data samples can be calculated according to the corresponding row data of the data samples in the sample group in the attraction degree matrix and the first auxiliary array.

The first auxiliary array corresponding to the data sample in each sample group is counted in advance, and the first auxiliary array is the sum of the effective attraction degrees of other data samples relative to the data sample, so that when the attribution degree of the ith data sample relative to the jth data sample in the sample group is calculated, the first auxiliary array corresponding to the jth data sample is directly called to calculate the attribution degree, and the sum of the effective attraction degrees of other data samples relative to the jth data sample does not need to be determined through column traversal.

That is, by introducing the first auxiliary array, it is not necessary to perform column traversal of each column on the absorbance matrix when calculating the attribution degree of one data sample relative to other data samples.

In the method, when the attraction degree and the attribution degree of the ith data sample relative to other data samples are calculated, the attraction degree and the attribution degree can be calculated only by traversing the first auxiliary array corresponding to the N data samples and the data corresponding to the ith data sample in the similarity matrix, the attraction degree matrix and the attribution degree matrix without other data in the matrix, so that the attraction degree and the attribution degree among the data samples can be calculated simultaneously.

It should be noted that the number of concurrent calculations performed on the attraction degree and the attribution degree is not limited in the embodiments of the present application, and in a specific implementation, the number may be determined according to the computing capability of the computer device. On the multi-core processor, a plurality of thread pools can be created, and the similarity matrix, the attraction matrix and the attribution matrix are uniformly distributed into each thread, so that concurrent calculation of the attraction, the attribution and the first auxiliary array, the second auxiliary array and the third auxiliary array can be performed.

For the following description of the concurrent computation mode, referring to fig. 4, this figure shows a schematic view of a scenario of the concurrent computation mode provided in the embodiment of the present application, and as shown in fig. 4, in the t-th iteration, it is assumed that there are 100 data samples, and the corresponding similarity matrix and the matrix with the attribution degree matrix of 100 × 100 in the t-1 th iteration are provided. The 100 data samples may be divided into 3 (i.e., M ═ 3) sample groups, i.e., sample group 1, sample group 2, and sample group 3. Sample set 1 includes the first 30 data samples; the sample group 2 includes 31 st to 70 th data samples; sample set 3 includes the last 30 data samples.

Then, as shown in the procedure of calculating the attractiveness corresponding to the thin solid line in fig. 4, the second auxiliary array (ASmax1(1) -ASmax1(100) and ASmax 1) corresponding to each data sample may be determined from the similarity matrix and the attribution matrix in the t-1 th iteration_index(1)-ASmax1_index(100) And a third auxiliary array (ASmax2(1) -ASmax2 (100)). The line data corresponding to the first 30 data samples of the similarity matrix and the corresponding second and third auxiliary arrays (ASmax1(1) -ASmax1(30), ASmax 1) may be combined_index(1)-ASmax1_index(30) And ASmax2(1) -ASmax2(30)) into thread 1 to calculate the attraction of the first 30 data samples to the 100 data samples.

Similarly, the line data corresponding to the 31 st to 70 th data samples of the similarity matrix, respectively, and the corresponding second and third auxiliary arrays (ASmax1(31) -ASmax1(70), ASmax1_index(31)-ASmax1_index(70) And ASmax2(31) -ASmax2(70)) to calculate the attraction of the 31 st to 70 th data samples with respect to the 100 data samples. Respectively corresponding line data of 30 data samples after the similarity matrix, and corresponding second and third auxiliary arrays (ASmax1(71) -ASmax1(100) and ASmax1_index(71)-ASmax1_index(100) And ASmax2(71) -ASmax2(100)) to calculate the attraction of the last 30 data samples to the 100 data samples.

After the calculation of the attraction degree among the 100 data samples is completed, the first auxiliary array (Rsum (1) -Rsum (100)) corresponding to each data sample can be determined.

As shown in fig. 4, in the attribution degree calculation process corresponding to the thick solid line, the attraction degrees of the first 30 data samples with respect to the other data samples and the corresponding first auxiliary array (Rsum (1) -Rsum (30)) are input to the thread 1 to calculate the attribution degrees of the first 30 data samples with respect to the 100 data samples. Similarly, the attraction degrees of the 31 st to 70 th data samples with respect to the other data samples and the corresponding first auxiliary array (Rsum (31) -Rsum (70)) are input into the thread 2 to calculate the attribution degrees of the 31 st to 70 th data samples with respect to the 100 th data samples. The attraction degrees of the last 30 data samples and the corresponding first auxiliary array (Rsum (71) -Rsum (100)) are input into the thread 3 to calculate the attribution degrees of the last 30 data samples to 100 data samples.

In the method, the first auxiliary array is introduced, so that column traversal of the attribution degree calculation aiming at the matrix data is avoided, possibility is provided for concurrent calculation, iterative calculation time is reduced through the concurrent calculation, and data clustering efficiency is improved.

In an embodiment of the present application, after completing multiple computation iterations for N data samples, in one possible implementation, the method further includes:

s401: and determining the target data sample as a clustering center according to the attraction degree and the attribution degree among the N data samples.

In the embodiment of the application, when the iteration times reach the maximum iteration times or the continuous times of the clustering center determined by each iteration reach the convergence iteration times, the target data sample serving as the clustering center is determined according to the attraction degree and the attribution degree among the N data samples.

In the embodiment of the present application, for example, the actually existing data sample corresponding to R (i, i) + a (i, i) >0 may be determined as the cluster center, and the data sample as the cluster center may be determined as the target data sample.

S402: and finishing data clustering aiming at the N data samples according to the target data samples.

In a specific implementation, the way of completing data clustering according to target data samples includes: if a data sample is not contiguous with any target data sample, the data sample may be separated into classes. The data samples described herein are connected, which means that the similarity between the two is high. And if the data sample is connected with the only target data sample, determining that the data sample belongs to the class of the target data sample. And if the data sample is connected with a plurality of target data samples, determining that the data sample belongs to the class of the target data sample with the maximum similarity in all the connected target data samples.

It can be understood that, when the present scheme is applied to a corpus clustering scenario, according to the method of S401 to S402, a target data sample can be determined from corpus samples after multiple computation iterations, and clustering for all corpus samples is completed according to the target data sample, so as to obtain a sample set with semantic approximation in the corpus samples. Therefore, the corpus is generalized and trained according to the corpus sample with similar semantics, and the semantic matching model performance is improved.

When the method is applied to a node clustering scene, a target data sample serving as a clustering center is determined from node samples by the method of S401-S402, clustering of all the node samples is realized, a node sample set with approximate entity characteristics is determined, and functions such as friend recommendation are realized according to the node sample sets.

Similarly, when the method is applied to other various scenes in which the function needs to be realized through clustering, the clustering of all data samples can be realized through the method. Such as image classification, a question-answering system of man-machine interaction and the like, thereby improving the image classification precision, the recall rate and the accuracy of the question-answering system through image clustering, semantic similarity mining and the like.

By the method, the N data samples are clustered.

Next, a data clustering method provided in the embodiment of the present application is introduced in combination with an actual application scenario.

The embodiment provides a sparse affinity propagation clustering method supporting multi-thread concurrent computation, which can realize rapid clustering of data samples according to the similarity degree among the data samples. In the method, data samples are abstracted into network nodes, after a similarity sparse matrix among the data samples is input, two types of messages of attraction and attribution among the network nodes can be calculated in a multithreading and concurrent mode in each iteration based on the similarity among the network nodes, the attraction and the attribution of each network node are continuously updated through a multi-round iteration process until a plurality of high-quality clustering centers, namely target data samples, are generated, and therefore all the data samples are distributed to corresponding clusters.

Referring to fig. 5, which illustrates a flowchart of a data clustering method provided in an embodiment of the present application, and as shown in fig. 5, the method includes:

s501: and constructing a corresponding similarity sparse matrix according to the N data samples to be clustered.

Because the data samples have similarity information in some aspects, a similarity matrix among the N data samples can be constructed, and the similarity matrix is subjected to data processing through the method to obtain a corresponding similarity sparse matrix.

S502: and configuring related parameters, and initializing an attribution degree matrix, an attraction degree matrix and an auxiliary array.

The auxiliary array may include a first auxiliary array, a second auxiliary array, and a third auxiliary array.

Before the 1 st iteration, the attribution degree matrix, the attraction degree matrix and the auxiliary array may be initialized, for example, to make the data therein all 0.

The parameters for the required configuration are as described above and will not be described herein.

S503: and concurrently updating the attribution degree matrix, the attraction degree matrix and the auxiliary array, and calculating a target data sample serving as a clustering center.

In the embodiment of the application, in each iteration, the attribution degree, the attraction degree and the auxiliary array among the data samples can be updated in a concurrent calculation mode, so that the target data sample serving as a clustering center in each iteration is calculated.

S504: and determining whether a clustering iteration ending condition is met.

If so, go to step S505, otherwise, go to step S503.

And the clustering iteration ending condition is that the iteration frequency reaches the maximum iteration frequency, or the continuous frequency of the clustering center determined by each iteration reaches the convergence iteration frequency.

S505: a class to which the data sample belongs is determined.

The specific manner for determining the class to which the data sample belongs is as described in S402, and is not described herein again.

In the embodiment of the present application, the experimental results shown in table 1 are obtained based on the condition that the configuration parameters are the same according to 8.6k corpus samples and the semantic similarity based on 29.8M similarity relations. Through comparison of experimental results, the scheme of the multi-thread concurrent computation mode is superior to other related technologies in the aspects of memory overhead and clustering efficiency.

TABLE 1 results of the experiment

Based on the data clustering method provided by the foregoing embodiment, an embodiment of the present application provides a computer device, which may be the foregoing mentioned computer device. Referring to fig. 6a, this figure shows a structure diagram of a computer device provided in an embodiment of the present application, where the device includes a first computing unit 601, a statistics unit 602, and a second computing unit 603:

the first calculating unit 601 is configured to perform multiple calculation iterations on the attraction degree and the attribution degree of the N data samples in a data clustering process of the N data samples, and calculate the attraction degree between the N data samples according to the similarity between the N data samples in a t-th iteration; the tth iteration is one of the multiple computing iterations;

the counting unit 602 is configured to count first auxiliary arrays corresponding to the N data samples respectively; aiming at the ith data sample in the N data samples, a first auxiliary array corresponding to the ith data sample is used for marking the sum of effective attraction degrees of other data samples in the N data samples to the ith data sample in the t iteration;

the second calculating unit 603 is configured to calculate an attribution degree between the N data samples in the t-th iteration according to the attraction degree between the N data samples and the first auxiliary array.

In a possible implementation manner, the second calculating unit 603 is specifically configured to:

in the process of calculating the attribution degrees of the N data samples in the t-th iteration, determining second auxiliary arrays corresponding to the N data samples respectively according to the attribution degrees; the second auxiliary array corresponding to the ith data sample is used for identifying the maximum attribution information corresponding to the ith data sample and the data sample corresponding to the maximum attribution information in the t iteration; the second auxiliary array is used for calculating the attraction degree among the N data samples in the (t + 1) th iteration.

in the process of calculating the attribution degrees of the N data samples in the t-th iteration, determining third auxiliary arrays corresponding to the N data samples respectively according to the attribution degrees; the third auxiliary array corresponding to the ith data sample is used for identifying the secondary big attribution information corresponding to the ith data sample in the t iteration; and the third auxiliary array is used for calculating the attraction degree among the N data samples in the (t + 1) th iteration.

In a possible implementation manner, the similarity among the N data samples is represented by a similarity matrix, where the ith row is used to represent the similarity between the ith data sample and other data samples;

the attraction degree among the N data samples is represented by an attraction degree matrix, wherein the ith row is used for showing the attraction degree between the ith data sample and other data samples;

and the attribution degrees of the N data samples are represented by an attribution degree matrix, wherein the ith row is used for embodying the attribution degrees between the ith data sample and other data samples.

In a possible implementation manner, the first calculating unit 601 is specifically configured to:

n data samples are divided into M sample groups, M is less than N, and in the t iteration, the M sample groups are subjected to attraction calculation concurrently; calculating the attraction degree between the data samples in the sample group and N data samples according to the corresponding row data of the data samples in the sample group in the similarity matrix;

the second calculating unit 603 is specifically configured to:

in the t iteration, the attribution degree of the M sample groups is calculated concurrently; and calculating the attribution degree between the data samples in the sample group and the N data samples according to the corresponding row data of the data samples in the sample group in the attraction degree matrix and the first auxiliary array.

In a possible implementation manner, referring to fig. 6b, this figure shows a structure diagram of a computer device provided in an embodiment of the present application, where the device further includes a setting unit 604, and the setting unit 604 is specifically configured to:

determining the target position with the similarity smaller than a threshold value in the similarity matrix;

and setting the numerical value of the target position in the similarity matrix, the attraction matrix and the attribution matrix to be null.

In a possible implementation manner, referring to fig. 6c, this figure shows a structure diagram of a computer device provided in an embodiment of the present application, where the device further includes a clustering unit 605, and the clustering unit 605 is specifically configured to:

determining a target data sample serving as a clustering center according to the attraction degree and the attribution degree among the N data samples;

and finishing data clustering aiming at the N data samples according to the target data samples.

The embodiment of the application also provides a computer device, which is described below with reference to the accompanying drawings. Referring to fig. 7, an embodiment of the present application provides a structure diagram of a computer device, where the device 700 may also be a terminal device, taking the terminal device as a mobile phone as an example:

fig. 7 is a block diagram illustrating a part of the structure of a mobile phone according to an embodiment of the present application. Referring to fig. 7, the handset includes: radio Frequency (RF) circuit 710, memory 720, input unit 730, display unit 740, sensor 750, audio circuit 760, wireless fidelity (WiFi) module 770, processor 780, and power supply 790. Those skilled in the art will appreciate that the handset configuration shown in fig. 7 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

The following describes each component of the mobile phone in detail with reference to fig. 7:

in general, the RF circuit 710 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (L ow noise amplifier, L NA), a duplexer, and the like, and the RF circuit 710 may also communicate with a network and other devices through wireless communication.

The memory 720 may be used to store software programs and modules, and the processor 780 may execute various functional applications and data processing of the cellular phone by operating the software programs and modules stored in the memory 720. The memory 720 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 720 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The input unit 730 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone. Specifically, the input unit 730 may include a touch panel 731 and other input devices 732. The touch panel 731, also referred to as a touch screen, can collect touch operations of a user (e.g. operations of the user on or near the touch panel 731 by using any suitable object or accessory such as a finger, a stylus, etc.) and drive the corresponding connection device according to a preset program. Alternatively, the touch panel 731 may include two portions of a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts it to touch point coordinates, and sends the touch point coordinates to the processor 780, and can receive and execute commands from the processor 780. In addition, the touch panel 731 may be implemented by various types, such as a resistive type, a capacitive type, an infrared ray, and a surface acoustic wave. The input unit 730 may include other input devices 732 in addition to the touch panel 731. In particular, other input devices 732 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 740 may include a display panel 741, and optionally, the display panel 741 may be configured in the form of a liquid crystal display (L liquid crystal display, abbreviated as L CD), an Organic light-Emitting Diode (Organic L lighting-Emitting Diode, abbreviated as O L ED), and the like, and further, the touch panel 731 may cover the display panel 741, and when a touch operation is detected on or near the touch panel 731, the touch panel 731 may be transmitted to the processor 780 to determine the type of the touch event, and then the processor 780 may provide a corresponding visual output on the display panel 741 according to the type of the touch event.

The handset may also include at least one sensor 750, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that adjusts the brightness of the display panel 741 according to the brightness of ambient light, and a proximity sensor that turns off the display panel 741 and/or a backlight when the mobile phone is moved to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the posture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, further description is omitted here.

Audio circuitry 760, speaker 761, and microphone 762 may provide an audio interface between a user and a cell phone. The audio circuit 760 can transmit the electrical signal converted from the received audio data to the speaker 761, and the electrical signal is converted into a sound signal by the speaker 761 and output; on the other hand, the microphone 762 converts the collected sound signal into an electric signal, converts the electric signal into audio data after being received by the audio circuit 760, and then processes the audio data output processor 780, and then transmits the audio data to, for example, another cellular phone through the RF circuit 710, or outputs the audio data to the memory 720 for further processing.

WiFi belongs to short-distance wireless transmission technology, and the mobile phone can help a user to receive and send e-mails, browse webpages, access streaming media and the like through the WiFi module 770, and provides wireless broadband Internet access for the user. Although fig. 7 shows the WiFi module 770, it is understood that it does not belong to the essential constitution of the handset, and can be omitted entirely as needed within the scope not changing the essence of the invention.

The processor 780 is a control center of the mobile phone, connects various parts of the entire mobile phone by using various interfaces and lines, and performs various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 720 and calling data stored in the memory 720, thereby integrally monitoring the mobile phone. Optionally, processor 780 may include one or more processing units; preferably, the processor 780 may integrate an application processor, which primarily handles operating systems, user interfaces, applications, etc., and a modem processor, which primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into processor 780.

The handset also includes a power supply 790 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 780 via a power management system, so that the power management system may be used to manage charging, discharging, and power consumption.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which are not described herein.

In this embodiment, the processor 780 included in the terminal device further has the following functions:

in the data clustering process of the N data samples, a plurality of calculation iterations of the attraction degree and the attribution degree aiming at the N data samples are included,

The computer device provided in this embodiment of the present application may be a server, please refer to fig. 8, fig. 8 is a structural diagram of a server 800 provided in this embodiment of the present application, and the server 800 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 822 (e.g., one or more processors) and a memory 832, and one or more storage media 830 (e.g., one or more mass storage devices) storing an application 842 or data 844. Memory 832 and storage medium 830 may be, among other things, transient or persistent storage. The program stored in the storage medium 830 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, a central processor 822 may be provided in communication with the storage medium 830 for executing a series of instruction operations in the storage medium 830 on the server 800.

The server 800 may also include one or more power supplies 826, one or more wired or wireless network interfaces 850, one or more input-output interfaces 858, and/or one or more operating systems 841, such as Windows ServerTM, Mac OS XTM, UnixTM, and &lTtTtranslation = L "&gTtL &lTt/T &gTtinxTM, FreeDTM, and so forth.

The steps in the above embodiments may also be performed by a server, which may be based on the server structure shown in fig. 8.

The embodiment of the present application further provides a computer-readable storage medium, which is used for storing a program code, where the program code is used for executing the method described in the foregoing embodiments.

The embodiments of the present application also provide a computer program product including instructions, which when run on a computer, cause the computer to perform the method described in the foregoing embodiments.

The terms "first," "second," "third," "fourth," and the like in the description of the application and the above-described figures, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium may be at least one of the following media: various media that can store program codes, such as read-only memory (ROM), RAM, magnetic disk, or optical disk.

It should be noted that, in the present specification, all the embodiments are described in a progressive manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus and system embodiments, since they are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for related points. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

The above description is only one specific embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A data clustering method, characterized in that in a data clustering process of N data samples, a plurality of calculation iterations of attractiveness and attribution for the N data samples are included, the method is executed by a computer device, and the method comprises:

2. The method of claim 1, wherein in calculating the attribution of the N data samples in the tth iteration, the method further comprises:

determining second auxiliary arrays corresponding to the N data samples respectively according to the attribution degree; the second auxiliary array corresponding to the ith data sample is used for identifying the maximum attribution information corresponding to the ith data sample and the data sample corresponding to the maximum attribution information in the t iteration; the second auxiliary array is used for calculating the attraction degree among the N data samples in the (t + 1) th iteration.

3. The method of claim 1, wherein in calculating the attribution of the N data samples in the tth iteration, the method further comprises:

determining third auxiliary arrays corresponding to the N data samples respectively according to the attribution degree; the third auxiliary array corresponding to the ith data sample is used for identifying the secondary big attribution information corresponding to the ith data sample in the t iteration; and the third auxiliary array is used for calculating the attraction degree among the N data samples in the (t + 1) th iteration.

4. The method according to any one of claims 1 to 3, wherein the similarity among the N data samples is represented by a similarity matrix, wherein the ith row is used for representing the similarity between the ith data sample and other data samples;

5. The method of claim 4, wherein N data samples are divided into M sample groups, M < N, and wherein calculating the attraction between N data samples based on the similarity between N data samples comprises:

in the t iteration, the M sample groups are subjected to attraction calculation concurrently; calculating the attraction degree between the data samples in the sample group and N data samples according to the corresponding row data of the data samples in the sample group in the similarity matrix;

the calculating the attribution degree among the N data samples in the t iteration according to the attraction degree among the N data samples and the first auxiliary array comprises the following steps:

6. The method of claim 4, further comprising:

7. The method of claim 1, wherein after completing a plurality of computational iterations, the method further comprises:

8. A computer device, characterized in that the device comprises a first calculation unit, a statistics unit and a second calculation unit:

9. The device according to claim 8, wherein the second computing unit is specifically configured to:

10. The device according to claim 8, wherein the second computing unit is specifically configured to:

11. The apparatus according to any one of claims 8 to 10, wherein the similarity among the N data samples is represented by a similarity matrix, wherein the ith row is used for representing the similarity between the ith data sample and other data samples;

12. The device according to claim 11, wherein the first computing unit is specifically configured to:

the second computing unit is specifically configured to:

13. The device according to claim 11, further comprising a setting unit, the setting unit being specifically configured to:

14. A computer device, the device comprising a processor and a memory:

the processor is configured to perform the data clustering method of any one of claims 1 to 7 according to instructions in the program code.

15. A computer-readable storage medium for storing program code for performing the data clustering method of any one of claims 1 to 7.