CN111340082B

CN111340082B - Data processing method and device, processor, electronic equipment and storage medium

Info

Publication number: CN111340082B
Application number: CN202010102367.XA
Authority: CN
Inventors: 黄厚钧; 何悦; 李�诚; 王贵杰; 王子彬
Original assignee: Shenzhen Sensetime Technology Co Ltd
Current assignee: Shenzhen Sensetime Technology Co Ltd
Priority date: 2020-02-19
Filing date: 2020-02-19
Publication date: 2024-08-13
Anticipated expiration: 2040-02-19
Also published as: CN111340082A

Abstract

The application discloses a data processing method and device, a processor, electronic equipment and a storage medium. The method comprises the following steps: acquiring n nodes, wherein n is an integer greater than or equal to 2, and the nodes are used for representing objects to be clustered; determining a node, of the n nodes, with similarity to a first node being greater than or equal to a reference threshold, as a first alternative node, the first node belonging to the n nodes; and connecting the first node with the first alternative node to obtain an adjacency graph, wherein the adjacency graph is used for clustering objects to be clustered represented by the n nodes.

Description

Data processing method and device, processor, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data processing method and apparatus, a processor, an electronic device, and a storage medium.

Background

Clustering is one of the key technologies in the fields of data mining, machine learning, and the like. Clustering refers to dividing dissimilar objects to be clustered into different class clusters by dividing similar objects to be clustered into the same class cluster. And clustering the object set to be clustered by utilizing the information of the object set to be clustered and the association information among different objects to be clustered.

And obtaining an adjacency graph of the object set to be clustered according to the association among different objects to be clustered in the object set to be clustered. The adjacency graph comprises at least two nodes, and each node corresponds to an object to be clustered. The adjacency graph contains information of objects to be clustered corresponding to the nodes and association information among different objects to be clustered. The conventional clustering method can utilize the information of the objects to be clustered and the information among different objects to be clustered to realize the clustering of the object set to be clustered by processing the adjacency graph. But adjacency graphs obtained by conventional methods contain information with low accuracy.

Disclosure of Invention

The application provides a data processing method and device, a processor, electronic equipment and a storage medium.

In a first aspect, a data processing method is provided, the method comprising:

acquiring n nodes, wherein n is an integer greater than or equal to 2, and the nodes are used for representing objects to be clustered;

Determining a node, of the n nodes, with similarity to a first node being greater than or equal to a reference threshold, as a first alternative node, the first node belonging to the n nodes;

And connecting the first node with the first alternative node to obtain an adjacency graph, wherein the adjacency graph is used for clustering objects to be clustered represented by the n nodes.

In this aspect, by taking the reference similarity threshold as a basis for determining the first candidate nodes of the first node, the number of noise-associated nodes of the first node among the first candidate nodes can be reduced. Thus, the quality of the adjacency graph can be improved.

In combination with any one of the embodiments of the present application, before determining that a node, of the n nodes, having a similarity with the first node that is greater than or equal to a reference threshold, is the first candidate node, the method further includes:

determining the similarity between the n nodes and the first node to obtain a first similarity set;

the node corresponding to the k maximum similarity in the first similarity set is used as a second alternative node;

The determining, as a first candidate node, a node, of the n nodes, having a similarity with the first node greater than or equal to a reference threshold, includes:

and determining a node with the similarity being greater than or equal to the reference threshold value in the second alternative nodes as the first alternative node.

In this embodiment, k nodes with the highest similarity to the first node are selected as the second candidate nodes. And subsequently, when the first alternative node is determined according to the second alternative node, limiting the number of the first alternative nodes to be within k. In this way, the number of first candidate nodes is limited while the number of noise-associated nodes of the first node among the first candidate nodes is reduced. Thereby realizing that the data processing amount for constructing the adjacency graph is reduced while the quality of the adjacency graph is improved.

In combination with any embodiment of the present application, the connecting the first node with the first alternative node to obtain an adjacency graph includes:

Determining an adjacency relation between the first alternative node and the first node according to the similarity between the first alternative node and the first node;

and connecting the first node with the first alternative node to enable the first node and the first alternative node to meet the adjacency relation, so as to obtain the adjacency graph.

In such an embodiment, the adjacency between nodes is determined based on the similarity between the nodes. By making the adjacency graph satisfy the adjacency relation, the accuracy of information in the adjacency graph can be improved, and the quality of the adjacency graph can be further improved.

In combination with any one of the embodiments of the present application, the adjacency includes a distance between the first candidate node and the first node;

The determining the adjacency relation between the first alternative node and the first node according to the similarity between the first alternative node and the first node comprises the following steps:

And determining the distance between the first alternative node and the first node according to the similarity between the first alternative node and the first node, wherein the distance is positively correlated with the similarity.

In such an embodiment, the adjacency includes a distance between the first alternative node and the first node. And determining the distance between the first alternative node and the first node according to the similarity between the first alternative node and the first node. The distance between the nodes contains the similarity information between the nodes, so that the quality of the adjacency graph is improved.

In combination with any one of the embodiments of the present application, the determining, according to the similarity between the first candidate node and the first node, a distance between the first candidate node and the first node includes:

Taking the similarity between the first alternative node and the first node as an alternative similarity set, and determining the minimum value in the alternative similarity set as a reference similarity;

Obtaining a first weight and a second weight according to the difference between the first similarity and the reference similarity and the difference between the second similarity and the reference similarity, wherein the first similarity and the second similarity belong to the alternative similarity set;

And determining the distance between the first node and the second node and the distance between the first node and the third node according to the first weight and the second weight, wherein the second node is a node corresponding to the first similarity, and the third node is a node corresponding to the second similarity.

In this embodiment, the first weight and the second weight are obtained according to a difference between the first similarity and the reference similarity and a difference between the second similarity and the reference similarity. And determining the distance between the first node and the second node and the distance between the first node and the third node according to the first weight and the second weight. Because the distance between the nodes contains the similarity information between the nodes, the distance between the nodes (comprising the distance between the first node and the second node and the distance between the first node and the third node) is determined according to the weight (comprising the first weight and the second weight), different weights can be given to the similarity information between different nodes, and clustering is facilitated by using the information between the nodes when the adjacency graph is processed. It follows that this embodiment may improve the quality of the adjacency graph.

In combination with any one of the embodiments of the present application, before determining, as the first candidate node, a node with a similarity greater than or equal to the reference threshold in the second candidate node, the method further includes:

Performing feature extraction processing on a first object to be clustered in the n objects to be clustered to obtain first feature data;

Determining the data type of the first object to be clustered according to the first characteristic data, wherein the data type comprises images, voices and sentences;

And obtaining the reference threshold according to the data type and the reference mapping relation of the first object to be clustered, wherein the reference mapping relation is a mapping relation between the data type and the similarity threshold.

In this embodiment, the reference threshold is determined according to the data type of the first object to be clustered in the n objects to be clustered and the reference mapping relation, so that different reference thresholds can be set for data of different data types. And determining a first alternative node of the first node according to the reference threshold value, so that noise association nodes in the first alternative node can be reduced, noise association in the adjacency graph can be reduced, and the quality of the adjacency graph is improved.

In combination with any one of the embodiments of the present application, before the node corresponding to the k maximum similarities in the first similarity set is used as the second candidate node, the method further includes:

acquiring a reference time length and/or a reference storage capacity;

and obtaining the k according to the reference time length and/or the reference storage capacity.

In such an embodiment, k is determined according to the reference time length and/or the reference storage capacity, so that the user requirement can be better satisfied. For example, the user may desire to shorten the time period for constructing the adjacency graph, and may adjust the reference time period to be smaller. The data processing device can determine that the value of k is maximum on the premise that the time for constructing the adjacency graph is smaller than or equal to the reference time according to the reference time. Therefore, the quality of the obtained adjacency graph is improved on the premise of meeting the requirement of a user (the duration of constructing the adjacency graph is smaller than or equal to the reference duration).

In combination with any of the embodiments of the present application, the method further comprises:

Acquiring a clustering network;

and processing the adjacency graph by using the clustering network to obtain a clustering result of the objects to be clustered represented by the n nodes.

In the embodiment, the adjacency graph obtained based on the technical scheme provided by the application is processed to obtain the clustering results of n objects to be clustered, so that the accuracy of the clustering results can be improved.

In combination with any one of the embodiments of the present application, the determining the similarity between the n nodes and the first node, to obtain a first similarity set includes:

And respectively determining the similarity between the object to be clustered represented by the first node and the object to be clustered represented by each node in the n nodes to obtain the first similarity set.

In a second aspect, there is provided a data processing apparatus, the apparatus comprising:

The acquisition unit is used for acquiring n nodes, wherein n is an integer greater than or equal to 2, and the nodes are used for representing objects to be clustered;

A first determining unit, configured to determine a node, among the n nodes, having a similarity with a first node that is greater than or equal to a reference threshold, as a first candidate node, where the first node belongs to the n nodes;

And the connection unit is used for connecting the first node with the first alternative node to obtain an adjacency graph, wherein the adjacency graph is used for clustering objects to be clustered represented by the n nodes.

In combination with any of the embodiments of the application, the device further comprises:

a second determining unit, configured to determine, before determining, as a first candidate node, a similarity between the n nodes and the first node, to obtain a first similarity set, where the similarity between the n nodes and the first node is greater than or equal to a reference threshold;

the second determining unit is further configured to use a node corresponding to the k maximum similarities in the first similarity set as a second candidate node;

The first determining unit is configured to:

In combination with any one of the embodiments of the present application, the connection unit is configured to:

The connecting unit is used for:

the feature extraction processing unit is used for carrying out feature extraction processing on the object to be clustered before determining a node with similarity larger than or equal to the reference threshold value in the second alternative nodes as the first alternative node to obtain first feature data;

the third determining unit is used for determining the data type of the object to be clustered according to the first characteristic data, wherein the data type comprises images, voices and sentences;

the first processing unit is used for obtaining the reference threshold according to the data type and the reference mapping relation of the objects to be clustered, and the reference mapping relation is a mapping relation between the data type and the similarity threshold.

In connection with any one of the embodiments of the present application,

The obtaining unit is configured to obtain a reference duration and/or a reference storage capacity before the node corresponding to the k maximum similarities in the first similarity set is used as the second candidate node;

The apparatus further comprises:

and the second processing unit is used for obtaining the k according to the reference time length and/or the reference storage capacity.

In combination with any one of the embodiments of the present application, the obtaining unit is further configured to obtain a clustering network;

The apparatus further comprises:

and the third processing unit is used for processing the adjacency graph by using the clustering network to obtain a clustering result of the objects to be clustered represented by the n nodes.

In combination with any one of the embodiments of the present application, the second determining unit is configured to:

In a third aspect, a processor is provided for performing the method of the first aspect and any one of its possible implementation manners described above.

In a fourth aspect, there is provided an electronic device comprising: a processor, transmission means, input means, output means and memory for storing computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to carry out the method as described in the first aspect and any one of its possible implementations.

In a fifth aspect, there is provided a computer readable storage medium having stored therein a computer program comprising program instructions which, when executed by a processor, cause the processor to carry out a method as in the first aspect and any one of its possible implementations.

In a sixth aspect, a computer program product is provided, the computer program product comprising a computer program or instructions which, when run on a computer, cause the computer to perform the method of the first aspect and any one of the possible implementations thereof.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application as claimed.

Drawings

In order to more clearly describe the embodiments of the present application or the technical solutions in the background art, the following description will describe the drawings that are required to be used in the embodiments of the present application or the background art.

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

FIG. 1 is a schematic representation of an adjacency according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a data processing method according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating another data processing method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a reference frame and an azimuth angle according to an embodiment of the present application;

FIG. 5 is another adjacency graph intent provided by an embodiment of the present application;

FIG. 6 is another adjacency graph intent provided by an embodiment of the present application;

FIG. 7 is another adjacency graph intent provided by an embodiment of the present application;

FIG. 8 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

Fig. 9 is a schematic hardware structure of a data processing apparatus according to an embodiment of the present application.

Detailed Description

In order that those skilled in the art will better understand the present application, a technical solution in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms first, second and the like in the description and in the claims and in the above-described figures are used for distinguishing between different objects and not necessarily for describing a sequential or chronological order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, may mean including any one or more elements selected from the group consisting of A, B and C.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

The execution body of the embodiment of the application is a data processing device, and the data processing device can be any one of the following: cell phone, computer, server, panel computer.

The adjacency graph comprises at least two nodes, and each node corresponds to one object to be clustered. The adjacency relationship between every two nodes can be used for representing similarity information between objects to be clustered corresponding to the nodes. For example, in the adjacency graph, the connection of two nodes indicates that the similarity of the two nodes is high, or that the probability of the two objects to be clustered corresponding to the two nodes being the same is high.

For example, there are 6 nodes in the adjacency graph shown in fig. 1 (one circle in the graph represents one node). The node 1 corresponds to the image a to be processed, the node 2 corresponds to the image b to be processed, the node 3 corresponds to the image c to be processed, the node 4 corresponds to the image d to be processed, the node 5 corresponds to the image e to be processed, and the node 6 corresponds to the image f to be processed. As can be seen from fig. 1, the distance between the node No. 2 and the node No. 1 is closer than the distance between the node No. 4 and the node No. 1, and correspondingly, the similarity between b and a is greater than the similarity between d and a. Similarly, the similarity between c and a is greater than the similarity between c and e, and the similarity between d and f is greater than the similarity between d and a.

Since the adjacency graph contains the category information of the nodes and the similarity information between the nodes. By processing the adjacency graph, the clustering of the nodes can be realized by using the category information of the nodes and the similarity information between the nodes, and the clustering result of the object set to be clustered corresponding to the nodes in the adjacency graph is obtained. The class information of the nodes is the class information of the objects to be clustered corresponding to the nodes. For example, the object to be clustered a corresponds to the node a, and the class of the object to be clustered a is apple, and the class of the node a is apple. For another example, the object set to be clustered includes an object a to be clustered and an object B to be clustered, the information of the object a to be clustered includes a class a of the object to be clustered, and the association information between the object a to be clustered and the object B to be clustered includes a similarity between the object a to be clustered and the object B to be clustered of 80%. According to the information of the object to be clustered A and the association information between the object to be clustered A and the object to be clustered B, the probability that the class of the object to be clustered B is a is 80%, and the class of the object to be clustered B can be determined, so that the clustering of the object set to be clustered is realized.

If the similarity information between the nodes contained in the adjacency graph is inaccurate, the clustering result obtained based on the adjacency graph is inaccurate. For example (example 1), node a corresponds to image a and node B corresponds to image B. The basis for judging that the categories of the two images are the same is as follows: the similarity of the two images is greater than or equal to a preset threshold. The similarity between the image a and the image b is smaller than a preset threshold. If node a and node B are connected in the adjacent graph, the category of image a and the category of image B are erroneously determined to be the same.

In the embodiment of the present application, if the categories of two objects to be clustered are different, the connection between two nodes corresponding to the two objects to be clustered is referred to as noise association (for example, in example 1, the connection between node a and node B is referred to as noise association). If the categories of the two objects to be clustered are the same, the connection between the two nodes corresponding to the two objects to be clustered is called effective association. If the connection between node 1 and node 2 is a noise correlation, node 2 is referred to as a noise correlation node of node 1. Similarly, node 1 may be referred to as a noise-associated node of node 2. For example, in example 1, node a is a noise-related node of node B, and node B is a noise-related node of node a. If the connection between node 1 and node 2 is an active association, node 2 is referred to as the active association node of node 1. Similarly, node 1 may be referred to as the active association node of node 2.

Before proceeding with the following explanation, the quality of the adjacency graph is first defined. In the embodiment of the application, the quality of the adjacency graph refers to the accuracy of information in the adjacency graph, wherein the information comprises information of nodes and similarity information between the nodes. The higher the quality of the adjacency graph, the higher the accuracy of the information in the adjacency graph.

Obviously, the more noise is correlated in the adjacency graph, the less accurate the information in the adjacency graph (where the information includes node-to-node similarity information), and the lower the quality of the adjacency graph. In order to reduce noise association in an adjacency graph and improve quality of the adjacency graph, the embodiment of the application provides a technical scheme for constructing the adjacency graph. Embodiments of the present application will be described below with reference to the accompanying drawings in the embodiments of the present application.

Referring to fig. 2, fig. 2 is a flow chart of a data processing method according to an embodiment of the application.

201. And obtaining n nodes to-be-clustered objects.

In the embodiment of the application, the objects to be clustered are n being an integer greater than or equal to 2. Each of the n nodes may represent an object to be clustered. Objects to be clustered the objects to be clustered may be: images, speech, sentences, etc. For example, when the object to be clustered is an image to be processed, the n nodes are respectively in one-to-one correspondence with the n images. For another example, when the object to be clustered is a sentence, the n nodes are respectively in one-to-one correspondence with the n sentences.

The manner of obtaining the n nodes may be to receive n nodes input by a user through an input component, where the input component includes: a keyboard, a mouse, a touch screen, a touch pad, an audio input device, and the like. The manner of obtaining n nodes may also be receiving n nodes sent by a terminal, where the terminal includes a mobile phone, a computer, a server, a tablet computer, and the like. The method for obtaining n nodes may also be that the data processing apparatus generates n nodes according to the n objects to be clustered after obtaining the n objects to be clustered. The application is not limited in the manner of acquiring n nodes.

Optionally, the manner of acquiring the n objects to be clustered may be to receive the n objects to be clustered input by the user through an input component, where the input component includes: a keyboard, a mouse, a touch screen, a touch pad, an audio input device, and the like. The manner of obtaining n objects to be clustered may also be to receive n objects to be clustered sent by a terminal, where the terminal includes a mobile phone, a computer, a server, a tablet computer, and the like. The method is not limited in the way of acquiring n objects to be clustered.

202. And determining a node with the similarity with the first node being greater than or equal to a reference threshold value from the n nodes as a first candidate node.

In the embodiment of the application, the similarity between two nodes is the similarity between two objects to be clustered corresponding to the two nodes. For example, node a corresponds to object a to be clustered and node B corresponds to object B to be clustered. And c is the similarity between the object to be clustered a and the object to be clustered B, and c is the similarity between the node A and the node B. The similarity may be any of the following: cosine similarity, neisserian distance (WASSERSTEIN METRIC), euclidean distance (Euclidean), JS divergence (jensen-shannon divergence). The application is not limited to the specific form of similarity.

The reference threshold is a number greater than or equal to 0 and less than or equal to 1. Alternatively, the reference threshold is 70%. And under the condition that the similarity between the two nodes is larger than or equal to a reference threshold value, the probability that two objects to be clustered corresponding to the two nodes belong to the same category is high. For example, assume that the reference threshold is 75%. The node A corresponds to the object a to be clustered, and the node B corresponds to the object B to be clustered. The similarity between the node A and the node B is 80%, and the probability of representing that the object a to be clustered and the object B to be clustered belong to the same class is high because 80% is more than 75% (for example, the class of the object a to be clustered is apple, and the probability of the class of the object B to be clustered is apple is high).

The first node may be any one of n nodes. And respectively calculating the similarity between the first node and each node in the n nodes, wherein the node with the similarity larger than or equal to the reference threshold value is used as a first alternative node of the first node. For example, the n nodes include: node A, node B, node C, node D, wherein node A is the first node. The reference threshold is 70%. The similarity between node a and node B was 80%, the similarity between node a and node C was 60%, and the similarity between node a and node D was 70%. Since 80% is greater than 70%, node B is the first candidate node. Since 60% is less than 70%, node C is not the first candidate node. Since the similarity between node a and node D is equal to 70%, node D is the first candidate node.

Alternatively, a first alternative node may be determined for each of the n nodes separately. For example, the n nodes include: node a, node B, node C. The reference threshold is 75%. The degree of similarity between node a and node B (hereinafter, will be referred to as similarity a) is 60%, the degree of similarity between node a and node C (hereinafter, will be referred to as similarity B) is 75%, and the degree of similarity between node B and node C (hereinafter, will be referred to as similarity C) is 80%. Since the similarity a is less than the reference threshold, node a is not the first candidate node for node B, and node B is not the first candidate node for node a. Since the similarity b is equal to the reference threshold, node a is the first candidate node for node C, and node C is the first candidate node for node a. Since the similarity C is greater than the reference threshold, node B is the first candidate node for node C, which is the first candidate node for node B.

203. And connecting the first node with the first alternative node to obtain an adjacency graph.

Since the probability that the category of the first alternative node is the same as the category of the first node is high, the first node and the first alternative node can be connected to obtain the adjacency graph. This reduces the noise connections in the adjacency graph and thus improves the accuracy of the information in the adjacency graph. When the clustering of the represented objects to be clustered of the n nodes is realized by processing the adjacency graph, the accuracy of the obtained clustering result can be improved.

In the embodiment of the application, the adjacency graph can be an adjacency matrix, and the adjacency graph can also be an adjacency table.

According to the implementation, the reference similarity threshold is used as the basis for determining the first alternative nodes of the first nodes, so that the number of noise associated nodes of the first nodes in the first alternative nodes can be reduced, and the quality of the adjacency graph can be improved.

Optionally, before performing step 203, the following steps may also be performed:

21. And determining the similarity between the n nodes and the first node to obtain a first similarity set.

And n-1 nodes are added in the n nodes except the first node, and the similarity between the first node and each node in the n-1 nodes is calculated respectively to obtain n-1 similarities which are used as a first similarity set. For example (example 2), the n nodes include: a first node, node a, node B. The similarity between the first node and the node A is similarity a, and the similarity between the first node and the node B is similarity B, so that the first similarity set of the first node comprises the similarity a and the similarity B.

Alternatively, a first set of similarities may be determined for each of the n nodes separately. For example, the n nodes include: node a, node B, node C. The similarity between the node A and the node B is the similarity a, the similarity between the node A and the node C is the similarity B, and the similarity between the node B and the node C is the similarity C. The first set of similarities for node a includes a similarity a and a similarity b. The first set of similarities for node B includes a similarity a and a similarity c. The first set of similarities for node C includes a similarity b and a similarity C.

22. And taking the node corresponding to the k maximum similarity in the first similarity set as a second candidate node.

And selecting the nodes corresponding to the maximum k similarity from the first similarity set of the first node to obtain a second alternative node of the first node.

For example, the n nodes include: a first node, node a, node B, node C. The similarity between the first node and the node A is similarity a, the similarity between the first node and the node B is similarity B, and the similarity between the first node and the node C is similarity C. The first set of similarities for the first node includes a similarity a, a similarity b, and a similarity c. Assuming that k=2, the similarity a is larger than the similarity b, which is larger than the similarity c. The largest 2 similarities in the first set of similarities are: similarity a and similarity b. The node corresponding to the similarity a is node A, the node corresponding to the similarity B is node B, and the second alternative node of the first node comprises node A and node B.

Alternatively, a second alternative node may be determined for each of the n nodes separately. For example, the n nodes include: node a, node B, node C. The similarity between the node A and the node B is the similarity a, the similarity between the node A and the node C is the similarity B, and the similarity between the node B and the node C is the similarity C. The first set of similarities for node a includes a similarity a and a similarity b. The first set of similarities for node B includes a similarity a and a similarity c. The first set of similarities for node C includes a similarity b and a similarity C. Assuming that k=1, the similarity a is larger than the similarity b, which is larger than the similarity c. The largest 1 similarity in the first similarity set of node a is: similarity a. The node corresponding to the similarity a is node B, and the second alternative node of node a includes node B. The largest 1 similarity in the first set of similarities for node B is: similarity a. The node corresponding to the similarity a is node a, and the second alternative node of the node B comprises node a. The largest 1 similarity in the first similarity set of node C is: similarity b. The node corresponding to the similarity B is node B, and the second alternative node of node C includes node B.

In this step, k nodes with the highest similarity to the first node are selected as the second candidate nodes. And subsequently, when the first alternative node is determined according to the second alternative node, limiting the number of the first alternative nodes to be within k. In this way, the number of first candidate nodes may be limited while reducing the number of noise-associated nodes of the first node among the first candidate nodes. Thereby realizing that the data processing amount for constructing the adjacency graph is reduced while the quality of the adjacency graph is improved. As an alternative embodiment, the specific implementation of step 103 may be:

13. And determining a node with the similarity being greater than or equal to the reference threshold value in the second alternative nodes as a first alternative node.

The reference threshold in this step is the same as the reference threshold in step 203. As described in step 202, in the case that the similarity between two nodes is greater than or equal to the reference threshold, the probability that two objects to be clustered corresponding to the two nodes belong to the same class is high. And by taking the similarity larger than or equal to the reference threshold value as a basis, effectively associated nodes can be screened out from the second alternative nodes. Specifically, a node with the similarity greater than or equal to the reference threshold value in the second alternative nodes is used as the first alternative node.

In the embodiment of the application, k is a positive integer, and in the process of implementing the technical scheme provided by the embodiment of the application, the size of k can be determined according to the requirement of a user. Different values of k have different effects, and specifically include the following points:

1. The value of k may affect the speed at which the second candidate node of the first node is determined from the n nodes, and thus the speed at which the adjacency graph is constructed. For example, in the case where n=100000, k=8000, the time period required to determine the second alternative node for each of the n nodes is t ₁. In the case of n=100000, k=500, the time period required to determine the second alternative node for each of the n nodes is t ₂. Obviously, t ₁ is greater than t ₂. That is, k is inversely related to the speed at which the adjacency graph is built.

2. The value of k may affect the amount of data processing required to determine the second alternative node of the first node from the n nodes. Specifically, the value of k is positively correlated with the data throughput. An important indicator of whether the data processing device executing the technical scheme provided by the embodiment of the application can support the data processing amount is as follows: storage capacity of the data processing apparatus. It is clear that the storage capacity is positively correlated with the memory cost, i.e. the storage capacity is positively correlated with the cost of the data processing device. Further, the amount of data processing is positively correlated with the cost of the data processing apparatus. Still further, k is positively correlated with the cost of the data processing device.

3. The value of k can affect the number of valid associated nodes in the first candidate node determined based on the second candidate node, thereby affecting the number of valid associated nodes in the adjacency graph. Obviously, the larger the value of k, the greater the probability that all the valid associated nodes of the first node are included in the first candidate nodes of the first node. For example, assume that the number of associated nodes of the first node among the n nodes is 1000. In the case of k=800, the first candidate node cannot contain all valid associated nodes. In the case of k=1200, the first alternative node may contain all valid associated nodes. That is, k is positively correlated with the quality of the adjacency graph.

The three-point influence is comprehensively considered to determine the value of k, so that the quality of the obtained adjacency graph can be improved.

In one possible implementation, the data processing apparatus obtains a reference time length and/or a reference storage capacity. K is obtained according to the reference time length and/or the reference storage capacity. In the embodiment of the application, the reference time length can be the expected time length for constructing the adjacency graph. For example, if the user desires to complete the process of constructing the adjacency graph based on n objects to be clustered within 10 minutes, the reference time period is 10 minutes. The reference storage capacity may be a storage capacity of the data processing apparatus. The data processing apparatus may acquire the reference time period by receiving the reference time period input by the user through the input component. The data processing device may acquire the reference time length in a manner of acquiring the reference time length sent by the receiving terminal. The data processing apparatus may acquire the reference storage capacity by receiving the reference storage capacity input by the user through the input component. The data processing apparatus may acquire the reference storage capacity by using the reference storage capacity transmitted from the receiving terminal. In the case where the data processing apparatus acquires the reference time length, k can be obtained based on the reference time length. In the case where the data processing apparatus acquires the reference storage capacity, k can be obtained from the reference storage capacity. In the case where the data processing apparatus acquires the reference time period and the reference storage capacity, k may be obtained from the reference time period and the reference storage capacity.

In one implementation of obtaining k according to the reference time length, assuming that the reference time length is t _r, the reference time length and k satisfy the following formula:

Wherein n is the number of objects to be clustered, and a and b are both positive numbers. Alternatively, a=100000, b=2. In formula (1), t _r is given in seconds. For example, a=100000, n=5000, t _r =1.8 seconds, b=5. K=41 can be determined according to formula (1). It should be understood that if the result according to equation (1) is non-integer, the result may be rounded to k. For example, the result obtained according to formula (1) is 80.3, and the result obtained by rounding 80.3 is 80, i.e. the value of k.

In another implementation manner of obtaining k according to the reference time length, assuming that the reference time length is t _r, the reference time length and k satisfy the following formula:

wherein n is the number of objects to be clustered, and a is a positive number. Alternatively, a=10000000. In formula (2), t _r is given in seconds. For example, a=10000000, n=1000, t _r =1 second. K=10 can be determined according to formula (2). It should be understood that if the result according to equation (2) is non-integer, the result may be rounded to k. For example, the result obtained according to the formula (2) is 100.6, and the result obtained by rounding the 100.6 is 101, i.e. the value of k.

In one implementation where k is derived from a reference storage capacity, assuming that the reference storage capacity is c _r, the reference storage capacities are c _r and k satisfy the following equation:

Wherein n is the number of objects to be clustered, and a is a positive number. Alternatively, a=1000. In formula (3), c _r is in bytes. For example, a=1000, n=50000, c _r =10240 bytes. The result according to formula (3) was 19.3. Rounding 14.3 may determine k=14.

In another implementation where k is derived from the reference storage capacity, assuming the reference storage capacity is c _r, the reference storage capacities c _r and k satisfy the following equation:

wherein n is the number of objects to be clustered, and a and b are both positive numbers. Alternatively, a=1000, b=5. In formula (4), c _r is in bytes. For example, a=1000, n=50000, c _r =10240 bytes, b=5. The result according to equation (4) was 19.3. Rounding 19.3 may determine k=19.

In one implementation of deriving k from the reference time length and the reference storage capacity, assuming that the reference time length is t _r and the reference storage capacity is c _r, the reference time length, the reference storage capacity, and k satisfy the following formula:

Wherein n is the number of objects to be clustered, and a and b are both positive numbers. Alternatively, a=100, b=10000. In equation (5), t _r is in seconds and c _r is in bytes. For example, a=100, b=10000, n=50000, t _r =1.5 seconds, c _r =25600 bytes. The result according to equation (5) was 81.2. Rounding 81.2 may determine k=81.

In another implementation manner of obtaining k according to the reference time length and the reference storage capacity, assuming that the reference time length is t _r and the reference time length is c _r, the reference time length, the reference storage capacity and k satisfy the following formula:

Wherein n is the number of objects to be clustered, and a, b and c are positive numbers. Alternatively, a=100, b=10000, c=5. In formula (6), t _r is in seconds and c _r is in bytes. For example, a=100, b=10000, c=5, n=50000, t _r =1.5 seconds, c _r =25600 bytes. The result according to equation (6) was 86.2. Rounding 86.2 may determine k=86.

K is determined according to the reference time length and/or the reference storage capacity, and the three aspects can be comprehensively considered to determine a proper value for k. K is determined according to the reference time length and/or the reference storage capacity, and the three aspects can be comprehensively considered to determine a proper value for k. For example, the user may want to shorten the time taken to perform the clustering process on at least two pending cluster pairs, and may adjust the reference time length to be smaller. The data processing device can maximize the value of k on the premise that the time consumed for clustering at least two clusters to be processed is smaller than or equal to the reference time length according to the reference time length, so that the value of k is determined. Therefore, the quality of the adjacency graph is improved on the premise of meeting the requirement of a user (the time for constructing the adjacency graph by the data processing device is less than or equal to the reference time). For another example, the user may desire to use a data processing apparatus with a small storage capacity to perform clustering processing on at least two pending cluster pairs, and the reference storage capacity may be reduced. The data processing device maximizes the value of k on the premise that the data processing device can complete clustering processing of at least two cluster pairs to be processed according to the reference storage capacity, so that the value of k is determined. Therefore, the quality of the adjacency graph is improved on the premise of meeting the requirement of a user (the clustering processing of at least two cluster pairs to be processed is completed through a data processing device with small storage capacity). For another example, the user may desire to use a data processing apparatus with a small storage capacity to perform clustering on at least two pairs of clusters to be processed, and shorten the time consumed by the data processing apparatus to perform clustering on at least two pairs of clusters to be processed, so that the reference storage capacity may be reduced while the reference time period is reduced. The data processing device can complete clustering processing of at least two to-be-processed cluster pairs according to the reference storage capacity and the reference time length, and the data processing device can maximize the value of k on the premise that the time consumed by the clustering processing of the at least two to-be-processed cluster pairs by the data processing device is smaller than or equal to the reference time length, so that the value of k is determined. Therefore, the quality of the adjacency graph is improved on the premise that the user requirement is met (the clustering processing of at least two cluster pairs to be processed is completed through the data processing device with small storage capacity, and the time for constructing the adjacency graph by the data processing device is smaller than or equal to the reference time).

Referring to fig. 3, fig. 3 is a flowchart illustrating another data processing method according to an embodiment of the application.

301. And determining the adjacent relation between the first alternative node and the first node according to the similarity between the first alternative node and the first node.

In an embodiment of the present application, the adjacency includes at least one of: a distance between the first candidate node and the first node, an azimuth angle between the first candidate node and the first node.

The azimuth angle between the first candidate node and the first node includes an azimuth angle of the first candidate node with respect to the first node and an azimuth angle of the first node with respect to the first candidate node.

In the embodiment of the application, n nodes belong to the same plane. The plane containing n nodes is referred to as a reference plane. As shown in fig. 4, the reference plane includes a reference coordinate system xoy. The vector of the first candidate node pointing to the first node is assumed to be a first vector, and the vector of the first node pointing to the first candidate node is assumed to be a second vector. The azimuth angle of the first alternative node relative to the first node comprises an included angle between the first vector and the x-axis and an included angle between the first vector and the y-axis, and the azimuth angle of the first node relative to the first alternative node comprises an included angle between the second vector and the x-axis and an included angle between the second vector and the y-axis. The angle between the vector (including the first vector and the second vector) and the x-axis is hereinafter referred to as the x-azimuth angle, and the angle between the vector (including the first vector and the second vector) and the y-axis is hereinafter referred to as the y-azimuth angle, the azimuth angles include the x-azimuth angle and the y-azimuth angle. As shown in fig. 4, the first alternative node of the first node (i.e., node a in fig. 4) includes: node B and node C.As a result of the first vector being a first vector,Is the second vector.The included angle between the X axis and the X axis is beta,The included angle between the Y axis and the Y axis is lambda,The included angle between the X axis and the X axis is eta,The included angle between the Y axis and the Y axis is theta. The azimuth angle of node B relative to node a includes β and λ, and the azimuth angle of node a relative to node C includes η and θ.

The similarity between the first candidate node and the first node is referred to as a target similarity, and the adjacency relationship may represent the size of the target similarity. In one possible implementation, the adjacency includes a distance (hereinafter referred to as a reference distance) between the first candidate node and the first node. The reference distance may be determined based on the target similarity, wherein the reference distance is positively correlated with the target similarity. For example, the first candidate node includes a node a and a node B, a distance between the node a and the first node is a reference distance a, a distance between the node B and the first node is a reference distance B, a similarity between the node a and the first node is a target similarity a, and a similarity between the node B and the first node is a target similarity B. Assuming that the target similarity a is greater than the target similarity b, the reference distance a may be made greater than the reference distance b. Alternatively, can makeWherein t is a positive number.

For convenience of description, the value interval of a or more and b or less will be denoted by [ a, b ], the value interval of c or less and d will be denoted by (c, d), and the value interval of e or more and f will be denoted by [ e, f ].

In another possible implementation, the adjacency includes an azimuth angle between the first alternative node and the first node. In the case where the x azimuth is within (0, 90 ° ] and the y azimuth is within (0, 90 °), the target similarity is within a first preset interval and the x azimuth is positively correlated with the similarity, in the case where the x azimuth is within (90 °,180 ° ] and the y azimuth is within (0, 90 °), the target similarity is within a second preset interval and the x azimuth is negatively correlated with the similarity, and the y azimuth is within (90, 180), the target similarity is within a first preset interval and the x azimuth is positively correlated with the similarity, in x azimuth is within (0, 90), and y azimuth is within (90, 180 DEG, the target similarity is in a second preset interval, and the x azimuth angle and the similarity are in negative correlation.

For example, assume that the first preset interval is (0.5, 1) and the second preset interval is [0,0.5 ]. With x azimuth equal to 60 deg., and y azimuth equal to 60 deg., the target similarity is within (0.5, 1), and the target similarity is positively correlated with x azimuth, and the y azimuth is equal to 60 °, the target similarity is within [0, 0.5), and the target similarity is inversely related to the x azimuth. With x azimuth equal to 80 ° and y azimuth equal to 80 °, the target similarity is within (0.5, 1) and the target similarity is positively correlated with the x azimuth.

In one possible implementation, the adjacency includes a distance from node to node. Determining an adjacency relationship between the first candidate node and the first node according to the similarity between the first candidate node and the first node may include the following steps:

and taking the similarity between the first alternative node and the first node as an alternative similarity set, and determining the minimum value in the alternative similarity set as a reference similarity.

And obtaining a first weight and a second weight according to the difference between the first similarity and the reference similarity and the difference between the second similarity and the reference similarity, wherein the first similarity and the second similarity belong to the alternative similarity set.

For example, the first alternative node of the first node comprises: node a and node B. The degree of similarity between the first node and the node a (hereinafter, will be referred to as similarity 1) is 0.6, and the degree of similarity between the first node and the node B (hereinafter, will be referred to as similarity 2) is 0.8. The alternative set of similarities for the first node includes: 0.6 and 0.8, wherein 0.6 is the reference similarity. The difference between the calculated similarity 1 and the reference similarity (which will be referred to as a first difference value hereinafter) is 0, and the difference between the calculated similarity 2 and the reference similarity (which will be referred to as a second difference value hereinafter) is 0.2. The first weight and the second weight may be determined based on the first difference and the second difference. The distance between the first node and node a, and the distance between the first node and node B may be determined based on the first weight and the second weight.

In an implementation manner of obtaining a first weight and a second weight according to a difference between the first similarity and the reference similarity and a difference between the second similarity and the reference similarity, assuming that the first difference is c ₁, the second difference is c ₂, the first weight is w ₁, and the second weight is w ₂, then c ₁、c₂、w₁ satisfies the following formula:

c ₁、c₂、w₂ satisfies the following formula:

In this possible implementation manner, the first weight and the second weight with the sizes between 0 and 1 may be obtained by respectively performing normalization processing on the first difference and the second difference, and the sum of the first weight and the second weight may be equal to 1. In this way, when the adjacency graph is processed later, the fusion weight of the information of the second node and the information of the third node can be determined according to the first weight and the second weight. For example, when the graph convolution network is used to convolve the adjacency graph shown in fig. 5 to determine the class of the node No. 1, the class information (a) of the node No. 2 and the class information (B) of the node No. 3 are extracted respectively, and the class information of the node No. 2 and the class information of the node No. 3 propagated to the node No. 1 are determined by means of weighted summation. For example, axd1+bxd2=c, where C is the category information propagated to node No. 1. d1 is the weight of the No. 2 node, and d2 is the weight of the No. 2 node.

In another implementation manner of obtaining the first weight and the second weight according to the difference between the first similarity and the reference similarity and the difference between the second similarity and the reference similarity, assuming that the first difference is c ₁, the second difference is c ₂, the first weight is w ₁, and the second weight is w ₂, then c ₁、c₂、w₁ satisfies the following formula:

c ₁、c₂、w₂ satisfies the following formula:

In an implementation of determining a distance between the first node and the second node (hereinafter, referred to as a first distance) and a distance between the first node and the third node (hereinafter, referred to as a second distance) according to the first weight and the second weight, assuming that the first weight is w ₁, the second weight is w ₂, the first distance is D ₁, and the second distance is D ₂, w ₁、w₂、D₁、D₂ satisfies the following formula:

Wherein t is a positive number.

In the embodiment of the application, the distance between the first alternative node and the first node is determined according to the similarity between the first node and the first alternative node, so that the distance between the nodes and the similarity between the nodes are positively correlated. Therefore, when the adjacency graph is processed subsequently, the weight of the information of different nodes can be determined according to the distance between the nodes, and the accuracy of the information in the adjacency graph, namely the quality of the adjacency graph, is improved.

In another implementation manner of determining the first distance and the second distance according to the first weight and the second weight, assuming that the first weight is w ₁, the second weight is w ₂, the first distance is D ₁, and the second distance is D ₂, w ₁、w₂、D₁、D₂ satisfies the following formula:

Wherein t is a positive number.

302. And connecting the first node with the first alternative node, and enabling the first node and the first alternative node to meet the adjacency relation to obtain the adjacency graph.

And connecting the first node with the first alternative node to enable the first node and the first alternative node to meet the adjacency relation, wherein the obtained adjacency graph contains similarity information between the first alternative node and the first node.

For example, in the adjacency graph shown in fig. 6, the first alternative node includes node a and node B, the distance between the first node (i.e., node C in fig. 6) and node a is d ₁, and the distance between the first node and node B is d ₂. If d ₁/d₂ = 3/4, characterization: the similarity between node a and node C/the similarity between node B and node C is 3/4. Optionally, the adjacency graph is processed later, and the distance between the nodes is used to determine the weight of the information between the nodes. For example, in the adjacency graph shown in fig. 6, if the category of the node C needs to be determined by using the category information of the node a and the category information of the node B, the category information of the node a and the node B propagated to the node C may be determined by a weighted summation, for example: a _m×w₁+B_m×w₂＝C_m, wherein a _m is class information of node a, B _m is class information of node B, and C _m is class information propagated to node C. w ₁ and w ₂ are each constant, and

According to the method, the adjacent relation between the nodes is determined according to the similarity between the nodes. By making the adjacency graph satisfy the adjacency relation, the accuracy of information in the adjacency graph can be improved, and the quality of the adjacency graph can be further improved.

Obviously, the reference threshold should be different for different types of data, and if it is determined whether a certain node of the n nodes is taken as the first candidate node of the first node based on the fixed-valued reference threshold, it is unreasonable. This will also lead to more noise correlation in subsequently built adjacency graphs.

For example, it is assumed that data corresponding to a first node (to be referred to as node a hereinafter) and data corresponding to node B are both images, and data corresponding to the first node (to be referred to as node C hereinafter) and data corresponding to node D are both voice data. If the similarity threshold value for judging whether the two images belong to the same category is larger than the similarity threshold value for judging whether the two voice data belong to the same category. For example, when the similarity between two images is greater than or equal to 90%, it is determined that the two images belong to the same category. When the similarity between the two voice data is greater than or equal to 80%, it is determined that the two voice data belong to the same category. It is obviously not reasonable to use the similarity threshold of the speech data to determine whether node B is the first candidate of node a, nor is it reasonable to use the similarity threshold of the image to determine whether node D is the first candidate of node C. That is, it is not reasonable to use the same reference threshold to determine whether node B is the first candidate node for node a and to determine whether node D is the first candidate node for node C.

Considering that the reference threshold corresponds to the basis for judging whether the data corresponding to the two nodes belong to the same category, the reference threshold of the object to be clustered can be determined according to the data type of the object to be clustered. In one possible implementation manner, feature extraction processing is performed on any one object to be clustered of n objects to be clustered, so as to obtain first feature data. And determining the data type of the object to be clustered according to the first characteristic data, wherein the data type comprises images, voices and sentences. And obtaining a reference threshold according to the data type of the object to be clustered and the reference mapping relation, wherein the reference mapping relation is a mapping relation between the data type and the similarity threshold. Alternatively, the reference mapping may be found in table 1.

Data type	Similarity threshold
		Image processing apparatus	0.9
Speech sound	0.88
		Statement	0.85

TABLE 1

For example, the data type of the object to be clustered is determined to be speech according to the first characteristic data. The reference threshold value may be determined to be 0.88 according to the reference mapping relationship shown in table 1. For another example, the data type of the object to be clustered is determined to be a sentence according to the first feature data, and the reference threshold value is determined to be 0.85 according to the reference mapping relation shown in table 1.

Optionally, feature extraction processing may be performed on at least two objects to be clustered in the n objects to be clustered, to obtain at least two feature data. And determining the confidence of the data types of the at least two objects to be clustered according to the at least two feature data. And determining the data types of the at least two objects to be clustered according to the confidence of the data types of the at least two objects to be clustered. And obtaining a reference threshold according to the data types of at least two objects to be clustered and the reference mapping relation.

For example, the n objects to be clustered include a first object to be clustered and a second object to be clustered. And carrying out feature extraction processing on the first object to be clustered to obtain first feature data. And carrying out feature extraction processing on the second objects to be clustered to obtain second feature data. And determining that the confidence of the data type of the first object to be clustered as the image is 0.8 according to the first characteristic data. And determining that the confidence of the statement of the data type of the second object to be clustered is 0.6 according to the second characteristic data. Since 0.8 is greater than 0.6, the data type of the first object to be clustered may be determined to be an image. Since the data types of the n objects to be clustered are generally the same, in the case that the data type of the first object to be clustered is determined to be an image, it may be determined that both the data type of the first object to be clustered and the data type of the second object to be clustered are images (i.e., the data types of the n objects to be clustered are images).

For another example, the n objects to be clustered include a first object to be clustered, a second object to be clustered, and a third object to be clustered. And carrying out feature extraction processing on the first object to be clustered to obtain first feature data. And carrying out feature extraction processing on the second objects to be clustered to obtain second feature data. And carrying out feature extraction processing on the third object to be clustered to obtain third feature data. And determining that the confidence of the data type of the first object to be clustered as the image is 0.8 according to the first characteristic data. And determining that the confidence of the statement of the data type of the second object to be clustered is 0.9 according to the second characteristic data. And determining that the confidence of the data type of the first object to be clustered as the image is 0.78 according to the first characteristic data. According to the confidence that the data type of the first object to be clustered is the image is 0.8, the data type of the first object to be clustered can be determined to be the image. According to the confidence that the data type of the second object to be clustered is statement is 0.9, the data type of the first object to be clustered can be determined to be statement. And determining that the data type of the third object to be clustered is the image according to the confidence that the data type of the third object to be clustered is the image of 0.78. Since the number of objects to be clustered of which the data types are images is 2 and the number of objects to be clustered of which the data types are sentences is 1, the data types of the first object to be clustered, the data types of the second object to be clustered and the data types of the third object to be clustered can be determined to be images according to the principle of 'minority obeys majority' (namely, the data types of n objects to be clustered are images).

According to the embodiment, the reference threshold is determined according to the data type of the first object to be clustered in the n objects to be clustered and the reference mapping relation, so that different reference thresholds can be set for data of different data types. And determining a first alternative node of the first node according to the reference threshold value, so that noise association nodes in the first alternative node can be reduced, noise association in the adjacency graph can be reduced, and the quality of the adjacency graph is improved.

Based on the technical scheme provided by the embodiment of the application, the embodiment of the application also provides several possible application scenes.

In the era of rapid expansion of data volume, hidden association and information between data can be obtained by clustering the data. Therefore, how to efficiently and accurately cluster data is of great importance.

Based on the similarity between the data in the data sets, an adjacency graph corresponding to the data sets can be constructed. And processing the adjacency graph by using a clustering network to obtain a clustering result of the data set. Thus, the accuracy of the information in the adjacency graph, which includes the number of valid associated nodes, will affect the accuracy of the clustering result of the dataset. Alternatively, the clustering network may be a graph rolling network (graph convolutional networks, GCN).

According to the technical scheme provided by the embodiment of the application, noise association in the adjacent graphs of n objects to be clustered can be reduced, and the number of effective association nodes is increased. Therefore, the accuracy of the clustering result of the n objects to be clustered can be improved.

Scene a: neural networks have been widely used in recent years for various tasks (e.g., image recognition, sentence recognition) thanks to their powerful performance. The performance effect of the neural network in the fields depends on the training effect of the neural network, and the training effect of the neural network depends on the quantity of training data for training the neural network, namely, the more the quantity of training data is, the better the training effect of the neural network is, and further, the better the effect of performing corresponding tasks (such as image recognition and sentence recognition) by applying the trained neural network is.

The training data refers to an image or sentence with labeling information, for example, the task to be executed is to cluster the content contained in the image, and determine which content contained in the image is apple, banana, pear, peach, orange and watermelon, and the labeling information includes apple, banana, pear, peach, orange and watermelon. For another example, the task to be executed is to cluster the contents of the sentence descriptions, and determine whether the contents of the sentence descriptions are sentences describing the automobile fault, and then the labeling information includes sentences describing the automobile fault and sentences not describing the automobile fault.

The more accurate the labeling information of the training data is, the better the training effect of the neural network is, and therefore, the higher the matching degree between the labeling data of the training data and the real content of the training data is, the better the training effect is. For example, if the image containing pears is marked as apple, it is incorrect. For another example, "go to where to eat at night? "the sentence labeled as describing the failure of the car is also incorrect. Training data with incorrect labeling information can deteriorate training effects, so that the conventional method mostly completes labeling of the training data in a manual labeling mode. However, when the number of training data is large, the manual labeling mode is low in efficiency and high in labor cost.

By applying the technical scheme provided by the embodiment of the application, the adjacency graph of the data set to be marked is constructed. And processing the adjacency graph by using the GCN to obtain a clustering result of the data set to be marked. And determining the labeling information of the data to be labeled according to the clustering result. The accuracy of information in the adjacency graph constructed by the technical scheme provided by the embodiment of the application is high, so that the accuracy of the clustering result obtained based on the adjacency graph can be improved. Thereby improving the accuracy of the marking information of the data to be marked.

Scene B: with the rapid development of internet technology, social networks are becoming more popular, and people can perform operations such as communication by establishing friend relations on the social networks. If each user on the social network is regarded as a node, the whole social network can also be regarded as a pending adjacency graph, wherein the connection relationship between the nodes in the pending adjacency graph can be determined by the friend relationship between the users. The similarity between the attributes (such as age, gender, hobbies, attribution, education background and the like) of the users is used as the similarity between the nodes, and the social network adjacency graph is constructed according to the technical scheme provided by the embodiment of the application. The attributes of the nodes, i.e., the attributes of the user, may be determined by processing the social network adjacency graph using the GCN.

For example, in the social adjacency graph shown in fig. 7, the user corresponding to the node No. 1 is Zhang three, the user corresponding to the node No. 2 is Lifour, and the user corresponding to the node No. 3 is Wang five. The basis for constructing the social adjacency graph is the educational background of the user. Namely, the similarity between educational backgrounds of users is used as the similarity between corresponding nodes, and then the social adjacency graph is constructed according to the technical scheme provided by the embodiment of the application. Two nodes in the social adjacency graph are connected, and the corresponding user is characterized by having a friend relationship. Wherein Zhang three, li four and Wang five are all friend relations. If the hobbies of Zhang III and Li IV are basketball, the attribute of Wang Wu does not contain hobby information. The social adjacency graph is processed by the GCN, so that the preference information of the node No. 3 can be determined, for example, the probability of playing basketball of the node No. 3 is 90%, namely, the probability of playing basketball of the king five is higher. Further, after determining Wang Wufei that basketball is likely to be favored, information related to basketball may be pushed to the king's account.

It will be appreciated by those skilled in the art that in the above-described method of the specific embodiments, the written order of steps is not meant to imply a strict order of execution but rather should be construed according to the function and possibly inherent logic of the steps.

The foregoing details of the method according to the embodiments of the present application and the apparatus according to the embodiments of the present application are provided below.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application, where the apparatus 1 includes: an acquisition unit 11, a first determination unit 12, a connection unit 13, a second determination unit 14, a feature extraction processing unit 15, a third determination unit 16, a first processing unit 17, a second processing unit 18, wherein:

an obtaining unit 11, configured to obtain n nodes, where n is an integer greater than or equal to 2, and the nodes are used to represent objects to be clustered;

A first determining unit 12, configured to determine, as a first candidate node, a node, which has a similarity with a first node, among the n nodes, that is greater than or equal to a reference threshold, the first node belonging to the n nodes;

and the connection unit 13 is used for connecting the first node with the first alternative node to obtain an adjacency graph, wherein the adjacency graph is used for clustering objects to be clustered represented by the n nodes.

In combination with any of the embodiments of the present application, the device 1 further comprises:

A second determining unit 14, configured to determine, before determining, as a first candidate node, a similarity between the n nodes and the first node, to obtain a first similarity set, where the similarity between the n nodes and the first node is greater than or equal to a reference threshold;

The second determining unit 14 is further configured to use, as a second candidate node, a node corresponding to the k similarities that are the largest in the first similarity set;

The first determining unit 12 is configured to:

In combination with any embodiment of the present application, the connection unit 13 is configured to:

The connection unit 13 is configured to:

the feature extraction processing unit 15 is configured to perform feature extraction processing on the object to be clustered to obtain first feature data before determining, as the first candidate node, a node with a similarity greater than or equal to the reference threshold in the second candidate nodes;

A third determining unit 16, configured to determine a data type of the object to be clustered according to the first feature data, where the data type includes an image, a voice, and a sentence;

the first processing unit 17 is configured to obtain the reference threshold according to the data type of the object to be clustered and a reference mapping relationship, where the reference mapping relationship is a mapping relationship between the data type and a similarity threshold.

In connection with any one of the embodiments of the present application,

The obtaining unit 11 is configured to obtain a reference duration and/or a reference storage capacity before the node corresponding to the k biggest similarities in the first similarity set is used as the second candidate node;

The apparatus further comprises:

A second processing unit 18, configured to obtain the k according to the reference duration and/or the reference storage capacity.

In combination with any embodiment of the present application, the obtaining unit 11 is further configured to obtain a clustering network;

the device 1 further comprises:

And the third processing unit 16 is configured to process the adjacency graph by using the clustering network, so as to obtain clustering results of the n represented objects to be clustered.

The second determining unit 14 is configured to:

The implementation can reduce the number of noise-associated nodes of the first node in the first alternative nodes by taking the reference similarity threshold as a basis for determining the first alternative nodes of the first node. Thus, the quality of the adjacency graph can be improved.

In some embodiments, the functions or modules included in the apparatus provided by the embodiments of the present application may be used to perform the methods described in the foregoing method embodiments, and specific implementations thereof may refer to descriptions of the foregoing method embodiments, which are not repeated herein for brevity.

Fig. 9 is a schematic hardware structure of a data processing apparatus according to an embodiment of the present application. The data processing means 2 comprise a processor 21, a memory 22, input means 23, output means 24. The processor 21, memory 22, input device 23, and output device 24 are coupled by connectors including various interfaces, transmission lines or buses, etc., as are not limited by the present embodiments. It should be appreciated that in various embodiments of the application, coupled is intended to mean interconnected by a particular means, including directly or indirectly through other devices, e.g., through various interfaces, transmission lines, buses, etc.

The processor 21 may be one or more graphics processors (graphics processing unit, GPUs), which in the case of a GPU as the processor 21 may be a single core GPU or a multi-core GPU. Alternatively, the processor 21 may be a processor group formed by a plurality of GPUs, and the plurality of processors are coupled to each other through one or more buses. In the alternative, the processor may be another type of processor, and the embodiment of the application is not limited.

Memory 22 may be used to store computer program instructions as well as various types of computer program code for performing aspects of the present application. Optionally, the memory includes, but is not limited to, random access memory (random access memory, RAM), read-only memory (ROM), erasable programmable read-only memory (erasable programmable read only memory, EPROM), or portable read-only memory (compact disc read-only memory, CD-ROM) for associated instructions and data.

The input means 23 are for inputting data and/or signals and the output means 24 are for outputting data and/or signals. The input device 23 and the output device 24 may be separate devices or may be an integral device.

It will be appreciated that, in the embodiment of the present application, the memory 22 may be used to store not only related instructions, but also related data, for example, the memory 22 may be used to store objects to be clustered acquired through the input device 23, or the memory 22 may be used to store an adjacency graph obtained through the processor 21, etc., and the embodiment of the present application is not limited to the data specifically stored in the memory.

It will be appreciated that figure 9 only shows a simplified design of a data processing apparatus. In practical applications, the data processing apparatus may also include other necessary elements, including but not limited to any number of input/output devices, processors, memories, etc., and all data processing apparatuses capable of implementing the embodiments of the present application are within the scope of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein. It will be further apparent to those skilled in the art that the descriptions of the various embodiments of the present application are provided with emphasis, and that the same or similar parts may not be described in detail in different embodiments for convenience and brevity of description, and thus, parts not described in one embodiment or in detail may be referred to in description of other embodiments.

In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital versatile disk (DIGITAL VERSATILE DISC, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.

Those of ordinary skill in the art will appreciate that implementing all or part of the above-described method embodiments may be accomplished by a computer program to instruct related hardware, the program may be stored in a computer readable storage medium, and the program may include the above-described method embodiments when executed. And the aforementioned storage medium includes: a read-only memory (ROM) or a random-access memory (random access memory, RAM), a magnetic disk or an optical disk, or the like.

Claims

1. A method of data processing, the method comprising:

Acquiring n nodes, wherein n is an integer greater than or equal to 2, and the nodes are used for representing objects to be clustered, and the objects to be clustered comprise images, voices or sentences;

determining the similarity between the n nodes and a first node to obtain a first similarity set, wherein the first node belongs to the n nodes;

acquiring a reference time length and/or a reference storage capacity;

Obtaining k according to the reference time length and/or the reference storage capacity;

Performing feature extraction processing on the objects to be clustered to obtain first feature data;

Determining the data type of the object to be clustered according to the first characteristic data, wherein the data type comprises images, voices and sentences;

obtaining a reference threshold according to the data type and the reference mapping relation of the objects to be clustered, wherein the reference mapping relation is a mapping relation between the data type and the similarity threshold;

Determining a node with similarity greater than or equal to the reference threshold value in the second alternative nodes as a first alternative node;

Connecting the first node with the first alternative node to obtain an adjacency graph, wherein the adjacency graph is used for clustering objects to be clustered represented by the n nodes; the step of connecting the first node with the first alternative node to obtain an adjacency graph includes: determining an adjacency relation between the first alternative node and the first node according to the similarity between the first alternative node and the first node, wherein the adjacency relation comprises an azimuth angle between the first alternative node and the first node, and the azimuth angle is positively correlated with the similarity; connecting the first node with the first alternative node to enable the first node and the first alternative node to meet the adjacency relation, so as to obtain the adjacency graph;

the determining the adjacency relation between the first alternative node and the first node according to the similarity between the first alternative node and the first node comprises the following steps: taking the similarity between the first alternative node and the first node as an alternative similarity set, and determining the minimum value in the alternative similarity set as a reference similarity; obtaining a first weight and a second weight according to the difference between the first similarity and the reference similarity and the difference between the second similarity and the reference similarity, wherein the first similarity and the second similarity belong to the alternative similarity set; and determining the distance between the first node and the second node and the distance between the first node and the third node according to the first weight and the second weight, wherein the second node is a node corresponding to the first similarity, and the third node is a node corresponding to the second similarity.

2. The method according to claim 1, wherein the method further comprises:

Acquiring a clustering network;

3. The method of claim 1, wherein the determining the similarities between the n nodes and the first node, resulting in a first set of similarities, comprises:

4. A data processing apparatus, the apparatus comprising:

the acquisition unit is used for acquiring n nodes to-be-clustered objects, wherein n is an integer greater than or equal to 2, the nodes are used for representing the to-be-clustered objects, and the to-be-clustered objects comprise images, voices or sentences;

A second determining unit, configured to determine similarities between the n nodes and a first node, to obtain a first similarity set, where the first node belongs to the n nodes;

acquiring a reference time length and/or a reference storage capacity;

The second determining unit is configured to use, as a second candidate node, a node corresponding to the k similarities with the largest similarity in the first similarity set;

A first determining unit, configured to determine, as a first candidate node, a node, of the second candidate nodes, where the similarity is greater than or equal to the reference threshold;

the connection unit is used for connecting the first node with the first alternative node to obtain an adjacency graph, wherein the adjacency graph is used for clustering objects to be clustered represented by the n nodes;

The connecting unit is specifically configured to: determining an adjacency relation between the first alternative node and the first node according to the similarity between the first alternative node and the first node, wherein the adjacency relation comprises an azimuth angle between the first alternative node and the first node, and the azimuth angle is positively correlated with the similarity; connecting the first node with the first alternative node to enable the first node and the first alternative node to meet the adjacency relation, so as to obtain the adjacency graph;

the connecting unit is specifically configured to: taking the similarity between the first alternative node and the first node as an alternative similarity set, and determining the minimum value in the alternative similarity set as a reference similarity; obtaining a first weight and a second weight according to the difference between the first similarity and the reference similarity and the difference between the second similarity and the reference similarity, wherein the first similarity and the second similarity belong to the alternative similarity set; and determining the distance between the first node and the second node and the distance between the first node and the third node according to the first weight and the second weight, wherein the second node is a node corresponding to the first similarity, and the third node is a node corresponding to the second similarity.

5. A processor for performing the method of any one of claims 1 to 3.

6. An electronic device, comprising: a processor, transmission means, input means, output means and memory for storing computer program code comprising computer instructions which, when executed by the processor, cause the electronic device to perform the method of any one of claims 1 to 3.

7. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of any of claims 1 to 3.