CN113111133A - User classification method and device - Google Patents
User classification method and device Download PDFInfo
- Publication number
- CN113111133A CN113111133A CN202110383817.1A CN202110383817A CN113111133A CN 113111133 A CN113111133 A CN 113111133A CN 202110383817 A CN202110383817 A CN 202110383817A CN 113111133 A CN113111133 A CN 113111133A
- Authority
- CN
- China
- Prior art keywords
- community
- user
- determining
- users
- attribute
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 239000013598 vector Substances 0.000 claims description 36
- 238000013145 classification model Methods 0.000 claims description 12
- 238000004590 computer program Methods 0.000 claims description 10
- 238000010276 construction Methods 0.000 claims description 5
- 230000003542 behavioural effect Effects 0.000 claims description 3
- 230000006399 behavior Effects 0.000 description 47
- 238000010586 diagram Methods 0.000 description 14
- 230000006870 function Effects 0.000 description 10
- 230000000875 corresponding effect Effects 0.000 description 9
- 238000004891 communication Methods 0.000 description 7
- 230000008569 process Effects 0.000 description 7
- 238000012545 processing Methods 0.000 description 7
- 238000004364 calculation method Methods 0.000 description 4
- 230000008859 change Effects 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- XOFYZVNMUHMLCC-ZPOLXVRWSA-N prednisone Chemical compound O=C1C=C[C@]2(C)[C@H]3C(=O)C[C@](C)([C@@](CC4)(O)C(=O)CO)[C@@H]4[C@@H]3CCC2=C1 XOFYZVNMUHMLCC-ZPOLXVRWSA-N 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 239000000835 fiber Substances 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 239000004065 semiconductor Substances 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000000638 solvent extraction Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/288—Entity relationship models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Human Resources & Organizations (AREA)
- Health & Medical Sciences (AREA)
- Economics (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a user classification method and device, and relates to the technical field of computers. One embodiment of the method comprises: constructing a user relationship network according to the behavior data and the attribute data of each user; dividing all users into a plurality of communities according to the network structure of the user relationship network and the attribute information of each node in the user relationship network; determining the attribute information of the community according to the attribute information of each node in the community, and determining the user category of the user in the community according to the network structure and the attribute information of the community. According to the embodiment, the user relationship network is established according to the behavior data and the attribute data of the users, the possibility that the users with weak association relation are divided into the same community can be reduced, the community division purity and precision are improved, and the accuracy of user classification and identification is improved.
Description
Technical Field
The invention relates to the technical field of computers, in particular to a user classification method and device.
Background
A relational network is a data structure made up of nodes and edges. In a relational network, entities correspond to nodes in the network, and the association relationship between the entities constitutes the whole network structure. The connection between the entities has relevance with different strengths, the part with tight connection can be regarded as a community, the nodes in the community have tight connection, and the relative connection between two communities is sparse, which is called community division. The relationship network can be used for describing human social attributes, incidence relations among mechanisms, internet user group analysis and other fields.
The relationship network is reasonably constructed, and the user categories of user groups in the community are identified by a community division method, so that the method has great significance to enterprises.
Disclosure of Invention
In view of this, embodiments of the present invention provide a method and an apparatus for classifying users, which construct a user relationship network according to behavior data and attribute data of users, so as to reduce the possibility that users with weak association are classified into the same community, improve the purity and precision of community classification, and improve the accuracy of user classification and identification.
To achieve the above object, according to an aspect of an embodiment of the present invention, there is provided a user classification method, including:
constructing a user relationship network according to the behavior data and the attribute data of each user;
dividing all users into a plurality of communities according to the network structure of the user relationship network and the attribute information of each node in the user relationship network;
determining the attribute information of the community according to the attribute information of each node in the community, and determining the user category of the user in the community according to the network structure and the attribute information of the community.
Optionally, constructing a user relationship network according to the behavior data and the attribute data of each user includes:
determining an edge weight relationship between any two users according to behavior data of the any two users, and constructing a basic relationship network by taking each user as a node and taking the edge weight relationship between the users as an edge;
and determining the attribute feature vector of each user according to the attribute data of each user, and taking the attribute feature vector of each user as the attribute information of the corresponding node in the basic relationship network to obtain the user relationship network.
Optionally, determining an edge weight relationship between any two users according to behavior data of the any two users includes:
determining whether medium association exists between any two users according to the behavior data of any two users; if not, setting the edge weight relation between any two users as a first weight; and if so, increasing the first weight.
Optionally, the media association comprises at least one of:
the same behavior object exists between any two users; any two users have the same IP; any two users have the same contact information; any two users have the same address.
Optionally, dividing all users into a plurality of communities according to the network structure of the user relationship network and the attribute information of each node in the user relationship network, including:
each node in the user relationship network is respectively used as a community; for any community, the following steps are performed:
determining the modularity of any community according to the network structure of any community, determining the attribute similarity of any community according to the attribute information of any community, and determining the community consistency of any community according to the modularity and the attribute similarity;
determining a combined module degree of any community after being respectively combined with each neighbor community according to a network structure of the user relationship network, determining a combined attribute similarity of any community after being respectively combined with each neighbor community according to attribute information of any community and each neighbor community, and determining a combined community consistency of any community after being respectively combined with each neighbor community according to the combined module degree and the combined attribute similarity;
and determining a community consistency increment according to the community consistency and the merged community consistency, judging whether the community consistency which is greater than 0 exists, and if so, merging any community into a neighbor community corresponding to the maximum community consistency increment.
Optionally, determining the community consistency of any community according to the modularity and the attribute consistency includes:
and carrying out weighted summation on the modularity and the attribute consistency, and taking the result obtained by weighted summation as the community consistency of any community.
Optionally, determining the user category of the user in the community according to the network structure and the attribute information of the community includes:
determining a structural feature vector of the community according to the network structure of the community, determining an attribute feature vector of the community according to attribute information of the community, inputting the structural feature vector and the attribute feature vector into a pre-trained classification model, and determining the user category of users in the community according to the output of the classification model.
According to a second aspect of the embodiments of the present invention, there is provided an apparatus for classifying users, including:
the network construction module is used for constructing a user relationship network according to the behavior data and the attribute data of each user;
the community dividing module is used for dividing all users into a plurality of communities according to the network structure of the user relationship network and the attribute information of each node in the user relationship network;
and the classification identification module is used for determining the attribute information of the community according to the attribute information of each node in the community and determining the user category of the user in the community according to the network structure and the attribute information of the community.
Optionally, the network building module builds a user relationship network according to the behavior data and the attribute data of each user, including:
determining an edge weight relationship between any two users according to behavior data of the any two users, and constructing a basic relationship network by taking each user as a node and taking the edge weight relationship between the users as an edge;
and determining the attribute feature vector of each user according to the attribute data of each user, and taking the attribute feature vector of each user as the attribute information of the corresponding node in the basic relationship network to obtain the user relationship network.
Optionally, the determining, by the network building module, an edge weight relationship between any two users according to behavior data of the any two users includes:
determining whether medium association exists between any two users according to the behavior data of any two users; if not, setting the edge weight relation between any two users as a first weight; and if so, increasing the first weight.
Optionally, the media association comprises at least one of:
the same behavior object exists between any two users; any two users have the same IP; any two users have the same contact information; any two users have the same address.
Optionally, the community dividing module divides all users into a plurality of communities according to the network structure of the user relationship network and the attribute information of each node in the user relationship network, and includes:
each node in the user relationship network is respectively used as a community; for any community, the following steps are performed:
determining the modularity of any community according to the network structure of any community, determining the attribute similarity of any community according to the attribute information of any community, and determining the community consistency of any community according to the modularity and the attribute similarity;
determining a combined module degree of any community after being respectively combined with each neighbor community according to a network structure of the user relationship network, determining a combined attribute similarity of any community after being respectively combined with each neighbor community according to attribute information of any community and each neighbor community, and determining a combined community consistency of any community after being respectively combined with each neighbor community according to the combined module degree and the combined attribute similarity;
and determining a community consistency increment according to the community consistency and the merged community consistency, judging whether the community consistency which is greater than 0 exists, and if so, merging any community into a neighbor community corresponding to the maximum community consistency increment.
Optionally, the determining, by the community dividing module, the community consistency of any community according to the modularity and the attribute consistency includes:
and carrying out weighted summation on the modularity and the attribute consistency, and taking the result obtained by weighted summation as the community consistency of any community.
Optionally, the determining, by the classification identification module, the user category of the user in the community according to the network structure and the attribute information of the community includes:
determining a structural feature vector of the community according to the network structure of the community, determining an attribute feature vector of the community according to attribute information of the community, inputting the structural feature vector and the attribute feature vector into a pre-trained classification model, and determining the user category of users in the community according to the output of the classification model.
According to a third aspect of the embodiments of the present invention, there is provided an electronic device for user classification, including:
one or more processors;
a storage device for storing one or more programs,
when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method provided by the first aspect of the embodiments of the present invention.
According to a fourth aspect of embodiments of the present invention, there is provided a computer readable medium, on which a computer program is stored, which when executed by a processor, implements the method provided by the first aspect of embodiments of the present invention.
One embodiment of the above invention has the following advantages or benefits: by constructing the user relationship network according to the behavior data and the attribute data of the users, the possibility that the users with weak association relation are divided into the same community can be reduced, the purity and the precision of community division are improved, and the accuracy of user classification and identification is improved.
Further effects of the above-mentioned non-conventional alternatives will be described below in connection with the embodiments.
Drawings
The drawings are included to provide a better understanding of the invention and are not to be construed as unduly limiting the invention. Wherein:
FIG. 1 is a schematic diagram of the main flow of a method of user classification of an embodiment of the present invention;
FIG. 2 is a schematic diagram of the main flow of a method of user classification in an alternative embodiment of the invention;
FIG. 3 is a schematic diagram of the main modules of an apparatus for user classification according to an embodiment of the present invention;
FIG. 4 is an exemplary system architecture diagram in which embodiments of the present invention may be employed;
fig. 5 is a schematic block diagram of a computer system suitable for use in implementing a terminal device or server of an embodiment of the invention.
Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
According to an aspect of an embodiment of the present invention, a method for user classification is provided.
Fig. 1 is a schematic diagram of a main flow of a user classification method according to an embodiment of the present invention, and as shown in fig. 1, the user classification method includes: step S101, step S102, and step S103.
And step S101, constructing a user relationship network according to the behavior data and the attribute data of each user.
User behavior data refers to data relating to user behavior. Taking the e-commerce field as an example, the user behavior can be the behaviors of browsing, purchasing, collecting, clicking, paying attention to and the like of the user on the e-commerce platform. Taking the bank field as an example, the user behavior can be the behaviors of purchasing or redeeming financing, paying electric charges, transacting new business and the like in the bank. The attribute data refers to data reflecting the attribute of the user. The attribute data of the user can comprise basic attribute characteristics of the user such as gender, name, age, preference, academic calendar, native place, mobile phone number and the like, historical preference attribute characteristics of the user such as the longest historical purchase category and the most common delivery address and the like, and the number of accounts using the same IP or mobile phone number and other media to generate behaviors.
A relational network is a data structure composed of nodes and edges, wherein each node in the network represents a user, and the edges between two nodes represent the association relationship between the users. In the actual application process, in order to reduce the calculation amount of network construction and improve the timeliness of the constructed relationship network, behavior data and attribute data of a user in a preset time period, such as behavior data and attribute data in one day, can be collected.
In some optional embodiments, a comprehensive feature vector containing the behavior features and the attribute features of the user may be determined according to the behavior data and the attribute data of the user, and a user relationship network may be constructed according to the comprehensive feature vector of each user. In further alternative embodiments, constructing the user relationship network from the behavioral data and attribute data of each user comprises: determining an edge weight relationship between any two users according to behavior data of the any two users, and constructing a basic relationship network by taking each user as a node and taking the edge weight relationship between the users as an edge; and determining the attribute feature vector of each user according to the attribute data of each user, and taking the attribute feature vector of each user as the attribute information of the corresponding node in the basic relationship network to obtain the user relationship network. According to the embodiment, the calculation amount of the subsequent steps can be reduced on the basis of ensuring the subsequent community division precision, and the user classification efficiency is improved.
Optionally, determining the edge weight relationship between any two users according to the behavior data of the any two users includes: determining whether medium association exists between any two users according to the behavior data of the any two users; if not, setting the edge weight relation between any two users as a first weight; if yes, the first weight is increased. Media refers to objects that cause an association between two users. Optionally, the media association comprises at least one of: the same behavior object or IP exists between any two users, any two users have the same contact information, and any two users have the same address.
Illustratively, the edge weight between two users is 0 by default, i.e. the first weight is 0. Determining the edge weight A epsilon R between users according to the following association relationn*n: when users i and j browse or click or purchase the same commodity, A (i, j) is A (i, j) + 1; if users i and j use the same IP, a (i, j) ═ a (i, j) + 1; if users i and j use the same mobile phone number, a (i, j) ═ a (i, j) + 1; when the users i and j receive goods using the same address, a (i, j) ═ a (i, j) + 1. The larger A (i, j) is, the more common behaviors between users i and j are shown, and the closer the relationship between users i and j is.
When determining the attribute feature vector of each user according to the attribute data of the user, each attribute of the user can be converted into a numerical value to obtain the attribute vector of the user.
Step S102, dividing all users into a plurality of communities according to the network structure of the user relationship network and the attribute information of each node in the user relationship network.
In the user relationship network, users correspond to nodes in a network graph, and the association relationship among the users forms edges between the nodes. The connection between users has relevance with different strengths, the part with tight connection can be regarded as a community, the nodes in the community have tight connection, and the relative connection between two communities is sparse, which is called community division. In this step, users with similar behavior characteristics and attribute characteristics are divided into the same communities.
Given a user relationship network G ═ { V, a, F }, where V ═ V }1,v2,v3...,vnIs a user node set, n is the number of nodes, A belongs to Rn*nFor the inter-user edge weight matrix, F ∈ Rl*nEach node is provided with characteristic attribute information of dimension l as node attribute information. The community division target is to find a division C ═ C1,c2,c3...cX}。ciE {1, 2,. X } represents the community to which the node i belongs, and X is the number of communities. The node division C satisfies that V is equal to Ui=1,2...Xset(vc=i) That is, the union of all communities containing nodes is equal to the full graph node set, and any subdivision intersection is satisfied to be empty (i.e., the nodes between communities do not overlap). The consistency of the structure information and the attribute information is maximized through graph node division, namely, the user adjacency relation in the same community is close, the attribute similarity is high, the connectivity between different community intervals is weak, and the attribute information difference between the nodes is large.
The structural consistency describes the similarity of the behavior patterns of the users in the same community, and the structural consistency of the community can be described by adopting modularity or betweenness. The modularity is adopted to measure the structural consistency of community division, and the definition is as follows:
where m is the sum of the edge weights, kiIs the sum of the edge weights connected to node i, AijIs the edge weight between nodes i, j, ciIndicating the community to which node i belongs. Delta (c)i,cj) The function characteristic ensures that the modularity value is only available when the i, j nodes belong to the same community, so the modularity calculation can be focused only in the communityBased on this modularity, the following derivation can be made:
wherein, Σ in represents the sum of the edge weights among all nodes in the community c, and represents the compactness of the structure in the community; Σ tot is the sum of two partial values, one of which is Σ in, and the other of which is the sum of edge weights connected to nodes within the community c.
If the result is positive, the degree of tightness of the nodes in the community is more tight than that of the whole graph.
The physical meaning of the modularity is the difference value of the closeness degree of the edge relations between nodes in the community and nodes outside the community, and the community division with high structural consistency can be obtained by optimizing the global modularity objective function. The value range of Q is [ -1, 1), and the larger the Q value is, the higher the consistency of the community partition structure is.
The attribute consistency describes the similarity of attribute information of users in the same community. Nodes within the same community tend to have similar attributes, such as a shipping home, class preferences, and the like. Let F ∈ Rl*nThe attribute set of the graph G nodes is n characteristic vectors with dimension l. The attribute consistency of community partitioning can be defined as:
wherein n is the number of nodes in the relational network graph, d (F) describes the overall inconsistency of the node attribute set F, and can be characterized by calculating the distance (such as euclidean distance, cosine distance, hamming distance, etc.) between each feature vector of dimension l and the central point g of the node attribute set F;whereinDescribing attribute information of each node and attribute information f of node i in relational network diagramiThe sum of the distances of (a) and (b). I fi-fj‖2Is the distance of the attribute information between the nodes i, j.
For the same reason, the reason is that delta (c)i,cj) The homogeneity S may only concern the computation within the community, and therefore may be derived as follows:
wherein, Sigma DoutIs the sum of Euclidean distances between the node in the community c and the central point g of the node attribute set F, sigma DinThe sum of Euclidean distances between every two nodes in the community c. The physical meaning of homogeneity is to measure the attribute distance between the node attribute in the community and the node outside the community and the sum of the distance between every two nodes in the community, and the value range is [ -1, 1). Communities with small internal attributes and large external attribute differences can be obtained by maximizing S.
In community division, Q and S maximization, that is, community consistency maximization is sought. The community consistency is an index value positively correlated with the modularity and the attribute consistency, and the mode of determining the community consistency according to the modularity and the attribute consistency can be selectively set according to the actual situation, for example, the community consistency is obtained by adding the sum of the modularity and the attribute consistency and a preset value. Optionally, the module degree and the attribute consistency degree are subjected to weighted summation, and a result obtained by the weighted summation is used as the community consistency degree of the community.
Illustratively, given a set of graph vertices V ═ V1,v2,v3...,vnSolving the community division C ═ C1,c2,c3...cnDenotes a community to which the node i belongs such that the community division consistency E is maximized, i.e., Maximize E (c) ═ 1- α q (c) + αs (c) st.ciE.g. {0, 1, 2,. X }. Where α adjusts the weight of structure-to-attribute correspondence.
The consistency maximum solution is a typical discrete combinatorial optimization problem, the solution of which is an NP-HARD problem. NP (non-deterministic polynomial) refers to a non-deterministic polynomial, and NP-hard refers to a problem to which all NP problems can be reduced within the polynomial time complexity. The violent exhaustion is exponential complexity, so an approximate solving scheme can be adopted in the actual solving process. Optionally, the objective function is solved by using a greedy local optimization idea, an algorithm defaults to use each node as a community, and then the following steps are iteratively executed:
the method comprises the following steps: for any community: determining the modularity of any community according to the network structure of any community, determining the attribute similarity of any community according to the attribute information of any community, and determining the community consistency of any community according to the modularity and the attribute similarity; determining the combined modularity of any community after being respectively combined with each neighbor community according to the network structure of the user relationship network, determining the combined attribute similarity of any community after being respectively combined with each neighbor community according to the attribute information of any community and each neighbor community, and determining the combined community consistency of any community after being respectively combined with each neighbor community according to the combined modularity and the combined attribute similarity; determining community consistency increment according to the community consistency and the merged community consistency;
step two: and judging whether the community consistency degree larger than 0 exists, if so, merging any community into the neighbor community corresponding to the maximum community consistency degree increment. This step is intended to add nodes within a community to the neighbor community that can maximize community consistency e (c).
In this embodiment, each community (including a new community formed after merging) is iteratively traversed until convergence, and a convergence condition may be selectively set, for example, the iteration number is greater than a set number threshold, or until nodes included in all communities do not change. The weight of the edge between the new community and the other communities obtained by each combination is the sum of the weights of the edges between all the original nodes in the two communities before combination and the other communities. The attribute information of the new community formed after the nodes are merged may be an average value of the attribute information of each node in the community, or a maximum value, a median value, an expected value, and the like of the attribute information of each node in the community. The embodiment of the present invention does not specifically limit the determination method of the attribute information of the new community after the merging.
Illustratively, traverse node viE.g. V, calculating node ViDistributing to the community consistency E (c) of the neighbor node, calculating the consistency change delta E before and after distribution, recording the neighbor node with the maximum delta E, and if max delta E is more than 0, then distributing the node viIs distributed to the community c where the neighbor node with the maximum Delta E is positionedmaxEOtherwise, the value remains unchanged. General community cmaxEAll nodes in the community are compressed into a new node, the updated node attribute is the average value of the attributes of all nodes in the community, and the updated node weight is the sum of the edge weights among all nodes in the community. And repeating the steps to carry out loop iteration until the nodes contained in all communities do not change any more, and at the moment, the iteration is converged, and the consistency of the whole relation network does not change any more.
According to the embodiment of the invention, the objective function is solved by adopting the greedy local optimal idea, so that the calculation amount can be greatly simplified, the community division efficiency is improved, and the user classification efficiency is further improved.
Step S103, determining the attribute information of the community according to the attribute information of each node in the community, and determining the user category of the user in the community according to the network structure and the attribute information of the community.
The manner of determining the user category of the user in the community according to the network structure and the attribute information of the community can be selectively set, for example, by human judgment. Optionally, determining the user category of the user in the community according to the network structure and the attribute information of the community includes: determining a structure characteristic vector of the community according to a network structure of the community, determining an attribute characteristic vector of the community according to attribute information of the community, inputting the structure characteristic vector and the attribute characteristic vector into a pre-trained classification model, and determining a user category of a user in the community according to output of the classification model. The network structure of the classification model can be selected according to actual conditions, such as the network structure of unsupervised learning and supervised learning. Optionally, recognition of user classes is accomplished using a supervised machine learning XGB model.
Illustratively, community feature processing is performed first. Community features mainly include two aspects: structural features and attribute features. The structural features are defined as: the number of nodes in the community, the sum of weights in the community, the average value of the node degrees in the community, and the triangular count in the community (the number of triangles formed between each node in the community). The attribute characteristics are defined as: the average or maximum of the node attributes within the community. Categories are then booked for those user categories. Finishing feature processing and community labeling to form a sample set, and carrying out classification training on community features by adopting a supervised model XGB (extreme gradient boosting) classifier to form a classification model so as to finish community division. And a supervised classifier is adopted to replace an artificial classification rule, so that the accuracy of user classification can be improved.
Fig. 2 is a schematic diagram of the main flow of a method of user classification in an alternative embodiment of the invention. As shown in fig. 2, the method for classifying users mainly includes several parts, namely, graph structure initialization, community division, and classification identification. The process of graph structure initialization is the process of building a user relationship network. The community division mainly comprises the following parts: (1) and (3) node movement, wherein the part is mainly to calculate the community consistency before and after merging of one community and each neighbor community so as to determine the community consistency increment before and after merging. The relevant content is not described herein again with reference to the description in step S102. (2) And (4) merging the communities, which is mainly used for judging whether a community consistency increment larger than 0 exists, and merging the community traversed in the step one with the neighbor community corresponding to the maximum community consistency increment when the community consistency increment exists, wherein related contents refer to the description in the step S102, and are not described again here. (3) And (3) convergence of community consistency, namely judging whether each community and the nodes included in the community are changed before and after the step (1) and the step (2) are executed again. If not, outputting the result of community division and entering a classification identification part; if the community consistency convergence is changed, continuing to iteratively execute the steps (1) - (2) and judging the community consistency convergence after the step (2). In the classification identification section, a user category of a user in a certain community is identified, or a user category in each community is identified. Of course, communities belonging to a specific user category among the divided communities may be filtered.
In the practical application process, the user relationship network can be constructed only according to the behavior data or the attribute data of the user, but the user relationship network constructed in the way has the problem of poor accuracy. Taking the example of constructing a user relationship network according to behavior data of users alone in the e-commerce field, nodes in the network have rich attribute information, such as age, gender, activity degree, activity time, category preference and the like of the users. When a user relationship network is established independently according to behavior data of a user, the attribute information is not effectively applied, and the problem of insufficient identification accuracy and recall rate is caused, normal users are likely to generate association relations due to behaviors that abnormal users browse the same page, purchase the same commodity and the like, only the user behavior data is used for establishing the network, and only structural information based on a network graph is used as the basis of community division. According to the embodiment of the invention, the user relationship network is constructed according to the behavior data and the attribute data of the user, so that each node in the user relationship network carries the node attribute information, the possibility of dividing the users with weak association relationship into the same community in the subsequent step can be reduced, the community division purity and precision are improved, and the accuracy of user classification and identification is improved.
The method provided by the embodiment of the invention can be used for discovering the abnormally aggregated communities, controlling the loss of the abnormally aggregated communities to the e-commerce platform, and screening the communities of the specific user categories and knowing the distribution conditions of the user categories and other scenes.
According to a second aspect of the embodiments of the present invention, there is provided an apparatus for implementing the above method.
Fig. 3 is a schematic diagram of main modules of an apparatus for classifying users according to an embodiment of the present invention, and as shown in fig. 3, the apparatus 300 for classifying users includes:
the network construction module 301 constructs a user relationship network according to the behavior data and the attribute data of each user;
the community dividing module 302 is configured to divide all users into multiple communities according to the network structure of the user relationship network and the attribute information of each node in the user relationship network;
the classification identification module 303 determines attribute information of the community according to the attribute information of each node in the community, and determines a user category of a user in the community according to the network structure and the attribute information of the community.
Optionally, the network building module builds a user relationship network according to the behavior data and the attribute data of each user, including:
determining an edge weight relationship between any two users according to behavior data of the any two users, and constructing a basic relationship network by taking each user as a node and taking the edge weight relationship between the users as an edge;
and determining the attribute feature vector of each user according to the attribute data of each user, and taking the attribute feature vector of each user as the attribute information of the corresponding node in the basic relationship network to obtain the user relationship network.
Optionally, the determining, by the network building module, an edge weight relationship between any two users according to behavior data of the any two users includes:
determining whether medium association exists between any two users according to the behavior data of any two users; if not, setting the edge weight relation between any two users as a first weight; and if so, increasing the first weight.
Optionally, the media association comprises at least one of:
the same behavior object exists between any two users; any two users have the same IP; any two users have the same contact information; any two users have the same address.
Optionally, the community dividing module divides all users into a plurality of communities according to the network structure of the user relationship network and the attribute information of each node in the user relationship network, and includes:
each node in the user relationship network is respectively used as a community; for any community, the following steps are performed:
determining the modularity of any community according to the network structure of any community, determining the attribute similarity of any community according to the attribute information of any community, and determining the community consistency of any community according to the modularity and the attribute similarity;
determining a combined module degree of any community after being respectively combined with each neighbor community according to a network structure of the user relationship network, determining a combined attribute similarity of any community after being respectively combined with each neighbor community according to attribute information of any community and each neighbor community, and determining a combined community consistency of any community after being respectively combined with each neighbor community according to the combined module degree and the combined attribute similarity;
and determining a community consistency increment according to the community consistency and the merged community consistency, judging whether the community consistency which is greater than 0 exists, and if so, merging any community into a neighbor community corresponding to the maximum community consistency increment.
Optionally, the determining, by the community dividing module, the community consistency of any community according to the modularity and the attribute consistency includes:
and carrying out weighted summation on the modularity and the attribute consistency, and taking the result obtained by the weighted summation as the community consistency of any community.
Optionally, the determining, by the classification identification module, the user category of the user in the community according to the network structure and the attribute information of the community includes:
determining a structural feature vector of the community according to the network structure of the community, determining an attribute feature vector of the community according to attribute information of the community, inputting the structural feature vector and the attribute feature vector into a pre-trained classification model, and determining the user category of users in the community according to the output of the classification model.
According to a third aspect of the embodiments of the present invention, there is provided an electronic device for user classification, including:
one or more processors;
a storage device for storing one or more programs,
when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method provided by the first aspect of the embodiments of the present invention.
According to a fourth aspect of embodiments of the present invention, there is provided a computer readable medium, on which a computer program is stored, which when executed by a processor, implements the method provided by the first aspect of embodiments of the present invention.
Fig. 4 shows an exemplary system architecture 400 of a user classification method or apparatus to which embodiments of the invention may be applied.
As shown in fig. 4, the system architecture 400 may include terminal devices 401, 402, 403, a network 404, and a server 405. The network 404 serves as a medium for providing communication links between the terminal devices 401, 402, 403 and the server 405. Network 404 may include various types of connections, such as wire, wireless communication links, or fiber optic cables, to name a few.
A user may use terminal devices 401, 402, 403 to interact with a server 405 over a network 404 to receive or send messages or the like. The terminal devices 401, 402, 403 may have installed thereon various communication client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).
The terminal devices 401, 402, 403 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
The server 405 may be a server providing various services, such as a background management server (for example only) providing support for shopping websites browsed by users using the terminal devices 401, 402, 403. The backend management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (for example, target push information, product information — just an example) to the terminal device.
It should be noted that the method for user classification provided by the embodiment of the present invention is generally executed by the server 405, and accordingly, the device for user classification is generally disposed in the server 405.
It should be understood that the number of terminal devices, networks, and servers in fig. 4 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 5, shown is a block diagram of a computer system 500 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.
The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program performs the above-described functions defined in the system of the present invention when executed by the Central Processing Unit (CPU) 501.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor comprising: the system comprises a network building module, a community dividing module and a classification and identification module. The names of these modules do not in some cases constitute a limitation on the modules themselves, and for example, the network building module may also be described as a "module that divides all users into multiple communities".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise: constructing a user relationship network according to the behavior data and the attribute data of each user; dividing all users into a plurality of communities according to the network structure of the user relationship network and the attribute information of each node in the user relationship network; determining the attribute information of the community according to the attribute information of each node in the community, and determining the user category of the user in the community according to the network structure and the attribute information of the community.
According to the technical scheme of the embodiment of the invention, the user relationship network is constructed according to the behavior data and the attribute data of the users, so that the possibility that the users with weak association relation are divided into the same community can be reduced, the purity and the precision of community division are improved, and the accuracy of user classification and identification is improved.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (10)
1. A method of user classification, comprising:
constructing a user relationship network according to the behavior data and the attribute data of each user;
dividing all users into a plurality of communities according to the network structure of the user relationship network and the attribute information of each node in the user relationship network;
determining the attribute information of the community according to the attribute information of each node in the community, and determining the user category of the user in the community according to the network structure and the attribute information of the community.
2. The method of claim 1, wherein constructing a user relationship network from the behavioral data and attribute data for each user comprises:
determining an edge weight relationship between any two users according to behavior data of the any two users, and constructing a basic relationship network by taking each user as a node and taking the edge weight relationship between the users as an edge;
and determining the attribute feature vector of each user according to the attribute data of each user, and taking the attribute feature vector of each user as the attribute information of the corresponding node in the basic relationship network to obtain the user relationship network.
3. The method of claim 2, wherein determining an edge weight relationship between any two users from behavioral data of the any two users comprises:
determining whether medium association exists between any two users according to the behavior data of any two users; if not, setting the edge weight relation between any two users as a first weight; and if so, increasing the first weight.
4. The method of claim 3, wherein the media association comprises at least one of:
the same behavior object exists between any two users; any two users have the same IP; any two users have the same contact information; any two users have the same address.
5. The method of claim 1, wherein dividing all users into a plurality of communities according to a network structure of the user relationship network and attribute information of each node in the user relationship network comprises:
each node in the user relationship network is respectively used as a community; for any community, the following steps are performed:
determining the modularity of any community according to the network structure of any community, determining the attribute similarity of any community according to the attribute information of any community, and determining the community consistency of any community according to the modularity and the attribute similarity;
determining a combined module degree of any community after being respectively combined with each neighbor community according to a network structure of the user relationship network, determining a combined attribute similarity of any community after being respectively combined with each neighbor community according to attribute information of any community and each neighbor community, and determining a combined community consistency of any community after being respectively combined with each neighbor community according to the combined module degree and the combined attribute similarity;
and determining a community consistency increment according to the community consistency and the merged community consistency, judging whether the community consistency which is greater than 0 exists, and if so, merging any community into a neighbor community corresponding to the maximum community consistency increment.
6. The method of claim 5, wherein determining the community compliance for any community based on the modularity and the attribute compliance comprises:
and carrying out weighted summation on the modularity and the attribute consistency, and taking the result obtained by weighted summation as the community consistency of any community.
7. The method of claim 1, wherein determining the user category of the user within the community based on the network structure and the attribute information of the community comprises:
determining a structural feature vector of the community according to the network structure of the community, determining an attribute feature vector of the community according to attribute information of the community, inputting the structural feature vector and the attribute feature vector into a pre-trained classification model, and determining the user category of users in the community according to the output of the classification model.
8. An apparatus for user classification, comprising:
the network construction module is used for constructing a user relationship network according to the behavior data and the attribute data of each user;
the community dividing module is used for dividing all users into a plurality of communities according to the network structure of the user relationship network and the attribute information of each node in the user relationship network;
and the classification identification module is used for determining the attribute information of the community according to the attribute information of each node in the community and determining the user category of the user in the community according to the network structure and the attribute information of the community.
9. An electronic device for user classification, comprising:
one or more processors;
a storage device for storing one or more programs,
when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-7.
10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110383817.1A CN113111133A (en) | 2021-04-09 | 2021-04-09 | User classification method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110383817.1A CN113111133A (en) | 2021-04-09 | 2021-04-09 | User classification method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113111133A true CN113111133A (en) | 2021-07-13 |
Family
ID=76715344
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110383817.1A Pending CN113111133A (en) | 2021-04-09 | 2021-04-09 | User classification method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113111133A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113554308A (en) * | 2021-07-23 | 2021-10-26 | 中信银行股份有限公司 | User community division and risk user identification method and device and electronic equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007148661A (en) * | 2005-11-25 | 2007-06-14 | Ntt Docomo Inc | Community control system and community control method |
US20080229222A1 (en) * | 2007-03-16 | 2008-09-18 | Sony Computer Entertainment Inc. | User interface for processing data by utilizing attribute information on data |
US20190179615A1 (en) * | 2016-10-27 | 2019-06-13 | Tencent Technology (Shenzhen) Company Limited | Community discovery method, device, server and computer storage medium |
CN111932386A (en) * | 2020-09-09 | 2020-11-13 | 腾讯科技(深圳)有限公司 | User account determining method and device, information pushing method and device, and electronic equipment |
CN112101390A (en) * | 2019-05-29 | 2020-12-18 | 腾讯科技(深圳)有限公司 | Attribute information determination method, attribute information determination device and electronic equipment |
-
2021
- 2021-04-09 CN CN202110383817.1A patent/CN113111133A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2007148661A (en) * | 2005-11-25 | 2007-06-14 | Ntt Docomo Inc | Community control system and community control method |
US20080229222A1 (en) * | 2007-03-16 | 2008-09-18 | Sony Computer Entertainment Inc. | User interface for processing data by utilizing attribute information on data |
US20190179615A1 (en) * | 2016-10-27 | 2019-06-13 | Tencent Technology (Shenzhen) Company Limited | Community discovery method, device, server and computer storage medium |
CN112101390A (en) * | 2019-05-29 | 2020-12-18 | 腾讯科技(深圳)有限公司 | Attribute information determination method, attribute information determination device and electronic equipment |
CN111932386A (en) * | 2020-09-09 | 2020-11-13 | 腾讯科技(深圳)有限公司 | User account determining method and device, information pushing method and device, and electronic equipment |
Non-Patent Citations (1)
Title |
---|
李张涛: "基于进化算法的属性网络社区检测及应用", 《中国知网硕士学位论文电子期刊》, vol. 2018, no. 4, 15 April 2018 (2018-04-15), pages 3 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113554308A (en) * | 2021-07-23 | 2021-10-26 | 中信银行股份有限公司 | User community division and risk user identification method and device and electronic equipment |
CN113554308B (en) * | 2021-07-23 | 2024-05-28 | 中信银行股份有限公司 | User community division and risk user identification method and device and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022237667A1 (en) | Method and apparatus for determining order fulfillment warehouse | |
US20180330192A1 (en) | Load-Balancing Training of Recommender System for Heterogeneous Systems | |
US20160350376A1 (en) | Estimating the cost of data-mining services | |
Lu et al. | Show me the money: Dynamic recommendations for revenue maximization | |
CN104077723B (en) | A kind of social networks commending system and method | |
CN111044062B (en) | Path planning and recommending method and device | |
Tang et al. | A factorization machine-based QoS prediction approach for mobile service selection | |
CN113450167A (en) | Commodity recommendation method and device | |
US11693579B2 (en) | Value-based replication of streaming data | |
CN115423555A (en) | Commodity recommendation method and device, electronic equipment and storage medium | |
CN113761350A (en) | Data recommendation method, related device and data recommendation system | |
CN113760521B (en) | Virtual resource allocation method and device | |
CN113111133A (en) | User classification method and device | |
Zhang et al. | Personalized quality centric service recommendation | |
US11113741B2 (en) | Arranging content on a user interface of a computing device | |
US11182833B2 (en) | Estimating annual cost reduction when pricing information technology (IT) service deals | |
Xu et al. | A collaborative framework of web service recommendation with clustering-extended matrix factorisation | |
CN116777205A (en) | Supply chain risk pre-judging method, electronic equipment and storage medium | |
US10185980B1 (en) | Efficiently computing a feature based on a plurality of variables | |
CN116089367A (en) | Dynamic barrel dividing method, device, electronic equipment and medium | |
US20220391746A1 (en) | Api optimizer using contextual analysis of interface data exchange | |
US11836656B2 (en) | Cognitive enabled blockchain based resource prediction | |
US20220207001A1 (en) | Cross-domain structural mapping in machine learning processing | |
CN114298203B (en) | Method, apparatus, device and computer readable medium for data classification | |
Ling et al. | TMP: Meta-path based recommendation on time-weighted heterogeneous information networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |