CN115099875A

CN115099875A - Data classification method based on decision tree model and related equipment

Info

Publication number: CN115099875A
Application number: CN202210836961.0A
Authority: CN
Inventors: 钱学广
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-07-15
Filing date: 2022-07-15
Publication date: 2022-09-23

Abstract

The embodiment of the application belongs to the technical field of artificial intelligence, and relates to a data classification method and related equipment based on a decision tree model, wherein the data classification method comprises the steps of extracting service features from acquired historical service data to generate a data set, and extracting a subset from the data set to serve as a first training data set; determining an index attribute and a condition attribute in the first training data set, and determining a first total information entropy of the first training data set based on the index attribute; generating nodes according to the first total information entropy and the condition attributes, and generating a decision tree model based on the nodes; verifying the decision tree model through a verification data set to obtain a verification result, outputting the final decision tree model as a classification prediction model until the verification result meets a preset condition; and inputting the target service data into a classification prediction model to obtain a classification result. In addition, the application also relates to a block chain technology, and the service characteristics can be stored in the block chain. The method and the device can improve the classification efficiency and the classification accuracy of the business data.

Description

Data classification method based on decision tree model and related equipment

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a data classification method based on a decision tree model and related devices.

Background

Due to the rapid development of information technology, the purchase of insurance on the internet has been widely used. The insurance company provides an internet platform and opens to external third-party partners, and the third-party partners can inquire and purchase insurance products on the internet interface platform of the insurance company.

The insurance company usually carries out business docking with a plurality of partners, a large amount of interface access data are generated and stored in the business docking process of the partners, the functions of data query, report statistics and the like of the traditional database cannot efficiently, quickly and accurately identify important clients, cannot discover the relation and rule existing in the data, and cannot carry out expected speculation according to the existing data.

Disclosure of Invention

The embodiment of the application aims to provide a data classification method based on a decision tree model and related equipment, so as to solve the technical problems that important clients cannot be identified efficiently, quickly and accurately, relationships and rules in data cannot be found, and expected speculation cannot be performed according to the existing data in the related technology.

In order to solve the above technical problem, an embodiment of the present application provides a data classification method based on a decision tree model, which adopts the following technical solutions:

acquiring historical service data, extracting service features from the historical service data, generating a data set according to the service features, and extracting a subset from the data set to serve as a first training data set;

determining an index attribute and a condition attribute in the first training data set, and determining a first total information entropy of the first training data set based on the index attribute;

generating a node according to the first total information entropy and the condition attribute, and generating a decision tree model based on the node;

acquiring a verification data set from the data set, and verifying the decision tree model through the verification data set to obtain a verification result;

determining whether the verification result meets a preset condition, if not, updating the decision tree model until the verification result meets the preset condition, and outputting a final decision tree model as a classification prediction model;

and acquiring target service data, and inputting the target service data into the classification prediction model to obtain a classification result.

Further, the step of determining a first total entropy of the first training data set based on the indicator attribute comprises:

determining a probability of each index feature in the index attributes in the first training data set;

and calculating to obtain a first total information entropy of the first training data set based on the probability.

Further, the step of generating a node according to the first total information entropy and the condition attribute includes:

step A, calculating to obtain the information gain of each condition attribute according to the first total information entropy and the attribute data of the condition attribute;

b, obtaining an optimized weight of the condition attribute, and optimizing corresponding information gain through the optimized weight to obtain optimized information gain;

step C, determining an optimal condition attribute based on the optimization information gain, and taking the optimal condition attribute as a node;

step D, forming a second training data set by the condition attributes and the index attributes outside the nodes, and calculating a second total information entropy of the second training data set according to the index attributes;

and E, circulating the step A to the step D until all the condition attribute generation nodes.

Further, the step of calculating the information gain of each condition attribute according to the first total information entropy and the attribute data of the condition attribute comprises:

calculating the attribute information entropy of each attribute feature in each condition attribute according to the attribute data;

calculating a condition information entropy corresponding to the condition attribute based on the attribute information entropy;

and calculating to obtain information gain according to the first total information entropy and the condition information entropy.

Further, before the step of calculating an information gain of each condition attribute according to the first total information entropy and the attribute data of the condition attribute, the method further includes:

determining whether the attribute data has abnormal data;

and if the abnormal data exist, correcting the abnormal data.

Further, before the step of obtaining the optimized weight of the condition attribute, the method further includes:

determining a condition attribute corresponding to the abnormal data, and counting the proportion of the abnormal data in the condition attribute;

and calculating according to the proportion to obtain an adjustment coefficient, and adjusting the optimized weight of the condition attribute according to the adjustment coefficient.

Further, the step of verifying the decision tree model through the verification data set to obtain a verification result includes:

inputting the verification data set into the decision tree model and outputting a prediction result;

and calculating the prediction accuracy according to the prediction result, and taking the prediction accuracy as a verification result.

In order to solve the above technical problem, an embodiment of the present application further provides a data classification device based on a decision tree model, which adopts the following technical solutions:

the extraction module is used for acquiring historical business data, extracting business features from the historical business data, generating a data set according to the business features, and extracting subsets from the data set to serve as a first training data set;

the determining module is used for determining an index attribute and a condition attribute in the first training data set and determining a first total information entropy of the first training data set based on the index attribute;

the generating module is used for generating nodes according to the first total information entropy and the condition attributes and generating a decision tree model based on the nodes;

the verification module is used for acquiring a verification data set from the data set and verifying the decision tree model through the verification data set to obtain a verification result;

and the output module is used for determining whether the verification result meets a preset condition, if not, updating the decision tree model until the verification result meets the preset condition, and outputting the final decision tree model as a classification prediction model.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, which adopts the following technical solutions:

the computer device comprises a memory having computer readable instructions stored therein, and a processor implementing the steps of the decision tree model based data classification method as described above when executing the computer readable instructions.

In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, which adopts the following technical solutions:

the computer readable storage medium has stored thereon computer readable instructions which, when executed by a processor, implement the steps of the decision tree model based data classification method as described above.

Compared with the prior art, the embodiment of the application mainly has the following beneficial effects:

the method comprises the steps of extracting service features from historical service data by acquiring the historical service data, generating a data set according to the service features, and extracting a subset from the data set to serve as a first training data set; determining an index attribute and a condition attribute in the first training data set, and determining a first total information entropy of the first training data set based on the index attribute; generating nodes according to the first total information entropy and the condition attributes, and generating a decision tree model based on the nodes; acquiring a verification data set from the data set, and verifying the decision tree model through the verification data set to obtain a verification result; determining whether the verification result meets a preset condition, if not, updating the decision tree model until the verification result meets the preset condition, and outputting a final decision tree model as a classification prediction model; acquiring target service data, and inputting the target service data into a classification prediction model to obtain a classification result; according to the method and the device, the training data set is extracted through the acquired data set, the nodes are determined according to the total information entropy of the training data set, the decision tree model is generated based on the nodes, the target business data are classified according to the generated decision tree model, the classification efficiency and the classification accuracy of the business data can be improved, important partners are further identified efficiently and accurately, and certain assistance is provided for business personnel to predict partner characteristics through rules in the model.

Drawings

In order to more clearly illustrate the solution of the present application, the drawings needed for describing the embodiments of the present application will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a decision tree model based data classification method according to the present application;

FIG. 3 is a schematic block diagram of an embodiment of a decision tree model-based data classification apparatus according to the present application;

FIG. 4 is a schematic block diagram of one embodiment of a computer device according to the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof in the description and claims of this application and the description of the figures above, are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein may be combined with other embodiments.

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

The application provides a data classification method based on a decision tree model, which relates to artificial intelligence and can be applied to a system architecture 100 shown in fig. 1, wherein the system architecture 100 can comprise

terminal devices

101, 102 and 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, mpeg compression standard Audio Layer 3), MP4 players (Moving Picture Experts Group Audio Layer IV, mpeg compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the

terminal devices

101, 102, 103.

It should be noted that the data classification method based on the decision tree model provided in the embodiments of the present application is generally executed by a server/terminal device, and accordingly, the data classification apparatus based on the decision tree model is generally disposed in the server/terminal device.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for an implementation.

With continued reference to FIG. 2, a flow diagram of one embodiment of a decision tree model based data classification method according to the present application is shown, comprising the steps of:

step S201, obtaining historical service data, extracting service features from the historical service data, generating a data set according to the service features, and extracting a subset from the data set as a first training data set.

The historical service data can be obtained from a corresponding service database, and the service database can be a pre-established database specially storing the service data and can also be a storage database of the insurance system.

In this embodiment, the business data includes the number of times each partner accesses the insurance system, the policy amount, policy information including the policy number, the product type, the amount of the applied insurance, the time of the applied insurance, and the like.

It should be noted that, in the process of acquiring the service data, the continuous data is discretized; and (4) correcting the missing or abnormal data to be 0 if no data or negative number exists.

And extracting business characteristics from the business data based on the business rules, wherein the business characteristics comprise a partner, access times, a product type, a policy amount, a total amount of the application of the policy, whether the partner is concerned or not and the like, and forming a policy data set based on the business characteristics.

And a subset is extracted from the policy data set to serve as a first training data set, so that the problem that the data size is too large and convergence is difficult can be avoided. In a specific implementation manner of this embodiment, the first training data set is shown in table 1.

Table 1 first training data set D1

It is emphasized that to further ensure privacy and security of the traffic features, they may also be stored in nodes of a block chain.

The block chain referred by the application is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

Step S202, determining an index attribute and a condition attribute in the first training data set, and determining a first total information entropy of the first training data set based on the index attribute.

In this embodiment, the index attribute is a decision attribute, and the decision attribute of the data set is determined according to the condition attribute, for example, whether to play basketball is determined according to the decision attribute, and whether to play basketball is determined according to the condition attribute, such as weather, temperature, humidity, wind, and the like; the decision attribute is whether the income of the snack vendor is good or not on a certain day, and the decision attribute is measured according to the condition attribute weather, whether students leave false or not, whether security guards are pressed or not, whether activity promotion is carried out or not and the like.

In this embodiment, the step of determining the first total information entropy of the first training data set based on the indicator attribute comprises:

Specifically, the first total information entropy is calculated using the following formula:

wherein n is index attribute and comprises n index features, and p _i Representing the probability of the occurrence of the ith index feature in all samples of the first training data set.

For example, in the first training data set D1 shown in table 1, if the determined index attribute is whether to pay attention to a partner, the determination of whether to pay attention to the partner is made according to the condition attributes, such as product type, access times, insurance policy amount, and total amount of insurance application, where the determination of whether to pay attention to the partner includes two index features: and if not, the first total information entropy of the first data set is as follows:

it should be noted that, the degree of disorder of data distribution can be reflected by calculating the information entropy, and the classification of high-dimensional data can be applied according to the degree of disorder of data distribution.

And S203, generating nodes according to the first total information entropy and the condition attributes, and generating a decision tree model based on the nodes.

In this embodiment, the conditional attribute with the largest information gain is selected as the classification attribute by using the information entropy principle, and the branches of the decision tree are recursively expanded to complete the structure of the decision tree.

Determining the node according to the first total information entropy and the condition attribute, wherein the specific process is as follows:

step C, determining an optimal condition attribute based on the optimal information gain, and taking the optimal condition attribute as a node;

step D, forming the condition attributes except the nodes into a second training data set, and calculating a second total information entropy of the second training data set according to the index attributes;

Wherein, the step of calculating the information gain of each condition attribute according to the first total information entropy and the attribute data of the condition attribute comprises:

calculating to obtain a condition information entropy corresponding to the condition attribute based on the attribute information entropy;

In this embodiment, each condition attribute includes m attribute features, for example, when the condition attribute is a product type, the attribute features include property risk, accident risk, pet risk, and the like, the condition attribute is that the number of visits is >1 ten thousand times per day, and the attribute features include yes and no.

Determining the attribute probability of each index feature in each attribute feature, calculating the attribute information entropy of the attribute feature based on the attribute probability, calculating the condition information entropy of the corresponding condition attribute according to the attribute information entropy of each attribute feature, and calculating the difference value between the first total information entropy and the condition information entropy to obtain the information gain of the condition attribute.

Specifically, the information gain of the condition attribute adopts the following calculation formula:

gain (conditional Attribute) ═ Ent (D1) -Ent (D1| conditional Attribute)

In the formula, Ent (D1| condition attribute) represents the condition information entropy of a certain condition attribute, and the calculation formula of the condition information entropy is as follows:

in the formula, p _j As the jth attribute feature C _j The probability of occurrence in all samples of the conditional attribute, Ent (D1| C) _j ) Representing attribute features C _j The entropy of the attribute information of (1),

wherein p is _ij The ith index feature is represented by the attribute feature C _j Probability of occurrence in all samples.

It should be noted that the larger the information gain is, the higher the uniformity of dividing the sample subset according to the condition attribute is, and the classification is more facilitated.

And a node is selected to generate a decision tree model based on the information gain, so that the classification result of the decision tree has better interpretability, and the accuracy of data classification is improved.

In some optional implementations, before the step of calculating the information gain of each condition attribute according to the first total information entropy and the attribute data of the condition attribute, the method further includes:

determining whether the attribute data has abnormal data;

and if the abnormal data exists, correcting the abnormal data.

In this embodiment, if there is abnormal data in the training data set, the abnormal data is corrected, for example, the attribute data with the condition attribute of policy amount > 1000/day has an abnormal data 0 value, and the abnormal data is converted into a similar value, that is, the 0 value is converted into "no"; the condition attribute is that there is abnormal data 0 in the attribute data of the total amount of insurance application (ten thousand)/day, and it is converted into "< 10".

After the correction, the information gain calculation is carried out, so that the interference can be reduced, and the accuracy can be improved.

In this embodiment, after the information gain of each condition attribute is obtained by calculation, weighting correction is performed on the information gain, and the information gain corresponding to the preset optimization weight is obtained.

The optimized weight corresponding to the condition attribute is pre-configured, can be set by service personnel according to importance, can also be obtained by using a pre-trained weight generation model, and is specifically selected according to actual needs without limitation.

In this embodiment, if there is abnormal data in the attribute data and the abnormal data is corrected, and accordingly, the optimized weight of the condition attribute corresponding to the abnormal data needs to be adjusted, the specific steps are as follows:

determining condition attributes corresponding to the abnormal data, and counting the proportion of the abnormal data in the condition attributes;

and calculating according to the proportion to obtain an adjusting coefficient, and adjusting the optimized weight of the condition attribute according to the adjusting coefficient.

Specifically, the formula for the adjustment coefficient is as follows:

in the formula (I), the compound is shown in the specification,

indicating the amount of anomalous data, sigma, in a conditional attribute _x∈D1 K represents the number of all samples in the condition attribute. Assuming that the initial optimization weight of the condition attribute A is w ₁ The adjustment coefficient is g ₁ Then the adjusted optimized weight is w ₁ ×g ₁ 。

In this embodiment, the information gain of the corresponding condition attribute is optimized by the optimization weight to obtain an optimized information gain, the optimized information gains corresponding to the condition attributes are compared, and the condition attribute corresponding to the largest optimized information gain is selected as the optimal condition attribute, that is, the optimal partition feature, which is used as the root node of the decision tree.

It should be noted that the generated decision tree can have a better classification effect by selecting the nodes through the information gain, and meanwhile, the information gain is weighted and corrected, so that the interference can be further reduced, and the classification accuracy is improved.

In this embodiment, after the root node is generated, a child node and a leaf node are further generated, where the child node is a node in the middle of the decision tree, and the leaf node is a bottommost node, and is also a decision result, that is, an index attribute.

And combining the condition attributes and the index attributes except the root node into a second training data set, and calculating a second total information entropy of the second training data set according to the index attributes, wherein the calculating method is the same as the first total information entropy of the first training data set, and the description is omitted here.

And then, calculating to obtain the information gain of each condition attribute according to the second total information entropy and the attribute data of the condition attributes in the second training data set, namely sequentially circulating the steps A to D until leaf nodes are generated, and then generating a decision tree model according to all the nodes.

By way of example, table 1 above is also described as an example.

Firstly, attribute information entropies corresponding to four condition attribute product types, access times, insurance policy amount and total amount of insurance are respectively calculated, and the method specifically comprises the following steps:

then the conditional information entropy of the product type is:

the information gain of the product type is further calculated:

gain (product type) ═ Ent (D1) -Ent (D1| product type) ═ 0.9710-0.8880 ═ 0.083;

and according to the formula, sequentially calculating the information gains corresponding to the access times, the insurance policy amount and the total amount of the insurance application:

gain (access time) ═ Ent (D1) -Ent (D1| access time) ═ 0.9710-0.647 ═ 0.324;

gain (guaranteed amount) ═ Ent (D1) -Ent (D1| guaranteed amount) ═ 0.9710-0.551 ═ 0.420;

gain (total amount of money) Ent (D1) -Ent (D1| total amount of money) 0.9710-0.608-0.363.

And if abnormal data exist in the condition attributes 'policy amount' and 'total amount of insurance application', carrying out weighted correction on the information gains of the policy amount and the total amount of insurance application.

When the initial value of the condition attribute weight defaults to 1, the adjustment coefficient of the policy amount is set to 1

The adjustment coefficient of the total amount of the application is

The corrected information gain is:

after weighting correction, the information gain of the policy amount is maximum, and the policy amount is used as a first classification characteristic, namely, the policy amount is used as a root node.

And forming a second training data set D2 by combining the product types, the access times, the total amount of the insurance and the index attributes with the condition attributes except the policy amount and whether the index attributes concern the partners, wherein the second training data set comprises two subsets, namely a subset with the policy amount > 1000/day yes and a subset with the policy amount > 1000/day no, and when the value of the ' policy amount > 1000/day ' is found to be ' yes ', the value of ' whether to concern ' is also ' yes ', and the information gain Ent (D1| ' policy amount > 1000/day ' is) ' 0, deleting the corresponding attributes to form a second training data set D2, and referring to Table 2.

That is, before a new training data set is formed, the attribute data is processed, and the attribute data with the information gain of 0 is removed, so that the data processing efficiency is improved.

Table 2 second training data set D2

Calculating a second total information entropy of the second training data set D2:

the attribute information entropy of each condition attribute in the second training data set D2 is calculated in the same manner as above, and the attribute information entropy of the condition attribute "product type" is calculated as follows:

the conditional information entropy of the product type is:

the information gain of the product type is further calculated:

gain (D2| -product type) Ent (D2) -Ent (D2| -product type) 0.9183-0.6667 ═ 0.2516

Entropy of attribute information of conditional attribute "number of accesses":

the conditional information entropy of the number of accesses is:

the information gain of the number of accesses is:

gain (D2| access number of times end (D2) — end (D2| access number of times) 0.9183

The entropy of the attribute information of the conditional attribute "total amount of underwriting" is calculated as follows:

the conditional information entropy of the total amount of the application is as follows:

then the information gain of the total amount of the insurance application is:

gain (D2 |) (end (D2)) -end (D2 |) (0.9183-0.5394 ═ 0.3789)

Because abnormal data exist in the total amount of the application, the information gain of the total amount of the application is weighted and corrected:

and comparing the information gain of each condition attribute, wherein the information gain of the access times of the condition attributes is the maximum, namely the access times are used as a second classification characteristic, and the access times are intermediate nodes below the root node.

And forming a third training data set by the product types and the total amount of the insurance application and the index attributes which are not accessed for the conditional attributes and whether the index attributes concern partners, circulating the method until all the conditional attributes form nodes, and obtaining a decision tree model based on the generated nodes.

And S204, acquiring a verification data set from the data set, and verifying the decision tree model through the verification data set to obtain a verification result.

In this embodiment, the subset is extracted from the dataset as a verification dataset, the data of the verification dataset is input into the decision tree model, the prediction result is output, and the prediction accuracy of the prediction result is calculated as the verification result.

In this embodiment, the accuracy of classification of the decision tree model can be improved by verifying the generated decision tree model.

Step S205, determining whether the verification result meets the preset condition, if not, updating the decision tree model until the verification result meets the preset condition, and outputting the final decision tree model as a classification prediction model.

In this embodiment, if the verification result is the prediction accuracy, the predetermined condition is that the prediction accuracy is greater than or equal to the predetermined threshold, and if the predetermined condition is not satisfied, the prediction accuracy is less than the predetermined threshold.

If the preset condition is not met, reestablishing the decision tree model, extracting the first training data set from the data set again, and repeating the steps S202 to S203 to obtain a new decision tree model; and if the preset conditions are met, outputting the current decision tree model as a classification prediction model.

Step S206, target business data is obtained and input into the classification prediction model to obtain a classification result.

In this embodiment, the obtained target service data is output to the classification prediction model to obtain a classification result, and a subsequent service decision is made according to the classification result.

The business personnel can judge the access attribute, behavior information and the like of the partner according to the classification prediction model, analyze the insurance product type, the application proportion and the like of the partner, and more accurately identify the important partner, for example, a new partner A1 accesses the insurance system for multiple continuous days, the interface access frequency is 10000 times/day, the situation that the partner A1 needs to be concerned is predicted according to the interface access frequency, namely the partner A1 is taken as a key user, the business personnel is reminded to continuously pay attention to the business situation of the partner A1, and the business negotiation and customer service support can be carried out in a targeted manner.

In the embodiment, the decision-making judgment is carried out on the target service data through the classification prediction model, so that the classification accuracy and the decision-making efficiency are improved.

According to the method and the device, the training data set is extracted through the acquired data set, the nodes are determined according to the total information entropy of the training data set, the decision tree model is generated based on the nodes, the target business data are classified according to the generated decision tree model, the classification efficiency and the classification accuracy of the business data can be improved, important partners are further identified efficiently and accurately, and certain assistance is provided for business personnel to predict partner characteristics through rules in the model.

The application is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware associated with computer readable instructions, which can be stored in a computer readable storage medium, and when executed, the processes of the embodiments of the methods described above can be included. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of other steps.

With further reference to fig. 3, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a data classification apparatus based on a decision tree model, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be applied to various electronic devices.

As shown in fig. 3, the decision tree model-based data classification apparatus 300 according to the present embodiment includes: an extraction module 301, a determination module 302, a generation module 303, a verification module 304, an output module 305, and a classification module 306. Wherein:

the extraction module 301 is configured to obtain historical service data, extract service features from the historical service data, generate a data set according to the service features, and extract a subset from the data set as a first training data set;

the determining module 302 is configured to determine an index attribute and a condition attribute in the first training data set, and determine a first total information entropy of the first training data set based on the index attribute;

the generating module 303 is configured to generate a node according to the first total information entropy and the condition attribute, and generate a decision tree model based on the node;

the verification module 304 is configured to obtain a verification data set from the data set, and verify the decision tree model through the verification data set to obtain a verification result;

the output module 305 is configured to determine whether the verification result meets a preset condition, and if the verification result does not meet the preset condition, update the decision tree model until the verification result meets the preset condition, and output a final decision tree model as a classification prediction model;

the classification module 306 is configured to obtain target service data, and input the target service data into the classification prediction model to obtain a classification result.

The data classification device based on the decision tree model extracts the training data set through the acquired data set, determines the nodes according to the total information entropy of the training data set, generates the decision tree model based on the nodes, classifies target business data according to the generated decision tree model, can improve the classification efficiency and classification accuracy of the business data, further efficiently and accurately identifies important partners, and has a certain auxiliary effect on business personnel to predict partner characteristics through rules in the model.

In this embodiment, the determining module 302 includes a determining submodule and a calculating submodule, where the determining submodule is configured to determine a probability of each index feature in the index attribute in the first training data set; the calculation submodule is used for calculating and obtaining a first total information entropy of the first training data set based on the probability.

The chaos degree of data distribution can be reflected by calculating the information entropy, and the method can be suitable for classification of high-dimensional data according to the chaos degree of the data distribution.

In this embodiment, the generating module 303 includes a first calculating submodule, an optimizing submodule, a determining submodule, a second calculating submodule, and a circulating submodule, where:

the first calculation submodule is used for calculating and obtaining the information gain of each condition attribute according to the first total information entropy and the attribute data of the condition attribute;

the optimization submodule is used for acquiring an optimization weight of the condition attribute, and optimizing corresponding information gain through the optimization weight to obtain optimized information gain;

the determining submodule is used for determining an optimal condition attribute based on the optimization information gain and taking the optimal condition attribute as a node;

the second calculation submodule is used for forming a second training data set by the condition attributes and the index attributes except the nodes, and calculating a second total information entropy of the second training data set according to the index attributes;

and the circulation submodule is used for circulating the steps A to D until all the condition attribute generation nodes.

In this embodiment, the first computing submodule is further configured to:

The decision tree model is generated by selecting nodes through information gain, so that the classification result of the decision tree has better interpretability, and the accuracy of data classification is improved.

In some optional implementations, the determining module 302 further includes a determining submodule and a modifying submodule, where the determining submodule is configured to determine whether there is abnormal data in the attribute data; and the correction submodule is used for correcting the abnormal data if the abnormal data exists.

In this embodiment, the determining module 302 further includes an adjusting sub-module, configured to:

The generated decision tree can have a better classification effect by selecting the nodes through the information gain, and meanwhile, the information gain is weighted and corrected, so that the interference can be further reduced, and the classification accuracy is improved.

In some optional implementations of the present embodiment, the verification module 304 is further configured to:

By verifying the generated decision tree model, the accuracy of decision tree model classification can be improved.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 4, fig. 4 is a block diagram of a basic structure of a computer device according to the present embodiment.

The computer device 4 comprises a memory 41, a processor 42, a network interface 43 communicatively connected to each other via a system bus. It is noted that only computer device 4 having components 41-43 is shown, but it is understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user through a keyboard, a mouse, a remote controller, a touch panel or voice control equipment and the like.

The memory 41 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or a memory of the computer device 4. In other embodiments, the memory 41 may also be an external storage device of the computer device 4, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the computer device 4. Of course, the memory 41 may also include both internal and external storage devices of the computer device 4. In this embodiment, the memory 41 is generally used for storing an operating system installed on the computer device 4 and various types of application software, such as computer readable instructions of a data classification method based on a decision tree model. Further, the memory 41 may also be used to temporarily store various types of data that have been output or are to be output.

The processor 42 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 42 is typically used to control the overall operation of the computer device 4. In this embodiment, the processor 42 is configured to execute computer readable instructions stored in the memory 41 or process data, for example, execute computer readable instructions of the decision tree model-based data classification method.

The network interface 43 may comprise a wireless network interface or a wired network interface, and the network interface 43 is generally used for establishing communication connection between the computer device 4 and other electronic devices.

In this embodiment, when the processor executes the computer readable instructions stored in the memory, the steps of the decision tree model-based data classification method according to the above embodiments are implemented, the training data set is extracted from the acquired data set, the nodes are determined according to the total entropy of the training data set, the decision tree model is generated based on the nodes, and the target business data is classified according to the generated decision tree model, so that the classification efficiency and the classification accuracy of the business data can be improved, the important partners are further efficiently and accurately identified, and certain assistance is provided for business personnel to predict partner characteristics through rules in the model.

The present application further provides another embodiment, which is to provide a computer-readable storage medium storing computer-readable instructions executable by at least one processor, so that the at least one processor performs the steps of the data classification method based on the decision tree model as described above, extracts a training data set from an acquired data set, determines nodes according to total entropy of the training data set, generates a decision tree model based on the nodes, and classifies target business data according to the generated decision tree model, which can improve classification efficiency and classification accuracy of the business data, further efficiently and accurately identify important partners, and has a certain assistance effect on predicting partner characteristics by business staff through rules in the model.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present application or portions thereof that contribute to the prior art may be embodied in the form of a software product, where the computer software product is stored in a storage medium (such as a ROM/RAM, a magnetic disk, and an optical disk), and includes several instructions for enabling a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to one skilled in the art that the present application may be practiced without modification or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. A data classification method based on a decision tree model is characterized by comprising the following steps:

obtaining a verification data set from the data set, and verifying the decision tree model through the verification data set to obtain a verification result;

2. The decision tree model-based data classification method according to claim 1, wherein the step of determining the first total information entropy of the first training data set based on the index attribute comprises:

3. The decision tree model-based data classification method according to claim 1, wherein the step of generating a node according to the first total entropy and the condition attribute comprises:

4. The decision tree model-based data classification method according to claim 3, wherein the step of calculating an information gain for each of the condition attributes based on the first total information entropy and the attribute data of the condition attributes comprises:

5. The decision tree model-based data classification method according to claim 3, wherein before the step of calculating the information gain of each condition attribute according to the first total information entropy and the attribute data of the condition attribute, the method further comprises:

determining whether the attribute data has abnormal data;

and if the abnormal data exist, correcting the abnormal data.

6. The decision tree model-based data classification method according to claim 5, further comprising, before the step of obtaining the optimized weights of the condition attributes:

7. The decision tree model-based data classification method according to claim 1, wherein the step of verifying the decision tree model by the verification dataset to obtain a verification result comprises:

8. A decision tree model-based data classification device, comprising:

the extraction module is used for acquiring historical business data, extracting business features from the historical business data, generating a data set according to the business features, and extracting a subset from the data set to serve as a first training data set;

a determining module, configured to determine an index attribute and a condition attribute in the first training data set, and determine a first total information entropy of the first training data set based on the index attribute;

the output module is used for determining whether the verification result meets a preset condition, if the verification result does not meet the preset condition, the decision tree model is updated until the verification result meets the preset condition, and the final decision tree model is output as a classification prediction model;

and the classification module is used for acquiring target service data and inputting the target service data into the classification prediction model to obtain a classification result.

9. A computer device comprising a memory having computer readable instructions stored therein and a processor which when executed implements the steps of the decision tree model based data classification method according to any one of claims 1 to 7.

10. A computer-readable storage medium, having computer-readable instructions stored thereon, which, when executed by a processor, implement the steps of the decision tree model-based data classification method according to any one of claims 1 to 7.