Customer risk index screening method and system based on random forest
Technical Field
The present disclosure relates to a Customer Risk Rating (CRR) system, and more particularly, to a method and system for customer risk indicator screening based on random forests.
Background
Internet finance (ITFIN) refers to a novel financial business mode in which a traditional financial institution and an internet enterprise realize fund integration, payment, investment and information intermediary services by using internet technology and information communication technology. Internet finance is not a simple combination of internet and financial industry, but a new mode and a new service which are generated for adapting to new requirements naturally after being familiar and accepted by users (especially, the acceptance of electronic commerce) on the level of realizing network technologies such as security, mobility and the like. Is an emerging field combining the traditional financial industry and the internet technology. With the continuous development of internet technology, more and more financial institutions, enterprises, merchants and ordinary users are used to realize various financial transactions such as inter-bank transfer, online payment, online investment and financing, digital money, online shopping, and the like by using internet finance.
On the other hand, there are many problems with internet finance. First, internet finance has not been accessed to a people bank credit system, has no credit information sharing mechanism, and has no wind control, compliance and clearing mechanism similar to banks, so various risk problems, such as various P2P platform running, money laundering, financial fraud and the like, can easily occur. Secondly, the internet finance is weakly supervised, is in the initial stage in China, is not supervised and legally constrained, lacks admission thresholds and industry specifications, and faces a plurality of policy and legal risks, for example, money laundering, illegal collection of resources, fraud, gambling and other traditional financial criminal activities begin to transfer the illegal activities to the internet, which brings huge harm and destructive power to the society.
Therefore, there is an urgent need for a mechanism to effectively assess the risk level of a customer using internet finance to oversee the development of the internet finance.
Disclosure of Invention
The disclosure relates to a customer risk index screening method based on a random forest, and a method and a system for evaluating a customer risk level by using a customer risk index screened by the method.
According to a first aspect of the disclosure, a customer risk indicator screening method based on a random forest is provided, which includes: collecting customer characteristic data comprising a plurality of customer characteristics; constructing a random forest based on the customer characteristic data; for each customer characteristic in the customer characteristic data, performing the following importance calculation steps: determining the importance of the customer features at nodes of a decision tree in the random forest; determining the importance of the customer features in each decision tree in the random forest; determining the importance of the customer feature throughout the random forest; ranking the plurality of customer characteristics according to their importance throughout the random forest, and selecting a subset containing customer characteristics with a desired importance as a customer risk indicator.
According to a second aspect of the present disclosure, there is provided a method for evaluating a risk level of a customer by using a customer risk indicator after random forest screening, comprising: collecting customer characteristic data; screening out a subset of the client features of desired importance as a client risk indicator for assessing a client risk level according to the method of claim 1; for each screened client risk index, calculating a corresponding risk index score through a client risk level model; weighting and summing all the calculated risk index scores of the client to calculate the total risk level score of the client; wherein the weight of each risk indicator score in the weighted sum is based on the importance of the customer feature associated therewith.
According to a third aspect of the present disclosure, there is provided a computer system comprising: the client characteristic data collection module is used for collecting client characteristic data; a client characteristics screening module for screening out a subset of client characteristics having a desired importance from the collected client characteristics data as a client risk indicator for assessing a client risk level according to the method of claim 1; the client risk level scoring module is used for calculating a corresponding risk index score for each screened client risk index through the client risk level model; and the client risk level total scoring module is used for weighting and summing all the calculated client risk index scores to calculate the risk level total score of the client.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Drawings
In order to describe the manner in which the above-recited and other advantages and features of the disclosure can be obtained, a more particular description of the disclosure briefly described above will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. Understanding that these drawings depict only typical embodiments of the disclosure and are not therefore to be considered to be limiting of its scope, the disclosure will be described and explained with additional specificity and detail through the use of the accompanying drawings in which:
fig. 1 is a flowchart illustrating a conventional scheme for assessing a risk level of a customer based on expert experience.
Fig. 2 shows a conventional flow chart for constructing a random forest.
FIG. 3 illustrates a flow of customer risk indicator screening based on random forests according to one embodiment of the present disclosure.
FIG. 4 shows a flow of a scheme for assessing a risk level of a customer using customer risk indicators after random forest based screening according to an embodiment of the present disclosure.
FIG. 5 illustrates a computer system for assessing a risk level of a customer using customer risk indicators after random forest based screening according to one embodiment of the present disclosure.
Detailed Description
In order to assess the risk level of a customer using internet finance to avoid financial risks, such as money laundering activities, a Customer Risk Rating (CRR) system typically determines the risk level of the customer by collecting various customer data associated with the customer as indicators for assessing the risk of the customer and applying the data to a risk assessment model to obtain a risk score for the customer. The customer risk indicators may include, for example, customer characteristics (e.g., age, gender, address, etc.), occupation (affiliated company, position, time of day, income, etc.), territory (e.g., country and region where activities such as money laundering, fraud are rampant), business type (e.g., transaction type, flow of funds, etc.), and various other indicators.
But as customer data of different dimensions is collected and accumulated, more and more customer features can be extracted by a financial institution. If all the collected client characteristic data are used as risk indexes for evaluating the clients and the system is applied to a risk evaluation model, each risk evaluation for the clients consumes a lot of system resources and time, which is contrary to the low cost and high efficiency characteristics of internet finance.
Therefore, a mechanism is needed to accurately mine more risky features from a large number of customer features as indicators for customer risk level assessment. This mining may be referred to as a "feature selection" technique, i.e., a process of selecting a subset of relevant features (i.e., attributes, indices) from a large number of features in order to construct a model. The main reason for using the feature selection technique is that many redundant or irrelevant features are generally included in the collected client features as training data, so that removing these features does not result in losing information, and at the same time, the model is greatly simplified, the training time is shortened, and the system resources are saved.
The conventional Customer Risk Rating (CRR) system usually screens customer features by means of original expert experience, and then calculates a total score of the customer risk according to the model scores of the screened indexes, thereby obtaining the customer risk rating. The specific process is shown in figure 1:
in the process shown in FIG. 1, first, at step 110, a Customer Risk Rating (CRR) system collects customer characteristic data of different dimensions from various data sources. For example, the current location of the customer may be obtained from a cell phone used by the customer, the various transaction types used by the customer may be collected from various APPs used by the customer, the latest address information and cell phone number of the customer may be obtained from the customer's address book, the flow of funds for the customer may be obtained from a bank, and so on. Some of these customer characteristic data are closely related to the customer risk level, such as transaction type, fund flow, etc., while some customer characteristic data are less relevant, such as location and address, which have little impact on the customer risk level. Therefore, it is necessary to filter the customer feature data of these different dimensions to select the features that best differentiate the risk of the user.
Thus, in step 120, the Customer Risk Rating (CRR) system manually screens out features from a large amount of customer feature data using expert experience as risk indicators 1, 2, … …, N.
In step 130, based on the screened risk indexes 1, 2, … …, N, risk index scores corresponding thereto, such as score S1 of risk index 1, scores S2, … … of risk index 2, and score SN of risk index N, are calculated by the client risk level model.
Then, in step 140, all the calculated risk indicator scores are weighted and summed to calculate a total risk level score for the customer. The risk rating of the customer may be assessed based on the total score. Where the weight of each risk indicator score in the total score is specified by the expert using his manual experience, for example, customer data such as abnormal transactions, frequent transfers, etc. may be assigned a high weight. Alternatively, after determining the total risk level score of the customer, the corresponding risk level of the customer may also be determined by comparing the total risk score with a threshold risk level range set in a risk level classification table of the customer.
Although the number of customer data (characteristics) to be processed is reduced by using expert experience in the conventional customer risk level assessment process as described above, it is difficult to quantify the importance of characteristics because expert experience is a process of manual decision-making and can be only counted and refined based on historical experience. Specifically, only historical data can be used to count potential risk points in the risk client as risk indicators through expert experience, but it is difficult to quantify the importance of each risk indicator. Moreover, as the collected client features are more and more, the possible risk indicators need to be found out from the massive features, and the expert experience screening process also needs to consume great manpower and resources.
In order to address the above-mentioned problems of expert experience in screening customer characteristic data, some automated algorithms have been provided to replace the expert experience. For example, a package-class feature selection algorithm uses a predictive model to score feature subsets. By feature subsets is meant a combination of one or more of the collected features of the customer, each feature subset being used to train the model once and then test the trained model with the validation data set. These feature subsets are scored by calculating the number of errors that occur on the validation data set (i.e., the error rate of the model). Finally, the feature subset (combination) with the required score is selected as the risk indicator according to the score of each feature subset. The mechanism realizes automatic screening of client characteristics, does not need manual intervention, and greatly reduces manual labor in risk assessment.
But this mechanism also has its own drawbacks. Since the packing-like algorithm trains the model once for each feature subset (combination), it is computationally expensive. For example, assuming that there are three customer features A, B and C, the possible feature subsets include seven feature subsets (combinations) of a, B, C, AB, AC, BC, ABC, for each of which a model is trained once and the trained model is tested with a validation data set to obtain a score. Although in practice some other means (e.g. forward/reverse search) may be used to remove some less relevant feature subsets (combinations) from all possible feature subsets (combinations), in the case where the number of client features is large per se, such removal does not significantly reduce the number of feature subsets. Moreover, if the Customer Risk Rating (CRR) system adds a new dimension of customer data to a customer feature, many new feature subsets (combinations) will be generated due to the addition of the new customer feature, which requires training and weighted scoring again, which will certainly increase the subsequent processing cost greatly.
Therefore, in order to solve the problems faced when screening massive customer characteristics, such as those mentioned above, the present disclosure proposes a mechanism for screening customer risk indicators based on random forests. In other words, the solution of the present disclosure is mainly directed to the improvement of the step of screening the client characteristics as the risk indicator in step 120 in the conventional client risk level assessment process in fig. 1. A random forest is constructed for the feature set of the user to screen out the feature subset which can distinguish the user risks to serve as a risk index, and the importance of the client risk index can be clearly indicated.
Before describing the scheme of the present disclosure, the concept of "Random Forest" is first solved. A random forest is a classifier that trains and predicts samples using a plurality of trees, i.e., a machine learning classification algorithm that includes a plurality of decision trees, and the output classes of the random forest are determined by the mode of the class output by the individual trees.
Wherein, as shown in fig. 2, each decision tree in a random forest can be built according to the following procedure:
1) step 210: n is used for representing the number of training cases (samples), and M is used for representing the number of features;
2) step 220: sampling n times from n training cases (samples) in a mode of sampling with a return sample to form a training set (namely, bootstrap sampling), and using the cases (samples) which are not extracted as a prediction to evaluate the error of the cases (samples);
3) step 230: in the process of forming a decision tree (tree classifier), r characteristics are randomly selected for each node, wherein r is far smaller than M, then a certain strategy (such as information gain, information gain rate, a kini index and the like) is adopted from the r characteristics to select one characteristic as the splitting attribute of the node for tree classification, and the whole process of forming the decision tree is noticed that no pruning exists in the process until the splitting cannot be carried out;
4) a number of decision trees are built as per step 220 and 230, thus forming a random forest, step 240.
The main work of decision trees in random forests is to select features to divide a data set and finally attach two different types of labels to the data. A decision tree corresponds to an expert who classifies new data by learning knowledge in the data set itself. Random forests are algorithms that attempt to build multiple decision trees so that the final classification effect can exceed that of a single expert.
Based on the above process, it can be understood that the construction of the random forest mainly involves two aspects: 1. selecting data at random; 2. and randomly selecting the features to be selected.
1. Random selection of data:
first, a set of sub-data is constructed from the original data set using the replaced samples, the sub-data set having the same data size as the original data set. Elements of different sub data sets may be repeated, as may elements in the same sub data set. Second, the sub-decision trees are constructed using the sub-data sets, and this data is placed into each sub-decision tree, which outputs one result. And finally, if new data is needed to obtain a classification result through the random forest, the output result of the random forest can be obtained through voting on the judgment result of the sub-decision tree. Assuming that there are 3 decisiontrees in the random forest, the classification result of 2 subtrees is class a, and the classification result of 1 subtree is class B, the classification result of the random forest is class a.
2. Random selection of features to be selected
Similar to the random selection of the data set, each splitting process of the subtrees in the random forest does not use all the features to be selected, but randomly selects a certain feature from all the features to be selected, and then selects the optimal feature from the randomly selected features. Therefore, decision trees in the random forest can be different from each other, the diversity of the system is improved, and the classification performance is improved. In splitting, the following algorithm can be adopted for selecting the optimal features: the ID3 algorithm (proposed by j. ross Quinlan in 1986) adopts the feature of maximum information gain; the C4.5 algorithm (j. ross Quinlan, 1993) employs an information gain ratio selection feature; the CART algorithm (Breiman et al, 1984) utilizes the Gini index minimization criterion, among others.
In the above, the concept, construction flow and features of "random forest" are introduced. The concepts, construction processes and features of the random forest belong to the technical knowledge well known in the field of machine learning algorithms, and are not described in detail herein. The skilled person may refer to technical literature on "random forests" for further understanding.
The purpose of building a "random forest" is usually for classification, i.e. to decide on the final classification of a sample by voting summaries in a decision tree therein. By researching the characteristics and the algorithm in the construction process of the random forest, the scheme disclosed by the invention utilizes partial characteristics and the algorithm in the random forest to calculate the importance degree of the customer characteristics, and therefore, the customer characteristics are screened according to the importance degree to select the customer characteristic subset which can distinguish the user risks most as the risk index.
Specifically, as described above in conjunction with the flow of "building a random forest" in fig. 2, in step 230, in the splitting process of each node, a certain policy needs to be adopted to select one feature from m features as a splitting attribute. To make the selected features the optimal splitting property, the strategies may include, for example, using the information gain-based ID3 algorithm, the information gain ratio-based C4.5 algorithm, and the kini coefficient (Gini Index) -based branching (CART) algorithm, among others. When the above algorithms are used to select features for subsequent splitting (bifurcation), we find that parameters such as information gain, information gain ratio or kini coefficient in the algorithms can also be used to evaluate the importance of the features. The CART algorithm is used as an example to specifically describe how to calculate the importance of features by using the kini coefficients.
In the CART algorithm, a chiny coefficient minimization criterion is used for feature selection to generate a binary tree. The Kearny index (also known as Kearny's purity (impurity), Kearny's uncertainty, etc.) represents the probability that a randomly selected sample in a sample set is misclassified. Note that: a smaller Gini index indicates a smaller probability that the selected sample in the collection is misclassified, i.e. the purity of the collection is higher, whereas the collection is less pure.
That is, the kini index (kini purity), the probability that a sample is selected, and the probability that a sample is mistaken.
Assuming that a risk indicator can classify users into K classes, the probability that a sample (client feature) point belongs to the kth class is pk, and the probability that the client feature is mistaken is (1-pk), the kini index of the client feature is expressed by the formula:
the computation of the kini index of a feature is a commonly used technique in constructing the individual decision trees in a random forest. As previously mentioned, the Keyny index represents the probability that a sample set is misclassified. The smaller the Kini index is, the better the certainty of the sample set is, and the smaller the error probability is, and the larger the Kini index is, the larger the uncertainty of the sample set is, and the higher the error probability is. Thus, it can be readily appreciated that if the difference between the current kini index of the sample set at one node (e.g., node m) and the kini indexes of the sample sets of the left and right child nodes after splitting based on the selected customer feature is larger, the more important the selected customer feature is to be accounted for. The larger the difference between the kini indexes before and after splitting is, the smaller the kini indexes of two sample subsets (left child node l and right child node r) divided based on the customer feature is, that is, the better the certainty of the sample subsets is, the smaller the possibility of error is, so the larger the influence of the customer feature on the whole decision tree and even the whole random forest is.
Based on the above principle, the present disclosure provides a scheme, including: the importance degree of each customer feature on each decision tree is measured by calculating the change of the Gini index of CART (branching algorithm) before and after splitting (branching) while or after constructing the random forest of the feature set of the user, and then the importance degree of each customer feature on each tree is averaged to calculate the importance degree of the customer feature in the whole random forest. Each customer characteristic is then ranked according to its importance, with the most important ones (e.g., Top N customer characteristics) being selected as risk indicators for risk rating score calculation.
In particular, assume that the importance of the client feature I at decision tree node m is ImIt is then measured by the change between the node m's kini index and the new node after splitting (the left branch new node l's kini index and the right branch new node r's kini index of node m), i.e. the importance ImCan be expressed by the following formula (1):
Im=Ginim-Ginil-Ginir(formula 1)
Wherein GinimDenotes the Kini index of node m, and GinilAnd GinirThe kini indexes of the split left branch new node l and right branch new node r are respectively represented. These kini indexes are all already calculated in the CART algorithm when the decision tree is constructed, so the scheme of the present disclosure only needs to be directly used. As mentioned above, a larger difference in the kini coefficients between a node m and its two child nodes indicates that the subset of samples split based on the feature is more correct (which may also be referred to as "more certain" or "purer"), and thus, the more important the feature i can represent to the node m of the decision tree.
It should be understood that in the decision tree t, the client characteristic i may not be used only for splitting of the node m, but it may also be used for splitting of other nodes. Therefore, it may occur x times (x is an integer of 1 or more). Therefore, each time the feature i is used for node splitting in the whole decision tree construction process, the importance of the feature i in the node needs to be calculatedAnd put them together。
For example, the importance I of the statistical customer feature I in the whole decision tree t can be realized by the following formula (2)t:
Wherein,representing the importance of a feature i at a node when the decision tree t uses the feature i for a certain node split the kth time, which can be calculated according to equation (1) above.
Subsequently, the importance I of each node of the feature I in the other decision trees can be calculated in the same manner using equations (1) and (2)mAnd importance I in other overall decision treest。
Then, the importance I of the client characteristics I in each decision tree can be summarizedtAnd averaged to calculate the importance of the customer feature i throughout the random forest. Assuming that the splitting of nodes by using the customer feature I occurs in Y trees (Y is an integer greater than or equal to 1) in the random forest, the importance I of the customer feature I in the whole random forest ffCan be expressed as formula (3):
wherein,representing the importance of feature i on the e-th decision tree. And said IfIs the final importance of the client characteristic i (averaging).
For each customer feature, a calculation similar to the importance of feature i described above is iteratively performed to obtain a (final) importance of each customer feature.
And finally, sorting the importance of each client characteristic, and selecting the Top N most important characteristics as the risk indexes for the client risk level model. So far, the screening mechanism of the client risk index based on the CART algorithm used by the random forest has been introduced.
It should be understood that the CART algorithm and the kini index are merely illustrated as an example of a decision tree feature selection strategy, which does not mean that the client risk indicator screening mechanism of the present disclosure can only be based on the CART algorithm and the kini index. In practice, the concept of "information gain" is used when the feature selection strategy of the decision tree employs, for example, the ID3 algorithm based on information gain or the C4.5 algorithm based on information gain ratio. ID3 is a decision tree based classification algorithm. ID3 applies a top-down greedy strategy to build decision trees based on information gain. The information gain is used to measure how well a certain attribute classifies a sample set. Due to the adoption of information gain, the decision tree established by the ID3 algorithm is small in scale and fast in query speed. The improvement of the ID3 algorithm is the C4.5 algorithm, and the C4.5 algorithm can process continuous data, using an information gain rate instead of an information gain.
ID3 and C4.5 algorithm core:
the core of the ID3 and C4.5 algorithms is "information entropy". Taking the ID3 algorithm as an example, by calculating the information gain of each attribute (feature), it is considered that the attribute with high information gain is a good attribute, and the attribute with the highest information gain is selected for each division as the division standard, and this process is repeated until a decision tree capable of perfectly classifying training samples is generated.
ID3 uses information gain as a method to select the optimal split attribute, and entropy as a standard to measure node purity.
The information entropy is an expectation of random variables. The degree of uncertainty of the information is measured. The larger the entropy of the information, the less easily the information is to be clarified. The information is processed to make the information clear, namely the process of entropy reduction. The calculation formula of the information entropy is as follows:
and the information gain is used for measuring the contribution size of the attribute A to reduce the entropy of the sample set X. The larger the information gain, the more suitable it is for classifying X. The calculation formula of the information gain is as follows:
the entropy of the parent node; encopy (S) represents the Entropy of the parent node, and Encopy (S)V) The entropy of the child node is represented. From the above formula of information gain, it can be easily understood that the information gain of the feature I at the node m can be understood as an index (i.e., I) for evaluating the importance of the featuremInformation gain infoGain of characteristic i at the node m)
The C4.5 algorithm process is the same as the ID3 algorithm, except that the method for selecting the feature is changed from the information gain to the information gain ratio, and the calculation process also includes calculating the information gain of the feature at the node. Therefore, the importance I of the feature I at the node mmThe information gain infoGain of feature i at this node m.
The importance I of a feature I at node m is calculated in the context of knowing how to compute the ID3 and the C4.5 algorithmmThen, the above formula (2) can be used to calculate the importance I of the customer feature I in the whole decision tree ttAnd calculating the importance I of the customer characteristics I in the whole random forest f by using the formula (3)f. Thus, using the information gain in these algorithms, the final importance of each feature can also be calculated as described above and the most important features selected by ranking as risk indicators for the client risk level model. Knowledge of the algorithms and information gains for ID3 and C4.5 are well known to those skilled in the art and, therefore, we have only briefly introduced.
With the principles of the customer characteristic index screening of the present disclosure in mind, a specific flow of the customer risk index screening based on random forests according to one embodiment of the present disclosure is described below in conjunction with fig. 3.
As shown in FIG. 3, first, at step 310, customer characteristic data is collected. Taking the problem of 'anti-money laundering' in the field of internet finance as an example, when assessing the risk level of a customer, data characteristics such as the name, age, occupation, income, address, identification number, mobile phone number, bank account, account transaction record, logistics information, transfer record, consumption record and the like of the customer can be collected from various data sources. Some of these customer characteristics are closely related to, for example, "money laundering" activities, such as bank accounts, transfer records, transaction records, and the like, information relating to the flow of funds, but some characteristics are not so related, such as income, address, and the like. Therefore, there is a need for the following efficient screening of these collected customer characteristics to improve the accuracy and efficiency of assessing the risk level of the customer.
After the collection is complete, a random forest is constructed based on the collected customer characteristic data in step 320. The construction of the random forest may refer to a construction flow of the random forest as shown in fig. 2, which is not described in detail herein. It should be noted that the customer risk indicator screening process described in this disclosure may be performed while constructing a random forest of a risk customer or after the random forest is constructed. When the screening process is executed while the random forest is being constructed, while calculating the splitting attribute of the feature at the node of the decision tree, such as the parameter of the kini index or the information gain, the corresponding importance of the feature can be calculated according to the formulas (1) to (3) as described above (the importance at the node, at the decision tree and at the whole random forest is calculated step by step according to the construction progress). When the process is executed after the random forest is constructed, since parameters such as the kini index or the information gain of the feature at each node of each decision tree already exist, the overall importance of the feature can be directly calculated according to the formulas (1) to (3) as described above. Specifically, the importance calculation flow is as follows:
assuming that M client features are collected in step 310, for each client feature i (1 ≦ i ≦ M), the following steps are performed:
in step 330, the importance I of the customer characteristic I at each node of the decision tree t is determinedm. As described above, if the decision tree isThe splitting is based on the CART algorithm, and then the importance of the client feature i at the node of the decision tree t is determined as follows: i ism=Ginim-Ginil-GinirI.e. the kini index of feature i at the node minus the kini index of feature i at its left and right child nodes. And when the splitting of the decision tree is based on ID3 or C4.5 algorithm, the ImThe information gain infoGain of the feature i at this node. As mentioned above, in the process of constructing the decision tree t, the client characteristic i may not only be used for splitting the node m, but also for splitting other nodes. Therefore, for each node splitting using the feature i, the importance of the feature i at the corresponding node can be calculated according to the above description
Next, in step 340, the importance I of the client feature I in each decision tree is determinedt. For example, the importance of each node of the feature i in the decision tree t can be represented by the above formula (2)Are added up to obtain the importance I of the decision tree tt. In the same way, the importance of the client characteristic i in each decision tree can be obtained by accumulating the importance of each node in the other decision trees respectively
Then, in step 350, the importance I of the customer features I in the whole random forest f is determinedf. For example, the importance of the feature i in each decision tree can be determined by the above formula (3)Accumulate and average to obtain the final importance I in the whole random forest ff. The importance degree IfIs the ultimate importance of the client characteristic i.
After the above-mentioned steps 330-350 are iteratively performed for each customer characteristic to obtain the importance degree associated therewith, in step 360, the customer characteristics are sorted according to the importance degree of each customer characteristic, and a subset containing N customer characteristics with the required importance degree (e.g. the most important Top N, N ≧ 1, or no lower than a certain importance degree threshold) is selected as the customer risk indicator. Thus, the random forest based screening scheme for the customer risk indicators according to the present disclosure is completed.
Through the screening process shown in fig. 3, only a few customer characteristics with high importance are selected as the customer risk indicator. These customer risk indicators may be applied in steps 130 and 140 as described in fig. 1 to calculate a total score for the risk level of the customer and determine the risk level of the customer based on the total score.
A flow chart of a scheme for evaluating a risk level of a customer using a customer risk indicator based on random forest screening according to the present disclosure is described below with reference to a specific application scenario. The application scenario may be, for example, an "anti-money laundering" scenario, or other scenarios involving internet financial security requiring assessment of a risk level of a customer, such as a cyber credit, P2P, or the like scenario. In these scenarios, the risk level of the client needs to be evaluated accordingly before normal transactions can be executed.
Especially in "anti-money laundering" scenarios, common money laundering approaches involve a wide range of possible implications in various fields such as banks, insurance, securities, real estate, etc. Anti-money laundering is a system project which is used by government to draw legislative and judicial strength, mobilize related organizations and commercial institutions to identify possible money laundering activities, dispose related money and punish related institutions and persons, thereby achieving the purpose of preventing criminal activities. From international experience, the major activities of money laundering and anti-money laundering are performed in the financial field, almost all countries place the anti-money laundering of financial institutions at the core position, and the cooperation of anti-money laundering performed in the international society is mainly performed in the financial field. With the development of internet finance, money laundering activities are also shifted from off-line to more concealed-line. Therefore, fighting against the internet "money laundering crime" is also becoming a major concern in keeping finance stable in our country. The difficulty of the internet of anti-money laundering is as follows:
and (I) the identification of the client is difficult. In the traditional payment transaction behavior process, service organizations such as banks and the like can require customers to provide real identity identification, which makes the evidence obtaining of money laundering crimes relatively easy. In a network environment, a client can enter an online financial account and realize fund transfer at any place, and an entry point which can be tracked is difficult to find without visual identification; even if the electronic evidence is found, the real identity of a lawbreaker can be difficult to determine due to the fact that a password, an identification number and the like are stolen.
And (II) suspicious transactions are difficult to find. There are difficulties in the manual analysis of suspicious transactions involving online banking. Firstly, the online banking transaction information is mostly stored in an electronic backup mode, and paper certificates are not reserved. In addition, the online banking transaction can realize rapid and multi-fund transfer in a short time, and has huge transaction data volume and complex transaction behavior. In addition, because internet banking lacks a perfect customer identification system, there are certain difficulties in verifying counterfeit identity documents and opening accounts with a counterfeit registration company.
And thirdly, the capital flow direction is difficult to track and monitor. Under the existing conditions, it is not easy to screen suspicious transactions from a large amount of network transaction data every day, and even under the condition of long-distance and even transnational fund transfer, even if the behavior of suspected money laundering is discovered, the behavior is late and extremely popular. Compared with the traditional business, the network transaction is not limited by time and space, the fund can be transferred instantly, and lawless persons can split a fund chain by means of frequent network transfer, frequent transaction object replacement, multiple access and the like to transfer the fund to a safe place. Thus, funds are more difficult to track and control over network transactions.
And (IV) the complete information is difficult to master. Fragmentation of the payment process results in the division of transaction information and customer identity information, and if a transaction is completed, a card issuing mechanism, an internet payment mechanism, a telecommunication operator, an acquiring mechanism, a card holder, a merchant and even an outsourcing service mechanism are required to participate together. Fragmentation of participating entities causes identity information and transaction information of the same customer to be stored dispersedly in different institutions, making it difficult for related anti-money laundering institutions to grasp complete information.
It is because of the difficulties described above with the internet "anti-money laundering" work that makes it difficult to count customer data that needs to be collected and verified when assessing a customer's risk level. Conventional customer risk rating schemes that utilize expert experience are clearly not likely to meet the regulatory requirements of the growing internet financial transaction activity. According to the scheme for screening the customer risk indexes based on the random forest, the most important customer risk indexes required by the customer risk grades can be screened out under the condition that manual labor is not needed, and further the workload of the customer risk grades is greatly reduced. Specifically, the flow of the scheme for evaluating the risk level of the customer based on the customer risk indicators after the random forest screening is shown in fig. 4.
In FIG. 4, first, at step 410, the Customer Risk Rating (CRR) system also collects customer characteristic data of different dimensions from various data sources. Some of these customer profiles are closely related to the customer risk level, such as transaction type, flow of funds, etc., while some customer profiles are less relevant, such as location and address, which may have little effect on the customer risk level and may even be unnecessarily "noisy". Therefore, it is necessary to filter these customer data of different dimensions to select the features that best differentiate the user's risk.
Thus, in step 420, a random forest is started to be constructed based on the customer characteristic data. That is, as shown in fig. 2, a random forest including a plurality of decision trees is constructed by, for example, an ID3 algorithm based on information gain, a C4.5 algorithm based on information gain ratio, or a branch (CART) algorithm based on a Gini Index (Gini Index), or the like. While constructing the random forest or after constructing the random forest, according to the flow of customer risk indicator screening based on the random forest as shown in fig. 3, using various parameters related to splitting (such as the kini index of CART algorithm, the information gain in ID3 and C4.5 algorithm, etc.) used in constructing the actual forest, the corresponding importance degree can be calculated for each customer feature, please refer to the contents of the flow of fig. 3. Then, the client features are sorted according to the importance level, and a subset of, for example, TOP N client features (i.e., the client features with the TOP N importance levels) is screened out as the client risk indexes 1, 2, … …, N. For other customer characteristics that are not selected, they may be added to the candidate set for manual screening by experts, if necessary. For example, in an "anti-money laundering" scenario, customer features that may be selected may include: transfer frequency, transaction count, transaction amount, etc., and unselected customer characteristics may include: customer age, gender, transaction notes, etc.
In step 430, based on the screened client risk indexes 1, 2, … …, N, the corresponding risk index scores, such as score S1 of risk index 1, scores S2, … … of risk index 2, and score SN of risk index N, are calculated by the client risk level model.
Then, in step 440, all calculated risk indicator scores of the customer are weighted and summed to calculate a total risk level score for the customer. Wherein the weight of each risk indicator score in the total score is determined by the importance of the screened client feature associated with that risk indicator score. The higher the importance of the customer feature, the more heavily the corresponding customer risk indicator score is weighted in the total score. For example, in order of the importance degree from high to low, assuming that the customer characteristic "transfer frequency" is ranked first, the "transaction amount" is ranked second, and the "transaction number" is ranked last, when calculating the total score, the weight of the risk index score based on the "transfer frequency" is the largest (for example, 0.6), the weight of the risk index score based on the "transaction amount" is the next (for example, 0.3), and the weight of the risk index score based on the "transaction number" is the minimum weight (for example, 0.1), so that the total score is the risk index score of "transfer frequency" x0.6+ "transaction amount" x0.3+ "transaction number" x 0.1. The above weight ratios are for illustration purposes only. In fact, the relationship between the importance and the weight may be set according to various algorithms. And will not be described in detail herein.
In another embodiment, after calculating the total risk index score for the customer, the customer may be classified into the corresponding risk levels according to, for example, a customer risk level classification table. For example, customers with a total score of 0-200 may be ranked at low risk, customers with a total score of 200-500 may be ranked at medium risk, and customers with a total score exceeding 500 may be ranked at high risk. The grading scale can be adjusted according to actual needs.
In other embodiments, after the random forest-based screening of a set of highly important customer features in step 420, a manual screening step may be added, in which the automatically screened set of customer features is provided to an expert in the anti-money laundering field, for example, as a recommended risk indicator. The expert then reviews whether to use the set of customer characteristics as a risk indicator, or, alternatively, further adds or deletes portions of the customer characteristics before providing them to step 430 for calculation of a risk indicator score.
With the understanding of the solution for assessing risk levels of customers based on customer risk indicators after random forest screening according to the present disclosure, a computer system for implementing the solution is described below.
In FIG. 5, a computer system 500 for assessing a risk level of a customer using a customer risk indicator after random forest based screening according to one embodiment of the present disclosure is shown. As shown, the computer system 500 includes a customer characteristic data collection module 510, a random forest construction module 520, a customer characteristic screening module 530, a customer risk level scoring module 540, a customer risk level total scoring module 550, and optionally an expert screening module 560 may be included between the customer characteristic screening module 530 and the customer risk level scoring module 540. The expert screening module 560 is not required.
Among the modules described above, customer characteristic data collection module 510 is used to collect customer characteristic data from various sources, such as customer account, transaction type, transaction amount, flow of funds, etc., as required for anti-money laundering risk assessment.
The random forest construction module 520 is configured to construct a random forest based on the collected customer feature data, and the random forest construction process may refer to the flow illustrated in fig. 2. The random forest building module may also be included as a sub-module in the customer feature filter module 530 as part of its functionality.
While constructing the random forest or after completing the construction of the random forest, the customer feature screening module 530 screens out a subset of customer features having a desired importance from the collected customer feature data as a customer risk indicator for evaluating a customer risk level according to the method shown in fig. 3;
the client risk level scoring module 540 is used for calculating a corresponding risk index score for each screened client risk index through the client risk level model;
when the corresponding risk indicator scores for all the screened client features have been calculated, the total client risk level scoring module 550 performs a weighted summation of all the calculated client risk indicator scores to calculate the total risk level score for the client. Wherein the weight of each customer risk indicator score in the summing process may be associated with the importance of the customer feature associated therewith. And, the customer risk level total scoring module 550 may classify customers into corresponding risk levels according to their risk level total scores based on a customer risk level ranking table.
In another embodiment, the computer system 500 may further include an expert screening module 560. After the customer characteristic screening module 530 screens out the subset of customer characteristics with the required importance, the expert screening module 560 may further perform manual adjustment on the screened subset of customer characteristics through expert experience, such as deleting or adding some customer characteristics, and then provide the customer characteristics screened by the expert as a customer risk indicator to the customer risk level scoring module 540.
In conclusion, the random forest is used for calculating the corresponding importance degree from a large number of characteristics of the undetermined clients to quickly and efficiently screen out the appropriate characteristics as the risk indexes, so that the problems that expert experience is difficult to apply to scenes with a large number of characteristics of the clients and the importance degree of the characteristics cannot be quantized are solved. Meanwhile, the importance of all customer features can be obtained by only training the random forest model once, and compared with the packaging feature selection algorithm, the calculation complexity is greatly reduced, the speed is greatly improved, and the human participation of experts is reduced.
The foregoing description of specific embodiments of the present disclosure has been described. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous. Moreover, those skilled in the relevant art will recognize that the embodiments can be practiced with various modifications in form and detail without departing from the spirit and scope of the present disclosure, as defined by the appended claims. Thus, the breadth and scope of the present disclosure should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.