[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

AU2019101160A4 - Application of decision tree and random forest in cash loan - Google Patents

Application of decision tree and random forest in cash loan Download PDF

Info

Publication number
AU2019101160A4
AU2019101160A4 AU2019101160A AU2019101160A AU2019101160A4 AU 2019101160 A4 AU2019101160 A4 AU 2019101160A4 AU 2019101160 A AU2019101160 A AU 2019101160A AU 2019101160 A AU2019101160 A AU 2019101160A AU 2019101160 A4 AU2019101160 A4 AU 2019101160A4
Authority
AU
Australia
Prior art keywords
loan
departments
cash
decision tree
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
AU2019101160A
Inventor
Jingxuan Chen
Jingya Li
Yani Li
Xuanning Liu
Zihao Mei
Yupu Zheng
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chen Jingxuan Miss
Li Jingya Miss
Li Yani Miss
Zheng Yupu Miss
Original Assignee
Chen Jingxuan Miss
Li Jingya Miss
Li Yani Miss
Zheng Yupu Miss
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chen Jingxuan Miss, Li Jingya Miss, Li Yani Miss, Zheng Yupu Miss filed Critical Chen Jingxuan Miss
Priority to AU2019101160A priority Critical patent/AU2019101160A4/en
Application granted granted Critical
Publication of AU2019101160A4 publication Critical patent/AU2019101160A4/en
Ceased legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • General Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • Computational Linguistics (AREA)
  • Strategic Management (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • Artificial Intelligence (AREA)
  • Technology Law (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)

Abstract

"Cash loan" is a kind of consumer loan business for applicants. It has convenient and flexible loan and repayment process, as well as real-time approval and fast-acquisition. Platform participants of cash loans can be divided into private departments, banking departments, listing departments, state-owned departments and venture capital departments. Nowadays, more and more people choose the "cash loan" business for purchasing something they can't afford now. But here comes the question, how do we recognize the individuals' credit? To determine whether an individual is qualified for the loan, we implement the following steps to solve the problem. Data pre-processing Normalization fisng dAt a procession Labeling - - - - - optimizing Decision tree -- Test Train k optimizing Random forest paaetr Sumary Figure 1

Description

Title
Application of decision tree and random forest in cash loan
Field of the Invention
This invention involves in the field of cash loan system and determines whether a borrower is qualified for the loan.
Background of the Invention ’’Cash loan” means small cash loan business, which is a kind of consumer loan business for applicants. It has convenient and flexible loan and repay- ment process, as well as real-time approval and fast-acquisition. Platform participants of cash loans can be divided into private departments, bank- ing departments, listing departments, state-owned departments and venture capital departments.
The Advantage of Cash Loan
The biggest advantage of cash loan is that the verification process is more lenient than the bank and that the target population is wider. The users of cash loans are generally low-income or unstable young groups, often dis- missed by banks when they encounter emergencies. Cash loans do not require mortgage and can often help borrowers get through short-term difficulties. In addition, the approval speed for accessing the smart automatic approval system is relatively fast and brings a lot convenience. Borrowers can use their own credit as a guarantee, and there’s no need to hypothecate something. Cash loans can be borrowed repeatedly within the credit line, and can be borrowed at any time. And the credit rank of users with good credit will continue to increase.
The Disadvantage of Cash Loan
From a lender’s perspective view, it can provide exorbitant profit, but also there are several disadvantage. First of all, the way cash loan makes profit leads to high risk. Unlike credit card companies charge sellers to make profit, cash loan lenders gain most profit from those who cannot pay their debt in a short-time and keep paying high interest. At the meantime, there is not enough constraint to make sure cash loan borrowers pay their debt. Usually they use their personal ID and information as the mortgage instead of real
2019101160 30 Sep 2019 valuable goods. In some situations, borrowers are either too poor to pay the debt or indifferent about their reputation anymore.
SUMMARY
To determine whether an individual is qualified for the loan, we implement the following steps to solve the problem. Firstly, we preprocess the data by normalizing and dealing with the missing data. The process of normaliza- tion makes data in the same order of magnitude so that features will have the same extend of impact on the results and avoid data of large orders of magnitude having too much impact on the outcome, thus making the out- come inaccurate. As for the missing data, we use the mean value to replace the data. Secondly, we label the specific column ‘3d’ as the attribute of classification. Thirdly, we use the features and attribute of the train data to establish the decision tree classification. Then, we construct the random forest classification to make the classification better. Lastly, we optimize the main parameters to match the model and make the outcome more accurate.
DESCRIPTION OF DRAWING
Figure 1 is the Flow Chart
DESCRIPTION OF PREFERRED EMBODIMENT
1.1 The Algorithm of Cash Loan Model
- Decision tree
Decision tree is a data mining tool which builds tree-like classification or regression models to handle categorical or numerical data. It starts with a root node and breaks down into two or more branches, which are called decision nodes. Each decision node has its own branches and the tree keeps growing until it reaches a classification or decision. The final nodes are called leaf nodes. Thanks to this tree form structure, the results of a decision tree can be easily understood.
Table 1 is the Parameter Description
Parameter
Descriptions
2019101160 30 Sep 2019
«estimators the number of decision tree
bootstrap whether the sample set is sampled with
put back to build the tree.
oobscore whether to use out-of-pocket samples to
evaluate the quality of the model.
maxdepth maximum depth of decision tree
criterion the partition criteria of the node.
A decision tree is built top-down from the root node which contains all the decisions, to leaf nodes which contain instances with similar values (also known as homogenous). The homogeneity of a sample is calculated using entropy, which is defined as:
H(T) = - Σ·ί=ϊΡΐΙο£2ρΐ
2019101160 30 Sep 2019 where J is the number of classes and pi is the percentage of each class. Entropy of zero means the sample is completely homogeneous and entropy of one implies the sample is equally divided. When building a decision tree, the entropy of the parent node is first calculated. Then the dataset is split on the different attributes. The new entropy of the child node is calculated as the weighted sum of entropy for each branch:
H(T\ a) = a)log2Pr(i\a)
To determine which attribute to be used to split the dataset, new entropy after splitting is subtracted from the entropy before the split, the result is called information gain:
IG(T, d) = H(T}~ H(T\a)
The attribute that has the largest information gain is used as the de- cision node to divide the dataset by its branches. This process is re- peated until the branch has zero entropy, and that branch would be a leaf node.
- Random Forest
Random forest is a relatively new machine learning model. The classic machine learning model is a neural network with a history of more than half a century. Neural network predictions are accurate, but it is computationally intensive. In the 1980s, Bre iman et al. invented the classification tree algorithm.
It uses the bootstrap resampling technique to randomly extract k sam- pies from the original training sample set N to generate a new training sample set, and then generate a new training sample set according to the selfservice sample set. The k classification trees constitute a ran- dom forest, and the classification result of the new data is determined by the score formed by voting of the classification tree. Its essence is an improvement of the decision tree algorithm, which combines multiple decision trees.
Each classification tree in the random forest is a binary tree, and its generation follows the principle of top-down recursive splitting, that is, the training set is divided from the root node in turn; in the binary tree, the root node contains all training data, according to the principle of minimum purity of node. The root node is split into left and right nodes, which respectively contains a subset of the training data, and the nodes continue to split according to the same rule until the branch stop rule is satisfied and the growth stops. If the classification data on node n is all from the same category, then the node’s purity I(n) = 0.
2019101160 30 Sep 2019
1.2 Procedure
- Stepl: Data Acquisition • Stepl: Date Preprocessing
A data set is usually in rows and columns. Each row, in particular, can be interpreted as a single client that the cash loaning enterprise has. The columns represents different features of each client. Among all the columns there is a 3d column that indicates the true label for each client.
The first step of data preprocessing is dealing with missing data. Every set of data might contain missing data. Since they do not carry any helpful information themselves, missing datas would disturb further classification unless we dispose them appropriately. When the missing datas are scarce, we may manually delete the “bad” rows and classify the rest. However, this scenario rarely happens in reality when missing datas are usually intense or hard to be recognized in a big data set. Removing rows with missing datas arbitrarily would eventually results in huge missing information. Hence a method that not only identifies missing values but could somehow fix them is more preferable. Still, to what extend the missing datas are to be fixed depends on the val- ues that already exist. We define a data preprocessing function which calculates the mean values of the columns. Each missing data would be replaced by the corresponding mean value. Since mean values are calculated with all existing data in a certain feature column, they are pretty representative for all data and could fill missing values convinc- ingly.
The following is the Confusion Matrix
Actual
Positive Negative
Predicted Positive True Positive False Positive
Negative False Negative True Negative
Next comes data normalization. After we generate a relatively comprehensive data set with our mean value function, one would notice that the data varies significantly among different feature columns. Conse- quently, classification work is going to be painstaking once the range of values gets too large. The logic follows that the importance of each feature is not determined by the size of values it contains. For ex- ample, the feature
2019101160 30 Sep 2019 “age” might only vary between 20 and 80, whereas feature “credit” might have values that are few times of those under the feature “age”. However, we know that the feature “credit” is not necessarily that much more important than “age”. Indeed, different features are not always comparable to each other. What we are really concerned with is that given a particular feature (i.e. credit), which person performs better than the others relatively. In order to eliminate potential misconstrue by our model, we acquire data normalization through rescaling the values under our feature columns into a range of 0 to 1. The idea is to transform datas into ratios that measure the relative differences of each individual. In the mean time, normalization dramatically narrows calculation and thus boost data processing with only a few lines of codes.
- Step3: Training and optimization
- Decision tree
Now that we have preprocessed our data, decision tree model is ready for the training. The 3d column which demonstrates true labels will be compared with predicted labels by our model. Therefore, the data will first be divided into variable train_y that contains 3d column and train_x that contains the remainder. We feed train_x to the model that will later provide us with corresponding labels of prediction.
- Random forest
We have already predicted the labels using our decision tree model but the outcome isn’t accurate enough so we establish the random forest classification to solve this problem. Variable train_y contains the label ‘3d’ and train_x contains the remainder. Then, repeat the step in part
4.2 and compare the true labels in column ‘3d’ with the labels we predicted using this classification.
- Optimization
Actually the classification has already fitted our data set much more. However, we can still do some optimization by changing the main parameters of random forest classification, including «-estimators, that is, how many decision trees we establish in the random forest and max depth of each decision tree.
In the process of optimization, we first consider «-estimators in our model. According to some relevant materials, we set the range of «-estimators as 100 to 200. The output of our code indicates that when n_estimators is 150, the indexes we utilize to measure the clas- silication
2019101160 30 Sep 2019 attain the best condition.
Moving on, we add another parameter, max depth, to the code, while the range is set as 6 to 10. With two loop in our code, we can acquire the best result considering the two parameters. The final output is presented in the following picture.
- Step4: Testing
In the process of testing, we utilize the preprocessed testing data to examine several the performance of our classifications. The principles of testing decision tree classification and random forest classification are quite similar. Therefore we only list the procedure of testing decision tree classification and that of testing random forest classification is in the similar fashion.
We have labeled the features and attributes of the test data in the preprocessing stage. The features are regarded as test_x and will be applied to our decision tree classification and random forest classifica- tion. The model generates a vector called y_pre which indicates the result of our prediction. By comparing test_y with y_pre, we can mea- sure the behavior of the model. The measurement standards we utilize include ROC_AUC,y~ 1 score, precision and recall, while ROC_AUC is a major standard that we need to consider. ROC is the abbreviation for receiver operating characteristic curve, while AUC represents area under the ROC curve. The ROC curve is plotted as true positive rate versus false positive rate, the biggest values of whom are both 1. It behaves as a convex curve above the diagonal of the coordinates. The more AUC is close to 1, the more accurate our model is. The idea of other standards is explained as follows. With the testing, it is obvious that the behavior of random forest clas- sification is much more better than decision tree classification. So we choose random forest theory to establish our model.By optimization, the classification become increasingly fitter for our data.
The following is thef \ Score precision * recall p = 2. * —_________________ precision + recall
The following shows the Recall and Precision true positives recall =;------τ-----p,------7— true positives + false negatives true positives precision - -----T----π— r true positives + false positives

Claims (1)

  1. CLAIM
    1. An application of decision tree and random forest in cash loan, wherein decision tree is a data mining tool which builds tree-like classification or regression models to handle categorical or numerical data; it starts with a root node and breaks down into two or more branches, which are called decision nodes; each decision node has its own branches and the tree keeps growing until it reaches a classification or decision; the final nodes are called leaf nodes.
AU2019101160A 2019-09-30 2019-09-30 Application of decision tree and random forest in cash loan Ceased AU2019101160A4 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
AU2019101160A AU2019101160A4 (en) 2019-09-30 2019-09-30 Application of decision tree and random forest in cash loan

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
AU2019101160A AU2019101160A4 (en) 2019-09-30 2019-09-30 Application of decision tree and random forest in cash loan

Publications (1)

Publication Number Publication Date
AU2019101160A4 true AU2019101160A4 (en) 2019-10-31

Family

ID=68341990

Family Applications (1)

Application Number Title Priority Date Filing Date
AU2019101160A Ceased AU2019101160A4 (en) 2019-09-30 2019-09-30 Application of decision tree and random forest in cash loan

Country Status (1)

Country Link
AU (1) AU2019101160A4 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062422A (en) * 2019-11-29 2020-04-24 上海观安信息技术股份有限公司 Method and device for systematic identification of road loan
CN112767177A (en) * 2020-12-30 2021-05-07 中国人寿保险股份有限公司上海数据中心 Insurance customer information management system for customer grading based on random forest
CN114444738A (en) * 2022-04-08 2022-05-06 国网浙江省电力有限公司物资分公司 Electrical equipment maintenance cycle generation method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111062422A (en) * 2019-11-29 2020-04-24 上海观安信息技术股份有限公司 Method and device for systematic identification of road loan
CN111062422B (en) * 2019-11-29 2023-07-14 上海观安信息技术股份有限公司 Method and device for identifying set-way loan system
CN112767177A (en) * 2020-12-30 2021-05-07 中国人寿保险股份有限公司上海数据中心 Insurance customer information management system for customer grading based on random forest
CN114444738A (en) * 2022-04-08 2022-05-06 国网浙江省电力有限公司物资分公司 Electrical equipment maintenance cycle generation method

Similar Documents

Publication Publication Date Title
AU2019101160A4 (en) Application of decision tree and random forest in cash loan
Chen et al. Research on credit card default prediction based on k-means SMOTE and BP neural network
Eddy et al. Credit scoring models: Techniques and issues
Gramespacher et al. Employing explainable AI to optimize the return target function of a loan portfolio
Wu et al. Adaptive Feature Interaction Model for Credit Risk Prediction in the Digital Finance Landscape
CN117474669A (en) Loan overdue prediction method, device, equipment and storage medium
Wagner The use of credit scoring in the mortgage industry
KR102519878B1 (en) Apparatus, method and recording medium storing commands for providing artificial-intelligence-based risk management solution in credit exposure business of financial institution
IMBALANCE Ensemble Adaboost in classification and regression trees to overcome class imbalance in credit status of bank customers
Archana et al. Comparison of Various Machine Learning Algorithms and Deep Learning Algorithms for Prediction of Loan Eligibility
Reza et al. A machine learning approach to identify customer attrition for a long time business planning
Vahid et al. Modeling corporate customers’ credit risk considering the ensemble approaches in multiclass classification: evidence from Iranian corporate credits
Krishnaraj et al. Comparing Machine Learning Techniques for Loan Approval Prediction
Khedr et al. A New Prediction Approach for Preventing Default Customers from Applying Personal Loans Using Machine Learning
Preetham et al. A Stacked Model for Approving Bank Loans
Yazdani Developing a model for validation and prediction of bank customer credit using information technology (case study of Dey Bank)
Aribowo et al. Feasibility study for banking loan using association rule mining classifier
Liu et al. A comparison of machine learning algorithms for prediction of past due service in commercial credit
KR102576143B1 (en) Method for performing continual learning on credit scoring without reject inference and recording medium recording computer readable program for executing the method
Pertiwi et al. Combination of Stacking with Genetic Algorithm Feature Selection to Improve Default Prediction in P2P Lending
Wan Research on predicting credit card customers’ service using logistic regression and Bp neural network
Parvin et al. A machine learning-based credit lending eligibility prediction and suitable bank recommendation: an Android app for entrepreneurs
Liu et al. Research on Investment Model of Internet Financial Loan Platform
Sulejmani Using deep learning and explainable AI to predict and explain loan defaults
Şakar Variable Importance Analysis in Default Prediction using Machine Learning Techniques

Legal Events

Date Code Title Description
FGI Letters patent sealed or granted (innovation patent)
MK22 Patent ceased section 143a(d), or expired - non payment of renewal fee or expiry