CN113781056A - Method and device for predicting user fraud behavior - Google Patents
Method and device for predicting user fraud behavior Download PDFInfo
- Publication number
- CN113781056A CN113781056A CN202111093212.5A CN202111093212A CN113781056A CN 113781056 A CN113781056 A CN 113781056A CN 202111093212 A CN202111093212 A CN 202111093212A CN 113781056 A CN113781056 A CN 113781056A
- Authority
- CN
- China
- Prior art keywords
- sample data
- data
- fraud
- processing
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 40
- 238000012549 training Methods 0.000 claims abstract description 41
- 238000012545 processing Methods 0.000 claims description 87
- 230000006399 behavior Effects 0.000 claims description 30
- 238000012360 testing method Methods 0.000 claims description 28
- 238000012795 verification Methods 0.000 claims description 23
- 230000002159 abnormal effect Effects 0.000 claims description 22
- 238000007781 pre-processing Methods 0.000 claims description 19
- 238000004590 computer program Methods 0.000 claims description 16
- 238000009795 derivation Methods 0.000 claims description 16
- 230000000694 effects Effects 0.000 claims description 11
- 238000003860 storage Methods 0.000 claims description 9
- 238000007499 fusion processing Methods 0.000 claims description 7
- 230000009467 reduction Effects 0.000 claims description 6
- 238000013473 artificial intelligence Methods 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 13
- 238000004422 calculation algorithm Methods 0.000 description 12
- 230000008569 process Effects 0.000 description 6
- 230000006870 function Effects 0.000 description 4
- 238000010801 machine learning Methods 0.000 description 4
- 230000001131 transforming effect Effects 0.000 description 3
- 230000005856 abnormality Effects 0.000 description 2
- 238000003066 decision tree Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000000513 principal component analysis Methods 0.000 description 2
- 241000282414 Homo sapiens Species 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 238000007635 classification algorithm Methods 0.000 description 1
- 238000004140 cleaning Methods 0.000 description 1
- 238000002790 cross-validation Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 238000007689 inspection Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012954 risk control Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q20/00—Payment architectures, schemes or protocols
- G06Q20/38—Payment protocols; Details thereof
- G06Q20/40—Authorisation, e.g. identification of payer or payee, verification of customer or shop credentials; Review and approval of payers, e.g. check credit lines or negative lists
- G06Q20/401—Transaction verification
- G06Q20/4016—Transaction verification involving fraud or risk level assessment in transaction processing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2135—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/03—Credit; Loans; Processing thereof
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Accounting & Taxation (AREA)
- Finance (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Strategic Management (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- General Business, Economics & Management (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Development Economics (AREA)
- Economics (AREA)
- Marketing (AREA)
- Technology Law (AREA)
- Computer Security & Cryptography (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a method and a device for predicting user fraud, which relate to the technical field of artificial intelligence, wherein the method comprises the following steps: acquiring basic information and current transaction data of a user to be predicted; inputting basic information of a user to be predicted and current transaction data into a pre-established fraud prediction model to obtain a prediction result of whether the user has fraud; the fraud behavior prediction model is obtained by fusing a LightGBM model and a Catboost model, and the LightGBM model and the Catboost model are generated by pre-training according to basic information of a plurality of users and sample data of historical transaction data. The invention can realize the high-efficiency and accurate prediction of the user fraud behavior.
Description
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a method and a device for predicting user fraud.
Background
This section is intended to provide a background or context to the embodiments of the invention that are recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.
With the vigorous development of the economy and the improvement of the internationalization level of China, the personal credit balance proportion of the financial industry of China is gradually improved. Today of informatization intelligence, a large amount of user data is precipitated regardless of a traditional financial institution or an internet financial platform, and the popularity of online business causes the financial institution to urgently need to construct a reliable, intelligent and efficient risk control scheme. Therefore, the method can identify whether the transaction has fraud behavior by utilizing the existing user information and the transaction data of the user on the platform, thereby providing a basis for the decision of providing a safe and reliable financial transaction environment for the client. The existing scheme for predicting the fraudulent conduct is mainly realized by manual check, and the problems of low efficiency and low accuracy exist.
Disclosure of Invention
The embodiment of the invention provides a method for predicting user fraud, which is used for efficiently and accurately predicting the user fraud and comprises the following steps:
acquiring basic information and current transaction data of a user to be predicted;
inputting basic information of a user to be predicted and current transaction data into a pre-established fraud prediction model to obtain a prediction result of whether the user has fraud; the fraud behavior prediction model is obtained by fusing a LightGBM model and a Catboost model, and the LightGBM model and the Catboost model are generated by pre-training according to basic information of a plurality of users and sample data of historical transaction data.
The embodiment of the invention also provides a device for predicting the fraud of the user, which is used for efficiently and accurately predicting the fraud of the user and comprises the following steps:
the system comprises an acquisition unit, a prediction unit and a prediction unit, wherein the acquisition unit is used for acquiring basic information of a user to be predicted and current transaction data;
the prediction unit is used for inputting the basic information of the user to be predicted and the current transaction data into a pre-established fraud prediction model to obtain a prediction result of whether the user has fraud; the fraud behavior prediction model is obtained by fusing a LightGBM model and a Catboost model, and the LightGBM model and the Catboost model are generated by pre-training according to basic information of a plurality of users and sample data of historical transaction data.
The embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the method for predicting the user fraud is implemented.
An embodiment of the present invention further provides a computer-readable storage medium, where a computer program for executing the method for predicting a user fraud is stored in the computer-readable storage medium.
In the embodiment of the invention, the scheme for predicting the fraud behavior of the user comprises the following steps: acquiring basic information and current transaction data of a user to be predicted; inputting basic information of a user to be predicted and current transaction data into a pre-established fraud prediction model to obtain a prediction result of whether the user has fraud; the fraud behavior prediction model is obtained by fusing a LightGBM model and a Catboost model, and the LightGBM model and the Catboost model are generated by pre-training according to basic information of a plurality of users and sample data of historical transaction data, so that the fraud behaviors of the users can be efficiently and accurately predicted.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts. In the drawings:
FIG. 1 is a flow chart illustrating a method for predicting fraud by a user according to an embodiment of the present invention;
FIG. 2 is a schematic diagram illustrating the establishment of a fraud prediction model according to an embodiment of the present invention;
FIG. 3 is a schematic flow chart illustrating pre-establishing the fraud prediction model according to an embodiment of the present invention;
FIG. 4 is a schematic flow chart illustrating pre-building the fraud prediction model according to another embodiment of the present invention;
FIG. 5 is a schematic flow chart illustrating the pre-processing of the acquired sample data according to an embodiment of the present invention;
FIG. 6 is a schematic flow chart illustrating feature engineering processing performed on preprocessed sample data according to an embodiment of the present disclosure;
FIG. 7 is a block diagram of an apparatus for predicting fraud by a user according to an embodiment of the present invention;
FIG. 8 is a schematic structural diagram of a setup unit according to an embodiment of the present invention;
fig. 9 is a schematic structural diagram of a setup unit according to another embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the embodiments of the present invention are further described in detail below with reference to the accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.
Before describing embodiments of the present invention, terms related to the embodiments of the present invention will be described.
1. Machine learning: a multi-field cross discipline relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer.
2. Training data: a data set for machine learning. Is an input to machine learning.
Characteristic engineering: the essence of this is an engineering activity aimed at maximizing the extraction of features from raw data for use by algorithms and models. When the data modeling is carried out, if all the attributes of original data are learned, the potential trend of the data cannot be well found, and when the data are preprocessed through the characteristic engineering, the algorithm model can reduce the interference of noise, so that the trend can be better found.
Catcoost: a gradient boosting algorithm library which can well process class-type characteristics.
LightGBM: the Gradient Boosting Decision Tree (GBDT) is an iterative Decision Tree algorithm. There is an open source algorithm in python. LightGBM is an advanced algorithm of GBDT and has better training efficiency.
AUC: the Area Under Curve is defined as the Area enclosed by the ROC Curve and the coordinate axis, the numerical value is not more than 1, the evaluation standard of the model is realized, and the larger the AUC value is, the better the effect of the classifier is.
PCA dimensionality reduction: the principal component analysis (principal component analysis) maps high-dimensional data to a low-dimensional space, thereby avoiding a dimensional disaster.
The inventors have found that the disadvantages of existing manual checks to predict card swipe fraud schemes include:
1. mass data and information are complicated: a large amount of user transaction operation data are generated on a platform every year, contents are numerous, true and false are difficult to distinguish, decision is made by manual work, and accurate marketing is required to be simple and can not be completed.
2. The duplicate checking is inefficient: the same user characteristics also need to be checked repeatedly, and the subjective judgment of service personnel still depends, so that the efficiency is low.
3. Lack of knowledge accumulation: the experience of the service personnel cannot be solidified and is difficult to inherit. The invention can automatically learn the past transaction records of the user on the platform, and automatically give the probability prediction of whether the fraud behavior is present or not by combining the algorithm of data mining, thereby greatly reducing the decision difficulty.
4. The accuracy is low: the existing machine learning model is low in prediction accuracy, and needs to be improved by a new method.
In summary, the existing scheme for predicting card swiping fraud by manual inspection has the problems of low efficiency and accuracy, so that the security of card swiping transactions cannot be ensured.
In view of the above technical problems, an embodiment of the present invention provides a scheme for predicting a user fraud, where the scheme mainly performs preprocessing on massive user data, constructs feature engineering, derives features, extracts features, selects an algorithm model, and performs model integration, and performs training on the basis of the preprocessing, the feature engineering, the deriving features, the selecting algorithm model, and the integrating algorithm model, and digs some rules (policies) capable of effective classification from the data to generate a classifier (an aggregated rule set). The newly input user can obtain the probability value that the behavior is the fraudulent behavior through the classifier, so that a basis is provided for the decision of whether to intervene in the transaction manually. The following describes the scheme for predicting the fraud of the user in detail.
Fig. 1 is a schematic flow chart of a method for predicting fraud of a user in an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:
step 101: acquiring basic information and current transaction data of a user to be predicted;
step 102: inputting basic information of a user to be predicted and current transaction data into a pre-established fraud prediction model to obtain a prediction result of whether the user has fraud; the fraud behavior prediction model is obtained by fusing a LightGBM model and a Catboost model, and the LightGBM model and the Catboost model are generated by pre-training according to basic information of a plurality of users and sample data of historical transaction data.
The method for predicting the user fraud provided by the embodiment of the invention can realize the purpose of efficiently and accurately predicting the user fraud. This is described in detail below with reference to fig. 2.
Firstly, a step of establishing a fraud prediction model in advance is introduced.
In one embodiment, as shown in fig. 3, the method for predicting fraud of a user may further include: the fraud prediction model is pre-established as follows:
step 301: acquiring basic information and historical transaction data of a plurality of users as sample data; the sample data comprises positive samples where fraudulent activity exists and negative samples where fraudulent activity does not exist;
step 304: dividing the sample data into a training set, a test set and a verification set;
step 305: training the LightGBM model and the Catboost model by using the training set to obtain a trained LightGBM model and a Catboost model;
step 306: performing weighted fusion processing on the trained LightGBM model and the Catboost model to obtain a fused fraud behavior prediction model;
step 307: testing the fused fraud prediction model by using the test set to obtain a tested fraud prediction model;
step 308: and verifying the tested fraud prediction model by using the verification set to obtain a pre-established fraud prediction model.
When the method is specifically implemented, the implementation mode of pre-establishing the fraud prediction model further improves the efficiency and accuracy of model establishment, and further improves the accuracy and efficiency of prediction by using the established model. The steps of this setup are described in detail below.
1. Introducing step 301, obtaining basic information (i.e. basic characteristic data, such as the academic history, occupation, customer attribution, marital status, customer rating, and the like of a user) and historical transaction data (e.g. card swiping transaction data of a user within a certain historical time period or loan transaction data of a user within a certain historical time period) of a plurality of users (massive users) as sample data, namely, firstly taking the user information and the transaction data from a financial institution platform, and determining whether each record is overdue or not, wherein the positive sample may be a sample with fraudulent behaviors, such as a sample with fraudulent swiping behaviors or a sample with fraudulent behaviors due to overdue or not; negative examples may be examples where no fraudulent activity is present.
The data structure on which the embodiment of the present invention is based is shown in table 1 below, which is data for training and the like that has been labeled:
TABLE 1
In one embodiment, as shown in fig. 4, the method for predicting fraud of a user may further include:
step 302: preprocessing the acquired sample data to obtain preprocessed sample data;
step 303: performing characteristic engineering processing on the preprocessed sample data to obtain the sample data after the characteristic engineering processing;
dividing the sample data into a training set, a test set and a verification set, which may include: and dividing the sample data after the characteristic engineering processing into a training set, a test set and a verification set.
In specific implementation, after the acquired sample data is preprocessed, after characteristic engineering processing is performed, subsequent training, testing and verification are performed, and finally the established fraud prediction model is obtained, so that the model prediction precision is further improved, and the user fraud prediction precision is further improved. This step 302 and step 303 will be described in detail below.
2. Next, the above step 302 is introduced, that is, preprocessing such as obtaining and cleaning a large number of historical clients to ensure the accuracy and diversity of the training data.
In an embodiment, as shown in fig. 5, preprocessing the acquired sample data to obtain preprocessed sample data may include:
step 3021: carrying out missing value processing on the user characteristic data of a preset category in the sample data to obtain characteristic data after the missing value processing;
step 3022: abnormal value processing is carried out on the characteristic data with the abnormal degree larger than the preset abnormal value in the characteristic data after the missing value processing, so that the characteristic data after the abnormal value processing is obtained;
step 3023: and performing time stamp processing on the feature data after the abnormal value processing to obtain preprocessed sample data.
In specific implementation, the data preprocessing process mentioned in fig. 2 is as follows:
(1) missing value processing: category characteristics (user characteristic data of preset categories, i.e., user characteristic data) such as a scholarly, a profession, a customer home, a marital status, and a customer grade are filled with "-1" for distinguishing such categories. The others are left untreated and are subsequently filled by the model.
(2) Abnormal value processing: the characteristics of obvious abnormality (characteristic data with the abnormality degree larger than a preset abnormal value) are processed, such as the age larger than 99, and are filled with a mean value. The values of the feature columns appear only once in the training set and test set, filled in with the mean.
(3) Time stamp processing: the time stamp is converted into a time type, and time characteristics such as year, month, day, hour, minute, second, week, and the week of each year are extracted.
In specific implementation, the implementation method for preprocessing the acquired sample data can further improve the precision of the model input data, further improve the precision of model building, and further improve the precision of model prediction, namely improve the precision of user fraud prediction.
3. Next, the above-mentioned step 303, i.e. feature derivation, feature screening process and parameter optimization process, is introduced.
In an embodiment, as shown in fig. 6, performing feature engineering on the preprocessed sample data to obtain the sample data after the feature engineering processing may include:
step 3031: multiplying the characteristic data with negative correlation coefficient in the preprocessed sample data by a preset negative number to obtain the sample data after the preprocessing;
step 3032: zooming the sample data after the second processing to a preset range interval to obtain zoomed sample data;
step 3033: performing characteristic derivation processing on the sample data after the zooming processing to obtain the sample data after the characteristic derivation processing;
step 3034: and carrying out PCA dimension reduction processing on the sample data after the characteristic derivation processing to obtain the sample data after the characteristic engineering processing.
In specific implementation, the process of feature engineering mentioned in fig. 2 is as follows:
(1) for features where the correlation coefficients are negative, they are each multiplied by a preset negative number, for example-1.
(2) And (4) zooming to a (-1,1) interval (a preset range interval) to prepare data for a neural network model (a fraud behavior prediction model). Feature scaling is necessary due to the large number of computations required in deep learning. Feature scaling normalizes the range of the independent variables.
(3) Characteristic derivation: there are 200 features, each of which derives 6 new features.
attr _ n, scaling the original characteristics by (0, 1);
concat _ count _ attr _ n, the number of times each value in attr _ n occurs in the corresponding column;
concat _ count _ round4_ attr _ n, which is used for transforming the value of attr _ n (attr _ n 10^4 ^ 2+1)//2/10^ 4;
concat _ count _ round3_ attr _ n, which is used for transforming the value of attr _ n (attr _ n 10^3 ^ 2+1)//2/10^ 3;
concat _ count _ round2_ attr _ n, which is used for transforming the value of attr _ n (attr _ n 10^ 2+1)//2/10^ 2;
attr _ num: characteristic sequence of attr _ n.
(4) PCA dimension reduction: after the above feature derivation processing, 6 new features are derived, and because the original features are more, the dimension of the features is increased and then reduced, so that dimension disasters are avoided, and only the features with larger influence are left.
When the method is specifically implemented, the implementation mode of the characteristic engineering treatment further improves the efficiency and precision of model establishment, and further improves the precision and efficiency of subsequent model prediction.
4. Next, the above-mentioned step 304 is introduced.
In practice, 80% of the data may be divided into training data (training set) and 20% of the data may be divided into testing data (testing set). Of course, it is also possible to divide a part into verification data (verification set) as necessary.
5. Next, the above-described step 305 is introduced.
In specific implementation, the model training mentioned in fig. 2 may use LightGBM and CatBoost to perform 5-fold cross validation training on data, and the error iteration of the validation set is not reduced for 50 times, and the training is stopped, that is, the LightGBM model and the CatBoost model are trained by using the training set to obtain the trained LightGBM model and CatBoost model. That is, the LGB model and the castboost model are selected, the features processed by the feature engineering are input into the model, and the model is adjusted and trained to obtain two locally optimal models respectively.
6. Next, step 306 is described above.
In specific implementation, the model fusion mentioned in fig. 2 may be: and performing weighted fusion processing on the trained LightGBM model and Catboost model (the LightGBM model and the Catboost model selected by the algorithm mentioned in the figure 2, namely selection of a classification algorithm and parameter optimization) to obtain a fused fraud behavior prediction model, namely performing weighted combination on the two locally optimal models to obtain a new model.
7. Next, the above step 306 and step 307, i.e. the model prediction and verification mentioned in FIG. 2, are introduced together.
In specific implementation, the testing set is used for testing the fused fraud prediction model to obtain a tested fraud prediction model, and then model verification is carried out on the data of the testing set, namely the verification set is used for verifying the tested fraud prediction model to obtain a pre-established fraud prediction model.
In specific implementation, the new model is used for testing on the verification set, when the AUC index is not improved, a good model is considered to be obtained, and the fraud behavior prediction model at the moment can be used for predicting the new data.
Next, for convenience of understanding, the above steps 101 and 102 are introduced together, and after a fraud prediction model is established, an actual prediction step is performed using the model.
And during specific implementation, acquiring basic information and current transaction data of a user to be predicted, and inputting the basic information and the current transaction data of the user to be predicted into the 'one' pre-established fraud behavior prediction model to obtain a prediction result of whether the user has fraud behaviors.
In addition, embodiments of the present invention relate to the use of the following related art tools and protocols: python programming language; LightGBM; catboost. The feature engineering processing in the embodiment of the invention realizes the derivation and processing of longitudinal features.
In summary, the method for predicting the user fraud provided by the embodiment of the invention can help the financial platform to obtain the client loan overdue probability through the client basic information and the transaction data, help the relevant business personnel to provide decision basis for lending or not, reduce the working pressure of the business personnel of the financial institution, improve the efficiency and further improve the intelligent wind control level of the institution.
The embodiment of the invention also provides a device for predicting the fraud of the user, which is described in the following embodiment. Because the principle of the device for solving the problems is similar to the method for predicting the fraudulent conduct of the user, the implementation of the device can refer to the implementation of the method for predicting the fraudulent conduct of the user, and repeated details are not repeated.
Fig. 7 is a schematic structural diagram of an apparatus for predicting fraud of a user in an embodiment of the present invention, as shown in fig. 7, the apparatus includes:
the acquisition unit 01 is used for acquiring basic information of a user to be predicted and current transaction data;
the prediction unit 02 is used for inputting the basic information of the user to be predicted and the current transaction data into a pre-established fraud prediction model to obtain the prediction result of whether the user has fraud; the fraud behavior prediction model is obtained by fusing a LightGBM model and a Catboost model, and the LightGBM model and the Catboost model are generated by pre-training according to basic information of a plurality of users and sample data of historical transaction data.
In an embodiment, the above apparatus for predicting fraud of a user may further include an establishing unit, configured to establish the fraud prediction model in advance according to the following method; as shown in fig. 8, the establishing unit includes:
a sample obtaining module 031, configured to obtain basic information and historical transaction data of a plurality of users as sample data; the sample data comprises positive samples where fraudulent activity exists and negative samples where fraudulent activity does not exist;
a sample dividing module 034, configured to divide the sample data into a training set, a test set, and a verification set;
the training module 035 is configured to train the LightGBM model and the CatBoost model by using the training set to obtain a trained LightGBM model and a trained CatBoost model;
the fusion processing module 036 is configured to perform weighted fusion processing on the trained LightGBM model and the CatBoost model to obtain a fused fraud behavior prediction model;
the testing module 037 is configured to test the merged fraud prediction model by using the test set to obtain a tested fraud prediction model;
the verification module 038 is configured to verify the tested fraud prediction model by using the verification set, so as to obtain a pre-established fraud prediction model.
In one embodiment, as shown in fig. 9, the establishing unit further includes:
a preprocessing module 032, configured to preprocess the acquired sample data to obtain preprocessed sample data;
a feature engineering processing module 033, configured to perform feature engineering processing on the preprocessed sample data to obtain the sample data after the feature engineering processing;
the sample division module is specifically configured to: and dividing the sample data after the characteristic engineering processing into a training set, a test set and a verification set.
In one embodiment, the preprocessing module is specifically configured to:
carrying out missing value processing on the user characteristic data of a preset category in the sample data to obtain characteristic data after the missing value processing;
abnormal value processing is carried out on the characteristic data with the abnormal degree larger than the preset abnormal value in the characteristic data after the missing value processing, so that the characteristic data after the abnormal value processing is obtained;
and performing time stamp processing on the feature data after the abnormal value processing to obtain preprocessed sample data.
In one embodiment, the feature engineering processing module is specifically configured to:
multiplying the characteristic data with negative correlation coefficient in the preprocessed sample data by a preset negative number to obtain the sample data after the preprocessing;
zooming the sample data after the second processing to a preset range interval to obtain zoomed sample data;
performing characteristic derivation processing on the sample data after the zooming processing to obtain the sample data after the characteristic derivation processing;
and carrying out PCA dimension reduction processing on the sample data after the characteristic derivation processing to obtain the sample data after the characteristic engineering processing.
The embodiment of the present invention further provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the method for predicting the user fraud is implemented.
An embodiment of the present invention further provides a computer-readable storage medium, where a computer program for executing the method for predicting a user fraud is stored in the computer-readable storage medium.
In the embodiment of the invention, the scheme for predicting the fraud behavior of the user comprises the following steps: acquiring basic information and current transaction data of a user to be predicted; inputting basic information of a user to be predicted and current transaction data into a pre-established fraud prediction model to obtain a prediction result of whether the user has fraud; the fraud behavior prediction model is obtained by fusing a LightGBM model and a Catboost model, and the LightGBM model and the Catboost model are generated by pre-training according to basic information of a plurality of users and sample data of historical transaction data, so that the fraud behaviors of the users can be efficiently and accurately predicted.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (12)
1. A method of predicting fraud by a user, comprising:
acquiring basic information and current transaction data of a user to be predicted;
inputting basic information of a user to be predicted and current transaction data into a pre-established fraud prediction model to obtain a prediction result of whether the user has fraud; the fraud behavior prediction model is obtained by fusing a LightGBM model and a Catboost model, and the LightGBM model and the Catboost model are generated by pre-training according to basic information of a plurality of users and sample data of historical transaction data.
2. A method of predicting fraud by a user according to claim 1, further comprising: the fraud prediction model is pre-established as follows:
acquiring basic information and historical transaction data of a plurality of users as sample data; the sample data comprises positive samples where fraudulent activity exists and negative samples where fraudulent activity does not exist;
dividing the sample data into a training set, a test set and a verification set;
training the LightGBM model and the Catboost model by using the training set to obtain a trained LightGBM model and a Catboost model;
performing weighted fusion processing on the trained LightGBM model and the Catboost model to obtain a fused fraud behavior prediction model;
testing the fused fraud prediction model by using the test set to obtain a tested fraud prediction model;
and verifying the tested fraud prediction model by using the verification set to obtain a pre-established fraud prediction model.
3. A method of predicting fraud by a user according to claim 2, further comprising:
preprocessing the acquired sample data to obtain preprocessed sample data;
performing characteristic engineering processing on the preprocessed sample data to obtain the sample data after the characteristic engineering processing;
dividing the sample data into a training set, a test set and a verification set, including: and dividing the sample data after the characteristic engineering processing into a training set, a test set and a verification set.
4. The method of predicting fraud by claim 3, wherein the preprocessing the acquired sample data to obtain preprocessed sample data comprises:
carrying out missing value processing on the user characteristic data of a preset category in the sample data to obtain characteristic data after the missing value processing;
abnormal value processing is carried out on the characteristic data with the abnormal degree larger than the preset abnormal value in the characteristic data after the missing value processing, so that the characteristic data after the abnormal value processing is obtained;
and performing time stamp processing on the feature data after the abnormal value processing to obtain preprocessed sample data.
5. The method of claim 3, wherein the performing feature engineering on the preprocessed sample data to obtain feature engineered sample data comprises:
multiplying the characteristic data with negative correlation coefficient in the preprocessed sample data by a preset negative number to obtain the sample data after the preprocessing;
zooming the sample data after the second processing to a preset range interval to obtain zoomed sample data;
performing characteristic derivation processing on the sample data after the zooming processing to obtain the sample data after the characteristic derivation processing;
and carrying out PCA dimension reduction processing on the sample data after the characteristic derivation processing to obtain the sample data after the characteristic engineering processing.
6. An apparatus for predicting fraud by a user, comprising:
the system comprises an acquisition unit, a prediction unit and a prediction unit, wherein the acquisition unit is used for acquiring basic information of a user to be predicted and current transaction data;
the prediction unit is used for inputting the basic information of the user to be predicted and the current transaction data into a pre-established fraud prediction model to obtain a prediction result of whether the user has fraud; the fraud behavior prediction model is obtained by fusing a LightGBM model and a Catboost model, and the LightGBM model and the Catboost model are generated by pre-training according to basic information of a plurality of users and sample data of historical transaction data.
7. An apparatus for predicting fraud by a user according to claim 6, further comprising an establishing unit for establishing said fraud prediction model in advance; the establishing unit includes:
the system comprises a sample acquisition module, a data processing module and a data processing module, wherein the sample acquisition module is used for acquiring basic information and historical transaction data of a plurality of users as sample data; the sample data comprises positive samples where fraudulent activity exists and negative samples where fraudulent activity does not exist;
the sample dividing module is used for dividing the sample data into a training set, a test set and a verification set;
the training module is used for training the LightGBM model and the Catboost model by utilizing the training set to obtain the trained LightGBM model and the Catboost model;
the fusion processing module is used for carrying out weighted fusion processing on the trained LightGBM model and the Catboost model to obtain a fused fraud behavior prediction model;
the testing module is used for testing the fused fraud prediction model by utilizing the testing set to obtain a tested fraud prediction model;
and the verification module is used for verifying the tested fraud prediction model by utilizing the verification set to obtain a pre-established fraud prediction model.
8. An apparatus for predicting fraud by a user according to claim 7, wherein said establishing unit further comprises:
the preprocessing module is used for preprocessing the acquired sample data to obtain preprocessed sample data;
the characteristic engineering processing module is used for carrying out characteristic engineering processing on the preprocessed sample data to obtain the sample data after the characteristic engineering processing;
the sample division module is specifically configured to: and dividing the sample data after the characteristic engineering processing into a training set, a test set and a verification set.
9. The apparatus for predicting fraud by a user of claim 8, wherein the preprocessing module is specifically configured to:
carrying out missing value processing on the user characteristic data of a preset category in the sample data to obtain characteristic data after the missing value processing;
abnormal value processing is carried out on the characteristic data with the abnormal degree larger than the preset abnormal value in the characteristic data after the missing value processing, so that the characteristic data after the abnormal value processing is obtained;
and performing time stamp processing on the feature data after the abnormal value processing to obtain preprocessed sample data.
10. The apparatus for predicting fraud by a user of claim 8, wherein the feature engineering processing module is specifically configured to:
multiplying the characteristic data with negative correlation coefficient in the preprocessed sample data by a preset negative number to obtain the sample data after the preprocessing;
zooming the sample data after the second processing to a preset range interval to obtain zoomed sample data;
performing characteristic derivation processing on the sample data after the zooming processing to obtain the sample data after the characteristic derivation processing;
and carrying out PCA dimension reduction processing on the sample data after the characteristic derivation processing to obtain the sample data after the characteristic engineering processing.
11. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 5 when executing the computer program.
12. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program for executing the method of any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111093212.5A CN113781056A (en) | 2021-09-17 | 2021-09-17 | Method and device for predicting user fraud behavior |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111093212.5A CN113781056A (en) | 2021-09-17 | 2021-09-17 | Method and device for predicting user fraud behavior |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113781056A true CN113781056A (en) | 2021-12-10 |
Family
ID=78852129
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111093212.5A Pending CN113781056A (en) | 2021-09-17 | 2021-09-17 | Method and device for predicting user fraud behavior |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113781056A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110706090A (en) * | 2019-08-26 | 2020-01-17 | 阿里巴巴集团控股有限公司 | Credit fraud identification method and device, electronic equipment and storage medium |
CN114358922A (en) * | 2022-01-10 | 2022-04-15 | 中国银行股份有限公司 | Credit card fraud prediction method and device |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109255506A (en) * | 2018-11-22 | 2019-01-22 | 重庆邮电大学 | A kind of internet finance user's overdue loan prediction technique based on big data |
CN109447149A (en) * | 2018-10-25 | 2019-03-08 | 腾讯科技(深圳)有限公司 | A kind of training method of detection model, device and terminal device |
CN111105241A (en) * | 2019-12-20 | 2020-05-05 | 浙江工商大学 | Identification method for anti-fraud of credit card transaction |
CN111311401A (en) * | 2020-03-30 | 2020-06-19 | 百维金科(上海)信息科技有限公司 | Financial default probability prediction model based on LightGBM |
CN112308459A (en) * | 2020-11-23 | 2021-02-02 | 国网北京市电力公司 | Power grid household transformation relation identification method and identification device, and electronic equipment |
CN112581265A (en) * | 2020-12-23 | 2021-03-30 | 百维金科(上海)信息科技有限公司 | Internet financial client application fraud detection method based on AdaBoost |
-
2021
- 2021-09-17 CN CN202111093212.5A patent/CN113781056A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109447149A (en) * | 2018-10-25 | 2019-03-08 | 腾讯科技(深圳)有限公司 | A kind of training method of detection model, device and terminal device |
CN109255506A (en) * | 2018-11-22 | 2019-01-22 | 重庆邮电大学 | A kind of internet finance user's overdue loan prediction technique based on big data |
CN111105241A (en) * | 2019-12-20 | 2020-05-05 | 浙江工商大学 | Identification method for anti-fraud of credit card transaction |
CN111311401A (en) * | 2020-03-30 | 2020-06-19 | 百维金科(上海)信息科技有限公司 | Financial default probability prediction model based on LightGBM |
CN112308459A (en) * | 2020-11-23 | 2021-02-02 | 国网北京市电力公司 | Power grid household transformation relation identification method and identification device, and electronic equipment |
CN112581265A (en) * | 2020-12-23 | 2021-03-30 | 百维金科(上海)信息科技有限公司 | Internet financial client application fraud detection method based on AdaBoost |
Non-Patent Citations (1)
Title |
---|
马晓君;宋嫣琦;常百舒;袁铭忆;苏衡;: "基于CatBoost算法的P2P违约预测模型应用研究", 统计与信息论坛, no. 07, 10 July 2020 (2020-07-10) * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110706090A (en) * | 2019-08-26 | 2020-01-17 | 阿里巴巴集团控股有限公司 | Credit fraud identification method and device, electronic equipment and storage medium |
CN114358922A (en) * | 2022-01-10 | 2022-04-15 | 中国银行股份有限公司 | Credit card fraud prediction method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TWI789345B (en) | Modeling method and device for machine learning model | |
CN110998608B (en) | Machine learning system for various computer applications | |
CN110751557B (en) | Abnormal fund transaction behavior analysis method and system based on sequence model | |
CN112819604A (en) | Personal credit evaluation method and system based on fusion neural network feature mining | |
CN110738564A (en) | Post-loan risk assessment method and device and storage medium | |
CN112258312B (en) | Personal credit scoring method and system, electronic equipment and storage medium | |
US20070124236A1 (en) | Credit risk profiling method and system | |
CN113011895A (en) | Associated account sample screening method, device and equipment and computer storage medium | |
CN107392217B (en) | Computer-implemented information processing method and device | |
CN111199469A (en) | User payment model generation method and device and electronic equipment | |
CN113781056A (en) | Method and device for predicting user fraud behavior | |
CN111210332A (en) | Method and device for generating post-loan management strategy and electronic equipment | |
CN116503158A (en) | Enterprise bankruptcy risk early warning method, system and device based on data driving | |
Sujatha et al. | Loan prediction using machine learning and its deployement on web application | |
CN112836750A (en) | System resource allocation method, device and equipment | |
CN109242165A (en) | A kind of model training and prediction technique and device based on model training | |
CN112801784A (en) | Bit currency address mining method and device for digital currency exchange | |
CN114581249B (en) | Financial product recommendation method and system based on investment risk bearing capacity assessment | |
CN118115281B (en) | Risk level assessment method for banking financial clients | |
CN116562901B (en) | Automatic generation method of anti-fraud rule based on machine learning | |
CN113706258B (en) | Product recommendation method, device, equipment and storage medium based on combined model | |
CN110458684A (en) | A kind of anti-fraud detection method of finance based on two-way shot and long term Memory Neural Networks | |
CN116611911A (en) | Credit risk prediction method and device based on support vector machine | |
CN117455681A (en) | Service risk prediction method and device | |
CN114861680B (en) | Dialogue processing method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |