CN116579842B

CN116579842B - Credit data analysis method and system based on user behavior data

Info

Publication number: CN116579842B
Application number: CN202310854274.6A
Authority: CN
Inventors: 刘晓光; 王潇霏; 王刚; 陈静怡; 王文蕊; 赵思浓
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2023-07-13
Filing date: 2023-07-13
Publication date: 2023-10-03
Anticipated expiration: 2043-07-13
Also published as: CN116579842A

Abstract

The invention relates to the technical field of data processing, and discloses a credit data analysis method and system based on user behavior data, which are used for improving the accuracy rate of credit data analysis. Comprising the following steps: collecting a plurality of user behavior data and performing tag matching to determine tag data; integrating the data of the plurality of user behavior data and the tag data to obtain a user data set; performing data processing on the user data set to obtain a data set to be analyzed; performing first feature extraction processing on the data set to be analyzed through a filtering type feature extraction algorithm to obtain a first candidate feature set; performing second feature extraction processing on the first candidate feature set through a wrapped feature extraction algorithm to obtain a second candidate feature set; performing data drift detection and feature screening processing on the second candidate feature set to obtain a target feature set; and carrying out credit data analysis on the target feature set to obtain a credit data analysis result, and transmitting the credit data analysis result to a preset data processing terminal.

Description

Credit data analysis method and system based on user behavior data

技术领域Technical Field

本发明涉及数据处理技术领域，尤其涉及一种基于用户行为数据的信用数据分析方法及系统。The present invention relates to the field of data processing technology, and in particular to a credit data analysis method and system based on user behavior data.

背景技术Background Art

近年来，互联网金融的快速发展使用户信用数据分析变得愈发重要。仅凭基本信息对用户进行信用评定很难有效判断用户违约的风险，除此之外，与用户强相关的金融属性的数据很难获取，且获取成本很高，导致获取到的有效数据十分有限，这对构建高准确率的信用数据分析系统造成了很大的困难。并且随着互联网金融的迅速发展，数据维度呈现爆炸性增长，导致数据具有高维稀疏的特点。此外，在风控建模当中，结构化数据清洗加工繁重、数据变换存在矩阵稀疏导致损失信息过多，特征提取困难，同时较高维度的数据已超出传统评分卡模型所能处理的数据范围。In recent years, the rapid development of Internet finance has made user credit data analysis increasingly important. It is difficult to effectively judge the risk of user default by evaluating the credit of users based on basic information alone. In addition, data with financial attributes that are strongly related to users is difficult to obtain and the cost of obtaining is very high, resulting in very limited effective data, which has caused great difficulties in building a high-accuracy credit data analysis system. And with the rapid development of Internet finance, the data dimension has shown explosive growth, resulting in high-dimensional and sparse data. In addition, in risk control modeling, structured data cleaning and processing are arduous, data transformation has matrix sparsity, resulting in excessive loss of information, and feature extraction is difficult. At the same time, higher-dimensional data has exceeded the data range that traditional scoring card models can handle.

但是，机器学习模型对于具有上述特点的数据建模具有明显优势。一方面机器学习模型可以帮助筛选数据中影响建模效果的无关的和冗余的特征数据。通过特征选择可有效对数据进行维度缩减，降低模型的计算复杂度，提高模型的运算速度和精度。另一方面机器学习模型还可以在高维稀疏数据中发现规律和模式，具有较强的泛化能力。通过机器学习建模可以有效提高模型的预测和分类性能，同时防止模型出现过拟合的情况。However, machine learning models have obvious advantages in modeling data with the above characteristics. On the one hand, machine learning models can help filter out irrelevant and redundant feature data that affect the modeling effect. Feature selection can effectively reduce the dimension of data, reduce the computational complexity of the model, and improve the operation speed and accuracy of the model. On the other hand, machine learning models can also discover rules and patterns in high-dimensional sparse data and have strong generalization capabilities. Machine learning modeling can effectively improve the prediction and classification performance of the model while preventing the model from overfitting.

与此同时，数据漂移问题对近年来机器学习模型的实际投产产生了极大的困难。数据漂移是指数据的分布随着时间或空间推移逐渐发生变化，需要预测或验证的数据和用于训练的数据分布表现出明显的偏移，这会明显降低系统模型的预测性能。因此，在基于用户行为数据进行信用数据分析时准确率较低。At the same time, the data drift problem has caused great difficulties in the actual production of machine learning models in recent years. Data drift refers to the gradual change of data distribution over time or space, and the data distribution that needs to be predicted or verified and the data used for training show obvious deviations, which will significantly reduce the prediction performance of the system model. Therefore, the accuracy of credit data analysis based on user behavior data is low.

发明内容Summary of the invention

有鉴于此，本发明实施例提供了一种基于用户行为数据的信用数据分析方法及系统，解决了基于用户行为数据进行信用数据分析时准确率较低的技术问题。In view of this, an embodiment of the present invention provides a credit data analysis method and system based on user behavior data, which solves the technical problem of low accuracy when performing credit data analysis based on user behavior data.

本发明提供了一种基于用户行为数据的信用数据分析方法，包括：采集多个用户行为数据，并对多个所述用户行为数据进行标签匹配，确定每个所述用户行为数据对应的标签数据；对所述多个用户行为数据以及每个所述用户行为数据对应的标签数据进行数据整合，得到用户数据集合；对所述用户数据集合进行数据预处理，得到待分析数据集合；通过过滤式特征提取算法对所述待分析数据集合进行第一特征提取处理，得到第一候选特征集合；通过包裹式特征提取算法对所述第一候选特征集合进行第二特征提取处理，得到第二候选特征集合；对所述第二候选特征集合进行数据漂移检测及特征筛选处理，得到目标特征集合；通过预置的目标信用数据分析模型对所述目标特征集合进行信用数据分析，得到信用数据分析结果，并将所述信用数据分析结果传输至预置的数据处理终端。The present invention provides a credit data analysis method based on user behavior data, comprising: collecting multiple user behavior data, and performing label matching on the multiple user behavior data to determine the label data corresponding to each of the user behavior data; performing data integration on the multiple user behavior data and the label data corresponding to each of the user behavior data to obtain a user data set; performing data preprocessing on the user data set to obtain a data set to be analyzed; performing a first feature extraction process on the data set to be analyzed by a filtering feature extraction algorithm to obtain a first candidate feature set; performing a second feature extraction process on the first candidate feature set by a wrapping feature extraction algorithm to obtain a second candidate feature set; performing data drift detection and feature screening on the second candidate feature set to obtain a target feature set; performing credit data analysis on the target feature set by a preset target credit data analysis model to obtain a credit data analysis result, and transmitting the credit data analysis result to a preset data processing terminal.

在本发明中，所述采集多个用户行为数据，并对多个所述用户行为数据进行标签匹配，确定每个所述用户行为数据对应的标签数据步骤，包括：采集多个用户行为数据，并对每个所述用户行为数据进行时间数据提取，确定每个所述用户行为数据对应的时间数据；基于每个所述用户行为数据对应的时间数据，对多个所述用户行为数据进行标签匹配，确定每个所述用户行为数据对应的标签数据。In the present invention, the step of collecting multiple user behavior data, performing label matching on the multiple user behavior data, and determining the label data corresponding to each of the user behavior data includes: collecting multiple user behavior data, extracting time data on each of the user behavior data, and determining the time data corresponding to each of the user behavior data; based on the time data corresponding to each of the user behavior data, performing label matching on the multiple user behavior data, and determining the label data corresponding to each of the user behavior data.

在本发明中，所述对所述用户数据集合进行数据预处理，得到待分析数据集合步骤，包括：对所述用户数据集合进行异常值分析，确定目标异常值，并通过所述异常值对所述用户数据集合进行缺失值分析，确定目标缺失值；基于所述目标缺失值，对所述用户数据集合进行数据填充处理，得到待分析数据集合。In the present invention, the step of performing data preprocessing on the user data set to obtain the data set to be analyzed includes: performing outlier analysis on the user data set to determine the target outlier value, and performing missing value analysis on the user data set through the outlier value to determine the target missing value; based on the target missing value, performing data filling processing on the user data set to obtain the data set to be analyzed.

在本发明中，所述通过过滤式特征提取算法对所述待分析数据集合进行第一特征提取处理，得到第一候选特征集合步骤，包括：通过所述过滤式特征提取算法对所述待分析数据集合进行冗余特征剔除，得到待处理特征集合；对所述待处理特征集合进行特征相关性分析，得到特征相关性分析结果；通过所述特征相关性分析结果对所述待处理特征集合进行特征提取，得到第一候选特征集合。In the present invention, the step of performing a first feature extraction process on the data set to be analyzed by using a filtering feature extraction algorithm to obtain a first candidate feature set includes: removing redundant features from the data set to be analyzed by using the filtering feature extraction algorithm to obtain a feature set to be processed; performing feature correlation analysis on the feature set to be processed to obtain a feature correlation analysis result; and performing feature extraction on the feature set to be processed by using the feature correlation analysis result to obtain a first candidate feature set.

在本发明中，所述通过包裹式特征提取算法对所述第一候选特征集合进行第二特征提取处理，得到第二候选特征集合步骤，包括：通过包裹式特征提取算法对所述第一候选特征集合中每个第一候选特征进行重要度分析，确定每个第一候选特征的重要度数据；基于每个第一候选特征的重要度数据对所述第一候选特征集合进行第二特征提取处理，得到第二候选特征集合。In the present invention, the step of performing second feature extraction processing on the first candidate feature set through a wraparound feature extraction algorithm to obtain a second candidate feature set includes: performing importance analysis on each first candidate feature in the first candidate feature set through a wraparound feature extraction algorithm to determine the importance data of each first candidate feature; performing second feature extraction processing on the first candidate feature set based on the importance data of each first candidate feature to obtain a second candidate feature set.

在本发明中，所述对所述第二候选特征集合进行数据漂移检测及特征筛选处理，得到目标特征集合步骤，包括：通过预置的对抗分类器对所述第二候选特征集合进行数据漂移检测，生成数据漂移检测结果；通过所述数据漂移检测结果对所述第二候选特征集合进行特征筛选处理，得到目标特征集合。In the present invention, the step of performing data drift detection and feature screening processing on the second candidate feature set to obtain a target feature set includes: performing data drift detection on the second candidate feature set through a preset adversarial classifier to generate a data drift detection result; performing feature screening processing on the second candidate feature set through the data drift detection result to obtain a target feature set.

在本发明中，在所述对所述第二候选特征集合进行数据漂移检测及特征筛选处理，得到目标特征集合步骤之后，在所述通过预置的目标信用数据分析模型对所述目标特征集合进行信用数据分析，得到信用数据分析结果，并将所述信用数据分析结果传输至预置的数据处理终端步骤之前，包括：对初始信用数据分析模型进行初始超参数分析，确定初始超参数组合；对所述初始超参数组合进行先验概率分布分析，确定先验概率分布数据；通过所述第二候选特征集合对所述初始信用数据分析模型进行模型训练，生成训练集以及测试集；通过所述训练集以及所述测试集对所述初始超参数组合进行后验概率分布分析，确定后验概率分布数据；基于所述后验概率分布数据对所述初始超参数组合进行迭代分析，确定最优超参数组合；基于所述最优超参数组合对所述初始信用数据分析模型进行参数配置，得到所述目标信用数据分析模型。In the present invention, after the step of performing data drift detection and feature screening on the second candidate feature set to obtain the target feature set, and before the step of performing credit data analysis on the target feature set through a preset target credit data analysis model to obtain a credit data analysis result, and transmitting the credit data analysis result to a preset data processing terminal, it includes: performing initial hyperparameter analysis on the initial credit data analysis model to determine an initial hyperparameter combination; performing prior probability distribution analysis on the initial hyperparameter combination to determine prior probability distribution data; performing model training on the initial credit data analysis model through the second candidate feature set to generate a training set and a test set; performing posterior probability distribution analysis on the initial hyperparameter combination through the training set and the test set to determine posterior probability distribution data; performing iterative analysis on the initial hyperparameter combination based on the posterior probability distribution data to determine the optimal hyperparameter combination; and performing parameter configuration on the initial credit data analysis model based on the optimal hyperparameter combination to obtain the target credit data analysis model.

本发明还提供了一种基于用户行为数据的信用数据分析系统，包括：The present invention also provides a credit data analysis system based on user behavior data, comprising:

数据采集模块，用于采集多个用户行为数据，并对多个所述用户行为数据进行标签匹配，确定每个所述用户行为数据对应的标签数据；A data collection module is used to collect multiple user behavior data, and perform label matching on the multiple user behavior data to determine the label data corresponding to each user behavior data;

数据整合模块，用于对所述多个用户行为数据以及每个所述用户行为数据对应的标签数据进行数据整合，得到用户数据集合；A data integration module, used to integrate the plurality of user behavior data and the label data corresponding to each of the user behavior data to obtain a user data set;

数据处理模块，用于对所述用户数据集合进行数据预处理，得到待分析数据集合；A data processing module, used for performing data preprocessing on the user data set to obtain a data set to be analyzed;

第一提取模块，用于通过过滤式特征提取算法对所述待分析数据集合进行第一特征提取处理，得到第一候选特征集合；A first extraction module, configured to perform a first feature extraction process on the data set to be analyzed by using a filtering feature extraction algorithm to obtain a first candidate feature set;

第二提取模块，用于通过包裹式特征提取算法对所述第一候选特征集合进行第二特征提取处理，得到第二候选特征集合；A second extraction module, configured to perform a second feature extraction process on the first candidate feature set by using a wrapping feature extraction algorithm to obtain a second candidate feature set;

特征筛选模块，用于对所述第二候选特征集合进行数据漂移检测及特征筛选处理，得到目标特征集合；A feature screening module, used for performing data drift detection and feature screening processing on the second candidate feature set to obtain a target feature set;

信用分析模块，用于通过预置的目标信用数据分析模型对所述目标特征集合进行信用数据分析，得到信用数据分析结果，并将所述信用数据分析结果传输至预置的数据处理终端。The credit analysis module is used to perform credit data analysis on the target feature set through a preset target credit data analysis model to obtain a credit data analysis result, and transmit the credit data analysis result to a preset data processing terminal.

本发明提供的技术方案中，采集多个用户行为数据并进行标签匹配，确定对应的标签数据；对多个用户行为数据及标签数据进行数据整合，得到用户数据集合；对用户数据集合进行数据预处理，得到待分析数据集合；通过过滤式特征提取算法对待分析数据集合进行第一特征提取处理得到第一候选特征集合；通过包裹式特征提取算法对第一候选特征集合进行第二特征提取处理得到第二候选特征集合；对第二候选特征集合进行数据漂移检测及特征筛选处理，得到目标特征集合；对目标特征集合进行信用数据分析，得到信用数据分析结果并将信用数据分析结果传输至预置的数据处理终端，在本发明实施例中，更关注用户行为数据对其信用情况的影响，在不需要获取高成本且不易获取的与用户强相关的金融属性的数据的情况下，建立了具有较高准确率的基于用户行为数据的信用数据分析系统，一方面，在本发明实施例中，直接将类别型特征转化为数值型特征，不需要对类别型特征进行独热编码等操作避免增加数据维度，快速高效。另一方面本发明通过对梯度的无偏估计，相比传统的梯度估计方法降低了估计偏差的影响，解决了梯度偏差和预测偏移的问题，从而有效提高了系统模型的泛化能力。因此本发明可以以较快的训练速度对用户的信用情况进行预测，并具有更准确的预测能力以及更优的泛化性能，以进一步提升基于用户行为数据对信用数据分析时的准确率。In the technical solution provided by the present invention, multiple user behavior data are collected and label matching is performed to determine the corresponding label data; multiple user behavior data and label data are integrated to obtain a user data set; the user data set is preprocessed to obtain a data set to be analyzed; a first feature extraction process is performed on the data set to be analyzed by a filtering feature extraction algorithm to obtain a first candidate feature set; a second feature extraction process is performed on the first candidate feature set by a wrapping feature extraction algorithm to obtain a second candidate feature set; data drift detection and feature screening are performed on the second candidate feature set to obtain a target feature set; credit data analysis is performed on the target feature set to obtain a credit data analysis result and the credit data analysis result is transmitted to a preset data processing terminal. In the embodiment of the present invention, more attention is paid to the impact of user behavior data on its credit status. Without the need to obtain high-cost and difficult-to-obtain financial attribute data that is strongly related to the user, a credit data analysis system based on user behavior data with a high accuracy is established. On the one hand, in the embodiment of the present invention, the categorical features are directly converted into numerical features, and there is no need to perform operations such as unique hot encoding on the categorical features to avoid increasing the data dimension, which is fast and efficient. On the other hand, the present invention reduces the impact of estimation bias by unbiased estimation of gradients, solves the problems of gradient bias and prediction offset, and thus effectively improves the generalization ability of the system model. Therefore, the present invention can predict the credit status of users at a faster training speed, and has more accurate prediction ability and better generalization performance, so as to further improve the accuracy of credit data analysis based on user behavior data.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明具体实施方式或现有技术中的技术方案，下面将对具体实施方式或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施方式，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the specific implementation methods of the present invention or the technical solutions in the prior art, the drawings required for use in the specific implementation methods or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are some implementation methods of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative work.

图1为本发明实施例中基于用户行为数据的信用数据分析方法的流程图。FIG1 is a flow chart of a credit data analysis method based on user behavior data in an embodiment of the present invention.

图2为本发明实施例中冗余特征的十分位分布图。FIG. 2 is a decile distribution diagram of redundant features in an embodiment of the present invention.

图3为本发明实施例中非冗余特征的十分位分布图。FIG. 3 is a decile distribution diagram of non-redundant features in an embodiment of the present invention.

图4为本发明实施例中通过包裹式特征提取算法对第一候选特征集合进行第二特征提取处理的流程图。FIG4 is a flow chart of performing a second feature extraction process on a first candidate feature set by using a wraparound feature extraction algorithm in an embodiment of the present invention.

图5为本发明实施例中过滤式特征选择剩余特征的重要性分布图。FIG5 is a diagram showing the importance distribution of the remaining features in the filtering feature selection according to an embodiment of the present invention.

图6为本发明实施例中基于用户行为数据的信用数据分析系统的示意图。FIG. 6 is a schematic diagram of a credit data analysis system based on user behavior data in an embodiment of the present invention.

附图标记：Reference numerals:

3001、数据采集模块；3002、数据整合模块；3003、数据处理模块；3004、第一提取模块；3005、第二提取模块；3006、特征筛选模块；3007、信用分析模块；3008、参数分析模块；3009、分布分析模块；3010、模型训练模块；3011、概率分析模块；3012、迭代分析模块；3013、参数配置模块。3001, data acquisition module; 3002, data integration module; 3003, data processing module; 3004, first extraction module; 3005, second extraction module; 3006, feature screening module; 3007, credit analysis module; 3008, parameter analysis module; 3009, distribution analysis module; 3010, model training module; 3011, probability analysis module; 3012, iterative analysis module; 3013, parameter configuration module.

具体实施方式DETAILED DESCRIPTION

下面将结合附图对本发明的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solution of the present invention will be described clearly and completely below in conjunction with the accompanying drawings. Obviously, the described embodiments are only part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work are within the scope of protection of the present invention.

在本发明的描述中，术语“第一”、“第二”、“第三”仅用于描述目的，而不能理解为指示或暗示相对重要性。此外，下面所描述的本发明不同实施方式中所涉及的技术特征只要彼此之间未构成冲突就可以相互结合。In the description of the present invention, the terms "first", "second" and "third" are used for descriptive purposes only and cannot be understood as indicating or implying relative importance. In addition, the technical features involved in the different embodiments of the present invention described below can be combined with each other as long as they do not conflict with each other.

请参阅图1，图1是本发明实施例的基于用户行为数据的信用数据分析方法的流程图，如图1所示，包括以下步骤：Please refer to FIG. 1 , which is a flow chart of a credit data analysis method based on user behavior data according to an embodiment of the present invention. As shown in FIG. 1 , the method includes the following steps:

S101、采集多个用户行为数据，并对多个用户行为数据进行标签匹配，确定每个用户行为数据对应的标签数据；S101, collecting multiple user behavior data, and performing label matching on the multiple user behavior data to determine label data corresponding to each user behavior data;

需要说明的是，用户行为数据包括用户使用的App信息、用户使用的设备信息以及用户近期位置移动信息，进一步的，对多个用户行为数据进行标签匹配，确定每个用户行为数据对应的标签数据，需要说明的是，在本发明实施例中，用户行为数据主要包括用户在采集时间段内使用的App列表以及对应App所属分类，后续参考App分类信息以及相关业务合作对App进行重新分类。分类之前因为实际采集的App数据存在乱码现象且真实性有待验证，首先针对乱码现象，删除字符长度大于20的App数据，之后使用常用汉字进行App名称匹配，保留匹配度较高的App数据，经过上述操作完成乱码处理。It should be noted that user behavior data includes information about apps used by users, information about devices used by users, and information about users’ recent location movements. Furthermore, multiple user behavior data are matched with labels to determine the label data corresponding to each user behavior data. It should be noted that in an embodiment of the present invention, user behavior data mainly includes a list of apps used by users during the collection period and the categories to which the corresponding apps belong. Apps are subsequently reclassified with reference to app classification information and related business cooperation. Before classification, because the actual collected app data has garbled characters and its authenticity needs to be verified, firstly, to address the garbled characters, app data with a character length greater than 20 is deleted, and then common Chinese characters are used to match the app names, and app data with a higher matching degree is retained. The garbled characters are processed through the above operations.

表1-每个分类部分特征的具体含义Table 1 - Specific meaning of each classification feature

其次本发明参考国内专业的移动推广数据分析平台七麦数据下载真实的App数据以验证本发明采集App数据的真实性。最后统计当前所有用户单个App的使用频率，319,071个App中几乎83%的App使用频率只有1次，因此本发明保留使用频率较高的前5万的App数据。用户所使用的设备信息，具体包括设备的上市价格、年份、上市距今年限、设备最新活跃时间、设备在收集数据时间段内的活跃天数等数据。用户在采集时间段内的位置移动信息，粗略定位为用户所属的省区县，精确可获取用户位置的经纬度信息。根据真实收集的三部分行为数据，基于一定的加工逻辑对用户行为数据进行加工。数据收集和加工工作耗费了大量的时间和人力，但数据的可信度、真实有效性以及通用性都具有十分显著的优势。真实准确的数据来源是建模的关键之处，也是系统模型得以长期适用的基础。Secondly, the present invention refers to the domestic professional mobile promotion data analysis platform Qimai Data to download real App data to verify the authenticity of the App data collected by the present invention. Finally, the usage frequency of all current users' single Apps is counted. Among the 319,071 Apps, almost 83% of the Apps are used only once, so the present invention retains the top 50,000 App data with higher usage frequency. The device information used by the user specifically includes the listing price, year, listing time limit this year, the latest active time of the device, the number of active days of the device during the data collection period, and other data. The user's location movement information during the collection period is roughly located as the province, district, and county to which the user belongs, and the longitude and latitude information of the user's location can be accurately obtained. According to the three parts of the behavior data collected, the user behavior data is processed based on a certain processing logic. Data collection and processing work consumes a lot of time and manpower, but the credibility, authenticity, validity, and versatility of the data have very significant advantages. A true and accurate data source is the key to modeling and the basis for the long-term applicability of the system model.

需要说明的是，待分析数据集合中的行为数据特征包括三个部分：1）用户使用的App信息，包括金融App使用偏好、其他App使用偏好、金融标签以及其他标签四个大类。具体来说，App使用偏好类特征指用户设备近7天、15天、30天或90天内各类App的安装、新增、卸载款数以及各类App的活跃天数。标签类特征是指某类特征用户打开次数；2）用户使用的设备信息，包括用户使用设备的上市价格、年份、上市距今年限以及近30天内设备所对应的MAC地址的个数；3）用户近期位置移动信息，包括用户在便利店近期出现次数。用户行为数据包含六个分类共94个特征，如表1所示，详细展示了每个分类部分特征的具体含义。It should be noted that the behavioral data features in the data set to be analyzed include three parts: 1) Information on apps used by users, including four major categories: financial app usage preferences, other app usage preferences, financial tags, and other tags. Specifically, app usage preference features refer to the number of installations, new additions, and uninstalls of various apps on the user's device in the past 7 days, 15 days, 30 days, or 90 days, as well as the number of active days of various apps. Tag features refer to the number of times a certain type of feature is opened by a user; 2) Information on the device used by the user, including the listing price, year, the number of years since the device was listed, and the number of MAC addresses corresponding to the device in the past 30 days; 3) Information on the user's recent location movement, including the number of times the user has appeared in convenience stores recently. The user behavior data contains 94 features in six categories, as shown in Table 1, which details the specific meanings of some features in each category.

S102、对多个用户行为数据以及每个用户行为数据对应的标签数据进行数据整合，得到用户数据集合；S102, integrating multiple user behavior data and label data corresponding to each user behavior data to obtain a user data set;

具体的，对多个用户行为数据以及每个用户行为数据对应的标签数据进行数据合并处理，得到用户数据集合，其中，本发明还获取了同一批用户的信用记录数据，将用户表现期定义为三个月，若在表现期内用户的信用记录数据中的时间数据超过预设的第一阈值时则定义为正样本，若用户的信用记录数据中的时间数据未超过预设的第二阈值时则定义为负样本，建模数据只需要获取明确定义为正样本和负样本的数据即可。该数据集的时间范围为2021年9月1日至2021年12月31日，包括142,793条数据。数据集整合划分的结果具体如表2所示。Specifically, multiple user behavior data and the label data corresponding to each user behavior data are merged to obtain a user data set, wherein the present invention also obtains the credit record data of the same batch of users, and defines the user performance period as three months. If the time data in the user's credit record data exceeds the preset first threshold during the performance period, it is defined as a positive sample. If the time data in the user's credit record data does not exceed the preset second threshold, it is defined as a negative sample. The modeling data only needs to obtain data that is clearly defined as positive samples and negative samples. The time range of this data set is from September 1, 2021 to December 31, 2021, including 142,793 data. The results of the data set integration and division are specifically shown in Table 2.

表2-数据集划分结果Table 2 - Dataset partitioning results

需要说明的是，表2中OOT为跨时间验证集，跨时间验证集为建模样本时间切片的最后一段样本。It should be noted that OOT in Table 2 is a cross-time validation set, and the cross-time validation set is the last sample of the modeling sample time slice.

S103、对用户数据集合进行数据预处理，得到待分析数据集合；S103, performing data preprocessing on the user data set to obtain a data set to be analyzed;

需要说明的是，在对用户数据集合进行预处理时，首先将用户行为数据和是否违约的标签数据进行整合形成最终的数据集，同时完成数据集的划分，之后进行异常值处理和缺失值处理，并根据模型效果确定缺失值的填充方法。It should be noted that when preprocessing the user data set, the user behavior data and the label data of whether there is a breach of contract are first integrated to form the final data set, and the data set is divided at the same time. Then, outlier processing and missing value processing are performed, and the method of filling missing values is determined based on the model effect.

具体的，在对用户数据集合进行数据预处理时，对用户数据集合进行异常值分析，确定目标异常值，并通过异常值对用户数据集合进行缺失值分析，确定目标缺失值，基于目标缺失值，对用户数据集合进行数据填充处理，得到待分析数据集合。Specifically, when performing data preprocessing on a user data set, an outlier analysis is performed on the user data set to determine a target outlier, and a missing value analysis is performed on the user data set through the outlier to determine a target missing value, and based on the target missing value, data filling processing is performed on the user data set to obtain a data set to be analyzed.

需要说明的是，由于缺失度较高的特征数据会影响建模效果，所以对于列特征缺失度高于80%的特征列采用直接删除的策略，其次采用箱型图并结合专家经验进行异常值的检测和确定，确定为异常值的数据当作缺失值处理，最后采用固定值填充、均值填充、上一个数据、插值法填充等多种方法进行缺失值填充，根据系统模型效果决定对于每一列的缺失值，采用当列均值进行填充。It should be noted that since feature data with a high degree of missingness will affect the modeling effect, a direct deletion strategy is adopted for feature columns with a column feature missingness rate higher than 80%. Secondly, a box plot is used in combination with expert experience to detect and determine outliers. Data determined to be outliers are treated as missing values. Finally, a variety of methods such as fixed value filling, mean filling, previous data, and interpolation filling are used to fill missing values. The missing values of each column are filled based on the system model effect and the mean of the column is used to fill them.

S104、通过过滤式特征提取算法对待分析数据集合进行第一特征提取处理，得到第一候选特征集合；S104, performing a first feature extraction process on the data set to be analyzed by using a filtering feature extraction algorithm to obtain a first candidate feature set;

S105、通过包裹式特征提取算法对第一候选特征集合进行第二特征提取处理，得到第二候选特征集合；S105, performing a second feature extraction process on the first candidate feature set by using a wraparound feature extraction algorithm to obtain a second candidate feature set;

在本发明实施例中，特征选择的方法采用过滤式和包裹式两种方法。其中，通过滤式特征提取算法进行处理时，包括通过十分位分布、秩和检验、和标准分三种统计学的方法筛掉冗余特征，其中，为了挑选对建模有关的特征，首先采用十分位分布、秩和检验、和标准分三种统计学的方法筛掉冗余特征；之后从特征的线性和非线性的角度出发，采用皮尔逊相关系统法和最大信息系数法更新特征集合，最终完成对待分析数据集合的第一特征提取处理，得到第一候选特征集合。In the embodiment of the present invention, the feature selection method adopts two methods: filtering and wrapping. Among them, when the filtering feature extraction algorithm is used for processing, it includes filtering out redundant features through three statistical methods: decile distribution, rank sum test, and standard score. Among them, in order to select features related to modeling, firstly, three statistical methods: decile distribution, rank sum test, and standard score are used to filter out redundant features; then, from the perspective of linearity and nonlinearity of features, the Pearson correlation system method and the maximum information coefficient method are used to update the feature set, and finally the first feature extraction processing of the data set to be analyzed is completed to obtain the first candidate feature set.

进一步的，通过包裹式特征提取算法对第一候选特征集合进行第二特征提取处理，得到第二候选特征集合，其中，服务器在通过包裹式特征提取算法对第一候选特征集合进行第二特征提取处理时，结合分类树模型特征重要性打分的方法确定较优候选特征集合，最终将该较优候选特征集合作为该第二候选特征集合。Furthermore, a second feature extraction process is performed on the first candidate feature set through a wraparound feature extraction algorithm to obtain a second candidate feature set, wherein when the server performs the second feature extraction process on the first candidate feature set through the wraparound feature extraction algorithm, a method of scoring feature importance of a classification tree model is combined to determine a better candidate feature set, and finally the better candidate feature set is used as the second candidate feature set.

S106、对第二候选特征集合进行数据漂移检测及特征筛选处理，得到目标特征集合；S106, performing data drift detection and feature screening processing on the second candidate feature set to obtain a target feature set;

具体的，在本发明实施例中，利用HyperGBM对该用户数据进行数据漂移的检测和处理，需要说明的是，HyperGBM是一款全Pipeline自动机器学习工具，可以端到端的完整覆盖从数据清洗、预处理、特征加工和筛选以及模型选择和超参数优化的全过程，同时进行特征筛选处理，得到目标特征集合。Specifically, in an embodiment of the present invention, HyperGBM is used to detect and process data drift of the user data. It should be noted that HyperGBM is a full-Pipeline automatic machine learning tool that can fully cover the entire process from data cleaning, preprocessing, feature processing and screening, model selection and hyperparameter optimization end-to-end, and perform feature screening processing at the same time to obtain a target feature set.

S107、通过预置的目标信用数据分析模型对目标特征集合进行信用数据分析，得到信用数据分析结果，并将信用数据分析结果传输至预置的数据处理终端。S107. Perform credit data analysis on the target feature set using a preset target credit data analysis model to obtain a credit data analysis result, and transmit the credit data analysis result to a preset data processing terminal.

通过执行上述步骤，采集多个用户行为数据并进行标签匹配，确定对应的标签数据；对多个用户行为数据及标签数据进行数据整合，得到用户数据集合；对用户数据集合进行数据预处理，得到待分析数据集合；通过过滤式特征提取算法对待分析数据集合进行第一特征提取处理得到第一候选特征集合；通过包裹式特征提取算法对第一候选特征集合进行第二特征提取处理得到第二候选特征集合；对第二候选特征集合进行数据漂移检测及特征筛选处理，得到目标特征集合；对目标特征集合进行信用数据分析，得到信用数据分析结果并将信用数据分析结果传输至预置的数据处理终端。By executing the above steps, multiple user behavior data are collected and label matching is performed to determine the corresponding label data; multiple user behavior data and label data are integrated to obtain a user data set; the user data set is preprocessed to obtain a data set to be analyzed; the first feature extraction process is performed on the data set to be analyzed by a filtering feature extraction algorithm to obtain a first candidate feature set; the second feature extraction process is performed on the first candidate feature set by a wrapping feature extraction algorithm to obtain a second candidate feature set; data drift detection and feature screening are performed on the second candidate feature set to obtain a target feature set; credit data analysis is performed on the target feature set to obtain a credit data analysis result and the credit data analysis result is transmitted to a preset data processing terminal.

在本发明实施例中，更关注用户行为数据对其信用情况的影响，在不需要获取高成本且不易获取的与用户强相关的金融属性的数据的情况下，建立了具有较高准确率的基于用户行为数据的信用数据分析系统。一方面，在本发明实施例中，直接将类别型特征转化为数值型特征，不需要对类别型特征进行独热编码等操作避免增加数据维度，快速高效。另一方面本发明通过对梯度的无偏估计，相比传统的梯度估计方法降低了估计偏差的影响，解决了梯度偏差和预测偏移的问题，从而有效提高了系统模型的泛化能力。因此本发明可以以较快的训练速度对用户的信用情况进行预测，并具有更准确的预测能力以及更优的泛化性能，以进一步提升基于用户行为数据对信用数据分析时的准确率。In the embodiment of the present invention, more attention is paid to the impact of user behavior data on their credit status. Without the need to obtain high-cost and difficult-to-obtain data on financial attributes that are strongly related to users, a credit data analysis system based on user behavior data with high accuracy is established. On the one hand, in the embodiment of the present invention, categorical features are directly converted into numerical features, and there is no need to perform operations such as unique hot encoding on categorical features to avoid increasing data dimensions, which is fast and efficient. On the other hand, the present invention reduces the impact of estimation bias by unbiased estimation of the gradient compared to traditional gradient estimation methods, solves the problems of gradient bias and prediction offset, and thus effectively improves the generalization ability of the system model. Therefore, the present invention can predict the credit status of users at a faster training speed, and has more accurate prediction capabilities and better generalization performance, so as to further improve the accuracy of credit data analysis based on user behavior data.

在一具体实施例中，执行步骤S101的过程可以具体包括如下步骤：In a specific embodiment, the process of executing step S101 may specifically include the following steps:

（1）采集多个用户行为数据，并对每个用户行为数据进行时间数据提取，确定每个用户行为数据对应的时间数据；(1) Collect multiple user behavior data, extract time data for each user behavior data, and determine the time data corresponding to each user behavior data;

（2）基于每个用户行为数据对应的时间数据，对多个用户行为数据进行标签匹配，确定每个用户行为数据对应的标签数据。(2) Based on the time data corresponding to each user behavior data, label matching is performed on multiple user behavior data to determine the label data corresponding to each user behavior data.

需要说明的是，用户行为数据包括用户使用的App信息、用户使用的设备信息以及用户近期位置移动信息。进一步的，对多个用户行为数据进行标签匹配，确定每个用户行为数据对应的标签数据。在本发明实施例中，将用户表现期定义为三个月，若在表现期内用户的信用记录数据中的时间数据超过预设的第一阈值时则定义为正样本，若用户的信用记录数据中的时间数据未超过预设的第二阈值时则定义为负样本。It should be noted that user behavior data includes information about apps used by users, information about devices used by users, and information about recent location movements of users. Furthermore, label matching is performed on multiple user behavior data to determine the label data corresponding to each user behavior data. In an embodiment of the present invention, the user performance period is defined as three months. If the time data in the user's credit record data during the performance period exceeds a preset first threshold, it is defined as a positive sample. If the time data in the user's credit record data does not exceed a preset second threshold, it is defined as a negative sample.

在一具体实施例中，执行步骤S103的过程可以具体包括如下步骤：In a specific embodiment, the process of executing step S103 may specifically include the following steps:

（1）对用户数据集合进行异常值分析，确定目标异常值，并通过异常值对用户数据集合进行缺失值分析，确定目标缺失值；(1) Perform outlier analysis on the user data set to determine the target outlier, and perform missing value analysis on the user data set through the outlier to determine the target missing value;

（2）基于目标缺失值，对用户数据集合进行数据填充处理，得到待分析数据集合。(2) Based on the target missing values, the user data set is filled with data to obtain the data set to be analyzed.

在本步骤中，需要说明的是，数据的质量可以直接决定系统的预测和泛化能力，而数据预处理是保证数据质量的前提，因此数据预处理对于建模工作至关重要。实际收集的数据由于数据收集时间跨度大且收集方式复杂，不可避免会存在数据缺失度高的问题，本发明认为缺失度极高的特征数据会影响建模效果，所以对于列特征缺失度高于80%的特征列采用直接删除的策略。经过上述操作，共删除18个缺失度极高的特征列，得到包含76个特征的数据集，对于其余数据，因数据都属于数值型，因此首先画出箱型图进行异常值的检测，之后结合专家经验进行异常值的确定。因为用户和用户之间是不存在关联性的，很有可能出现用户某个时间段内卸载安装某类App的款数极多或极少，因此对于箱型图检测出来的异常值需要根据专家经验进行异常值的确定。若数据被确定为异常值，则当作缺失值处理，缺失值填充本发明采用固定值填充、均值填充、中位数填充、众数填充、插值法填充、上一个数据填充、下一个数据填充7种方法。根据模型效果，固定值填充、均值填充、中位数填充、众数填充四种方法的填充效果明显优于插值法填充、上一个数据填充、下一个数据填充三种方法约5%至10%，这是因为本发明的用户数据之间几乎不存在关联性，当前用户数据的缺失借助上一个用户或下一下用户数据进行补全的做法并不合适。填充效果较好的四种方法中均值填充的模型效果优于其他三种方法约1%至2%。因此本发明缺失值填充方法选择列均值填充。In this step, it should be noted that the quality of the data can directly determine the prediction and generalization capabilities of the system, and data preprocessing is a prerequisite for ensuring data quality, so data preprocessing is crucial for modeling. The actual collected data will inevitably have a high degree of data missing due to the large data collection time span and complex collection method. The present invention believes that the feature data with extremely high missingness will affect the modeling effect, so the feature column with a column feature missingness higher than 80% adopts a direct deletion strategy. After the above operation, a total of 18 feature columns with extremely high missingness are deleted, and a data set containing 76 features is obtained. For the remaining data, because the data are all numerical types, a box plot is first drawn to detect outliers, and then the outliers are determined in combination with expert experience. Because there is no correlation between users and users, it is very likely that the number of users uninstalling and installing a certain type of App within a certain period of time is very large or very small, so the outliers detected by the box plot need to be determined according to expert experience. If the data is determined to be an outlier, it is treated as a missing value. The present invention adopts seven methods for filling missing values: fixed value filling, mean filling, median filling, mode filling, interpolation filling, previous data filling, and next data filling. According to the model effect, the filling effects of the four methods of fixed value filling, mean filling, median filling, and mode filling are significantly better than the three methods of interpolation filling, previous data filling, and next data filling by about 5% to 10%. This is because there is almost no correlation between the user data of the present invention, and it is not appropriate to use the previous user or next user data to complete the missing current user data. Among the four methods with better filling effects, the model effect of mean filling is about 1% to 2% better than the other three methods. Therefore, the missing value filling method of the present invention selects column mean filling.

最终完成对用户数据集合进行异常值分析，确定目标异常值，并通过异常值对用户数据集合进行缺失值分析，确定目标缺失值，基于目标缺失值，对用户数据集合进行数据填充处理，得到待分析数据集合。Finally, the outlier analysis of the user data set is completed to determine the target outlier, and the missing value analysis of the user data set is performed through the outlier to determine the target missing value. Based on the target missing value, the user data set is filled with data to obtain the data set to be analyzed.

在一具体实施例中，执行步骤S104的过程可以具体包括如下步骤：In a specific embodiment, the process of executing step S104 may specifically include the following steps:

（1）通过过滤式特征提取算法对待分析数据集合进行冗余特征剔除，得到待处理特征集合；(1) Redundant features are removed from the data set to be analyzed through a filtering feature extraction algorithm to obtain a feature set to be processed;

（2）对待处理特征集合进行特征相关性分析，得到特征相关性分析结果；(2) Perform feature correlation analysis on the feature set to be processed to obtain feature correlation analysis results;

（3）通过特征相关性分析结果对待处理特征集合进行特征提取，得到第一候选特征集合。(3) Perform feature extraction on the feature set to be processed based on the feature correlation analysis results to obtain the first candidate feature set.

需要说明的是，特征提取是为了挑选对建模有帮助的特征，本发明采用过滤式和包裹式两种方法筛选与建模无关的特征。其中，冗余特征会增加模型的计算量，减慢训练速度，甚至有产生过拟合的可能。对这部分特征进行筛选，可以减少不必要的资源消耗，提升系统模型的预测性能。十分位分布、秩和检验及和标准分三种基于统计学的特征选择方法的目标是为了筛选掉冗余特征，除此之外，两个变量之间的相关程度也可作为特征筛选的依据，变量之间的相关性越强，两个变量互相包含的信息量越大，则选择相关性强的特征中的一个即可，因此本发明从特征的线性和非线性的角度出发，采用皮尔逊相关系数法和最大信息系数法更新特征集合。如表3所示，为相关系数的取值与特征之间相关性的强弱关系。在本发明中，通过过滤式特征提取算法对待分析数据集合进行冗余特征剔除，得到待处理特征集合，对待处理特征集合进行特征相关性分析，得到特征相关性分析结果，通过特征相关性分析结果对待处理特征集合进行特征提取，得到第一候选特征集合。It should be noted that feature extraction is to select features that are helpful for modeling. The present invention uses filtering and wrapping methods to filter features that are not related to modeling. Among them, redundant features will increase the amount of calculation of the model, slow down the training speed, and even have the possibility of overfitting. Screening these features can reduce unnecessary resource consumption and improve the prediction performance of the system model. The goal of the three statistical feature selection methods of decile distribution, rank sum test and standard classification is to filter out redundant features. In addition, the degree of correlation between two variables can also be used as the basis for feature screening. The stronger the correlation between the variables, the greater the amount of information contained in the two variables, and then select one of the features with strong correlation. Therefore, the present invention uses the Pearson correlation coefficient method and the maximum information coefficient method to update the feature set from the perspective of linearity and nonlinearity of the features. As shown in Table 3, it is the strength relationship between the value of the correlation coefficient and the correlation between the features. In the present invention, the redundant features of the data set to be analyzed are eliminated by a filtering feature extraction algorithm to obtain a feature set to be processed, and the feature correlation analysis is performed on the feature set to be processed to obtain a feature correlation analysis result. The feature extraction of the feature set to be processed is performed by the feature correlation analysis result to obtain a first candidate feature set.

表3-相关系数的取值与特征之间相关性的强弱关系Table 3 - The relationship between the value of the correlation coefficient and the strength of the correlation between the features

其中，十分位分布是基于十分位数的一种特征选择方法，可以直观地反映每个特征对负样本和正样本的区分作用。图2和图3表示冗余特征和非冗余特征两个特征的十分位分布图，其中，图2中naw表示近7天内快递物流或其他快递物流类APP安装款数，图2中Deciles表示冗余特征的十分位数；图3中ECA08表示近7天内电商行业或电商行业线上行为或垂直电商-数码3C类APP用户打开次数，图3中Deciles表示非冗余特征的十分位数。从图中可以观察到，冗余特征的负样本和正样本累计分布相同，则该特征对于区分负样本和正样本没有作用，表明该特征是冗余特征；相反非冗余特征的负样本和正样本累计分布不一致，则该特征可以区分负样本和正样本，表明该特征不是冗余特征，特征保留。秩和检验与和标准分两种统计学的方法与十分位分布都属于无参统计方法，可以直观反映每个特征对于负样本和正样本的区分作用。综合三种统计学方法，剔除冗余特征29个，得到47个特征组成的候选特征集合。皮尔逊相关系数是通过两个特征变量的协方差和标准差来衡量不同特征之间的线性相关性。本发明将皮尔逊相关系数取值范围定义为[0,1]，相关系数的取值与特征之间相关性的强弱关系如表3所示。通过统计极强相关的特征对发现多个极强相关的特征对可以构建极强相关特征集合，即此特征集合中任意两个特征都满足极强相关。有些特征集合中虽然个别特征对相关系数小于0.8，但是高于0.7，因此这种特殊的特征集合本发明也认为其满足极强相关特征集合。进一步的，计算待分析数据中各特征之间的相关系数，最大信息系数是通过计算两个变量的之间的互信息以及联合概率MIC来衡量不同特征之间的非线性相关性，MIC值的取值范围定义为[0,1]，MIC的取值与特征之间相关性的强弱与皮尔逊相关系数法类似。皮尔逊相关系数法和最大信息系数法中极强线性和非线性相关特征对和特征集合统计结果如表4所示。之后结合十分位分布法最终确定极强相关特征对和特征集合中选取的特征，需要说明的是，特征选取结果在表4中加粗标出，最终，从特征的线性和非线性的角度出发，采用皮尔逊相关系数法和最大信息系数法剔除22个相关性极强的特征，更新特征集合个数为25个。Among them, the decile distribution is a feature selection method based on deciles, which can intuitively reflect the distinguishing effect of each feature on negative samples and positive samples. Figures 2 and 3 show the decile distribution of two features, redundant features and non-redundant features. In Figure 2, naw represents the number of express logistics or other express logistics APP installations in the past 7 days, and Deciles in Figure 2 represents the decile of redundant features; in Figure 3, ECA08 represents the number of users opening the e-commerce industry or e-commerce industry online behavior or vertical e-commerce-digital 3C APP in the past 7 days, and Deciles in Figure 3 represents the decile of non-redundant features. It can be observed from the figure that if the cumulative distribution of negative samples and positive samples of redundant features is the same, then the feature has no effect on distinguishing negative samples and positive samples, indicating that the feature is a redundant feature; on the contrary, if the cumulative distribution of negative samples and positive samples of non-redundant features is inconsistent, then the feature can distinguish negative samples and positive samples, indicating that the feature is not a redundant feature and the feature is retained. The rank sum test and the standard score are two statistical methods that belong to non-parametric statistical methods, and the decile distribution can intuitively reflect the distinguishing effect of each feature on negative samples and positive samples. Combining three statistical methods, 29 redundant features were eliminated and a candidate feature set consisting of 47 features was obtained. The Pearson correlation coefficient measures the linear correlation between different features by the covariance and standard deviation of two feature variables. The present invention defines the value range of the Pearson correlation coefficient as [0,1], and the relationship between the value of the correlation coefficient and the strength of the correlation between the features is shown in Table 3. By statistically analyzing extremely strongly correlated feature pairs, it is found that multiple extremely strongly correlated feature pairs can construct an extremely strongly correlated feature set, that is, any two features in this feature set satisfy the extremely strong correlation. Although the correlation coefficient of individual feature pairs in some feature sets is less than 0.8, it is higher than 0.7. Therefore, this special feature set is also considered by the present invention to satisfy the extremely strongly correlated feature set. Further, the correlation coefficient between each feature in the data to be analyzed is calculated, and the maximum information coefficient is measured by calculating the mutual information between the two variables and the joint probability MIC to measure the nonlinear correlation between different features. The value range of the MIC value is defined as [0,1]. The strength of the correlation between the value of the MIC and the feature is similar to the Pearson correlation coefficient method. The statistical results of extremely strong linear and nonlinear correlation feature pairs and feature sets in the Pearson correlation coefficient method and the maximum information coefficient method are shown in Table 4. Then, the decile distribution method is combined to finally determine the features selected from the extremely strong correlation feature pairs and feature sets. It should be noted that the feature selection results are bolded in Table 4. Finally, from the perspective of linearity and nonlinearity of the features, the Pearson correlation coefficient method and the maximum information coefficient method are used to eliminate 22 extremely correlated features, and the number of feature sets is updated to 25.

表4-极强线性和非线性相关特征对和特征集合统计结果Table 4 - Statistics of extremely strong linear and nonlinear correlation feature pairs and feature sets

在一具体实施例中，如图4所示，执行步骤S105的过程可以具体包括如下步骤：In a specific embodiment, as shown in FIG. 4 , the process of executing step S105 may specifically include the following steps:

S201、通过包裹式特征提取算法对第一候选特征集合中每个第一候选特征进行重要度分析，确定每个第一候选特征的重要度数据；S201, performing importance analysis on each first candidate feature in the first candidate feature set by using a wraparound feature extraction algorithm to determine importance data of each first candidate feature;

S202、基于每个第一候选特征的重要度数据对第一候选特征集合进行第二特征提取处理，得到第二候选特征集合。S202: Perform a second feature extraction process on the first candidate feature set based on the importance data of each first candidate feature to obtain a second candidate feature set.

具体的，分类树模型特征重要性通过计算所有树中划分属性的次数，可以直观反映出影响系统决策的各特征的重要程度。如图5所示，为过滤式特征选择剩余特征的重要性分布图，其中，图中Feature Importance表示分类树模型的特征重要性数值，根据CatBoost模型效果，最终选取特征重要性较高的前20个特征作为较优特征集合。最终，基于每个第一候选特征的重要度数据对第一候选特征集合进行第二特征提取处理，得到第二候选特征集合，如表5所示，为该第二候选特征集合中每个特征的描述列表。Specifically, the importance of the features of the classification tree model can intuitively reflect the importance of each feature that affects the system decision by calculating the number of times the attributes are divided in all trees. As shown in Figure 5, it is an importance distribution diagram of the remaining features selected for the filtering feature, where Feature Importance in the figure represents the feature importance value of the classification tree model. According to the effect of the CatBoost model, the top 20 features with higher feature importance are finally selected as the better feature set. Finally, based on the importance data of each first candidate feature, the first candidate feature set is subjected to second feature extraction processing to obtain the second candidate feature set, as shown in Table 5, which is a description list of each feature in the second candidate feature set.

表5-第二候选特征集合中每个特征的描述列表Table 5 - Description list of each feature in the second candidate feature set

需要说明的是，第二候选特征以及特征所属分类及其具体含义如表5所示，特征顺序按照分类树模型重要性由高到低进行描述。首先可以从表5中观察得到用户使用设备的上市价格对预测该用户是否具有良好信用比较重要。It should be noted that the second candidate features and the categories to which the features belong and their specific meanings are shown in Table 5, and the feature order is described from high to low importance according to the classification tree model. First, it can be observed from Table 5 that the listing price of the device used by the user is relatively important in predicting whether the user has good credit.

统计数据可以得出，当使用设备价格高于2500元时约73%的用户具有良好的信用，若使用设备价格低于2500元时只有21%的用户具有良好的信用。同时，从统计数据可以得出，当用户近期在便利店类购物场所出现次数超过10次时约67%的用户具有良好的信用，若用户近期在便利店类购物场所出现次数小于10次时约26%的用户具有良好的信用。综上所述，在评估用户是否具有良好信用时，可考虑用户使用的设备价格或者用户近期在便利店类购物场所出现的次数进行综合评价。According to the statistics, when the price of the device used is higher than RMB 2,500, about 73% of users have good credit, and when the price of the device used is lower than RMB 2,500, only 21% of users have good credit. At the same time, according to the statistics, when the number of times a user has appeared in a convenience store shopping place more than 10 times recently, about 67% of users have good credit, and when the number of times a user has appeared in a convenience store shopping place less than 10 times recently, about 26% of users have good credit. In summary, when evaluating whether a user has good credit, the price of the device used by the user or the number of times the user has appeared in a convenience store shopping place recently can be considered for comprehensive evaluation.

在一具体实施例中，执行步骤S106的过程可以具体包括如下步骤：In a specific embodiment, the process of executing step S106 may specifically include the following steps:

（1）通过预置的对抗分类器对第二候选特征集合进行数据漂移检测，生成数据漂移检测结果；(1) Performing data drift detection on the second candidate feature set through a preset adversarial classifier to generate a data drift detection result;

（2）通过数据漂移检测结果对第二候选特征集合进行特征筛选处理，得到目标特征集合。(2) Perform feature screening on the second candidate feature set based on the data drift detection results to obtain the target feature set.

在一具体实施例中，在执行步骤S106之后，在执行步骤S107之前，还包括如下步骤：In a specific embodiment, after executing step S106 and before executing step S107, the following steps are further included:

（1）对初始信用数据分析模型进行初始超参数分析，确定初始超参数组合；(1) Conduct initial hyperparameter analysis on the initial credit data analysis model and determine the initial hyperparameter combination;

（2）对初始超参数组合进行先验概率分布分析，确定先验概率分布数据；(2) Perform a prior probability distribution analysis on the initial hyperparameter combination to determine the prior probability distribution data;

（3）通过第二候选特征集合对初始信用数据分析模型进行模型训练，生成训练集以及测试集；(3) Training the initial credit data analysis model using the second candidate feature set to generate a training set and a test set;

（4）通过训练集以及测试集对初始超参数组合进行后验概率分布分析，确定后验概率分布数据；(4) Perform posterior probability distribution analysis on the initial hyperparameter combination through training sets and test sets to determine the posterior probability distribution data;

（5）基于后验概率分布数据对初始超参数组合进行迭代分析，确定最优超参数组合；(5) Iteratively analyze the initial hyperparameter combination based on the posterior probability distribution data to determine the optimal hyperparameter combination;

（6）基于最优超参数组合对初始信用数据分析模型进行参数配置，得到目标信用数据分析模型。(6) Based on the optimal hyperparameter combination, the initial credit data analysis model is configured with parameters to obtain the target credit data analysis model.

具体的，在本步骤中，调参方式采用贝叶斯优化，基于数据使用贝叶斯定理估计目标函数的后验分布，然后再根据分布选择下一个采样的超参数组合。基于对前一个采样点信息的充分利用，它可更好调整当前参数快速找到使目标函数全局最大的参数。相比网格搜索，贝叶斯优化迭代次数少，运行速度更快。给定参数具体范围之后可一次调整多个参数，因此贝叶斯优化在参数过多时也不会导致维度爆炸。Specifically, in this step, the parameter adjustment method uses Bayesian optimization, which estimates the posterior distribution of the objective function based on the data using Bayesian theorem, and then selects the next sampled hyperparameter combination based on the distribution. Based on the full use of the information of the previous sampling point, it can better adjust the current parameters to quickly find the parameters that maximize the global objective function. Compared with grid search, Bayesian optimization has fewer iterations and runs faster. After a specific range of parameters is given, multiple parameters can be adjusted at once, so Bayesian optimization will not cause dimensionality explosion when there are too many parameters.

在本发明实施例中，对初始信用数据分析模型进行初始超参数分析，确定初始超参数组合；对初始超参数组合进行先验概率分布分析，确定先验概率分布数据；通过第二候选特征集合对初始信用数据分析模型进行模型训练，生成训练集以及测试集，通过训练集以及测试集对初始超参数组合进行后验概率分布分析，确定后验概率分布数据；基于后验概率分布数据对初始超参数组合进行迭代分析，确定最优超参数组合；基于最优超参数组合对初始信用数据分析模型进行参数配置，得到目标信用数据分析模型。本模型对学习率，树的深度、样本采样比率、列采样比率等重要参数进行调参，参数设置范围以及最终调参结果如表6所示。In an embodiment of the present invention, an initial hyperparameter analysis is performed on the initial credit data analysis model to determine the initial hyperparameter combination; a priori probability distribution analysis is performed on the initial hyperparameter combination to determine the priori probability distribution data; the initial credit data analysis model is trained through the second candidate feature set to generate a training set and a test set, and a posterior probability distribution analysis is performed on the initial hyperparameter combination through the training set and the test set to determine the posterior probability distribution data; the initial hyperparameter combination is iteratively analyzed based on the posterior probability distribution data to determine the optimal hyperparameter combination; the initial credit data analysis model is parameterized based on the optimal hyperparameter combination to obtain the target credit data analysis model. This model adjusts important parameters such as the learning rate, tree depth, sample sampling ratio, column sampling ratio, etc. The parameter setting range and the final adjustment results are shown in Table 6.

表6-贝叶斯优化调参范围及调参结果Table 6-Bayesian optimization parameter adjustment range and parameter adjustment results

进一步的，如表7所示，系统模型初始效果训练集KS为0.1925，测试集为0.1523。采用贝叶斯优化调参之后，系统性能达到训练集KS为0.1728，测试集为0.1638。参数调优有效降低了系统过拟合现象，测试集效果相比初始系统模型KS提升了约7%。Furthermore, as shown in Table 7, the initial effect of the system model training set KS is 0.1925, and the test set is 0.1523. After Bayesian optimization parameter adjustment, the system performance reaches a training set KS of 0.1728 and a test set KS of 0.1638. Parameter tuning effectively reduces the overfitting phenomenon of the system, and the test set effect is improved by about 7% compared with the initial system model KS.

表7-调参前后信用数据分析模型效果对比Table 7-Comparison of credit data analysis model effects before and after parameter adjustment

需要说明的是，系统评价指标为KS（Kolmogorov-Smirnov），KS为每个分箱区间累计正样本与累计负样本占比差的绝对值的最大值。在风控系统中，KS值的大小代表系统的区分度，KS值越大，也就说明系统的风险排序能力越强。系统预测能力是指系统的预测准确率，系统的预测能力越好则系统的区分能力越强；系统的泛化能力是指在具有同样规律的新数据集上的系统预测能力；系统的稳定性是指系统在不同随机抽样结果下的预测结果波动情况。It should be noted that the system evaluation index is KS (Kolmogorov-Smirnov), which is the maximum absolute value of the difference between the cumulative positive samples and the cumulative negative samples in each bin interval. In the risk control system, the size of the KS value represents the discrimination of the system. The larger the KS value, the stronger the risk ranking ability of the system. The system prediction ability refers to the prediction accuracy of the system. The better the prediction ability of the system, the stronger the system's discrimination ability; the generalization ability of the system refers to the system's prediction ability on a new data set with the same rules; the stability of the system refers to the fluctuation of the prediction results of the system under different random sampling results.

在本发明实施例中，还包括在用户行为数据集上对比五种算法的预测性能，如表8所示，为不同机器学习模型性能对比。In an embodiment of the present invention, the prediction performance of five algorithms is also compared on a user behavior data set, as shown in Table 8, which is a performance comparison of different machine learning models.

首先可以发现集成模型的性能明显优于单一系统。这是因为集成模型是以降低系统预测偏差或方差为目标，将若干个模型按照一定策略组合起来提升系统预测性能。其次，集成模型中采用boosting思想的CatBoost和LightGBM模型KS值明显高于采用bagging思想的RandomForest。这是因为一方面GBDT算法能结合多个基学习器有效提高系统模型泛化性和鲁棒性，更致力于提升模型的预测精度，而Random Forest只专注于提高模型的泛化性和鲁棒性。另一方面，bagging基于并行的思想构建不同的模型，而boosting基于串行的思想，以提高精度为目标，后一个系统充分考虑了上一个模型的训练结果。最后，CatBoost模型在训练和测试阶段相比LightGBM都表现出了明显的优势，且在OOT上泛化性能较好。这是因为CatBoost算法相比LightGBM能够快速高效处理类别型特征，且采用了Ordered boosting方法得到了梯度的无偏估计，解决了梯度偏差和预测偏移的问题，从而有效提高了系统模型的预测性能和泛化能力。在训练时间方面，表8展示的五个模型通过50次训练得到的平均训练时间。首先可从表8中观察到RandomForest模型训练最长，CatBoost次之。因为RandomForest树在每次分裂时考虑所有特征，导致需要较长的训练时间。CatBoost优势体现在快速处理分类特征，若数据中存在较多的类别特征，CatBoost的训练时间将会大大缩小。其次对于预测性能较好的LightGBM和CatBoost模型，虽然CatBoost模型训练时间约为LightGBM模型的10倍左右，但是CatBoost模型测试集的预测性能相比LightGBM提升了36.98%，OOT性能也优于LightGBM。最后，经与专家讨论，CatBoost模型7s左右的训练时间在实际生产环境中是可以接受的。First, it can be found that the performance of the integrated model is significantly better than that of a single system. This is because the integrated model aims to reduce the system prediction bias or variance, and combines several models according to a certain strategy to improve the system prediction performance. Secondly, the KS values of the CatBoost and LightGBM models using the boosting idea in the integrated model are significantly higher than those of the RandomForest using the bagging idea. This is because on the one hand, the GBDT algorithm can combine multiple base learners to effectively improve the generalization and robustness of the system model, and is more committed to improving the prediction accuracy of the model, while Random Forest only focuses on improving the generalization and robustness of the model. On the other hand, bagging builds different models based on the idea of parallelism, while boosting is based on the idea of serialization, with the goal of improving accuracy. The latter system fully considers the training results of the previous model. Finally, the CatBoost model shows obvious advantages over LightGBM in both training and testing stages, and has better generalization performance on OOT. This is because the CatBoost algorithm can quickly and efficiently process categorical features compared to LightGBM, and uses the Ordered boosting method to obtain an unbiased estimate of the gradient, solving the problems of gradient bias and prediction offset, thereby effectively improving the prediction performance and generalization ability of the system model. In terms of training time, Table 8 shows the average training time of the five models obtained through 50 trainings. First, it can be observed from Table 8 that the RandomForest model takes the longest to train, followed by CatBoost. Because the RandomForest tree considers all features at each split, it takes a long time to train. The advantage of CatBoost is that it can quickly process categorical features. If there are more categorical features in the data, the training time of CatBoost will be greatly reduced. Secondly, for the LightGBM and CatBoost models with better prediction performance, although the training time of the CatBoost model is about 10 times that of the LightGBM model, the prediction performance of the CatBoost model test set is 36.98% higher than that of LightGBM, and the OOT performance is also better than that of LightGBM. Finally, after discussion with experts, the training time of the CatBoost model of about 7s is acceptable in an actual production environment.

测试时间方面，表8展示的五个模型通过50次测试得到的平均测试时间。从表8中可以观察到CatBoost、LightGBM、GaussianNaive Bayes和Gaussian Mixture Model四个模型的测试时间相比Random Forest体现出数量级的优势，其中CatBoost模型的训练时间优势也比较明显。In terms of test time, the average test time of the five models obtained through 50 tests is shown in Table 8. From Table 8, it can be observed that the test time of the four models CatBoost, LightGBM, GaussianNaive Bayes and Gaussian Mixture Model is an order of magnitude better than that of Random Forest, among which the training time advantage of the CatBoost model is also obvious.

综上所述，通过不同算法的模型效果对比，本发明采用CatBoost建立基于用户行为数据的信用数据分析系统，不仅表现出出色的预测能力，且系统拥有较好泛化能力和显著的稳定性。如表8所示，测试集的KS达到0.1638，泛化能力较好。系统的稳定性是指系统在不同随机抽样结果下的预测结果波动情况。因为本发明在CatBoost模型中设置了样本和特征的采样比率两个参数，训练过程中每次迭代的训练对象均有所不同。出现此现象的原因是设置了不同的随机种子。因此系统的稳定性可以观察系统在不同的随机种子下KS的变化情况。In summary, through the comparison of model effects of different algorithms, the present invention uses CatBoost to establish a credit data analysis system based on user behavior data, which not only shows excellent prediction ability, but also has good generalization ability and significant stability. As shown in Table 8, the KS of the test set reaches 0.1638, and the generalization ability is good. The stability of the system refers to the fluctuation of the prediction results of the system under different random sampling results. Because the present invention sets two parameters, the sampling ratio of samples and features, in the CatBoost model, the training objects of each iteration during the training process are different. The reason for this phenomenon is that different random seeds are set. Therefore, the stability of the system can be observed by observing the changes in the KS of the system under different random seeds.

表8-不同机器学习模型性能对比Table 8 - Performance comparison of different machine learning models

以上对本发明的技术方法进行了详细介绍，本文中应用了具体个例对本发明的原理及实施方式进行阐述，以上实施例的说明只是用于帮助理解本发明的方法及其核心思想；同时，对于本领域的一般技术人员，依据本发明的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本发明的限制。The technical method of the present invention is introduced in detail above. Specific examples are used in this article to illustrate the principle and implementation mode of the present invention. The description of the above embodiments is only used to help understand the method of the present invention and its core idea. At the same time, for general technical personnel in this field, according to the idea of the present invention, there will be changes in the specific implementation mode and application scope. In summary, the content of this specification should not be understood as a limitation on the present invention.

本发明实施例还提供了一种基于用户行为数据的信用数据分析系统，如图6所示，该基于用户行为数据的信用数据分析系统具体包括：The embodiment of the present invention further provides a credit data analysis system based on user behavior data, as shown in FIG6 , the credit data analysis system based on user behavior data specifically includes:

数据采集模块3001，用于采集多个用户行为数据，并对多个所述用户行为数据进行标签匹配，确定每个所述用户行为数据对应的标签数据；The data collection module 3001 is used to collect multiple user behavior data, and perform label matching on the multiple user behavior data to determine the label data corresponding to each user behavior data;

数据整合模块3002，用于对所述多个用户行为数据以及每个所述用户行为数据对应的标签数据进行数据整合，得到用户数据集合；A data integration module 3002 is used to integrate the plurality of user behavior data and the label data corresponding to each of the user behavior data to obtain a user data set;

数据处理模块3003，用于对所述用户数据集合进行数据预处理，得到待分析数据集合；The data processing module 3003 is used to perform data preprocessing on the user data set to obtain a data set to be analyzed;

第一提取模块3004，用于通过过滤式特征提取算法对所述待分析数据集合进行第一特征提取处理，得到第一候选特征集合；A first extraction module 3004 is used to perform a first feature extraction process on the data set to be analyzed by using a filtering feature extraction algorithm to obtain a first candidate feature set;

第二提取模块3005，用于通过包裹式特征提取算法对所述第一候选特征集合进行第二特征提取处理，得到第二候选特征集合；A second extraction module 3005 is used to perform a second feature extraction process on the first candidate feature set by using a wrapping feature extraction algorithm to obtain a second candidate feature set;

特征筛选模块3006，用于对所述第二候选特征集合进行数据漂移检测及特征筛选处理，得到目标特征集合；A feature screening module 3006 is used to perform data drift detection and feature screening on the second candidate feature set to obtain a target feature set;

信用分析模块3007，用于通过预置的目标信用数据分析模型对所述目标特征集合进行信用数据分析，得到信用数据分析结果，并将所述信用数据分析结果传输至预置的数据处理终端。The credit analysis module 3007 is used to perform credit data analysis on the target feature set through a preset target credit data analysis model to obtain a credit data analysis result, and transmit the credit data analysis result to a preset data processing terminal.

可选的，所述数据采集模块3001具体用于：采集多个用户行为数据，并对每个所述用户行为数据进行时间数据提取，确定每个所述用户行为数据对应的时间数据；基于每个所述用户行为数据对应的时间数据，对多个所述用户行为数据进行标签匹配，确定每个所述用户行为数据对应的标签数据。Optionally, the data collection module 3001 is specifically used to: collect multiple user behavior data, and extract time data for each of the user behavior data to determine the time data corresponding to each of the user behavior data; based on the time data corresponding to each of the user behavior data, perform label matching on the multiple user behavior data to determine the label data corresponding to each of the user behavior data.

可选的，所述数据处理模块3003具体用于：对所述用户数据集合进行异常值分析，确定目标异常值，并通过所述异常值对所述用户数据集合进行缺失值分析，确定目标缺失值；基于所述目标缺失值，对所述用户数据集合进行数据填充处理，得到待分析数据集合。Optionally, the data processing module 3003 is specifically used to: perform outlier analysis on the user data set to determine target outliers, and perform missing value analysis on the user data set through the outliers to determine target missing values; based on the target missing values, perform data filling processing on the user data set to obtain the data set to be analyzed.

可选的，所述第一提取模块3004具体用于：通过所述过滤式特征提取算法对所述待分析数据集合进行冗余特征剔除，得到待处理特征集合；对所述待处理特征集合进行特征相关性分析，得到特征相关性分析结果；通过所述特征相关性分析结果对所述待处理特征集合进行特征提取，得到第一候选特征集合。Optionally, the first extraction module 3004 is specifically used to: eliminate redundant features of the data set to be analyzed by using the filtering feature extraction algorithm to obtain a feature set to be processed; perform feature correlation analysis on the feature set to be processed to obtain a feature correlation analysis result; perform feature extraction on the feature set to be processed by using the feature correlation analysis result to obtain a first candidate feature set.

可选的，所述第二提取模块3005具体用于：通过包裹式特征提取算法对所述第一候选特征集合中每个第一候选特征进行重要度分析，确定每个第一候选特征的重要度数据；基于每个第一候选特征的重要度数据对所述第一候选特征集合进行第二特征提取处理，得到第二候选特征集合。Optionally, the second extraction module 3005 is specifically used to: perform importance analysis on each first candidate feature in the first candidate feature set through a wraparound feature extraction algorithm to determine the importance data of each first candidate feature; perform second feature extraction processing on the first candidate feature set based on the importance data of each first candidate feature to obtain a second candidate feature set.

可选的，所述特征筛选模块3006具体用于：通过预置的对抗分类器对所述第二候选特征集合进行数据漂移检测，生成数据漂移检测结果；通过所述数据漂移检测结果对所述第二候选特征集合进行特征筛选处理，得到目标特征集合。Optionally, the feature screening module 3006 is specifically used to: perform data drift detection on the second candidate feature set through a preset adversarial classifier to generate a data drift detection result; perform feature screening processing on the second candidate feature set through the data drift detection result to obtain a target feature set.

可选的，所述基于用户行为数据的信用数据分析系统还包括：Optionally, the credit data analysis system based on user behavior data further includes:

参数分析模块3008，用于对初始信用数据分析模型进行初始超参数分析，确定初始超参数组合；The parameter analysis module 3008 is used to perform initial hyperparameter analysis on the initial credit data analysis model and determine the initial hyperparameter combination;

分布分析模块3009，用于对所述初始超参数组合进行先验概率分布分析，确定先验概率分布数据；The distribution analysis module 3009 is used to perform a priori probability distribution analysis on the initial hyperparameter combination to determine a priori probability distribution data;

模型训练模块3010，用于通过所述第二候选特征集合对所述初始信用数据分析模型进行模型训练，生成训练集以及测试集；A model training module 3010, configured to perform model training on the initial credit data analysis model using the second candidate feature set to generate a training set and a test set;

概率分析模块3011，用于通过所述训练集以及所述测试集对所述初始超参数组合进行后验概率分布分析，确定后验概率分布数据；The probability analysis module 3011 is used to perform a posterior probability distribution analysis on the initial hyperparameter combination through the training set and the test set to determine the posterior probability distribution data;

迭代分析模块3012，用于基于所述后验概率分布数据对所述初始超参数组合进行迭代分析，确定最优超参数组合；An iterative analysis module 3012 is used to iteratively analyze the initial hyperparameter combination based on the posterior probability distribution data to determine the optimal hyperparameter combination;

参数配置模块3013，用于基于所述最优超参数组合对所述初始信用数据分析模型进行参数配置，得到所述目标信用数据分析模型。The parameter configuration module 3013 is used to configure the parameters of the initial credit data analysis model based on the optimal hyperparameter combination to obtain the target credit data analysis model.

通过上述各个模块的协同合作，采集多个用户行为数据并进行标签匹配，确定对应的标签数据；对多个用户行为数据及标签数据进行数据整合，得到用户数据集合；对用户数据集合进行数据预处理，得到待分析数据集合；通过过滤式特征提取算法对待分析数据集合进行第一特征提取处理得到第一候选特征集合；通过包裹式特征提取算法对第一候选特征集合进行第二特征提取处理得到第二候选特征集合；对第二候选特征集合进行数据漂移检测及特征筛选处理，得到目标特征集合；对目标特征集合进行信用数据分析，得到信用数据分析结果并将信用数据分析结果传输至预置的数据处理终端。在本发明实施例中，更关注用户行为数据对其信用情况的影响，在不需要获取高成本且不易获取的与用户强相关的金融属性的数据的情况下，建立了具有较高准确率的基于用户行为数据的信用数据分析系统。一方面，在本发明实施例中，直接将类别型特征转化为数值型特征，不需要对类别型特征进行独热编码等操作避免增加数据维度，快速高效。另一方面本发明通过对梯度的无偏估计，相比传统的梯度估计方法降低了估计偏差的影响，解决了梯度偏差和预测偏移的问题，从而有效提高了系统模型的泛化能力。因此本发明可以以较快的训练速度对用户的信用情况进行预测，并具有更准确的预测能力以及更优的泛化性能，以进一步提升基于用户行为数据对信用数据分析时的准确率。Through the cooperation of the above modules, multiple user behavior data are collected and label matching is performed to determine the corresponding label data; multiple user behavior data and label data are integrated to obtain a user data set; the user data set is preprocessed to obtain a data set to be analyzed; the first feature extraction process is performed on the data set to be analyzed by a filtering feature extraction algorithm to obtain a first candidate feature set; the second feature extraction process is performed on the first candidate feature set by a wrapping feature extraction algorithm to obtain a second candidate feature set; the second candidate feature set is subjected to data drift detection and feature screening to obtain a target feature set; the target feature set is subjected to credit data analysis to obtain a credit data analysis result and transmit the credit data analysis result to a preset data processing terminal. In the embodiment of the present invention, more attention is paid to the impact of user behavior data on its credit status. Without the need to obtain high-cost and difficult-to-obtain financial attribute data that is strongly related to the user, a credit data analysis system based on user behavior data with a high accuracy is established. On the one hand, in the embodiment of the present invention, the categorical features are directly converted into numerical features, and there is no need to perform operations such as unique hot encoding on the categorical features to avoid increasing the data dimension, which is fast and efficient. On the other hand, the present invention reduces the impact of estimation bias by unbiased estimation of gradients, solves the problems of gradient bias and prediction offset, and thus effectively improves the generalization ability of the system model. Therefore, the present invention can predict the credit status of users at a faster training speed, and has more accurate prediction ability and better generalization performance, so as to further improve the accuracy of credit data analysis based on user behavior data.

以上实施例仅用以说明本发明的技术方案而非对其限制，尽管参照实施例对本发明进行了详细的说明，所属领域的普通技术人员应当理解：依然可以对本发明的具体实施方式进行修改或者等同替换，而未脱离本发明精神和范围的任何修改或者等同替换，其均应涵盖在本发明的权利要求范围当中。The above embodiments are only used to illustrate the technical solutions of the present invention rather than to limit the same. Although the present invention has been described in detail with reference to the embodiments, a person skilled in the art should understand that the specific implementation modes of the present invention can still be modified or replaced by equivalents, and any modification or equivalent replacement that does not depart from the spirit and scope of the present invention should be included in the scope of the claims of the present invention.

Claims

1. A credit data analysis method based on user behavior data, which is characterized by including:

Collect multiple user behavior data, perform label matching on the multiple user behavior data, and determine the label data corresponding to each user behavior data;

Perform data integration on the plurality of user behavior data and label data corresponding to each user behavior data to obtain a user data set;

Perform data preprocessing on the user data set to obtain a data set to be analyzed;

Perform a first feature extraction process on the data set to be analyzed through a filtering feature extraction algorithm to obtain a first candidate feature set, which specifically includes: removing redundant features from the data set to be analyzed through the filtering feature extraction algorithm. , obtain the feature set to be processed; perform feature correlation analysis on the feature set to be processed, and obtain the feature correlation analysis result; perform feature extraction on the feature set to be processed through the feature correlation analysis result, and obtain the first candidate feature set;

Among them, when performing the first feature extraction process on the data set to be analyzed through the filtering feature extraction algorithm, it includes filtering out redundant features through the tenth distribution algorithm, the rank sum test algorithm and the standard dividing algorithm. Among them, the decile distribution algorithm, the rank sum test algorithm and the standard dividing algorithm are used to filter out redundant features. The bit distribution algorithm, rank sum test algorithm and standard classification algorithm screen out redundant features, and then use the Pearson correlation system method and the maximum information coefficient method to update the feature set to be processed to obtain the first candidate feature set;

Perform a second feature extraction process on the first candidate feature set through a wrapped feature extraction algorithm to obtain a second candidate feature set, which specifically includes: performing a second feature extraction process on each first candidate feature set in the first candidate feature set through a wrapped feature extraction algorithm. Conducting importance analysis on the candidate features to determine the importance data of each first candidate feature; performing a second feature extraction process on the first candidate feature set based on the importance data of each first candidate feature to obtain the second candidate feature gather;

Perform data drift detection and feature screening on the second candidate feature set to obtain a target feature set, which specifically includes: performing data drift detection on the second candidate feature set through a preset adversarial classifier to generate a data drift detection result ; Perform feature screening processing on the second candidate feature set through the data drift detection result to obtain a target feature set;

Perform initial hyperparameter analysis on the initial credit data analysis model to determine an initial hyperparameter combination; perform a priori probability distribution analysis on the initial hyperparameter combination to determine a priori probability distribution data; use the second candidate feature set to determine the a priori probability distribution data. The initial credit data analysis model performs model training and generates a training set and a test set; performs posterior probability distribution analysis on the initial hyperparameter combination through the training set and the test set, and determines the posterior probability distribution data; based on the The posterior probability distribution data is used to iteratively analyze the initial hyperparameter combination to determine the optimal hyperparameter combination; parameter configuration is performed on the initial credit data analysis model based on the optimal hyperparameter combination to obtain the target credit data analysis Model;

Perform credit data analysis on the target feature set through the target credit data analysis model to obtain credit data analysis results, and transmit the credit data analysis results to a preset data processing terminal.

2. The credit data analysis method based on user behavior data according to claim 1, characterized in that: collecting multiple user behavior data, performing tag matching on multiple user behavior data, and determining each of the user behavior data. The label data steps corresponding to user behavior data include:

Collect multiple user behavior data, extract time data for each user behavior data, and determine the time data corresponding to each user behavior data;

Based on the time data corresponding to each user behavior data, label matching is performed on multiple user behavior data to determine the label data corresponding to each user behavior data.

3. The credit data analysis method based on user behavior data according to claim 1, characterized in that the step of performing data preprocessing on the user data set to obtain the data set to be analyzed includes:

Perform outlier analysis on the user data set to determine target outliers, and perform missing value analysis on the user data set through the outliers to determine target missing values;

Based on the target missing values, data filling processing is performed on the user data set to obtain a data set to be analyzed.

4. A credit data analysis system based on user behavior data, used to perform the credit data analysis method based on user behavior data according to any one of claims 1 to 3, characterized in that it includes:

A data collection module, used to collect multiple user behavior data, perform label matching on multiple user behavior data, and determine the label data corresponding to each user behavior data;

A data integration module, configured to perform data integration on the plurality of user behavior data and label data corresponding to each of the user behavior data, to obtain a user data set;

A data processing module, used to perform data preprocessing on the user data set to obtain a data set to be analyzed;

The first extraction module is used to perform a first feature extraction process on the data set to be analyzed through a filtering feature extraction algorithm to obtain a first candidate feature set, specifically including: performing a first feature extraction process on the data set to be analyzed through the filtering feature extraction algorithm. The data set is subjected to redundant feature elimination to obtain a feature set to be processed; feature correlation analysis is performed on the feature set to be processed to obtain a feature correlation analysis result; and the feature set to be processed is processed based on the feature correlation analysis result. Feature extraction to obtain the first candidate feature set;

The second extraction module is used to perform a second feature extraction process on the first candidate feature set through a wrapped feature extraction algorithm to obtain a second candidate feature set, specifically including: performing a second feature extraction process on the first candidate feature set through a wrapped feature extraction algorithm. Conducting importance analysis on each first candidate feature in the feature set to determine the importance data of each first candidate feature; performing second feature extraction on the first candidate feature set based on the importance data of each first candidate feature Process to obtain the second candidate feature set;

A feature screening module, used to perform data drift detection and feature screening processing on the second candidate feature set to obtain a target feature set, specifically including: performing data drift detection on the second candidate feature set through a preset adversarial classifier , generate a data drift detection result; perform feature screening processing on the second candidate feature set through the data drift detection result to obtain a target feature set;

The credit analysis module is used to perform initial hyperparameter analysis on the initial credit data analysis model and determine the initial hyperparameter combination; perform a priori probability distribution analysis on the initial hyperparameter combination and determine the priori probability distribution data; through the second The candidate feature set conducts model training on the initial credit data analysis model, and generates a training set and a test set; performs posterior probability distribution analysis on the initial hyperparameter combination through the training set and the test set, and determines the posterior probability Distribution data; perform iterative analysis on the initial hyperparameter combination based on the posterior probability distribution data to determine the optimal hyperparameter combination; perform parameter configuration on the initial credit data analysis model based on the optimal hyperparameter combination to obtain The target credit data analysis model;