WO2020143233A1 - 评分卡模型的建立方法、装置、计算机设备和存储介质 - Google Patents
评分卡模型的建立方法、装置、计算机设备和存储介质 Download PDFInfo
- Publication number
- WO2020143233A1 WO2020143233A1 PCT/CN2019/103489 CN2019103489W WO2020143233A1 WO 2020143233 A1 WO2020143233 A1 WO 2020143233A1 CN 2019103489 W CN2019103489 W CN 2019103489W WO 2020143233 A1 WO2020143233 A1 WO 2020143233A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sample
- variable
- bins
- chi
- sub
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
Definitions
- the present application relates to a method, device, computer equipment and storage medium for establishing a scorecard model.
- a method, device, computer device, and storage medium for establishing a scorecard model are provided.
- a method for establishing a scorecard model is executed by a computer device.
- the method includes: acquiring sample data of multiple training samples; the sample data includes multiple sample variables; performing binning operations on each sample variable; determining each The number of bins corresponding to each sample variable, compare whether the number of bins exceeds the threshold; if so, calculate the proportion of bins corresponding to each bin for the sample variable, the bad sample rate, and the chi-square value with the adjacent bins; The binning ratio, the bad sample rate and the chi-square value, merge the multiple binnings of the sample variable, and return to the step of determining the binning number corresponding to each sample variable; and otherwise, calculating each sample For the WOE value of the variable, a sample variable is screened according to the WOE value, and a score card model is established based on the sampled variable obtained by the screening.
- the binning operation for each sample variable includes: identifying associated samples of the training sample, crawling associated data of the associated samples; the associated data includes multiple associated variables; Receiving model configuration information sent by the terminal, extracting derivative factors from the model configuration information, obtaining derivative variables corresponding to each derivative factor of the training sample; and performing binning operations on each sample variable, associated variable and derivative variable.
- combining the multiple bins of the sample variable according to the bin proportion, the bad sample rate and the chi-square value includes: determining the monotony of the multiple bins according to the bad sample rate Characteristics; identify the bins that do not conform to the monotonic characteristics, the bad sample rate is the preset value, the bin ratio is the smallest, or the chi-square value is the smallest, and are respectively recorded as bins to be merged; One adjacent bin or the next adjacent bin is merged.
- the determination of the monotonic characteristics of multiple bins based on the bad sample rate includes: counting the number of bins with a monotonous trend of bad sample rate; determining the monotonic trend with the largest number of bins; and The monotonic trend with the largest number of bins determines the monotonic characteristics of the corresponding sample variables.
- the merging the to-be-combined sub-box with the previous adjacent sub-box or the next-adjacent sub-box includes: calculating the The chi-square value is recorded as the first chi-square value; the chi-square value of the sub-box to be merged and the next adjacent sub-box is calculated and recorded as the second chi-square value; compare whether the first chi-square value is equal to Describe the second chi-square value; if it is, merge the sub-box to be merged with the previous adjacent sub-box or the next adjacent sub-box with a small proportion; and otherwise, merge the sub-box to be merged with the chi-square value The previous adjacent bin or the next adjacent bin is merged.
- the establishment of a scorecard model based on the sampled variables obtained by screening includes: screening target samples from the training samples, extracting sample features of the target samples; performing reinforcement learning on the sample features, Get more derived samples; use the training samples and derived samples to train the basic model to obtain the scorecard model, calculate the accuracy of the scorecard model, and compare whether the accuracy reaches the threshold; and if not, based on Generating a regenerated sample from the derived sample; using the regenerated sample as the current derived sample, returning to training the basic model using the training sample and the derived sample, obtaining a scorecard model, and calculating the accuracy of the scorecard model , Compare the step of whether the accuracy reaches the threshold until the accuracy reaches the threshold.
- An apparatus for establishing a scorecard model includes: a data binning module for acquiring sample data of multiple training samples; the sample data includes multiple sample variables; and performing binning operations on each sample variable;
- the binning and merging module is used to determine the number of binnings corresponding to each sample variable and compare whether the binning number exceeds the threshold; if so, calculate the binning ratio, bad sample rate and the relative value of each binning corresponding to the sample variable The chi-square value of the adjacent binning; according to the binning ratio, bad sample rate and chi-square value, the multiple binning of the sample variable is merged, and the determination of the number of binning corresponding to each sample variable is returned Steps; and a model building module, used to calculate the WOE value of each sample variable when the number of bins corresponding to the sample variable is less than or equal to the threshold, perform sample variable screening based on the WOE value, and based on the sample variable obtained by the screening Create a scorecard model.
- the data binning module is further used to identify the associated samples of the training samples and crawl the associated data of the associated samples; the associated data includes multiple associated variables; the model sent by the receiving terminal Configuration information, extract the derivative factors from the model configuration information, obtain the derivative variables corresponding to each derivative factor of the training sample; and perform binning operation on each sample variable, associated variable and derivative variable.
- a computer device includes a memory and one or more processors.
- the memory stores computer-readable instructions.
- the steps of the risk prediction processing method provided in any embodiment of the present application are implemented.
- One or more non-volatile computer-readable storage media storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to implement any one of the embodiments of the present application The steps of the risk prediction processing method provided.
- FIG. 1 is an application scenario diagram of a method for establishing a scorecard model according to one or more embodiments.
- FIG. 2 is a schematic flowchart of a method for establishing a scorecard model according to one or more embodiments.
- FIG. 3 is a schematic flowchart of the steps of binning and merging according to one or more embodiments.
- FIG. 4 is a structural block diagram of an apparatus for establishing a scorecard model according to one or more embodiments.
- Figure 5 is a block diagram of a computer device in accordance with one or more embodiments.
- the method for establishing a scorecard model provided in this application can be applied to the application environment shown in FIG. 1.
- the terminal 102 and the server 104 communicate via the network.
- the terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
- the server 104 may be implemented by an independent server or a server cluster composed of multiple servers.
- the user may send a scoring request to the server 104 through the terminal 102.
- the server 104 obtains the monitoring data of the monitoring object according to the scoring request, and calls the score card model to process the monitoring data.
- the score card model may be obtained by training the server 104 based on the sample variables obtained by the screening.
- the user may send a model establishment request to the server 104 through the terminal 102.
- the server 104 obtains sample data of multiple training samples according to the model establishment request.
- the sample data includes multiple sample variables.
- the server 104 performs binning operations on each sample variable according to the unsupervised binning method or the supervised binning method.
- the server 104 determines the number of bins corresponding to each sample variable, and compares whether the number of bins exceeds the threshold. If the number of bins of a sample variable exceeds the threshold, the server 104 calculates the bin proportion of each bin corresponding to the sample variable, the bad sample rate, and the chi-square value with the adjacent bins, and determines multiple according to the bad sample rate Monotonic characteristics of binning. The server 104 identifies the bins that do not conform to the monotonic characteristics, the bad sample rate is the preset value, the bin proportion is the smallest, or the chi-square value is the smallest, and are respectively recorded as bins to be merged.
- the server 104 merges the to-be-combined binning box with the previous adjacent binning box or the next adjacent binning box, and after the combining, judges again whether the number of binning boxes of the sample variable exceeds the threshold. If the threshold is still exceeded, the server 104 performs merge processing again on the multiple bins of the sample variable in the above manner until the number of bins of the sample variable is less than or equal to the threshold.
- the server 104 calculates the WOE value of each sample variable, performs sample variable screening based on the WOE value, and establishes a score card model based on the sample variables obtained by the screening.
- the training samples are balanced based on the improved chi-square binning method, and the establishment of the scorecard model based on the training of multiple sample sets obtained by sample balancing can improve the accuracy of the model and reduce the number of binning. In turn, the model training efficiency can be improved.
- a method for establishing a scorecard model is provided.
- the method is applied to the server in FIG. 1 as an example for illustration, including the following steps:
- Step 202 Obtain sample data of multiple training samples; the sample data includes multiple sample variables.
- Each training sample is a monitoring object of virtual resources, such as an enterprise or individual that has been exposed to credit risk, or an enterprise or individual that has not been exposed to credit risk.
- Virtual resources can be stocks, bonds, etc.
- the monitoring object is a risk monitoring object.
- the sample data includes monitoring data of multiple dimensions of the monitoring object.
- the sample data corresponding to the training sample includes financial variables A1, financial variables A2, judicial variables B1, public opinion variables C1, and industry variables D1; training samples 2
- the corresponding sample data includes financial variable A1, financial variable A2, financial variable A3, public opinion variable C2, and industry variable D2.
- Step 204 Perform binning operations on each sample variable.
- Variable types include qualitative and quantitative variables.
- variable type is a qualitative variable
- initial binning is performed based on the attribute value corresponding to the qualitative variable.
- the initial binning of quantitative variables is based on the unsupervised binning method or the supervised binning method.
- the unsupervised binning method can be equidistant binning, equal width binning, etc.
- the supervised binning method can be chi-square binning and so on.
- Step 206 Determine the number of bins corresponding to each sample variable, and compare whether the number of bins exceeds the threshold.
- the threshold of the number of bins can be set freely according to experience, such as 5. It should be noted that the thresholds for the number of bins corresponding to different sample variables can be different.
- Step 208 if so, calculate the bin ratio, bad sample rate and chi-square value of each bin corresponding to the sample variable; according to the bin ratio, bad sample rate and chi-square value, the sample variable
- the multiple bins of are combined and returned to the step of determining the number of bins corresponding to each sample variable.
- the binning ratio refers to the ratio of the number of sample variables whose value falls into the current binning to the number of all training samples containing the sample variable, for example, the binning ratio of financial variable A2 in [500, 700] binning It is 10000/120000.
- the bad sample rate refers to the ratio of the number of bad samples in the current binning to the number of all sample variables in the current binning. For example, the bad sample rate of the financial variable A2 in [500,700] bins is 2500/50000.
- the chi-square value with the adjacent binning is a statistic in the non-parametric test, which is used to test the data correlation of the adjacent binning.
- the server recognizes whether there is a bin with a bad sample rate of 0 or ⁇ , and marks the bin with a bad sample rate of 0 or ⁇ as a bin to be merged.
- the server recognizes a pair of adjacent bins with the smallest chi-square value, and marks the adjacent bins with the smallest chi-square value as the bins to be merged.
- the server identifies the bin with the smallest bin ratio, and marks the bin with the smallest bin ratio as the bin to be merged.
- the server merges the to-be-consolidated bins with the previous adjacent bin or the next adjacent bin. After the merge processing is completed, the server re-judges whether the number of bins of the sample variable still exceeds the threshold. If yes, continue to merge the multiple bins corresponding to the sample variable in the above manner until the number of bins corresponding to the sample variable is less than or equal to the threshold.
- Step 210 otherwise, the WOE value of each sample variable is calculated, the sample variable is screened according to the WOE value, and a score card model is established based on the sampled variable obtained by the screening.
- sample equalization processing is performed on the training samples, and a score card model is built based on the training of multiple sample sets obtained by sample equalization.
- the sample data of multiple training samples is obtained, and the multiple sample variables contained in the sample data are binned to determine the number of bins corresponding to each sample variable; when the number of bins exceeds the threshold, it can be calculated
- the sample variables correspond to the binning ratio, bad sample rate and chi-square value of each binning, and according to the binning ratio, bad sample rate and chi-square value, multiple binning of the sample variable Perform merge processing; when the number of bins is less than or equal to the threshold, you can calculate the WOE value of each sample variable and filter the sample variables based on the WOE value; based on the sample variables obtained by the screening, you can build a scorecard model.
- binning is based on binning ratio, bad sample rate, and chi-square value binning conditions, the number of binnings for each sample variable can be limited, and rare samples can be evenly distributed in different binnings to achieve sample balance. Therefore, the selection of sample variables according to this binning mechanism can improve model training efficiency and accuracy.
- performing binning operations on each sample variable includes: identifying related samples of the training sample and crawling related data of the related samples; related data includes multiple related variables; receiving model configuration information sent by the terminal, Extract the derivative factors in the model configuration information to obtain the derivative variables corresponding to each derivative factor of the training sample; perform binning operations on each sample variable, associated variable and derivative variable.
- the server calls the preset risk conduction model.
- Risk conduction models include relationship extraction models and conduction prediction models.
- the server crawls the social relationship data of the monitored object at the designated website, inputs the social relationship data into the relationship extraction model, determines one or more associated samples corresponding to the training sample, and generates a knowledge graph corresponding to the training sample based on the determined associated samples.
- the associated sample may be an associated object that has an investment relationship, supply relationship, or other relationship with the monitored object.
- the knowledge graph includes monitoring object nodes and multiple associated object nodes.
- the relationship extraction model includes an intimacy operator model. Use the intimacy operator model to calculate each associated sample and training
- the set of adjacent nodes of this node v; the number of common adjacent nodes of the training sample node v and the associated sample node w is
- the closeness may be an investment ratio, a contribution ratio, a contribution ratio, etc. according to the association relationship.
- the server crawls the associated data of the associated samples at the specified website.
- the server inputs the associated data and the corresponding intimacy into the conduction prediction model, calculates the conduction risk score of the associated sample, and marks the conduction risk score as an associated variable.
- the server uses the conduction risk score with the highest median value of the conduction risk scores corresponding to multiple associated objects as an associated variable, or the average value of the conduction risk scores corresponding to multiple associated objects as an associated variable , There is no restriction on this.
- the scorecard model provided by the virtual resource acquisition platform itself only provides a model framework. If the user is not satisfied with the scorecard model used for virtual resource risk analysis, he can send a model configuration request to the server through the terminal, and then change the scorecard model according to his own industry experience. Specifically, the server obtains the corresponding scorecard model according to the model identifier carried in the model configuration request. The server recognizes the editable elements in the scorecard model, replaces the editable elements with blank cells, and fills the editable elements into the blank cells to obtain the model editing page and returns the model editing page to the terminal.
- the model editing page allows users to freely edit based on industry experience on the basis of the model framework to achieve model customization. For example, allowing users to modify variable weights, change variable values, etc.
- the model editing page also includes a "new indicator" button to support users to add new variables.
- the server returns the model editing page to the terminal according to the model configuration request.
- the editing information includes the changed scorecard model.
- variable type can be finance, public opinion, etc.
- indicator whether senior executives have judicial punishment.
- the server regularly conducts a network-wide screening, and adds the newly added variables of the user (denoted as derived variables) to the variable library, so that the user or other users can use them again later.
- different users may have different customization logic for the scorecard model. In order to protect the user's customization logic, different users may perform data isolation on the customization operation of the scorecard model.
- the server splits the formula, obtains corresponding multiple itemized variables in the monitoring data, and performs preset logic operations on the multiple itemized variables according to the formula logic to obtain the corresponding variable values.
- the new variable is a natural language
- the user is also allowed to configure the corresponding variable value acquisition logic for the new variable, and the variable value is automatically acquired based on the configured acquisition logic. For example, according to the newly added variable “Whether executives have judicial punishment”, first crawl the information about judicial punishment of senior executives on the designated website, and identify the subject (whether it is an executive expected to monitor) of the senior executives involved; if the subjects are consistent, Conduct public opinion analysis on the crawled information to get the corresponding variable value. It is easy to understand that the new variable value can also be entered manually.
- the server performs binning operations on the sample variables, associated variables and derived variables in the above manner.
- the related samples of the training sample are also identified, and the related variables are extracted from the risk data corresponding to the related samples, and the related variables are included in the risk measurement consideration range, which can be expanded
- the risk prediction dimension can further improve the accuracy of risk prediction.
- providing users with a common scorecard model reduces the risk prediction threshold, and allows users to change the scorecard model according to their own industry experience, which can be customized to make the virtual resource platform suitable for any industry. Background users.
- the binning combining step includes:
- Step 302 Determine the monotonic characteristics of multiple bins according to the bad sample rate.
- the monotonic characteristic refers to the characteristic that the bad sample rate shows a continuously increasing or decreasing trend.
- the monotonic characteristics of multiple bins are determined according to the bad sample rate, including: counting the number of bins with a monotonous trend of bad sample rate; determining the monotonic trend with the largest number of bins; according to the monotonic with the largest number of bins The trend determines the monotonic nature of the corresponding sample variable.
- Step 304 Identify the bins that do not meet the monotonic characteristics, the bad sample rate is the preset value, the bin ratio is the smallest, or the chi-square value is the smallest, and are respectively recorded as bins to be merged.
- the server recognizes whether there is a bin with a bad sample rate of 0 or ⁇ in the sample variable, and marks the bin with a bad sample rate of 0 or ⁇ as the bin to be merged.
- the server recognizes whether there is a binning that does not meet the monotonic characteristics, and marks the binning that does not meet the monotonic characteristics as the to-be-consolidated binning.
- the server recognizes a pair of adjacent bins with the smallest chi-square value, and marks the adjacent bins with the smallest chi-square value as the bins to be merged.
- the server identifies the bin with the smallest bin ratio, and marks the bin with the smallest bin ratio as the bin to be merged.
- Step 306 merge the sub-boxes to be merged with the previous adjacent sub-box or the next adjacent sub-box.
- merging the bins to be merged with the previous neighbouring bin or the next neighbouring bin includes: calculating the chi-square value of the bin to be merged and the previous neighbouring bin, which is recorded as The first chi-square value; calculate the chi-square value of the bin to be merged and the next adjacent bin, and record it as the second chi-square value; compare whether the first chi-square value is equal to the second chi-square value; The bins and bins with a small proportion of the previous adjacent bin or the next adjacent bin are merged; otherwise, the bins to be merged are merged with the previous adjacent bin or the next adjacent bin with a small chi-square value To merge.
- the server obtains a bin with a bad sample rate of 0 or ⁇ or a bin that does not meet the monotonic characteristics, and is recorded as the first bin to be merged. If the first bin to be merged is the first bin, the first bin to be merged is merged with the next adjacent bin. If the first bin to be merged is the last bin, the first bin to be merged is merged with the previous adjacent bin.
- the server obtains the chi-square value of the first bin to be merged and the previous adjacent bin (recorded as the first chi-square value) and The chi-square value of the adjacent binning box (recorded as the second chi-square value), and compare the first chi-square value and the second chi-square value. If the first chi-square value is greater than the second chi-square value, the first to-be-combined sub-box is merged with the next adjacent sub-box. If the first chi-square value is smaller than the second chi-square value, the first to-be-combined sub-box is merged with the previous adjacent sub-box. If the first chi-square value is equal to the second chi-square value, the to-be-combined sub-box and the previous adjacent sub-box or the next adjacent sub-box with a small proportion of the sub-boxes are merged.
- binning is based on binning ratio, bad sample rate, chi-square value of adjacent binnings, and multiple binning conditions with monotonic characteristics, which can limit the number of binnings for each sample variable or make it rare Samples are evenly distributed in different bins to achieve sample balance. Screening of sample variables based on this bin mechanism can improve model training efficiency and accuracy.
- the establishment of a scorecard model based on the sample variables obtained by the screening includes: filtering the target samples from the training samples to extract the sample features of the target samples; performing reinforcement learning on the sample features to obtain more derived samples; using Training samples and derivative samples train the basic model to obtain the scorecard model, calculate the accuracy of the scorecard model, and compare whether the accuracy reaches the threshold; if not, generate the regeneration sample based on the derivative sample; use the regeneration sample as the current derivative sample, Return to the steps of training the basic model with training samples and derived samples to obtain the scorecard model, calculate the accuracy of the scorecard model, and compare whether the accuracy reaches the threshold until the accuracy reaches the threshold.
- Training samples include good samples and bad samples.
- the same monitoring object often does not always have risky behavior. It is possible that a period of time (recorded as a white period) does not have risky behavior, and a period of time (recorded as a black period) has risky behavior.
- a period of time (recorded as a white period) does not have risky behavior
- a period of time (recorded as a black period) has risky behavior.
- company A is exposed to financial fraud risks from 2017.08 to 2017.11, after being required to be rectified by the regulatory department, company A will revise the financial data for that period of time, then the financial data exposed from 2017.08 to 2017.11 may be used as a bad sample.
- the revised monitoring data can be used as a good sample.
- Training samples also include gray samples. In fact, most of the monitored objects are in the gray period between the white period and the black period. The gray period refers to the period when there may be risky behavior but it is not exposed. The number of
- the training samples have corresponding classification labels.
- the server obtains the training samples and inputs the training samples into the base classifier to obtain the model classification results.
- the server compares whether the model classification result is consistent with the corresponding classification label. If not, the server marks the existing sample as the target sample.
- the target sample refers to a bad sample that actually has risky behavior but is not recognized by the scorecard model. Sample characteristics include normal indicators of bad samples and one or more abnormal indicators.
- the server extracts the sample features of the target sample. Specifically, the server obtains sample data corresponding to the target sample; preprocesses the sample data to obtain multiple sample indicators. The server marks one or more of the sample indicators as abnormal indicators according to the penalty documents published by the regulatory authorities such as the CSRC, and then determines the indicator type of the sample indicators. Indicator types include normal indicators and abnormal indicators.
- the server performs reinforcement learning on the sample features to obtain more derived samples.
- the server pre-stores reinforcement learning rules corresponding to multiple sample indexes.
- Reinforcement learning rules include the corresponding increase or decrease of various sample indicators.
- the reinforcement learning rules include the first-level amplitude and the second-level amplitude that increase or decrease the sample index value.
- the first level amplitude refers to the increase or decrease of the sample index value when the sample index is reinforced for the first time according to the reinforcement learning rule
- the second level amplitude refers to the same sample index according to the reinforcement learning rule for the second time The extent to which the sample index value increases or decreases during reinforcement learning, and so on.
- the server increases the abnormal index according to the increase amplitude; or reduces the abnormal index according to the decrease amplitude.
- the server combines multiple abnormal indicators after the enhanced processing to obtain multiple indicator combinations.
- the server generates a derivative sample based on the normal indicators and each indicator combination after the enhanced processing.
- the server uses training samples and derived samples to perform semi-supervised training on the basic model to obtain a scorecard model.
- the base classifier may be a gradient promotion model (Gradient Boosting Decison Tree, GBDT).
- GDT Gradient Boosting Decison Tree
- the basic model can also be other models, which is not limited. Good samples and bad samples have clear labels for whether they are fraudulent (recorded as classification labels), while gray samples have no clear classification labels. In other words, the labeled sample data is scarce, while the unlabeled sample data is many. Based on limited good samples, bad samples, and a large number of gray samples, the semi-supervised training of the basic model makes the trained scorecard model more in line with the actual situation, which can improve the accuracy of model monitoring.
- the scorecard model can capture the credit risk caused by more situations, and then gradually identify Increase the credit risk in more and more situations and improve the accuracy of the model.
- a large number of gray period sample data are used as samples to conduct semi-supervised training on the model, which can enable the model to learn as much as possible of the risk behavior characteristics of most companies in the normal state, which can further improve the model accuracy and can be early. Identify companies that are still in the grey period but have signs of risky behavior.
- steps in the flowcharts of FIG. 2 and FIG. 3 are displayed in order according to the arrows, the steps are not necessarily executed in the order indicated by the arrows. Unless clearly stated in this article, the execution of these steps is not strictly limited in order, and these steps can be executed in other orders. Moreover, at least some of the steps in FIGS. 2 and 3 may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily executed at the same time, but may be executed at different times. These sub-steps or The execution order of the stages is not necessarily sequential, but may be executed in turn or alternately with other steps or sub-steps of the other steps or at least a part of the stages.
- a device for establishing a score card model including: a data binning module 402, a binning combining module 404, and a model building module 406, wherein:
- the data binning module 402 is used to obtain sample data of multiple training samples; the sample data includes multiple sample variables; binning operations are performed on each sample variable;
- the binning and merging module 404 is used to determine the number of binnings corresponding to each sample variable and compare whether the number of binnings exceeds the threshold; if so, calculate the binning ratio, bad sample rate and adjacent The chi-square value of the bin; according to the bin proportion, bad sample rate and chi-square value, the multiple bins of the sample variable are merged, and the step of determining the number of bins corresponding to each sample variable is returned;
- the model building module 406 is used to calculate the WOE value of each sample variable when the number of bins corresponding to the sample variable is less than or equal to the threshold, filter the sample variable according to the WOE value, and establish a score card model based on the sampled variable.
- the data binning module 402 is also used to identify the associated samples of the training samples and crawl the associated data of the associated samples; the associated data includes multiple associated variables; the model configuration information sent by the receiving terminal, and the model configuration information Extract the derivative factor from the training sample to obtain the derivative variable corresponding to each derivative factor of the training sample; perform binning operation on each sample variable, associated variable and derivative variable.
- the binning merge module 404 is also used to determine the monotonic characteristics of multiple binning according to the bad sample rate; identifying the monotonic characteristics that do not meet the monotonic characteristics, the bad sample rate is the preset value, the binning ratio is the smallest or the chi The bin with the smallest value is recorded as the bin to be merged; the bin to be merged is merged with the previous adjacent bin or the next adjacent bin.
- the binning merging module 404 is also used to count the number of binning with a monotonous trend of bad sample rate; determine the monotonic trend with the largest number of binning; determine the monotony of the corresponding sample variable according to the monotonic trend with the largest number of binning characteristic.
- the binning and merging module 404 is further used to calculate the chi-square value of the binning to be merged and the previous adjacent binning, which is recorded as the first chi-square value;
- the chi-square value of the bin is recorded as the second chi-square value; compare whether the first chi-square value is equal to the second chi-square value; if so, divide the bins to be merged and the previous adjacent bins with a small proportion of bins or The next adjacent bin is merged; otherwise, the bin to be merged is merged with the previous adjacent bin or the next adjacent bin with a small chi-square value.
- the model building module 406 is also used to filter the target samples from the training samples to extract the sample features of the target samples; perform reinforcement learning on the sample features to obtain more derived samples; use the training samples and derived sample pairs Train the basic model to get the scorecard model, calculate the accuracy of the scorecard model, and compare whether the accuracy reaches the threshold; if not, generate the regeneration sample based on the derivative sample; use the regeneration sample as the current derivative sample and return to use the training sample and the derivative The sample trains the basic model to obtain the scorecard model, calculates the accuracy of the scorecard model, and compares whether the accuracy reaches the threshold until the accuracy reaches the threshold.
- Each module in the above-mentioned scorecard model building device may be implemented in whole or in part by software, hardware, and a combination thereof.
- the above modules may be embedded in the hardware form or independent of the processor in the computer device, or may be stored in the memory in the computer device in the form of software so that the processor can call and execute the operations corresponding to the above modules.
- a computer device is provided.
- the computer device may be a server, and its internal structure may be as shown in FIG. 5.
- the computer device includes a processor, memory, network interface, and database connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities.
- the memory of the computer device includes a non-volatile storage medium and an internal memory.
- the non-volatile storage medium stores an operating system, computer-readable instructions, and a database.
- the internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
- the database of the computer device is used to store the sample data of the training samples and the related data of the related samples.
- the network interface of the computer device is used to communicate with external terminals through a network connection.
- FIG. 5 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
- the specific computer device may Include more or less components than shown in the figure, or combine certain components, or have a different arrangement of components.
- One or more non-volatile storage media storing computer-readable instructions, which when executed by one or more processors, cause the one or more processors to achieve the score provided in any embodiment of the present application.
- Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
- Volatile memory can include random access memory (RAM) or external cache memory.
- RAM random access memory
- DRAM dynamic RAM
- SDRAM synchronous DRAM
- DDRSDRAM double data rate SDRAM
- ESDRAM enhanced SDRAM
- SLDRAM synchronous chain (Synchlink) DRAM
- RDRAM direct RAM
- DRAM direct memory bus dynamic RAM
- RDRAM memory bus dynamic RAM
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computer Hardware Design (AREA)
- Evolutionary Computation (AREA)
- Geometry (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
- Complex Calculations (AREA)
Abstract
一种评分卡模型的建立方法,包括:获取多个训练样本的样本数据;样本数据包括多个样本变量;对每个样本变量进行分箱操作;确定每个样本变量对应的分箱数,比较分箱数是否超过阈值;若是,计算样本变量对应每一分箱的分箱占比、坏样本率以及与相邻分箱的卡方值;根据分箱占比、坏样本率及卡方值,对样本变量的多个分箱进行合并处理,返回确定每个样本变量对应的分箱数的步骤;否则,计算每个样本变量的WOE值,根据WOE值进行样本变量筛选,基于筛选得到的样本变量建立评分卡模型。
Description
本申请要求于2019年01月07日提交中国专利局,申请号为201910012412X,申请名称为“评分卡模型的建立方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及一种评分卡模型的建立方法、装置、计算机设备和存储介质。
通常在建立分类模型时,需要对连续变量离散化,特征离散化可以使模型更稳定,降低模型过拟合的风险。比如,在建立评分卡模型时用逻辑回归模型作为基模型就需要对连续变量进行离散化,而离散化通常采用分箱法。而传统的数据分箱方法存在分箱数过多等现象,使得模型训练效率降低,且会影响模型输出的精准度。
发明内容
根据本申请公开的各种实施例,提供一种评分卡模型的建立方法、装置、计算机设备和存储介质。
一种评分卡模型的建立方法,由计算机设备执行,所述方法包括:获取多个训练样本的样本数据;所述样本数据包括多个样本变量;对每个样本变量进行分箱操作;确定每个样本变量对应的分箱数,比较所述分箱数是否超过阈值;若是,计算样本变量对应每一分箱的分箱占比、坏样本率以及与相邻分箱的卡方值;根据所述分箱占比、坏样本率及卡方值,对样本变量的多个分箱进行合并处理,返回所述确定每个样本变量对应的分箱数的步骤;及否则,计算每个样本变量的WOE值,根据所述WOE值进行样本变量筛选,基于筛选得到的样本变量建立评分卡模型。
在其中一个实施例中,所述对每个样本变量进行分箱操作,包括:识别所述训练样本的关联样本,爬取所述关联样本的关联数据;所述关联数据包括多个关联变量;接收终端发送的模型配置信息,在所述模型配置信息中提取衍生因子,获取训练样本对应每个衍生因子的衍生变量;及对每个样本变量、关联变量和衍生变量进行分箱操作。
在其中一个实施例中,所述根据分箱占比、坏样本率及卡方值,对样本变量的多个分箱进行合并处理,包括:根据所述坏样本率确定多个分箱的单调特性;识别不符合所述单调特性、坏样本率为预设值、分箱占比最小或者卡方值最小的分箱,分别记作待合并分箱;及将所述待合并分箱与前一相邻分箱或后一相邻分箱进行合并。
在其中一个实施例中,所述根据所述坏样本率确定多个分箱的单调特性,包括:统计坏样本率呈单调趋势的分箱数;确定分箱数最大的单调趋势;及根据所述分箱数最大的单 调趋势确定相应样本变量的单调特性。
在其中一个实施例中,所述将所述待合并分箱与前一相邻分箱或后一相邻分箱进行合并,包括:计算所述待合并分箱与前一相邻分箱的卡方值,记作第一卡方值;计算所述待合并分箱与后一相邻分箱的卡方值,记作第二卡方值;比较所述第一卡方值是否等于所述第二卡方值;若是,将待合并分箱与分箱占比小的前一相邻分箱或后一相邻分箱合并;及否则,将待合并分箱与卡方值小的前一相邻分箱或后一相邻分箱进行合并。
在其中一个实施例中,所述基于筛选得到的样本变量建立评分卡模型,包括:在所述训练样本中筛选目标样本,提取所述目标样本的样本特征;对所述样本特征进行强化学习,得到更多的衍生样本;利用所述训练样本和衍生样本对基础模型进行训练,得到评分卡模型,计算所述评分卡模型的准确度,比较所述准确度是否达到阈值;及若否,基于所述衍生样本生成再生样本;将所述再生样本作为当前的衍生样本,返回所述利用所述训练样本和衍生样本对基础模型进行训练,得到评分卡模型,计算所述评分卡模型的准确度,比较所述准确度是否达到阈值的步骤,直至所述准确度达到阈值。
一种评分卡模型的建立装置,所述装置包括:数据分箱模块,用于获取多个训练样本的样本数据;所述样本数据包括多个样本变量;对每个样本变量进行分箱操作;分箱合并模块,用于确定每个样本变量对应的分箱数,比较所述分箱数是否超过阈值;若是,计算样本变量对应每一分箱的分箱占比、坏样本率以及与相邻分箱的卡方值;根据所述分箱占比、坏样本率及卡方值,对样本变量的多个分箱进行合并处理,返回所述确定每个样本变量对应的分箱数的步骤;及模型建立模块,用于当样本变量对应的分箱数小于或等于所述阈值时,计算每个样本变量的WOE值,根据所述WOE值进行样本变量筛选,基于筛选得到的样本变量建立评分卡模型。
在其中一个实施例中,所述数据分箱模块还用于识别所述训练样本的关联样本,爬取所述关联样本的关联数据;所述关联数据包括多个关联变量;接收终端发送的模型配置信息,在模型配置信息中提取衍生因子,获取训练样本对应每个衍生因子的衍生变量;及对每个样本变量、关联变量和衍生变量进行分箱操作。
一种计算机设备,包括存储器和一个或多个处理器,存储器中存储有计算机可读指令,计算机可读指令被处理器执行时实现本申请任意一个实施例中提供的风险预测处理方法的步骤。
一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器实现本申请任意一个实施例中提供的风险预测处理方法的步骤。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1为根据一个或多个实施例中评分卡模型的建立方法的应用场景图。
图2为根据一个或多个实施例中评分卡模型的建立方法的流程示意图。
图3为根据一个或多个实施例中分箱合并的步骤的流程示意图。
图4为根据一个或多个实施例中评分卡模型的建立装置的结构框图。
图5为根据一个或多个实施例中计算机设备的框图。
为了使本申请的技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请提供的评分卡模型的建立方法,可以应用于如图1所示的应用环境中。终端102与服务器104通过网络进行通信。终端102可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备,服务器104可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
当用户需要对目标对象进行评分时,可以通过终端102向服务器104发送评分请求。服务器104根据评分请求获取监控对象的监控数据,调用评分卡模型对监控数据进行处理。其中,评分卡模型可以是服务器104基于筛选得到的样本变量训练得到的。具体的,当需要建立评分卡模型时,用户可以通过终端102向服务器104发送模型建立请求。服务器104根据模型建立请求获取多个训练样本的样本数据。样本数据包括多个样本变量。服务器104按照无监督分箱方法或者有监督分箱方法对每个样本变量进行分箱操作。服务器104确定每个样本变量对应的分箱数,比较分箱数是否超过阈值。如果一个样本变量的分箱数超过阈值,则服务器104计算样本变量对应每一分箱的分箱占比、坏样本率以及与相邻分箱的卡方值,并根据坏样本率确定多个分箱的单调特性。服务器104识别不符合单调特性、坏样本率为预设值、分箱占比最小或者卡方值最小的分箱,分别记作待合并分箱。服务器104将待合并分箱与前一相邻分箱或后一相邻分箱进行合并,合并后再次判断该样本变量的分箱数是否超过阈值。若仍超过阈值,则服务器104按照上述方式样本变量的多个分箱再次进行合并处理,直至该样本变量的分箱数小于或等于阈值。当样本变量的分享数小于或等于阈值时,服务器104计算每个样本变量的WOE值,根据WOE值进行样本变量筛选,基于筛选得到的样本变量建立评分卡模型。上述评分卡模型建立过程,基于改进的卡方分箱方式对训练样本进行样本均衡处理,基于样本均衡得到的多个样本集训练建立评分卡模型,可以提高模型精度,也可以减少分箱数,进而可以提高模型训练效率。
在其中一个实施例中,如图2所示,提供了一种评分卡模型的建立方法,以该方法应用于图1中的服务器为例进行说明,包括以下步骤:
步骤202,获取多个训练样本的样本数据;样本数据包括多个样本变量。
每个训练样本为虚拟资源的一个监控对象,如已经曝出具有信用风险的企业或个人,也可以是尚未被曝出具有信用风险的企业或个人。虚拟资源可以是股票、债券等。监控对象作为风险监控对象,样本数据包括监控对象的多个维度的监控数据,如训练样本对应的样本数据包括财务变量A1,财务变量A2,司法变量B1、舆情变量C1、行业变量D1;训练样本2对应的样本数据包括财务变量A1,财务变量A2,财务变量A3、舆情变量C2、行业变量D2。
步骤204,对每个样本变量进行分箱操作。
确定每种样本变量的变量类型。变量类型包括定性变量和定量变量。
如下表1所示,若变量类型为定性变量,根据定性变量对应的属性值进行初始分箱。
表1
财务变量A1 | 响应(坏样本) | 未响应(好样本) | 合计 | 坏样本率 |
很好 | 4000 | 16000 | 20000 | 20% |
较好 | 3000 | 27000 | 30000 | 10% |
一般 | 3000 | 12000 | 15000 | 20% |
较差 | 1500 | 8500 | 10000 | 15% |
很差 | 1000 | 5000 | 5000 | 10% |
如下表2所示,若变量类型为定量变量,若变量类型为定量变量,基于无监督分箱方法或有监督分箱方法对定量变量进行初始分箱。无监督分箱方法可以是等距分箱、等宽分箱等。有监督分箱方法可以是卡方分箱等。
表2
财务变量A2 | 响应(坏样本) | 未响应(好样本) | 合计 | 坏样本率 |
<100元 | 2500 | 47500 | 50000 | 5% |
[100,200] | 3000 | 27000 | 30000 | 10% |
[200,500] | 3000 | 12000 | 15000 | 20% |
[500,700] | 1500 | 8500 | 10000 | 15% |
[700,900] | 2000 | 8000 | 10000 | 20% |
≥900元 | 1000 | 4000 | 5000 | 20% |
步骤206,确定每个样本变量对应的分箱数,比较分箱数是否超过阈值。
分箱数的阈值可以根据经验自由设定,如5等。需要说明的是,不同样本变量对应的分箱数阈值可以不同。
步骤208,若是,计算样本变量对应每一分箱的分箱占比、坏样本率以及与相邻分箱 的卡方值;根据分箱占比、坏样本率及卡方值,对样本变量的多个分箱进行合并处理,返回确定每个样本变量对应的分箱数的步骤。
分箱占比是指变量值落入当前分箱的样本变量的数量与包含该样本变量的全部训练样本的数量的比值,例如,财务变量A2在[500,700]分箱的分箱占比为10000/120000。坏样本率是指当前分箱内坏样本的数量与当前分箱内全部样本变量的数量的比值。例如,财务变量A2在[500,700]分箱的坏样本率为2500/50000。与相邻分箱的卡方值是非参数检验中的一个统计量,用于检验相邻分箱的数据相关性。
服务器识别是否存在坏样本率为0或者∞的分箱,将坏样本率为0或者∞的分箱标记为待合并分箱。服务器识别卡方值最小的一对相邻分箱,将卡方值最小的相邻分箱标记为待合并分箱。服务器识别分箱占比最小的一个分箱,将分箱占比最小的分箱标记为待合并分箱。服务器将待合并分箱与前一相邻分箱或后一相邻分箱进行合并。待合并处理完毕,服务器重新判断该样本变量的分箱数是否依然超过阈值。若是,继续按照上述方式对样本变量对应的多个分箱进行合并处理,直至样本变量对应的分箱数小于或等于阈值。
步骤210,否则,计算每个样本变量的WOE值,根据WOE值进行样本变量筛选,基于筛选得到的样本变量建立评分卡模型。
基于改进的分箱方法对训练样本进行样本均衡处理,基于样本均衡得到的多个样本集训练建立评分卡模型,可以提高模型精度,也可以减少分箱数,进而可以提高模型训练效率。
本实施例中,获取多个训练样本的样本数据,并对样本数据包含的多个样本变量进行分箱操作,可以确定每个样本变量对应的分箱数;当分箱数超过阈值时,可以计算样本变量对应每一分箱的分箱占比、坏样本率以及与相邻分箱的卡方值,并根据分箱占比、坏样本率及卡方值,对样本变量的多个分箱进行合并处理;当分箱数小于或等于阈值时,可以计算每个样本变量的WOE值,根据WOE值进行样本变量筛选;基于筛选得到的样本变量,可以建立评分卡模型。由于基于分箱占比、坏样本率及卡方值多个分箱条件进行分箱,可以限定每个样本变量分箱的数量,也可以使得稀少样本均匀分布在不同分箱,实现样本均衡,从而根据这种分箱机制进行样本变量的筛选可以提高模型训练效率和精准度。
在其中一个实施例中,对每个样本变量进行分箱操作,包括:识别训练样本的关联样本,爬取关联样本的关联数据;关联数据包括多个关联变量;接收终端发送的模型配置信息,在模型配置信息中提取衍生因子,获取训练样本对应每个衍生因子的衍生变量;对每个样本变量、关联变量和衍生变量进行分箱操作。
服务器调用预设的风险传导模型。风险传导模型包括关系提取模型和传导预测模型。服务器在指定网站爬取监控对象的社交关系数据,将社交关系数据输入关系提取模型,确定训练样本对应的一个或多个关联样本,并基于确定的关联样本生成训练样本对应的知识图谱。关联样本可以是与监控对象存在投资关系、供应关系或其他关系的关联对象。知识图谱包括监控对象节点和多个关联对象节点。
关系提取模型包括亲密度测算子模型。利用亲密度测算子模型计算每个关联样本与训
本节点v的邻接节点集合;训练样本节点v和关联样本节点w的共同邻接节点数为|N(v)∩N(w)|;训练样本节点v和关联样本节点w均不邻接的节点数为|N(v)∪N(w)|。在另一个实施例中,亲密度根据关联关系,可以是投资比例、出质比例、出资比例等。
服务器在指定网站爬取关联样本的关联数据。服务器将关联数据以及对应的亲密度输入传导预测模型,计算关联样本的传导风险评分,将传导风险评分标记为关联变量。在另一个实施例中,服务器将多个关联对象分别对应的传导风险评分中值最高的传导风险评分作为一个关联变量,或者将多个关联对象分别对应的传导风险评分的平均值作为一个关联变量,对此不作限制。
虚拟资源获取平台本身提供的评分卡模型只是提供一个模型框架。若用户对用于虚拟资源风险分析的评分卡模型不满意,可以通过终端向服务器发送模型配置请求,进而根据自己的行业经验对评分卡模型进行变更。具体的,服务器根据模型配置请求携带的模型标识,获取对应的评分卡模型。服务器识别评分卡模型中的可编辑元素,利用空白单元对可编辑元素进行替换,并将可编辑元素填充至空白单元,得到模型编辑页面,将模型编辑页面返回至终端。模型编辑页面允许用户在模型框架基础上根据行业经验进行自由编辑,实现模型定制化。例如,允许用户修改变量权重、更改变量值等。另外,模型编辑页面还包括“新增指标”按钮,以支持用户加入全新变量。服务器根据模型配置请求向终端返回模型编辑页面。编辑信息包括变更后的评分卡模型。
当用户增加全新变量时,在用户终端录入采用自然语言或公式等的形式录入新增变量的变量名称、变量类型和变量值。其中变量类型可以是财务类、舆情类等。例如,用户新增指标“高管是否有司法处罚”。服务器定期进行全网筛查,将用户新增的变量(记作衍生变量)添加至变量库中,以便后续该用户或其他用户再次利用。在另一个实施例中,不同用户对评分卡模型的定制逻辑可能不同,为了保护用户的定制逻辑,可以对不同用户对评分卡模型的定制化操作进行数据隔离。
若新增因子为公式,则服务器对公式进行拆分,在监控数据中获取对应的多个分项变量,按照公式逻辑对多个分项变量进行预设逻辑运算即可得到对应的变量值。若新增变量为自然语言,还允许用户针对新增变量配置对应的变量值获取逻辑,基于配置的获取逻辑自动获取变量值。例如,根据新增变量“高管是否有司法处罚”,首先在指定网站爬取有关高管司法处罚的信息,对涉世高管进行主体(是否为期望监控的高管)识别;若主体一致,对爬取到的信息进行舆情分析,得到对应的变量值。容易理解,新增变量值也可以是人为录入的。
服务器按照上述方式对样本变量、关联变量和衍生变量进行分箱操作。
本实施例中,由于除了在训练样本自身的样本数据中提取样本变量,还识别训练样本的关联样本,并在关联样本对应风险数据中提取关联变量,将关联变量纳入风险测算考虑范围,可以拓展风险预测维度,进而可以提高风险预测准确性。此外,向用户提供通用的评分卡模型,降低风险预测门槛的同时,允许用户根据自己的行业经验对评分卡模型进行变更,可以实现模型定制化,从而使得该虚拟资源平台适用于任何有无行业背景的用户。
在其中一个实施例中,如图3所示,根据分箱占比、坏样本率及卡方值,对样本变量的多个分箱进行合并处理,即分箱合并的步骤,包括:
步骤302,根据坏样本率确定多个分箱的单调特性。
单调特性是指坏样本率呈连续递增变化趋势或连续递减变化趋势的特性。
在其中一个实施例中,根据坏样本率确定多个分箱的单调特性,包括:统计坏样本率呈单调趋势的分箱数;确定分箱数最大的单调趋势;根据分箱数最大的单调趋势确定相应样本变量的单调特性。
例如,在上述表2中,财务变量A2对应的所有分箱有6个,这6个分箱的坏样本率,第1至3分箱呈连续递增变化,第3至4分箱呈连续递减变化,第4至5分箱呈连续递增变化,第5至6分箱连续不变。通过统计,坏样本率呈连续递增变化趋势的分箱数为4,呈连续递减变化趋势的分箱数为1,从而财务变量A2的单调特性为连续递增变化趋势。
步骤304,识别不符合单调特性、坏样本率为预设值、分箱占比最小或者卡方值最小的分箱,分别记作待合并分箱。
服务器识别样本变量是否存在坏样本率为0或者∞的分箱,将坏样本率为0或者∞的分箱标记为待合并分箱。服务器识别是否存在不符合单调特性的分箱,将不符合单调特性的分箱标记为待合并分箱。服务器识别卡方值最小的一对相邻分箱,将卡方值最小的相邻分箱标记为待合并分箱。服务器识别分箱占比最小的一个分箱,将分箱占比最小的分箱标记为待合并分箱。
步骤306,将待合并分箱与前一相邻分箱或后一相邻分箱进行合并。
在其中一个实施例中,将待合并分箱与前一相邻分箱或后一相邻分箱进行合并,包括:计算待合并分箱与前一相邻分箱的卡方值,记作第一卡方值;计算待合并分箱与后一相邻分箱的卡方值,记作第二卡方值;比较第一卡方值是否等于第二卡方值;若是,将待合并分箱与分箱占比小的前一相邻分箱或后一相邻分箱合并;否则,将待合并分箱与卡方值小的前一相邻分箱或后一相邻分箱进行合并。
服务器获取坏样本率为0或者∞的分箱或者不符合单调特性的分箱,记作第一待合并分箱。若第一待合并分箱为首个分箱,将第一待合并分箱与后一相邻分箱合并。若第一待合并分箱为最后一个分箱,将第一待合并分箱与前一相邻分箱合并。
若第一待合并分箱为该样本变量中间的一个分箱,服务器获取第一待合并分箱与前一相邻分箱的卡方值(记作第一卡方值)以及与后一相邻分箱的卡方值(记作第二卡方值),并比较第一卡方值和第二卡方值。若第一卡方值大于第二卡方值,将第一待合并分箱与后 一相邻分箱进行合并。若第一卡方值小于第二卡方值,将第一待合并分箱与前一相邻分箱进行合并。若第一卡方值等于第二卡方值,将待合并分箱与分箱占比小的前一相邻分箱或后一相邻分箱进行合并。
本实施例中,基于分箱占比、坏样本率、相邻分箱的卡方值以及单调特性多个分箱条件进行分箱,可以限定每个样本变量分箱的数量,也可以使得稀少样本均匀分布在不同分箱,实现样本均衡,基于这种分箱机制进行样本变量的筛选可以提高模型训练效率和精准度。
在其中一个实施例中,基于筛选得到的样本变量建立评分卡模型,包括:在训练样本中筛选目标样本,提取目标样本的样本特征;对样本特征进行强化学习,得到更多的衍生样本;利用训练样本和衍生样本对基础模型进行训练,得到评分卡模型,计算评分卡模型的准确度,比较准确度是否达到阈值;若否,基于衍生样本生成再生样本;将再生样本作为当前的衍生样本,返回利用训练样本和衍生样本对基础模型进行训练,得到评分卡模型,计算评分卡模型的准确度,比较准确度是否达到阈值的步骤,直至准确度达到阈值。
训练样本包括好样本和坏样本。同一监控对象往往并非一直具有风险行为,有可能一段时间(记作白色时段)不具有风险行为,一段时间(记作黑色时段)具有风险行为。例如,企业A被曝出在2017.08~2017.11发生财务造假风险行为,在被监管部门要求整改后企业A对该段时间的财务数据进行修正,则2017.08~2017.11被曝出的财务数据可以作为坏样本,对应修改后的监控数据可以作为好样本。训练样本还包括灰色样本。实际上,大部分监控对象处于白色时段和黑色时段之间的灰色时段。灰色时段是指有可能存在风险行为但未被曝出的时段。好样本与坏样本的数量有限,而灰色样本则比较多。
训练样本具有对应的分类标签。服务器获取训练样本,将训练样本输入基分类器,得到模型分类结果。服务器比较模型分类结果与相应分类标签是否一致。若否,服务器将已有样本标记为目标样本。目标样本是指实际存在风险行为但未被评分卡模型识别出来的坏样本。样本特征包括坏样本的正常指标以及一种或多种异常指标。
服务器提取目标样本的样本特征。具体的,服务器获取目标样本对应的样本数据;对样本数据进行预处理,得到多个样本指标。服务器根据证监会等监管部门公布的处罚文件将其中一个或多个样本指标标记为异常指标,进而确定样本指标的指标类型。指标类型包括正常指标和异常指标。
服务器对样本特征进行强化学习,得到更多的衍生样本。具体的,服务器预存储了多个样本指标对应的强化学习规则。强化学习规则包括多种样本指标分别对应的增大幅度或减小幅度。换言之,强化学习规则包括对该样本指标值进行增大或缩小的一级幅度、二级幅度等。其中,一级幅度是指第一次根据强化学习规则对样本指标进行强化学习时对样本指标值增大的幅度或者缩小的幅度;二级幅度是指第二次根据强化学习规则对同一样本指标进行强化学习时对样本指标值增大的幅度或者缩小的幅度,如此类推。服务器根据增大幅度对异常指标进行增大处理;或根据减小幅度对异常指标进行减小处理。服务器对强化 处理后的多个异常指标进行组合,得到多种指标组合。服务器基于正常指标及强化处理后的每种指标组合生成一种衍生样本。
服务器利用训练样本和衍生样本对基础模型进行半监督训练,得到评分卡模型。具体的,基分类器可以是梯度促进模型(Gradient Boosting Decison Tree,GBDT)。容易理解,基础模型也可以是其他模型,对此不做限制。好样本与坏样本具有明确的是否造假的标注(记作分类标签),而灰色样本则没有明确的分类标签。换言之,有标注的样本数据稀少,而无标注的样本数据很多。基于有限的好样本、坏样本以及大量的灰色样本对基础模型进行半监督训练,使得训练得到的评分卡模型更加符合实际情况,从而可以提高模型监控精准度。
本实施例中,由于信用风险情形太多,初始的评分卡模型难以识别出在特定情况下的信用风险,通过强化学习使得评分卡模型能够捕捉到更多情况下造成的信用风险,进而逐步识别出越来越多情形下的信用风险,提高模型精度。此外,同时将大量的灰色时段的样本数据作为样本对模型进行半监督训练,可以使模型尽可能多的学习到大部分企业在常规状态下的风险行为特征,从而可以进一步提高模型精度,可以提早发现尚处于灰色时段但有风险行为征兆的企业。
应该理解的是,虽然图2和图3的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2和图3中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
在其中一个实施例中,如图4所示,提供了一种评分卡模型的建立装置,包括:数据分箱模块402、分箱合并模块404和模型建立模块406,其中:
数据分箱模块402,用于获取多个训练样本的样本数据;样本数据包括多个样本变量;对每个样本变量进行分箱操作;
分箱合并模块404,用于确定每个样本变量对应的分箱数,比较分箱数是否超过阈值;若是,计算样本变量对应每一分箱的分箱占比、坏样本率以及与相邻分箱的卡方值;根据分箱占比、坏样本率及卡方值,对样本变量的多个分箱进行合并处理,返回确定每个样本变量对应的分箱数的步骤;
模型建立模块406,用于当样本变量对应的分箱数小于或等于阈值时,计算每个样本变量的WOE值,根据WOE值进行样本变量筛选,基于筛选得到的样本变量建立评分卡模型。
在其中一个实施例中,数据分箱模块402还用于识别训练样本的关联样本,爬取关联 样本的关联数据;关联数据包括多个关联变量;接收终端发送的模型配置信息,在模型配置信息中提取衍生因子,获取训练样本对应每个衍生因子的衍生变量;对每个样本变量、关联变量和衍生变量进行分箱操作。
在其中一个实施例中,分箱合并模块404还用于根据坏样本率确定多个分箱的单调特性;识别不符合单调特性、坏样本率为预设值、分箱占比最小或者卡方值最小的分箱,分别记作待合并分箱;将待合并分箱与前一相邻分箱或后一相邻分箱进行合并。
在其中一个实施例中,分箱合并模块404还用于统计坏样本率呈单调趋势的分箱数;确定分箱数最大的单调趋势;根据分箱数最大的单调趋势确定相应样本变量的单调特性。
在其中一个实施例中,分箱合并模块404还用于计算待合并分箱与前一相邻分箱的卡方值,记作第一卡方值;计算待合并分箱与后一相邻分箱的卡方值,记作第二卡方值;比较第一卡方值是否等于第二卡方值;若是,将待合并分箱与分箱占比小的前一相邻分箱或后一相邻分箱合并;否则,将待合并分箱与卡方值小的前一相邻分箱或后一相邻分箱进行合并。
在其中一个实施例中,模型建立模块406还用于在训练样本中筛选目标样本,提取目标样本的样本特征;对样本特征进行强化学习,得到更多的衍生样本;利用训练样本和衍生样本对基础模型进行训练,得到评分卡模型,计算评分卡模型的准确度,比较准确度是否达到阈值;若否,基于衍生样本生成再生样本;将再生样本作为当前的衍生样本,返回利用训练样本和衍生样本对基础模型进行训练,得到评分卡模型,计算评分卡模型的准确度,比较准确度是否达到阈值的步骤,直至准确度达到阈值。
关于评分卡模型的建立装置的具体限定可以参见上文中对于评分卡模型的建立方法的限定,在此不再赘述。上述评分卡模型的建立装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在其中一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图5所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储训练样本的样本数据和关联样本的关联数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种评分卡模型的建立方法。
本领域技术人员可以理解,图5中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可 以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
一个或多个存储有计算机可读指令的非易失性存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器实现本申请任意一个实施例中提供的评分卡模型的建立方法的步骤。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。
Claims (20)
- 一种评分卡模型的建立方法,由计算机设备执行,包括:获取多个训练样本的样本数据;所述样本数据包括多个样本变量;对每个样本变量进行分箱操作;确定每个样本变量对应的分箱数,比较所述分箱数是否超过阈值;若是,计算样本变量对应每一分箱的分箱占比、坏样本率以及与相邻分箱的卡方值;根据所述分箱占比、坏样本率及卡方值,对样本变量的多个分箱进行合并处理,返回所述确定每个样本变量对应的分箱数的步骤;及否则,计算每个样本变量的WOE值,根据所述WOE值进行样本变量筛选,基于筛选得到的样本变量建立评分卡模型。
- 根据权利要求1所述的方法,其特征在于,所述对每个样本变量进行分箱操作,包括:识别所述训练样本的关联样本,爬取所述关联样本的关联数据;所述关联数据包括多个关联变量;接收终端发送的模型配置信息,在所述模型配置信息中提取衍生因子,获取训练样本对应每个衍生因子的衍生变量;及对每个样本变量、关联变量和衍生变量进行分箱操作。
- 根据权利要求1所述的方法,其特征在于,所述根据分箱占比、坏样本率及卡方值,对样本变量的多个分箱进行合并处理,包括:根据所述坏样本率确定多个分箱的单调特性;识别不符合所述单调特性的分箱,记作待合并分箱;识别坏样本率不等于预设值的分箱,记作待合并分箱;识别分箱占比不等于最小值的分箱,记作待合并分箱;识别卡方值不等于最小值的分箱,记作待合并分箱;及将所述待合并分箱与前一相邻分箱或后一相邻分箱进行合并。
- 根据权利要求3所述的方法,其特征在于,所述根据所述坏样本率确定多个分箱的单调特性,包括:统计所述坏样本率呈单调趋势的分箱数;确定所述分箱数最大的单调趋势;及根据所述分箱数最大的单调趋势确定相应样本变量的单调特性。
- 根据权利要求3所述的方法,其特征在于,所述将所述待合并分箱与前一相邻分箱或后一相邻分箱进行合并,包括:计算所述待合并分箱与前一相邻分箱的卡方值,记作第一卡方值;计算所述待合并分箱与后一相邻分箱的卡方值,记作第二卡方值;比较所述第一卡方值是否等于所述第二卡方值;若是,将待合并分箱与分箱占比小的前一相邻分箱或后一相邻分箱合并;否则,将待合并分箱与卡方值小的前一相邻分箱或后一相邻分箱进行合并。
- 根据权利要求1所述的方法,其特征在于,所述基于筛选得到的样本变量建立评分卡模型,包括:在所述训练样本中筛选目标样本,提取所述目标样本的样本特征;对所述样本特征进行强化学习,得到更多的衍生样本;利用所述训练样本和衍生样本对基础模型进行训练,得到评分卡模型,计算所述评分卡模型的准确度,比较所述准确度是否达到阈值;若否,基于所述衍生样本生成再生样本;及将所述再生样本作为当前的衍生样本,返回所述利用所述训练样本和衍生样本对基础模型进行训练,得到评分卡模型,计算所述评分卡模型的准确度,比较所述准确度是否达到阈值的步骤,直至所述准确度达到阈值。
- 一种评分卡模型的建立装置,所述装置包括:数据分箱模块,用于获取多个训练样本的样本数据;所述样本数据包括多个样本变量;对每个样本变量进行分箱操作;分箱合并模块,用于确定每个样本变量对应的分箱数,比较所述分箱数是否超过阈值;若是,计算样本变量对应每一分箱的分箱占比、坏样本率以及与相邻分箱的卡方值;根据所述分箱占比、坏样本率及卡方值,对样本变量的多个分箱进行合并处理,返回所述确定每个样本变量对应的分箱数的步骤;模型建立模块,用于当样本变量对应的分箱数小于或等于所述阈值时,计算每个样本变量的WOE值,根据所述WOE值进行样本变量筛选,基于筛选得到的样本变量建立评分卡模型。
- 根据权利要求7所述的装置,其特征在于,所述数据分箱模块还用于识别所述训练样本的关联样本,爬取所述关联样本的关联数据;所述关联数据包括多个关联变量;接收终端发送的模型配置信息,在所述模型配置信息中提取衍生因子,获取训练样本对应每个衍生因子的衍生变量;对每个样本变量、关联变量和衍生变量进行分箱操作。
- 一种计算机设备,包括存储器及一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:获取多个训练样本的样本数据;所述样本数据包括多个样本变量;对每个样本变量进行分箱操作;确定每个样本变量对应的分箱数,比较所述分箱数是否超过阈值;若是,计算样本变量对应每一分箱的分箱占比、坏样本率以及与相邻分箱的卡方值;根据所述分箱占比、坏样本率及卡方值,对样本变量的多个分箱进行合并处理,返回所述确定每个样本变量对应的分箱数的步骤;及否则,计算每个样本变量的WOE值,根据所述WOE值进行样本变量筛选,基于筛选得到的样本变量建立评分卡模型。
- 根据权利要求9所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:识别所述训练样本的关联样本,爬取所述关联样本的关联数据;所述关联数据包括多个关联变量;接收终端发送的模型配置信息,在所述模型配置信息中提取衍生因子,获取训练样本对应每个衍生因子的衍生变量;及对每个样本变量、关联变量和衍生变量进行分箱操作。
- 根据权利要求9所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:根据所述坏样本率确定多个分箱的单调特性;识别不符合所述单调特性的分箱,记作待合并分箱;识别坏样本率不等于预设值的分箱,记作待合并分箱;识别分箱占比不等于最小值的分箱,记作待合并分箱;识别卡方值不等于最小值的分箱,记作待合并分箱;及将所述待合并分箱与前一相邻分箱或后一相邻分箱进行合并。
- 根据权利要求11所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:统计坏样本率呈单调趋势的分箱数;确定分箱数最大的单调趋势;及根据所述分箱数最大的单调趋势确定相应样本变量的单调特性。
- 根据权利要求11所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:计算所述待合并分箱与前一相邻分箱的卡方值,记作第一卡方值;计算所述待合并分箱与后一相邻分箱的卡方值,记作第二卡方值;比较所述第一卡方值是否等于所述第二卡方值;若是,将待合并分箱与分箱占比小的前一相邻分箱或后一相邻分箱合并;否则,将待合并分箱与卡方值小的前一相邻分箱或后一相邻分箱进行合并。
- 根据权利要求9所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:在所述训练样本中筛选目标样本,提取所述目标样本的样本特征;对所述样本特征进行强化学习,得到更多的衍生样本;利用所述训练样本和衍生样本对基础模型进行训练,得到评分卡模型,计算所述评分卡模型的准确度,比较所述准确度是否达到阈值;若否,基于所述衍生样本生成再生样本;及将所述再生样本作为当前的衍生样本,返回所述利用所述训练样本和衍生样本对基础模型进行训练,得到评分卡模型,计算所述评分卡模型的准确度,比较所述准确度是否达到阈值的步骤,直至所述准确度达到阈值。
- 一个或多个存储有计算机可读指令的非易失性计算机可读存储介质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:获取多个训练样本的样本数据;所述样本数据包括多个样本变量;对每个样本变量进行分箱操作;确定每个样本变量对应的分箱数,比较所述分箱数是否超过阈值;若是,计算样本变量对应每一分箱的分箱占比、坏样本率以及与相邻分箱的卡方值;根据所述分箱占比、坏样本率及卡方值,对样本变量的多个分箱进行合并处理,返回所述确定每个样本变量对应的分箱数的步骤;及否则,计算每个样本变量的WOE值,根据所述WOE值进行样本变量筛选,基于筛选得到的样本变量建立评分卡模型。
- 根据权利要求15所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:识别所述训练样本的关联样本,爬取所述关联样本的关联数据;所述关联数据包括多个关联变量;接收终端发送的模型配置信息,在所述模型配置信息中提取衍生因子,获取训练样本对应每个衍生因子的衍生变量;及对每个样本变量、关联变量和衍生变量进行分箱操作。
- 根据权利要求15所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:根据所述坏样本率确定多个分箱的单调特性;识别不符合所述单调特性的分箱,记作待合并分箱;识别坏样本率不等于预设值的分箱,记作待合并分箱;识别分箱占比不等于最小值的分箱,记作待合并分箱;识别卡方值不等于最小值的分箱,记作待合并分箱;及将所述待合并分箱与前一相邻分箱或后一相邻分箱进行合并。
- 根据权利要求17所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:统计坏样本率呈单调趋势的分箱数;确定分箱数最大的单调趋势;及根据所述分箱数最大的单调趋势确定相应样本变量的单调特性。
- 根据权利要求17所述的存储介质,其特征在于,所述计算机可读指令被所述处 理器执行时还执行以下步骤:计算所述待合并分箱与前一相邻分箱的卡方值,记作第一卡方值;计算所述待合并分箱与后一相邻分箱的卡方值,记作第二卡方值;比较所述第一卡方值是否等于所述第二卡方值;若是,将待合并分箱与分箱占比小的前一相邻分箱或后一相邻分箱合并;否则,将待合并分箱与卡方值小的前一相邻分箱或后一相邻分箱进行合并。
- 根据权利要求15所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:在所述训练样本中筛选目标样本,提取所述目标样本的样本特征;对所述样本特征进行强化学习,得到更多的衍生样本;利用所述训练样本和衍生样本对基础模型进行训练,得到评分卡模型,计算所述评分卡模型的准确度,比较所述准确度是否达到阈值;若否,基于所述衍生样本生成再生样本;及将所述再生样本作为当前的衍生样本,返回所述利用所述训练样本和衍生样本对基础模型进行训练,得到评分卡模型,计算所述评分卡模型的准确度,比较所述准确度是否达到阈值的步骤,直至所述准确度达到阈值。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910012412.XA CN109598095B (zh) | 2019-01-07 | 2019-01-07 | 评分卡模型的建立方法、装置、计算机设备和存储介质 |
CN201910012412.X | 2019-01-07 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020143233A1 true WO2020143233A1 (zh) | 2020-07-16 |
Family
ID=65965053
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/103489 WO2020143233A1 (zh) | 2019-01-07 | 2019-08-30 | 评分卡模型的建立方法、装置、计算机设备和存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN109598095B (zh) |
WO (1) | WO2020143233A1 (zh) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112711765A (zh) * | 2020-12-30 | 2021-04-27 | 深圳前海微众银行股份有限公司 | 样本特征的信息价值确定方法、终端、设备和存储介质 |
CN112836765A (zh) * | 2021-03-01 | 2021-05-25 | 深圳前海微众银行股份有限公司 | 分布式学习的数据处理方法、装置、电子设备 |
CN113051317A (zh) * | 2021-04-09 | 2021-06-29 | 上海云从企业发展有限公司 | 一种数据探查方法和系统、数据挖掘模型更新方法和系统 |
CN113657481A (zh) * | 2021-08-13 | 2021-11-16 | 上海晓途网络科技有限公司 | 一种模型构建系统及方法 |
CN114266641A (zh) * | 2021-09-27 | 2022-04-01 | 东方微银科技股份有限公司 | 基于逻辑回归和规则的评分模型构建方法 |
CN114298532A (zh) * | 2021-12-27 | 2022-04-08 | 智慧芽信息科技(苏州)有限公司 | 评分卡模型生成方法、使用方法、装置、设备及存储介质 |
CN115229846A (zh) * | 2022-04-26 | 2022-10-25 | 浙江大学 | 一种机械臂运动平稳性自动评分系统 |
Families Citing this family (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109598095B (zh) * | 2019-01-07 | 2023-08-08 | 平安科技(深圳)有限公司 | 评分卡模型的建立方法、装置、计算机设备和存储介质 |
CN110675029A (zh) * | 2019-08-30 | 2020-01-10 | 阿里巴巴集团控股有限公司 | 商户的动态管控方法、装置、服务器及可读存储介质 |
CN110717650B (zh) * | 2019-09-06 | 2024-07-16 | 深圳平安医疗健康科技服务有限公司 | 单据数据处理方法、装置、计算机设备和存储介质 |
CN110837894B (zh) * | 2019-10-28 | 2024-02-13 | 腾讯科技(深圳)有限公司 | 一种特征处理方法、装置及存储介质 |
CN110807159B (zh) * | 2019-10-30 | 2021-05-11 | 同盾控股有限公司 | 数据标记方法、装置、存储介质及电子设备 |
CN110992043B (zh) * | 2019-11-05 | 2022-08-05 | 支付宝(杭州)信息技术有限公司 | 一种风险实体挖掘的方法和装置 |
CN111105144A (zh) * | 2019-11-26 | 2020-05-05 | 苏宁金融科技(南京)有限公司 | 数据处理方法、装置和目标对象风险监控方法 |
CN111178722B (zh) * | 2019-12-20 | 2023-05-02 | 上海数策软件股份有限公司 | 适用于销售线索评级和分配的机器学习系统、方法及介质 |
CN111859682A (zh) * | 2020-07-24 | 2020-10-30 | 北京睿知图远科技有限公司 | 基于GroupLasso的变量自动选择方法、系统及可读介质 |
CN112232944B (zh) * | 2020-09-29 | 2024-05-31 | 中诚信征信有限公司 | 一种评分卡创建方法、装置和电子设备 |
CN112102074B (zh) * | 2020-10-14 | 2024-01-30 | 深圳前海弘犀智能科技有限公司 | 一种评分卡建模方法 |
CN112330048A (zh) * | 2020-11-18 | 2021-02-05 | 中国光大银行股份有限公司 | 评分卡模型训练方法、装置、存储介质及电子装置 |
CN112766649B (zh) * | 2020-12-31 | 2022-03-15 | 平安科技(深圳)有限公司 | 基于多评分卡融合的目标对象评价方法及其相关设备 |
CN113158947B (zh) * | 2021-04-29 | 2023-04-07 | 重庆长安新能源汽车科技有限公司 | 一种动力电池健康评分方法、系统及存储介质 |
CN113570259A (zh) * | 2021-07-30 | 2021-10-29 | 北京房江湖科技有限公司 | 基于维度模型的数据评估方法和计算机程序产品 |
CN114240215A (zh) * | 2021-12-22 | 2022-03-25 | 中国建设银行股份有限公司 | 用户失联等级获取方法、装置及存储介质 |
CN118261553B (zh) * | 2024-03-11 | 2024-11-01 | 深圳市高斯全球信息技术有限公司 | 一种基于机器学习的签证自动化管理系统 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015163720A1 (ko) * | 2014-04-24 | 2015-10-29 | 세종대학교 산학협력단 | 3차원 영상 생성 방법, 이를 수행하는 3차원 영상 생성 장치 및 이를 저장하는 기록매체 |
CN108334954A (zh) * | 2018-01-22 | 2018-07-27 | 中国平安人寿保险股份有限公司 | 逻辑回归模型的构建方法、装置、存储介质及终端 |
CN108876076A (zh) * | 2017-05-09 | 2018-11-23 | 中国移动通信集团广东有限公司 | 基于指令数据的个人信用评分方法及装置 |
CN108959187A (zh) * | 2018-04-09 | 2018-12-07 | 中国平安人寿保险股份有限公司 | 一种变量分箱方法、装置、终端设备及存储介质 |
CN109598095A (zh) * | 2019-01-07 | 2019-04-09 | 平安科技(深圳)有限公司 | 评分卡模型的建立方法、装置、计算机设备和存储介质 |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108366045B (zh) * | 2018-01-02 | 2020-09-01 | 北京奇艺世纪科技有限公司 | 一种风控评分卡的设置方法和装置 |
CN108416495B (zh) * | 2018-01-30 | 2021-02-26 | 杭州排列科技有限公司 | 基于机器学习的评分卡模型建立方法及装置 |
CN108765127A (zh) * | 2018-04-26 | 2018-11-06 | 浙江邦盛科技有限公司 | 一种基于蒙特卡罗搜索的信用评分卡特征选择方法 |
CN108830707A (zh) * | 2018-06-05 | 2018-11-16 | 重庆小雨点小额贷款有限公司 | 基于最大化iv的数据分组方法、装置、储存介质及设备 |
-
2019
- 2019-01-07 CN CN201910012412.XA patent/CN109598095B/zh active Active
- 2019-08-30 WO PCT/CN2019/103489 patent/WO2020143233A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015163720A1 (ko) * | 2014-04-24 | 2015-10-29 | 세종대학교 산학협력단 | 3차원 영상 생성 방법, 이를 수행하는 3차원 영상 생성 장치 및 이를 저장하는 기록매체 |
CN108876076A (zh) * | 2017-05-09 | 2018-11-23 | 中国移动通信集团广东有限公司 | 基于指令数据的个人信用评分方法及装置 |
CN108334954A (zh) * | 2018-01-22 | 2018-07-27 | 中国平安人寿保险股份有限公司 | 逻辑回归模型的构建方法、装置、存储介质及终端 |
CN108959187A (zh) * | 2018-04-09 | 2018-12-07 | 中国平安人寿保险股份有限公司 | 一种变量分箱方法、装置、终端设备及存储介质 |
CN109598095A (zh) * | 2019-01-07 | 2019-04-09 | 平安科技(深圳)有限公司 | 评分卡模型的建立方法、装置、计算机设备和存储介质 |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112711765A (zh) * | 2020-12-30 | 2021-04-27 | 深圳前海微众银行股份有限公司 | 样本特征的信息价值确定方法、终端、设备和存储介质 |
CN112836765A (zh) * | 2021-03-01 | 2021-05-25 | 深圳前海微众银行股份有限公司 | 分布式学习的数据处理方法、装置、电子设备 |
CN112836765B (zh) * | 2021-03-01 | 2023-12-22 | 深圳前海微众银行股份有限公司 | 分布式学习的数据处理方法、装置、电子设备 |
CN113051317A (zh) * | 2021-04-09 | 2021-06-29 | 上海云从企业发展有限公司 | 一种数据探查方法和系统、数据挖掘模型更新方法和系统 |
CN113051317B (zh) * | 2021-04-09 | 2024-05-28 | 上海云从企业发展有限公司 | 一种数据挖掘模型更新方法、系统、计算机设备及可读介质 |
CN113657481A (zh) * | 2021-08-13 | 2021-11-16 | 上海晓途网络科技有限公司 | 一种模型构建系统及方法 |
CN114266641A (zh) * | 2021-09-27 | 2022-04-01 | 东方微银科技股份有限公司 | 基于逻辑回归和规则的评分模型构建方法 |
CN114298532A (zh) * | 2021-12-27 | 2022-04-08 | 智慧芽信息科技(苏州)有限公司 | 评分卡模型生成方法、使用方法、装置、设备及存储介质 |
CN115229846A (zh) * | 2022-04-26 | 2022-10-25 | 浙江大学 | 一种机械臂运动平稳性自动评分系统 |
Also Published As
Publication number | Publication date |
---|---|
CN109598095B (zh) | 2023-08-08 |
CN109598095A (zh) | 2019-04-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2020143233A1 (zh) | 评分卡模型的建立方法、装置、计算机设备和存储介质 | |
CN109543925B (zh) | 基于机器学习的风险预测方法、装置、计算机设备和存储介质 | |
CN110489520B (zh) | 基于知识图谱的事件处理方法、装置、设备和存储介质 | |
WO2020062660A1 (zh) | 企业信用风险评估方法、装置、设备及存储介质 | |
WO2020253358A1 (zh) | 业务数据的风控分析处理方法、装置和计算机设备 | |
WO2020037942A1 (zh) | 风险预测处理方法、装置、计算机设备和介质 | |
JP6771751B2 (ja) | リスク評価方法およびシステム | |
WO2019218699A1 (zh) | 欺诈交易判断方法、装置、计算机设备和存储介质 | |
US10521748B2 (en) | Retention risk determiner | |
CN109492945A (zh) | 企业风险识别监控方法、装置、设备及存储介质 | |
CN108876600A (zh) | 预警信息推送方法、装置、计算机设备和介质 | |
TW202030685A (zh) | 電腦執行的事件風險評估的方法及裝置 | |
CN108898476A (zh) | 一种贷款客户信用评分方法和装置 | |
CN109886554B (zh) | 违规行为判别方法、装置、计算机设备和存储介质 | |
CN109615280A (zh) | 员工数据处理方法、装置、计算机设备和存储介质 | |
US12099631B2 (en) | Rule-based anonymization of datasets | |
WO2022143431A1 (zh) | 一种反洗钱模型的训练方法及装置 | |
TWI677830B (zh) | 模型中關鍵變量的探測方法及裝置 | |
CN113705201B (zh) | 基于文本的事件概率预测评估算法、电子设备及存储介质 | |
US20230177443A1 (en) | Systems and methods for automated modeling of processes | |
CN118134652A (zh) | 一种资产配置方案生成方法、装置、电子设备及介质 | |
CN117437001A (zh) | 目标对象的指标数据处理方法、装置及计算机设备 | |
CN114997879B (zh) | 一种支付路由方法、装置、设备和存储介质 | |
US20140324524A1 (en) | Evolving a capped customer linkage model using genetic models | |
CN116187760A (zh) | 基于图谱重构的企业关联影响度量及风险识别方法、装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19908153 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19908153 Country of ref document: EP Kind code of ref document: A1 |