[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN112711765A - Sample characteristic information value determination method, terminal, device and storage medium - Google Patents

Sample characteristic information value determination method, terminal, device and storage medium Download PDF

Info

Publication number
CN112711765A
CN112711765A CN202011619669.0A CN202011619669A CN112711765A CN 112711765 A CN112711765 A CN 112711765A CN 202011619669 A CN202011619669 A CN 202011619669A CN 112711765 A CN112711765 A CN 112711765A
Authority
CN
China
Prior art keywords
sample
value
samples
information
terminal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011619669.0A
Other languages
Chinese (zh)
Other versions
CN112711765B (en
Inventor
康焱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WeBank Co Ltd
Original Assignee
WeBank Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WeBank Co Ltd filed Critical WeBank Co Ltd
Priority to CN202011619669.0A priority Critical patent/CN112711765B/en
Publication of CN112711765A publication Critical patent/CN112711765A/en
Application granted granted Critical
Publication of CN112711765B publication Critical patent/CN112711765B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Development Economics (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Finance (AREA)
  • Artificial Intelligence (AREA)
  • Economics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Technology Law (AREA)
  • Computing Systems (AREA)
  • Accounting & Taxation (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioethics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to the technical field of financial science and technology, and discloses a method, a terminal, equipment and a storage medium for determining information value of sample characteristics. The method comprises the following steps: the method comprises the steps that a first terminal obtains first sample information corresponding to first samples in a second terminal, wherein the first sample information comprises identifiers of the first samples and box separation identifications of characteristics of the first samples, and the second terminal conducts box separation operation on characteristic values of the characteristics of the first samples to obtain box separation identifications corresponding to the characteristics of the first samples; determining a tag value corresponding to the first sample according to the identifier; determining the information value of each first sample on the characteristics according to the label value and the box separation identification; and encrypting the information value of each first sample on the sample characteristic, and sending the encrypted information value to a second terminal to complete the determination of the information value of the sample characteristic in the longitudinal federal learning scene. The invention avoids data leakage caused by the calculation of the information value.

Description

Sample characteristic information value determination method, terminal, device and storage medium
Technical Field
The invention relates to the technical field of financial technology (Fintech), in particular to a method, a terminal, equipment and a storage medium for determining the information value of sample characteristics.
Background
With the development of computer technology, more and more technologies are applied in the financial field, and the traditional financial industry is gradually changing to financial technology (Fintech), but higher requirements are also put forward on the technologies due to the requirements of the financial industry on safety and real-time performance.
In the field of financial wind control, the Information Value (IV) of the characteristics of a sample is an important index for measuring the characteristic prediction capability. The calculation of the information value requires the label of the sample and the characteristic value, and the label of the sample and the characteristic value of the sample are located at different parties. At present, when the information value of a sample is determined, the characteristics of the party A and the label of the party B need to be combined for calculation, but in the calculation mode, the characteristic value or the label needs to be sent to the other party, so that the real information is known by the other party, namely the problem of data leakage exists in the existing information value calculation.
Disclosure of Invention
The invention mainly aims to provide a method, a terminal, equipment and a readable storage medium for determining information value of sample characteristics, and aims to solve the problem of data leakage in the existing information value calculation.
In order to achieve the above object, the present invention provides a method for determining an information value of a sample feature, including:
a first terminal acquires first sample information corresponding to a first sample in a second terminal, wherein the first sample information comprises an identifier of the first sample and a binning identifier of the characteristic of the first sample, and the second terminal performs binning operation on the characteristic value of the characteristic of each first sample to obtain a binning identifier corresponding to the characteristic of each first sample;
determining a tag value corresponding to the first sample according to the identifier;
determining the information value of each first sample on the characteristics according to the label value and the box separation identification;
and encrypting the information value of each first sample on the characteristics, and sending the encrypted information value to the second terminal to finish the determination of the information value of the sample characteristics in the longitudinal federal learning scene.
In one embodiment, the step of determining the information value of each first sample on the feature according to the tag value and the bin identification comprises:
determining sample attributes of the first sample according to the tag value of the first sample, the sample attributes including a positive sample and a negative sample;
determining a first number of second samples and a second number of third samples corresponding to each bin according to the sample attributes, wherein the second samples are the first samples of positive samples, the third samples are the first samples of negative samples, and the bins are determined according to the bin identifiers;
determining a first total number of the second samples and a second total number of the third samples;
determining a value of information for each bin based on the first number, the second number, the first sample, the first total number, and the second total number of the bins;
and summing the information values of the sub-boxes to obtain the information value of the first sample in the characteristic.
In an embodiment, the step of determining the information value of each bin according to the first number, the second number, the first sample, the first total number and the second total number of the bins comprises:
determining a first ratio of the first number of bins to the first total number and a second ratio of the second number of bins to the second total number;
and determining the information value of the sub-box according to the first ratio and the second ratio.
In an embodiment, the step of determining the first number of the second samples and the second number of the third samples corresponding to each bin according to the sample attributes includes:
determining the first samples with the same bin identifications as the samples contained in the bins corresponding to the bin identifications;
among the samples contained in the bins, a first number of second samples and a second number of third samples are determined.
In an embodiment, the step of determining the tag value corresponding to the first sample according to the identifier includes:
determining second sample information corresponding to the first sample according to the identifier, wherein the second sample information comprises the identifier and a label value corresponding to the identifier;
and determining a label value corresponding to the first sample according to the second sample information.
In an embodiment, the step of the first terminal obtaining the first sample information corresponding to the first sample in the second terminal includes:
the first terminal receives the encrypted information sent by the second terminal;
and decrypting the encrypted information to obtain first sample information corresponding to the first sample in the second terminal.
In an embodiment, after the step of determining the information value of each first sample on the feature according to the tag value and the bin identification, the method further includes:
determining an interval in which the information value of the features is located;
and associating and storing the prediction capability corresponding to the interval with the characteristics.
In order to achieve the above object, the present invention further provides a terminal, including:
an obtaining module, configured to obtain, by a first terminal, first sample information corresponding to a first sample in a second terminal, where the first sample information includes an identifier of the first sample and a binning identifier of a feature of the first sample, and the second terminal performs binning operation on a feature value of the feature of each first sample to obtain a binning identifier corresponding to the feature of each first sample;
a determining module, configured to determine, according to the identifier, a tag value corresponding to the first sample;
the determining module is further configured to determine an information value of each first sample on the feature according to the tag value and the binning identifier;
and the encryption module is used for encrypting the information value of each first sample on the characteristics and sending the encrypted information value to the second terminal so as to complete the determination of the information value of the sample characteristics in the longitudinal federal learning scene.
To achieve the above object, the present invention further provides an apparatus including a memory, a processor, and a determination program stored in the memory and executable on the processor, the determination program, when executed by the processor, implementing the information value determination method of the sample feature as described above.
To achieve the above object, the present invention also provides a storage medium storing a determination program that, when executed by a processor, implements the information value determination method of the sample feature as described above.
To achieve the above object, the present invention also provides a computer program product comprising a computer program which, when executed by a processor, implements the method for determining an information value of a sample feature as described above.
The invention provides a method, a terminal, equipment and a storage medium for determining information value of sample characteristics, wherein a second terminal performs box separation operation on characteristic values of the characteristics of samples to obtain box separation identification of each sample, a first terminal obtains identifiers and box separation identifications corresponding to first samples in the second terminal, determines label values corresponding to the samples in the first terminal according to the identifiers, determines the information value of each sample on the characteristics according to the label values and the box separation identifications, encrypts the information value of each first sample on the characteristics, and sends the encrypted information value to the second terminal to complete the determination of the information value of the sample characteristics in a longitudinal federal learning scene. Compared with the prior art that the information value is determined by sending the characteristic value or the label of the sample characteristic value to another party, the information value is determined by sending the sub-box identification representing the characteristic value of the sample to other parties, the information value of the sample characteristic can be determined without sending the characteristic value to other parties, and data leakage is avoided.
Drawings
Fig. 1 is a schematic diagram of a hardware structure of a terminal/device in a hardware operating environment according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a first embodiment of a method for determining information value of a sample feature according to the present invention;
FIG. 3 is a flowchart illustrating a second embodiment of a method for determining information value of a sample feature according to the present invention;
fig. 4 is a functional block diagram of the terminal according to the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Referring to fig. 1, fig. 1 is a schematic hardware structure diagram of a hardware operating environment related to a terminal or a device according to an embodiment of the present invention.
As shown in fig. 1, the terminal/device may include: a processor 1001, such as a CPU, a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., a Wi-Fi interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.
Those skilled in the art will appreciate that the configuration shown in fig. 1 does not constitute a limitation of the terminal or device and may include more or less components than those shown, or some components in combination, or a different arrangement of components.
As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a determination program.
In the terminal or device shown in fig. 1, the network interface 1004 is mainly used for connecting to a background server and performing data communication with the background server; the user interface 1003 is mainly used for connecting a client and performing data communication with the client; and the processor 1001 may be configured to call the determination program stored in the memory 1005 and perform the following operations:
a first terminal acquires first sample information corresponding to a first sample in a second terminal, wherein the first sample information comprises an identifier of the first sample and a binning identifier of the characteristic of the first sample, and the second terminal performs binning operation on the characteristic value of the characteristic of each first sample to obtain a binning identifier corresponding to the characteristic of each first sample;
determining a tag value corresponding to the first sample according to the identifier;
determining the information value of each first sample on the characteristics according to the label value and the box separation identification;
and encrypting the information value of each first sample on the characteristics, and sending the encrypted information value to the second terminal to finish the determination of the information value of the sample characteristics in the longitudinal federal learning scene.
In one embodiment, the processor 1001 may call the determination program stored in the memory 1005, and further perform the following operations:
determining sample attributes of the first sample according to the tag value of the first sample, the sample attributes including a positive sample and a negative sample;
determining a first number of second samples and a second number of third samples corresponding to each bin according to the sample attributes, wherein the second samples are the first samples of positive samples, the third samples are the first samples of negative samples, and the bins are determined according to the bin identifiers;
determining a first total number of the second samples and a second total number of the third samples;
determining a value of information for each bin based on the first number, the second number, the first sample, the first total number, and the second total number of the bins;
and summing the information values of the sub-boxes to obtain the information value of the first sample in the characteristic.
In one embodiment, the processor 1001 may call the determination program stored in the memory 1005, and further perform the following operations:
determining a first ratio of the first number of bins to the first total number and a second ratio of the second number of bins to the second total number;
and determining the information value of the sub-box according to the first ratio and the second ratio.
In one embodiment, the processor 1001 may call the determination program stored in the memory 1005, and further perform the following operations:
determining the first samples with the same bin identifications as the samples contained in the bins corresponding to the bin identifications;
among the samples contained in the bins, a first number of second samples and a second number of third samples are determined.
In one embodiment, the processor 1001 may call the determination program stored in the memory 1005, and further perform the following operations:
determining second sample information corresponding to the first sample according to the identifier, wherein the second sample information comprises the identifier and a label value corresponding to the identifier;
and determining a label value corresponding to the first sample according to the second sample information.
In one embodiment, the processor 1001 may call the determination program stored in the memory 1005, and further perform the following operations:
the first terminal receives the encrypted information sent by the second terminal;
and decrypting the encrypted information to obtain first sample information corresponding to the first sample in the second terminal.
In one embodiment, the processor 1001 may call the determination program stored in the memory 1005, and further perform the following operations:
determining an interval in which the information value of the features is located;
and associating and storing the prediction capability corresponding to the interval with the characteristics.
Based on the hardware structure of the terminal, the embodiments of the information value determination method of the sample characteristics are provided.
The invention provides a method for determining information value of sample characteristics.
Referring to fig. 2, fig. 2 is a first embodiment of a method for determining an information value of a sample feature according to the present invention, where the method for determining an information value of a sample feature includes:
step S10, a first terminal acquires first sample information corresponding to a first sample in a second terminal, wherein the first sample information includes an identifier of the first sample and a binning identifier of a feature of the first sample, and the second terminal performs binning operation on a feature value of the feature of each first sample to obtain a binning identifier corresponding to the feature of each first sample;
in this embodiment, the first terminal stores a plurality of samples, the second terminal stores a plurality of samples, and the same sample in the first terminal and the second terminal is defined as the first sample. The method comprises the steps that a first terminal and a second terminal send characteristic vectors corresponding to respective samples to a server, the server aligns the obtained samples based on a longitudinal federated learning scene, namely, the distance between the characteristic vectors is calculated, the two characteristic vectors with the distance smaller than a preset threshold value are determined to be characteristic vectors of the same samples, marking is carried out, the marked characteristic vectors are fed back to the first terminal and the second terminal, the first terminal and the second terminal determine the first samples in the samples, and the first terminal and the second terminal set the same identifiers for the first samples.
The characteristic of each first sample in the second terminal has a corresponding characteristic value, and the second terminal performs box separation operation on the characteristic values of the characteristics of the first samples, so that a box separation identification corresponding to the characteristic of each first sample is obtained. The equidistant binning is taken as an example for illustration. Assuming that the second terminal includes a first sample a, a first sample b, a first sample c, a first sample d, a first sample e, a first sample f, a first sample g, a first sample h, a first sample i and a first sample k, the feature value of the first sample feature a is 8, the feature value of the first sample feature is 9, the feature value of the first sample feature c is 39, the feature value of the first sample feature d is 7, the feature value of the first sample feature e is 11, the feature value of the first sample feature f is 3, the feature value of the first sample feature g is 37, the feature value of the first sample feature h is 45, the feature value of the first sample feature i is 1 and the feature value of the first sample feature k is 23; the 10 pieces of the above-described data are subjected to binning operation by a distance of 10, and feature values 1, 3, 7, 8, and 9 belonging to bin a, feature value 11 belonging to bin B, feature value 23 belonging to bin C, feature values 37 and 39 belonging to bin D, and feature value 45 belonging to bin E are obtained. Therefore, the binning marks corresponding to the first sample feature a, the first sample feature b, the first sample feature d, the first sample feature f and the first sample feature i are a; the bin label of the first sample characteristic e is B; the bin identifier of the first sample characteristic k is k; the binning of the first sample feature c and the first sample feature g is denoted D and the binning of the first sample feature h is denoted E.
The first sample in the second terminal has a corresponding identifier, and the second terminal packages the identifier of the first sample and the binning identity into the first sample information, that is, the first sample information includes the identifier of the first sample and the binning identity. The second terminal sends each first sample information to the first terminal, that is, the first terminal obtains the first sample information of the first sample in the second terminal.
Step S20, determining the label value corresponding to the first sample according to the identifier;
the first sample within the first terminal also has a corresponding identifier and the first sample has a corresponding tag value. The identifier and the tag value corresponding to the first sample in the first terminal are stored as the second sample information in an associated manner, that is, the second sample information includes the identifier and the tag value of the first sample.
And the first terminal determines second sample information corresponding to the first sample according to the identifier in the acquired first sample information, and then determines a label value corresponding to the first sample according to the second sample information. Specifically, after obtaining each piece of first sample information, the first terminal extracts an identifier of the first sample, and compares the identifier of the second sample information with the obtained identifier, thereby finding out the second sample information containing the identifier, where the second sample information is the second sample information of the first sample corresponding to the identifier. And the first terminal extracts a label value from the second sample information, wherein the label value is the label value of the first sample characteristic corresponding to the identifier obtained from the second terminal. For example, the first terminal obtains the identifier ID _0 from the second terminal, and the second sample information a includes the identifier ID _0 and the tag value 0, then the tag value of the first sample feature with the identifier ID _0 in the second terminal is 0.
Step S30, determining the information value of each first sample on the characteristics according to the label value and the box separation identification;
and step S40, encrypting the information value of each first sample on the characteristic, and sending the encrypted information value to the second terminal to complete the determination of the information value of the sample characteristic in the longitudinal federal learning scene.
After the first terminal determines the label value corresponding to each first sample in the second terminal, the information value of each first sample on the characteristics can be determined according to the label value and the box separation identification. Specifically, the bin separation identifier can represent the size of the characteristic value, the smaller the equidistant bin separation distance in the bin separation operation process is, the more accurate the characteristic value representing the characteristic of the first sample is, and the equidistant bin separation distance is a preset value, so that the characteristic value of the bin separation identifier is within an error range. The first terminal determines a characteristic value of the characteristic of each first sample according to the box separation identification, and then brings each label value and the characteristic value into a calculation formula of information value, so that the information value of each first sample on the characteristic is obtained, and the information value is a specific numerical value.
After the first terminal obtains the information value, the information value of each first sample on the sample characteristic is encrypted, and then the encrypted information value is sent to the second terminal, so that the information value is prevented from being leaked in the transmission process. In addition, when the second terminal needs to determine the information value of each first sample on the characteristic, the first sample information corresponding to each first sample is encrypted to obtain encrypted information, and then the encrypted information is sent to the first terminal. The first terminal decrypts the encrypted information to obtain first sample information corresponding to the first sample in the second terminal.
In the technical scheme provided by this embodiment, the second terminal performs binning operation on the feature value of the feature of each sample to obtain a binning identifier of each sample, the first terminal obtains an identifier and a binning identifier corresponding to the first sample in the second terminal again, and determines a tag value corresponding to the sample in the first terminal according to the identifier, so that the information value of each sample on the feature is determined according to the tag value and the binning identifier, the information value of each first sample on the feature is encrypted, and the encrypted information value is sent to the second terminal, so that the information value determination of the sample feature in the longitudinal federal learning scene is completed. Compared with the prior art that the information value is determined by sending the characteristic value or the label of the sample characteristic value to another party, the information value is determined by sending the sub-box identification representing the characteristic value of the sample to other parties, the information value of the sample characteristic can be determined without sending the characteristic value to other parties, and data leakage is avoided.
Referring to fig. 3, fig. 3 is a second embodiment of the information value determining method for sample characteristics according to the present invention, and based on the first embodiment, the step S20 includes:
step S21, determining sample attributes of the first sample according to the label value of the first sample, wherein the sample attributes comprise a positive sample and a negative sample;
in this embodiment, the first terminal may determine a sample attribute of the first sample according to the tag value, where the sample attribute includes a positive sample and a negative sample. Specifically, the label value is a specific numerical value, for example, the label value may be 0 or 1, if the label value is 0, the sample attribute of the first sample is a negative sample, and if the label value is 1, the sample attribute of the first sample is a positive sample.
Step S22, determining a first number of second samples and a second number of third samples corresponding to each bin according to the sample attributes, and determining a first total number of the second samples and a second total number of the third samples, where the second samples are the first samples of positive samples, the third samples are the first samples of negative samples, and the bins are determined according to the bin identifiers;
after the first terminal obtains each of the box identifiers, the same box identifier is determined to serve as the box corresponding to the box identifiers, that is, the first terminal can determine a plurality of boxes. And the first terminal determines the first samples corresponding to the same box identification as the samples corresponding to the boxes. The first terminal determines a first number of second samples and a second number of third samples in each sample corresponding to the sub-box, wherein the second samples are first samples with positive sample attributes, and the third samples are first samples with negative sample attributes. The first terminal can determine the number of positive samples and the number of negative samples in each bin.
In addition, the first terminal needs to count the first total number of the second samples and the second total number of the third samples, that is, determine the total number of the first samples of the positive samples and the total number of the first samples of the negative samples.
A step S23, according to the first number, the second number, the first sample, the first total number and the second total number of the bins;
after the first terminal determines the first number and the second number of the sub-boxes, the information value corresponding to each sub-box can be determined according to the first number, the second number, the first total number and the second total number. Specifically, the first terminal determines a first ratio between a first quantity corresponding to the binning and the first total quantity, determines a second ratio between a second quantity corresponding to the binning and the second total quantity, and determines the information value of the binning through the first ratio and the second ratio. The following is a detailed description.
The number of positive samples of bin A is PA(first number), the number of negative examples is NA(second number), the first total number is P, the second total number is N, then the first ratio RP=PAA second ratio of RN=NA/N, information value IV of the binningA=(RP-RN)ln(RP/RN)。
And step S24, summing the information values of the sub-boxes to obtain the information value of the first sample in the characteristic.
And after determining the information value corresponding to each sub-box, the first terminal sums the information values of the sub-boxes to obtain the characteristic information value of each first sample.
In the technical scheme provided by this embodiment, the first terminal determines the sample attribute of the first sample according to the tag value, determines the first number of positive samples and the second number of negative samples corresponding to each bin according to the sample attribute, determines the information value of each bin according to the first number, the second number, the total number of second samples and the total number of third samples, and sums the information values of the bins to accurately obtain the information value of each first sample on the characteristics.
In an embodiment, after step S30, the method further includes:
determining an interval in which the information value of the features is located;
and associating and storing the prediction capability corresponding to the interval with the characteristics.
In the present embodiment, the information value IV is an index that is closely related to WOE (Weight of occurrence). The IV can be used for evaluating the prediction capability of the sample characteristics, so that the sample characteristics which accord with the wind control model training are screened out based on the prediction capability.
Specifically, the IV is divided into a plurality of intervals, for example, the IV interval may be divided into: a first interval (0,0.02), a second interval [0.02, 0.1), a third interval [0.1, 0.3), a fourth interval [0.3, 0.5), and a fifth interval [0.5, + ∞). The prediction capability corresponding to the first interval is as follows: little predictive power; the prediction capability corresponding to the second interval is as follows: weak; the prediction capability corresponding to the third interval is as follows: medium; the prediction capability corresponding to the fourth interval is as follows: high; the prediction capability corresponding to the fifth interval is as follows: is extremely high.
The first terminal determines the interval where the information value of the feature is located, and therefore the prediction capability corresponding to the interval and the feature are stored in an associated mode, and therefore when the wind control model needs to be trained, the feature vector is screened out based on the prediction effect of the wind control model. For example, if the accuracy requirement of the wind control model is 90%, the features with extremely high prediction capability are selected as training samples, that is, all the features with information value greater than 0.5 are used as training samples to train the model to obtain the wind control model.
In addition, the second terminal can also store the characteristics and the prediction capability in an associated manner.
In the technical scheme provided by this embodiment, the first terminal determines an interval in which the information value of the feature is located, and performs associated storage on the prediction capability corresponding to the interval and the feature, so that when a model is trained, features which meet the prediction effect of the model are screened out based on the prediction capability for training.
The invention also provides a terminal.
Referring to fig. 4, fig. 4 is a functional module diagram of the terminal of the present invention.
As shown in fig. 4, the terminal includes:
an obtaining module 10, configured to obtain first sample information corresponding to a first sample in a second terminal, where the first sample information includes an identifier of the first sample and a binning identifier of a feature of the first sample, and the second terminal performs binning operation on a feature value of the feature of each first sample to obtain a binning identifier corresponding to the feature of each first sample;
a determining module 20, configured to determine, according to the identifier, a tag value corresponding to the first sample;
the determining module 20 is further configured to determine, according to the tag value and the binning identifier, an information value of each first sample on the feature;
and the encryption module 30 sums the information values of the sub-boxes to obtain the information value of the first sample in the characteristic.
In one embodiment, the terminal further includes:
a determining module 20, configured to determine a sample attribute of the first sample according to the tag value of the first sample, where the sample attribute includes a positive sample and a negative sample;
a determining module 20, configured to determine, according to the sample attributes, a first number of second samples and a second number of third samples corresponding to each bin, where the second samples are the first samples of positive samples, the third samples are the first samples of negative samples, and the bins are determined according to the bin identifiers;
a determining module 20, configured to determine a first total number of the second samples and a second total number of the third samples;
a determining module 20, configured to determine an information value of each bin according to the first number, the second number, the first sample, the first total number, and the second total number of the bins;
and the summing module is used for summing the information values of the sub-boxes to obtain the information value of the first sample in the characteristic.
An encryption module 30, configured to encrypt the information value of each first sample on the feature, and send the encrypted information value to the second terminal, so as to complete information value determination of the sample feature in a longitudinal federal learning scenario
In one embodiment, the terminal further includes:
a determining module 20 for determining a first ratio of the first number of bins to the first total number and a second ratio of the second number of bins to the second total number;
and a determining module 20, configured to determine the information value of the bin according to the first ratio and the second ratio.
In one embodiment, the terminal further includes:
a determining module 20, configured to determine the first samples with the same binning identifier as samples included in a bin corresponding to the binning identifier;
a determining module 20 for determining a first number of second samples and a second number of third samples among the samples contained in the bins.
In one embodiment, the terminal further includes:
a determining module 20, configured to determine, according to the identifier, second sample information corresponding to the first sample, where the second sample information includes the identifier and a tag value corresponding to the identifier;
and a determining module 20, configured to determine, according to the second sample information, a tag value corresponding to the first sample.
In one embodiment, the terminal further includes:
the receiving module is used for receiving the encrypted information sent by the second terminal;
and the decryption module is used for decrypting the encrypted information to obtain first sample information corresponding to the first sample in the second terminal.
The function implementation of each module in the terminal corresponds to each step in the embodiment of the method for determining the information value of the sample characteristic, and the function and implementation process are not described in detail herein.
In one embodiment, the terminal further includes:
a determining module 20, configured to determine an interval in which the information value of the feature is located;
and the association module is used for associating and storing the prediction capability corresponding to the interval with the characteristics.
The present invention also provides a storage medium having a determination program stored thereon, which when executed by a processor, implements the steps of the information value determination method of the sample feature as described in any one of the above embodiments.
The specific embodiment of the readable storage medium of the present invention is substantially the same as the embodiments of the method for determining the information value of the sample characteristic, and is not repeated herein.
The present invention also provides a computer program product comprising a computer program, wherein the computer program is configured to, when executed by a processor, implement the method for determining an information value of a sample feature according to the above embodiments.
The specific embodiment of the computer program product of the present invention is substantially the same as the embodiments of the method for determining the information value of the sample feature, and will not be described herein again.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (11)

1. A method for determining an information value of a sample feature, the method comprising:
a first terminal acquires first sample information corresponding to a first sample in a second terminal, wherein the first sample information comprises an identifier of the first sample and a binning identifier of the characteristic of the first sample, and the second terminal performs binning operation on the characteristic value of the characteristic of each first sample to obtain a binning identifier corresponding to the characteristic of each first sample;
determining a tag value corresponding to the first sample according to the identifier;
determining the information value of each first sample on the characteristics according to the label value and the box separation identification;
and encrypting the information value of each first sample on the characteristics, and sending the encrypted information value to the second terminal to finish the determination of the information value of the sample characteristics in the longitudinal federal learning scene.
2. The method of claim 1, wherein the step of determining the information value of each of the first samples on the feature based on the tag value and the bin identification comprises:
determining sample attributes of the first sample according to the tag value of the first sample, the sample attributes including a positive sample and a negative sample;
determining a first number of second samples and a second number of third samples corresponding to each bin according to the sample attributes, wherein the second samples are the first samples of positive samples, the third samples are the first samples of negative samples, and the bins are determined according to the bin identifiers;
determining a first total number of the second samples and a second total number of the third samples;
determining the information value of each bin according to the first number, the second number, the first total number and the second total number of the bins;
and summing the information values of the sub-boxes to obtain the information value of the first sample in the characteristic.
3. The method of determining an information value of a sample feature of claim 2, wherein the step of determining an information value of each bin based on the first number, the second number, the first total number, and the second total number of bins comprises:
determining a first ratio of the first number of bins to the first total number and a second ratio of the second number of bins to the second total number;
and determining the information value of the sub-box according to the first ratio and the second ratio.
4. The method of determining the information value of a sample feature of claim 2, wherein the step of determining a first number of second samples and a second number of third samples for each bin based on the sample attributes comprises:
determining the first samples with the same bin identifications as the samples contained in the bins corresponding to the bin identifications;
among the samples contained in the bins, a first number of second samples and a second number of third samples are determined.
5. The method of claim 1, wherein the step of determining the tag value corresponding to the first sample based on the identifier comprises:
determining second sample information corresponding to the first sample according to the identifier, wherein the second sample information comprises the identifier and a label value corresponding to the identifier;
and determining a label value corresponding to the first sample according to the second sample information.
6. The method for determining the information value of the sample feature according to any one of claims 1 to 5, wherein the step of the first terminal obtaining the first sample information corresponding to the first sample in the second terminal comprises:
the first terminal receives the encrypted information sent by the second terminal;
and decrypting the encrypted information to obtain first sample information corresponding to the first sample in the second terminal.
7. The method of determining the informative value of said sample characteristics according to any one of claims 1 to 5, wherein said step of determining the informative value of each of said first samples on said characteristics based on said tag value and said bin identification further comprises:
determining an interval in which the information value of the features is located;
and associating and storing the prediction capability corresponding to the interval with the characteristics.
8. A terminal, characterized in that the terminal comprises:
an obtaining module, configured to obtain, by a first terminal, first sample information corresponding to a first sample in a second terminal, where the first sample information includes an identifier of the first sample and a binning identifier of a feature of the first sample, and the second terminal performs binning operation on a feature value of the feature of each first sample to obtain a binning identifier corresponding to the feature of each first sample;
a determining module, configured to determine, according to the identifier, a tag value corresponding to the first sample;
the determining module is further configured to determine an information value of each first sample on the feature according to the tag value and the binning identifier;
and the encryption module is used for encrypting the information value of each first sample on the characteristics and sending the encrypted information value to the second terminal so as to complete the determination of the information value of the sample characteristics in the longitudinal federal learning scene.
9. An apparatus comprising a memory, a processor, and a determination program stored in the memory and executable on the processor, the determination program when executed by the processor implementing a method of information value determination of a sample feature as claimed in any one of claims 1 to 7.
10. A storage medium characterized by storing a determination program which, when executed by a processor, implements the information value determination method of the sample feature according to any one of claims 1 to 7.
11. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements a method of information value determination for a sample feature as claimed in any one of claims 1 to 7.
CN202011619669.0A 2020-12-30 2020-12-30 Information value determining method, terminal, device and storage medium for sample characteristics Active CN112711765B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011619669.0A CN112711765B (en) 2020-12-30 2020-12-30 Information value determining method, terminal, device and storage medium for sample characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011619669.0A CN112711765B (en) 2020-12-30 2020-12-30 Information value determining method, terminal, device and storage medium for sample characteristics

Publications (2)

Publication Number Publication Date
CN112711765A true CN112711765A (en) 2021-04-27
CN112711765B CN112711765B (en) 2024-06-14

Family

ID=75547519

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011619669.0A Active CN112711765B (en) 2020-12-30 2020-12-30 Information value determining method, terminal, device and storage medium for sample characteristics

Country Status (1)

Country Link
CN (1) CN112711765B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113917364A (en) * 2021-10-09 2022-01-11 广东电网有限责任公司东莞供电局 High-resistance grounding identification method and device for power distribution network

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109241770A (en) * 2018-08-10 2019-01-18 深圳前海微众银行股份有限公司 Information value calculating method, equipment and readable storage medium storing program for executing based on homomorphic cryptography
CN109325357A (en) * 2018-08-10 2019-02-12 深圳前海微众银行股份有限公司 Information value calculating method, equipment and readable storage medium storing program for executing based on RSA
WO2019114423A1 (en) * 2017-12-15 2019-06-20 阿里巴巴集团控股有限公司 Method and apparatus for merging model prediction values, and device
CN110990857A (en) * 2019-12-11 2020-04-10 支付宝(杭州)信息技术有限公司 Multi-party combined feature evaluation method and device for protecting privacy and safety
WO2020143233A1 (en) * 2019-01-07 2020-07-16 平安科技(深圳)有限公司 Method and device for building scorecard model, computer apparatus and storage medium
CN111506485A (en) * 2020-04-15 2020-08-07 深圳前海微众银行股份有限公司 Feature binning method, device, equipment and computer-readable storage medium
CN111523679A (en) * 2020-04-26 2020-08-11 深圳前海微众银行股份有限公司 Feature binning method, device and readable storage medium
WO2020177475A1 (en) * 2019-03-04 2020-09-10 阿里巴巴集团控股有限公司 Secure feature engineering method and apparatus
CN111695675A (en) * 2020-05-14 2020-09-22 平安科技(深圳)有限公司 Federal learning model training method and related equipment
CN111898765A (en) * 2020-07-29 2020-11-06 深圳前海微众银行股份有限公司 Feature binning method, device, equipment and readable storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019114423A1 (en) * 2017-12-15 2019-06-20 阿里巴巴集团控股有限公司 Method and apparatus for merging model prediction values, and device
CN109241770A (en) * 2018-08-10 2019-01-18 深圳前海微众银行股份有限公司 Information value calculating method, equipment and readable storage medium storing program for executing based on homomorphic cryptography
CN109325357A (en) * 2018-08-10 2019-02-12 深圳前海微众银行股份有限公司 Information value calculating method, equipment and readable storage medium storing program for executing based on RSA
WO2020143233A1 (en) * 2019-01-07 2020-07-16 平安科技(深圳)有限公司 Method and device for building scorecard model, computer apparatus and storage medium
WO2020177475A1 (en) * 2019-03-04 2020-09-10 阿里巴巴集团控股有限公司 Secure feature engineering method and apparatus
CN110990857A (en) * 2019-12-11 2020-04-10 支付宝(杭州)信息技术有限公司 Multi-party combined feature evaluation method and device for protecting privacy and safety
CN111506485A (en) * 2020-04-15 2020-08-07 深圳前海微众银行股份有限公司 Feature binning method, device, equipment and computer-readable storage medium
CN111523679A (en) * 2020-04-26 2020-08-11 深圳前海微众银行股份有限公司 Feature binning method, device and readable storage medium
CN111695675A (en) * 2020-05-14 2020-09-22 平安科技(深圳)有限公司 Federal learning model training method and related equipment
CN111898765A (en) * 2020-07-29 2020-11-06 深圳前海微众银行股份有限公司 Feature binning method, device, equipment and readable storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
何雯;白翰茹;李超;: "基于联邦学习的企业数据共享探讨", 信息与电脑(理论版), no. 08 *
杨建林;刘扬;: "基于关联分类算法的PU学习研究", 数据分析与知识发现, no. 11 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113917364A (en) * 2021-10-09 2022-01-11 广东电网有限责任公司东莞供电局 High-resistance grounding identification method and device for power distribution network
CN113917364B (en) * 2021-10-09 2024-03-08 广东电网有限责任公司东莞供电局 High-resistance grounding identification method and device for power distribution network

Also Published As

Publication number Publication date
CN112711765B (en) 2024-06-14

Similar Documents

Publication Publication Date Title
EP3848838A1 (en) Model parameter acquisition method and system based on federated learning, and readable storage medium
CN109241770B (en) Information value calculation method and device based on homomorphic encryption and readable storage medium
CN111695697A (en) Multi-party combined decision tree construction method and device and readable storage medium
EP3971798A1 (en) Data processing method and apparatus, and computer readable storage medium
CN110659206B (en) Simulation architecture establishment method and device based on micro-service, medium and electronic equipment
CN111539009B (en) Supervised feature binning method and device for protecting private data
CN109325357B (en) RSA-based information value calculation method, device and readable storage medium
CN111563267B (en) Method and apparatus for federal feature engineering data processing
CN103093154B (en) One is determined confidential information management system and determines confidential information management method
CN110728328B (en) Training method and device for classification model
CN115049070A (en) Screening method and device of federal characteristic engineering data, equipment and storage medium
CN112149706A (en) Model training method, device, equipment and medium
CN112711765A (en) Sample characteristic information value determination method, terminal, device and storage medium
CN112861939A (en) Feature selection method, device, readable storage medium and computer program product
US20190279136A1 (en) Method and system for selective data visualization and posting of supply chain information to a blockchain
CN111523679A (en) Feature binning method, device and readable storage medium
CN111859360A (en) Safe multi-device joint data computing system, method and device
CN111414636A (en) Method, device and equipment for updating recognition model and storage medium
CN110020333A (en) Data analysing method and device, electronic equipment, storage medium
CN111428265B (en) Statement quality inspection method, device, equipment and storage medium based on federal learning
CN111325629A (en) Enterprise investment value evaluation method, device, server and readable storage medium
CN112597525A (en) Data processing method and device based on privacy protection and server
CN113641568A (en) Software test data processing method and device, electronic equipment and readable storage medium
CN112668020B (en) Feature crossing method, device, readable storage medium, and computer program product
CN110737662A (en) data analysis method, device, server and computer storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant