CN114154202B

CN114154202B - Wind control data exploration method and system based on differential privacy

Info

Publication number: CN114154202B
Application number: CN202210119822.6A
Authority: CN
Inventors: 申书恒; 傅欣艺; 王维强
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2022-02-09
Filing date: 2022-02-09
Publication date: 2022-06-24
Anticipated expiration: 2042-02-09
Also published as: CN114154202A

Abstract

The invention provides a wind control data exploration method based on differential privacy, which comprises the following steps: acquiring module entering data at a user side; adding the perturbation to the obtained modulo data at the user side to obtain an expanded modulo data set; obtaining, at a server, a statistical approximation for the incoming data based on the expanded incoming data set; and correcting, at the server, a statistical approximation to the incoming data.

Description

Wind control data exploration method and system based on differential privacy

Technical Field

The present disclosure relates generally to wind control, and more particularly to data exploration in wind control.

Background

With the increasing awareness of the privacy protection of citizens and the increasing supervision of data collection, mobile phone manufacturers are beginning to restrict the collection of private data on terminal devices, such as AppList. In the wind control system, data to be detected mainly comprises report information, historical transaction relation information and terminal abnormal behavior information. And the performance of the intelligent wind control system can be reduced to a great extent due to the limitation of terminal data acquisition, so that the threshold of black product audit is improved, and the coverage rate of the black product is reduced.

Therefore, there is a need in the art to design an efficient and efficient scheme for probing the wind-controlled data to use the terminal risk features without violating the privacy of the user.

Disclosure of Invention

In order to solve the technical problem, the present disclosure provides a wind-controlled data exploration scheme based on differential privacy, which can obtain an approximation of statistical information on the premise of not invading user privacy based on a large data volume assumption by adding noise to original data.

In an embodiment of the present disclosure, a method for detecting wind control data based on differential privacy is provided, including: acquiring module entering data at a user side; adding the perturbation to the acquired modulo data at the user side to obtain an expanded modulo data set; obtaining a statistical approximation for the incoming data at the server based on the expanded incoming data set; and correcting the statistical approximation for the in-mode data at the server.

In another embodiment of the present disclosure, adding the disturbance to the acquired modulo data at the user end includes using a random response mechanism.

In yet another embodiment of the present disclosure, obtaining the statistical approximation for the in-mode data comprises obtaining a validity or stability of the in-mode data for the model.

In another embodiment of the present disclosure, the in-mold data includes a training sample set and a validation sample set.

In yet another embodiment of the present disclosure, obtaining stability of the in-mold data against the model includes: based on the expanded training sample set and validation sample set, a difference in distribution of the training samples and the validation samples is obtained.

In another embodiment of the present disclosure, obtaining the validity of the in-mold data for the model includes: based on the expanded set of modulo data, a response sample fraction of the modulo data is obtained.

In an embodiment of the present disclosure, a wind-controlled data exploration system based on differential privacy includes: the data acquisition module is used for acquiring the module entering data at a user side; a perturbation module for adding perturbation to the acquired modulo data at the user end to obtain an expanded modulo data set; and a data analysis module that obtains statistical approximations for the modulo data at the server based on the expanded modulo data set and corrects the statistical approximations for the modulo data at the server.

In another embodiment of the present disclosure, the adding, by the perturbation module, the perturbation to the acquired module entry data at the user side includes that the perturbation module adopts a random response mechanism.

In yet another embodiment of the present disclosure, the data analysis module obtaining a statistical approximation for the modelled data includes the data analysis module obtaining a validity or stability of the modelled data for the model.

In another embodiment of the present disclosure, the in-mode data includes a training sample set and a validation sample set.

In another embodiment of the present disclosure, the data analysis module obtaining stability of the in-mold data with respect to the model includes: based on the expanded training sample set and validation sample set, the data analysis module obtains a difference in the distribution of the training samples and the validation samples.

In another embodiment of the present disclosure, the obtaining, by the data analysis module, the validity of the in-mold data for the model includes: based on the expanded modelled data set, the data analysis module obtains a response sample fraction of the modelled data.

In an embodiment of the disclosure, a computer-readable storage medium is provided that stores instructions that, when executed, cause a machine to perform the method as previously described.

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

Drawings

The foregoing summary, as well as the following detailed description of the present disclosure, will be better understood when read in conjunction with the appended drawings. It is to be noted that the appended drawings are intended as examples of the claimed invention. In the drawings, like reference characters designate the same or similar elements.

Fig. 1 is a flowchart illustrating a method for detecting wind-controlled data based on differential privacy according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram respectively illustrating a cleartext backhaul-based wind control data probing process and a cleartext backhaul-based wind control data probing process according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram respectively illustrating a centralized differential privacy-based wind-controlled data exploration process according to an embodiment of the present disclosure and a localized differential privacy-based wind-controlled data exploration process according to another embodiment of the present disclosure;

fig. 4 is a data flow diagram respectively illustrating a differential privacy based gated data exploration process according to an embodiment of the present disclosure;

fig. 5 is a block diagram illustrating a differential privacy based wind-controlled data probing system according to an embodiment of the present disclosure.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, embodiments accompanying the present disclosure are described in detail below.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein, and thus the present disclosure is not limited to the specific embodiments disclosed below.

In the era of mobile internet, the user side almost carries secrets owned by an individual. For example, if a data set with ID class information removed is published, the data set is legally compliant because it does not relate to individual privacy in terms of law and ethics. However, if some technical means are used to infer a specific personal information by using the association between different data sets and public information, the privacy problem of the user may occur.

The privacy protection awareness is enhanced in the whole society, and the collection of private data is also limited. And the performance of the intelligent wind control system is greatly reduced due to the limitation of terminal data acquisition. Therefore, it is desirable to design a data exploration scheme that enables the use of terminal risk features without violating user privacy.

When the information quantity of the in-model variable which can be provided for the label judgment is small or the population stability of the variable is poor, the model generalization is poor and the performance fluctuation is large. Therefore, the first step of the wind control modeling is data exploration, namely, the effectiveness and stability of existing data are explored, and characteristics suitable for being used as model parameters are screened, so that the generalization and robustness of the model are improved. For example, the data probing index includes an IV value and a PSI value, wherein a larger IV value indicates that the feature provides a larger amount of information for tag determination, and a smaller PSI value indicates that the distribution of the variable fluctuates less with time.

The present disclosure provides a wind-controlled data exploration scheme based on Differential Privacy, which can obtain an approximation of statistical information on the premise of not invading user Privacy based on a large data volume assumption by adding noise to original data by adopting a Differential Privacy (DP) algorithm.

The differential privacy algorithm scrambles personal user data, so that the technical backtracking process cannot be realized. And then, under the condition that the original data cannot be obtained, carrying out batch calculation on the data, and outputting a calculation result. The method and the device achieve protection of user privacy data while obtaining data resources required by machine learning.

Fig. 1 is a flowchart illustrating a method 100 for detecting wind-controlled data based on differential privacy according to an embodiment of the present disclosure.

In the present disclosure, differential privacy is employed to ensure that the query results of the statistical database are not affected by the private data of any single user. Thus, the data collector or data analyst cannot infer the data for any single user.

At 102, incoming module data is obtained at a user end.

Data is a big prerequisite for user behavior analysis. In the present disclosure, the data source of the collected user behavior is data of the user side. The wind control model is aimed at user behavior data originating from the user side. Since the user behavior data flows from the user side to the server side, the included user privacy needs to be guaranteed in the process of transmitting the user behavior data to the server side.

At 104, a perturbation is added to the acquired modulo data at the user end to obtain an expanded modulo data set.

In order to guarantee the user privacy protection of the acquired module-entering data, perturbation (perturbation) is added at the user end, so as to obtain an expanded module-entering data set.

Adding disturbance to the user side actually randomizes the data of a single user, and then sends the user side data set added with disturbance or randomized. In this way, every word or search key entered by the user is not collected, and the data that is disturbed or randomized is not leaked during transmission, which leads to irreversible results.

In an embodiment of the present disclosure, adding perturbations employs a random response mechanism in the case of applying localized differential privacy. In another embodiment of the present disclosure, adding perturbation employs an information compression and warping mechanism in the case of applying localized differential privacy.

At 106, a statistical approximation for the in-mode data is obtained at the server based on the expanded in-mode data set.

Because many scenes do not have enough large data, and the data may be mutually isolated and cannot be exchanged and shared, the expanded data set added with disturbance can be converged at the server, and then statistical approximation for the model-entering data is performed at the server, for example, the validity and stability of the existing data are explored, and the characteristics suitable for being used as a model parameter are screened, and the like.

After disturbance is added to data from different servers and different scenes, data of any single user is hidden, but the statistical trend of the overall data set is not affected. Therefore, when the information quantity which can be provided by label judgment of the model entering variable is little or the group stability of the variable is poor, the generalization and the robustness of the model can be effectively improved by carrying out data exploration on the basis of the expanded data set after disturbance is added.

At 108, the statistical approximation for the in-mode data is corrected at the server.

The statistical approximation obtained at 106 for the incoming data is not a true unbiased estimate and needs to be corrected at the server side.

For example, the statistical approximation may be corrected using a maximum likelihood Estimation (Max Likehood Estimation). Those skilled in the art can understand that other calibration methods can be used for calibration, and are not described herein.

Therefore, noise is added to the original data by adopting a differential privacy algorithm, and approximation of statistical information can be obtained on the premise of not invading the privacy of a user on the basis of a large data volume assumption.

Fig. 2 is a schematic diagram respectively illustrating a cleartext backhaul-based wind-control data probing process and a differential privacy-based wind-control data probing process according to an embodiment of the present disclosure.

As shown in the upper diagram of fig. 2, the process of probing the wind control data based on plaintext return mainly includes: and transmitting the plaintext back to the cloud, and counting the IV value and the PSI value by the plaintext. The wind control data exploration process based on plaintext return directly returns plaintext data of a user to the cloud, and the cloud directly carries out statistical query, such as statistical calculation of information value (IV value) and stability (PSI value), according to the plaintext data. The privacy of the user is violated both during the plaintext return and during the statistical calculation including the IV and PS values.

The information value IV measures the information quantity of a certain variable, and the size of the value determines the influence degree of the independent variable on the target variable, which is equivalent to a weighted summation of the WOE values of the independent variables. WOE (weight of evidence) is the evidence weight, which is a form of encoding the original variable. To perform WOE encoding on a variable, the variable needs to be first processed into packets, i.e., binned or discretized. Common discretization methods are equal-width grouping, equal-height grouping, or grouping using decision trees. After grouping, for the ith group (when the variable is a continuous value, discrete binning is required for the variable, and the variable is divided into n groups), the WOE is calculated as follows:

it is expressed in the meaning of the difference between the ratio of responding clients to all responding clients in the current group and the ratio of non-responding clients to all non-responding clients in the current group, which can also be expressed as the ratio of response samples.

For packet i, its corresponding IV value refers to the following formula, where n is the number of packets,

note that in any packet of the variable, the case where the response number is 0 or the case where the non-response number is 0 should not occur, and when the response number of one packet of the variable is 0, the corresponding WOE is negative infinity, and the IV value is positive infinity. This grouping is made directly into a rule, if possible, as a precondition or complement to the model.

After calculating the IV values of the various groups of a variable, the IV value of the entire variable can be calculated:

PSI is a Population Stability Index (Population Stability Index) that reflects the Stability of the distribution of validation samples over each fraction segment and the distribution of modeling/training samples. In modeling, PSI is often used to screen for characteristic variables, evaluate model stability. Stability is referenced, so two distributions are required-actual distribution (actual) and expected distribution (expected). In modeling, a training Sample (INS) is usually used as an expected distribution, and a verification Sample is usually used as an actual distribution. The validation samples generally include Out of Sample (OOS) and Out of Time samples (OOT).

The calculation process of the PSI is to put two distributions together and compare the difference between the two divisions:

wherein

Denotes the first𝑖The actual percentage of each bin is,𝐸_𝑖is shown as𝑖Desired fraction in each bin.

Fig. 2 illustrates a differential privacy-based detection process of the wind-control data according to an embodiment of the present disclosure.

The Central Differential Privacy (CDP) algorithm collects data scattered on each terminal device into a trusted data center, acquires statistical information based on the CDP algorithm, and issues the statistical information to the outside. The premise is to have a trusted third party data collector, i.e. to protect the collected data from being leaked and stolen. In practical application scenarios, it is difficult to find a trusted third party, which greatly limits the application of CDP.

The difference is that the Local Differential Privacy (LDP) algorithm directly completes data encryption of the Differential Privacy algorithm at the terminal without the participation of a trusted third party, and then transmits the encrypted data to the server (i.e., ciphertext return), and the server performs subsequent processing according to the encrypted data.

The LDP is defined as: an algorithmMSatisfy the requirement of

If for any

And for possible outputs

All are provided with

。

Random Response (randomised Response) is the dominant perturbation mechanism for LDP, defined as follows:

assume user variablestThe value range of (c) is {0, 1},

for perturbed values, if defined

Then

Satisfy the requirement of

。

In bookIn one embodiment, the local privacy protection based IV statistical computation is LDP based

And

herein, the

Indicating the proportion of users with label 1 in the ith group among all users with label 1,

indicating the proportion of users with 0 label in the ith group among all users with 0 label. Assuming there are n groups, only the total number of people labeled 1 and 0 in each group needs to be calculated by LDP, respectively.

Before the statistical calculation of the IV value based on the LDP is carried out, data is subjected to data binning in advance. Common data binning modes include equal-distance binning and equal-frequency binning, but as privacy information of data needs to be acquired in the equal-frequency binning process and extra privacy budget is introduced, equal-distance binning is recommended to be performed in advance according to expert experience (estimation variable value range and distribution).

The IV statistic calculation based on the local privacy protection comprises the following steps:

(1) for each user u, a vector of length n is returned

If the user's variable belongs to the second𝑖A group is formed by

Probability r of_𝑖=1, r_j=0,

(ii) a To be provided with

Probability r of_𝑖And = 0. Random selectionSelecting a j such that r_j=1, the rest r_k=0,

。

(2) The server receiving all N users' backtransmission

Then, summing all vectors of the user with the label of 1 to obtain

According to the definition of the random response, final

Wherein

To (1)𝑖Item representation belongs to the second of all users with label 1𝑖Estimation of the proportion of users of a packet. Can be calculated by the same principle

。

(3) Since in the last step

And

are all estimated random variables, the element sum in each vector is not 1, therefore, the two vectors are respectively normalized to obtain

,

Are respectively multiplied by N₁And N₀An estimate of the number of people labeled 1 and 0 in each group can be obtained, which satisfies

。

(4) And (4) calculating to obtain an IV value based on the LDP according to the estimation of the number of people labeled with 1 and 0 in each group obtained in the calculation of the step (3).

And (3) proving that: according to the definition of LDP and the probability of various random conditions in step (1), the method can obtain

From the above formula, one can obtain:

namely as abovepCan ensure that the algorithm satisfies

。

In practical application, set up

The security level of the private data can be adjusted, and the setting is recommended when n =2

(ii) a When 2 is in<When n is less than or equal to 5, the setting is recommended

(ii) a When 5 is<When n is less than or equal to 10, the setting is recommended

。

The above algorithm assumes one and only one sample per user, which guarantees that the IV value calculation for a single variable satisfies

。

In another embodiment of the present disclosure, when the sample size of each user is greater than 1, it sums the data locally in advance, and then will sum up the data after the summation

The data is transmitted back to the server; because the server can know that the label of the user is 1 or 0 or None in advance, the user with the label of None can not participate in the algorithm calculation; if the value of this variable is None, then None may be calculated alone as a group of bins.

In addition, in the wind control modeling, transaction data can be directly acquired at a server side, and terminal data is mainly data of user dimensions.

In another embodiment of the present disclosure, when multiple variables participate in the calculation at the same time, if all variables satisfy the independent same distribution, it can still be guaranteed

(ii) a When there are non-independent identically distributed variables in units of groups, since LDP satisfies additivity, it is set assuming that the maximum capacity of each group is S

Can guarantee that the algorithm is satisfied

。

In an embodiment of the present disclosure, PSI statistics calculation based on local privacy protection is based on the concatenation of local differential privacy.

Differential privacy tandem: given a data set D, a random algorithm is assumed

The differential privacy budget of which is respectively

Combinatorial algorithm

Providing (a)

) -differential privacy protection. That is, for the same data set, a series of combined differential privacy protection algorithms are used, and the differential privacy protection level provided is the sum of the differential privacy budgets.

Unlike calculating IV values, calculating PSI requires multiple sets of variables to be provided per user. Let INS be t₁Data of day, OOT is t₂Day's data, then a total return (t) is required for each user₁+t₂) A group variable. First, the

Group variable represents user number

The sub-box to which the variable belongs, and the process of calculating the PSI comprises the following steps:

(1) per user u, return (t)₁+t₂) A vector of length n

If the user's variable belongs to the first

A group is formed by

Probability r of_𝑖=1, r_j=0,

(ii) a To be provided with

Probability r of_𝑖And = 0. Randomly choosing a j such that r_j=1, the rest r_k=0,

。

(2) The server receives all N user return

Then, sum of all user vectors of each day is calculated

According to the definition of the random response, final

In

To (1)

Item representation pair

All users in the day belong to the first

Estimation of the proportion of users of a packet.

(3) Due to calculation in the previous step

For the estimated random variables, the sum of the elements in each vector is not 1, so each vector is normalized respectively to obtain

Corresponding to INS and OOT, respectively

Is averaged to obtain

And

the estimation satisfies

。

(4) And (4) calculating to obtain PSI values based on LDP according to the actual and expected occupation ratios of each packet calculated in the step (3).

And (3) proving that: the privacy budget per day is available according to the LDP definition as

. From differential privacy cascades, (t)₁+t₂) The total privacy budget of the day is

. Order to

Is obtained, is provided with

Can ensure that the algorithm meets

。

Fig. 3 is a schematic diagram respectively illustrating a centralized differential privacy-based wind-control data exploration process according to an embodiment of the present disclosure and a localized differential privacy-based wind-control data exploration process according to another embodiment of the present disclosure.

As shown in the upper diagram of fig. 3, in the centralized differential privacy-based wind-controlled data exploration process according to an embodiment of the present disclosure, a plurality of wind-controlled data probes from a plurality of users (user 1, user 2,

user n) and transmits the original data to a credible central server, and the central server performs centralized differential privacy processing, i.e. adds disturbance data, and then performs approximation of statistical query, such as information value (IV value) query, stability (PSI value) query, Top-k query, mean value query, and the like.

As shown in the lower graph of FIG. 3, a localization difference based method according to an embodiment of the present disclosureDuring the process of exploring the wind control data with privacy, the data from a plurality of users (user 1, user 2,

user n) performs localized differential privacy processing on the original data of the user side, namely adding disturbance data, and then performs approximation of statistical query on the server side.

Those skilled in the art will appreciate that after the statistical query approximation is obtained, the approximation may be corrected to improve the accuracy of the statistical analysis.

Fig. 4 is a data flow diagram respectively illustrating a differential privacy based gated data probing process according to an embodiment of the present disclosure.

As shown in fig. 4, from the collection of sensitive data to the publication of data, the hiding of user privacy is achieved by differential privacy. The main purpose of differential privacy is to provide query results of maximized data in batches, and ensure that the privacy disclosure does not exceed the preset value

。

Differential privacy mainly includes perturbation and sampling (sampling). For disturbance, input data, intermediate data, or output data are disturbed, and noise is added to make them meet

-differential privacy. A typical scheme for input data perturbation is the random response and a typical scheme for output data perturbation is the Laplace algorithm (Laplace algorithm). The intermediate data can be regarded as the output of the front sub-stage and can also be regarded as the input of the rear sub-stage, so that the algorithm of input or output disturbance can be flexibly selected.

In the embodiment shown in fig. 4, input data, i.e., incoming-mode data of the user terminal, is perturbed to obtain perturbed incoming-mode data (i.e., an expanded incoming-mode data set). The expanded modelled data set is then input into a model to obtain a statistical approximation of the modelled data. And finally, the obtained statistical approximation is corrected, and then the statistical approximation can be published. In this way, the user data will be protected reliably from user privacy.

For sampling, in an embodiment of the present disclosure, assume the query function is f. Data is divided into k parts, and a query function f is operated on each part of data to obtain query results f (d1), f (d2),. ang.f (dk). Then apply any one of the satiations to the query result

Differential privacy algorithms (e.g. random response) resulting in final results. The advantage of this is finally that

The differential privacy algorithm operates on a smaller data set f (d1), f (d 2).

Fig. 5 is a block diagram illustrating a differential privacy based wind-controlled data probing system 500 according to an embodiment of the present disclosure.

The wind-controlled data exploration system 500 based on differential privacy according to an embodiment of the present disclosure includes a data acquisition module 502, a perturbation module 506, and a data analysis module 508.

The data obtaining module 502 obtains the module entering data at the user side. In the present disclosure, the data source of the collected user behavior is data of the user side. The wind control model is aimed at user behavior data originating from the user side. Since the user behavior data flows from the user side to the server side, the included user privacy needs to be guaranteed in the process of transmitting the user behavior data to the server side.

The perturbation module 506 adds perturbations to the obtained modulo data at the user end to obtain an expanded modulo data set.

The adding of the perturbation by the perturbation module 506 at the ue is actually to randomize the single user data and then send the set of the ue data added with the perturbation or randomized. In an embodiment of the present disclosure, adding perturbations employs a random response mechanism in the case of applying localized differential privacy. In another embodiment of the present disclosure, adding perturbation employs an information compression and warping mechanism in the case of applying localized differential privacy.

The data analysis module 508 obtains a statistical approximation for the incoming data at the server based on the expanded incoming data set.

Because many scenes do not have a sufficient amount of large data, and the data may also be isolated from each other and cannot be shared with each other, the expanded data set added with disturbance can be aggregated at the server through the data analysis module 508, and then statistical approximation for the model-entering data is performed at the server, for example, the validity and stability of the existing data are explored, and features suitable for being used as model entries are screened, and the like.

Further, the data analysis module 508 corrects statistical approximations for the incoming data at the server. The obtained statistical approximation for the in-mode data is not a true unbiased estimate, which needs to be corrected at the server side.

Therefore, noise is added to the original data through the differential privacy algorithm, and based on the assumption of large data volume, the wind control data exploration system based on the differential privacy can use the terminal risk characteristics on the premise of not invading the privacy of the user, so that the approximation of statistical information is obtained.

The various steps and modules of the differential privacy based method and system for detecting wind control data described above may be implemented in hardware, software, or a combination thereof. If implemented in hardware, the various illustrative steps, modules, and circuits described in connection with the present invention may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), or other programmable logic component, hardware component, or any combination thereof. A general purpose processor may be a processor, microprocessor, controller, microcontroller, state machine, or the like. If implemented in software, the various illustrative steps, modules, etc. described in connection with the present invention may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Software modules implementing the various operations of the present invention may reside in storage media such as RAM, flash memory, ROM, EPROM, EEPROM, registers, hard disk, a removable disk, a CD-ROM, cloud storage, etc. A storage medium may be coupled to the processor such that the processor can read information from, and write information to, the storage medium, and execute the corresponding program modules to perform the steps of the present invention. Furthermore, software-based embodiments may be uploaded, downloaded, or accessed remotely through suitable communication means. Such suitable communication means include, for example, the internet, the world wide web, an intranet, software applications, cable (including fiber optic cable), magnetic communication, electromagnetic communication (including RF, microwave, and infrared communication), electronic communication, or other such communication means.

It is also noted that the embodiments may be described as a process which is depicted as a flowchart, a flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be rearranged.

The disclosed methods, apparatus, and systems should not be limited in any way. Rather, the invention encompasses all novel and non-obvious features and aspects of the various disclosed embodiments, both individually and in various combinations and sub-combinations with each other. The disclosed methods, apparatus, and systems are not limited to any specific aspect or feature or combination thereof, nor do any of the disclosed embodiments require that any one or more specific advantages be present or that a particular or all technical problem be solved.

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes may be made in the embodiments without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A method for detecting wind control data based on differential privacy comprises the following steps:

acquiring module entering data at a user side;

adding perturbations to the obtained modulo data at the user side to obtain an expanded modulo data set;

aggregating the expanded modular data set at a server;

obtaining, at the server, a statistical approximation for the incoming module data based on the aggregated expanded incoming module data set, wherein the obtaining of the statistical approximation for the incoming module data comprises: based on the expanded mode-entering data set, acquiring a difference of distribution of a training sample and a verification sample or acquiring a response sample ratio of the mode-entering data;

wherein the obtaining a statistical approximation for the in-mold data is based on: for non-independent identically distributed variables in groups,

，

where s is the maximum capacity of each group; and

and correcting the statistical approximation aiming at the module entering data at the server side.

2. The method of claim 1, wherein adding the perturbation to the obtained modulo data at the user end comprises using a random response mechanism.

3. The method of claim 1, the obtaining a statistical approximation for the in-mold data comprising obtaining a validity or stability of the in-mold data for a model.

4. The method of claim 3, the in-mode data comprising a training sample set and a validation sample set.

5. A differential privacy based wind-controlled data exploration system, comprising:

the data acquisition module is used for acquiring the module entering data at a user side;

a perturbation module that adds perturbations to the obtained modulo data at the user side to obtain an expanded modulo data set; and

a data analysis module that aggregates the expanded modular-entry data set at a server, obtains a statistical approximation for the modular-entry data at the server based on the aggregated expanded modular-entry data set, and corrects the statistical approximation for the modular-entry data at the server,

wherein the data analysis module obtaining a statistical approximation for the modelled data comprises: based on the expanded mode-entering data set, acquiring a difference of distribution of a training sample and a verification sample or acquiring a response sample ratio of the mode-entering data;

wherein the data analysis module obtains a statistical approximation for the modelled data based on: for non-independent identically distributed variables in groups,

，

where s is the maximum capacity of each group.

6. The system of claim 5, wherein the perturbation module adding perturbations to the obtained modulo data at the user-side comprises the perturbation module employing a random response mechanism.

7. The system of claim 5, the data analysis module obtaining a statistical approximation for the modelled data comprising the data analysis module obtaining a validity or stability of the modelled data for a model.

8. The system of claim 7, the in-mode data comprising a training sample set and a validation sample set.

9. A computer-readable storage medium having stored thereon instructions that, when executed, cause a machine to perform the method of any of claims 1-4.