[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN114154202B - Wind control data exploration method and system based on differential privacy - Google Patents

Wind control data exploration method and system based on differential privacy Download PDF

Info

Publication number
CN114154202B
CN114154202B CN202210119822.6A CN202210119822A CN114154202B CN 114154202 B CN114154202 B CN 114154202B CN 202210119822 A CN202210119822 A CN 202210119822A CN 114154202 B CN114154202 B CN 114154202B
Authority
CN
China
Prior art keywords
data
module
expanded
user
server
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210119822.6A
Other languages
Chinese (zh)
Other versions
CN114154202A (en
Inventor
申书恒
傅欣艺
王维强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202210119822.6A priority Critical patent/CN114154202B/en
Publication of CN114154202A publication Critical patent/CN114154202A/en
Application granted granted Critical
Publication of CN114154202B publication Critical patent/CN114154202B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6227Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database where protection concerns the structure of data, e.g. records, types, queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Probability & Statistics with Applications (AREA)
  • Medical Informatics (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a wind control data exploration method based on differential privacy, which comprises the following steps: acquiring module entering data at a user side; adding the perturbation to the obtained modulo data at the user side to obtain an expanded modulo data set; obtaining, at a server, a statistical approximation for the incoming data based on the expanded incoming data set; and correcting, at the server, a statistical approximation to the incoming data.

Description

Wind control data exploration method and system based on differential privacy
Technical Field
The present disclosure relates generally to wind control, and more particularly to data exploration in wind control.
Background
With the increasing awareness of the privacy protection of citizens and the increasing supervision of data collection, mobile phone manufacturers are beginning to restrict the collection of private data on terminal devices, such as AppList. In the wind control system, data to be detected mainly comprises report information, historical transaction relation information and terminal abnormal behavior information. And the performance of the intelligent wind control system can be reduced to a great extent due to the limitation of terminal data acquisition, so that the threshold of black product audit is improved, and the coverage rate of the black product is reduced.
Therefore, there is a need in the art to design an efficient and efficient scheme for probing the wind-controlled data to use the terminal risk features without violating the privacy of the user.
Disclosure of Invention
In order to solve the technical problem, the present disclosure provides a wind-controlled data exploration scheme based on differential privacy, which can obtain an approximation of statistical information on the premise of not invading user privacy based on a large data volume assumption by adding noise to original data.
In an embodiment of the present disclosure, a method for detecting wind control data based on differential privacy is provided, including: acquiring module entering data at a user side; adding the perturbation to the acquired modulo data at the user side to obtain an expanded modulo data set; obtaining a statistical approximation for the incoming data at the server based on the expanded incoming data set; and correcting the statistical approximation for the in-mode data at the server.
In another embodiment of the present disclosure, adding the disturbance to the acquired modulo data at the user end includes using a random response mechanism.
In yet another embodiment of the present disclosure, obtaining the statistical approximation for the in-mode data comprises obtaining a validity or stability of the in-mode data for the model.
In another embodiment of the present disclosure, the in-mold data includes a training sample set and a validation sample set.
In yet another embodiment of the present disclosure, obtaining stability of the in-mold data against the model includes: based on the expanded training sample set and validation sample set, a difference in distribution of the training samples and the validation samples is obtained.
In another embodiment of the present disclosure, obtaining the validity of the in-mold data for the model includes: based on the expanded set of modulo data, a response sample fraction of the modulo data is obtained.
In an embodiment of the present disclosure, a wind-controlled data exploration system based on differential privacy includes: the data acquisition module is used for acquiring the module entering data at a user side; a perturbation module for adding perturbation to the acquired modulo data at the user end to obtain an expanded modulo data set; and a data analysis module that obtains statistical approximations for the modulo data at the server based on the expanded modulo data set and corrects the statistical approximations for the modulo data at the server.
In another embodiment of the present disclosure, the adding, by the perturbation module, the perturbation to the acquired module entry data at the user side includes that the perturbation module adopts a random response mechanism.
In yet another embodiment of the present disclosure, the data analysis module obtaining a statistical approximation for the modelled data includes the data analysis module obtaining a validity or stability of the modelled data for the model.
In another embodiment of the present disclosure, the in-mode data includes a training sample set and a validation sample set.
In another embodiment of the present disclosure, the data analysis module obtaining stability of the in-mold data with respect to the model includes: based on the expanded training sample set and validation sample set, the data analysis module obtains a difference in the distribution of the training samples and the validation samples.
In another embodiment of the present disclosure, the obtaining, by the data analysis module, the validity of the in-mold data for the model includes: based on the expanded modelled data set, the data analysis module obtains a response sample fraction of the modelled data.
In an embodiment of the disclosure, a computer-readable storage medium is provided that stores instructions that, when executed, cause a machine to perform the method as previously described.
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
Drawings
The foregoing summary, as well as the following detailed description of the present disclosure, will be better understood when read in conjunction with the appended drawings. It is to be noted that the appended drawings are intended as examples of the claimed invention. In the drawings, like reference characters designate the same or similar elements.
Fig. 1 is a flowchart illustrating a method for detecting wind-controlled data based on differential privacy according to an embodiment of the present disclosure;
fig. 2 is a schematic diagram respectively illustrating a cleartext backhaul-based wind control data probing process and a cleartext backhaul-based wind control data probing process according to an embodiment of the present disclosure;
fig. 3 is a schematic diagram respectively illustrating a centralized differential privacy-based wind-controlled data exploration process according to an embodiment of the present disclosure and a localized differential privacy-based wind-controlled data exploration process according to another embodiment of the present disclosure;
fig. 4 is a data flow diagram respectively illustrating a differential privacy based gated data exploration process according to an embodiment of the present disclosure;
fig. 5 is a block diagram illustrating a differential privacy based wind-controlled data probing system according to an embodiment of the present disclosure.
Detailed Description
In order to make the aforementioned objects, features and advantages of the present disclosure more comprehensible, embodiments accompanying the present disclosure are described in detail below.
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein, and thus the present disclosure is not limited to the specific embodiments disclosed below.
In the era of mobile internet, the user side almost carries secrets owned by an individual. For example, if a data set with ID class information removed is published, the data set is legally compliant because it does not relate to individual privacy in terms of law and ethics. However, if some technical means are used to infer a specific personal information by using the association between different data sets and public information, the privacy problem of the user may occur.
The privacy protection awareness is enhanced in the whole society, and the collection of private data is also limited. And the performance of the intelligent wind control system is greatly reduced due to the limitation of terminal data acquisition. Therefore, it is desirable to design a data exploration scheme that enables the use of terminal risk features without violating user privacy.
When the information quantity of the in-model variable which can be provided for the label judgment is small or the population stability of the variable is poor, the model generalization is poor and the performance fluctuation is large. Therefore, the first step of the wind control modeling is data exploration, namely, the effectiveness and stability of existing data are explored, and characteristics suitable for being used as model parameters are screened, so that the generalization and robustness of the model are improved. For example, the data probing index includes an IV value and a PSI value, wherein a larger IV value indicates that the feature provides a larger amount of information for tag determination, and a smaller PSI value indicates that the distribution of the variable fluctuates less with time.
The present disclosure provides a wind-controlled data exploration scheme based on Differential Privacy, which can obtain an approximation of statistical information on the premise of not invading user Privacy based on a large data volume assumption by adding noise to original data by adopting a Differential Privacy (DP) algorithm.
The differential privacy algorithm scrambles personal user data, so that the technical backtracking process cannot be realized. And then, under the condition that the original data cannot be obtained, carrying out batch calculation on the data, and outputting a calculation result. The method and the device achieve protection of user privacy data while obtaining data resources required by machine learning.
Fig. 1 is a flowchart illustrating a method 100 for detecting wind-controlled data based on differential privacy according to an embodiment of the present disclosure.
In the present disclosure, differential privacy is employed to ensure that the query results of the statistical database are not affected by the private data of any single user. Thus, the data collector or data analyst cannot infer the data for any single user.
At 102, incoming module data is obtained at a user end.
Data is a big prerequisite for user behavior analysis. In the present disclosure, the data source of the collected user behavior is data of the user side. The wind control model is aimed at user behavior data originating from the user side. Since the user behavior data flows from the user side to the server side, the included user privacy needs to be guaranteed in the process of transmitting the user behavior data to the server side.
At 104, a perturbation is added to the acquired modulo data at the user end to obtain an expanded modulo data set.
In order to guarantee the user privacy protection of the acquired module-entering data, perturbation (perturbation) is added at the user end, so as to obtain an expanded module-entering data set.
Adding disturbance to the user side actually randomizes the data of a single user, and then sends the user side data set added with disturbance or randomized. In this way, every word or search key entered by the user is not collected, and the data that is disturbed or randomized is not leaked during transmission, which leads to irreversible results.
In an embodiment of the present disclosure, adding perturbations employs a random response mechanism in the case of applying localized differential privacy. In another embodiment of the present disclosure, adding perturbation employs an information compression and warping mechanism in the case of applying localized differential privacy.
At 106, a statistical approximation for the in-mode data is obtained at the server based on the expanded in-mode data set.
Because many scenes do not have enough large data, and the data may be mutually isolated and cannot be exchanged and shared, the expanded data set added with disturbance can be converged at the server, and then statistical approximation for the model-entering data is performed at the server, for example, the validity and stability of the existing data are explored, and the characteristics suitable for being used as a model parameter are screened, and the like.
After disturbance is added to data from different servers and different scenes, data of any single user is hidden, but the statistical trend of the overall data set is not affected. Therefore, when the information quantity which can be provided by label judgment of the model entering variable is little or the group stability of the variable is poor, the generalization and the robustness of the model can be effectively improved by carrying out data exploration on the basis of the expanded data set after disturbance is added.
At 108, the statistical approximation for the in-mode data is corrected at the server.
The statistical approximation obtained at 106 for the incoming data is not a true unbiased estimate and needs to be corrected at the server side.
For example, the statistical approximation may be corrected using a maximum likelihood Estimation (Max Likehood Estimation). Those skilled in the art can understand that other calibration methods can be used for calibration, and are not described herein.
Therefore, noise is added to the original data by adopting a differential privacy algorithm, and approximation of statistical information can be obtained on the premise of not invading the privacy of a user on the basis of a large data volume assumption.
Fig. 2 is a schematic diagram respectively illustrating a cleartext backhaul-based wind-control data probing process and a differential privacy-based wind-control data probing process according to an embodiment of the present disclosure.
As shown in the upper diagram of fig. 2, the process of probing the wind control data based on plaintext return mainly includes: and transmitting the plaintext back to the cloud, and counting the IV value and the PSI value by the plaintext. The wind control data exploration process based on plaintext return directly returns plaintext data of a user to the cloud, and the cloud directly carries out statistical query, such as statistical calculation of information value (IV value) and stability (PSI value), according to the plaintext data. The privacy of the user is violated both during the plaintext return and during the statistical calculation including the IV and PS values.
The information value IV measures the information quantity of a certain variable, and the size of the value determines the influence degree of the independent variable on the target variable, which is equivalent to a weighted summation of the WOE values of the independent variables. WOE (weight of evidence) is the evidence weight, which is a form of encoding the original variable. To perform WOE encoding on a variable, the variable needs to be first processed into packets, i.e., binned or discretized. Common discretization methods are equal-width grouping, equal-height grouping, or grouping using decision trees. After grouping, for the ith group (when the variable is a continuous value, discrete binning is required for the variable, and the variable is divided into n groups), the WOE is calculated as follows:
Figure 100002_DEST_PATH_IMAGE002
it is expressed in the meaning of the difference between the ratio of responding clients to all responding clients in the current group and the ratio of non-responding clients to all non-responding clients in the current group, which can also be expressed as the ratio of response samples.
For packet i, its corresponding IV value refers to the following formula, where n is the number of packets,
Figure DEST_PATH_IMAGE004
note that in any packet of the variable, the case where the response number is 0 or the case where the non-response number is 0 should not occur, and when the response number of one packet of the variable is 0, the corresponding WOE is negative infinity, and the IV value is positive infinity. This grouping is made directly into a rule, if possible, as a precondition or complement to the model.
After calculating the IV values of the various groups of a variable, the IV value of the entire variable can be calculated:
Figure DEST_PATH_IMAGE006
PSI is a Population Stability Index (Population Stability Index) that reflects the Stability of the distribution of validation samples over each fraction segment and the distribution of modeling/training samples. In modeling, PSI is often used to screen for characteristic variables, evaluate model stability. Stability is referenced, so two distributions are required-actual distribution (actual) and expected distribution (expected). In modeling, a training Sample (INS) is usually used as an expected distribution, and a verification Sample is usually used as an actual distribution. The validation samples generally include Out of Sample (OOS) and Out of Time samples (OOT).
The calculation process of the PSI is to put two distributions together and compare the difference between the two divisions:
Figure DEST_PATH_IMAGE008
wherein
Figure DEST_PATH_IMAGE010
Denotes the first𝑖The actual percentage of each bin is,𝐸𝑖is shown as𝑖Desired fraction in each bin.
Fig. 2 illustrates a differential privacy-based detection process of the wind-control data according to an embodiment of the present disclosure.
The Central Differential Privacy (CDP) algorithm collects data scattered on each terminal device into a trusted data center, acquires statistical information based on the CDP algorithm, and issues the statistical information to the outside. The premise is to have a trusted third party data collector, i.e. to protect the collected data from being leaked and stolen. In practical application scenarios, it is difficult to find a trusted third party, which greatly limits the application of CDP.
The difference is that the Local Differential Privacy (LDP) algorithm directly completes data encryption of the Differential Privacy algorithm at the terminal without the participation of a trusted third party, and then transmits the encrypted data to the server (i.e., ciphertext return), and the server performs subsequent processing according to the encrypted data.
The LDP is defined as: an algorithmMSatisfy the requirement of
Figure DEST_PATH_IMAGE012
If for any
Figure DEST_PATH_IMAGE014
And for possible outputs
Figure DEST_PATH_IMAGE016
All are provided with
Figure DEST_PATH_IMAGE018
Random Response (randomised Response) is the dominant perturbation mechanism for LDP, defined as follows:
assume user variablestThe value range of (c) is {0, 1},
Figure DEST_PATH_IMAGE020
for perturbed values, if defined
Figure DEST_PATH_IMAGE022
Then
Figure DEST_PATH_IMAGE024
Satisfy the requirement of
Figure DEST_PATH_IMAGE025
In bookIn one embodiment, the local privacy protection based IV statistical computation is LDP based
Figure DEST_PATH_IMAGE027
And
Figure DEST_PATH_IMAGE029
herein, the
Figure DEST_PATH_IMAGE031
Indicating the proportion of users with label 1 in the ith group among all users with label 1,
Figure DEST_PATH_IMAGE033
indicating the proportion of users with 0 label in the ith group among all users with 0 label. Assuming there are n groups, only the total number of people labeled 1 and 0 in each group needs to be calculated by LDP, respectively.
Before the statistical calculation of the IV value based on the LDP is carried out, data is subjected to data binning in advance. Common data binning modes include equal-distance binning and equal-frequency binning, but as privacy information of data needs to be acquired in the equal-frequency binning process and extra privacy budget is introduced, equal-distance binning is recommended to be performed in advance according to expert experience (estimation variable value range and distribution).
The IV statistic calculation based on the local privacy protection comprises the following steps:
(1) for each user u, a vector of length n is returned
Figure DEST_PATH_IMAGE035
If the user's variable belongs to the second𝑖A group is formed by
Figure DEST_PATH_IMAGE037
Probability r of𝑖=1, rj=0,
Figure DEST_PATH_IMAGE039
(ii) a To be provided with
Figure DEST_PATH_IMAGE041
Probability r of𝑖And = 0. Random selectionSelecting a j such that rj=1, the rest rk=0,
Figure DEST_PATH_IMAGE043
(2) The server receiving all N users' backtransmission
Figure DEST_PATH_IMAGE045
Then, summing all vectors of the user with the label of 1 to obtain
Figure DEST_PATH_IMAGE047
According to the definition of the random response, final
Figure DEST_PATH_IMAGE049
Wherein
Figure DEST_PATH_IMAGE051
To (1)𝑖Item representation belongs to the second of all users with label 1𝑖Estimation of the proportion of users of a packet. Can be calculated by the same principle
Figure DEST_PATH_IMAGE053
(3) Since in the last step
Figure DEST_PATH_IMAGE055
And
Figure DEST_PATH_IMAGE057
are all estimated random variables, the element sum in each vector is not 1, therefore, the two vectors are respectively normalized to obtain
Figure DEST_PATH_IMAGE059
,
Figure DEST_PATH_IMAGE061
Are respectively multiplied by N1And N0An estimate of the number of people labeled 1 and 0 in each group can be obtained, which satisfies
Figure DEST_PATH_IMAGE063
(4) And (4) calculating to obtain an IV value based on the LDP according to the estimation of the number of people labeled with 1 and 0 in each group obtained in the calculation of the step (3).
And (3) proving that: according to the definition of LDP and the probability of various random conditions in step (1), the method can obtain
Figure DEST_PATH_IMAGE065
From the above formula, one can obtain:
Figure DEST_PATH_IMAGE067
namely as abovepCan ensure that the algorithm satisfies
Figure DEST_PATH_IMAGE063A
In practical application, set up
Figure DEST_PATH_IMAGE069
The security level of the private data can be adjusted, and the setting is recommended when n =2
Figure DEST_PATH_IMAGE071
(ii) a When 2 is in<When n is less than or equal to 5, the setting is recommended
Figure DEST_PATH_IMAGE073
(ii) a When 5 is<When n is less than or equal to 10, the setting is recommended
Figure DEST_PATH_IMAGE075
The above algorithm assumes one and only one sample per user, which guarantees that the IV value calculation for a single variable satisfies
Figure DEST_PATH_IMAGE076
In another embodiment of the present disclosure, when the sample size of each user is greater than 1, it sums the data locally in advance, and then will sum up the data after the summation
Figure DEST_PATH_IMAGE078
The data is transmitted back to the server; because the server can know that the label of the user is 1 or 0 or None in advance, the user with the label of None can not participate in the algorithm calculation; if the value of this variable is None, then None may be calculated alone as a group of bins.
In addition, in the wind control modeling, transaction data can be directly acquired at a server side, and terminal data is mainly data of user dimensions.
In another embodiment of the present disclosure, when multiple variables participate in the calculation at the same time, if all variables satisfy the independent same distribution, it can still be guaranteed
Figure DEST_PATH_IMAGE063AA
(ii) a When there are non-independent identically distributed variables in units of groups, since LDP satisfies additivity, it is set assuming that the maximum capacity of each group is S
Figure DEST_PATH_IMAGE080
Can guarantee that the algorithm is satisfied
Figure DEST_PATH_IMAGE063AAA
In an embodiment of the present disclosure, PSI statistics calculation based on local privacy protection is based on the concatenation of local differential privacy.
Differential privacy tandem: given a data set D, a random algorithm is assumed
Figure DEST_PATH_IMAGE082
The differential privacy budget of which is respectively
Figure DEST_PATH_IMAGE084
Combinatorial algorithm
Figure DEST_PATH_IMAGE086
Providing (a)
Figure DEST_PATH_IMAGE088
) -differential privacy protection. That is, for the same data set, a series of combined differential privacy protection algorithms are used, and the differential privacy protection level provided is the sum of the differential privacy budgets.
Unlike calculating IV values, calculating PSI requires multiple sets of variables to be provided per user. Let INS be t1Data of day, OOT is t2Day's data, then a total return (t) is required for each user1+t2) A group variable. First, the
Figure DEST_PATH_IMAGE090
Group variable represents user number
Figure 182079DEST_PATH_IMAGE090
The sub-box to which the variable belongs, and the process of calculating the PSI comprises the following steps:
(1) per user u, return (t)1+t2) A vector of length n
Figure DEST_PATH_IMAGE092
If the user's variable belongs to the first
Figure DEST_PATH_IMAGE094
A group is formed by
Figure DEST_PATH_IMAGE096
Probability r of𝑖=1, rj=0,
Figure DEST_PATH_IMAGE098
(ii) a To be provided with
Figure DEST_PATH_IMAGE100
Probability r of𝑖And = 0. Randomly choosing a j such that rj=1, the rest rk=0,
Figure DEST_PATH_IMAGE102
(2) The server receives all N user return
Figure DEST_PATH_IMAGE104
Then, sum of all user vectors of each day is calculated
Figure DEST_PATH_IMAGE106
According to the definition of the random response, final
Figure DEST_PATH_IMAGE108
Figure DEST_PATH_IMAGE110
In
Figure DEST_PATH_IMAGE112
To (1)
Figure DEST_PATH_IMAGE114
Item representation pair
Figure DEST_PATH_IMAGE116
All users in the day belong to the first
Figure DEST_PATH_IMAGE114A
Estimation of the proportion of users of a packet.
(3) Due to calculation in the previous step
Figure DEST_PATH_IMAGE118
For the estimated random variables, the sum of the elements in each vector is not 1, so each vector is normalized respectively to obtain
Figure DEST_PATH_IMAGE120
Corresponding to INS and OOT, respectively
Figure DEST_PATH_IMAGE122
Is averaged to obtain
Figure DEST_PATH_IMAGE124
And
Figure DEST_PATH_IMAGE126
the estimation satisfies
Figure DEST_PATH_IMAGE128
(4) And (4) calculating to obtain PSI values based on LDP according to the actual and expected occupation ratios of each packet calculated in the step (3).
And (3) proving that: the privacy budget per day is available according to the LDP definition as
Figure DEST_PATH_IMAGE130
. From differential privacy cascades, (t)1+t2) The total privacy budget of the day is
Figure DEST_PATH_IMAGE132
. Order to
Figure DEST_PATH_IMAGE134
Is obtained, is provided with
Figure DEST_PATH_IMAGE136
Can ensure that the algorithm meets
Figure DEST_PATH_IMAGE137
Fig. 3 is a schematic diagram respectively illustrating a centralized differential privacy-based wind-control data exploration process according to an embodiment of the present disclosure and a localized differential privacy-based wind-control data exploration process according to another embodiment of the present disclosure.
As shown in the upper diagram of fig. 3, in the centralized differential privacy-based wind-controlled data exploration process according to an embodiment of the present disclosure, a plurality of wind-controlled data probes from a plurality of users (user 1, user 2,
Figure DEST_PATH_IMAGE139
user n) and transmits the original data to a credible central server, and the central server performs centralized differential privacy processing, i.e. adds disturbance data, and then performs approximation of statistical query, such as information value (IV value) query, stability (PSI value) query, Top-k query, mean value query, and the like.
As shown in the lower graph of FIG. 3, a localization difference based method according to an embodiment of the present disclosureDuring the process of exploring the wind control data with privacy, the data from a plurality of users (user 1, user 2,
Figure DEST_PATH_IMAGE139A
user n) performs localized differential privacy processing on the original data of the user side, namely adding disturbance data, and then performs approximation of statistical query on the server side.
Those skilled in the art will appreciate that after the statistical query approximation is obtained, the approximation may be corrected to improve the accuracy of the statistical analysis.
Fig. 4 is a data flow diagram respectively illustrating a differential privacy based gated data probing process according to an embodiment of the present disclosure.
As shown in fig. 4, from the collection of sensitive data to the publication of data, the hiding of user privacy is achieved by differential privacy. The main purpose of differential privacy is to provide query results of maximized data in batches, and ensure that the privacy disclosure does not exceed the preset value
Figure DEST_PATH_IMAGE141
Differential privacy mainly includes perturbation and sampling (sampling). For disturbance, input data, intermediate data, or output data are disturbed, and noise is added to make them meet
Figure DEST_PATH_IMAGE142
-differential privacy. A typical scheme for input data perturbation is the random response and a typical scheme for output data perturbation is the Laplace algorithm (Laplace algorithm). The intermediate data can be regarded as the output of the front sub-stage and can also be regarded as the input of the rear sub-stage, so that the algorithm of input or output disturbance can be flexibly selected.
In the embodiment shown in fig. 4, input data, i.e., incoming-mode data of the user terminal, is perturbed to obtain perturbed incoming-mode data (i.e., an expanded incoming-mode data set). The expanded modelled data set is then input into a model to obtain a statistical approximation of the modelled data. And finally, the obtained statistical approximation is corrected, and then the statistical approximation can be published. In this way, the user data will be protected reliably from user privacy.
For sampling, in an embodiment of the present disclosure, assume the query function is f. Data is divided into k parts, and a query function f is operated on each part of data to obtain query results f (d1), f (d2),. ang.f (dk). Then apply any one of the satiations to the query result
Figure DEST_PATH_IMAGE143
Differential privacy algorithms (e.g. random response) resulting in final results. The advantage of this is finally that
Figure DEST_PATH_IMAGE144
The differential privacy algorithm operates on a smaller data set f (d1), f (d 2).
Fig. 5 is a block diagram illustrating a differential privacy based wind-controlled data probing system 500 according to an embodiment of the present disclosure.
The wind-controlled data exploration system 500 based on differential privacy according to an embodiment of the present disclosure includes a data acquisition module 502, a perturbation module 506, and a data analysis module 508.
The data obtaining module 502 obtains the module entering data at the user side. In the present disclosure, the data source of the collected user behavior is data of the user side. The wind control model is aimed at user behavior data originating from the user side. Since the user behavior data flows from the user side to the server side, the included user privacy needs to be guaranteed in the process of transmitting the user behavior data to the server side.
The perturbation module 506 adds perturbations to the obtained modulo data at the user end to obtain an expanded modulo data set.
The adding of the perturbation by the perturbation module 506 at the ue is actually to randomize the single user data and then send the set of the ue data added with the perturbation or randomized. In an embodiment of the present disclosure, adding perturbations employs a random response mechanism in the case of applying localized differential privacy. In another embodiment of the present disclosure, adding perturbation employs an information compression and warping mechanism in the case of applying localized differential privacy.
The data analysis module 508 obtains a statistical approximation for the incoming data at the server based on the expanded incoming data set.
Because many scenes do not have a sufficient amount of large data, and the data may also be isolated from each other and cannot be shared with each other, the expanded data set added with disturbance can be aggregated at the server through the data analysis module 508, and then statistical approximation for the model-entering data is performed at the server, for example, the validity and stability of the existing data are explored, and features suitable for being used as model entries are screened, and the like.
After disturbance is added to data from different servers and different scenes, data of any single user is hidden, but the statistical trend of the overall data set is not affected. Therefore, when the information quantity which can be provided by label judgment of the model entering variable is little or the group stability of the variable is poor, the generalization and the robustness of the model can be effectively improved by carrying out data exploration on the basis of the expanded data set after disturbance is added.
Further, the data analysis module 508 corrects statistical approximations for the incoming data at the server. The obtained statistical approximation for the in-mode data is not a true unbiased estimate, which needs to be corrected at the server side.
Therefore, noise is added to the original data through the differential privacy algorithm, and based on the assumption of large data volume, the wind control data exploration system based on the differential privacy can use the terminal risk characteristics on the premise of not invading the privacy of the user, so that the approximation of statistical information is obtained.
The various steps and modules of the differential privacy based method and system for detecting wind control data described above may be implemented in hardware, software, or a combination thereof. If implemented in hardware, the various illustrative steps, modules, and circuits described in connection with the present invention may be implemented or performed with a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), or other programmable logic component, hardware component, or any combination thereof. A general purpose processor may be a processor, microprocessor, controller, microcontroller, state machine, or the like. If implemented in software, the various illustrative steps, modules, etc. described in connection with the present invention may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Software modules implementing the various operations of the present invention may reside in storage media such as RAM, flash memory, ROM, EPROM, EEPROM, registers, hard disk, a removable disk, a CD-ROM, cloud storage, etc. A storage medium may be coupled to the processor such that the processor can read information from, and write information to, the storage medium, and execute the corresponding program modules to perform the steps of the present invention. Furthermore, software-based embodiments may be uploaded, downloaded, or accessed remotely through suitable communication means. Such suitable communication means include, for example, the internet, the world wide web, an intranet, software applications, cable (including fiber optic cable), magnetic communication, electromagnetic communication (including RF, microwave, and infrared communication), electronic communication, or other such communication means.
It is also noted that the embodiments may be described as a process which is depicted as a flowchart, a flow diagram, a structure diagram, or a block diagram. Although a flowchart may describe the operations as a sequential process, many of the operations can be performed in parallel or concurrently. In addition, the order of the operations may be rearranged.
The disclosed methods, apparatus, and systems should not be limited in any way. Rather, the invention encompasses all novel and non-obvious features and aspects of the various disclosed embodiments, both individually and in various combinations and sub-combinations with each other. The disclosed methods, apparatus, and systems are not limited to any specific aspect or feature or combination thereof, nor do any of the disclosed embodiments require that any one or more specific advantages be present or that a particular or all technical problem be solved.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes may be made in the embodiments without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (9)

1. A method for detecting wind control data based on differential privacy comprises the following steps:
acquiring module entering data at a user side;
adding perturbations to the obtained modulo data at the user side to obtain an expanded modulo data set;
aggregating the expanded modular data set at a server;
obtaining, at the server, a statistical approximation for the incoming module data based on the aggregated expanded incoming module data set, wherein the obtaining of the statistical approximation for the incoming module data comprises: based on the expanded mode-entering data set, acquiring a difference of distribution of a training sample and a verification sample or acquiring a response sample ratio of the mode-entering data;
wherein the obtaining a statistical approximation for the in-mold data is based on: for non-independent identically distributed variables in groups,
Figure DEST_PATH_IMAGE002
where s is the maximum capacity of each group; and
and correcting the statistical approximation aiming at the module entering data at the server side.
2. The method of claim 1, wherein adding the perturbation to the obtained modulo data at the user end comprises using a random response mechanism.
3. The method of claim 1, the obtaining a statistical approximation for the in-mold data comprising obtaining a validity or stability of the in-mold data for a model.
4. The method of claim 3, the in-mode data comprising a training sample set and a validation sample set.
5. A differential privacy based wind-controlled data exploration system, comprising:
the data acquisition module is used for acquiring the module entering data at a user side;
a perturbation module that adds perturbations to the obtained modulo data at the user side to obtain an expanded modulo data set; and
a data analysis module that aggregates the expanded modular-entry data set at a server, obtains a statistical approximation for the modular-entry data at the server based on the aggregated expanded modular-entry data set, and corrects the statistical approximation for the modular-entry data at the server,
wherein the data analysis module obtaining a statistical approximation for the modelled data comprises: based on the expanded mode-entering data set, acquiring a difference of distribution of a training sample and a verification sample or acquiring a response sample ratio of the mode-entering data;
wherein the data analysis module obtains a statistical approximation for the modelled data based on: for non-independent identically distributed variables in groups,
Figure DEST_PATH_IMAGE002A
where s is the maximum capacity of each group.
6. The system of claim 5, wherein the perturbation module adding perturbations to the obtained modulo data at the user-side comprises the perturbation module employing a random response mechanism.
7. The system of claim 5, the data analysis module obtaining a statistical approximation for the modelled data comprising the data analysis module obtaining a validity or stability of the modelled data for a model.
8. The system of claim 7, the in-mode data comprising a training sample set and a validation sample set.
9. A computer-readable storage medium having stored thereon instructions that, when executed, cause a machine to perform the method of any of claims 1-4.
CN202210119822.6A 2022-02-09 2022-02-09 Wind control data exploration method and system based on differential privacy Active CN114154202B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210119822.6A CN114154202B (en) 2022-02-09 2022-02-09 Wind control data exploration method and system based on differential privacy

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210119822.6A CN114154202B (en) 2022-02-09 2022-02-09 Wind control data exploration method and system based on differential privacy

Publications (2)

Publication Number Publication Date
CN114154202A CN114154202A (en) 2022-03-08
CN114154202B true CN114154202B (en) 2022-06-24

Family

ID=80450277

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210119822.6A Active CN114154202B (en) 2022-02-09 2022-02-09 Wind control data exploration method and system based on differential privacy

Country Status (1)

Country Link
CN (1) CN114154202B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109902506A (en) * 2019-01-08 2019-06-18 中国科学院软件研究所 A Local Differential Privacy Data Sharing Method and System with Multiple Privacy Budgets
CN110727957A (en) * 2019-10-15 2020-01-24 电子科技大学 Differential privacy protection method and system based on sampling
CN112329056A (en) * 2020-11-03 2021-02-05 石家庄铁道大学 Government affair data sharing-oriented localized differential privacy method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10467234B2 (en) * 2015-11-02 2019-11-05 LeapYear Technologies, Inc. Differentially private database queries involving rank statistics
CN112163896B (en) * 2020-10-19 2022-05-06 科技谷(厦门)信息技术有限公司 Federated learning system
CN112434280B (en) * 2020-12-17 2024-02-13 浙江工业大学 Federal learning defense method based on blockchain
CN113591133B (en) * 2021-09-27 2021-12-24 支付宝(杭州)信息技术有限公司 Method and device for performing feature processing based on differential privacy

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109902506A (en) * 2019-01-08 2019-06-18 中国科学院软件研究所 A Local Differential Privacy Data Sharing Method and System with Multiple Privacy Budgets
CN110727957A (en) * 2019-10-15 2020-01-24 电子科技大学 Differential privacy protection method and system based on sampling
CN112329056A (en) * 2020-11-03 2021-02-05 石家庄铁道大学 Government affair data sharing-oriented localized differential privacy method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Local Differential Privacy for Evolving Data;Joseph,M等;《ADVANCES IN NUERAL INFORMATION PROCESSING SYSTEMS 31》;20190405;全文 *
基于本地化差分隐私的政务数据共享隐私保护算法研究;郝玉蓉等;《情报杂志》;20210430;第40卷(第2期);第169-175+137页 *

Also Published As

Publication number Publication date
CN114154202A (en) 2022-03-08

Similar Documents

Publication Publication Date Title
Keshk et al. A privacy-preserving-framework-based blockchain and deep learning for protecting smart power networks
Li et al. Intrusion detection system using Online Sequence Extreme Learning Machine (OS-ELM) in advanced metering infrastructure of smart grid
Yang et al. Toward data integrity attacks against optimal power flow in smart grid
Higgins et al. Stealthy MTD against unsupervised learning-based blind FDI attacks in power systems
Jow et al. A survey of intrusion detection systems in smart grid
Yang et al. VoteTrust: Leveraging friend invitation graph to defend against social network sybils
CN109040130A (en) Mainframe network behavior pattern measure based on attributed relational graph
CN116383753B (en) Abnormal behavior prompting method, device, equipment and medium based on Internet of things
David et al. Detection of distributed denial of service attacks based on information theoretic approach in time series models
Abdulaal et al. Privacy-preserving detection of power theft in smart grid change and transmit (CAT) advanced metering infrastructure
Bajtoš et al. Network intrusion detection with threat agent profiling
Al-Ghaili et al. A Review of anomaly detection techniques in advanced metering infrastructure
Desai et al. Mitigating consumer privacy breach in smart grid using obfuscation-based generative adversarial network
Huang et al. LDPGuard: Defenses against data poisoning attacks to local differential privacy protocols
Touré et al. A framework for detecting zero-day exploits in network flows
Ibrahem Privacy-preserving and efficient electricity theft detection and data collection for AMI using machine learning
Feng et al. Sentinel: An aggregation function to secure decentralized federated learning
CN114154202B (en) Wind control data exploration method and system based on differential privacy
Zuo et al. ApaPRFL: robust privacy-preserving federated learning scheme against poisoning adversaries for intelligent devices using edge computing
CN116702922B (en) Training method, training device, terminal equipment and training medium based on malicious behavior detection
Pliatsios et al. Trust management in smart grid: A markov trust model
WO2024007565A1 (en) Network analysis using optical quantum computing
CN112257098B (en) Method and device for determining safety of rule model
Ahmed et al. Smart integration of cloud computing and MCMC based secured WSN to monitor environment
Lavrova et al. Detection of cyber threats to network infrastructure of digital production based on the methods of Big Data and multifractal analysis of traffic

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant