CN116566650B - Key value data collection method based on loose local differential privacy model - Google Patents
Key value data collection method based on loose local differential privacy model Download PDFInfo
- Publication number
- CN116566650B CN116566650B CN202310354580.3A CN202310354580A CN116566650B CN 116566650 B CN116566650 B CN 116566650B CN 202310354580 A CN202310354580 A CN 202310354580A CN 116566650 B CN116566650 B CN 116566650B
- Authority
- CN
- China
- Prior art keywords
- key
- value
- data
- privacy
- probability
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 48
- 238000013480 data collection Methods 0.000 title claims abstract description 20
- 230000007246 mechanism Effects 0.000 claims abstract description 37
- 230000008569 process Effects 0.000 claims abstract description 10
- 230000004044 response Effects 0.000 claims abstract description 9
- 238000004364 calculation method Methods 0.000 claims abstract description 7
- 238000005070 sampling Methods 0.000 claims description 30
- 238000006243 chemical reaction Methods 0.000 claims description 13
- 102100029469 WD repeat and HMG-box DNA-binding protein 1 Human genes 0.000 claims description 4
- 101710097421 WD repeat and HMG-box DNA-binding protein 1 Proteins 0.000 claims description 4
- 238000000354 decomposition reaction Methods 0.000 claims description 4
- 238000007781 pre-processing Methods 0.000 claims description 4
- 238000007619 statistical method Methods 0.000 claims description 4
- 230000007704 transition Effects 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 2
- 238000000844 transformation Methods 0.000 claims description 2
- 230000002776 aggregation Effects 0.000 claims 1
- 238000004220 aggregation Methods 0.000 claims 1
- 238000011156 evaluation Methods 0.000 claims 1
- 238000004422 calculation algorithm Methods 0.000 description 4
- 230000000875 corresponding effect Effects 0.000 description 4
- 238000011161 development Methods 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 238000007405 data analysis Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000004807 localization Effects 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 238000013179 statistical model Methods 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/04—Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks
- H04L63/0428—Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the data content is protected, e.g. by encrypting or encapsulating the payload
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/10—Pre-processing; Data cleansing
- G06F18/15—Statistical pre-processing, e.g. techniques for normalisation or restoring missing data
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/45—Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
- H04N21/462—Content or additional data management, e.g. creating a master electronic program guide from data received from the Internet and a Head-end, controlling the complexity of a video stream by scaling the resolution or bit-rate based on the client capabilities
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/45—Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
- H04N21/466—Learning process for intelligent management, e.g. learning user preferences for recommending movies
- H04N21/4667—Processing of monitored end-user data, e.g. trend analysis based on the log file of viewer selections
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/50—Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Signal Processing (AREA)
- Multimedia (AREA)
- Computer Security & Cryptography (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Computing Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Networks & Wireless Communication (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Complex Calculations (AREA)
Abstract
The invention discloses a key value data collection method based on a loose local differential privacy model, which comprises the steps of defining a local differential privacy model of a key value data type, disclosing and optimally distributing privacy budget under the loose local differential privacy under a generalized random response mechanism. The problems are solved in the following application scenarios: and each user reports the values of a plurality of attributes to form a key value pair, probability calculation is carried out on privacy budget allocation and disturbance, then the user takes the value according to the privacy budget provided by the server, data disturbance is carried out on the values of the data key value pair in the user local terminal, and the disturbed noise value is submitted to a data collector. The server collects data after disturbance of the user, and analyzes and calculates a frequency estimation result of an original data key and a mean value estimation result corresponding to the key value. The invention can lead the protection process to be all at the client and resist the attack of the untrusted third party server.
Description
Technical Field
The invention belongs to the information security technology, and relates to a key value data collection method based on a loose local differential privacy model.
Background
With the development of new technological revolution and information age, the generation of network data brings great development and progress to enterprises and governments, and an important scientific basis is provided for subsequent policies and actions through data collection and analysis, so that better economic effect is realized. The data pair is a common data form in real life and has very wide application scenes. By analyzing the relation between the data, the information of the big data can be mined, so that data support is better provided for a manager, and service is better provided for a user. For example, when shopping or watching a movie, the user's preferences are analyzed and a large number of user preferences for movies are collected, and new products or high-score movies are recommended according to the habits of new and old users. Meanwhile, big data are double-edged swords, along with the development of the big data, the safety problem is increasingly developed, personal rights, public interests and even national interests are increasingly threatened, the importance of users on personal privacy protection is higher, and a large amount of privacy information leakage risks exist in the collecting process. Therefore, how to protect privacy is a urgent problem for the associated data.
Privacy calculations are a collection of techniques that cross-integrate many disciplines such as cryptography, statistics, artificial intelligence, computer hardware, etc. The method can calculate, circulate, and have a controllable value on the premise of not exposing the original data. Aiming at the defects of infinite privacy attack and privacy protection mechanism, dwork in 2006 proposes a differential privacy model (differential privacy, DP for short) [ Cynthia Dwork. Differential privacy. ICALP (2) 2006:1-12)]This technique is a data processing means in cryptography, and aims to provide a way to maximize the accuracy of data queries when they are queried from a statistical database, while minimizing the chance of identifying their records. Differential privacy not only strictly defines the background knowledge of an attacker, but also has a strict statistical model, which has become a standard for private data distribution. With the development of technology, there is a third party unreliable problem in real life, and the local differential privacy (local differential privacy, LDP for short) is generated, and the model inherits the advantages of differential privacy, in the local differential privacy, each user uses a mechanism M to perform disturbance processing on input x and upload y=m (x) to an unreliable server for data analysis, under the mechanism, only the user has own original data, and the server side can only receive y=m (x) and cannot directly access the original data, so that the unreliable server can be resistedThe third parties sum up. Formalized definition of local differential privacy is as follows: for a given ε εR+, the random mechanism M satisfies (ε, δ) -LDP if and only if for any pair of inputs x, x' and any pair of outputs y, the probability relation for outputting the same y is: pr (M (x) =y). Ltoreq.e ε Pr (M (x')=y) +delta, where ε is 0 or more and δ ε [0,1 ]]. Epsilon is a privacy budget, is a parameter for measuring the exposure degree of the privacy of a user, and can reflect the degree of privacy delivery of the user to a third party, and can be used for controlling the intensity of privacy protection, wherein the larger epsilon is, and the smaller epsilon is, so that the stronger privacy protection is represented. When epsilon is small, an attacker needs to ensure that the exact information is in a low state when trying to distinguish between any pair of inputs x, x'. Delta is the loose degree, which is a parameter for measuring the degree of local differential privacy which cannot be guaranteed by an algorithm, and the smaller delta is, the higher the possibility that the algorithm guarantees to meet epsilon-LDP is, but the larger the applicability can be changed for a certain loose degree. To exacting local difference privacy when δ=0, when δ>Loose local differential privacy at 0. Existing LDP focuses mainly on simple statistical queries such as frequency and histogram estimation of classified data.
However, real world applications typically involve many different data types, key-value data being a type of data common in many applications. For example, for a movie rating system, each user has a plurality of movie viewing records (keys) and scores (values) for the corresponding movie viewing records, i.e., each user has a set of key-value pairs < k, v >. The data collector may aggregate all user ratings records and statistically analyze the relevant attributes of a movie, such as the number of persons watching the movie and the average value of the scores for the movie. After statistical analysis of the data, the server may provide movie suggestions by selecting a high frequency and large average means, where both the frequency of the keys and the average of the values under each key must be estimated simultaneously. For data collection privacy protection using key values, the Xiaolan Gu et al [ Xiaolan Gu, mig Li, yueqiang Cheng, li Xiong, yang cao.pckv: locally Differentially Private Correlated Key-Value Data Collection with Optimized availability.usenix Security Symposium,2020:967-984] proposes a local differential privacy-based key value data (PCKV-GRR) to implement a key value-to-data collection framework that utilizes a related perturbation mechanism to enhance privacy protection effects. But this is only for the special case of strict local differential privacy and does not extend to key value data mechanisms under loose local differential privacy. In the invention, the aim is to design a high-precision key value data acquisition and analysis mechanism under the privacy constraint of an (epsilon, delta) -LDP model.
Disclosure of Invention
The invention aims to: the invention provides a key value data collection method based on a loose local differential privacy model, which extends the existing strict local differential privacy key value data mechanism so as to be more widely applied later.
The invention discloses a key value data collection method based on a loose local differential privacy model, which comprises the following steps:
s1, setting a server under a generalized random response mechanism, wherein the server comprises an overall privacy budget epsilon and an overall looseness delta of a setting model, and the privacy budget epsilon is distributed into the privacy budget epsilon related to key disturbance through decomposition 1 And degree of loosening of key perturbation delta 1 Setting privacy budget ε related to value perturbation 2 Degree of looseness delta of sum disturbance 2 At the same time epsilon and epsilon 1 、δ 1 、ε 2 And delta 2 Publishing to a user to deduce the optimal probability of key and value conversion in the element;
s2, each user locally transmits the original data<k,v>Preprocessing a data set, setting the original data as S, filling, and setting the filling length as l; for the whole mechanism, the number of all original keys is d, and the total length d' =d+l in the protocol is set; then sampling key value pairs and sampling the sampled dataAccording to->Discretizing the size of (3);
s3, the user performs key value pair according to the data in the samplingPerturbation of the key and value is performed and the perturbed data is +.>Sending the data to a server;
and S4, the server performs statistical analysis according to disturbance data sent by the user, and estimates a frequency distribution result and an average value estimation size of the original data.
In step (S1), the optimal allocation method of the privacy budget of the key-value data collection mechanism is as follows:
s11, in a key value and data collection method for loose local differential privacy under a generalized random response mechanism (GRR for short), the privacy budgets of key and value disturbance are assumed to be epsilon respectively 1 And epsilon 2 Degree of loosening of key perturbation delta 1 Degree of loosening of value disturbance delta 2 . Pr (M (x) =y). Ltoreq.e according to the definition of (ε, δ) -LDP ε Pr (M (x')=y), i.e. the case where the key conversion in the key pair is deduced, the probability of 1.fwdarw.1 is a, and the probability of 0.fwdarw.1 is b. The disturbance probability of a and b in the key is calculated respectively:
s12, the definition of (epsilon, delta) -LDP should be satisfied for the value disturbance stage in the key value pair, namely the probability of the median 1-1 of the key value pair isAnd the probability of 1.fwdarw.1 is 1-p.
S13, calculating the probability of the whole disturbance aiming at the key value. Data for discrete key values<k,v>Disturbance into the PCKV mechanism<k',v'>V when k.epsilon. { d+1, … d '} based on the boundary value of k and Pr (y' |S) * k =0, deriving the privacy budget requires a borderline discussion. Case 1 when Privacy budgets at the time Case 2 when->Privacy budgets at the time
S14, optimizing the privacy budget to make full use of the privacy budget. The method adopts a method for evaluating the utility of a mechanism, and uses a Mean Square Error (MSE) to evaluate, namely, the smaller the mean square error is, the better the utility is. Estimation of the quantityThe mean square error of (a) can be calculated by the sum of squares of the variance and its deviation, i.e.)>According to the function analysis, need to be about>Taking the minimum value. Finally, a relation between the disturbance probability and the privacy can be obtained. The fitness conversion probability a= (l (e) ε -1)+2+2lδ(d'-1)/(l(e ε -1) +2d'); the bond transition probability b=2 (1-lδ)/(l (e) ε -1) +2d'). Key value pairConversion probability p= (l (e) ε -1)+1+lδ(2d'-1))/(l(e ε -1)+2+2lδ(d'-1))。
In step (S2), the method for filling and sampling the original real data of the user terminal includes the following steps:
s21, filling and sampling the original data S, setting the filling length as l, and filling l data when the user data set is empty. For the whole mechanism, the number of all original keys is d, d '=d+l is set, where d' represents the size of the filled key field. The key field K '= {1,2, the. D' }.
S22, sampling probability eta calculation is carried out on non-pseudo key value pairs, eta= |S|/[ max { |S|, l }]This parameter will be used as two randomly distributed parameters. The sampled value is preserved when the sampled key is truly data, and if a dummy key is sampled, a dummy value v will be assigned * =0, so as to ensure that sampling to the pseudo-key does not affect the mean value estimation.
S23, sampling the sampled dataAccording to->Is discretized. At this time, the sampled dataAccording to->Is discretized. Wherein the probability of a value of 1 is +.>The probability of conversion to-1 is
In step (S4), the process of estimating the frequency and the mean value of the data after the server receives the disturbance is as follows:
s41, after uploading the value output of all users to the server, the server counts the output 1 and-1 numbers, respectively denoted as n 1 And n 2 Wherein n is 1 =num(y'=<k,1>) And n 2 =num(y'=<k,-1>). Then, for n 1 And n 2 Calibration is performed to estimate the frequency and average of the keys K e K.
S42, for frequency estimation, a baseline estimation method is adopted and a frequency estimator is used, which is unbiased when the size of the data set of each user does not exceed l. Due to n 1 +n 2 Is the observed number of users possessing keys, the method has the following equivalent frequency estimators
The probability of false value-1 or 1 generated in the server is 0.5 (i.e. the expected value of false value is zero) respectively, so there is no reference meaning in the process, but the average value can be estimated by dividing the sum by the count of the real keys. The count of the real keys reported as owned may be usedTo approximate, because the sampling rate is 1/l, the probability that a real key is reported as owned is a. Thus, the corresponding average estimator +.>
The beneficial effects are that: according to the invention, privacy protection optimization is carried out locally on loose local differential privacy under a generalized random response mechanism, and optimal allocation is carried out on privacy parameters epsilon and delta through calculation so as to realize minimum variance, so that the full-sensitive data privacy protection degree is not reduced in the model, and the accuracy of frequency estimation and mean value estimation results is ensured; the invention fills and samples the data key value pairs in the collecting process, protects the relevance among homogeneous data, ensures the unbiasedness of the data, and improves the overall data utility on the basis of the original scheme.
Drawings
FIG. 1 is an overall flow chart of the present invention;
FIG. 2 is a schematic diagram of a client flow chart of an embodiment of the present invention;
fig. 3 is a schematic flow chart of a user side according to an embodiment of the present invention.
Detailed Description
The above-described aspects are further described below in conjunction with specific embodiments. It should be understood that these examples are illustrative of the present invention and are not intended to limit the scope of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without departing from the inventive concepts herein, are intended to be within the scope of the invention.
For a detailed description of the disclosed embodiments, reference will now be made to the accompanying drawings and examples. Firstly, the invention combines the problem that the frequency of key value data and the average value under each key value cannot be calculated simultaneously aiming at homoplasmic data in the existing Local Differential Privacy (LDP) application, in addition, an attacker is prevented from stealing data submitted by the user, and the attacker is prevented from possibly acquiring the data of the user from a trusted third-party server, so that the user has the risk of data privacy violation, such as data of position, shopping records, film scoring and the like. According to the invention, privacy budget allocation is carried out on the model by utilizing the loose local differential privacy data key value, then a data disturbance mechanism is carried out by optimizing privacy budget according to the defined model, and finally data is uploaded to the server, so that the safety of the original data is ensured. I.e. no data in a single user key can be obtained by an attacker, by any means.
Specifically, the loose local differential privacy key value implementation steps of the data acquisition method are as follows:
s1, setting an overall privacy budget epsilon and an overall looseness delta. By decomposition, allocation of privacy budget ε as key perturbation involvesDue to privacy budget ε 1 And degree of loosening of key perturbation delta 1 Setting privacy budget ε related to value perturbation 2 Degree of looseness delta of sum disturbance 2 At the same time epsilon and epsilon 1 、δ 1 、ε 2 And delta 2 The optimal probability of key and value transformations in the element will be deduced for the user.
S2, each user locally transmits the original data<k,v>Preprocessing a data set: setting the original data as S and filling, and setting the filling length as l. For the whole mechanism, the number of all original keys is d, the total length in the protocol d' =d+l is set. Sampling key value pairs, and sampling the sampled dataAccording to->Is discretized.
S3, the user performs key value pair according to the data in the samplingPerturbation of the key and value is performed and the perturbed data is +.>Sending the data to a server;
and S4, the server performs statistical analysis according to disturbance data sent by the user, and estimates a frequency distribution result and an average value estimation size of the original data.
In step S1, the optimal allocation method of the privacy budget of the key-value data collection mechanism is as follows:
s11, for a generalized random response mechanism (GRR for short) in a key value under loose local differential privacy and data collection mechanism, the privacy budgets of key and value disturbance are assumed to be epsilon respectively 1 And epsilon 2 Degree of loosening of key perturbation delta 1 Degree of loosening of value disturbance delta 2 . Pr (M (x) =y). Ltoreq.e according to the definition of (ε, δ) -LDP ε Pr (M (x')=y), i.e. derivingThe probability of 1 to 1 in the key-out value pair is a, and the probability of 0 to 1 is b. The disturbance probability of a and b in the key is calculated respectively:
s12, the definition of (epsilon, delta) -LDP should be satisfied for the value disturbance stage in the key value pair, namely the probability of the median 1-1 of the key value pair is
S13, calculating the probability of the whole disturbance aiming at the key value. Data for discrete key values<k,v>Disturbance into the PCKV mechanism<k',v'>V when k.epsilon. { d+1, … d '} based on the boundary value of k and Pr (y' |S) * k =0, deriving the privacy budget requires a borderline discussion. Case 1 when Privacy budgets at the time Case 2 when->Privacy budgets at the time
S14, in order to fully utilize privacy budgetFor compact use, it will be optimized next. The method adopts a method for evaluating the utility of a mechanism, namely a Mean Square Error (MSE) is used, namely the smaller the mean square error is, the better the utility is. Estimation of the quantityThe mean square error of (a) can be calculated by the sum of squares of the variance and its deviation, i.e.)> According to the function analysis, need to be about>Taking the minimum value. Finally, a relation between the disturbance probability and the privacy can be obtained. The fitness conversion probability a= (l (e) ε -1)+2+2lδ(d'-1)/(l(e ε -1) +2d'); the bond transition probability b=2 (1-lδ)/(l (e) ε -1) +2d'). The probability of conversion p= (l (e) ε -1)+1+lδ(2d'-1))/(l(e ε -1)+2+2lδ(d′-1))。
In step S2, the method for filling and sampling the original real data of the user terminal includes the following steps:
s21, filling and sampling the original data S, setting the filling length as l, and filling l data when the user data set is empty. For the whole mechanism, the number of all keys originally is d, d '=d+l is set, where d' represents the size of the filled key field, the key field K '= {1,2, the. D' }.
S22, sampling probability eta calculation is carried out on non-pseudo key value pairs, eta= |S|/[ max { |S|, l }]This parameter is used in experimental simulations as two randomly distributed parameters. When the sampled key is actually present, the sampled value is reserved and converted according to the size of the value range, and if a pseudo key is sampled, a pseudo value v is allocated * =0, so as to ensure that sampling to the pseudo-key does not affect the mean value estimation.
S23, discretizing sampling data: sampling the dataAccording to->Is discretized. When a pseudo key is sampled, a pseudo value v is assigned * =0, thus ensuring that sampling to the pseudo-key does not affect the mean value estimation. At this time, the sampled data +.>According to->Is discretized. Wherein the probability of a value of 1 is +.>The probability of conversion to-1 is +.>
In step S4, the process of estimating the frequency and the mean value of the data by the server end after receiving the disturbed value is as follows:
s41, after uploading the value output of all users to the server, the server counts the output 1 and-1 numbers, respectively denoted as n 1 And n 2 Wherein n is 1 =num(y'=<k,1>) And n 2 =num(y'=<k,-1>). Then, for n 1 And n 2 Calibration is performed.
S42, for frequency estimation, a baseline estimation method is adopted and a frequency estimator is used, and when the size of the data set of each user does not exceed l, the estimator is unbiased. Due to n 1 +n 2 Is the observed number of users possessing keys, has the following equivalent frequencyRate estimator
The probability of false value-1 or 1 generated in the server is 0.5 (i.e. the expected value of false value is zero) respectively, so there is no reference meaning in the process, but the average value can be estimated by dividing the sum by the count of the real keys. The count of the real keys reported as owned may be usedTo approximate, because the sampling rate is 1/l, the probability that a real key is reported as owned is a. Thus, the corresponding average estimator +.>
Referring to fig. 1, the specific flow steps of the user side in the loose local differential privacy key value data acquisition method are as follows:
step one: the original data set S for possession is padded and sampled, the length of the padding is set to l, and when the user data set is empty, l data will be padded. For the entire mechanism, the number of all keys originally is d, d ' =d+l is set, key field K ' = {1,2, the. D ' }.
Step two: for the non-pseudo key value pair, sampling probability eta is calculated, eta= |S|/[ max { |S|, l } ], and the parameter is used as the parameter of the two random distribution. And executing the third step if the sampled key value actually exists; if not, executing the fourth step.
Step three: and when the sampled key is actually present, the sampled value is reserved and converted according to the size of the value range, and the fifth step is executed.
Step four: if a pseudo key is sampled, a pseudo value v will be assigned * And (3) the method comprises the following steps of (1) and (0) ensuring that the mean value estimation is not influenced when the pseudo key is sampled, and executing the fifth step.
Step five: sampling the dataAccording to->Is discretized. Wherein the probability of a value of 1 is +.> The probability of conversion to-1 is +.>
Step six: user will disturb the resultsAnd sending the data to a server.
With reference to fig. 2, the specific flow steps of the server side in the loose local differential privacy key value data acquisition method are as follows:
step one: the overall privacy budget epsilon and the overall looseness delta are set. By decomposition, the privacy budget epsilon is distributed as the privacy budget epsilon related to key disturbance 1 And degree of loosening of key perturbation delta 1 Setting privacy budget ε related to value perturbation 2 Degree of looseness delta of sum disturbance 2 At the same time epsilon and epsilon 1 、δ 1 、ε 2 And delta 2 Published to the user.
Step two: the server collects the disturbance data results sent by all n users, counts the number of 1 and-1 according to the disturbed key value data sent by the users, and respectively represents the number as n 1 And n 2 Wherein n is 1 =num(y'=<k,1>) And n 2 =num(y'=<k,-1>)。
Step three: calculating a frequency estimate, when the size of the data set for each user does not exceed l,the estimator is unbiased. Due to n 1 +n 2 Is the observed number of users with keys, the frequency estimator calculates as:
step four: estimating an average by dividing the sum by the count of real keys, the count of real keys reported as owned may be usedTo approximate, the probability that a real key is reported as owned is a. According to the probability of 0.5 of false value-1 or 1 generated in the mechanism, it is proved that the mean value estimation can be obtained when the sum of the values is not changed in statistics>
Step five: the server obtains the frequency estimation results of all keys and the average value estimation results of the corresponding values.
According to the deduction, the optimized privacy allocation of the key value data collection mechanism under the loose local differential privacy model can be obtained, and in addition, the random response scheme disturbance probability under the strict localization and the loose localization is summarized, so that the table 1 can be seen. Furthermore, the present invention evaluates the utility mechanism by means of Mean Square Error (MSE). The scientific credibility of the mechanism in the use process is improved.
TABLE 1 random response scheme perturbation probability under strictly localized and loosely localized
The following are experimental results of the present invention. The dataset used in the experiment was closing, which is a dataset on Kaggle in which scores of 105508 users for 5850 things were recorded, each score being a record, togetherThere were 192544 records. The invention constrains the record of each user score, sets the filling number l to about 10 to 10000, and specifies the range of the loosening degree delta, delta= [10 ] -7 ,10 -2 ]. 7 privacy classes were divided in the experiment, with privacy budgets of 0.1,1.0,2.0,3.0,4.0,5.0,6.0, respectively. Each user perturbs its own data locally in the experiment and sends the results to the server, becauseThe MSE is affected little, so the server calculates the mean value result MSE mean 。
TABLE 2 influence of mean square error by parameter l
TABLE 3 influence of the mean square error by the parameter delta
TABLE 4 influence of the mean square error by the parameter ε
As can be seen from table 2, setting the filling size to be about l=100 for the data set used in the present experiment can make the overall performance better. By fixing δ at a small value close to 0 through table 3, it can be seen that the mean square error curve of the frequency is substantially the same as the curve without δ, indicating a smaller degree of looseness δ, with little impact on overall scheme performance; fixing δ at a larger value, it can be seen that the mean square error curve of frequency is smaller than that of table 3, indicating that the greater the looseness δ, the greater the impact on the overall scheme performance, and under this parameter, the greater the improvement in overall mechanism performance. As can be seen from table 4, the mean square error of the mean estimate in the optimal distribution of the privacy budget is substantially smaller than for the simple distribution. It can be derived that δ is a parameter of the degree of looseness in the loose local differential privacy, and the influence on the whole algorithm is that the stability of the algorithm performance is reduced when δ is larger.
The method and the device can protect the original data of the user from being acquired by an attacker, resist the attacker with arbitrary background knowledge and prevent privacy attack from an untrusted third party, solve the problem that the existing local differential privacy key value data acquisition method does not consider two-dimensional data, realize finer-granularity protection, reasonably protect sensitive data on the basis of ensuring the security of the privacy data, and improve the utility of estimating the privacy data.
The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (5)
1. The key value data collection method based on the loose local differential privacy model is characterized by comprising the following steps of:
s1, setting a server under a generalized random response mechanism, wherein the server comprises an overall privacy budget epsilon and an overall looseness delta of a setting model, and the privacy budget epsilon is distributed into the privacy budget epsilon related to key disturbance through decomposition 1 And degree of loosening of key perturbation delta 1 Setting privacy budget ε related to value perturbation 2 Degree of looseness delta of sum disturbance 2 At the same time epsilon and epsilon 1 、δ 1 、ε 2 And delta 2 Publishing to a user to deduce the optimal probability of key and value conversion in the element;
the server in step (S1) optimizes the privacy allocation, and deriving the optimal probability of key and value transformations in the element comprises the following steps:
s11, based on a generalized random response mechanism, the method assumes that privacy budgets of key and value disturbance are epsilon respectively 1 And epsilon 2 Degree of loosening of key perturbation delta 1 Degree of loosening of value disturbance delta 2 Pr (M (x) =y). Ltoreq.e according to the definition of (ε, δ) -LDP ε Pr (M (x')=y), i.e. deriving key-to-key conversion in key-value pairs: the probability of 1-1 is a, the probability of 0-1 is b, and then the disturbance probabilities of a and b in the key are calculated respectively:
s12, the definition of (epsilon, delta) -LDP should be satisfied for the value disturbance stage in the key value pair, namely the probability of the median 1-1 of the key value pair isAnd the probability of 1.fwdarw. -1 is 1-p;
s13, calculating probability of overall disturbance of the key value, and aiming at the discrete key value, calculating data<k,v>Is disturbed to be<k′,v′>V when k e { d+1,..d '} based on the boundary value of k and Pr (y' |s) * k =0, deriving the privacy budget requires a boundary to be discussed;
case 1: when (when)Privacy budget +.>
Case 2: when (when)Privacy budget +.>
S14, optimizing privacy budget in a mode of evaluating mechanism effectiveness to maximize utilization rate, wherein the step uses mean square error for evaluation and estimation quantityIs calculated by the sum of squares of the variance and its deviation, i.e./> Then analyzing the required pair according to the function>Taking the minimum value, and finally obtaining a relation between the disturbance probability and the privacy;
the fitness conversion probability a= (l (e) ε -1)+2+2lδ(d′-1)/(l(e ε -1)+2d′);
The bond transition probability b=2 (1-lδ)/(l (e) ε -1)+2d′);
The probability of conversion p= (l (e) ε -1)+1+lδ(2d′-1))/(l(e ε -1)+2+2lδ(d′-1));
S2, each user locally transmits the original data<k,v>Preprocessing a data set, setting the original data as S, filling, and setting the filling length as l; for the whole mechanism, the number of all original keys is d, and the total length d' =d+l in the protocol is set; then sampling key value pairs and sampling the sampled dataAccording to->Discretizing the size of (3);
s3, the user performs key value pair according to the data in the samplingPerforming key and value perturbation and data after perturbationSending the data to a server;
and S4, the server performs statistical analysis according to disturbance data sent by the user, and estimates a frequency distribution result and an average value estimation size of the original data.
2. The loose local differential privacy model-based key value data collection method of claim 1, wherein: the preprocessing of the data set in the step (S2) further comprises the filling and sampling of the original real data by the user side, including the following procedures:
s21, filling and sampling the original data S, setting the filling length as l, and filling l data when the user data set is empty; for the whole mechanism, the number of all original keys is d, d '=d+l is set, where d' represents the size of the filled key field; a key domain K '= {1,2, …, d' };
s22, sampling probability eta calculation is carried out on non-pseudo key value pairs, eta= |S|/[ max { |S|, l }]The parameter is to be used as two randomly distributed parameters; the sampled value is preserved when the sampled key is truly data, and if a dummy key is sampled, a dummy value v will be assigned * =0, so as to ensure that the mean value estimation is not affected when the pseudo key is sampled;
s23, sampling the sampled dataAccording to->Discretizing the size of (3); at this time, the sampled dataAccording to->Discretizing the size of (3); wherein the probability of a value of 1 is +.>The probability of conversion to-1 is
3. The key value data collection method based on the loose local differential privacy model according to claim 1, wherein in step (S4), the process of estimating the frequency and the mean value of the data after receiving the disturbance by the server is as follows:
s41, after uploading the value output of all users to the server, the server respectively represents the number of 1 and-1 which are counted and output as n 1 And n 2 Wherein n is 1 =num(y′=<k,1>) And n 2 =num(y′=<k,-1>) The method comprises the steps of carrying out a first treatment on the surface of the Then, for n 1 And n 2 Performing calibration;
s42, for frequency estimation, adopting a baseline estimation method and using a frequency estimator, wherein when the size of a data set of each user is not more than l, the estimator is unbiased;
due to n 1 +n 2 Is the observed number of users with keys, so there are the following equivalent frequency estimators
S43, the probability of false value-1 or 1 generated in the server is 0.5 respectively, so that the reference meaning is not existed in the process, but the average value can be estimated by dividing the sum by the count of the real keys;
counting of real keys reported as ownedTo be similar, because the sampling rate is 1/l, the probability that a real key is reported as owned is a; thus, the corresponding average estimator
4. The loose local differential privacy model-based key value data collection method of claim 1, wherein: in the method, a user performs local-end on own original dataAnd carrying out disturbance, sending the disturbed result to a server, and carrying out aggregation calculation by the server to obtain all user data estimated values.
5. The loose local differential privacy model-based key value data collection method of claim 1, wherein: the method calculates the privacy budget by using the mean square error, wherein the estimator calculates the privacy budget by back-deriving the optimized privacy budget allocation by achieving better mechanism performanceMay be calculated by the sum of squares of the variance and its deviation; and deriving a relation between the disturbance probability and the privacy through mean square error of frequency estimation and mean value estimation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310354580.3A CN116566650B (en) | 2023-04-06 | 2023-04-06 | Key value data collection method based on loose local differential privacy model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310354580.3A CN116566650B (en) | 2023-04-06 | 2023-04-06 | Key value data collection method based on loose local differential privacy model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116566650A CN116566650A (en) | 2023-08-08 |
CN116566650B true CN116566650B (en) | 2024-01-26 |
Family
ID=87499082
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310354580.3A Active CN116566650B (en) | 2023-04-06 | 2023-04-06 | Key value data collection method based on loose local differential privacy model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116566650B (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113206831A (en) * | 2021-03-31 | 2021-08-03 | 南京邮电大学 | Data acquisition privacy protection method facing edge calculation |
CN114462032A (en) * | 2022-04-13 | 2022-05-10 | 北京理工大学 | Method for detecting data collection virus attack of key value under localized differential privacy |
CN114884682A (en) * | 2022-07-07 | 2022-08-09 | 湖南工商大学 | Crowd sensing data stream privacy protection method based on self-adaptive local differential privacy |
CN115718927A (en) * | 2022-10-21 | 2023-02-28 | 桂林电子科技大学 | Difference privacy mixed recommendation method based on untrusted server |
CN115906164A (en) * | 2022-11-22 | 2023-04-04 | 南京航空航天大学 | Local differential privacy-based utility optimization key value data protection method and device |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11328084B2 (en) * | 2020-02-11 | 2022-05-10 | LeapYear Technologies, Inc. | Adaptive differentially private count |
-
2023
- 2023-04-06 CN CN202310354580.3A patent/CN116566650B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113206831A (en) * | 2021-03-31 | 2021-08-03 | 南京邮电大学 | Data acquisition privacy protection method facing edge calculation |
CN114462032A (en) * | 2022-04-13 | 2022-05-10 | 北京理工大学 | Method for detecting data collection virus attack of key value under localized differential privacy |
CN114884682A (en) * | 2022-07-07 | 2022-08-09 | 湖南工商大学 | Crowd sensing data stream privacy protection method based on self-adaptive local differential privacy |
CN115718927A (en) * | 2022-10-21 | 2023-02-28 | 桂林电子科技大学 | Difference privacy mixed recommendation method based on untrusted server |
CN115906164A (en) * | 2022-11-22 | 2023-04-04 | 南京航空航天大学 | Local differential privacy-based utility optimization key value data protection method and device |
Non-Patent Citations (3)
Title |
---|
《PCKV: Locally Differentially Private Correlated Key-Value Data Collection with Optimized Utility》;Gu, XL;《29th USENIX Security Symposium》;967-984 * |
基于本地差分隐私的键-值数据精确收集方法;张啸剑;付楠;孟小峰;;计算机学报(第08期);97-110 * |
本地化差分隐私研究综述;叶青青;孟小峰;朱敏杰;霍峥;;软件学报(第07期);159-183 * |
Also Published As
Publication number | Publication date |
---|---|
CN116566650A (en) | 2023-08-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Wang et al. | Privacy preservation in big data from the communication perspective—A survey | |
Sun et al. | LDP-FL: Practical private aggregation in federated learning with local differential privacy | |
KR100595786B1 (en) | A system and method to determine the validity of an interaction on a network | |
US8069210B2 (en) | Graph based bot-user detection | |
Wang et al. | Privset: Set-valued data analyses with locale differential privacy | |
KR20150115772A (en) | Privacy against interference attack against mismatched prior | |
Kotsogiannis et al. | One-sided differential privacy | |
CN110866263B (en) | User privacy information protection method and system capable of resisting longitudinal attack | |
Ozturk et al. | From existing trends to future trends in privacy‐preserving collaborative filtering | |
Huang et al. | An improved federated learning approach enhanced internet of health things framework for private decentralized distributed data | |
Fu et al. | GC-NLDP: A graph clustering algorithm with local differential privacy | |
CN115130119B (en) | Utility optimization set data protection method based on local differential privacy | |
Kleiner et al. | A general bootstrap performance diagnostic | |
Li et al. | Differential privacy location protection method based on the Markov model | |
CN116566650B (en) | Key value data collection method based on loose local differential privacy model | |
CN112968873B (en) | Encryption method and device for private data transmission | |
Ioannidis et al. | Privacy tradeoffs in predictive analytics | |
Giordano et al. | Uprise-iot: User-centric privacy & security in the iot | |
Venkatasubramanian | Measures of anonymity | |
Pon et al. | Data quality inference | |
Zhu et al. | Multimedia fusion privacy protection algorithm based on iot data security under network regulations | |
Boenisch | Differential privacy: general survey and analysis of practicability in the context of machine learning | |
Fu et al. | DP-HORUS: Differentially Private Hierarchical Count Histograms under Untrusted Server | |
Narule et al. | Privacy preservation using optimized Federated Learning: A critical survey | |
Cheng et al. | Universal interactive verification framework for federated learning protocol |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |