CN116566650B

CN116566650B - Key value data collection method based on loose local differential privacy model

Info

Publication number: CN116566650B
Application number: CN202310354580.3A
Authority: CN
Inventors: 胡朋; 朱友文; 王凤祥; 张跃
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2023-04-06
Filing date: 2023-04-06
Publication date: 2024-01-26
Anticipated expiration: 2043-04-06
Also published as: CN116566650A

Abstract

The invention discloses a key value data collection method based on a loose local differential privacy model, which comprises the steps of defining a local differential privacy model of a key value data type, disclosing and optimally distributing privacy budget under the loose local differential privacy under a generalized random response mechanism. The problems are solved in the following application scenarios: and each user reports the values of a plurality of attributes to form a key value pair, probability calculation is carried out on privacy budget allocation and disturbance, then the user takes the value according to the privacy budget provided by the server, data disturbance is carried out on the values of the data key value pair in the user local terminal, and the disturbed noise value is submitted to a data collector. The server collects data after disturbance of the user, and analyzes and calculates a frequency estimation result of an original data key and a mean value estimation result corresponding to the key value. The invention can lead the protection process to be all at the client and resist the attack of the untrusted third party server.

Description

Key value data collection method based on loose local differential privacy model

Technical Field

The invention belongs to the information security technology, and relates to a key value data collection method based on a loose local differential privacy model.

Background

With the development of new technological revolution and information age, the generation of network data brings great development and progress to enterprises and governments, and an important scientific basis is provided for subsequent policies and actions through data collection and analysis, so that better economic effect is realized. The data pair is a common data form in real life and has very wide application scenes. By analyzing the relation between the data, the information of the big data can be mined, so that data support is better provided for a manager, and service is better provided for a user. For example, when shopping or watching a movie, the user's preferences are analyzed and a large number of user preferences for movies are collected, and new products or high-score movies are recommended according to the habits of new and old users. Meanwhile, big data are double-edged swords, along with the development of the big data, the safety problem is increasingly developed, personal rights, public interests and even national interests are increasingly threatened, the importance of users on personal privacy protection is higher, and a large amount of privacy information leakage risks exist in the collecting process. Therefore, how to protect privacy is a urgent problem for the associated data.

Privacy calculations are a collection of techniques that cross-integrate many disciplines such as cryptography, statistics, artificial intelligence, computer hardware, etc. The method can calculate, circulate, and have a controllable value on the premise of not exposing the original data. Aiming at the defects of infinite privacy attack and privacy protection mechanism, dwork in 2006 proposes a differential privacy model (differential privacy, DP for short) [ Cynthia Dwork. Differential privacy. ICALP (2) 2006:1-12)]This technique is a data processing means in cryptography, and aims to provide a way to maximize the accuracy of data queries when they are queried from a statistical database, while minimizing the chance of identifying their records. Differential privacy not only strictly defines the background knowledge of an attacker, but also has a strict statistical model, which has become a standard for private data distribution. With the development of technology, there is a third party unreliable problem in real life, and the local differential privacy (local differential privacy, LDP for short) is generated, and the model inherits the advantages of differential privacy, in the local differential privacy, each user uses a mechanism M to perform disturbance processing on input x and upload y=m (x) to an unreliable server for data analysis, under the mechanism, only the user has own original data, and the server side can only receive y=m (x) and cannot directly access the original data, so that the unreliable server can be resistedThe third parties sum up. Formalized definition of local differential privacy is as follows: for a given ε εR+, the random mechanism M satisfies (ε, δ) -LDP if and only if for any pair of inputs x, x' and any pair of outputs y, the probability relation for outputting the same y is: pr (M (x) =y). Ltoreq.e ^ε Pr (M (x')=y) +delta, where ε is 0 or more and δ ε [0,1 ]]. Epsilon is a privacy budget, is a parameter for measuring the exposure degree of the privacy of a user, and can reflect the degree of privacy delivery of the user to a third party, and can be used for controlling the intensity of privacy protection, wherein the larger epsilon is, and the smaller epsilon is, so that the stronger privacy protection is represented. When epsilon is small, an attacker needs to ensure that the exact information is in a low state when trying to distinguish between any pair of inputs x, x'. Delta is the loose degree, which is a parameter for measuring the degree of local differential privacy which cannot be guaranteed by an algorithm, and the smaller delta is, the higher the possibility that the algorithm guarantees to meet epsilon-LDP is, but the larger the applicability can be changed for a certain loose degree. To exacting local difference privacy when δ=0, when δ>Loose local differential privacy at 0. Existing LDP focuses mainly on simple statistical queries such as frequency and histogram estimation of classified data.

However, real world applications typically involve many different data types, key-value data being a type of data common in many applications. For example, for a movie rating system, each user has a plurality of movie viewing records (keys) and scores (values) for the corresponding movie viewing records, i.e., each user has a set of key-value pairs < k, v >. The data collector may aggregate all user ratings records and statistically analyze the relevant attributes of a movie, such as the number of persons watching the movie and the average value of the scores for the movie. After statistical analysis of the data, the server may provide movie suggestions by selecting a high frequency and large average means, where both the frequency of the keys and the average of the values under each key must be estimated simultaneously. For data collection privacy protection using key values, the Xiaolan Gu et al [ Xiaolan Gu, mig Li, yueqiang Cheng, li Xiong, yang cao.pckv: locally Differentially Private Correlated Key-Value Data Collection with Optimized availability.usenix Security Symposium,2020:967-984] proposes a local differential privacy-based key value data (PCKV-GRR) to implement a key value-to-data collection framework that utilizes a related perturbation mechanism to enhance privacy protection effects. But this is only for the special case of strict local differential privacy and does not extend to key value data mechanisms under loose local differential privacy. In the invention, the aim is to design a high-precision key value data acquisition and analysis mechanism under the privacy constraint of an (epsilon, delta) -LDP model.

Disclosure of Invention

The invention aims to: the invention provides a key value data collection method based on a loose local differential privacy model, which extends the existing strict local differential privacy key value data mechanism so as to be more widely applied later.

The invention discloses a key value data collection method based on a loose local differential privacy model, which comprises the following steps:

s1, setting a server under a generalized random response mechanism, wherein the server comprises an overall privacy budget epsilon and an overall looseness delta of a setting model, and the privacy budget epsilon is distributed into the privacy budget epsilon related to key disturbance through decomposition ₁ And degree of loosening of key perturbation delta ₁ Setting privacy budget ε related to value perturbation ₂ Degree of looseness delta of sum disturbance ₂ At the same time epsilon and epsilon ₁ 、δ ₁ 、ε ₂ And delta ₂ Publishing to a user to deduce the optimal probability of key and value conversion in the element;

s2, each user locally transmits the original data<k,v>Preprocessing a data set, setting the original data as S, filling, and setting the filling length as l; for the whole mechanism, the number of all original keys is d, and the total length d' =d+l in the protocol is set; then sampling key value pairs and sampling the sampled dataAccording to->Discretizing the size of (3);

s3, the user performs key value pair according to the data in the samplingPerturbation of the key and value is performed and the perturbed data is +.>Sending the data to a server;

and S4, the server performs statistical analysis according to disturbance data sent by the user, and estimates a frequency distribution result and an average value estimation size of the original data.

In step (S1), the optimal allocation method of the privacy budget of the key-value data collection mechanism is as follows:

s11, in a key value and data collection method for loose local differential privacy under a generalized random response mechanism (GRR for short), the privacy budgets of key and value disturbance are assumed to be epsilon respectively ₁ And epsilon ₂ Degree of loosening of key perturbation delta ₁ Degree of loosening of value disturbance delta ₂ . Pr (M (x) =y). Ltoreq.e according to the definition of (ε, δ) -LDP ^ε Pr (M (x')=y), i.e. the case where the key conversion in the key pair is deduced, the probability of 1.fwdarw.1 is a, and the probability of 0.fwdarw.1 is b. The disturbance probability of a and b in the key is calculated respectively:

s12, the definition of (epsilon, delta) -LDP should be satisfied for the value disturbance stage in the key value pair, namely the probability of the median 1-1 of the key value pair isAnd the probability of 1.fwdarw.1 is 1-p.

S13, calculating the probability of the whole disturbance aiming at the key value. Data for discrete key values<k,v>Disturbance into the PCKV mechanism<k',v'>V when k.epsilon. { d+1, … d '} based on the boundary value of k and Pr (y' |S) ^* _k =0, deriving the privacy budget requires a borderline discussion. Case 1 when Privacy budgets at the time Case 2 when->Privacy budgets at the time

S14, optimizing the privacy budget to make full use of the privacy budget. The method adopts a method for evaluating the utility of a mechanism, and uses a Mean Square Error (MSE) to evaluate, namely, the smaller the mean square error is, the better the utility is. Estimation of the quantityThe mean square error of (a) can be calculated by the sum of squares of the variance and its deviation, i.e.)>According to the function analysis, need to be about>Taking the minimum value. Finally, a relation between the disturbance probability and the privacy can be obtained. The fitness conversion probability a= (l (e) ^ε -1)+2+2lδ(d'-1)/(l(e ^ε -1) +2d'); the bond transition probability b=2 (1-lδ)/(l (e) ^ε -1) +2d'). Key value pairConversion probability p= (l (e) ^ε -1)+1+lδ(2d'-1))/(l(e ^ε -1)+2+2lδ(d'-1))。

In step (S2), the method for filling and sampling the original real data of the user terminal includes the following steps:

s21, filling and sampling the original data S, setting the filling length as l, and filling l data when the user data set is empty. For the whole mechanism, the number of all original keys is d, d '=d+l is set, where d' represents the size of the filled key field. The key field K '= {1,2, the. D' }.

S22, sampling probability eta calculation is carried out on non-pseudo key value pairs, eta= |S|/[ max { |S|, l }]This parameter will be used as two randomly distributed parameters. The sampled value is preserved when the sampled key is truly data, and if a dummy key is sampled, a dummy value v will be assigned ^* =0, so as to ensure that sampling to the pseudo-key does not affect the mean value estimation.

S23, sampling the sampled dataAccording to->Is discretized. At this time, the sampled dataAccording to->Is discretized. Wherein the probability of a value of 1 is +.>The probability of conversion to-1 is

In step (S4), the process of estimating the frequency and the mean value of the data after the server receives the disturbance is as follows:

s41, after uploading the value output of all users to the server, the server counts the output 1 and-1 numbers, respectively denoted as n ₁ And n ₂ Wherein n is ₁ ＝num(y'＝<k,1>) And n ₂ ＝num(y'＝<k,-1>). Then, for n ₁ And n ₂ Calibration is performed to estimate the frequency and average of the keys K e K.

S42, for frequency estimation, a baseline estimation method is adopted and a frequency estimator is used, which is unbiased when the size of the data set of each user does not exceed l. Due to n ₁ +n ₂ Is the observed number of users possessing keys, the method has the following equivalent frequency estimators

The probability of false value-1 or 1 generated in the server is 0.5 (i.e. the expected value of false value is zero) respectively, so there is no reference meaning in the process, but the average value can be estimated by dividing the sum by the count of the real keys. The count of the real keys reported as owned may be usedTo approximate, because the sampling rate is 1/l, the probability that a real key is reported as owned is a. Thus, the corresponding average estimator +.>

The beneficial effects are that: according to the invention, privacy protection optimization is carried out locally on loose local differential privacy under a generalized random response mechanism, and optimal allocation is carried out on privacy parameters epsilon and delta through calculation so as to realize minimum variance, so that the full-sensitive data privacy protection degree is not reduced in the model, and the accuracy of frequency estimation and mean value estimation results is ensured; the invention fills and samples the data key value pairs in the collecting process, protects the relevance among homogeneous data, ensures the unbiasedness of the data, and improves the overall data utility on the basis of the original scheme.

Drawings

FIG. 1 is an overall flow chart of the present invention;

FIG. 2 is a schematic diagram of a client flow chart of an embodiment of the present invention;

fig. 3 is a schematic flow chart of a user side according to an embodiment of the present invention.

Detailed Description

The above-described aspects are further described below in conjunction with specific embodiments. It should be understood that these examples are illustrative of the present invention and are not intended to limit the scope of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without departing from the inventive concepts herein, are intended to be within the scope of the invention.

For a detailed description of the disclosed embodiments, reference will now be made to the accompanying drawings and examples. Firstly, the invention combines the problem that the frequency of key value data and the average value under each key value cannot be calculated simultaneously aiming at homoplasmic data in the existing Local Differential Privacy (LDP) application, in addition, an attacker is prevented from stealing data submitted by the user, and the attacker is prevented from possibly acquiring the data of the user from a trusted third-party server, so that the user has the risk of data privacy violation, such as data of position, shopping records, film scoring and the like. According to the invention, privacy budget allocation is carried out on the model by utilizing the loose local differential privacy data key value, then a data disturbance mechanism is carried out by optimizing privacy budget according to the defined model, and finally data is uploaded to the server, so that the safety of the original data is ensured. I.e. no data in a single user key can be obtained by an attacker, by any means.

Specifically, the loose local differential privacy key value implementation steps of the data acquisition method are as follows:

s1, setting an overall privacy budget epsilon and an overall looseness delta. By decomposition, allocation of privacy budget ε as key perturbation involvesDue to privacy budget ε ₁ And degree of loosening of key perturbation delta ₁ Setting privacy budget ε related to value perturbation ₂ Degree of looseness delta of sum disturbance ₂ At the same time epsilon and epsilon ₁ 、δ ₁ 、ε ₂ And delta ₂ The optimal probability of key and value transformations in the element will be deduced for the user.

S2, each user locally transmits the original data<k,v>Preprocessing a data set: setting the original data as S and filling, and setting the filling length as l. For the whole mechanism, the number of all original keys is d, the total length in the protocol d' =d+l is set. Sampling key value pairs, and sampling the sampled dataAccording to->Is discretized.

In step S1, the optimal allocation method of the privacy budget of the key-value data collection mechanism is as follows:

s11, for a generalized random response mechanism (GRR for short) in a key value under loose local differential privacy and data collection mechanism, the privacy budgets of key and value disturbance are assumed to be epsilon respectively ₁ And epsilon ₂ Degree of loosening of key perturbation delta ₁ Degree of loosening of value disturbance delta ₂ . Pr (M (x) =y). Ltoreq.e according to the definition of (ε, δ) -LDP ^ε Pr (M (x')=y), i.e. derivingThe probability of 1 to 1 in the key-out value pair is a, and the probability of 0 to 1 is b. The disturbance probability of a and b in the key is calculated respectively:

s12, the definition of (epsilon, delta) -LDP should be satisfied for the value disturbance stage in the key value pair, namely the probability of the median 1-1 of the key value pair is

S14, in order to fully utilize privacy budgetFor compact use, it will be optimized next. The method adopts a method for evaluating the utility of a mechanism, namely a Mean Square Error (MSE) is used, namely the smaller the mean square error is, the better the utility is. Estimation of the quantityThe mean square error of (a) can be calculated by the sum of squares of the variance and its deviation, i.e.)> According to the function analysis, need to be about>Taking the minimum value. Finally, a relation between the disturbance probability and the privacy can be obtained. The fitness conversion probability a= (l (e) ^ε -1)+2+2lδ(d'-1)/(l(e ^ε -1) +2d'); the bond transition probability b=2 (1-lδ)/(l (e) ^ε -1) +2d'). The probability of conversion p= (l (e) ^ε -1)+1+lδ(2d'-1))/(l(e ^ε -1)+2+2lδ(d′-1))。

In step S2, the method for filling and sampling the original real data of the user terminal includes the following steps:

s21, filling and sampling the original data S, setting the filling length as l, and filling l data when the user data set is empty. For the whole mechanism, the number of all keys originally is d, d '=d+l is set, where d' represents the size of the filled key field, the key field K '= {1,2, the. D' }.

S22, sampling probability eta calculation is carried out on non-pseudo key value pairs, eta= |S|/[ max { |S|, l }]This parameter is used in experimental simulations as two randomly distributed parameters. When the sampled key is actually present, the sampled value is reserved and converted according to the size of the value range, and if a pseudo key is sampled, a pseudo value v is allocated ^* =0, so as to ensure that sampling to the pseudo-key does not affect the mean value estimation.

S23, discretizing sampling data: sampling the dataAccording to->Is discretized. When a pseudo key is sampled, a pseudo value v is assigned ^* =0, thus ensuring that sampling to the pseudo-key does not affect the mean value estimation. At this time, the sampled data +.>According to->Is discretized. Wherein the probability of a value of 1 is +.>The probability of conversion to-1 is +.>

In step S4, the process of estimating the frequency and the mean value of the data by the server end after receiving the disturbed value is as follows:

s41, after uploading the value output of all users to the server, the server counts the output 1 and-1 numbers, respectively denoted as n ₁ And n ₂ Wherein n is ₁ ＝num(y'＝<k,1>) And n ₂ ＝num(y'＝<k,-1>). Then, for n ₁ And n ₂ Calibration is performed.

S42, for frequency estimation, a baseline estimation method is adopted and a frequency estimator is used, and when the size of the data set of each user does not exceed l, the estimator is unbiased. Due to n ₁ +n ₂ Is the observed number of users possessing keys, has the following equivalent frequencyRate estimator

Referring to fig. 1, the specific flow steps of the user side in the loose local differential privacy key value data acquisition method are as follows:

step one: the original data set S for possession is padded and sampled, the length of the padding is set to l, and when the user data set is empty, l data will be padded. For the entire mechanism, the number of all keys originally is d, d ' =d+l is set, key field K ' = {1,2, the. D ' }.

Step two: for the non-pseudo key value pair, sampling probability eta is calculated, eta= |S|/[ max { |S|, l } ], and the parameter is used as the parameter of the two random distribution. And executing the third step if the sampled key value actually exists; if not, executing the fourth step.

Step three: and when the sampled key is actually present, the sampled value is reserved and converted according to the size of the value range, and the fifth step is executed.

Step four: if a pseudo key is sampled, a pseudo value v will be assigned ^* And (3) the method comprises the following steps of (1) and (0) ensuring that the mean value estimation is not influenced when the pseudo key is sampled, and executing the fifth step.

Step five: sampling the dataAccording to->Is discretized. Wherein the probability of a value of 1 is +.> The probability of conversion to-1 is +.>

Step six: user will disturb the resultsAnd sending the data to a server.

With reference to fig. 2, the specific flow steps of the server side in the loose local differential privacy key value data acquisition method are as follows:

step one: the overall privacy budget epsilon and the overall looseness delta are set. By decomposition, the privacy budget epsilon is distributed as the privacy budget epsilon related to key disturbance ₁ And degree of loosening of key perturbation delta ₁ Setting privacy budget ε related to value perturbation ₂ Degree of looseness delta of sum disturbance ₂ At the same time epsilon and epsilon ₁ 、δ ₁ 、ε ₂ And delta ₂ Published to the user.

Step two: the server collects the disturbance data results sent by all n users, counts the number of 1 and-1 according to the disturbed key value data sent by the users, and respectively represents the number as n ₁ And n ₂ Wherein n is ₁ ＝num(y'＝<k,1>) And n ₂ ＝num(y'＝<k,-1>)。

Step three: calculating a frequency estimate, when the size of the data set for each user does not exceed l,the estimator is unbiased. Due to n ₁ +n ₂ Is the observed number of users with keys, the frequency estimator calculates as:

step four: estimating an average by dividing the sum by the count of real keys, the count of real keys reported as owned may be usedTo approximate, the probability that a real key is reported as owned is a. According to the probability of 0.5 of false value-1 or 1 generated in the mechanism, it is proved that the mean value estimation can be obtained when the sum of the values is not changed in statistics>

Step five: the server obtains the frequency estimation results of all keys and the average value estimation results of the corresponding values.

According to the deduction, the optimized privacy allocation of the key value data collection mechanism under the loose local differential privacy model can be obtained, and in addition, the random response scheme disturbance probability under the strict localization and the loose localization is summarized, so that the table 1 can be seen. Furthermore, the present invention evaluates the utility mechanism by means of Mean Square Error (MSE). The scientific credibility of the mechanism in the use process is improved.

TABLE 1 random response scheme perturbation probability under strictly localized and loosely localized

The following are experimental results of the present invention. The dataset used in the experiment was closing, which is a dataset on Kaggle in which scores of 105508 users for 5850 things were recorded, each score being a record, togetherThere were 192544 records. The invention constrains the record of each user score, sets the filling number l to about 10 to 10000, and specifies the range of the loosening degree delta, delta= [10 ] ^-7 ,10 ^-2 ]. 7 privacy classes were divided in the experiment, with privacy budgets of 0.1,1.0,2.0,3.0,4.0,5.0,6.0, respectively. Each user perturbs its own data locally in the experiment and sends the results to the server, becauseThe MSE is affected little, so the server calculates the mean value result MSE _mean 。

TABLE 2 influence of mean square error by parameter l

TABLE 3 influence of the mean square error by the parameter delta

TABLE 4 influence of the mean square error by the parameter ε

As can be seen from table 2, setting the filling size to be about l=100 for the data set used in the present experiment can make the overall performance better. By fixing δ at a small value close to 0 through table 3, it can be seen that the mean square error curve of the frequency is substantially the same as the curve without δ, indicating a smaller degree of looseness δ, with little impact on overall scheme performance; fixing δ at a larger value, it can be seen that the mean square error curve of frequency is smaller than that of table 3, indicating that the greater the looseness δ, the greater the impact on the overall scheme performance, and under this parameter, the greater the improvement in overall mechanism performance. As can be seen from table 4, the mean square error of the mean estimate in the optimal distribution of the privacy budget is substantially smaller than for the simple distribution. It can be derived that δ is a parameter of the degree of looseness in the loose local differential privacy, and the influence on the whole algorithm is that the stability of the algorithm performance is reduced when δ is larger.

The method and the device can protect the original data of the user from being acquired by an attacker, resist the attacker with arbitrary background knowledge and prevent privacy attack from an untrusted third party, solve the problem that the existing local differential privacy key value data acquisition method does not consider two-dimensional data, realize finer-granularity protection, reasonably protect sensitive data on the basis of ensuring the security of the privacy data, and improve the utility of estimating the privacy data.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The key value data collection method based on the loose local differential privacy model is characterized by comprising the following steps of:

the server in step (S1) optimizes the privacy allocation, and deriving the optimal probability of key and value transformations in the element comprises the following steps:

s11, based on a generalized random response mechanism, the method assumes that privacy budgets of key and value disturbance are epsilon respectively ₁ And epsilon ₂ Degree of loosening of key perturbation delta ₁ Degree of loosening of value disturbance delta ₂ Pr (M (x) =y). Ltoreq.e according to the definition of (ε, δ) -LDP ^ε Pr (M (x')=y), i.e. deriving key-to-key conversion in key-value pairs: the probability of 1-1 is a, the probability of 0-1 is b, and then the disturbance probabilities of a and b in the key are calculated respectively:

s12, the definition of (epsilon, delta) -LDP should be satisfied for the value disturbance stage in the key value pair, namely the probability of the median 1-1 of the key value pair isAnd the probability of 1.fwdarw. -1 is 1-p;

s13, calculating probability of overall disturbance of the key value, and aiming at the discrete key value, calculating data<k，v>Is disturbed to be<k′，v′>V when k e { d+1,..d '} based on the boundary value of k and Pr (y' |s) ^* _k =0, deriving the privacy budget requires a boundary to be discussed;

case 1: when (when)Privacy budget +.>

Case 2: when (when)Privacy budget +.>

S14, optimizing privacy budget in a mode of evaluating mechanism effectiveness to maximize utilization rate, wherein the step uses mean square error for evaluation and estimation quantityIs calculated by the sum of squares of the variance and its deviation, i.e./> Then analyzing the required pair according to the function>Taking the minimum value, and finally obtaining a relation between the disturbance probability and the privacy;

the fitness conversion probability a= (l (e) ^ε -1)+2+2lδ(d′-1)/(l(e ^ε -1)+2d′)；

The bond transition probability b=2 (1-lδ)/(l (e) ^ε -1)+2d′)；

The probability of conversion p= (l (e) ^ε -1)+1+lδ(2d′-1))/(l(e ^ε -1)+2+2lδ(d′-1))；

S2, each user locally transmits the original data<k，v>Preprocessing a data set, setting the original data as S, filling, and setting the filling length as l; for the whole mechanism, the number of all original keys is d, and the total length d' =d+l in the protocol is set; then sampling key value pairs and sampling the sampled dataAccording to->Discretizing the size of (3);

s3, the user performs key value pair according to the data in the samplingPerforming key and value perturbation and data after perturbationSending the data to a server;

2. The loose local differential privacy model-based key value data collection method of claim 1, wherein: the preprocessing of the data set in the step (S2) further comprises the filling and sampling of the original real data by the user side, including the following procedures:

s21, filling and sampling the original data S, setting the filling length as l, and filling l data when the user data set is empty; for the whole mechanism, the number of all original keys is d, d '=d+l is set, where d' represents the size of the filled key field; a key domain K '= {1,2, …, d' };

s22, sampling probability eta calculation is carried out on non-pseudo key value pairs, eta= |S|/[ max { |S|, l }]The parameter is to be used as two randomly distributed parameters; the sampled value is preserved when the sampled key is truly data, and if a dummy key is sampled, a dummy value v will be assigned ^* =0, so as to ensure that the mean value estimation is not affected when the pseudo key is sampled;

s23, sampling the sampled dataAccording to->Discretizing the size of (3); at this time, the sampled dataAccording to->Discretizing the size of (3); wherein the probability of a value of 1 is +.>The probability of conversion to-1 is

3. The key value data collection method based on the loose local differential privacy model according to claim 1, wherein in step (S4), the process of estimating the frequency and the mean value of the data after receiving the disturbance by the server is as follows:

s41, after uploading the value output of all users to the server, the server respectively represents the number of 1 and-1 which are counted and output as n ₁ And n ₂ Wherein n is ₁ ＝num(y′＝<k，1>) And n ₂ ＝num(y′＝<k，-1>) The method comprises the steps of carrying out a first treatment on the surface of the Then, for n ₁ And n ₂ Performing calibration;

s42, for frequency estimation, adopting a baseline estimation method and using a frequency estimator, wherein when the size of a data set of each user is not more than l, the estimator is unbiased;

due to n ₁ +n ₂ Is the observed number of users with keys, so there are the following equivalent frequency estimators

S43, the probability of false value-1 or 1 generated in the server is 0.5 respectively, so that the reference meaning is not existed in the process, but the average value can be estimated by dividing the sum by the count of the real keys;

counting of real keys reported as ownedTo be similar, because the sampling rate is 1/l, the probability that a real key is reported as owned is a; thus, the corresponding average estimator

4. The loose local differential privacy model-based key value data collection method of claim 1, wherein: in the method, a user performs local-end on own original dataAnd carrying out disturbance, sending the disturbed result to a server, and carrying out aggregation calculation by the server to obtain all user data estimated values.

5. The loose local differential privacy model-based key value data collection method of claim 1, wherein: the method calculates the privacy budget by using the mean square error, wherein the estimator calculates the privacy budget by back-deriving the optimized privacy budget allocation by achieving better mechanism performanceMay be calculated by the sum of squares of the variance and its deviation; and deriving a relation between the disturbance probability and the privacy through mean square error of frequency estimation and mean value estimation.