CN114969656B - Streaming histogram release method and system for weighted sliding window under differential privacy - Google Patents
Streaming histogram release method and system for weighted sliding window under differential privacy Download PDFInfo
- Publication number
- CN114969656B CN114969656B CN202210343677.XA CN202210343677A CN114969656B CN 114969656 B CN114969656 B CN 114969656B CN 202210343677 A CN202210343677 A CN 202210343677A CN 114969656 B CN114969656 B CN 114969656B
- Authority
- CN
- China
- Prior art keywords
- interval
- sliding window
- current
- histogram
- noise
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 60
- 230000008569 process Effects 0.000 claims abstract description 18
- 238000004422 calculation algorithm Methods 0.000 claims description 79
- 230000036961 partial effect Effects 0.000 claims description 43
- 238000004364 calculation method Methods 0.000 claims description 18
- 238000012545 processing Methods 0.000 claims description 8
- 238000010276 construction Methods 0.000 claims description 7
- 238000000638 solvent extraction Methods 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 2
- 230000002860 competitive effect Effects 0.000 abstract description 8
- 230000007246 mechanism Effects 0.000 abstract description 7
- 230000000875 corresponding effect Effects 0.000 description 8
- 208000027418 Wounds and injury Diseases 0.000 description 5
- 238000010835 comparative analysis Methods 0.000 description 5
- 230000006378 damage Effects 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 208000014674 injury Diseases 0.000 description 5
- 206010039203 Road traffic accident Diseases 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 230000007423 decrease Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000002829 reductive effect Effects 0.000 description 3
- 230000003068 static effect Effects 0.000 description 3
- 230000002596 correlated effect Effects 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000035945 sensitivity Effects 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000002238 attenuated effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000003139 buffering effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 230000001276 controlling effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000670 limiting effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000010223 real-time analysis Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 231100000279 safety data Toxicity 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/18—Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2474—Sequence data queries, e.g. querying versioned data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H10/00—ICT specially adapted for the handling or processing of patient-related medical or healthcare data
- G16H10/60—ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/50—Reducing energy consumption in communication networks in wire-line communication networks, e.g. low power modes or reduced link rate
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Mathematical Physics (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Computational Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Pure & Applied Mathematics (AREA)
- Bioethics (AREA)
- Mathematical Optimization (AREA)
- Mathematical Analysis (AREA)
- Medical Informatics (AREA)
- Primary Health Care (AREA)
- Evolutionary Biology (AREA)
- Computer Security & Cryptography (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Epidemiology (AREA)
- Computational Linguistics (AREA)
- Fuzzy Systems (AREA)
- Public Health (AREA)
- Operations Research (AREA)
- Computer Hardware Design (AREA)
- Algebra (AREA)
- Complex Calculations (AREA)
Abstract
The invention provides a method and a system for issuing a flow type histogram of a weighted sliding window under differential privacy, and relates to the technical field of information security; firstly, a new sketch structure is created, and counting information of each histogram interval of each time instance is maintained; then, the invention provides a selective release mechanism which utilizes the difference between the interval count value and the noise value approximately estimated by each histogram interval to select better count information, and adopts greedy grouping to all interval counts in a weighted sliding window, thereby ensuring the competitive usability of query data in most cases. The invention can rapidly process the data stream based on the weighted sliding window and ensure the competitive usability of the release data under most conditions.
Description
Technical Field
The invention relates to the technical field of information security, in particular to a method and a system for issuing a stream type histogram of a weighted sliding window under differential privacy.
Background
With the rapid development of big data and internet of things (IoT), streaming data (data streams) are ubiquitous in a variety of real-world scenarios, such as real-time traffic streams and hospital patient information data streams. Histograms serve as an important statistical method for capturing streaming data distribution information, playing a vital role in real-time analysis applications in many such scenarios. However, if the histogram is directly distributed without privacy protection, sensitive personal information may be exposed to malicious adversaries, who may benefit from the data being distributed. Thus, the continuous construction of publicly-published histograms on streaming data is becoming increasingly important for data privacy-sensitive analysis applications.
In general, recent data is often of greater interest than older data because the importance of an element decreases as the element ages. Thus, consider a weighted sliding window model, which is generalized by adding a weighting factor γε [0,1 ]. Weighted sliding window models have wide application in the prior art. For example, considering a monitoring system in a certain network scenario, the most recent packets are considered more important than the older packets because they reflect the network state more accurately and more timely. Thus, weighted sliding window models are often used in the data flow of these online monitoring systems. In another application example, in the context of a social network, the most recent data is greater than the old data in building trust relationships between users.
Currently, there is much research effort on data distribution, whether static or dynamic data sets. The methods proposed in the prior literature are not applicable to weighted sliding window models where the importance of the data decays over time, as they always require buffering of all elements in each sliding window. In particular in weighted sliding window histogram publications, they suffer from two drawbacks:
(1) The existing method has no effective mechanism, and the importance of new elements is higher than that of old elements;
(2) When the existing method is used for constructing the histogram, each element contained in each sliding window needs to be scanned repeatedly, so that the calculation cost is high and the storage burden is high.
Disclosure of Invention
The invention aims to provide a method and a system for issuing a flow histogram of weighted sliding windows under differential privacy, which creatively proposes a new sketch data structure to store each weighted sliding window by modeling a data flow by using a series of weighted sliding windows; specifically, in the weighted sliding window model, the importance of the data is characterized by a weighting factor, and the weighting factor is specified by a user according to actual requirements; by adopting the new sketch data structure, the calculation cost can be remarkably reduced, the difference between the real data and the noise data can be adaptively controlled, and the competitive availability of the weighted sliding window model for inquiring the data under most conditions is ensured.
In order to achieve the above purpose, the present invention proposes the following technical scheme: a method for issuing a flow histogram of a weighted sliding window under differential privacy comprises the following steps:
determining an interval of a noise adding histogram to be issued in data flow;
determining an optimal privacy budget allocation factor alpha, and allocating the whole privacy budget epsilon as a first partial privacy budget alpha epsilon and a second partial privacy budget epsilon (1-alpha);
Constructing an approximate estimation sketch structure for each interval of the noise adding histogram to be issued to approximate estimate the data in the weighted sliding window, and obtaining interval count values of all the intervals;
adding a first privacy budget alpha epsilon to the noise adding histogram to be issued, and respectively obtaining a first partial privacy budget noise value of each interval;
determining the approximate statistical frequency of each interval according to the interval count value of any interval of the noise adding histogram to be issued and the corresponding first partial privacy budget noise value, and constructing the approximate statistical histogram of the current moment according to the approximate statistical frequency of each interval;
Clustering the intervals sequenced by the approximate statistical histogram of the current moment according to a greedy clustering algorithm by adopting a second partial privacy budget epsilon (1-alpha) to obtain a noise adding histogram of the sliding window of the current moment;
and issuing a noise adding histogram of the sliding window at the current moment.
Further, the method for determining the privacy budget allocation factor alpha comprises the following steps:
Preprocessing a small sample, and setting different values for the privacy budget allocation factor alpha;
Respectively calculating the mean square error between the noise adding value and the true value of each interval of the noise adding histogram to be issued under each value;
And selecting the privacy budget allocation factor alpha corresponding to the minimum mean square error as the optimal privacy budget allocation factor.
Further, the process of constructing an approximate estimation sketch structure to approximate estimate the data in the weighted sliding window and obtain the interval count value is as follows:
Judging whether the current data entering the weighted sliding window data stream accords with a noise adding histogram judgment statistical interval to be issued or not; if the current data meets the current interval range, setting the current data to be 1; if the current data does not meet the current interval range, the current data is set to 0;
and partitioning the weighted sliding window, and calculating an approximation error threshold value theta of the weighted sliding window and a weighted count value y of the current timestamp, wherein the calculation formulas are respectively as follows: y=y×γ +x t; wherein W represents the size of the sliding window, gamma represents the weighting factor, beta represents the approximation error factor, S i [ t-W+1, t ] represents the sliding window of the ith binary data stream, x t represents an element instance in the sliding window at the current time S i [ t-W+1, t ];
Judging whether the data stream S i [ t-W+1, t ] has an expired data element on the current timestamp;
calculating the interval count value of the current timestamp When the data stream S i [ t-W+1, t ] generates an expired data element at the current timestamp, thenWhen the data stream S i [ t-W+1, t ] has no expired data elements at the current timestamp, thenWherein B represents a two-dimensional array for storing weighted sliding window count information, wherein B 1 (i) represents a first sub-array, records a timestamp of a current element in the data stream, and B 2 (i) represents a second sub-array, stores a section count value of an ith binary data stream S i [ t-W+1, t ]Old (B 1 (i)) represents the first child array of the oldest block, old (B 2 (i)) represents the second child array of the oldest block;
Judging the interval count value of the current time stamp Is to count the interval valueThe result of less than 0 is set to 0 to obtain all interval count values in the current weighted sliding window
Further, the process of determining the approximate statistical frequency of each interval according to the interval count value of any interval of the noise adding histogram to be issued and the corresponding first partial privacy budget noise value includes:
judging the magnitude of the approximation error and the noise threshold value, and determining an approximation statistics frequency; wherein the approximation error is the difference between the interval count value and the true value, and the noise threshold represents the maximum difference between the first partial privacy budget noise value and the true value
When the approximate error threshold is smaller than the noise threshold, the approximate statistical frequency is the interval count valueWhen the noise threshold is smaller than the approximation error, the approximation statistics frequency is the noise adding value, i.eWherein G i represents the true value of the noisy histogram calculation interval to be issued;
judging the magnitude of the approximate statistical frequency, setting the result of the approximate statistical frequency smaller than 0 to be 0, and obtaining the approximate statistical frequency of all the intervals in the current weighted sliding window.
Further, the clustering process of the interval sequenced by the approximate statistical histogram of the current moment according to the greedy clustering algorithm by adopting the second partial privacy budget epsilon (1-alpha) comprises the following steps:
Approximate statistical frequency of all intervals in weighted sliding window according to ordering algorithm Sequencing, and calculating the combined error and the uncombined error between each frequency number and other frequency numbers;
Judging the interval group, comprising: when the combination error of the two sections is smaller than the non-combination error, combining the two sections and setting the two sections as a group; when the combination error of the two sections is larger than the non-combination error, the two sections are not combined and are set into two groups; the merging error is the result of carrying out square difference on the average value of each interval and the current value of each interval; the non-merging error is the result of performing square deviation on the average value of each interval and the existing value of each interval;
regrouping all intervals in the current weighted sliding window;
replacing the approximate statistical frequency value of each packet by using the average value of each packet, and determining a second partial privacy budget of each interval by the proportion of the approximate statistical frequency value in each packet to the total number of the data statistical values in the whole weighted sliding window so as to add Laplace noise to each packet interval;
calculating the approximate statistical frequency of each group after adding the Laplace noise, judging the size of the approximate statistical frequency, and setting the result of the approximate statistical frequency smaller than 0 to be 0;
And obtaining a noise adding histogram of the sliding window at the current moment.
Further, the specific process of the interval grouping is as follows:
The first interval of the ordered approximate statistical histogram is self-organized, and the noise error introduced after the second interval is added into the current grouping and the noise error introduced by the new grouping which is reestablished by taking the second interval as a starting point are judged; if the noise error introduced by adding the current packet is smaller than that of establishing a new packet, adding the current packet, otherwise, establishing the new packet; until all the bins of the noise plus histogram are noisy.
The invention further discloses a flow histogram release system of a weighted sliding window under differential privacy, which comprises the following program modules:
the first determining module is used for determining an interval of a noise adding histogram to be issued in data in a data stream;
A second determining module, configured to determine a privacy budget allocation factor α, and allocate the entire privacy budget epsilon as a first partial privacy budget epsilon and a second partial privacy budget epsilon (1-alpha);
the construction module is used for constructing an approximate estimation sketch structure for each interval of the noise-added histogram to be issued and carrying out approximate estimation on the data in the weighted sliding window to obtain interval count values of all the intervals;
The first processing module is used for adding a first privacy budget alpha epsilon to the noise adding histogram to be issued and respectively obtaining a first partial privacy budget noise value of each interval;
the third determining module is used for determining the approximate statistical frequency of each interval according to the interval count value of any interval of the noise adding histogram to be issued and the corresponding first partial privacy budget noise value, and constructing the approximate statistical histogram of the current moment according to the approximate statistical frequency of each interval;
The second processing module is used for clustering the intervals sequenced by the approximate statistical histogram of the current moment according to a greedy clustering algorithm by adopting a second partial privacy budget epsilon (1-alpha) to obtain a noise adding histogram of the sliding window of the current moment;
And the release module is used for releasing the noise adding histogram of the sliding window at the current moment.
Further, the construction module performs approximate estimation on the data in the weighted sliding window by constructing an approximate estimation sketch structure, and the process of obtaining the interval count value is as follows:
Judging whether the current data entering the weighted sliding window data stream accords with a noise adding histogram judgment statistical interval to be issued or not; if the current data meets the current interval range, setting the current data to be 1; if the current data does not meet the current interval range, the current data is set to 0;
and partitioning the weighted sliding window, and calculating an approximation error threshold value theta of the weighted sliding window and a weighted count value y of the current timestamp, wherein the calculation formulas are respectively as follows: y=y×γ +x t; wherein W represents the size of the sliding window, gamma represents the weighting factor, beta represents the approximation error factor, S i [ t-W+1, t ] represents the sliding window of the ith binary data stream, x t represents an element instance in the sliding window at the current time S i [ t-W+1, t ];
Judging whether the data stream S i [ t-W+1, t ] has an expired data element on the current timestamp;
calculating the interval count value of the current timestamp When the data stream S i [ t-W+1, t ] generates an expired data element at the current timestamp, thenWhen the data stream S i [ t-W+1, t ] has no expired data elements at the current timestamp, thenWherein B represents a two-dimensional array for storing weighted sliding window count information, wherein B 1 (i) represents a first sub-array, records a timestamp of a current element in the data stream, and B 2 (i) represents a second sub-array, stores a section count value of an ith binary data stream S i [ t-W+1, t ]Old (B 1 (i)) represents the first child array of the oldest block, old (B 2 (i)) represents the second child array of the oldest block;
Judging the interval count value of the current time stamp Is to count the interval valueThe result of less than 0 is set to 0 to obtain all interval count values in the current weighted sliding window
Further, the process of determining the approximate statistical frequency of each interval by the third determining module is as follows:
judging the magnitude of the approximation error and the noise threshold value, and determining an approximation statistics frequency; wherein the approximation error is the difference between the interval count value and the true value, and the noise threshold represents the maximum difference between the first partial privacy budget noise value and the true value
When the approximate error threshold is smaller than the noise threshold, the approximate statistical frequency is the interval count valueWhen the noise threshold is smaller than the approximation error, the approximation statistics frequency is the noise adding value, i.eWherein G i represents the true value of the noisy histogram calculation interval to be issued;
judging the magnitude of the approximate statistical frequency, setting the result of the approximate statistical frequency smaller than 0 to be 0, and obtaining the approximate statistical frequency of all the intervals in the current weighted sliding window.
The invention also discloses an electronic device, which comprises a memory and a processor, wherein the memory is used for storing at least one instruction, and the processor is used for executing the at least one instruction to realize the streaming histogram release method of the weighted sliding window under the differential privacy.
According to the technical scheme, the following beneficial effects are achieved:
(1) The invention discloses a flow histogram release method and a system of a differential privacy weighting sliding window, which create a new sketch structure which is marked as AESketch so as to maintain the counting information of each histogram interval of each time instance; the sketch structure AESketch can be used in places with severe hardware conditions by selectively compressing the count information of elements in the data stream D and the time stamps of the elements into a two-dimensional array to rapidly process the data stream data in the weighted sliding window model, so that the running time is effectively shortened, the memory consumption can be reduced, the hardware conditions of the software are reduced, and the data stream data are used in places with severe hardware conditions.
(2) The invention discloses a streaming histogram release method and a system of a differential privacy weighting sliding window, and provides a selective release mechanism; the mechanism uses the difference between the approximate statistics and noise values for each histogram interval to select better count information and employs greedy groupings for all interval counts in a weighted sliding window. Therefore, the invention not only can obviously reduce the calculation cost, but also ensures that the weighted sliding window model has competitive usability for inquiring data under most conditions by controlling the privacy allocation budget allocation factor alpha to adaptively control the difference between real data and noise data.
It should be understood that all combinations of the foregoing concepts, as well as additional concepts described in more detail below, may be considered a part of the inventive subject matter of the present disclosure as long as such concepts are not mutually inconsistent.
The foregoing and other aspects, embodiments, and features of the present teachings will be more fully understood from the following description, taken together with the accompanying drawings. Other additional aspects of the invention, such as features and/or advantages of the exemplary embodiments, will be apparent from the description which follows, or may be learned by practice of the embodiments according to the teachings of the invention.
Drawings
The drawings are not intended to be drawn to scale. In the drawings, each identical or nearly identical component that is illustrated in various figures may be represented by a like numeral. For purposes of clarity, not every component may be labeled in every drawing. Embodiments of various aspects of the invention will now be described, by way of example, with reference to the accompanying drawings, in which:
FIG. 1 is a schematic diagram of a weighted sliding window of the present invention;
FIG. 2 is a schematic diagram of the privacy budget allocation results of a high-efficiency streaming histogram distribution algorithm;
FIG. 3 (a) is a block diagram illustrating the overall run time of example 1 for a British car accident dataset at different window sizes with a privacy budget of 0.5;
FIG. 3 (b) is the total run time of example 1 for a British car accident dataset at different window sizes with a privacy budget of 1;
FIG. 3 (c) is a block diagram illustrating the overall run time of the British traffic accident data set at different window sizes for example 1 with a privacy budget of 1.5;
FIG. 4 (a) is the accuracy of example 1 for British traffic accident data sets at different window sizes with a privacy budget of 0.5;
FIG. 4 (b) is the accuracy of example 1 for British traffic accident data sets at different window sizes with a privacy budget of 1;
FIG. 4 (c) is the accuracy of the British traffic accident data set at different window sizes for example 1 with a privacy budget of 1.5;
FIG. 5 (a) is the accuracy of the New York yellow taxi trip dataset at different window sizes for example 2 at a privacy budget of 0.5;
FIG. 5 (b) is the accuracy of the New York yellow taxi trip dataset at different window sizes for privacy budget 1 of example 2;
FIG. 5 (c) is the accuracy of the New York yellow taxi trip dataset at different window sizes for privacy budget of 1.5 of example 2;
FIG. 6 (a) is the accuracy of example 1 in selecting different weighting factors for British car accident datasets at a privacy budget of 0.5;
FIG. 6 (b) is the accuracy of example 1 in selecting different weighting factors for British car accident datasets with a privacy budget of 1;
FIG. 6 (c) is the accuracy of example 1 in selecting different weighting factors for British car accident datasets at a privacy budget of 1.5;
FIG. 7 (a) is the accuracy of example 2 in selecting different weighting factors for the New York yellow taxi trip dataset at a privacy budget of 0.5;
FIG. 7 (b) is the accuracy of example 2 in selecting different weighting factors for the New York yellow taxi trip dataset when the privacy budget is 1;
FIG. 7 (c) is the accuracy of example 2 in selecting different weighting factors for the New York yellow taxi trip dataset at a privacy budget of 1.5;
FIG. 8 (a) is an illustration of the accuracy of example 1 in selecting different approximation error factors for a British car accident dataset at a privacy budget of 0.5;
FIG. 8 (b) is the accuracy of example 1 in selecting different approximation error factors for a British car accident dataset when the privacy budget is 1;
FIG. 8 (c) is the accuracy of example 1 in selecting different approximation error factors for a British car accident dataset at a privacy budget of 1.5;
FIG. 9 (a) is the accuracy of example 2 in selecting different approximation error factors for the New York yellow taxi trip dataset at a privacy budget of 0.5;
FIG. 9 (b) is the accuracy of example 2 in selecting different approximation error factors for the New York yellow taxi trip dataset when the privacy budget is 1;
FIG. 9 (c) is the accuracy of example 2 in selecting different approximation error factors for the New York yellow taxi trip dataset at a privacy budget of 1.5;
FIG. 10 is a flow chart of a method for streaming histogram distribution of a differential privacy-weighted sliding window in accordance with the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention more clear, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. It will be apparent that the described embodiments are some, but not all, embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without creative efforts, based on the described embodiments of the present invention fall within the protection scope of the present invention. Unless defined otherwise, technical or scientific terms used herein should be given the ordinary meaning as understood by one of ordinary skill in the art to which this invention belongs.
The terms "first," "second," and the like in the description and in the claims, are not used for any order, quantity, or importance, but are used for distinguishing between different elements. Also, unless the context clearly indicates otherwise, singular forms "a," "an," or "the" and similar terms do not denote a limitation of quantity, but rather denote the presence of at least one. The terms "comprises," "comprising," or the like are intended to cover a feature, integer, step, operation, element, and/or component recited as being present in the element or article that "comprises" or "comprising" does not exclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
In the prior art, although there are many methods for issuing histograms based on sliding window in a data stream environment, these methods cannot be directly applied to data with time-importance, cannot protect weighted data, and face the problem of high issue error of histograms. In view of this, the present embodiment provides a method and a system for issuing a streaming histogram of a differential privacy-weighted sliding window, which integrates the sliding window and the differential privacy technology to further optimize the histogram issuing technology under the differential privacy condition, and aims at solving the problem that the importance of elements in a data stream decreases with time, by modeling the data stream by using a series of weighted sliding windows, and researching how to issue histograms continuously on the windows.
The invention performs single scanning on a given data stream and performs statistical distribution by adopting a weighted sliding window model. The invention adopts the high-efficiency flow histogram release (ESHP) algorithm, firstly creates an approximate statistical sketch structure AESketch, and can store the key statistical information of each element in the data flow into a sketch data structure of a high-efficiency memory; then, a selective publishing mechanism is used to publish, which utilizes the difference between the approximate statistical information and the noise value of each histogram interval to select better counting information, and greedy grouping is carried out on all interval counts in a weighted sliding window, so that the competitive usability of query data in most cases is ensured.
Given a binary data stream D, D = { x 1,x2,···,xn, ·· · a sliding window of size W, a weighting factor γ, γ e [0,1], a timestamp t, t >0, weighted sliding window referring to the newer W elements in data stream D; the weighted importance of each element in the weighted sliding window is controlled by the time stamp and gamma, and the weighted importance of the weighted sliding window at the time of the time stamp t consists of the following two parts:
(a) The sliding window at time t of the time stamp is
(B) A weight array of < gamma W-1,γW-2,···,γ0 > assigned to each element in the sliding windowAssociated with a different weight I (x i)=γt-i I, I representing that element x i is located in a sliding window)An i-th position within.
A weighted sliding window schematic is shown in fig. 1. From the definition of the weighted sliding window, the latest elementThe most heavy, 1, most recent second elementWith the second greatest weight, γ, and so on.
In combination with the data stream D, the method for issuing the streaming histogram of the differential privacy-weighted sliding window disclosed by the invention, as shown in fig. 10, specifically comprises the following steps:
(1) Determining an interval of a noise adding histogram to be issued in data flow;
(2) Determining an optimal privacy budget allocation factor alpha by utilizing a small sample preprocessing mode, dividing the whole privacy budget epsilon into two parts, wherein the first part of privacy budget epsilon is marked as epsilon 1 and used for a selective release mechanism, and the second part of privacy budget epsilon (1-alpha) is marked as epsilon 2 and used for a greedy clustering part;
the optimal privacy budget allocation factor alpha calculation process is as follows: the availability of the existing data is controlled by adjusting the privacy budget allocation factor alpha to find out the optimal value, and the mean square error between the noise adding value and the true value of each interval of the noise adding histogram to be issued is calculated respectively due to the mode that the issued data has the highest availability, namely, different values are set through the privacy budget allocation factor alpha; selecting a privacy budget allocation factor alpha with the value of the privacy budget allocation factor corresponding to the minimum mean square error as the optimal privacy budget allocation factor alpha; the mean square error is calculated as the sum of squares of the absolute values of the differences between the true value and the noise added value.
(3) Constructing an approximate estimation sketch structure for each interval of the noise adding histogram to be issued to approximate estimate the data in the weighted sliding window, and obtaining interval count values of all the intervals; the method comprises the following steps:
Judging whether the current data entering the weighted sliding window data stream accords with a noise adding histogram judgment statistical interval to be issued or not; if the current data meets the current interval range, setting the current data to be 1; if the current data does not meet the current interval range, the current data is set to 0;
and partitioning the weighted sliding window, and calculating an approximation error threshold value theta of the weighted sliding window and a weighted count value y of the current timestamp, wherein the calculation formulas are respectively as follows: y=y×γ +x t; wherein W represents the size of the sliding window, gamma represents the weighting factor, beta represents the approximation error factor, S i [ t-W+1, t ] represents the sliding window of the ith binary data stream, x t represents an element instance in the sliding window at the current time S i [ t-W+1, t ];
Judging whether the data stream S i [ t-W+1, t ] has an expired data element on the current timestamp;
calculating the interval count value of the current timestamp When the data stream S i [ t-W+1, t ] generates an expired data element at the current timestamp, thenWhen the data stream S i [ t-W+1, t ] has no expired data elements at the current timestamp, thenWherein B represents a two-dimensional array for storing weighted sliding window count information, wherein B 1 (i) represents a first sub-array, records a timestamp of a current element in the data stream, and B 2 (i) represents a second sub-array, stores a section count value of an ith binary data stream S i [ t-W+1, t ]Old (B 1 (i)) represents the first child array of the oldest block, old (B 2 (i)) represents the second child array of the oldest block;
Judging the interval count value of the current time stamp Is to count the interval valueThe result of less than 0 is set to 0 to obtain all interval count values in the current weighted sliding window
(4) Adding a first privacy budget alpha epsilon to the noise adding histogram to be issued, and respectively obtaining a first partial privacy budget noise value of each interval;
(5) Determining the approximate statistical frequency of each interval according to the interval count value of any interval of the noise adding histogram to be issued and the corresponding first partial privacy budget noise value, and constructing the approximate statistical histogram of the current moment according to the approximate statistical frequency of each interval;
That is, a selective publishing mechanism is proposed that utilizes the difference between the bin count value of each histogram bin and the first partial privacy budget noise value to select a better publishing value; specifically, the step of comparing the difference between the interval count value and the true value to select a suitable approximate statistical frequency number includes the steps of:
Judging whether the weighted count value of the weighted sliding window in the current time stamp exceeds the approximate error threshold of the weighted sliding window, namely judging the size of the approximate error and the noise threshold, and determining the approximate statistical frequency; wherein the approximation error is the difference between the interval count value and the true value, and the noise threshold represents the maximum difference between the first partial privacy budget noise value and the true value
When the approximate error threshold is smaller than the noise threshold, the approximate statistical frequency is the interval count valueWhen the noise threshold is smaller than the approximation error, the approximation statistics frequency is the noise adding value, i.eWherein G i represents the true value of the noisy histogram calculation interval to be issued;
judging the magnitude of the approximate statistical frequency, setting the result of the approximate statistical frequency smaller than 0 to be 0, and obtaining the approximate statistical frequency of all the intervals in the current weighted sliding window.
(6) Clustering the sequenced intervals of the current time approximate statistical histogram by using a second partial privacy budget epsilon (1-alpha), and optimizing errors of data in the group to obtain a noise adding histogram of a current time sliding window;
The process comprises the following steps:
inserting the approximate statistical frequency of each interval of the current sliding window after the screening in the step (5) into the current histogram statistical data;
Sorting the approximate statistical frequency numbers of all the intervals in the weighted sliding window according to a sorting algorithm, and calculating the combined error and the uncombined error between each frequency number and other frequency numbers;
Judging the interval group, comprising: when the combination error of the two sections is smaller than the non-combination error, combining the two sections and setting the two sections as a group; when the combination error of the two sections is larger than the non-combination error, the two sections are not combined and are set into two groups; the combined error is the result of combining interval and interval, and the average value of each interval and the current value of each interval are subjected to square deviation; the error which is not combined is the result of performing square deviation on the average value of each interval and the existing value of each interval;
regrouping all intervals in the current weighted sliding window;
replacing the approximate statistical frequency value of each packet by using the average value of each packet, and determining a second partial privacy budget of each interval by the proportion of the approximate statistical frequency value in each packet to the total number of the data statistical values in the whole weighted sliding window so as to add Laplace noise to each packet interval;
Calculating the approximate statistical frequency of each group after adding the Laplace noise, judging the size of the approximate statistical frequency, and setting the result of the approximate statistical frequency smaller than 0 to be 0; reducing errors generated by approximation statistics;
And obtaining a noise adding histogram of the sliding window at the current moment.
(7) And issuing a noise adding histogram of the sliding window at the current moment.
The efficient streaming histogram distribution (ESHP) algorithm proposed by the invention satisfies epsilon-differential privacy and is implemented in the following two points.
(1) The ESHP algorithm provided by the invention selects a proper value by comparing the difference between the estimated value and the noise value of the self-adaptive selection part (the self-adaptive selection part meets epsilon 1 -differential privacy);
(2) The ESHP algorithm employs a greedy grouping strategy, adding Laplacian noise to the grouping portion (so the grouping portion satisfies ε 2 -differential privacy).
Based on the two facts described above and epsilon=epsilon 1+ε2, the conclusion that ESHP algorithm satisfies epsilon-differential privacy is drawn from the combined nature of differential privacy.
Secondly, when considering the time complexity and the space complexity of ESHP algorithm and analyzing the calculation cost and the running time, the following preconditions are considered first.
The conditions are as follows: the block number k in the two-dimensional array B used by the approximate estimation sketch structure algorithm adopted by the invention is not more than
The reason is the fact that: (1) By weighting the nature of the sliding windows, all windows are all 1, thus obtaining a maximum count value of(2) Block threshold for each block in the approximate estimation sketch structure algorithm: from (1) and (2), the maximum number of blocks used in the two-dimensional array B is
The time complexity analysis of the ESHP algorithm of the present invention is as follows:
assuming that the number of histogram bins issued in ESHP algorithm is M, the time overhead of ESHP algorithm is
AESketch requires time complexity based on the preconditions of the application of the approximate estimated sketch structure algorithmIs performed in a weighted sliding window. When performing a query on the histograms of M bins, AESketch has a temporal complexity ofSince ESHP algorithm uses greedy grouping strategy, the M interval counts need to be ordered, so the time overhead of greedy grouping is O (Mlog (M)); based on this, the time overhead of ESHP algorithm is
The spatial complexity analysis of the ESHP algorithm of the present invention is as follows:
assuming that the number of histogram bins in ESHP algorithm is M, ESHP algorithm requires Memory of bits.
The space cost of the ESHP algorithm is primarily determined by the data structure formed by AESketch and the data structure formed by the greedy packet portion. From the approximate estimated sketch structure algorithm, the known binary array B needs toBits, therefore, needThe bit memory processes M binary data streams. In the ordering process, the greedy grouping part occupies the main space memory, and the histogram interval number represents the memory usage size M. Therefore, the memory space used by ESHP algorithm is
The invention discloses a method and a system for issuing a flow histogram of a differential privacy weighting sliding window, which are further specifically introduced by combining a drawing and a specific embodiment.
Example 1
A uk car accident data set for extracting detailed road safety data of the condition of the uk personal injury road accident from 2005 to 2014, each record comprising the attributes of injury age, sex, injury severity, injury social class, injury type, etc., wherein the age interval is [0, 100];
example 2
The new york yellow taxi trip data set is extracted, and a detailed trip data set related to the new york yellow taxi of 2019 is extracted, wherein each record comprises the attributes of the pick-up date/time, the pick-up location, the trip distance, the rate type, the passenger count reported by a driver and the like, the passenger count interval is [0,6], and histogram release is carried out along with time.
The data of example 1 and example 2 were processed and histogram distributed using ESHP algorithm, FAST algorithm, rtp_dmm algorithm and APB algorithm, respectively.
In example comparative analysis of privacy budget allocation in connection with the ESHP algorithm shown in FIG. 2, the results indicate that: in the step (2), as the privacy budget allocation factor alpha increases, that is, the first partial privacy budget epsilon 1 increases, the mean square error of the published data decreases; the reason is that the ESHP algorithm uses the difference between the approximate statistics and the noise value to select the appropriate value, resulting in a reduction of ESHP total error.
As shown in connection with fig. 3 (a) to 3 (c), for comparative analysis of the total run time of example 1 at different weighted sliding window sizes, the results show that: the total running time of the FAST, the APB and the RTP_DMM algorithm is increased along with the increase of a sliding window, and the increase of the sliding window has little influence on the total running time of ESHP algorithms; the reason is that the pretreatment time of ESHP algorithm isThe size of the sliding window affects the spatial complexity of the ESHP algorithm, not the total run time. The RTP_DMM algorithm uses a binary tree, the required processing time is O (log 2 W), and when the histograms in the window are required to be issued, the APB algorithm and the FAST algorithm all take the most time to buffer all the data in the cheering window and have a logarithmic linear relation with the size of the sliding window.
In comparative analysis of the accuracy at different sliding window sizes in connection with example 1 and example 2 shown in fig. 4 (a) to 5 (c), the results show that the ESHP algorithm has the smallest mean square error among the four algorithms as the weighted sliding window size increases; the FAST algorithm requires the addition of noise to each data, its mean square error is relatively high; the RTP_DMM algorithm needs to add noise to the statistical information in the weighted sliding window, so that the mean square error of the RTP_DMM algorithm is smaller than that of the FAST algorithm; due to the increase in size of the sliding window in the ESHP algorithm, the approximate statistical error in the sliding window tends to be constant, resulting in a total error that tends to be constant. The ESHP algorithm adopts a self-adaptive noise adding method, so that huge errors caused by directly adding noise in a sliding window are avoided. When the weighted sliding window is smaller, the mean square error of the APB algorithm is smaller than ESHP algorithm; as the window increases, the mean square error of ESHP algorithm is smaller than APB algorithm, FAST algorithm and rtp_dmm algorithm. In summary, the ESHP algorithm ensures competitive availability of published data in most cases.
In example comparative analysis combining accuracy at different weighting factors in fig. 6 (a) to 7 (c), the results show that: the mean square error of ESHP algorithm and APB algorithm are positively correlated with the weighting coefficients, and the mean square error of FAST algorithm and rtp_dmm algorithm are a constant value, independent of the weighting factors. When the weighting factor is small, the mean square error of ESHP algorithm is smaller than that of FAST algorithm, rtp_dmm algorithm and APB algorithm. The reason for this is that an increase in the weighting factor changes the global sensitivity, affecting the increase in noise. As the weighting factor increases, the ESHP algorithm publishes data to ensure competitive availability in most cases.
In a comparative analysis showing the accuracy of example 1 and example 2 at different approximation error factors, the results are shown in connection with fig. 8 (a) to 9 (c): the query errors of ESHP and APB algorithms are positively correlated with the approximation error factor, while the mean square error of FAST algorithm and rtp_dmm algorithm is a constant value, independent of the approximation error factor. When the approximation error coefficient is small, the mean square error of ESHP algorithm is significantly lower than that of APB algorithm. The reasons for this are two ways: (1) An increase in the approximation error factor changes the global sensitivity and increases in the laplace noise, resulting in an increase in the overall error of the histogram; (2) The ESHP algorithm adopts an adaptive selection algorithm, so that the problem that the total error of the histogram is increased due to the fact that Laplacian noise is directly added into a weighted sliding window is avoided. The mean square error of the rtp_dmm algorithm is independent of the approximation error factor, resulting in the same error. As the approximation error factor increases, the error of ESHP algorithm will be larger than that of rtp_dmm algorithm, so the fact shows that ESHP algorithm is very competitive in most cases.
The method for issuing the flow histogram of the weighted sliding window under the differential privacy disclosed by the embodiment of the invention can be used for rapidly issuing data with different importance in a data flow environment and is suitable for a weighted sliding window model. According to the method, the histogram data in the data stream environment can be rapidly released under the condition of low space cost, and theoretical analysis shows that the method can effectively improve the availability of the data and reduce the running time while reducing the privacy budget. Experimental results on two different real data sets of example 1 and example 2 also demonstrate that the present invention is significantly superior to existing methods in terms of data availability and run time.
In an embodiment of the present invention, there is also provided an electronic device including a processor and a memory for storing at least one instruction; when the processor executes the at least one instruction, the streaming histogram issuing method of the weighted sliding window under differential privacy disclosed in the above embodiment is executed.
The above-described streaming histogram distribution method of differential privacy-weighted sliding windows may be run in a processor or may also be stored in a computer-readable storage medium, including permanent and non-permanent, removable and non-removable media, which may be implemented by any method or technique. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, read only compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Storage media, as defined herein, does not include transitory computer readable media such as modulated data signals and carrier waves.
These computer programs may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart block or blocks and/or block diagram block or blocks, and corresponding steps may be implemented in different modules.
Illustratively, in this embodiment, there is provided a system for streaming histogram distribution of weighted sliding windows under differential privacy, where the system includes the following program modules: the first determining module is used for determining an interval of a noise adding histogram to be issued in data in a data stream; a second determining module, configured to determine a privacy budget allocation factor α, and allocate the entire privacy budget epsilon as a first partial privacy budget epsilon and a second partial privacy budget epsilon (1-alpha); the construction module is used for constructing an approximate estimation sketch structure for each interval of the noise-added histogram to be issued and carrying out approximate estimation on the data in the weighted sliding window to obtain interval count values of all the intervals; the first processing module is used for adding a first privacy budget alpha epsilon to the noise adding histogram to be issued and respectively obtaining a first partial privacy budget noise value of each interval; the third determining module is used for determining the approximate statistical frequency of each interval according to the interval count value of any interval of the noise adding histogram to be issued and the corresponding first partial privacy budget noise value, and constructing the approximate statistical histogram of the current moment according to the approximate statistical frequency of each interval; the second processing module is used for clustering the intervals sequenced by the approximate statistical histogram of the current moment according to a greedy clustering algorithm by adopting a second partial privacy budget epsilon (1-alpha) to obtain a noise adding histogram of the sliding window of the current moment; and the release module is used for releasing the noise adding histogram of the sliding window at the current moment.
The system is used for implementing the functions of the method in the above embodiment, and each module in the system corresponds to each step in the method, which has been described in the method, and will not be described herein.
For example, the construction module performs approximate estimation on the data in the weighted sliding window by constructing an approximate estimation sketch structure, and the process of obtaining the interval count value is as follows:
Judging whether the current data entering the weighted sliding window data stream accords with a noise adding histogram judgment statistical interval to be issued or not; if the current data meets the current interval range, setting the current data to be 1; if the current data does not meet the current interval range, the current data is set to 0;
and partitioning the weighted sliding window, and calculating an approximation error threshold value theta of the weighted sliding window and a weighted count value y of the current timestamp, wherein the calculation formulas are respectively as follows: y=y×γ +x t; wherein W represents the size of the sliding window, gamma represents the weighting factor, beta represents the approximation error factor, S i [ t-W+1, t ] represents the sliding window of the ith binary data stream, x t represents an element instance in the sliding window at the current time S i [ t-W+1, t ];
Judging whether the data stream S i [ t-W+1, t ] has an expired data element on the current timestamp;
calculating the interval count value of the current timestamp When the data stream S i [ t-W+1, t ] generates an expired data element at the current timestamp, thenWhen the data stream S i [ t-W+1, t ] has no expired data elements at the current timestamp, thenWherein B represents a two-dimensional array for storing weighted sliding window count information, wherein B 1 (i) represents a first sub-array, records a timestamp of a current element in the data stream, and B 2 (i) represents a second sub-array, stores count information for an ith binary data stream S i [ t-W+1, t ]Old (B 1 (i)) represents the first child array of the oldest block, old (B 2 (i)) represents the second child array of the oldest block;
Judging the interval count value of the current time stamp Is to count the interval valueThe result of less than 0 is set to 0 to obtain all interval count values in the current weighted sliding window
For another example, the third determining module determines the approximate statistical frequency of each interval as follows:
judging the magnitude of the approximation error and the noise threshold value, and determining an approximation statistics frequency; wherein the approximation error is the difference between the interval count value and the true value, and the noise threshold represents the maximum difference between the first partial privacy budget noise value and the true value
When the approximate error threshold is smaller than the noise threshold, the approximate statistical frequency is the interval count valueWhen the noise threshold is smaller than the approximation error, the approximation statistics frequency is the noise adding value, i.eWherein G i represents the true value of the noisy histogram calculation interval to be issued;
judging the magnitude of the approximate statistical frequency, setting the result of the approximate statistical frequency smaller than 0 to be 0, and obtaining the approximate statistical frequency of all the intervals in the current weighted sliding window.
The method for publishing the static data set or the dynamic data stream provided by the prior art is not suitable for a weighted sliding window model with the data importance attenuated along with time, the importance degree of new and old elements cannot be considered, and the problems of high calculation cost and high storage burden of the existing method are solved, so that histogram data in a data stream environment can be published rapidly under the condition of low space cost, the availability of the data is improved effectively, and the running time is shortened.
While the invention has been described with reference to preferred embodiments, it is not intended to be limiting. Those skilled in the art will appreciate that various modifications and adaptations can be made without departing from the spirit and scope of the present invention. Accordingly, the scope of the invention is defined by the appended claims.
Claims (8)
1. A method for streaming histogram distribution of weighted sliding windows under differential privacy, comprising:
determining an interval of a noise adding histogram to be issued in data flow;
determining an optimal privacy budget allocation factor alpha, and allocating the whole privacy budget epsilon as a first partial privacy budget alpha epsilon and a second partial privacy budget epsilon (1-alpha);
Constructing an approximate estimation sketch structure for each interval of the noise adding histogram to be issued to approximate estimate the data in the weighted sliding window, and obtaining interval count values of all the intervals;
adding a first privacy budget alpha epsilon to the noise adding histogram to be issued, and respectively obtaining a first partial privacy budget noise value of each interval;
determining the approximate statistical frequency of each interval according to the interval count value of any interval of the noise adding histogram to be issued and the corresponding first partial privacy budget noise value, and constructing the approximate statistical histogram of the current moment according to the approximate statistical frequency of each interval;
Clustering the intervals sequenced by the approximate statistical histogram of the current moment according to a greedy clustering algorithm by adopting a second partial privacy budget epsilon (1-alpha) to obtain a noise adding histogram of the sliding window of the current moment;
Issuing a noise adding histogram of the sliding window at the current moment;
the process of constructing an approximate estimation sketch structure to approximate estimate the data in the weighted sliding window and obtaining the interval count value is as follows:
Judging whether the current data entering the weighted sliding window data stream accords with a noise adding histogram judgment statistical interval to be issued or not; if the current data meets the current interval range, setting the current data to be 1; if the current data does not meet the current interval range, the current data is set to 0;
and partitioning the weighted sliding window, and calculating an approximation error threshold value theta of the weighted sliding window and a weighted count value y of the current timestamp, wherein the calculation formulas are respectively as follows: y=y×γ +x t; wherein W represents the size of the sliding window, gamma represents the weighting factor, beta represents the approximation error factor, S i [ t-W+1, t ] represents the sliding window of the ith binary data stream, x t represents an element instance in the sliding window at the current time S i [ t-W+1, t ];
Judging whether the data stream S i [ t-W+1, t ] has an expired data element on the current timestamp;
calculating the interval count value of the current timestamp When the data stream S i [ t-W+1, t ] generates an expired data element at the current timestamp, thenWhen the data stream S i [ t-W+1, t ] has no expired data elements at the current timestamp, thenWherein B represents a two-dimensional array for storing weighted sliding window count information, wherein B 1 (i) represents a first sub-array, records a timestamp of a current element in the data stream, and B 2 (i) represents a second sub-array, stores a section count value of an ith binary data stream S i [ t-W+1, t ]Old (B 1 (i)) represents the first child array of the oldest block, old (B 2 (i)) represents the second child array of the oldest block;
Judging the interval count value of the current time stamp Is to count the interval valueThe result of less than 0 is set to 0 to obtain all interval count values in the current weighted sliding window
2. The method for distributing the flow histogram of the weighted sliding window under the differential privacy according to claim 1, wherein the method for determining the privacy budget allocation factor α is as follows:
Preprocessing a small sample, and setting different values for the privacy budget allocation factor alpha;
Respectively calculating the mean square error between the noise adding value and the true value of each interval of the noise adding histogram to be issued under each value;
And selecting the privacy budget allocation factor alpha corresponding to the minimum mean square error as the optimal privacy budget allocation factor.
3. The method for distributing the flow histogram of the weighted sliding window under the differential privacy according to claim 1, wherein the process of determining the approximate statistical frequency of each interval according to the interval count value of any interval of the noisy histogram to be distributed and the corresponding first partial privacy budget noise value is as follows:
judging the magnitude of the approximation error and the noise threshold value, and determining an approximation statistics frequency; wherein the approximation error is the difference between the interval count value and the true value, and the noise threshold represents the maximum difference between the first partial privacy budget noise value and the true value
When the approximate error threshold is smaller than the noise threshold, the approximate statistical frequency is the interval count value
When the noise threshold is smaller than the approximation error, the approximation statistics frequency is the noise adding value, i.eWherein G i represents the true value of the noisy histogram calculation interval to be issued;
judging the magnitude of the approximate statistical frequency, setting the result of the approximate statistical frequency smaller than 0 to be 0, and obtaining the approximate statistical frequency of all the intervals in the current weighted sliding window.
4. The method for distributing the flow histogram of the differential privacy-weighted sliding window according to claim 3, wherein the clustering process of the interval sequenced by the approximate statistical histogram of the current moment according to the greedy clustering algorithm by adopting the second partial privacy budget epsilon (1-alpha) is as follows:
Approximate statistical frequency of all intervals in weighted sliding window according to ordering algorithm Sequencing, and calculating the combined error and the uncombined error between each frequency number and other frequency numbers;
Judging the interval group, comprising: when the combination error of the two sections is smaller than the non-combination error, combining the two sections and setting the two sections as a group; when the combination error of the two sections is larger than the non-combination error, the two sections are not combined and are set into two groups; the merging error is the result of carrying out square difference on the average value of each interval and the current value of each interval; the non-merging error is the result of performing square deviation on the average value of each interval and the existing value of each interval;
regrouping all intervals in the current weighted sliding window;
replacing the approximate statistical frequency value of each packet by using the average value of each packet, and determining a second partial privacy budget of each interval by the proportion of the approximate statistical frequency value in each packet to the total number of the data statistical values in the whole weighted sliding window so as to add Laplace noise to each packet interval;
calculating the approximate statistical frequency of each group after adding the Laplace noise, judging the size of the approximate statistical frequency, and setting the result of the approximate statistical frequency smaller than 0 to be 0;
And obtaining a noise adding histogram of the sliding window at the current moment.
5. The streaming histogram distribution method of a differential privacy-weighted sliding window according to claim 4, wherein the specific procedure of the interval grouping is as follows:
The first interval of the ordered approximate statistical histogram is self-organized, and the noise error introduced after the second interval is added into the current grouping and the noise error introduced by the new grouping which is reestablished by taking the second interval as a starting point are judged; if the noise error introduced by adding the current packet is smaller than that of establishing a new packet, adding the current packet, otherwise, establishing the new packet; until all the bins of the noise plus histogram are noisy.
6. A streaming histogram distribution system of weighted sliding windows under differential privacy, comprising:
the first determining module is used for determining an interval of a noise adding histogram to be issued in data in a data stream;
A second determining module, configured to determine a privacy budget allocation factor α, and allocate the entire privacy budget epsilon as a first partial privacy budget epsilon and a second partial privacy budget epsilon (1-alpha);
the construction module is used for constructing an approximate estimation sketch structure for each interval of the noise-added histogram to be issued and carrying out approximate estimation on the data in the weighted sliding window to obtain interval count values of all the intervals;
The first processing module is used for adding a first privacy budget alpha epsilon to the noise adding histogram to be issued and respectively obtaining a first partial privacy budget noise value of each interval;
the third determining module is used for determining the approximate statistical frequency of each interval according to the interval count value of any interval of the noise adding histogram to be issued and the corresponding first partial privacy budget noise value, and constructing the approximate statistical histogram of the current moment according to the approximate statistical frequency of each interval;
The second processing module is used for clustering the intervals sequenced by the approximate statistical histogram of the current moment according to a greedy clustering algorithm by adopting a second partial privacy budget epsilon (1-alpha) to obtain a noise adding histogram of the sliding window of the current moment;
the release module is used for releasing the noise adding histogram of the sliding window at the current moment;
The construction module carries out approximate estimation on the data in the weighted sliding window by constructing an approximate estimation sketch structure, and the process of obtaining the interval count value is as follows:
Judging whether the current data entering the weighted sliding window data stream accords with a noise adding histogram judgment statistical interval to be issued or not; if the current data meets the current interval range, setting the current data to be 1; if the current data does not meet the current interval range, the current data is set to 0;
and partitioning the weighted sliding window, and calculating an approximation error threshold value theta of the weighted sliding window and a weighted count value y of the current timestamp, wherein the calculation formulas are respectively as follows: y=y×γ +x t; wherein W represents the size of the sliding window, gamma represents the weighting factor, beta represents the approximation error factor, S i [ t-W+1, t ] represents the sliding window of the ith binary data stream, x t represents an element instance in the sliding window at the current time S i [ t-W+1, t ];
Judging whether the data stream S i [ t-W+1, t ] has an expired data element on the current timestamp;
calculating the interval count value of the current timestamp When the data stream S i [ t-W+1, t ] generates an expired data element at the current timestamp, thenWhen the data stream S i [ t-W+1, t ] has no expired data elements at the current timestamp, thenWherein B represents a two-dimensional array for storing weighted sliding window count information, wherein B 1 (i) represents a first sub-array, records a timestamp of a current element in the data stream, and B 2 (i) represents a second sub-array, stores count information for an ith binary data stream S i [ t-W+1, t ]Old (B 1 (i)) represents the first child array of the oldest block, old (B 2 (i)) represents the second child array of the oldest block;
Judging the interval count value of the current time stamp Is to count the interval valueThe result of less than 0 is set to 0 to obtain all interval count values in the current weighted sliding window
7. The system for streaming histogram distribution of weighted sliding windows under differential privacy according to claim 6, wherein the third determining module determines the approximate statistical frequency of each interval by:
judging the magnitude of the approximation error and the noise threshold value, and determining an approximation statistics frequency; wherein the approximation error is the difference between the interval count value and the true value, and the noise threshold represents the maximum difference between the first partial privacy budget noise value and the true value
When the approximate error threshold is smaller than the noise threshold, the approximate statistical frequency is the interval count value
When the noise threshold is smaller than the approximation error, the approximation statistics frequency is the noise adding value, i.eWherein G i represents the true value of the noisy histogram calculation interval to be issued;
judging the magnitude of the approximate statistical frequency, setting the result of the approximate statistical frequency smaller than 0 to be 0, and obtaining the approximate statistical frequency of all the intervals in the current weighted sliding window.
8. An electronic device comprising a memory for storing at least one instruction and a processor for executing the at least one instruction to implement the method of streaming histogram distribution of a differential privacy-weighted sliding window of any one of claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210343677.XA CN114969656B (en) | 2022-03-31 | 2022-03-31 | Streaming histogram release method and system for weighted sliding window under differential privacy |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210343677.XA CN114969656B (en) | 2022-03-31 | 2022-03-31 | Streaming histogram release method and system for weighted sliding window under differential privacy |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114969656A CN114969656A (en) | 2022-08-30 |
CN114969656B true CN114969656B (en) | 2024-10-22 |
Family
ID=82977881
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210343677.XA Active CN114969656B (en) | 2022-03-31 | 2022-03-31 | Streaming histogram release method and system for weighted sliding window under differential privacy |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114969656B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118568780A (en) * | 2024-07-31 | 2024-08-30 | 杭州世平信息科技有限公司 | Differential privacy method, system and equipment based on dynamic database histogram release |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109558426A (en) * | 2018-11-22 | 2019-04-02 | 河南财经政法大学 | A kind of dissemination method of the streaming histogram based on difference privacy |
CN112307078A (en) * | 2020-09-29 | 2021-02-02 | 安徽工业大学 | Data stream differential privacy histogram publishing method based on sliding window |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113934772A (en) * | 2021-10-20 | 2022-01-14 | 飞马智科信息技术股份有限公司 | Adaptive histogram publishing method facing data stream sliding window |
-
2022
- 2022-03-31 CN CN202210343677.XA patent/CN114969656B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109558426A (en) * | 2018-11-22 | 2019-04-02 | 河南财经政法大学 | A kind of dissemination method of the streaming histogram based on difference privacy |
CN112307078A (en) * | 2020-09-29 | 2021-02-02 | 安徽工业大学 | Data stream differential privacy histogram publishing method based on sliding window |
Also Published As
Publication number | Publication date |
---|---|
CN114969656A (en) | 2022-08-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112307078B (en) | Data stream differential privacy histogram publishing method based on sliding window | |
CN107249035B (en) | Shared repeated data storage and reading method with dynamically variable levels | |
CN112114960B (en) | Scheduling strategy for remote sensing image parallel cluster processing adapting to internet scene | |
CN112417500B (en) | Privacy-protected data stream statistics and release method | |
CN114969656B (en) | Streaming histogram release method and system for weighted sliding window under differential privacy | |
US20140258672A1 (en) | Demand determination for data blocks | |
Kondo et al. | Non-Gaussian statistics in global atmospheric dynamics: a study with a 10 240-member ensemble Kalman filter using an intermediate atmospheric general circulation model | |
US20110173243A1 (en) | Scaled exponential smoothing | |
CN107038067B (en) | Management method and device for processing resources in distributed stream processing | |
CN113271631B (en) | Novel content cache deployment scheme based on user request possibility and space-time characteristics | |
CN113934772A (en) | Adaptive histogram publishing method facing data stream sliding window | |
CN116841753B (en) | Stream processing and batch processing switching method and switching device | |
CN117997902A (en) | Cloud edge collaboration-based data distribution method and system | |
CN112651172A (en) | Rainfall peak type dividing method, device, equipment and storage medium | |
CN118283082A (en) | Big data acquisition method and system based on cloud computing | |
CN106557469B (en) | Method and device for processing data in data warehouse | |
CN113672979B (en) | Differential privacy non-equidistant histogram release method and device based on barrel structure division | |
CN109685101B (en) | Multi-dimensional data self-adaptive acquisition method and system | |
CN118056185A (en) | Anomaly aware cloud resource management system receiving external information and including short-term and long-term resource planning | |
CN115499513A (en) | Data request processing method and device, computer equipment and storage medium | |
CN113572653B (en) | Method, device and equipment for obtaining flow prediction range and storage medium | |
CN117349563A (en) | Data stream sampling and publishing method and system with differential privacy | |
CN114119888B (en) | Method for rapidly obtaining discrete point density through twice statistics | |
CN117519913B (en) | Method and system for elastically telescoping scheduling of container memory resources | |
CN112559571B (en) | Approximate outlier calculation method and system for numerical value type stream data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |