WO2021248937A1 - 一种基于差分隐私的地理分布式图计算方法及系统 - Google Patents
一种基于差分隐私的地理分布式图计算方法及系统 Download PDFInfo
- Publication number
- WO2021248937A1 WO2021248937A1 PCT/CN2021/077138 CN2021077138W WO2021248937A1 WO 2021248937 A1 WO2021248937 A1 WO 2021248937A1 CN 2021077138 W CN2021077138 W CN 2021077138W WO 2021248937 A1 WO2021248937 A1 WO 2021248937A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- iteration
- data
- differential privacy
- data center
- preset
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
- G06F21/6263—Protecting personal data, e.g. for financial or medical purposes during internet communication, e.g. revealing personal data from cookies
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L63/00—Network architectures or network communication protocols for network security
- H04L63/04—Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks
- H04L63/0428—Network architectures or network communication protocols for network security for providing a confidential data exchange among entities communicating through data packet networks wherein the data content is protected, e.g. by encrypting or encapsulating the payload
Definitions
- This application relates to the field of large-scale graph segmentation processing, and in particular to a geographically distributed graph computing method and system based on differential privacy.
- differential privacy technology When performing graph processing on a geographically distributed data center (DC: Data Center), in order to protect personal privacy, differential privacy technology can be applied. Differential privacy is a strictly proven differential technology that can protect personal privacy. It implements differential privacy by adding random noise to the communication between different DCs. The size of this random noise is mainly determined by two parameters, one is the privacy budget (budget), and the other is the sensitivity (sensitivity). The relationship between the size of the budget and the effect of privacy protection and the size of noise is as follows: the larger the budget, the smaller the noise added, and the worse the protection effect; the smaller the budget, the larger the noise added, and the better the protection effect. The budget mentioned here refers to the total budget size.
- this application provides a method and system for computing geographically distributed graphs based on differential privacy.
- the technical problem to be solved is to overcome the problem of overcoming the geographically distributed graph computing in the prior art when applying differential privacy technology for some iterative features. It is difficult to converge because the noise is too large, or the data availability of experimental results is low due to the influence of noise after applying differential privacy.
- an embodiment of the present application provides a method for calculating a geographically distributed map based on differential privacy, which includes the following steps: calculate the geographical distribution map based on the differential privacy using a preset processing model, and calculate the geographical distribution map according to the index allocation mechanism. Allocate budgets for each iteration in each round;
- Each data center receives the data sent by other data centers after the previous iteration, and updates the effective value of the vertex, and repeats the adding of aggregators in the data center to collect the data that needs to be sent to neighboring data centers, and save them all Add up the noise corresponding to the current iteration, and then divide it evenly and send it to the adjacent data center until the preset convergence condition is reached, and the iteration ends; each data center performs geographic operations according to the processing model that meets the preset convergence condition. Data transfer between distributed graphs.
- the method before adding an aggregator in a data center to collect messages that need to be sent to other data centers, the method further includes:
- the method for obtaining the effective value of each vertex includes: the shortest single-source path algorithm or the PageRank algorithm; when obtained by the shortest single-source path algorithm, the effective value of each vertex is the shortest path length; when obtained by the PageRank algorithm , The effective value of each vertex is the rank value.
- the re-sampling probability formula is:
- rank represents the effective value of a vertex in the current iteration
- n the initial effective value of the vertex.
- the preset iteration conditions include: the average value of the effective values of each data center in the current round of iteration reaches a preset value, the number of iterations is equal to the preset maximum number of iterations, or the effective value of each vertex in the current round of iteration is relative to The change value of the effective value of the last round is less than the preset value, at least one of them.
- the formula of the preset index allocation mechanism is as follows:
- max represents the maximum number of iterations
- the preset processing model is a Pregel model.
- the embodiments of the present application provide a geographically distributed graph computing system based on differential privacy, including:
- Each round of iterative budget allocation module is used to calculate the geographic distribution map based on differential privacy using a preset processing model, and allocate the budget for each iteration of the geographic distribution map according to the index allocation mechanism;
- Noise adding module used to add an aggregator in the data center to collect the data that needs to be sent to the adjacent data center, and add them all together plus the noise corresponding to this round of iteration, and then divide it evenly and send it to the adjacent data In the center, the noise is obtained through the Laplace mechanism conversion of the budget allocated in this iteration;
- Iteration module is used for each data center to receive the data sent by other data centers after the previous iteration, and update the effective value of the vertex itself, and repeat the above adding an aggregator in the data center to collect the data that needs to be sent to the adjacent data center Data, and add it all up plus the noise corresponding to the current iteration, and then divide it evenly and send it to the adjacent data center until the preset convergence condition is reached, and the iteration ends; each data center meets the preset convergence condition
- the processing model is used to transfer data between geographically distributed graphs.
- an embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium stores computer instructions, and the computer instructions are used to make the computer execute the difference-based Privacy-based geographically distributed graph computing method.
- an embodiment of the present application provides a computer device, including: a memory and a processor, the memory and the processor are communicatively connected to each other, the memory stores computer instructions, and the processor executes all The computer instructions are described to execute the geographically distributed graph calculation method based on differential privacy in the first aspect of the embodiments of the present application.
- the geographically distributed graph computing method and system based on differential privacy minimizes the impact of noise by assigning the total budget to the exponential mechanism of each iteration on the premise of satisfying differential privacy;
- An aggregator is added to DC to reduce the introduction of noise without affecting the protection effect;
- the probability sampling method is used to reduce the number of vertices in each iteration, thereby reducing the introduction of noise without affecting the protection effect.
- FIG. 1 is a flow chart of a specific example of a geographically distributed graph calculation method based on differential privacy in an embodiment of the application;
- Figure 2 is a schematic diagram of budget allocation performed by a common Pregel model in an embodiment of the application
- FIG. 3 is a schematic diagram of budget allocation during iteration of a common Pregel model in an embodiment of the application
- FIG. 4 is a schematic diagram of budget allocation after adding an aggregator to a common Pregel model in an embodiment of the application;
- FIG. 5 is a flowchart of another specific example of a geographically distributed graph calculation method based on differential privacy in an embodiment of the application
- Fig. 6 is a block diagram of a geographically distributed graph computing system based on differential privacy in an embodiment of the application
- FIG. 7 is a diagram of another module composition of a geographically distributed graph computing system based on differential privacy in an embodiment of the application.
- FIG. 8 is a composition diagram of a specific example of a computer device provided by an embodiment of the application.
- the embodiment of the present application provides a geographically distributed graph calculation method based on differential privacy, as shown in FIG. 1, including the following steps:
- Step S10 Perform map calculation on the geographic distribution map using a preset processing model based on differential privacy, and allocate a budget for each iteration of the geographic distribution map according to the index allocation mechanism.
- Differential privacy is a strictly proven differential technology that can protect personal privacy. It adds random noise to the communication between different DCs (the method of adding noise generally includes an exponential mechanism and a Laplace mechanism). , To achieve differential privacy. Privacy is defined as the difference: random algorithm has M, P M for the M output probability for all possible sets of configuration, for any two adjacent data sets D and D 'and any subset P M S M, if M algorithm satisfies :
- Algorithm M provides ⁇ -differential privacy protection.
- the size of the random noise is mainly determined by two parameters, one is the privacy budget (budget), and the other is the sensitivity (sensitivity). Sensitivity is not the main improvement point of this application, so it is set according to the worst case, that is, under this setting, the differential privacy can be strictly guaranteed; there are generally two ways to add noise: an exponential mechanism and a Laplace mechanism. This application uses the Laplace mechanism to calculate:
- sensitivity Given a function set F, D and D’ are adjacent data sets, the sensitivity is defined as follows:
- the embodiments of this application are calculated based on the Pregel model.
- the Pregel model is based on edge cutting, and its calculation process is composed of a series of iterative processes.
- a user-defined function is executed in parallel on each vertex, which describes the operation that a vertex V needs to perform in a superstep S.
- the result will be sent to the other vertices it needs, but at this time other vertices will not accept the message immediately, but wait for the next iteration to arrive before receiving the message.
- the vertex can read the messages sent by other vertices during the previous iteration and continue to execute the user-defined function. This iteration continues until all vertices are in an inactive state (when a vertex does not need to perform further calculations, it will be set to an inactive state).
- the embodiment of this application is based on the Pregel model for calculation, but this is not a limitation, and it is also applicable to other graph calculation models, such as the GAS model.
- the embodiment of this application adopts the Pregel model to achieve better technical effects.
- the total budget is allocated to each iteration.
- General methods include distribution methods such as equal distribution, linear distribution, Fibonacci sequence, etc.
- the total budget setting is often relatively small, so noise is often relatively small.
- the revised index allocation mechanism formula is as follows:
- max represents the maximum number of iterations
- Step S20 Add an aggregator in the data center to collect data that needs to be sent to adjacent data centers, add all of them and add the noise corresponding to the current iteration, and then divide them evenly and send them to adjacent data centers.
- an aggregator is added to the Pregel model.
- the aggregator is responsible for collecting messages that need to be sent to other DCs, and adding them all together, so that the budget allocation rises from the vertex allocation level to the aggregator Level, that is, budget i only needs to be assigned to the created aggregators at this time.
- the budget is allocated to all vertices that require cross-DC communication (the number of vertices is 10e+05 level or above).
- Step S30 Each data center receives the data sent by other data centers after the previous iteration, and updates the effective value of the vertex, and repeats the adding of an aggregator in the data center to collect the data that needs to be sent to adjacent data centers, and Add them up and add the noise corresponding to the current iteration, and then divide them evenly and send them to adjacent data centers until the preset convergence conditions are reached, and the iteration ends; each data center follows the processing model that meets the preset convergence conditions , For data transmission between geographically distributed graphs.
- the effective value of each vertex can be obtained in the following ways: the shortest single-source path algorithm sssp or the page ranking PageRank algorithm; when obtained by the shortest single-source path algorithm, the effective value of each vertex is the shortest path length; when obtained by the PageRank algorithm When, the effective value of each vertex is the rank value.
- the PageRank algorithm is taken as an example, and the PR value of a webpage is calculated as follows:
- the preset iteration conditions in the embodiments of the present application include: the average value of the effective values of each data center in this round of iteration reaches the preset value, the number of iterations is equal to the preset maximum number of iterations, or the effective value of each vertex in this round of iteration is relative to the previous round.
- the change value of the effective value of is less than the preset value, at least one of them.
- the embodiment of the present application further includes:
- Step 11 Discard all vertices in a certain round of iteration. After all vertices are resampled according to the probability obtained by the preset resampling formula, the vertices that are sampled successfully will be allocated to the aggregator to which they should belong.
- rank represents the rank value of a vertex in the current iteration
- n is the initial rank value of the vertex of the PageRank algorithm, which should be set according to different applications. In this application, since ⁇ in the PageRank calculation formula is 0.85, n corresponds to 0.15.
- the geographically distributed graph calculation method based on differential privacy provided by the embodiments of this application, on the premise of satisfying differential privacy, minimizes the impact of noise by assigning the total budget to the exponential mechanism of each iteration; in DC A new aggregator is added to reduce the introduction of noise without affecting the protection effect; the probability sampling method is used to reduce the number of vertices in each iteration, thereby reducing the introduction of noise without affecting the protection effect. Thereby improving the convergence ability of the iteration, while greatly improving the availability of data.
- the embodiment of the application provides a geographically distributed graph computing system based on differential privacy, as shown in FIG. 6, including:
- Each round of iterative budget allocation module 10 is used to calculate the geographic distribution map based on differential privacy using a preset processing model, and allocate a budget to each iteration of the geographic distribution map according to an index allocation mechanism. This module executes the method described in step S10 in embodiment 1, which will not be repeated here.
- the noise adding module 20 is used to add an aggregator in the data center to collect the data that needs to be sent to the adjacent data center, and add all of them together with the noise corresponding to the current iteration, and then divide it evenly and send it to the adjacent In the data center, the noise is obtained through the Laplace mechanism conversion of the budget allocated in this iteration.
- This module executes the method described in step S20 in Embodiment 1, which will not be repeated here.
- the iteration module 30 is used for each data center to receive the data sent by other data centers after the previous iteration, and update the effective value of the vertex itself, and repeat the above adding aggregator in the data center to collect the data that needs to be sent to the adjacent data center
- the data is added up and the noise corresponding to the current iteration is added, and then divided equally and sent to the adjacent data center until the preset convergence condition is reached, the iteration ends; each data center reaches the preset convergence Conditional processing model for data transmission between geographically distributed graphs.
- This module executes the method described in step S30 in embodiment 1, which will not be repeated here.
- the above-mentioned geographically distributed graph computing system based on differential privacy further includes:
- the re-sampling module 11 is used to discard all vertices in a certain round of iteration, and after all vertices are re-sampled according to the probability obtained by the preset re-sampling formula, the vertices that are sampled successfully will be allocated to the aggregator to which they should belong.
- This module executes the method described in step S11 in embodiment 1, which will not be repeated here.
- the embodiment of the application provides a geographically distributed graph computing system based on differential privacy.
- the total budget is allocated to the index mechanism of each iteration to minimize the impact of noise;
- a new aggregator is added to DC to reduce the introduction of noise without affecting the protection effect;
- the probability sampling method is used to reduce the number of vertices in each iteration, thereby reducing the introduction of noise without affecting the protection effect.
- FIG. 8 An embodiment of the present application provides a computer device. As shown in FIG. 8, the device may include a processor 51 and a memory 52, where the processor 51 and the memory 52 may be connected by a bus or in other ways. FIG. 8 uses a bus connection as an example .
- the processor 51 may be a central processing unit (Central Processing Unit, CPU).
- the processor 51 may also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA), or Chips such as other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, or a combination of the above types of chips.
- DSP Digital Signal Processor
- ASIC Application Specific Integrated Circuit
- FPGA Field-Programmable Gate Array
- Chips such as other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, or a combination of the above types of chips.
- the memory 52 can be used to store non-transitory software programs, non-transitory computer executable programs and modules, such as corresponding program instructions/modules in the embodiments of the present application.
- the processor 51 executes various functional applications and data processing of the processor by running non-transitory software programs, instructions, and modules stored in the memory 52, that is, realizing the geographically distributed map based on differential privacy in the foregoing method embodiment. Calculation method.
- the memory 52 may include a program storage area and a data storage area.
- the program storage area may store an operating system and an application program required by at least one function; the data storage area may store data created by the processor 51 and the like.
- the memory 52 may include a high-speed random access memory, and may also include a non-transitory memory, such as at least one magnetic disk storage device, a flash memory device, or other non-transitory solid-state storage devices.
- the memory 52 may optionally include memories remotely provided with respect to the processor 51, and these remote memories may be connected to the processor 51 through a network. Examples of the aforementioned network include, but are not limited to, the Internet, an intranet, an intranet, a mobile communication network, and combinations thereof.
- One or more modules are stored in the memory 52, and when executed by the processor 51, a geographically distributed graph calculation method based on differential privacy in Embodiment 1 is executed.
- a computer program can be used to instruct relevant hardware to complete the program, which can be stored in a computer readable storage medium, and when the program is executed , May include the processes of the above-mentioned method embodiments.
- the storage media can be magnetic disks, optical disks, read-only memory (Read-Only Memory, ROM), random access memory (RAM), flash memory (Flash Memory), hard disk (Hard Disk Drive) , Abbreviation: HDD) or solid-state drive (Solid-State Drive, SSD), etc.; the storage medium may also include a combination of the foregoing types of memories.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- Bioethics (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Computer Hardware Design (AREA)
- Health & Medical Sciences (AREA)
- Signal Processing (AREA)
- Computer Networks & Wireless Communication (AREA)
- Medical Informatics (AREA)
- Databases & Information Systems (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Telephonic Communication Services (AREA)
- Data Exchanges In Wide-Area Networks (AREA)
- Complex Calculations (AREA)
Abstract
本申请公开了基于差分隐私的地理分布式图计算方法及系统,基于差分隐私利用预设处理模型对地理分布图进行图计算,按指数分配机制对每轮迭代分配预算;在DC中增加聚合器收集需要发送向相邻DC数据,将其全部加起来加上本轮迭代对应的噪音平均划分发送给相邻的DC;各DC接收上轮迭代后其他DC发送的数据更新顶点的有效值,重复在DC中增加聚合器来收集需要发送向相邻DC的数据,将其全部加起来加上本轮迭代对应的噪音平均划分发送给相邻的DC的步骤,直至达到收敛条件迭代结束;各DC按照达到收敛条件的处理模型进行分布式图间的数据传输。本申请通过减小噪音的引入而不影响保护效果,提高迭代的收敛能力,同时大大提高了数据的可用性。
Description
本申请涉及大规模图分割处理领域,具体涉及一种基于差分隐私的地理分布式图计算方法及系统。
在地理分布式的数据中心(DC:Data Center)上进行图处理时,为了保护个人隐私,可以应用差分隐私技术。差分隐私是一种经过严格证明的能够保护个人隐私的差分技术,它通过在不同DC之间的通信上加随机噪音(noise)的方法来实现差分隐私。这个随机的noise的大小主要是由两个参数决定的,一是隐私预算(budget),一是敏感度(sensitivity)。budget的大小与隐私保护效果、noise的大小之间的关系是这样的:budget越大,所加入的noise越小,保护效果越差;budget越小,加入的noise越大,保护效果越好。这里所说的budget是指总的budget大小,对于计算过程具有迭代特征的应用(PageRank、sssp等),还需要把这个budget按照某种规则分配给每个迭代过程,然后在具体的每次迭代中再细分给各个顶点。现有技术存在的主要问题有两个:1、对于具有迭代特征的某些应用差分隐私技术时由于noise太大而难以收敛;2、应用了差分隐私之后由于noise的影响实验结果数据可用性较低。
发明内容
因此,本申请提供一种基于差分隐私的地理分布式图计算方法及系统, 要解决的技术问题在于克服现有技术中地理分布式图计算时,对于具有迭代特征的某些应用差分隐私技术时由于noise太大而难以收敛,或应用差分隐私之后由于noise的影响实验结果数据可用性较低的缺陷。
为达到上述目的,本申请提供如下技术方案:
第一方面,本申请实施例提供一种基于差分隐私的地理分布式图计算方法,包括如下步骤:基于差分隐私利用预设处理模型对地理分布图进行图计算,按照指数分配机制对地理分布图中每一轮迭代分配预算;
在数据中心中增加聚合器来收集需要发送向相邻数据中心的数据,并将其全部加起来加上本轮迭代对应的噪音,再平均划分后发送给相邻的数据中心;
各数据中心接收上一轮迭代后其他数据中心发送的数据,并更新顶点的有效值,并重复所述在数据中心中增加聚合器来收集需要发送向相邻数据中心的数据,并将其全部加起来加上本轮迭代对应的噪音,再平均划分后发送给相邻的数据中心的步骤,直至达到预设收敛条件,迭代结束;各个数据中心按照达到预设收敛条件的处理模型,进行地理分布式图之间的数据传输。
在一实施例中,在数据中心中增加聚合器来收集需要发送向其他数据中心的消息的步骤之前,还包括:
在某轮迭代中丢弃所有顶点,按照预设重新采样公式得到的概率对所有顶点进行重取样之后,取样成功的顶点将会分配给其应归属的聚合器。
在一实施例中,各个顶点有效值的获取方式包括:最短单源路径算法或PageRank算法;当通过最短单源路径算法获取时,各个顶点的有效值为最短路径长度;当通过PageRank算法获取时,各个顶点的有效值为rank值。
在一实施例中,重取样概率公式为:
式中,rank代表本轮迭代中某个顶点的有效值;
n表征顶点的初始有效值。
在一实施例中,所述预设迭代条件包括:本轮迭代中各个数据中心有效值的平均值达到预设值、迭代次数等于预设最大迭代次数或本轮迭代中各个顶点有效值相对于上轮的有效值的变化值均小于预设值,中的至少之一种。
在一实施例中,预设指数分配机制公式如下:
在一实施例中,所述预设处理模型为Pregel模型。
第二方面,本申请实施例提供一种基于差分隐私的地理分布式图计算系统,包括:
每轮迭代预算分配模块,用于基于差分隐私利用预设处理模型对地理分布图进行图计算,按照指数分配机制对地理分布图中每轮迭代分配预算;
噪声添加模块,用于在数据中心中增加聚合器来收集需要发送向相邻数据中心的数据,并将其全部加起来加上本轮迭代对应的噪音,再平均划分后发送给相邻的数据中心,所述噪音通过该轮迭代分配的预算进行拉普拉斯机制转换得到;
迭代模块,用于各数据中心接收上一轮迭代后其他数据中心发送的数据,并更新顶点自身的有效值,并重复所述在数据中心中增加聚合器来收集需要发送向相邻数据中心的数据,并将其全部加起来加上本轮迭代对应的噪音,再平均划分后发送给相邻的数据中心的步骤,直至达到预设收敛条件,迭代结束;各个数据中心按照达到预设收敛条件的处理模型,进行地理分布式图之间的数据传输。
第三方面,本申请实施例提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机指令,所述计算机指令用于使所述计算机执行 本申请实施例第一方面的基于差分隐私的地理分布式图计算方法。
第四方面,本申请实施例提供一种计算机设备,包括:存储器和处理器,所述存储器和所述处理器之间互相通信连接,所述存储器存储有计算机指令,所述处理器通过执行所述计算机指令,从而执行本申请实施例第一方面的基于差分隐私的地理分布式图计算方法。
本申请技术方案,具有如下优点:
本申请提供的一种基于差分隐私的地理分布式图计算方法及系统,在满足差分隐私的前提下,通过将总的budget分配给各轮迭代的指数机制,最大程度地减小noise的影响;在DC中新增aggregator来减小noise的引入而不影响保护效果;通过概率取样的方法来减少每轮迭代中顶点的数量,从而减小noise的引入而不影响保护效果。从而提高了迭代的收敛能力,同时大大提高了数据的可用性。
为了更清楚地说明本申请具体实施方式或现有技术中的技术方案,下面将对具体实施方式或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施方式,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例中基于差分隐私的地理分布式图计算方法的一个具体示例的流程图;
图2为本申请实施例中的普通的Pregel模型进行预算分配的示意图;
图3为本申请实施例中的普通的Pregel模型在迭代时预算分配的示意图;
图4为本申请实施例中在普通的Pregel模型中加入聚合器后的预算分配的示意图;
图5为本申请实施例中基于差分隐私的地理分布式图计算方法的另一个具体示例的流程图;
图6为本申请实施例中基于差分隐私的地理分布式图计算系统的一个模块组成图;
图7为本申请实施例中基于差分隐私的地理分布式图计算系统的另一个模块组成图;
图8为本申请实施例提供的计算机设备一个具体示例的组成图。
下面将结合附图对本申请的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
此外,下面所描述的本申请不同实施方式中所涉及的技术特征只要彼此之间未构成冲突就可以相互结合。
实施例1
本申请实施例提供一种基于差分隐私的地理分布式图计算方法,如图1 所示,包括如下步骤:
步骤S10:基于差分隐私利用预设处理模型对地理分布图进行图计算,按照指数分配机制对地理分布图中每一轮迭代分配预算。
差分隐私是一种经过严格证明的能够保护个人隐私的差分技术,它通过在不同DC之间的通信上加随机noise(噪音的添加方式一般有指数机制以及拉普拉斯机制两种)的方法,来实现差分隐私。定义差分隐私为:设有随机算法M,P
M为M的所有可能输出构成的集合的概率,对于任意两个邻近数据集D与D’以及P
M的任意子集S
M,若算法M满足:
P{M(D)∈S
M}≤e
ε·P{M(D')∈S
M}
则称算法M提供ε-差分隐私保护。
随机的noise的大小主要是由两个参数决定的,一是隐私预算(budget),一是敏感度(sensitivity)。敏感度不是本申请的主要改进点,因此是按照最差的情况来设置的,即在该设置下能够严格保证满足差分隐私;噪音的添加方式一般有指数机制以及拉普拉斯机制两种,本申请采用拉普拉斯机制进行计算:
敏感度(sensitivity)的定义:给定一个函数集F,D和D’为邻近数据集,其敏感度定义如下:
给定一个函数f:D→R
d,若隐私保护算法A满足ε-差分隐私,当且 仅当下述表达式成立:
可知ε(budget)的大小与noise的大小以及差分隐私保护效果之间的关系为:ε越小,Laplace noise越大,隐私保护效果越好。
本申请实施例是基于Pregel模型进行计算的,Pregel模型是基于边切割的,它的计算过程是由一系列迭代过程组成的。在每个迭代过程中,每个顶点上面都会并行执行用户自定义的函数,该函数描述了一个顶点V在一个超步S中需要执行的操作。执行完该函数之后即将得到的结果发送给其所需的其他顶点,但是此时其他顶点并不会马上接受该消息,而是等待下一轮迭代到来才会接收该消息。在下一次迭代中顶点可以读取上一次迭代过程中其他顶点发过来的消息并继续执行用户自定义的函数。该迭代一直持续直至所有顶点处于非活跃状态(当一个顶点不需要执行进一步的计算时会被设置为非活跃状态)为止。
需要说明的是本申请实施例是基于Pregel模型进行计算,但是不以此作为限制,也适用于其他图计算模型,例如是GAS模型,本申请实施例采用Pregel模型的技术效果更优。
将总的budget分配给各轮迭代,一般的方法有诸如平均分配、线性分配、斐波那契数列等分配方式,但是实际应用中由于总的budget设置往往是比较小的,因此noise往往也是比较大的,对此为了尽可能地减小noise的影响,希望能够在迭代的前期分配较少的budget,迭代的后期分配较大 的budget,能够最大程度地减小noise的影响。因此本申请实施例提供了一种新的经过修改的指数分配机制。如图2所示,假设总的budget为3,则其会被按照指数机制分配到每一轮迭代中。修改的指数分配机制公式如下:
步骤S20:在数据中心中增加聚合器来收集需要发送向相邻数据中心的数据,并将其全部加起来加上本轮迭代对应的噪音,再平均划分后发送给相邻的数据中心。
如图3所示,假设在某轮迭代中,DC0中有四个顶点需要与DC1通信,如果按照普通的Pregel模型,则需要将本轮迭代分配到的budget
i继续分配个这四个顶点(这里只是举例说明,实际中这样的顶点数量通常是达到10e+05或者以上级别的)。
如图4所示,本申请实施例中在Pregel模型中加入了聚合器aggregator,aggregator负责收集需要发送向其他DC的消息,并将它们全部加起来,这 样budget分配就从顶点分配级别上升到了aggregator级别,即此时budget
i只需分配给创建的aggregators即可。对比普通的Pregel模型的将budget分配给所有需要跨DC通信的顶点(顶点数量是10e+05级别或以上),加入aggregator之后,由于aggregator数量可以自己定义(通常不建议设置太多aggregator),因此每个aggregator分配到的budget将会远大于Pregel模型下顶点分配到的budget。因此加入aggregator之后可以在不降低隐私保护效果的同时,大大降低noise的影响。
步骤S30:各数据中心接收上一轮迭代后其他数据中心发送的数据,并更新顶点的有效值,并重复所述在数据中心中增加聚合器来收集需要发送向相邻数据中心的数据,并将其全部加起来加上本轮迭代对应的噪音,再平均划分后发送给相邻的数据中心的步骤,直至达到预设收敛条件,迭代结束;各个数据中心按照达到预设收敛条件的处理模型,进行地理分布式图之间的数据传输。
实际应用中,各个顶点有效值的获取方式包括:最短单源路径算法sssp或网页排序PageRank算法;当通过最短单源路径算法获取时,各个顶点的有效值为最短路径长度;当通过PageRank算法获取时,各个顶点的有效值为rank值。
本申请实施例以PageRank算法为例,一个网页的PR值计算如下:
根据上述的公式计算每个网页的PR值,在不断迭代趋于平稳(即收敛)的时候,即为最终结果。
本申请实施例中的预设迭代条件包括:本轮迭代中各个数据中心有效值的平均值达到预设值、迭代次数等于预设最大迭代次数或本轮迭代中各个顶点有效值相对于上轮的有效值的变化值均小于预设值,中的至少之一种。
需要注意的是,由于aggregator的工作原理,它负责收集消息并将这些消息加起来统一加一次noise,之后aggregator负责将这些消息发送给DC1时不能够再按照原来的Msg_rank的比例去还原成4份,而是需要平均划分成4份,否则将会使得其不满足ε-差分隐私。但是平均划分的方法有一个缺点:改变了原顶点的rank值,会使得最终结果误差上升。但是该额外引入的误差对比于不使用aggregator时的noise显得微不足道,因此总体上反而使得加入aggregator之后的数据可用性大大提高,并且通过修改的指数机制的方式已经能够解决Pregel模型下PageRank算法无法收敛的问题,但是数据可用性依然不足。因此为了克服其存在不不足,本申请实施例在数据中心中增加聚合器来收集需要发送向其他数据中心的消息的步骤之前,如图5所示,还包括:
步骤11:在某轮迭代中丢弃所有顶点,按照预设重新采样公式得到的概率对所有顶点进行重取样之后,取样成功的顶点将会分配给其应归属的聚合器。
式中,rank代表本轮迭代中某个顶点的rank值;
n的含义是PageRank算法的顶点的初始rank值,该值应根据不同的应用进行设置,本申请中由于PageRank计算公式中的α取0.85,因此n对应取0.15。
本申请实施例提供的基于差分隐私的地理分布式图计算方法,在满足差分隐私的前提下,通过将总的budget分配给各轮迭代的指数机制,最大程度地减小noise的影响;在DC中新增aggregator来减小noise的引入而不影响保护效果;通过概率取样的方法来减少每轮迭代中顶点的数量,从而减小noise的引入而不影响保护效果。从而提高了迭代的收敛能力,同时大大提高了数据的可用性。
实施例2
本申请实施例提供一种基于差分隐私的地理分布式图计算系统,如图6所示,包括:
每轮迭代预算分配模块10,用于基于差分隐私利用预设处理模型对地理分布图进行图计算,按照指数分配机制对地理分布图中每轮迭代分配预算。此模块执行实施例1中的步骤S10所描述的方法,在此不再赘述。
噪声添加模块20,用于在数据中心中增加聚合器来收集需要发送向相邻数据中心的数据,并将其全部加起来加上本轮迭代对应的噪音,再平均划分后发送给相邻的数据中心,所述噪音通过该轮迭代分配的预算进行拉 普拉斯机制转换得到。此模块执行实施例1中的步骤S20所描述的方法,在此不再赘述。
迭代模块30,用于各数据中心接收上一轮迭代后其他数据中心发送的数据,并更新顶点自身的有效值,并重复所述在数据中心中增加聚合器来收集需要发送向相邻数据中心的数据,并将其全部加起来加上本轮迭代对应的噪音,再平均划分后发送给相邻的数据中心的步骤,直至达到预设收敛条件,迭代结束;各个数据中心按照达到预设收敛条件的处理模型,进行地理分布式图之间的数据传输。此模块执行实施例1中的步骤S30所描述的方法,在此不再赘述。
在一实施例中,上述基于差分隐私的地理分布式图计算系统,如图7所示,还包括:
重采样模块11,用于在某轮迭代中丢弃所有顶点,按照预设重新采样公式得到的概率对所有顶点进行重取样之后,取样成功的顶点将会分配给其应归属的聚合器。此模块执行实施例1中的步骤S11所描述的方法,在此不再赘述。
本申请实施例提供一种基于差分隐私的地理分布式图计算系统,在满足差分隐私的前提下,通过将总的budget分配给各轮迭代的指数机制,最大程度地减小noise的影响;在DC中新增aggregator来减小noise的引入而不影响保护效果;通过概率取样的方法来减少每轮迭代中顶点的数量,从而减小noise的引入而不影响保护效果。从而提高了迭代的收敛能力,同时大大提高了数据的可用性。
实施例3
本申请实施例提供一种计算机设备,如图8所示,该设备可以包括处理器51和存储器52,其中处理器51和存储器52可以通过总线或者其他方式连接,图8以通过总线连接为例。
处理器51可以为中央处理器(Central Processing Unit,CPU)。处理器51还可以为其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等芯片,或者上述各类芯片的组合。
存储器52作为一种非暂态计算机可读存储介质,可用于存储非暂态软件程序、非暂态计算机可执行程序以及模块,如本申请实施例中的对应的程序指令/模块。处理器51通过运行存储在存储器52中的非暂态软件程序、指令以及模块,从而执行处理器的各种功能应用以及数据处理,即实现上述方法实施例中的基于差分隐私的地理分布式图计算方法。
存储器52可以包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需要的应用程序;存储数据区可存储处理器51所创建的数据等。此外,存储器52可以包括高速随机存取存储器,还可以包括非暂态存储器,例如至少一个磁盘存储器件、闪存器件、或其他非暂态固态存储器件。在一些实施例中,存储器52可选包括相对于处理器51远程设置的存储器,这些远程存储器可以通过网络连接至处理器51。上述 网络的实例包括但不限于互联网、企业内部网、企业内网、移动通信网及其组合。
一个或者多个模块存储在存储器52中,当被处理器51执行时,执行实施例1中的一种基于差分隐私的地理分布式图计算方法。
上述计算机设备具体细节可以对应参阅实施例1中对应的相关描述和效果进行理解,此处不再赘述。
本领域技术人员可以理解,实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)、随机存储记忆体(Random Access Memory,RAM)、快闪存储器(Flash Memory)、硬盘(Hard Disk Drive,缩写:HDD)或固态硬盘(Solid-State Drive,SSD)等;存储介质还可以包括上述种类的存储器的组合。
显然,上述实施例仅仅是为清楚地说明所作的举例,而并非对实施方式的限定。对于所属领域的普通技术人员来说,在上述说明的基础上还可以做出其它不同形式的变化或变动。这里无需也无法对所有的实施方式予以穷举。而由此所引申出的显而易见的变化或变动仍处于本申请的保护范围之中。
Claims (11)
- 一种基于差分隐私的地理分布式图计算方法,其特征在于,包括如下步骤:基于差分隐私利用预设处理模型对地理分布图进行图计算,按照指数分配机制对地理分布图中每一轮迭代分配预算;在数据中心中增加聚合器来收集需要发送向相邻数据中心的数据,并将其全部加起来加上本轮迭代对应的噪音,再平均划分后发送给相邻的数据中心;各数据中心接收上一轮迭代后其他数据中心发送的数据,并更新顶点的有效值,并重复所述在数据中心中增加聚合器来收集需要发送向相邻数据中心的数据,并将其全部加起来加上本轮迭代对应的噪音,再平均划分后发送给相邻的数据中心的步骤,直至达到预设收敛条件,迭代结束;各个数据中心按照达到预设收敛条件的处理模型,进行地理分布式图之间的数据传输。
- 根据权利要求1所述的基于差分隐私的地理分布式图计算方法,其特征在于,在数据中心中增加聚合器来收集需要发送向其他数据中心的消息的步骤之前,还包括:在某轮迭代中丢弃所有顶点,按照预设重新采样公式得到的概率对所有顶点进行重取样之后,取样成功的顶点将会分配给其应归属的聚合器。
- 根据权利要求2所述的基于差分隐私的地理分布式图计算方法,其特 征在于,各个顶点有效值的获取方式包括:最短单源路径算法或PageRank算法;当通过最短单源路径算法获取时,各个顶点的有效值为最短路径长度;当通过PageRank算法获取时,各个顶点的有效值为rank值。
- 根据权利要求1所述的基于差分隐私的地理分布式图计算方法,其特征在于,所述预设迭代条件包括:本轮迭代中各个数据中心有效值的平均值达到预设值、迭代次数等于预设最大迭代次数或本轮迭代中各个顶点有效值相对于上轮的有效值的变化值均小于预设值,中的至少之一种。
- 根据权利要求1-6任一所述的基于差分隐私的地理分布式图计算方法,其特征在于,所述预设处理模型为Pregel模型。
- 一种基于差分隐私的地理分布式图计算系统,其特征在于,包括:每轮迭代预算分配模块,用于基于差分隐私利用预设处理模型对地理分布图进行图计算,按照指数分配机制对地理分布图中每轮迭代分配预算;噪声添加模块,用于在数据中心中增加聚合器来收集需要发送向相邻数据中心的数据,并将其全部加起来加上本轮迭代对应的噪音,再平均划分后发送给相邻的数据中心,所述噪音通过该轮迭代分配的预算进行拉普拉斯机制转换得到;迭代模块,用于各数据中心接收上一轮迭代后其他数据中心发送的数据,并更新顶点自身的有效值,并重复所述在数据中心中增加聚合器来收集需要发送向相邻数据中心的数据,并将其全部加起来加上本轮迭代对应的噪音,再平均划分后发送给相邻的数据中心的步骤,直至达到预设收敛条件,迭代结束;各个数据中心按照达到预设收敛条件的处理模型,进行地理分布式图之间的数据传输。
- 根据权利要求8所述的基于差分隐私的地理分布式图计算系统,其特征在于,还包括:重采样模块,用于在某轮迭代中丢弃所有顶点,按照预设重新采样公式得到的概率对所有顶点进行重取样之后,取样成功的顶点将会分配给其应归属的聚合器。
- 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机指令,所述计算机指令用于使所述计算机执行如权利要求1-7任一项所述的基于差分隐私的地理分布式图计算方法。
- 一种计算机设备,其特征在于,包括:存储器和处理器,所述存储器和所述处理器之间互相通信连接,所述存储器存储有计算机指令,所述处理器通过执行所述计算机指令,从而执行如权利要求1-7任一项所述的基于差分隐私的地理分布式图计算方法。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010518901.5A CN111914285B (zh) | 2020-06-09 | 2020-06-09 | 一种基于差分隐私的地理分布式图计算方法及系统 |
CN202010518901.5 | 2020-06-09 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021248937A1 true WO2021248937A1 (zh) | 2021-12-16 |
Family
ID=73237698
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/077138 WO2021248937A1 (zh) | 2020-06-09 | 2021-02-22 | 一种基于差分隐私的地理分布式图计算方法及系统 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111914285B (zh) |
WO (1) | WO2021248937A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117892357A (zh) * | 2024-03-15 | 2024-04-16 | 大连优冠网络科技有限责任公司 | 基于差分隐私防护的能源大数据共享分发风险控制方法 |
CN117910046A (zh) * | 2024-03-18 | 2024-04-19 | 青岛他坦科技服务有限公司 | 基于差分隐私保护的电力大数据发布方法 |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111914285B (zh) * | 2020-06-09 | 2022-06-17 | 深圳大学 | 一种基于差分隐私的地理分布式图计算方法及系统 |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8335405B2 (en) * | 2008-11-07 | 2012-12-18 | The United States Of America, As Represented By The Secretary Of The Navy | Method and apparatus for measuring fiber twist by polarization tracking |
CN106778314A (zh) * | 2017-03-01 | 2017-05-31 | 全球能源互联网研究院 | 一种基于k‑means的分布式差分隐私保护方法 |
CN108280366A (zh) * | 2018-01-17 | 2018-07-13 | 上海理工大学 | 一种基于差分隐私的批量线性查询方法 |
US10223547B2 (en) * | 2016-10-11 | 2019-03-05 | Palo Alto Research Center Incorporated | Method for differentially private aggregation in a star topology under a realistic adversarial model |
CN110334757A (zh) * | 2019-06-27 | 2019-10-15 | 南京邮电大学 | 面向大数据分析的隐私保护聚类方法及计算机存储介质 |
US20190347278A1 (en) * | 2018-05-09 | 2019-11-14 | Sogang University Research Foundation | K-means clustering based data mining system and method using the same |
CN111914285A (zh) * | 2020-06-09 | 2020-11-10 | 深圳大学 | 一种基于差分隐私的地理分布式图计算方法及系统 |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109495476B (zh) * | 2018-11-19 | 2020-11-20 | 中南大学 | 一种基于边缘计算的数据流差分隐私保护方法及系统 |
CN110347511B (zh) * | 2019-07-10 | 2021-08-06 | 深圳大学 | 含隐私约束条件的地理分布式进程映射方法、装置及终端 |
-
2020
- 2020-06-09 CN CN202010518901.5A patent/CN111914285B/zh active Active
-
2021
- 2021-02-22 WO PCT/CN2021/077138 patent/WO2021248937A1/zh active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8335405B2 (en) * | 2008-11-07 | 2012-12-18 | The United States Of America, As Represented By The Secretary Of The Navy | Method and apparatus for measuring fiber twist by polarization tracking |
US10223547B2 (en) * | 2016-10-11 | 2019-03-05 | Palo Alto Research Center Incorporated | Method for differentially private aggregation in a star topology under a realistic adversarial model |
CN106778314A (zh) * | 2017-03-01 | 2017-05-31 | 全球能源互联网研究院 | 一种基于k‑means的分布式差分隐私保护方法 |
CN108280366A (zh) * | 2018-01-17 | 2018-07-13 | 上海理工大学 | 一种基于差分隐私的批量线性查询方法 |
US20190347278A1 (en) * | 2018-05-09 | 2019-11-14 | Sogang University Research Foundation | K-means clustering based data mining system and method using the same |
CN110334757A (zh) * | 2019-06-27 | 2019-10-15 | 南京邮电大学 | 面向大数据分析的隐私保护聚类方法及计算机存储介质 |
CN111914285A (zh) * | 2020-06-09 | 2020-11-10 | 深圳大学 | 一种基于差分隐私的地理分布式图计算方法及系统 |
Non-Patent Citations (1)
Title |
---|
MA YIN-FANG , ZHANG LIN: "KDCK-medoids Dynamic Clustering Algorithm Based on Differential Privacy", COMPUTER SCIENCE, vol. 43, no. 11A, 15 November 2016 (2016-11-15), pages 368 - 372, XP055878834 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117892357A (zh) * | 2024-03-15 | 2024-04-16 | 大连优冠网络科技有限责任公司 | 基于差分隐私防护的能源大数据共享分发风险控制方法 |
CN117892357B (zh) * | 2024-03-15 | 2024-05-31 | 国网河南省电力公司经济技术研究院 | 基于差分隐私防护的能源大数据共享分发风险控制方法 |
CN117910046A (zh) * | 2024-03-18 | 2024-04-19 | 青岛他坦科技服务有限公司 | 基于差分隐私保护的电力大数据发布方法 |
CN117910046B (zh) * | 2024-03-18 | 2024-06-07 | 国网河南省电力公司经济技术研究院 | 基于差分隐私保护的电力大数据发布方法 |
Also Published As
Publication number | Publication date |
---|---|
CN111914285B (zh) | 2022-06-17 |
CN111914285A (zh) | 2020-11-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021248937A1 (zh) | 一种基于差分隐私的地理分布式图计算方法及系统 | |
US10289451B2 (en) | Method, apparatus, and system for adjusting deployment location of virtual machine | |
US10078594B2 (en) | Cache management for map-reduce applications | |
EP2849099B1 (en) | A computer-implemented method for designing an industrial product modeled with a binary tree. | |
WO2015196911A1 (zh) | 数据挖掘方法和节点 | |
WO2019085709A1 (zh) | 一种应用于卷积神经网络的池化处理的方法及系统 | |
US10249070B2 (en) | Dynamic interaction graphs with probabilistic edge decay | |
WO2018133573A1 (zh) | 业务生存性分析方法及装置 | |
CN103345508A (zh) | 一种适用于社会网络图的数据存储方法及系统 | |
US9928317B2 (en) | Additive design of heat sinks | |
WO2021238305A1 (zh) | 一种基于强化学习的通用分布式图处理方法及系统 | |
US20190220357A1 (en) | Storage system management method, electronic device, storage system and computer program product | |
CN105302536A (zh) | MapReduce应用的相关参数的配置方法和装置 | |
WO2023109142A1 (zh) | 一种能源物联网物理与信息系统的关键路径确定方法及装置 | |
CN108509532B (zh) | 一种应用于地图的聚点方法和装置 | |
CN117014318B (zh) | 多尺度网络节点间链路的添加方法、装置、设备及介质 | |
CN110264467B (zh) | 基于顶点切割的动态幂律图实时重划分方法 | |
CN108768735B (zh) | 一种测试床拓扑结构的二分图采样方法及装置 | |
WO2023185144A1 (zh) | 基于geohash的空间数据处理方法、装置及电子设备 | |
CN113360736B (zh) | 互联网数据的抓取方法和装置 | |
CN116137058A (zh) | 四面体网格优化方法、装置、设备及存储介质 | |
CN114741029A (zh) | 应用于去重存储系统的数据分配方法及相关设备 | |
CN113778645A (zh) | 基于边缘计算的任务调度方法、装置、设备及存储介质 | |
JP6961950B2 (ja) | 格納方法、格納装置および格納プログラム | |
CN115759233B (zh) | 模型的训练方法、图数据处理方法、装置及电子设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21822565 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 15.03.2023) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21822565 Country of ref document: EP Kind code of ref document: A1 |