CN102156755A

CN102156755A - K-cryptonym improving method

Info

Publication number: CN102156755A
Application number: CN2011101173038A
Authority: CN
Inventors: 张静
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2011-05-06
Filing date: 2011-05-06
Publication date: 2011-08-17

Abstract

The invention discloses a K-anonymous improved method, which relates to the field of data mining, selects a quasi-identifier according to an original data set, determines a generalization mode, and establishes an initial generalization grid corresponding to the generalization mode; judges whether the initial generalization grid is is empty, if not, select the global optimal node from all the nodes in the initial generalized lattice according to the optimal node selection method, and obtain the first generalized lattice; perform anonymization processing on the data to be published according to the global optimal node, and obtain the anonymous The number of clusters; judge whether the number of anonymous clusters is less than the preset number, if yes, calculate the optimal node selection method for the first generalized lattice, and obtain the optimal node; if not, the anonymous cluster is an isolated cluster, and the first generalized lattice The grid performs secondary K-anonymous calculations to obtain the optimal node; generalizes the data to be released according to the generalization method corresponding to the optimal node, obtains the generalized data, and publishes the generalized data. The invention shortens the execution time and improves the accuracy of information.

Description

An improved K-anonymity method

技术领域technical field

本发明涉及数据挖掘领域中的K-匿名(K-Anonymity)，特别涉及一种K-匿名改进方法。The invention relates to K-anonymity in the field of data mining, in particular to an improved K-anonymity method.

背景技术Background technique

数据匿名化采用的常用处理手段源于统计数据库中的数据处理方法，主要是通过以发布数据中的属性值的信息损失为代价，换取通过这些属性值再标识某些个体的准确性，同时尽可能保证发布数据的可用性，在发布数据的准确性和隐私保护之间达到一种平衡。传统的隐私保护方法，为了保证发布数据的整体趋势，往往以牺牲单个数据记录的准确性为代价。The commonly used processing methods for data anonymization are derived from the data processing methods in statistical databases, mainly through the loss of information on the attribute values in the published data in exchange for the accuracy of re-identifying some individuals through these attribute values, and at the same time It is possible to guarantee the availability of published data, and to achieve a balance between the accuracy of published data and privacy protection. Traditional privacy protection methods often sacrifice the accuracy of individual data records in order to ensure the overall trend of published data.

K-Anonymity(K-匿名算法)：K-匿名(K-Anonymity)是不同于传统的访问控制等基于目标的隐私保护技术，是一个典型的微数据发布模型(微数据定义为一条表达和描述个体信息的数据记录，为个体信息的载体。这些信息包括个体的标识信息(如姓名、身份证号等)、敏感信息(如病史等)、以及一些非敏感信息(如性别)。每个信息都是以个体属性和相应的属性值匹配的方式作为微数据(记录)的某个分量。)。它要求首先对原始数据进行预处理以满足匿名要求，然后将已处理的数据予以发布。它并不要求限制对已发布数据的访问，相反尽可能的保持数据的可统计性。因而数据泛化(泛化是对于数据的一个属性，用概括值代替原来的值，使其意义更为广阔)是常用的数据预处理手段。K匿名就是要求在一个集合中(这里是指广义集合，即允许包含相同元素，类似于包(Bag)或簇(Cluster)的概念)中只能以不大于1/k(k是一个常数)的概率确定任何一个元素，即要求任何一个元素，集合中至少存在k-1个相同的副本元素。用形式化语言表述K匿名的概念，一般是将发布数据表中的个体记录的属性分为标识符、准标识符和敏感属性三类，并提出了等价类的概念。以下是相关定义：K-Anonymity (K-anonymity algorithm): K-anonymity (K-Anonymity) is different from traditional access control and other goal-based privacy protection technologies. It is a typical microdata release model (microdata is defined as an expression and description The data record of individual information is the carrier of individual information. These information include individual identification information (such as name, ID number, etc.), sensitive information (such as medical history, etc.), and some non-sensitive information (such as gender). Each information They are all used as a certain component of microdata (record) in the way of matching individual attributes and corresponding attribute values.). It requires that raw data be pre-processed to meet anonymity requirements before the processed data is published. It does not require restricting access to published data, but instead keeps data as statistically available as possible. Therefore, data generalization (generalization is an attribute of data, replacing the original value with a summary value to make it more meaningful) is a commonly used data preprocessing method. K anonymity means that in a set (here refers to a generalized set, which allows to contain the same element, similar to the concept of bag (Bag) or cluster (Cluster)) can only be no more than 1/k (k is a constant) The probability of any element is determined, that is, any element is required to have at least k-1 identical copy elements in the set. To express the concept of K-anonymity in a formal language, the attributes of individual records in the published data table are generally divided into three categories: identifier, quasi-identifier and sensitive attribute, and the concept of equivalence class is proposed. Here are the relevant definitions:

标识符(Identifiers)：标识符属性是指能够直接标识出个体身份的属性，如姓名、身份证号码、社会保险号码等属性，通过这些属性值能够直接确定具体的个体。Identifiers: Identifier attributes refer to attributes that can directly identify the identity of an individual, such as name, ID number, social security number and other attributes, through which the specific individual can be directly determined.

准标识符(Quasi-Indentifiers，QI)：给定实体集合U、实体表T(A₁，A₂，LA_n)，f_c：U→T以及f_g：T→U′，其中，

实体表T的准标识符QI为属性组其中，

且满足f_g(f_c(p_i[QI]))＝p_i。换言之，同时存在于发布数据表和外部数据源表中，利用此两种数据表进行连接的推演来表示个人隐私信息的一组属性称为准标识符属性。准标识符属性也叫做类标识符属性。不同的发布数据表可以根据不同的情况划分不同的准标识符属性，一般情况下准标识符由专家选择，而非用户随便选取。一般情况下可以以年龄、教育程度、性别作为准标识符。Quasi-Indentifiers (QI): given entity set U, entity table T(A ₁ , A ₂ , LA _n ), f _c : U→T and f _g : T→U′, where,

The quasi-identifier QI of the entity table T is an attribute group in,

And it satisfies f _g (f _c (p _i [QI]))=p _i . In other words, a group of attributes that exist in both the published data table and the external data source table, and are deduced by using these two data tables to represent personal private information are called quasi-identifier attributes. Quasi-identifier properties are also called class-identifier properties. Different published data tables can be divided into different quasi-identifier attributes according to different situations. Generally, quasi-identifiers are selected by experts, not randomly selected by users. Generally, age, education level, and gender can be used as quasi-identifiers.

敏感属性(Sensitive-Attributes，SA)，个人隐私属性。发布数据中，个体不希望其他用户知道的信息属性。比如说个人的工资水平以及患者的就诊记录中的所患疾病。发布数据时，为了防止个人敏感信息的泄露，标识符必须被删除，发布的数据记录只保留准标识符属性和敏感属性，称为匿名化处理。Sensitive attributes (Sensitive-Attributes, SA), personal privacy attributes. In the published data, the individual does not want other users to know the information attributes. For example, the salary level of the individual and the disease in the patient's medical record. When publishing data, in order to prevent the leakage of personal sensitive information, the identifier must be deleted, and the published data records only retain quasi-identifier attributes and sensitive attributes, which is called anonymization.

等价组：在准标识符上的投影完全相同的记录组成的等价组，即：等价组中所有的记录在准标识符上的属性值完全相同，其他的属性值可以不同。Equivalence group: An equivalence group composed of records whose projections on the quasi-identifier are exactly the same, that is, all records in the equivalence group have exactly the same attribute values on the quasi-identifier, and other attribute values can be different.

K-匿名：给定数据表T(A₁，A₂…A_n)，QI是与T相关联的准标识符，当且仅当在T[QI]中出现的每个值序列至少在T[QI]中出现K次，则T满足K-匿名。T[QI]表示T表中的元组在QI上的投影。K-Anonymity: Given a data table T(A ₁ , A ₂ ...A _n ), QI is a quasi-identifier associated with T if and only if every sequence of values appearing in T[QI] is at least in T [QI] appears K times, then T satisfies K-anonymity. T[QI] represents the projection of the tuples in the T table on QI.

现实生活中，将医疗、投票和求职等信息公开的同时又要保证隐藏相关患者、投票人和求职人等的个体标识信息并确保这些公布的数据不能用来推导出这些标识信息，K-匿名就是非常好的可选模型。当数据发布到公共数据库，数据的拥有者不再继续控制数据的使用方式和范围时，在这种情况下为了不暴露数据主体的身份移出所有涉及到个体标识的数据项信息De-Identification(去标识)就是一种常用的方法。In real life, while disclosing information such as medical treatment, voting, and job hunting, it is necessary to hide the individual identification information of relevant patients, voters, and job applicants, and ensure that the published data cannot be used to derive these identification information. K-anonymity It is a very good optional model. When the data is released to the public database, and the owner of the data no longer continues to control the use and scope of the data, in this case, in order not to expose the identity of the data subject, remove all data item information related to individual identification. De-Identification (de-identification) identification) is a commonly used method.

发明人在实现本发明的过程中发现现有技术中至少存在以下的缺点：The inventor finds that there is at least the following shortcoming in the prior art in the process of realizing the present invention:

现有技术中的K-匿名方法在判断和比较的时候，都需要比较泛化格中的所有节点，当泛化格的规模比较大时，执行时间将会很长，这对于数据处理是很不利的；这种方法多半是全局最优的，由于数据分布的不均匀性，即存在着孤立簇(即数量很小的集合)，为了达到匿名要求，不得不采用更高的泛化层次，这显然会降低信息的准确性。The K-anonymity method in the prior art needs to compare all the nodes in the generalization grid when judging and comparing. When the scale of the generalization grid is relatively large, the execution time will be very long, which is very important for data processing. Unfavorable; this method is mostly globally optimal. Due to the inhomogeneity of data distribution, that is, there are isolated clusters (that is, a small number of sets), in order to meet the anonymous requirements, a higher generalization level has to be adopted. This obviously reduces the accuracy of the information.

发明内容Contents of the invention

为了缩短执行时间，提高信息的准确性，本发明提供了一种K-匿名改进方法，详见下文描述：In order to shorten the execution time and improve the accuracy of information, the present invention provides a K-anonymous improved method, which is described in detail below:

一种K-匿名改进方法，所述方法包括以下步骤：A kind of K-anonymous improved method, described method comprises the following steps:

(1)根据原始数据集选择准标识符，由所述准标识符确定泛化方式，并建立与所述泛化方式对应的初始泛化格；(1) Select a quasi-identifier according to the original data set, determine the generalization mode by the quasi-identifier, and establish an initial generalization grid corresponding to the generalization mode;

(2)判断所述初始泛化格是否为空，如果是，流程结束；如果否，执行步骤(3)；(2) Judging whether the initial generalization lattice is empty, if yes, the process ends; if not, perform step (3);

(3)根据最优节点选择方式从所述初始泛化格的所有节点中选择出全局最优节点，获取第一泛化格；(3) selecting the global optimal node from all the nodes of the initial generalization lattice according to the optimum node selection mode, and obtaining the first generalization lattice;

(4)根据所述全局最优节点对待发布数据进行匿名化处理，获取和所述全局最优节点相应的匿名簇的数量；(4) Anonymize the data to be released according to the global optimal node, and obtain the number of anonymous clusters corresponding to the global optimal node;

(5)判断所述匿名簇的数量是否小于预设数量，如果是，执行步骤(6)；如果否，执行步骤(7)；(5) judging whether the quantity of the anonymous cluster is less than a preset quantity, if yes, perform step (6); if not, perform step (7);

(6)对所述第一泛化格进行所述最优节点选择方式计算，获取最优节点；(6) Performing the calculation of the optimal node selection method on the first generalized lattice to obtain the optimal node;

(7)匿名簇为非孤立簇，对所述第一泛化格进行二次K-匿名计算，获取所述最优节点；(7) The anonymous cluster is a non-isolated cluster, and the second K-anonymous calculation is performed on the first generalized lattice to obtain the optimal node;

(8)将所述待发布数据按照所述最优节点对应的泛化方式进行泛化，获取泛化后的数据，将所述泛化后的数据发布，流程结束。(8) Generalize the data to be released according to the generalization method corresponding to the optimal node, obtain generalized data, publish the generalized data, and the process ends.

步骤(3)中的所述根据最优节点选择方式从所述初始泛化格的所有节点中选择出全局最优节点，获取第一泛化格，具体为：According to the optimal node selection method in step (3), the global optimal node is selected from all the nodes of the initial generalized lattice, and the first generalized lattice is obtained, specifically:

①计算所述初始泛化格中所有节点的度；① Calculate the degrees of all nodes in the initial generalization lattice;

②对所述初始泛化格中所有节点按照度进行排序，获取度最大节点；②Sorting all the nodes in the initial generalization lattice according to degree, and obtaining the node with the largest degree;

③判断所述度最大节点是否满足K-匿名，如果是，执行步骤④；如果否，执行步骤⑤；③ Judging whether the node with the largest degree satisfies K-anonymity, if yes, execute step ④; if not, execute step ⑤;

④所述度最大节点的所有父节点都为K-匿名节点，删除所述度最大节点的所有祖先节点，查找所述原始数据集中保存的K-min集合，判断所述K-min集合中是否有所述度最大节点的祖先，如果是，从所述K-min集合中删除所述度最大节点的祖先；如果否，执行步骤⑥；4. All parent nodes of the node with the largest degree are K-anonymous nodes, delete all ancestor nodes of the node with the largest degree, search the K-min set preserved in the original data set, and judge whether the K-min set is There is the ancestor of the node with the maximum degree, if yes, delete the ancestor of the node with the maximum degree from the K-min set; if not, perform step ⑥;

⑤所述度最大节点的所有子孙节点都不是K-匿名节点，删除所述度最大节点以及所述度最大节点的所有子孙节点；5. All descendant nodes of the node with the maximum degree are not K-anonymous nodes, delete the node with the maximum degree and all descendant nodes of the node with the maximum degree;

⑥计算所述K-min集合中所有节点的信息损失量，获取最小信息损失量，将所述最小信息损失量对应的节点作为全局最优节点，获取所述第一泛化格；6. Calculate the information loss amount of all nodes in the K-min set, obtain the minimum information loss amount, use the node corresponding to the minimum information loss amount as the global optimal node, and obtain the first generalization lattice;

其中，计算所述K-min集合中所有节点的信息损失量具体为：Wherein, calculating the amount of information loss of all nodes in the K-min set is specifically:

$InfoLoss InfoLoss = = \frac{{Σ Σ}_{i i = = 11}^{N N} \frac{{h h}_{i i}}{{DGH DGH}_{i i}}}{N N}$

其中，N表示元组集中的准标识符个数、DGH_i表示N个准标识符中第i个准标识符的泛化等级、h_i表示准标识符i的泛化程度。Among them, N represents the number of quasi-identifiers in the tuple set, DGH _i represents the generalization level of the i-th quasi-identifier among N quasi-identifiers, and h _i represents the generalization degree of quasi-identifier i.

本发明提供的技术方案的有益效果是：The beneficial effects of the technical solution provided by the invention are:

本发明提供了一种K-匿名改进方法，本发明提供的方法缩短了执行时间，提高了信息的准确性，满足了实际应用中的需要。The invention provides an improved K-anonymity method. The method provided by the invention shortens the execution time, improves the accuracy of information, and meets the requirements in practical applications.

附图说明Description of drawings

图1为本发明提供的年龄Age的泛化方式；Fig. 1 is the generalization mode of age Age provided by the present invention;

图2为本发明提供的性别Sex的泛化方式；Fig. 2 is the generalization mode of gender Sex provided by the present invention;

图3为本发明提供的年龄Age和性别Sex的泛化格；Fig. 3 is the generalized grid of age Age and sex Sex provided by the present invention;

图4为本发明提供的一种K-匿名改进方法的流程图。Fig. 4 is a flowchart of a K-anonymity improvement method provided by the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚，下面将结合附图对本发明实施方式作进一步地详细描述。In order to make the object, technical solution and advantages of the present invention clearer, the implementation manner of the present invention will be further described in detail below in conjunction with the accompanying drawings.

为了缩短执行时间，提高信息的准确性，本发明实施例提供了一种K-匿名改进方法，本发明实施例基于K匿名算法，用于对隐私数据进行去标识，本发明实施例采用的预处理主要在得到最优节点后，对泛化格进行进一步的优化，详见下文描述：In order to shorten the execution time and improve the accuracy of information, the embodiment of the present invention provides an improved K-anonymity method. The embodiment of the present invention is based on the K-anonymity algorithm for de-identifying private data. The processing is mainly to further optimize the generalization lattice after obtaining the optimal node, as described below for details:

101：根据原始数据集选择准标识符，由准标识符确定泛化方式，并建立与泛化方式对应的初始泛化格；101: Select a quasi-identifier according to the original data set, determine the generalization method by the quasi-identifier, and establish an initial generalization lattice corresponding to the generalization method;

其中，遍历初始泛化格后保存有K-min集合。参见图1、图2和图3，例如：从原始数据集中选择准标识符年龄Age和性别Sex，由准标识符年龄Age和性别Sex确定泛化方式，泛化方式由以下泛化向量组成：L(a₁...a_i...a_k)，其中a_i表示节点每个属性的泛化等级(泛化高度)，k为表中的属性个数；泛化等级：

表示该节点的在初始泛化格中的泛化等级(泛化高度)。两个或多个准标识符进行不同等级的泛化得到的结果构成准标识符泛化序列，这些准标识符泛化序列构成基于准标识符的泛化等级序列，称为泛化格。例如：年龄Age和性别Sex可以构成图3中的初始泛化格。根据各个准标识符相应的泛化方式可以建立初始泛化格，令T_i(A₁，…，A_k)和T_j(A₁，…，A_k)是两个不同表(即两者为初始泛化格Lattice中不同的节点，(A₁，…，A_n)为数据的k个准标识符，A_k为第i个准标识符的泛化等级或泛化高度)。泛化格中的每一个节点，表示一个准标识符一次泛化后的表，即为各准标识符泛化到此时对应的发布数据。Among them, the K-min set is saved after traversing the initial generalization lattice. See Figure 1, Figure 2 and Figure 3, for example: select the quasi-identifier Age and gender Sex from the original data set, and determine the generalization method by the quasi-identifier Age and sex Sex, the generalization method consists of the following generalization vectors: L(a ₁ ...a _i ...a _k ), where a _i represents the generalization level (generalization height) of each attribute of the node, and k is the number of attributes in the table; the generalization level:

Indicates the generalization level (generalization height) of the node in the initial generalization lattice. The results of different levels of generalization of two or more quasi-identifiers constitute a quasi-identifier generalization sequence, and these quasi-identifier generalization sequences constitute a quasi-identifier-based generalization level sequence, called a generalization lattice. For example: Age and Sex can constitute the initial generalization lattice in Figure 3. The initial generalization lattice can be established according to the corresponding generalization methods of each quasi-identifier, let T _i (A ₁ ,...,A _k ) and T _j (A ₁ ,...,A _k ) be two different tables (namely, both are different nodes in the initial generalization lattice Lattice, (A ₁ ,...,A _n ) are k quasi-identifiers of the data, A _k is the generalization level or generalization height of the i-th quasi-identifier). Each node in the generalization grid represents a table after one generalization of a quasi-identifier, that is, the corresponding published data of each quasi-identifier after generalization.

其中，可以根据实际应用中的需要确定准标识符和泛化方式的数量，具体实现时，本发明实施例对此不做限制。Wherein, the number of quasi-identifiers and generalization methods can be determined according to requirements in practical applications, and the embodiment of the present invention does not limit this during specific implementation.

102：判断初始泛化格是否为空，如果是，流程结束；如果否，执行步骤103；102: Determine whether the initial generalization grid is empty, if yes, the process ends; if not, execute step 103;

103：根据最优节点选择方式从初始泛化格的所有节点中选择出全局最优节点，获取第一泛化格；103: Select the global optimal node from all the nodes of the initial generalized lattice according to the optimal node selection method, and obtain the first generalized lattice;

其中，最优节点选择方式的步骤具体包括：Among them, the steps of the optimal node selection method specifically include:

(1)计算初始泛化格中所有节点的度；(1) Calculate the degree of all nodes in the initial generalized lattice;

其中，初始泛化格中所有节点包括父节点和子节点，父节点为与该节点相连的上一层节点；子节点为与该节点相连的下一层节点；计算每个节点的度具体为：该节点的所有父节点数与所有直接子节点数的乘积。Among them, all nodes in the initial generalization grid include parent nodes and child nodes, and the parent node is the upper layer node connected to the node; the child node is the lower layer node connected to the node; the calculation of the degree of each node is specifically: The product of the number of all parents of this node and the number of all direct children.

(2)对初始泛化格中所有节点按照度进行排序，获取度最大节点；(2) Sort all the nodes in the initial generalized lattice according to degree, and obtain the node with the largest degree;

其中，排序可以采用任何一种排序方式，例如：由高到低排序或由低到高排序，具体实现时，本发明实施例对此不做限制。The sorting may adopt any sorting manner, for example, sorting from high to low or sorting from low to high, which is not limited in this embodiment of the present invention during specific implementation.

(3)判断度最大节点是否满足K-匿名，如果是，执行步骤(4)；如果否，执行步骤(5)；(3) Determine whether the node with the largest degree satisfies K-anonymity, if yes, execute step (4); if not, execute step (5);

(4)度最大节点的所有父节点都为K-匿名节点，删除度最大节点的所有祖先节点，查找原始数据集中保存的K-min集合，判断K-min集合中是否有度最大节点的祖先，如果是，从K-min集合中删除度最大节点的祖先；如果否，执行步骤(6)；(4) All parent nodes of the node with the largest degree are K-anonymous nodes, delete all ancestor nodes of the node with the largest degree, search the K-min set saved in the original data set, and judge whether there is an ancestor of the node with the largest degree in the K-min set , if yes, delete the ancestor of the node with the largest degree from the K-min set; if no, perform step (6);

其中，祖先节点具体为：度最大节点的父节点以及父节点的上n层节点，n的取值为大于等于1的正整数。Wherein, the ancestor node is specifically: the parent node of the node with the largest degree and the upper n layer nodes of the parent node, and the value of n is a positive integer greater than or equal to 1.

(5)度最大节点的所有子孙节点都不是K-匿名节点，删除度最大节点以及度最大节点的所有子孙节点；(5) All descendant nodes of the node with the maximum degree are not K-anonymous nodes, delete the node with the maximum degree and all descendant nodes of the node with the maximum degree;

(6)计算K-min集合中所有节点的信息损失量，获取最小信息损失量，将最小信息损失量对应的节点作为全局最优节点，获取第一泛化格。(6) Calculate the information loss of all nodes in the K-min set, obtain the minimum information loss, and use the node corresponding to the minimum information loss as the global optimal node to obtain the first generalization lattice.

数据经过泛化后就会出现一定程度的失真，泛化程度越高，那么数据失真度就越大。发布数据常常是用来分析或者研究某些问题，因此在发布数据的时候不仅要保护隐私的泄露，还要尽量保证发布的数据有较小的损失，否则即使达到了隐私保护的目的，发布的数据也失去了价值，隐私保护也没有任何意义。After the data is generalized, there will be a certain degree of distortion. The higher the degree of generalization, the greater the degree of data distortion. Publishing data is often used to analyze or research certain issues. Therefore, when publishing data, it is necessary not only to protect privacy leakage, but also to ensure that the published data has a small loss. Otherwise, even if the purpose of privacy protection is achieved, the published Data also loses value, and privacy protection doesn't make any sense.

其中，计算K-min集合中所有节点的信息损失量具体为：Among them, the calculation of the information loss of all nodes in the K-min set is specifically:

其中，N表示元组集中的准标识符个数、DGH_i表示N个准标识符中第i个准标识符的泛化等级、h_i表示准标识符i的泛化程度。由上述公式可知泛化程度越高、信息损失量越大；泛化程度越低，信息损失量越小。Among them, N represents the number of quasi-identifiers in the tuple set, DGH _i represents the generalization level of the i-th quasi-identifier among N quasi-identifiers, and h _i represents the generalization degree of quasi-identifier i. It can be seen from the above formula that the higher the degree of generalization, the greater the amount of information loss; the lower the degree of generalization, the smaller the amount of information loss.

104：根据全局最优节点对待发布数据进行匿名化处理，获取和全局最优节点相应的匿名簇的数量；104: Anonymize the data to be released according to the global optimal node, and obtain the number of anonymous clusters corresponding to the global optimal node;

其中，待发布数据由实际应用中的需要进行设定，具体实现时，本发明实施例对此不做限制。Wherein, the data to be released is set according to the needs in the actual application, and the embodiment of the present invention does not limit it during specific implementation.

105：判断匿名簇的数量是否小于预设的数量，如果是，执行步骤106；如果否，执行步骤107；105: Determine whether the number of anonymous clusters is less than the preset number, if yes, execute step 106; if not, execute step 107;

106：对第一泛化格进行最优节点选择方式计算，获取最优节点；106: Calculate the optimal node selection method for the first generalization lattice, and obtain the optimal node;

107：匿名簇为非孤立簇，对第一泛化格进行二次K-匿名计算，获取最优节点；107: The anonymous cluster is a non-isolated cluster, and the second K-anonymous calculation is performed on the first generalized lattice to obtain the optimal node;

其中，对第一泛化格进行二次K-匿名计算具体为：取出最小K匿名节点集合中的第i个节点；计算第一泛化格节点的簇信息，包括代表每个簇的节点信息和对应的容量；取出第j簇的信息；添加簇的信息；建立新的lattice；在数据集firstdata上计算节点的信息损失量；计算二次K-匿名的抑制率；在数据集seconddata上进行抑制率为secondratio的K-匿名计算(度优先算法)；将最小K-匿名节点集合中信息损失量最小的节点找出来，获取最优节点。Among them, the second K-anonymity calculation for the first generalization lattice is specifically: taking out the i-th node in the minimum K anonymous node set; calculating the cluster information of the nodes of the first generalization lattice, including the node information representing each cluster And the corresponding capacity; take out the information of the jth cluster; add the information of the cluster; create a new lattice; calculate the information loss of the node on the data set firstdata; calculate the suppression rate of the secondary K-anonymity; perform on the data set seconddata K-anonymous calculation of suppression rate secondratio (degree-first algorithm); find out the node with the smallest amount of information loss in the smallest K-anonymous node set to obtain the optimal node.

其中，二次K-匿名计算的代码详见下文描述：Among them, the code for the secondary K-anonymous calculation is described below:

输入：大小簇分界值thresholdInput: size cluster cutoff value threshold

输出：最小信息损失量smallinfoOutput: minimum information loss smallinfo

变量：firstdata用于存储孤立簇数据；seconddata用于存储进行二次匿名的数据；kmin用于存储最小K匿名节点的集合Variables: firstdata is used to store isolated cluster data; seconddata is used to store secondary anonymous data; kmin is used to store the minimum set of K anonymous nodes

getnode(i)：取出最小K匿名节点集合中的第i个节点getnode(i): Take out the i-th node in the minimum K anonymous node set

getInfoOf(latticenode)：计算latticenode节点的簇信息，包括代表每个簇的节点信息和对应的容量getInfoOf(latticenode): Calculate the cluster information of the latticenode node, including the node information representing each cluster and the corresponding capacity

getcluster(j)：取出第j簇的信息getcluster(j): Take out the information of the jth cluster

add(ClusterInfo.get(j))：添加簇的信息add(ClusterInfo.get(j)): Add cluster information

Lattice(Bnode，latticenode)：建立新的latticeLattice (Bnode, latticenode): create a new lattice

infoloss(lattticenode，firstdata)：infoloss(latticenode, firstdata):

在数据集firstdata上计算节点latticenode的信息损失量Calculate the information loss of the node latticenode on the dataset firstdata

SecondRatio()：计算二次K-匿名的抑制率SecondRatio(): Calculate the suppression rate of the secondary K-anonymity

Kmin(lattice，secondratio，seconddata)：在数据集seconddata上进行抑制率为secondratio的K-匿名计算(度优先算法)Kmin(lattice, secondratio, seconddata): K-anonymous calculation of the suppression rate secondratio on the data set seconddata (degree priority algorithm)

GetsmallLatticenode(latticenode.kmin)：将最小K-匿名节点集合中信息损失量最小的节点找出来GetsmallLatticenode(latticenode.kmin): Find the node with the smallest amount of information loss in the smallest K-anonymous node set

108：将待发布数据按照最优节点对应的泛化方式进行泛化，获取泛化后的数据，将泛化后的数据发布，流程结束。108: Generalize the data to be released according to the generalization method corresponding to the optimal node, obtain the generalized data, publish the generalized data, and the process ends.

综上所述，本发明实施例提供了一种K-匿名改进方法，本发明实施例提供的方法缩短了执行时间，提高了信息的准确性，满足了实际应用中的需要。To sum up, the embodiment of the present invention provides an improved K-anonymity method. The method provided by the embodiment of the present invention shortens the execution time, improves the accuracy of information, and meets the needs of practical applications.

本领域技术人员可以理解附图只是一个优选实施例的示意图，上述本发明实施例序号仅仅为了描述，不代表实施例的优劣。Those skilled in the art can understand that the accompanying drawing is only a schematic diagram of a preferred embodiment, and the serial numbers of the above-mentioned embodiments of the present invention are for description only, and do not represent the advantages and disadvantages of the embodiments.

以上所述仅为本发明的较佳实施例，并不用以限制本发明，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection of the present invention. within range.

Claims

1. A kind of K-anonymous improved method is characterized in that, described method comprises the following steps:

(1) Select a quasi-identifier according to the original data set, determine the generalization mode by the quasi-identifier, and establish an initial generalization grid corresponding to the generalization mode;

(2) Judging whether the initial generalization lattice is empty, if yes, the process ends; if not, perform step (3);

(3) selecting the global optimal node from all the nodes of the initial generalization lattice according to the optimum node selection mode, and obtaining the first generalization lattice;

(4) Anonymize the data to be released according to the global optimal node, and obtain the number of anonymous clusters corresponding to the global optimal node;

(5) judging whether the quantity of the anonymous cluster is less than a preset quantity, if yes, perform step (6); if not, perform step (7);

(6) Performing the calculation of the optimal node selection method on the first generalized lattice to obtain the optimal node;

(7) The anonymous cluster is a non-isolated cluster, and the second K-anonymous calculation is performed on the first generalized lattice to obtain the optimal node;

(8) Generalize the data to be released according to the generalization method corresponding to the optimal node, obtain generalized data, publish the generalized data, and the process ends.

2. A kind of K-anonymity improved method according to claim 1, is characterized in that, described in step (3) selects global optimal node from all nodes of described initial generalization lattice according to optimal node selection mode. The superior node obtains the first generalization lattice, specifically:

① calculating the degrees of all nodes in the initial generalization lattice;

②Sorting all the nodes in the initial generalization lattice according to degree, and obtaining the node with the largest degree;

③ Judging whether the node with the largest degree satisfies K-anonymity, if yes, execute step ④; if not, execute step ⑤;

4. All parent nodes of the node with the largest degree are K-anonymous nodes, delete all ancestor nodes of the node with the largest degree, search the K-min set preserved in the original data set, and judge whether the K-min set is There is the ancestor of the node with the maximum degree, if yes, delete the ancestor of the node with the maximum degree from the K-min set; if not, perform step ⑥;

5. All descendant nodes of the node with the maximum degree are not K-anonymous nodes, delete the node with the maximum degree and all descendant nodes of the node with the maximum degree;

6. Calculate the information loss amount of all nodes in the K-min set, obtain the minimum information loss amount, use the node corresponding to the minimum information loss amount as the global optimal node, and obtain the first generalization lattice;

Wherein, calculating the amount of information loss of all nodes in the K-min set is specifically:

InfoLoss InfoLoss = = \frac{{Σ Σ}_{i i = = 11}^{N N} \frac{{h h}_{i i}}{{DGH DGH}_{i i}}}{N N}

Among them, N represents the number of quasi-identifiers in the tuple set, DGH _i represents the generalization level of the i-th quasi-identifier among N quasi-identifiers, and h _i represents the generalization degree of quasi-identifier i.