CN108776762A

CN108776762A - A kind of processing method and processing device of data desensitization

Info

Publication number: CN108776762A
Application number: CN201810586230.9A
Authority: CN
Inventors: 林鸿; 欧阳红; 袁葆; 江再玉; 赵加奎; 熊根鑫; 王宇坤; 于喻; 宋振世; 王奕; 郑倩
Original assignee: State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; Beijing China Power Information Technology Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Information and Telecommunication Co Ltd; Beijing China Power Information Technology Co Ltd
Priority date: 2018-06-08
Filing date: 2018-06-08
Publication date: 2018-11-09
Anticipated expiration: 2038-06-08
Also published as: CN108776762B

Abstract

This application provides a data desensitization processing method and device, which determine the type of target data; call the corresponding sub-thesaurus in the word segmentation benchmark lexicon according to the type of the target data, and use The corresponding word segmentation method performs word segmentation; according to the type of the target data and the length of the target data, determine the desensitization method of the target data, and use the desensitization method of the target data to segment the target data to obtain Desensitize sensitive data. By segmenting the target data to obtain data with a certain structure, desensitize the part with main sensitive information, mask all or most of the sensitive information, improve the effectiveness of data desensitization, and ensure the security of data assets , to protect the security of customer information to the greatest extent, and avoid customer information leakage caused by abnormal query and export.

Description

A data desensitization processing method and device

技术领域technical field

本发明涉及数据处理技术领域，更具体的，涉及一种数据脱敏的处理方法及装置。The present invention relates to the technical field of data processing, and more specifically, to a data desensitization processing method and device.

背景技术Background technique

为落实国家《网络安全法》关于保护客户敏感信息的工作要求，保障电力营销客户数据资产安全，保障电力营销客户合法权益，需要对电力营销客户敏感信息进行数据脱敏，目的是在满足正常业务需要的同时，最大程度保护电力客户信息的安全，避免非正常查询、导出等方式造成的电力客户信息泄露。In order to implement the national "Network Security Law" on the protection of customer sensitive information, to ensure the security of data assets of power marketing customers, and to protect the legitimate rights and interests of power marketing customers, it is necessary to desensitize the sensitive information of power marketing customers. The purpose is to meet normal business needs. At the same time as necessary, the security of power customer information is protected to the greatest extent, and power customer information leakage caused by abnormal query and export is avoided.

目前电力营销数据脱敏主要规则主要采用掩码脱敏方法，保留部分信息，保证信息的长度不变，主要规则如下：At present, the main rules of power marketing data desensitization mainly adopt the mask desensitization method, retain part of the information, and ensure that the length of the information remains unchanged. The main rules are as follows:

(1)联系人地址(1) Contact address

格式：格式不固定，为不定长的字符串。Format: The format is not fixed, it is a string of variable length.

脱敏规则：按长度分阶梯保留，长度5个字及以下的，保留第1个字和最后2个字；长度6-9个字的，保留最后5个字；长度为10个字及以上的，隐去最后5个字之前的4个字；隐藏字用*代替。Desensitization rules: Reserved in steps according to length, if the length is 5 characters or less, keep the first and last 2 characters; if the length is 6-9 characters, keep the last 5 characters; if the length is 10 characters or more , hide the 4 characters before the last 5 characters; replace the hidden characters with *.

(2)企业类户名(2) Enterprise account name

格式：企业类户名与营业执照一致，为公司名称，由若干个汉字组成。Format: The enterprise account name is consistent with the business license, which is the company name and consists of several Chinese characters.

脱敏规则：按长度分阶梯保留：长度4个字及以下的，首尾各保留1个字；长度5-6个字的，首尾各保留2个字；长度7个字及以上奇数，隐去中间3个字；长度8个字及以上偶数，隐去中间4个字；隐藏字用*代替。Desensitization rules: Reserved in steps according to the length: if the length is 4 characters or less, keep 1 character at the beginning and the end; if the length is 5-6 characters, keep 2 characters at the beginning and the end; if the length is 7 characters or more odd numbers, hide them The middle 3 characters; the length is 8 characters and above even number, the middle 4 characters are hidden; the hidden characters are replaced by *.

现有电力营销数据脱敏规则的主要缺点在于：The main disadvantages of the existing power marketing data desensitization rules are:

用电地址和企业类户这两类电力营销数据按照目前数据脱敏规则进行数据脱敏后，非关键字掩码，而关键字却还保留着。比如，按照企业类户名的脱敏规则，脱敏后的户名地址仍然可能存在敏感信息，部分关键字得到保留，脱敏效果不明显。如下所示：青岛惠丰电机制造有限公司->青岛惠丰****有限公司；青岛贰零贰零商业服务有限公司->青岛贰******务有限公司。After desensitization of the two types of power marketing data, electricity address and business households, according to the current data desensitization rules, the non-keywords are masked, but the keywords are still retained. For example, according to the desensitization rules for corporate account names, there may still be sensitive information in the desensitized account name and address, some keywords are reserved, and the desensitization effect is not obvious. As follows: Qingdao Huifeng Motor Manufacturing Co., Ltd. -> Qingdao Huifeng **** Co., Ltd.; Qingdao 2020 Business Service Co., Ltd. -> Qingdao 2****** Service Co., Ltd.

按照联系人地址的脱敏规则，也存在类似的问题，如下所示：山东省济南市市中区山川大街天桥北居委会纬三路齐鲁安康苑小区2-1-101->山东省济南市市中区山川大街天桥北居委会纬三路齐鲁安康苑****1-101。According to the desensitization rules of the contact address, there are similar problems, as follows: 2-1-101, Qilu Ankangyuan Community, Weisan Road, Shanchuan Street, Tianqiao North Residential Committee, Shizhong District, Jinan City, Shandong Province -> Jinan City, Shandong Province 1-101, Qilu Ankangyuan, Weisan Road, Tianqiao North Neighborhood Committee, Shanchuan Street, Central District.

发明内容Contents of the invention

有鉴于此，本发明公开了一种数据脱敏的处理方法及装置，在数据脱敏之前通过调用分词基准词库对目标数据进行分词，实现更加有效的数据脱敏。In view of this, the present invention discloses a data desensitization processing method and device. Before data desensitization, the target data is segmented by calling a word segmentation benchmark lexicon, so as to realize more effective data desensitization.

为了实现上述发明目的，本发明提供的具体技术方案如下：In order to realize the foregoing invention object, the specific technical scheme provided by the present invention is as follows:

一种数据脱敏的处理方法，包括：A processing method for data desensitization, comprising:

确定目标数据的类型；Determine the type of target data;

根据所述目标数据的类型调用分词基准词库中的相应子词库，并采用与所述目标数据的类型相对应的分词方法进行分词；Invoking the corresponding sub-thesaurus in the word segmentation benchmark thesaurus according to the type of the target data, and using a word segmentation method corresponding to the type of the target data to perform word segmentation;

根据所述目标数据的类型和所述目标数据的长度，确定所述目标数据的脱敏方法，并采用所述目标数据的脱敏方法对所述目标数据分词后得到的敏感数据进行脱敏处理。According to the type of the target data and the length of the target data, determine the desensitization method of the target data, and use the desensitization method of the target data to perform desensitization processing on the sensitive data obtained after the word segmentation of the target data .

可选的，所述方法还包括：Optionally, the method also includes:

构建分词基准词库，所述分词基准词库中包括多个子词库，每个子词库分别包括一种类型的敏感词。A word segmentation reference thesaurus is constructed, and the word segmentation reference thesaurus includes a plurality of sub-thesauruses, and each sub-thesaurus includes a type of sensitive word.

可选的，当所述目标数据的类型为用电地址时，所述根据所述目标数据的类型调用分词基准词库中的相应子词库，采用与所述目标数据的类型相对应的分词方法进行分词，包括：Optionally, when the type of the target data is an electrical address, the corresponding sub-thesaurus in the word segmentation benchmark lexicon is invoked according to the type of the target data, and the word segmentation corresponding to the type of the target data is used Methods for word segmentation, including:

调用通用地址子词库、地名子词库、小区名称子词库和行政区划分集合子词库，采用最大正向匹配中文分词对所述目标数据进行分词。The general address sub-thesaurus, the place name sub-thesaurus, the community name sub-thesaurus and the administrative division sub-thesaurus are called, and the target data is segmented by using the maximum positive matching Chinese word segmentation.

可选的，当所述目标数据的类型为企业类户名时，所述根据所述目标数据的类型调用分词基准词库中的相应子词库，采用与所述目标数据的类型相对应的分词方法进行分词，包括：Optionally, when the type of the target data is an enterprise account name, the corresponding sub-thesaurus in the word segmentation benchmark lexicon is called according to the type of the target data, and the corresponding sub-thesaurus corresponding to the type of the target data is used. The word segmentation method performs word segmentation, including:

调用区域集合子词库、行业集合子词库和公司组织集合子词库，采用双向最大匹配中文分词方法进行分词。Call the regional collection sub-thesaurus, industry collection sub-thesaurus and company organization collection sub-thesaurus, and use the two-way maximum matching Chinese word segmentation method for word segmentation.

可选的，在所述根据所述目标数据的类型和所述目标数据的长度，确定所述目标数据的脱敏方法之前，所述方法还包括：Optionally, before determining the desensitization method of the target data according to the type of the target data and the length of the target data, the method further includes:

计算所述目标数据的分词结果的正确率；Calculate the accuracy rate of the word segmentation result of the target data;

判断所述目标数据的分词结果的正确率是否大于第一预设值；judging whether the correct rate of the word segmentation result of the target data is greater than a first preset value;

若是，执行所述根据所述目标数据的类型和所述目标数据的长度，确定所述目标数据的脱敏方法；If yes, perform the desensitization method for determining the target data according to the type of the target data and the length of the target data;

若否，基于隐马尔柯夫模型对所述目标数据进行分词，并执行所述根据所述目标数据的类型和所述目标数据的长度，确定所述目标数据的脱敏方法。If not, segment the target data based on a Hidden Markov Model, and execute the desensitization method for determining the target data according to the type of the target data and the length of the target data.

可选的，当所述目标数据的类型为用电地址时，所述根据所述目标数据的类型和所述目标数据的长度，确定所述目标数据的脱敏方法，并采用所述目标数据的脱敏方法对所述目标数据分词后得到的敏感数据进行脱敏处理，包括：Optionally, when the type of the target data is an electrical address, according to the type of the target data and the length of the target data, determine a desensitization method for the target data, and use the target data The desensitization method desensitizes the sensitive data obtained after the target data is segmented, including:

判断所述目标数据的长度是否大于第二预设值；judging whether the length of the target data is greater than a second preset value;

当所述目标数据的长度大于所述第二预设值时，确定所述目标数据的脱敏方法为第一用电地址数据脱敏方法；When the length of the target data is greater than the second preset value, it is determined that the desensitization method of the target data is the first power address data desensitization method;

采用所述第一用户地址数据脱敏方法，从所述目标数据的分词结果中提取门牌号数据的最后5位数据和省市区县数据，得到剩余部分数据；Using the first user address data desensitization method, extracting the last 5 digits of the house number data and the province, district and county data from the word segmentation result of the target data to obtain the remaining part of the data;

保留所述门牌号数据的后5位数据和所述省市区县数据，对所述目标数据的剩余部分数据进行掩码，得到所述目标数据脱敏后的数据；Retaining the last 5 digits of the house number data and the data of the provinces, districts, and counties, and masking the remaining data of the target data to obtain desensitized data of the target data;

当所述目标数据的长度不大于所述第二预设值时，确定所述目标数据的脱敏方法为第二用电地址数据脱敏方法；When the length of the target data is not greater than the second preset value, it is determined that the desensitization method of the target data is the second power address data desensitization method;

采用所述第二用户地址数据脱敏方法，根据所述目标数据的长度按第一分阶梯保留规则提取所述目标数据的保留部分，并对所述目标数据的剩余部分进行掩码，得到所述目标数据脱敏后的数据。Using the second user address data desensitization method, according to the length of the target data, extract the reserved part of the target data according to the first sub-step retention rule, and mask the remaining part of the target data, to obtain the Data after desensitization of the target data.

可选的，当所述目标数据的类型为企业类户名时，所述根据所述目标数据的类型和所述目标数据的长度，确定所述目标数据的脱敏方法，并采用所述目标数据的脱敏方法对所述目标数据分词后得到的敏感数据进行脱敏处理，包括：Optionally, when the type of the target data is an enterprise account name, the desensitization method of the target data is determined according to the type of the target data and the length of the target data, and the target The data desensitization method desensitizes the sensitive data obtained after the target data is segmented, including:

判断所述目标数据的长度是否大于第三预设值；judging whether the length of the target data is greater than a third preset value;

当所述目标数据的长度大于所述第三预设值时，确定所述目标数据的脱敏方法为第一企业类户名数据脱敏方法；When the length of the target data is greater than the third preset value, it is determined that the desensitization method of the target data is the first enterprise-type account name data desensitization method;

采用所述第一企业类户名数据脱敏方法，从所述目标数据的分词结果中提取字号数据的第一个字和行业数据的最后一个字，得到所述字号数据的剩余数据和所述行业数据的剩余数据；Using the desensitization method of the first enterprise class account name data, extract the first character of the font size data and the last character of the industry data from the word segmentation result of the target data, and obtain the remaining data of the font size data and the The rest of the industry data;

对所述字号数据的剩余数据和所述行业数据的剩余数据进行掩码，保留所述目标数据的其他数据，得到所述目标数据脱敏后的数据；Masking the remaining data of the font size data and the remaining data of the industry data, retaining other data of the target data, and obtaining desensitized data of the target data;

当所述目标数据的长度不大于所述第三预设值时，确定所述目标数据的脱敏方法为第二企业类户名数据脱敏方法；When the length of the target data is not greater than the third preset value, it is determined that the desensitization method of the target data is the second enterprise-type account name data desensitization method;

采用所述第二企业类户名数据脱敏方法，根据所述目标数据的长度按第二分阶梯保留规则提取所述目标数据的保留部分，并对所述目标数据的剩余部分进行掩码，得到所述目标数据脱敏后的数据。Using the second enterprise-type account name data desensitization method, extracting the retained part of the target data according to the second step-by-step retention rule according to the length of the target data, and masking the remaining part of the target data, Obtain the desensitized data of the target data.

一种数据脱敏的处理装置，包括：A data desensitization processing device, comprising:

类型确定单元，用于确定目标数据的类型；a type determination unit, configured to determine the type of the target data;

第一分词处理单元，用于根据所述目标数据的类型调用分词基准词库中的相应子词库，并采用与所述目标数据的类型相对应的分词方法进行分词；The first word segmentation processing unit is used to call the corresponding sub-thesaurus in the word segmentation reference thesaurus according to the type of the target data, and use the word segmentation method corresponding to the type of the target data to perform word segmentation;

脱敏处理单元，用于根据所述目标数据的类型和所述目标数据的长度，确定所述目标数据的脱敏方法，并采用所述目标数据的脱敏方法对所述目标数据分词后得到的敏感数据进行脱敏处理。The desensitization processing unit is used to determine the desensitization method of the target data according to the type of the target data and the length of the target data, and use the desensitization method of the target data to segment the target data to obtain Desensitize sensitive data.

可选的，所述装置还包括：Optionally, the device also includes:

词库构建单元，用于构建分词基准词库，所述分词基准词库中包括多个子词库，每个子词库分别包括一种类型的敏感词。The thesaurus construction unit is used to construct a word segmentation reference thesaurus, the word segmentation reference thesaurus includes a plurality of sub-thesaurus, and each sub-thesaurus includes a type of sensitive word.

可选的，当所述目标数据的类型为用电地址时，所述第一分词处理单元具体用于：Optionally, when the type of the target data is an electrical address, the first word segmentation processing unit is specifically configured to:

可选的，当所述目标数据的类型为企业类户名时，所述第一分词处理单元具体用于：Optionally, when the type of the target data is an enterprise account name, the first word segmentation processing unit is specifically used for:

可选的，所述装置还包括：Optionally, the device also includes:

计算单元，用于计算所述目标数据的分词结果的正确率；A calculation unit, configured to calculate the correct rate of the word segmentation result of the target data;

判断端元，用于判断所述目标数据的分词结果的正确率是否大于第一预设值；Judging an end member, used to judge whether the correct rate of the word segmentation result of the target data is greater than a first preset value;

若是，触发所述脱敏处理单元；If so, trigger the desensitization processing unit;

若否，触发第二分词处理单元，所述第二分词处理单元，用于基于隐马尔柯夫模型对所述目标数据进行分词，并触发所述脱敏处理单元。If not, a second word segmentation processing unit is triggered, and the second word segmentation processing unit is configured to perform word segmentation on the target data based on a hidden Markov model, and trigger the desensitization processing unit.

可选的，当所述目标数据的类型为用电地址时，所述脱敏处理单元包括：Optionally, when the type of the target data is an electrical address, the desensitization processing unit includes:

第一判断子单元，用于判断所述目标数据的长度是否大于第二预设值；a first judging subunit, configured to judge whether the length of the target data is greater than a second preset value;

第一确定子单元，用于当所述目标数据的长度大于所述第二预设值时，确定所述目标数据的脱敏方法为第一用电地址数据脱敏方法；A first determining subunit, configured to determine that the desensitization method of the target data is the first power address data desensitization method when the length of the target data is greater than the second preset value;

第一提取子单元，用于采用所述第一用户地址数据脱敏方法，从所述目标数据的分词结果中提取门牌号数据的最后5位数据和省市区县数据，得到剩余部分数据；The first extraction subunit is used to extract the last 5 digits of the house number data and the province, district and county data from the word segmentation result of the target data by using the first user address data desensitization method to obtain the remaining data;

第一脱敏处理子单元，用于保留所述门牌号数据的后5位数据和所述省市区县数据，对所述目标数据的剩余部分数据进行掩码，得到所述目标数据脱敏后的数据；The first desensitization processing subunit is used to retain the last 5 digits of the house number data and the province, district and county data, and mask the remaining data of the target data to obtain the desensitization of the target data after the data;

第二确定子单元，用于当所述目标数据的长度不大于所述第二预设值时，确定所述目标数据的脱敏方法为第二用电地址数据脱敏方法；The second determination subunit is configured to determine that the desensitization method of the target data is the second power address data desensitization method when the length of the target data is not greater than the second preset value;

第二脱敏处理子单元，用于采用所述第二用户地址数据脱敏方法，根据所述目标数据的长度按第一分阶梯保留规则提取所述目标数据的保留部分，并对所述目标数据的剩余部分进行掩码，得到所述目标数据脱敏后的数据。The second desensitization processing subunit is used to adopt the second user address data desensitization method, extract the reserved part of the target data according to the length of the target data according to the first step-by-step retention rule, and process the target The remaining part of the data is masked to obtain the desensitized data of the target data.

可选的，当所述目标数据的类型为企业类户名时，所述脱敏处理单元包括：Optionally, when the type of the target data is an enterprise account name, the desensitization processing unit includes:

第二判断子单元，用于判断所述目标数据的长度是否大于第三预设值；A second judging subunit, configured to judge whether the length of the target data is greater than a third preset value;

第三确定子单元，用于当所述目标数据的长度大于所述第三预设值时，确定所述目标数据的脱敏方法为第一企业类户名数据脱敏方法；A third determination subunit, configured to determine that the desensitization method of the target data is the first desensitization method of enterprise-type account name data when the length of the target data is greater than the third preset value;

第二提取子单元，用于采用所述第一企业类户名数据脱敏方法，从所述目标数据的分词结果中提取字号数据的第一个字和行业数据的最后一个字，得到所述字号数据的剩余数据和所述行业数据的剩余数据；The second extraction subunit is used to extract the first character of the font size data and the last character of the industry data from the word segmentation result of the target data by using the desensitization method of the first enterprise-type account name data to obtain the The remaining data of the font size data and the remaining data of the industry data;

第三脱敏处理子单元，用于对所述字号数据的剩余数据和所述行业数据的剩余数据进行掩码，保留所述目标数据的其他数据，得到所述目标数据脱敏后的数据；The third desensitization processing subunit is used to mask the remaining data of the font size data and the remaining data of the industry data, retain other data of the target data, and obtain the desensitized data of the target data;

第四确定子单元，用于当所述目标数据的长度不大于所述第三预设值时，确定所述目标数据的脱敏方法为第二企业类户名数据脱敏方法；The fourth determination subunit is used to determine that the desensitization method of the target data is the second desensitization method of enterprise-type account name data when the length of the target data is not greater than the third preset value;

第四脱敏处理子单元，用于采用所述第二企业类户名数据脱敏方法，根据所述目标数据的长度按第二分阶梯保留规则提取所述目标数据的保留部分，并对所述目标数据的剩余部分进行掩码，得到所述目标数据脱敏后的数据。The fourth desensitization processing subunit is used to adopt the second enterprise-type account name data desensitization method, extract the reserved part of the target data according to the second step-by-step retention rule according to the length of the target data, and Mask the remaining part of the target data to obtain desensitized data of the target data.

相对于现有技术，本发明的有益效果如下：Compared with the prior art, the beneficial effects of the present invention are as follows:

本发明提供的一种数据脱敏的处理方法及装置，在数据脱敏之前通过调用分词基准词库对目标数据进行分词，得到具有一定结构的数据，对存在主要敏感信息的部分进行脱敏处理，对敏感信息的全部或大部分进行掩码，提高了数据脱敏的有效性。根据目标数据的类型调用分词基准词库中相应子词库，并采用与目标数据的类型相对应的分词方法进行分词，提高了分词的准确性，并根据目标数据的类型和长度确定目标数据的脱敏方法，实现了不同类型不同长度数据的差异化脱敏，提高了数据脱敏的有效性。A data desensitization processing method and device provided by the present invention, before the data desensitization, the target data is segmented by calling the word segmentation benchmark thesaurus to obtain data with a certain structure, and desensitization is performed on the part with main sensitive information , to mask all or most of the sensitive information, improving the effectiveness of data desensitization. According to the type of target data, call the corresponding sub-thesaurus in the benchmark lexicon for word segmentation, and use the word segmentation method corresponding to the type of target data to perform word segmentation, which improves the accuracy of word segmentation, and determines the target data according to the type and length of the target data The desensitization method realizes the differential desensitization of data of different types and lengths, and improves the effectiveness of data desensitization.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据提供的附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only It is an embodiment of the present invention, and those skilled in the art can also obtain other drawings according to the provided drawings without creative work.

图1为本发明实施例公开的一种数据脱敏的处理方法流程图；FIG. 1 is a flowchart of a data desensitization processing method disclosed in an embodiment of the present invention;

图2为本发明实施例公开的通用地址子词库示意图；FIG. 2 is a schematic diagram of a general address sub-thesaurus disclosed in an embodiment of the present invention;

图3为本发明实施例公开的地名词库子词库示意图；Fig. 3 is a schematic diagram of the sub-thesaurus of the place name database disclosed in the embodiment of the present invention;

图4为本发明实施例公开的小区名称子词库示意图；FIG. 4 is a schematic diagram of a cell name sub-thesaurus disclosed in an embodiment of the present invention;

图5为本发明实施例公开的行政区划分集合子词库示意图；Fig. 5 is the schematic diagram of the set sub-thesaurus of administrative division division disclosed by the embodiment of the present invention;

图6为本发明实施例公开的区域集合子词库示意图；Fig. 6 is a schematic diagram of the regional set sub-thesaurus disclosed in the embodiment of the present invention;

图7为本发明实施例公开的行业集合子词库示意图；Fig. 7 is a schematic diagram of an industry collection sub-thesaurus disclosed in an embodiment of the present invention;

图8为本发明实施例公开的公司组织集合子词库示意图；Fig. 8 is a schematic diagram of the company organization collection sub-thesaurus disclosed in the embodiment of the present invention;

图9为本发明实施例公开的最大正向匹配中文分词方法示意图；Fig. 9 is a schematic diagram of the maximum forward matching Chinese word segmentation method disclosed in the embodiment of the present invention;

图10为本发明实施例公开的用电地址数据脱敏处理方法流程图；Fig. 10 is a flow chart of a desensitization processing method for electricity address data disclosed in an embodiment of the present invention;

图11为本发明实施例公开的企业类户名数据脱敏处理方法流程图；Fig. 11 is a flowchart of a desensitization processing method for enterprise account name data disclosed in an embodiment of the present invention;

图12为本发明实施例公开的另一种数据脱敏的处理方法流程图；Fig. 12 is a flowchart of another data desensitization processing method disclosed in the embodiment of the present invention;

图13为本发明实施例公开的一种数据脱敏的处理装置结构示意图。Fig. 13 is a schematic structural diagram of a data desensitization processing device disclosed in an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

请参阅图1，本实施例公开了一种数据脱敏的处理方法，具体包括以下步骤：Please refer to Figure 1. This embodiment discloses a data desensitization processing method, which specifically includes the following steps:

S101：确定目标数据的类型；S101: Determine the type of target data;

目标数据为需要进行脱敏处理的数据，目标数据的类型可以包括电话类数据、地址类数据、用户名数据、银行账户类数据等。The target data is the data that needs to be desensitized. The type of target data may include phone data, address data, user name data, bank account data, etc.

S102：根据所述目标数据的类型调用分词基准词库中的相应子词库，并采用与所述目标数据的类型相对应的分词方法进行分词；S102: Call the corresponding sub-thesaurus in the word segmentation reference thesaurus according to the type of the target data, and use the word segmentation method corresponding to the type of the target data to perform word segmentation;

分词是将一个汉字序列切分成一个一个单独的词。分词是将连续的字序列按照一定的规范重新组合成词序列的过程。Word segmentation is to divide a sequence of Chinese characters into individual words. Word segmentation is the process of recombining continuous word sequences into word sequences according to certain specifications.

为了更准确的对目标数据进行分词，根据目标数据的类型调用分词基准词库中的相应子词库对目标数据进行分词。In order to more accurately segment the target data, according to the type of the target data, the corresponding sub-thesaurus in the benchmark lexicon for word segmentation is called to segment the target data.

需要说明的是，所述数据脱敏的处理方法还包括：It should be noted that the data desensitization processing method also includes:

构建分词基准词库。Build a word segmentation benchmark thesaurus.

所述分词基准词库中包括多个子词库，每个子词库分别包括一种类型的敏感词。The word segmentation benchmark lexicon includes multiple sub-thesauruses, and each sub-thesaurus includes a type of sensitive words.

请参阅图2～8，分别为分词基准词库中的通用地址子词库、地名子词库、小区名称子词库、行政区划分集合子词库、区域集合子词库、行业集合子词库和公司组织集合子词库。Please refer to Figures 2 to 8, which are the general address sub-thesaurus, place name sub-thesaurus, community name sub-thesaurus, administrative division collection sub-thesaurus, regional collection sub-thesaurus, and industry collection sub-thesaurus in the word segmentation benchmark lexicon. and company organization collection sub-thesaurus.

为了更准确的对目标数据进行分词，根据所述目标数据的类型调用分词基准词库中的相应子词库，采用与所述目标数据的类型相对应的分词方法进行分词。例如，当所述目标数据的类型为用电地址时，调用通用地址子词库、地名子词库、小区名称子词库和行政区划分集合子词库，采用最大正向匹配中文分词对所述目标数据进行分词。当所述目标数据的类型为企业类户名时，调用区域集合子词库、行业集合子词库和公司组织集合子词库，采用双向最大匹配中文分词方法进行分词。In order to more accurately segment the target data, the corresponding sub-thesaurus in the word segmentation reference thesaurus is invoked according to the type of the target data, and the word segmentation method corresponding to the type of the target data is used for word segmentation. For example, when the type of the target data is an electricity address, call the general address sub-thesaurus, place name sub-thesaurus, community name sub-thesaurus and administrative division set sub-thesaurus, and use the maximum positive matching Chinese word segmentation to pair the The target data is word-segmented. When the type of the target data is an enterprise account name, call the regional collection sub-thesaurus, the industry collection sub-thesaurus and the company organization collection sub-thesaurus, and use the two-way maximum matching Chinese word segmentation method for word segmentation.

如图9所示，在用电地址数据分词时采用最大正向匹配中文分词算法，具体算法如下：As shown in Figure 9, the maximum positive matching Chinese word segmentation algorithm is used in the word segmentation of the electricity address data, and the specific algorithm is as follows:

从左到右将目标数据中的几个连续字符与词表匹配，如果匹配上，则切分出一个词。但这里有一个问题：要做到最大匹配，并不是第一次匹配到就可以切分的。如待分词文本：Match several consecutive characters in the target data with the vocabulary from left to right, and if they match, a word is segmented. But there is a problem here: to achieve the maximum match, it is not possible to split the first match. Such as text to be segmented:

content[]＝{"洪"，"山"，"街"，"道"，"双"，"河"，"社"，"区"，……}content[]＝{"Hong", "Mountain", "Street", "Tao", "Shuang", "River", "Community", "District",...}

词表：dict[]＝{"长沙市","开福区","洪山","洪山街道",……}Vocabulary: dict[]＝{"Changsha City", "Kaifu District", "Hongshan", "Hongshan Street",...}

(1)从content[1]开始，当扫描到content[2]的时候，发现"洪山"已经在词表dict[]中了。但还不能切分出来，因为我们不知道后面的词语能不能组成更长的词(最大匹配)；(1) Starting from content[1], when content[2] is scanned, it is found that "Hongshan" is already in the vocabulary dict[]. But it can't be segmented yet, because we don't know whether the following words can form longer words (maximum match);

(2)继续扫描content[3]，发现"洪山街"并不是dict[]中的词。但是我们还不能确定是否前面找到的"洪山"已经是最大的词了，因为"洪山街"是dict[2]的前缀；(2) Continue to scan content[3] and find that "Hongshan Street" is not a word in dict[]. But we are still not sure whether the "Hongshan" found earlier is already the largest word, because "Hongshan Street" is the prefix of dict[2];

(3)扫描content[4]，发现"洪山街道"是dict[]中的词。继续扫描下去；(3) Scan content[4] and find that "Hongshan Street" is a word in dict[]. Continue to scan;

(4)当扫描content[5]的时候，发现"洪山街道双"并不是词表中的词，也不是词的前缀。因此可以切分出前面最大的词——"洪山街道"。(4) When scanning content[5], it is found that "Hongshan Street Double" is not a word in the vocabulary, nor is it a prefix of a word. Therefore, the biggest word in the front can be cut out - "Hongshan street".

由此可见，最大匹配出的词必须保证下一个扫描不是词表中的词或词的前缀才可以结束。利用最大正向匹配算法，继续循环，完成剩余分词。如“长沙市开福区洪山街道双河社区福元西路199号当代万国城三期10栋二单元1706”这个地址的最后分词结果如下：It can be seen that the maximum matched word must ensure that the next scan is not a word or a prefix of a word in the vocabulary before it can end. Use the maximum forward matching algorithm to continue the loop to complete the remaining word segmentation. For example, "Unit 1706, Unit 2, Building 10, Phase III, Contemporary Wanguocheng, No. 199, Fuyuan West Road, Shuanghe Community, Hongshan Street, Kaifu District, Changsha City" is as follows:

“长沙市|开福区|洪山街道|双河社区|福元西路|199|号|当代万国城三期|10|栋|二|单元|1706”。"Changsha | Kaifu District | Hongshan Street | Shuanghe Community | Fuyuan West Road | No. 199 | Contemporary Wanguocheng Phase III | 10 | Building | Second | Unit | 1706".

在企业类户名数据分词时采用双向最大匹配中文分词方法。双向最大匹配中文分词方法首先分别进行最大正向匹配和最大逆向匹配中文分词，在此基础上对分词结果进行比较，根据不同的结果采用不同的分词策略，比如可以根据大颗粒度词越多越好、非词典词和单字词越少越好的原则，选取其中一种分词结果输出。The two-way maximum matching Chinese word segmentation method is used in the word segmentation of enterprise account name data. The two-way maximum matching Chinese word segmentation method first performs the maximum forward matching and maximum reverse matching Chinese word segmentation respectively, and compares the word segmentation results on this basis, and adopts different word segmentation strategies according to different results. Good, the principle that the fewer non-dictionary words and single-character words is better, select one of the word segmentation results to output.

最大正向匹配中文分词算法已经详细描述。最大逆向匹配中文分词算法跟最大正向匹配算法类似，不同的是扫描的方向，它是从右往左取子串进行匹配。算法流程可描述为：The maximum forward matching Chinese word segmentation algorithm has been described in detail. The maximum reverse matching Chinese word segmentation algorithm is similar to the maximum forward matching algorithm, the difference is the scanning direction, which takes substrings from right to left for matching. The algorithm flow can be described as:

(1)输入经过预处理后待分词句子content，并初始化index＝content.length；(1) Input the sentence content to be segmented after preprocessing, and initialize index=content.length;

(2)获得字典数据库内各个子字典的长度；(2) obtain the length of each sub-dictionary in the dictionary database;

(3)获得分词单词的长度，并和字典数据库内最长的子字典比较，如果子字典的最大长度大于要分词的长度，则取剩于要分词的字符串为最大长度，否则则以最大长度分词；(3) Obtain the length of the participle word and compare it with the longest sub-dictionary in the dictionary database. If the maximum length of the sub-dictionary is greater than the length of the word to be segmented, then take the string remaining in the word to be segmented as the maximum length, otherwise use the maximum length participle;

(4)用二分法查找与当前最大匹配长度相同的子字典，如果找到该字典则转(5)，否则最大长度减一转(4)；(4) search for the same sub-dictionary as the current maximum matching length by dichotomy, if find the dictionary then turn (5), otherwise the maximum length minus one turn (4);

(5)取得要分词的字符串SubStr，在字典里找该字符串，如果找到则将该字符串添加到List内，如果没有找到则判断SubStr是否大于1，如果大于1，则删除SubStr最后一个字转(5)，否则置切分标志，转(6)；(5) Obtain the string SubStr to be segmented, look for the string in the dictionary, if found, add the string to the List, if not found, determine whether SubStr is greater than 1, if greater than 1, delete the last SubStr Word turns (5), otherwise puts segmentation mark, turns (6);

(6)判断Index是否大于1，如果小于则转(3)否则保存List，退出。(6) Determine whether the Index is greater than 1, if it is less, go to (3) otherwise save the List and exit.

双向最大匹配算法将正向匹配与逆向匹配算法相结合起来，对于待分字符串，首先分别用最大正向匹配和最大逆向匹配算法进行分词，对于分词结果进行比较，比较正向和反向两个最大匹配，返回分词结果；当两个方向的分词结果一致，返回字符串当不一致，返回长度小的；当长度一致，返回反向的。双向最大匹配中文分词算法步骤如下：The two-way maximum matching algorithm combines the forward matching and reverse matching algorithms. For the character string to be divided, firstly use the maximum forward matching and maximum reverse matching algorithms for word segmentation, and compare the results of word segmentation, comparing the forward and reverse two A maximum match, returns the word segmentation result; when the word segmentation results in the two directions are consistent, the returned strings are inconsistent, and return the smaller length; when the length is the same, return the reverse. The steps of the two-way maximum matching Chinese word segmentation algorithm are as follows:

(1)输入待分词的句子content；(1) Input the content of the sentence to be segmented;

(2)对content进行预处理后分别用最大正向匹配算法和最大逆向匹配算法进行分词，对分词结果进行比较，如果分词结果完全相同则转(3)，如果分词结果不同则转(4)；(2) After preprocessing the content, use the maximum forward matching algorithm and the maximum reverse matching algorithm to perform word segmentation respectively, compare the word segmentation results, if the word segmentation results are exactly the same, go to (3), if the word segmentation results are different, go to (4) ;

(3)任意选出一种分词结果，将分词结果输出算法结束；(3) arbitrarily select a word segmentation result, and output the word segmentation result to the end of the algorithm;

(4)比较分词数目是否相同，如果相同则选取逆向分词结果，将分词结果输出，算法结束；否则选取分词数目较小的分词结果进行输出，算法结束。(4) Compare whether the number of word segmentation is the same, if the same, select the reverse word segmentation result, output the word segmentation result, and the algorithm ends; otherwise select the word segmentation result with a smaller number of word segmentation to output, and the algorithm ends.

S103：根据所述目标数据的类型和所述目标数据的长度，确定所述目标数据的脱敏方法，并采用所述目标数据的脱敏方法对所述目标数据分词后得到的敏感数据进行脱敏处理。S103: Determine the desensitization method of the target data according to the type of the target data and the length of the target data, and use the desensitization method of the target data to desensitize the sensitive data obtained after word segmentation of the target data Sensitive handling.

请参阅图10，当所述目标数据的类型为用电地址时，S103的执行过程如下：Please refer to Fig. 10, when the type of the target data is a power consumption address, the execution process of S103 is as follows:

S201：判断所述目标数据的长度是否大于第二预设值；若是执行S202，若否执行S203：S201: Determine whether the length of the target data is greater than a second preset value; if yes, execute S202, if not, execute S203:

S202：确定所述目标数据的脱敏方法为第一用电地址数据脱敏方法；S202: Determine that the desensitization method of the target data is the first power usage address data desensitization method;

S204：采用所述第一用户地址数据脱敏方法，从所述目标数据的分词结果中提取门牌号数据的最后5位数据和省市区县数据，得到剩余部分数据；S204: Using the first user address data desensitization method, extract the last 5 digits of the house number data and the province, district and county data from the word segmentation result of the target data to obtain the remaining data;

S205：保留所述门牌号数据的后5位数据和所述省市区县数据，对所述目标数据的剩余部分数据进行掩码，得到所述目标数据脱敏后的数据；S205: Reserving the last 5 digits of the house number data and the data of the province, district and county, and masking the remaining data of the target data to obtain desensitized data of the target data;

S203：确定所述目标数据的脱敏方法为第二用电地址数据脱敏方法；S203: Determine that the desensitization method of the target data is the second power usage address data desensitization method;

S206：采用所述第二用户地址数据脱敏方法，根据所述目标数据的长度按第一分阶梯保留规则提取所述目标数据的保留部分，并对所述目标数据的剩余部分进行掩码，得到所述目标数据脱敏后的数据。S206: Using the second user address data desensitization method, extracting the reserved part of the target data according to the length of the target data according to the first hierarchical retention rule, and masking the remaining part of the target data, Obtain the desensitized data of the target data.

例如，对于长度10个字及以下的用电地址数据按第二用户地址数据脱敏方法进行数据脱敏，按长度分阶梯保留，长度5个字及以下的，保留第1个字和最后2个字；长度6-9个字的，保留最后5个字。For example, for electricity address data with a length of 10 words or less, data desensitization is carried out according to the second user address data desensitization method, and the data is reserved in steps according to the length. If the length is 5 words or less, the first word and the last 2 characters; if the length is 6-9 characters, keep the last 5 characters.

对于长度10个字及以上的用电地址数据按第一用户地址数据脱敏方法进行数据脱敏。用电地址一般由省、市、区县、街道/乡镇居委会/村、道路、小区、门牌号部分组成。门牌号部分保留最后5位，省、市、区县保留，其他部分全部用*代替。如下所示：For electricity address data with a length of 10 words or more, perform data desensitization according to the first user address data desensitization method. Electricity address is generally composed of province, city, district/county, street/township committee/village, road, community, and house number. The last 5 digits of the house number are reserved, provinces, cities, districts and counties are reserved, and all other parts are replaced by *. As follows:

山东省济南市市中区山川大街天桥北居委会纬三路齐鲁安康苑小区2-1-101->山东省济南市市中区**********************1-101。2-1-101, Qilu Ankangyuan Community, Weisan Road, North Neighborhood Committee, Weisan Road, Shanchuan Street, Shizhong District, Jinan City, Shandong Province -> Shizhong District, Jinan City, Shandong Province****************** *****1-101.

请参阅图11，当所述目标数据的类型为用电地址时，S103的执行过程如下：Please refer to Fig. 11, when the type of the target data is a power consumption address, the execution process of S103 is as follows:

S301：判断所述目标数据的长度是否大于第三预设值；若是，执行S302，若否执行S303；S301: Determine whether the length of the target data is greater than a third preset value; if yes, perform S302, and if not, perform S303;

S302：确定所述目标数据的脱敏方法为第一企业类户名数据脱敏方法；S302: Determine that the desensitization method of the target data is the desensitization method of the first enterprise-type account name data;

S304：采用所述第一企业类户名数据脱敏方法，从所述目标数据的分词结果中提取字号数据的第一个字和行业数据的最后一个字，得到所述字号数据的剩余数据和所述行业数据的剩余数据；S304: Using the desensitization method for the first enterprise-type account name data, extract the first character of the font size data and the last character of the industry data from the word segmentation result of the target data, and obtain the remaining data of the font size data and the remainder of said industry data;

S305：对所述字号数据的剩余数据和所述行业数据的剩余数据进行掩码，保留所述目标数据的其他数据，得到所述目标数据脱敏后的数据；S305: Mask the remaining data of the font size data and the remaining data of the industry data, retain other data of the target data, and obtain desensitized data of the target data;

S303：确定所述目标数据的脱敏方法为第二企业类户名数据脱敏方法；S303: Determine that the desensitization method of the target data is the second enterprise-type account name data desensitization method;

S306：采用所述第二企业类户名数据脱敏方法，根据所述目标数据的长度按第二分阶梯保留规则提取所述目标数据的保留部分，并对所述目标数据的剩余部分进行掩码，得到所述目标数据脱敏后的数据。S306: Using the second desensitization method for enterprise-type account name data, extract the reserved part of the target data according to the length of the target data according to the second ladder retention rule, and mask the remaining part of the target data code to obtain the desensitized data of the target data.

例如，对于长度6个字以下的企业类户名数据按第二用电地址数据脱敏方法进行数据脱敏，按长度分阶梯保留，长度4个字及以下的，首尾各保留1个字；长度5-6个字的，首尾各保留2个字。For example, for the enterprise account name data with a length of less than 6 characters, data desensitization is performed according to the second electricity address data desensitization method, and the length is divided into steps. For the length of 4 characters and less, the first and last characters are reserved; If the length is 5-6 characters, keep 2 characters at the beginning and end.

对于长度6个字及以上的企业类户名数据按第一用电地址数据脱敏方法进行数据脱敏。企业类户名一般由区域、字号、行业、公司组织四部分组成。保留前后的区域和组织部分不变，对字号和行业进行掩码操作。字号部分保留第一个字，其他部分全部用*代替；行业部分保留最后一个字，其他部分全部用*代替。如下所示：For enterprise account name data with a length of 6 characters or more, the data is desensitized according to the data desensitization method of the first power consumption address. An enterprise account name generally consists of four parts: region, trade name, industry, and company organization. The regions and organizations before and after are kept unchanged, and the font size and industry are masked. For the part of font size, the first character is reserved, and all other parts are replaced with *; for the industry part, the last character is reserved, and all other parts are replaced with *. As follows:

青岛惠丰电机制造有限公司->青岛惠****造有限公司；Qingdao Huifeng Motor Manufacturing Co., Ltd.->Qingdao Hui**** Manufacturing Co., Ltd.;

青岛贰零贰零商业服务有限公司->青岛贰******务有限公司。Qingdao 2020 Business Service Co., Ltd. -> Qingdao 2020 Service Co., Ltd.

本实施例公开的一种数据脱敏的处理方法，在数据脱敏之前通过调用分词基准词库对目标数据进行分词，得到具有一定结构的数据，对存在主要敏感信息的部分进行脱敏处理，对敏感信息的全部或大部分进行掩码，提高了数据脱敏的有效性。根据目标数据的类型调用分词基准词库中相应子词库，并采用与目标数据的类型相对应的分词方法进行分词，提高了分词的准确性，并根据目标数据的类型和长度确定目标数据的脱敏方法，实现了不同类型不同长度数据的差异化脱敏，提高了数据脱敏的有效性。In the data desensitization processing method disclosed in this embodiment, before the data desensitization, the target data is segmented by calling the word segmentation benchmark lexicon to obtain data with a certain structure, and the part with main sensitive information is desensitized. Masking all or most of the sensitive information improves the effectiveness of data desensitization. According to the type of target data, call the corresponding sub-thesaurus in the benchmark lexicon for word segmentation, and use the word segmentation method corresponding to the type of target data to perform word segmentation, which improves the accuracy of word segmentation, and determines the target data according to the type and length of the target data The desensitization method realizes the differential desensitization of data of different types and lengths, and improves the effectiveness of data desensitization.

请参阅图12，本实施例公开了另一种数据脱敏的处理方法，具体包括以下步骤：Please refer to Figure 12, this embodiment discloses another data desensitization processing method, which specifically includes the following steps:

S401：确定目标数据的类型；S401: Determine the type of target data;

S402：根据所述目标数据的类型调用分词基准词库中的相应子词库，并采用与所述目标数据的类型相对应的分词方法进行分词；S402: Call the corresponding sub-thesaurus in the word segmentation reference thesaurus according to the type of the target data, and use the word segmentation method corresponding to the type of the target data to perform word segmentation;

S403：计算所述目标数据的分词结果的正确率；S403: Calculate the accuracy rate of the word segmentation result of the target data;

S404：判断所述目标数据的分词结果的正确率是否大于第一预设值；若是，执行S405，若否，执行S406；S404: Determine whether the correct rate of the word segmentation result of the target data is greater than the first preset value; if yes, execute S405; if not, execute S406;

S405：根据所述目标数据的类型和所述目标数据的长度，确定所述目标数据的脱敏方法，并采用所述目标数据的脱敏方法对所述目标数据分词后得到的敏感数据进行脱敏处理；S405: Determine the desensitization method of the target data according to the type of the target data and the length of the target data, and use the desensitization method of the target data to desensitize the sensitive data obtained after word segmentation of the target data sensitive treatment;

S406：基于隐马尔柯夫模型对所述目标数据进行分词，并执行S405。S406: Segment the target data based on the Hidden Markov Model, and perform S405.

采用隐马尔柯夫模型(HMM Hidden Markov Model)对企业类户名和用电地址两类数据进行中文分词处理。HMM算法在训练语料规模足够大和覆盖领域足够多的情况下，可以获得更高的切分正确率。这类分词算法基于人工标注的词性和统计特征，对中文进行建模，即根据观测到的数据(标注好的语料)对模型参数进行估计即训练。在分词阶段再通过模型计算各种分词出现的概率，将概率最大的分词结果作为最终结果。常见的序列标注模型就有HMM算法，该算法能够很好地处理歧义和未登录词问题，效果比基于字符串匹配效果更好。The HMM Hidden Markov Model is used to perform Chinese word segmentation processing on the two types of data of enterprise account name and electricity address. The HMM algorithm can obtain a higher segmentation accuracy when the training corpus is large enough and covers enough fields. This type of word segmentation algorithm is based on the artificially marked part-of-speech and statistical features to model Chinese, that is, to estimate the model parameters based on the observed data (marked corpus), that is, to train. In the word segmentation stage, the probability of various word segmentations is calculated through the model, and the word segmentation result with the highest probability is taken as the final result. The common sequence labeling model has the HMM algorithm, which can handle ambiguity and unregistered words well, and the effect is better than that based on string matching.

隐马尔柯夫模型是一个双重随机过程，我们不知道具体的状态序列，只知道状态转移的概率，即模型的状态转换过程是不可观察的(隐蔽的)，而可观察的事件的随机过程是隐蔽的状态转换过程的随机函数。Hidden Markov model is a double stochastic process, we do not know the specific state sequence, only the probability of state transition, that is, the state transition process of the model is unobservable (concealed), while the random process of observable events is Stochastic functions for covert state transition processes.

HMM的组成包括：The composition of HMM includes:

模型中的状态数为N；The number of states in the model is N;

从每一个状态可能输出的不同的符号数M；The number M of different symbols that may be output from each state;

状态转移概率矩阵A＝a_ij，其中a_ij为状态S_i转移到状态S_j的概率；State transition probability matrix A=a _ij , where a _ij is the probability of transitioning from state S _i to state S _j ;

从状态C_j观察到某一特定符号O_k的概率分布矩阵为：B＝b_j(k)，观察符号的概率又称符号发射概率；The probability distribution matrix of observing a specific symbol O _k from the state C _j is: B=b _j (k), the probability of observing the symbol is also called the symbol emission probability;

初始状态的概率分布为：π＝{π_i}。The probability distribution of the initial state is: π={π _i }.

一般地，一个HMM记为一个五元组μ＝(C，K，A，B，π)，其中，C为状态的集合，O为输出符号的集合，π，A和B分别是初始状态的概率分布、状态转移概率和符号发射概率。Generally, an HMM is recorded as a five-tuple μ=(C, K, A, B, π), where C is the set of states, O is the set of output symbols, and π, A, and B are the initial state Probability distributions, state transition probabilities, and symbol emission probabilities.

中文分词使用语料用以训练HMM。使用经典的字符标注模型，四类标签的集合C是C＝{B，E，M，S}，其含义如下：Chinese word segmentation uses corpus to train HMM. Using the classic character annotation model, the set C of four types of labels is C={B, E, M, S}, and its meaning is as follows:

B：一个词的开始B: the beginning of a word

E：一个词的结束E: end of a word

M：一个词的中间M: middle of a word

S：单字成词S: a single character into a word

用四类标签做好标记后，就可以开始用统计的方法构建一个HMM模型，每个字符的标签分类只受前一个字符分类的影响。求得HMM的状态转移矩阵A以及符号发射概率B。其中：After marking with four types of labels, you can start to build an HMM model using statistical methods. The label classification of each character is only affected by the previous character classification. The state transition matrix A and the symbol emission probability B of the HMM are obtained. in:

公式中C＝{B，E，M，S}，O＝{字符集合}，Count代表频率。在计算B_ij时，由于数据的稀疏性，很多字符未出现在训练集中，这导致概率为0的结果出现在B中，为了修补这个问题，采用加1的数据平滑技术，即：In the formula, C={B, E, M, S}, O={character set}, and Count represents the frequency. When calculating B _ij , due to the sparsity of the data, many characters do not appear in the training set, which leads to a result with a probability of 0 appearing in B. In order to remedy this problem, a data smoothing technique of adding 1 is used, namely:

我们设定初始向量π＝{0.5，0.0，0.0，0.5}，M和E不可能出现在句子的首位。至此，HMM模型构建完毕。基于这个HMM模型，对于一个观察序列，用Viterbi算法获得一个隐藏序列{B，E，M，S}。We set the initial vector π={0.5, 0.0, 0.0, 0.5}, M and E cannot appear at the beginning of the sentence. So far, the HMM model has been constructed. Based on this HMM model, for an observation sequence, a hidden sequence {B, E, M, S} is obtained with the Viterbi algorithm.

Viterbi搜索算法为：The Viterbi search algorithm is:

1、初始化：δ₁(i)＝π_ib_i(O1),1≤i≤N,1. Initialization: δ ₁ (i) = π _i b _i (O1), 1≤i≤N,

概率最大的路径变量： The most probable path variable:

2、递归计算：2. Recursive calculation:

3、记忆回退路径：3. Memory rollback path:

4、终结：4. End:

通过回溯得到路径(状态序列)：Get the path (state sequence) by backtracking:

Viterbi算法的时间复杂性是O(N²T)。如“长沙市开福区洪山街道双河社区福元西路199号当代万国城三期10栋二单元1706”这个地址的输出状态序列为：The time complexity of the Viterbi algorithm is O(N ² T). For example, "Unit 1706, Unit 2, Building 10, Phase III, Contemporary Wanguocheng, No. 199, Fuyuan West Road, Shuanghe Community, Hongshan Street, Kaifu District, Changsha City" is the output status sequence of the address:

“BMEBMEBMMEBMMEBMMEBMMEBMMMMMEBMEBMEBMME”"BMEBMEBMMEBMMEBMMEBMMEBMMMEBMMMMMEBMEBMEBMME"

根据这个状态序列可以进行中文切词为：According to this state sequence, Chinese word segmentation can be performed as follows:

最后中文切词结果如下:The final Chinese word segmentation results are as follows:

本实施例公开的数据脱敏的处理方法，首先采用算法复杂度较小的最大正向匹配方法或双向最大匹配中文分词方法对目标分词进行分词处理，保证了分词处理的处理速度。对分词结果的正确率进行计算，当分词结果正确率低于阈值时采用算法复杂度较高但分词准确率也较高的隐马尔柯夫模型对目标数据进行分词，保证了分词结果的准确性。The data desensitization processing method disclosed in this embodiment first uses the maximum forward matching method or the two-way maximum matching Chinese word segmentation method with less algorithm complexity to perform word segmentation processing on the target word, which ensures the processing speed of word segmentation processing. Calculate the correct rate of the word segmentation result. When the correct rate of the word segmentation result is lower than the threshold, the hidden Markov model with high algorithm complexity but high word segmentation accuracy is used to segment the target data to ensure the accuracy of the word segmentation result. .

基于上述实施例公开的一种数据脱敏的处理方法，请参阅图13，本实施例对应公开了一种数据脱敏的处理装置，包括：Based on a data desensitization processing method disclosed in the above embodiment, please refer to FIG. 13. This embodiment discloses a data desensitization processing device correspondingly, including:

类型确定单元501，用于确定目标数据的类型；A type determining unit 501, configured to determine the type of the target data;

第一分词处理单元502，用于根据所述目标数据的类型调用分词基准词库中的相应子词库，并采用与所述目标数据的类型相对应的分词方法进行分词；The first word segmentation processing unit 502 is used to call the corresponding sub-thesaurus in the word segmentation benchmark lexicon according to the type of the target data, and use the word segmentation method corresponding to the type of the target data to perform word segmentation;

脱敏处理单元503，用于根据所述目标数据的类型和所述目标数据的长度，确定所述目标数据的脱敏方法，并采用所述目标数据的脱敏方法对所述目标数据分词后得到的敏感数据进行脱敏处理。A desensitization processing unit 503, configured to determine a desensitization method for the target data according to the type of the target data and the length of the target data, and use the desensitization method for the target data to segment the target data The obtained sensitive data is desensitized.

可选的，所述装置还包括：Optionally, the device also includes:

可选的，当所述目标数据的类型为用电地址时，所述第一分词处理单元502具体用于：Optionally, when the type of the target data is an electrical address, the first word segmentation processing unit 502 is specifically configured to:

可选的，当所述目标数据的类型为企业类户名时，所述第一分词处理单元502具体用于：Optionally, when the type of the target data is an enterprise account name, the first word segmentation processing unit 502 is specifically configured to:

可选的，所述装置还包括：Optionally, the device also includes:

可选的，当所述目标数据的类型为用电地址时，所述脱敏处理单元503包括：Optionally, when the type of the target data is an electrical address, the desensitization processing unit 503 includes:

可选的，当所述目标数据的类型为企业类户名时，所述脱敏处理单元503包括：Optionally, when the type of the target data is an enterprise account name, the desensitization processing unit 503 includes:

本实施例公开的一种数据脱敏的处理装置，在数据脱敏之前通过调用分词基准词库对目标数据进行分词，得到具有一定结构的数据，对存在主要敏感信息的部分进行脱敏处理，对敏感信息的全部或大部分进行掩码，提高了数据脱敏的有效性。根据目标数据的类型调用分词基准词库中相应子词库，并采用与目标数据的类型相对应的分词方法进行分词，提高了分词的准确性，并根据目标数据的类型和长度确定目标数据的脱敏方法，实现了不同类型不同长度数据的差异化脱敏，提高了数据脱敏的有效性。In the data desensitization processing device disclosed in this embodiment, before the data desensitization, the target data is segmented by calling the word segmentation benchmark thesaurus to obtain data with a certain structure, and desensitization is performed on the part with main sensitive information. Masking all or most of the sensitive information improves the effectiveness of data desensitization. According to the type of target data, call the corresponding sub-thesaurus in the benchmark lexicon for word segmentation, and use the word segmentation method corresponding to the type of target data to perform word segmentation, which improves the accuracy of word segmentation, and determines the target data according to the type and length of the target data The desensitization method realizes the differential desensitization of data of different types and lengths, and improves the effectiveness of data desensitization.

对所公开的实施例的上述说明，使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的，本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下，在其它实施例中实现。因此，本发明将不会被限制于本文所示的这些实施例，而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the present invention will not be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A processing method for data desensitization, comprising:

Determine the type of target data;

Invoking the corresponding sub-thesaurus in the word segmentation benchmark thesaurus according to the type of the target data, and using a word segmentation method corresponding to the type of the target data to perform word segmentation;

According to the type of the target data and the length of the target data, determine the desensitization method of the target data, and use the desensitization method of the target data to perform desensitization processing on the sensitive data obtained after the word segmentation of the target data .

2. The method according to claim 1, characterized in that the method further comprises:

A word segmentation reference thesaurus is constructed, and the word segmentation reference thesaurus includes a plurality of sub-thesauruses, and each sub-thesaurus includes a type of sensitive word.

3. method according to claim 1, is characterized in that, when the type of described target data is electricity address, described according to the type of described target data calls corresponding sub-thesaurus in word segmentation benchmark lexicon, adopts The word segmentation method corresponding to the type of the target data performs word segmentation, including:

The general address sub-thesaurus, the place name sub-thesaurus, the community name sub-thesaurus and the administrative division sub-thesaurus are called, and the target data is segmented by using the maximum positive matching Chinese word segmentation.

4. method according to claim 1, it is characterized in that, when the type of described target data is enterprise class account name, described according to the type of described target data calls the corresponding sub-thesaurus in the participle benchmark lexicon, A word segmentation method corresponding to the type of the target data is used for word segmentation, including:

Call the regional collection sub-thesaurus, industry collection sub-thesaurus and company organization collection sub-thesaurus, and use the two-way maximum matching Chinese word segmentation method for word segmentation.

5. The method according to claim 1, wherein, before determining the desensitization method of the target data according to the type of the target data and the length of the target data, the method further comprises:

Calculate the accuracy rate of the word segmentation result of the target data;

judging whether the correct rate of the word segmentation result of the target data is greater than a first preset value;

If yes, perform the desensitization method for determining the target data according to the type of the target data and the length of the target data;

If not, segment the target data based on a Hidden Markov Model, and execute the desensitization method for determining the target data according to the type of the target data and the length of the target data.

6. The method according to claim 1, wherein when the type of the target data is an electrical address, the target data is determined according to the type of the target data and the length of the target data desensitization method of the target data, and use the desensitization method of the target data to desensitize the sensitive data obtained after the target data is segmented, including:

judging whether the length of the target data is greater than a second preset value;

When the length of the target data is greater than the second preset value, it is determined that the desensitization method of the target data is the first power address data desensitization method;

Using the first user address data desensitization method, extracting the last 5 digits of the house number data and the province, district and county data from the word segmentation result of the target data to obtain the remaining part of the data;

Retaining the last 5 digits of the house number data and the data of the provinces, districts, and counties, and masking the remaining data of the target data to obtain desensitized data of the target data;

When the length of the target data is not greater than the second preset value, it is determined that the desensitization method of the target data is the second power address data desensitization method;

Using the second user address data desensitization method, according to the length of the target data, extract the reserved part of the target data according to the first sub-step retention rule, and mask the remaining part of the target data, to obtain the Data after desensitization of the target data.

7. The method according to claim 1, wherein, when the type of the target data is an enterprise account name, the target is determined according to the type of the target data and the length of the target data Data desensitization method, and using the target data desensitization method to desensitize the sensitive data obtained after the target data is segmented, including:

judging whether the length of the target data is greater than a third preset value;

When the length of the target data is greater than the third preset value, it is determined that the desensitization method of the target data is the first enterprise-type account name data desensitization method;

Using the desensitization method of the first enterprise class account name data, extract the first character of the font size data and the last character of the industry data from the word segmentation result of the target data, and obtain the remaining data of the font size data and the The rest of the industry data;

Masking the remaining data of the font size data and the remaining data of the industry data, retaining other data of the target data, and obtaining desensitized data of the target data;

When the length of the target data is not greater than the third preset value, it is determined that the desensitization method of the target data is the second enterprise-type account name data desensitization method;

Using the second enterprise-type account name data desensitization method, extracting the retained part of the target data according to the second step-by-step retention rule according to the length of the target data, and masking the remaining part of the target data, Obtain the desensitized data of the target data.

8. A processing device for data desensitization, comprising:

a type determination unit, configured to determine the type of the target data;

The first word segmentation processing unit is used to call the corresponding sub-thesaurus in the word segmentation reference thesaurus according to the type of the target data, and use the word segmentation method corresponding to the type of the target data to perform word segmentation;

The desensitization processing unit is used to determine the desensitization method of the target data according to the type of the target data and the length of the target data, and use the desensitization method of the target data to segment the target data to obtain Desensitize sensitive data.

9. The device according to claim 8, further comprising:

The thesaurus construction unit is used to construct a word segmentation reference thesaurus, the word segmentation reference thesaurus includes a plurality of sub-thesaurus, and each sub-thesaurus includes a type of sensitive word.

10. The device according to claim 8, wherein when the type of the target data is an electrical address, the first word segmentation processing unit is specifically used for:

11. The device according to claim 8, wherein when the type of the target data is an enterprise account name, the first word segmentation processing unit is specifically used for:

12. The device according to claim 8, further comprising:

A calculation unit, configured to calculate the correct rate of the word segmentation result of the target data;

Judging an end member, used to judge whether the correct rate of the word segmentation result of the target data is greater than a first preset value;

If so, trigger the desensitization processing unit;

If not, a second word segmentation processing unit is triggered, and the second word segmentation processing unit is configured to perform word segmentation on the target data based on a hidden Markov model, and trigger the desensitization processing unit.

13. The device according to claim 8, wherein when the type of the target data is an electrical address, the desensitization processing unit comprises:

a first judging subunit, configured to judge whether the length of the target data is greater than a second preset value;

A first determining subunit, configured to determine that the desensitization method of the target data is the first power address data desensitization method when the length of the target data is greater than the second preset value;

The first extraction subunit is used to extract the last 5 digits of the house number data and the province, district and county data from the word segmentation result of the target data by using the first user address data desensitization method to obtain the remaining data;

The first desensitization processing subunit is used to retain the last 5 digits of the house number data and the province, district and county data, and mask the remaining data of the target data to obtain the desensitization of the target data after the data;

The second determination subunit is configured to determine that the desensitization method of the target data is the second power address data desensitization method when the length of the target data is not greater than the second preset value;

The second desensitization processing subunit is used to adopt the second user address data desensitization method, extract the reserved part of the target data according to the length of the target data according to the first step-by-step retention rule, and process the target The remaining part of the data is masked to obtain the desensitized data of the target data.

14. The method according to claim 8, wherein when the type of the target data is an enterprise account name, the desensitization processing unit includes:

A second judging subunit, configured to judge whether the length of the target data is greater than a third preset value;

A third determination subunit, configured to determine that the desensitization method of the target data is the first desensitization method of enterprise-type account name data when the length of the target data is greater than the third preset value;

The second extraction subunit is used to extract the first character of the font size data and the last character of the industry data from the word segmentation result of the target data by using the desensitization method of the first enterprise-type account name data to obtain the The remaining data of the font size data and the remaining data of the industry data;

The third desensitization processing subunit is used to mask the remaining data of the font size data and the remaining data of the industry data, retain other data of the target data, and obtain the desensitized data of the target data;

The fourth determination subunit is used to determine that the desensitization method of the target data is the second desensitization method of enterprise-type account name data when the length of the target data is not greater than the third preset value;

The fourth desensitization processing subunit is used to adopt the second enterprise-type account name data desensitization method, extract the reserved part of the target data according to the second step-by-step retention rule according to the length of the target data, and Mask the remaining part of the target data to obtain desensitized data of the target data.