CN114154503A

CN114154503A - Sensitive data type identification method

Info

Publication number: CN114154503A
Application number: CN202111463036.XA
Authority: CN
Inventors: 徐小雄; 魏华强; 彭曦; 杨洋
Original assignee: Sichuan Cric Technology Co ltd
Current assignee: Sichuan Cric Technology Co ltd
Priority date: 2021-12-02
Filing date: 2021-12-02
Publication date: 2022-03-08

Abstract

The invention discloses a sensitive data type identification method, which comprises the steps of preprocessing a training sample, and training a BilSTM-CRF model by adopting a vector matrix of each piece of preprocessed data and a label sequence corresponding to each piece of data; identifying the sensitive data type of the file by adopting the trained BilSTM-CRF model; and after data post-processing is carried out on the result returned by the BilSTM-CRF model, feeding back the final result, wherein the data post-processing comprises data inverse mapping and sensitive data extraction. The method comprises the steps of conducting real-time or off-line data scanning on data in the industrial Internet based on the supervised two-way long-section memory network and the sensitive data identification of the conditional random field, locating and identifying multiple sensitive data types from text contents in files, conducting sensitive data type identification on multiple types of file data, and further improving the performance of a model by combining a deep neural network and a CRF model.

Description

Sensitive data type identification method

Technical Field

The invention relates to the technical field of data security, in particular to a sensitive data type identification method.

Background

Currently, network attack events for enterprises are rising year by year and cause a large amount of data leakage and data lemma events. The threat of data leakage is not insignificant, and in an industrial internet environment, protection of data is more important, especially protection of sensitive data. In the prior art, the data is scanned by using a preset rule, the efficiency of a violent searching mode is low, and the data needs to be analyzed manually and continuously to add a new rule so as to improve the system effect; some search the named entity in the log in the system by using the CRF model, but do not identify the task aiming at the sensitive data type, and the detected data type is the system log. Thus, there is no efficient way in the prior art to identify sensitive data present in a file.

Disclosure of Invention

The invention aims to provide a sensitive data type identification method, which is used for solving the problem that no efficient method for identifying sensitive data existing in a file exists in the prior art.

The invention solves the problems through the following technical scheme:

a sensitive data type identification method, comprising:

s100, training a BilSTM-CRF model, comprising the following steps:

step S110, preprocessing a training sample, wherein the preprocessing comprises text feature extraction, core dictionary construction, data vectorization and fixed data length, and the text feature comprises a single word, a part of speech and a word boundary; the core dictionary comprises a core dictionary of single words, a core dictionary of part of speech, a core dictionary of word boundaries and a core dictionary of labels corresponding to the single words; the labels correspond to the single characters one by one and identify the sensitive entity types of the single characters; the data vectorization means that the extracted single words, the parts of speech and the word boundaries are mapped through respective core dictionaries to obtain a vector matrix; the fixed data length refers to intercepting or filling the data length of the vector matrix;

s120, training a BilSTM-CRF model by adopting the vector matrix of each piece of preprocessed data and the label sequence corresponding to each piece of data;

s200, identifying the sensitive data type of the file by adopting the trained BilSTM-CRF model:

and S210, after text feature extraction and data cleaning are carried out on the received file, mapping the received file into a vector matrix through the created core dictionary, sending the vector matrix to the trained BilSTM-CRF model for sensitive data identification, carrying out inverse mapping on the identification result and the core dictionary of the label into a label, positioning the label to the text, and integrating and returning the positioning and sensitive entity types.

And S220, performing data post-processing on the result returned by the BilSTM-CRF model, and feeding back the final result, wherein the data post-processing comprises data inverse mapping and sensitive data extraction.

Preferably, the BilSTM-CRF model comprises an embedding layer, a BilSTM layer and a CRF layer, the preprocessing further comprises training text data of a training sample by adopting pre-training Word vector tools, namely Word2Vec, FastText and GloVe respectively to obtain a character embedding characteristic matrix, a part of speech embedding characteristic matrix and a Word boundary embedding characteristic matrix, reading the Word embedding characteristic matrix from the character embedding characteristic matrix, loading the Word embedding characteristic matrix to the three embedding layers of the BilSTM-CRF model, and splicing the three embedding layers into a group of Word embedding layers in a splicing mode;

the BiLSTM layer receives the embedded layer of each piece of data and predicts the probability of each character for each label and inputs it to the CRF layer, which outputs the most likely annotation sequence.

Preferably, in training the BilSTM-CRF model, the input sequence is defined as X ═ X (X)₁,...,x_i,...,x_n) The output sequence is defined as y ═ y (y)₁,...,y_i,...,y_n) (ii) a BilsTM layerThe output matrix of (a) is P,

representative word x_iMapping to tag y_iNon-normalized probability of (d); the dimensionality of P is n multiplied by k, and k is the category number of the label; the transfer matrix of the CRF layer is a,

representative label y_iTo y_i+1A dimension of (k +2) × (k + 2); the score function is defined as:

wherein, y_nIs an end symbol; when i is 0, y_iIn order to be the start character,

for all possible predicted tag sequences, n is the number of words in the input sequence;

next, for each possible predicted tag sequence, using the Softmax function

Defining a probability value;

the model loss function is again derived from the above function as:

after the loss values are obtained, all parameters of the model are optimized by performing gradient descent calculation using an Adam optimizer until the loss values are minimized.

Preferably, the method further comprises evaluating the trained BilSTM-CRF model by adopting the following parameters:

wherein Precision is accuracy; recall is Recall; the model correctly identifies the i-type sensitive entities, namely that the prediction result is completely consistent with the label content;

and saving the model with the optimal evaluation result as a final BilSTM-CRF model.

Preferably, the data cleansing in step S210 includes text denoising, full angle turning to half angle, and sentence segmentation.

Preferably, the text denoising comprises removing redundant line feed characters, space characters and messy codes in the text; converting the full angle to the half angle comprises converting English letters and punctuation marks into a half angle format; the sentence segmentation and word segmentation comprises the steps of splitting the read long text by taking a sentence as a unit, and then using Hanlp to split and complement the length of each sentence of text to obtain a word sequence, a part of speech sequence and a word boundary sequence.

Compared with the prior art, the invention has the following advantages and beneficial effects:

the method is based on the sensitive data identification of a supervised two-way long-section memory network and a conditional random field, carries out real-time or off-line data scanning on data in the industrial Internet, and positions and identifies various sensitive data types from text contents in files. Automatically learning the mode of the text in an artificial intelligence mode to identify the sensitive data in the text; sensitive data type recognition can be carried out on various types of file data, and the performance of the model is further improved by combining a deep neural network and a CRF model.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a schematic diagram of sample data preprocessing at a model training stage;

FIG. 3 is a schematic diagram of text data processing during a model identification phase;

FIG. 4 is a schematic diagram of a tag format;

FIG. 5 is a schematic structural diagram of a BilSTM-CRF model.

Detailed Description

The present invention will be described in further detail with reference to examples, but the embodiments of the present invention are not limited thereto.

Before describing a specific embodiment of the present invention, the technical terms to which the present invention relates will be described:

sensitive Data Identification: sensitive data identification;

industrial Internet of Things: an industrial internet;

ocr (optical Character recognition): optical character recognition;

named Entity Recognition: named entity recognition;

nerual Network: a neural network;

BilSTM (Bidirectional Long Short-Term Memeform): bidirectional long and short term memory network

CRF (Conditional Random Field;

embedding Layer: and (4) embedding the layer.

Example (b):

referring to fig. 1, a method for identifying a type of sensitive data includes:

s100, training a BilSTM-CRF model, comprising the following steps:

as shown in fig. 2, the text feature extraction is to obtain word segmentation information and part-of-speech information of text data through a Hanlp word segmentation tool, and obtain word boundary information by converting the word segmentation information in a BMESO format: wherein, B is Beginning, which represents the Beginning of a word; m is Middle, which represents the Middle of the word; e is End, which represents the End of the word; s is Single, which means that the word is a Single character; o represents Outside and represents a non-entity word. For example, "Beijing university" may be denoted as "BMME" after being converted by word segmentation and BMES format. At the same time, each word boundary mark is given with a corresponding number to form a word boundary core dictionary and store the word boundary core dictionary. For example, { "E":1, "B":2, "M":3 … }.

Part-of-speech information, for example, the part-of-speech of "Xiaoming" obtained by the Hanlp tool is labeled "/nz" meaning "other proper names"; "school up" is labeled as "/vi" meaning "the short verb". Similarly, a part-of-speech core dictionary is constructed and stored. For example, { "n":1, "w":2, "nnt":3 … }.

The text data is broken into individual words, the frequency of occurrence of each word is counted and words having a frequency of occurrence lower than 100 are deleted. By assigning a number to each word, a core dictionary of the individual word is constructed and stored. E.g., { ",":1, ". ":2,": 1, ":3,": 0, ":4 … }.

The processing of the label is to standardize the label corresponding to each word of the text data, and the label identifies the sensitive entity type of the word. The defined partially sensitive entity types and corresponding tags are shown in the following table:

the label is processed into the format of "word boundary _ entity type" as shown in FIG. 4, and all the processed new types are assigned with an index number, which constitutes the core dictionary of the label and is stored, for example, { "B _ PER":0, "M _ PER":1, "E _ PER":2, … }.

And (3) text mapping: dividing the training text into a single character sequence and a sequence of part of speech and word boundary obtained by using Hanlp, and respectively mapping the single character sequence, the part of speech sequence and the word boundary sequence through a core dictionary of a single character, a core dictionary of the part of speech and a core dictionary of the word boundary to obtain three groups of ID mapping vector matrixes: as shown in the ID mapping part of fig. 2, the word vector matrix, the part-of-speech vector matrix and the word boundary vector matrix are obtained after the mapping of "mingming learning at beijing university of beijing" from top to bottom.

Fixed data length: in the model training process, the length of each input batch of data must be the same, so that each group of vector matrixes is truncated or filled. For example, the longest data length of the current batch of data is 150, and the length of any row of vectors in the batch of vector matrix groups is shorter than 150, and the length needs to be complemented to 150 by padding 0 at the tail of the data.

Step S120, training a BilSTM-CRF model by adopting the vector matrix of each piece of preprocessed data and the label sequence corresponding to each piece of data, wherein the structure of the BilSTM-CRF model is shown in figure 5,

obtaining data samples after data processing, wherein the vector matrix of each piece of data is x epsilon { x ∈₁,x₂,x₃,…,x_nAnd the tag sequence of each piece of data is y e { y ∈ {₁,y₂,y₃,…,y_nAnd optimizing model parameters by the model through an optimization algorithm and training data labels. The model after final training can identify the type of the sensitive entity according to the input text content. The structure of the BilSTM-CRF model used is shown below: for word boundary word embedding feature matrices and word boundary embedding feature matrices,the method of random initialization is used.

(1) Embedding layer (Embedding layer): the Word embedding feature matrix is read from the saved character embedding feature matrix (Word2Vec, FastText, GloVe) file and loaded into three embedding layers. The embedded feature matrix of part of speech and word boundary needs to construct two embedded layers initialized randomly. The character embedding layers are respectively 300 layers, the part-of-speech embedding dimension is 150, and the word boundary embedding dimension is 50. And splicing the word embedding groups into a word embedding layer in a splicing mode. The purpose of this layer is to reduce the dimensions of the input data and to ensure that data information is not excessively lost.

(2) A BilSTM-CRF layer: BilSTM accepts the embedded layer of each piece of data and predicts the probability of each character for each training label (Emission Score) and inputs it into the CRF layer to output the most likely annotation sequence. With the hidden dimension parameter of BiLSTM set to 250.

(3) Loss function:

in the process of training the BilSTM-CRF model, an input sequence is defined as X ═ X (X)₁,...,x_i,...,x_n) The output sequence is defined as y ═ y (y)₁,...,y_i,...,y_n) (ii) a The output matrix of the BiLSTM layer is P,

next, for each possible predicted tag sequence, using the Softmax function

Defining a probability value;

the model loss function is again derived from the above function as:

The method also comprises the following steps of evaluating the trained BilSTM-CRF model by adopting the following parameters:

step S210, after text feature extraction and data cleaning are carried out on the file received from the client, the file is mapped into a vector matrix through the created core dictionary, the vector matrix is sent to the trained BilTM-CRF model for sensitive data recognition, the recognition result and the core dictionary of the label are reversely mapped into a label, the label is positioned to the text through the label, and the positioning and sensitive entity types are integrated and returned. The data cleaning comprises text denoising, full angle turning to half angle turning and sentence segmentation. The text denoising comprises the step of removing redundant line feed characters, space characters and messy codes in the text; converting the full angle to the half angle comprises converting English letters and punctuation marks into a half angle format; the sentence segmentation and word segmentation includes splitting the read long text by taking a sentence as a unit, and then performing splitting and length completion on each sentence of text by using Hanlp to obtain a word sequence, a part of speech sequence and a word boundary sequence, as shown in FIG. 3.

Although the present invention has been described herein with reference to the illustrated embodiments thereof, which are intended to be preferred embodiments of the present invention, it is to be understood that the invention is not limited thereto, and that numerous other modifications and embodiments can be devised by those skilled in the art that will fall within the spirit and scope of the principles of this disclosure.

Claims

1. A sensitive data type identification method, comprising:

s100, training a BilSTM-CRF model, comprising the following steps:

2. The sensitive data type identification method of claim 1, wherein the BilSTM-CRF model comprises an embedding layer, a BilSTM layer and a CRF layer, the preprocessing further comprises training the text data of the training sample by using Word vector pre-training tools Word2Vec, FastText and GloVe respectively to obtain a character embedding feature matrix, a part-of-speech embedding feature matrix and a Word boundary embedding feature matrix, reading the Word embedding feature matrix from the character embedding feature matrix, loading the Word embedding feature matrix to the three embedding layers of the BilSTM-CRF model, and splicing the three embedding layers into a group of Word embedding layers by means of splicing;

3. Sensitive data type identifier according to claim 2The method is characterized in that in the process of training the BiLSTM-CRF model, an input sequence is defined as X ═ X (X)₁,...,x_i,...,x_n) The output sequence is defined as y ═ y (y)₁,...,y_i,...,y_n) (ii) a The output matrix of the BilSTM layer is P, P_i,yiRepresentative word x_iMapping to tag y_iNon-normalized probability of (d); the dimensionality of P is n multiplied by k, and k is the category number of the label; the transfer matrix of the CRF layer is A, A_yi,yi+1Representative label y_iTo y_i+1A dimension of (k +2) × (k + 2); the score function is defined as:

next, for each possible predicted tag sequence, using the Softmax function

Defining a probability value;

the model loss function is again derived from the above function as:

4. The method of claim 3, further comprising evaluating the trained BilSTM-CRF model using the following parameters:

5. The method for identifying the type of sensitive data as claimed in claim 1, wherein the data cleaning in step S210 includes text denoising, full angle turning half angle and sentence segmentation.

6. The method of claim 5, wherein the text denoising comprises removing redundant line breaks, space breaks and messy codes in the text; converting the full angle to the half angle comprises converting English letters and punctuation marks into a half angle format; the sentence segmentation and word segmentation comprises the steps of splitting the read long text by taking a sentence as a unit, and then using Hanlp to split and complement the length of each sentence of text to obtain a word sequence, a part of speech sequence and a word boundary sequence.