CN114154503A - Sensitive data type identification method - Google Patents
Sensitive data type identification method Download PDFInfo
- Publication number
- CN114154503A CN114154503A CN202111463036.XA CN202111463036A CN114154503A CN 114154503 A CN114154503 A CN 114154503A CN 202111463036 A CN202111463036 A CN 202111463036A CN 114154503 A CN114154503 A CN 114154503A
- Authority
- CN
- China
- Prior art keywords
- data
- bilstm
- word
- model
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 22
- 239000011159 matrix material Substances 0.000 claims abstract description 42
- 238000012549 training Methods 0.000 claims abstract description 26
- 239000013598 vector Substances 0.000 claims abstract description 26
- 238000013507 mapping Methods 0.000 claims abstract description 16
- 238000007781 pre-processing Methods 0.000 claims abstract description 10
- 238000012805 post-processing Methods 0.000 claims abstract description 8
- 238000013075 data extraction Methods 0.000 claims abstract description 4
- 230000011218 segmentation Effects 0.000 claims description 13
- 238000000605 extraction Methods 0.000 claims description 7
- 238000004140 cleaning Methods 0.000 claims description 5
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000010276 construction Methods 0.000 claims description 3
- 238000011156 evaluation Methods 0.000 claims description 3
- 238000012546 transfer Methods 0.000 claims description 3
- 230000000295 complement effect Effects 0.000 claims description 2
- 238000013528 artificial neural network Methods 0.000 abstract description 3
- 230000015654 memory Effects 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 7
- 238000010586 diagram Methods 0.000 description 4
- 238000012545 processing Methods 0.000 description 3
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000012015 optical character recognition Methods 0.000 description 2
- UTRLJOWPWILGSB-UHFFFAOYSA-N 1-[(2,5-dioxopyrrol-1-yl)methoxymethyl]pyrrole-2,5-dione Chemical compound O=C1C=CC(=O)N1COCN1C(=O)C=CC1=O UTRLJOWPWILGSB-UHFFFAOYSA-N 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Molecular Biology (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a sensitive data type identification method, which comprises the steps of preprocessing a training sample, and training a BilSTM-CRF model by adopting a vector matrix of each piece of preprocessed data and a label sequence corresponding to each piece of data; identifying the sensitive data type of the file by adopting the trained BilSTM-CRF model; and after data post-processing is carried out on the result returned by the BilSTM-CRF model, feeding back the final result, wherein the data post-processing comprises data inverse mapping and sensitive data extraction. The method comprises the steps of conducting real-time or off-line data scanning on data in the industrial Internet based on the supervised two-way long-section memory network and the sensitive data identification of the conditional random field, locating and identifying multiple sensitive data types from text contents in files, conducting sensitive data type identification on multiple types of file data, and further improving the performance of a model by combining a deep neural network and a CRF model.
Description
Technical Field
The invention relates to the technical field of data security, in particular to a sensitive data type identification method.
Background
Currently, network attack events for enterprises are rising year by year and cause a large amount of data leakage and data lemma events. The threat of data leakage is not insignificant, and in an industrial internet environment, protection of data is more important, especially protection of sensitive data. In the prior art, the data is scanned by using a preset rule, the efficiency of a violent searching mode is low, and the data needs to be analyzed manually and continuously to add a new rule so as to improve the system effect; some search the named entity in the log in the system by using the CRF model, but do not identify the task aiming at the sensitive data type, and the detected data type is the system log. Thus, there is no efficient way in the prior art to identify sensitive data present in a file.
Disclosure of Invention
The invention aims to provide a sensitive data type identification method, which is used for solving the problem that no efficient method for identifying sensitive data existing in a file exists in the prior art.
The invention solves the problems through the following technical scheme:
a sensitive data type identification method, comprising:
s100, training a BilSTM-CRF model, comprising the following steps:
step S110, preprocessing a training sample, wherein the preprocessing comprises text feature extraction, core dictionary construction, data vectorization and fixed data length, and the text feature comprises a single word, a part of speech and a word boundary; the core dictionary comprises a core dictionary of single words, a core dictionary of part of speech, a core dictionary of word boundaries and a core dictionary of labels corresponding to the single words; the labels correspond to the single characters one by one and identify the sensitive entity types of the single characters; the data vectorization means that the extracted single words, the parts of speech and the word boundaries are mapped through respective core dictionaries to obtain a vector matrix; the fixed data length refers to intercepting or filling the data length of the vector matrix;
s120, training a BilSTM-CRF model by adopting the vector matrix of each piece of preprocessed data and the label sequence corresponding to each piece of data;
s200, identifying the sensitive data type of the file by adopting the trained BilSTM-CRF model:
and S210, after text feature extraction and data cleaning are carried out on the received file, mapping the received file into a vector matrix through the created core dictionary, sending the vector matrix to the trained BilSTM-CRF model for sensitive data identification, carrying out inverse mapping on the identification result and the core dictionary of the label into a label, positioning the label to the text, and integrating and returning the positioning and sensitive entity types.
And S220, performing data post-processing on the result returned by the BilSTM-CRF model, and feeding back the final result, wherein the data post-processing comprises data inverse mapping and sensitive data extraction.
Preferably, the BilSTM-CRF model comprises an embedding layer, a BilSTM layer and a CRF layer, the preprocessing further comprises training text data of a training sample by adopting pre-training Word vector tools, namely Word2Vec, FastText and GloVe respectively to obtain a character embedding characteristic matrix, a part of speech embedding characteristic matrix and a Word boundary embedding characteristic matrix, reading the Word embedding characteristic matrix from the character embedding characteristic matrix, loading the Word embedding characteristic matrix to the three embedding layers of the BilSTM-CRF model, and splicing the three embedding layers into a group of Word embedding layers in a splicing mode;
the BiLSTM layer receives the embedded layer of each piece of data and predicts the probability of each character for each label and inputs it to the CRF layer, which outputs the most likely annotation sequence.
Preferably, in training the BilSTM-CRF model, the input sequence is defined as X ═ X (X)1,...,xi,...,xn) The output sequence is defined as y ═ y (y)1,...,yi,...,yn) (ii) a BilsTM layerThe output matrix of (a) is P,representative word xiMapping to tag yiNon-normalized probability of (d); the dimensionality of P is n multiplied by k, and k is the category number of the label; the transfer matrix of the CRF layer is a,representative label yiTo yi+1A dimension of (k +2) × (k + 2); the score function is defined as:
wherein, ynIs an end symbol; when i is 0, yiIn order to be the start character,for all possible predicted tag sequences, n is the number of words in the input sequence;
next, for each possible predicted tag sequence, using the Softmax functionDefining a probability value;
the model loss function is again derived from the above function as:
after the loss values are obtained, all parameters of the model are optimized by performing gradient descent calculation using an Adam optimizer until the loss values are minimized.
Preferably, the method further comprises evaluating the trained BilSTM-CRF model by adopting the following parameters:
wherein Precision is accuracy; recall is Recall; the model correctly identifies the i-type sensitive entities, namely that the prediction result is completely consistent with the label content;
and saving the model with the optimal evaluation result as a final BilSTM-CRF model.
Preferably, the data cleansing in step S210 includes text denoising, full angle turning to half angle, and sentence segmentation.
Preferably, the text denoising comprises removing redundant line feed characters, space characters and messy codes in the text; converting the full angle to the half angle comprises converting English letters and punctuation marks into a half angle format; the sentence segmentation and word segmentation comprises the steps of splitting the read long text by taking a sentence as a unit, and then using Hanlp to split and complement the length of each sentence of text to obtain a word sequence, a part of speech sequence and a word boundary sequence.
Compared with the prior art, the invention has the following advantages and beneficial effects:
the method is based on the sensitive data identification of a supervised two-way long-section memory network and a conditional random field, carries out real-time or off-line data scanning on data in the industrial Internet, and positions and identifies various sensitive data types from text contents in files. Automatically learning the mode of the text in an artificial intelligence mode to identify the sensitive data in the text; sensitive data type recognition can be carried out on various types of file data, and the performance of the model is further improved by combining a deep neural network and a CRF model.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram of sample data preprocessing at a model training stage;
FIG. 3 is a schematic diagram of text data processing during a model identification phase;
FIG. 4 is a schematic diagram of a tag format;
FIG. 5 is a schematic structural diagram of a BilSTM-CRF model.
Detailed Description
The present invention will be described in further detail with reference to examples, but the embodiments of the present invention are not limited thereto.
Before describing a specific embodiment of the present invention, the technical terms to which the present invention relates will be described:
sensitive Data Identification: sensitive data identification;
industrial Internet of Things: an industrial internet;
ocr (optical Character recognition): optical character recognition;
named Entity Recognition: named entity recognition;
nerual Network: a neural network;
BilSTM (Bidirectional Long Short-Term Memeform): bidirectional long and short term memory network
CRF (Conditional Random Field;
embedding Layer: and (4) embedding the layer.
Example (b):
referring to fig. 1, a method for identifying a type of sensitive data includes:
s100, training a BilSTM-CRF model, comprising the following steps:
step S110, preprocessing a training sample, wherein the preprocessing comprises text feature extraction, core dictionary construction, data vectorization and fixed data length, and the text feature comprises a single word, a part of speech and a word boundary; the core dictionary comprises a core dictionary of single words, a core dictionary of part of speech, a core dictionary of word boundaries and a core dictionary of labels corresponding to the single words; the labels correspond to the single characters one by one and identify the sensitive entity types of the single characters; the data vectorization means that the extracted single words, the parts of speech and the word boundaries are mapped through respective core dictionaries to obtain a vector matrix; the fixed data length refers to intercepting or filling the data length of the vector matrix;
as shown in fig. 2, the text feature extraction is to obtain word segmentation information and part-of-speech information of text data through a Hanlp word segmentation tool, and obtain word boundary information by converting the word segmentation information in a BMESO format: wherein, B is Beginning, which represents the Beginning of a word; m is Middle, which represents the Middle of the word; e is End, which represents the End of the word; s is Single, which means that the word is a Single character; o represents Outside and represents a non-entity word. For example, "Beijing university" may be denoted as "BMME" after being converted by word segmentation and BMES format. At the same time, each word boundary mark is given with a corresponding number to form a word boundary core dictionary and store the word boundary core dictionary. For example, { "E":1, "B":2, "M":3 … }.
Part-of-speech information, for example, the part-of-speech of "Xiaoming" obtained by the Hanlp tool is labeled "/nz" meaning "other proper names"; "school up" is labeled as "/vi" meaning "the short verb". Similarly, a part-of-speech core dictionary is constructed and stored. For example, { "n":1, "w":2, "nnt":3 … }.
The text data is broken into individual words, the frequency of occurrence of each word is counted and words having a frequency of occurrence lower than 100 are deleted. By assigning a number to each word, a core dictionary of the individual word is constructed and stored. E.g., { ",":1, ". ":2,": 1, ":3,": 0, ":4 … }.
The processing of the label is to standardize the label corresponding to each word of the text data, and the label identifies the sensitive entity type of the word. The defined partially sensitive entity types and corresponding tags are shown in the following table:
the label is processed into the format of "word boundary _ entity type" as shown in FIG. 4, and all the processed new types are assigned with an index number, which constitutes the core dictionary of the label and is stored, for example, { "B _ PER":0, "M _ PER":1, "E _ PER":2, … }.
And (3) text mapping: dividing the training text into a single character sequence and a sequence of part of speech and word boundary obtained by using Hanlp, and respectively mapping the single character sequence, the part of speech sequence and the word boundary sequence through a core dictionary of a single character, a core dictionary of the part of speech and a core dictionary of the word boundary to obtain three groups of ID mapping vector matrixes: as shown in the ID mapping part of fig. 2, the word vector matrix, the part-of-speech vector matrix and the word boundary vector matrix are obtained after the mapping of "mingming learning at beijing university of beijing" from top to bottom.
Fixed data length: in the model training process, the length of each input batch of data must be the same, so that each group of vector matrixes is truncated or filled. For example, the longest data length of the current batch of data is 150, and the length of any row of vectors in the batch of vector matrix groups is shorter than 150, and the length needs to be complemented to 150 by padding 0 at the tail of the data.
Step S120, training a BilSTM-CRF model by adopting the vector matrix of each piece of preprocessed data and the label sequence corresponding to each piece of data, wherein the structure of the BilSTM-CRF model is shown in figure 5,
obtaining data samples after data processing, wherein the vector matrix of each piece of data is x epsilon { x ∈1,x2,x3,…,xnAnd the tag sequence of each piece of data is y e { y ∈ {1,y2,y3,…,ynAnd optimizing model parameters by the model through an optimization algorithm and training data labels. The model after final training can identify the type of the sensitive entity according to the input text content. The structure of the BilSTM-CRF model used is shown below: for word boundary word embedding feature matrices and word boundary embedding feature matrices,the method of random initialization is used.
(1) Embedding layer (Embedding layer): the Word embedding feature matrix is read from the saved character embedding feature matrix (Word2Vec, FastText, GloVe) file and loaded into three embedding layers. The embedded feature matrix of part of speech and word boundary needs to construct two embedded layers initialized randomly. The character embedding layers are respectively 300 layers, the part-of-speech embedding dimension is 150, and the word boundary embedding dimension is 50. And splicing the word embedding groups into a word embedding layer in a splicing mode. The purpose of this layer is to reduce the dimensions of the input data and to ensure that data information is not excessively lost.
(2) A BilSTM-CRF layer: BilSTM accepts the embedded layer of each piece of data and predicts the probability of each character for each training label (Emission Score) and inputs it into the CRF layer to output the most likely annotation sequence. With the hidden dimension parameter of BiLSTM set to 250.
(3) Loss function:
in the process of training the BilSTM-CRF model, an input sequence is defined as X ═ X (X)1,...,xi,...,xn) The output sequence is defined as y ═ y (y)1,...,yi,...,yn) (ii) a The output matrix of the BiLSTM layer is P,representative word xiMapping to tag yiNon-normalized probability of (d); the dimensionality of P is n multiplied by k, and k is the category number of the label; the transfer matrix of the CRF layer is a,representative label yiTo yi+1A dimension of (k +2) × (k + 2); the score function is defined as:
wherein, ynIs an end symbol; when i is 0, yiIn order to be the start character,for all possible predicted tag sequences, n is the number of words in the input sequence;
next, for each possible predicted tag sequence, using the Softmax functionDefining a probability value;
the model loss function is again derived from the above function as:
after the loss values are obtained, all parameters of the model are optimized by performing gradient descent calculation using an Adam optimizer until the loss values are minimized.
The method also comprises the following steps of evaluating the trained BilSTM-CRF model by adopting the following parameters:
wherein Precision is accuracy; recall is Recall; the model correctly identifies the i-type sensitive entities, namely that the prediction result is completely consistent with the label content;
and saving the model with the optimal evaluation result as a final BilSTM-CRF model.
S200, identifying the sensitive data type of the file by adopting the trained BilSTM-CRF model:
step S210, after text feature extraction and data cleaning are carried out on the file received from the client, the file is mapped into a vector matrix through the created core dictionary, the vector matrix is sent to the trained BilTM-CRF model for sensitive data recognition, the recognition result and the core dictionary of the label are reversely mapped into a label, the label is positioned to the text through the label, and the positioning and sensitive entity types are integrated and returned. The data cleaning comprises text denoising, full angle turning to half angle turning and sentence segmentation. The text denoising comprises the step of removing redundant line feed characters, space characters and messy codes in the text; converting the full angle to the half angle comprises converting English letters and punctuation marks into a half angle format; the sentence segmentation and word segmentation includes splitting the read long text by taking a sentence as a unit, and then performing splitting and length completion on each sentence of text by using Hanlp to obtain a word sequence, a part of speech sequence and a word boundary sequence, as shown in FIG. 3.
And S220, performing data post-processing on the result returned by the BilSTM-CRF model, and feeding back the final result, wherein the data post-processing comprises data inverse mapping and sensitive data extraction.
Although the present invention has been described herein with reference to the illustrated embodiments thereof, which are intended to be preferred embodiments of the present invention, it is to be understood that the invention is not limited thereto, and that numerous other modifications and embodiments can be devised by those skilled in the art that will fall within the spirit and scope of the principles of this disclosure.
Claims (6)
1. A sensitive data type identification method, comprising:
s100, training a BilSTM-CRF model, comprising the following steps:
step S110, preprocessing a training sample, wherein the preprocessing comprises text feature extraction, core dictionary construction, data vectorization and fixed data length, and the text feature comprises a single word, a part of speech and a word boundary; the core dictionary comprises a core dictionary of single words, a core dictionary of part of speech, a core dictionary of word boundaries and a core dictionary of labels corresponding to the single words; the labels correspond to the single characters one by one and identify the sensitive entity types of the single characters; the data vectorization means that the extracted single words, the parts of speech and the word boundaries are mapped through respective core dictionaries to obtain a vector matrix; the fixed data length refers to intercepting or filling the data length of the vector matrix;
s120, training a BilSTM-CRF model by adopting the vector matrix of each piece of preprocessed data and the label sequence corresponding to each piece of data;
s200, identifying the sensitive data type of the file by adopting the trained BilSTM-CRF model:
and S210, after text feature extraction and data cleaning are carried out on the received file, mapping the received file into a vector matrix through the created core dictionary, sending the vector matrix to the trained BilSTM-CRF model for sensitive data identification, carrying out inverse mapping on the identification result and the core dictionary of the label into a label, positioning the label to the text, and integrating and returning the positioning and sensitive entity types.
And S220, performing data post-processing on the result returned by the BilSTM-CRF model, and feeding back the final result, wherein the data post-processing comprises data inverse mapping and sensitive data extraction.
2. The sensitive data type identification method of claim 1, wherein the BilSTM-CRF model comprises an embedding layer, a BilSTM layer and a CRF layer, the preprocessing further comprises training the text data of the training sample by using Word vector pre-training tools Word2Vec, FastText and GloVe respectively to obtain a character embedding feature matrix, a part-of-speech embedding feature matrix and a Word boundary embedding feature matrix, reading the Word embedding feature matrix from the character embedding feature matrix, loading the Word embedding feature matrix to the three embedding layers of the BilSTM-CRF model, and splicing the three embedding layers into a group of Word embedding layers by means of splicing;
the BiLSTM layer receives the embedded layer of each piece of data and predicts the probability of each character for each label and inputs it to the CRF layer, which outputs the most likely annotation sequence.
3. Sensitive data type identifier according to claim 2The method is characterized in that in the process of training the BiLSTM-CRF model, an input sequence is defined as X ═ X (X)1,...,xi,...,xn) The output sequence is defined as y ═ y (y)1,...,yi,...,yn) (ii) a The output matrix of the BilSTM layer is P, Pi,yiRepresentative word xiMapping to tag yiNon-normalized probability of (d); the dimensionality of P is n multiplied by k, and k is the category number of the label; the transfer matrix of the CRF layer is A, Ayi,yi+1Representative label yiTo yi+1A dimension of (k +2) × (k + 2); the score function is defined as:
wherein, ynIs an end symbol; when i is 0, yiIn order to be the start character,for all possible predicted tag sequences, n is the number of words in the input sequence;
next, for each possible predicted tag sequence, using the Softmax functionDefining a probability value;
the model loss function is again derived from the above function as:
after the loss values are obtained, all parameters of the model are optimized by performing gradient descent calculation using an Adam optimizer until the loss values are minimized.
4. The method of claim 3, further comprising evaluating the trained BilSTM-CRF model using the following parameters:
wherein Precision is accuracy; recall is Recall; the model correctly identifies the i-type sensitive entities, namely that the prediction result is completely consistent with the label content;
and saving the model with the optimal evaluation result as a final BilSTM-CRF model.
5. The method for identifying the type of sensitive data as claimed in claim 1, wherein the data cleaning in step S210 includes text denoising, full angle turning half angle and sentence segmentation.
6. The method of claim 5, wherein the text denoising comprises removing redundant line breaks, space breaks and messy codes in the text; converting the full angle to the half angle comprises converting English letters and punctuation marks into a half angle format; the sentence segmentation and word segmentation comprises the steps of splitting the read long text by taking a sentence as a unit, and then using Hanlp to split and complement the length of each sentence of text to obtain a word sequence, a part of speech sequence and a word boundary sequence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111463036.XA CN114154503A (en) | 2021-12-02 | 2021-12-02 | Sensitive data type identification method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111463036.XA CN114154503A (en) | 2021-12-02 | 2021-12-02 | Sensitive data type identification method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114154503A true CN114154503A (en) | 2022-03-08 |
Family
ID=80456224
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111463036.XA Pending CN114154503A (en) | 2021-12-02 | 2021-12-02 | Sensitive data type identification method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114154503A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116127321A (en) * | 2023-02-16 | 2023-05-16 | 广东工业大学 | Training method, pushing method and system for ship news pushing model |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109657230A (en) * | 2018-11-06 | 2019-04-19 | 众安信息技术服务有限公司 | Merge the name entity recognition method and device of term vector and part of speech vector |
CN110826320A (en) * | 2019-11-28 | 2020-02-21 | 上海观安信息技术股份有限公司 | Sensitive data discovery method and system based on text recognition |
CN111709242A (en) * | 2020-06-01 | 2020-09-25 | 广州多益网络股份有限公司 | Chinese punctuation mark adding method based on named entity recognition |
CN112232195A (en) * | 2020-10-15 | 2021-01-15 | 北京临近空间飞行器系统工程研究所 | Handwritten Chinese character recognition method, device and storage medium |
-
2021
- 2021-12-02 CN CN202111463036.XA patent/CN114154503A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109657230A (en) * | 2018-11-06 | 2019-04-19 | 众安信息技术服务有限公司 | Merge the name entity recognition method and device of term vector and part of speech vector |
CN110826320A (en) * | 2019-11-28 | 2020-02-21 | 上海观安信息技术股份有限公司 | Sensitive data discovery method and system based on text recognition |
CN111709242A (en) * | 2020-06-01 | 2020-09-25 | 广州多益网络股份有限公司 | Chinese punctuation mark adding method based on named entity recognition |
CN112232195A (en) * | 2020-10-15 | 2021-01-15 | 北京临近空间飞行器系统工程研究所 | Handwritten Chinese character recognition method, device and storage medium |
Non-Patent Citations (1)
Title |
---|
王子牛 等: ""基于BERT的中文命名实体识别方法"", 《计算机科学》, vol. 46, no. 11, 15 November 2019 (2019-11-15), pages 138 - 142 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116127321A (en) * | 2023-02-16 | 2023-05-16 | 广东工业大学 | Training method, pushing method and system for ship news pushing model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111222305B (en) | Information structuring method and device | |
CN111414479B (en) | Label extraction method based on short text clustering technology | |
CN110580308B (en) | Information auditing method and device, electronic equipment and storage medium | |
CN114036930A (en) | Text error correction method, device, equipment and computer readable medium | |
CN113553853B (en) | Named entity recognition method and device, computer equipment and storage medium | |
CN110569486B (en) | Sequence labeling method and device based on double architectures and computer equipment | |
CN113590778A (en) | Intelligent customer service intention understanding method, device, equipment and storage medium | |
CN110413972B (en) | Intelligent table name field name complementing method based on NLP technology | |
CN114298035A (en) | Text recognition desensitization method and system thereof | |
CN114416979A (en) | Text query method, text query equipment and storage medium | |
CN115827819A (en) | Intelligent question and answer processing method and device, electronic equipment and storage medium | |
CN110008473B (en) | Medical text named entity identification and labeling method based on iteration method | |
CN117668180A (en) | Document question-answering method, document question-answering device, and readable storage medium | |
CN115438650B (en) | Contract text error correction method, system, equipment and medium fusing multi-source characteristics | |
CN110705285B (en) | Government affair text subject word library construction method, device, server and readable storage medium | |
CN111708870A (en) | Deep neural network-based question answering method and device and storage medium | |
CN112417823A (en) | Chinese text word order adjusting and quantitative word completion method and system | |
CN114154503A (en) | Sensitive data type identification method | |
CN112445862B (en) | Internet of things equipment data set construction method and device, electronic equipment and storage medium | |
CN114139537A (en) | Word vector generation method and device | |
CN117291192B (en) | Government affair text semantic understanding analysis method and system | |
CN112052649B (en) | Text generation method, device, electronic equipment and storage medium | |
CN112784227A (en) | Dictionary generating system and method based on password semantic structure | |
CN111104422A (en) | Training method, device, equipment and storage medium of data recommendation model | |
CN111581963B (en) | Method and device for extracting time character string, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |