[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN114154503A - Sensitive data type identification method - Google Patents

Sensitive data type identification method Download PDF

Info

Publication number
CN114154503A
CN114154503A CN202111463036.XA CN202111463036A CN114154503A CN 114154503 A CN114154503 A CN 114154503A CN 202111463036 A CN202111463036 A CN 202111463036A CN 114154503 A CN114154503 A CN 114154503A
Authority
CN
China
Prior art keywords
data
bilstm
word
model
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111463036.XA
Other languages
Chinese (zh)
Inventor
徐小雄
魏华强
彭曦
杨洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Cric Technology Co ltd
Original Assignee
Sichuan Cric Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Cric Technology Co ltd filed Critical Sichuan Cric Technology Co ltd
Priority to CN202111463036.XA priority Critical patent/CN114154503A/en
Publication of CN114154503A publication Critical patent/CN114154503A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a sensitive data type identification method, which comprises the steps of preprocessing a training sample, and training a BilSTM-CRF model by adopting a vector matrix of each piece of preprocessed data and a label sequence corresponding to each piece of data; identifying the sensitive data type of the file by adopting the trained BilSTM-CRF model; and after data post-processing is carried out on the result returned by the BilSTM-CRF model, feeding back the final result, wherein the data post-processing comprises data inverse mapping and sensitive data extraction. The method comprises the steps of conducting real-time or off-line data scanning on data in the industrial Internet based on the supervised two-way long-section memory network and the sensitive data identification of the conditional random field, locating and identifying multiple sensitive data types from text contents in files, conducting sensitive data type identification on multiple types of file data, and further improving the performance of a model by combining a deep neural network and a CRF model.

Description

Sensitive data type identification method
Technical Field
The invention relates to the technical field of data security, in particular to a sensitive data type identification method.
Background
Currently, network attack events for enterprises are rising year by year and cause a large amount of data leakage and data lemma events. The threat of data leakage is not insignificant, and in an industrial internet environment, protection of data is more important, especially protection of sensitive data. In the prior art, the data is scanned by using a preset rule, the efficiency of a violent searching mode is low, and the data needs to be analyzed manually and continuously to add a new rule so as to improve the system effect; some search the named entity in the log in the system by using the CRF model, but do not identify the task aiming at the sensitive data type, and the detected data type is the system log. Thus, there is no efficient way in the prior art to identify sensitive data present in a file.
Disclosure of Invention
The invention aims to provide a sensitive data type identification method, which is used for solving the problem that no efficient method for identifying sensitive data existing in a file exists in the prior art.
The invention solves the problems through the following technical scheme:
a sensitive data type identification method, comprising:
s100, training a BilSTM-CRF model, comprising the following steps:
step S110, preprocessing a training sample, wherein the preprocessing comprises text feature extraction, core dictionary construction, data vectorization and fixed data length, and the text feature comprises a single word, a part of speech and a word boundary; the core dictionary comprises a core dictionary of single words, a core dictionary of part of speech, a core dictionary of word boundaries and a core dictionary of labels corresponding to the single words; the labels correspond to the single characters one by one and identify the sensitive entity types of the single characters; the data vectorization means that the extracted single words, the parts of speech and the word boundaries are mapped through respective core dictionaries to obtain a vector matrix; the fixed data length refers to intercepting or filling the data length of the vector matrix;
s120, training a BilSTM-CRF model by adopting the vector matrix of each piece of preprocessed data and the label sequence corresponding to each piece of data;
s200, identifying the sensitive data type of the file by adopting the trained BilSTM-CRF model:
and S210, after text feature extraction and data cleaning are carried out on the received file, mapping the received file into a vector matrix through the created core dictionary, sending the vector matrix to the trained BilSTM-CRF model for sensitive data identification, carrying out inverse mapping on the identification result and the core dictionary of the label into a label, positioning the label to the text, and integrating and returning the positioning and sensitive entity types.
And S220, performing data post-processing on the result returned by the BilSTM-CRF model, and feeding back the final result, wherein the data post-processing comprises data inverse mapping and sensitive data extraction.
Preferably, the BilSTM-CRF model comprises an embedding layer, a BilSTM layer and a CRF layer, the preprocessing further comprises training text data of a training sample by adopting pre-training Word vector tools, namely Word2Vec, FastText and GloVe respectively to obtain a character embedding characteristic matrix, a part of speech embedding characteristic matrix and a Word boundary embedding characteristic matrix, reading the Word embedding characteristic matrix from the character embedding characteristic matrix, loading the Word embedding characteristic matrix to the three embedding layers of the BilSTM-CRF model, and splicing the three embedding layers into a group of Word embedding layers in a splicing mode;
the BiLSTM layer receives the embedded layer of each piece of data and predicts the probability of each character for each label and inputs it to the CRF layer, which outputs the most likely annotation sequence.
Preferably, in training the BilSTM-CRF model, the input sequence is defined as X ═ X (X)1,...,xi,...,xn) The output sequence is defined as y ═ y (y)1,...,yi,...,yn) (ii) a BilsTM layerThe output matrix of (a) is P,
Figure BDA0003389355610000021
representative word xiMapping to tag yiNon-normalized probability of (d); the dimensionality of P is n multiplied by k, and k is the category number of the label; the transfer matrix of the CRF layer is a,
Figure BDA0003389355610000022
representative label yiTo yi+1A dimension of (k +2) × (k + 2); the score function is defined as:
Figure BDA0003389355610000023
wherein, ynIs an end symbol; when i is 0, yiIn order to be the start character,
Figure BDA0003389355610000024
for all possible predicted tag sequences, n is the number of words in the input sequence;
next, for each possible predicted tag sequence, using the Softmax function
Figure BDA0003389355610000031
Defining a probability value;
Figure BDA0003389355610000032
the model loss function is again derived from the above function as:
Figure BDA0003389355610000033
after the loss values are obtained, all parameters of the model are optimized by performing gradient descent calculation using an Adam optimizer until the loss values are minimized.
Preferably, the method further comprises evaluating the trained BilSTM-CRF model by adopting the following parameters:
Figure BDA0003389355610000034
Figure BDA0003389355610000035
Figure BDA0003389355610000036
wherein Precision is accuracy; recall is Recall; the model correctly identifies the i-type sensitive entities, namely that the prediction result is completely consistent with the label content;
and saving the model with the optimal evaluation result as a final BilSTM-CRF model.
Preferably, the data cleansing in step S210 includes text denoising, full angle turning to half angle, and sentence segmentation.
Preferably, the text denoising comprises removing redundant line feed characters, space characters and messy codes in the text; converting the full angle to the half angle comprises converting English letters and punctuation marks into a half angle format; the sentence segmentation and word segmentation comprises the steps of splitting the read long text by taking a sentence as a unit, and then using Hanlp to split and complement the length of each sentence of text to obtain a word sequence, a part of speech sequence and a word boundary sequence.
Compared with the prior art, the invention has the following advantages and beneficial effects:
the method is based on the sensitive data identification of a supervised two-way long-section memory network and a conditional random field, carries out real-time or off-line data scanning on data in the industrial Internet, and positions and identifies various sensitive data types from text contents in files. Automatically learning the mode of the text in an artificial intelligence mode to identify the sensitive data in the text; sensitive data type recognition can be carried out on various types of file data, and the performance of the model is further improved by combining a deep neural network and a CRF model.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a schematic diagram of sample data preprocessing at a model training stage;
FIG. 3 is a schematic diagram of text data processing during a model identification phase;
FIG. 4 is a schematic diagram of a tag format;
FIG. 5 is a schematic structural diagram of a BilSTM-CRF model.
Detailed Description
The present invention will be described in further detail with reference to examples, but the embodiments of the present invention are not limited thereto.
Before describing a specific embodiment of the present invention, the technical terms to which the present invention relates will be described:
sensitive Data Identification: sensitive data identification;
industrial Internet of Things: an industrial internet;
ocr (optical Character recognition): optical character recognition;
named Entity Recognition: named entity recognition;
nerual Network: a neural network;
BilSTM (Bidirectional Long Short-Term Memeform): bidirectional long and short term memory network
CRF (Conditional Random Field;
embedding Layer: and (4) embedding the layer.
Example (b):
referring to fig. 1, a method for identifying a type of sensitive data includes:
s100, training a BilSTM-CRF model, comprising the following steps:
step S110, preprocessing a training sample, wherein the preprocessing comprises text feature extraction, core dictionary construction, data vectorization and fixed data length, and the text feature comprises a single word, a part of speech and a word boundary; the core dictionary comprises a core dictionary of single words, a core dictionary of part of speech, a core dictionary of word boundaries and a core dictionary of labels corresponding to the single words; the labels correspond to the single characters one by one and identify the sensitive entity types of the single characters; the data vectorization means that the extracted single words, the parts of speech and the word boundaries are mapped through respective core dictionaries to obtain a vector matrix; the fixed data length refers to intercepting or filling the data length of the vector matrix;
as shown in fig. 2, the text feature extraction is to obtain word segmentation information and part-of-speech information of text data through a Hanlp word segmentation tool, and obtain word boundary information by converting the word segmentation information in a BMESO format: wherein, B is Beginning, which represents the Beginning of a word; m is Middle, which represents the Middle of the word; e is End, which represents the End of the word; s is Single, which means that the word is a Single character; o represents Outside and represents a non-entity word. For example, "Beijing university" may be denoted as "BMME" after being converted by word segmentation and BMES format. At the same time, each word boundary mark is given with a corresponding number to form a word boundary core dictionary and store the word boundary core dictionary. For example, { "E":1, "B":2, "M":3 … }.
Part-of-speech information, for example, the part-of-speech of "Xiaoming" obtained by the Hanlp tool is labeled "/nz" meaning "other proper names"; "school up" is labeled as "/vi" meaning "the short verb". Similarly, a part-of-speech core dictionary is constructed and stored. For example, { "n":1, "w":2, "nnt":3 … }.
The text data is broken into individual words, the frequency of occurrence of each word is counted and words having a frequency of occurrence lower than 100 are deleted. By assigning a number to each word, a core dictionary of the individual word is constructed and stored. E.g., { ",":1, ". ":2,": 1, ":3,": 0, ":4 … }.
The processing of the label is to standardize the label corresponding to each word of the text data, and the label identifies the sensitive entity type of the word. The defined partially sensitive entity types and corresponding tags are shown in the following table:
Figure BDA0003389355610000051
Figure BDA0003389355610000061
Figure BDA0003389355610000071
the label is processed into the format of "word boundary _ entity type" as shown in FIG. 4, and all the processed new types are assigned with an index number, which constitutes the core dictionary of the label and is stored, for example, { "B _ PER":0, "M _ PER":1, "E _ PER":2, … }.
And (3) text mapping: dividing the training text into a single character sequence and a sequence of part of speech and word boundary obtained by using Hanlp, and respectively mapping the single character sequence, the part of speech sequence and the word boundary sequence through a core dictionary of a single character, a core dictionary of the part of speech and a core dictionary of the word boundary to obtain three groups of ID mapping vector matrixes: as shown in the ID mapping part of fig. 2, the word vector matrix, the part-of-speech vector matrix and the word boundary vector matrix are obtained after the mapping of "mingming learning at beijing university of beijing" from top to bottom.
Fixed data length: in the model training process, the length of each input batch of data must be the same, so that each group of vector matrixes is truncated or filled. For example, the longest data length of the current batch of data is 150, and the length of any row of vectors in the batch of vector matrix groups is shorter than 150, and the length needs to be complemented to 150 by padding 0 at the tail of the data.
Step S120, training a BilSTM-CRF model by adopting the vector matrix of each piece of preprocessed data and the label sequence corresponding to each piece of data, wherein the structure of the BilSTM-CRF model is shown in figure 5,
obtaining data samples after data processing, wherein the vector matrix of each piece of data is x epsilon { x ∈1,x2,x3,…,xnAnd the tag sequence of each piece of data is y e { y ∈ {1,y2,y3,…,ynAnd optimizing model parameters by the model through an optimization algorithm and training data labels. The model after final training can identify the type of the sensitive entity according to the input text content. The structure of the BilSTM-CRF model used is shown below: for word boundary word embedding feature matrices and word boundary embedding feature matrices,the method of random initialization is used.
(1) Embedding layer (Embedding layer): the Word embedding feature matrix is read from the saved character embedding feature matrix (Word2Vec, FastText, GloVe) file and loaded into three embedding layers. The embedded feature matrix of part of speech and word boundary needs to construct two embedded layers initialized randomly. The character embedding layers are respectively 300 layers, the part-of-speech embedding dimension is 150, and the word boundary embedding dimension is 50. And splicing the word embedding groups into a word embedding layer in a splicing mode. The purpose of this layer is to reduce the dimensions of the input data and to ensure that data information is not excessively lost.
(2) A BilSTM-CRF layer: BilSTM accepts the embedded layer of each piece of data and predicts the probability of each character for each training label (Emission Score) and inputs it into the CRF layer to output the most likely annotation sequence. With the hidden dimension parameter of BiLSTM set to 250.
(3) Loss function:
in the process of training the BilSTM-CRF model, an input sequence is defined as X ═ X (X)1,...,xi,...,xn) The output sequence is defined as y ═ y (y)1,...,yi,...,yn) (ii) a The output matrix of the BiLSTM layer is P,
Figure BDA0003389355610000081
representative word xiMapping to tag yiNon-normalized probability of (d); the dimensionality of P is n multiplied by k, and k is the category number of the label; the transfer matrix of the CRF layer is a,
Figure BDA0003389355610000082
representative label yiTo yi+1A dimension of (k +2) × (k + 2); the score function is defined as:
Figure BDA0003389355610000083
wherein, ynIs an end symbol; when i is 0, yiIn order to be the start character,
Figure BDA0003389355610000084
for all possible predicted tag sequences, n is the number of words in the input sequence;
next, for each possible predicted tag sequence, using the Softmax function
Figure BDA0003389355610000085
Defining a probability value;
Figure BDA0003389355610000086
the model loss function is again derived from the above function as:
Figure BDA0003389355610000091
after the loss values are obtained, all parameters of the model are optimized by performing gradient descent calculation using an Adam optimizer until the loss values are minimized.
The method also comprises the following steps of evaluating the trained BilSTM-CRF model by adopting the following parameters:
Figure BDA0003389355610000092
Figure BDA0003389355610000093
Figure BDA0003389355610000094
wherein Precision is accuracy; recall is Recall; the model correctly identifies the i-type sensitive entities, namely that the prediction result is completely consistent with the label content;
and saving the model with the optimal evaluation result as a final BilSTM-CRF model.
S200, identifying the sensitive data type of the file by adopting the trained BilSTM-CRF model:
step S210, after text feature extraction and data cleaning are carried out on the file received from the client, the file is mapped into a vector matrix through the created core dictionary, the vector matrix is sent to the trained BilTM-CRF model for sensitive data recognition, the recognition result and the core dictionary of the label are reversely mapped into a label, the label is positioned to the text through the label, and the positioning and sensitive entity types are integrated and returned. The data cleaning comprises text denoising, full angle turning to half angle turning and sentence segmentation. The text denoising comprises the step of removing redundant line feed characters, space characters and messy codes in the text; converting the full angle to the half angle comprises converting English letters and punctuation marks into a half angle format; the sentence segmentation and word segmentation includes splitting the read long text by taking a sentence as a unit, and then performing splitting and length completion on each sentence of text by using Hanlp to obtain a word sequence, a part of speech sequence and a word boundary sequence, as shown in FIG. 3.
And S220, performing data post-processing on the result returned by the BilSTM-CRF model, and feeding back the final result, wherein the data post-processing comprises data inverse mapping and sensitive data extraction.
Although the present invention has been described herein with reference to the illustrated embodiments thereof, which are intended to be preferred embodiments of the present invention, it is to be understood that the invention is not limited thereto, and that numerous other modifications and embodiments can be devised by those skilled in the art that will fall within the spirit and scope of the principles of this disclosure.

Claims (6)

1. A sensitive data type identification method, comprising:
s100, training a BilSTM-CRF model, comprising the following steps:
step S110, preprocessing a training sample, wherein the preprocessing comprises text feature extraction, core dictionary construction, data vectorization and fixed data length, and the text feature comprises a single word, a part of speech and a word boundary; the core dictionary comprises a core dictionary of single words, a core dictionary of part of speech, a core dictionary of word boundaries and a core dictionary of labels corresponding to the single words; the labels correspond to the single characters one by one and identify the sensitive entity types of the single characters; the data vectorization means that the extracted single words, the parts of speech and the word boundaries are mapped through respective core dictionaries to obtain a vector matrix; the fixed data length refers to intercepting or filling the data length of the vector matrix;
s120, training a BilSTM-CRF model by adopting the vector matrix of each piece of preprocessed data and the label sequence corresponding to each piece of data;
s200, identifying the sensitive data type of the file by adopting the trained BilSTM-CRF model:
and S210, after text feature extraction and data cleaning are carried out on the received file, mapping the received file into a vector matrix through the created core dictionary, sending the vector matrix to the trained BilSTM-CRF model for sensitive data identification, carrying out inverse mapping on the identification result and the core dictionary of the label into a label, positioning the label to the text, and integrating and returning the positioning and sensitive entity types.
And S220, performing data post-processing on the result returned by the BilSTM-CRF model, and feeding back the final result, wherein the data post-processing comprises data inverse mapping and sensitive data extraction.
2. The sensitive data type identification method of claim 1, wherein the BilSTM-CRF model comprises an embedding layer, a BilSTM layer and a CRF layer, the preprocessing further comprises training the text data of the training sample by using Word vector pre-training tools Word2Vec, FastText and GloVe respectively to obtain a character embedding feature matrix, a part-of-speech embedding feature matrix and a Word boundary embedding feature matrix, reading the Word embedding feature matrix from the character embedding feature matrix, loading the Word embedding feature matrix to the three embedding layers of the BilSTM-CRF model, and splicing the three embedding layers into a group of Word embedding layers by means of splicing;
the BiLSTM layer receives the embedded layer of each piece of data and predicts the probability of each character for each label and inputs it to the CRF layer, which outputs the most likely annotation sequence.
3. Sensitive data type identifier according to claim 2The method is characterized in that in the process of training the BiLSTM-CRF model, an input sequence is defined as X ═ X (X)1,...,xi,...,xn) The output sequence is defined as y ═ y (y)1,...,yi,...,yn) (ii) a The output matrix of the BilSTM layer is P, Pi,yiRepresentative word xiMapping to tag yiNon-normalized probability of (d); the dimensionality of P is n multiplied by k, and k is the category number of the label; the transfer matrix of the CRF layer is A, Ayi,yi+1Representative label yiTo yi+1A dimension of (k +2) × (k + 2); the score function is defined as:
Figure FDA0003389355600000021
wherein, ynIs an end symbol; when i is 0, yiIn order to be the start character,
Figure FDA0003389355600000022
for all possible predicted tag sequences, n is the number of words in the input sequence;
next, for each possible predicted tag sequence, using the Softmax function
Figure FDA0003389355600000023
Defining a probability value;
Figure FDA0003389355600000024
the model loss function is again derived from the above function as:
Figure FDA0003389355600000025
after the loss values are obtained, all parameters of the model are optimized by performing gradient descent calculation using an Adam optimizer until the loss values are minimized.
4. The method of claim 3, further comprising evaluating the trained BilSTM-CRF model using the following parameters:
Figure FDA0003389355600000026
Figure FDA0003389355600000031
Figure FDA0003389355600000032
wherein Precision is accuracy; recall is Recall; the model correctly identifies the i-type sensitive entities, namely that the prediction result is completely consistent with the label content;
and saving the model with the optimal evaluation result as a final BilSTM-CRF model.
5. The method for identifying the type of sensitive data as claimed in claim 1, wherein the data cleaning in step S210 includes text denoising, full angle turning half angle and sentence segmentation.
6. The method of claim 5, wherein the text denoising comprises removing redundant line breaks, space breaks and messy codes in the text; converting the full angle to the half angle comprises converting English letters and punctuation marks into a half angle format; the sentence segmentation and word segmentation comprises the steps of splitting the read long text by taking a sentence as a unit, and then using Hanlp to split and complement the length of each sentence of text to obtain a word sequence, a part of speech sequence and a word boundary sequence.
CN202111463036.XA 2021-12-02 2021-12-02 Sensitive data type identification method Pending CN114154503A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111463036.XA CN114154503A (en) 2021-12-02 2021-12-02 Sensitive data type identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111463036.XA CN114154503A (en) 2021-12-02 2021-12-02 Sensitive data type identification method

Publications (1)

Publication Number Publication Date
CN114154503A true CN114154503A (en) 2022-03-08

Family

ID=80456224

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111463036.XA Pending CN114154503A (en) 2021-12-02 2021-12-02 Sensitive data type identification method

Country Status (1)

Country Link
CN (1) CN114154503A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116127321A (en) * 2023-02-16 2023-05-16 广东工业大学 Training method, pushing method and system for ship news pushing model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109657230A (en) * 2018-11-06 2019-04-19 众安信息技术服务有限公司 Merge the name entity recognition method and device of term vector and part of speech vector
CN110826320A (en) * 2019-11-28 2020-02-21 上海观安信息技术股份有限公司 Sensitive data discovery method and system based on text recognition
CN111709242A (en) * 2020-06-01 2020-09-25 广州多益网络股份有限公司 Chinese punctuation mark adding method based on named entity recognition
CN112232195A (en) * 2020-10-15 2021-01-15 北京临近空间飞行器系统工程研究所 Handwritten Chinese character recognition method, device and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109657230A (en) * 2018-11-06 2019-04-19 众安信息技术服务有限公司 Merge the name entity recognition method and device of term vector and part of speech vector
CN110826320A (en) * 2019-11-28 2020-02-21 上海观安信息技术股份有限公司 Sensitive data discovery method and system based on text recognition
CN111709242A (en) * 2020-06-01 2020-09-25 广州多益网络股份有限公司 Chinese punctuation mark adding method based on named entity recognition
CN112232195A (en) * 2020-10-15 2021-01-15 北京临近空间飞行器系统工程研究所 Handwritten Chinese character recognition method, device and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王子牛 等: ""基于BERT的中文命名实体识别方法"", 《计算机科学》, vol. 46, no. 11, 15 November 2019 (2019-11-15), pages 138 - 142 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116127321A (en) * 2023-02-16 2023-05-16 广东工业大学 Training method, pushing method and system for ship news pushing model

Similar Documents

Publication Publication Date Title
CN111222305B (en) Information structuring method and device
CN111414479B (en) Label extraction method based on short text clustering technology
CN110580308B (en) Information auditing method and device, electronic equipment and storage medium
CN114036930A (en) Text error correction method, device, equipment and computer readable medium
CN113553853B (en) Named entity recognition method and device, computer equipment and storage medium
CN110569486B (en) Sequence labeling method and device based on double architectures and computer equipment
CN113590778A (en) Intelligent customer service intention understanding method, device, equipment and storage medium
CN110413972B (en) Intelligent table name field name complementing method based on NLP technology
CN114298035A (en) Text recognition desensitization method and system thereof
CN114416979A (en) Text query method, text query equipment and storage medium
CN115827819A (en) Intelligent question and answer processing method and device, electronic equipment and storage medium
CN110008473B (en) Medical text named entity identification and labeling method based on iteration method
CN117668180A (en) Document question-answering method, document question-answering device, and readable storage medium
CN115438650B (en) Contract text error correction method, system, equipment and medium fusing multi-source characteristics
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
CN111708870A (en) Deep neural network-based question answering method and device and storage medium
CN112417823A (en) Chinese text word order adjusting and quantitative word completion method and system
CN114154503A (en) Sensitive data type identification method
CN112445862B (en) Internet of things equipment data set construction method and device, electronic equipment and storage medium
CN114139537A (en) Word vector generation method and device
CN117291192B (en) Government affair text semantic understanding analysis method and system
CN112052649B (en) Text generation method, device, electronic equipment and storage medium
CN112784227A (en) Dictionary generating system and method based on password semantic structure
CN111104422A (en) Training method, device, equipment and storage medium of data recommendation model
CN111581963B (en) Method and device for extracting time character string, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination