CN116644737A

CN116644737A - Proper noun error correction method based on automatic word stock updating and prefix tree structure

Info

Publication number: CN116644737A
Application number: CN202310163847.0A
Authority: CN
Inventors: 王晶; 李国定
Original assignee: Consistent Zhifu Hangzhou Technology Co ltd
Current assignee: Consistent Zhifu Hangzhou Technology Co ltd
Priority date: 2023-02-09
Filing date: 2023-02-09
Publication date: 2023-08-25

Abstract

The invention discloses a proper noun error correction method based on automatic word stock updating and prefix tree structure, which comprises the following steps: step 1, automatically updating a proprietary name lexicon; step 2, acquiring trigger words of proper nouns, and acquiring prefix trees related to the trigger words and trigger dictionaries of the proper nouns; step 3, searching trigger words in the text to be corrected based on the prefix tree, and acquiring proper noun candidate words based on the trigger dictionary; and 4, intercepting a plurality of text fragments associated with the trigger words from the text to be corrected, calculating editing distances between the text fragments and the proper noun candidate words, and selecting the longest text fragment to execute corresponding editing operation by taking the minimum editing distance as a target.

Description

Proper noun error correction method based on automatic word stock updating and prefix tree structure

Technical Field

The invention belongs to the technical field of semantic recognition, and particularly relates to a proper noun error correction method based on automatic word stock updating and prefix tree structure.

Background

Proper nouns in news texts often have errors such as spelling, grammar and the like, which often bring poor reading experience to readers, and influence the authenticity of news, and if manual auditing is used, higher cost is brought, and the method for constructing automatic error correction has important significance for improving the auditing efficiency of news and reducing cost. The proper noun error correction is used as a practical floor scene in the text error correction field, and has good application prospect in checking and correcting news texts. The current mainstream technical scheme is divided into two types of error correction based on a model and error correction based on rules, wherein the error correction method based on the model is to train a deep learning model with an error correction function by using labeled training corpus; the rule-based error correction method is to construct a word stock in advance, detect error words through rule logic of making error correction, and search related words in the word stock as correct words against a near-pronunciation dictionary and a near-shape dictionary.

In the error correction scene of proper nouns, the error correction method based on the model is often inferior to the method based on the rule in accuracy and stability, on the other hand, the accuracy of the error correction method based on the rule mainly depends on the word stock, so that new words are required to be continuously supplemented into the word stock in a manual mode, and the cost is high.

Disclosure of Invention

The invention aims to provide a proper noun error correction method based on automatic word stock updating and prefix tree structure, which aims to solve the problems that the error correction method based on a model, which is proposed in the background art, is poor in accuracy and stability and the maintenance cost of the accuracy of the error correction method based on rules is high.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a proper noun error correction method based on automated lexicon updating and prefix tree structure, the method comprising the steps of:

step 1, automatically updating a proprietary name lexicon;

step 2, acquiring trigger words of proper nouns, and acquiring prefix trees related to the trigger words and trigger dictionaries of the proper nouns;

step 3, searching trigger words in the text to be corrected based on the prefix tree, and acquiring proper noun candidate words based on the trigger dictionary; and 4, intercepting a plurality of text fragments associated with the trigger words from the text to be corrected, calculating editing distances between the text fragments and the proper noun candidate words, and selecting the longest text fragment to execute corresponding editing operation by taking the minimum editing distance as a target.

Preferably, in the step 1, the automatic updating of the proper name word stock is to acquire text data through a timed crawling task, and the proper noun recognition is performed through a trained entity recognition model, so as to update the proper name word stock.

Preferably, the step 1 includes the steps of:

step 1.1, crawling news texts containing proper nouns, and performing manual labeling, including labeling of the proper nouns;

step 1.2, training on the labeled text by utilizing NER entity recognition technology to obtain a trained entity recognition model;

step 1.3, regularly crawling news texts containing proper nouns and identifying through a trained entity identification model to obtain new proper nouns;

and step 1.4, adding the new proper nouns into a proper noun library to update the proper noun library.

Preferably, the step 2 includes the steps of:

step 2.1, slicing any proper noun in the proper noun library and taking the sliced proper noun as a trigger word of the proper noun;

step 2.2, constructing a prefix tree based on the trigger words, and constructing a trigger dictionary { key ] based on the corresponding relation between the trigger words and the proper nouns: [ word1, word2, ] }, where key represents a trigger word and word represents a proper noun.

Preferably, the step 2.1 is to make a word length L for any proper noun _i The corresponding generation length is in (L _i /2，L _i ) And (3) slicing the proper nouns into trigger words of the proper nouns through a window sliding method in the window of the interval.

Preferably, the step 4 includes the steps of:

step 4.1, based on the trigger word length, intercepting window fragments at corresponding positions in the text to be corrected;

step 4.2, intercepting a plurality of text fragments in the window fragments based on the offset of the trigger word positions;

step 4.3, comparing each text segment with the candidate word to obtain the editing distance and the editing operation;

and 4.4, reserving an editing operation with the shortest editing distance and the longest text slice as the optimal error correction of the candidate noun.

Preferably, in step 4.1, if the position of the trigger word is the start position or the end position, the characters with the preset length are intercepted before or after the trigger word, and the window segment is formed by the characters with the preset length and the candidate word, otherwise, the window segment is formed by the characters with the preset length and the candidate word are intercepted before and after the trigger word.

Preferably, the step 4.2 comprises the steps of:

step 4.2.1, determining a interception length section of the text fragment, wherein the minimum interception length is the trigger word length, and the maximum interception length is the window fragment length;

and 4.2.2, selecting different interception lengths based on the interception length intervals, and intercepting the text fragments from the window fragments by a window sliding method.

Preferably, in the step 4.4, if the two editing operations are performed on the same place in the error correction text and the error correction operations are identical, only one piece is reserved; if the text fragments for which the two editing operations are directed exist in the contained relation, a longer one is reserved.

Compared with the prior art, the invention has the beneficial effects that:

the invention realizes continuous automatic update of the proper noun error correction word stock by utilizing the entity recognition technology, supplements new words into the proper noun based on automatic update of the proper noun, and solves the problem of high cost of manually updating the word stock; meanwhile, the prefix tree and window sliding error detection method can be used for being compatible with the updated word stock in real time, and the problem of poor reusability is solved.

Drawings

FIG. 1 is a flow chart of the present invention.

Fig. 2 is a schematic diagram of an entity recognition model.

Fig. 3 is a schematic diagram of the structure of a prefix tree.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by a person skilled in the art without any inventive effort, are intended to be within the scope of the present invention, based on the embodiments of the present invention.

Referring to fig. 1, a proper noun error correction method based on automatic lexicon update and prefix tree structure is implemented in 4 steps.

Step 1, automatic update of a proprietary name lexicon: text data is obtained through the timed crawling task, and proper noun recognition is carried out through the trained entity recognition model so as to update a proper noun library.

Specifically, the step 1 includes the following procedures:

As one embodiment of the present invention, the specific operation of step 1.1 is as follows: crawling 1000 news texts containing proper nouns from an official website, segmenting each text according to a maximum length 512, finally obtaining 4000 pieces of data, and dividing the 4000 pieces of data into a training set and a verification set according to the proportion of 8:2; labeling proper nouns, belonging categories and head and tail positions thereof in the text contained in each piece of data;word segmentation is carried out on each marked text by using a token word segmentation device to obtain corresponding text sequences (w ₁ ，w ₂ ，...，w _m ) A total of 4000; preprocessing each marked text by adopting a BIO marking mode, and referring to FIG. 1, the preprocessing is specifically as follows: the starting position of each proper noun is marked as "B-X", and other positions than the starting position in each proper noun are marked as "I-X", wherein "X" represents the category represented by the current noun, such as "political policy class", and furthermore, non-proper noun parts in the text are marked as letters "o"; thus, a marker sequence (t) ₁ ，t ₂ ，...，t _m ). The operation of this step 1.2 is as follows: sequence of text (w ₁ ，w ₂ ，...，w _m ) As an input to the pre-training model Bert, the output (v ₁ ，v ₂ ，...，v _m ) Each element v in the output _ε Epsilon = 1,2,..a vector of 768 dimensions each; taking the result obtained by the Bert learning of the pre-training model as the input of a full connection layer, so that each element v _ε The dimension of (a) is reduced to the number of "BIO" labels to obtain (alpha) ₁ ，α ₂ ，...，α _m ) The method comprises the steps of carrying out a first treatment on the surface of the Will (. Alpha.) ₁ ，α ₂ ，...，α _m ) As the current tag score of the CRF model, a tag transfer matrix is randomly initialized, a tag sequence (t ₁ ，t ₂ ，...，t _m ) As a target path, learning association information between tags, and calculating to obtain a total score of all paths on the assumption that the number of all possible paths transferred by the current text tag is N:

wherein,,score representing ith label path to text tail position m, S _i A sum of a label score and a label transfer score representing an i-th label path;

defining a loss function as:

wherein,,tag score representing the i-th position of text under the correct path,/->Tag y representing the i-th position of text in the correct path _i Transfer to the (i+1) th position corresponding tag y _i+1 Is a fraction of (2);

the entity recognition model Bert-CRF learns on a training set in a gradient descent mode, so that the loss function reaches a minimum value, and after each round of training, the entity recognition model is evaluated on a verification set, and model parameters of the round with the best evaluation effect are saved, so that a trained entity recognition model is obtained.

And 2, acquiring the trigger words of the proper nouns, and acquiring a prefix tree related to the trigger words and a trigger dictionary of the proper nouns.

The step 2 specifically comprises the following steps:

step 2.2, constructing a prefix tree based on the trigger words, and constructing a trigger dictionary { key ] based on the corresponding relation between the trigger words and the proper nouns: [ word1, word2, ]) wherein the key represents a trigger word and the word represents a proper noun.

As one embodiment of the present invention, in the step 2.1, the length L of any proper noun according to the word _i Corresponding to the generation length of L _i /2，L _i ]And (3) slicing the proper nouns into trigger words of the proper nouns through a window sliding method in the window of the interval. If the term "new tax law" is 2 to 4 characters in the window length interval, it has 3 windows of 2, 3,4, the trigger words of the proper noun are "new", "tax", "new tax", "tax", and "new tax".

In step 2.2 of the present invention, taking the term "personal income tax" as an example, the term "1" indicates that the end of the search is reached, and the tree structure is shown in fig. 3.

And step 3, searching trigger words from the text to be corrected according to the prefix tree, and acquiring all corresponding proper nouns from the trigger word dictionary as candidate words based on the searched trigger words.

For example, if the text to be corrected is "the personal tax should be paid on time and the @ is performed, the trigger word" personal place "is available according to the prefix tree, and all proper nouns [ word1, word2, ] related to the trigger word" personal place "are recorded in the trigger word dictionary.

And 4, intercepting a plurality of text fragments associated with the trigger words from the text to be corrected, calculating editing distances between the text fragments and the proper noun candidate words, and selecting the longest text fragment to execute corresponding editing operation by taking the minimum editing distance as a target.

Edit distance refers to the minimum number of edit operations required to switch from one to the other between two strings, and if their distance is greater, it means that they are different. The permitted editing operations include replacing one character with another, inserting one character, and deleting one character.

The step 4 specifically comprises the following steps:

step 4.1, determining the length of a cutting window based on the length of the trigger word and the position of the trigger word, and cutting out window fragments at corresponding positions in the text to be corrected;

and 4.4, reserving an editing operation with the shortest editing distance and the longest text slice as the optimal error correction of the corresponding candidate noun.

In step 4.1 of the present invention, if the position of the trigger word is the starting position or the ending position, the character with the preset length is intercepted before or after the trigger word, and the candidate word forms a window segment, otherwise, the character with the preset length is intercepted before or after the trigger word, and the candidate word forms a window segment. Here, the preset length is 2 characters, the length of the trigger word "personal" is 3, and the position of the trigger word is not at the beginning or end of a sentence, so that the characters with the length of 2 are intercepted before and after the trigger word "personal" respectively, and the window segment "personal tax sum" is obtained.

Step 4.2 of the present invention comprises the steps of: step 4.2.1, determining a interception length section of the text fragment based on the trigger word, wherein the minimum interception length is the trigger word length, and the maximum interception length is the window fragment length; and 4.2.2, selecting different interception lengths based on the interception length intervals, and intercepting the text fragments from the window fragments by a window sliding method. Here, the minimum interception length and the maximum interception length of the text segment are 3 and 7 respectively, so that the window length of the sliding window is 5, and is 3, 4, 5, 6 and 7 respectively; the window segments are slide-sliced using these slide windows to obtain a plurality of text segments, such as the text segments "personal", "personal taxed", "tax and" tax "when the window length of the slide window is 3.

In step 4.3 of the invention, the editing distance between each text segment and each candidate word is calculated, if the length of the text segment is greater than the length of the candidate word in the calculating process of the editing distance, deleting operation is firstly carried out, and redundant characters at the initial position and the tail position are deleted; if the initial position and the tail position of the text segment do not have redundant characters, if the deleting operation is required to be performed on the text segment and the deleted characters do not belong to characters of proper nouns, or if the replacing operation is required to be performed on the text segment and the replaced characters are not homophones of the original characters, the text segment is considered to be unmatched with the candidate words, editing operation and editing distance calculation are not performed any more, and comparison of the next text segment is performed; if the total number of the modified characters related to other operations is larger than a preset threshold value, the text segment is considered to be not matched with the candidate word, editing operation and editing distance calculation are not performed any more, and next text segment comparison is performed; here the preset threshold is set to 2.

For example, a window segment is "personal tax" and "corresponds to a plurality of text segments, wherein a text segment is" personal tax "and one of all proper nouns corresponding to the trigger word" personal tax "is" personal income tax ", which is a candidate word to be compared with the text segment" personal tax "to calculate an edit distance, and since the text length of the text segment is 7 and the text length of the candidate word is 5, it is necessary to determine whether there are redundant characters at the start position and the end position, here," personal "and" are deleted "when the edit distance is 3, to obtain" personal tax ", then an insert operation is performed to insert" to obtain "personal income tax", and the edit distance is changed from 3 to 4. Here, since the deletion operation involves the total number of modified characters being 3, exceeding the preset threshold value of 2, it is considered that the text segment does not match the candidate word, and the calculation of the insertion operation and the edit distance is not performed, but the comparison of the next text segment is performed.

In step 4.4 of the present invention, for each window segment, a comparison is performed between all the obtained text segments and all the candidate words, and an editing operation in which the editing distance is the shortest and the text slice is the longest is selected as the optimal error correction, where it should be noted that "an editing operation" includes one or more editing operations, rather than indicating only one editing operation, where "an editing operation" can modify a text segment into a corresponding candidate word.

Furthermore, if the two editing operations edit the same window segment in the error correction text and the error correction operations are consistent, only one window segment is reserved; if the text fragments aimed at by the two editing operations have the contained relation, one editing operation corresponding to the longer text fragment is reserved.

Claims

1. An automatic word stock updating and prefix tree structure-based proper noun error correction method, which is characterized by comprising the following steps:

step 1, automatically updating a proprietary name lexicon;

step 3, searching trigger words in the text to be corrected based on the prefix tree, and acquiring proper noun candidate words based on the trigger dictionary;

2. The method for correcting proper nouns based on automatic word stock updating and prefix tree structure according to claim 1, wherein in the step 1, the proper noun stock is automatically updated by acquiring text data through a timed crawling task, and proper noun recognition is performed through a trained entity recognition model, and the proper noun stock is updated.

3. The method for proper noun error correction based on automatic lexicon update and prefix tree structure according to claim 2, wherein said step 1 comprises the steps of:

step 1.3, regularly crawling news texts containing proper nouns, and identifying through a trained entity identification model to obtain new proper nouns;

4. The method for proper noun error correction based on automatic lexicon update and prefix tree structure according to claim 1, wherein said step 2 comprises the steps of:

5. The method for error correction of proper nouns based on automatic word stock update and prefix tree structure according to claim 4, wherein said step 2.1 is to update proper nouns according to the length L of their word _i The corresponding generation length is in (L _i /2，L _i ) And (3) slicing the proper nouns into trigger words of the proper nouns through a window sliding method in the window of the interval.

6. The method for proper noun error correction based on automatic lexicon update and prefix tree structure according to claim 1, wherein said step 4 comprises the steps of:

7. The method for error correction of proper nouns based on automatic updating and prefix tree structure according to claim 6, wherein in the step 4.1, if the position of the trigger word is the start position or the end position, the character with the preset length is intercepted before or after the trigger word to form a window segment with the candidate word, otherwise, the character with the preset length is intercepted before and after the trigger word to form a window segment with the candidate word.

8. The proper noun error correction method based on an automated update and prefix tree structure according to claim 7, wherein said step 4.2 includes the steps of:

9. The method for error correction of proper nouns based on automatic updating and prefix tree structure according to claim 7, wherein in the step 4.4, if two editing operations are performed on the same place in the error correction text and the error correction operations are identical, only one editing operation is reserved; if the text fragments for which the two editing operations are directed exist in the contained relation, a longer one is reserved.