CN117875313B

CN117875313B - Chinese grammar error correction method and system

Info

Publication number: CN117875313B
Application number: CN202410279802.4A
Authority: CN
Inventors: 康占英; 黄惟; 王青; 肖峰; 徐伯辰; 刘优; 彭卓; 汤达夫; 李芳芳
Original assignee: Changsha Zhiwei Information Technology Co ltd
Current assignee: Changsha Zhiwei Information Technology Co ltd
Priority date: 2024-03-12
Filing date: 2024-03-12
Publication date: 2024-07-02
Anticipated expiration: 2044-03-12
Also published as: CN117875313A

Abstract

The application relates to a Chinese grammar error correction method and a system, wherein the method comprises the following steps: acquiring a primary text containing grammar errors; inputting the original text into a pretrained Bert model, and outputting a semantic representation vector; the semantic representation vector is respectively passed through two different feedforward neural networks with normalization to respectively obtain replication probability and error type probability, and an index value of the maximum value of the error type probability is returned, and the replication distribution vector is determined based on the index value of the maximum value; calculating to obtain editing tag probability based on the semantic characterization vector; fusing the editing tag probability, the copying distribution vector and the copying probability to obtain a final editing tag probability, and determining a final editing tag based on the final editing tag probability; modifying the original text based on the final editing tag; and obtaining the correct text until the original text is free of errors.

Description

Chinese grammar error correction method and system

Technical Field

The application relates to the technical field of Chinese grammar error correction, in particular to a Chinese grammar error correction method and a Chinese grammar error correction system.

Background

The Chinese grammar error correction method mainly comprises two schemes, namely a seq2seq mode based on a machine translation mode and a seq2 wait mode based on editing label prediction. The seq2seq architecture based on machine translation has the problems of low reasoning speed and large quantity of training data because of being an autoregressive language model, and has poor interpretability, cannot distinguish the specific grammar error type of sentences, and cannot meet the requirement of actual production environment in terms of speed performance. The current seq2 wait architecture also has a plurality of problems, firstly, a Bert pre-training language model is formed by pre-training two tasks of shielding language modeling (MLM, masked Language Modeling) and context matching (NSP, next SENTENCE PREDICT), the pre-training task related to word insertion and deletion is lacked, and a grammar error correction task has a plurality of redundancy and missing errors; secondly, the requirement on the editing label is high, and the prediction space of the editing label is too large.

Disclosure of Invention

The application aims to provide a Chinese grammar error correction method and a Chinese grammar error correction system, which aim to improve the accuracy of Chinese grammar error correction.

The embodiment of the application provides a Chinese grammar error correction method, which comprises the following steps:

s1: acquiring a primary text containing grammar errors;

S2: inputting the original text to a pretrained Bert model, and outputting a semantic representation vector;

S3: the semantic characterization vector is respectively passed through two different two layers of feedforward neural networks with normalization to respectively obtain replication probability and error type probability, and an index value of the maximum value of the error type probability is returned, and a replication distribution vector is determined based on the index value of the maximum value;

calculating to obtain editing tag probability based on the semantic representation vector;

S4: fusing the editing tag probability, the copying distribution vector and the copying probability to obtain a final editing tag probability, and determining a final editing tag based on the final editing tag probability;

S5: modifying the original text based on the final editing tag;

s6: and repeating S2-S5 until the original text is free of errors, and obtaining the correct text.

The embodiment of the application also provides a Chinese grammar error correction system, which comprises:

the acquisition module is used for acquiring the original text containing grammar errors;

The characterization module is used for inputting the original text into the pretrained Bert model and outputting a semantic characterization vector;

The replication module is used for respectively passing the semantic characterization vectors through two different two layers of feedforward neural networks with normalization to respectively obtain replication probability and error type probability, returning an index value of the maximum value of the error type probability, and determining replication distribution vectors based on the index value of the maximum value;

The editing tag probability calculation module is used for calculating the editing tag probability based on the semantic characterization vector;

The editing tag prediction module is used for fusing the editing tag probability, the replication distribution vector and the replication probability to obtain a final editing tag probability, and determining a final editing tag based on the final editing tag probability;

The modification module is used for modifying the original text based on the final editing tag;

The iteration module is used for sequentially running the characterization module, the replication module, the editing tag probability calculation module, the editing tag prediction module and the modification module once to obtain a correct text, wherein the cycle is a cycle until the original text is free of errors.

The application has the beneficial effects that: outputting a semantic representation vector by inputting the original text into the pretrained Bert model; the method comprises the steps of respectively obtaining replication probability and error type probability based on semantic characterization vectors, returning an index value of a maximum value of the error type probability, and determining replication distribution vectors based on the index value of the maximum value; calculating to obtain editing tag probability based on the semantic characterization vector; fusing the editing tag probability, the copying distribution vector and the copying probability to obtain a final editing tag probability, and determining a final editing tag based on the final editing tag probability; modifying the original text according to the final editing label; circularly executing until the original text is free from errors, and obtaining a correct text; the adaptation degree of the pretrained model Bert model in the grammar error correction task is improved, the problem of sparse labels of the downstream grammar error correction task is solved, the overall performance is improved, and the accuracy of Chinese grammar error correction is improved.

Drawings

Fig. 1 is a flowchart of a chinese grammar error correction method provided by an embodiment of the present application.

Fig. 2 is a flowchart of pretraining a Bert model provided in an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

As shown in fig. 1, the present embodiment provides a method for correcting errors in chinese grammar, which includes:

s1: the original text containing grammar errors is obtained.

S2: inputting the original text to a pretrained Bert model, and outputting a semantic representation vector.

Specifically, as shown in fig. 2, the pretraining process of the Bert model includes:

s2.1: obtaining an error-free text, and performing replacement operation and insertion operation on the error-free text to obtain a lost text; the lost text comprises text with missing errors and text with redundant errors;

further, the process of obtaining the lost text includes:

Randomly extracting 20% of characters from the error-free text, selecting 50% of extracted characters to perform substitution operation to obtain the text with the missing error, and selecting the rest 50% to perform insertion operation to obtain the text with the redundancy error;

The replacing operation comprises the steps that 50% of the probability of any one character in the selected characters is replaced by a 'mask' mark, 25% of the probability is randomly replaced, and 25% of the probability is replaced by a character corresponding to the 10 th position of the prediction probability in all characters;

the inserting operation comprises the steps of randomly selecting an inserting position in selected characters, wherein 50% probability of the inserting position is an inserting "[ mask ]" mark, 15% probability of the inserting position randomly selects one character to be inserted from error-free texts, and 35% probability of the inserting position randomly selects one character to be inserted from characters before 10 of the prediction probability in the MLM pre-training task of the Bert model.

S2.2: inputting the lost text into a Bert model, and outputting semantic feature representations;

s2.3: and (3) enabling the semantic feature representation to pass through a Softmax full-connection layer to obtain a prediction probability, wherein a calculation formula is as follows:

；

Wherein, Representing a prediction probability; softmax (·) represents Softmax holo-linked layer; representing a semantic feature representation; weights representing Softmax fully connected layers; Representing the bias of the Softmax fully connected layer;

S2.4: and returning an index of the maximum value of the prediction probability to obtain a prediction character result, wherein the calculation formula is as follows:

；

Wherein, Representing predicted character results; argmax (·) represents the Argmax function;

s2.5: and calculating a pre-training loss function based on the real characters in the error-free text and the predicted character result, wherein the calculation formula is as follows:

；

Wherein, Representing a pre-training loss function; i is an ordinal number; Representing a set of alternate location markers; text representing a missing error; representing a set of insert location markers; Text representing redundant errors; representing the result of the ith predicted character; Representing the ith real character in the error-free text; Semantic feature representation representing the i-th character in the lost text; Representing a delete flag; p (·) represents known Under the condition of (a) and (b),Probability of being a real character or a delete marker; in case of substitution errors, is a real characterIn case of redundant errors, the delete flag;

s2.6: and training the Bert model according to the pre-training loss function until convergence or maximum times are reached, so as to obtain the pre-trained Bert model.

S3: and respectively passing the semantic characterization vectors through two different two layers of feedforward neural networks with normalization to respectively obtain replication probability and error type probability, returning an index value of the maximum value of the error type probability, and determining a replication distribution vector based on the index value of the maximum value.

Specifically, the calculation formula for obtaining the duplication probability and the error type probability is as follows:

；

Wherein, Representing the weight of a first feedforward neural network of the first two layers of feedforward neural networks with normalization; representing the bias of the first one of the two layers of feedforward neural networks with normalization; representing any one normalization function; representing an ith value in the semantic token vector; Representing any one of the activation functions; Representing a replication probability; representing a Sigmoid full-connection layer in a first two-layer feedforward neural network with normalization; Representation of Weights of (2); Representing the weight of the first layer feedforward neural network in the second layer feedforward neural network with normalization; Representing the bias of the first layer of the second, normalized two-layer feedforward neural network; Representing an error type probability; representing a second Sigmoid full-connection layer in the two-layer feedforward neural network with normalization; Representation of Weights of (2); Representation of Intermediate output through a first full-connection layer in a first two-layer feedforward neural network with normalization; Representation of The output after normalization of any activating function; Representation of Intermediate output of a first full-connection layer in the second feedforward neural network with normalization; Representation of And (5) normalizing the output by any activating function.

Further, the returning the index value of the maximum value of the error type probability, and determining the duplicate distribution vector based on the index value of the maximum value includes:

Returning an index value of the maximum value of the error type probability through Argmax functions, wherein the calculation formula is as follows:

；

Wherein type represents an index value of a maximum value of the error type probability, 0 Indicates no error, 1 indicates a redundant error, 2 indicates a replacement error, and 3 indicates a missing error; argmax (·) represents the Argmax function; Representing an error type probability;

Determining a value in a copy distribution vector by using an index value of the maximum value of the error type probability, wherein the copy distribution vector has the expression:

；

Wherein CopyDistribute denotes a duplicate distribution vector; k represents an element without error; d represents an element in which a redundancy error exists; r ₁ represents an element in which substitution errors exist in the 1 st; r ₂ represents an element of which the 2 nd has a substitution error; Represent the first Elements with substitution errors; a ₁ represents the 1 st element having a missing error; a ₂ represents an element having a missing error in the 2 nd; Represent the first Elements with missing errors; representing the number of edit tags;

In the duplicate distribution vector of the present invention,

When type=0, k=1, and the remaining elements are all 0;

when type=1, d=1, and the remaining elements are all 0;

when the type=2, All are 1, and the other elements are 0;

when the type is set to be 3, All are 1, and the other elements are 0.

And calculating to obtain editing tag probability based on the semantic representation vector.

Specifically, the calculation formula is:

；

Wherein, Representing an ith value in the semantic token vector; representing any one normalization function; Representing any one of the activation functions; An edit tag probability representing an i-th character; Representation of Intermediate output of a first full-connection layer in a third two-layer feedforward neural network with normalization; Representation of The output after normalization of any activating function; representing the weight of the first layer feedforward neural network in the third two-layer feedforward neural network with normalization; representing the bias of the first layer of the third two-layer feedforward neural network with normalization; The third one with the weights of the second one of the normalized two-layer feedforward neural networks is shown.

S4: and fusing the editing tag probability, the copying distribution vector and the copying probability to obtain a final editing tag probability, and returning an index of the maximum value of the final editing tag probability through Argmax functions to obtain the final editing tag.

Further, the final calculation formula of the label editing probability is:

；

Wherein, Representing a final edit tag probability; Representing a replication probability; copyDistribute denotes a duplicate distribution vector; Representing edit tag probabilities.

S5: and modifying the original text based on the final editing tag.

In this embodiment, editing the tag includes holding, deleting, replacing, inserting.

According to the Chinese grammar error correction method provided by the embodiment, the original text is input into the pretrained Bert model, and the semantic representation vector is output; the method comprises the steps of respectively obtaining replication probability and error type probability based on semantic characterization vectors, returning an index value of a maximum value of the error type probability, and determining replication distribution vectors based on the index value of the maximum value; calculating to obtain editing tag probability based on the semantic characterization vector; fusing the editing tag probability, the copying distribution vector and the copying probability to obtain a final editing tag probability, and determining a final editing tag based on the final editing tag probability; modifying the original text according to the final editing label; circularly executing until the original text is free from errors, and obtaining a correct text; the adaptation degree of the pretrained model Bert model in the grammar error correction task is improved, the problem of sparse labels of the downstream grammar error correction task is solved, the overall performance is improved, and the accuracy of Chinese grammar error correction is improved.

The embodiment also provides a Chinese grammar error correction system, which comprises:

the device comprises an acquisition module, a pre-training module and a correction module;

The pretraining module is used for pretraining the Bert model;

The correction module includes: the system comprises a characterization module, a replication module, an editing tag probability calculation module, an editing tag prediction module, a modification module and an iteration module;

the replication module includes: a duplication probability calculation sub-module and an error type judgment sub-module;

the replication probability calculation sub-module is used for calculating the replication probability based on the semantic representation vector;

The error type judging sub-module is used for calculating error type probability based on the semantic characterization vector, returning an index value of the maximum value of the error type probability and determining a copy distribution vector based on the index value of the maximum value;

According to the Chinese grammar error correction system provided by the embodiment, the confidence level of the grammar error position [ PAD ] label is judged through the MLM pre-training task of the Bert model, so that the loss function of the pre-training task is calculated, the Bert model is represented to have the capability of expressing missing and redundant error information after the model converges, and reliable and comprehensive information is provided for the downstream grammar correction task. In addition, a replication module is introduced into the correction module, and an error type judgment task (embodied by replication and distribution vectors) in the replication module is used as an auxiliary task of an editing tag prediction task, so that the searching range of the editing tag can be reduced, and the accuracy of editing tag prediction is improved; meanwhile, the replication probability is introduced, a method for fusing the information of the error type judgment task and the information of the editing label prediction task is designed, the error judgment of the auxiliary task can be shielded, the information when the auxiliary task is correctly judged can be enhanced, and a buffer zone is added between the two tasks through a soft replication mode of the replication probability.

Exemplary embodiments of the present disclosure are specifically illustrated and described above. It is to be understood that this disclosure is not limited to the particular arrangements, instrumentalities and methods of implementation described herein; on the contrary, the disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A method for correcting errors in chinese grammar, comprising:

s1: acquiring a primary text containing grammar errors;

The pretraining process of the Bert model comprises the following steps:

；

S2.6: training the Bert model according to the pre-training loss function until convergence or maximum times are reached, so as to obtain the pre-trained Bert model;

the returning the index value of the maximum value of the error type probability, the determining the duplicate distribution vector based on the index value of the maximum value includes:

；

In the duplicate distribution vector of the present invention,

When type=0, k=1, and the remaining elements are all 0;

when type=1, d=1, and the remaining elements are all 0;

when the type=2, All are 1, and the other elements are 0;

when the type is set to be 3, All are 1, and the other elements are 0;

S5: modifying the original text based on the final editing tag;

2. The method of claim 1, wherein in S2.1, the replacing and inserting the error-free text to obtain the lost text comprises:

The inserting operation comprises the steps of randomly selecting an inserting position in selected characters, wherein 50% probability of the inserting position is an inserting "[ mask ]" mark, 15% probability of the inserting position randomly selects one character to be inserted from the text without errors, and 35% probability of the inserting position randomly selects one character to be inserted from the characters in front of 10 in the sequence of the predicted probability.

3. The method for correcting chinese grammar according to claim 1, wherein in S3, the calculation formula for obtaining the duplication probability and the error type probability is:

；

4. The method according to claim 1, wherein in S3, the calculating the edit tag probability based on the semantic token vector comprises:

；

5. The method of claim 1, wherein in S4, the final calculation formula of the edit tag probability is:

；

6. The method of claim 1, wherein in S4, determining the final edit label based on the final edit label probability includes returning an index of a maximum value of the final edit label probability by Argmax functions to obtain the final edit label.

7. The method of claim 1, wherein editing the tag includes holding, deleting, replacing, inserting.

8. A chinese grammar error correction system, comprising:

The pretraining process of the Bert model comprises the following steps:

；

In the duplicate distribution vector of the present invention,

When type=0, k=1, and the remaining elements are all 0;

when type=1, d=1, and the remaining elements are all 0;

when the type=2, All are 1, and the other elements are 0;

when the type is set to be 3, All are 1, and the other elements are 0;