[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN114564942B - Text error correction method, storage medium and device for supervision field - Google Patents

Text error correction method, storage medium and device for supervision field Download PDF

Info

Publication number
CN114564942B
CN114564942B CN202111052921.9A CN202111052921A CN114564942B CN 114564942 B CN114564942 B CN 114564942B CN 202111052921 A CN202111052921 A CN 202111052921A CN 114564942 B CN114564942 B CN 114564942B
Authority
CN
China
Prior art keywords
text
error correction
bert
model
sub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111052921.9A
Other languages
Chinese (zh)
Other versions
CN114564942A (en
Inventor
孙晓兵
齐路
唐会军
刘栓林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Nextdata Times Technology Co ltd
Original Assignee
Beijing Nextdata Times Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Nextdata Times Technology Co ltd filed Critical Beijing Nextdata Times Technology Co ltd
Priority to CN202111052921.9A priority Critical patent/CN114564942B/en
Publication of CN114564942A publication Critical patent/CN114564942A/en
Application granted granted Critical
Publication of CN114564942B publication Critical patent/CN114564942B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a text error correction method, a storage medium and a device for the supervision field, and relates to the live broadcast field. The method comprises the following steps: acquiring ASR real-time translation text, classifying the translation text through a trained BERT classification model, outputting a supervision sub-field to which the translation text belongs, and labeling the translation text according to the supervision sub-field; and carrying out error correction processing on the translation text with the label through the trained sub-field BERT error correction model to obtain an error correction text. And classifying the translated text through the BERT classification model, carrying out error correction processing through the corresponding BERT error correction model according to different supervision sub-domain texts after classifying, obtaining error correction texts, effectively improving the word accuracy of ASR (automatic generation and correction) to the audio translated text in each supervision domain under a live broadcast scene, and rapidly applying the error correction texts to related fields.

Description

Text error correction method, storage medium and device for supervision field
Technical Field
The present invention relates to the field of live broadcasting, and in particular, to a text error correction method, a storage medium, and an apparatus for the field of supervision.
Background
With the rise of the network live broadcast industry, social channels of people are greatly expanded, and meanwhile, supervision of the network industry is required to be more complex. Due to the shortcomings of the live broadcast environment and automatic speech recognition technology (ASR for short, the following description is the same), the ASR has error translation information which is enough to change audio semantics, and the supervision problem is endless, so that the correction of ASR translation data of supervision field data in the live broadcast scene becomes an important technical bottleneck.
The traditional error correction method is to directly use the error correction method in the general field after the ASR translates the text. A drawback of such an approach is that the data distribution in the regulatory and general domain is not sufficiently matched, and the data distribution in the general domain is intended to include the data distribution in the regulatory domain and is more extensive. On the other hand, the supervision field comprises a plurality of subdivided supervision, error correction tasks of a plurality of supervision sub-fields are integrated together, accurate field results are difficult to obtain, and the traditional error correction mode has poor evaluation effect on the supervision field.
Disclosure of Invention
The invention aims to solve the technical problem of providing a text error correction method, a storage medium and a device for the supervision field aiming at the defects of the prior art.
The technical scheme for solving the technical problems is as follows:
a text error correction method for a regulatory domain, comprising:
s1, acquiring ASR real-time translation text;
s2, classifying the translation text through the trained BERT classification model, outputting the supervision sub-field to which the translation text belongs, and labeling the translation text according to the supervision sub-field;
s3, performing error correction processing on the translation text with the label through the trained sub-field BERT error correction model to obtain an error correction text.
The beneficial effects of the invention are as follows: according to the method, the translated text is classified through the BERT classification model, after the translated text is classified, error correction is conducted through the corresponding sub-domain BERT error correction model according to different supervision sub-domain texts, the error correction text is obtained, the word accuracy of ASR (automatic repeat request) to the audio translated text in each supervision domain in a live broadcast scene is effectively improved, and the method is rapidly applied to related fields.
And correcting the data in each field by adopting a BERT-based method for the data in each supervision sub-field. The BERT model can realize plug and play only by fine adjustment of data in the sub-fields, and improves adaptability and supervision accuracy of each sub-supervision field and error correction algorithm.
The data are marked as non-supervision data and detailed supervision data by adopting a classification algorithm based on a bi-directional self-coding pre-training language BERT model. So that regulatory domain data is distinguished from non-regulatory data and a more detailed division of regulatory sub-domains is obtained.
Further, the step S3 further includes:
inputting the error correction text into the BERT classification model for classification processing, and returning the error correction text and the label when the field in the classification result is consistent with the label; if not, the translated text is reclassified.
The beneficial effects of adopting the further scheme are as follows: according to the method, the error correction text is input into the BERT classification model to carry out classification processing, whether the field in the classification result is consistent with the label or not is judged, so that the classification review of the error correction text is realized, and the classification and error correction accuracy is improved.
Further, before S2, the method further includes:
adopting a double-coding BERT model;
setting sentence vectors at an output layer of the BERT model;
using a softmax function at an output layer of the BERT model, and setting classification parameters at the output layer;
calculating the iteration loss of the BERT model through a cross entropy loss function;
and updating BERT model parameters by using a learning rate attenuation method by Adam to realize the construction of the BERT classification model.
The beneficial effects of adopting the further scheme are as follows: according to the invention, the supervision sub-field division of the translation text is realized through the constructed BERT classification model, and the suitability and supervision accuracy of each sub-supervision field and an error correction algorithm are improved.
Further, the step S2 further includes:
collecting translation text and manually translated standard text of ASR of original voice information under a historical live broadcast scene;
respectively labeling the translation text and the standard text according to the supervision type in the supervision sub-field to which the standard text belongs to, and obtaining a first labeling corpus translation text and a first labeling corpus standard text;
and storing the first annotation corpus translation text and the first annotation corpus standard text into a corresponding supervision domain database in the database to form an original corpus.
The beneficial effects of adopting the further scheme are as follows: according to the scheme, a training source is provided for the BERT classification model and the BERT error correction model in the sub-field through the constructed original corpus.
Further, the step S2 further includes: setting the first labeling corpus translation text as a negative sample of the BERT classification model;
selecting positive corpus which is easy to misjudge as the negative sample from the first labeling corpus standard text, and randomly selecting normal corpus as a positive sample of the BERT classification model;
constructing a classification training set according to the negative sample and the positive sample;
setting model parameters of the BERT classification model;
and inputting the classification training set into the BERT classification model to train the BERT classification model, and obtaining the trained BERT classification model.
The beneficial effects of adopting the further scheme are as follows: the scheme adopts a classification algorithm based on a bi-directional self-coding pre-training language BERT model to mark data as non-supervision data and detailed supervision data, so that the supervision domain data and the non-supervision data are distinguished, and a more detailed supervision sub-domain division result is obtained.
Further, before the step S3, the method further includes:
respectively acquiring the translation text and the manually translated standard text of the ASR of each supervision sub-field in the original corpus;
and performing alignment processing on the translation text and the standard text by using an alignment algorithm to obtain an error correction training set.
Further, before the step S3, the method further includes: setting training parameters of the sub-domain BERT error correction model, taking the error correction training set as input of the sub-domain BERT error correction model, taking the standard text as a training target, and training the sub-domain BERT error correction model to obtain the trained sub-domain BERT error correction model.
The beneficial effects of adopting the further scheme are as follows: the scheme realizes that the data of each supervision sub-field is subjected to error correction by adopting a BERT-based method through the trained sub-field BERT error correction model. The BERT model can realize plug and play only by fine adjustment of data in the sub-fields, and improves adaptability and supervision accuracy of each sub-supervision field and error correction algorithm.
Further, before the step S3, the method further includes:
newly adding a full-connection network layer to an output layer of the BERT error correction model in the sub-field;
mapping the output of each token neuron of the sub-domain BERT error correction model into a word vector dimension value of the BERT pre-training model;
performing normalization constraint processing on the sub-field BERT error correction model through layer normalization to obtain an embedding parameter matrix of the sub-field BERT error correction model after normalization processing;
mapping the output of the fully connected network layer of each token neuron position into word vector values of the sub-field BERT error correction model through the sounding parameter matrix;
normalizing the word vector dimension value and the word vector value through softmax, and calculating the iteration loss of the BERT error correction model in the sub-field of fine adjustment of the effective character position through a cross entropy loss function;
and finally updating the EMBedding parameter matrix of the sub-field BERT error correction model by Adam by using a learning rate attenuation mode.
The beneficial effects of adopting the further scheme are as follows: according to the scheme, a fully-connected network layer is newly added, the output of each token neuron is mapped to the dimension size of the BERT word vector, the fully-connected output of each token position is mapped to the vector of the BERT word list size, the iteration loss of the fine adjustment BERT at the effective character position is calculated through a cross entropy loss function, and finally the model parameters are updated by using Adam as an optimizer and a learning rate attenuation mode, so that the parameter optimization of the BERT error correction model in the sub-field is realized.
The other technical scheme for solving the technical problems is as follows:
a storage medium having instructions stored therein which, when read by a computer, cause the computer to perform a text error correction method for a supervision domain as described in any one of the above aspects.
The other technical scheme for solving the technical problems is as follows:
a text error correction apparatus comprising:
a memory for storing a computer program;
and the processor is used for executing the computer program to realize the text error correction method for the supervision field according to any one of the schemes.
Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.
Drawings
FIG. 1 is a schematic flow chart provided by an embodiment of the present invention;
FIG. 2 is a schematic diagram of a text error correction method according to other embodiments of the present invention;
FIG. 3 is a schematic diagram of a BERT classification model according to other embodiments of the present invention;
fig. 4 is a schematic structural diagram of transformer block provided in other embodiments of the present invention;
fig. 5 is a schematic structural diagram of a sub-domain BERT error correction model according to another embodiment of the present invention.
Detailed Description
The principles and features of the present invention are described below with reference to the drawings, the illustrated embodiments are provided for illustration only and are not intended to limit the scope of the present invention.
As shown in fig. 1, a text error correction method for a supervision domain according to an embodiment of the present invention includes:
s1, acquiring ASR real-time translation text;
s2, classifying the translation text through the trained BERT classification model, outputting the supervision sub-field to which the translation text belongs, and labeling the translation text according to the supervision sub-field;
in one embodiment, obtaining the trained BERT classification model may comprise:
and collecting translation text corpus and original voice information of ASR under direct broadcasting, performing data annotation by using the voice information to obtain standard text information, simultaneously manually extracting annotation corpus as training samples of supervision and division models according to different supervision categories, manually annotating each ASR annotation sample according to the category of supervision sub-fields, and marking the sub-fields. Wherein the categories of different polices may include: pornography, politics, banning, etc., each of which has a large inconsistency in data distribution, e.g., pornography: text 1 containing pornography; forbidden: containing text 2 belonging to regulatory or contraband content. It should be noted that, the labeling corpus can be obtained by manual labeling, and here, mainly, the supervision labels of the labeling corpus are determined, and labeling in two aspects is provided: (1) labeling of audio-to-text words, where pair pairs of training error correction models are generated; (2) the labels of the supervision labels to which the text belongs are specially defined to distinguish the labels to which the text belongs, and the pair pairs are subdivided and belong to error correction pair pairs of different supervision labels.
Meanwhile, the extraction of the labeling corpus has various modes: (1) randomly extracting live service data and performing a labeling (text to audio and text label) process; (2) the existing supervision classification model is used for primary screening, rough classification is screened, and two labeling processes are carried out until the final data are obtained. Generally, the method (1) is used at the start, and the classification model is constructed after preliminary data acquisition, and the method (2) is used at the moment.
Setting an ASR labeling corpus sample to be supervised as a negative sample of a model, selecting a positive corpus which is easy to misjudge as the negative sample, randomly selecting a batch of normal corpora as the positive sample, and constructing the sample; it should be noted that, selecting the front corpus that is easily misjudged as the negative sample may include: gambling and blogging are relevant samples of error correction, and here constructed are supervision tag classification samples used for classification models to distinguish samples as to whether they belong to a supervision tag or not. Here, positive and negative sample correlation is required, but not such correlation as gambling and blogging, such as under pornography, is: text 1 containing pornography?
Selecting a BERT model adopting a bidirectional coding Chinese_base version as a pre-training model, using a [ CLS ] "token vector at a first token position of an output layer to represent a sentence vector of the whole input sentence, using softmax for the output layer and setting n+1 classification, wherein each classification represents a supervision word field 1, a supervision word field 2, a number of the supervision word field n and a normal classification, calculating iterative loss of the BERT by using a cross entropy loss function, and finally updating model parameters by using Adam as an optimizer and a learning rate attenuation mode, thereby constructing and obtaining a BERT supervision sub-field division model.
Setting parameters of BERT sub-field division model training, and putting the constructed sample as input into a BERT classification model for training;
and storing the trained BERT sub-domain division model.
S3, performing error correction processing on the translation text with the label through the trained sub-field BERT error correction model to obtain an error correction text.
In an embodiment, obtaining the trained sub-domain BERT error correction model may include:
and collecting translation text corpus and original voice information of the ASR in the supervision sub-field, and simultaneously, using the voice information to carry out data annotation to obtain standard text information, wherein the standard text information is used as an original corpus for fine-tuning an ASR error correction model. Such as ASR translation text: open, contain text belonging to regulatory or illicit content, audio annotation text: it was learned that a pair of text was formed that was close to the text pronunciation containing content belonging to supervision or contraband.
And collecting N-best translation text junction linguistic data through an ASR system by using original voice information, expanding a linguistic data base corresponding to standard text information of voice, and carrying out data enhancement of the linguistic data base of the fine-tuning ASR error correction model. In one embodiment, the ASR system, in performing an audio-to-text translation, gives a top-N-best (N-best) text candidate for a piece of audio, and the final translation result is the best one of the top-N-best. For example, top-50, but the 50 text messages often contain more translation error corresponding information than top-one, which is a good data enhancement and expansion mode.
And constructing a training corpus for fine tuning the BERT from all the obtained corpus, wherein the fine tuning process can be as follows:
and (3) carrying out text character string alignment on the standard text and the translated text information by using an alignment algorithm based on the Levenshtein distance, obtaining correct, inserting, deleting and replacing marks after alignment, taking the standard text as a template, only keeping correct and replacing mark characters in the translated text, replacing the characters of other marks with corresponding position characters in the standard text, and constructing a training corpus of fine-tuning BERT.
Selecting a BERT model adopting a bidirectional coding Chinese_base version as a pre-training model, adding a layer of full-connection network after the 12 layers are finished to map the output of each token neuron of the BERT model into the dimension size of a BERT word vector, applying layer normalization normalization constraint, mapping the full-connection output of each token position to the vector of the BERT word list size by using an embellishment parameter matrix in the shared BERT pre-training model, carrying out probability normalization by using softmax, calculating the iteration loss of fine-tuning BERT at the effective character position by using a cross entropy loss function, and finally updating model parameters by using Adam as an optimizer and a learning rate attenuation mode.
Setting the training parameters of the BERT, wherein the training data uses the translated text character sequence constructed by the steps as input and the labeled text character sequence as target.
And storing the trained BERT error correction model.
In one embodiment, as shown in fig. 2, the text error correction method specifically includes:
step 1: an ASR real-time translation text is obtained.
Step 2: and the ASR real-time translation text passes through the BERT sub-field division model, outputs the supervision sub-field to which the text belongs, and marks the supervision sub-label to which the text belongs if the text does not belong normally.
Step 3: and 2, acquiring ASR translation texts containing supervision sub-labels, and constructing reference texts based on fine tuning error correction BERT, namely constructing simple translation texts into input forms of models. And selecting the BERT error correction model finely tuned by using the corpus in the supervision sub-field to correct errors to obtain error correction texts. It should be noted that, for the supervision sub-field, the data distribution is greatly different, different error correction BERTs are to be established, and for the similar, only one usable one with strong correlation such as pornography-heavy, pornography-light, etc. can be established. If the resources are insufficient, one of the supervision fields can be used, and the selection can be performed according to the actual application scene.
Step 4: and (3) dividing the monitoring sub-fields again by using the error correction text after the sub-field error correction, and returning a final error correction result and a monitoring label if the sub-fields do not belong to the normal sub-fields.
According to the method, the translated text is classified through the BERT classification model, after the translated text is classified, error correction is conducted through the corresponding sub-domain BERT error correction model according to different supervision sub-domain texts, the error correction text is obtained, the word accuracy of ASR (automatic repeat request) to the audio translated text in each supervision domain in a live broadcast scene is effectively improved, and the method is rapidly applied to related fields.
And correcting the data in each field by adopting a BERT-based method for the data in each supervision sub-field. The BERT model can realize plug and play only by fine adjustment of data in the sub-fields, and improves adaptability and supervision accuracy of each sub-supervision field and error correction algorithm.
The data are marked as non-supervision data and detailed supervision data by adopting a classification algorithm based on a bi-directional self-coding pre-training language BERT model. So that regulatory domain data is distinguished from non-regulatory data and a more detailed division of regulatory sub-domains is obtained.
Preferably, in any of the above embodiments, S3 further includes:
inputting the error correction text into the BERT classification model for classification processing, and returning the error correction text and the label when the field in the classification result is consistent with the label; if not, the translated text is reclassified.
According to the method, the classifying processing is carried out by inputting the error correction text into the BERT classifying model, whether the field in the classifying result is consistent with the label is judged, so that the classifying and rechecking of the error correction text is realized, and the classifying and error correction accuracy is improved.
Preferably, in any of the above embodiments, before S2, further includes:
adopting a double-coding BERT model;
setting sentence vectors at an output layer of the BERT model;
using a softmax function at an output layer of the BERT model, and setting classification parameters at the output layer;
calculating the iteration loss of the BERT model through a cross entropy loss function;
and updating BERT model parameters by Adam by using a learning rate attenuation method to realize construction of the BERT classification model.
In one embodiment, as shown in fig. 3, the BERT classification model structure includes: the code structure includes: input, embedding, BERT bi-directional coding structure and output, wherein the BERT bi-directional coding structure comprises a plurality of Trm, trm being transformer block, as shown in fig. 4.
According to the invention, the supervision sub-field division of the translation text is realized through the constructed BERT classification model, and the suitability and supervision accuracy of each sub-supervision field and an error correction algorithm are improved.
Preferably, in any of the foregoing embodiments, S2 further includes:
collecting translation text and manually translated standard text of ASR of original voice information under a historical live broadcast scene;
respectively labeling the translation text and the standard text according to the supervision type in the supervision sub-field to which the translation text and the standard text belong to, and obtaining a first labeling corpus translation text and a first labeling corpus standard text;
and storing the translated text of the first annotation corpus and the standard text of the first annotation corpus into a corresponding supervision domain database in the database to form an original corpus.
According to the scheme, a training source is provided for the BERT classification model and the BERT error correction model in the sub-field through the constructed original corpus.
Preferably, in any of the foregoing embodiments, S2 further includes: setting the first labeling corpus translation text as a negative sample of the BERT classification model;
selecting front corpus which is easy to misjudge as a negative sample from the first labeling corpus standard text and randomly selecting normal corpus as a positive sample of the BERT classification model;
constructing a classification training set according to the negative sample and the positive sample;
setting model parameters of the BERT classification model;
and inputting the classification training set into the BERT classification model to train the BERT classification model, and obtaining the trained BERT classification model.
The scheme adopts a classification algorithm based on a bi-directional self-coding pre-training language BERT model to mark data as non-supervision data and detailed supervision data, so that the supervision domain data and the non-supervision data are distinguished, and a more detailed supervision sub-domain division result is obtained.
Preferably, in any of the above embodiments, before S3, the method further includes:
respectively acquiring the translation text and the manually translated standard text of the ASR of each supervision sub-field in an original corpus;
and performing alignment processing on the translated text and the standard text by using an alignment algorithm to obtain an error correction training set.
Preferably, in any of the above embodiments, before S3, the method further includes: setting training parameters of the sub-field BERT error correction model, taking an error correction training set as input of the sub-field BERT error correction model, taking a standard text as a training target, and training the sub-field BERT error correction model to obtain a trained sub-field BERT error correction model.
In one embodiment, as shown in fig. 5, the sub-domain BERT error correction model structure includes: input, embedding, BERT bi-directional coding structure and output, wherein the BERT bi-directional coding structure comprises a plurality of Trm, trm being transformer block, as shown in fig. 4.
The scheme realizes that the data of each supervision sub-field is subjected to error correction by adopting a BERT-based method through the trained sub-field BERT error correction model. The BERT model can realize plug and play only by fine adjustment of data in the sub-fields, and improves adaptability and supervision accuracy of each sub-supervision field and error correction algorithm.
Preferably, in any of the above embodiments, before S3, the method further includes:
newly adding a full-connection network layer to an output layer of the BERT error correction model in the sub-field;
mapping the output of each token neuron of the sub-field BERT error correction model into a word vector dimension value of the BERT pre-training model;
carrying out normalization constraint processing on the sub-field BERT error correction model through layer normalization to obtain an emmbedding parameter matrix of the sub-field BERT error correction model after normalization processing;
mapping the output of the fully connected network layer of each token neuron position into word vector values of the BERT error correction model in the sub-field through an embellishing parameter matrix;
normalizing the word vector dimension value and the word vector value through softmax, and calculating the iteration loss of the BERT error correction model in the fine tuning sub-field of the effective character position through a cross entropy loss function;
and finally updating an ebedding parameter matrix of the BERT error correction model in the sub-field by Adam through a learning rate attenuation mode.
According to the scheme, a fully-connected network layer is newly added, the output of each token neuron is mapped to the dimension size of the BERT word vector, the fully-connected output of each token position is mapped to the vector of the BERT word list size, the iteration loss of the fine adjustment BERT at the effective character position is calculated through a cross entropy loss function, and finally the model parameters are updated by using Adam as an optimizer and a learning rate attenuation mode, so that the parameter optimization of the BERT error correction model in the sub-field is realized.
A storage medium having instructions stored therein which, when read by a computer, cause the computer to perform a text error correction method for use in the supervision field as in any one of the embodiments described above.
A text error correction apparatus comprising:
a memory for storing a computer program;
a processor for executing a computer program to implement a text error correction method for a supervision domain as in any one of the embodiments above.
It is to be understood that in some embodiments, some or all of the alternatives described in the various embodiments above may be included.
It should be noted that, the foregoing embodiments are product embodiments corresponding to the previous method embodiments, and the description of each optional implementation manner in the product embodiments may refer to the corresponding description in the foregoing method embodiments, which is not repeated herein.
The reader will appreciate that in the description of this specification, a description of terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of elements is merely a logical functional division, and there may be additional divisions of actual implementation, e.g., multiple elements or components may be combined or integrated into another system, or some features may be omitted, or not performed.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment of the present invention.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present invention. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-only memory (ROM), a random access memory (RAM, randomAccessMemory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The present invention is not limited to the above embodiments, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the present invention, and these modifications and substitutions are intended to be included in the scope of the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims (6)

1. A text error correction method for a regulatory domain, comprising:
s1, acquiring ASR real-time translation text;
s2, classifying the translation text through the trained BERT classification model, outputting the supervision sub-field to which the translation text belongs, and labeling the translation text according to the supervision sub-field;
s3, performing error correction processing on the translation text with the label through the trained sub-field BERT error correction model to obtain an error correction text;
wherein, before S2, further comprises:
adopting a double-coding BERT model;
setting sentence vectors at an output layer of the BERT model;
using a softmax function at an output layer of the BERT model, and setting classification parameters at the output layer;
calculating the iteration loss of the BERT model through a cross entropy loss function;
updating BERT model parameters by Adam by using a learning rate attenuation method to realize the construction of the BERT classification model;
the step S2 further includes:
collecting translation text and manually translated standard text of ASR of original voice information under a historical live broadcast scene;
respectively labeling the translation text and the standard text according to the supervision type in the supervision sub-field to which the standard text belongs to, and obtaining a first labeling corpus translation text and a first labeling corpus standard text;
storing the first annotation corpus translation text and the first annotation corpus standard text into a corresponding supervision domain database in a database to form an original corpus;
the step S2 further includes: setting the first labeling corpus translation text as a negative sample of the BERT classification model;
selecting positive corpus which is easy to misjudge as the negative sample from the first labeling corpus standard text, and randomly selecting normal corpus as a positive sample of the BERT classification model;
constructing a classification training set according to the negative sample and the positive sample;
setting model parameters of the BERT classification model;
inputting the classification training set into the BERT classification model to train the BERT classification model, and obtaining the trained BERT classification model;
also included before the S3 is: respectively acquiring the translation text and the manually translated standard text of the ASR of each supervision sub-field in the original corpus;
performing alignment processing on the translation text and the standard text by using an alignment algorithm to obtain an error correction training set;
the alignment algorithm is used for carrying out alignment processing on the translation text and the standard text to obtain an error correction training set, and the method specifically comprises the following steps:
and (3) carrying out text character string alignment on the standard text and the translated text information by using an alignment algorithm based on the Levenshtein distance, obtaining correct, inserting, deleting and replacing marks after alignment, taking the standard text as a template, only keeping correct and replacing mark characters in the translated text, and replacing the characters of other marks with corresponding position characters in the standard text to obtain an error correction training set.
2. The text error correction method for a supervision domain according to claim 1, wherein the step S3 further comprises:
inputting the error correction text into the BERT classification model for classification processing, and returning the error correction text and the label when the field in the classification result is consistent with the label; if not, the translated text is reclassified.
3. A method of text error correction for a regulatory domain according to claim 1, further comprising, prior to S3: setting training parameters of the sub-domain BERT error correction model, taking the error correction training set as input of the sub-domain BERT error correction model, taking the standard text as a training target, and training the sub-domain BERT error correction model to obtain the trained sub-domain BERT error correction model.
4. A method of text error correction for a regulatory domain according to claim 1 or 3, further comprising, prior to S3:
newly adding a full-connection network layer to an output layer of the BERT error correction model in the sub-field;
mapping the output of each token neuron of the sub-domain BERT error correction model into a word vector dimension value of the BERT pre-training model;
performing normalization constraint processing on the sub-field BERT error correction model through layer normalization to obtain an embedding parameter matrix of the sub-field BERT error correction model after normalization processing;
mapping the output of the fully connected network layer of each token neuron position into word vector values of the sub-field BERT error correction model through the sounding parameter matrix;
normalizing the word vector dimension value and the word vector value through softmax, and calculating the iteration loss of the BERT error correction model in the sub-field of fine adjustment of the effective character position through a cross entropy loss function;
and finally updating the EMBedding parameter matrix of the sub-field BERT error correction model by Adam by using a learning rate attenuation mode.
5. A storage medium having instructions stored therein which, when read by a computer, cause the computer to perform a text error correction method for a regulatory domain as defined in any one of claims 1 to 4.
6. A text error correction apparatus, comprising:
a memory for storing a computer program;
processor for executing the computer program for implementing a text error correction method for a supervision domain as claimed in any one of the claims 1 to 4.
CN202111052921.9A 2021-09-06 2021-09-06 Text error correction method, storage medium and device for supervision field Active CN114564942B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111052921.9A CN114564942B (en) 2021-09-06 2021-09-06 Text error correction method, storage medium and device for supervision field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111052921.9A CN114564942B (en) 2021-09-06 2021-09-06 Text error correction method, storage medium and device for supervision field

Publications (2)

Publication Number Publication Date
CN114564942A CN114564942A (en) 2022-05-31
CN114564942B true CN114564942B (en) 2023-07-18

Family

ID=81712134

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111052921.9A Active CN114564942B (en) 2021-09-06 2021-09-06 Text error correction method, storage medium and device for supervision field

Country Status (1)

Country Link
CN (1) CN114564942B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115858776B (en) * 2022-10-31 2023-06-23 北京数美时代科技有限公司 Variant text classification recognition method, system, storage medium and electronic equipment

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108091328A (en) * 2017-11-20 2018-05-29 北京百度网讯科技有限公司 Speech recognition error correction method, device and readable medium based on artificial intelligence

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101655837B (en) * 2009-09-08 2010-10-13 北京邮电大学 Method for detecting and correcting error on text after voice recognition
CN109918497A (en) * 2018-12-21 2019-06-21 厦门市美亚柏科信息股份有限公司 A kind of file classification method, device and storage medium based on improvement textCNN model
CN113297833B (en) * 2020-02-21 2024-10-22 华为技术有限公司 Text error correction method, device, terminal equipment and computer storage medium
CN111931490B (en) * 2020-09-27 2021-01-08 平安科技(深圳)有限公司 Text error correction method, device and storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108091328A (en) * 2017-11-20 2018-05-29 北京百度网讯科技有限公司 Speech recognition error correction method, device and readable medium based on artificial intelligence

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
一文读懂BERT(原理篇);废柴当自强;《CSDN》;20190419;1-13 *
基于BERT的ASR纠错;zenRRan;《CSDN》;20200716;1-6 *

Also Published As

Publication number Publication date
CN114564942A (en) 2022-05-31

Similar Documents

Publication Publication Date Title
CN111694924B (en) Event extraction method and system
CN110532353B (en) Text entity matching method, system and device based on deep learning
CN107943911A (en) Data pick-up method, apparatus, computer equipment and readable storage medium storing program for executing
CN113191148B (en) Rail transit entity identification method based on semi-supervised learning and clustering
CN107391495B (en) Sentence alignment method of bilingual parallel corpus
CN113779358B (en) Event detection method and system
CN108959474B (en) Entity relation extraction method
CN111046670A (en) Entity and relationship combined extraction method based on drug case legal documents
CN112307773B (en) Automatic generation method of custom problem data of machine reading understanding system
CN107993636B (en) Recursive neural network-based music score modeling and generating method
CN109858025B (en) Word segmentation method and system for address standardized corpus
CN115034218A (en) Chinese grammar error diagnosis method based on multi-stage training and editing level voting
CN116361306A (en) Open domain science popularization-oriented question-answer library automatic updating method and device
CN114564942B (en) Text error correction method, storage medium and device for supervision field
CN111581346A (en) Event extraction method and device
CN111898337B (en) Automatic generation method of single sentence abstract defect report title based on deep learning
CN114254077A (en) Method for evaluating integrity of manuscript based on natural language
CN114330350A (en) Named entity identification method and device, electronic equipment and storage medium
CN116483990B (en) Internet news content automatic generation method based on big data
CN115688789B (en) Entity relation extraction model training method and system based on dynamic labels
CN116090449B (en) Entity relation extraction method and system for quality problem analysis report
CN114996442B (en) Text abstract generation system combining abstract degree discrimination and abstract optimization
CN114595459B (en) Question rectification suggestion generation method based on deep learning
CN115757815A (en) Knowledge graph construction method and device and storage medium
CN115422349A (en) Hierarchical text classification method based on pre-training generation model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant