CN110263349A

CN110263349A - Corpus assessment models training method, device, storage medium and computer equipment

Info

Publication number: CN110263349A
Application number: CN201910176030.0A
Authority: CN
Inventors: 邵晨泽; 张金超; 孟凡东; 周杰; 冯洋
Original assignee: Institute of Computing Technology of CAS; Tencent Cyber Tianjin Co Ltd
Current assignee: Institute of Computing Technology of CAS; Tencent Cyber Tianjin Co Ltd
Priority date: 2019-03-08
Filing date: 2019-03-08
Publication date: 2019-09-20
Anticipated expiration: 2039-03-08
Also published as: CN110263349B

Abstract

This application involves a kind of corpus assessment models training method, device, computer readable storage medium and computer equipments, which comprises obtains parallel corpora；The parallel corpora includes source text and refers to cypher text accordingly；The source text is translated by Machine Translation Model to obtain corresponding machine translation text；By the source text and the machine translation text collectively as the training sample of corpus assessment models；It compares the machine translation text and described with reference to cypher text, obtains trained label corresponding with the training sample；Pass through the training sample and the corresponding training label training corpus assessment models.Model training efficiency and effect can be improved in scheme provided by the present application.

Description

Corpus assessment models training method, device, storage medium and computer equipment

Technical field

This application involves machine learning techniques fields, more particularly to a kind of corpus assessment models training method, device, deposit Storage media and computer equipment.

Background technique

With the development of machine learning techniques, there is machine translation mothod.In machine translation field, it to be used for training airplane There may be a large amount of noise in the parallel corpora of device translation model, to influence the quality of Machine Translation Model.Therefore, such as What filters out the parallel corpora of low noise from a large amount of parallel corpora, to become urgently to be resolved for training machine translation model The problem of.

Traditional mode screened to parallel corpora mainly passes through the feature of the multiple corpus assessments of engineer, with nothing The parallel corpora of noise is positive example, applies man made noise corresponding with the feature of engineer to the reference translation of corpus to construct Counter-example to construct the training data of data screening model, then is trained model.

However, the mode of traditional model training, due to needing the corpus and the artificial number that marks of a large amount of manual construction According to being trained, taken a substantial amount of time obtaining training data and using training data being trained Shi Douhui to model, Cause machine learning model training effectiveness relatively low.

Summary of the invention

Based on this, it is necessary to the technical issues of for traditional data screening model training low efficiency, provide a kind of corpus Assessment models training method, device, computer readable storage medium and computer equipment.

A kind of corpus assessment models training method, comprising:

Obtain parallel corpora；The parallel corpora includes source text and refers to cypher text accordingly；

The source text is translated by Machine Translation Model to obtain corresponding machine translation text；

By the source text and the machine translation text collectively as the training sample of corpus assessment models；

It compares the machine translation text and described with reference to cypher text, obtains training mark corresponding with the training sample Label；

Pass through the training sample and the corresponding training label training corpus assessment models.

A kind of corpus assessment models training device, described device include:

Module is obtained, for obtaining parallel corpora；The parallel corpora includes source text and refers to cypher text accordingly；

Translation module obtains corresponding machine translation text for being translated by Machine Translation Model to the source text This；

Determining module, for the training by the source text and the machine translation text collectively as corpus assessment models Sample；

Contrast module obtains and the trained sample for comparing the machine translation text and described with reference to cypher text This corresponding training label；

Training module, for passing through the training sample and the corresponding training label training corpus assessment models.

A kind of computer readable storage medium is stored with computer program, when the computer program is executed by processor, So that the processor executes following steps:

A kind of computer equipment, including memory and processor, the memory are stored with computer program, the calculating When machine program is executed by the processor, so that the processor executes following steps:

Above-mentioned corpus assessment models training method, device, computer readable storage medium and computer equipment, acquisition include There are source text and the corresponding parallel corpora for referring to cypher text, the source text is translated to obtain by Machine Translation Model Corresponding machine translation text.By source text and corresponding machine translation text collectively as the training sample of corpus assessment models This.It compares machine translation text and refers to cypher text accordingly, obtain trained label corresponding with the training sample.Due to machine There are various noises in device cypher text, thus can be no longer dependent on artificial addition noise and carry out the structure counter-example.Pass through The comparison result of machine translation text and reference cypher text determines training label, and eliminating the reliance on artificial mark corpus can construct A large amount of training data out, substantially increases the preparation efficiency of training data, and then substantially increases the training effectiveness of model.And And will include various noises in the machine translation text by being exported by Machine Translation Model, compared to manually making an uproar Sound covering surface is wider, closer to real scene, the model caused by the limitation of training data can be avoided well excessively quasi- The problem of conjunction, can efficiently train high performance corpus assessment models.

Detailed description of the invention

Fig. 1 is the applied environment figure of corpus assessment models training method in one embodiment；

Fig. 2 is the flow diagram of corpus assessment models training method in one embodiment；

Fig. 3 is in one embodiment by training sample and corresponding the step of training label training corpus assessment models Flow diagram；

Fig. 4 is the model structure schematic diagram of corpus assessment models in one embodiment in one embodiment；

Fig. 5 is the process signal for handling parallel corpora in one embodiment in one embodiment by corpus assessment models Figure；

Fig. 6 is flow diagram the step of screening target parallel corpora in one embodiment；

Fig. 7 is flow diagram the step of reordering in one embodiment to each spare version；

Fig. 8 is the flow diagram of corpus assessment models training method in a specific embodiment；

Fig. 9 is the structural block diagram of corpus assessment models training device in one embodiment；

Figure 10 is the structural block diagram of corpus assessment models training device in another embodiment；

Figure 11 is the structural block diagram of computer equipment in one embodiment.

Specific embodiment

It is with reference to the accompanying drawings and embodiments, right in order to which the objects, technical solutions and advantages of the application are more clearly understood The application is further elaborated.It should be appreciated that specific embodiment described herein is only used to explain the application, and It is not used in restriction the application.

Fig. 1 is the applied environment figure of corpus assessment models training method in one embodiment.Referring to Fig.1, which assesses Model training method is applied to model training systems.The model training systems include terminal 110 and server 120.110 He of terminal Server 120 passes through network connection.Terminal 110 specifically can be terminal console or mobile terminal, and mobile terminal specifically can be with hand At least one of machine, tablet computer, laptop etc..Server 120 can use independent server either multiple clothes The server cluster of business device composition is realized.Terminal 110 and server 120 can be individually used for executing and mention in the embodiment of the present application The corpus assessment models training method of confession.Terminal 110 and server 120 can also be cooperateed with to be provided for executing in the embodiment of the present application Corpus assessment models training method.

It should be noted that two kinds of machine learning models involved in the embodiment of the present application.Machine learning model is to pass through sample Has the model of certain ability after this study.One of the embodiment of the present application machine learning model is to have by sample learning The Machine Translation Model of translation ability.Translation refers to the language by a kind of language conversion of languages type for another languages type Speech.It for example by translator of Chinese is English, or by translator of Japanese be Korean etc..Certainly, language here is also possible to local language Speech, such as the south of Fujian Province language or Guangdong language etc..Another machine learning model in the embodiment of the present application is to have by sample learning The corpus assessment models of corpus evaluation capacity.Corpus assessment is to source text and corresponding cypher text in parallel corpora The process to score with degree or difference degree.

Wherein, neural network model, such as CNN (Convolutional Neural can be used in machine learning model Networks, convolutional neural networks) model, RNN (Recurrent Neural Networks, Recognition with Recurrent Neural Network) model or Person's transformer model etc..Certainly, machine learning model can also use other kinds of model, and the embodiment of the present application exists This is without limitation.

It is appreciated that can use the application when needing training corpus assessment models, the only parallel corpora of low noise The scheme provided in embodiment, obtained using Machine Translation Model include various noises machine translation text.By source text With machine translation text as training sample, using the comparison result of machine translation text and reference cypher text as training label Carry out training corpus assessment models, be no longer dependent on and manually add various noises and carry out the structure counter-example, conveniently constructs training number According to, and then substantially increase the training effectiveness of model.

Wherein, the Machine Translation Model used in the application can be the machine obtained by the parallel corpora pre-training of acquisition Device translation model.It is also possible to carry out the obtained Machine Translation Model of pre-training by other parallel corporas.Pass through the machine There are various noises in the cypher text that device translation model is translated.Wherein, noise, which refers to, influences cypher text standard The factor of true property, for example, the translation of mistake, translation out of order, translation missing or translation logic it is obstructed etc..In cypher text The translation quality of noise response cypher text, noise is higher, and the translation quality of the cypher text is poorer；Noise is lower, the translation The translation quality of text is higher.The pre-training process of Machine Translation Model can refer to the detailed description in subsequent embodiment.

In the embodiment of the present application, it can be used for various needing to carry out by the trained corpus assessment models of above-mentioned training data The scene of data screening.For example, in machine translation field, it is a large amount of for that may exist in the parallel corpora of training pattern Noise, to influence the quality of Machine Translation Model.Parallel corpora can be carried out by trained corpus assessment models at this time Corpus assessment obtains corresponding corpus assessment score.Reselection corpus assesses the higher parallel corpora of score as machine translation The training data of model, so that training obtains the high Machine Translation Model of translation accuracy.

As shown in Fig. 2, in one embodiment, providing a kind of corpus assessment models training method.The present embodiment is main Be applied in this way computer equipment come for example, the computer equipment specifically can be terminal 110 in upper figure or Server 120.Referring to Fig. 2, which specifically comprises the following steps:

S202 obtains parallel corpora；Parallel corpora includes source text and refers to cypher text accordingly.

Wherein, parallel corpora (parallel corpora) is made of source text and its parallel corresponding cypher text Bilingual (perhaps multi-lingual) its degree of registration of text pair can be word grade, sentence grade, section grade or piece grade etc..Source text and translation Text respectively corresponds different languages.It illustrates, it is assumed that source text is Chinese text, and corresponding cypher text specifically can wrap Include other texts of more non-Chinese languages such as English text or French text.For example, parallel corpora is expressed as<X, Y>, In, X is source text, and Y is parallel with source text corresponding with reference to cypher text.<X, Y>such as<thanks, and Thankyou>, or< Today, Today > etc..

It is the standard translation translation of source text with reference to cypher text, is low noise or muting cypher text.With reference to turning over Translation, which originally specifically can be, artificially to carry out translating obtained cypher text according to source text, is also possible to manually to machine translation Text is corrected rear obtained cypher text.It is understood that standard translation text at least meets sentence smoothness, institute's table The conditions such as the meaning reached is same or similar with the expression meaning of source text.Source text specifically can be word, sentence, paragraph or a piece Chapter etc..Correspondingly, being also possible to word corresponding with source text, sentence, paragraph or chapter etc. with reference to cypher text.Source text can It is then the text different from other languages of the affiliated languages of source text with reference to cypher text to be the text of any languages.

Specifically, computer equipment can crawl the text of corresponding different language as parallel corpora from internet, Also it can receive the text of the corresponding different language of other computer equipments transmission as parallel corpora etc..Wherein, relatively In the text for the different language answered, for the text of which languages as source text, the text of which languages, which is used as, refers to cypher text Languages depending on Machine Translation Model translate direction.

It is understood that source text included by the parallel corpora that computer obtains in step S202 and reference translation text The quantity of this bilingual sentence pair is more than one.And then more than one set of training data can be constructed.Under normal conditions, training data Group number is more, and the content of text range of covering is wider, more advantageous to the training of model.

S204 translates source text by Machine Translation Model to obtain corresponding machine translation text.

Wherein, machine translation text is to carry out text by Machine Translation Model in actual use to translate obtained turn over Translation.It is that obtained text is translated by Machine Translation Model due to being machine translation text, the quality of translation depends on The model performance of machine learning model, thus may include various noises in machine translation text, such as cypher text Word order it is incorrect, lack translation word, containing the invalid translation noises such as word or grammer confusion.It is understood that source text can To be the text of any languages, machine translation text is the text different from other languages of the affiliated languages of source text, machine translation Text and reference cypher text are the text with languages.

For example, when source text be " wait I move into come, platform television set can be bought? ", it is " Can I with reference to cypher text Get a TV when I move in? ", and the machine translation text that Machine Translation Model is translated is " I move in Buy a TV? "., it is clear that the machine translation text translated by Machine Translation Model to the source text is not It is very accurate, there is semantic and phraseological gaps and omissions.

Machine Translation Model is the machine learning model of pre-training.The Machine Translation Model is that study will in pre-training The text of the affiliated languages type of source text is translated as the text with reference to the affiliated languages type of cypher text in S202, therefore the machine turns over Model is translated after pre-training, various processing, output and source text phase can be carried out to the source text in the parallel corpora of acquisition The machine translation text answered.

Specifically, computer equipment can be obtained to carry out translating obtained machine to source text by Machine Translation Model and be turned over Translation sheet.Computer equipment directly can translate source text by Machine Translation Model, to obtain machine translation text.Computer Equipment can also obtain Machine Translation Model from other computer equipments or network and translate obtained machine to source text in advance Cypher text.

In one embodiment, computer equipment obtains after can carrying out word segmentation processing to source text after obtaining source text Word sequence composed by each word.Discrete word sequence can be converted into accordingly by computer equipment in such a way that word is embedded in processing Initial vector sequence.Again by the initial vector sequence inputting into the Machine Translation Model of pre-training, then pass through machine translation The hidden layer that model includes handles the initial vector sequence, obtains the corresponding hidden layer vector of the initial vector sequence, so Machine translation text output is converted for the hidden layer vector by output layer afterwards.

Wherein, hidden layer is one of neural network model term, is the middle layer relative to input layer and output layer.It is hidden It include the model parameter obtained to neural network model training in layer.Here the hidden layer of Machine Translation Model is turned over relative to machine Translate the middle layer of the input layer of model and the output layer of Machine Translation Model.It can be by the input layer of Machine Translation Model and output All middle layers between layer are referred to as hidden layer, can also divide to these middle layers, i.e. multilayer hidden layer.Machine translation mould The hidden layer of type may include multilayer neural network structure.Every layer of neural network structure may include one layer or multilayer neural network Layer.

Here the hidden layer of Machine Translation Model can be described as black box.Hidden layer vector is Machine Translation Model Hidden layer to inputting the processing result after data therein handle, obtained.The quantity of hidden layer vector can be one or It is multiple.Hidden layer vector is multiple Shi Zewei hidden layer sequence vectors.

Wherein, the Machine Translation Model of pre-training uses Sequence-to-Sequence (sequence-neural network based Sequence) frame.Sequence-to-Sequence frame includes Encoder-Decoder (coder-decoder) structure Frame.List entries is converted to the output of another sequence by Encoder-Decoder structure.In the frame, encoder will be defeated Enter Sequence Transformed at vector, decoder then accepted vector and sequentially generates output sequence in chronological order.Encoder and decoding Device can use the neural network model of same type, be also possible to different types of neural network model.Neural network model Such as CNN (Convolutional Neural Networks, convolutional neural networks) model, RNN (Recurrent Neural Networks) model, long memory models (Long Short-Term Memory, LSTM), time-delay network model or lock control in short-term Convolutional neural networks model etc..

In one embodiment, computer equipment can pass through the parallel corpora training machine after getting parallel corpora Translation model.Initial vector sequence inputting corresponding to source text is namely passed through into machine translation into Machine Translation Model The data for the hidden layer input that model includes export translation result after being handled.Computer equipment can be according to translation result and ginseng The difference between cypher text is examined, towards the direction adjustment model parameter for reducing difference and continues to train, stops item until reaching training Deconditioning when part.

In one embodiment, computer equipment can be directly acquired through the trained machine translation mould of other parallel corporas Type, then source text is translated by the Machine Translation Model, obtain machine translation text.It should be noted that this turning in The Machine Translation Model used when translating be also possible to it is being obtained based on other training methods or other model structures have turn over The model of function is translated, it is not limited here.

S206, by source text and machine translation text collectively as the training sample of corpus assessment models.

Specifically, computer equipment can be by source text and corresponding machine translation text collectively as corpus assessment models Training sample.Due to including various noises in machine translation text, for the noise manually added, covering surface is more Extensively, closer to real scene.It is very suitable to be used as the training data of corpus assessment models.

In one embodiment, computer equipment can carry out source text after obtaining source text and machine translation text Word sequence composed by each word is obtained after word segmentation processing, and obtains each word institute after carrying out word segmentation processing to machine translation text The word sequence of composition.Further, computer equipment can be embedded in processing mode by word, by the corresponding discrete word order of source text Column are converted into corresponding initial vector sequence, and the corresponding discrete word sequence of machine translation text is converted into accordingly just Beginning sequence vector.Again will initial vector sequence corresponding with source text, and initial vector sequence corresponding with machine translation text, It is separately input into corpus assessment models, is handled by hidden layer included by corpus assessment models.

S208 compares machine translation text and refers to cypher text, obtains trained label corresponding with training sample.

Specifically, computer equipment can be compared by machine translation text and with reference to cypher text, obtain comparison result, And trained label corresponding with training sample is determined according to comparison result.

In one embodiment, pre-set text matches mode computing machine cypher text can be used in computer equipment It is with the matching degree of reference cypher text, the linear transformation result of the matching degree or the matching degree is corresponding as training sample Training label.Pre-set text difference calculation computing machine cypher text can also be used in computer equipment and reference is turned over The diversity factor of translation sheet, using the linear transformation result of the diversity factor or the diversity factor as training sample, training is marked accordingly Label.

It is appreciated that when the matching degree of machine translation text and reference cypher text is higher, then it is assumed that machine translation text This translation quality is better；When the matching degree of machine translation text and reference cypher text is lower, then it is assumed that machine translation text This translation quality is poorer.Alternatively, when the difference of machine translation text and reference cypher text is smaller, then it is assumed that machine translation text This translation quality is better；When the difference of machine translation text and reference cypher text is bigger, then it is assumed that machine translation text Translation quality is poorer.And corpus assessment models are the models for being evaluated the corpus quality expected in parallel, can also be managed The model that the translation quality of pairs of cypher text is evaluated is solved, then computer equipment can pass through machine translation text and ginseng Matching degree or the difference condition of cypher text are examined to determine that training sample trains label accordingly.

In one embodiment, step S208, that is, compare machine translation text and refer to cypher text, it obtains and instructs Practice the step of sample trains label accordingly specifically includes the following steps: according to preset text matches mode, computing machine is turned over The matching degree of translation sheet and reference cypher text；Label is trained using matching degree as corresponding with training sample.

Wherein, text matches mode is the strategy for the matching degree of computing machine cypher text and reference cypher text. There are many text matches modes, and computer equipment can be chosen any one kind of them in advance as preset text matches mode.

Generally, source text is translated after obtaining machine translation text by Machine Translation Model, needs one kind Evaluation index come evaluate this translation effect.So here the calculation of evaluation index can be used as text matches side Formula.Evaluation index such as BLEU (Bilingual Evaluation understudy), NIST (National Institute Of standards and Technology), wrong word rate (The Word error rate, WER) or TER (Translation error rate, translation error rate) etc..

Specifically, computer equipment is being obtained through Machine Translation Model to the source text progress obtained machine of machine translation After device cypher text, it can be compared with reference cypher text, according to preset text matches mode, computing machine translation The matching degree of text and reference cypher text, then trains label for matching degree as training sample accordingly.Wherein, matching degree It can be denoted as M (Y', Y), Y' is machine translation text, and Y is with reference to cypher text.

In above-described embodiment, according to preset text matches mode, machine translation text can be calculated and turned over reference The matching degree of translation sheet, and label is trained using matching degree as corresponding with training sample, avoid the need for artificial constructed corpus And the movement of corpus is marked, substantially increase the efficiency of model training.

S210 passes through training sample and corresponding training label training corpus assessment models.

Specifically, the training of corpus assessment models is that have the training process of supervision.Computer equipment inputs training sample Corpus assessment models train label to export as target, by adjusting the model of corpus assessment models accordingly using the training sample Parameter makes the reality output of model constantly approach target output.

In one embodiment, training sample can be inputted in corpus assessment models and is trained by computer equipment, be obtained Corpus assessment result.Loss function is constructed according to the difference of corpus assessment result and training label.When loss function is minimized Model parameter of the model parameter as corpus assessment models, return to input training sample in corpus assessment models and instruct The step of practicing, obtaining corpus assessment result, until deconditioning when meeting training stop condition.

Wherein, training stop condition is the condition for terminating model training.Training stop condition can be reach it is preset repeatedly The performance indicator of corpus assessment models after generation number, or adjustment model parameter reaches pre-set level.

In one embodiment, computer equipment can be related to actual result according to the prediction result of corpus assessment models Property as measure corpus assessment models performance quality.Wherein, the correlation of prediction result and actual result can specifically pass through Prediction result and actual result pearson (Pearson came) related coefficient embody.Wherein.

In one embodiment, various each by will include in the machine translation text that is exported by Machine Translation Model The noise of sample, it is wider compared to man made noise's covering surface, closer to real scene.Thus pass through the training sample in the embodiment of the present application When this is with training label training corpus assessment models, in data volume abundance, over-fitting will not generally occur.

In one embodiment, step S210, that is, commented by training sample and corresponding training label training corpus The step of estimating model specifically includes the following steps:

S302, by two Recognition with Recurrent Neural Network structures parallel in corpus assessment models, respectively to the word order of source text The word sequence of column and machine translation text is handled.

Wherein, Recognition with Recurrent Neural Network structure be it is a kind of with sequence data for input, carry out recurrence in the evolution tendency of sequence (recursion) and all nodes (cycling element) press recurrent neural network (the recursive neural that chain type connects Network) structure.

Specifically, computer equipment can carry out word segmentation processing to source text after obtaining source text and machine translation text After obtain word sequence composed by each word, and each word for obtain after word segmentation processing to machine translation text is formed Word sequence.It further, can be by two Recognition with Recurrent Neural Network structures parallel in corpus assessment models, respectively to source text Word sequence and the word sequence of machine translation text handled.

In one embodiment, computer equipment can be embedded in processing mode by word, by the corresponding discrete word of source text Sequence is converted into corresponding initial vector sequence, and the corresponding discrete word sequence of machine translation text is converted into accordingly Initial vector sequence.Again by one of circulation mind in initial vector sequence inputting corpus assessment models corresponding with source text Through network structure；By another circulation mind in initial vector sequence inputting corpus assessment models corresponding with machine translation text Through network structure.

In one embodiment, the Recognition with Recurrent Neural Network structure in corpus assessment models, it is hidden in input data for extracting The semantic information that contains simultaneously exports the form that input data is encoded into vector.In one embodiment, in corpus assessment models Recognition with Recurrent Neural Network structure, it is real especially by the frame of Encoder-Decoder (coder-decoder) structure based on LSTM It is existing.Or the Recognition with Recurrent Neural Network structure in corpus assessment models only passes through the reality of the encoder in Encoder-Decoder structure It is existing.

S304, by the Architecture of Feed-forward Neural Network of corpus assessment models, to two parallel Recognition with Recurrent Neural Network structures Continue to handle after the vector splicing respectively exported, obtains corpus assessment result.

Wherein, Architecture of Feed-forward Neural Network is a kind of structure of one-way multilayer neural network, wherein each layer includes several A neuron does not interconnect between the neuron of same layer, and the transmission of inter-layer information only carries out in one direction.Corpus is commented Estimate the result is that obtained as a result, corpus assessment result can table after source text and machine translation text progress corpus assessment processing Levy the fine or not degree of the translation quality of machine translation text.

Specifically, computer equipment can follow parallel two by the Architecture of Feed-forward Neural Network of corpus assessment models The splicing of vector that ring neural network structure respectively exports, obtains splicing vector, and carries out linear transformation and non-thread to splicing vector Property transformation at least one of transformation, obtain corpus assessment result.

In one embodiment, computer equipment can by the Architecture of Feed-forward Neural Network of corpus assessment models to splicing to Amount carry out linear transformation and sigmoid (S sigmoid growth curve) function processing, export 0-1 between numerical value, using the numerical value of output as Corpus assessment result.

It is the model structure schematic diagram of corpus assessment models in one embodiment with reference to Fig. 4, Fig. 4.As shown in figure 4, respectively The word sequence of the word sequence of source text and machine translation text is inputted corresponding two Recognition with Recurrent Neural Network channels (namely to scheme The channel RNN- 1 and the channel RNN- 2 in 4).By two parallel Recognition with Recurrent Neural Network pass through in Recognition with Recurrent Neural Network knot Structure handles the word sequence accordingly inputted respectively.Before the vector that two parallel Recognition with Recurrent Neural Network results are exported is input to It presents in neural network structure.The vector of Architecture of Feed-forward Neural Network splicing input obtains splicing vector, and carries out to splicing vector Processing exports corpus assessment result.

S306, according to corpus assessment result and training label difference, adjust corpus assessment models model parameter and after Continuous training terminates to train when meeting training stop condition.

Wherein, training stop condition is the condition for terminating model training.Training stop condition can be reach it is preset repeatedly The performance indicator of corpus assessment models after generation number, or adjustment model parameter reaches pre-set level.Adjust corpus assessment The model parameter of model is adjusted to the model parameter of corpus assessment models.

Specifically, computer equipment may compare the difference of corpus assessment result and tag along sort, thus towards difference is reduced Direction adjusts the model parameter of corpus assessment models.If after adjusting model parameter, being unsatisfactory for training stop condition, then returning Step S302 continues to train, and terminates to train when meeting training stop condition.

In one embodiment, computer equipment can construct damage according to the square error of corpus assessment result and training label Lose function.In each training, all minimized or model parameter more newspeak when less than preset threshold by making loss function The model parameter of assessment models is expected, until deconditioning when meeting training stop condition.

In above-described embodiment, by two Recognition with Recurrent Neural Network structures parallel in corpus assessment models, respectively to source document This word sequence and the word sequence of machine translation text are handled.Pass through the feedforward neural network knot of corpus assessment models again Structure continues to handle, obtains corpus assessment after the vector splicing for respectively exporting two parallel Recognition with Recurrent Neural Network structures As a result.So as to according to corpus assessment result and training label difference, towards reduce difference direction training corpus assessment models. In this way, may make semantic assessment models study to the semantic information of deep layer, corpus assessment is done from semantic level.In model training mistake Cheng Zhong, by continuously adjusting model parameter, so that it may train the translation quality that can accurately assess cypher text as soon as possible Corpus assessment models, improve training effectiveness.

In one embodiment, step S302, that is, pass through two circulation nerve nets parallel in corpus assessment models Network structure specifically includes: passing through the step of processing respectively the word sequence of the word sequence of source text and machine translation text The encoder of first circulation neural network structure in corpus assessment models carries out semantic coding to the word sequence of source text, obtains First semantic vector sequence, and the decoder of first circulation neural network structure is continued through, successively to the first semantic vector sequence Column are decoded, and obtain the first hidden layer sequence vector；By first circulation neural network structure, in the first hidden layer sequence vector Each vector be weighted read group total, export the obtained vector of weighted read group total；Pass through in corpus assessment models The encoder of two Recognition with Recurrent Neural Network structures carries out semantic coding to the word sequence of machine translation text, obtain second it is semantic to Sequence is measured, and continues through the decoder of second circulation neural network structure, successively the second semantic vector sequence is decoded, Obtain the second hidden layer sequence vector；By second circulation neural network structure, to each vector in the second hidden layer sequence vector into Row weighted sum calculates, and exports the obtained vector of weighted read group total；Wherein, first circulation neural network structure and second Recognition with Recurrent Neural Network structure is parallel.

Specifically, for two Recognition with Recurrent Neural Network structures parallel in corpus assessment models, two Recognition with Recurrent Neural Network Structure is the same to the processing mode of input data.It is all the word order by the encoder of Recognition with Recurrent Neural Network structure to input Column carry out semantic coding, obtain semantic vector sequence.Then the decoder in Recognition with Recurrent Neural Network structure is again by semantic vector sequence Column decoding is converted into hidden layer sequence vector.Different places are that a Recognition with Recurrent Neural Network pattern handling source text is corresponding Word sequence, another corresponding word sequence of Recognition with Recurrent Neural Network pattern handling machine translation text.

In one embodiment, the encoder of first circulation neural network structure can be according to source text in corpus assessment models Word sequence word order, successively in word sequence each word carry out semantic coding, obtain the corresponding semantic vector of each word, thus Obtain the first semantic vector sequence corresponding with the word sequence of source text.

Wherein, the hidden layer of encoder, can be by semantic vector corresponding to preceding sequence word when carrying out semantic coding to current word Input directly or through vector resulting after processing as current word semantic coding, obtains the semantic vector of current word.Also It is to say that the semantic vector of current word has merged the semantic vector of preceding sequence word.In this way, in the word sequence of source text each word it is semantic to Amount not only contains the semantic information of corresponding words, and combines the semantic information of preceding sequence word, so that first ultimately generated is semantic The semantic meaning representation of sequence vector is more accurate.Here preceding sequence word refers to the word before current word, and the word before current word can be with It is that all words are also possible to part word.

Illustrate the volume by first circulation neural network structure in corpus assessment models below by the mode of illustration Code device carries out semantic coding by word to the word sequence of source text, obtains the process of the first semantic vector sequence step: with source text X For, computer equipment segments source text, and the word order for obtaining source text is classified as X=(x₁,x₂,...,x_m), by word sequence X=(x₁,x₂,...,x_m) be input to the encoder of first circulation neural network structure after, the hidden layer of encoder is to x₁It carries out semantic Coding obtains corresponding semantic vector v₁, further according to semantic vector v₁To x₂It carries out semantic coding and obtains corresponding semantic vector v₂, And so on, until obtaining x_mCorresponding semantic vector v_m, finally obtain the first semantic vector sequence V=(v₁,v₂,...,v_m)。

Further, in corpus assessment models first circulation neural network structure encoder by obtain first it is semantic to Amount sequence is transferred to decoder, and the hidden layer of decoder is decoded the first semantic vector sequence to obtain the first hidden layer vector sequence again Column.Here the first semantic vector sequence and the first hidden layer sequence vector can reflect the semantic information of the word sequence of source text And syntactic information.

In one embodiment, the decoder of first circulation neural network is successively one by one when generating the first hidden layer vector It carries out.When decoder generate when the first secondary hidden layer vector, the first hidden layer vector of previous output can be obtained.According to before First hidden layer vector of secondary output is decoded the first semantic vector sequence of encoder output, obtains when secondary first is hidden Layer vector.In this way, the first hidden layer vector not only contains the semantic information of each word in the word sequence of source text, and combine previous The semantic information of the first hidden layer vector exported.Decoder is according to the generation time sequencing of each first hidden layer vector, and splicing is respectively First hidden layer vector obtains the first hidden layer sequence vector.

It in one embodiment, can be by random vector or default when decoder is when decoding first the first hidden layer vector Previous first hidden layer vector of the vector as first the first hidden layer vector is based on random vector or default vector, Yi Jibian First semantic vector sequence of code device transmitting is decoded, and obtains first the first hidden layer vector.It is first hidden based on first again Layer vector sum semantic vector sequence is decoded, and obtains second the first hidden layer vector.And so on, until last obtained Until a first hidden layer vector.

Illustrate the decoder by first circulation neural network in corpus assessment models below by the mode of illustration First semantic vector sequence is decoded, obtains the process of the first hidden layer sequence vector step: the hidden layer of decoder be based on Machine vector or default vector and the first semantic vector sequence V=(v₁,v₂,...,v_m) be decoded to obtain first it is first hidden Layer vector h₁；It is based on the first hidden layer vector h again₁With the first semantic vector sequence V=(v₁,v₂,...,v_m) be decoded to obtain Two the first hidden layer vector h₂；And so on, until obtaining h_n, finally obtain the first hidden layer sequence vector H=(h₁,h₂,..., h_n)。

Further, by first circulation neural network structure, each vector in the first hidden layer sequence vector is added Read group total is weighed, the obtained vector of weighted read group total is exported.Wherein, each first hidden layer vector institute in weighted sum calculating Corresponding weight can be default weight, be also possible to calculate obtained weight by corpus assessment models.Implement at one In example, computer equipment can be averaging each first hidden layer vector, obtain the vector of expression source text, Vector Fusion source document The semantic information and syntactic information of each word in this.

It is appreciated that by the encoder of second circulation neural network structure in corpus assessment models, to machine translation text This word sequence carries out semantic coding, obtains the second semantic vector sequence, and continue through second circulation neural network structure Decoder is successively decoded the second semantic vector sequence, obtains the concrete operations content of the second hidden layer sequence vector, and on Stating coding and decoding operation performed by described first circulation neural network structure is that the same, different place is, the The corresponding word sequence of one Recognition with Recurrent Neural Network pattern handling source text, and second circulation neural network structure handling machine translation text This corresponding word sequence.Coding and decoding is carried out to the corresponding word sequence of machine translation text about second circulation neural network structure The detailed content of operation can refer to above-mentioned first circulation neural network structure and carry out coding and decoding to the corresponding word sequence of source text The description of operation.

In one embodiment, second circulation neural network structure handles the corresponding word sequence of machine translation text When, initial data can be provided by first circulation neural network structure.For example, second circulation neural network structure decoding when, The second previous hidden layer vector used in the secondary decoding can be exported according to first circulation neural network structure the last one the One hidden layer vector determines.Alternatively, second circulation neural network structure is in decoding, when the calculative content vector of secondary decoding can It is determined according to the last one content vector in first circulation neural network structure.

Further, by second circulation neural network structure, each vector in the second hidden layer sequence vector is added Read group total is weighed, the obtained vector of weighted read group total is exported.Wherein, each second hidden layer vector institute in weighted sum calculating Corresponding weight can be default weight, be also possible to calculate obtained weight by corpus assessment models.Implement at one In example, computer equipment can be averaging each second hidden layer vector, obtain the vector for indicating machine translation text, the Vector Fusion The semantic information and syntactic information of each word in machine translation text.

In above-described embodiment, conciliate by the encoder of two Recognition with Recurrent Neural Network structures parallel in corpus assessment models Code device carries out corresponding encoding and decoding processing to the word sequence of the word sequence of source text and machine translation text respectively, obtains first Hidden layer sequence vector and the second hidden layer sequence vector.It is hidden to first respectively again by two parallel Recognition with Recurrent Neural Network structures Each vector in each vector and the second hidden layer sequence vector in layer sequence vector is weighted read group total, exports respectively The weighted obtained vector of read group total.In this way, can be by two parallel Recognition with Recurrent Neural Network structures, respectively to source document This and machine translation text carry out encoding and decoding, can extract source text and machine translation text from the semantic information level of deep layer Corresponding feature.

In one embodiment, by the decoder of first circulation neural network structure, successively to the first semantic vector sequence The step of column are decoded, and obtain the first hidden layer sequence vector specifically includes: by the decoding of first circulation neural network structure Device is obtained when secondary Automobile driving weight vectors corresponding with the first semantic vector sequence；According to Automobile driving weight to Amount and the first semantic vector sequence are calculated when secondary content vector；According to the decoding of first circulation neural network structure First hidden layer vector of the previous output of device, and when secondary content vector, be calculated when the first secondary hidden layer vector；Combination the Each first hidden layer vector that the decoder of one Recognition with Recurrent Neural Network structure is sequentially output, obtains the first hidden layer corresponding with source text Sequence vector.

In one embodiment, the decoder of first circulation neural network can carry out attention mechanism to the first semantic vector (Attention) it handles, obtains content vector corresponding with the first semantic vector.The content Vector Fusion semanteme of source text Information and syntactic information.

In one embodiment, computer equipment can be obtained when corresponding with the first semantic vector sequence in time decoding process Automobile driving weight vectors.Wherein, each Automobile driving weight in Automobile driving weight vectors, respectively with the first language Each first semantic vector in adopted sequence vector is corresponding.According to the corresponding Automobile driving weight of each first semantic vector First semantic vector is merged to obtain content vector.Wherein, the calculating side of weighted sum specifically can be used in the mode of fusion Formula.In turn, decoder can be decoded based on the first hidden layer vector of content vector, previous output, be obtained when secondary first is hidden Layer vector.Each first hidden layer vector that the decoder of computer equipment combination first circulation neural network structure is sequentially output, obtains To the first hidden layer sequence vector corresponding with source text.Wherein, the corresponding Automobile driving weight of each first semantic vector Indicate that select encoder to obtain information with stressing carrys out auxiliary decoder.

In one embodiment, the corresponding Automobile driving weight of each first semantic vector is counted in the following manner It calculates: the first hidden layer vector that decoder previous moment is exported and each first semantic vector is compared respectively, that is, is logical Cross function F (h_i-1, V_m) come obtain when time each first semantic vector of the first hidden layer vector sum correspond to a possibility that.Then by F (h_i-1, V_m) attention point for just having obtained meeting probability distribution value interval is normalized by Softmax the output of function With weight.It is exactly Automobile driving weight vectors by the combination of each Automobile driving weight.Wherein, i refers to that i-th decodes.

Illustrate the decoder by first circulation neural network in corpus assessment models below by the mode of illustration First semantic vector sequence is decoded, the process of the first hidden layer sequence vector step is obtained: in working as secondary decoding process, When the calculation of secondary content vector is as follows:Wherein, α_i,mIndicate i-th with m-th of first languages The corresponding Automobile driving weight of adopted vector；v_mIndicate m-th of first semantic vectors；Indicate vector dot operation.For working as The first secondary hidden layer vector, can use h_iIt indicates, the first previous hidden layer vector uses h_i-1It indicates.Following formula meter can then be passed through Calculate the first hidden layer vector h_i: h_i=f (h_i-1,c_i)；Wherein, f () indicates activation primitive.Correspondingly, hidden for first each time Layer vector can be calculated by above-mentioned formula.In turn, the sliceable each first hidden layer vector of combination of decoder, obtains hidden layer Sequence vector.

In one embodiment, computer equipment can also be according to the previous output of decoder of first circulation neural network structure The first hidden layer vector, previous output target word and when time content vector, be calculated when time the first hidden layer to Amount.It determines further according to when the first secondary hidden layer vector when secondary target word.In turn, when secondary target word is for next first The calculating of hidden layer vector.

In above-described embodiment, in such a way that attention mechanism merges, phase is selected from the information of encoder in decoding As auxiliary, more fully study each hidden layer into Recognition with Recurrent Neural Network structure indicates the information of pass, reduces and assesses in corpus The loss of effective information in the process substantially increases the accuracy rate of corpus assessment.

It is appreciated that successively being carried out to the second semantic vector sequence by the decoder of second circulation neural network structure Decoding obtains the concrete operations content of the second hidden layer sequence vector and above-mentioned described by first circulation neural network knot The decoder of structure is successively decoded the first semantic vector sequence, and it is the same for obtaining the operation of the first hidden layer sequence vector.No Same place is that the decoder of first circulation neural network structure is decoded the first semantic vector sequence, and second follows The decoder of ring neural network structure is decoded the second semantic vector sequence.Solution about second circulation neural network structure Code device can refer to the solution of above-mentioned first circulation neural network structure to the detailed content for the operation that the second semantic vector is decoded The associated description for the operation that code device is decoded the first semantic vector.

The process signal for handling parallel corpora in one embodiment by corpus assessment models is shown with reference to Fig. 5, Fig. 5 Figure.As shown in figure 5, computer equipment can be by the word sequence of source text, such as X=(x₁,x₂,x₃,x₄) and machine translation text This word sequence, such as Y'=(y'₁,y'₂,y'₃,y'₄) be separately input into corpus assessment models two Recognition with Recurrent Neural Network it is logical In road.It is appreciated that X=(x₁,x₂,x₃,x₄) and Y'=(y'₁,y'₂,y'₃,y'₄) by way of example only, source text and machine turn over The quantity of the word sequence of translation sheet is in this application without limitation.Pass through the encoder in two parallel Recognition with Recurrent Neural Network channels The word sequence of the word sequence of source text and machine translation text is handled respectively, obtains corresponding content sequence vector C= (c₁,c₂,c₃,c₄) and C'=(c'₁,c'₂,c'₃,c'₄).Pass through the decoding in two parallel Recognition with Recurrent Neural Network channels respectively again Device is decoded processing, obtains corresponding first hidden layer sequence vector H=(h₁,h₂,h₃,h₄) and the second hidden layer sequence vector H' =(h'₁,h'₂,h'₃,h'₄).First circulation neural network structure is respectively to the first hidden layer sequence vector H=(h₁,h₂,h₃,h₄) in Each first hidden layer vector carry out processing of averaging, output indicates the vector h of source text_Source.Second circulation neural network structure Respectively to the second hidden layer sequence vector H'=(h'₁,h'₂,h'₃,h'₄) in each second hidden layer vector carry out processing of averaging, Output indicates the vector h of machine translation text_{It translates}.Again by vector h_SourceWith vector h_{It translates}It inputs in Architecture of Feed-forward Neural Network Reason exports corpus assessment result.

In one embodiment, which further includes combining the determination step of dimension vector, should Step specifically includes: obtaining and carries out the obtained result of translation quality assessment to machine translation text under different dimensions；According to As a result, determining corresponding combination dimension vector corresponding to different dimensions.Pass through the feedforward neural network knot of corpus assessment models Structure continues to handle, obtains corpus assessment after the vector splicing for respectively exporting two parallel Recognition with Recurrent Neural Network structures As a result the step of, comprising: by the Architecture of Feed-forward Neural Network of corpus assessment models, to two parallel Recognition with Recurrent Neural Network knots The vector and combination dimension vector that structure respectively exports continue with after being spliced, and obtain corpus assessment result.

Wherein, different dimensions refer specifically to different evaluation index dimensions, such as text size dimension, text distance dimension Degree, word alignment dimension etc..Carrying out translation quality assessment to machine translation text under different dimensions specifically can be by two-way The modes such as cross entropy, language model scoring, COS distance, word alignment calculate the matching degree or difference of source text and machine translation text Different degree etc..The matching degree or diversity factor of the source text and machine translation text that calculate in different ways are just regarded as in difference Translation quality is carried out to machine translation text under dimension and assesses obtained result.

Specifically, computer equipment, which can be obtained, carries out obtained by translation quality assessment machine translation text under different dimensions To be converted into same dimension as a result, translation quality will be carried out to machine translation text under different dimensions and assess obtained result Corresponding quantized value down, for example result obtained under different dimensions is converted between characterization source text and machine translation text The numerical value of match condition.In turn, computer equipment can will combine dimension vector as a result, being spliced into corresponding to different dimensions.

Further, computer equipment can be by the Architecture of Feed-forward Neural Network of corpus assessment models, to parallel two The vector and combination dimension vector that Recognition with Recurrent Neural Network structure respectively exports are spliced, and splicing vector is obtained.And to splicing Vector carries out at least one of linear transformation and nonlinear transformation transformation, obtains corpus assessment result.

In one embodiment, the vector and combination dimension that two parallel Recognition with Recurrent Neural Network structures respectively export The splicing sequence of vector without limitation, need to only ensure that splicing sequence is the same in each training process of model.

In above-described embodiment, done by the vector for the sentence pair for being exported assemblage characteristic vector sum Recognition with Recurrent Neural Network result Splicing, that is, by the vector expression of the merging features of engineer to sentence pair, the corpus model learning to energy can be passed through So that the various features combination that loss is minimum, it is not necessary that weight corresponding to different dimensions feature is a priori arranged, without Suitable weight is found with grid search, substantially increases the training effectiveness and effect of model.

In one embodiment, which further includes the steps that screening target parallel corpora, should Step specifically includes the following steps:

S602 obtains candidate parallel corpora to be processed；Candidate parallel corpora includes candidate source text and corresponding candidate Cypher text.

Specifically, computer equipment can crawl the text of corresponding different language as candidate parallel language from internet Material.It is appreciated that candidate cypher text may include various noises in candidate parallel corpora, candidate cypher text is turned over It is irregular to translate quality.

In one embodiment, computer equipment can obtain single language corpus, such as candidate source text in advance, then by instructing in advance Experienced Machine Translation Model translates candidate source text to obtain candidate cypher text.Alternatively, being calculated in another scene Machine equipment can obtain single language corpus, such as candidate cypher text in advance, then be turned over by the Machine Translation Model of pre-training to candidate Translation this progress reverse translation obtains candidate source text.Due to translating in obtained text and may deposit by Machine Translation Model In various noises, thus, constructed candidate parallel corpora is also referred to as pseudo- parallel corpora at this time.

In one embodiment, candidate source text is the text of corresponding first languages；Candidate cypher text is corresponding second The text of languages.The step of obtaining candidate parallel corpora to be processed specifically includes: obtaining the first parallel corpora and second in parallel Corpus；First parallel corpora includes the candidate source text of corresponding first languages and the intermediate text of candidate of corresponding corresponding third languages This；Second parallel corpora includes the candidate internal expression text of corresponding third languages and the candidate translation text of corresponding corresponding second languages This；According to the first parallel corpora and the second parallel corpora, candidate parallel corpora is constructed；Candidate parallel corpora includes corresponding first language The candidate cypher text of the candidate source text of kind and corresponding corresponding second languages.

It is appreciated that the flat of the low noise of training machine translation model can be used in actual machine translation field Row corpus is that quantity usually can be less, sometimes even without.In one embodiment, computer equipment can obtain from network One parallel corpora and the second parallel corpora.Wherein, the first parallel corpora includes the candidate source texts of corresponding first languages and corresponding Correspondence third languages candidate internal expression text.Second parallel corpora includes the candidate internal expression texts of corresponding third languages and corresponding The second languages of correspondence candidate cypher text.And translation direction at this time be then the source text for corresponding to the first languages is translated to The direction of the cypher text of corresponding second languages.Computer equipment can be according to common in the first parallel corpora and the second parallel corpora Including correspondence third languages candidate internal expression text, reverse translation constructs candidate parallel corpora.Wherein, candidate parallel corpora The candidate cypher text of candidate source text and corresponding corresponding second languages including corresponding first languages.

For example, when needing to train the Machine Translation Model between the language of language and C languages of A languages, but calculate Machine equipment is only capable of obtaining A languages and corresponding first parallel corpora of B languages and B languages the second parallel language corresponding with C languages Material.For example, computer equipment obtains the first parallel corpora [A1, B1] and the second parallel corpora [B2, C2], at this time can by [A1, B1] corpus training obtain the translation model of the language of the language translation of B languages to A languages.Computer equipment passes through training Translation model to corpus A2, thus can conveniently and efficiently construct corpus B2 reverse translation pseudo- parallel corpora [A2, C2], also It is the candidate parallel corpora of building.

Reverse translation bring noise in candidate parallel corpora to reduce building can assess mould with trained corpus Type carries out corpus assessment to candidate parallel corpora, to filter out the A languages and the corresponding parallel corpora of C languages of low noise.Sieve The parallel corpora selected can be used to training machine translation model, to obtain the machine translation mould with good translation ability Type.

S604 carries out language to candidate source text and corresponding candidate cypher text by the corpus assessment models trained Expect assessment processing, obtains corpus assessment score corresponding with candidate cypher text.

Wherein, corpus assessment score is that candidate source text and corresponding candidate cypher text are input to trained corpus The corpus assessment result that assessment models are exported after being handled.Corpus assessment score can be used to measure candidate source text and time Select the matching degree or diversity factor between cypher text.When the matching degree between candidate source text and candidate cypher text is higher or diversity factor Smaller, then corresponding corpus assessment score is higher；When the matching degree between candidate source text and candidate cypher text is lower or difference Degree is bigger, then corresponding corpus assessment score is lower.

Specifically, computer equipment can be by candidate source text and the corresponding word sequence of candidate cypher text, and difference is defeated Enter to trained corpus assessment models.By the corpus assessment models trained, candidate source text and corresponding candidate are turned over Translation this progress corpus assessment processing obtains corpus assessment score corresponding with candidate cypher text.

It is flat to filter out the target that corresponding corpus assessment score meets default screening conditions from candidate parallel corpora by S606 Row corpus.

Wherein, it screening conditions preset specifically can be corpus assessment score and be greater than or equal to preset threshold, or by corpus It assesses score to press from high to low carry out ranking, ranking ranking is N first etc..Specifically, computer equipment can be from candidate parallel language In material, the target parallel corpora that corresponding corpus assessment score meets default screening conditions is filtered out.In one embodiment, it calculates Machine equipment can be by the target parallel corpora training machine translation model of screening, to obtain the machine with good translation ability Translation model.Since the corpus assessment models trained are since the semantic information of the deep layer of source text and cypher text is arrived in study, Thus data screening can be done from semantic level.

It is appreciated that in practical application scene, the parallel corpora of only a small amount of low noise and a large amount of can be usually encountered Strong noise parallel corpora the case where, and directly can reduce machine using the parallel corpora training machine translation model of strong noise The performance of translation model.In this case, the corpus assessment models in the various embodiments of the application can be used in computer equipment Training method leads to the parallel corpora training corpus assessment models of too small amount of low noise.Mould is assessed by trained corpus again Type carries out corpus screening to the parallel corpora of a large amount of strong noise, is screened out from it the mesh that corpus assessment score meets screening conditions Mark parallel corpora.To which computer equipment can instruct jointly according to the parallel corpora of low noise with the target parallel corpora filtered out Practice Machine Translation Model, to obtain the Machine Translation Model with good translation ability.

It is appreciated that in practical application scene, it is also possible to which there are following scenes: computer equipment is only capable of obtaining on a small quantity The parallel corpora of low noise and a large amount of single language corpus.In this case, computer equipment can carry out single language corpus reversed Translation carrys out expanding data, but the corpus of reverse translation can have certain noise.In this case, this can be used in computer equipment Apply for the corpus assessment models training method in various embodiments, leads to the parallel corpora training corpus assessment of too small amount of low noise Model.Corpus screening is carried out by parallel corpora of the trained corpus assessment models to reverse translation again, is screened out from it language Material assessment score meets the target parallel corpora of screening conditions.To, computer equipment can according to the parallel corpora of low noise and The common training machine translation model of the target parallel corpora filtered out, to obtain the machine translation mould with good translation ability Type.

In above-described embodiment, by the corpus assessment models trained, to candidate source text and corresponding candidate translation text This progress corpus assessment processing obtains corpus assessment score corresponding with candidate cypher text, and assessing score according to corpus can be with From candidate parallel corpora, target parallel corpora is conveniently and efficiently filtered out.In this way, mould can be assessed by trained corpus Type filters out the parallel corpora of low noise as target parallel corpora from the parallel corpora of strong noise.

It is appreciated that under exporting the scenes of multiple spare cypher texts when being translated using Machine Translation Model, if When needing to be ranked up this multiple spare cypher text, or selecting preferably translation from this multiple spare cypher text, It can be using corpus assessment models obtained in each embodiment of the application to respectively as composed by this multiple spare cypher text Parallel corpora carries out corpus assessment, obtain each spare cypher text corresponding corpus assessment score, thus using obtaining These corpus assessment score reorders to these spare cypher texts.

It should be noted that used Machine Translation Model is not limited to be mentioned in the embodiment of the present application under the scene And Machine Translation Model, be also possible to it is being obtained based on other training methods or other model structures have translation function The model of energy.

In one embodiment, which further includes reordering to each spare version The step of, the step specifically includes the following steps:

S702 obtains text to be translated.

Specifically, computer equipment can obtain text to be translated from local or other computer equipments.

Text input to be translated to Machine Translation Model is obtained multiple spare versions by S704.

Specifically, computer equipment segments the text to be translated obtained in the case where translating scene, obtains word sequence, and The word sequence of text to be translated is input to Machine Translation Model.When Machine Translation Model export multiple spare versions and When needing that these spare versions are ranked up or are selected, it can obtain what training in the above embodiments of the present application was completed Corpus assessment models.

Text to be translated and each spare version are separately constituted more than one set of spare parallel corpora by S706.

Specifically, computer equipment can separately constitute text to be translated and each spare version more than one set of spare Parallel corpora.For example, text to be translated is X, X progress machine translation is obtained by Machine Translation Model multiple spare Version is Y1, Y2, Y3 and Y4.So, computer equipment can separately constitute text to be translated and each spare version The spare parallel corpora of multiple groups, such as spare parallel corpora 1[X, Y1], spare parallel corpora 2[X, Y2], spare parallel corpora 3[X, Y3] and spare parallel corpora 4[X, Y4].

S708 is carried out corpus assessment processing to each spare parallel corpora respectively, is obtained by the corpus assessment models trained Score is assessed to corpus corresponding with each spare version difference.

Specifically, multiple spare parallel corporas can be inputted respectively the corpus assessment models trained by computer equipment, be obtained Score is assessed to corpus corresponding with each spare version difference, score can be assessed according to each corpus in this way to corresponding Spare version reorders.

S710 assesses score according to each corpus, reorders to spare version.

Wherein, reorder is resequenced to the ranking results of former spare version.It is appreciated that for machine Translation model carries out translation processing treating cypher text, when one target word of every output, may exist multiple choices, most The spare word sequence of multiple groups, that is, multiple spare versions can be obtained eventually, and every group of spare version respectively corresponds to one A translation probability.Machine Translation Model can be turned over before exporting target version according to each spare version is corresponding It translates probability to be ranked up, be exported spare version corresponding to maximum translation probability as target version.Thus, Before being ranked up according to corpus assessment models to spare version, Machine Translation Model has done tentatively spare version Sequence, and corpus assessment score will affect the result to reorder.

In one embodiment, computer equipment directly can assess score according to corpus corresponding to each spare version Size, spare version is ranked up again.Alternatively, computer equipment can also each spare version institute of comprehensive consideration Corresponding translation probability and corpus assess score, are ranked up again to spare version.Wherein, comprehensive consideration is each spare translates Translation probability corresponding to text and corpus assess score, for example, translation probability and corpus assessment score are converted into count The hundred-mark system score of amount, then it is weighted summation, it reorders etc. according to the result of weighted sum.Certainly, computer equipment Other comprehensive consideration modes can also be used, it is not limited here.

Wherein, reordering specifically can be sequence sequence, i.e. the high spare version sequence of corpus assessment score is forward, Corpus assesses the low spare version sequence of score rearward；It reorders and is also possible to backward sequence, is i.e. corpus assessment score is high Spare version sequence rearward, corpus assessment score do spare version sequence it is forward.

It in one embodiment, can when needing to select preferably spare version from these spare versions To choose the forward multiple spare versions that rank the first or sort, Huo Zhecong from the spare version of sequence sequence The spare version of backward sequence, which is chosen, comes the multiple spare versions of last bit or sequence rearward.By the spare of screening Version is as target version corresponding with text to be translated.

In above-described embodiment, after the completion of the training of corpus assessment models, trained corpus assessment models pair can be passed through Multiple spare versions that text to be translated translates are reordered or are selected, thus be applied in Machine Translation Model, Application range is widened.

In a specific embodiment, the corpus assessment models training method specifically includes the following steps:

S802 obtains parallel corpora；Parallel corpora includes source text and refers to cypher text accordingly.

S804 translates source text by Machine Translation Model to obtain corresponding machine translation text.

S806, by source text and machine translation text collectively as the training sample of corpus assessment models.

S808, according to preset text matches mode, computing machine cypher text and with reference to the matching degree of cypher text.

S810 trains label using matching degree as corresponding with training sample.

S812, by the encoder of first circulation neural network structure in corpus assessment models, to the word sequence of source text Semantic coding is carried out, obtains the first semantic vector sequence, and continue through the decoder of first circulation neural network structure, successively First semantic vector sequence is decoded, the first hidden layer sequence vector is obtained.

S814 is weighted each vector in the first hidden layer sequence vector and is asked by first circulation neural network structure And calculating, export the obtained vector of weighted read group total.

S816, by the encoder of second circulation neural network structure in corpus assessment models, to machine translation text Word sequence carries out semantic coding, obtains the second semantic vector sequence, and continue through the decoding of second circulation neural network structure Device is successively decoded the second semantic vector sequence, obtains the second hidden layer sequence vector.

S818 is weighted each vector in the second hidden layer sequence vector and is asked by second circulation neural network structure And calculating, export the obtained vector of weighted read group total；Wherein, first circulation neural network structure and second circulation nerve Network structure is parallel.

S820 is obtained and is carried out the obtained result of translation quality assessment to machine translation text under different dimensions.

S822, as a result, determining corresponding combination dimension vector according to corresponding to different dimensions.

S824, by the Architecture of Feed-forward Neural Network of corpus assessment models, to two parallel Recognition with Recurrent Neural Network structures The vector and combination dimension vector respectively exported continues with after being spliced, and obtains corpus assessment result.

S826, according to corpus assessment result and training label difference, adjust corpus assessment models model parameter and after Continuous training terminates to train when meeting training stop condition.

Above-mentioned corpus assessment models training method, acquisition include source text and the corresponding parallel language for referring to cypher text Material, translates the source text by Machine Translation Model to obtain corresponding machine translation text.By source text and accordingly Training sample of the machine translation text collectively as corpus assessment models.Compare machine translation text and corresponding with reference to translation text This, obtains trained label corresponding with the training sample.Since there are various noises in machine translation text, thus can Carry out the structure counter-example to be no longer dependent on artificial addition noise.Comparison result by machine translation text and reference cypher text is true Surely label is trained, a large amount of training data can be constructed by eliminating the reliance on artificial mark corpus, substantially increase training data Prepare efficiency, and then substantially increases the training effectiveness of model.Also, since the machine exported by Machine Translation Model turns over It will include various noises in translation sheet, it is wider compared to man made noise's covering surface, it, can be well closer to real scene It avoids the problem that the model over-fitting caused by the limitation of training data, can efficiently train high performance corpus and comment Estimate model.

Fig. 8 is the flow diagram of corpus assessment models training method in one embodiment.Although should be understood that figure Each step in 8 flow chart is successively shown according to the instruction of arrow, but these steps are not necessarily to refer to according to arrow The sequence shown successively executes.Unless expressly stating otherwise herein, there is no stringent sequences to limit for the execution of these steps, this A little steps can execute in other order.Moreover, at least part step in Fig. 8 may include multiple sub-steps or more A stage, these sub-steps or stage are not necessarily to execute completion in synchronization, but can hold at different times Row, the execution sequence in these sub-steps perhaps stage be also not necessarily successively carry out but can be with other steps or other The sub-step or at least part in stage of step execute in turn or alternately.

As shown in figure 9, in one embodiment, providing corpus assessment models training device 900, including obtain module 901, translation module 902, determining module 903, contrast module 904 and training module 905.

Module 901 is obtained, for obtaining parallel corpora；Parallel corpora includes source text and refers to cypher text accordingly.

Translation module 902 obtains corresponding machine translation text for being translated by Machine Translation Model to source text This.

Determining module 903, for the training sample by source text and machine translation text collectively as corpus assessment models.

Contrast module 904 obtains instruct corresponding with training sample for comparing machine translation text and with reference to cypher text Practice label.

Training module 905, for passing through training sample and corresponding training label training corpus assessment models.

In one embodiment, contrast module 904 is also used to according to preset text matches mode, computing machine translation text The matching degree of this and reference cypher text；Label is trained using matching degree as corresponding with training sample.

In one embodiment, training module 905 is also used to through two circulations nerve parallel in corpus assessment models Network structure is respectively handled the word sequence of the word sequence of source text and machine translation text；Pass through corpus assessment models Architecture of Feed-forward Neural Network, continue to locate after the vector splicing for respectively exporting two parallel Recognition with Recurrent Neural Network structures Reason, obtains corpus assessment result；According to the difference of corpus assessment result and training label, the model ginseng of adjustment corpus assessment models It counts and continues to train, terminate to train when meeting training stop condition.

In one embodiment, training pattern 905 is also used to through first circulation neural network knot in corpus assessment models The encoder of structure carries out semantic coding to the word sequence of source text, obtains the first semantic vector sequence, and continue through first and follow The decoder of ring neural network structure is successively decoded the first semantic vector sequence, obtains the first hidden layer sequence vector；It is logical First circulation neural network structure is crossed, read group total is weighted to each vector in the first hidden layer sequence vector, output is through adding Weigh the obtained vector of read group total；By the encoder of second circulation neural network structure in corpus assessment models, to machine The word sequence of cypher text carries out semantic coding, obtains the second semantic vector sequence, and continue through second circulation neural network The decoder of structure is successively decoded the second semantic vector sequence, obtains the second hidden layer sequence vector；Pass through second circulation Neural network structure is weighted read group total to each vector in the second hidden layer sequence vector, exports weighted read group total Obtained vector；Wherein, first circulation neural network structure and second circulation neural network structure are parallel.

In one embodiment, training pattern 905 is also used to the decoder by first circulation neural network structure, obtains When secondary Automobile driving weight vectors corresponding with the first semantic vector sequence；According to Automobile driving weight vectors and First semantic vector sequence is calculated when secondary content vector；Decoder according to first circulation neural network structure is previous First hidden layer vector of output, and when secondary content vector, be calculated when the first secondary hidden layer vector；Combine first circulation Each first hidden layer vector that the decoder of neural network structure is sequentially output obtains the first hidden layer vector sequence corresponding with source text Column.

In one embodiment, training module 905 is also used to obtain and turn under different dimensions to machine translation text Translate the obtained result of quality evaluation；As a result, determining corresponding combination dimension vector according to corresponding to different dimensions；Pass through language Expect assessment models Architecture of Feed-forward Neural Network, the vector respectively exported to two parallel Recognition with Recurrent Neural Network structures and Combination dimension vector continues with after being spliced, and obtains corpus assessment result.

With reference to Figure 10, in one embodiment, the corpus assessment models training device 900 further include: module 906 is used, For obtaining module 901, it is also used to obtain candidate parallel corpora to be processed；Candidate parallel corpora includes candidate source text and phase The candidate cypher text answered；By the corpus assessment models trained, to candidate source text and corresponding candidate cypher text into Row corpus assessment processing obtains corpus assessment score；From candidate parallel corpora, filters out corresponding corpus assessment score and meet in advance If the target parallel corpora of screening conditions.

In one embodiment, candidate source text is the text of corresponding first languages；Candidate cypher text is corresponding second The text of languages；It obtains module 901 and is also used to obtain the first parallel corpora and the second parallel corpora；First parallel corpora includes pair Answer the candidate source text of the first languages and the candidate internal expression text of corresponding corresponding third languages；Second parallel corpora includes corresponding to The candidate cypher text of the candidate internal expression text of third languages and corresponding corresponding second languages；According to the first parallel corpora and Two parallel corporas construct candidate parallel corpora；Candidate parallel corpora includes the candidate source texts of corresponding first languages and corresponding The candidate cypher text of corresponding second languages.

In one embodiment, it is also used to obtain text to be translated using module 906；By text input to be translated to machine Translation model obtains multiple spare versions；Text to be translated and each spare version are separately constituted more than one set of Spare parallel corpora；By the corpus assessment models trained, corpus assessment processing is carried out to each spare parallel corpora respectively, is obtained Score is assessed to corpus corresponding with each spare version difference；Score is assessed according to each corpus, to spare version It reorders.

Above-mentioned corpus assessment models training device, acquisition include source text and the corresponding parallel language for referring to cypher text Material, translates the source text by Machine Translation Model to obtain corresponding machine translation text.By source text and accordingly Training sample of the machine translation text collectively as corpus assessment models.Compare machine translation text and corresponding with reference to translation text This, obtains trained label corresponding with the training sample.Since there are various noises in machine translation text, thus can Carry out the structure counter-example to be no longer dependent on artificial addition noise.Comparison result by machine translation text and reference cypher text is true Surely label is trained, a large amount of training data can be constructed by eliminating the reliance on artificial mark corpus, substantially increase training data Prepare efficiency, and then substantially increases the training effectiveness of model.Also, since the machine exported by Machine Translation Model turns over It will include various noises in translation sheet, it is wider compared to man made noise's covering surface, it, can be well closer to real scene It avoids the problem that the model over-fitting caused by the limitation of training data, can efficiently train high performance corpus and comment Estimate model.

Figure 11 shows the internal structure chart of computer equipment in one embodiment.The computer equipment specifically can be figure Terminal 110 or server 120 in 1.As shown in figure 11, it includes total by system which, which includes the computer equipment, Processor, memory and the network interface of line connection.Wherein, memory includes non-volatile memory medium and built-in storage.It should The non-volatile memory medium of computer equipment is stored with operating system, can also be stored with computer program, the computer program When being executed by processor, processor may make to realize corpus assessment models training method.Meter can also be stored in the built-in storage Calculation machine program when the computer program is executed by processor, may make processor to execute corpus assessment models training method.

It will be understood by those skilled in the art that structure shown in Figure 11, only part relevant to application scheme The block diagram of structure, does not constitute the restriction for the computer equipment being applied thereon to application scheme, and specific computer is set Standby may include perhaps combining certain components or with different component layouts than more or fewer components as shown in the figure.

In one embodiment, corpus assessment models training device provided by the present application can be implemented as a kind of computer journey The form of sequence, computer program can be run in computer equipment as shown in figure 11.It can be deposited in the memory of computer equipment Storage forms each program module of the corpus assessment models training device, for example, acquisitions module shown in Fig. 9, translation module, true Cover half block, contrast module and training module.The computer program that each program module is constituted makes processor execute this specification Described in each embodiment of the application corpus assessment models training method in step.

For example, computer equipment shown in Figure 11 can be by corpus assessment models training device as shown in Figure 9 It obtains module and executes step S202.Computer equipment can execute step S204 by translation module.Computer equipment can be by true Cover half block executes step S206.Computer equipment can execute step S208 by contrast module.Computer equipment can pass through training Module executes step S210.

In one embodiment, a kind of computer equipment, including memory and processor are provided, memory is stored with meter Calculation machine program, when computer program is executed by processor, so that processor executes the step of above-mentioned corpus assessment models training method Suddenly.The step of corpus assessment models training method can be in the corpus assessment models training method of above-mentioned each embodiment herein The step of.

In one embodiment, a kind of computer readable storage medium is provided, computer program, computer journey are stored with When sequence is executed by processor, so that the step of processor executes above-mentioned corpus assessment models training method.Corpus assesses mould herein The step of type training method, can be the step in the corpus assessment models training method of above-mentioned each embodiment.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a non-volatile computer and can be read In storage medium, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, provided herein Each embodiment used in any reference to memory, storage, database or other media, may each comprise non-volatile And/or volatile memory.Nonvolatile memory may include that read-only memory (ROM), programming ROM (PROM), electricity can be compiled Journey ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) directly RAM (RDRAM), straight Connect memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..

Each technical characteristic of above embodiments can be combined arbitrarily, for simplicity of description, not to above-described embodiment In each technical characteristic it is all possible combination be all described, as long as however, the combination of these technical characteristics be not present lance Shield all should be considered as described in this specification.

The several embodiments of the application above described embodiment only expresses, the description thereof is more specific and detailed, but simultaneously The limitation to the application the scope of the patents therefore cannot be interpreted as.It should be pointed out that for those of ordinary skill in the art For, without departing from the concept of this application, various modifications and improvements can be made, these belong to the guarantor of the application Protect range.Therefore, the scope of protection shall be subject to the appended claims for the application patent.

Claims

1. a kind of corpus assessment models training method, comprising:

It compares the machine translation text and described with reference to cypher text, obtains trained label corresponding with the training sample；

2. the method according to claim 1, wherein the comparison machine translation text and the reference are turned over Translation sheet obtains trained label corresponding with the training sample, comprising:

According to preset text matches mode, the machine translation text and the matching degree with reference to cypher text are calculated；

Label is trained using the matching degree as corresponding with the training sample.

3. the method according to claim 1, wherein described pass through the training sample and corresponding training label The training corpus assessment models, comprising:

By two Recognition with Recurrent Neural Network structures parallel in the corpus assessment models, respectively to the word sequence of the source text It is handled with the word sequence of the machine translation text；

It is each to two parallel Recognition with Recurrent Neural Network structures by the Architecture of Feed-forward Neural Network of the corpus assessment models Continue to handle from after the vector splicing of output, obtains corpus assessment result；

According to the difference of the corpus assessment result and the trained label, the model parameter of the corpus assessment models is adjusted simultaneously Continue to train, terminates to train when meeting training stop condition.

4. according to the method described in claim 3, it is characterized in that, described pass through two parallel in the corpus assessment models Recognition with Recurrent Neural Network structure, the word sequence of the word sequence to the source text and the machine translation text is handled respectively, Include:

By the encoder of first circulation neural network structure in the corpus assessment models, to the word sequence of the source text into Row semantic coding obtains the first semantic vector sequence, and continues through the decoder of the first circulation neural network structure, according to It is secondary that the first semantic vector sequence is decoded, obtain the first hidden layer sequence vector；

By the first circulation neural network structure, summation is weighted to each vector in the first hidden layer sequence vector It calculates, exports the obtained vector of weighted read group total；

By the encoder of second circulation neural network structure in the corpus assessment models, to the word of the machine translation text Sequence carries out semantic coding, obtains the second semantic vector sequence, and continue through the solution of the second circulation neural network structure Code device, is successively decoded the second semantic vector sequence, obtains the second hidden layer sequence vector；

By the second circulation neural network structure, summation is weighted to each vector in the second hidden layer sequence vector It calculates, exports the obtained vector of weighted read group total；

Wherein, the first circulation neural network structure and the second circulation neural network structure are parallel.

5. according to the method described in claim 4, it is characterized in that, the solution by the first circulation neural network structure Code device, is successively decoded the first semantic vector sequence, obtains the first hidden layer sequence vector, comprising:

By the decoder of the first circulation neural network structure, what acquisition ought be secondary is corresponding with the first semantic vector sequence Automobile driving weight vectors；

According to the Automobile driving weight vectors and the first semantic vector sequence, be calculated when time content to Amount；

According to the first hidden layer vector of the previous output of decoder of the first circulation neural network structure, and when secondary content Vector is calculated when the first secondary hidden layer vector；

Each first hidden layer vector that the decoder of the first circulation neural network structure is sequentially output is combined, is obtained and the source The corresponding first hidden layer sequence vector of text.

6. according to the method described in claim 3, it is characterized in that, the method also includes:

It obtains and the obtained result of translation quality assessment is carried out to the machine translation text under different dimensions；

As a result, determining corresponding combination dimension vector described according to corresponding to different dimensions；

The Architecture of Feed-forward Neural Network by the corpus assessment models, to two parallel Recognition with Recurrent Neural Network knots Continue to handle after the vector splicing that structure respectively exports, obtain corpus assessment result, comprising:

It is each to two parallel Recognition with Recurrent Neural Network structures by the Architecture of Feed-forward Neural Network of the corpus assessment models It is continued with after being spliced from the vector of output and the combination dimension vector, obtains corpus assessment result.

7. according to right want any one of 1-6 described in method, which is characterized in that the method also includes:

Obtain candidate parallel corpora to be processed；Candidate's parallel corpora includes candidate source text and corresponding candidate translation text This；

By the corpus assessment models trained, corpus assessment is carried out to the candidate source text and corresponding candidate cypher text Processing obtains corpus assessment score；

From the candidate parallel corpora, the parallel language of target that corresponding corpus assessment score meets default screening conditions is filtered out Material.

8. the method according to the description of claim 7 is characterized in that candidate's source text is the text of corresponding first languages； Candidate's cypher text is the text of corresponding second languages；It is described to obtain candidate parallel corpora to be processed and include:

Obtain the first parallel corpora and the second parallel corpora；First parallel corpora includes the candidate source document of corresponding first languages The candidate internal expression text of this and corresponding corresponding third languages；Second parallel corpora includes in the candidate of corresponding third languages Between text and corresponding corresponding second languages candidate cypher text；

According to first parallel corpora and the second parallel corpora, candidate parallel corpora is constructed；It is described candidate parallel corpora include The candidate cypher text of the candidate source text of corresponding first languages and corresponding corresponding second languages.

9. method according to claim 1 to 6, which is characterized in that the method also includes:

Obtain text to be translated；

By the text input to be translated to Machine Translation Model, multiple spare versions are obtained；

The text to be translated and each spare version are separately constituted into more than one set of spare parallel corpora；

By the corpus assessment models trained, corpus assessment processing is carried out to each spare parallel corpora respectively, is obtained and each institute It states spare version and distinguishes corresponding corpus assessment score；

Score is assessed according to each corpus, is reordered to the spare version.

10. a kind of corpus assessment models training device, which is characterized in that described device includes:

Translation module obtains corresponding machine translation text for being translated by Machine Translation Model to the source text；

Determining module, for the training sample by the source text and the machine translation text collectively as corpus assessment models This；

Contrast module obtains and the training sample phase for comparing the machine translation text and described with reference to cypher text The training label answered；

11. a kind of computer readable storage medium is stored with computer program, when the computer program is executed by processor, So that the processor is executed such as the step of any one of claims 1 to 9 the method.

12. a kind of computer equipment, including memory and processor, the memory is stored with computer program, the calculating When machine program is executed by the processor, so that the processor executes the step such as any one of claims 1 to 9 the method Suddenly.