A kind of illiteracy Chinese nerve machine translation method based on triangle framework
Technical field
The invention belongs to machine translation mothod field, in particular to a kind of illiteracy Chinese nerve machine translation based on triangle framework
Method.
Background technique
A kind of automatic language translation can be become another language using computer by machine translation, be to solve language barrier
Hinder most one of powerful measure of problem.In recent years, many large-scale searching enterprises and service centre such as Google, Baidu etc. are directed to machine
Device translation has all carried out large-scale research, is made that significant contribution, therefore big language to obtain the high quality translation of machine translation
Already close to human translation level, millions of people realizes leap using translation on line system and mobile application for translation between kind
The exchange of aphasis.In the tide of deep learning in recent years, machine translation has become the most important thing, and it is complete to have become promotion
The important component of ball exchange.
As a kind of data-driven method, the performance height of neural machine translation relies on scale, the quality of Parallel Corpus
With neighborhood covering face.However, the language resourceful in addition to Chinese, English etc., most language all lack big rule in the world
The Parallel Corpus of mould, high quality, wide coverage rate, Mongol are exactly a typical representative.Therefore, how to make full use of existing
Data alleviate scarcity of resources problem, become an important research direction of neural machine translation.
Currently, end-to-end nerve machine translation is rapidly developed, translated relative to traditional machine translation method
It is significantly improved in quality, has become the core technology of commercial online machine translation system.But for Parallel Corpus
The translation of deficient low-resource language still has no small disadvantage compared to the translation between majority language.
Summary of the invention
In order to overcome the disadvantages of the above prior art, the purpose of the present invention is to provide a kind of illiteracy Chinese based on triangle framework
Neural machine translation method, this method especially cover Chinese parallel corpora mainly for the limited problem of Parallel Corpus in rare foreign languages
Library scarcity problem regard Mongol (z) as intermediate hidden variable, is introduced into the translation between English (x) and Chinese (y), by English
Translation between the Chinese is decomposed into via Mongolian two steps.
To achieve the goals above, the technical solution adopted by the present invention is that:
A kind of illiteracy Chinese nerve machine translation method based on triangle framework, which is characterized in that using Mongol as intermediate hidden
Variable is introduced into the translation between majority language x (such as English, French, Japanese) and Chinese, will be between majority language x and Chinese
Translation be decomposed into via Mongolian two steps, under target a possibility that maximizing translation majority language x and Chinese, use
The unified two-way Mongolian translation model of EM algorithm combined optimization, is promoted and covers Chinese translation quality, and any two of them combines it
Between translation still use coder-decoder structure end to end.
Mongol is indicated with z, indicates that Chinese, the two-way EM algorithmic procedure of the unification are as follows with y:
The direction x → y
E: optimization θz|x
Wherein: θz|xIt indicates to make to translate ginseng when accuracy rate reaches on setting value when translating Mongol z from majority language x
Numerical value, and p (z | x) indicate the accuracy rate that Mongol z is translated from majority language x, it is really to be distributed, p (z | y) it indicates to turn over from Chinese y
The accuracy rate for translating Mongol z is the fitting distribution of p (z | x), and KL () is KullbackLeibler divergence;KL(p(z|x)||
P (z | y)) it indicates when being fitted true distribution p (z | x) with p (z | y), the information loss of generation;
M: optimization θy|z
Wherein: θy|zIt indicates to make to translate parameter when accuracy rate reaches on setting value when translating Chinese y from Mongol z
Value, EZ~p (z | x)The mathematic expectaion of z when indicating to translate Mongol z from majority language x, and p (y | z) it indicates to translate the Chinese from Mongol z
The accuracy rate of language y, D indicate entire training set;
The direction y → x
E: optimization θz|y
Wherein: θz|yIt indicates to make to translate parameter when accuracy rate reaches on setting value when translating Mongol z from Chinese y
Value;
M: optimization θx|z
Wherein: θx|zIt indicates to make to translate ginseng when accuracy rate reaches on setting value when translating majority language x from Mongol z
Numerical value, EZ~p (z | y)The mathematic expectaion of z when indicating to translate Mongol z from Chinese y, and p (x | z) it indicates to translate greatly from Mongol z
The accuracy rate of languages x.
P (z | x) and p (y | z) joint training with the help of p (z | y), p (z | y) and p (x | z) are with the help of p (z | x)
Joint training.
P (z | x), p (z | y), p (y | z) and p (x | z) are trained by the sample that itself is generated.
The training of the direction the x → y translation is decomposed into two stages, two translation models of training, the first model x → z
From the potential translation of the input sentence generation Mongol z of majority language x, second model z → y generates Chinese according to the potential translation
The final translation of y, it then follows the step of standard EM algorithm and Jensen inequality, the lower bound of p on entire training data D (y | x) is such as
Under:
Wherein: L (Q) is L (θ;D lower bound), L (θ;It D) is likelihood function, θ is the model parameter of p (z | x) and p (y | z)
Parameter value when concentration reaches translation accuracy rate on setting value, and p (y | x) it indicates to translate the accurate of Chinese y from majority language x
Rate, Q (z) are any Posterior distrbutionps of z, Q (z)=p (z | x).
It is weighted examination with translation of the IBM model to generation, translation probability is calculated according to given bilingual data, it is described
Bilingual data refers to low-resource to (x;Or (y z);z).
The pseudo- sample generated by model p (z | x) or p (z | y) and real bilingual sample are blended in together with the ratio of 1:1
In one small lot, to stablize training process.
The entire training process algorithm of the present invention is as follows:
Input: resource bilingual data (x abundant;Y), low-resource bilingual data (x;And (y z);z)
Output: parameter θz|x,θy|z,θz|yAnd θx|z
1: pre-training p (z | x), p (z | y), p (x | z), p (y | z)
2:while does not restrain do
3: parallel corpora (x, y) ∈ D between majority language x and Chinese y, the parallel corpora between majority language x and Mongol z
(x*,z*) ∈ D, the parallel corpora (y between Chinese y and Mongol z*,z*)∈D
The direction 4:x → y: optimization θz|x,θy|z
5: generating z ' from p (z ' | x) and establish training batch B1=(x, z ') ∪ (x*,z*), B1Indicate sample (x;Z) it adds
(x after training the pseudo- Parallel Corpus come;Z) parallel corpora, the corpus of the newly-generated Mongol z of z ' expression, B2=(y,
z′)∪(y*,z*), B2Indicate sample (y;Z) (the y after being added to the pseudo- Parallel Corpus for training and;Z) parallel corpora
6:E step: B is used1Update θz|x
7:M step: B is used2Update θy|z
The direction 8:y → x: optimization θz|y,θx|z
9: generating z ' from p (z ' | y) and establish training batch B3=(y, z ') ∪ (y*,z*), B4=(x, z ') ∪ (x*,z*)
10:E step: B is used3Update θz|y
11:M step: B is used4Update θx|z
12:end while
13: returning: θz|x, θy|z, θz|y, θx|z
Compared with existing end-to-end neural machine translation method, the present invention has fully considered Parallel Corpus in rare foreign languages
Limited problem especially covers Chinese Parallel Corpus scarcity problem, improves under the premise of Parallel Corpus scarcity and covers Chinese translation
Quality;Secondly, utilizing the unified two-way Mongolian translation model of EM algorithm combined optimization;Finally, by model x → z or z → y
The pseudo- sample of generation and real bilingual sample are blended in same small lot with the ratio of 1:1 and stablize training process.
Detailed description of the invention
The triangle that Fig. 1 is low-resource NMT learns architecture diagram.
Fig. 2 is end-to-end coder-decoder structure.
Specific embodiment
The embodiment that the present invention will be described in detail with reference to the accompanying drawings and examples.
Problem description: the illiteracy Chinese nerve machine translation method based on triangle framework, it is excellent with unified two-way EM algorithm joint
Change Mongolian translation model.
Mongol is indicated with z, Chinese is indicated with y, English is indicated with x, unified two-way broad sense EM process is as follows:
The training of the translation of x → y is decomposed into two stages to train two translation models, the first model x → z is from x's
The potential translation of sentence generation z is inputted, second model z → y generates the final translation of y language according to the potential translation, this two
A process uses coder-decoder structure end to end;In addition, it then follows the step of standard EM algorithm and Jensen inequality,
The lower bound for obtaining the p (y | x) on entire training data D is as follows:
Wherein: L (Q) is L (θ;D lower bound), L (θ;It D) is likelihood function, θ is the model parameter of p (z | x) and p (y | z)
Parameter value when concentration reaches translation accuracy rate on setting value, and p (z | x) it indicates to translate the accurate of language z from language x
Rate, p (y | z) indicate the accuracy rate that language y is translated from language z, p (y | x) indicate the accuracy rate that language y is translated from language x,
D indicates entire training set, and Q (z) is any Posterior distrbutionp of z, Q (z)=p (z | x).
The direction x → y
E: optimization θz|x
In order to make L (Q) and L (θ;D error reaches minimum between), uses following formula:
Wherein: θz|xIt indicates to make to translate ginseng when accuracy rate reaches on setting value when translating Mongol z from majority language x
Numerical value, and p (z | x) indicate the accuracy rate that Mongol z is translated from majority language x, it is really to be distributed, p (z | y) it indicates to turn over from Chinese y
The accuracy rate for translating Mongol z is the fitting distribution of p (z | x), and KL () is KullbackLeibler divergence;KL(p(z|x)||
P (z | y)) it indicates when being fitted true distribution p (z | x) with p (z | y), the information loss of generation;
M: optimization θy|z
Wherein: θy|zIt indicates to make to translate parameter value when accuracy rate reaches on setting value when translating language y from language z,
EZ~p (z | x)The mathematic expectaion of z when indicating to translate z from x;
The direction y → x
E: optimization θz|y
Wherein: θz|yIt indicates to make to translate parameter value when accuracy rate reaches on setting value when translating language z from language y,
P (z | y) indicate the accuracy rate that language z is translated from language y;
M: optimization θx|z
Wherein: θx|zIt indicates to make to translate parameter value when accuracy rate reaches on setting value when translating language x from language z,
EZ~p (z | y)Indicating the mathematic expectaion of z when translating z from y, p (x | z) indicates the accuracy rate that language x is translated from language z, p (x |
Y) accuracy rate that language x is translated from language y is indicated;
On the basis of above-mentioned derivation, whole system structure is analyzed, as shown in Figure 1: dotted arrow indicate p (y |
X) direction, wherein p (z | x) and p (y | z) joint training with the help of p (z | y), and solid arrow indicates the side of p (x | y)
To wherein p (z | y) and p (x | z) joint training with the help of p (z | x).Similar to intensified learning, model p (z | x), p (z |
Y), p (y | z) and p (x | z) are trained by sample that itself is generated.
In above-mentioned two-way training process, E step is performed both by gradient decline training, and the gradient descent algorithm on the direction x → y is public
Formula is as follows:
Training process algorithm is as follows:
Input: resource bilingual data (x abundant;Y), low-resource bilingual data (x;And (y z);z)
Output: parameter θz|x,θy|z,θz|yAnd θx|z
1: pre-training p (z | x), p (z | y), p (x | z), p (y | z)
2:while does not restrain do
3: parallel corpora (x, y) ∈ D between language x and language y, the parallel corpora (x between language x and language z*,z*)
∈ D, the parallel corpora (y between language y and language z*,z*)∈D
The direction 4:x → y: optimization θz|x,θy|z
5: generating z ' from p (z ' | x) and establish training batch B1=(x, z ') ∪ (x*,z*), B1Indicate sample (x;Z) it adds
(x after training the pseudo- Parallel Corpus come;Z) parallel corpora, the corpus of the newly-generated language z of z ' expression, B2=(y,
z′)∪(y*,z*), B2Indicate sample (y;Z) (the y after being added to the pseudo- Parallel Corpus for training and;Z) parallel corpora, z ' table
Show the corpus of newly-generated language z
6:E step: B is used1Update θz|x
7:M step: B is used2Update θy|z
The direction 8:y → x: optimization θz|y,θx|z
9: generating z ' from p (z ' | y) and establish training batch B3=(y, z ') ∪ (y*,z*), B4=(x, z ') ∪ (x*,z*)
10:E step: B is used3Update θz|y
11:M step: B is used4Update θx|z
12:end while
13: returning: θz|x, θy|z, θz|y, θx|z
The guarantee of training process stability:
In order to guarantee the stability of training process, by the pseudo- sample generated by model x → z or z → y and real bilingual sample
This is blended in same small lot with the ratio of 1:1.
It is that English covers, illiteracy English, Meng Han, the Chinese cover and utilize coder-decoder knot end to end between any bilingual below
The process of structure translation:
Referring to Fig. 2, firstly, the top half of Fig. 2, encoder encode source language sentence, it is semantic to generate context
Vector Groups, then, the coding that these context semantic vectors are intended to as user.During generation (lower part of Fig. 2), decoding
Device combination attention mechanism generates each of object language word, while generating each word, considers input this institute
Corresponding context semantic vector, so that the content that this generates is consistent with the meaning of original language.
Specific translation steps are as follows:
1. the source language sentence that encoder reads input;
2. the sentence read is encoded to hidden layer state using Recognition with Recurrent Neural Network by encoder, a context semanteme is formed
Vector Groups;
3. each word that decoder combination attention mechanism sequentially generates object language.