CN111444730A

CN111444730A - Data enhancement Weihan machine translation system training method and device based on Transformer model

Info

Publication number: CN111444730A
Application number: CN202010226101.6A
Authority: CN
Inventors: 艾山·吾买尔; 西热艾力·海热拉; 刘文其; 盛嘉宝; 早克热·卡德尔; 郑炅; 徐翠云; 斯拉吉艾合麦提·如则麦麦提
Original assignee: Xinjiang University
Current assignee: Xinjiang University
Priority date: 2020-03-27
Filing date: 2020-03-27
Publication date: 2020-07-24

Abstract

The invention discloses a method and a device for training a data-enhanced Wei-Han machine translation system based on a Transformer model, wherein the Transformer model consists of an encoder and a decoder, the left half part of the model is an encoder end and consists of 6 identical layers, and each layer consists of two sub-layers. The right half is the decoder side, which consists of 6 identical layers, each layer consisting of three sub-layers. The problem of poor translation performance of the neural machine translation model under the condition of resource shortage is greatly improved, and the generalization capability of the model is improved. Experimental results show that data are forged by 17 ten thousand pairs of Wei-Han parallel linguistic data and a translation model is trained, and finally the translation quality is improved to a certain extent.

Description

Data enhancement Weihan machine translation system training method and device based on Transformer model

Technical Field

The invention relates to the technical field of translation, in particular to a method and a device for training a data-enhanced Weihan machine translation system based on a Transformer model.

Background

Machine translation is the process of converting one natural language to another by machine. The concept of machine translation has been proposed to go through roughly four stages: rule-based machine translation, instance-based machine translation, statistical-based machine translation, and neural machine translation. The traditional machine translation method needs manually set translation rules and wide-coverage parallel corpora, and has the difficulties of high cost and long development period. The neural machine translation concept is put forward and then receives the attention of a plurality of researchers, and the translation performance of the neural machine translation exceeds that of the traditional machine translation method.

The neural machine translation method has different ideas from statistical machine translation, and the main idea of the statistical machine translation method is to construct a statistical translation model by counting a large number of parallel corpora, while the neural machine translation method is to construct the neural machine translation model by firstly converting texts into numbers and secondly operating the numbers. The method for converting the text into the number has discrete representation and distributed representation, when the one-hot represents the word vector of the word, the size of the word list is set as the length of the vector, the value of one dimension in the vector is 1, and the value of the other dimensions is 0, but the meaning of the word cannot be effectively represented on the semantic level. Google published a Word2vec Word vector training tool in 2013, and Word2vec trained Word vector models quickly and efficiently with given text data. The model can represent the vector of the words on the semantic layer, and the similarity of the two words can be conveniently calculated. Word2vec is a milestone in the field of natural language processing, which facilitates individual ones of the natural language processing tasks.

The neural machine translation system mainly comprises an encoder and a decoder, wherein the encoder encodes sentences of any length in a source language, and the decoder takes vectors of specific lengths output by the encoder as input and decodes sentences in a target language. The structure is modeled in an end-to-end fashion, with all parameters of the model being trained with an objective function. Fig. 1 shows the structure of an encoder-decoder model.

The method is characterized in that a cyclic neural network (RNN), a long-short term memory (L STM), a gated recurrent neural network (GRU), a Transformer and the like are adopted in different neural machine translation systems of an encoder and a decoder.

The existing machine translation depends on large-scale high-quality parallel corpora, which require millions or even tens of millions of parallel corpora to be trained to achieve certain effect, but for the language of the resource of Uygur language, the large-scale parallel corpora cannot be obtained, and even if the large-scale parallel corpora exist, the quality of the long sentence translation of the machine translation based on statistics and the machine translation based on L STM is not high,

disclosure of Invention

The invention aims to provide a method and a device for training a data-enhanced Weihan machine translation system based on a Transformer model, so as to solve the problems in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme: the training device for the data enhancement Winhamer machine translation system based on the Transformer model comprises an encoder and a decoder, wherein the left half part of the model is an encoder end and consists of 6 identical layers, and each layer consists of two sublayers. The right half is the decoder side, which consists of 6 identical layers, each layer consisting of three sub-layers.

Preferably, the first sub-layer self-attention layer and the second sub-layer are feedforward neural networks, each word passes through the self-attention layer first, the word is coded, the position information of the word is obtained through a position coder, a query sum key value pair vector is created from an input vector, and the three vectors are trained through a point product scaling attention algorithm.

Preferably, the training method comprises the following steps:

A. preprocessing the corpus:

B. phrase alignment, extraction and filtering, and extracting noun phrases;

C. pseudo parallel sentence pairs are generated.

Preferably, the preprocessing in the step a includes preprocessing chinese and uygur, using an uygur preprocessing tool and a segmentation tool, performing extension region-basis region transcoding and segmentation on uygur, performing full-angle-half-angle conversion on chinese corpus, and segmenting the chinese corpus using a hayward chinese segmentation tool.

Preferably, phrase alignment and phrase pair extraction are performed by using a statistical machine translation tool moses in the step B to obtain about ten million phrase pairs; the phrase filtering is to filter the extracted phrase pairs by simply defining the following rules:

a. filtering phrase pairs containing punctuation marks;

b. filtering pairs of phrases containing numbers;

c. filtering phrase pairs in which the Chinese phrase contains non-Chinese characters or the Uygur phrase contains non-Uygur characters;

d. filtering phrase pairs with too large or too small length proportion;

e. filtering single words and non-noun phrases, and then remaining 324 ten thousand phrase pairs;

extracting noun phrases, performing syntactic analysis on the Chinese sentence by using a syntactic analyzer of the Hadamard, and extracting all noun phrases in the sentence; due to the lack of the Uygur syntax analyzer, the phrase alignment table is used to find out the Uygur noun phrases corresponding to the Chinese noun phrases.

Preferably, step C includes:

a. training word vectors, namely training word vector models by using Chinese and Uygur language monolingual corpora, wherein the word vectors are the skip-gram models in word2 vec;

b. calculating phrase similarity: firstly, calculating phrase vectors on the basis of word vectors, secondly, calculating the similarity of two phrases through cosine similarity, and adding the vectors of each word in the phrases and then averaging to obtain the vectors of the phrases; then respectively calculating the similarity of each phrase and all phrases in the phrase table, wherein the cosine similarity is used when calculating the phrase similarity; the phrase vector and phrase similarity formula are calculated as follows:

where p is a phrase vector, w_iIs the vector of the ith word, p_iAnd p_jTwo phrase vectors with similarity needing to be calculated;

c. and generating a sentence: the noun phrases in the original sentence pair are replaced by the phrases with the highest similarity in the phrase table, the similarity of the phrases of Uygur is calculated, and when the phrases of Uygur are replaced, the corresponding phrases in the Chinese sentences are replaced at the same time.

d. Screening for pseudo-parallel corpora and filtering out non-compliance rules using SRI L M for 359-ten thousand Uygur languagesThe method comprises the steps of respectively training language models of Uygur language and Chinese language from monolingual data and monolingual data of 354 thousand Chinese language, calculating the confusion degree of each newly generated sentence through the trained language models, filtering out sentences of which the confusion degree is 5 higher than that of original sentences, wherein the confusion degree measurement is an index for evaluating the quality of the language models, the confusion degree is the measure of information theory, the measure is used for measuring the quality of a probability model prediction sample, the lower the confusion degree is, the better the confusion degree is, and a given text corpus w containing n words₁,w₂,…,w_nAnd a language model function L M, L M based on word history for assigning probabilities to words, where the degree of confusion in the corpus is:

compared with the prior art, the invention has the beneficial effects that: the invention greatly improves the problem of poor translation performance of the neural machine translation model under the condition of resource shortage, and improves the generalization capability of the model. Experimental results show that data are forged by 17 ten thousand pairs of Wei-Han parallel linguistic data and a translation model is trained, and finally the translation quality is improved to a certain extent.

Drawings

FIG. 1 is a schematic diagram of a prior art encoder-decoder model;

FIG. 2 is a diagram of a prior art system architecture;

FIG. 3 is a model block diagram of the present invention;

FIG. 4 is a vector diagram of a target sentence corresponding to a query vector according to the present invention;

FIG. 5 is a diagram of a data query architecture in accordance with the present invention;

FIG. 6 is a schematic illustration of the present invention in position embedding;

FIG. 7 is a flow chart of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-7, the present invention provides a technical solution: the training device for the data enhancement Winhamer machine translation system based on the Transformer model comprises an encoder and a decoder, wherein the left half part of the model is an encoder end and consists of 6 identical layers, and each layer consists of two sublayers. The right half is the decoder side, which consists of 6 identical layers, each layer consisting of three sub-layers.

In the invention, a first sub-layer self-attention layer and a second sub-layer of an encoder are feedforward neural networks, each word passes through the self-attention layer firstly, the word is encoded, the position information of the word is obtained through a position encoder, a query and key value pair vector is established from an input vector, and the three vectors are trained through a scaling dot product attention algorithm. In this algorithm, it is assumed that the key "k" and the value "v" are the same vector, and as shown in fig. 4, the query vector q corresponds to the vector of the target sentence.

The specific operation comprises three steps

1. A dot product operation is performed on each query vector q and the "key".

2. The dot product results were normalized with softmax.

3. The final multiplication by the value "v" is used as the attention vector.

The mathematical formula for the calculation is as follows:

attention of multiple heads: the multi-head attention repeats the zooming dot product attention process h times in order to acquire more semantic information in sentences as much as possible, and a plurality of query values q1 are obtained; the final results of parallel computations performed on n { q1, q 2.., qn } are combined into a matrix, and the architecture is shown in fig. 5.

The invention discloses a data enhancement Wei-Han machine translation system training method based on a Transformer model, which is characterized by comprising the following steps: the training method comprises the following steps:

A. preprocessing the materials;

B. phrase alignment, extraction and filtering, and extracting noun phrases;

C. pseudo parallel sentence pairs are generated.

The preprocessing in the step A comprises preprocessing Chinese and preprocessing Uygur language, the Uygur language preprocessing tool and the word segmentation tool are used for carrying out extension region-basic region code conversion and word segmentation on the Uygur language, full-angle-half-angle conversion is carried out on Chinese language materials, and the Chinese word segmentation tool with the size of Harmony is used for carrying out word segmentation on the Chinese language materials.

B, aligning and extracting short words, and performing phrase alignment and phrase pair extraction by using a statistical machine translation tool moses to obtain about ten million phrase pairs; the phrase filtering is to filter the extracted phrase pairs by simply defining the following rules:

a. filtering phrase pairs containing punctuation marks;

b. filtering pairs of phrases containing numbers;

d. filtering phrase pairs with too large or too small length proportion;

The step C comprises the following steps:

d. Using SRI L M to train language models of 359 ten thousand Uygur languages and 354 ten thousand Chinese language data respectively, calculating the confusion degree of each newly generated sentence through the trained language models, and filtering out the sentences of which the confusion degree is 5 higher than that of the original sentence, wherein the confusion degree measurement is an index for evaluating the good quality of the language models, the confusion degree is a measure of information theory, the lower the confusion degree is, the better the probability model is, and the given text corpus w containing n words₁,w₂,…,w_nAnd a language model function L M, L M based on word history for assigning probabilities to words, where the degree of confusion in the corpus is:

a good language model will assign a higher probability to the samples in the corpus and will also have a lower confusion value.

The invention uses a Transformer model to train a dimension-Chinese machine translation model. The Transformer model is consistent with the encoder-decoder model in the structure, and the problem of long sequence dependence in the neural network is effectively solved by adopting the attention layer and the full connection layer, so that a better effect is achieved. Fig. 3 shows the structure of the transform model, the left half is the encoder side, which is composed of a multi-headed attention layer and a fully-connected layer, and the right half is the decoder side, which is composed of a multi-headed attention layer and a fully-connected layer. Wherein the multi-headed attention layer is different from the encoder segment, and is composed of a self-attention layer and an encoder-decoder attention layer. The concrete structure is as follows:

1. packet attention network

The attention network can be viewed as mapping a data query Q onto a key (K) -value (V) pair and producing a weighted output. Unlike conventional attention mechanisms that use only one attention network to generate one context vector, the grouped attention network concatenates multiple attention networks, specifically, given (Q, K, V), Q, K and V are first mapped to different spaces using different linear mappings, respectively, and then context vectors for the different spaces are computed using different attention networks and concatenated to the final output. Is calculated by the formula

MultiHead(Q,K,V)＝Concat(head₁,head₂,…,head_n)W^o

Wherein the head_i＝Attention(QW_i ^Q,KW_i ^K,VW_i ^V)，QW_i ^Q,KW_i ^K,VW_i ^VIs group i, W^oLinear mapping parameters for the final context are generated after splicing.

Position coding: the attention network based encoder and decoder cannot take into account location information, which is important for language understanding and generation. To solve this problem, the position embedding method is applied to an attention-based encoder and decoder. As shown in fig. 6, the position-embedded vector adds together the word-embedded vectors in a bit-wise manner, so that the position-embedded vector has the same length as the word-embedded vector.

Unlike the position embedding method which needs a proper parameter, the fixed position embedding method which is defined based on the trigonometric function and does not need to be learned is specifically defined as follows:

where pos is the number of the word, 2i and 2i +1 are the dimension of the position code, d_modelIs the length of the position code. PE (pos,2i) defines the value of the 2 i-th dimension of the position code with position number pos. Similarly, PE (pos,2i +1) defines the (2i +1) th dimension of the position code. The advantage of such a design position coding is that the position coding of two positions pos and (pos + k) separated by k words is a linear transformation defined by the position coding of k, and the specific process is as follows:

self-attention network performance: in both the encoder and decoder, a self-attention network is used to model Uygur sentences and Chinese sentences. The temporal complexity of the self-attention network is lower than for constructing encoders and decoders using a recurrent neural network and a convolutional neural network to model sentences. Assuming that the sentence length is n, the length of the hidden state vector is d, and the convolution kernel size of the convolutional neural network is k, the respective computational complexity is shown in table 1.

Single-layer computational complexity refers to the total computational complexity when only one layer of such a network is used. When the sentence length n is less than the length d of the implicit vector, the total computation from the attention network is less. In most cases, the sentence length is significantly smaller than the length of the implicit vector. The sequence operand is the number of operations which are required to be sequentially executed to generate the hidden state corresponding to each word in the sentence (for example, the cyclic neural network can generate the hidden state of the next word only after the hidden state corresponding to the previous word is generated, and the generation of the hidden state between the words cannot be parallel), and the larger the sequence operand is, the weaker the parallelization capability is; the maximum path length is the maximum number of operations that need to be performed under conditions that ensure that each word in the sentence can affect another word. For example, in the cyclic neural network, the first word of a sentence affects the last word, the information needs to be transmitted to all hidden states behind the sentence, and the hidden states are (n-1) nodes in total, so the complexity of the maximum length of the sentence is O (n), each layer of the convolutional neural network only enables k words in a convolutional kernel to affect each other, and log is overlapped_k(n) layers of convolutional networks, while the self-attention network can directly link any two words in a sentence through an attention mechanism.

The invention adopts 17 ten thousand dimensional Chinese parallel sentence pairs as initial training corpus to carry out experiments, and verifies the effect of the invention. In order to avoid the generated data from being too similar, 3 ten thousand sentences and 8 ten thousand sentences with the best quality are screened out from the generated 200 ten thousand sentences and used as pseudo parallel sentence pairs to be respectively tested, and the test results are shown in table 2:

from Table 2, it can be seen that in the case of using only the original corpus, the model using the Transformer model is improved by 9.35B L EU values compared with the model using RNN, and is improved by 4.52B L EU. compared with the model using statistical machine translation Moses, the model of the original corpus running under the Transformer model is used as a pre-training model, and then the training corpus consisting of the original corpus and newly generated 3 ten thousand sentence pairs is used to continue training in the Transformer model, and the final result is improved by 0.7B L EU values compared with the result of using only the original corpus and the Transformer model, and the training corpus consisting of the original corpus and the newly generated 13 ten thousand sentence pairs is improved by 1.05B L EU values on the Transformer model.

In conclusion, the invention greatly improves the problem of poor translation performance of the neural machine translation model under the condition of resource shortage, and improves the generalization capability of the model. Experimental results show that data are forged by 17 ten thousand pairs of Wei-Han parallel linguistic data and a translation model is trained, and finally the translation quality is improved to a certain extent.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims

1. The training device of the data enhancement dimension Chinese machine translation system based on the Transformer model, the Transformer model is composed of an encoder and a decoder, and is characterized in that: the left half of the model is the encoder side and consists of 6 identical layers, each layer consisting of two sub-layers. The right half is the decoder side, which consists of 6 identical layers, each layer consisting of three sub-layers.

2. The device for training a data-enhanced Withania machine translation system based on a Transformer model according to claim 1, wherein: the first sub-layer self-attention layer and the second sub-layer are feedforward neural networks, each word passes through the self-attention layer firstly, the word is coded, the position information of the word is obtained through a position coder, a query vector and a key value pair vector are created from an input vector, and the three vectors are trained through a scaling dot product attention algorithm.

3. A data enhancement Wei-Han machine translation system training method based on a Transformer model is characterized by comprising the following steps: the training method comprises the following steps:

A. preprocessing the corpus:

B. phrase alignment, extraction and filtering, and extracting noun phrases;

C. pseudo parallel sentence pairs are generated.

4. The method for training the data-enhanced Withania machine translation system based on the Transformer model according to claim 3, wherein: the preprocessing in the step A comprises preprocessing Chinese and preprocessing Uygur language, the Uygur language preprocessing tool and the word segmentation tool are used for carrying out extension region-basic region code conversion and word segmentation on the Uygur language, carrying out full-angle-half-angle conversion on Chinese language materials, and using a Chinese word segmentation tool with large Harmony to segment the Chinese language materials.

5. The method for training the data-enhanced Withania machine translation system based on the Transformer model according to claim 3, wherein: in the step B, phrase alignment and phrase pair extraction are carried out by using a statistical machine translation tool moses to obtain about ten million phrase pairs; the phrase filtering is to filter the extracted phrase pairs by simply defining the following rules:

a. filtering phrase pairs containing punctuation marks;

b. filtering pairs of phrases containing numbers;

d. filtering phrase pairs with too large or too small length proportion;

6. The method for training the data-enhanced Withania machine translation system based on the Transformer model according to claim 3, wherein: the step C comprises the following steps: