[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

Integrating Reconstructor and Post-Editor into Neural Machine Translation

Published: 16 June 2023 Publication History

Abstract

Neural machine translation (NMT) mainly comprises the encoder and decoder. The encoder is mainly used to extract the feature vector of the source language sentence. The decoder predicts the next token according to the feature vector extracted by the encoder and the information of the current moment. In this process, there is no guarantee that the features extracted by the encoder are indistinguishable from the meaning of the sentences in the source language. There is also no guarantee that the decoder can accurately predict the corresponding character. These issues can lead to over-translation and under-translation issues in the translated results. Previous researchers alleviated this problem by calculating the gap between the reconstructed source-language sentences and the source-language sentences. Inspired by this method, we propose to integrate a reconstructor and a post-editor into NMT during the training. The reconstructor takes the translation of NMT as input to reconstruct the source sentence, and the post-editor takes the translation as input and post-edits it to predict the target sentence. Through the training of the reconstructor and the post-editor, the semantics of the translation are forced to follow the source sentence and the target sentence. Experimental results show that our approach can effectively improve the performance of NMT on multiple translation tasks.

1 Introduction

Neural Machine Translation (NMT) [2, 11, 15, 25, 27] has achieved great progress in various translation tasks. The current mainstream NMT models are based on the encoder-decoder structure, which maps source sentence into a distributed representation through encoder, and then uses the decoder to predict the target sentence based on the representation of source sentence. During the training, NMT uses the teacher forcing strategy [30] to force the model to generate the target sentence. During the inference, various decoding strategies like greedy search and beam search can be used to generate the translation. Therefore, the ability of the encoder to extract linguistic information and the ability of the decoder to infer the tokens of the current moment determines the translation model performance. The teacher-enforced strategy can only be indirectly judged by the difference between the generated and target language sentences. It is impossible to directly calculate the difference between the feature vector obtained by the encoder and the source language sentence. At the same time, greedy and beam searches also make the generated target sentences differ from the features predicted by the decoder.
The translation of NMT may deviate from the semantics of the source sentence. The main reason is that the translation model cannot ensure that the information of the source sentence can be fully conveyed to the target sentence. During the translation process, some information in the source sentence may be omitted or distorted, which can cause translation errors like over-translation and under-translation. Many researchers improve encoder or decoder performance by introducing different evaluations into NMT. [7, 23, 28, 33] use different evaluation methods to calculate the difference between the generated sentence and the target language sentence so that the result predicted by the decoder is closer to the target language sentence. An encoder-decoder-reconstructor framework was proposed in Tu et al. [26]. The target feature vector predicted by the decoder is reconstructed into source sentences through a reconstructor to guide optimization of the translation model. Inspired by the above methods, we consider two kinds of mapping of the translation result to reduce the omission and distortion of source information during translation. If we retranslate the translation to the source-side, the difference between the retranslated sentence and the source sentence can reflect the quality of the translation to a certain extent. Similarly, if we post-edit the translation in the target-side, the difference between the post-edit result and the target sentence can also reflect the translation quality. The omission and distortion of source information will be effectively reduced if the translation result can be successfully retranslated to the source sentence or the target sentence. Therefore, we can use the retranslation quality to evaluate the translation result and guide the training of NMT.
Based on the analysis, we propose to integrate a reconstructor and a post-editor into NMT during the training. We use the encoder-decoder structure for the reconstructor and the post-editor, both of which take the translation of NMT as the input. Firstly, we run the NMT model and obtain the translation of the source sentence. Then the translation is retranslated into the source language by the reconstructor and post-edited into the target language by the post-editor. Finally, we evaluate the difference between the reconstructed sentence and the source sentence, also the difference between the post-edited sentence and the target sentence, which is used to evaluate the translation result and guide the training of NMT. The overall process is shown in Figure 1. Our contributions are as follows:
Fig. 1.
Fig. 1. Examples of neural machine translation with reconstructor and post-editor. Different from a reconstruction translation model, we use a translation model as reconstructor. In addition to reconstructing a source sentence, we also get a target sentence through post-editor.
In addition to employing refactoring translations, we used post-translation editing to optimize NMT translation results by evaluating both results.
Considering that different decoding strategies may affect the content predicted by NMT, we evaluate the translation results obtained by the greedy search method to affect the performance of NMT. In contrast, the reinforcement learning method is required to achieve the goal.
Our method only needs to be evaluated at training time, and the inference stage is the same as the original NMT inference process, so it does not increase the time spent in inference.

2 Background

Since our method is built upon the Transformer [27], we briefly introduce the Transformer in this section. The Transformer uses an encoder-decoder architecture. We describe the encoder and decoder below. We denote the source sentence as\(X=\lbrace x_{1},x_{2}, \ldots , x_{n}\rbrace\) and the target sentence as \(Y=\lbrace y_1,y_2,\ldots ,y_m\rbrace\) where \(n\) and \(m\) are the lengths of \(X\) and \(Y\), respectively. Transformer network structure is shown in Figure 2.
Fig. 2.
Fig. 2. The structure of transformer [27].

2.1 Encoder

The encoder of Transformer is composed of \(N\) layers with the same structure. Before entering the layer, word embeddings are generated and then modified by an additive positional encoding. The positional encoding is necessary since the network does not leverage the order of the sequence by recurrence or convolution. The encoded word embeddings are then used as input to the encoder layers. Each layer is composed of two sub-layers: (1) the multi-head attention layer, and (2) the position-wise feed-forward layer. The multi-head attention layer extracts the information in the sequence by self-attention. The position-wise feed-forward layer then performs feature combination and nonlinear mapping of the feature vector containing the extracted information. The output of layer \(k-1\) is then fed to the layer \(k\) as the input as follows:
\begin{equation} \begin{aligned} s_k=EncoderLayer(s_{k-1}) \end{aligned} \end{equation}
(1)
where \(s_k\) is the output of \(k\)-th layer. We denote the output of \(N\)-th layer as \(s\), which is the extracted feature of the source sentence. We simplify the entire process of the encoder as follows:
\begin{equation} \begin{aligned} s=Encoder(X). \end{aligned} \end{equation}
(2)

2.2 Decoder

The decoder architecture is similar to the encoder, which is also composed of \(N\) layers with the same structure. In addition to the two sub-layers in the encoder, the decoder inserts a multi-head cross-attention sub-layer between the multi-head self-attention layer and the position-wise feed-forward layer, which performs multi-head attention over the encoder output \(s\). It is also worth noting that the self-attention layer in the decoder must be masked with a lower triangular matrix to ensure that the prediction for position \(i\) can depend only on positions less than \(i\) during the training. The process of the decoder is:
\begin{equation} \begin{aligned} c_k=DecoderLayer(c_{k-1}, s), \end{aligned} \end{equation}
(3)
where \(c_k\) is the output of the \(k\)-th layer and \(s\) is a feature of a source sentence extracted by the encoder. Finally, the softmax layer is used to output the probability distribution of target words, and the model is trained by the cross-entropy loss, which minimizes the negative log-likelihood as:
\begin{equation} \begin{aligned} \mathcal {L}_{NMT}=-\sum ^m_{i=1}{\log {p(y_i|y_{\lt i},X)}}, \end{aligned} \end{equation}
(4)
where \(y_i\) is the \(i\)-th word in the target sentence \(Y\), and \(y_{\lt i}\) is the previous words in the target sentence. In this paper, we denote the decoding process of the NMT model to:
\begin{equation} \begin{aligned} Y^{\prime }=NMT(X) \end{aligned} \end{equation}
(5)
where \(Y^{\prime }\) is the translation result of NMT.

3 Methodology

In this section, we describe our approach that integrates a reconstructor and a post-editor into NMT during the training. Firstly, we use the translation model to predict the probability distribution in the target-side based on the source sentence. Then we obtain the translation result \(Y^{\prime }=NMT(X)\) through greedy search. In the following, we describe the reconstructor and the post-editor that take the translation \(Y^{\prime }\) as input for reconstruction and post-editing.

3.1 Reconstructor

The structure of the reconstructor is illustrated in Figure 3. Unlike Tu et al. [26], which only uses the decoder to reconstruct the source sentence, we use the translation model to get the translated result \(Y^{\prime }=NMT(X)\). The greedy search process is much more straightforward than the beam search process, and the greedy search can widen the gap between the search results and the feature vector, which is conducive to better evaluation of NMT. We use the greedy strategy to decode the feature vector output by the decoder to get the translation result \(Y^{\prime }\). We use the reconstructor to translate the decoded result to get the source sentence \(X^{\prime }=NMT^{\prime }(Y^{\prime })\). The structure of the reconstructor here is the same as that of NMT. We use cross-entropy loss to evaluate the gap between \(X\) and \(X^{\prime }\), and indirectly express the gap between \(X^{\prime }\) and \(X\) as the performance of NMT. This cross-entropy also trains the reconstructor to the extent that it minimizes the negative log-likelihood as:
\begin{equation} \begin{aligned} \mathcal {L}_r=-\sum _{i=1}^{n}{\log {p_r(x_i|x_{\lt i},NMT(X))}}, \end{aligned} \end{equation}
(6)
where \(p_r\) denotes the probability distribution predicted by the reconstructor model.
Fig. 3.
Fig. 3. Network structure of the reconstructor. We use the source sentences to predict the probability distribution of the target with a translation model. Then we use a greedy search to transform the probability distribution into target sentences. Those target sentences into another translation model to reconstruct the source sentence.
The translation result obtained through the greedy search process is usually very different from the ground truth. These differences usually come from the omitting and distorting of source information and the process of searching for the target sentence since the result of the greedy search is used when reconstructing the source sentence. During the training process, the optimization of model parameters cannot be applied to the two sets of translation models. We use the following methods to solve these problems. We share some of the parameters in the two structures. These parameters include the embedded layer parameters in the decoder and the parameters in the final linear function. The parameters shared by the two sets of translation structures will be used to improve the robustness of the translation model.

3.2 Post-Editor

We denote the post-editor model as \(PE\), whose objective is to generate the target sentence based on the source sentence and the translation \(Y^{\prime }=NMT(X)\). We denote the output of the \(PE\) model as \(Y^{\prime \prime }\):
\begin{equation} \begin{aligned} Y^{\prime \prime }=PE(X,Y^{\prime }). \end{aligned} \end{equation}
(7)
The post-editor is similar to the teacher-student model used for joint training [14, 17, 20], where the process of generating \(Y^{\prime }\) is the teacher, and the process of generating \(Y^{\prime \prime }\) is the student. Unlike the traditional teacher-student model, our goal is not to improve student performance or speed through teachers. Instead, our objective is to evaluate the teacher output by the student.
The network structure is shown in Figure 4, which is similar to the reconstructor. The difference is that the post-editor regenerates the target sentence other than the source sentence, and it also takes the source sentence as input. We use the encoder of the NMT model to encode the source sentence, and the encoder of the PE model to encode the translation \(Y^{\prime }\). The outputs of the two encoders are concatenated and then the concatenated features are input to the decoder.
\begin{equation} c^{\prime }=Decoder\left([s_{Y^{\prime }},s_X],Y \right), \end{equation}
(8)
\begin{equation} p(Y|Y^{\prime },X) \propto \exp (Linear(c^{\prime })), \end{equation}
(9)
where \(s_{Y^{\prime }}\) and \(s_X\) are the output of the two encoders. We denote the output probability of the post-editor as \(p_e\) and use the cross-entropy loss to train the post-editor:
\begin{equation} \begin{aligned} \mathcal {L}_{e}=-\sum _{i=1}^{m}{\log {p_e(y_i|y_{\lt i},Y^{\prime },X)}}. \end{aligned} \end{equation}
(10)
Fig. 4.
Fig. 4. Network structure of post-editor. The overall process is similar to that used to reconstruct the source sentence. When we regenerate the target sentence, the feature input to the decoder is not just the target sentence output by the greedy search, but rather splicing of the target and source information.
In order to make the training of generating \(Y^{\prime \prime }\) effectively affect the stage of generating \(Y^{\prime }\), we spliced the encoded states of \(X\) with the encoded states of the translation \(Y^{\prime }\), hoping to interfere with the information provided by \(X\) by introducing \(Y^{\prime }\). Through the post-editor’s evaluation of \(Y\) in the future, we can quantify the degree of interference of \(Y^{\prime }\) on the information provided by \(X\). The decrease in the degree of interference of \(Y^{\prime }\) on the information provided by \(X\) indicates that the higher the quality of \(Y^{\prime }\), the higher the reward of post-editor.
We combine the losses of the NMT, the reconstructor, and the post-editor as the final training objective:
\begin{equation} \begin{aligned}\mathcal {L}=\mathcal {L}_{NMT} + \mathcal {L}_{r} + \mathcal {L}_{e}. \end{aligned} \end{equation}
(11)
We used reinforcement learning in the training phase. The model with reconstructor and post-editor added uses greedy search to obtain the corresponding sentence, so the model cannot be backpropagation, resulting in the reconstructor or post-editor not affecting the translation model. Therefore, we consider reconstructor’s loss \(\mathcal {L}_{r}\) and post-editor’s loss \(\mathcal {L}_{e}\) as rewards similar to reinforcement learning affecting translation structures. For the translation structure, the loss of the reconstructor or post-editor is the reward, and for the reconstructor or post-editor, the loss of translation structure is the reward. In the model of loss \(\mathcal {L}\), the loss of the reconstructor \(\mathcal {L}_{r}\) and the loss of the post-editor \(\mathcal {L}_{e}\) are a bias for translation losses. The same applies to the configurator and the post-editor. At the same time, because the word embedding layer of the translation structure is shared with the reconstructor and the post-editor, the parameter updating of the word embedding layer is affected by both rewards and losses.

4 Experiments

4.1 Dataset

In order to illustrate the effectiveness of the method proposed in this article, the method was tested on multiple translation tasks. The WMT dataset [3] was used for Estonian-English, Russian-English, and English-German translation tasks. Among them, the training set of Estonian-English is the data provided by WMT18,1 with a size of 800,000 sentences, the validation set is NEWdev8, and the test set is Newtest18, each with 2,000 pieces of data. Russian-English training data is WMT14 data,2 and there are 1.5 million sentences for training dataset, validation and test sets are Newstest13 and Newstest14, each with 3000 data. English-German training data is WMT14 data, and there is 4.5 million validation, and test sets are Newstest13 and Newstest14. The validation has 3,000 pieces of data. The test set has 2,737 pieces of data. The LWSLT dataset [4] was used for German-English, Chinese-English, and German-Italian translation tasks.3 The training set of German-English, Chinese-English, and German-Italian is the data provided by IWSLT17, the training set size of German-English is 200,000 sentences, and the validation and test sets are TED.dev2010 and TED.tst2010, respectively. There are 888 sentences in the validation set and the test set has 1080 sentences; the size of the Chinese-English training set is 230,000 sentences, the validation and test sets are TED.dev2010 and TED.tst2015, respectively, and there are 879 pieces of data for validation. There are 1,000 pieces of data in the test set; The training set size for German-Italian is 180,000 sentences; the validation and test sets are TED.dev2010 and TED.tst2010, respectively, with 923 pieces of data for validation. There are 1,567 pieces of data for the test set. The data were segmented using the NLTK [18] word segmentation tool, and the number of byte pair encoding [22] iterations was 32,000. Table 1 shows the information on these datasets.
Table 1.
Tasktrainvalidtest
et-enWMT180.8MNewsdev182000Newstest182000
ru-enWMT141.5MNewstest133000Newstest143000
en-de1WMT144.5MNewstest133000Newstest142737
de-en2IWSLT170.2MTED.dev2010888TED.tst20101080
zh-enIWSLT170.23MTED.dev2010879TED.tst20151000
de-itIWSLT170.18MTED.dev2010923TED.tst20101567
en-trWMT170.19MNewsdev20171001Newstest173000
Table 1. The Data Used in the Et-En Translation Experiment was Provided by Europarl v8
In addition, because of the use of different sources of English and German data in the experiment, because this article represents en\(\rightarrow\)de1 as data from WMT14, de\(\rightarrow\)en2 and en\(\rightarrow\)de2 represent data from IWLST17 data.
Table 2.
systemTransformerGSGTMG
ratex1.94x1.94x2.16x
Table 2. Proportion of Time Spent on Training Different Models

4.2 Hyperparameters and Systems

The Base system is trained using the open-source system Fairseq,4 and other models are also modified on Fairseq. The setting of model hyperparameters are:
The head is 8 in multi-head attention.
The number of layers of encoder and decoder is 6.
The dimension of the word vector and model hidden state is 512.
The hidden state dimension in the feedforward neural network is 2,048.
The optimization function is adam [16], where adam\(\_\)beta1 is 0.9 and adam\(\_\)beta2 is 0.998.
Dropout [24] is 0.1.
The warmup [13] is 4000.
The learning rate is 0.0007.
The label\(\_\)smoothing is 0.1.
In the decoding stage, Beam Search [9] is used to search, the number of search candidates is 24, and the length ratio of the translated target sentence to the source sentence is 1.6. We use a transformer as the base system for comparison with other methods. The following systems are used for experiments according to the model structure:
transformer: The Transformer base parameter [27].
RL: Transformer trained under the reinforcement learning framework with the BLEU as the rewards [34], specifically the REINFORCE algorithm [29].
BOW: The implementation of [19] on the basis of Transformer.
base: Our training results on the transformer.
Source: Our implement of the reconstructed translation model Tu et al. [26] on transformer.
Target: The model that changes the reconstructed target into the target sentences.
no cat: The model only uses the output of the PE encoder into the PE decoder.
GS: The Reconstructor model in our method.
GT: The Post-Editor in our method.
MG: The combined method of Reconstructor and Post-Editor.
The structure of source and target is shown in the Figure 5. We reproduced the method on the transformer according to Tu et al. [26]. We believe that using the reconstructor to get the target-side sentence has a particular effect compared to using the reconstructor to get the source-side sentence. The only difference is that the reconstruction of the source sentence differs between the features proposed by the decoder and the encoder. Reconstructing the target-side sentence narrows the gap between the reconstructor and the decoder, thereby reducing the error transmission in the reconstructor and the decoder and indirectly improving the robustness of the decoder. Therefore, we added an experiment to reconstruct the target-side sentence on top of the original paper integration. All results evaluate the translation by calculating BLEU using multi-bleu.perl5 after averaging 10 models.
Fig. 5.
Fig. 5. The reconstructed translation model is modified based on the RNN-based translation model. The network structure and translation accuracy of the RNN-based translation model are quite different from the transformer. Therefore, the transformer was modified according to the structure of the original reconstructed translation model. We added a reconstruction experiment oriented to target on the basis of the original reconstruction model only for source.

5 Results

5.1 Comparison of Results from Different Methods

Experimental results on low-resource translation tasks show that the Source and Target models significantly improve the en\(\rightarrow\)de2 and en\(\rightarrow\)de2 translation tasks compared to the base models. However, in the zh\(\rightarrow\)en and en\(\rightarrow\)zh translation tasks, the performance of these two models is not improved compared to the baseline model but decreased. Compared with the baseline model and the Source and Target models, our proposed MG model has a specific improvement in these translation tasks. The improvement is especially noticeable on the zh\(\rightarrow\)en and en\(\rightarrow\)zh translation tasks. Compared with source models, our method does not improve the translation tasks of en\(\rightarrow\)de2 and en\(\rightarrow\)de2 as much as zh\(\rightarrow\)en and en\(\rightarrow\)zh. The results are shown in Table 3. And other results of low-resource translation tasks are shown in Table 4
Table 3.
systemzh\(\rightarrow\)enen\(\rightarrow\)zhen\(\rightarrow\)de2de\(\rightarrow\)en2
Base21.2119.6227.7532.27
RL21.2219.4428.4332.05
Source21.1419.4728.5133.31
Target21.0119.2728.4632.57
no cat21.4819.6428.2132.68
GS22.0419.8128.5333.38
GT22.1220.0628.1533.21
MG22.3420.1028.9633.67
Table 3. The BLEU Scores Four Low-resource Translation Tasks
The bolded result indicates that it is better than the transformer.
Table 4.
systemde\(\rightarrow\)itit\(\rightarrow\)deen\(\rightarrow\)trtr\(\rightarrow\)en
Base19.6120.0412.615.7
GS20.3521.113.3616.52
GT20.3520.8813.5116.91
MG21.2120.9913.6217.31
Table 4. The BLEU Scores of Other Low-resource Translation Tasks
This problem may be because the difference between Chinese and English is relatively significant, while the gap between English and German is not as large as that between Chinese and English. Similarly, in the same language space, the distance between the features of German and English sentences is much smaller than that of Chinese and English. It is a more complex transformation between Chinese sentence and English sentence features than between German sentence and English sentence features. The purpose of the Source model is to make the decoder’s output features close to the source sentence’s features. This operation can be analogized to the transformation relationship between two features. The difficulty of narrowing the semantic features of English and German is much less than that of Chinese and English. That’s why the source model approach has lower improvement on zh\(\rightarrow\)en and en\(\rightarrow\)zh translation tasks than on en\(\rightarrow\)de2 and de\(\rightarrow\)en2 translation tasks. The same is true for the Target model.
The GS and GT experiments are similar to MG’s ablation experiments. From the experimental results, the results of GS and GT are similar to MG, but MG’s performance is better than both. Since MG is a fusion of GS and GT methods, this suggests that the two methods are complementary to some extent. On en\(\rightarrow\)de2 and de\(\rightarrow\)en2 translation tasks, the results of GS and GT are not much different from those of Source and Target, and there are certain similarities in structure, which shows that GS and Source play the same role. But because our method uses NMT as a reconstructor, shares the parameters of the word embedding layer, and introduces a reinforcement learning method, our method is more adaptable to different languages than Source. Because GS and GT also improve on zh\(\rightarrow\)en and en\(\rightarrow\)zh translation tasks. To demonstrate the validity of splicing the encoded states of \(X\) with the encoded states of the translation \(Y^{\prime }\), We contrast this approach with that using only the encoded states of the translation \(Y^{\prime }\). The results are shown in Figure 3, and it shows that concatenating two states is better than using only the encoded states of the translation \(Y^{\prime }\). This result also validates our ideas in Section 3.2.
To further verify the effect of our method on different languages, we conducted experiments on Estonian-English and German-Italian. The experimental results of Estonian-English and German-Italian tasks in different translation directions are shown in the Table **************; The data sets are disclosed by LWSLT and WMT. The experimental results show that GS and GT can provide higher-accuracy translations, as does MG. However, for en\(\rightarrow\)et and it\(\rightarrow\)de tasks, GS has a better improvement than GT. In other experiments, the results are the opposite. For MG, which combines GS and GT, the it\(\rightarrow\)de accuracy is lower than the results from GS. The results of other translation tasks are more accurate than the previous two. In particular, the improvement from Estonia to English and German to Italian is significantly better than the previous two methods. The experimental results are generally the same as those on Chinese-English and English-German.
The time used in the reasoning of our method is the same as that of the Transformer model. However, in terms of training time, the proposed method model structure is similar to the combination of two translation models. The training time should be twice that of the original model, and some time may be increased due to greedy search. The training time of the Transformer model is 3848 seconds. The Table 2 shows the comparison of training time between other models and the Transformer model.

5.2 Methods on Abundant Resources

Previously we conducted experiments on low-resource translation tasks. In this subsection, we will conduct experiments on resource-rich language pairs. To better demonstrate our method, we add comparisons with BOW and reinforcement learning methods in the context of abundant resources. The experimental results are shown in Table 5. The results of the RL method and the BOW method are filled in about the results of the original paper, and the original paper does not have the results of the en\(\rightarrow\)ru, ru\(\rightarrow\)en translation tasks, so we omit this part of the results. Now, only the results on the en\(\rightarrow\)de1 translation task are provided.
Table 5.
systemru\(\rightarrow\)enen\(\rightarrow\)ruen\(\rightarrow\)de1
transformer--27.30
RL--27.49
BOW--27.35
Base27.6229.6327.5
GS27.4429.2127.72
GT27.9729.9428.02
MG29.4329.8427.35
Table 5. Results of Our Methods with Abundant Resources
In the en\(\rightarrow\)ru, ru\(\rightarrow\)en, en\(\rightarrow\)de1 translation task, the GT and MG results show more accurate translation compared to the baseline model. In terms of Russian-English and English-German, the improvement is quite obvious. When more data is available to train the baseline model so that the model can extract more semantic information, GT gives more accurate results than GS. This situation may be because the task of the post-translation editor in GT is the same as the task of the translation model. Therefore, in the training process, it can be helpful to generate better target sentences when optimizing the shared parameters. The GS shared parameters are more inclined to make the encoder output more accurate features. Similarly, comparing RL and BOW methods, the results of our method are improved to a certain extent, which shows that in addition to the RL method introduced by our method, other methods also have an impact on the performance of the model. At the same time, our method is also competitive with other methods.

5.3 The Relationship between Model and Regularization

From the Table 3 and Table 5, the results of our method on low-resource translation tasks are somewhat different from those on resource-rich tasks. In terms of low resources, MG integrated into MS and GT has more accurate translation results. And on resource-rich tasks, GT may perform better than MG. This may occur because the value of dropout in the model hyperparameters is small, and the dimensionality of the hidden state of the model is large. When decimal training is used, the extracted features often contain invalid or even harmful information, which makes the model unable to better adapt to the training data. For this reason, experiments were conducted where dropout was changed from 0.1 to 0.3 in en\(\rightarrow\)de2, de\(\rightarrow\)en2, as well as zh\(\rightarrow\)en and en\(\rightarrow\)zh. The experimental results are shown in Table 6. MG produces more accurate translation than the other two methods for Chinese-English translation tasks. This shows that GS and GT have an impact on the parameters, such as the model word embedding layer. The model will discard more information when the dropout value becomes higher. As a result, the parts of GS and GT that have the greatest impact on the model are discarded or the dropout becomes larger, which makes the optimization part of the model more unified by different methods. As the model dropout increases, the model can better fit the data, which causes the improvement of the model by these methods to decrease. The goal of these methods is to ensure the model can accurately represent data by evaluating the results of the initial translation model.
Table 6.
systemzh\(\rightarrow\)enen\(\rightarrow\)zhen\(\rightarrow\)de2de\(\rightarrow\)en2
Base22.5720.5829.0034.19
GS23.0220.9129.2934.91
GT22.5120.5729.9534.76
MG23.2920.8429.4334.7
Table 6. Results with Our Method when Dropout = 0.3

5.4 Reconstruction Results

To better understand the impact of model structure on model performance, we compare the initial translation results and post-editing output results on the LWSLT English-German data. The results are shown in Table 7. From the experimental results, one can see that both the initial results and the target reconstruction results are more accurate than the results from the baseline model. We use NMT as a teacher structure and post-editing as a student structure. This result is consistent with the teacher-student approach, where the student model yields lower results than the teacher model. And from the experiments, we know that the results obtained by our method, the teacher model, are much better than the results of only using NMT for translation. This result aligns with what we hope to output by evaluating post-editing results to optimize the initial translation structure. A better teacher model also improves the performance of the student model. It also shows that due to the limitations of the greedy search method, the results obtained can be considered to be significantly disturbed. We can evaluate the quality of translation results by losing post-translation editors. In this way, the idea of expected progress of teachers and students has been achieved.
Table 7.
systemde\(\rightarrow\)en2en\(\rightarrow\)de2
Base32.2727.75
Translation result33.1128.12
Post-Editor results32.9427.81
Table 7. Post-translation Editing and Translation Results

5.5 Losses in Some Translation Tasks

In this paper, we introduce an evaluation module in the translation model and then add the results of the evaluation model to the loss, aiming to achieve better translation results with the help of the evaluation results. We visualized the loss change curve on the validation set while training on de\(\rightarrow\)en2 and ru\(\rightarrow\)en translation tasks, as shown in Figure 6. We are using these pictures to show that our translation model has a lower loss, which will make the BLEU value higher. For the de\(\rightarrow\)en2 translation task, one can see from the change in loss that, at the 8th epoch, a gap in the loss of the method begins to appear. At the 17th epoch, the loss with these methods reaches its lowest point, and the best model we selected for de\(\rightarrow\)en2 translation task uses averages of the 10 model parameters found from epochs 10 to 19. For the ru\(\rightarrow\)en translation task, the losses in these methods are quite different in the initial stage. The loss in GS is greater than the loss seen with other methods. Finally, we also calculated the BLEU value of the test set on different methods, and the results are shown in Figure 7. The results show that our method is better than other methods.
Fig. 6.
Fig. 6. The loss of validation dataset on German
to English translate task.
Fig. 7.
Fig. 7. The loss of validation dataset on Russian
to English translate task.

6 Related Work

We evaluate the quality of translated sentences by introducing a reconstructor and a post-editor and use this to guide the optimization of the model. Many researchers have also studied reconstructors. For example, an encoder-decoder-reconstructor framework was proposed in Tu et al. [26]. The target feature vector predicted by the decoder is reconstructed into source sentences through a reconstructor to guide optimization of the translation model. Aiming at semi-supervised machine translation [5], a method to train an NMT model was introduced by Zhang et al. [36], where a parallel corpus and monolingual corpus were mixed, and the monolingual corpus was reconstructed using an autoencoder. For iterative back translation [21, 36] and back translation [21], a dual reconstruction goal that provides the same view of iterative translation and dual learning was introduced in Xu et al. [32].
Many researchers introduce different evaluation methods into the translation model. An evaluation module in the translation model to guide prediction with NMT was introduced in Feng et al. [7]. The evaluation module evaluates the results of each prediction from the perspective of fluency and fidelity to encourage the model to generate characters that are more related to its past and future translations. Shao et al. [23] proposed a training target based on n-gram matching probability, which used the N-gram probability to calculate the accuracy of greedy search results during the training process and used it as a training model to reduce the exposure bias problem. Yang et al. [32] performs sentence-level loss calculation on the feature vector output by the encoder and the vector feature output by the target word embedding layer. Wieting et al. [28] introduced a function to calculate semantic similarity in machine translation so that the predicted sentence can be better evaluated. This work uses a search strategy to generate predicted sentences during the training process. Some other works also use similar methods to generate prediction sentences and evaluate the generated sentences to make the training phase closer to the prediction phase. [34] adopts the REINFORCE algorithm [29] and samples translation to optimize according to the probability distribution. They sample a translation from all the possible translations and perform gradient descent through the translation with a sequence-level reward under the framework of reinforcement learning. [19] presented a bag-of-word loss for the model to generate translation with words in ground truth but in flexible word order. [1, 6, 10] explored the use of MBR decoding in machine translation. [8] proposes quality-aware decoding of NMT, leveraging recent breakthroughs in reference-free and reference-based MT evaluation through various inference methods such as N-best reranking and minimum Bayesian risk decoding.
For post-translation editing, Xia et al. [31] introduces the deliberation process into the encoder-decoder framework to generate a better sentence. Zhang et al. [35] generate the target sentence from left to right and from right to left through two decoders. In this way, the model makes better use of the source and target contexts. At the same time, our work is similar to the dual learning mentioned by He et al. [12]. It is different from dual learning by introducing two translation models that are dual tasks to improve feedback between the models. We give feedback to the current translation through the result of the dual translation task and the post-translation editing task of this task.

7 Conclusion

To alleviate the problem of missing translations, translation errors, even the meaning of some predicted sentences, differ from the source sentence. A Neural Machine Translation with Reconstructor and Post-Editor is proposed for ensuring high translation quality with an NMT model. The model mainly evaluates the results of the translation model through the source sentences and target sentences regenerated by the reconstructor and the post-editor and through the sharing of some parameters to enhance the robustness of the translation model. In this way, the model’s use of parallel data is improved. Finally, the model’s performance was verified on multiple translation tasks, and we performed a comparative analysis of different methods.

Footnotes

References

[1]
Chantal Amrhein and Rico Sennrich. 2022. Identifying weaknesses in machine translation metrics through minimum Bayes risk decoding: A case study for COMET. arXiv preprint arXiv:2202.05148 (2022).
[2]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
[3]
Ondřej Bojar, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, and Christof Monz. 2018. Findings of the 2018 conference on machine translation (WMT’18). In Proceedings of the 3rd Conference on Machine Translation: Shared Task Papers, Association for Computational Linguistics, Belgium, 272–303. https://aclanthology.org/W18-6401.
[4]
Mauro Cettolo, Marcello Federico, Luisa Bentivogli, Jan Niehues, Sebastian Stüker, Katsuitho Sudoh, Koichiro Yoshino, and Christian Federmann. 2017. Overview of the IWSLT 2017 evaluation campaign. In International Workshop on Spoken Language Translation. 2–14.
[5]
Yong Cheng. 2019. Semi-supervised learning for neural machine translation. In Joint Training for Neural Machine Translation. Springer, 25–40.
[6]
Bryan Eikema and Wilker Aziz. 2021. Sampling-based minimum Bayes risk decoding for neural machine translation. arXiv preprint arXiv:2108.04718 (2021).
[7]
Yang Feng, Wanying Xie, Shuhao Gu, Chenze Shao, Wen Zhang, Zhengxin Yang, and Dong Yu. 2020. Modeling fluency and faithfulness for diverse neural machine translation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 59–66.
[8]
Patrick Fernandes, António Farinhas, Ricardo Rei, José G. C. de Souza, Perez Ogayo, Graham Neubig, and André F. T. Martins. 2022. Quality-aware decoding for neural machine translation. arXiv preprint arXiv:2205.00978 (2022).
[9]
Markus Freitag and Yaser Al-Onaizan. 2017. Beam search strategies for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation. 56–60.
[10]
Markus Freitag, David Grangier, Qijun Tan, and Bowen Liang. (n.d.). Minimum Bayes risk decoding with neural metrics of translation quality. (n.d.).
[11]
Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional sequence to sequence learning. In International Conference on Machine Learning. PMLR, 1243–1252.
[12]
Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tie-Yan Liu, and Wei-Ying Ma. 2016. Dual learning for machine translation. Advances in Neural Information Processing Systems 29 (2016), 820–828.
[13]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
[14]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
[15]
Nal Kalchbrenner and Phil Blunsom. 2013. Recurrent continuous translation models. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1700–1709.
[16]
D. Kingma and J. Ba. 2014. Adam: A method for stochastic optimization. Computer Science (2014).
[17]
Jinyu Li, Rui Zhao, Jui-Ting Huang, and Yifan Gong. 2014. Learning small-size DNN with output-distribution-based criteria. In Fifteenth Annual Conference of the International Speech Communication Association.
[18]
Edward Loper and Steven Bird. 2002. NLTK: The natural language toolkit. arXiv preprint cs/0205028 (2002).
[19]
Shuming Ma, Xu Sun, Yizhong Wang, and Junyang Lin. 2018. Bag-of-words as target for neural machine translation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). 332–338.
[20]
Zhong Meng, Jinyu Li, Yong Zhao, and Yifan Gong. 2019. Conditional teacher-student learning. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6445–6449.
[21]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Improving neural machine translation models with monolingual data. arXiv preprint arXiv:1511.06709 (2015).
[22]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015).
[23]
Chenze Shao, Yang Feng, and Xilin Chen. 2018. Greedy search with probabilistic n-gram matching for neural machine translation. arXiv preprint arXiv:1809.03132 (2018).
[24]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 1 (2014), 1929–1958.
[25]
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. arXiv preprint arXiv:1409.3215 (2014).
[26]
Zhaopeng Tu, Yang Liu, Lifeng Shang, Xiaohua Liu, and Hang Li. 2017. Neural machine translation with reconstruction. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31.
[27]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762 (2017).
[28]
John Wieting, Taylor Berg-Kirkpatrick, Kevin Gimpel, and Graham Neubig. 2019. Beyond BLEU: Training neural machine translation with semantic similarity. arXiv preprint arXiv:1909.06694 (2019).
[29]
Ronald J. Williams. 1992. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8, 3 (1992), 229–256.
[30]
Ronald J. Williams and David Zipser. 1989. A learning algorithm for continually running fully recurrent neural networks. Neural Computation 1, 2 (1989), 270–280.
[31]
Yingce Xia, Fei Tian, Lijun Wu, Jianxin Lin, Tao Qin, Nenghai Yu, and Tie-Yan Liu. 2017. Deliberation networks: Sequence generation beyond one-pass decoding. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 1782–1792.
[32]
Weijia Xu, Xing Niu, and Marine Carpuat. 2020. Dual reconstruction: A unifying objective for semi-supervised neural machine translation. arXiv preprint arXiv:2010.03412 (2020).
[33]
Mingming Yang, Rui Wang, Kehai Chen, Masao Utiyama, Eiichiro Sumita, Min Zhang, and Tiejun Zhao. 2019. Sentence-level agreement for neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 3076–3082.
[34]
Zhen Yang, Wei Chen, Feng Wang, and Bo Xu. 2018. Improving neural machine translation with conditional sequence generative adversarial nets. In Proceedings of NAACL-HLT. 1346–1355.
[35]
Xiangwen Zhang, Jinsong Su, Yue Qin, Yang Liu, Rongrong Ji, and Hongji Wang. 2018. Asynchronous bidirectional decoding for neural machine translation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
[36]
Zhirui Zhang, Shujie Liu, Mu Li, Ming Zhou, and Enhong Chen. 2018. Joint training for neural machine translation models with monolingual data. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.

Index Terms

  1. Integrating Reconstructor and Post-Editor into Neural Machine Translation

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 6
    June 2023
    635 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3604597
    Issue’s Table of Contents
    This work is licensed under a Creative Commons Attribution International 4.0 License.

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 June 2023
    Online AM: 24 March 2023
    Accepted: 17 March 2023
    Revised: 29 December 2022
    Received: 28 August 2022
    Published in TALLIP Volume 22, Issue 6

    Check for updates

    Author Tags

    1. Neural network
    2. neural machine translation
    3. reconstructor
    4. post-editor
    5. loss function

    Qualifiers

    • Research-article

    Funding Sources

    • National Natural Science Foundation of China

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 720
      Total Downloads
    • Downloads (Last 12 months)405
    • Downloads (Last 6 weeks)81
    Reflects downloads up to 13 Dec 2024

    Other Metrics

    Citations

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media