CN110688860B

CN110688860B - Weight distribution method based on multiple attention mechanisms of transformer

Info

Publication number: CN110688860B
Application number: CN201910924914.XA
Authority: CN
Inventors: 闫明明; 陈绪浩; 罗华成; 赵宇; 段世豪
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-09-27
Filing date: 2019-09-27
Publication date: 2024-02-06
Anticipated expiration: 2039-09-27
Also published as: CN110688860A

Abstract

The invention discloses a weight distribution method based on a plurality of converters attention mechanisms; comprising the following steps: the input of the attention mechanism is the word vector of the target language and the source language of the target language, and the output is an alignment tensor. Multiple alignment tensor outputs can be output using multiple attention mechanism functions, and each output is different due to random parameter variations in the computation process. All attention mechanism models are put into operation, and regularization calculation is carried out on various attention mechanism outputs so as to approximate the optimal output. The regularization calculation method determines that the obtained value does not deviate from the optimal value too far, the optimality of each attention model is also saved, and if the experimental effect of one attention model is excellent, the weight function of the model is increased to increase the influence of the model on the final output, so that the translation effect is improved.

Description

Weight distribution method based on multiple attention mechanisms of transformer

Technical Field

The invention relates to the field of neural machine translation, in particular to a weight distribution method based on multiple attention mechanisms of a transducer.

Background

Neural network machine translation is a machine translation method proposed in recent years. In contrast to traditional statistical machine translation, neural network machine translation can train a neural network that can map from one sequence to another, and output a variable length sequence, which can achieve very good performance in translation, dialogue, and text summarization. The neural network machine translation is in fact an encoding-decoding system, the encoding encodes the source language sequence and extracts the information in the source language, and the information is converted into another language, namely the target language, through decoding, so that the translation of the language is completed.

While the model is generating output, it generates a focus range to indicate which parts of the input sequence are focused on in the next output, and then generates the next output according to the focused region, and so on. The attention mechanism is somewhat similar to some behavioral characteristics of a person, who typically only pays attention to words with information when looking at a segment of speech, rather than to all words, i.e., the person may give each word a different attention weight. The attention mechanism model increases the training difficulty of the model, but improves the effect of text generation. In this patent we are improving in the function of the attention mechanism.

After the neural machine translation system was proposed in 2013, along with the rapid development of the computational power of the computer, the neural machine translation was also rapidly developed, the seq-seq model, the transducer model and so on were sequentially proposed, and in 2013, nal Kalchbrenner and Phil Blunsom proposed a novel end-to-end encoder-decoder structure for machine translation [4]. The model may encode a given piece of source text into a continuous vector using a Convolutional Neural Network (CNN) and then convert the state vector into the target language using a convolutional neural network (RNN) as a decoder. Google 2017 issued a new machine learning model, a transducer, that performs far beyond existing algorithms in machine translation and other language understanding tasks.

The traditional technology has the following technical problems:

in the aligning process of the attention mechanism function, the existing framework is to calculate the similarity of two input sentence word vectors first and then to perform a series of calculations to obtain an aligning function. Each alignment function outputs one pass when calculating, and the output of the pass is used as the input of the next time to calculate. Such a single thread calculation is likely to result in an accumulation of errors. We introduce a number of attention mechanisms for weight distribution, namely to find the optimal solution in multiple computation processes. The best translation effect is achieved.

Disclosure of Invention

Accordingly, in order to solve the above-mentioned shortcomings, the present invention herein provides a method for weight distribution based on a plurality of attention mechanisms of a transducer; applied to a transducer framework model based on an attention mechanism. Comprising the following steps: the input of the attention mechanism is the word vector of the target language and the source language of the target language, and the output is an alignment tensor. Multiple alignment tensor outputs can be output using multiple attention mechanism functions, and each output is different due to random parameter variations in the computation process. Many attention mechanism models, such as self-attention mechanism, multi-head attention mechanism, total attention mechanism, local attention mechanism, etc., have been proposed nowadays, each different attention mechanism has different outputs and characteristics, we put all attention mechanism models into operation, and regularize the outputs of various attention mechanisms to approximate the optimal output.

The invention is realized in such a way, a weight distribution method based on a plurality of attention mechanisms of a transducer is constructed, and the method is applied to a transducer model based on the attention mechanisms and is characterized in that; the method comprises the following steps:

step 1: in the transducer model, a more excellent model output is selected for an application scenario.

Step 2: initializing the value of the weight sequence delta, wherein the weight sequence delta is a random number in the first calculation, and delta ₁ +δ ₂ +....+δ _i ＝1；

Step 3: regularized calculation is carried out on each model output, the center point (the point closest to all values) of each output is calculated, and the calculation formula fin_out=delta is adopted ₁ O ₁ +δ ₂ O ₂ +δ ₃ O ₃ .......+δ _i O _i Calculating an optimal matching value as a final output; wherein delta ₁ +δ ₂ +....+δ _i =1 and δ _i Is a weight parameter set by us; o (O) _i Is the output of various attention models;

step 4: substituting the final output into subsequent operation, calculating the difference value of the loss function compared with the previous training, and if the loss function is reduced, improving the sequence proportion of the delta near the center point; if the loss function rises, the sequence proportion farthest from the center point in the delta sequence is increased, and the whole process strictly obeys delta ₁ +δ ₂ +....+δ _i A rule of =1;

step 5: and (5) carrying out repeated loop iterative computation to finally determine the optimal weight sequence delta.

The invention has the following advantages: the invention discloses a weight distribution method based on a plurality of converters attention mechanisms. Applied to a transducer framework model based on an attention mechanism. Comprising the following steps: the input of the attention mechanism is the purpose of the target languageThe word vectors in the tagline and source languages, the output is an alignment tensor. Multiple alignment tensor outputs can be output using multiple attention mechanism functions, and each output is different due to random parameter variations in the computation process. Many attention mechanism models, such as self-attention mechanism, multi-head attention mechanism, total attention mechanism, local attention mechanism, etc., have been proposed nowadays, each different attention mechanism has different outputs and characteristics, we put all attention mechanism models into operation, and regularize the outputs of various attention mechanisms to approximate the optimal output. The formula is applied: finout=δ ₁ O ₁ +δ ₂ O ₂ +δ ₃ O ₃ .......+δ _i O _i Wherein delta ₁ +δ ₂ +....+δ _i =1 and δ _i Is a weight parameter we set. O (O) _i The regularization calculation method determines that the obtained value does not deviate from the optimal value too far, the optimality of each attention model is also saved, and if the experimental effect of one attention model is excellent, the weight function of the model is increased to increase the influence of the model on the final output, so that the translation effect is improved.

Detailed Description

The following detailed description of the present invention will clearly and fully describe the technical solutions of the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The invention provides a weight distribution method based on a plurality of attention mechanisms of a transducer by improving the method. Applied to a transducer framework model based on an attention mechanism.

the transducer framework introduces:

the Encoder consists of 6 identical layers, each layer containing two sub-layers, the first sub-layer being the multi-headed attention layer followed by a simple fully connected layer. Wherein each sub-layer adds a residual connection and normalization).

The Decoder consists of 6 identical layers, but the layers are different from the Decoder, and the layers comprise three sub-layers, one self-Layer, and the Decoder-Decoder attention Layer is finally a full connection Layer. The first two sub-layers are based on multi-head attention layer. One particular point here is masking, which functions to prevent future output words from being used during training.

Attention model:

the encoder-decoder model, while very classical, is also very limited. A major limitation is that the link between encoding and decoding is a fixed length semantic vector C. That is, the encoder compresses the entire sequence of information into a fixed length vector. However, there are two drawbacks in doing so, namely, the semantic vector cannot completely represent the information of the whole sequence, and the information carried by the content input first can be diluted by the information input later. The longer the input sequence, the more serious this phenomenon. This results in insufficient information being obtained from the input sequence at the beginning of the decoding process, and thus in a certain degree of accuracy being compromised at the time of decoding.

In order to solve the above problem, after one year of appearance of the Seq2Seq, an attention model is proposed. The model, when generating an output, generates a range of attention to indicate which portions of the input sequence are to be focused on in the next output, and then generates the next output based on the region of interest, and so on. Attention is somewhat similar to some behavioral characteristics of a person, who typically only focuses on words with information when looking at a word, and not on all words, i.e., the person gives each word a different attention weight. The attention model increases the training difficulty of the model, but improves the effect of text generation.

First, generating a semantic vector at the moment:

s _t ＝tanh(W[s _t-1 ，y _t-1 ])

secondly, transmitting hidden layer information and predicting:

a number of attention mechanism models have been proposed today, such as self-attention mechanism, multi-head attention mechanism, full attention mechanism, local attention mechanism, etc., each different attention mechanism having different outputs and characteristics.

The improvement here is a modification in the attention function.

All attention mechanism models are put into operation, and regularization calculation is carried out on various attention mechanism outputs so as to approximate the optimal output. The formula is applied: finout=δ ₁ O ₁ +δ ₂ O ₂ +δ ₃ O ₃ .......+δ _i O _i Wherein delta ₁ +δ ₂ +....+δ _i =1 and δ _i Is a weight parameter we set. O (O) _i The regularization calculation method determines that the obtained value does not deviate too far from the optimal value, and the optimality of each attention model is also saved. The method comprises the following specific implementation steps of;

Step 3: regularized calculation is carried out on each model output, the center point (the point closest to all values) of each output is calculated, and the calculation formula fin_out=delta is adopted ₁ O ₁ +δ ₂ O ₂ +δ ₃ O ₃ .......+δ _i O _i And calculating the optimal matching value as a final output.

Step 4: substituting the final output into subsequent operation, calculating the difference value of the loss function compared with the previous training, and if the loss function is reduced, improving the sequence proportion of the delta near the center point; if the loss function rises, the sequence proportion farthest from the center point in the delta sequence is increased, and the whole process strictly obeys delta ₁ +δ ₂ +....+δ _i Rule of =1.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A weight distribution method based on a plurality of attention mechanisms of a transducer is applied in machine translation, wherein the input of the attention mechanism is word vectors of a target language and a source language, and the output is an alignment tensor;

the method is characterized by comprising the following steps of:

step 1: in a transducer model, selecting a better model output from the application scene;

Step 3: regularized calculation is carried out on each model output, the center point of each output is calculated, and the regularized calculation is carried out according to a calculation formula fin_out=delta ₁ O ₁ +δ ₂ O ₂ +δ ₃ O ₃ .......+δ _i O _i Calculate the optimal matching value as the final output, and delta ₁ +δ ₂ +....+δ _i ＝1，δ _i Is a set weight parameter, O _i Is the output of various attention models, the center point being the point closest to all values;

step 4: substituting the final output into subsequent operation, calculating the difference value of the loss function compared with the previous training, if the loss function is reduced, increasing the sequence proportion near the center point in delta, if the loss function is increased, increasing the sequence proportion farthest from the center point in the weight sequence delta, and delta ₁ +δ ₂ +....+δ _i ＝1；