CN110688860B - Weight distribution method based on multiple attention mechanisms of transformer - Google Patents
Weight distribution method based on multiple attention mechanisms of transformer Download PDFInfo
- Publication number
- CN110688860B CN110688860B CN201910924914.XA CN201910924914A CN110688860B CN 110688860 B CN110688860 B CN 110688860B CN 201910924914 A CN201910924914 A CN 201910924914A CN 110688860 B CN110688860 B CN 110688860B
- Authority
- CN
- China
- Prior art keywords
- output
- delta
- attention
- model
- attention mechanism
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000007246 mechanism Effects 0.000 title claims abstract description 51
- 238000000034 method Methods 0.000 title claims abstract description 23
- 230000006870 function Effects 0.000 claims abstract description 20
- 238000004364 calculation method Methods 0.000 claims abstract description 17
- 238000013519 translation Methods 0.000 claims abstract description 16
- 239000013598 vector Substances 0.000 claims abstract description 11
- 238000012549 training Methods 0.000 claims description 6
- 230000008569 process Effects 0.000 abstract description 8
- 230000000694 effects Effects 0.000 abstract description 7
- 238000013528 artificial neural network Methods 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 3
- 230000001537 neural effect Effects 0.000 description 3
- 230000003542 behavioural effect Effects 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 1
- 230000001010 compromised effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000000873 masking effect Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
Landscapes
- Machine Translation (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a weight distribution method based on a plurality of converters attention mechanisms; comprising the following steps: the input of the attention mechanism is the word vector of the target language and the source language of the target language, and the output is an alignment tensor. Multiple alignment tensor outputs can be output using multiple attention mechanism functions, and each output is different due to random parameter variations in the computation process. All attention mechanism models are put into operation, and regularization calculation is carried out on various attention mechanism outputs so as to approximate the optimal output. The regularization calculation method determines that the obtained value does not deviate from the optimal value too far, the optimality of each attention model is also saved, and if the experimental effect of one attention model is excellent, the weight function of the model is increased to increase the influence of the model on the final output, so that the translation effect is improved.
Description
Technical Field
The invention relates to the field of neural machine translation, in particular to a weight distribution method based on multiple attention mechanisms of a transducer.
Background
Neural network machine translation is a machine translation method proposed in recent years. In contrast to traditional statistical machine translation, neural network machine translation can train a neural network that can map from one sequence to another, and output a variable length sequence, which can achieve very good performance in translation, dialogue, and text summarization. The neural network machine translation is in fact an encoding-decoding system, the encoding encodes the source language sequence and extracts the information in the source language, and the information is converted into another language, namely the target language, through decoding, so that the translation of the language is completed.
While the model is generating output, it generates a focus range to indicate which parts of the input sequence are focused on in the next output, and then generates the next output according to the focused region, and so on. The attention mechanism is somewhat similar to some behavioral characteristics of a person, who typically only pays attention to words with information when looking at a segment of speech, rather than to all words, i.e., the person may give each word a different attention weight. The attention mechanism model increases the training difficulty of the model, but improves the effect of text generation. In this patent we are improving in the function of the attention mechanism.
After the neural machine translation system was proposed in 2013, along with the rapid development of the computational power of the computer, the neural machine translation was also rapidly developed, the seq-seq model, the transducer model and so on were sequentially proposed, and in 2013, nal Kalchbrenner and Phil Blunsom proposed a novel end-to-end encoder-decoder structure for machine translation [4]. The model may encode a given piece of source text into a continuous vector using a Convolutional Neural Network (CNN) and then convert the state vector into the target language using a convolutional neural network (RNN) as a decoder. Google 2017 issued a new machine learning model, a transducer, that performs far beyond existing algorithms in machine translation and other language understanding tasks.
The traditional technology has the following technical problems:
in the aligning process of the attention mechanism function, the existing framework is to calculate the similarity of two input sentence word vectors first and then to perform a series of calculations to obtain an aligning function. Each alignment function outputs one pass when calculating, and the output of the pass is used as the input of the next time to calculate. Such a single thread calculation is likely to result in an accumulation of errors. We introduce a number of attention mechanisms for weight distribution, namely to find the optimal solution in multiple computation processes. The best translation effect is achieved.
Disclosure of Invention
Accordingly, in order to solve the above-mentioned shortcomings, the present invention herein provides a method for weight distribution based on a plurality of attention mechanisms of a transducer; applied to a transducer framework model based on an attention mechanism. Comprising the following steps: the input of the attention mechanism is the word vector of the target language and the source language of the target language, and the output is an alignment tensor. Multiple alignment tensor outputs can be output using multiple attention mechanism functions, and each output is different due to random parameter variations in the computation process. Many attention mechanism models, such as self-attention mechanism, multi-head attention mechanism, total attention mechanism, local attention mechanism, etc., have been proposed nowadays, each different attention mechanism has different outputs and characteristics, we put all attention mechanism models into operation, and regularize the outputs of various attention mechanisms to approximate the optimal output.
The invention is realized in such a way, a weight distribution method based on a plurality of attention mechanisms of a transducer is constructed, and the method is applied to a transducer model based on the attention mechanisms and is characterized in that; the method comprises the following steps:
step 1: in the transducer model, a more excellent model output is selected for an application scenario.
Step 2: initializing the value of the weight sequence delta, wherein the weight sequence delta is a random number in the first calculation, and delta 1 +δ 2 +....+δ i =1;
Step 3: regularized calculation is carried out on each model output, the center point (the point closest to all values) of each output is calculated, and the calculation formula fin_out=delta is adopted 1 O 1 +δ 2 O 2 +δ 3 O 3 .......+δ i O i Calculating an optimal matching value as a final output; wherein delta 1 +δ 2 +....+δ i =1 and δ i Is a weight parameter set by us; o (O) i Is the output of various attention models;
step 4: substituting the final output into subsequent operation, calculating the difference value of the loss function compared with the previous training, and if the loss function is reduced, improving the sequence proportion of the delta near the center point; if the loss function rises, the sequence proportion farthest from the center point in the delta sequence is increased, and the whole process strictly obeys delta 1 +δ 2 +....+δ i A rule of =1;
step 5: and (5) carrying out repeated loop iterative computation to finally determine the optimal weight sequence delta.
The invention has the following advantages: the invention discloses a weight distribution method based on a plurality of converters attention mechanisms. Applied to a transducer framework model based on an attention mechanism. Comprising the following steps: the input of the attention mechanism is the purpose of the target languageThe word vectors in the tagline and source languages, the output is an alignment tensor. Multiple alignment tensor outputs can be output using multiple attention mechanism functions, and each output is different due to random parameter variations in the computation process. Many attention mechanism models, such as self-attention mechanism, multi-head attention mechanism, total attention mechanism, local attention mechanism, etc., have been proposed nowadays, each different attention mechanism has different outputs and characteristics, we put all attention mechanism models into operation, and regularize the outputs of various attention mechanisms to approximate the optimal output. The formula is applied: finout=δ 1 O 1 +δ 2 O 2 +δ 3 O 3 .......+δ i O i Wherein delta 1 +δ 2 +....+δ i =1 and δ i Is a weight parameter we set. O (O) i The regularization calculation method determines that the obtained value does not deviate from the optimal value too far, the optimality of each attention model is also saved, and if the experimental effect of one attention model is excellent, the weight function of the model is increased to increase the influence of the model on the final output, so that the translation effect is improved.
Detailed Description
The following detailed description of the present invention will clearly and fully describe the technical solutions of the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention provides a weight distribution method based on a plurality of attention mechanisms of a transducer by improving the method. Applied to a transducer framework model based on an attention mechanism.
the transducer framework introduces:
the Encoder consists of 6 identical layers, each layer containing two sub-layers, the first sub-layer being the multi-headed attention layer followed by a simple fully connected layer. Wherein each sub-layer adds a residual connection and normalization).
The Decoder consists of 6 identical layers, but the layers are different from the Decoder, and the layers comprise three sub-layers, one self-Layer, and the Decoder-Decoder attention Layer is finally a full connection Layer. The first two sub-layers are based on multi-head attention layer. One particular point here is masking, which functions to prevent future output words from being used during training.
Attention model:
the encoder-decoder model, while very classical, is also very limited. A major limitation is that the link between encoding and decoding is a fixed length semantic vector C. That is, the encoder compresses the entire sequence of information into a fixed length vector. However, there are two drawbacks in doing so, namely, the semantic vector cannot completely represent the information of the whole sequence, and the information carried by the content input first can be diluted by the information input later. The longer the input sequence, the more serious this phenomenon. This results in insufficient information being obtained from the input sequence at the beginning of the decoding process, and thus in a certain degree of accuracy being compromised at the time of decoding.
In order to solve the above problem, after one year of appearance of the Seq2Seq, an attention model is proposed. The model, when generating an output, generates a range of attention to indicate which portions of the input sequence are to be focused on in the next output, and then generates the next output based on the region of interest, and so on. Attention is somewhat similar to some behavioral characteristics of a person, who typically only focuses on words with information when looking at a word, and not on all words, i.e., the person gives each word a different attention weight. The attention model increases the training difficulty of the model, but improves the effect of text generation.
First, generating a semantic vector at the moment:
s t =tanh(W[s t-1 ,y t-1 ])
secondly, transmitting hidden layer information and predicting:
a number of attention mechanism models have been proposed today, such as self-attention mechanism, multi-head attention mechanism, full attention mechanism, local attention mechanism, etc., each different attention mechanism having different outputs and characteristics.
The improvement here is a modification in the attention function.
All attention mechanism models are put into operation, and regularization calculation is carried out on various attention mechanism outputs so as to approximate the optimal output. The formula is applied: finout=δ 1 O 1 +δ 2 O 2 +δ 3 O 3 .......+δ i O i Wherein delta 1 +δ 2 +....+δ i =1 and δ i Is a weight parameter we set. O (O) i The regularization calculation method determines that the obtained value does not deviate too far from the optimal value, and the optimality of each attention model is also saved. The method comprises the following specific implementation steps of;
step 1: in the transducer model, a more excellent model output is selected for an application scenario.
Step 2: initializing the value of the weight sequence delta, wherein the weight sequence delta is a random number in the first calculation, and delta 1 +δ 2 +....+δ i =1;
Step 3: regularized calculation is carried out on each model output, the center point (the point closest to all values) of each output is calculated, and the calculation formula fin_out=delta is adopted 1 O 1 +δ 2 O 2 +δ 3 O 3 .......+δ i O i And calculating the optimal matching value as a final output.
Step 4: substituting the final output into subsequent operation, calculating the difference value of the loss function compared with the previous training, and if the loss function is reduced, improving the sequence proportion of the delta near the center point; if the loss function rises, the sequence proportion farthest from the center point in the delta sequence is increased, and the whole process strictly obeys delta 1 +δ 2 +....+δ i Rule of =1.
Step 5: and (5) carrying out repeated loop iterative computation to finally determine the optimal weight sequence delta.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (1)
1. A weight distribution method based on a plurality of attention mechanisms of a transducer is applied in machine translation, wherein the input of the attention mechanism is word vectors of a target language and a source language, and the output is an alignment tensor;
the method is characterized by comprising the following steps of:
step 1: in a transducer model, selecting a better model output from the application scene;
step 2: initializing the value of the weight sequence delta, wherein the weight sequence delta is a random number in the first calculation, and delta 1 +δ 2 +....+δ i =1;
Step 3: regularized calculation is carried out on each model output, the center point of each output is calculated, and the regularized calculation is carried out according to a calculation formula fin_out=delta 1 O 1 +δ 2 O 2 +δ 3 O 3 .......+δ i O i Calculate the optimal matching value as the final output, and delta 1 +δ 2 +....+δ i =1,δ i Is a set weight parameter, O i Is the output of various attention models, the center point being the point closest to all values;
step 4: substituting the final output into subsequent operation, calculating the difference value of the loss function compared with the previous training, if the loss function is reduced, increasing the sequence proportion near the center point in delta, if the loss function is increased, increasing the sequence proportion farthest from the center point in the weight sequence delta, and delta 1 +δ 2 +....+δ i =1;
Step 5: and (5) carrying out repeated loop iterative computation to finally determine the optimal weight sequence delta.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910924914.XA CN110688860B (en) | 2019-09-27 | 2019-09-27 | Weight distribution method based on multiple attention mechanisms of transformer |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910924914.XA CN110688860B (en) | 2019-09-27 | 2019-09-27 | Weight distribution method based on multiple attention mechanisms of transformer |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110688860A CN110688860A (en) | 2020-01-14 |
CN110688860B true CN110688860B (en) | 2024-02-06 |
Family
ID=69110821
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910924914.XA Active CN110688860B (en) | 2019-09-27 | 2019-09-27 | Weight distribution method based on multiple attention mechanisms of transformer |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110688860B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112381581B (en) * | 2020-11-17 | 2022-07-08 | 东华理工大学 | Advertisement click rate estimation method based on improved Transformer |
CN112992129B (en) * | 2021-03-08 | 2022-09-30 | 中国科学技术大学 | Method for keeping monotonicity of attention mechanism in voice recognition task |
CN113505193A (en) * | 2021-06-01 | 2021-10-15 | 华为技术有限公司 | Data processing method and related equipment |
-
2019
- 2019-09-27 CN CN201910924914.XA patent/CN110688860B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN110688860A (en) | 2020-01-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Liu et al. | Diffsinger: Singing voice synthesis via shallow diffusion mechanism | |
CN110413785B (en) | Text automatic classification method based on BERT and feature fusion | |
CN111382582B (en) | Neural machine translation decoding acceleration method based on non-autoregressive | |
CN113657124B (en) | Multi-mode Mongolian translation method based on cyclic common attention transducer | |
WO2020140487A1 (en) | Speech recognition method for human-machine interaction of smart apparatus, and system | |
CN110688860B (en) | Weight distribution method based on multiple attention mechanisms of transformer | |
CN110929092A (en) | Multi-event video description method based on dynamic attention mechanism | |
CN110457661B (en) | Natural language generation method, device, equipment and storage medium | |
CN109522403A (en) | A kind of summary texts generation method based on fusion coding | |
CN114091478A (en) | Dialog emotion recognition method based on supervised contrast learning and reply generation assistance | |
CN114281982B (en) | Book propaganda abstract generation method and system adopting multi-mode fusion technology | |
CN113609284A (en) | Method and device for automatically generating text abstract fused with multivariate semantics | |
CN112668346A (en) | Translation method, device, equipment and storage medium | |
CN107463928A (en) | Word sequence error correction algorithm, system and its equipment based on OCR and two-way LSTM | |
CN110276396B (en) | Image description generation method based on object saliency and cross-modal fusion features | |
JP7579022B1 (en) | Method and system for intelligent analysis of bills based on semantic graph model | |
CN115860054A (en) | Sparse codebook multiple access coding and decoding system based on generation countermeasure network | |
CN113299268A (en) | Speech synthesis method based on stream generation model | |
CN115841119B (en) | Emotion cause extraction method based on graph structure | |
CN117877460A (en) | Speech synthesis method, device, speech synthesis model training method and device | |
CN113297374B (en) | Text classification method based on BERT and word feature fusion | |
CN115471665A (en) | Matting method and device based on tri-segmentation visual Transformer semantic information decoder | |
CN110717342B (en) | Distance parameter alignment translation method based on transformer | |
CN112465929A (en) | Image generation method based on improved graph convolution network | |
CN116543289B (en) | Image description method based on encoder-decoder and Bi-LSTM attention model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |