[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN110688860B - Weight distribution method based on multiple attention mechanisms of transformer - Google Patents

Weight distribution method based on multiple attention mechanisms of transformer Download PDF

Info

Publication number
CN110688860B
CN110688860B CN201910924914.XA CN201910924914A CN110688860B CN 110688860 B CN110688860 B CN 110688860B CN 201910924914 A CN201910924914 A CN 201910924914A CN 110688860 B CN110688860 B CN 110688860B
Authority
CN
China
Prior art keywords
output
delta
attention
model
attention mechanism
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910924914.XA
Other languages
Chinese (zh)
Other versions
CN110688860A (en
Inventor
闫明明
陈绪浩
罗华成
赵宇
段世豪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201910924914.XA priority Critical patent/CN110688860B/en
Publication of CN110688860A publication Critical patent/CN110688860A/en
Application granted granted Critical
Publication of CN110688860B publication Critical patent/CN110688860B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a weight distribution method based on a plurality of converters attention mechanisms; comprising the following steps: the input of the attention mechanism is the word vector of the target language and the source language of the target language, and the output is an alignment tensor. Multiple alignment tensor outputs can be output using multiple attention mechanism functions, and each output is different due to random parameter variations in the computation process. All attention mechanism models are put into operation, and regularization calculation is carried out on various attention mechanism outputs so as to approximate the optimal output. The regularization calculation method determines that the obtained value does not deviate from the optimal value too far, the optimality of each attention model is also saved, and if the experimental effect of one attention model is excellent, the weight function of the model is increased to increase the influence of the model on the final output, so that the translation effect is improved.

Description

Weight distribution method based on multiple attention mechanisms of transformer
Technical Field
The invention relates to the field of neural machine translation, in particular to a weight distribution method based on multiple attention mechanisms of a transducer.
Background
Neural network machine translation is a machine translation method proposed in recent years. In contrast to traditional statistical machine translation, neural network machine translation can train a neural network that can map from one sequence to another, and output a variable length sequence, which can achieve very good performance in translation, dialogue, and text summarization. The neural network machine translation is in fact an encoding-decoding system, the encoding encodes the source language sequence and extracts the information in the source language, and the information is converted into another language, namely the target language, through decoding, so that the translation of the language is completed.
While the model is generating output, it generates a focus range to indicate which parts of the input sequence are focused on in the next output, and then generates the next output according to the focused region, and so on. The attention mechanism is somewhat similar to some behavioral characteristics of a person, who typically only pays attention to words with information when looking at a segment of speech, rather than to all words, i.e., the person may give each word a different attention weight. The attention mechanism model increases the training difficulty of the model, but improves the effect of text generation. In this patent we are improving in the function of the attention mechanism.
After the neural machine translation system was proposed in 2013, along with the rapid development of the computational power of the computer, the neural machine translation was also rapidly developed, the seq-seq model, the transducer model and so on were sequentially proposed, and in 2013, nal Kalchbrenner and Phil Blunsom proposed a novel end-to-end encoder-decoder structure for machine translation [4]. The model may encode a given piece of source text into a continuous vector using a Convolutional Neural Network (CNN) and then convert the state vector into the target language using a convolutional neural network (RNN) as a decoder. Google 2017 issued a new machine learning model, a transducer, that performs far beyond existing algorithms in machine translation and other language understanding tasks.
The traditional technology has the following technical problems:
in the aligning process of the attention mechanism function, the existing framework is to calculate the similarity of two input sentence word vectors first and then to perform a series of calculations to obtain an aligning function. Each alignment function outputs one pass when calculating, and the output of the pass is used as the input of the next time to calculate. Such a single thread calculation is likely to result in an accumulation of errors. We introduce a number of attention mechanisms for weight distribution, namely to find the optimal solution in multiple computation processes. The best translation effect is achieved.
Disclosure of Invention
Accordingly, in order to solve the above-mentioned shortcomings, the present invention herein provides a method for weight distribution based on a plurality of attention mechanisms of a transducer; applied to a transducer framework model based on an attention mechanism. Comprising the following steps: the input of the attention mechanism is the word vector of the target language and the source language of the target language, and the output is an alignment tensor. Multiple alignment tensor outputs can be output using multiple attention mechanism functions, and each output is different due to random parameter variations in the computation process. Many attention mechanism models, such as self-attention mechanism, multi-head attention mechanism, total attention mechanism, local attention mechanism, etc., have been proposed nowadays, each different attention mechanism has different outputs and characteristics, we put all attention mechanism models into operation, and regularize the outputs of various attention mechanisms to approximate the optimal output.
The invention is realized in such a way, a weight distribution method based on a plurality of attention mechanisms of a transducer is constructed, and the method is applied to a transducer model based on the attention mechanisms and is characterized in that; the method comprises the following steps:
step 1: in the transducer model, a more excellent model output is selected for an application scenario.
Step 2: initializing the value of the weight sequence delta, wherein the weight sequence delta is a random number in the first calculation, and delta 12 +....+δ i =1;
Step 3: regularized calculation is carried out on each model output, the center point (the point closest to all values) of each output is calculated, and the calculation formula fin_out=delta is adopted 1 O 12 O 23 O 3 .......+δ i O i Calculating an optimal matching value as a final output; wherein delta 12 +....+δ i =1 and δ i Is a weight parameter set by us; o (O) i Is the output of various attention models;
step 4: substituting the final output into subsequent operation, calculating the difference value of the loss function compared with the previous training, and if the loss function is reduced, improving the sequence proportion of the delta near the center point; if the loss function rises, the sequence proportion farthest from the center point in the delta sequence is increased, and the whole process strictly obeys delta 12 +....+δ i A rule of =1;
step 5: and (5) carrying out repeated loop iterative computation to finally determine the optimal weight sequence delta.
The invention has the following advantages: the invention discloses a weight distribution method based on a plurality of converters attention mechanisms. Applied to a transducer framework model based on an attention mechanism. Comprising the following steps: the input of the attention mechanism is the purpose of the target languageThe word vectors in the tagline and source languages, the output is an alignment tensor. Multiple alignment tensor outputs can be output using multiple attention mechanism functions, and each output is different due to random parameter variations in the computation process. Many attention mechanism models, such as self-attention mechanism, multi-head attention mechanism, total attention mechanism, local attention mechanism, etc., have been proposed nowadays, each different attention mechanism has different outputs and characteristics, we put all attention mechanism models into operation, and regularize the outputs of various attention mechanisms to approximate the optimal output. The formula is applied: finout=δ 1 O 12 O 23 O 3 .......+δ i O i Wherein delta 12 +....+δ i =1 and δ i Is a weight parameter we set. O (O) i The regularization calculation method determines that the obtained value does not deviate from the optimal value too far, the optimality of each attention model is also saved, and if the experimental effect of one attention model is excellent, the weight function of the model is increased to increase the influence of the model on the final output, so that the translation effect is improved.
Detailed Description
The following detailed description of the present invention will clearly and fully describe the technical solutions of the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The invention provides a weight distribution method based on a plurality of attention mechanisms of a transducer by improving the method. Applied to a transducer framework model based on an attention mechanism.
the transducer framework introduces:
the Encoder consists of 6 identical layers, each layer containing two sub-layers, the first sub-layer being the multi-headed attention layer followed by a simple fully connected layer. Wherein each sub-layer adds a residual connection and normalization).
The Decoder consists of 6 identical layers, but the layers are different from the Decoder, and the layers comprise three sub-layers, one self-Layer, and the Decoder-Decoder attention Layer is finally a full connection Layer. The first two sub-layers are based on multi-head attention layer. One particular point here is masking, which functions to prevent future output words from being used during training.
Attention model:
the encoder-decoder model, while very classical, is also very limited. A major limitation is that the link between encoding and decoding is a fixed length semantic vector C. That is, the encoder compresses the entire sequence of information into a fixed length vector. However, there are two drawbacks in doing so, namely, the semantic vector cannot completely represent the information of the whole sequence, and the information carried by the content input first can be diluted by the information input later. The longer the input sequence, the more serious this phenomenon. This results in insufficient information being obtained from the input sequence at the beginning of the decoding process, and thus in a certain degree of accuracy being compromised at the time of decoding.
In order to solve the above problem, after one year of appearance of the Seq2Seq, an attention model is proposed. The model, when generating an output, generates a range of attention to indicate which portions of the input sequence are to be focused on in the next output, and then generates the next output based on the region of interest, and so on. Attention is somewhat similar to some behavioral characteristics of a person, who typically only focuses on words with information when looking at a word, and not on all words, i.e., the person gives each word a different attention weight. The attention model increases the training difficulty of the model, but improves the effect of text generation.
First, generating a semantic vector at the moment:
s t =tanh(W[s t-1 ,y t-1 ])
secondly, transmitting hidden layer information and predicting:
a number of attention mechanism models have been proposed today, such as self-attention mechanism, multi-head attention mechanism, full attention mechanism, local attention mechanism, etc., each different attention mechanism having different outputs and characteristics.
The improvement here is a modification in the attention function.
All attention mechanism models are put into operation, and regularization calculation is carried out on various attention mechanism outputs so as to approximate the optimal output. The formula is applied: finout=δ 1 O 12 O 23 O 3 .......+δ i O i Wherein delta 12 +....+δ i =1 and δ i Is a weight parameter we set. O (O) i The regularization calculation method determines that the obtained value does not deviate too far from the optimal value, and the optimality of each attention model is also saved. The method comprises the following specific implementation steps of;
step 1: in the transducer model, a more excellent model output is selected for an application scenario.
Step 2: initializing the value of the weight sequence delta, wherein the weight sequence delta is a random number in the first calculation, and delta 12 +....+δ i =1;
Step 3: regularized calculation is carried out on each model output, the center point (the point closest to all values) of each output is calculated, and the calculation formula fin_out=delta is adopted 1 O 12 O 23 O 3 .......+δ i O i And calculating the optimal matching value as a final output.
Step 4: substituting the final output into subsequent operation, calculating the difference value of the loss function compared with the previous training, and if the loss function is reduced, improving the sequence proportion of the delta near the center point; if the loss function rises, the sequence proportion farthest from the center point in the delta sequence is increased, and the whole process strictly obeys delta 12 +....+δ i Rule of =1.
Step 5: and (5) carrying out repeated loop iterative computation to finally determine the optimal weight sequence delta.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (1)

1. A weight distribution method based on a plurality of attention mechanisms of a transducer is applied in machine translation, wherein the input of the attention mechanism is word vectors of a target language and a source language, and the output is an alignment tensor;
the method is characterized by comprising the following steps of:
step 1: in a transducer model, selecting a better model output from the application scene;
step 2: initializing the value of the weight sequence delta, wherein the weight sequence delta is a random number in the first calculation, and delta 12 +....+δ i =1;
Step 3: regularized calculation is carried out on each model output, the center point of each output is calculated, and the regularized calculation is carried out according to a calculation formula fin_out=delta 1 O 12 O 23 O 3 .......+δ i O i Calculate the optimal matching value as the final output, and delta 12 +....+δ i =1,δ i Is a set weight parameter, O i Is the output of various attention models, the center point being the point closest to all values;
step 4: substituting the final output into subsequent operation, calculating the difference value of the loss function compared with the previous training, if the loss function is reduced, increasing the sequence proportion near the center point in delta, if the loss function is increased, increasing the sequence proportion farthest from the center point in the weight sequence delta, and delta 12 +....+δ i =1;
Step 5: and (5) carrying out repeated loop iterative computation to finally determine the optimal weight sequence delta.
CN201910924914.XA 2019-09-27 2019-09-27 Weight distribution method based on multiple attention mechanisms of transformer Active CN110688860B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910924914.XA CN110688860B (en) 2019-09-27 2019-09-27 Weight distribution method based on multiple attention mechanisms of transformer

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910924914.XA CN110688860B (en) 2019-09-27 2019-09-27 Weight distribution method based on multiple attention mechanisms of transformer

Publications (2)

Publication Number Publication Date
CN110688860A CN110688860A (en) 2020-01-14
CN110688860B true CN110688860B (en) 2024-02-06

Family

ID=69110821

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910924914.XA Active CN110688860B (en) 2019-09-27 2019-09-27 Weight distribution method based on multiple attention mechanisms of transformer

Country Status (1)

Country Link
CN (1) CN110688860B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112381581B (en) * 2020-11-17 2022-07-08 东华理工大学 Advertisement click rate estimation method based on improved Transformer
CN112992129B (en) * 2021-03-08 2022-09-30 中国科学技术大学 Method for keeping monotonicity of attention mechanism in voice recognition task
CN113505193A (en) * 2021-06-01 2021-10-15 华为技术有限公司 Data processing method and related equipment

Also Published As

Publication number Publication date
CN110688860A (en) 2020-01-14

Similar Documents

Publication Publication Date Title
Liu et al. Diffsinger: Singing voice synthesis via shallow diffusion mechanism
CN110413785B (en) Text automatic classification method based on BERT and feature fusion
CN111382582B (en) Neural machine translation decoding acceleration method based on non-autoregressive
CN113657124B (en) Multi-mode Mongolian translation method based on cyclic common attention transducer
WO2020140487A1 (en) Speech recognition method for human-machine interaction of smart apparatus, and system
CN110688860B (en) Weight distribution method based on multiple attention mechanisms of transformer
CN110929092A (en) Multi-event video description method based on dynamic attention mechanism
CN110457661B (en) Natural language generation method, device, equipment and storage medium
CN109522403A (en) A kind of summary texts generation method based on fusion coding
CN114091478A (en) Dialog emotion recognition method based on supervised contrast learning and reply generation assistance
CN114281982B (en) Book propaganda abstract generation method and system adopting multi-mode fusion technology
CN113609284A (en) Method and device for automatically generating text abstract fused with multivariate semantics
CN112668346A (en) Translation method, device, equipment and storage medium
CN107463928A (en) Word sequence error correction algorithm, system and its equipment based on OCR and two-way LSTM
CN110276396B (en) Image description generation method based on object saliency and cross-modal fusion features
JP7579022B1 (en) Method and system for intelligent analysis of bills based on semantic graph model
CN115860054A (en) Sparse codebook multiple access coding and decoding system based on generation countermeasure network
CN113299268A (en) Speech synthesis method based on stream generation model
CN115841119B (en) Emotion cause extraction method based on graph structure
CN117877460A (en) Speech synthesis method, device, speech synthesis model training method and device
CN113297374B (en) Text classification method based on BERT and word feature fusion
CN115471665A (en) Matting method and device based on tri-segmentation visual Transformer semantic information decoder
CN110717342B (en) Distance parameter alignment translation method based on transformer
CN112465929A (en) Image generation method based on improved graph convolution network
CN116543289B (en) Image description method based on encoder-decoder and Bi-LSTM attention model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant