CN110442880B - Translation method, device and storage medium for machine translation - Google Patents
Translation method, device and storage medium for machine translation Download PDFInfo
- Publication number
- CN110442880B CN110442880B CN201910721252.6A CN201910721252A CN110442880B CN 110442880 B CN110442880 B CN 110442880B CN 201910721252 A CN201910721252 A CN 201910721252A CN 110442880 B CN110442880 B CN 110442880B
- Authority
- CN
- China
- Prior art keywords
- translation
- word
- penalty
- beam search
- evaluation function
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000013519 translation Methods 0.000 title claims abstract description 137
- 238000000034 method Methods 0.000 title claims abstract description 29
- 239000013598 vector Substances 0.000 claims abstract description 52
- 238000011156 evaluation Methods 0.000 claims abstract description 50
- 238000001514 detection method Methods 0.000 claims abstract description 22
- 230000011218 segmentation Effects 0.000 claims abstract description 6
- 238000012545 processing Methods 0.000 claims abstract description 5
- 230000006870 function Effects 0.000 claims description 42
- 238000013528 artificial neural network Methods 0.000 claims description 19
- 230000000306 recurrent effect Effects 0.000 claims description 15
- 239000003550 marker Substances 0.000 claims description 10
- 230000002457 bidirectional effect Effects 0.000 claims description 9
- 238000005315 distribution function Methods 0.000 claims description 8
- 230000001186 cumulative effect Effects 0.000 claims description 5
- 238000013135 deep learning Methods 0.000 claims description 5
- 238000012417 linear regression Methods 0.000 claims description 4
- 238000004891 communication Methods 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 239000012634 fragment Substances 0.000 abstract description 5
- 230000014616 translation Effects 0.000 description 95
- 238000010586 diagram Methods 0.000 description 5
- 238000003058 natural language processing Methods 0.000 description 4
- 230000000694 effects Effects 0.000 description 3
- 125000004122 cyclic group Chemical group 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000010845 search algorithm Methods 0.000 description 2
- 241000282414 Homo sapiens Species 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001537 neural effect Effects 0.000 description 1
- 230000002688 persistence Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Landscapes
- Machine Translation (AREA)
Abstract
本发明公开了一种机器翻译译文的翻译方法、装置及存储介质,包括:接收待翻译的源语句;对所述源语句进行分词处理;获取所述分词中每一个单词的词性;根据词向量模型,将所述词性融入单词所对应的词向量中,获取融合后的词向量序列;将所述词向量序列输入至编码器解码器模型中,获得编解码结果;针对所述编解码结果,基于波束搜索评价函数进行结果评价,其中,所述波束搜索评价函数包括在基于长度对比的惩罚项和重复检测的惩罚项;根据所述评价结果获得译文。应用本发明实施例,改善了译文中出现重复片段以及遗漏源语句的问题,适用范围广、针对性强、翻译译文质量较高。
The invention discloses a translation method, device and storage medium for machine translation translation, comprising: receiving a source sentence to be translated; performing word segmentation processing on the source sentence; model, the part of speech is integrated into the word vector corresponding to the word, and the fused word vector sequence is obtained; the word vector sequence is input into the encoder-decoder model, and the encoding and decoding results are obtained; for the encoding and decoding results, Result evaluation is performed based on a beam search evaluation function, wherein the beam search evaluation function includes a penalty item based on length comparison and a penalty item based on repeated detection; and a translation is obtained according to the evaluation result. The application of the embodiments of the present invention improves the problems of repeated fragments and omission of source sentences in the translation, and has a wide range of applications, strong pertinence, and high translation quality.
Description
技术领域technical field
本发明涉及机器翻译译文改进技术领域,尤其涉及一种机器翻译译文的翻译方法、装置及存储介质。The present invention relates to the technical field of machine translation translation improvement, in particular to a translation method, device and storage medium for machine translation translation.
背景技术Background technique
语言是人类平时信息交流最重要的一种载体,它对于整个社会的发展有着十分重要的影响,机器自动化翻译的方法已经成为了目前的一个迫切的需求。实现不同语言的自动化翻译由着巨大的应用控件。Language is the most important carrier of information exchange for human beings. It has a very important impact on the development of the whole society. The method of automatic machine translation has become an urgent need at present. The realization of automatic translation of different languages is controlled by a huge application.
目前,基于规则的机器翻译方法需要专业的语言学家制定大量的规则,人工成本高,可扩展性差。基于中间语言的机器翻译方法需要制定一套通用的中间语言,难度太高,且鲁棒性低。基于统计的机器翻译方法虽然人工成本较低,扩展性得到了提高,但译文质量依旧较差。基于神经网络的机器翻译方法是目前最先进的机器翻译方法,但对于翻译译文的质量依旧有着改进的空间。At present, rule-based machine translation methods require professional linguists to formulate a large number of rules, with high labor costs and poor scalability. The machine translation method based on intermediate language needs to formulate a set of common intermediate language, which is too difficult and has low robustness. Although statistical-based machine translation methods have lower labor costs and improved scalability, the translation quality is still poor. The machine translation method based on neural network is the most advanced machine translation method at present, but there is still room for improvement in the quality of the translated text.
发明内容SUMMARY OF THE INVENTION
本发明的目的在于提供一种机器翻译译文的翻译方法、装置及存储介质,旨在解决现有机器翻译模型生成译文质量较差的问题。The purpose of the present invention is to provide a translation method, device and storage medium for machine translation translation, aiming at solving the problem of poor quality of translation generated by the existing machine translation model.
为了实现上述目的,本发明提供一种机器翻译译文的翻译方法,所述方法包括:In order to achieve the above object, the present invention provides a translation method for machine translation translation, the method comprising:
接收待翻译的源语句;Receive the source sentence to be translated;
对所述源语句进行分词处理;performing word segmentation processing on the source sentence;
获取所述分词中每一个单词的词性;Obtain the part of speech of each word in the participle;
根据词向量模型,将所述词性融入单词所对应的词向量中,获取融合后的词向量序列;According to the word vector model, the part of speech is integrated into the word vector corresponding to the word, and the fused word vector sequence is obtained;
将所述词向量序列输入至编码器解码器模型中,获得编解码结果;Inputting the word vector sequence into an encoder-decoder model to obtain an encoding and decoding result;
针对所述编解码结果,基于波束搜索评价函数进行结果评价,其中,所述波束搜索评价函数包括在基于长度对比的惩罚项和重复检测的惩罚项;For the encoding and decoding results, result evaluation is performed based on a beam search evaluation function, wherein the beam search evaluation function includes a penalty item based on length comparison and a penalty item for repeated detection;
根据所述评价结果获得译文。A translation is obtained based on the evaluation results.
进一步的,所述波束搜索评价函数的具体表达为:Further, the specific expression of the beam search evaluation function is:
s(Y,X)=log(P(Y|X))+d(x)+l(x)s(Y, X)=log(P(Y|X))+d(x)+l(x)
其中,x(Y,X)是波束搜索评价函数,log(P(Y/X))是在X出现时Y出现的概率函数,d(x)是基于重复检测的惩罚项,l(x)为基于长度对比的惩罚,P是分布函数;where x(Y, X) is the beam search evaluation function, log(P(Y/X)) is the probability function of Y appearing when X appears, d(x) is the penalty term based on repeated detection, and l(x) is the penalty based on length comparison, P is the distribution function;
在波束搜索评价函数中加入基于长度比值的惩罚项,用于解决译文出现部分漏翻的问题;A penalty term based on the length ratio is added to the beam search evaluation function to solve the problem of partial omissions in the translation;
在波束搜索评价函数中加入基于重复检测的惩罚项,用于解决译文出现重复内容的问题。A penalty term based on duplicate detection is added to the beam search evaluation function to solve the problem of duplicate content in translation.
进一步的,所述重复检测惩罚项d(x)的具体公式表达为:Further, the specific formula of the repeated detection penalty term d(x) is expressed as:
其中,c为当前翻译单词所在的索引,δ为重复检测的范围,ε为惩罚系数,y为候选译文所对应的矩阵,yc-j,yc-i-j分别为重复检测的两个矩阵,i,j,为遍历变量。Among them, c is the index of the current translation word, δ is the range of duplicate detection, ε is the penalty coefficient, y is the matrix corresponding to the candidate translation, y cj , y cij are the two matrices for duplicate detection, i, j, to iterate over variables.
进一步的,所述针对所述编解码结果,基于波束搜索评价函数进行结果评价的步骤,包括:Further, the step of evaluating the result based on the beam search evaluation function for the encoding and decoding results includes:
所述源语句的长度与目标译文的长度比值;the ratio of the length of the source sentence to the length of the target translation;
通过线形回归对所述长度比值进行拟合,得到累计分布函数;Fitting the length ratio through linear regression to obtain a cumulative distribution function;
当波束搜索的候选词中同时出现句末标记和普通的单词时,将译文已经结束的概率FX(x)和译文没结束的概率1-FX(x)分别加入到评价函数l(x)=θFX(x),not_EOS,l(x)=θ(1-FX(x)),EOS,其中,EOS是句末标记,θ是参数;When both end-of-sentence markers and common words appear in the candidate words of the beam search, the probability that the translation has ended F X (x) and the probability that the translation has not ended 1-F X (x) are respectively added to the evaluation function l(x )=θF X (x), not_EOS, l(x)=θ(1-F X (x)), EOS, where EOS is the end-of-sentence marker, and θ is a parameter;
当候选词是句末标记标志时,将还未翻译好的概率乘上惩罚因子作为惩罚项;When the candidate word is the marker at the end of the sentence, multiply the untranslated probability by the penalty factor as the penalty item;
而当候选词不是句末标记标志时,将完成翻译的概率乘上惩罚因子作为惩罚项;When the candidate word is not the marker at the end of the sentence, the probability of completing the translation is multiplied by the penalty factor as the penalty item;
将得到的基于长度比值的惩罚项加入到波束搜索的评价函数中;The obtained penalty term based on the length ratio is added to the evaluation function of beam search;
基于波束搜索评价函数进行结果评价。The results are evaluated based on the beam search evaluation function.
进一步的,所述编码器解码器模型中,编码器部分和解码器部分均使用双向循环神经网络。Further, in the encoder-decoder model, both the encoder part and the decoder part use a bidirectional recurrent neural network.
进一步的,所述将所述词向量序列输入至编码器解码器模型中,获得编解码结果的步骤,包括:Further, the step of inputting the word vector sequence into an encoder-decoder model to obtain an encoding and decoding result includes:
将所述词向量序列输入至编码器解码器模型中;inputting the sequence of word vectors into an encoder-decoder model;
基于编码解码器的深度学习框架,将词向量序列转换成句向量;The deep learning framework based on the codec converts the sequence of word vectors into sentence vectors;
在基于解码器,将句向量转换成词向量序列。Based on the decoder, the sentence vector is converted into a sequence of word vectors.
此外,本发明还公开了一种机器翻译译文装置,所述装置包括处理器、以及通过通信总线与所述处理器连接的存储器;其中,In addition, the present invention also discloses a machine translation translation device, the device includes a processor and a memory connected to the processor through a communication bus; wherein,
所述存储器,用于存储机器翻译译文的翻译程序;The memory is used to store a translation program for machine translation translation;
所述处理器,用于执行所述机器翻译译文的翻译程序,以实现任一项所述的机器翻译译文的翻译步骤。The processor is configured to execute the translation program of the machine translation translation, so as to realize the translation step of any one of the machine translation translations.
以及,一种计算机存储介质,所述计算机存储介质存储有一个或者多个程序,所述一个或者多个程序可被一个或者多个处理器执行,以使所述一个或者多个处理器执行任一项所述的机器翻译译文的翻译步骤。And, a computer storage medium storing one or more programs executable by one or more processors to cause the one or more processors to perform any The translation steps of the machine translation translation described in one item.
应用本发明实施例提供的一种机器翻译译文的翻译方法、装置及存储介质,在有效地构建向量建立不同词语之间的语义关联的同时,对包含它们在不同词性下的含义,并且修正了波束搜索评价函数改善了译文中出现重复片段以及遗漏源语句的问题,适用范围广、针对性强、翻译译文质量较高。By applying the translation method, device and storage medium for machine translation translation provided by the embodiments of the present invention, while effectively constructing vectors to establish semantic associations between different words, the meanings of them under different parts of speech are included and corrected. The beam search evaluation function improves the problem of repeated fragments and missing source sentences in the translation, and has a wide range of applications, strong pertinence, and high translation quality.
附图说明Description of drawings
图1是本发明实施例的一种流程示意图。FIG. 1 is a schematic flowchart of an embodiment of the present invention.
图2是本发明实施例的一种结构示意图。FIG. 2 is a schematic structural diagram of an embodiment of the present invention.
图3是本发明实施例的另一种结构示意图。FIG. 3 is another schematic structural diagram of an embodiment of the present invention.
图4是本发明实施例的重复检测的惩罚项算法描述示意图。FIG. 4 is a schematic diagram illustrating a penalty item algorithm for repeated detection according to an embodiment of the present invention.
图5是本发明实施例的长度比值的惩罚项算法描述示意图。FIG. 5 is a schematic diagram illustrating a penalty term algorithm for a length ratio according to an embodiment of the present invention.
图6是本发明实施例的英译中效果示意图。FIG. 6 is a schematic diagram of an English-Chinese translation effect of an embodiment of the present invention.
具体实施方式Detailed ways
以下通过特定的具体实例说明本发明的实施方式,本领域技术人员可由本说明书所揭露的内容轻易地了解本发明的其他优点与功效。本发明还可以通过另外不同的具体实施方式加以实施或应用,本说明书中的各项细节也可以基于不同观点与应用,在没有背离本发明的精神下进行各种修饰或改变。The embodiments of the present invention are described below through specific specific examples, and those skilled in the art can easily understand other advantages and effects of the present invention from the contents disclosed in this specification. The present invention can also be implemented or applied through other different specific embodiments, and various details in this specification can also be modified or changed based on different viewpoints and applications without departing from the spirit of the present invention.
语言模型(language Model)是一个单纯的、统一的、抽象的形式系统,语言客观事实经过语言模型的描述后,即可被计算机进行自动处理,因而语言模型对于自然语言的信息处理具有重大的意义,对于词性标注、句法分析、语音识别等研究都具有重要的作用。The language model is a simple, unified and abstract formal system. After the objective facts of the language are described by the language model, they can be automatically processed by the computer. Therefore, the language model is of great significance to the information processing of natural language. It plays an important role in the research of part-of-speech tagging, syntactic analysis, and speech recognition.
在机器翻译问题中,输入的原语言语句和输出的目标语言译文都可以看作序列,因此可以将机器翻译看作序列到序列问题。目前主流的解决序列到序列的方法为编码器解码器模型,编码器将原语言语句编码为句子向量,解码器将句子向量重新解码得到目标语言译文。In the machine translation problem, both the input original language sentence and the output target language translation can be regarded as sequences, so machine translation can be regarded as a sequence-to-sequence problem. The current mainstream method to solve sequence-to-sequence is the encoder-decoder model. The encoder encodes the original language sentence into a sentence vector, and the decoder re-decodes the sentence vector to obtain the target language translation.
需要说明的是,通常使用循环神经网络RNN作为编码器和解码器。循环神经网络(RNN,RecurrentNeural Network),是一种经典的神经网络结构,它包含循环的神经单元,因此可以处理可序列化数据并允许数据信息的持久化。RNN会将当前的输入与之前的输入一起作为参数进行训练并得到输出。双向循环神经网络(Bi-RNN,BidirectionalRecurrent Neural Network),是基于循环神经网络改进的一种网络结构。在一些任务中,网络的输入不仅与过去的输入有关,也与后续的输入有一定关联。因此除了输入正向序列外,还需输入逆向序列。而双向循环神经网络就是由两层循环神经网络组成的,它支持同时输入正向序列和逆向序列,有效地提高了网络的性能。It should be noted that the recurrent neural network RNN is usually used as the encoder and decoder. Recurrent Neural Network (RNN, Recurrent Neural Network), is a classic neural network structure, which contains recurrent neural units, so it can process serializable data and allow the persistence of data information. The RNN will train the current input together with the previous input as parameters and get the output. Bidirectional Recurrent Neural Network (Bi-RNN, Bidirectional Recurrent Neural Network) is an improved network structure based on cyclic neural network. In some tasks, the input of the network is not only related to the past input, but also has a certain relationship with the subsequent input. Therefore, in addition to inputting the forward sequence, the reverse sequence also needs to be input. The bidirectional recurrent neural network is composed of a two-layer recurrent neural network, which supports the simultaneous input of forward sequence and reverse sequence, which effectively improves the performance of the network.
词向量(WordVector)嵌入式自然语言处理(NLP)中的一组语言建模和特征学习技术的统称,其中来自词汇表的单词或短语被映射到实数的向量。从概念上讲,它涉及从每个单词一维的空间到具有更低维度的连续向量空间的数学嵌入。Skip-gram模型是在神经网络训练语言模型时,用于生成单词的分布式表示的模型结构,Skip-gram模型是将当前词的词向量作为输入,预测这个词的上下文。波束搜索(Beam Search)波束搜索是一种启发式搜索算法,通过扩展有限集合中最有希望的节点来探索图形。WordVector A general term for a set of language modeling and feature learning techniques in embedded natural language processing (NLP), where words or phrases from a vocabulary are mapped to vectors of real numbers. Conceptually, it involves mathematical embeddings from a one-dimensional space per word to a continuous vector space with lower dimensions. The Skip-gram model is a model structure used to generate a distributed representation of a word when a neural network trains a language model. The Skip-gram model uses the word vector of the current word as input to predict the context of the word. Beam Search Beam search is a heuristic search algorithm that explores a graph by expanding the most promising nodes in a finite set.
波束搜索是最佳优先搜索的优化,可以减少其内存需求。最佳搜索是根据尝试预测部分解决方案与完整解决方案(目标状态)有多接近的一些启发式命令来排列所有部分解决方案(状态)的图搜索。但是在波束搜索中,只有预定数量的最佳局部解才被保留为候选。Beam search is an optimization of best-first search that reduces its memory requirements. An optimal search is a graph search that ranks all partial solutions (states) according to some heuristic commands that try to predict how close the partial solutions are to the complete solution (goal state). But in beam search, only a predetermined number of the best local solutions are kept as candidates.
请参阅图1-6。需要说明的是,本实施例中所提供的图示仅以示意方式说明本发明的基本构想,遂图式中仅显示与本发明中有关的组件而非按照实际实施时的组件数目、形状及尺寸绘制,其实际实施时各组件的型态、数量及比例可为一种随意的改变,且其组件布局型态也可能更为复杂。See Figures 1-6. It should be noted that the drawings provided in this embodiment are only to illustrate the basic concept of the present invention in a schematic way, so the drawings only show the components related to the present invention rather than the number, shape and the number of components in actual implementation. For dimension drawing, the type, quantity and proportion of each component can be changed at will in actual implementation, and the component layout may also be more complicated.
如图1本发明提供一种机器翻译译文的翻译方法,所述方法包括:As shown in Figure 1, the present invention provides a translation method for machine translation translation, and the method includes:
S110,接收待翻译的源语句。S110: Receive the source sentence to be translated.
S120,对源语句进行分词处理;S120, performing word segmentation processing on the source sentence;
可以理解的是,对接收待翻译的源语句中的每句语句进行分词处理。It can be understood that, word segmentation processing is performed on each sentence in the received source sentence to be translated.
S130,获取所述分词中每一个单词的词性;S130, obtaining the part of speech of each word in the word segmentation;
需要说明的是,对源语句中的每句语句进行分词,接着使用词性标注工具对各个单词进行词性标注,得到每个单词的词性,并通过查询词性缩写表获得对应词性的缩写符号。最后将原单词与对应的词性缩写符号通过“_”符号进行连接得到单词/词性字符串并替代源语句中的原单词。It should be noted that each sentence in the source sentence is segmented, and then each word is tagged with a part of speech tagging tool to obtain the part of speech of each word, and the abbreviation symbol of the corresponding part of speech is obtained by querying the part of speech abbreviation table. Finally, connect the original word and the corresponding part-of-speech abbreviation through the "_" symbol to obtain a word/part-of-speech string and replace the original word in the source sentence.
S140,根据词向量模型,将所述词性融入单词所对应的词向量中,获取融合后的词向量序列;S140, according to the word vector model, integrate the part of speech into the word vector corresponding to the word, and obtain the fused word vector sequence;
词向量(WordVector)嵌入式自然语言处理(NLP)中的一组语言建模和特征学习技术的统称,其中来自词汇表的单词或短语被映射到实数的向量。从概念上讲,它涉及从每个单词一维的空间到具有更低维度的连续向量空间的数学嵌入。WordVector A general term for a set of language modeling and feature learning techniques in embedded natural language processing (NLP), where words or phrases from a vocabulary are mapped to vectors of real numbers. Conceptually, it involves mathematical embeddings from a one-dimensional space per word to a continuous vector space with lower dimensions.
本发明一种实现方式中,对步骤S120,S130得到源语句中所有的单词/词性字符串进行统计并构建词典,接着将词典中的每个单词/词性字符串进行索引并保存。然后将句子中的单词/词性字符串转化为索引值,并将每句句子代表的索引值序列输入到skip-gram模型中进行训练,得到训练好的融合了词性特征的词向量,获取融合后的词向量序列。In an implementation manner of the present invention, all words/part-of-speech strings in the source sentence obtained in steps S120 and S130 are counted and a dictionary is constructed, and then each word/part-of-speech string in the dictionary is indexed and saved. Then convert the word/part-of-speech string in the sentence into an index value, and input the index value sequence represented by each sentence into the skip-gram model for training, and obtain the trained word vector fused with the part-of-speech feature. sequence of word vectors.
示例性的,如图2所示,输入为w(t)经过skip-gram模型训练后输出分别为w(t-2)、w(t-1)、w(t+2)、w(t+1)。Exemplarily, as shown in Figure 2, the input is w(t), and after the skip-gram model is trained, the output is w(t-2), w(t-1), w(t+2), w(t respectively +1).
S150,将所述词向量序列输入至编码器解码器模型中,获得编解码结果。S150: Input the word vector sequence into an encoder-decoder model to obtain an encoding and decoding result.
需要说明的是,将词向量序列输入编码器解码器模型中,将训练好的词向量代替原语料集中的各条句子的单词,将语料集中的句子转化词向量序列。再将词向量序列作为输入送入编码器解码器模型中,获得编解码结果。编码器解码器模型结构如图3所示。It should be noted that the word vector sequence is input into the encoder-decoder model, the trained word vector is used to replace the words of each sentence in the original corpus, and the sentence in the corpus is converted into a word vector sequence. Then, the word vector sequence is sent as input into the encoder-decoder model to obtain the encoding and decoding results. The encoder-decoder model structure is shown in Figure 3.
S160,针对所述编解码结果,基于波束搜索评价函数进行结果评价,其中,所述波束搜索评价函数包括在基于长度对比的惩罚项和重复检测的惩罚项;S160, for the encoding and decoding results, perform result evaluation based on a beam search evaluation function, wherein the beam search evaluation function is included in a penalty item based on length comparison and a penalty item based on repeated detection;
可以理解的是,波束搜索是一种启发式搜索算法,通过扩展有限集合中最有希望的节点来探索图形。波束搜索是最佳优先搜索的优化,可以减少其内存需求。最佳搜索是根据尝试预测部分解决方案与完整解决方案(目标状态)有多接近的一些启发式命令来排列所有部分解决方案(状态)的图搜索。但是在波束搜索中,只有预定数量的最佳局部解才被保留为候选。本发明实施例,对波束搜索的评价函数进行改进,加入基于重复检测的惩罚项和加入基于长度比值的惩罚项。Understandably, beam search is a heuristic search algorithm that explores a graph by expanding the most promising nodes in a finite set. Beam search is an optimization of best-first search that reduces its memory requirements. An optimal search is a graph search that ranks all partial solutions (states) according to some heuristic commands that try to predict how close the partial solutions are to the complete solution (goal state). But in beam search, only a predetermined number of the best local solutions are kept as candidates. In the embodiment of the present invention, the evaluation function of beam search is improved, and a penalty item based on repeated detection and a penalty item based on length ratio are added.
S170,根据所述评价结果获得译文;S170, obtaining a translation according to the evaluation result;
通过编码器解码器模型和波束搜索,得到最终的译文。Through the encoder-decoder model and beam search, the final translation is obtained.
本发明的一种实现方式中,所述波束搜索评价函数的具体表达为:In an implementation manner of the present invention, the specific expression of the beam search evaluation function is:
s(Y,X)=log(P(Y|X))+d(x)+l(x)s(Y, X)=log(P(Y|X))+d(x)+l(x)
其中,x(Y,X)是波束搜索评价函数,log(P(Y/X))是在X出现时Y出现的概率函数,d(x)是基于重复检测的惩罚项,l(x)为基于长度对比的惩罚,P是分布函数;where x(Y, X) is the beam search evaluation function, log(P(Y/X)) is the probability function of Y appearing when X appears, d(x) is the penalty term based on repeated detection, and l(x) is the penalty based on length comparison, P is the distribution function;
在波束搜索评价函数中加入基于长度比值的惩罚项,用于解决译文出现部分漏翻的问题;A penalty term based on the length ratio is added to the beam search evaluation function to solve the problem of partial omissions in the translation;
在波束搜索评价函数中加入基于重复检测的惩罚项,用于解决译文出现重复内容的问题。A penalty term based on duplicate detection is added to the beam search evaluation function to solve the problem of duplicate content in translation.
需要说明的是,本发明实施例对波束搜索的评价函数进行改进,加入基于长度比值的惩罚项和基于重复检测的惩罚项。长度比值惩罚项针对机器翻译译文长度过长或过短的问题,通过统计源语句的长度与译文长度的比值得到惩罚项,并用于波束搜索对候选词的评价函数中,重复检测惩罚项通过将译文划分成不同大小的片段进行比对,并将出现重复词的位置与待翻译的位置间的距离也考虑在内,最后将求得的惩罚项用在波束搜索对候选词的评价函数中。改善了译文中出现重复片段以及遗漏源语句的问题,适用范围广、针对性强、翻译译文质量较高。It should be noted that, in the embodiment of the present invention, the evaluation function of beam search is improved, and a penalty item based on length ratio and a penalty item based on repeated detection are added. The length ratio penalty item is aimed at the problem that the length of the machine translation translation is too long or too short. The penalty item is obtained by counting the ratio of the length of the source sentence to the length of the translated text, and it is used in the evaluation function of the candidate word by the beam search. The translation is divided into fragments of different sizes for comparison, and the distance between the position where the repeated word appears and the position to be translated are also taken into account. Finally, the obtained penalty term is used in the evaluation function of the candidate word by the beam search. The problem of repeated fragments and omission of source sentences in the translation has been improved, with a wide range of applications, strong pertinence, and high translation quality.
进一步的,所述重复检测惩罚项d(x)的具体公式表达为:Further, the specific formula of the repeated detection penalty term d(x) is expressed as:
其中,c为当前翻译单词所在的索引,δ为重复检测的范围,ε为惩罚系数,y为候选译文所对应的矩阵,yc-j,yc-i-j分别为重复检测的两个矩阵,i,j,为遍历变量。Among them, c is the index of the current translation word, δ is the range of duplicate detection, ε is the penalty coefficient, y is the matrix corresponding to the candidate translation, y cj , y cij are the two matrices for duplicate detection, i, j, to iterate over variables.
如图4,将整个波束搜索的候选语句以及公式中的参数δ和ε作为算法的输入,经过划分多个不同大小的片段进行比对,分别计算各自的惩罚项,最后进行加权累加。图5本文将当前候选词、当前长度计算得到的累积分布函数的值FX(x)以及公式中的参数θ作为算法的输入,首先通过向量运算得到候选词是否为EOS,是的话为1,否则为0。然后通过点乘的形式,得到公式中l(x)的值。As shown in Figure 4, the candidate sentences of the entire beam search and the parameters δ and ε in the formula are used as the input of the algorithm. After dividing into multiple segments of different sizes for comparison, the respective penalty terms are calculated respectively, and finally the weighted accumulation is performed. Figure 5 This paper uses the current candidate word, the value of the cumulative distribution function calculated by the current length F X (x), and the parameter θ in the formula as the input of the algorithm. First, whether the candidate word is EOS is obtained by vector operation, if it is, it is 1, 0 otherwise. Then through the form of dot product, the value of l(x) in the formula is obtained.
本发明的一种实现方式中,所述针对所述编解码结果,基于波束搜索评价函数进行结果评价的步骤,包括:In an implementation manner of the present invention, the step of evaluating the encoding and decoding results based on the beam search evaluation function includes:
所述源语句的长度与目标译文的长度比值;the ratio of the length of the source sentence to the length of the target translation;
通过线形回归对所述长度比值进行拟合,得到累计分布函数;Fitting the length ratio through linear regression to obtain a cumulative distribution function;
当波束搜索的候选词中同时出现句末标记和普通的单词时,将译文已经结束的概率FX(x)和译文没结束的概率1-FX(x)分别加入到评价函数l(x)=θFX(x),not_EOS,l(x)=θ(1-FX(x)),EOS,其中,EOS是句末标记,θ是参数;When both end-of-sentence markers and common words appear in the candidate words of the beam search, the probability that the translation has ended F X (x) and the probability that the translation has not ended 1-F X (x) are respectively added to the evaluation function l(x )=θF X (x), not_EOS, l(x)=θ(1-F X (x)), EOS, where EOS is the end-of-sentence marker, and θ is a parameter;
当候选词是句末标记标志时,将还未翻译好的概率乘上惩罚因子作为惩罚项;When the candidate word is the marker at the end of the sentence, multiply the untranslated probability by the penalty factor as the penalty item;
而当候选词不是句末标记标志时,将完成翻译的概率乘上惩罚因子作为惩罚项;When the candidate word is not the marker at the end of the sentence, the probability of completing the translation is multiplied by the penalty factor as the penalty item;
将得到的基于长度比值的惩罚项加入到波束搜索的评价函数中;The obtained penalty term based on the length ratio is added to the evaluation function of beam search;
基于波束搜索评价函数进行结果评价。The results are evaluated based on the beam search evaluation function.
可以理解的是,首先分别统计源语句的长度和目标译文的长度,并计算源语句与目标译文的长度比值,然后通过线性回归得到的长度比值进行拟合得到其累计分布函数FX(x)=P(X<x),其中,It can be understood that firstly, the length of the source sentence and the length of the target translation are calculated separately, and the length ratio of the source sentence and the target translation is calculated, and then the length ratio obtained by linear regression is fitted to obtain its cumulative distribution function F X (x) =P(X<x), where,
即目标译文的长度与源语句的长度比值。当波束搜索的候选词中同时出现EOS(句末标记)和普通的单词时,将译文已经结束的概率FX(x)和译文还没结束的概率1-FX(x)分别加入到它们的评价函数l(x)=θFX(x),not_EOS,l(x)=θ(1-FX(x)),EOS。当候选词是EOS标志时,将还未翻译好的概率乘上惩罚因子作为惩罚项,而当候选词不是EOS标志时,将完成翻译的概率乘上惩罚因子作为惩罚项。最后将得到的基于长度比值的惩罚项加入到波束搜索的评价函数中,如图5。That is, the ratio of the length of the target translation to the length of the source sentence. When both EOS (sentence end marker) and common words appear in the candidate words of the beam search, add the probability FX(x) that the translation has ended and the probability 1-F X (x) that the translation has not ended to their Evaluation function l(x)=θF X (x), not_EOS, l(x)=θ(1−F X (x)), EOS. When the candidate word is the EOS symbol, multiply the untranslated probability by the penalty factor as the penalty item, and when the candidate word is not the EOS symbol, multiply the translation probability by the penalty factor as the penalty term. Finally, the penalty term based on the length ratio is added to the evaluation function of beam search, as shown in Figure 5.
通过编码器解码器模型和波束搜索,得到最终的最优译文,如图6所示。Through the encoder-decoder model and beam search, the final optimal translation is obtained, as shown in Figure 6.
需要说明的是,编码解码器是深度学习框架再结合波束搜索评价函数,得到最终的最优译文,解决了出现重复片段以及遗漏源语句的问题,适用范围广、针对性强、翻译译文质量较高。It should be noted that the codec is a deep learning framework combined with a beam search evaluation function to obtain the final optimal translation, which solves the problem of repeated fragments and missing source sentences. high.
进一步的,所述编码器解码器模型中,编码器部分和解码器部分均使用双向循环神经网络。Further, in the encoder-decoder model, both the encoder part and the decoder part use a bidirectional recurrent neural network.
需要说明的是,双向循环神经网络(Bi-RNN,Bidirectional Recurrent NeuralNetwork),是基于循环神经网络改进的一种网络结构。在一些任务中,网络的输入不仅与过去的输入有关,也与后续的输入有一定关联。因此除了输入正向序列外,还需输入逆向序列。而双向循环神经网络就是由两层循环神经网络组成的,它支持同时输入正向序列和逆向序列,有效地提高了网络的性能。It should be noted that the Bidirectional Recurrent Neural Network (Bi-RNN, Bidirectional Recurrent Neural Network) is an improved network structure based on the cyclic neural network. In some tasks, the input of the network is not only related to the past input, but also has a certain relationship with the subsequent input. Therefore, in addition to inputting the forward sequence, the reverse sequence also needs to be input. The bidirectional recurrent neural network is composed of a two-layer recurrent neural network, which supports the simultaneous input of forward sequence and reverse sequence, which effectively improves the performance of the network.
本发明一种实现方式中,所述将所述词向量序列输入至编码器解码器模型中,获得编解码结果的步骤,包括:In an implementation manner of the present invention, the step of inputting the word vector sequence into an encoder-decoder model to obtain an encoding and decoding result includes:
将所述词向量序列输入至编码器解码器模型中;inputting the sequence of word vectors into an encoder-decoder model;
基于编码解码器的深度学习框架,将词向量序列转换成句向量;The deep learning framework based on the codec converts the sequence of word vectors into sentence vectors;
在基于解码器,将句向量转换成词向量序列。Based on the decoder, the sentence vector is converted into a sequence of word vectors.
可以理解的是,编码器解码器是深度学习框架,编码器用于将词向量序列转换成句向量,解码器用于将句向量转换成词向量序列。It can be understood that the encoder-decoder is a deep learning framework, the encoder is used to convert word vector sequences into sentence vectors, and the decoder is used to convert sentence vectors into word vector sequences.
本发明还提供了一种机器翻译译文的翻译装置,所述装置包括处理器、以及通过通信总线与所述处理器连接的存储器;其中,The present invention also provides a translation device for machine translation translation, the device includes a processor and a memory connected to the processor through a communication bus; wherein,
所述存储器,用于存储机器翻译译文的翻译程序;The memory is used to store a translation program for machine translation translation;
所述处理器,用于执行所述机器翻译译文的翻译程序,以实现任一项所述的机器翻译译文的翻译步骤。The processor is configured to execute the translation program of the machine translation translation, so as to realize the translation step of any one of the machine translation translations.
本发明还提供了一种计算机存储介质存储有一个或者多个程序,所述一个或者多个程序可被一个或者多个处理器执行,以使所述一个或者多个处理器执行任一项所述的机器翻译译文的翻译步骤。The present invention also provides a computer storage medium storing one or more programs, and the one or more programs can be executed by one or more processors to cause the one or more processors to execute any one of the The translation steps of the machine translation translation described above.
上述实施例仅例示性说明本发明的原理及其功效,而非用于限制本发明。任何熟悉此技术的人士皆可在不违背本发明的精神及范畴下,对上述实施例进行修饰或改变。因此,举凡所属技术领域中具有通常知识者在未脱离本发明所揭示的精神与技术思想下所完成的一切等效修饰或改变,仍应由本发明的权利要求所涵盖。The above-mentioned embodiments merely illustrate the principles and effects of the present invention, but are not intended to limit the present invention. Anyone skilled in the art can modify or change the above embodiments without departing from the spirit and scope of the present invention. Therefore, all equivalent modifications or changes made by those with ordinary knowledge in the technical field without departing from the spirit and technical idea disclosed in the present invention should still be covered by the claims of the present invention.
Claims (8)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910721252.6A CN110442880B (en) | 2019-08-06 | 2019-08-06 | Translation method, device and storage medium for machine translation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910721252.6A CN110442880B (en) | 2019-08-06 | 2019-08-06 | Translation method, device and storage medium for machine translation |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110442880A CN110442880A (en) | 2019-11-12 |
CN110442880B true CN110442880B (en) | 2022-09-30 |
Family
ID=68433418
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910721252.6A Active CN110442880B (en) | 2019-08-06 | 2019-08-06 | Translation method, device and storage medium for machine translation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110442880B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112541364A (en) * | 2020-12-03 | 2021-03-23 | 昆明理工大学 | Chinese-transcendental neural machine translation method fusing multilevel language feature knowledge |
CN112632996A (en) * | 2020-12-08 | 2021-04-09 | 浙江大学 | Entity relation triple extraction method based on comparative learning |
CN113435215A (en) * | 2021-06-22 | 2021-09-24 | 北京捷通华声科技股份有限公司 | A machine translation method and device |
CN113191165B (en) * | 2021-07-01 | 2021-09-24 | 南京新一代人工智能研究院有限公司 | Method for avoiding duplication of machine translation fragments |
CN113836950B (en) * | 2021-09-22 | 2024-04-02 | 广州华多网络科技有限公司 | Commodity title text translation method and device, equipment and medium thereof |
CN114254630B (en) * | 2021-11-29 | 2025-01-10 | 北京捷通华声科技股份有限公司 | A translation method, device, electronic device and readable storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018058046A1 (en) * | 2016-09-26 | 2018-03-29 | Google Llc | Neural machine translation systems |
CN107967262A (en) * | 2017-11-02 | 2018-04-27 | 内蒙古工业大学 | A kind of neutral net covers Chinese machine translation method |
-
2019
- 2019-08-06 CN CN201910721252.6A patent/CN110442880B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2018058046A1 (en) * | 2016-09-26 | 2018-03-29 | Google Llc | Neural machine translation systems |
CN107967262A (en) * | 2017-11-02 | 2018-04-27 | 内蒙古工业大学 | A kind of neutral net covers Chinese machine translation method |
Non-Patent Citations (1)
Title |
---|
融合先验信息的蒙汉神经网络机器翻译模型;樊文婷等;《中文信息学报》;20180615(第06期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN110442880A (en) | 2019-11-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110442880B (en) | Translation method, device and storage medium for machine translation | |
CN109213995B (en) | Cross-language text similarity evaluation technology based on bilingual word embedding | |
CN107330032B (en) | An Implicit Textual Relationship Analysis Method Based on Recurrent Neural Networks | |
CN110738057B (en) | Text style migration method based on grammar constraint and language model | |
CN108647214A (en) | Coding/decoding method based on deep-neural-network translation model | |
Khan et al. | RNN-LSTM-GRU based language transformation | |
CN106484681A (en) | A kind of method generating candidate's translation, device and electronic equipment | |
CN112883193A (en) | Training method, device and equipment of text classification model and readable medium | |
CN102117270B (en) | A kind of based on the statistical machine translation method of fuzzy tree to accurate tree | |
CN114818717B (en) | Chinese named entity recognition method and system integrating vocabulary and syntax information | |
CN110362797B (en) | Research report generation method and related equipment | |
CN113657123A (en) | Mongolian Aspect-Level Sentiment Analysis Method Based on Target Template Guidance and Relation Head Coding | |
CN110619127A (en) | Mongolian Chinese machine translation method based on neural network turing machine | |
CN112926337A (en) | End-to-end aspect level emotion analysis method combined with reconstructed syntax information | |
CN116861269A (en) | Multi-source heterogeneous data fusion and analysis method in engineering field | |
CN117932066A (en) | Pre-training-based 'extraction-generation' answer generation model and method | |
CN114757184B (en) | Method and system for realizing knowledge question and answer in aviation field | |
CN108491381A (en) | A kind of syntactic analysis method of Chinese bipartite structure | |
CN118246426A (en) | Writing method, system, device and medium based on generative text big model | |
CN115017924B (en) | Construction of neural machine translation model for cross-language translation and translation method thereof | |
CN117172241A (en) | Tibetan language syntax component labeling method | |
CN116611428A (en) | Non-Autoregressive Decoding Vietnamese Text Regularization Method Based on Edit Alignment Algorithm | |
CN118503411B (en) | Outline generation method, model training method, device and medium | |
Nie et al. | Graph neural net-based user simulator | |
CN113901217B (en) | A sentence classification method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB03 | Change of inventor or designer information |
Inventor after: Lin Xinyue Inventor after: Liu Jin Inventor after: Song Junjie Inventor before: Lin Xinyue Inventor before: Liu Jin |
|
CB03 | Change of inventor or designer information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |