CN109979461B

CN109979461B - Voice translation method and device

Info

Publication number: CN109979461B
Application number: CN201910199082.XA
Authority: CN
Inventors: 马志强; 刘俊华; 魏思; 胡国平
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2019-03-15
Filing date: 2019-03-15
Publication date: 2022-02-25
Anticipated expiration: 2039-03-15
Also published as: CN109979461A

Abstract

The application discloses a voice translation method and a device, wherein the method comprises the following steps: compared with the prior art that the target voice is recognized firstly to obtain the recognition text, and then the recognition text is used as the translation object to be translated, the translation objects in the method are more abundant, namely the recognition text and the target voice are included, so that the more accurate translation text of the target voice can be determined by the method of translating the two translation objects.

Description

Voice translation method and device

Technical Field

The present application relates to the field of speech translation technologies, and in particular, to a speech translation method and apparatus.

Background

Existing speech translation methods typically include two steps, namely speech recognition and text translation. Specifically, first, a piece of speech is recognized into a text in the same language as the piece of speech by a speech recognition technique, and then the recognized text is translated into a text in another language by a text translation technique, thereby implementing a speech translation process.

However, when speech translation is performed by combining a speech recognition technique and a text translation technique, there is a disadvantage that errors are accumulated, and for example, if a word is recognized incorrectly by the speech recognition technique, when the word is translated by the text translation technique, an incorrect translation result is obtained from the incorrect word. It can be seen that errors in the speech recognition stage can accumulate into the text translation stage, resulting in inaccurate translation results.

Disclosure of Invention

The embodiment of the present application mainly aims to provide a method and an apparatus for speech translation, which can improve the accuracy of a speech translation result.

The embodiment of the application provides a voice translation method, which comprises the following steps:

acquiring target voice to be translated;

and translating a first translation object and a second translation object to obtain a final translation text of the target voice, wherein the first translation object is a recognition text of the target voice, and the second translation object is the target voice.

Optionally, the translating the first translation object and the second translation object to obtain a final translation text of the target speech includes:

generating a first probability distribution and a second probability distribution corresponding to a kth word in the final translation text;

the first probability distribution comprises a first decoding probability when a kth word obtained by decoding the recognition text of the target voice is each to-be-selected word in the word list, and the second probability distribution comprises a second decoding probability when the kth word obtained by decoding the target voice is each to-be-selected word in the word list;

and obtaining a translation result of the kth word according to the first probability distribution and the second probability distribution.

Optionally, the obtaining a translation result of the k-th word according to the first probability distribution and the second probability distribution includes:

in the first probability distribution and the second probability distribution, fusing a first decoding probability and a second decoding probability corresponding to the same word to be selected to obtain a fused decoding probability corresponding to the kth word in the final translation text as each word to be selected;

and selecting the word to be selected corresponding to the maximum fusion decoding probability as the translation result of the kth word.

translating the recognition text of the target voice to obtain a first translation text;

directly translating the target voice to obtain a second translation text;

and obtaining a final translation text of the target voice according to the first translation text and the second translation text.

Optionally, the obtaining a final translation text of the target speech according to the first translation text and the second translation text includes:

determining a confidence level when the first translated text is used as a final translated text of the target voice;

determining a confidence level when the second translated text is used as a final translated text of the target voice;

and selecting the translation text corresponding to the higher confidence coefficient as the final translation text of the target voice.

Optionally, the determining the confidence level of the first translated text as the final translated text of the target speech includes:

acquiring decoding probabilities corresponding to the text units of the first translation text, wherein the decoding probabilities represent the possibility of the corresponding text units belonging to the translation result;

and determining the confidence coefficient when the first translation text is used as the final translation text of the target voice according to the decoding probability corresponding to each text unit of the first translation text.

Optionally, the determining the confidence level of the second translated text as the final translated text of the target speech includes:

acquiring decoding probabilities corresponding to the text units of the second translation text, wherein the decoding probabilities represent the possibility of the corresponding text units belonging to the translation result;

and determining the confidence coefficient when the second translation text is used as the final translation text of the target voice according to the decoding probability corresponding to each text unit of the second translation text.

Optionally, the translating the first translation object and the second translation object includes:

recognizing the target voice by utilizing a pre-constructed voice recognition model to obtain a recognition text;

translating the recognition text by utilizing a pre-constructed text translation model;

translating the target voice by utilizing a pre-constructed voice translation model;

wherein the speech translation model shares or does not share part of the model parameters with the speech recognition model.

An embodiment of the present application further provides a speech translation apparatus, including:

the target voice acquiring unit is used for acquiring target voice to be translated;

and the translation text obtaining unit is used for translating a first translation object and a second translation object to obtain a final translation text of the target voice, wherein the first translation object is the recognition text of the target voice, and the second translation object is the target voice.

Optionally, the translated text obtaining unit includes:

a probability distribution generating subunit, configured to generate a first probability distribution and a second probability distribution corresponding to a kth word in the final translated text;

and the translation result obtaining subunit is used for obtaining a translation result of the kth word according to the first probability distribution and the second probability distribution.

Optionally, the translation result obtaining subunit includes:

a fused decoding probability obtaining subunit, configured to fuse, in the first probability distribution and the second probability distribution, a first decoding probability and a second decoding probability that correspond to the same to-be-selected word, so as to obtain a fused decoding probability that corresponds to a kth word in the final translation text when the kth word is each to-be-selected word;

and the first translation result obtaining subunit is used for selecting the word to be selected corresponding to the maximum fusion decoding probability as the translation result of the kth word.

Optionally, the translated text obtaining unit includes:

the first translation text obtaining subunit is used for translating the recognition text of the target voice to obtain a first translation text;

the second translation text obtaining subunit is used for directly translating the target voice to obtain a second translation text;

and the final translation text obtaining subunit is used for obtaining a final translation text of the target voice according to the first translation text and the second translation text.

Optionally, the final translation text obtaining subunit includes:

a first confidence determining subunit, configured to determine a confidence at which the first translated text is a final translated text of the target speech;

a second confidence determining subunit, configured to determine a confidence at which the second translated text is a final translated text of the target speech;

and the second translation result obtaining subunit is used for selecting the translation text corresponding to the higher confidence coefficient as the final translation text of the target voice.

Optionally, the first confidence determining subunit includes:

a first decoding probability obtaining subunit, configured to obtain a decoding probability corresponding to each text unit of the first translation text, where the decoding probability represents a possibility that the corresponding text unit belongs to a translation result;

and the first confidence obtaining subunit is configured to determine, according to the decoding probability corresponding to each text unit of the first translated text, a confidence when the first translated text is used as a final translated text of the target speech.

Optionally, the second confidence determining subunit includes:

a second decoding probability obtaining subunit, configured to obtain a decoding probability corresponding to each text unit of the second translation text, where the decoding probability represents a possibility that the corresponding text unit belongs to a translation result;

and the second confidence obtaining subunit is configured to determine, according to the decoding probability corresponding to each text unit of the second translated text, a confidence at which the second translated text is used as the final translated text of the target speech.

Optionally, the translated text obtaining unit includes:

the text recognition subunit is used for recognizing the target voice by utilizing a pre-constructed voice recognition model to obtain a recognition text;

the text translation subunit is used for translating the recognition text by utilizing a pre-constructed text translation model;

the voice translation subunit is used for translating the target voice by utilizing a pre-constructed voice translation model;

An embodiment of the present application further provides a speech translation apparatus, including: a processor, a memory, a system bus;

the processor and the memory are connected through the system bus;

the memory is used for storing one or more programs, and the one or more programs comprise instructions which, when executed by the processor, cause the processor to execute any implementation of the speech translation method.

An embodiment of the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are run on a terminal device, the terminal device is enabled to execute any implementation manner of the foregoing speech translation method.

The embodiment of the present application further provides a computer program product, which, when running on a terminal device, enables the terminal device to execute any implementation manner of the above speech translation method.

Compared with the prior art that the target voice is recognized firstly to obtain the recognized text, and then the recognized text is used as the translation object to be translated, the translation objects in the method are richer, namely the recognized text and the target voice are included, so that the more accurate translation text of the target voice can be determined by translating the two translation objects.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a speech translation method according to an embodiment of the present application;

FIG. 2 is a schematic structural diagram of a speech translation model and a speech recognition model provided in an embodiment of the present application;

fig. 3 is a second schematic structural diagram of a speech translation model and a speech recognition model provided in the embodiment of the present application;

fig. 4 is a schematic structural diagram of a speech translation model, a speech recognition model, and a text translation model provided in an embodiment of the present application;

fig. 5 is a schematic flowchart of obtaining a translation result of a kth word according to a first probability distribution and a second probability distribution according to an embodiment of the present application;

fig. 6 is a schematic flowchart of obtaining a final translated text of a target speech according to a first translated text and a second translated text according to an embodiment of the present application;

fig. 7 is a schematic composition diagram of a speech translation apparatus according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

First embodiment

Referring to fig. 1, a schematic flow chart of a speech translation method provided in this embodiment is shown, where the method includes the following steps:

s101: and acquiring target voice to be translated.

In this embodiment, any voice subjected to voice translation by using this embodiment is defined as a target voice. In addition, the present embodiment does not limit the language type of the target speech, for example, the target speech may be a chinese speech, an english speech, or the like; meanwhile, the embodiment also does not limit the length of the target speech, for example, the target speech may be a sentence, or multiple sentences.

It can be understood that the target voice can be obtained by recording and the like according to actual needs, for example, phone call voice or conference recording and the like in daily life of people can be used as the target voice, and after the target voice is obtained, translation of the target voice can be realized by using the embodiment.

S102: and translating the first translation object and the second translation object to obtain a final translation text of the target voice, wherein the first translation object is a recognition text of the target voice, and the second translation object is the target voice.

In this embodiment, in order to determine a more accurate translation text of the target speech, the recognition text of the target speech and the target speech themselves may be translated as translation objects, and since the translation objects are abundant, a more accurate translation text may be obtained to serve as a final translation text of the target speech.

Specifically, after the target speech to be translated is acquired in step S101, speech recognition may be performed on the target speech to obtain a corresponding recognized text, and the recognized text may be used as a first translation object, and then text translation may be performed on the first translation object (i.e., the recognized text) to obtain intermediate data in the translation process or a translated text of the first translation object. Similarly, after the target voice to be translated is acquired in step S101, the target voice itself may be used as a second translation object, and then the second translation object (i.e., the target voice) is directly translated (without performing voice recognition), so as to obtain intermediate data in the translation process or a translated text of the second translation object.

In an implementation manner of this embodiment, this step S102 may include: and after the first translation object and the second translation object are translated to obtain intermediate data in the translation process corresponding to each translation object, obtaining a final translation text of the target voice based on the intermediate data.

In this implementation, the intermediate data may be probability distribution data, and specifically, by translating the first translation object (i.e., the recognized text of the target speech) and the second translation object (i.e., the target speech) in step S102, a first probability distribution and a second probability distribution corresponding to the kth word in the final translated text may be generated, where the first probability distribution includes a first decoding probability when the kth word obtained by decoding the recognized text of the target speech is each candidate word in the vocabulary, and the second probability distribution includes a second decoding probability when the kth word obtained by decoding the target speech is each candidate word in the vocabulary. It should be noted that, in this implementation, specific descriptions of the first probability distribution and the second probability distribution corresponding to the kth word in the final translated text obtained by translating the first translation object and the second translation object may be referred to as a second embodiment, and a specific implementation of obtaining a translation result of the kth word in the final translated text on the basis of the first probability distribution and the second probability distribution corresponding to the kth word in the final translated text is also described in the second embodiment.

In another implementation manner of this embodiment, this step S102 may include: and after the first translation object and the second translation object are translated to obtain the translation texts corresponding to the first translation object and the second translation object respectively, obtaining a final translation text of the target voice based on the two translation texts.

In this implementation, the first translation object (i.e., the recognition text of the target speech) is translated in step S102 to obtain the first translation text, and the second translation object (i.e., the target speech) is directly translated to obtain the second translation text. It should be noted that, in this implementation, specific descriptions of the first translated text and the second translated text obtained by translating the first translated object and the second translated object may be referred to in the third embodiment, and on the basis of the first translated text and the second translated text, a specific implementation manner of obtaining a final translated text of the target speech is also described in the third embodiment.

Further, the step S102 may be implemented by using three models, which may specifically include the following steps a 1-A3:

step A1: and recognizing the target voice by utilizing a pre-constructed voice recognition model to obtain a recognition text.

In this implementation manner, after the target speech to be translated is acquired in step S101, the acquired target speech may be recognized by using a pre-constructed speech recognition model as shown in the right side of fig. 2, so as to obtain a recognition text. The speech recognition model includes an encoder, an Attention layer (Attention) and a recognition decoder, and can perform speech recognition on a target speech, for example, a target speech of chinese is recognized as a chinese recognition text in the same language.

Specifically, in this embodiment, an alternative implementation manner is that the pre-constructed speech recognition model may adopt a network structure as shown in the right side of fig. 3, and then, taking this speech recognition model as an example, an implementation process for recognizing the target speech by using it will be described:

(1) inputting audio features of a target speech

First, audio feature extraction is performed on the target speech to be translated, for example, Mel spectrum Features (Mel Bank Features) of the target speech can be extracted as audio Features of the target speech, and the audio Features can be represented in the form of feature vectors, where the feature vectors are defined as x_1...TWhere T represents the dimension size of the audio feature vector of the target speech, i.e. the number of vector elements contained by the audio feature vector, then x may be set_1...TAs input data, the speech recognition model shown on the right side of fig. 3 is input.

(2) Generating coding vector corresponding to audio characteristic of target voice

As shown in fig. 3, the coding portion of the speech recognition model includes two layers of Convolutional Neural Networks (CNNs) and max pooling layers (MaxPooling), one layer of Convolutional Long Short-Term Memory (called Convolutional Short-Term Memory), and three layers of Bi-directional Long Short-Term Memory (called BilsTM).

Inputting the audio characteristic x of the target voice through the step (1)_1...TThen, the coding vector can be coded through a layer of CNN, then down-sampling operation is carried out on the coding vector through Max scaling, then the operation is repeated through a layer of CNN and Max scaling to obtain a coding vector with the length of L, and then the coding vector is processed through a layer of conditional LSTM and three layers of BiLSTM to obtain a final coding vector which is defined as h_1...LWherein, L represents the dimension of the coding vector obtained by coding the audio features of the target speech, i.e. the number of vector elements contained in the coding vector, h_1...LThe specific calculation formula of (2) is as follows:

h_1...L＝enc(W_encx_1...T) (1)

wherein enc represents the whole coding calculation process of the model coding part; w_encAll network parameters of each layer of network in the model coding part are represented.

(3) Generating a decoded vector corresponding to the encoded vector

As shown in FIG. 3, the decoding portion of the speech recognition model comprises a 4-layer one-way Long-Short Term Memory network (LSTM) and softmax classifier.

Through the step (2), after the audio features of the target speech are encoded by the encoding part of the model to obtain the encoded vector, the encoding vector can be subjected to attention operation so as to pay attention to the related data in the encoded vector which can be used for generating the decoded vector, then the encoded vector is decoded by the 4-layer LSTM and softmax classifiers to obtain the corresponding decoded vector, and then the decoded vector is used for generating the recognition text of the target speech, and the recognition text is defined as z_1...NWhere N may represent the number of words contained in the recognition text.

The specific calculation formula of the decoding part is as follows:

c_k＝att(s_k,h_1...L) (2)

s_k＝lstm(z_k-1,s_k-1,c_k-1) (3)

z_k＝soft max(W_z[s_k,c_k]+b_z) (4)

wherein h is_1...LCoding vectors corresponding to the audio features representing the target voice; c. C_kRepresents the k-th attitude calculation result; att represents an attention calculation process; c. C_k-1Represents the k-1 st attribute calculation result; s_kRepresenting the k-th hidden layer vector output in the 4-layer LSTM network contained in the decoding part; LSTM represents a calculation process of a 4-layer LSTM network included in the decoding section; s_k-1Representing the k-1 hidden layer vector output in the 4-layer LSTM network contained in the decoding part; z is a radical of_kRepresenting the kth word contained in the recognized text (Or a word); z is a radical of_k-1Representing the recognition of the (k-1) th word (or word) contained in the text; w_zAnd b_zRepresenting model parameters in the softmax classifier.

If W is utilized_asrRepresenting all network parameters of each layer of network in the decoding part of the model, the recognition text z of the target speech output by the model_1...NThe calculation formula of (a) is as follows:

z_1...N＝dec(W_asrh_1...L) (5)

wherein dec represents the entire decoding computation process of the model decoding portion; w_asrAll network parameters of each layer of network in the model decoding part are represented; h is_1...LAnd the code vector corresponding to the audio characteristic representing the target voice.

It should be noted that the network structure of the encoder and the decoder in the speech recognition model shown on the right side of fig. 2 is not exclusive, and the network structure shown on the right side of fig. 3 is only one example, and other network structures or network layer numbers may be adopted. For example, the encoder of the model may also use a Recurrent Neural Network (RNN) or the like for encoding, and the number of layers of the Network may also be set according to an actual situation, which is not limited in this embodiment of the present application. The number of layers of CNN, BiLSTM, and the like described above or in the following is merely an example, and the present application is not limited to the number of layers, and may be the number of layers mentioned in the embodiments of the present application, or may be another number of layers.

Step A2: and translating the recognition text of the target voice by using a pre-constructed text translation model.

In this implementation manner, after the recognized text of the target speech is obtained in step a1, the recognized text may be translated by using a pre-constructed text translation model as shown in the upper part of fig. 4, so as to obtain intermediate data in the translation process or a translated text corresponding to the recognized text.

The text translation model includes a text encoder, an Attention layer (Attention) and a text decoder, and the text encoder is connected to a recognition decoder in the speech recognition model, as shown in fig. 4.

Next, a description will be given of a process of translating the recognized text by using the text translation model:

(1) inputting a recognized text of a target speech

As shown in FIG. 4, firstly the recognized text z of the target speech obtained by step A1 can be used_1...N(which may be in vector form) as input data to a text encoder of the text translation model.

(2) Generating code vector corresponding to recognition text of target voice

In this embodiment, the text encoder of the text translation model may be composed of BilSTM, and the recognized text z of the target speech is input through the above step (1)_1...NIt can then be encoded by BilSTM to obtain the corresponding code vector, defined as s_1...NThe specific calculation formula is as follows:

s_1...N＝enc(U_encz_1...N) (6)

wherein enc represents the whole encoding calculation process of the text translation model encoding part; u shape_encAll network parameters representing the encoded portion of the text translation model.

(3) Decoding to obtain intermediate data or first translation text corresponding to the identification text

In this embodiment, the text decoder of the text translation model may contain one-way LSTM and softmax classifiers. Generating a code vector s of the target speech by the above step (2)_1...NThe encoding vector may then be subjected to an attention operation to focus on relevant data in the encoding vector that can be used to generate a decoding result, and then decoded by the one-way LSTM and softmax classifiers to obtain intermediate data in the translation process or a first translation text corresponding to the recognition text.

It should be noted that the network configuration of the encoder and the decoder in the text translation model is not unique, and the model network structure described in the implementation process is only one example, and other network structures or network layer numbers may also be adopted. For example, the encoder of the model may perform encoding using RNN or the like, and the number of layers of the network may be set according to actual circumstances, which is not limited in the embodiment of the present application.

Step A3: and translating the target voice by utilizing a pre-constructed voice translation model.

In this embodiment, the speech recognition model is used to directly translate the target speech, and generate intermediate data in the translation process or a translation text corresponding to the target speech. The speech translation model may or may not share part of the model parameters with the speech recognition model described in step a1 above.

When the speech translation model shares part of the model parameters with the speech recognition model described in step a1, an alternative implementation is that the network structure of the speech translation model may be as shown in the left diagram of fig. 2, which shares an encoder with the speech recognition model, and the speech translation model includes a translation decoder. It should be noted that, in fig. 2, the network structures of the recognition decoder of the speech recognition model and the translation decoder of the speech translation model may be the same or different, and the respective specific constituent structures may be set according to actual situations, which is not limited in the embodiment of the present application.

In this embodiment, an optional implementation manner is that the speech translation model and the speech recognition model may adopt a network structure as shown in fig. 3, and based on the network structure of the model, a specific process of directly translating the target speech is as follows:

(1) inputting audio features of a target speech

Firstly, audio feature extraction is performed on the target voice, for example, mel frequency spectrum features of the target voice can be extracted as audio features of the target voice, and the feature vector is defined as x_1...TThen, x is added_1...TAs input data, the input data is input to the encoder shown in fig. 3.

The audio characteristic x of the target speech input in the above step (1) is processed by the encoder shown in FIG. 3_1...TAfter encoding, the best result can be obtainedFinal code vector h_1...LWherein, L represents the dimension of the coding vector obtained by coding the audio features of the target speech, i.e. the number of vector elements contained in the coding vector, h_1...LThe specific calculation formula of (2) is the above formula (1), which is as follows:

h_1...L＝enc(W_encx_1...T)

wherein enc represents the whole encoding calculation process of the encoder in fig. 3; w_encAll network parameters of the various layers of the network in the encoder of fig. 3 are shown.

(3) Decoding to obtain intermediate data or second translation text corresponding to the target voice

As shown in FIG. 3, it is assumed that the network structure of the recognition decoder of the speech translation model and the network structure of the translation decoder of the speech translation model are the same, and both comprise 4-layer LSTM and softmax classifiers, but the training parameters of the LSTM and the softmax classifiers are not shared.

Through the step (2), the audio characteristics of the target voice are coded by using the coding part of the model to obtain a coding vector h_1...LThen, as shown in fig. 3, the attention operation may be performed on the coding vector, and then the attention operation result is decoded by a 4-layer LSTM and softmax classifier in the translation decoder, so as to obtain intermediate data in the translation process or a second translation text corresponding to the target speech.

It should be noted that the way in which the speech translation model and the speech recognition model share the encoder parameters shown in fig. 2 is not exclusive, but is merely an example, and other parameter sharing modes may be adopted.

In addition, the speech translation model and the speech recognition model may not share model parameters, and in this case, the speech translation model and the speech recognition model are two separate models, in this case, the network structures of the speech translation model and the speech recognition model may be the same or different, and the specific composition structures of the two models may be set according to the actual situation, which is not limited in the embodiment of the present application.

It should be noted that the present embodiment does not limit the execution sequence of a1-a2 (a 1 is executed first and then a2 is executed) and A3, and a1-a2 may be executed first and then A3 is executed, or A3 is executed first and then a1-a2 is executed, or a1-a2 and A3 are executed simultaneously.

Further, since the "integration module" shown in fig. 4 is connected to the translation decoder of the speech recognition model and the text decoder of the text translation model, respectively, the "integration module" shown in fig. 4 may be used to determine the final translated text of the target speech according to the intermediate data in the translation process obtained by translating the first translation object and the second translation object or the respective translated text.

Specifically, in an implementation manner of this embodiment, a first probability distribution corresponding to a kth word in a final translation text output by a text decoder and a second probability distribution corresponding to the kth word in the final translation text output by a translation decoder may be respectively input to an "integration module", and the first probability distribution and the second probability distribution are fused by the "integration module" to determine a translation result of the kth word in the final translation text according to the fused probability distributions, specifically please refer to the second embodiment.

In another implementation manner of this embodiment, the first translated text output by the text decoder and the second translated text output by the translation decoder may be respectively input to the "integration module", and the two translated texts are compared by the "integration module" to determine a more accurate translated text of the target speech according to the comparison result, specifically please refer to the third embodiment.

In summary, according to the voice translation method provided in this embodiment, after the target voice to be translated is obtained, the recognition text of the target voice and the target voice are jointly used as the translation object to be translated to obtain the final translation text of the target voice, compared with a method in the prior art that the target voice is recognized to obtain the recognition text, and then the recognition text is used as the translation object to be translated, the method for translating the target voice in the present application has more abundant translation objects, that is, includes two translation objects of the recognition text and the target voice, and therefore, by means of translating the two translation objects, the more accurate translation text of the target voice can be determined.

Second embodiment

In this embodiment, by translating the first translation target (i.e., the recognized text of the target speech) in step S102 in the first embodiment, a first probability distribution corresponding to the kth word in the final translated text of the target speech can be generated, and the first probability distribution can be defined as P_text(y_k) Wherein, y_kRefers to the kth word in the final translated text of the target speech.

Wherein the first probability distribution P_text(y_k) May include the k-th word y obtained by decoding the recognized text of the target speech_kThe first decoding probability of each word to be selected in the word list. The larger the value of the first decoding probability is, the k-th word y obtained by decoding the recognition text of the target voice is_kThe greater the probability of corresponding to the candidate word.

With reference to the network structure shown in fig. 4, the kth word y obtained by decoding the recognition text of the target speech and output by the text translation model_kCorresponding first probability distribution P_text(y_k) The calculation formula of (a) is as follows:

P_text(y_k)＝soft max(dec(U_decs_1...N)) (7)

wherein dec represents the entire decoding computation process of the text translation model decoding portion; u shape_decAll network parameters representing the decoded part of the text translation model; s_1...NThe coding vector corresponding to the recognition text representing the target voice; p_text(y_k) Indicating the k-th word y obtained by decoding the recognized text of the target speech_kThe first decoding probability of each word to be selected in the word list.

Similarly, in this embodiment, by translating the second translation object (i.e., the target speech) in step S102 in the first embodiment, a second probability distribution corresponding to the kth word in the final translation text may be generated, and the second probability distribution may be defined as P_trans(y_k) Wherein, y_kOf fingersIs the kth word in the final translated text of the target speech.

Wherein the second probability distribution P_trans(y_k) May include the k word y obtained by decoding the target speech_kThe second decoding probability of each word to be selected in the word list. The larger the value of the second decoding probability is, the larger the k-th word y obtained after decoding the target speech is_kThe greater the probability of corresponding to the candidate word.

With reference to the network structure shown in fig. 4, the kth word y obtained by decoding the target speech and output by the speech translation model_kCorresponding second probability distribution P_trans(y_k) The calculation formula of (a) is as follows:

P_trans(y_k)＝soft max(dec(W_dec,h_1...L)) (8)

wherein dec represents the entire decoding computation process of the speech translation model decoding portion; w_decAll network parameters representing the decoding part of the speech translation model; h is_1...LCoding vectors corresponding to the audio features representing the target voice; p_trans(y_k) Representing the k-th word y obtained by decoding the target speech_kThe second decoding probability of each word to be selected in the word list.

Based on this, the first probability distribution P corresponding to the k-th word in the generated final translation text can be further used_text(y_k) And a second probability distribution P_trans(y_k) And obtaining a translation result of the kth word.

Next, the present embodiment will be described with respect to how "the first probability distribution P corresponding to the kth word in the generated final translation text is_text(y_k) And a second probability distribution P_trans(y_k) And introducing a specific implementation process of obtaining a translation result of the kth word.

Referring to fig. 5, a schematic diagram of a flow of obtaining a translation result of a k-th word according to the first probability distribution and the second probability distribution provided in this embodiment is shown, where the flow includes the following steps:

s501: and in the first probability distribution and the second probability distribution, fusing the first decoding probability and the second decoding probability corresponding to the same word to be selected to obtain the fused decoding probability corresponding to the k-th word in the final translation text as each word to be selected.

In this embodiment, after the first translation object (i.e., the recognition text of the target speech) is translated, the kth word y in the final translation text is generated_kCorresponding first probability distribution P_text(y_k) I.e. P_text(y_k) Includes the k word y obtained by decoding the recognized text of the target voice_kThe first decoding probability of each word to be selected in the word list; and after the second translation object (namely the target voice) is translated, the kth word y in the final translation text is generated_kCorresponding second probability distribution P_trans(y_k) I.e. P_trans(y_k) Includes the k word y obtained by decoding the target speech_kThe second decoding probability of each word to be selected in the word list.

Further, the first probability distribution P may be distributed using "integration modules" as shown in FIG. 4_text(y_k) And a second probability distribution P_trans(y_k) Perform "decoding probability fusion", i.e. to P_text(y_k) And P_trans(y_k) Fusing the decoding probabilities corresponding to the same words to be selected to obtain fused decoding probabilities corresponding to the k-th word in the final translation text when selecting words for the words to be selected, wherein the fused decoding probabilities form a fused probability distribution and are defined as P_ensemble(y_k)。

For example, the following steps are carried out: suppose that each candidate word is 10000 words contained in an english vocabulary, and the 10000 words contain the word "system", and the first probability distribution P_text(y_k) The k-th word y included in the target speech after decoding the recognition text_kThe first decoding probability value for the word "system" is 0.87 and the second probability distribution P_trans(y_k) The k word y included in the target speech after decoding_kAs a second of the word "systemIf the decoding probability value is 0.76, the two decoding probability values 0.87 and 0.76 corresponding to the word "system" can be fused to obtain a fused decoding probability corresponding to the word "system" for representing the kth word y in the final translated text_kIs the probability size of the word "system".

In this embodiment, the k word y in the final translated text_kThe corresponding specific calculation formula of the fused probability distribution is as follows:

P_ensemble(y_k)＝αP_trans(y_k)+(1-α)P_text(y_k) (9)

wherein, α represents the fusion weight of the decoding probability, and can be obtained through experiments or experience; p_ensemble(y_k) Representing the k-th word y in the final translated text_kThe fusion decoding probability of each word to be selected in the word list is obtained; p_text(y_k) Indicating the k-th word y obtained by decoding the recognized text of the target speech_kThe first decoding probability, namely the first probability distribution, of each word to be selected in the word list; p_trans(y_k) Representing the k-th word y obtained by decoding the target speech_kThe second decoding probability, i.e. the second probability distribution, of each word to be selected in the word list.

For example, the following steps are carried out: based on the above example, assume that the k-th word y obtained by decoding the recognition text of the target speech_kThe first decoding probability of the word "system" is 0.87, and the k-th word y obtained by decoding the target speech is_kThe second decoding probability for the word "system" is 0.76, and at the same time, if the value of α determined by experiment is 0.6, then the kth word y in the final translation text can be calculated by the above equation (9)_kThe fused decoding probability for the word "system" is 0.804, i.e., 0.6 × 0.76+ (1-0.6) × 0.87 — 0.804. Thus, the k word y in the final translated text can be calculated_kThe fusion decoding probability of other words to form the k word y_kCorresponding fusion probability distribution P_ensemble(y_k)。

S502: and selecting the word to be selected corresponding to the maximum fusion decoding probability as the translation text of the kth word.

In this embodiment, the k-th word y in the final translation text is obtained through step S501_kAfter the fused decoding probabilities respectively corresponding to the words to be selected are obtained, the word to be selected corresponding to the maximum fused decoding probability can be selected from the fused decoding probabilities to be used as the translation result of the kth word.

For example, the following steps are carried out: assuming that each word to be selected is 10000 words such as "system", "table", "box" … … included in an english word list, for the kth word in the final translation text, each word in the 10000 words corresponds to one fused decoding probability, and then the word to be selected corresponding to the largest fused decoding probability may be selected, for example, the word "system" corresponding to the largest fused decoding probability of 0.89 may be used as the translation result of the kth word.

To sum up, in the embodiment, the kth word y obtained by decoding the recognition text of the target speech is decoded by means of decoding probability fusion_kCorresponding first probability distribution P_text(y_k) And the k word y obtained by decoding the target voice_kCorresponding second probability distribution P_trans(y_k) Performing decoding probability fusion, thereby obtaining the k word y in the final translation text_kAnd correspondingly, more accurate fusion decoding probability distribution is obtained, and then the word to be selected corresponding to the maximum fusion decoding probability can be selected from the fusion decoding probability distribution and used as the translation result of the kth word. In this way, the translation result of each word in the final translated text can be obtained in turn.

Third embodiment

In this embodiment, the first translation object (i.e., the recognition text of the target speech) is translated in step S102 in the above-described first embodiment, so that a first translation text can be obtained.

The first translation text is a text in the target translation language, and the first translation text can be defined as the text in the target translation language

Wherein, K₁Which indicates the number of individual characters (or words) contained in the first translation text. For example, if the target speech is a chinese speech and the target translation language is english, i.e., the target speech needs to be translated into an english text, the first translation text is an english text

Wherein, K₁The number of words contained in the english text is indicated.

An alternative implementation is that, in conjunction with the network architecture shown in fig. 4, the decoded vector obtained by the text decoder can be used to generate the first translated text of the target speech

Wherein, K₁Indicating the number of words contained in the first translation text.

If U is utilized_decRepresenting all network parameters of the text decoder of the text translation model of fig. 4, the first translated text output by the model

The calculation formula of (a) is as follows:

where dec represents the entire decoding computation process of the text decoder of the text translation model in fig. 4; u shape_decAll network parameters of the text decoder representing the text translation model in fig. 4; s_1...NAnd the coded vector corresponding to the recognition text representing the target voice.

Further, in this embodiment, the second translation text can be obtained by directly translating the second translation object (i.e., the target speech) in step S102 in the above-described first embodiment.

Wherein the second translation text is a text in the target translation language, and the second translation text can be defined as the text in the target translation language

Wherein, K₂The number of single words (or words) contained in the second translation text may be represented. For example, if the target speech is Chinese speech and the target translation language is still English, i.e., the target speech still needs to be translated into English text, the second translation text is English text

Wherein, K₂The number of words contained in the english text is indicated.

An alternative implementation is that, in conjunction with the network architecture shown in fig. 4, the decoded vectors obtained by the translation decoder can be used to generate the second translated text of the target speech

Wherein, K₂The number of the single words (or words) included in the second translation text is represented, and the specific calculation formula of the decoding portion can be referred to the above formulas (2), (3), and (4), which is not described herein again.

If W is utilized_decRepresenting all network parameters of the translation decoder of the speech translation model of FIG. 4, the second translated text output by the model

The calculation formula of (a) is as follows:

where dec represents the entire decoding computation process of the translation decoder of the speech translation model of fig. 4; w_decAll network parameters of the translation decoder of the speech translation model in fig. 4 are represented; h is_1...LAnd the code vector corresponding to the audio characteristic representing the target voice.

In addition, it should be noted that the number of words (or words) included in the first translation text and the second translation text in this embodiment may beAre identical or different, i.e. may be K₁＝K₂And may also be K₁≠K₂However, the first translated text and the second translated text belong to the same language, for example, both the first translated text and the second translated text are chinese text or english text.

Based on this, the first translation text generated can be further used

And second translated text

And obtaining the final translation text of the target voice.

Next, the present embodiment will be described with respect to how "according to the generated first translation text

And second translated text

And introducing a specific implementation process of obtaining a final translation text of the target voice.

Referring to fig. 6, a schematic flow chart of obtaining a final translated text of a target speech according to a first translated text and a second translated text provided in this embodiment is shown, where the flow chart includes the following steps:

s601: a confidence level is determined when the first translated text is the final translated text of the target speech.

In this embodiment, if the first translation object (i.e., the recognition text of the target speech) is translated to obtain the first translation text, the data related to the first translation text may be further processed to determine the confidence level of the first translation text as the final translation text of the target speech, and the confidence level may be defined as socre_text。

In an implementation manner of this embodiment, S601 may specifically include the following steps B1-B2:

step B1: and acquiring the decoding probability corresponding to each text unit of the first translation text.

In the present implementation, in order to determine the confidence socre when the first translated text is used as the final translated text of the target speech_textFirstly, determining each text unit contained in the first translation text, wherein the text unit refers to a basic composition unit forming the first translation text, and the basic composition unit is different with the language to which the first translation text belongs, for example, if the first translation text is a chinese text, the text units contained in the first translation text can be characters and words; if the first translation text is an english text, the text units contained in the first translation text may be words, and so on.

Then, a decoding probability corresponding to each text unit included in the first translation text in the language to which the text unit belongs may be obtained, where the decoding probability refers to a probability that the corresponding text unit belongs to the translation result, and specifically, the decoding probability may be one of the first probability distributions corresponding to the kth word obtained after the recognition text of the target speech is decoded in the second embodiment. It is understood that the greater the decoding probability, the greater the likelihood of indicating its corresponding text unit as a translation result of the k-th word, whereas the smaller the decoding probability, the less likely the indicating its corresponding text unit as a translation result of the k-th word.

Step B2: and determining the confidence coefficient when the first translation text is used as the final translation text of the target voice according to the decoding probability corresponding to each text unit of the first translation text.

After the decoding probabilities corresponding to the text units of the first translated text are obtained through step B1, the decoding probabilities may be further processed, so as to determine, according to the processing result, a confidence socre when the first translated text is used as the final translated text of the target speech_text。

Specifically, an alternative implementation manner may be to sum the decoding probabilities corresponding to the text units of the first translation text, and then divide the sum by the total number K of the text units included in the first translation text₁Get each textAverage decoding probability value of unit to represent confidence socre when the first translation text is used as final translation text of target voice_text。

For example, the following steps are carried out: assuming that the first translated text is an english text including 6 words, and the decoding probabilities corresponding to the 1 st word to the 6 th word are 0.82, 0.78, 0.91, 0.85, 0.81, and 0.93, respectively, the sum of the 6 decoding probabilities is 5.1, that is, 0.82+0.78+0.91+0.85+0.81+0.93 is 5.1, and the sum 5.1 is divided by the total number 6 of words included in the first translated text to obtain an average decoding probability value of 0.85 for each word, that is, 5.1/6 is 0.85, and the average decoding probability value of 0.85 can be used to represent the confidence level socre of the first translated text as the final translated text of the target speech_textI.e. socre_text＝0.85。

S602: a confidence level is determined when the second translated text is the final translated text of the target speech.

In this embodiment, if the second translation object (i.e., the target speech) is directly translated to obtain the second translation text, the data related to the second translation text may be further processed to determine the confidence level of the second translation text as the final translation text of the target speech, and the confidence level is defined as socre_trans。

In an implementation manner of this embodiment, the step S602 may specifically include the following steps C1-C2:

step C1: and acquiring the decoding probability corresponding to each text unit of the second translation text.

In the present implementation, in order to determine the confidence socre when the second translated text is the final translated text of the target speech_transDetermining each text unit contained in the second translated text, where the text unit refers to a basic constituent unit constituting the second translated text, and is different according to a language to which the second translated text belongs, for example, if the second translated text is a chinese text, the text unit contained in the second translated text may be a word or a word; if the second translation text is EnglishText, the text units it contains may be words, etc.

Then, a decoding probability corresponding to each text unit included in the second translated text in the language to which the text unit belongs may be obtained, where the decoding probability refers to a probability that the corresponding text unit belongs to the translated text, and specifically, the decoding probability may be one of second probability distributions corresponding to a kth word obtained after the target speech is decoded in the second embodiment. It is understood that the greater the decoding probability, the greater the likelihood of indicating its corresponding text unit as a translation result of the k-th word, whereas the smaller the decoding probability, the less likely the indicating its corresponding text unit as a translation result of the k-th word.

Step C2: and determining the confidence coefficient when the second translation text is used as the final translation text of the target voice according to the decoding probability corresponding to each text unit of the second translation text.

After the decoding probabilities corresponding to the text units of the second translated text are obtained through step C1, the decoding probabilities may be further processed, so as to determine, according to the processing result, a confidence socre when the second translated text is used as the final translated text of the target speech_trans。

Specifically, an alternative implementation manner may be to sum the decoding probabilities corresponding to the text units of the second translation text, and then divide the sum by the total number K of the text units included in the second translation text₂Obtaining the average decoding probability value of each text unit to represent the confidence coefficient socre when the second translation text is used as the final translation text of the target voice_trans。

For example, the following steps are carried out: assuming that the second translated text is an english text containing 8 words, and the decoding probabilities corresponding to the 1 st word to the 8 th word are 0.76, 0.78, 0.92, 0.72, 0.89, 0.91, 0.75, and 0.83, respectively, the sum of the 8 decoding probabilities is 6.56, i.e., 0.76+0.78+0.92+0.72+0.89+0.91+0.75+0.83 is 6.56, and the sum of the 6.56 and the second translated text is divided by the sum of the 6.56If the average decoding probability value of each word is 0.82, that is, 6.56/8 is 0.82, which means that the total number of words is 8, the confidence socre of the second translated text as the final translated text of the target speech can be represented by the average decoding probability value of 0.82_transI.e. socre_trans＝0.82。

S603: and selecting the translation text corresponding to the higher confidence coefficient as the final translation text of the target voice.

In the present embodiment, the confidence socre at the time when the first translated text is determined as the final translated text of the target speech by step S601_textAnd the confidence socre at the time of determining the second translated text as the final translated text of the target voice by step S602_transThen, can be selected from socre_textAnd socre_transAnd selecting the translation text corresponding to the larger value as the final translation text of the target voice.

In particular, a salt of ascorbic acid_textValue of (a) is greater than socre_transIf the value of (2) indicates that the probability that each text unit in the first translated text belongs to the translation result is higher, socre can be selected_textThe corresponding first translation text is used as a final translation text of the target voice; on the contrary, if socre_transValue of (a) is greater than socre_textIf the value of (2) indicates that the probability that each text unit in the second translated text belongs to the translation result is higher, socre can be selected_transAnd the corresponding second translation text is used as the final translation text of the target voice.

In summary, in the embodiment, the first translation text and the second translation text are compared with each other to obtain the confidence degree when the first translation text and the second translation text are respectively used as the final translation text of the target speech, so that the translation text with higher possibility that each text unit belongs to the translation result can be selected according to the comparison result and used as the final translation text of the target speech, and then the more accurate translation text of the target speech can be determined, and the accuracy of the speech translation result is improved.

Fourth embodiment

In this embodiment, a speech translation apparatus will be described, and for related contents, please refer to the above method embodiment.

Referring to fig. 7, a schematic composition diagram of a speech translation apparatus provided in this embodiment is shown, where the apparatus 700 includes:

a target speech acquisition unit 701 configured to acquire a target speech to be translated;

a translated text obtaining unit 702, configured to translate a first translated object and a second translated object to obtain a final translated text of the target speech, where the first translated object is an identification text of the target speech, and the second translated object is the target speech.

In an implementation manner of this embodiment, the translated text obtaining unit 702 includes:

In an implementation manner of this embodiment, the translation result obtaining subunit includes:

In an implementation manner of this embodiment, the final translation text obtaining subunit includes:

In an implementation manner of this embodiment, the first confidence determining subunit includes:

In one implementation manner of this embodiment, the second confidence determining subunit includes:

Further, an embodiment of the present application further provides a speech translation apparatus, including: a processor, a memory, a system bus;

the processor and the memory are connected through the system bus;

the memory is used for storing one or more programs, and the one or more programs comprise instructions which, when executed by the processor, cause the processor to execute any one of the implementation methods of the speech translation method.

Further, an embodiment of the present application further provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are run on a terminal device, the instructions cause the terminal device to execute any implementation method of the foregoing speech translation method.

Further, an embodiment of the present application further provides a computer program product, which when running on a terminal device, causes the terminal device to execute any implementation method of the above-mentioned speech translation method.

As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that all or part of the steps in the above embodiment methods can be implemented by software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network communication device such as a media gateway, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.

It should be noted that, in the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.

It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of speech translation, comprising:

acquiring target voice to be translated;

translating a first translation object and a second translation object to obtain a final translation text of the target voice, wherein the first translation object is a recognition text of the target voice, and the second translation object is the target voice;

the translating the first translation object and the second translation object to obtain a final translation text of the target voice includes:

obtaining a translation result of the kth word according to the first probability distribution and the second probability distribution;

the first probability distribution includes a first decoding probability when a kth word obtained by decoding the recognition text of the target speech is each to-be-selected word in the word list, and the second probability distribution includes a second decoding probability when the kth word obtained by decoding the target speech is each to-be-selected word in the word list.

2. The method of claim 1, wherein obtaining translation results for a k-th word from the first probability distribution and the second probability distribution comprises:

3. The method of any one of claims 1 to 2, wherein translating the first translation object and the second translation object comprises:

4. A method of speech translation, comprising:

acquiring target voice to be translated;

directly translating the target voice to obtain a second translation text;

obtaining a final translation text of the target voice according to the first translation text and the second translation text;

the obtaining of the final translation text of the target voice according to the first translation text and the second translation text includes:

selecting a translation text corresponding to a larger confidence coefficient as a final translation text of the target voice;

the determining the confidence level when the first translated text is used as the final translated text of the target voice comprises:

5. The method of claim 4, wherein translating the first translation object and the second translation object comprises:

6. A method of speech translation, comprising:

acquiring target voice to be translated;

directly translating the target voice to obtain a second translation text;

the determining the confidence level when the second translated text is used as the final translated text of the target voice comprises:

7. The method of claim 6, wherein translating the first translation object and the second translation object comprises:

8. A speech translation apparatus, comprising:

a translation text obtaining unit, configured to translate a first translation object and a second translation object to obtain a final translation text of the target speech, where the first translation object is an identification text of the target speech, and the second translation object is the target speech;

the translated text obtaining unit includes:

a translation result obtaining subunit, configured to obtain a translation result of the kth word according to the first probability distribution and the second probability distribution;

9. A speech translation apparatus, comprising:

the translated text obtaining unit includes:

a final translation text obtaining subunit, configured to obtain a final translation text of the target speech according to the first translation text and the second translation text;

the final translated text obtaining subunit includes:

the second translation result obtaining subunit is used for selecting a translation text corresponding to a larger confidence coefficient as a final translation text of the target voice;

the first confidence determination subunit includes:

10. A speech translation apparatus, comprising:

the translated text obtaining unit includes:

the final translated text obtaining subunit includes:

the second confidence determining subunit includes:

11. A speech translation apparatus, comprising: a processor, a memory, a system bus;

the processor and the memory are connected through the system bus;

the memory is to store one or more programs, the one or more programs comprising instructions, which when executed by the processor, cause the processor to perform the method of any of claims 1-7.

12. A computer-readable storage medium having stored therein instructions that, when executed on a terminal device, cause the terminal device to perform the method of any one of claims 1-7.