CN114783426A - Voice recognition method and device, electronic equipment and storage medium - Google Patents
Voice recognition method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN114783426A CN114783426A CN202210393911.XA CN202210393911A CN114783426A CN 114783426 A CN114783426 A CN 114783426A CN 202210393911 A CN202210393911 A CN 202210393911A CN 114783426 A CN114783426 A CN 114783426A
- Authority
- CN
- China
- Prior art keywords
- speech
- text
- coding network
- voice
- training
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 72
- 238000012549 training Methods 0.000 claims abstract description 139
- 239000013598 vector Substances 0.000 claims description 93
- 230000004927 fusion Effects 0.000 claims description 47
- 238000004590 computer program Methods 0.000 claims description 9
- 230000015572 biosynthetic process Effects 0.000 claims description 5
- 238000003786 synthesis reaction Methods 0.000 claims description 5
- 230000000694 effects Effects 0.000 abstract description 25
- 230000008569 process Effects 0.000 description 24
- 238000010586 diagram Methods 0.000 description 16
- 230000006870 function Effects 0.000 description 13
- 238000013528 artificial neural network Methods 0.000 description 7
- 238000004891 communication Methods 0.000 description 5
- 230000006872 improvement Effects 0.000 description 5
- 239000011159 matrix material Substances 0.000 description 5
- 230000004913 activation Effects 0.000 description 4
- 230000006835 compression Effects 0.000 description 3
- 238000007906 compression Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 230000000052 comparative effect Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 241001575999 Hakka Species 0.000 description 1
- 235000016278 Mentha canadensis Nutrition 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000001105 regulatory effect Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 238000013526 transfer learning Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Telephonic Communication Services (AREA)
Abstract
The invention provides a voice recognition method, a voice recognition device, electronic equipment and a storage medium, wherein the method comprises the following steps: determining a voice to be recognized; determining a recognition text of the voice to be recognized based on the voice recognition model; the voice recognition model is obtained by applying a first voice text pair training based on a first coding network, the first coding network is obtained by combining a voice coding network and a text coding network and applying a second voice text pair training; the speech to be recognized and the first speech text pair belong to a first language, and the second speech text pair belongs to other languages except the first language. According to the method, the device, the electronic equipment and the storage medium provided by the invention, the second voice text pair is easy to obtain, and the training effect can be improved due to the increase of the data scale in the supervised training, so that the first coding network used for the pre-training of the voice recognition model of the first language can ensure the excellent performance of the voice recognition model, and the accurate and reliable voice recognition of the low-resource language is realized.
Description
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a speech recognition method and apparatus, an electronic device, and a storage medium.
Background
The voice recognition technology is one of important interfaces for human-computer interaction, so that more convenient experience is brought to a user, and the interaction threshold of a human and a machine is reduced. But still faces a serious data bottleneck when constructing a multilingual and multiparty low-resource speech recognition system.
Due to the scarcity of training data, the low-resource speech recognition model is trained based on a supervised method, and the obtained model is generally poor in recognition effect. Aiming at the problem, the existing method for improving the low-resource voice recognition rate is usually realized through self-supervision pre-training, and a robust feature extractor is obtained through unsupervised training in the self-supervision pre-training method and is used as a feature extractor of a low-resource task, so that more robust features can be extracted in the low-resource voice recognition process, and the recognition effect is provided.
However, in the self-supervision pre-training, when the data volume for pre-training reaches a certain scale, the scale of the data volume for pre-training is continuously increased, and the effect improvement of low-resource speech recognition cannot be continuously brought about.
Disclosure of Invention
The invention provides a voice recognition method, a voice recognition device, electronic equipment and a storage medium, which are used for solving the problem that the improvement of low-resource voice recognition effect in the prior art is limited.
The invention provides a voice recognition method, which comprises the following steps:
determining a voice to be recognized;
determining a recognition text of the voice to be recognized based on a voice recognition model;
the speech recognition model is obtained by applying a first speech text pair training based on a first coding network, and the first coding network is obtained by combining a speech coding network and a text coding network and applying a second speech text pair training;
the speech to be recognized and the first speech text pair belong to a first language, and the second speech text pair belongs to other languages except the first language.
According to the speech recognition method provided by the invention, the first coding network is obtained based on the following training steps:
determining a speech vector of a second speech based on the speech coding network;
determining a text vector of a second text based on the text encoding network;
and performing comparison training on the speech coding network and the text coding network based on the similarity between the speech vector and the text vector and whether the second speech and the second text belong to the same second speech text pair, and determining the speech coding network after the comparison training as the first coding network.
According to a speech recognition method provided by the present invention, the training of comparing the speech coding network and the text coding network based on the similarity between the speech vector and the text vector and whether the second speech and the second text belong to the same second speech-text pair comprises:
determining positive example similarity based on the voice vectors of the second voices and the text vectors of the second texts belonging to the same second voice text pair, and determining negative example similarity based on the voice vectors of the second voices and the text vectors of the second texts belonging to different second voice text pairs;
and performing comparison training on the speech coding network and the text coding network by taking the maximization of the positive case similarity and the minimization of the negative case similarity as targets.
According to the voice recognition method provided by the invention, the voice recognition model is obtained by training based on the following steps:
determining a second coding network, the second coding network being an encoder in an end-to-end speech recognition model;
determining a joint coding network based on the first coding network and the second coding network;
and applying the first speech text pair, training the joint coding network, and determining the speech recognition model based on the trained joint coding network.
According to a speech recognition method provided by the present invention, the determining the second coding network includes:
determining a synthesized voice corresponding to a third text based on a voice synthesis model, and constructing a third voice text pair based on the third text and the synthesized voice, wherein the third voice text pair belongs to the first language;
and applying the third speech text pair, training an initial end-to-end model to obtain the end-to-end speech recognition model, and determining a coder in the end-to-end speech recognition model as the second coding network.
According to a speech recognition method provided by the present invention, the determining a joint coding network based on the first coding network and the second coding network includes:
determining a joint coding network based on the first coding network and the second coding network, and a fusion network;
the fusion network is used for determining fusion weight based on the first output of the first coding network and the second output of the second coding network, and performing feature fusion on the first output and the second output based on the fusion weight.
According to a speech recognition method provided by the present invention, the determining the speech recognition model based on the trained joint coding network includes:
accessing at least two decoding networks after the trained joint coding network to obtain the speech recognition model;
the at least two decoding networks are determined based on different training frameworks.
The present invention also provides a voice recognition apparatus comprising:
a voice determination unit for determining a voice to be recognized;
the voice recognition unit is used for determining a recognition text of the voice to be recognized based on a voice recognition model;
the speech recognition model is obtained by training a first speech text pair based on a first coding network, and the first coding network is obtained by training a second speech text pair by combining a speech coding network and a text coding network;
the speech to be recognized and the first speech text pair belong to a first language, and the second speech text pair belongs to other languages except the first language.
The invention also provides an electronic device, comprising a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the processor executes the computer program to implement the steps of any of the above-mentioned voice recognition methods.
The invention also provides a non-transitory computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the speech recognition method as described in any of the above.
The speech recognition method, the speech recognition device, the electronic equipment and the storage medium provided by the invention are based on the supervised second speech text pair, and are trained by combining the speech coding network and the text coding network to obtain the first coding network, because the second speech text pair is easy to obtain, and the training effect can be improved due to the increase of the data scale in the supervised training process, the first coding network used for the pre-training of the speech recognition model of the first language can ensure the excellent performance of the first coding network, although the scale of the first speech text pair of the first language is limited, the speech recognition model obtained by the training still can keep the excellent recognition effect, and the accurate and reliable speech recognition of the low-resource language is realized.
Drawings
In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
FIG. 1 is a flow chart of a speech recognition method according to the present invention;
FIG. 2 is a schematic diagram illustrating a training process of a first coding network according to the present invention;
FIG. 3 is a second schematic diagram illustrating a training process of the first coding network according to the present invention;
FIG. 4 is a schematic diagram of a training process of a speech recognition model provided by the present invention;
FIG. 5 is a schematic diagram illustrating a training process of a second coding network provided by the present invention;
FIG. 6 is a schematic diagram of a converged network provided by the present invention;
FIG. 7 is a second schematic diagram illustrating a training process of a speech recognition model according to the present invention;
FIG. 8 is a schematic diagram of a voice recognition apparatus according to the present invention;
fig. 9 is a schematic structural diagram of an electronic device provided by the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.
The current speech recognition technology still has the problem of strong data dependency when being oriented to complex and diverse practical application requirements, and particularly faces a serious data bottleneck when constructing a low-resource speech recognition system with multiple languages and multiple parties.
Due to the scarcity of training data, the low-resource speech recognition model is trained based on a supervised method, and the obtained model is generally poor in recognition effect. Aiming at the problem, the existing method for improving the low-resource voice recognition rate is usually realized through self-supervision pre-training, and the self-supervision pre-training method obtains a robust feature extractor through unsupervised training and uses the robust feature extractor as a feature extractor of a low-resource task, so that the requirement of the voice recognition task on supervised training data is reduced, and more robust features can be extracted in the low-resource voice recognition process, and the recognition effect is provided.
However, the supervised data is not applied in the self-supervised pre-training, and when the unsupervised data amount for pre-training reaches a certain scale, for example, when the unsupervised data amount for pre-training reaches the level of hundred thousand hours, the scale of the data amount for pre-training is continuously increased, and the effect of low-resource speech recognition cannot be continuously improved.
In view of this problem, embodiments of the present invention provide a speech recognition method. Fig. 1 is a schematic flow chart of a speech recognition method provided by the present invention, as shown in fig. 1, the method includes:
Specifically, the speech to be recognized is the speech that needs to be subjected to speech recognition, the speech to be recognized may be obtained through a sound pickup device, where the sound pickup device may be a smart phone, a tablet computer, or a smart electrical appliance such as a sound system, a television, an air conditioner, and the like, and the sound pickup device may amplify and reduce noise of the speech to be recognized after acquiring the speech to be recognized through sound pickup by the microphone array.
the speech recognition model is obtained by training a first speech text pair based on a first coding network, and the first coding network is obtained by training a second speech text pair by combining a speech coding network and a text coding network;
the speech to be recognized and the first speech text pair belong to a first language, and the second speech text pair belongs to other languages except the first language.
In particular, the speech recognition for the speech to be recognized can be realized by a speech recognition model. The speech recognition model herein should have a function of recognizing speech in the language to which the speech to be recognized belongs, and assuming that the speech to be recognized belongs to the first language, the speech recognition model itself may be used to recognize speech in the first language.
Here, the first language may be a low resource language, which is a language with a narrow application area and a small audience, such as frieze and uzbeki, and the supervised data of this language has a small scale and a large collection difficulty. In order to improve the recognition effect of the speech recognition model in the first language, when supervised model training is performed by applying the first speech text pair in the first language, the supervised model training is performed on the basis of a pre-trained first coding network, wherein the first coding network is obtained by applying a second speech text pair training through combining a speech coding network and a text coding network.
The first speech text pair and the second speech text pair both belong to supervised training data, i.e. the speech text pairs both comprise speech and text corresponding thereto. In contrast, the first phonetic text pair belongs to a first language, i.e. a low resource language, and the second phonetic text pair belongs to a second language, where the second language is a language other than the first language, and may be specifically a language or a plurality of languages, and compared to the low resource first language, the second phonetic text pair has supervised training data, i.e. the second phonetic text pair has advantages of large quantity and low difficulty in collection,
it should be noted that the language referred in the embodiment of the present invention may be a language in a country or a region, for example, the language may be a chinese language, or a dialect language of a chinese lower division, for example, a hakka, a cantonese, or a southern Fujian.
Because the second speech text pair is low in acquisition difficulty, a large number of second speech text pairs can be applied to train in combination with the speech coding network and the text coding network, in the process, the speech coding network can code the speech in the second speech text pair into a speech vector, the text coding network can code the text in the second speech text pair into a text vector, and parameters in the speech coding network and the text coding network are iteratively updated by combining the natural corresponding relation between the speech and the text in the second speech text pair with the aim that the speech vector and the text vector which have the corresponding relation are as close as possible and the speech vector and the text vector which do not have the corresponding relation are as different as possible, so that the speech coding network and the text coding network can fully learn the characteristics of the speech and the text in the training process, thereby training the finished speech coding network, i.e. a pre-trained first coding network with speech coding capability.
Here, a large number of supervised second speech text pairs are applied to the pre-training process of the first coding network, and the increase of the data scale by the second speech text can bring continuous improvement to the coding effect of the first coding network.
The method provided by the embodiment of the invention combines the speech coding network and the text coding network to train to obtain the first coding network based on the supervised second speech text pair, and the second speech text pair is easy to obtain, and the training effect can be improved due to the increase of the data scale during the supervised training.
Based on the foregoing embodiment, fig. 2 is a schematic diagram of a training process of a first coding network provided by the present invention, and as shown in fig. 2, the first coding network is obtained based on the following steps:
Specifically, in the training process, the speech coding network is configured to perform speech coding on the second speech to obtain a speech vector of the second speech; correspondingly, the text coding network is used for performing text coding on the second text to obtain a text vector of the second text. It should be noted that, the second speech and the second text here both come from a second speech text pair in the second language, and particularly, the second speech and the second text may come from the same second speech text pair or from different second speech text pairs, and the two are from the same second speech text pair, it may be considered that a natural correspondence exists between the second speech and the second text, that is, the second speech and the second text are matched with each other, and the two are from different second speech text pairs, it may be considered that no association exists between the second speech and the second text, that is, the second speech and the second text are not matched with each other.
After the speech vector of the second speech and the text vector of the second text are obtained respectively, the similarity between the speech vector of the second speech and the text vector of the second text can be calculated, and the speech coding network and the text coding network are subjected to comparison training by combining whether the speech vector of the second speech and the text vector of the second text belong to the same second speech text pair. In the process, under the condition that the second voice and the second text belong to the same second voice text pair, the higher the similarity between the voice vector and the text vector is, the better the effect of the contrast training is, and the lower the similarity is, the more the effect of the contrast training needs to be further adjusted; under the condition that the second voice and the second text belong to different second voice text pairs, the lower the similarity between the voice vector and the text vector is, the better the effect of the contrast training is, and the higher the similarity is, the more the effect of the contrast training needs to be further adjusted. After the comparison training is completed, the voice coding network after the comparison training is determined as the first coding network.
Based on any of the foregoing embodiments, in step 230, the performing comparison training on the speech coding network and the text coding network based on the similarity between the speech vector and the text vector and whether the second speech and the second text belong to the same second speech-text pair includes:
determining positive example similarity based on the voice vectors of the second voices and the text vectors of the second texts belonging to the same second voice text pair, and determining negative example similarity based on the voice vectors of the second voices and the text vectors of the second texts belonging to different second voice text pairs;
and performing comparison training on the speech coding network and the text coding network by taking the maximization of the positive case similarity and the minimization of the negative case similarity as targets.
Specifically, for all the speech vectors and text vectors, similarity between every two speech vectors and text vectors can be calculated, wherein for the speech vector of the second speech and the text vector of the second text belonging to the same second speech text pair, the second speech and the second text form a group of positive examples, and the similarity between the speech vector and the text vector of the second speech and the text vector of the second text can be recorded as positive example similarity; for a speech vector of a second speech and a text vector of a second text belonging to different second speech-text pairs, the second speech and the second text form a set of counterexamples, and the similarity between the speech vector and the text vector of the second speech and the text vector of the second text can be recorded as counterexample similarity. Further, the similarity between the vectors herein, including the positive similarity and the negative similarity, can be implemented by a similarity calculation method such as vector inner product, cosine similarity, euclidean distance, etc., which is not specifically limited in this embodiment of the present invention.
On the basis, the speech coding network and the text coding network are subjected to comparison training based on taking the positive case similarity maximization and the negative case similarity minimization as training targets, specifically, a loss function can be constructed by taking the positive case similarity maximization and the negative case similarity minimization as targets, a loss function value is calculated based on the positive case similarity and the negative case similarity, and network parameters of the speech coding network and the text coding network are updated and iterated based on a loss function value until the loss function is converged.
Based on any of the embodiments described above, in the related art, in order to improve the low-resource speech recognition rate, besides the self-supervised pre-training, it may also be implemented by the transfer learning, that is, training an initialization model as an initialization network of a low-resource language by using a large amount of supervised data of other languages, and then training again by using the supervised data of the low-resource language. In order to solve the problem, in the embodiment of the present invention, the second voice and the second text both use sentences as units, and the limitation of effect improvement is broken through sentence-level comparison learning. Fig. 3 is a second schematic diagram of a training process of the first coding network provided by the present invention, and as shown in fig. 3, a large number of pre-collected second speech-text pairs may be divided into a plurality of minimatch data, each minimatch data including N sets of second speech-text pairs, i.e., a second speech and a second text including N words.
The second voice of N words obtains the voice vector of each word through the voice coding network, and the voice vector is recorded as S1、S2、S3、…、SNThe second text of N words obtains the text vector of each word through the text coding network, and the text vector is recorded as T1、T2、T3、…、TNHere, the voice coding network may be a transformer structure or other neural network structure capable of performing voice coding, and the text coding network may be a transformer structure or other neural network structure capable of performing text coding.
After obtaining the speech vectors and text vectors for N words, one can pair S1,S2,S3,…,SN]TAnd [ T1,T2,T3,…,TN]The inner product is calculated, thereby obtaining a matrix as shown in fig. 3. The row direction in the matrix can be regarded as a classifier for the second voice and is used for judging whether the second voice is matched with the second text of the N words; the column direction in the matrix may be considered as a classifier for the second text for determining whether the second text matches the second speech of the N words. Since it is known whether N words of the second speech and the second text match, i.e. belong to the same second speech-text pair, the target penalty for the comparative training of the speech coding network and the text coding network can be set to maximize the inner product of the same pair of speech vectors and text vectors, i.e. the element S on the diagonal of the matrix1T1、S2T2、S3T3、…、SNTNI.e. positive example similarity, minimizing the inner product of uncorrelated vectors, i.e. elements of the matrix other than the diagonalI.e. the counterexample similarity.
The loss function L (W, (Y, T, S)) of the comparative training can be expressed as the following equation:
in the formula Dw(T, S) denotes the Euclidean distance (two-norm), D, of the text vector T and the speech vector Sw(T,S)=||T-S||2Y is a label indicating whether the text vector T and the speech vector S are matched, and Y ═ 1 indicates that the samples are matched, that is, the second text and the second speech are corresponding and belong to the same sentence, that is, belong to the same second speech text pair; y-0 indicates that the samples do not match, i.e. the second text and the second speech do not correspond, in two sentences, i.e. belong to different pairs of second speech texts. m is a set threshold value, and N is the number of samples.
Based on the loss function, gradient calculation can be realized, and the text coding network and the voice coding network are updated through a neural network forward and backward algorithm.
Based on any of the above embodiments, before the second text and the second speech belong to the text coding network and the speech coding network, word embedding coding may be performed on each word in the second text, and then a vector obtained by the coding may be input to the text coding network; and extracting filter bank spectral features from the second voice, then carrying out mean variance normalization, and inputting the normalized vector into a voice coding network.
In addition, in the training process of combining the text coding network and the voice coding network, random mask can be performed on the spectrum characteristics of the second voice with N words, so that the robustness of the first coding network obtained through training is improved. The random mask may be a small square block that is applied to perform a zeroing operation on a random position of the spectral feature, and the size of the small square block may be 4 dimensions by 8 frames, or 6 dimensions by 4 frames, etc., which is not specifically limited in the embodiments of the present invention.
Based on any of the above embodiments, fig. 4 is one of schematic diagrams of a training process of a speech recognition model provided by the present invention, and as shown in fig. 4, the speech recognition model is obtained by training based on the following steps:
at step 410, a second coding network is determined, the second coding network being an encoder in an end-to-end speech recognition model.
Specifically, the end-to-end speech recognition model is in the same model, and includes both acoustic model training and language model training. The end-to-end speech recognition model mainly comprises an encoder and a decoder, wherein words are used as modeling units in the speech recognition process, and the words are directly predicted in an autoregressive mode.
Here, the end-to-end speech recognition model has the capability of performing speech recognition on speech in the first language, and the encoder included in the end-to-end speech recognition model, i.e., the second encoding network, has the function of encoding speech in the first language.
In order to further improve the recognition effect of the speech recognition model, in the embodiment of the invention, the first coding network and the second coding network are combined to construct a joint coding network. In the joint coding network, the first coding network and the second coding network execute the speech coding aiming at the input speech to be recognized in parallel and output the speech vectors aiming at the speech to be recognized respectively, and the joint coding network can also have a network structure for fusing the speech vectors output by the first coding network and the second coding network respectively, so that the joint coding of the first coding network and the second coding network is realized.
Specifically, after the joint coding network is determined, supervised training is performed on the joint coding network based on the application of the first speech text pair, and the trained joint coding network can be decoded by combining a language model and/or a hidden markov model for the first language, so that speech recognition for the first language is realized.
The method provided by the embodiment of the invention obtains the first coding network and the second coding network under two training frames of joint contrast learning and end-to-end to construct the voice recognition model, which is beneficial to ensuring the robustness of the voice recognition model and further improving the voice recognition effect.
Based on any of the above embodiments, step 410 includes:
determining a synthesized voice corresponding to a third text based on a voice synthesis model, and constructing a third voice text pair based on the third text and the synthesized voice, wherein the third voice text pair belongs to the first language;
and applying the third speech text pair, training an initial end-to-end model to obtain the end-to-end speech recognition model, and determining a coder in the end-to-end speech recognition model as the second coding network.
In particular, given that the end-to-end speech recognition model is a speech recognition model for the first language, training of the end-to-end speech recognition model requires the application of a large number of supervised samples of the first language. However, the first language belongs to a low-resource language, and the acquisition difficulty and the acquisition cost of the supervised sample are relatively high.
The third speech text pair thus obtained, i.e. the supervised sample synthesized in the first language, may be applied to the training of the initial end-to-end model, thereby obtaining an end-to-end speech recognition model. Here the initial end-to-end model is the model of the encoder-decoder structure.
Further, in the process of training the initial end-to-end model, the first speech text pair and the third speech text pair may be mixed as supervised samples for model training, and the specific mixing ratio may be 1:3, or 1: 4, etc., which are not particularly limited in the embodiments of the present invention. Fig. 5 is a schematic diagram of a training process of a second encoding network provided by the present invention, as shown in fig. 5, a sample speech may be a first speech in a first speech-to-text pair, or may be a synthesized speech in a third speech-to-text pair, an encoder and a decoder form an initial end-to-end model, and an output text output by the decoder may calculate a loss function in combination with a text in the speech-to-text pair where the sample speech is located, so as to update network parameters of the encoder and the decoder. In addition, after the training of the encoded vector output by the encoder, i.e. the speech vector of the sample speech, is completed, the encoder can be used as the second encoding network.
Based on any of the above embodiments, in the initial end-to-end model, the encoder may use a MobileNet structure, and since the number of parameters of the MobileNet structure itself is small, the second encoding network based on the MobileNet structure is applied to the speech recognition model, which may accelerate the model inference speed. The decoder may use a transform structure and the model size may not be compressed.
Based on any of the above embodiments, step 420 includes:
determining a joint coding network based on the first coding network and the second coding network, and a convergence network;
the fusion network is used for determining fusion weight based on the first output of the first coding network and the second output of the second coding network, and performing feature fusion on the first output and the second output based on the fusion weight.
Specifically, considering that the output difference between the first coding network and the second coding network is large, and the direct splicing may affect the model training effect, in the embodiment of the present invention, the fusion network is accessed after the first coding network and the second coding network, so as to implement the weighted fusion of the first output for the first coding network and the second output for the second coding network.
Here, the fusion weight for weighted fusion in the fusion network is adaptively adjusted based on the first output of the first coding network and the second output of the second coding network, and in the process, in combination with the first output and the second output, it is possible to determine how much information is available for speech recognition in the first output and the second output, thereby determining whether the first output or the second output is more emphasized when the first output and the second output are weighted and fused.
According to the method provided by the embodiment of the invention, the fusion weight is determined through the first output and the second output, and the feature fusion is carried out according to the fusion weight, so that the information beneficial to speech recognition is highlighted, and the stability of model training is improved.
Based on any of the above embodiments, in the converged network, the convergence weight may be determined based on the following steps:
regulating the first output of the first coding network and the second output of the second coding network to be the same characteristic dimension, and splicing the first output and the second output under the same characteristic dimension to obtain splicing characteristics;
and performing feature extraction on the splicing features to obtain the fusion weight.
Here, the warping of the first output and the second output to the same feature dimension may specifically be to perform feature dimension compression on the first output or perform feature upsampling on the second output. Feature dimension compression here may be achieved by deep neural networks.
The feature extraction of the splicing features can be realized through a deep neural network or a feature mapping mode.
In the fusion network, performing feature fusion on the first output and the second output based on fusion weights, specifically, determining weights for the first output and the second output based on the fusion weights, and then weighting and summing the first output and the second output respectively; the fusion weight may be directly used as the weight of the first output or the second output, and only the first output or the second output may be weighted, for example, the product of the second output and the fusion weight may be added to the first output.
Based on any of the above embodiments, fig. 6 is a schematic structural diagram of the fusion network provided by the present invention, and as shown in fig. 6, DNN1 and DNN2 are two Deep Neural Networks (Deep Neural Networks), and sigmoid is an activation function. The first output is compressed to the same feature dimension as the second output by DNN1 and then spliced with the second output. And performing further feature extraction on the spliced features through DNN2, inputting a Sigmoid activation function for feature mapping to obtain fusion weight with the value between 0 and 1, multiplying the fusion weight by the second output to realize the weighting of the second output, and splicing the first output compressed through DNN1 and the weighted second output to realize the weighted fusion of the first output and the second output. After weighted fusion, the classification result can be output through a classification layer, for example, the distribution probability of triphone is available.
Assume that the first output is ltThe second output is stLet the first output after DNN1 compression be denoted as htIs specifically represented by ht=DNN1(lt)。htAnd stAfter splicing, obtaining a fusion weight through a Sigmoid activation function, and recording the fusion weight as gtIs specifically represented as gt=σ(W·DNN2[st;ht]+ b), where σ is the activation function, and W and b are learned parameters. The feature obtained by weighted fusion can be recorded as s'tIs specifically represented by s't=[gt*st;ht]After weighted fusion, the weighted fusion features can be output through the classification layer, where the output is the probability distribution of the triphones, denoted as P (y)t|x)=softmax(s′t)。
Based on any of the above embodiments, step 430 includes:
accessing at least two decoding networks after the trained joint coding network to obtain the voice recognition model;
the at least two decoding networks are determined based on different training frameworks.
Specifically, in order to further improve the recognition effect of the speech recognition model, at least two decoding networks are adopted for speech decoding in the model construction of the speech recognition model, wherein the at least two decoding networks are determined based on different training frames, and the decoding networks obtained by training under different training frames can realize complementation in the speech decoding layer, so that the reliability of speech recognition is ensured.
Preferably, the at least two decoding networks may be a hidden markov model and a decoder in the end-to-end speech recognition model, respectively, and the hidden markov model for performing fine-grained modeling at a frame level and the end-to-end speech recognition model for performing coarse-grained modeling at a word level are integrated into the same model, so that the speech recognition effect can be effectively optimized.
Based on any of the above embodiments, fig. 7 is a second schematic diagram of a training process of the speech recognition model provided by the present invention, and as shown in fig. 7, the training of the speech recognition model is implemented based on the following steps:
firstly, combining a voice coding network and a text coding network, applying a second voice text pair, carrying out comparison training, and determining the trained voice coding network as a first coding network; and synthesizing the speech through the speech synthesis model to construct a third speech text pair, training an end-to-end speech recognition model according to the third speech text pair, and determining an encoder in the end-to-end speech recognition model as a second encoding network.
After the first coding network and the second coding network are obtained, the two coding networks are followed by a fusion network, thereby constructing a joint coding network.
The method comprises the steps of taking a first voice in a first voice text pair as an input of a joint coding network, obtaining a triphone state of the joint coding network based on first voice prediction, comparing the triphone state obtained through prediction with the triphone state of a first text corresponding to the first voice in the first voice text pair, calculating to obtain a loss value of the joint coding network, carrying out gradient updating on the joint coding network based on the loss value to achieve training of the joint coding network, and determining a voice recognition model based on the trained joint coding network.
In the process, the gradient update is performed on the joint coding network based on the loss value, specifically, only the network parameters of the first coding network and the fusion network are iteratively updated on the premise of fixing the network parameters of the second coding network.
Correspondingly, in the process of performing voice recognition by using the voice recognition model, the first coding network and the second coding network in the voice recognition model are used for respectively performing voice coding on the voice to be recognized, and the voice vectors respectively output by the first coding network and the second coding network are fused by the fusion network so as to output the fused voice vector.
Based on any of the above embodiments, fig. 8 is a schematic structural diagram of a speech recognition apparatus provided by the present invention, and as shown in fig. 8, the apparatus includes:
a voice determination unit 810 for determining a voice to be recognized;
a speech recognition unit 820, configured to determine a recognition text of the speech to be recognized based on a speech recognition model;
the speech recognition model is obtained by applying a first speech text pair training based on a first coding network, and the first coding network is obtained by combining a speech coding network and a text coding network and applying a second speech text pair training;
the speech to be recognized and the first speech text pair belong to a first language, and the second speech text pair belongs to other languages except the first language.
The device provided by the embodiment of the invention is based on the supervised second speech text pair, and combines the speech coding network and the text coding network to train so as to obtain the first coding network, because the second speech text pair is easy to obtain, and the increase of the data scale during the supervised training can bring the improvement of the training effect, the first coding network used for the pre-training of the speech recognition model of the first language can ensure the excellent performance of the first coding network, although the scale of the first speech text pair of the first language is limited, the speech recognition model obtained by the training can still keep the excellent recognition effect, and the accurate and reliable speech recognition of the low-resource language is realized.
Based on any of the above embodiments, the apparatus further includes a first coding network training unit, configured to:
determining a speech vector of a second speech based on the speech coding network;
determining a text vector of a second text based on the text encoding network;
and performing comparison training on the speech coding network and the text coding network based on the similarity between the speech vector and the text vector and whether the second speech and the second text belong to the same second speech text pair, and determining the speech coding network after the comparison training as the first coding network.
Based on any of the above embodiments, the first coding network training unit is specifically configured to:
determining positive example similarity based on the voice vectors of the second voices and the text vectors of the second texts belonging to the same second voice text pair, and determining negative example similarity based on the voice vectors of the second voices and the text vectors of the second texts belonging to different second voice text pairs;
and performing comparison training on the speech coding network and the text coding network by taking the maximization of the positive case similarity and the minimization of the negative case similarity as targets.
Based on any of the above embodiments, the apparatus further includes a second coding network training unit, configured to:
determining a second coding network, the second coding network being an encoder in an end-to-end speech recognition model;
the apparatus further comprises a joint coding network training unit configured to:
determining a joint coding network based on the first coding network and the second coding network;
and applying the first speech text pair, training the joint coding network, and determining the speech recognition model based on the trained joint coding network.
Based on any of the above embodiments, the second coding network training unit is specifically configured to:
determining a synthesized voice corresponding to a third text based on a voice synthesis model, and constructing a third voice text pair based on the third text and the synthesized voice, wherein the third voice text pair belongs to the first language;
and applying the third speech text pair, training an initial end-to-end model to obtain the end-to-end speech recognition model, and determining a coder in the end-to-end speech recognition model as the second coding network.
Based on any of the embodiments above, the joint coding network training unit is specifically configured to:
determining a joint coding network based on the first coding network and the second coding network, and a fusion network;
the fusion network is used for determining fusion weight based on the first output of the first coding network and the second output of the second coding network, and performing feature fusion on the first output and the second output based on the fusion weight.
Based on any of the embodiments above, the joint coding network training unit is specifically configured to:
accessing at least two decoding networks after the trained joint coding network to obtain the speech recognition model;
the at least two decoding networks are determined based on different training frameworks.
Fig. 9 illustrates a physical structure diagram of an electronic device, and as shown in fig. 9, the electronic device may include: a processor (processor)910, a communication Interface (Communications Interface)920, a memory (memory)930, and a communication bus 940, wherein the processor 910, the communication Interface 920, and the memory 930 communicate with each other via the communication bus 940. Processor 910 may invoke logic instructions in memory 930 to perform a speech recognition method comprising:
determining a voice to be recognized;
determining a recognition text of the voice to be recognized based on a voice recognition model;
the speech recognition model is obtained by applying a first speech text pair training based on a first coding network, and the first coding network is obtained by combining a speech coding network and a text coding network and applying a second speech text pair training;
the speech to be recognized and the first speech text pair belong to a first language, and the second speech text pair belongs to other languages except the first language.
Furthermore, the logic instructions in the memory 930 may be implemented in software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
In another aspect, the present invention also provides a computer program product comprising a computer program stored on a non-transitory computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer, enable the computer to perform the speech recognition method provided by the above methods, the method comprising:
determining a voice to be recognized;
determining a recognition text of the voice to be recognized based on a voice recognition model;
the speech recognition model is obtained by training a first speech text pair based on a first coding network, and the first coding network is obtained by training a second speech text pair by combining a speech coding network and a text coding network;
the speech to be recognized and the first speech text pair belong to a first language, and the second speech text pair belongs to other languages except the first language.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the speech recognition methods provided above, the method comprising:
determining a voice to be recognized;
determining a recognition text of the voice to be recognized based on a voice recognition model;
the speech recognition model is obtained by applying a first speech text pair training based on a first coding network, and the first coding network is obtained by combining a speech coding network and a text coding network and applying a second speech text pair training;
the speech to be recognized and the first speech text pair belong to a first language, and the second speech text pair belongs to other languages except the first language.
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, and not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (10)
1. A speech recognition method, comprising:
determining a voice to be recognized;
determining a recognition text of the voice to be recognized based on a voice recognition model;
the speech recognition model is obtained by applying a first speech text pair training based on a first coding network, and the first coding network is obtained by combining a speech coding network and a text coding network and applying a second speech text pair training;
the speech to be recognized and the first speech text pair belong to a first language, and the second speech text pair belongs to other languages except the first language.
2. The speech recognition method of claim 1, wherein the first coding network is trained based on the following steps:
determining a speech vector of a second speech based on the speech coding network;
determining a text vector of a second text based on the text encoding network;
and performing comparison training on the speech coding network and the text coding network based on the similarity between the speech vector and the text vector and whether the second speech and the second text belong to the same second speech text pair, and determining the speech coding network after the comparison training as the first coding network.
3. The speech recognition method of claim 2, wherein the training of the comparison between the speech coding network and the text coding network based on the similarity between the speech vector and the text vector and whether the second speech and the second text belong to a same second speech-text pair comprises:
determining positive example similarity based on the voice vectors of the second voices and the text vectors of the second texts belonging to the same second voice text pair, and determining negative example similarity based on the voice vectors of the second voices and the text vectors of the second texts belonging to different second voice text pairs;
and performing comparison training on the speech coding network and the text coding network by taking the maximization of the positive case similarity and the minimization of the negative case similarity as targets.
4. A speech recognition method according to any one of claims 1 to 3, characterized in that the speech recognition model is trained on the basis of the following steps:
determining a second coding network, the second coding network being an encoder in an end-to-end speech recognition model;
determining a joint coding network based on the first coding network and the second coding network;
and applying the first speech text pair, training the joint coding network, and determining the speech recognition model based on the trained joint coding network.
5. The speech recognition method of claim 4, wherein determining the second coding network comprises:
determining a synthesized voice corresponding to a third text based on a voice synthesis model, and constructing a third voice text pair based on the third text and the synthesized voice, wherein the third voice text pair belongs to the first language;
and applying the third speech text pair, training an initial end-to-end model to obtain the end-to-end speech recognition model, and determining a coder in the end-to-end speech recognition model as the second coding network.
6. The speech recognition method of claim 4, wherein determining a joint coding network based on the first coding network and the second coding network comprises:
determining a joint coding network based on the first coding network and the second coding network, and a fusion network;
the fusion network is used for determining fusion weight based on the first output of the first coding network and the second output of the second coding network, and performing feature fusion on the first output and the second output based on the fusion weight.
7. The speech recognition method of claim 4, wherein determining the speech recognition model based on the trained joint coding network comprises:
accessing at least two decoding networks after the trained joint coding network to obtain the voice recognition model;
the at least two decoding networks are determined based on different training frameworks.
8. A speech recognition apparatus, comprising:
a voice determination unit for determining a voice to be recognized;
the voice recognition unit is used for determining a recognition text of the voice to be recognized based on a voice recognition model;
the speech recognition model is obtained by applying a first speech text pair training based on a first coding network, and the first coding network is obtained by combining a speech coding network and a text coding network and applying a second speech text pair training;
the speech to be recognized and the first speech text pair belong to a first language, and the second speech text pair belongs to other languages except the first language.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the speech recognition method according to any of claims 1 to 7 when executing the program.
10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the speech recognition method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210393911.XA CN114783426A (en) | 2022-04-14 | 2022-04-14 | Voice recognition method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210393911.XA CN114783426A (en) | 2022-04-14 | 2022-04-14 | Voice recognition method and device, electronic equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114783426A true CN114783426A (en) | 2022-07-22 |
Family
ID=82429618
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210393911.XA Pending CN114783426A (en) | 2022-04-14 | 2022-04-14 | Voice recognition method and device, electronic equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114783426A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2024183560A1 (en) * | 2023-03-03 | 2024-09-12 | 抖音视界有限公司 | Speech recognition method and apparatus, and electronic device |
-
2022
- 2022-04-14 CN CN202210393911.XA patent/CN114783426A/en active Pending
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2024183560A1 (en) * | 2023-03-03 | 2024-09-12 | 抖音视界有限公司 | Speech recognition method and apparatus, and electronic device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111754976B (en) | Rhythm control voice synthesis method, system and electronic device | |
CN112735373B (en) | Speech synthesis method, device, equipment and storage medium | |
CN110491382B (en) | Speech recognition method and device based on artificial intelligence and speech interaction equipment | |
Kannan et al. | Large-scale multilingual speech recognition with a streaming end-to-end model | |
CN110706692B (en) | Training method and system of child voice recognition model | |
CN111833845B (en) | Multilingual speech recognition model training method, device, equipment and storage medium | |
CN112115687B (en) | Method for generating problem by combining triplet and entity type in knowledge base | |
CN109887484A (en) | A kind of speech recognition based on paired-associate learning and phoneme synthesizing method and device | |
CN109523989A (en) | Phoneme synthesizing method, speech synthetic device, storage medium and electronic equipment | |
CN111816169B (en) | Method and device for training Chinese and English hybrid speech recognition model | |
CN113539242A (en) | Speech recognition method, speech recognition device, computer equipment and storage medium | |
CN112837669B (en) | Speech synthesis method, device and server | |
CN111862934A (en) | Method for improving speech synthesis model and speech synthesis method and device | |
Han et al. | Self-supervised learning with cluster-aware-dino for high-performance robust speaker verification | |
CN112967713A (en) | Audio-visual voice recognition method, device, equipment and storage medium based on multi-modal fusion | |
CN114550703A (en) | Training method and device of voice recognition system, and voice recognition method and device | |
CN115394287A (en) | Mixed language voice recognition method, device, system and storage medium | |
Qu et al. | LipSound: Neural Mel-Spectrogram Reconstruction for Lip Reading. | |
CN112185340B (en) | Speech synthesis method, speech synthesis device, storage medium and electronic equipment | |
CN114999443A (en) | Voice generation method and device, storage medium and electronic equipment | |
CN117877460A (en) | Speech synthesis method, device, speech synthesis model training method and device | |
CN111653270B (en) | Voice processing method and device, computer readable storage medium and electronic equipment | |
CN114783426A (en) | Voice recognition method and device, electronic equipment and storage medium | |
CN113593534B (en) | Method and device for multi-accent speech recognition | |
CN115966197A (en) | Speech synthesis method, speech synthesis device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |