CN111653275A - Method and device for constructing voice recognition model based on LSTM-CTC tail convolution and voice recognition method - Google Patents
Method and device for constructing voice recognition model based on LSTM-CTC tail convolution and voice recognition method Download PDFInfo
- Publication number
- CN111653275A CN111653275A CN202010253075.6A CN202010253075A CN111653275A CN 111653275 A CN111653275 A CN 111653275A CN 202010253075 A CN202010253075 A CN 202010253075A CN 111653275 A CN111653275 A CN 111653275A
- Authority
- CN
- China
- Prior art keywords
- model
- output
- input
- sequence
- layer
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 45
- 238000012549 training Methods 0.000 claims abstract description 63
- 230000006870 function Effects 0.000 claims abstract description 33
- 238000004364 calculation method Methods 0.000 claims abstract description 10
- 230000015654 memory Effects 0.000 claims abstract description 9
- 238000003062 neural network model Methods 0.000 claims description 26
- 238000013507 mapping Methods 0.000 claims description 11
- 239000011159 matrix material Substances 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 7
- 238000012545 processing Methods 0.000 claims description 6
- 238000004422 calculation algorithm Methods 0.000 claims description 5
- 238000000605 extraction Methods 0.000 claims description 3
- 230000009467 reduction Effects 0.000 claims description 3
- 230000001052 transient effect Effects 0.000 claims description 3
- 238000005457 optimization Methods 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 7
- 238000004590 computer program Methods 0.000 description 5
- 238000012986 modification Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 4
- 230000006403 short-term memory Effects 0.000 description 4
- 230000002123 temporal effect Effects 0.000 description 3
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000007787 long-term memory Effects 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000000306 recurrent effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
Landscapes
- Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a method and a device for constructing a speech recognition model based on LSTM-CTC tail convolution and a speech recognition method. The LSTM is used for training a voice recognition model, the CTC is used as a loss function, and the convolution layer is used for parallelizing calculation which needs to be carried out simultaneously by an original full-connection layer. The LSTM-CTC network based on the convolutional layer utilizes the characteristic of parallel computation of convolutional kernels, so that the original computation of the fully-connected layer does not need to be input into a memory at the same time, and the optimization of the network is accelerated. Compared with the prior art, the method and the device have the advantages that the training of the voice model is accelerated, the time cost of a developer is reduced, and the requirement standard of hardware is reduced to a certain extent.
Description
Technical Field
The invention relates to the field of voice recognition, in particular to a method and a device for constructing a voice recognition model based on LSTM-CTC tail convolution and a voice recognition method.
Background
Speech recognition technology is a technology that lets a machine convert a speech signal into a corresponding text or command through a recognition and understanding process. In recent years, with the great heat of artificial intelligence energy technology, the speech recognition technology is also rapidly developed, speech recognition models are updated and optimized for several times, and typical models include Hidden Markov Models (HMMs), Deep Neural Networks (DNNs), Recurrent Neural Networks (RNNs), and Long Short Term Memory Networks (LSTM).
Among them, the Long and Short Term memory network (LSTM-CTC) with CTC as loss function is widely used for speech recognition due to its characteristics of easy training, high decoding efficiency and good performance.
The inventor of the present application finds that the method of the prior art has at least the following technical problems in the process of implementing the present invention:
although the LSTM-CTC has many advantages, due to the LSTM timing sequence, the LSTM is very time-consuming because it is difficult to parallelize in network training, and the hardware requirement of the machine is also increased to some extent.
Therefore, the technical problem of long model training time in the prior art is known.
Disclosure of Invention
The invention provides a method and a device for constructing a speech recognition model based on LSTM-CTC tail convolution and a speech recognition method, which are used for solving or at least partially solving the technical problem of long model training time in the method in the prior art.
In order to solve the above technical problems, a first aspect of the present invention provides a method for constructing a speech recognition model based on LSTM-CTC tail convolution, including:
s1: acquiring training data;
s2: constructing a neural network model, wherein the neural network model comprises two LSTM layers, a full convolution layer and a Softmax layer, the LSTM layer is used for extracting a hidden state sequence with the same length as an input characteristic sequence, the full convolution layer is used for reducing the rank and classifying the input hidden state sequence, and the Softmax layer is used for mapping the output of the full convolution layer to obtain class prediction;
s3: inputting the obtained training data into a neural network model, training the neural network model by adopting a CTC loss function, judging whether the model is optimal or not according to the CTC loss function, and stopping training when the model is optimal to obtain a trained model which is used as a voice recognition model.
In one embodiment, S1 specifically includes:
FBank features extracted from the speech data are used as training data.
In one embodiment, S3 specifically includes:
s3.1: calculating a forward propagation variable α (t, u), where α (t, u) is the sum of probabilities of all paths with output length t and sequence l after mapping, as follows:
whereinu denotes the length of the sequence and,indicates the probability of the output being a space character at time t, < l >'uA tag indicating the output at the t time step;
s3.2: the back propagation vector β (t, u) is calculated as the sum of the probabilities of adding a path π' on the forward variable α (t, u) starting at time t +1, resulting in the sequence l after the final mapping, as follows
Whereinu denotes the length of the sequence and,indicates the probability, l ', that the output is a space character at time t + 1'uA tag indicating the output at the t time step;
s3.3: the CTC loss function L (x, z) is obtained from the forward and backward propagation variables as follows:
s3.4: training the model by adopting a random gradient descent algorithm, and calculating the gradient of a loss function, wherein the loss function is output by a network:
where B (z, k) is the set of all paths for which tag k appears in sequence z',a character indicating the output at time t,p (z | x) represents the posterior probability of the label z with respect to the input x, x represents training data, and z represents text information corresponding to the voice, i.e., the label;
s3.5: and judging whether the model reaches the optimum according to the output of the loss function, and stopping training when the model reaches the optimum to obtain a trained model.
Based on the same inventive concept, the second aspect of the present invention provides an apparatus for constructing a speech recognition model based on LSTM-CTC tail convolution, comprising:
the training data acquisition module is used for acquiring training data;
the model building module is used for building a neural network model, wherein the neural network model comprises two LSTM layers, a full convolution layer and a Softmax layer, the LSTM layer is used for extracting a hidden state sequence with the same length as the input characteristic sequence, the full convolution layer is used for reducing the rank and classifying the input hidden state sequence, and the Softmax layer is used for mapping the output of the full convolution layer to obtain class prediction;
and the model training module is used for inputting the acquired training data into the neural network model, training the neural network model by adopting a CTC loss function, judging whether the model is optimal or not according to the CTC loss function, and stopping training when the model is optimal to obtain a trained model serving as a voice recognition model.
Based on the same inventive concept, a third aspect of the present invention provides a speech recognition method, comprising:
and after feature extraction is carried out on the voice data to be recognized, inputting the voice data to be recognized into the voice recognition model constructed in the first aspect to obtain a voice recognition result.
In one embodiment, the recognition process of the speech recognition model includes:
s1: extracting a hidden state sequence with the same length as the input characteristic sequence through an LSTM layer;
s2: the method comprises the steps of performing rank reduction and classification on an input hidden state sequence through a full convolution layer;
s3: the output of the full convolution layer is mapped by the Softmax layer to obtain a class prediction.
In one embodiment, the LSTM layer includes the input word X at a time of daytCell state CtTemporary cell stateHidden state htForgetting door ftInput door itOutput gate otExtracting a hidden state sequence with the same length as the input feature sequence through the LSTM layer, wherein the hidden state sequence comprises the following steps:
s1.1: calculating a forgetting gate, selecting information to be forgotten:ft=σ(Wf·[ht-1,xt]+bf)
wherein the input is a hidden state h at the previous timet-1And the input word x at the current timetOutput is ft,Wf、 bfRespectively is a weight matrix and an offset of the forgetting gate;
s1.2: a calculation input gate, selecting information to be memorized:
it=σ(Wi·[ht-1,xt]+bi)
wherein the input is a hidden state h at the previous timet-1And the input word x at the current timetThe output is the value i of the memory gatetAnd transient cell stateWi、biWeight matrix and offset, W, of the input gate, respectivelyC、bCRespectively are a weight matrix and an offset of the output gate;
wherein the input is the value i of the memory gatetForgetting gate value ftTemporary cell statusAnd last-minute cell state Ct-1The output is the cell state C at the current momentt;
S1.4: compute output gate and current time hidden state
ot=σ(Wo[ht-1,xt]+bo)
ht=ot*tanh(Ct)
Wherein,input as hidden state h of previous momentt-1Input word x at the present timetAnd cell state C at the present timetThe output is the value o of the output gatetAnd hidden state ht;
S1.5: finally, a hidden state sequence { h) with the same length as the input characteristic sequence is obtained through calculation0,h1,...,hn-1}。
In one embodiment, S3 specifically includes: the characteristics of the full convolutional layer output are characterized as the relative probability among different classes to obtain the final class prediction,
wherein i represents the ith class, N represents the total number of classes, ViRepresenting the probability value of the ith category, SiRepresenting the probability value of the ith category after softmax processing.
One or more technical solutions in the embodiments of the present application have at least one or more of the following technical effects:
the invention provides a method for constructing a speech recognition model based on LSTM-CTC tail convolution, wherein a constructed neural network model comprises two LSTM layers, a full convolution layer and a Softmax layer, the full convolution layer is adopted to replace a full connection layer between the LSTM layer and the Softmax layer in the traditional scheme, compared with the existing full connection layer, a convolution kernel is used for calculating in the convolution layer, and the calculation of the convolution kernel is parallel, so that the training time of the model can be reduced.
Based on the constructed speech recognition model, the invention also provides a speech recognition method based on the model, thereby improving the speech recognition efficiency.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative work.
FIG. 1 is a schematic flow chart of an implementation of a method for constructing a speech recognition model based on LSTM-CTC tail convolution according to the present invention;
FIG. 2 is a flow chart of a LSTM-CTC model provided by an embodiment of the present invention;
FIG. 3 is a block diagram of the construction device of the speech recognition model based on LSTM-CTC tail convolution according to the present invention;
FIG. 4 is a flow chart of the operation of speech recognition using the speech recognition model of the present invention.
Detailed Description
The inventor of the application finds out through a great deal of research and practice that: based on prior knowledge, the long-term memory network and the short-term memory network both depend on the prediction of the last time point in the backward propagation process, and therefore, the three gates and the memory cell cannot be parallel. This makes LSTM very time consuming to train and it is very difficult to parallelize LSTM networks due to the temporal characteristics of LSTM. Based on this, the present invention aims to reduce the training time of the speech recognition model by modifying the network structure of the LSTM-CTC.
In order to achieve the above object, the main concept of the present invention is as follows:
the invention provides a method for constructing a speech recognition model based on LSTM-CTC (Long Short Term Memory connectivity temporal classification), which replaces a full connection layer between a BilTM layer and a softmax layer by a full convolution layer to achieve the effect of accelerating network training. The LSTM is used for training a voice recognition model, the CTC is used as a loss function, and the convolution layer is used for parallelizing the calculation which is required to be simultaneously carried out on the original full-connection layer. The LSTM-CTC network based on the convolutional layer utilizes the characteristic of parallel computation of convolutional kernels, so that the original computation of the fully-connected layer does not need to be input into a memory at the same time, and the optimization of the network is accelerated. Compared with the prior art, the method and the device have the advantages that the training of the voice model is accelerated, the time cost of a developer is reduced, and the requirement standard of hardware is reduced to a certain extent.
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
Example one
The embodiment provides a method for constructing a speech recognition model based on LSTM-CTC tail convolution, please refer to fig. 1, and the method includes:
s1: acquiring training data;
s2: constructing a neural network model, wherein the neural network model comprises two LSTM layers, a full convolution layer and a Softmax layer, the LSTM layer is used for extracting a hidden state sequence with the same length as an input characteristic sequence, the full convolution layer is used for reducing the rank and classifying the input hidden state sequence, and the Softmax layer is used for mapping the output of the full convolution layer to obtain class prediction;
s3: inputting the obtained training data into a neural network model, training the neural network model by adopting a CTC loss function, judging whether the model is optimal or not according to the CTC loss function, and stopping training when the model is optimal to obtain a trained model which is used as a voice recognition model.
Specifically, the training data in S1 may be acquired by speech recognition.
In the S2, a neural network model framework is constructed, the invention innovatively replaces the full-connection layer between the LSTM layer and the softmax layer with the convolutional layer, and the efficiency of model training is improved through the parallel calculation of the convolutional layer.
Ctc (connectionist Temporal classification) in S3 may be directly trained using the sequence. CTC introduces a new loss function that can be trained directly using unsingulated semaphores.
In one embodiment, S1 specifically includes:
FBank features extracted from the speech data are used as training data.
Specifically, the FBank feature of the audio can be obtained by acquiring voice data through an audio input device and then by audio front-end processing.
In one embodiment, S3 specifically includes:
s3.1: calculating a forward propagation variable α (t, u), where α (t, u) is the sum of probabilities of all paths with output length t and sequence l after mapping, as follows:
whereinu denotes the length of the sequence and,indicates the probability of the output being a space character at time t, < l >'uA tag indicating the output at the t time step;
s3.2: the back propagation vector β (t, u) is calculated as the sum of the probabilities of adding a path π' on the forward variable α (t, u) starting at time t + 1, resulting in the sequence l after the final mapping, as follows
Whereinu denotes the length of the sequence and,indicates the probability, l ', that the output is a space character at time t + 1'uA tag indicating the output at the t time step;
s3.3: obtaining CTC loss function according to forward propagation variable and backward propagation variableNumber L (x, z), as follows:
s3.4: training the model by adopting a random gradient descent algorithm, and calculating the gradient of a loss function, wherein the loss function is output by a network:
where B (z, k) is the set of all paths for which tag k appears in sequence z',a character indicating the output at time t,p (z | x) represents the posterior probability of the label z with respect to the input x, x represents training data, and z represents text information corresponding to the voice, i.e., the label;
s3.5: and judging whether the model reaches the optimum according to the output of the loss function, and stopping training when the model reaches the optimum to obtain a trained model.
Specifically, CTC is used as a loss function, a Stochastic Gradient Descent (SGD) algorithm is adopted to train the network, whether a model is optimal or not is measured through the loss function, if the model is optimal, the training is stopped, and if the model is not optimal, the next training and optimization of the network are guided by matching with the Stochastic gradient descent algorithm.
Referring to fig. 2, which is a flow chart of a model of speech recognition, training data is first input, and then a network result is constructed: two layers of LSTM (LSTM1 and LSTM2), a full convolution layer and a Softmax layer are adopted, after the structure of the model is determined, a CTC loss function is adopted to train the model, and finally the voice recognition model is obtained.
Compared with the prior art, the invention has the following advantages and beneficial effects: the time cost of network training is saved, and the hardware requirement of the network training is reduced to a certain extent.
Example two
Based on the same inventive concept, the embodiment provides a device for constructing a speech recognition model based on LSTM-CTC tail convolution, please refer to fig. 3, the device includes:
a training data acquisition module 201, configured to acquire training data;
the model building module 202 is configured to build a neural network model, where the neural network model includes two LSTM layers, a full convolution layer and a Softmax layer, where the LSTM layer is used to extract a hidden state sequence with the same length as an input feature sequence, the full convolution layer is used to reduce the rank and classify the input hidden state sequence, and the Softmax layer is used to map the output of the full convolution layer to obtain a category prediction;
and the model training module 203 is used for inputting the acquired training data into the neural network model, training the neural network model by adopting a CTC loss function, judging whether the model is optimal according to the CTC loss function, and stopping training when the model is optimal to obtain a trained model serving as a voice recognition model.
Since the device described in the second embodiment of the present invention is a device used for implementing the method for constructing the speech recognition model based on the LSTM-CTC tail convolution in the first embodiment of the present invention, a person skilled in the art can understand the specific structure and deformation of the device based on the method described in the first embodiment of the present invention, and thus details are not described herein. All the devices adopted in the method of the first embodiment of the present invention belong to the protection scope of the present invention.
EXAMPLE III
Based on the same inventive concept, the embodiment provides a speech recognition method, including:
and after feature extraction is carried out on the voice data to be recognized, inputting the voice data to be recognized into the voice recognition model constructed in the first embodiment to obtain a voice recognition result.
In one embodiment, the recognition process of the speech recognition model includes:
s1: extracting a hidden state sequence with the same length as the input characteristic sequence through an LSTM layer;
s2: the method comprises the steps of performing rank reduction and classification on an input hidden state sequence through a full convolution layer;
s3: the output of the full convolution layer is mapped by the Softmax layer to obtain a class prediction.
In one embodiment, the LSTM layer includes the input word X at a time of daytCell state CtTemporary cell stateHidden state htForgetting door ftInput door itOutput gate otExtracting a hidden state sequence with the same length as the input feature sequence through the LSTM layer, wherein the hidden state sequence comprises the following steps:
s1.1: calculating a forgetting gate, selecting information to be forgotten: f. oft=σ(Wf·[ht-1,xt]+bf)
Wherein the input is a hidden state h at the previous timet-1And the input word x at the current timetOutput is ft,Wf、 bfRespectively is a weight matrix and an offset of the forgetting gate;
s1.2: a calculation input gate, selecting information to be memorized:
it=σ(Wi·[ht-1,xt]+bi)
wherein the input is a hidden state h at the previous timet-1And the input word x at the current timetThe output is the value i of the memory gatetAnd transient cell stateWi、biWeight matrix and offset, W, of the input gate, respectivelyC、bCRespectively are a weight matrix and an offset of the output gate;
wherein the input is the value i of the memory gatetForgetting gate value ftTemporary cell statusAnd last-minute cell state Ct-1The output is the cell state C at the current momentt;
S1.4: compute output gate and current time hidden state
ot=σ(Wo[ht-1,xt]+bo)
ht=ot*tanh(Ct)
Wherein the input is the hidden state h of the previous momentt-1Input word x at the present timetAnd cell state C at the present timetThe output is the value o of the output gatetAnd hidden state ht;
S1.5: finally, a hidden state sequence { h) with the same length as the input characteristic sequence is obtained through calculation0,h1,...,hn-1}。
Specifically, S1.1-S1.5 describe the implementation process of LTSM layer in detail, the two layers of LSTMs are the same in function, the expression capability of the network model can be enhanced by deepening the network depth by adopting the multiple layers of LSTMs, but because the gradient disappears, the two layers of LSTMs are selected for training and prediction.
In one embodiment, S3 specifically includes: the characteristics of the full convolutional layer output are characterized as the relative probability among different classes to obtain the final class prediction,
wherein i represents the ith class, N represents the total number of classes, ViRepresenting the probability value of the ith category, SiRepresenting the probability value of the ith category after softmax processing.
Referring to fig. 4, which is a flowchart of performing speech recognition by using a speech recognition model, the Fbank feature extracted from the training speech is used for model training, the obtained decoding model is the final speech recognition model, and the speech to be recognized or the test speech is input into the decoding model to obtain the final recognition result, i.e., the recognition text.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various modifications and variations can be made in the embodiments of the present invention without departing from the spirit or scope of the embodiments of the invention. Thus, if such modifications and variations of the embodiments of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to encompass these modifications and variations.
Claims (8)
1. The method for constructing the voice recognition model based on the LSTM-CTC tail convolution is characterized by comprising the following steps of:
s1: acquiring training data;
s2: constructing a neural network model, wherein the neural network model comprises two LSTM layers, a full convolution layer and a Softmax layer, the LSTM layer is used for extracting a hidden state sequence with the same length as an input characteristic sequence, the full convolution layer is used for reducing the rank and classifying the input hidden state sequence, and the Softmax layer is used for mapping the output of the full convolution layer to obtain class prediction;
s3: inputting the obtained training data into a neural network model, training the neural network model by adopting a CTC loss function, judging whether the model is optimal or not according to the CTC loss function, and stopping training when the model is optimal to obtain a trained model which is used as a voice recognition model.
2. The method of claim 1, wherein S1 specifically comprises:
FBank features extracted from the speech data are used as training data.
3. The method of claim 1, wherein S3 specifically comprises:
s3.1: calculating a forward propagation variable α (t, u), where α (t, u) is the sum of probabilities of all paths with output length t and sequence l after mapping, as follows:
whereinu denotes the length of the sequence and,indicates the probability of output as a space character at time t, < l >'uA tag indicating the output at the t time step;
s3.2: the back propagation vector β (t, u) is calculated as the sum of the probabilities of adding a path π' on the forward variable α (t, u) starting at time t +1, resulting in the sequence l after the final mapping, as follows
Whereinu denotes the length of the sequence and,indicates the probability, l ', that the output is a space character at time t + 1'uA tag indicating the output at the t time step;
s3.3: the CTC loss function L (x, z) is obtained from the forward and backward propagation variables as follows:
s3.4: training the model by adopting a random gradient descent algorithm, and calculating the gradient of a loss function, wherein the loss function is output by a network:
where B (z, k) is the set of all paths for which tag k appears in sequence z',a character indicating the output at time t,p (z | x) represents the posterior probability of the label z with respect to the input x, x represents training data, and z represents text information corresponding to the voice, i.e. the label;
s3.5: and judging whether the model reaches the optimum according to the output of the loss function, and stopping training when the model reaches the optimum to obtain a trained model.
4. The device for constructing the voice recognition model based on the LSTM-CTC tail convolution is characterized by comprising the following steps:
the training data acquisition module is used for acquiring training data;
the model building module is used for building a neural network model, wherein the neural network model comprises two LSTM layers, a full convolution layer and a Softmax layer, the LSTM layer is used for extracting a hidden state sequence with the same length as the input characteristic sequence, the full convolution layer is used for reducing the rank and classifying the input hidden state sequence, and the Softmax layer is used for mapping the output of the full convolution layer to obtain class prediction;
and the model training module is used for inputting the acquired training data into the neural network model, training the neural network model by adopting a CTC loss function, judging whether the model is optimal or not according to the CTC loss function, and stopping training when the model is optimal to obtain a trained model serving as a voice recognition model.
5. A speech recognition method, comprising:
the voice data to be recognized is input into the voice recognition model according to any one of claims 1 to 3 after feature extraction, and a voice recognition result is obtained.
6. The method of claim 5, wherein the recognition process of the speech recognition model comprises:
s1: extracting a hidden state sequence with the same length as the input characteristic sequence through an LSTM layer;
s2: the method comprises the steps of performing rank reduction and classification on an input hidden state sequence through a full convolution layer;
s3: the output of the full convolution layer is mapped by the Softmax layer to obtain a class prediction.
7. The method of claim 6, wherein the LSTM layer includes an input word X of a time instanttCell state CtTemporary cell stateHidden state htForgetting door ftInput door itOutput gate otExtracting a hidden state sequence with the same length as the input feature sequence through an LSTM layer, wherein the hidden state sequence comprises the following steps:
s1.1: calculating a forgetting gate, selecting information to be forgotten: f. oft=σ(Wf·[ht-1,xt]+bf)
Wherein the input is a hidden state h at the previous timet-1And the input word x at the current timetOutput is ft,Wf、bfRespectively is a weight matrix and an offset of the forgetting gate;
s1.2: a calculation input gate, selecting information to be memorized:
it=σ(Wi·[ht-1,xt]+bi)
wherein the input is a hidden state h at the previous timet-1And the input word x at the current timetThe output is the value i of the memory gatetAnd transient cell stateWi、biWeight matrix and offset, W, of the input gate, respectivelyC、bCRespectively are the weight matrix and the offset of the output gate;
wherein the input is the value i of the memory gatetForgetting gate value ftTemporary cell statusAnd last minute cell state Ct-1The output is the cell state C at the current momentt;
S1.4: compute output gate and current time hidden state
ot=σ(Wo[ht-1,xt]+bo)
ht=ot*tanh(Ct)
Wherein the input is the hidden state h of the previous momentt-1Input word x at the present timetAnd the current time cell status CtThe output is the value o of the output gatetAnd hidden state ht;
S1.5: finally, a hidden state sequence { h) with the same length as the input characteristic sequence is obtained through calculation0,h1,...,hn-1}。
8. The method of claim 6, wherein S3 specifically comprises: the characteristics of the full convolutional layer output are characterized as the relative probability among different classes to obtain the final class prediction,
wherein i represents the ith class, N represents the total number of classes, ViRepresenting the probability value of the ith category, SiRepresenting the probability value of the ith category after softmax processing.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010253075.6A CN111653275B (en) | 2020-04-02 | 2020-04-02 | Method and device for constructing voice recognition model based on LSTM-CTC tail convolution and voice recognition method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010253075.6A CN111653275B (en) | 2020-04-02 | 2020-04-02 | Method and device for constructing voice recognition model based on LSTM-CTC tail convolution and voice recognition method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111653275A true CN111653275A (en) | 2020-09-11 |
CN111653275B CN111653275B (en) | 2022-06-03 |
Family
ID=72352085
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010253075.6A Active CN111653275B (en) | 2020-04-02 | 2020-04-02 | Method and device for constructing voice recognition model based on LSTM-CTC tail convolution and voice recognition method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111653275B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112233655A (en) * | 2020-09-28 | 2021-01-15 | 上海声瀚信息科技有限公司 | Neural network training method for improving voice command word recognition performance |
CN112235470A (en) * | 2020-09-16 | 2021-01-15 | 重庆锐云科技有限公司 | Incoming call client follow-up method, device and equipment based on voice recognition |
CN112802491A (en) * | 2021-02-07 | 2021-05-14 | 武汉大学 | Voice enhancement method for generating countermeasure network based on time-frequency domain |
CN113192489A (en) * | 2021-05-16 | 2021-07-30 | 金陵科技学院 | Paint spraying robot voice recognition method based on multi-scale enhancement BiLSTM model |
CN113808581A (en) * | 2021-08-17 | 2021-12-17 | 山东大学 | Chinese speech recognition method for acoustic and language model training and joint optimization |
CN115563508A (en) * | 2022-11-08 | 2023-01-03 | 北京百度网讯科技有限公司 | Model training method, device and equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190130578A1 (en) * | 2017-10-27 | 2019-05-02 | Siemens Healthcare Gmbh | Vascular segmentation using fully convolutional and recurrent neural networks |
CN109710922A (en) * | 2018-12-06 | 2019-05-03 | 深港产学研基地产业发展中心 | Text recognition method, device, computer equipment and storage medium |
US20190180188A1 (en) * | 2017-12-13 | 2019-06-13 | Cognizant Technology Solutions U.S. Corporation | Evolution of Architectures For Multitask Neural Networks |
US20190341052A1 (en) * | 2018-05-02 | 2019-11-07 | Simon Says, Inc. | Machine Learning-Based Speech-To-Text Transcription Cloud Intermediary |
CN110633646A (en) * | 2019-08-21 | 2019-12-31 | 数字广东网络建设有限公司 | Method and device for detecting image sensitive information, computer equipment and storage medium |
-
2020
- 2020-04-02 CN CN202010253075.6A patent/CN111653275B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190130578A1 (en) * | 2017-10-27 | 2019-05-02 | Siemens Healthcare Gmbh | Vascular segmentation using fully convolutional and recurrent neural networks |
US20190180188A1 (en) * | 2017-12-13 | 2019-06-13 | Cognizant Technology Solutions U.S. Corporation | Evolution of Architectures For Multitask Neural Networks |
US20190341052A1 (en) * | 2018-05-02 | 2019-11-07 | Simon Says, Inc. | Machine Learning-Based Speech-To-Text Transcription Cloud Intermediary |
CN109710922A (en) * | 2018-12-06 | 2019-05-03 | 深港产学研基地产业发展中心 | Text recognition method, device, computer equipment and storage medium |
CN110633646A (en) * | 2019-08-21 | 2019-12-31 | 数字广东网络建设有限公司 | Method and device for detecting image sensitive information, computer equipment and storage medium |
Non-Patent Citations (2)
Title |
---|
吴邦誉等: "采用拼音降维的中文对话模型", 《中文信息学报》, no. 05, 15 May 2019 (2019-05-15) * |
杨艳芳等: "基于深度卷积长短时记忆网络的加速度手势识别", 《电子测量技术》, no. 21, 8 November 2019 (2019-11-08) * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112235470A (en) * | 2020-09-16 | 2021-01-15 | 重庆锐云科技有限公司 | Incoming call client follow-up method, device and equipment based on voice recognition |
CN112233655A (en) * | 2020-09-28 | 2021-01-15 | 上海声瀚信息科技有限公司 | Neural network training method for improving voice command word recognition performance |
CN112802491A (en) * | 2021-02-07 | 2021-05-14 | 武汉大学 | Voice enhancement method for generating countermeasure network based on time-frequency domain |
CN112802491B (en) * | 2021-02-07 | 2022-06-14 | 武汉大学 | Voice enhancement method for generating confrontation network based on time-frequency domain |
CN113192489A (en) * | 2021-05-16 | 2021-07-30 | 金陵科技学院 | Paint spraying robot voice recognition method based on multi-scale enhancement BiLSTM model |
CN113808581A (en) * | 2021-08-17 | 2021-12-17 | 山东大学 | Chinese speech recognition method for acoustic and language model training and joint optimization |
CN113808581B (en) * | 2021-08-17 | 2024-03-12 | 山东大学 | Chinese voice recognition method based on acoustic and language model training and joint optimization |
CN115563508A (en) * | 2022-11-08 | 2023-01-03 | 北京百度网讯科技有限公司 | Model training method, device and equipment |
Also Published As
Publication number | Publication date |
---|---|
CN111653275B (en) | 2022-06-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111653275B (en) | Method and device for constructing voice recognition model based on LSTM-CTC tail convolution and voice recognition method | |
CN108875807B (en) | Image description method based on multiple attention and multiple scales | |
CN106098059B (en) | Customizable voice awakening method and system | |
CN108346436B (en) | Voice emotion detection method and device, computer equipment and storage medium | |
CN104143327B (en) | A kind of acoustic training model method and apparatus | |
JP2022141931A (en) | Method and device for training living body detection model, method and apparatus for living body detection, electronic apparatus, storage medium, and computer program | |
US11205419B2 (en) | Low energy deep-learning networks for generating auditory features for audio processing pipelines | |
CN111477220B (en) | Neural network voice recognition method and system for home spoken language environment | |
CN112749274A (en) | Chinese text classification method based on attention mechanism and interference word deletion | |
CN111563161B (en) | Statement identification method, statement identification device and intelligent equipment | |
CN110459207A (en) | Wake up the segmentation of voice key phrase | |
CN115312033A (en) | Speech emotion recognition method, device, equipment and medium based on artificial intelligence | |
CN111653270B (en) | Voice processing method and device, computer readable storage medium and electronic equipment | |
CN111882042B (en) | Neural network architecture automatic search method, system and medium for liquid state machine | |
Regmi et al. | Nepali speech recognition using rnn-ctc model | |
CN114694255B (en) | Sentence-level lip language recognition method based on channel attention and time convolution network | |
CN115687934A (en) | Intention recognition method and device, computer equipment and storage medium | |
CN113870863B (en) | Voiceprint recognition method and device, storage medium and electronic equipment | |
CN112417890B (en) | Fine granularity entity classification method based on diversified semantic attention model | |
CN111783688B (en) | Remote sensing image scene classification method based on convolutional neural network | |
CN114333768A (en) | Voice detection method, device, equipment and storage medium | |
CN112560440A (en) | Deep learning-based syntax dependence method for aspect-level emotion analysis | |
CN115803808A (en) | Synthesized speech detection | |
US20240046921A1 (en) | Method, apparatus, electronic device, and medium for speech processing | |
CN114357160B (en) | Early rumor detection method and device based on generated propagation structural features |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |