CN114203161A - Speech recognition method, apparatus, device and storage medium - Google Patents
Speech recognition method, apparatus, device and storage medium Download PDFInfo
- Publication number
- CN114203161A CN114203161A CN202111658938.9A CN202111658938A CN114203161A CN 114203161 A CN114203161 A CN 114203161A CN 202111658938 A CN202111658938 A CN 202111658938A CN 114203161 A CN114203161 A CN 114203161A
- Authority
- CN
- China
- Prior art keywords
- voice
- speech
- network
- frames
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 238000013528 artificial neural network Methods 0.000 claims abstract description 44
- 230000000306 recurrent effect Effects 0.000 claims abstract description 42
- 230000015654 memory Effects 0.000 claims abstract description 11
- 238000012545 processing Methods 0.000 claims description 13
- 238000004590 computer program Methods 0.000 claims description 12
- 230000004913 activation Effects 0.000 claims description 11
- 238000010606 normalization Methods 0.000 claims description 8
- 230000007787 long-term memory Effects 0.000 claims description 7
- 125000004122 cyclic group Chemical group 0.000 claims description 4
- 230000003213 activating effect Effects 0.000 claims 1
- 238000010586 diagram Methods 0.000 description 10
- 230000008569 process Effects 0.000 description 8
- 230000000694 effects Effects 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 2
- 230000003111 delayed effect Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 230000006403 short-term memory Effects 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000013518 transcription Methods 0.000 description 2
- 230000035897 transcription Effects 0.000 description 2
- 210000004556 brain Anatomy 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 230000001502 supplementing effect Effects 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
- G10L2015/025—Phonemes, fenemes or fenones being the recognition units
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- Electrically Operated Instructional Devices (AREA)
- Telephonic Communication Services (AREA)
Abstract
Disclosed are a voice recognition method, apparatus, device and storage medium, the method comprising: acquiring acoustic characteristics of a voice signal, wherein the voice signal comprises a plurality of voice frames; inputting the acoustic features into a convolution network to obtain convolution features of the acoustic features; inputting the convolution characteristics into an LSTM recurrent neural network for long and short time memory to obtain the correlation characteristics of a plurality of voice frames output by the LSTM recurrent neural network; and obtaining a phoneme recognition result of the first speech frame before the set duration according to the correlation characteristics of the plurality of speech frames in the set duration.
Description
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for speech recognition.
Background
Speech recognition is widely used in many fields, such as smart speakers, smart vehicles, smart customer service, mobile phone speech assistants, and other life service fields. In addition, the voice recognition is also applied to human-computer interaction and virtual human driving, and related technologies are adopted in virtual anchor, virtual idol and the like.
Most scenes related to voice, such as voice keyword detection, voice transcription and voice signal driving, require that an algorithm can realize streaming input/output, and on one hand, the streaming input/output conforms to the understanding flow of human brains, and the algorithm is just needed for part of real-time application scenes. For example, in the process of voice transcription, the reader hopes to see real-time text output along with the input of voice signals, instead of tens or hundreds of text appearing after a period of voice data input.
Therefore, how to better realize the real-time performance of speech recognition is a problem that needs to be actively studied.
Disclosure of Invention
The disclosed embodiments provide a speech recognition scheme.
According to a first aspect of the present disclosure, there is provided a speech recognition method, the method comprising: acquiring acoustic characteristics of a voice signal, wherein the voice signal comprises a plurality of voice frames; inputting the acoustic features into a convolution network to obtain convolution features of the acoustic features; inputting the convolution characteristics into an LSTM recurrent neural network for long and short time memory to obtain the correlation characteristics of a plurality of voice frames output by the LSTM recurrent neural network; and obtaining a phoneme recognition result of the first speech frame before the set duration according to the correlation characteristics of the plurality of speech frames in the set duration.
In some embodiments, the obtaining, according to the associated features of the plurality of speech frames in the set duration, a phoneme recognition result of a first speech frame before the set duration includes: inputting the associated characteristics of a plurality of voice frames output by the LSTM recurrent neural network into a characteristic queue in real time, wherein the maximum capacity of the characteristic queue is determined according to the set duration; under the condition that the number of the associated features in the feature queue reaches the maximum capacity, responding to the entering of the associated features, and outputting the associated features of the voice frame which enters the feature queue firstly, wherein the voice frame which enters the feature queue firstly is the first voice frame before the set time length; and determining the phoneme recognition result of the first voice frame according to the associated characteristics of the voice frames in the characteristic queue.
In some embodiments, the method further comprises: after the associated characteristics of a plurality of voice frames output by the LSTM recurrent neural network are input into a characteristic queue in real time, inputting a set number of voice frames with characteristics of zero into the characteristic queue, wherein the set number is equal to the maximum capacity of the characteristic queue.
In some embodiments, the determining the phoneme recognition result of the first speech frame according to the associated features of the plurality of speech frames in the feature queue includes: simultaneously inputting the correlation characteristics of the first voice frame and the correlation characteristics of the voice frames in the characteristic queue to a prediction network, wherein the prediction network predicts and obtains the target characteristics of the first voice frame according to the correlation characteristics of the voice frames in the characteristic queue; and carrying out full connection operation on the target characteristics to obtain the phoneme posterior probability of the first voice frame.
In some embodiments, the speech recognition method is performed using a vocoding network comprising at least the LSTM cyclic vocoding network, the convolutional network, and the prediction network, the vocoding network trained using triphone samples.
In some embodiments, the obtaining, according to the associated features of the plurality of speech frames in the set duration, the phoneme recognition result of the first speech frame before the set duration includes: obtaining a triphone recognition result of a first voice frame before a set duration according to the correlation characteristics of a plurality of voice frames in the set duration; and obtaining a single phone recognition result of the first voice frame according to the triphone recognition result of the first voice frame.
In some embodiments, the voice coding network further comprises a normalization module configured to normalize the acoustic feature of the speech signal before obtaining the convolution feature of the acoustic feature.
In some embodiments, the vocoding network further comprises an activation module for performing an activation operation on the convolution signature prior to inputting the convolution signature into the long term memory recurrent neural network.
In some embodiments, the method further comprises: and after obtaining the phoneme recognition result of the first speech frame before the set duration, removing the associated features before the set duration.
According to a second aspect of the present disclosure, there is provided a speech recognition apparatus, the apparatus comprising: the device comprises a first acquisition unit, a second acquisition unit and a processing unit, wherein the first acquisition unit is used for acquiring acoustic characteristics of a voice signal, and the voice signal comprises a plurality of voice frames; the second acquisition unit is used for inputting the acoustic features into a convolution network to obtain convolution features of the acoustic features; the third acquisition unit is used for inputting the convolution characteristics into an LSTM recurrent neural network for memorizing the long time and the short time to obtain the correlation characteristics of a plurality of voice frames output by the LSTM recurrent neural network; and the recognition unit is used for obtaining a phoneme recognition result of the first speech frame before the set duration according to the associated characteristics of the speech frames in the set duration.
In some embodiments, the identification unit is specifically configured to: inputting the associated characteristics of a plurality of voice frames output by the LSTM recurrent neural network into a characteristic queue in real time, wherein the maximum capacity of the characteristic queue is determined according to the set duration; under the condition that the number of the associated features in the feature queue reaches the maximum capacity, responding to the entering of the associated features, and outputting the associated features of the voice frame which enters the feature queue firstly, wherein the voice frame which enters the feature queue firstly is the first voice frame before the set time length; and determining the phoneme recognition result of the first voice frame according to the associated characteristics of the voice frames in the characteristic queue.
In some embodiments, the apparatus further includes an output unit, configured to input a set number of speech frames with zero features into a feature queue after inputting the associated features of the plurality of speech frames output by the LSTM recurrent neural network into the feature queue in real time, where the set number is equal to a maximum capacity of the feature queue.
In some embodiments, the identification unit is specifically configured to: simultaneously inputting the correlation characteristics of the first voice frame and the correlation characteristics of the voice frames in the characteristic queue to a prediction network, wherein the prediction network predicts and obtains the target characteristics of the first voice frame according to the correlation characteristics of the voice frames in the characteristic queue; and carrying out full connection operation on the target characteristics to obtain the phoneme posterior probability of the first voice frame.
In some embodiments, the speech recognition device is applied to a vocoding network comprising at least the LSTM cyclic vocoding network, the convolutional network, and the prediction network, the vocoding network being trained using triphone samples.
In some embodiments, the identification unit is specifically configured to: obtaining a triphone recognition result of a first voice frame before a set duration according to the correlation characteristics of a plurality of voice frames in the set duration; and obtaining a single phone recognition result of the first voice frame according to the triphone recognition result of the first voice frame.
In some embodiments, the voice coding network further comprises a normalization module configured to normalize the acoustic feature of the speech signal before obtaining the convolution feature of the acoustic feature.
In some embodiments, the vocoding network further comprises an activation module for performing an activation operation on the convolution signature prior to inputting the convolution signature into the long term memory recurrent neural network.
In some embodiments, the apparatus further includes a clearing unit configured to clear the associated features before the set duration after obtaining the phoneme recognition result of the first speech frame before the set duration.
According to a third aspect of the present disclosure, an electronic device is provided, the device comprising a memory for storing computer instructions executable on a processor, the processor being configured to implement the speech recognition method according to any one of the embodiments provided in the present disclosure when executing the computer instructions.
According to a fourth aspect of the present disclosure, there is provided a computer-readable storage medium, on which a computer program is stored, which when executed by a processor, implements the speech recognition method according to any one of the embodiments provided in the present disclosure.
Drawings
In order to more clearly illustrate one or more embodiments or technical solutions in the prior art in the present specification, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in one or more embodiments of the present specification, and other drawings can be obtained by those skilled in the art without inventive exercise.
Fig. 1 is a flow chart of a speech recognition method proposed by at least one embodiment of the present disclosure;
fig. 2A is a schematic diagram of a speech recognition process proposed by at least one embodiment of the present disclosure;
fig. 2B is a schematic diagram of a structure of a vocoding network shown in at least one embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of a speech recognition apparatus according to at least one embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of an electronic device according to at least one embodiment of the present disclosure.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.
At least one embodiment of the present disclosure provides a voice recognition method, which may be performed by an electronic device such as a terminal device or a server, where the terminal device may be a fixed terminal or a mobile terminal, such as a mobile phone, a tablet computer, a game machine, a desktop computer, an advertisement machine, a kiosk, a vehicle-mounted terminal, and the like, the server includes a local server or a cloud server, and the method may also be implemented by a processor calling a computer-readable instruction stored in a memory.
Fig. 1 illustrates a flow diagram of a speech recognition method according to at least one embodiment of the present disclosure, as shown in fig. 1, the method includes steps 101 to 104.
In step 101, acoustic features of a speech signal are acquired.
Wherein the speech signal comprises a plurality of speech frames and the plurality of speech frames are time-sequenced.
The acoustic features may be Mel Frequency Cepstral Coefficients (MFCCs), Fbank features, and the like.
In the embodiment of the present disclosure, the acoustic characteristics of each speech frame may be obtained by means of a sliding window.
For example, for a speech signal of 300ms, sliding window is performed with a step size of 10ms, and 30 frames of acoustic features, such as an Fbank feature of 80 dimensions, can be obtained.
In step 102, the acoustic features are input to a convolution network, and convolution features of the acoustic features are obtained.
See fig. 2A for a schematic illustration of the speech recognition method. The acoustic features are streamed into the voice coding network 20, specifically, the convolutional network 201 in the voice coding network 20, that is, the convolutional features of each voice frame are obtained by performing a convolution operation on the acoustic features of the voice frame.
The convolutional network 201 may be a one-dimensional convolutional layer (conv1d) with which the acoustic features may be subjected to high-dimensional feature extraction.
In step 103, the convolution characteristics are input to the long and short term memory LSTM recurrent neural network 202 in the voice coding network 20, and the correlation characteristics of the plurality of voice frames output by the LSTM recurrent neural network 202 are obtained.
Because the LSTM recurrent neural network performs multiple copies on the same neural network, the neural network at each moment can transmit information to the next moment, so that the LSTM recurrent neural network can learn the historical information of the input convolution characteristics, and the output associated characteristic information contains the influence of the historical information on the current information.
In step 104, according to the correlation characteristics of the plurality of speech frames before the set duration, obtaining the phoneme recognition result of the first speech frame before the set duration.
The set time length may be a time length of a time period calculated forward from a current time when the voice signal is received, or may be a time length of a time period calculated forward from another time before the current time. The correlation characteristics of the plurality of voice frames before the set time length are the correlation characteristics of the voice frames received in the time period; the first speech frame before the set duration is the latest received speech frame before the time period.
According to the correlation characteristics of the plurality of speech frames before the set duration, the phoneme type of the first speech frame before the set duration can be more accurately predicted and judged, so that a more accurate phoneme recognition result can be obtained. And aiming at the voice signals acquired in real time, outputting the phoneme recognition result of the first voice frame before the set time length is equivalent to outputting the phoneme recognition result of each voice frame with the set delay, namely performing streaming output with the set delay, thereby realizing the real-time processing of the voice signals.
In the embodiment of the disclosure, the acoustic feature of a speech signal is first obtained, then the convolution network is used to obtain the convolution feature of the acoustic feature, and the long and short duration memory LSTM recurrent neural network is input with the convolution feature to obtain the associated features of a plurality of speech frames contained in the speech signal, and finally, the phoneme recognition result of the first speech frame before the set duration is obtained according to the associated features of the plurality of speech frames before the set duration.
In some embodiments, the feature queue may be utilized to operate the associated features of the multiple speech frames output by the LSTM recurrent neural network, implement the delay of the set time, and obtain the associated features of the multiple speech frames before the set duration and the phoneme recognition result of the first speech frame before the set duration. See fig. 2A for a schematic illustration of the speech recognition method.
The acoustic features are streamed into the convolutional network 201, the convolutional features output by the convolutional network 201 are input into the LSTM recurrent neural network 202, and the LSTM recurrent neural network 202 streams out the associated features of the voice frame.
The associated feature sequences of the speech frames output by the LSTM recurrent neural network 202 proceed through the feature queue 20. That is, the associated feature of each speech frame output by the LSTM recurrent neural network 202 enters the feature queue 20 in real time.
Wherein the maximum capacity of the feature queue 20 is determined according to the set time length. For example, in the case where the set time length is 300ms and the time length of each frame is 10ms, the maximum capacity of the feature queue may be set to 30. That is, the feature queue can accommodate up to 30 speech frames of associated features.
The feature queue may be arranged in a first-in-first-out fashion. In case the associated features of a speech frame in the feature queue have reached a maximum capacity, e.g. there are already 30 associated features, when an associated feature of another speech frame enters the feature queue, the associated feature of the speech frame that entered the feature queue first is dequeued. The associated characteristic of the voice frame entering the characteristic queue first is the first voice frame before the set time length. And continuously dequeuing the associated characteristic of the first voice frame at the current moment along with the continuous entry of the associated characteristic of the new voice frame into the characteristic queue, thereby realizing the purpose of outputting the associated characteristic of the voice frame in a delayed streaming mode by taking the set time length as the delay time. After the subsequent identification processing is carried out on the associated characteristics, the streaming output of the voice identification result is realized.
According to the associated characteristics of the plurality of voice frames in the characteristic queue, the phoneme recognition result of the first voice frame can be determined.
In the embodiment of the present disclosure, by using the delay of the feature queue to set the duration, the association features of a plurality of voice frames with the set duration and the association features of the first voice frame before the set duration can be simultaneously obtained, and the real-time output of the association features of the voice frames can be realized.
In one example, after the associated features of the plurality of speech frames output by the LSTM recurrent neural network are input to a feature queue in real time, a set number of speech frames with a feature of zero are input to the feature queue, wherein the set number is equal to the maximum capacity of the feature queue.
Taking the acoustic feature as an 80-dimensional Fbank feature and the maximum capacity of the feature queue as N as an example, N × 80 0 s are supplemented after the input associated feature.
By supplementing the number of voice frames with characteristics of zero which is the same as the maximum capacity after the associated characteristics of the input multiple voice frames, the last associated characteristics existing in the characteristic queue can also be dequeued, so that the voice recognition result is more complete.
In some embodiments, a look-ahead prediction module may be utilized to predict a phoneme recognition result of the first speech frame according to the associated features of the speech frames in the feature queue. Referring to the schematic diagram of the speech recognition method shown in fig. 2A, the vocoding network 20 further includes a prediction network 203 and a fully connected network 204.
The associated feature for the first speech frame output by the feature queue 20 and the associated features for the plurality of speech frames in the feature queue are input to the prediction network 203 simultaneously. I.e. the number of associated features in the feature queue reaches said maximum capacity, when there is still an associated feature of a speech frame entering the feature queue, the first speech frame output from the feature queue is directly input to the prediction network 203, and simultaneously the associated features of all speech frames in the feature queue are input to the prediction network. And the prediction network predicts the target characteristic of the first speech frame of the output queue according to the received associated characteristics of all the speech frames in the queue and outputs the target characteristic.
Then, the target feature is input to the full-connection network 204, and a full-connection operation is performed to obtain a phoneme posterior probability of the first speech frame, that is, to determine a phoneme type of the first speech frame. Wherein the phoneme class indicates which phoneme in the phoneme table that is used the phoneme is.
In the embodiment of the disclosure, by combining the LSTM recurrent neural network with the prediction network, on the premise of ensuring that the past information is not forgotten, the set delayed future information is obtained to predict the current phoneme posterior probability, so that the accuracy of the recognition by the one-way LSTM recurrent neural network can be improved compared with the conventional one-way LSTM recurrent neural network, and the capability of outputting the phoneme recognition result in real time is improved compared with the two-way LSTM recurrent neural network.
In some embodiments, the vocoding network 20 may be trained using triphone samples. In this case, the voice encoding network 20 outputs the triphone recognition result of the first voice frame before the set duration, and then obtains the final monophonic recognition result based on the triphone recognition result.
Triphones are compared to monophonics in terms of context. That is, the state of triphones represents not only the current frame state but also the states of the preceding and following frames, compared to the monophonic classification result. In the case where the recognition result is a triphone, the phoneme detection effect is improved.
In some embodiments, the triphone samples may be frame-level phonemes obtained based on the Kaldi-MFA phoneme recognition method. The frame-level phoneme obtained by the method has high accuracy, and the training effect of the voice coding network can be further improved.
Wherein the triphone sample may be selected from the P2FA phone list. On one hand, the phoneme table is small in phoneme quantity, training efficiency can be improved, and the phoneme table is good in effect in some application scenes, for example, in expression driving of digital people, a finer driving effect can be obtained.
In some embodiments, the vocoding network may further comprise a normalization module. Referring to the schematic structural diagram of the sound coding network shown in fig. 2B, a normalization module 201A may further be included before the convolution network 201, where the normalization module 201A is configured to perform a normalization operation on the acoustic features of the speech signal before obtaining the convolution features of the acoustic features.
In some embodiments, the vocoding network may further comprise an activation module. Referring to the structural schematic diagram of the sound coding network shown in fig. 2B, an activation module 201B may be further included after the convolutional network 201, and the activation module 201B is configured to perform an activation operation on the convolutional feature before inputting the convolutional feature into the long-term memory recurrent neural network.
In some embodiments, after obtaining the phoneme recognition result of the first speech frame before the set duration, the associated features before the set duration are cleared. Referring to the schematic diagram of the voice recognition method shown in fig. 2A, for the correlation characteristics of the voice frames streamed out by the LSTM recurrent neural network 202, the correlation characteristics of the voice frames outside the feature queue, for example, the correlation characteristics of each voice frame out of the feature queue, may be deleted from the storage area, so as to ensure that the used data is not accumulated.
The present disclosure also provides a method for implementing engineering for a voice coding network. In the engineering implementation process, the voice coding network is divided into three modules for implementation. The first module comprises a convolution network and an LSTM recurrent neural network; the second module includes a prediction network; the third module comprises a fully connected network. The voice coding network is divided into three parts to be implemented in a working mode, so that the related characteristics of voice frames are controlled to enter and exit the queue in real time, and the requirements for real-time and low-delay input and output of voice recognition are met.
In addition, aiming at the Fbank characteristic, since the python toolkit is different from C + + in data accuracy, the method realizes the voice recognition process based on the Fbank by unifying the calculation logic and decimal accuracy of the Fbank in python and C + +.
Fig. 3 is a schematic structural diagram of a speech recognition apparatus according to at least one embodiment of the present disclosure, and as shown in fig. 3, the apparatus may include: a first obtaining unit 301, configured to obtain an acoustic feature of a speech signal, where the speech signal includes a plurality of speech frames; a second obtaining unit 302, configured to input the acoustic feature into a convolution network, so as to obtain a convolution feature of the acoustic feature; a third obtaining unit 303, configured to input the convolution feature into an LSTM recurrent neural network for long and short term memory, so as to obtain associated features of multiple speech frames output by the LSTM recurrent neural network; the identifying unit 304 is configured to obtain a phoneme identification result of a first speech frame before a set duration according to the associated features of the speech frames in the set duration.
In some embodiments, the identification unit is specifically configured to: inputting the associated characteristics of a plurality of voice frames output by the LSTM recurrent neural network into a characteristic queue in real time, wherein the maximum capacity of the characteristic queue is determined according to the set duration; under the condition that the number of the associated features in the feature queue reaches the maximum capacity, responding to the entering of the associated features, and outputting the associated features of the voice frame which enters the feature queue firstly, wherein the voice frame which enters the feature queue firstly is the first voice frame before the set time length; and determining the phoneme recognition result of the first voice frame according to the associated characteristics of the voice frames in the characteristic queue.
In some embodiments, the apparatus further includes an output unit, configured to input a set number of speech frames with zero features into a feature queue after inputting the associated features of the plurality of speech frames output by the LSTM recurrent neural network into the feature queue in real time, where the set number is equal to a maximum capacity of the feature queue.
In some embodiments, the identification unit is specifically configured to: simultaneously inputting the correlation characteristics of the first voice frame and the correlation characteristics of the voice frames in the characteristic queue to a prediction network, wherein the prediction network predicts and obtains the target characteristics of the first voice frame according to the correlation characteristics of the voice frames in the characteristic queue; and carrying out full connection operation on the target characteristics to obtain the phoneme posterior probability of the first voice frame.
In some embodiments, the speech recognition device is applied to a vocoding network comprising at least the LSTM cyclic vocoding network, the convolutional network, and the prediction network, the vocoding network being trained using triphone samples.
In some embodiments, the identification unit is specifically configured to: obtaining a triphone recognition result of a first voice frame before a set duration according to the correlation characteristics of a plurality of voice frames in the set duration; and obtaining a single phone recognition result of the first voice frame according to the triphone recognition result of the first voice frame.
In some embodiments, the voice coding network further comprises a normalization module configured to normalize the acoustic feature of the speech signal before obtaining the convolution feature of the acoustic feature.
In some embodiments, the vocoding network further comprises an activation module for performing an activation operation on the convolution signature prior to inputting the convolution signature into the long term memory recurrent neural network.
In some embodiments, the apparatus further includes a clearing unit configured to clear the associated features before the set duration after obtaining the phoneme recognition result of the first speech frame before the set duration.
At least one embodiment of the present disclosure also provides an electronic device, as shown in fig. 4, the device includes a memory for storing computer instructions executable on a processor, and the processor is configured to implement the speech recognition method according to any embodiment of the present disclosure when executing the computer instructions.
At least one embodiment of the present disclosure also provides a computer-readable storage medium on which a computer program is stored, which when executed by a processor implements the speech recognition method according to any one of the embodiments of the present disclosure.
At least one embodiment of the present disclosure also provides a computer program product comprising a computer program that, when executed by a processor, implements the speech recognition method of any of the embodiments of the present disclosure.
As will be appreciated by one skilled in the art, one or more embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the data processing apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to part of the description of the method embodiment.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the acts or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
Embodiments of the subject matter and the functional operations described in this specification can be implemented in: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and their structural equivalents, or a combination of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by the data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.
The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).
Computers suitable for executing computer programs include, for example, general and/or special purpose microprocessors, or any other type of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory and/or a random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer does not necessarily have such a device. Moreover, a computer may be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.
Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., an internal hard disk or a removable disk), magneto-optical disks, and CDROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.
While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. In other instances, features described in connection with one embodiment may be implemented as discrete components or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.
Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.
Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.
The above description is only for the purpose of illustrating the preferred embodiments of the one or more embodiments of the present disclosure, and is not intended to limit the scope of the one or more embodiments of the present disclosure, and any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the one or more embodiments of the present disclosure should be included in the scope of the one or more embodiments of the present disclosure.
Claims (12)
1. A method of speech recognition, the method comprising:
acquiring acoustic characteristics of a voice signal, wherein the voice signal comprises a plurality of voice frames;
inputting the acoustic features into a convolution network to obtain convolution features of the acoustic features;
inputting the convolution characteristics into an LSTM recurrent neural network for long and short time memory to obtain the correlation characteristics of a plurality of voice frames output by the LSTM recurrent neural network;
and obtaining a phoneme recognition result of the first speech frame before the set duration according to the correlation characteristics of the plurality of speech frames in the set duration.
2. The method of claim 1, wherein obtaining the phoneme recognition result of the first speech frame before the set duration according to the associated features of the speech frames within the set duration comprises:
inputting the associated characteristics of a plurality of voice frames output by the LSTM recurrent neural network into a characteristic queue in real time, wherein the maximum capacity of the characteristic queue is determined according to the set duration;
under the condition that the number of the associated features in the feature queue reaches the maximum capacity, responding to the entering of the associated features, and outputting the associated features of the voice frame which enters the feature queue firstly, wherein the voice frame which enters the feature queue firstly is the first voice frame before the set time length;
and determining the phoneme recognition result of the first voice frame according to the associated characteristics of the voice frames in the characteristic queue.
3. The method of claim 2, further comprising:
after the associated characteristics of a plurality of voice frames output by the LSTM recurrent neural network are input into a characteristic queue in real time, inputting a set number of voice frames with characteristics of zero into the characteristic queue, wherein the set number is equal to the maximum capacity of the characteristic queue.
4. The method of claim 3, wherein determining the phoneme recognition result of the first speech frame according to the associated features of the plurality of speech frames in the feature queue comprises:
simultaneously inputting the correlation characteristics of the first voice frame and the correlation characteristics of the voice frames in the characteristic queue to a prediction network, wherein the prediction network predicts and obtains the target characteristics of the first voice frame according to the correlation characteristics of the voice frames in the characteristic queue;
and carrying out full connection operation on the target characteristics to obtain the phoneme posterior probability of the first voice frame.
5. The method of any of claims 1 to 4, wherein the speech recognition method is performed using a vocoding network comprising at least the LSTM cyclic vocoding network, the convolutional network and the prediction network, the vocoding network being trained using triphone samples.
6. The method according to any one of claims 1 to 5, wherein the obtaining the phoneme recognition result of the first speech frame before the set duration according to the associated features of the plurality of speech frames within the set duration comprises:
obtaining a triphone recognition result of a first voice frame before a set duration according to the correlation characteristics of a plurality of voice frames in the set duration;
and obtaining a single phone recognition result of the first voice frame according to the triphone recognition result of the first voice frame.
7. The method according to claim 5 or 6, wherein the voice coding network further comprises a normalization module for normalizing the acoustic features of the speech signal before obtaining the convolution features of the acoustic features.
8. The method of any one of claims 5 to 7, wherein the vocoding network further comprises an activation module for activating the convolution signature prior to inputting the convolution signature into the long term memory recurrent neural network.
9. The method according to any one of claims 1 to 8, further comprising:
and after obtaining the phoneme recognition result of the first speech frame before the set duration, removing the associated features before the set duration.
10. A speech recognition apparatus, characterized in that the apparatus comprises:
the device comprises a first acquisition unit, a second acquisition unit and a processing unit, wherein the first acquisition unit is used for acquiring acoustic characteristics of a voice signal, and the voice signal comprises a plurality of voice frames;
the second acquisition unit is used for inputting the acoustic features into a convolution network to obtain convolution features of the acoustic features;
the third acquisition unit is used for inputting the convolution characteristics into an LSTM recurrent neural network for memorizing the long time and the short time to obtain the correlation characteristics of a plurality of voice frames output by the LSTM recurrent neural network;
and the recognition unit is used for obtaining a phoneme recognition result of the first speech frame before the set duration according to the associated characteristics of the speech frames in the set duration.
11. An electronic device, comprising a memory for storing computer instructions executable on a processor, the processor being configured to implement the method of any one of claims 1 to 9 when executing the computer instructions.
12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 9.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111658938.9A CN114203161A (en) | 2021-12-30 | 2021-12-30 | Speech recognition method, apparatus, device and storage medium |
PCT/CN2022/128619 WO2023124500A1 (en) | 2021-12-30 | 2022-10-31 | Voice recognition method and apparatus, device and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202111658938.9A CN114203161A (en) | 2021-12-30 | 2021-12-30 | Speech recognition method, apparatus, device and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114203161A true CN114203161A (en) | 2022-03-18 |
Family
ID=80657541
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202111658938.9A Pending CN114203161A (en) | 2021-12-30 | 2021-12-30 | Speech recognition method, apparatus, device and storage medium |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN114203161A (en) |
WO (1) | WO2023124500A1 (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023124500A1 (en) * | 2021-12-30 | 2023-07-06 | 深圳市慧鲤科技有限公司 | Voice recognition method and apparatus, device and storage medium |
CN116825109A (en) * | 2023-08-30 | 2023-09-29 | 深圳市友杰智新科技有限公司 | Processing method, device, equipment and medium for voice command misrecognition |
WO2024139805A1 (en) * | 2022-12-26 | 2024-07-04 | 腾讯科技(深圳)有限公司 | Audio processing method and related device |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108550364A (en) * | 2018-04-20 | 2018-09-18 | 百度在线网络技术(北京)有限公司 | Audio recognition method, device, equipment and storage medium |
CN110675860A (en) * | 2019-09-24 | 2020-01-10 | 山东大学 | Voice information identification method and system based on improved attention mechanism and combined with semantics |
CN111798840A (en) * | 2020-07-16 | 2020-10-20 | 中移在线服务有限公司 | Voice keyword recognition method and device |
CN112786016A (en) * | 2019-11-11 | 2021-05-11 | 北京声智科技有限公司 | Voice recognition method, device, medium and equipment |
WO2021136054A1 (en) * | 2019-12-30 | 2021-07-08 | Oppo广东移动通信有限公司 | Voice wake-up method, apparatus and device, and storage medium |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9620108B2 (en) * | 2013-12-10 | 2017-04-11 | Google Inc. | Processing acoustic sequences using long short-term memory (LSTM) neural networks that include recurrent projection layers |
US9721562B2 (en) * | 2013-12-17 | 2017-08-01 | Google Inc. | Generating representations of acoustic sequences |
CN109754789B (en) * | 2017-11-07 | 2021-06-08 | 北京国双科技有限公司 | Method and device for recognizing voice phonemes |
CN111009235A (en) * | 2019-11-20 | 2020-04-14 | 武汉水象电子科技有限公司 | Voice recognition method based on CLDNN + CTC acoustic model |
US11244668B2 (en) * | 2020-05-29 | 2022-02-08 | TCL Research America Inc. | Device and method for generating speech animation |
CN114203161A (en) * | 2021-12-30 | 2022-03-18 | 深圳市慧鲤科技有限公司 | Speech recognition method, apparatus, device and storage medium |
-
2021
- 2021-12-30 CN CN202111658938.9A patent/CN114203161A/en active Pending
-
2022
- 2022-10-31 WO PCT/CN2022/128619 patent/WO2023124500A1/en unknown
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108550364A (en) * | 2018-04-20 | 2018-09-18 | 百度在线网络技术(北京)有限公司 | Audio recognition method, device, equipment and storage medium |
CN110675860A (en) * | 2019-09-24 | 2020-01-10 | 山东大学 | Voice information identification method and system based on improved attention mechanism and combined with semantics |
CN112786016A (en) * | 2019-11-11 | 2021-05-11 | 北京声智科技有限公司 | Voice recognition method, device, medium and equipment |
WO2021136054A1 (en) * | 2019-12-30 | 2021-07-08 | Oppo广东移动通信有限公司 | Voice wake-up method, apparatus and device, and storage medium |
CN111798840A (en) * | 2020-07-16 | 2020-10-20 | 中移在线服务有限公司 | Voice keyword recognition method and device |
Non-Patent Citations (1)
Title |
---|
CHAO SUI ET AL.: "Extracting deep bottleneck features for visual speech recognition", 2015 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 6 August 2015 (2015-08-06), pages 1518 - 1522 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023124500A1 (en) * | 2021-12-30 | 2023-07-06 | 深圳市慧鲤科技有限公司 | Voice recognition method and apparatus, device and storage medium |
WO2024139805A1 (en) * | 2022-12-26 | 2024-07-04 | 腾讯科技(深圳)有限公司 | Audio processing method and related device |
CN116825109A (en) * | 2023-08-30 | 2023-09-29 | 深圳市友杰智新科技有限公司 | Processing method, device, equipment and medium for voice command misrecognition |
CN116825109B (en) * | 2023-08-30 | 2023-12-08 | 深圳市友杰智新科技有限公司 | Processing method, device, equipment and medium for voice command misrecognition |
Also Published As
Publication number | Publication date |
---|---|
WO2023124500A1 (en) | 2023-07-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20230410796A1 (en) | Encoder-decoder models for sequence to sequence mapping | |
CN114203161A (en) | Speech recognition method, apparatus, device and storage medium | |
US10192556B2 (en) | Speech recognition with acoustic models | |
JP6435312B2 (en) | Speech recognition using parallel recognition tasks. | |
Zeng et al. | Effective combination of DenseNet and BiLSTM for keyword spotting | |
CN113841195B (en) | Joint endpoint determination and automatic speech recognition | |
CN107851443B (en) | Voice activity detection | |
CN111081231B (en) | Adaptive audio enhancement for multi-channel speech recognition | |
US20190318727A1 (en) | Sub-matrix input for neural network layers | |
US20180174576A1 (en) | Acoustic-to-word neural network speech recognizer | |
JP2017097162A (en) | Keyword detection device, keyword detection method and computer program for keyword detection | |
US20160035344A1 (en) | Identifying the language of a spoken utterance | |
CN112530408A (en) | Method, apparatus, electronic device, and medium for recognizing speech | |
CN111916061B (en) | Voice endpoint detection method and device, readable storage medium and electronic equipment | |
CN113035231B (en) | Keyword detection method and device | |
EP3739583B1 (en) | Dialog device, dialog method, and dialog computer program | |
CN110070859B (en) | Voice recognition method and device | |
CN1950882A (en) | Detection of end of utterance in speech recognition system | |
US11183178B2 (en) | Adaptive batching to reduce recognition latency | |
Chao et al. | Speaker-targeted audio-visual models for speech recognition in cocktail-party environments | |
CN111625649A (en) | Text processing method and device, electronic equipment and medium | |
US20180350358A1 (en) | Voice recognition device, voice emphasis device, voice recognition method, voice emphasis method, and navigation system | |
CN115132196A (en) | Voice instruction recognition method and device, electronic equipment and storage medium | |
Irtza et al. | Phonemes frequency based PLLR dimensionality reduction for language recognition. | |
CN115273862A (en) | Voice processing method, device, electronic equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 40061476 Country of ref document: HK |