CN112259084B

CN112259084B - Speech recognition method, device and storage medium

Info

Publication number: CN112259084B
Application number: CN202010597703.2A
Authority: CN
Inventors: 吴川隆; 邓丽萍; 张超
Original assignee: Beijing Huijun Technology Co ltd
Current assignee: Beijing Huijun Technology Co ltd
Priority date: 2020-06-28
Filing date: 2020-06-28
Publication date: 2024-07-16
Anticipated expiration: 2040-06-28
Also published as: CN112259084A

Abstract

The disclosure provides a voice recognition method, a voice recognition device and a storage medium, and relates to the technical field of voice recognition. A speech recognition method of the present disclosure includes: acquiring candidate lattice according to the voice signal of the current sentence; resetting the neural network model according to the text above corresponding to the current sentence, wherein the text above is the identification text of the previous sentence or multiple sentences of the current sentence; re-scoring the candidate lattice by the reset neural network model to obtain re-scored lattice; and determining the identification text of the current sentence according to the repartitioning lattice. By the method, the information of one or more sentences can be considered in the voice recognition of the current sentence, so that the prior information is more fully utilized, the re-scoring is more accurate, and the accuracy of the voice recognition is improved.

Description

Speech recognition method, device and storage medium

Technical Field

The present disclosure relates to the field of speech recognition technology, and in particular, to a speech recognition method, apparatus, and storage medium.

Background

The voice recognition is a key technology in voice quality inspection, man-machine conversation and other systems, and is widely applied to the fields of logistics, finance, industry and the like. The accurate recognition rate is a key of all voice systems, for example, in a conversation robot, if the voice recognition accuracy rate is poor, the true intention of a speaker cannot be accurately understood, and then an error instruction is issued.

Disclosure of Invention

It is an object of the present disclosure to improve the accuracy of speech recognition.

According to an aspect of some embodiments of the present disclosure, there is provided a voice recognition method, including: acquiring candidate lattice according to the voice signal of the current sentence; resetting a neural network model according to the text above corresponding to the current sentence, wherein the text above is the identification text of the sentence or sentences before the current sentence, and the neural network model is generated based on corpus sample training with the text above; re-scoring the candidate lattice by the reset neural network model to obtain re-scored lattice; and determining the identification text of the current sentence according to the repartitioning lattice.

In some embodiments, the voice recognition method further comprises: the recognition text of the current sentence is stored in the buffer for use as the context text for the subsequent sentence.

In some embodiments, the voice recognition method further comprises: and acquiring the identified text corresponding to the current sentence from the buffer.

In some embodiments, obtaining a candidate lattice from the speech signal of the current sentence includes: and decoding the voice signal for one time based on the acoustic model and the language model to obtain candidate language.

In some embodiments, determining the recognition text of the current sentence from the repartitioning lattice includes: and performing acoustic weight and language weight analysis on the scoring lattice to obtain a decoding result of the path with the highest score, and using the decoding result as an identification text of the current sentence.

In some embodiments, the neural network model includes an LSTM (Long-Short Term Memory, long term memory) model or a GRU (Gate Recurrent Unit, gated loop unit) model.

In some embodiments, where the speech signal is a speech signal of a conversation, the identified text above for the current sentence includes the speech signal of the speaker preceding the current sentence that is closest to the utterance of the current sentence.

In some embodiments, the voice recognition method further comprises: training a neural network model using the samples with the above until the output of the loss function converges, comprising: acquiring a sample candidate lattice according to the voice signal of the current sample sentence; resetting a neural network model to be trained according to the above sample text corresponding to the current sample sentence, wherein the above sample text is the sample text of the previous sentence or multiple sentences of the current sample sentence; re-scoring the sample candidate lattice by the reset neural network model to be trained, obtaining a re-scored sample lattice, and determining an identification text of a current sample sentence; and determining the output of the loss function according to the identification text of the current sample sentence and the sample text of the current sample sentence.

By the method, the information of one or more sentences can be considered in the voice recognition of the current sentence, so that the prior information is more fully utilized, the re-scoring is more accurate, and the accuracy of the voice recognition is improved.

According to an aspect of other embodiments of the present disclosure, there is provided a voice recognition apparatus including: a decoding unit configured to acquire candidate lattice according to the voice signal of the current sentence; the resetting unit is configured to reset a neural network model according to the identified text corresponding to the current sentence, wherein the text is the identification text of the sentence or sentences before the current sentence, and the neural network model is generated based on corpus sample training with the text; the re-scoring unit is configured to re-score the candidate lattice through the reset neural network model to obtain a re-scored lattice; and the recognition unit is configured to determine the recognition text of the current sentence according to the repartitioning lattice.

In some embodiments, the voice recognition apparatus further comprises: and the caching unit is configured to store the identification text of the current sentence into the cache area so as to serve as the text of the subsequent sentence.

In some embodiments, the resetting unit is further configured to obtain the identified context text corresponding to the current sentence from the buffer.

In some embodiments, the decoding unit is configured to decode the speech signal in one pass based on the acoustic model and the language model to obtain candidate lattice.

In some embodiments, the recognition unit is configured to perform acoustic weight and language weight analysis on the scoring lattice to obtain a decoding result of the highest scoring path as the recognition text of the current sentence.

In some embodiments, the neural network model includes an LSTM model or a GRU model.

In some embodiments, the voice recognition apparatus further comprises: a training unit configured to train the neural network model with the samples with the above until the output of the loss function converges.

According to an aspect of some embodiments of the present disclosure, there is provided a voice recognition apparatus including: a memory; and a processor coupled to the memory, the processor configured to perform any of the speech recognition methods mentioned above based on instructions stored in the memory.

In the voice recognition of the current sentence, the device can consider the information of one or more sentences, so that the prior information is more fully utilized, the re-scoring is more accurate, and the accuracy of the voice recognition is improved.

According to an aspect of some embodiments of the present disclosure, a computer-readable storage medium is presented, on which computer program instructions are stored, which instructions, when executed by a processor, implement the steps of any one of the speech recognition methods mentioned above.

By executing the instructions on the computer-readable storage medium, the information of one or more sentences above can be considered in the voice recognition of the current sentence, so that the prior information is more fully utilized, the re-classification is more accurate, and the accuracy of the voice recognition is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate and explain the present disclosure, and together with the description serve to explain the present disclosure. In the drawings:

Fig. 1 is a flow chart of some embodiments of a speech recognition method of the present disclosure.

Fig. 2 is a flow chart of other embodiments of the speech recognition method of the present disclosure.

Fig. 3 is a schematic diagram of some embodiments of a speech recognition device of the present disclosure.

Fig. 4 is a schematic diagram of further embodiments of a speech recognition device of the present disclosure.

Fig. 5 is a schematic diagram of still other embodiments of a speech recognition apparatus of the present disclosure.

Detailed Description

The technical scheme of the present disclosure is described in further detail below through the accompanying drawings and examples.

The voice recognition system firstly utilizes a simple language model to carry out quick decoding to generate a lattice network, and then utilizes a complex language model to re-score the generated lattice network so as to obtain higher recognition accuracy. The speech recognition rate obtained by one-pass decoding is often lower, and the accuracy can be further improved after the complex language model obtained by large corpus training is re-scored. The language model used for the repartition adopts a high-order n-gram language model firstly, and the neural network is replaced by the neural network model by virtue of the superior modeling capability of the neural network.

The inventors found that although neural networks are superior in performance, the related art often performs a re-scoring based on the relationship between the front and rear words, and does not consider the logic between the front and rear sentences.

A flowchart of some embodiments of the speech recognition method of the present disclosure is shown in fig. 1.

In step 101, candidate lattice is obtained from the speech signal of the current sentence.

In some embodiments, the speech signal may be decoded in one pass based on the acoustic model and the language model to obtain candidate lattice. In some embodiments, one pass decoding may be performed in any manner of the related art to obtain the original lattice network, i.e., as a candidate lattice.

In step 102, the neural network model is reset according to the identified text above for the current sentence. The above text may be an identifying text of a sentence or sentences preceding the current sentence, for example, a predetermined number of sentences immediately adjacent to the current sentence, or a preceding sentence. In some embodiments, paragraphs may be time-divided by speech intervals, or distinguished by keywords.

In some embodiments, the order of execution of steps 101, 102 may be arbitrary.

In step 103, the candidate lattice is re-scored by the neural network model after being reset, and the re-scored lattice is obtained. In some embodiments, acoustic weight and language weight analysis may be performed on the scoring lattice to obtain the decoding result of the highest scoring path as the recognition text of the current sentence.

In step 104, the recognition text of the current sentence is determined based on the repartitioning lattice.

In some embodiments, where the speech signal is a speech signal of a conversation, the identified text above for the current sentence includes the speech signal of the speaker preceding the current sentence that is closest to the utterance of the current sentence. In some embodiments, the speaker may be determined to be changing based on the timbre.

By the method, question and answer logic in the communication process can be fully utilized, and the accuracy of voice recognition is further improved.

A flowchart of further embodiments of the speech recognition method of the present disclosure is shown in fig. 2.

In step 201, a speech signal is decoded in one pass based on the acoustic model and the low-order language model to obtain candidate lattice.

In step 202, the identified text corresponding to the current sentence is obtained from the buffer. In some embodiments, the corresponding text may be retrieved in the buffer according to a predetermined policy, which may include determining the recognition text of the last speaker's proximity utterance, or the recognition text of the last sentence, last paragraph.

In step 203, the neural network model is reset according to the above text obtained from the buffer.

In step 204, the candidate lattice is re-scored by the re-set neural network model, and the re-scored lattice is obtained. In some embodiments, the neural network model includes an LSTM model or a GRU model.

In step 205, an acoustic weight and a language weight analysis are performed on the scoring lattice to obtain a decoding result of the path with the highest score, which is used as the recognition text of the current sentence.

In step 206, the recognized text of the current sentence is stored in the buffer as the text of the context of the subsequent sentence.

By the method, the recognized text can be cached and managed in time to serve as a basis for recognizing subsequent sentences; and resetting the neural network model in time, analyzing and estimating the current sentence by utilizing the above information, and improving the prediction accuracy of the language model.

In some embodiments, training of the neural network model is required prior to speech recognition by any of the methods above. The training corpus sample needs to be provided with the above. In some embodiments, training may be performed by acquiring training text with the above according to the corresponding application scenario, and the neural network training ends when the result of the loss function converges to a stable state (e.g., the change in output is less than a predetermined value). In the test stage, a sample candidate lattice can be obtained according to the voice signal of the current sample sentence, and the neural network model is reset through the sample text corresponding to the current sample sentence. In some embodiments, the above sample text is sample text of one or more sentences preceding the current sample sentence. And (3) scoring the sample candidate lattice again through the reset neural network model to be trained, and determining the optimal recognition text.

By the method, the neural network model can be trained based on the corpus sample with the above, so that the generated neural network model has the capability of performing the re-scoring by utilizing the logics between the front sentence and the rear sentence, and the accuracy of voice recognition is further improved.

Found after testing with the voice test dataset. By the method in the embodiments of the present disclosure, PPL (Perplexity, confusion) of the single-layer LSTM neuro-language model is reduced from 43.2 to 40.05; meanwhile, the accuracy of voice recognition is absolutely improved by 0.7% by the Lattice scoring, and the improvement effect is obvious.

A schematic diagram of some embodiments of the speech recognition apparatus of the present disclosure is shown in fig. 3.

The decoding unit 301 can acquire a candidate lattice from the speech signal of the current sentence. In some embodiments, the speech signal may be decoded in one pass based on the acoustic model and the language model to obtain candidate lattice.

The resetting unit 302 can reset the neural network model according to the recognized text above corresponding to the current sentence. The above text may be an identifying text of a sentence or sentences preceding the current sentence, for example, a predetermined number of sentences immediately adjacent to the current sentence, or a preceding sentence. In some embodiments, paragraphs may be time-divided by speech intervals, or distinguished by keywords.

The re-scoring unit 303 can re-score the candidate lattice through the neural network model after being reset, and obtain the re-scored lattice. In some embodiments, acoustic weight and language weight analysis may be performed on the scoring lattice to obtain the decoding result of the highest scoring path as the recognition text of the current sentence.

The recognition unit 304 can determine the recognition text of the current sentence from the repartitioning lattice.

In some embodiments, as shown in fig. 3, the speech recognition apparatus may further include a buffer unit 305 capable of storing the recognition text of the current sentence in the buffer as the context text of the subsequent sentence. The resetting unit 302 can obtain the identified text corresponding to the current sentence from the buffer, and reset the neural network model according to the obtained text. In some embodiments, the corresponding text may be retrieved in the buffer according to a predetermined policy, which may include determining the recognition text of the last speaker's proximity utterance, or the recognition text of the last sentence, last paragraph.

The device can buffer and manage the recognized text in time to serve as a basis for recognizing subsequent sentences; and resetting the neural network model in time, analyzing and estimating the current sentence by utilizing the above information, and improving the prediction accuracy of the language model.

In some embodiments, as shown in fig. 3, the speech recognition apparatus may further include a training unit 306, which can train the neural network model until the output of the loss function converges, to generate the re-scoring unit 303. The corpus sample on which training is based needs to be provided with the above. In some embodiments, training may be performed based on the initial speech recognition apparatus as shown in fig. 3, where the training unit 306 inputs the corpus sample into the decoding unit 301, and obtains a sample candidate lattice according to the speech signal of the current sample sentence; the resetting unit resets the neural network model to be trained through the sample text corresponding to the current sample sentence, the re-scoring unit re-scores the sample candidate lattice through the reset neural network model to be trained, the re-scoring sample lattice is obtained, and the recognition unit determines the recognition text of the current sample sentence; the training unit 306 determines the output of the loss function according to the recognition text of the current sample sentence and the sample text of the current sample sentence, and if the training unit 306 determines that the change of the output is smaller than the predetermined value, the output is determined to be converged, and the training of the neural network model is completed.

The device can train the neural network model based on the corpus sample with the above, so that the generated neural network model has the capability of performing the re-scoring by utilizing the logics between the front sentence and the rear sentence, and the accuracy of voice recognition is further improved.

A schematic structural diagram of one embodiment of a speech recognition device of the present disclosure is shown in fig. 4. The speech recognition device comprises a memory 401 and a processor 402. Wherein: memory 401 may be a magnetic disk, flash memory, or any other non-volatile storage medium. The memory is used to store instructions in the corresponding embodiments of the speech recognition method above. Processor 402 is coupled to memory 401 and may be implemented as one or more integrated circuits, such as a microprocessor or microcontroller. The processor 402 is configured to execute instructions stored in the memory, so that priori information can be more fully utilized, re-classification is more accurate, and accuracy of speech recognition is improved.

In one embodiment, as also shown in FIG. 5, the speech recognition device 500 includes a memory 501 and a processor 502. The processor 502 is coupled to the memory 501 via a BUS 503. The speech recognition device 500 may also be connected to an external storage device 505 via a storage interface 504 for invoking external data, and may also be connected to a network or another computer system (not shown) via a network interface 506. And will not be described in detail herein.

In the embodiment, the data instruction is stored by the memory, and then the instruction is processed by the processor, so that the prior information can be more fully utilized, the re-scoring is more accurate, and the accuracy of voice recognition is improved.

In another embodiment, a computer readable storage medium has stored thereon computer program instructions which, when executed by a processor, implement the steps of the method in the corresponding embodiments of the speech recognition method. It will be apparent to those skilled in the art that embodiments of the present disclosure may be provided as a method, apparatus, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Thus far, the present disclosure has been described in detail. In order to avoid obscuring the concepts of the present disclosure, some details known in the art are not described. How to implement the solutions disclosed herein will be fully apparent to those skilled in the art from the above description.

The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, firmware. The above-described sequence of steps for the method is for illustration only, and the steps of the method of the present disclosure are not limited to the sequence specifically described above unless specifically stated otherwise. Furthermore, in some embodiments, the present disclosure may also be implemented as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

Finally, it should be noted that: the above embodiments are merely for illustrating the technical solution of the present disclosure and are not limiting thereof; although the present disclosure has been described in detail with reference to preferred embodiments, those of ordinary skill in the art will appreciate that: modifications may be made to the specific embodiments of the disclosure or equivalents may be substituted for part of the technical features; without departing from the spirit of the technical solutions of the present disclosure, it should be covered in the scope of the technical solutions claimed in the present disclosure.

Claims

1. A method of speech recognition, comprising:

Acquiring candidate lattice according to the voice signal of the current sentence;

resetting a neural network model according to an upper text corresponding to a current sentence, wherein the upper text is an identification text of one or more sentences before the current sentence, the neural network model is generated based on corpus sample training with the upper text, and the upper text corresponding to the current sentence comprises an identification text of a speech signal of a speaker closest to the current sentence before the current sentence when the speech signal is a speech signal of a dialogue;

Re-scoring the candidate lattice by the reset neural network model to obtain re-scored lattice;

And determining the recognition text of the current sentence according to the re-scoring lattice.

2. The method of claim 1, further comprising:

The recognition text of the current sentence is stored in the buffer for use as the context text for the subsequent sentence.

3. The method of claim 2, further comprising:

And acquiring the text corresponding to the current sentence from the buffer area.

4. The method of claim 1, wherein the obtaining a candidate lattice from the speech signal of the current sentence comprises:

and decoding the voice signal for one time based on an acoustic model and a language model to obtain the candidate lattice.

5. The method of claim 1, wherein the determining the recognition text of the current sentence from the repartitioning lattice comprises:

and performing acoustic weight and language weight analysis on the re-scoring lattice to obtain a decoding result of the path with the highest score, and taking the decoding result as an identification text of the current sentence.

6. The method of claim 1, wherein the neural network model comprises an LSTM model or a GRU model.

7. The method of any one of claims 1-6, further comprising:

training the neural network model with the samples with the above until the output of the loss function converges, comprising:

acquiring a sample candidate lattice according to the voice signal of the current sample sentence;

Resetting a neural network model to be trained according to the above sample text corresponding to the current sample sentence, wherein the upper Wen Yangben text is the sample text of the previous sentence or multiple sentences of the current sample sentence;

Re-scoring the sample candidate lattice through the reset neural network model to be trained, obtaining a re-scored sample lattice, and determining an identification text of the current sample sentence;

And determining the output of the loss function according to the identification text of the current sample sentence and the sample text of the current sample sentence.

8. A speech recognition apparatus comprising:

the decoding unit is configured to acquire candidate lattice according to the voice signal of the current sentence;

A resetting unit configured to reset a neural network model according to an upper text corresponding to a current sentence, wherein the upper text is an identification text of one or more sentences preceding the current sentence, the neural network model is generated based on corpus sample training with the upper text, and in the case that the speech signal is a speech signal of a dialogue, the upper text corresponding to the current sentence includes an identification text of a speech signal of a speaker preceding the current sentence, which is closest to the speech of the current sentence;

The re-scoring unit is configured to re-score the candidate lattice by the reset neural network model to obtain re-scored lattice;

And the recognition unit is configured to determine the recognition text of the current sentence according to the repartitioning lattice.

9. The apparatus of claim 8, further comprising:

And the caching unit is configured to store the identification text of the current sentence into the cache area so as to serve as the text of the subsequent sentence.

10. The apparatus of claim 8 or 9, further comprising:

A training unit configured to train the neural network model with the samples with the above until the output of the loss function converges.

11. A speech recognition apparatus comprising:

A memory; and

A processor coupled to the memory, the processor configured to perform the method of any of claims 1-7 based on instructions stored in the memory.

12. A computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the method of any of claims 1 to 7.