WO2018223796A1

WO2018223796A1 - Speech recognition method, storage medium, and speech recognition device

Info

Publication number: WO2018223796A1
Application number: PCT/CN2018/085819
Authority: WO
Inventors: 饶丰; 卢鲤; 马建雄; 赵贺楠; 孙彬; 王尔玉; 周领良
Original assignee: 腾讯科技（深圳）有限公司
Priority date: 2017-06-07
Filing date: 2018-05-07
Publication date: 2018-12-13
Also published as: CN108288467B; CN108288467A

Abstract

A speech recognition method comprises: in a process of continuously acquiring speech signals, acquiring custom language data corresponding to a current account (S213); performing analytical processing on the custom language data to construct at least one corresponding custom decoding model (S214); loading the at least one custom decoding model to a pre-stored universal decoding model to generate a new decoding model (S215); and using the new decoding model to decode the speech signals to obtain a speech recognition result (S216.)

Description

Speech recognition method, storage medium and speech recognition device

This application claims the priority of the Chinese Patent Application entitled "A Speech Recognition Method, Apparatus, and Speech Recognition Engine", filed on June 07, 2017, with the application number of 201710425219X, the entire contents of which are incorporated by reference. In this application.

Technical field

The present invention relates to the field of automatic speech recognition (ASR) technology, and in particular, to a speech recognition method, a storage medium, and a speech recognition device.

Background technique

ASR technology is a technique for converting vocabulary content in human speech into computer readable input characters. Speech recognition has a complex processing flow, which mainly includes four processes: acoustic model training, language model training, decoding resource network construction and decoding.

At present, the existing speech recognition scheme is mainly obtained by calculating the maximum posterior probability of the speech signal based on the text, and is generally divided into two decoding modes: dynamic decoding and static decoding. The speech recognition solution based on static decoding is mainly implemented based on a Finite State Translator (FST) network. For example, a Weighted Finite State Transducer (WFST) network is used in the speech recognition process. Most components, including pronunciation dictionary, acoustic model, grammar information, etc., are integrated to obtain a finite state transition graph, and then searched in the finite state transition graph by decoding tokens to obtain optimal speech recognition. result.

However, since the above-mentioned integrated finite state transition map is fixed, once the finite state transition map is generated, it cannot be modified; and because the content of the voice of each user is different, specifically to the algorithm. That is to say, the language model of each user is different, and the acoustic model is different due to the difference in accents. Therefore, the finite state transition map corresponding to each user is also different. Then, in order to match all users, a finite state transition graph needs to be generated for each user. However, in the case of limited storage resources, it is often impossible to store a finite state transition graph for each user. It only stores the finite state transition map for common speech recognition. In this way, each user completes the voice search on the same picture, which often results in data offset, resulting in low accuracy of speech recognition.

Summary of the invention

According to various embodiments provided by the present application, a voice recognition method, a storage medium, and a voice recognition device are provided.

A speech recognition method comprising:

The voice recognition device acquires a custom corpus corresponding to the current account in the process of continuously acquiring the voice signal;

The voice recognition device analyzes and processes the customized corpus to construct a corresponding at least one custom decoding model;

The speech recognition device loads the at least one custom decoding model into a pre-stored general decoding model to generate a new decoding model;

The speech recognition device decodes the speech signal by using the new decoding model to obtain a speech recognition result.

One or more non-transitory computer readable storage media storing computer executable instructions, when executed by one or more processors, cause the one or more processors to perform the following steps: Obtaining a custom corpus corresponding to the current account during the process of continuously acquiring the voice signal;

Performing analysis and processing on the customized corpus to construct corresponding at least one custom decoding model;

Loading the at least one custom decoding model into a pre-stored general decoding model to generate a new decoding model;

The new decoding model is decoded for the speech signal to obtain a speech recognition result.

A speech recognition apparatus comprising a memory and a processor, the memory storing computer readable instructions, the computer readable instructions being executed by the processor, causing the processor to perform the following steps:

Obtaining a custom corpus corresponding to the current account during the process of continuously acquiring the voice signal;

Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below. Other features, objects, and advantages of the invention will be apparent from the description and appended claims.

DRAWINGS

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present application. Other drawings may also be obtained from those of ordinary skill in the art in light of the inventive work.

1 is an application environment diagram of a voice recognition method in an embodiment;

2-1 is a schematic flowchart of an implementation process of a voice recognition method in an embodiment;

2-2 is a schematic flowchart of an implementation process of a voice recognition method in another embodiment;

2-3 is a schematic flowchart showing an implementation of a voice recognition method in another embodiment;

3-1 is a schematic diagram of a voice recognition interface in an embodiment;

3-2 is a schematic diagram of a voice recognition interface in another embodiment;

4-1 is a block diagram showing an implementation process of a voice recognition method in an embodiment;

4-2 is a block diagram showing an implementation of a voice recognition method in another embodiment;

4-3 is a partial schematic diagram of a new WFST network in an embodiment;

FIG. 5 is a schematic structural diagram of a unit of a voice recognition device in an embodiment; and

FIG. 6 is a schematic diagram showing the internal structure of a voice recognition device in an embodiment.

detailed description

In order to make the objects, technical solutions, and advantages of the present application more comprehensible, the present application will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the application and are not intended to be limiting.

The technical solutions in the embodiments of the present application will be clearly and completely described in the following with reference to the accompanying drawings in the embodiments.

The embodiment of the present application provides a voice recognition method, which is applied to a voice recognition device. The speech recognition device can function as a speech recognition engine. The voice recognition device may be a cloud voice recognition device, that is, the voice recognition device may be a voice recognition server, or may be a component disposed in the voice recognition server; the voice recognition device may also be a local voice recognition device, that is, voice. The identification device may be a terminal or a component disposed in the terminal.

FIG. 1 is an application environment diagram of a voice recognition method in an embodiment of the present application. Referring to FIG. 1, voice recognition server 110 can communicate with terminal 120 over network 200. When the voice recognition device is the voice recognition server 110, the voice recognition method is performed by the voice recognition server 110, and when the voice recognition device is the terminal 120, the voice recognition method is performed by the terminal 120.

Then, the voice recognition device may be configured to acquire a custom corpus corresponding to the current account in the process of continuously acquiring the voice signal; analyze and process the customized corpus, and construct at least one custom decoding model; at least one self The decoding model is defined to be loaded into a pre-stored general decoding model to generate a new decoding model; the speech signal is decoded by using a new decoding model to obtain a speech recognition result.

Hereinafter, the voice recognition device is taken as an example of the voice recognition server, and the voice recognition method described above is described.

2-1 is a schematic flowchart of the implementation of the voice recognition method in the embodiment of the present application. As shown in FIG. 2-1, the foregoing method may include:

S213: The voice recognition server acquires a customized corpus corresponding to the current account in the process of continuously acquiring the voice signal.

Here, the voice signal continuously acquired by the voice recognition server is sent by the terminal. Since the terminal continuously sends the voice signal to the voice recognition server, the voice recognition server continuously receives the voice signals, and then the voice recognition server continuously receives the voice signals. In the process of voice signal, the custom corpus corresponding to the current account can be obtained.

In practical applications, text is usually used instead of a language instance, that is, the text is used as a corpus. Then, the above custom corpus may include one of the following: contact information corresponding to the current account, such as a telephone address book, an instant messaging application contact. Person information; or proprietary text of at least one field uploaded by the current account, such as legal provisions, communication standards, industry standards, etc. Of course, the custom corpus can also be other texts, which is not specifically limited in the embodiment of the present application.

In other embodiments of the present application, the customized corpus may be read by the voice recognition server from the user account information server or the terminal after receiving the voice signal uploaded by the terminal; or may be the voice of the user through the application on the terminal. Identify the server uploaded. Of course, there are other ways of obtaining the custom corpus, which is not specifically limited in the embodiment of the present application.

For example, the user presses the voice input control 301 in the voice recognition interface 30 as shown in FIG. 3-1, and then the mouth speaks into the microphone, and the real-time voice recognition result is streamed back. In this process, first, the voice activity detection module (VAD, Voice Activity Detection) module is used to obtain the effective part of the voice signal. At this time, the voice recognition for the voice signal begins; then, after the voice recognition starts, the voice recognition server passes The contact information of the current account is read from the user account information server or the terminal. Alternatively, after the user starts using the voice recognition service, the terminal loads the proprietary text of at least one field required by the user, such as legal provisions, and uploads to the voice recognition server, at which time the voice recognition server obtains the legal provisions.

It should be noted that the above-mentioned custom corpus may be classified into categories or may not be classified, and is not specifically limited in the embodiment of the present application.

S214: The speech recognition server analyzes and processes the custom corpus, and constructs at least one custom decoding model.

In the specific implementation process, in order to make the speech recognition more accurate, the S214 may include: classifying the custom corpus to obtain a customized language model of each category; based on the pre-stored acoustic model, the dictionary model, and the custom language model of each category Constructing at least one custom decoding model corresponding to each category.

Here, after obtaining the above custom corpus, the speech recognition server classifies the custom corpora to obtain a customized language model of each category. For example, if the voice recognition server obtains the contact information and legal provisions corresponding to the current account at the same time, the voice recognition server needs to classify the contact information and the legal provisions first, and obtain the language model corresponding to the contact information and the language corresponding to the legal provisions. a model; then, the speech recognition server constructs at least one custom decoding model corresponding to each category according to a pre-stored acoustic model, a dictionary model, and a custom language model of each of the above categories, that is, the speech recognition server constructs a contact The decoding model corresponding to the information and the decoding model corresponding to the legal provisions.

S215: The speech recognition server loads at least one custom decoding model into a pre-stored general decoding model to generate a new decoding model.

Here, the general decoding model is a decoding model built on the everyday language, which is universal and can better recognize everyday language.

In a specific implementation process, because the language habits and accents of different users are different, in order to achieve more accurate speech recognition, the S215 may further include: acquiring a context template with a slot, where the slot is in the context template. Information variable, the context template is obtained by data mining the historical voice data of the current account; according to the slot classification flag, a slot is added between the start symbol and the end symbol of the universal decoding model, and the slot is decoded with at least one custom A custom decoding model with classification markers in the model is associated to generate a new decoding model.

Here, the voice recognition server may acquire historical voice data of the current account before the user uses the voice recognition service, perform data mining on the data, and obtain at least one context template with a slot. For example, to identify the names of people in the voice, through data mining, get the context template related to the name of the person: "@NAME@ come to me to eat", "I and @NAME@ are good friends" and so on. It should be noted that "@NAME@" is the slot in the above context template, and "NAME" is the classification tag of the slot. Then, the speech recognition server adds the above slot between the start symbol and the end symbol of the universal decoding model according to the context templates, and associates the slot with a custom decoding model having the same classification mark in at least one custom decoding model to generate a new Decoding model. For example, the voice recognition server inserts a slot corresponding to "@NAME@" in the general decoding model according to the context template "@NAME@ to find me to eat", and correspondingly according to the classification mark "NAME", corresponding to "@NAME@" The slot is associated with the decoding model corresponding to the contact information, thus generating a new decoding model.

S216: The speech recognition server decodes the speech signal by using a new decoding model to obtain a speech recognition result.

In a specific implementation process, the S216 may include: decoding and recognizing the voice signal according to the new decoding model, and when the decoding token encounters the slot, jumping to the custom decoding model associated with the slot; the customization associated with the slot The decoding is performed in the decoding model; after the decoding is completed in the custom decoding model associated with the slot, the slot is returned, and the decoding continues in the general decoding model until the speech recognition result is obtained.

Here, after the speech recognition server constructs a new decoding model, the speech recognition server can input the speech signal to the new decoding model for decoding. First, the speech recognition server performs a phoneme search in the universal decoding model until the decoding token encounters a slot inserted in the universal decoding model, and at this time, jumps to the custom decoding model associated with the slot to continue the phoneme search. After the search is completed in the custom decoding model, the slot is returned, and the symbols continue to be searched after the slot in the universal decoding model until the character string with the highest probability value is obtained as the speech recognition result.

FIG. 2-2 is a schematic diagram of an implementation process of a voice recognition method in an embodiment. Referring to Figure 2-2, the following steps are also included before S213:

S211: The terminal collects a voice signal input by the user.

Here, the terminal can install an application having a voice input function, such as an instant messaging application, a voice input method application, or a voice assistant. The user can use these applications to input voice signals. For example, in the process of using the instant messaging application, when the user needs to input voice, the user opens the voice recognition interface 30 shown in FIG. 3-1, and the user presses and holds the voice input control in the interface. 301. At this time, the instant messaging application invokes the voice collection device, such as turning on the microphone, so that the user can start talking to the microphone, that is, the terminal collects the voice signal input by the user.

S212: The terminal sends the collected voice signal to the voice recognition server.

Here, the terminal transmits the collected voice signal to the voice recognition server. In practical applications, the terminal can be sent to the voice recognition server via a wireless local area network or a cellular data network or the like.

With continued reference to Figure 2-2, the following steps are also included after S216:

S217: The voice recognition server sends the voice recognition result to the terminal.

S218: The terminal outputs a speech recognition result.

Here, after obtaining the speech recognition result, the speech recognition server transmits the speech recognition result, that is, the character string, to the terminal, and causes the terminal to display on the speech recognition interface. For example, the user voices a sentence "Zhang Sanlai to find me to eat", and decodes such a sentence by obtaining a new decoding model generated by a custom decoding model corresponding to the contact information inserted in the universal decoding model to obtain characters. The string "Zhang Sanlai came to me to eat", the voice recognition server sends the string to the terminal, as shown in Figure 3-2, the terminal can display the string 302 in the voice recognition interface 30, or convert the string into The voice signal is output to the user and performs voice interaction with the user. Of course, other input manners may also be used, which are not specifically limited in the embodiment of the present application.

At this point, the speech recognition process is completed.

The voice recognition device is taken as an example for the voice recognition method.

2-3 is a schematic flowchart of the implementation of the voice recognition method in the embodiment of the present application. Referring to FIG. 2-3, the foregoing method may include:

S221: The terminal collects a voice signal input by the user.

Here, the terminal can collect the voice signal input by the user through the voice collection device. Specifically, the terminal can install an application having a voice input function, such as an instant messaging application, a voice input method application, or a voice assistant. The user can use these applications to input voice signals. For example, in the process of using the instant messaging application, when the user needs to input voice, the user opens the voice recognition interface 30 shown in FIG. 3-1, and the user presses and holds the voice input control in the interface. 301. At this time, the instant messaging application invokes the voice collection device. If the microphone is turned on, the user can start speaking to the microphone, and thus, the terminal collects the voice signal input by the user.

Here, the terminal can transmit the collected voice signal to the processor, that is, the decoder, through the communication bus through the voice collection device.

S223: The terminal acquires a customized corpus corresponding to the current account in the process of continuously acquiring the voice signal.

Here, the terminal may acquire a custom corpus corresponding to the current account by the processor during the process of continuously acquiring the voice signal. Specifically, since the voice collecting device continuously sends the voice signal to the processor, the processor continuously receives the voice signals, and the processor can obtain the customized corpus corresponding to the current account in the process of continuously receiving the voice signals. .

In practical applications, the above custom corpus may include one of the following: contact information corresponding to the current account, such as a phone address book, instant messaging application contact information; or a proprietary text of at least one field uploaded by the current account, such as law Texts, communication standards, industry standards and other texts. Of course, the custom corpus can also be other texts, which is not specifically limited in the embodiment of the present application.

In other embodiments of the present application, the customized corpus may be read from the user account information server or locally after receiving the voice signal collected by the voice collection device, or may be stored locally by the user in advance. Of course, there are other ways of obtaining the custom corpus, which is not specifically limited in the embodiment of the present application.

S224: The terminal analyzes and processes the customized corpus, and constructs at least one custom decoding model.

In the specific implementation process, in order to make the speech recognition more accurate, the S224 may include: classifying the custom corpus to obtain a customized language model of each category; based on the pre-stored acoustic model, the dictionary model, and the custom language model of each category Constructing at least one custom decoding model corresponding to each category.

Here, the terminal may analyze and process the custom corpus through the processor, and construct corresponding at least one custom decoding model. Specifically, after obtaining the above custom corpus, the processor classifies the custom corpora to obtain a customized language model of each category. For example, if the processor obtains the contact information and legal provisions corresponding to the current account at the same time, the processor needs to first classify the contact information and the legal provisions, and obtain the language model corresponding to the contact information and the language model corresponding to the legal provisions; Then, the processor constructs at least one custom decoding model corresponding to each category according to the pre-stored acoustic model, the dictionary model, and the custom language model of each of the above categories, that is, the processor constructs a decoding corresponding to the contact information. The decoding model corresponding to the model and legal provisions.

S225: The terminal loads at least one custom decoding model into a pre-stored general decoding model to generate a new decoding model.

In the specific implementation process, because the language habits and accents of different users are different, in order to achieve more accurate speech recognition, the S225 may further include: acquiring a context template with a slot, wherein the context template is current The historical voice data of the account is obtained by data mining; according to the classification mark of the slot, a slot is added between the start symbol and the end symbol of the universal decoding model, and the slot is customized with the classification mark in at least one custom decoding model. Model associations generate new decoding models.

Here, the terminal may load at least one custom decoding model into a pre-stored general decoding model by the processor to generate a new decoding model. Specifically, before the user uses the voice recognition service, the processor may acquire historical voice data of the current account, perform data mining on the data, and obtain at least one context template with a slot. For example, to identify the names of people in the voice, through data mining, get the context template related to the name of the person: "@NAME@ come to me to eat", "I and @NAME@ are good friends" and so on. It should be noted that "@NAME@" is the slot in the above context template, and "NAME" is the classification tag of the slot. Then, according to the context templates, the processor adds the above slot between the start symbol and the end symbol of the universal decoding model, and associates the slot with a custom decoding model having the same classification mark in at least one custom decoding model to generate a new one. Decode the model. For example, the processor inserts the slot corresponding to "@NAME@" in the general decoding model according to the context template "@NAME@ to find me to eat", and according to the classification mark "NAME", the corresponding "@NAME@" The slot is associated with the decoding model corresponding to the contact information, thus generating a new decoding model.

S226: The terminal decodes the voice signal by using a new decoding model to obtain a voice recognition result.

In a specific implementation process, the S226 may include: decoding and recognizing the voice signal according to the new decoding model, and when the decoding token encounters the slot, jumping to the custom decoding model associated with the slot; the customization associated with the slot The decoding is performed in the decoding model; after the decoding is completed in the custom decoding model associated with the slot, the slot is returned, and the decoding continues in the general decoding model until the speech recognition result is obtained.

Here, the terminal can decode the voice signal by using a new decoding model to obtain a voice recognition result. Specifically, after the processor completes the construction of the new decoding model, the voice signal can be input to the new decoding model for decoding. First, the processor performs a phoneme search in the general decoding model until the decoding token encounters a slot inserted in the universal decoding model, and at this time, jumps to the custom decoding model associated with the slot to continue the phoneme search, where After completing the search in the custom decoding model, the slot is returned, and the symbols continue to be searched after the slot in the universal decoding model until the character string with the highest probability value is obtained as the speech recognition result.

S227: The terminal outputs a speech recognition result.

Here, the terminal can output a speech recognition result through the processor. Specifically, the processor may display the character string on the voice recognition interface as shown in FIG. 3-2, or convert the character string into a voice signal, and output it to the user for voice interaction with the user. Of course, other input manners may also be used, which are not specifically limited in the embodiment of the present application.

It can be seen that, in the embodiment of the present application, the voice recognition device acquires a customized corpus corresponding to the current account, such as contact information of the current account and a proprietary domain of the specific field uploaded by the current account, in the process of continuously acquiring the voice signal. Then, the custom corpus is analyzed and processed to construct at least one custom decoding model; then, the constructed at least one custom decoding model is loaded into a pre-stored general decoding model to generate a new decoding model; Finally, the speech signal is decoded by using a new decoding model to obtain the speech recognition result. In this way, through this new decoding model, the probability value of the user's custom corpus in the general decoding model can be significantly improved, so that the probability of the speech occurrence data offset of the custom corpus can be reduced, and the overall accuracy of the speech recognition is improved. rate.

Based on the foregoing embodiments, the WFST network can be used in practical applications to implement the decoding model.

In the embodiment of the present application, FIG. 4-1 is a block diagram of an implementation process of a voice recognition method in an embodiment of the present application. Referring to Figure 4-1, the figure shows a general speech recognition service. The environment is constructed in an offline environment. The static WFST network 414 is constructed by integrating the acoustic model 411, the dictionary 412, the language model 413, and the like. In an online environment, the WFST network is first loaded. When the service receives the speech signal, it first converts to a speech feature, and then, by calculating the acoustic model score and the weight score in the WFST network, the output text combination with the greatest a posteriori probability is obtained.

In order to improve the accuracy of the speech recognition, in other embodiments of the present application, FIG. 4-2 is a block diagram of an implementation process of the speech recognition method in the embodiment of the present application. Referring to FIG. 4-2, on the basis of the foregoing embodiment, the voice recognition online service is maintained, and the custom corpus 421 corresponding to the current account, such as contact information and at least one domain of proprietary text, is analyzed and processed. . First, extract the 421 (OOV, Out Of Vocabulary) dictionary 422. Considering that users may prefer some unfamiliar vocabulary, such as Martian, these words are probably not in the general vocabulary, so first build a The user customizes the vocabulary and obtains a new vocabulary by combining the OOV dictionary with the general vocabulary. Then, a new vocabulary is used in conjunction with the user's personal data to build a custom WFST network 423.

Then, the custom decoding model described in the foregoing embodiment may be a custom WFST network; the general decoding model may be a general-purpose WFST network.

In the embodiment of the present application, the step of loading at least one custom decoding model into the pre-stored general decoding model in the foregoing embodiment to generate a new decoding model may include: customizing the WFST network and the universal WFST network. The WFST network is merged to obtain a new WFST network. Correspondingly, the step of decoding the voice signal by using a new decoding model in the foregoing embodiment, and obtaining the voice recognition result may include: performing a search and decoding on the voice signal by using a new WFST network, and obtaining a voice. Identify the results.

For example, FIG. 4-3 is a partial schematic diagram of a new WFST network in the embodiment of the present application. Referring to FIG. 4-3, a slot 432 is inserted in the universal WFST network 431, and the slot 432 is associated with a custom WFST network 433 corresponding to the contact information to form a new WFST network. Then, when decoding the voice signal, when the decoding token searches for the location of the slot in the universal WFST network, the search continues in the custom WFST network that is directly entered, and the search ends in the custom WFST network, and the decoding is completed. The token will go back to the generic WFST network and continue searching. In this way, a user's own decoding space can be built for each user.

It should be understood that the various steps in the various embodiments of the present application are not necessarily performed in the order indicated by the steps. Except as explicitly stated herein, the execution of these steps is not strictly limited, and the steps may be performed in other orders. Moreover, at least some of the steps in the embodiments may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be executed at different times, and the execution of these sub-steps or stages The order is also not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of the other steps. Based on the same inventive concept, the embodiment of the present application provides a voice recognition device, which may be the voice recognition device described in one or more embodiments. The internal structure of the speech recognition apparatus can be referred to the structure shown in FIG. Each of the units described below may be implemented in whole or in part by software, hardware, or a combination thereof.

5 is a schematic structural diagram of a unit of a voice recognition device according to an embodiment of the present disclosure. As shown in FIG. 5, the voice recognition device 500 may include: a voice signal acquiring unit 501 for continuously acquiring a voice signal; and a corpus obtaining unit 502. Obtaining a custom corpus corresponding to the current account in the process of continuously acquiring the voice signal; the model building unit 503 is configured to analyze and process the custom corpus to construct a corresponding at least one custom decoding model; and the loading unit 504 is configured to: The at least one custom decoding model is loaded into the pre-stored general decoding model to generate a new decoding model. The decoding unit 505 is configured to decode the speech signal by using a new decoding model to obtain a speech recognition result.

In other embodiments of the present application, the custom corpus corresponding to the current account includes at least one of the following: contact information of the current account and proprietary text of at least one domain.

In other embodiments of the present application, the foregoing custom decoding model may be a custom WFST network; the universal decoding model may be a general WFST network; correspondingly, the loading unit is further used to merge the custom WFST network with the general WFST network to obtain The new WFST network; the decoding unit is also used for searching and decoding the voice signal using the new WFST network to obtain the speech recognition result.

In other embodiments of the present application, the model construction unit is further configured to classify the custom corpus to obtain a customized language model of each category; based on the pre-stored acoustic model, the dictionary model, and the custom language model of each category, At least one custom decoding model corresponding to each category is constructed.

In other embodiments of the present application, the loading unit is further configured to perform data mining on historical voice data of the current account to obtain a context template with a slot; and start symbols and end symbols in the universal decoding model according to the slot classification flag. A slot is added between the slots and associated with a custom decoding model with classification tags in at least one custom decoding model to generate a new decoding model.

In other embodiments of the present application, the decoding unit is specifically configured to decode and identify a voice signal according to a new decoding model, and when the decoding token encounters a slot, jump to a custom decoding model associated with the slot; Decoding is performed in the associated custom decoding model; after the decoding is completed in the custom decoding model associated with the slot, the slot is returned, and the decoding continues in the general decoding model until the speech recognition result is obtained.

It should be noted here that the description of the above device embodiment is similar to the description of the above method embodiment, and has similar advantages as the method embodiment. For technical details not disclosed in the device embodiments of the present application, please refer to the description of the method embodiments of the present application.

FIG. 6 is a schematic diagram of an internal structure of a voice recognition device according to an embodiment of the present application. Referring to FIG. 6, the voice recognition device 600 includes a processor, a memory, and a communication interface that are connected through a system bus. Wherein, the memory comprises a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device can store an operating system and computer readable instructions that, when executed, cause the processor to perform a speech recognition method. The processor of the computer device is used to provide computing and control capabilities to support the operation of the entire computer device. Computer readable instructions may be stored in the internal memory of the computer device, and when the computer readable instructions are executed by the processor, the processor may be caused to perform a speech recognition method. The computer device can be a cell phone, a tablet or a personal digital assistant or a wearable device or the like. It will be understood by those skilled in the art that the structure shown in FIG. 2 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the terminal to which the solution of the present application is applied. The specific computer equipment may include More or fewer components than shown in the figure, or some components combined, or with different component arrangements.

The processor executes the following steps: in the process of continuously acquiring the voice signal through the communication interface, acquiring a custom corpus corresponding to the current account; analyzing and processing the customized corpus to construct at least one custom Decoding the model; loading at least one custom decoding model into a pre-stored general decoding model to generate a new decoding model; decoding the speech signal with a new decoding model to obtain a speech recognition result.

In other embodiments of the present application, the custom decoding model may be a custom WFST network; the general decoding model may be a general-purpose WFST network; accordingly, the processor also implements the following steps when executing the program: the custom WFST network and the general WFST network The new WFST network is obtained by combining, and the new WFST network is used for search and decoding of the speech signal to obtain the speech recognition result.

In other embodiments of the present application, when the processor executes the computer readable instructions, the following steps are further performed: classifying the custom corpus to obtain a customized language model of each category; based on the pre-stored acoustic model, the dictionary model, and each category A custom language model is constructed to construct at least one custom decoding model corresponding to each category.

In other embodiments of the present application, the processor, when executing the computer readable instructions, further implements the following steps: performing data mining on historical voice data of the current account, obtaining a context template with slots; and using a general decoding model according to the classification mark of the slot A slot is added between the start symbol and the end symbol, and the slot is associated with a custom decoding model having a classification mark in at least one custom decoding model to generate a new decoding model.

In other embodiments of the present application, when the processor executes the computer readable instructions, the following steps are further implemented: decoding and recognizing the voice signal according to the new decoding model, and when the decoding token encounters the slot, jumping to the slot associated with the self Defining the decoding model; decoding in the custom decoding model associated with the slot; returning to the slot after decoding is completed in the custom decoding model associated with the slot, and continuing to decode in the general decoding model until the speech recognition result is obtained.

In practical applications, the processor may be an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), and a programmable At least one of a PLD (Programmable Logic Device), a Field Programmable Gate Array (FPGA), a Central Processing Unit (CPU), a controller, a microcontroller, and a microprocessor . The memory may be a removable storage device, a read only memory (ROM), a magnetic disk, or an optical disk. It is to be understood that the electronic device that implements the above-mentioned functions of the processor and the memory may be other, and is not specifically limited in the embodiment of the present application.

Further, if the voice recognition device is a cloud voice recognition device, that is, a voice recognition server, the communication interface may be an interface between the terminal and the voice recognition server. If the voice recognition device is a local voice recognition device, that is, a terminal, the voice collection device is also included. The voice collection device may be a microphone, a microphone array, a microphone, etc., which is not specifically limited in this embodiment. The communication interface on the terminal can be an interface between the processor and the voice collection device, such as an interface between the processor and a microphone or a microphone. Of course, the foregoing communication interface may have other implementation forms, which are not specifically limited in this embodiment.

Based on the same inventive concept, an embodiment of the present application provides a computer readable storage medium, where computer readable instructions are stored, and when the computer readable instructions are executed by a processor, the following steps are implemented: in the process of continuously acquiring a voice signal, acquiring Custom corpus corresponding to the current account; analyzing and processing the custom corpus, constructing at least one custom decoding model; loading at least one custom decoding model into the pre-stored general decoding model to generate a new decoding model; The speech signal is decoded using a new decoding model to obtain speech recognition results.

In other embodiments of the present application, the custom decoding model may be a custom WFST network; the universal decoding model may be a general-purpose WFST network; accordingly, when the computer readable instructions are executed by the processor, the following steps are also implemented: customizing the WFST network It merges with the universal WFST network to obtain a new WFST network. The new WFST network is used for search and decoding of voice signals to obtain speech recognition results.

In other embodiments of the present application, the computer readable instructions are further executed by the processor to: classify the custom corpus to obtain a custom language model for each category; based on pre-stored acoustic models, dictionary models, and categories A custom language model that constructs at least one custom decoding model corresponding to each category.

In other embodiments of the present application, when the computer readable instructions are executed by the processor, the following steps are further performed: performing data mining on historical voice data of the current account, obtaining a context template with slots; and performing general decoding according to the slot classification flag A slot is added between the start symbol and the end symbol of the model, and the slot is associated with a custom decoding model having a classification mark in at least one custom decoding model to generate a new decoding model.

In other embodiments of the present application, when the computer readable instructions are executed by the processor, the following steps are further implemented: decoding and recognizing the speech signal according to the new decoding model, and when the decoding token encounters the slot, jumping to the slot associated with Custom decoding model; decoding in the custom decoding model associated with the slot; returning to the slot after decoding is completed in the custom decoding model associated with the slot, and continuing to decode in the general decoding model until the speech recognition result is obtained .

In the embodiment of the present application, the computer readable instructions are stored in a storage medium, and include instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the embodiments of the present application. All or part of the method. The foregoing storage medium includes various media that can store program codes, such as a USB flash drive, a mobile hard disk, a read only memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the present application are not limited to any particular combination of hardware and software.

It should be noted here that the above description of the computing device or the computer readable storage medium embodiment is similar to the description of the above method embodiment, and has similar advantages as the method embodiment. For technical details not disclosed in the embodiments of the computing device or the storage medium of the present application, please refer to the description of the method embodiments of the present application.

It is to be understood that the phrase "one embodiment" or "an embodiment" or "an embodiment" or "an embodiment" means that the particular features, structures, or characteristics relating to the embodiments are included in at least one embodiment of the present application. Thus, "in one embodiment" or "in an embodiment" or "an" In addition, these particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in the various embodiments of the present application, the size of the sequence numbers of the foregoing processes does not mean the order of execution sequence, and the order of execution of each process should be determined by its function and internal logic, and should not be applied to the embodiment of the present application. The implementation process constitutes any limitation. The serial numbers of the embodiments of the present application are merely for the description, and do not represent the advantages and disadvantages of the embodiments.

It is to be understood that the term "comprises", "comprising", or any other variants thereof, is intended to encompass a non-exclusive inclusion, such that a process, method, article, or device comprising a series of elements includes those elements. It also includes other elements that are not explicitly listed, or elements that are inherent to such a process, method, article, or device. An element that is defined by the phrase "comprising a ..." does not exclude the presence of additional equivalent elements in the process, method, item, or device that comprises the element.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, such as: multiple units or components may be combined, or Can be integrated into another system, or some features can be ignored or not executed. In addition, the coupling, or direct coupling, or communication connection of the components shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may be electrical, mechanical or other forms. of.

The units described above as separate components may or may not be physically separated, and the components displayed as the unit may or may not be physical units; they may be located in one place or distributed on multiple network units; Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated into one unit; the above integration The unit can be implemented in the form of hardware or in the form of hardware plus software functional units.

One of ordinary skill in the art can understand that all or part of the process of implementing the above embodiments can be completed by a computer program to instruct related hardware, and the program can be stored in a non-volatile computer readable storage medium. Wherein, the program, when executed, may include the flow of an embodiment of the methods as described above. Any reference to a memory, storage, database or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of formats, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization chain. Synchlink DRAM (SLDRAM), Memory Bus (Rambus) Direct RAM (RDRAM), Direct Memory Bus Dynamic RAM (DRDRAM), and Memory Bus Dynamic RAM (RDRAM).

The technical features of the above embodiments may be arbitrarily combined. For the sake of brevity of description, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, It is considered to be the range described in this specification.

The above embodiments are merely illustrative of several embodiments of the present application, and the description thereof is more specific and detailed, but is not to be construed as limiting the scope of the invention. It should be noted that a number of variations and modifications may be made by those skilled in the art without departing from the spirit and scope of the present application. Therefore, the scope of the invention should be determined by the appended claims.

Claims

A speech recognition method comprising:

The voice recognition device acquires a custom corpus corresponding to the current account in the process of continuously acquiring the voice signal;

The voice recognition device analyzes and processes the customized corpus to construct a corresponding at least one custom decoding model;

The speech recognition device loads the at least one custom decoding model into a pre-stored general decoding model to generate a new decoding model;

The speech recognition device decodes the speech signal by using the new decoding model to obtain a speech recognition result.
The method according to claim 1, wherein the custom corpus corresponding to the current account comprises at least one of: contact information of the current account and proprietary text of at least one domain.
The method according to claim 1, wherein the custom decoding model is a Weighted Finite-State Transducer (WFST) network; the general decoding model is a general-purpose WFST network;

The voice recognition device loads the at least one custom decoding model into a pre-stored universal decoding model to generate a new decoding model, including: the voice recognition device, the custom WFST network and the universal WFST network Merging to get a new WFST network;

The voice recognition device uses the new decoding model to decode the voice signal to obtain a voice recognition result, and the voice recognition device uses the new WFST network to perform search and decoding on the voice signal to obtain a voice. Identify the results.
The method according to claim 1, wherein the speech recognition device analyzes and processes the customized corpus to construct a corresponding custom decoding model, including:

The voice recognition device classifies the customized corpus to obtain a customized language model of each category; and

The speech recognition apparatus constructs the at least one custom solution speech recognition code model corresponding to each category based on a pre-stored acoustic model, a dictionary model, and a custom language model of the respective categories.
The method according to claim 4, wherein the speech recognition device loads the at least one custom decoding model into a pre-stored general decoding model to generate a new decoding model, including:

The voice recognition device acquires a context template with a slot, wherein the context template is obtained by performing data mining on historical voice data of the current account;

The voice recognition device adds the slot between a start symbol and an end symbol of the universal decoding model according to a classification mark of the slot, and has the slot and the at least one custom decoding model having the A custom decoding model associated with the classification tag associates to generate the new decoding model.
The method according to claim 5, wherein the speech recognition device decodes the speech signal by using the new decoding model to obtain a speech recognition result, including:

The speech recognition device decodes and recognizes the speech signal according to the new decoding model, and when the decoding token encounters the slot, jumps to a custom decoding model associated with the slot; Decoding in the associated custom decoding model; and

The speech recognition device returns to the slot after decoding in the custom decoding model associated with the slot, and continues to continue decoding in the universal decoding model until the speech recognition result is obtained.
One or more non-volatile storage media storing computer readable instructions, when executed by one or more processors, cause one or more processors to perform the following steps:

Obtaining a custom corpus corresponding to the current account during the process of continuously acquiring the voice signal;

Performing analysis and processing on the customized corpus to construct corresponding at least one custom decoding model;

Loading the at least one custom decoding model into a pre-stored general decoding model to generate a new decoding model;

The new decoding model is decoded for the speech signal to obtain a speech recognition result.
The storage medium according to claim 7, wherein the custom corpus corresponding to the current account comprises at least one of: contact information of the current account and proprietary text of at least one domain.
The storage medium according to claim 7, wherein the custom decoding model is a custom weighted finite-conversion machine WFST network; the universal decoding model is a general-purpose WFST network;

The loading the at least one custom decoding model into a pre-stored universal decoding model to generate a new decoding model includes: combining the custom WFST network with the universal WFST network to obtain a new WFST network;

The decoding of the voice signal by using the new decoding model to obtain a voice recognition result includes: performing search and decoding on the voice signal by using the new WFST network, and obtaining a voice recognition result.
The storage medium according to claim 7, wherein the analyzing and processing the customized corpus to construct a corresponding custom decoding model comprises:

Classifying the custom corpus to obtain a custom language model for each category; and

The at least one custom solution speech recognition code model corresponding to each category is constructed based on a pre-stored acoustic model, a dictionary model, and a custom language model of the respective categories.
The storage medium according to claim 10, wherein the loading the at least one custom decoding model into a pre-stored general decoding model to generate a new decoding model comprises:

Obtaining a context template with a slot, wherein the context template is obtained by performing data mining on historical voice data of the current account; and

Adding the slot between a start symbol and an end symbol of the universal decoding model according to a classification mark of the slot, and customizing the slot with the classification mark in the at least one custom decoding model The model association is decoded to generate the new decoding model.
The storage medium according to claim 11, wherein the decoding of the voice signal by using the new decoding model to obtain a voice recognition result comprises:

Decoding and recognizing the speech signal according to the new decoding model, when the decoding token encounters the slot, jumping to a custom decoding model associated with the slot; customizing the association associated with the slot Decoding in the decoding model; and

Returning to the slot after decoding is completed in the custom decoding model associated with the slot, and continuing to continue decoding in the universal decoding model until the speech recognition result is obtained.
A speech recognition apparatus comprising a memory and a processor, the memory storing computer readable instructions, the computer readable instructions being executed by the processor such that the processor performs the following steps:

Obtaining a custom corpus corresponding to the current account during the process of continuously acquiring the voice signal;

Performing analysis and processing on the customized corpus to construct corresponding at least one custom decoding model;

Loading the at least one custom decoding model into a pre-stored general decoding model to generate a new decoding model;

The new decoding model is decoded for the speech signal to obtain a speech recognition result.
The speech recognition device according to claim 13, wherein the custom corpus corresponding to the current account comprises at least one of the following: contact information of the current account and proprietary text of at least one domain.
The speech recognition device according to claim 13, wherein the custom decoding model is a custom weighted limited converter WFST network; the universal decoding model is a universal WFST network;

The loading the at least one custom decoding model into a pre-stored universal decoding model to generate a new decoding model includes: combining the custom WFST network with the universal WFST network to obtain a new WFST network;

The decoding of the voice signal by using the new decoding model to obtain a voice recognition result includes: performing search and decoding on the voice signal by using the new WFST network, and obtaining a voice recognition result.
The speech recognition device according to claim 13, wherein the analyzing and processing the custom corpus to construct a corresponding custom decoding model comprises:

Classifying the custom corpus to obtain a custom language model for each category; and

The at least one custom solution speech recognition code model corresponding to each category is constructed based on a pre-stored acoustic model, a dictionary model, and a custom language model of the respective categories.
The speech recognition device according to claim 16, wherein the loading the at least one custom decoding model into a pre-stored general decoding model to generate a new decoding model comprises:

Obtaining a context template with a slot, wherein the context template is obtained by performing data mining on historical voice data of the current account; and

Adding the slot between a start symbol and an end symbol of the universal decoding model according to a classification mark of the slot, and customizing the slot with the classification mark in the at least one custom decoding model The model association is decoded to generate the new decoding model.
The speech recognition device according to claim 17, wherein the decoding of the speech signal by using the new decoding model to obtain a speech recognition result comprises:

Decoding and recognizing the speech signal according to the new decoding model, when the decoding token encounters the slot, jumping to a custom decoding model associated with the slot; customizing the association associated with the slot Decoding in the decoding model; and

Returning to the slot after decoding is completed in the custom decoding model associated with the slot, and continuing to continue decoding in the universal decoding model until the speech recognition result is obtained.