[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2018223796A1 - Speech recognition method, storage medium, and speech recognition device - Google Patents

Speech recognition method, storage medium, and speech recognition device Download PDF

Info

Publication number
WO2018223796A1
WO2018223796A1 PCT/CN2018/085819 CN2018085819W WO2018223796A1 WO 2018223796 A1 WO2018223796 A1 WO 2018223796A1 CN 2018085819 W CN2018085819 W CN 2018085819W WO 2018223796 A1 WO2018223796 A1 WO 2018223796A1
Authority
WO
WIPO (PCT)
Prior art keywords
custom
decoding
model
decoding model
slot
Prior art date
Application number
PCT/CN2018/085819
Other languages
French (fr)
Chinese (zh)
Inventor
饶丰
卢鲤
马建雄
赵贺楠
孙彬
王尔玉
周领良
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2018223796A1 publication Critical patent/WO2018223796A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/28Constructional details of speech recognition systems
    • G10L15/34Adaptation of a single recogniser for parallel processing, e.g. by use of multiple processors or cloud computing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice

Definitions

  • the present invention relates to the field of automatic speech recognition (ASR) technology, and in particular, to a speech recognition method, a storage medium, and a speech recognition device.
  • ASR automatic speech recognition
  • ASR technology is a technique for converting vocabulary content in human speech into computer readable input characters.
  • Speech recognition has a complex processing flow, which mainly includes four processes: acoustic model training, language model training, decoding resource network construction and decoding.
  • the existing speech recognition scheme is mainly obtained by calculating the maximum posterior probability of the speech signal based on the text, and is generally divided into two decoding modes: dynamic decoding and static decoding.
  • the speech recognition solution based on static decoding is mainly implemented based on a Finite State Translator (FST) network.
  • FST Finite State Translator
  • WFST Weighted Finite State Transducer
  • Most components, including pronunciation dictionary, acoustic model, grammar information, etc., are integrated to obtain a finite state transition graph, and then searched in the finite state transition graph by decoding tokens to obtain optimal speech recognition. result.
  • a voice recognition method a storage medium, and a voice recognition device are provided.
  • a speech recognition method comprising:
  • the voice recognition device acquires a custom corpus corresponding to the current account in the process of continuously acquiring the voice signal
  • the voice recognition device analyzes and processes the customized corpus to construct a corresponding at least one custom decoding model
  • the speech recognition device loads the at least one custom decoding model into a pre-stored general decoding model to generate a new decoding model
  • the speech recognition device decodes the speech signal by using the new decoding model to obtain a speech recognition result.
  • One or more non-transitory computer readable storage media storing computer executable instructions, when executed by one or more processors, cause the one or more processors to perform the following steps: Obtaining a custom corpus corresponding to the current account during the process of continuously acquiring the voice signal;
  • the new decoding model is decoded for the speech signal to obtain a speech recognition result.
  • a speech recognition apparatus comprising a memory and a processor, the memory storing computer readable instructions, the computer readable instructions being executed by the processor, causing the processor to perform the following steps:
  • the new decoding model is decoded for the speech signal to obtain a speech recognition result.
  • 1 is an application environment diagram of a voice recognition method in an embodiment
  • 2-1 is a schematic flowchart of an implementation process of a voice recognition method in an embodiment
  • 2-3 is a schematic flowchart showing an implementation of a voice recognition method in another embodiment
  • 3-1 is a schematic diagram of a voice recognition interface in an embodiment
  • 3-2 is a schematic diagram of a voice recognition interface in another embodiment
  • 4-1 is a block diagram showing an implementation process of a voice recognition method in an embodiment
  • 4-2 is a block diagram showing an implementation of a voice recognition method in another embodiment
  • 4-3 is a partial schematic diagram of a new WFST network in an embodiment
  • FIG. 5 is a schematic structural diagram of a unit of a voice recognition device in an embodiment.
  • FIG. 6 is a schematic diagram showing the internal structure of a voice recognition device in an embodiment.
  • the embodiment of the present application provides a voice recognition method, which is applied to a voice recognition device.
  • the speech recognition device can function as a speech recognition engine.
  • the voice recognition device may be a cloud voice recognition device, that is, the voice recognition device may be a voice recognition server, or may be a component disposed in the voice recognition server; the voice recognition device may also be a local voice recognition device, that is, voice.
  • the identification device may be a terminal or a component disposed in the terminal.
  • FIG. 1 is an application environment diagram of a voice recognition method in an embodiment of the present application.
  • voice recognition server 110 can communicate with terminal 120 over network 200.
  • the voice recognition device is the voice recognition server 110
  • the voice recognition method is performed by the voice recognition server 110
  • the voice recognition method is performed by the terminal 120.
  • the voice recognition device may be configured to acquire a custom corpus corresponding to the current account in the process of continuously acquiring the voice signal; analyze and process the customized corpus, and construct at least one custom decoding model; at least one self
  • the decoding model is defined to be loaded into a pre-stored general decoding model to generate a new decoding model; the speech signal is decoded by using a new decoding model to obtain a speech recognition result.
  • the voice recognition device is taken as an example of the voice recognition server, and the voice recognition method described above is described.
  • the foregoing method may include:
  • the voice recognition server acquires a customized corpus corresponding to the current account in the process of continuously acquiring the voice signal.
  • the voice signal continuously acquired by the voice recognition server is sent by the terminal. Since the terminal continuously sends the voice signal to the voice recognition server, the voice recognition server continuously receives the voice signals, and then the voice recognition server continuously receives the voice signals. In the process of voice signal, the custom corpus corresponding to the current account can be obtained.
  • the text is usually used instead of a language instance, that is, the text is used as a corpus.
  • the above custom corpus may include one of the following: contact information corresponding to the current account, such as a telephone address book, an instant messaging application contact. Person information; or proprietary text of at least one field uploaded by the current account, such as legal provisions, communication standards, industry standards, etc.
  • the custom corpus can also be other texts, which is not specifically limited in the embodiment of the present application.
  • the customized corpus may be read by the voice recognition server from the user account information server or the terminal after receiving the voice signal uploaded by the terminal; or may be the voice of the user through the application on the terminal. Identify the server uploaded.
  • the voice recognition server may be read by the voice recognition server from the user account information server or the terminal after receiving the voice signal uploaded by the terminal; or may be the voice of the user through the application on the terminal. Identify the server uploaded.
  • the user presses the voice input control 301 in the voice recognition interface 30 as shown in FIG. 3-1, and then the mouth speaks into the microphone, and the real-time voice recognition result is streamed back.
  • the voice activity detection module VAD, Voice Activity Detection
  • VAD Voice Activity Detection
  • the voice recognition server passes The contact information of the current account is read from the user account information server or the terminal.
  • the terminal loads the proprietary text of at least one field required by the user, such as legal provisions, and uploads to the voice recognition server, at which time the voice recognition server obtains the legal provisions.
  • custom corpus may be classified into categories or may not be classified, and is not specifically limited in the embodiment of the present application.
  • the speech recognition server analyzes and processes the custom corpus, and constructs at least one custom decoding model.
  • the S214 may include: classifying the custom corpus to obtain a customized language model of each category; based on the pre-stored acoustic model, the dictionary model, and the custom language model of each category Constructing at least one custom decoding model corresponding to each category.
  • the speech recognition server classifies the custom corpora to obtain a customized language model of each category. For example, if the voice recognition server obtains the contact information and legal provisions corresponding to the current account at the same time, the voice recognition server needs to classify the contact information and the legal provisions first, and obtain the language model corresponding to the contact information and the language corresponding to the legal provisions. a model; then, the speech recognition server constructs at least one custom decoding model corresponding to each category according to a pre-stored acoustic model, a dictionary model, and a custom language model of each of the above categories, that is, the speech recognition server constructs a contact The decoding model corresponding to the information and the decoding model corresponding to the legal provisions.
  • the speech recognition server loads at least one custom decoding model into a pre-stored general decoding model to generate a new decoding model.
  • the general decoding model is a decoding model built on the everyday language, which is universal and can better recognize everyday language.
  • the S215 may further include: acquiring a context template with a slot, where the slot is in the context template.
  • Information variable the context template is obtained by data mining the historical voice data of the current account; according to the slot classification flag, a slot is added between the start symbol and the end symbol of the universal decoding model, and the slot is decoded with at least one custom
  • a custom decoding model with classification markers in the model is associated to generate a new decoding model.
  • the voice recognition server may acquire historical voice data of the current account before the user uses the voice recognition service, perform data mining on the data, and obtain at least one context template with a slot. For example, to identify the names of people in the voice, through data mining, get the context template related to the name of the person: "@NAME@ come to me to eat", "I and @NAME@ are good friends" and so on. It should be noted that “@NAME@” is the slot in the above context template, and "NAME” is the classification tag of the slot.
  • the speech recognition server adds the above slot between the start symbol and the end symbol of the universal decoding model according to the context templates, and associates the slot with a custom decoding model having the same classification mark in at least one custom decoding model to generate a new Decoding model.
  • the voice recognition server inserts a slot corresponding to "@NAME@” in the general decoding model according to the context template "@NAME@ to find me to eat", and correspondingly according to the classification mark "NAME”, corresponding to "@NAME@”
  • the slot is associated with the decoding model corresponding to the contact information, thus generating a new decoding model.
  • the speech recognition server decodes the speech signal by using a new decoding model to obtain a speech recognition result.
  • the S216 may include: decoding and recognizing the voice signal according to the new decoding model, and when the decoding token encounters the slot, jumping to the custom decoding model associated with the slot; the customization associated with the slot The decoding is performed in the decoding model; after the decoding is completed in the custom decoding model associated with the slot, the slot is returned, and the decoding continues in the general decoding model until the speech recognition result is obtained.
  • the speech recognition server can input the speech signal to the new decoding model for decoding.
  • the speech recognition server performs a phoneme search in the universal decoding model until the decoding token encounters a slot inserted in the universal decoding model, and at this time, jumps to the custom decoding model associated with the slot to continue the phoneme search.
  • the slot is returned, and the symbols continue to be searched after the slot in the universal decoding model until the character string with the highest probability value is obtained as the speech recognition result.
  • FIG. 2-2 is a schematic diagram of an implementation process of a voice recognition method in an embodiment. Referring to Figure 2-2, the following steps are also included before S213:
  • the terminal collects a voice signal input by the user.
  • the terminal can install an application having a voice input function, such as an instant messaging application, a voice input method application, or a voice assistant.
  • the user can use these applications to input voice signals.
  • the instant messaging application invokes the voice collection device, such as turning on the microphone, so that the user can start talking to the microphone, that is, the terminal collects the voice signal input by the user.
  • S212 The terminal sends the collected voice signal to the voice recognition server.
  • the terminal transmits the collected voice signal to the voice recognition server.
  • the terminal can be sent to the voice recognition server via a wireless local area network or a cellular data network or the like.
  • the voice recognition server sends the voice recognition result to the terminal.
  • S218 The terminal outputs a speech recognition result.
  • the speech recognition server transmits the speech recognition result, that is, the character string, to the terminal, and causes the terminal to display on the speech recognition interface.
  • the user voices a sentence "Zhang Sanlai to find me to eat", and decodes such a sentence by obtaining a new decoding model generated by a custom decoding model corresponding to the contact information inserted in the universal decoding model to obtain characters.
  • the voice recognition server sends the string to the terminal, as shown in Figure 3-2, the terminal can display the string 302 in the voice recognition interface 30, or convert the string into
  • the voice signal is output to the user and performs voice interaction with the user.
  • other input manners may also be used, which are not specifically limited in the embodiment of the present application.
  • the voice recognition device is taken as an example for the voice recognition method.
  • the foregoing method may include:
  • S221 The terminal collects a voice signal input by the user.
  • the terminal can collect the voice signal input by the user through the voice collection device.
  • the terminal can install an application having a voice input function, such as an instant messaging application, a voice input method application, or a voice assistant.
  • the user can use these applications to input voice signals.
  • the instant messaging application invokes the voice collection device. If the microphone is turned on, the user can start speaking to the microphone, and thus, the terminal collects the voice signal input by the user.
  • the terminal can transmit the collected voice signal to the processor, that is, the decoder, through the communication bus through the voice collection device.
  • S223 The terminal acquires a customized corpus corresponding to the current account in the process of continuously acquiring the voice signal.
  • the terminal may acquire a custom corpus corresponding to the current account by the processor during the process of continuously acquiring the voice signal.
  • the processor continuously receives the voice signals, and the processor can obtain the customized corpus corresponding to the current account in the process of continuously receiving the voice signals.
  • the above custom corpus may include one of the following: contact information corresponding to the current account, such as a phone address book, instant messaging application contact information; or a proprietary text of at least one field uploaded by the current account, such as law Texts, communication standards, industry standards and other texts.
  • contact information corresponding to the current account such as a phone address book, instant messaging application contact information
  • a proprietary text of at least one field uploaded by the current account such as law Texts, communication standards, industry standards and other texts.
  • the custom corpus can also be other texts, which is not specifically limited in the embodiment of the present application.
  • the customized corpus may be read from the user account information server or locally after receiving the voice signal collected by the voice collection device, or may be stored locally by the user in advance.
  • the custom corpus may be read from the user account information server or locally after receiving the voice signal collected by the voice collection device, or may be stored locally by the user in advance.
  • custom corpus may be classified into categories or may not be classified, and is not specifically limited in the embodiment of the present application.
  • S224 The terminal analyzes and processes the customized corpus, and constructs at least one custom decoding model.
  • the S224 may include: classifying the custom corpus to obtain a customized language model of each category; based on the pre-stored acoustic model, the dictionary model, and the custom language model of each category Constructing at least one custom decoding model corresponding to each category.
  • the terminal may analyze and process the custom corpus through the processor, and construct corresponding at least one custom decoding model.
  • the processor classifies the custom corpora to obtain a customized language model of each category. For example, if the processor obtains the contact information and legal provisions corresponding to the current account at the same time, the processor needs to first classify the contact information and the legal provisions, and obtain the language model corresponding to the contact information and the language model corresponding to the legal provisions; Then, the processor constructs at least one custom decoding model corresponding to each category according to the pre-stored acoustic model, the dictionary model, and the custom language model of each of the above categories, that is, the processor constructs a decoding corresponding to the contact information.
  • the decoding model corresponding to the model and legal provisions.
  • S225 The terminal loads at least one custom decoding model into a pre-stored general decoding model to generate a new decoding model.
  • the S225 may further include: acquiring a context template with a slot, wherein the context template is current
  • the historical voice data of the account is obtained by data mining; according to the classification mark of the slot, a slot is added between the start symbol and the end symbol of the universal decoding model, and the slot is customized with the classification mark in at least one custom decoding model.
  • Model associations generate new decoding models.
  • the terminal may load at least one custom decoding model into a pre-stored general decoding model by the processor to generate a new decoding model.
  • the processor may acquire historical voice data of the current account, perform data mining on the data, and obtain at least one context template with a slot. For example, to identify the names of people in the voice, through data mining, get the context template related to the name of the person: "@NAME@ come to me to eat", "I and @NAME@ are good friends" and so on. It should be noted that "@NAME@” is the slot in the above context template, and "NAME” is the classification tag of the slot.
  • the processor adds the above slot between the start symbol and the end symbol of the universal decoding model, and associates the slot with a custom decoding model having the same classification mark in at least one custom decoding model to generate a new one.
  • Decode the model For example, the processor inserts the slot corresponding to "@NAME@” in the general decoding model according to the context template "@NAME@ to find me to eat", and according to the classification mark "NAME", the corresponding "@NAME@”
  • the slot is associated with the decoding model corresponding to the contact information, thus generating a new decoding model.
  • S226 The terminal decodes the voice signal by using a new decoding model to obtain a voice recognition result.
  • the S226 may include: decoding and recognizing the voice signal according to the new decoding model, and when the decoding token encounters the slot, jumping to the custom decoding model associated with the slot; the customization associated with the slot The decoding is performed in the decoding model; after the decoding is completed in the custom decoding model associated with the slot, the slot is returned, and the decoding continues in the general decoding model until the speech recognition result is obtained.
  • the terminal can decode the voice signal by using a new decoding model to obtain a voice recognition result.
  • the voice signal can be input to the new decoding model for decoding.
  • the processor performs a phoneme search in the general decoding model until the decoding token encounters a slot inserted in the universal decoding model, and at this time, jumps to the custom decoding model associated with the slot to continue the phoneme search, where After completing the search in the custom decoding model, the slot is returned, and the symbols continue to be searched after the slot in the universal decoding model until the character string with the highest probability value is obtained as the speech recognition result.
  • S227 The terminal outputs a speech recognition result.
  • the terminal can output a speech recognition result through the processor.
  • the processor may display the character string on the voice recognition interface as shown in FIG. 3-2, or convert the character string into a voice signal, and output it to the user for voice interaction with the user.
  • the processor may display the character string on the voice recognition interface as shown in FIG. 3-2, or convert the character string into a voice signal, and output it to the user for voice interaction with the user.
  • other input manners may also be used, which are not specifically limited in the embodiment of the present application.
  • the voice recognition device acquires a customized corpus corresponding to the current account, such as contact information of the current account and a proprietary domain of the specific field uploaded by the current account, in the process of continuously acquiring the voice signal.
  • the custom corpus is analyzed and processed to construct at least one custom decoding model; then, the constructed at least one custom decoding model is loaded into a pre-stored general decoding model to generate a new decoding model; Finally, the speech signal is decoded by using a new decoding model to obtain the speech recognition result.
  • the probability value of the user's custom corpus in the general decoding model can be significantly improved, so that the probability of the speech occurrence data offset of the custom corpus can be reduced, and the overall accuracy of the speech recognition is improved. rate.
  • the WFST network can be used in practical applications to implement the decoding model.
  • FIG. 4-1 is a block diagram of an implementation process of a voice recognition method in an embodiment of the present application.
  • the environment is constructed in an offline environment.
  • the static WFST network 414 is constructed by integrating the acoustic model 411, the dictionary 412, the language model 413, and the like.
  • the WFST network is first loaded.
  • the service receives the speech signal, it first converts to a speech feature, and then, by calculating the acoustic model score and the weight score in the WFST network, the output text combination with the greatest a posteriori probability is obtained.
  • FIG. 4-2 is a block diagram of an implementation process of the speech recognition method in the embodiment of the present application.
  • the voice recognition online service is maintained, and the custom corpus 421 corresponding to the current account, such as contact information and at least one domain of proprietary text, is analyzed and processed. .
  • OOV Out Of Vocabulary
  • users may prefer some unfamiliar vocabulary, such as Martian, these words are probably not in the general vocabulary, so first build a The user customizes the vocabulary and obtains a new vocabulary by combining the OOV dictionary with the general vocabulary. Then, a new vocabulary is used in conjunction with the user's personal data to build a custom WFST network 423.
  • the custom decoding model described in the foregoing embodiment may be a custom WFST network; the general decoding model may be a general-purpose WFST network.
  • the step of loading at least one custom decoding model into the pre-stored general decoding model in the foregoing embodiment to generate a new decoding model may include: customizing the WFST network and the universal WFST network. The WFST network is merged to obtain a new WFST network.
  • the step of decoding the voice signal by using a new decoding model in the foregoing embodiment, and obtaining the voice recognition result may include: performing a search and decoding on the voice signal by using a new WFST network, and obtaining a voice. Identify the results.
  • FIG. 4-3 is a partial schematic diagram of a new WFST network in the embodiment of the present application.
  • a slot 432 is inserted in the universal WFST network 431, and the slot 432 is associated with a custom WFST network 433 corresponding to the contact information to form a new WFST network.
  • the decoding token searches for the location of the slot in the universal WFST network, the search continues in the custom WFST network that is directly entered, and the search ends in the custom WFST network, and the decoding is completed.
  • the token will go back to the generic WFST network and continue searching. In this way, a user's own decoding space can be built for each user.
  • the various steps in the various embodiments of the present application are not necessarily performed in the order indicated by the steps. Except as explicitly stated herein, the execution of these steps is not strictly limited, and the steps may be performed in other orders. Moreover, at least some of the steps in the embodiments may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be executed at different times, and the execution of these sub-steps or stages The order is also not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of the other steps. Based on the same inventive concept, the embodiment of the present application provides a voice recognition device, which may be the voice recognition device described in one or more embodiments.
  • the internal structure of the speech recognition apparatus can be referred to the structure shown in FIG.
  • Each of the units described below may be implemented in whole or in part by software, hardware, or a combination thereof.
  • the voice recognition device 500 may include: a voice signal acquiring unit 501 for continuously acquiring a voice signal; and a corpus obtaining unit 502. Obtaining a custom corpus corresponding to the current account in the process of continuously acquiring the voice signal; the model building unit 503 is configured to analyze and process the custom corpus to construct a corresponding at least one custom decoding model; and the loading unit 504 is configured to: The at least one custom decoding model is loaded into the pre-stored general decoding model to generate a new decoding model.
  • the decoding unit 505 is configured to decode the speech signal by using a new decoding model to obtain a speech recognition result.
  • the custom corpus corresponding to the current account includes at least one of the following: contact information of the current account and proprietary text of at least one domain.
  • the foregoing custom decoding model may be a custom WFST network; the universal decoding model may be a general WFST network; correspondingly, the loading unit is further used to merge the custom WFST network with the general WFST network to obtain The new WFST network; the decoding unit is also used for searching and decoding the voice signal using the new WFST network to obtain the speech recognition result.
  • the model construction unit is further configured to classify the custom corpus to obtain a customized language model of each category; based on the pre-stored acoustic model, the dictionary model, and the custom language model of each category, At least one custom decoding model corresponding to each category is constructed.
  • the loading unit is further configured to perform data mining on historical voice data of the current account to obtain a context template with a slot; and start symbols and end symbols in the universal decoding model according to the slot classification flag.
  • a slot is added between the slots and associated with a custom decoding model with classification tags in at least one custom decoding model to generate a new decoding model.
  • the decoding unit is specifically configured to decode and identify a voice signal according to a new decoding model, and when the decoding token encounters a slot, jump to a custom decoding model associated with the slot; Decoding is performed in the associated custom decoding model; after the decoding is completed in the custom decoding model associated with the slot, the slot is returned, and the decoding continues in the general decoding model until the speech recognition result is obtained.
  • FIG. 6 is a schematic diagram of an internal structure of a voice recognition device according to an embodiment of the present application.
  • the voice recognition device 600 includes a processor, a memory, and a communication interface that are connected through a system bus.
  • the memory comprises a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium of the computer device can store an operating system and computer readable instructions that, when executed, cause the processor to perform a speech recognition method.
  • the processor of the computer device is used to provide computing and control capabilities to support the operation of the entire computer device.
  • Computer readable instructions may be stored in the internal memory of the computer device, and when the computer readable instructions are executed by the processor, the processor may be caused to perform a speech recognition method.
  • the computer device can be a cell phone, a tablet or a personal digital assistant or a wearable device or the like. It will be understood by those skilled in the art that the structure shown in FIG. 2 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the terminal to which the solution of the present application is applied.
  • the specific computer equipment may include More or fewer components than shown in the figure, or some components combined, or with different component arrangements.
  • the processor executes the following steps: in the process of continuously acquiring the voice signal through the communication interface, acquiring a custom corpus corresponding to the current account; analyzing and processing the customized corpus to construct at least one custom Decoding the model; loading at least one custom decoding model into a pre-stored general decoding model to generate a new decoding model; decoding the speech signal with a new decoding model to obtain a speech recognition result.
  • the custom corpus corresponding to the current account includes at least one of the following: contact information of the current account and proprietary text of at least one domain.
  • the custom decoding model may be a custom WFST network; the general decoding model may be a general-purpose WFST network; accordingly, the processor also implements the following steps when executing the program: the custom WFST network and the general WFST network The new WFST network is obtained by combining, and the new WFST network is used for search and decoding of the speech signal to obtain the speech recognition result.
  • the processor when the processor executes the computer readable instructions, the following steps are further performed: classifying the custom corpus to obtain a customized language model of each category; based on the pre-stored acoustic model, the dictionary model, and each category A custom language model is constructed to construct at least one custom decoding model corresponding to each category.
  • the processor when executing the computer readable instructions, further implements the following steps: performing data mining on historical voice data of the current account, obtaining a context template with slots; and using a general decoding model according to the classification mark of the slot A slot is added between the start symbol and the end symbol, and the slot is associated with a custom decoding model having a classification mark in at least one custom decoding model to generate a new decoding model.
  • the processor when the processor executes the computer readable instructions, the following steps are further implemented: decoding and recognizing the voice signal according to the new decoding model, and when the decoding token encounters the slot, jumping to the slot associated with the self Defining the decoding model; decoding in the custom decoding model associated with the slot; returning to the slot after decoding is completed in the custom decoding model associated with the slot, and continuing to decode in the general decoding model until the speech recognition result is obtained.
  • the processor may be an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), and a programmable At least one of a PLD (Programmable Logic Device), a Field Programmable Gate Array (FPGA), a Central Processing Unit (CPU), a controller, a microcontroller, and a microprocessor .
  • the memory may be a removable storage device, a read only memory (ROM), a magnetic disk, or an optical disk. It is to be understood that the electronic device that implements the above-mentioned functions of the processor and the memory may be other, and is not specifically limited in the embodiment of the present application.
  • the communication interface may be an interface between the terminal and the voice recognition server.
  • the voice recognition device is a local voice recognition device, that is, a terminal
  • the voice collection device is also included.
  • the voice collection device may be a microphone, a microphone array, a microphone, etc., which is not specifically limited in this embodiment.
  • the communication interface on the terminal can be an interface between the processor and the voice collection device, such as an interface between the processor and a microphone or a microphone.
  • the foregoing communication interface may have other implementation forms, which are not specifically limited in this embodiment.
  • an embodiment of the present application provides a computer readable storage medium, where computer readable instructions are stored, and when the computer readable instructions are executed by a processor, the following steps are implemented: in the process of continuously acquiring a voice signal, acquiring Custom corpus corresponding to the current account; analyzing and processing the custom corpus, constructing at least one custom decoding model; loading at least one custom decoding model into the pre-stored general decoding model to generate a new decoding model; The speech signal is decoded using a new decoding model to obtain speech recognition results.
  • the custom corpus corresponding to the current account includes at least one of the following: contact information of the current account and proprietary text of at least one domain.
  • the custom decoding model may be a custom WFST network; the universal decoding model may be a general-purpose WFST network; accordingly, when the computer readable instructions are executed by the processor, the following steps are also implemented: customizing the WFST network It merges with the universal WFST network to obtain a new WFST network.
  • the new WFST network is used for search and decoding of voice signals to obtain speech recognition results.
  • the computer readable instructions are further executed by the processor to: classify the custom corpus to obtain a custom language model for each category; based on pre-stored acoustic models, dictionary models, and categories A custom language model that constructs at least one custom decoding model corresponding to each category.
  • the following steps are further performed: performing data mining on historical voice data of the current account, obtaining a context template with slots; and performing general decoding according to the slot classification flag A slot is added between the start symbol and the end symbol of the model, and the slot is associated with a custom decoding model having a classification mark in at least one custom decoding model to generate a new decoding model.
  • the following steps are further implemented: decoding and recognizing the speech signal according to the new decoding model, and when the decoding token encounters the slot, jumping to the slot associated with Custom decoding model; decoding in the custom decoding model associated with the slot; returning to the slot after decoding is completed in the custom decoding model associated with the slot, and continuing to decode in the general decoding model until the speech recognition result is obtained .
  • the computer readable instructions are stored in a storage medium, and include instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the embodiments of the present application. All or part of the method.
  • the foregoing storage medium includes various media that can store program codes, such as a USB flash drive, a mobile hard disk, a read only memory (ROM), a magnetic disk, or an optical disk.
  • ROM read only memory
  • magnetic disk or an optical disk.
  • the disclosed apparatus and method may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner such as: multiple units or components may be combined, or Can be integrated into another system, or some features can be ignored or not executed.
  • the coupling, or direct coupling, or communication connection of the components shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may be electrical, mechanical or other forms. of.
  • the units described above as separate components may or may not be physically separated, and the components displayed as the unit may or may not be physical units; they may be located in one place or distributed on multiple network units; Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated into one unit; the above integration
  • the unit can be implemented in the form of hardware or in the form of hardware plus software functional units.
  • Non-volatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory can include random access memory (RAM) or external cache memory.
  • RAM is available in a variety of formats, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization chain.
  • SRAM static RAM
  • DRAM dynamic RAM
  • SDRAM synchronous DRAM
  • DDRSDRAM double data rate SDRAM
  • ESDRAM enhanced SDRAM
  • Synchlink DRAM SLDRAM
  • Memory Bus Radbus
  • RDRAM Direct RAM
  • DRAM Direct Memory Bus Dynamic RAM
  • RDRAM Memory Bus Dynamic RAM

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Multimedia (AREA)
  • Mathematical Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Telephonic Communication Services (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

A speech recognition method comprises: in a process of continuously acquiring speech signals, acquiring custom language data corresponding to a current account (S213); performing analytical processing on the custom language data to construct at least one corresponding custom decoding model (S214); loading the at least one custom decoding model to a pre-stored universal decoding model to generate a new decoding model (S215); and using the new decoding model to decode the speech signals to obtain a speech recognition result (S216.)

Description

语音识别方法、存储介质及语音识别设备Speech recognition method, storage medium and speech recognition device
本申请要求于2017年06月07日提交中国专利局,申请号为201710425219X,申请名称为“一种语音识别方法、装置及语音识别引擎”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese Patent Application entitled "A Speech Recognition Method, Apparatus, and Speech Recognition Engine", filed on June 07, 2017, with the application number of 201710425219X, the entire contents of which are incorporated by reference. In this application.
技术领域Technical field
本申请涉及自动语音识别(ASR,Automatic Speech Recognition)技术领域,尤其涉及一种语音识别方法、存储介质及语音识别设备。The present invention relates to the field of automatic speech recognition (ASR) technology, and in particular, to a speech recognition method, a storage medium, and a speech recognition device.
背景技术Background technique
ASR技术是将人类的语音中的词汇内容转换为计算机可读的输入字符的一项技术。语音识别具有复杂的处理流程,主要包括声学模型训练、语言模型训练、解码资源网络构建以及解码四个过程。ASR technology is a technique for converting vocabulary content in human speech into computer readable input characters. Speech recognition has a complex processing flow, which mainly includes four processes: acoustic model training, language model training, decoding resource network construction and decoding.
目前,现有的语音识别方案,主要是通过计算语音信号基于文字的最大后验概率来获得,一般分为动态解码和静态解码两种解码方式。基于静态解码的语音识别解决方案主要是基于有限状态转换器(FST,Finite State Transducer)网络来实现的,例如,采用加权有限状态转换器(WFST,Weighted Finite State Transducer)网络将语音识别过程中的大部分组件,包括发音字典,声学模型,语法信息等进行整合,得到一张有限状态转移图,然后,通过解码令牌(Token)在该有限状态转移图中搜索,来得到最优的语音识别结果。At present, the existing speech recognition scheme is mainly obtained by calculating the maximum posterior probability of the speech signal based on the text, and is generally divided into two decoding modes: dynamic decoding and static decoding. The speech recognition solution based on static decoding is mainly implemented based on a Finite State Translator (FST) network. For example, a Weighted Finite State Transducer (WFST) network is used in the speech recognition process. Most components, including pronunciation dictionary, acoustic model, grammar information, etc., are integrated to obtain a finite state transition graph, and then searched in the finite state transition graph by decoding tokens to obtain optimal speech recognition. result.
然而,由于上述整合后的有限状态转移图是固定的,所以,一旦生成有限状态转移图,就无法进行修改;又因为每个用户所说语音的内容是天差地别的,具体到算法上来说,就是每个用户的语言模型是不同的,声学模型由于口音的差异也是有所不同,所以,每个用户对应的有限状态转移图也是不同的。那么,为了匹配所有用户,就需要针对每个用户生成一张有限状态转 移图,但是,在存储资源有限的情况下,针对每个用户存储一张有限状态转移图往往是无法实现的,而通常只是存储针对常用说话识别的有限状态转移图,这样,每个用户都是在同一张图上完成语音搜索的,往往会产生数据偏移,导致语音识别准确率低。However, since the above-mentioned integrated finite state transition map is fixed, once the finite state transition map is generated, it cannot be modified; and because the content of the voice of each user is different, specifically to the algorithm. That is to say, the language model of each user is different, and the acoustic model is different due to the difference in accents. Therefore, the finite state transition map corresponding to each user is also different. Then, in order to match all users, a finite state transition graph needs to be generated for each user. However, in the case of limited storage resources, it is often impossible to store a finite state transition graph for each user. It only stores the finite state transition map for common speech recognition. In this way, each user completes the voice search on the same picture, which often results in data offset, resulting in low accuracy of speech recognition.
发明内容Summary of the invention
根据本申请提供的各种实施例,提供一种语音识别方法、存储介质及语音识别设备。According to various embodiments provided by the present application, a voice recognition method, a storage medium, and a voice recognition device are provided.
一种语音识别方法,包括:A speech recognition method comprising:
语音识别设备在持续获取语音信号的过程中,获取当前账户对应的自定义语料;The voice recognition device acquires a custom corpus corresponding to the current account in the process of continuously acquiring the voice signal;
所述语音识别设备对所述自定义语料进行分析处理,构建相应的至少一个自定义解码模型;The voice recognition device analyzes and processes the customized corpus to construct a corresponding at least one custom decoding model;
所述语音识别设备将所述至少一个自定义解码模型加载到预先存储的通用解码模型中,生成新的解码模型;及The speech recognition device loads the at least one custom decoding model into a pre-stored general decoding model to generate a new decoding model;
所述语音识别设备对所述语音信号采用所述新的解码模型进行解码,获得语音识别结果。The speech recognition device decodes the speech signal by using the new decoding model to obtain a speech recognition result.
一个或多个存储有计算机可执行指令的非易失性计算机可读存储介质,所述计算机可执行指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:在持续获取语音信号的过程中,获取当前账户对应的自定义语料;One or more non-transitory computer readable storage media storing computer executable instructions, when executed by one or more processors, cause the one or more processors to perform the following steps: Obtaining a custom corpus corresponding to the current account during the process of continuously acquiring the voice signal;
对所述自定义语料进行分析处理,构建相应的至少一个自定义解码模型;Performing analysis and processing on the customized corpus to construct corresponding at least one custom decoding model;
将所述至少一个自定义解码模型加载到预先存储的通用解码模型中,生成新的解码模型;及Loading the at least one custom decoding model into a pre-stored general decoding model to generate a new decoding model;
对所述语音信号采用所述新的解码模型进行解码,获得语音识别结果。The new decoding model is decoded for the speech signal to obtain a speech recognition result.
一种语音识别设备,包括存储器和处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行 以下步骤:A speech recognition apparatus comprising a memory and a processor, the memory storing computer readable instructions, the computer readable instructions being executed by the processor, causing the processor to perform the following steps:
在持续获取语音信号的过程中,获取当前账户对应的自定义语料;Obtaining a custom corpus corresponding to the current account during the process of continuously acquiring the voice signal;
对所述自定义语料进行分析处理,构建相应的至少一个自定义解码模型;Performing analysis and processing on the customized corpus to construct corresponding at least one custom decoding model;
将所述至少一个自定义解码模型加载到预先存储的通用解码模型中,生成新的解码模型;及Loading the at least one custom decoding model into a pre-stored general decoding model to generate a new decoding model;
对所述语音信号采用所述新的解码模型进行解码,获得语音识别结果。The new decoding model is decoded for the speech signal to obtain a speech recognition result.
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征、目的和优点将从说明书、附图以及权利要求书变得明显。Details of one or more embodiments of the present application are set forth in the accompanying drawings and description below. Other features, objects, and advantages of the invention will be apparent from the description and appended claims.
附图说明DRAWINGS
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments will be briefly described below. It is obvious that the drawings in the following description are only some embodiments of the present application. Other drawings may also be obtained from those of ordinary skill in the art in light of the inventive work.
图1为一个实施例中的语音识别方法的应用环境图;1 is an application environment diagram of a voice recognition method in an embodiment;
图2-1为一个实施例中的语音识别方法的实现流程示意图;2-1 is a schematic flowchart of an implementation process of a voice recognition method in an embodiment;
图2-2为另一个实施例中的语音识别方法的实现流程示意图;2-2 is a schematic flowchart of an implementation process of a voice recognition method in another embodiment;
图2-3为另一个实施例中的语音识别方法的实现流程示意图;2-3 is a schematic flowchart showing an implementation of a voice recognition method in another embodiment;
图3-1为一个实施例中的语音识别界面的示意图;3-1 is a schematic diagram of a voice recognition interface in an embodiment;
图3-2为另一个实施例中的语音识别界面的示意图;3-2 is a schematic diagram of a voice recognition interface in another embodiment;
图4-1为一个实施例中的语音识别方法的实现流程框图;4-1 is a block diagram showing an implementation process of a voice recognition method in an embodiment;
图4-2为另一个实施例中的语音识别方法的实现流程框图;4-2 is a block diagram showing an implementation of a voice recognition method in another embodiment;
图4-3为一个实施例中的新的WFST网络的局部示意图;4-3 is a partial schematic diagram of a new WFST network in an embodiment;
图5为一个实施例中的语音识别设备的单元结构示意图;及FIG. 5 is a schematic structural diagram of a unit of a voice recognition device in an embodiment; and
图6为一个实施例中的语音识别设备的内部结构示意图。FIG. 6 is a schematic diagram showing the internal structure of a voice recognition device in an embodiment.
具体实施方式detailed description
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。In order to make the objects, technical solutions, and advantages of the present application more comprehensible, the present application will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the application and are not intended to be limiting.
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。The technical solutions in the embodiments of the present application will be clearly and completely described in the following with reference to the accompanying drawings in the embodiments.
本申请实施例提供一种语音识别方法,该方法应用于语音识别设备。该语音识别设备可以作为语音识别引擎。语音识别设备可以为云端语音识别设备,也就是说语音识别设备可以是语音识别服务器,也可以是设置于语音识别服务器中的部件;上述语音识别设备也可以为本地语音识别设备,也就是说语音识别设备可以是终端,也可以是设置于终端中的部件。The embodiment of the present application provides a voice recognition method, which is applied to a voice recognition device. The speech recognition device can function as a speech recognition engine. The voice recognition device may be a cloud voice recognition device, that is, the voice recognition device may be a voice recognition server, or may be a component disposed in the voice recognition server; the voice recognition device may also be a local voice recognition device, that is, voice. The identification device may be a terminal or a component disposed in the terminal.
图1为本申请实施例中语音识别方法的应用环境图。参见图1所示,语音识别服务器110可以通过网络200与终端120通信。当语音识别设备为语音识别服务器110时,由语音识别服务器110执行语音识别方法,当语音识别设备为终端120时,由终端120执行语音识别方法。FIG. 1 is an application environment diagram of a voice recognition method in an embodiment of the present application. Referring to FIG. 1, voice recognition server 110 can communicate with terminal 120 over network 200. When the voice recognition device is the voice recognition server 110, the voice recognition method is performed by the voice recognition server 110, and when the voice recognition device is the terminal 120, the voice recognition method is performed by the terminal 120.
那么,上述语音识别设备,可以用于在持续获取语音信号的过程中,获取当前账户对应的自定义语料;对自定义语料进行分析处理,构建相应的至少一个自定义解码模型;将至少一个自定义解码模型加载到预先存储的通用解码模型中,生成新的解码模型;对语音信号采用新的解码模型进行解码,获得语音识别结果。Then, the voice recognition device may be configured to acquire a custom corpus corresponding to the current account in the process of continuously acquiring the voice signal; analyze and process the customized corpus, and construct at least one custom decoding model; at least one self The decoding model is defined to be loaded into a pre-stored general decoding model to generate a new decoding model; the speech signal is decoded by using a new decoding model to obtain a speech recognition result.
下面以语音识别设备为语音识别服务器为例,对上述语音识别方法进行说明。Hereinafter, the voice recognition device is taken as an example of the voice recognition server, and the voice recognition method described above is described.
图2-1为本申请实施例中的语音识别方法的实现流程示意图,参见图2-1所示,上述方法可以包括:2-1 is a schematic flowchart of the implementation of the voice recognition method in the embodiment of the present application. As shown in FIG. 2-1, the foregoing method may include:
S213:语音识别服务器在持续获取语音信号的过程中,获取当前账户对应的自定义语料。S213: The voice recognition server acquires a customized corpus corresponding to the current account in the process of continuously acquiring the voice signal.
这里,语音识别服务器持续获取的语音信号是由终端所发送的,由于终端不断的发送语音信号给语音识别服务器,语音识别服务器就会持续接收到 这些语音信号,那么,语音识别服务器在持续接收这些语音信号的过程中,可以获得当前账户对应的自定义语料。Here, the voice signal continuously acquired by the voice recognition server is sent by the terminal. Since the terminal continuously sends the voice signal to the voice recognition server, the voice recognition server continuously receives the voice signals, and then the voice recognition server continuously receives the voice signals. In the process of voice signal, the custom corpus corresponding to the current account can be obtained.
在实际应用中,通常采用文本来代替语言实例,也就是说将文本作为语料,那么,上述自定义语料可以包括以下之一:当前账户对应的联系人信息,如电话通讯录、即时通信应用联系人信息;或者当前账户上传的至少一个领域的专有文本,如法律条文、通信标准、行业标准等文本。当然,自定义语料还可以为其它文本,本申请实施例不作具体限定。In practical applications, text is usually used instead of a language instance, that is, the text is used as a corpus. Then, the above custom corpus may include one of the following: contact information corresponding to the current account, such as a telephone address book, an instant messaging application contact. Person information; or proprietary text of at least one field uploaded by the current account, such as legal provisions, communication standards, industry standards, etc. Of course, the custom corpus can also be other texts, which is not specifically limited in the embodiment of the present application.
在本申请其它实施例中,上述自定义语料可以是语音识别服务器在接收到终端上传的语音信号后,从用户账户信息服务器或者终端读取到的;也可以是用户通过终端上的应用向语音识别服务器上传的。当然,自定义语料还可以存在其它获取方式,本申请实施例不作具体限定。In other embodiments of the present application, the customized corpus may be read by the voice recognition server from the user account information server or the terminal after receiving the voice signal uploaded by the terminal; or may be the voice of the user through the application on the terminal. Identify the server uploaded. Of course, there are other ways of obtaining the custom corpus, which is not specifically limited in the embodiment of the present application.
举例来说,用户按住如图3-1所示的语音识别界面30中语音输入控件301,然后,嘴对着麦克风说话,实时的语音识别结果会流式返回。在这个过程中,首先,通过语音活动检测(VAD,Voice Activity Detection)模块获取语音信号的有效部分,此时,对于该段语音信号的语音识别开始;然后,语音识别开始之后,语音识别服务器通过从用户账户信息服务器或者终端读取当前账户的联系人信息。或者,在用户开始使用语音识别服务后,终端载入用户所需要的至少一个领域的专有文本,如法律条文,并且上传至语音识别服务器,此时,语音识别服务器获得法律条文。For example, the user presses the voice input control 301 in the voice recognition interface 30 as shown in FIG. 3-1, and then the mouth speaks into the microphone, and the real-time voice recognition result is streamed back. In this process, first, the voice activity detection module (VAD, Voice Activity Detection) module is used to obtain the effective part of the voice signal. At this time, the voice recognition for the voice signal begins; then, after the voice recognition starts, the voice recognition server passes The contact information of the current account is read from the user account information server or the terminal. Alternatively, after the user starts using the voice recognition service, the terminal loads the proprietary text of at least one field required by the user, such as legal provisions, and uploads to the voice recognition server, at which time the voice recognition server obtains the legal provisions.
需要说明的是,上述自定义语料可以区分类别,也可以不区分类别,本申请实施例不作具体限定。It should be noted that the above-mentioned custom corpus may be classified into categories or may not be classified, and is not specifically limited in the embodiment of the present application.
S214:语音识别服务器对自定义语料进行分析处理,构建相应的至少一个自定义解码模型。S214: The speech recognition server analyzes and processes the custom corpus, and constructs at least one custom decoding model.
在具体实施过程中,为了使得语音识别更加准确,S214可以包括:对自定义语料进行分类,得到各分类的自定义语言模型;基于预先存储的声学模型、词典模型以及各分类的自定义语言模型,构建与各分类对应的至少一个自定义解码模型。In the specific implementation process, in order to make the speech recognition more accurate, the S214 may include: classifying the custom corpus to obtain a customized language model of each category; based on the pre-stored acoustic model, the dictionary model, and the custom language model of each category Constructing at least one custom decoding model corresponding to each category.
这里,语音识别服务器在获得上述自定义语料之后,对这些自定义语料进行分类,得到各分类的自定义语言模型。比如,语音识别服务器同时获得当前账户对应的联系人信息和法律条文,那么,语音识别服务器就需要先对联系人信息和法律条文进行分类,得到联系人信息对应的语言模型和法律条文对应的语言模型;然后,语音识别服务器根据预先存储的声学模型、词典模型以及上述各分类的自定义语言模型,构建与各分类对应的至少一个自定义解码模型,也就是说,语音识别服务器会构建联系人信息对应的解码模型和法律条文对应的解码模型。Here, after obtaining the above custom corpus, the speech recognition server classifies the custom corpora to obtain a customized language model of each category. For example, if the voice recognition server obtains the contact information and legal provisions corresponding to the current account at the same time, the voice recognition server needs to classify the contact information and the legal provisions first, and obtain the language model corresponding to the contact information and the language corresponding to the legal provisions. a model; then, the speech recognition server constructs at least one custom decoding model corresponding to each category according to a pre-stored acoustic model, a dictionary model, and a custom language model of each of the above categories, that is, the speech recognition server constructs a contact The decoding model corresponding to the information and the decoding model corresponding to the legal provisions.
S215:语音识别服务器将至少一个自定义解码模型加载到预先存储的通用解码模型中,生成新的解码模型。S215: The speech recognition server loads at least one custom decoding model into a pre-stored general decoding model to generate a new decoding model.
这里,通用解码模型是指针对日常用语搭建的解码模型,具有普适性,能够较好地是识别日常用语。Here, the general decoding model is a decoding model built on the everyday language, which is universal and can better recognize everyday language.
在具体实施过程中,由于不同用户的语言习惯和口音都不尽相同,那么,为了实现更为准确的语音识别,S215还可以包括:获取带有槽的上下文模板,其中,槽是上下文模板中的信息变量,上下文模板是对当前账户的历史语音数据进行数据挖掘获得的;根据槽的分类标记,在通用解码模型的开始符号和结束符号之间添加槽,并将槽与至少一个自定义解码模型中具有分类标记的自定义解码模型关联,生成新的解码模型。In a specific implementation process, because the language habits and accents of different users are different, in order to achieve more accurate speech recognition, the S215 may further include: acquiring a context template with a slot, where the slot is in the context template. Information variable, the context template is obtained by data mining the historical voice data of the current account; according to the slot classification flag, a slot is added between the start symbol and the end symbol of the universal decoding model, and the slot is decoded with at least one custom A custom decoding model with classification markers in the model is associated to generate a new decoding model.
这里,语音识别服务器在用户使用语音识别服务之前,可以获取当前账户的历史语音数据,对这些数据进行数据挖掘,获得至少一个带有槽的上下文模板。例如,对语音中的人名进行识别,通过数据挖掘,得到人名相关的上下文模板:“@NAME@来找我吃饭”、“我和@NAME@是好朋友”等。需要说明的是,在上述上下文模板中“@NAME@”就是槽,“NAME”就是该槽的分类标记。然后,语音识别服务器根据这些上下文模板,在通用解码模型的开始符号和结束符号之间添加上述槽,并将槽与至少一个自定义解码模型中具有相同分类标记的自定义解码模型关联,生成新的解码模型。例如,语音识别服务器根据上下文模板“@NAME@来找我吃饭”,在通用解码模型 中插入“@NAME@”所对应的槽,并根据分类标记“NAME”,将“@NAME@”所对应的槽与联系人信息对应的解码模型关联起来,如此,生成了新的解码模型。Here, the voice recognition server may acquire historical voice data of the current account before the user uses the voice recognition service, perform data mining on the data, and obtain at least one context template with a slot. For example, to identify the names of people in the voice, through data mining, get the context template related to the name of the person: "@NAME@ come to me to eat", "I and @NAME@ are good friends" and so on. It should be noted that "@NAME@" is the slot in the above context template, and "NAME" is the classification tag of the slot. Then, the speech recognition server adds the above slot between the start symbol and the end symbol of the universal decoding model according to the context templates, and associates the slot with a custom decoding model having the same classification mark in at least one custom decoding model to generate a new Decoding model. For example, the voice recognition server inserts a slot corresponding to "@NAME@" in the general decoding model according to the context template "@NAME@ to find me to eat", and correspondingly according to the classification mark "NAME", corresponding to "@NAME@" The slot is associated with the decoding model corresponding to the contact information, thus generating a new decoding model.
S216:语音识别服务器对语音信号采用新的解码模型进行解码,获得语音识别结果。S216: The speech recognition server decodes the speech signal by using a new decoding model to obtain a speech recognition result.
在具体实施过程中,S216可以包括:根据新的解码模型对语音信号进行解码识别,当解码令牌遇到槽时,跳转到槽所关联的自定义解码模型;在槽所关联的自定义解码模型中进行解码;在槽所关联的自定义解码模型中解码完成后返回槽,并继续在通用解码模型中继续进行解码,直至获得语音识别结果。In a specific implementation process, the S216 may include: decoding and recognizing the voice signal according to the new decoding model, and when the decoding token encounters the slot, jumping to the custom decoding model associated with the slot; the customization associated with the slot The decoding is performed in the decoding model; after the decoding is completed in the custom decoding model associated with the slot, the slot is returned, and the decoding continues in the general decoding model until the speech recognition result is obtained.
这里,语音识别服务器在构建完成新的解码模型之后,语音识别服务器就可以将语音信号输入到新的解码模型进行解码。首先,语音识别服务器在通用解码模型中进行音素搜索,直至解码令牌遇到通用解码模型中插入的槽,此时,跳转到该槽所关联的自定义解码模型中继续进行音素搜索,在该自定义解码模型中完成搜索后返回槽,并继续在通用解码模型中该槽之后各个符号继续进行搜索,直至获得概率值最高的字符串作为语音识别结果。Here, after the speech recognition server constructs a new decoding model, the speech recognition server can input the speech signal to the new decoding model for decoding. First, the speech recognition server performs a phoneme search in the universal decoding model until the decoding token encounters a slot inserted in the universal decoding model, and at this time, jumps to the custom decoding model associated with the slot to continue the phoneme search. After the search is completed in the custom decoding model, the slot is returned, and the symbols continue to be searched after the slot in the universal decoding model until the character string with the highest probability value is obtained as the speech recognition result.
图2-2为一个实施例中的语音识别方法的实现流程示意图。参见图2-2,在S213之前还包括以下步骤:FIG. 2-2 is a schematic diagram of an implementation process of a voice recognition method in an embodiment. Referring to Figure 2-2, the following steps are also included before S213:
S211:终端采集用户输入的语音信号。S211: The terminal collects a voice signal input by the user.
这里,终端可以安装具有语音输入功能的应用,如即时通信应用、语音输入法应用或者语音助手等。用户可以使用这些应用输入语音信号,比如,用户使用即时通信应用的过程中,需要输入语音时,用户开启如图3-1所示的语音识别界面30,用户按住该界面中的语音输入控件301,此时,即时通信应用调用语音采集装置,如开启麦克风,这样,用户就能够对着麦克风开始说话,也就是说终端采集用户输入的语音信号。Here, the terminal can install an application having a voice input function, such as an instant messaging application, a voice input method application, or a voice assistant. The user can use these applications to input voice signals. For example, in the process of using the instant messaging application, when the user needs to input voice, the user opens the voice recognition interface 30 shown in FIG. 3-1, and the user presses and holds the voice input control in the interface. 301. At this time, the instant messaging application invokes the voice collection device, such as turning on the microphone, so that the user can start talking to the microphone, that is, the terminal collects the voice signal input by the user.
S212:终端将采集到的语音信号发送给语音识别服务器。S212: The terminal sends the collected voice signal to the voice recognition server.
这里,终端将采集到的语音信号发送给语音识别服务器。在实际应用中, 终端可以通过无线局域网或者蜂窝数据网等发送给语音识别服务器。Here, the terminal transmits the collected voice signal to the voice recognition server. In practical applications, the terminal can be sent to the voice recognition server via a wireless local area network or a cellular data network or the like.
继续参考图2-2,在S216之后还包括以下步骤:With continued reference to Figure 2-2, the following steps are also included after S216:
S217:语音识别服务器将语音识别结果发送给终端。S217: The voice recognition server sends the voice recognition result to the terminal.
S218:终端输出语音识别结果。S218: The terminal outputs a speech recognition result.
这里,语音识别服务器在获得语音识别结果后,将该语音识别结果,也就是字符串发送给终端,让终端在语音识别界面上进行显示。例如,用户语音输入一句话“张三来找我吃饭”,通过由通用解码模型中插入联系人信息对应的自定义解码模型所生成的新的解码模型,对这样的一句话进行解码,获得字符串“张三来找我吃饭”,语音识别服务器将这个字符串发给终端,如图3-2所示,终端可以在语音识别界面30中显示该字符串302,也可以将字符串转换为语音信号,输出给用户,与用户进行语音交互。当然,还可以为其它输入方式,本申请实施例不作具体限定。Here, after obtaining the speech recognition result, the speech recognition server transmits the speech recognition result, that is, the character string, to the terminal, and causes the terminal to display on the speech recognition interface. For example, the user voices a sentence "Zhang Sanlai to find me to eat", and decodes such a sentence by obtaining a new decoding model generated by a custom decoding model corresponding to the contact information inserted in the universal decoding model to obtain characters. The string "Zhang Sanlai came to me to eat", the voice recognition server sends the string to the terminal, as shown in Figure 3-2, the terminal can display the string 302 in the voice recognition interface 30, or convert the string into The voice signal is output to the user and performs voice interaction with the user. Of course, other input manners may also be used, which are not specifically limited in the embodiment of the present application.
至此,便完成了语音识别流程。At this point, the speech recognition process is completed.
下面以语音识别设备为终端为例,对上述语音识别方法进行说明。The voice recognition device is taken as an example for the voice recognition method.
图2-3为本申请实施例中的语音识别方法的实现流程示意图,参见图2-3所示,上述方法可以包括:2-3 is a schematic flowchart of the implementation of the voice recognition method in the embodiment of the present application. Referring to FIG. 2-3, the foregoing method may include:
S221:终端采集用户输入的语音信号。S221: The terminal collects a voice signal input by the user.
这里,终端可通过语音采集装置采集用户输入的语音信号。具体地,终端可以安装具有语音输入功能的应用,如即时通信应用、语音输入法应用或者语音助手等。用户可以使用这些应用输入语音信号,比如,用户使用即时通信应用的过程中,需要输入语音时,用户开启如图3-1所示的语音识别界面30,用户按住该界面中的语音输入控件301,此时,即时通信应用调用语音采集装置,如开启麦克风,用户就能够对着麦克风开始说话,如此,也就是说终端采集用户输入的语音信号。Here, the terminal can collect the voice signal input by the user through the voice collection device. Specifically, the terminal can install an application having a voice input function, such as an instant messaging application, a voice input method application, or a voice assistant. The user can use these applications to input voice signals. For example, in the process of using the instant messaging application, when the user needs to input voice, the user opens the voice recognition interface 30 shown in FIG. 3-1, and the user presses and holds the voice input control in the interface. 301. At this time, the instant messaging application invokes the voice collection device. If the microphone is turned on, the user can start speaking to the microphone, and thus, the terminal collects the voice signal input by the user.
这里,终端可通过语音采集装置将采集到的语音信号通过通信总线发送给处理器,也就是解码器。Here, the terminal can transmit the collected voice signal to the processor, that is, the decoder, through the communication bus through the voice collection device.
S223:终端在持续获取语音信号的过程中,获取当前账户对应的自定义 语料。S223: The terminal acquires a customized corpus corresponding to the current account in the process of continuously acquiring the voice signal.
这里,终端可通过处理器在持续获取语音信号的过程中,获取当前账户对应的自定义语料。具体地,由于语音采集装置不断的发送语音信号给处理器,处理器就会持续接收到这些语音信号,那么,处理器在持续接收这些语音信号的过程中,可以获得当前账户对应的自定义语料。Here, the terminal may acquire a custom corpus corresponding to the current account by the processor during the process of continuously acquiring the voice signal. Specifically, since the voice collecting device continuously sends the voice signal to the processor, the processor continuously receives the voice signals, and the processor can obtain the customized corpus corresponding to the current account in the process of continuously receiving the voice signals. .
在实际应用中,上述自定义语料可以包括以下之一:当前账户对应的联系人信息,如电话通讯录、即时通信应用联系人信息;或者当前账户上传的至少一个领域的专有文本,如法律条文、通信标准、行业标准等文本。当然,自定义语料还可以为其它文本,本申请实施例不作具体限定。In practical applications, the above custom corpus may include one of the following: contact information corresponding to the current account, such as a phone address book, instant messaging application contact information; or a proprietary text of at least one field uploaded by the current account, such as law Texts, communication standards, industry standards and other texts. Of course, the custom corpus can also be other texts, which is not specifically limited in the embodiment of the present application.
在本申请其它实施例中,上述自定义语料可以是处理器在接收到语音采集装置采集的语音信号后,从用户账户信息服务器或者本地读取到的;也可以是用户预先存储在本地的。当然,自定义语料还可以存在其它获取方式,本申请实施例不作具体限定。In other embodiments of the present application, the customized corpus may be read from the user account information server or locally after receiving the voice signal collected by the voice collection device, or may be stored locally by the user in advance. Of course, there are other ways of obtaining the custom corpus, which is not specifically limited in the embodiment of the present application.
需要说明的是,上述自定义语料可以区分类别,也可以不区分类别,本申请实施例不作具体限定。It should be noted that the above-mentioned custom corpus may be classified into categories or may not be classified, and is not specifically limited in the embodiment of the present application.
S224:终端对自定义语料进行分析处理,构建相应的至少一个自定义解码模型。S224: The terminal analyzes and processes the customized corpus, and constructs at least one custom decoding model.
在具体实施过程中,为了使得语音识别更加准确,S224可以包括:对自定义语料进行分类,得到各分类的自定义语言模型;基于预先存储的声学模型、词典模型以及各分类的自定义语言模型,构建与各分类对应的至少一个自定义解码模型。In the specific implementation process, in order to make the speech recognition more accurate, the S224 may include: classifying the custom corpus to obtain a customized language model of each category; based on the pre-stored acoustic model, the dictionary model, and the custom language model of each category Constructing at least one custom decoding model corresponding to each category.
这里,终端可通过处理器对自定义语料进行分析处理,构建相应的至少一个自定义解码模型。具体地,处理器在获得上述自定义语料之后,对这些自定义语料进行分类,得到各分类的自定义语言模型。比如,处理器同时获得当前账户对应的联系人信息和法律条文,那么,处理器就需要先对联系人信息和法律条文进行分类,得到联系人信息对应的语言模型和法律条文对应的语言模型;然后,处理器根据预先存储的声学模型、词典模型以及上述各 分类的自定义语言模型,构建与各分类对应的至少一个自定义解码模型,也就是说,处理器会构建联系人信息对应的解码模型和法律条文对应的解码模型。Here, the terminal may analyze and process the custom corpus through the processor, and construct corresponding at least one custom decoding model. Specifically, after obtaining the above custom corpus, the processor classifies the custom corpora to obtain a customized language model of each category. For example, if the processor obtains the contact information and legal provisions corresponding to the current account at the same time, the processor needs to first classify the contact information and the legal provisions, and obtain the language model corresponding to the contact information and the language model corresponding to the legal provisions; Then, the processor constructs at least one custom decoding model corresponding to each category according to the pre-stored acoustic model, the dictionary model, and the custom language model of each of the above categories, that is, the processor constructs a decoding corresponding to the contact information. The decoding model corresponding to the model and legal provisions.
S225:终端将至少一个自定义解码模型加载到预先存储的通用解码模型中,生成新的解码模型。S225: The terminal loads at least one custom decoding model into a pre-stored general decoding model to generate a new decoding model.
在具体实施过程中,由于不同用户的语言习惯和口音都不尽相同,那么,为了实现更为准确的语音识别,S225还可以包括:获取带有槽的上下文模板,其中,上下文模板是对当前账户的历史语音数据进行数据挖掘获得的;根据槽的分类标记,在通用解码模型的开始符号和结束符号之间添加槽,并将槽与至少一个自定义解码模型中具有分类标记的自定义解码模型关联,生成新的解码模型。In the specific implementation process, because the language habits and accents of different users are different, in order to achieve more accurate speech recognition, the S225 may further include: acquiring a context template with a slot, wherein the context template is current The historical voice data of the account is obtained by data mining; according to the classification mark of the slot, a slot is added between the start symbol and the end symbol of the universal decoding model, and the slot is customized with the classification mark in at least one custom decoding model. Model associations generate new decoding models.
这里,终端可通过处理器将至少一个自定义解码模型加载到预先存储的通用解码模型中,生成新的解码模型。具体地,处理器在用户使用语音识别服务之前,可以获取当前账户的历史语音数据,对这些数据进行数据挖掘,获得至少一个带有槽的上下文模板。例如,对语音中的人名进行识别,通过数据挖掘,得到人名相关的上下文模板:“@NAME@来找我吃饭”、“我和@NAME@是好朋友”等。需要说明的是,在上述上下文模板中“@NAME@”就是槽,“NAME”就是该槽的分类标记。然后,处理器根据这些上下文模板,在通用解码模型的开始符号和结束符号之间添加上述槽,并将槽与至少一个自定义解码模型中具有相同分类标记的自定义解码模型关联,生成新的解码模型。例如,处理器根据上下文模板“@NAME@来找我吃饭”,在通用解码模型中插入“@NAME@”所对应的槽,并根据分类标记“NAME”,将“@NAME@”所对应的槽与联系人信息对应的解码模型关联起来,如此,生成了新的解码模型。Here, the terminal may load at least one custom decoding model into a pre-stored general decoding model by the processor to generate a new decoding model. Specifically, before the user uses the voice recognition service, the processor may acquire historical voice data of the current account, perform data mining on the data, and obtain at least one context template with a slot. For example, to identify the names of people in the voice, through data mining, get the context template related to the name of the person: "@NAME@ come to me to eat", "I and @NAME@ are good friends" and so on. It should be noted that "@NAME@" is the slot in the above context template, and "NAME" is the classification tag of the slot. Then, according to the context templates, the processor adds the above slot between the start symbol and the end symbol of the universal decoding model, and associates the slot with a custom decoding model having the same classification mark in at least one custom decoding model to generate a new one. Decode the model. For example, the processor inserts the slot corresponding to "@NAME@" in the general decoding model according to the context template "@NAME@ to find me to eat", and according to the classification mark "NAME", the corresponding "@NAME@" The slot is associated with the decoding model corresponding to the contact information, thus generating a new decoding model.
S226:终端对语音信号采用新的解码模型进行解码,获得语音识别结果。S226: The terminal decodes the voice signal by using a new decoding model to obtain a voice recognition result.
在具体实施过程中,S226可以包括:根据新的解码模型对语音信号进行解码识别,当解码令牌遇到槽时,跳转到槽所关联的自定义解码模型;在槽 所关联的自定义解码模型中进行解码;在槽所关联的自定义解码模型中解码完成后返回槽,并继续在通用解码模型中继续进行解码,直至获得语音识别结果。In a specific implementation process, the S226 may include: decoding and recognizing the voice signal according to the new decoding model, and when the decoding token encounters the slot, jumping to the custom decoding model associated with the slot; the customization associated with the slot The decoding is performed in the decoding model; after the decoding is completed in the custom decoding model associated with the slot, the slot is returned, and the decoding continues in the general decoding model until the speech recognition result is obtained.
这里,终端可通过处理器对语音信号采用新的解码模型进行解码,获得语音识别结果。具体地,处理器在构建完成新的解码模型之后,就可以将语音信号输入到新的解码模型进行解码。首先,处理器在通用解码模型中进行音素搜索,直至解码令牌遇到通用解码模型中插入的槽,此时,跳转到该槽所关联的自定义解码模型中继续进行音素搜索,在该自定义解码模型中完成搜索后返回槽,并继续在通用解码模型中该槽之后各个符号继续进行搜索,直至获得概率值最高的字符串作为语音识别结果。Here, the terminal can decode the voice signal by using a new decoding model to obtain a voice recognition result. Specifically, after the processor completes the construction of the new decoding model, the voice signal can be input to the new decoding model for decoding. First, the processor performs a phoneme search in the general decoding model until the decoding token encounters a slot inserted in the universal decoding model, and at this time, jumps to the custom decoding model associated with the slot to continue the phoneme search, where After completing the search in the custom decoding model, the slot is returned, and the symbols continue to be searched after the slot in the universal decoding model until the character string with the highest probability value is obtained as the speech recognition result.
S227:终端输出语音识别结果。S227: The terminal outputs a speech recognition result.
这里,终端可通过处理器输出语音识别结果。具体地,处理器可以将字符串显示在如3-2所示的语音识别界面上,也可以将字符串转换为语音信号,输出给用户,与用户进行语音交互。当然,还可以为其它输入方式,本申请实施例不作具体限定。Here, the terminal can output a speech recognition result through the processor. Specifically, the processor may display the character string on the voice recognition interface as shown in FIG. 3-2, or convert the character string into a voice signal, and output it to the user for voice interaction with the user. Of course, other input manners may also be used, which are not specifically limited in the embodiment of the present application.
由此可见,在本申请实施例中,语音识别设备在持续获取语音信号的过程中,获取当前账户对应的自定义语料,如当前账户的联系人信息、当前账户上传的特定领域的专有文本;然后,对这些自定义语料进行分析处理,构建相应的至少一个自定义解码模型;接着,将构建好的至少一个自定义解码模型加载到预先存储的通用解码模型中,生成新的解码模型;最后,对语音信号采用新的解码模型进行解码,获得语音识别结果。如此,通过这种新的解码模型,可以明显提高用户的自定义语料在通用解码模型中过低的概率值,因此可以降低自定义语料的语音发生数据偏移的几率,整体提高语音识别的准确率。It can be seen that, in the embodiment of the present application, the voice recognition device acquires a customized corpus corresponding to the current account, such as contact information of the current account and a proprietary domain of the specific field uploaded by the current account, in the process of continuously acquiring the voice signal. Then, the custom corpus is analyzed and processed to construct at least one custom decoding model; then, the constructed at least one custom decoding model is loaded into a pre-stored general decoding model to generate a new decoding model; Finally, the speech signal is decoded by using a new decoding model to obtain the speech recognition result. In this way, through this new decoding model, the probability value of the user's custom corpus in the general decoding model can be significantly improved, so that the probability of the speech occurrence data offset of the custom corpus can be reduced, and the overall accuracy of the speech recognition is improved. rate.
基于前述实施例,在实际应用中可以采用WFST网络来实现解码模型。Based on the foregoing embodiments, the WFST network can be used in practical applications to implement the decoding model.
在本申请实施例中,图4-1为本申请实施例中的语音识别方法的实现流程框图。参见图4-1所示,该图表示的是通用的语音识别服务,环境构建为 离线环境下,通过将声学模型411、字典412、语言模型413等整合到一起,构建静态WFST网络414。在在线环境下,首先载入WFST网络。当服务收到语音信号之后,首先转换成语音特征,然后,通过计算声学模型分数以及WFST网络里的权重分数来得到具有最大后验概率的输出文字组合。In the embodiment of the present application, FIG. 4-1 is a block diagram of an implementation process of a voice recognition method in an embodiment of the present application. Referring to Figure 4-1, the figure shows a general speech recognition service. The environment is constructed in an offline environment. The static WFST network 414 is constructed by integrating the acoustic model 411, the dictionary 412, the language model 413, and the like. In an online environment, the WFST network is first loaded. When the service receives the speech signal, it first converts to a speech feature, and then, by calculating the acoustic model score and the weight score in the WFST network, the output text combination with the greatest a posteriori probability is obtained.
为了提高语音识别的精确度,在本申请其它实施例中,图4-2为本申请实施例中的语音识别方法的实现流程框图。参见图4-2所示,在上述实施例的基础上,保持语音识别在线服务,并通过把当前账户对应的自定义语料421,如联系人信息、至少一个领域的专有文本,进行分析处理。首先,提取词表外421(OOV,Out Of Vocabulary)词典422,考虑到用户可能会偏爱一些生僻的词汇,例如火星文之类的,这些词很大可能不在通用词表中,因此首先构建一个用户定制词表,通过将OOV字典和通用词表组合,获得新的词表。然后,利用新的词表结合用户的个人数据进行构建,生成自定义WFST网络423。In order to improve the accuracy of the speech recognition, in other embodiments of the present application, FIG. 4-2 is a block diagram of an implementation process of the speech recognition method in the embodiment of the present application. Referring to FIG. 4-2, on the basis of the foregoing embodiment, the voice recognition online service is maintained, and the custom corpus 421 corresponding to the current account, such as contact information and at least one domain of proprietary text, is analyzed and processed. . First, extract the 421 (OOV, Out Of Vocabulary) dictionary 422. Considering that users may prefer some unfamiliar vocabulary, such as Martian, these words are probably not in the general vocabulary, so first build a The user customizes the vocabulary and obtains a new vocabulary by combining the OOV dictionary with the general vocabulary. Then, a new vocabulary is used in conjunction with the user's personal data to build a custom WFST network 423.
那么,前述实施例中所述的自定义解码模型可以是自定义WFST网络;通用解码模型可以为通用WFST网络。Then, the custom decoding model described in the foregoing embodiment may be a custom WFST network; the general decoding model may be a general-purpose WFST network.
在本申请实施例中,前述实施例中的将至少一个自定义解码模型加载到预先存储的通用解码模型中,生成新的解码模型的步骤,就可以包括:将自定义WFST网络与通用WFST网络合并,得到新的WFST网络;相应地,前述实施例中对语音信号采用新的解码模型进行解码,获得语音识别结果的步骤,可以包括:对语音信号采用新的WFST网络进行搜索解码,获得语音识别结果。In the embodiment of the present application, the step of loading at least one custom decoding model into the pre-stored general decoding model in the foregoing embodiment to generate a new decoding model may include: customizing the WFST network and the universal WFST network. The WFST network is merged to obtain a new WFST network. Correspondingly, the step of decoding the voice signal by using a new decoding model in the foregoing embodiment, and obtaining the voice recognition result may include: performing a search and decoding on the voice signal by using a new WFST network, and obtaining a voice. Identify the results.
举例来说,图4-3为本申请实施例中的新的WFST网络的局部示意图。参见图4-3所示,在通用WFST网络431中插入槽432,并将槽432与联系人信息对应的自定义WFST网络433关联,构成新的WFST网络。那么,在对语音信号进行解码时,当解码令牌在通用WFST网络中搜索到槽的位置的时候,会直接进入的自定义WFST网络中继续搜索,而在自定义WFST网络中搜索结束,解码令牌会回到通用WFST网络中,继续进行搜索。通过这种 方式,可以针对每个用户构建了一个用户自己的解码空间。For example, FIG. 4-3 is a partial schematic diagram of a new WFST network in the embodiment of the present application. Referring to FIG. 4-3, a slot 432 is inserted in the universal WFST network 431, and the slot 432 is associated with a custom WFST network 433 corresponding to the contact information to form a new WFST network. Then, when decoding the voice signal, when the decoding token searches for the location of the slot in the universal WFST network, the search continues in the custom WFST network that is directly entered, and the search ends in the custom WFST network, and the decoding is completed. The token will go back to the generic WFST network and continue searching. In this way, a user's own decoding space can be built for each user.
应该理解的是,虽然本申请各实施例中的各个步骤并不是必然按照步骤标号指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,各实施例中至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。基于同一发明构思,本申请实施例提供一种语音识别设备,该语音识别设备可以是上述一个或者多个实施例所述的语音识别设备。该语音识别设备的内部结构可参照如图6所示的结构。下述的每个单元可全部或部分通过软件、硬件或其组合来实现。It should be understood that the various steps in the various embodiments of the present application are not necessarily performed in the order indicated by the steps. Except as explicitly stated herein, the execution of these steps is not strictly limited, and the steps may be performed in other orders. Moreover, at least some of the steps in the embodiments may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be executed at different times, and the execution of these sub-steps or stages The order is also not necessarily sequential, but may be performed alternately or alternately with other steps or at least a portion of the sub-steps or stages of the other steps. Based on the same inventive concept, the embodiment of the present application provides a voice recognition device, which may be the voice recognition device described in one or more embodiments. The internal structure of the speech recognition apparatus can be referred to the structure shown in FIG. Each of the units described below may be implemented in whole or in part by software, hardware, or a combination thereof.
图5为本申请实施例中的语音识别设备的单元结构示意图,参见图5所示,该语音识别设备500可以包括:语音信号获取单元501,用于持续获取语音信号;语料获得单元502,用于在持续获取语音信号的过程中,获取当前账户对应的自定义语料;模型构建单元503,用于对自定义语料进行分析处理,构建相应的至少一个自定义解码模型;加载单元504,用于将至少一个自定义解码模型加载到预先存储的通用解码模型中,生成新的解码模型;解码单元505,用于对语音信号采用新的解码模型进行解码,获得语音识别结果。5 is a schematic structural diagram of a unit of a voice recognition device according to an embodiment of the present disclosure. As shown in FIG. 5, the voice recognition device 500 may include: a voice signal acquiring unit 501 for continuously acquiring a voice signal; and a corpus obtaining unit 502. Obtaining a custom corpus corresponding to the current account in the process of continuously acquiring the voice signal; the model building unit 503 is configured to analyze and process the custom corpus to construct a corresponding at least one custom decoding model; and the loading unit 504 is configured to: The at least one custom decoding model is loaded into the pre-stored general decoding model to generate a new decoding model. The decoding unit 505 is configured to decode the speech signal by using a new decoding model to obtain a speech recognition result.
在本申请其它实施例中,上述当前账户对应的自定义语料至少包括以下之一:当前账户的联系人信息和至少一个领域的专有文本。In other embodiments of the present application, the custom corpus corresponding to the current account includes at least one of the following: contact information of the current account and proprietary text of at least one domain.
在本申请其它实施例中,上述自定义解码模型可以为自定义WFST网络;通用解码模型可以为通用WFST网络;相应地,加载单元,还用于将自定义WFST网络与通用WFST网络合并,得到新的WFST网络;解码单元,还用于对语音信号采用新的WFST网络进行搜索解码,获得语音识别结果。In other embodiments of the present application, the foregoing custom decoding model may be a custom WFST network; the universal decoding model may be a general WFST network; correspondingly, the loading unit is further used to merge the custom WFST network with the general WFST network to obtain The new WFST network; the decoding unit is also used for searching and decoding the voice signal using the new WFST network to obtain the speech recognition result.
在本申请其它实施例中,上述模型构建单元,还用于对自定义语料进行分类,得到各分类的自定义语言模型;基于预先存储的声学模型、词典模型 以及各分类的自定义语言模型,构建与各分类对应的至少一个自定义解码模型。In other embodiments of the present application, the model construction unit is further configured to classify the custom corpus to obtain a customized language model of each category; based on the pre-stored acoustic model, the dictionary model, and the custom language model of each category, At least one custom decoding model corresponding to each category is constructed.
在本申请其它实施例中,上述加载单元,还用于对当前账户的历史语音数据进行数据挖掘,获得带有槽的上下文模板;根据槽的分类标记,在通用解码模型的开始符号和结束符号之间添加槽,并将槽与至少一个自定义解码模型中具有分类标记的自定义解码模型关联,生成新的解码模型。In other embodiments of the present application, the loading unit is further configured to perform data mining on historical voice data of the current account to obtain a context template with a slot; and start symbols and end symbols in the universal decoding model according to the slot classification flag. A slot is added between the slots and associated with a custom decoding model with classification tags in at least one custom decoding model to generate a new decoding model.
在本申请其它实施例中,上述解码单元,具体用于根据新的解码模型对语音信号进行解码识别,当解码令牌遇到槽时,跳转到槽所关联的自定义解码模型;在槽所关联的自定义解码模型中进行解码;在槽所关联的自定义解码模型中解码完成后返回槽,并继续在通用解码模型中继续进行解码,直至获得语音识别结果。In other embodiments of the present application, the decoding unit is specifically configured to decode and identify a voice signal according to a new decoding model, and when the decoding token encounters a slot, jump to a custom decoding model associated with the slot; Decoding is performed in the associated custom decoding model; after the decoding is completed in the custom decoding model associated with the slot, the slot is returned, and the decoding continues in the general decoding model until the speech recognition result is obtained.
这里需要指出的是:以上装置实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果。对于本申请设备实施例中未披露的技术细节,请参照本申请方法实施例的描述而理解。It should be noted here that the description of the above device embodiment is similar to the description of the above method embodiment, and has similar advantages as the method embodiment. For technical details not disclosed in the device embodiments of the present application, please refer to the description of the method embodiments of the present application.
图6为本申请实施例中的语音识别设备的内部结构示意图,参见图6所示,该语音识别设备600包括包括通过系统总线连接的处理器、存储器和通信接口。其中,存储器包括非易失性存储介质和内存储器。该计算机设备的非易失性存储介质可存储操作系统和计算机可读指令,该计算机可读指令被执行时,可使得处理器执行一种语音识别方法。该计算机设备的处理器用于提供计算和控制能力,支撑整个计算机设备的运行。该计算机设备的内存储器中可储存有计算机可读指令,该计算机可读指令被所述处理器执行时,可使得所述处理器执行一种语音识别方法。该计算机设备可以是手机、平板电脑或者个人数字助理或穿戴式设备等。本领域技术人员可以理解,图2中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的终端的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。FIG. 6 is a schematic diagram of an internal structure of a voice recognition device according to an embodiment of the present application. Referring to FIG. 6, the voice recognition device 600 includes a processor, a memory, and a communication interface that are connected through a system bus. Wherein, the memory comprises a non-volatile storage medium and an internal memory. The non-volatile storage medium of the computer device can store an operating system and computer readable instructions that, when executed, cause the processor to perform a speech recognition method. The processor of the computer device is used to provide computing and control capabilities to support the operation of the entire computer device. Computer readable instructions may be stored in the internal memory of the computer device, and when the computer readable instructions are executed by the processor, the processor may be caused to perform a speech recognition method. The computer device can be a cell phone, a tablet or a personal digital assistant or a wearable device or the like. It will be understood by those skilled in the art that the structure shown in FIG. 2 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the terminal to which the solution of the present application is applied. The specific computer equipment may include More or fewer components than shown in the figure, or some components combined, or with different component arrangements.
其中,处理器执行计算机可读指令时实现以下步骤:在通过通信接口持 续获取语音信号的过程中,获取当前账户对应的自定义语料;对自定义语料进行分析处理,构建相应的至少一个自定义解码模型;将至少一个自定义解码模型加载到预先存储的通用解码模型中,生成新的解码模型;对语音信号采用新的解码模型进行解码,获得语音识别结果。The processor executes the following steps: in the process of continuously acquiring the voice signal through the communication interface, acquiring a custom corpus corresponding to the current account; analyzing and processing the customized corpus to construct at least one custom Decoding the model; loading at least one custom decoding model into a pre-stored general decoding model to generate a new decoding model; decoding the speech signal with a new decoding model to obtain a speech recognition result.
在本申请其它实施例中,当前账户对应的自定义语料至少包括以下之一:当前账户的联系人信息和至少一个领域的专有文本。In other embodiments of the present application, the custom corpus corresponding to the current account includes at least one of the following: contact information of the current account and proprietary text of at least one domain.
在本申请其它实施例中,自定义解码模型可以为自定义WFST网络;通用解码模型可以为通用WFST网络;相应地,处理器执行程序时还实现以下步骤:将自定义WFST网络与通用WFST网络合并,得到新的WFST网络;对语音信号采用新的WFST网络进行搜索解码,获得语音识别结果。In other embodiments of the present application, the custom decoding model may be a custom WFST network; the general decoding model may be a general-purpose WFST network; accordingly, the processor also implements the following steps when executing the program: the custom WFST network and the general WFST network The new WFST network is obtained by combining, and the new WFST network is used for search and decoding of the speech signal to obtain the speech recognition result.
在本申请其它实施例中,处理器执行计算机可读指令时还实现以下步骤:对自定义语料进行分类,得到各分类的自定义语言模型;基于预先存储的声学模型、词典模型以及各分类的自定义语言模型,构建与各分类对应的至少一个自定义解码模型。In other embodiments of the present application, when the processor executes the computer readable instructions, the following steps are further performed: classifying the custom corpus to obtain a customized language model of each category; based on the pre-stored acoustic model, the dictionary model, and each category A custom language model is constructed to construct at least one custom decoding model corresponding to each category.
在本申请其它实施例中,处理器执行计算机可读指令时还实现以下步骤:对当前账户的历史语音数据进行数据挖掘,获得带有槽的上下文模板;根据槽的分类标记,在通用解码模型的开始符号和结束符号之间添加槽,并将槽与至少一个自定义解码模型中具有分类标记的自定义解码模型关联,生成新的解码模型。In other embodiments of the present application, the processor, when executing the computer readable instructions, further implements the following steps: performing data mining on historical voice data of the current account, obtaining a context template with slots; and using a general decoding model according to the classification mark of the slot A slot is added between the start symbol and the end symbol, and the slot is associated with a custom decoding model having a classification mark in at least one custom decoding model to generate a new decoding model.
在本申请其它实施例中,处理器执行计算机可读指令时还实现以下步骤:根据新的解码模型对语音信号进行解码识别,当解码令牌遇到槽时,跳转到槽所关联的自定义解码模型;在槽所关联的自定义解码模型中进行解码;在槽所关联的自定义解码模型中解码完成后返回槽,并继续在通用解码模型中继续进行解码,直至获得语音识别结果。In other embodiments of the present application, when the processor executes the computer readable instructions, the following steps are further implemented: decoding and recognizing the voice signal according to the new decoding model, and when the decoding token encounters the slot, jumping to the slot associated with the self Defining the decoding model; decoding in the custom decoding model associated with the slot; returning to the slot after decoding is completed in the custom decoding model associated with the slot, and continuing to decode in the general decoding model until the speech recognition result is obtained.
在实际应用中,上述处理器可以为特定用途集成电路(ASIC,Application Specific Integrated Circuit)、数字信号处理器(DSP,Digital Signal Processor)、数字信号处理装置(DSPD,Digital Signal Processing Device)、可编程逻辑装 置(PLD,Programmable Logic Device)、现场可编程门阵列(FPGA,Field Programmable Gate Array)、中央处理器(CPU,Central Processing Unit)、控制器、微控制器、微处理器中的至少一种。存储器可以为移动存储设备、只读存储器(ROM,Read Only Memory)、磁碟或者光盘等。可以理解地,实现上述处理器和存储器功能的电子器件还可以为其它,本申请实施例不作具体限定。In practical applications, the processor may be an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), and a programmable At least one of a PLD (Programmable Logic Device), a Field Programmable Gate Array (FPGA), a Central Processing Unit (CPU), a controller, a microcontroller, and a microprocessor . The memory may be a removable storage device, a read only memory (ROM), a magnetic disk, or an optical disk. It is to be understood that the electronic device that implements the above-mentioned functions of the processor and the memory may be other, and is not specifically limited in the embodiment of the present application.
进一步地,如果上述语音识别设备为云端语音识别设备,即为语音识别服务器,则上述通信接口可以为终端与语音识别服务器之间的接口。如果上述语音识别设备为本地语音识别设备,即为终端时,还包括语音采集装置。语音采集装置可以为麦克风、麦克风阵列、送话器等,本申请实施例不作具体限定。终端上的通信接口可以为处理器与语音采集装置之间的接口,如处理器与麦克风或者送话器等之间的接口。当然,上述通信接口还可以有其它实现形式,本申请实施例不作具体限定。Further, if the voice recognition device is a cloud voice recognition device, that is, a voice recognition server, the communication interface may be an interface between the terminal and the voice recognition server. If the voice recognition device is a local voice recognition device, that is, a terminal, the voice collection device is also included. The voice collection device may be a microphone, a microphone array, a microphone, etc., which is not specifically limited in this embodiment. The communication interface on the terminal can be an interface between the processor and the voice collection device, such as an interface between the processor and a microphone or a microphone. Of course, the foregoing communication interface may have other implementation forms, which are not specifically limited in this embodiment.
基于同一发明构思,本申请实施例提供一种计算机可读存储介质,其上存储有计算机可读指令,计算机可读指令被处理器执行时实现以下步骤:在持续获取语音信号的过程中,获取当前账户对应的自定义语料;对自定义语料进行分析处理,构建相应的至少一个自定义解码模型;将至少一个自定义解码模型加载到预先存储的通用解码模型中,生成新的解码模型;对语音信号采用新的解码模型进行解码,获得语音识别结果。Based on the same inventive concept, an embodiment of the present application provides a computer readable storage medium, where computer readable instructions are stored, and when the computer readable instructions are executed by a processor, the following steps are implemented: in the process of continuously acquiring a voice signal, acquiring Custom corpus corresponding to the current account; analyzing and processing the custom corpus, constructing at least one custom decoding model; loading at least one custom decoding model into the pre-stored general decoding model to generate a new decoding model; The speech signal is decoded using a new decoding model to obtain speech recognition results.
在本申请其它实施例中,当前账户对应的自定义语料至少包括以下之一:当前账户的联系人信息和至少一个领域的专有文本。In other embodiments of the present application, the custom corpus corresponding to the current account includes at least one of the following: contact information of the current account and proprietary text of at least one domain.
在本申请其它实施例中,自定义解码模型可以为自定义WFST网络;通用解码模型可以为通用WFST网络;相应地,计算机可读指令被处理器执行时还实现以下步骤:将自定义WFST网络与通用WFST网络合并,得到新的WFST网络;对语音信号采用新的WFST网络进行搜索解码,获得语音识别结果。In other embodiments of the present application, the custom decoding model may be a custom WFST network; the universal decoding model may be a general-purpose WFST network; accordingly, when the computer readable instructions are executed by the processor, the following steps are also implemented: customizing the WFST network It merges with the universal WFST network to obtain a new WFST network. The new WFST network is used for search and decoding of voice signals to obtain speech recognition results.
在本申请其它实施例中,计算机可读指令被处理器执行时还实现以下步 骤:对自定义语料进行分类,得到各分类的自定义语言模型;基于预先存储的声学模型、词典模型以及各分类的自定义语言模型,构建与各分类对应的至少一个自定义解码模型。In other embodiments of the present application, the computer readable instructions are further executed by the processor to: classify the custom corpus to obtain a custom language model for each category; based on pre-stored acoustic models, dictionary models, and categories A custom language model that constructs at least one custom decoding model corresponding to each category.
在本申请其它实施例中,计算机可读指令被处理器执行时还实现以下步骤:对当前账户的历史语音数据进行数据挖掘,获得带有槽的上下文模板;根据槽的分类标记,在通用解码模型的开始符号和结束符号之间添加槽,并将槽与至少一个自定义解码模型中具有分类标记的自定义解码模型关联,生成新的解码模型。In other embodiments of the present application, when the computer readable instructions are executed by the processor, the following steps are further performed: performing data mining on historical voice data of the current account, obtaining a context template with slots; and performing general decoding according to the slot classification flag A slot is added between the start symbol and the end symbol of the model, and the slot is associated with a custom decoding model having a classification mark in at least one custom decoding model to generate a new decoding model.
在本申请其它实施例中,计算机可读指令被处理器执行时还实现以下步骤:根据新的解码模型对语音信号进行解码识别,当解码令牌遇到槽时,跳转到槽所关联的自定义解码模型;在槽所关联的自定义解码模型中进行解码;在槽所关联的自定义解码模型中解码完成后返回槽,并继续在通用解码模型中继续进行解码,直至获得语音识别结果。In other embodiments of the present application, when the computer readable instructions are executed by the processor, the following steps are further implemented: decoding and recognizing the speech signal according to the new decoding model, and when the decoding token encounters the slot, jumping to the slot associated with Custom decoding model; decoding in the custom decoding model associated with the slot; returning to the slot after decoding is completed in the custom decoding model associated with the slot, and continuing to decode in the general decoding model until the speech recognition result is obtained .
在本申请实施例中,上述计算机可读指令存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本申请各个实施例所述方法的全部或部分。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read Only Memory)、磁碟或者光盘等各种可以存储程序代码的介质。这样,本申请实施例不限制于任何特定的硬件和软件结合。In the embodiment of the present application, the computer readable instructions are stored in a storage medium, and include instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the embodiments of the present application. All or part of the method. The foregoing storage medium includes various media that can store program codes, such as a USB flash drive, a mobile hard disk, a read only memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the present application are not limited to any particular combination of hardware and software.
这里需要指出的是:以上计算设备或计算机可读存储介质实施例的描述,与上述方法实施例的描述是类似的,具有同方法实施例相似的有益效果。对于本申请计算设备或存储介质实施例中未披露的技术细节,请参照本申请方法实施例的描述而理解。It should be noted here that the above description of the computing device or the computer readable storage medium embodiment is similar to the description of the above method embodiment, and has similar advantages as the method embodiment. For technical details not disclosed in the embodiments of the computing device or the storage medium of the present application, please refer to the description of the method embodiments of the present application.
应理解,说明书通篇中提到的“一个实施例”或“一实施例”意味着与实施例有关的特定特征、结构或特性包括在本申请的至少一个实施例中。因此,在整个说明书各处出现的“在一个实施例中”或“在一实施例中”未必一定指相同的实施例。此外,这些特定的特征、结构或特性可以任意适合的 方式结合在一个或多个实施例中。应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。It is to be understood that the phrase "one embodiment" or "an embodiment" or "an embodiment" or "an embodiment" means that the particular features, structures, or characteristics relating to the embodiments are included in at least one embodiment of the present application. Thus, "in one embodiment" or "in an embodiment" or "an" In addition, these particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in the various embodiments of the present application, the size of the sequence numbers of the foregoing processes does not mean the order of execution sequence, and the order of execution of each process should be determined by its function and internal logic, and should not be applied to the embodiment of the present application. The implementation process constitutes any limitation. The serial numbers of the embodiments of the present application are merely for the description, and do not represent the advantages and disadvantages of the embodiments.
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。It is to be understood that the term "comprises", "comprising", or any other variants thereof, is intended to encompass a non-exclusive inclusion, such that a process, method, article, or device comprising a series of elements includes those elements. It also includes other elements that are not explicitly listed, or elements that are inherent to such a process, method, article, or device. An element that is defined by the phrase "comprising a ..." does not exclude the presence of additional equivalent elements in the process, method, item, or device that comprises the element.
在本申请所提供的几个实施例中,应该理解到,所揭露的设备和方法,可以通过其它的方式实现。以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,如:多个单元或组件可以结合,或可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的各组成部分相互之间的耦合、或直接耦合、或通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性的、机械的或其它形式的。In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The device embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, such as: multiple units or components may be combined, or Can be integrated into another system, or some features can be ignored or not executed. In addition, the coupling, or direct coupling, or communication connection of the components shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may be electrical, mechanical or other forms. of.
上述作为分离部件说明的单元可以是、或也可以不是物理上分开的,作为单元显示的部件可以是、或也可以不是物理单元;既可以位于一个地方,也可以分布到多个网络单元上;可以根据实际的需要选择其中的部分或全部单元来实现本实施例方案的目的。The units described above as separate components may or may not be physically separated, and the components displayed as the unit may or may not be physical units; they may be located in one place or distributed on multiple network units; Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
另外,在本申请各实施例中的各功能单元可以全部集成在一个处理单元中,也可以是各单元分别单独作为一个单元,也可以两个或两个以上单元集成在一个单元中;上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may be separately used as one unit, or two or more units may be integrated into one unit; the above integration The unit can be implemented in the form of hardware or in the form of hardware plus software functional units.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于 一非易失性计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。One of ordinary skill in the art can understand that all or part of the process of implementing the above embodiments can be completed by a computer program to instruct related hardware, and the program can be stored in a non-volatile computer readable storage medium. Wherein, the program, when executed, may include the flow of an embodiment of the methods as described above. Any reference to a memory, storage, database or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of formats, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization chain. Synchlink DRAM (SLDRAM), Memory Bus (Rambus) Direct RAM (RDRAM), Direct Memory Bus Dynamic RAM (DRDRAM), and Memory Bus Dynamic RAM (RDRAM).
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。The technical features of the above embodiments may be arbitrarily combined. For the sake of brevity of description, all possible combinations of the technical features in the above embodiments are not described. However, as long as there is no contradiction in the combination of these technical features, It is considered to be the range described in this specification.
以上实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。The above embodiments are merely illustrative of several embodiments of the present application, and the description thereof is more specific and detailed, but is not to be construed as limiting the scope of the invention. It should be noted that a number of variations and modifications may be made by those skilled in the art without departing from the spirit and scope of the present application. Therefore, the scope of the invention should be determined by the appended claims.

Claims (18)

  1. 一种语音识别方法,包括:A speech recognition method comprising:
    语音识别设备在持续获取语音信号的过程中,获取当前账户对应的自定义语料;The voice recognition device acquires a custom corpus corresponding to the current account in the process of continuously acquiring the voice signal;
    所述语音识别设备对所述自定义语料进行分析处理,构建相应的至少一个自定义解码模型;The voice recognition device analyzes and processes the customized corpus to construct a corresponding at least one custom decoding model;
    所述语音识别设备将所述至少一个自定义解码模型加载到预先存储的通用解码模型中,生成新的解码模型;及The speech recognition device loads the at least one custom decoding model into a pre-stored general decoding model to generate a new decoding model;
    所述语音识别设备对所述语音信号采用所述新的解码模型进行解码,获得语音识别结果。The speech recognition device decodes the speech signal by using the new decoding model to obtain a speech recognition result.
  2. 根据权利要求1所述的方法,其特征在于,所述当前账户对应的自定义语料至少包括以下之一:所述当前账户的联系人信息和至少一个领域的专有文本。The method according to claim 1, wherein the custom corpus corresponding to the current account comprises at least one of: contact information of the current account and proprietary text of at least one domain.
  3. 根据权利要求1所述的方法,其特征在于,所述自定义解码模型为自定义加权有限转换机WFST(Weighted Finite-State Transducer)网络;通用解码模型为通用WFST网络;The method according to claim 1, wherein the custom decoding model is a Weighted Finite-State Transducer (WFST) network; the general decoding model is a general-purpose WFST network;
    所述语音识别设备将所述至少一个自定义解码模型加载到预先存储的通用解码模型中,生成新的解码模型,包括:所述语音识别设备将所述自定义WFST网络与所述通用WFST网络合并,得到新的WFST网络;The voice recognition device loads the at least one custom decoding model into a pre-stored universal decoding model to generate a new decoding model, including: the voice recognition device, the custom WFST network and the universal WFST network Merging to get a new WFST network;
    所述语音识别设备对所述语音信号采用所述新的解码模型进行解码,获得语音识别结果,包括:所述语音识别设备对所述语音信号采用所述新的WFST网络进行搜索解码,获得语音识别结果。The voice recognition device uses the new decoding model to decode the voice signal to obtain a voice recognition result, and the voice recognition device uses the new WFST network to perform search and decoding on the voice signal to obtain a voice. Identify the results.
  4. 根据权利要求1所述的方法,其特征在于,所述语音识别设备对所述自定义语料进行分析处理,构建相应的自定义解码模型,包括:The method according to claim 1, wherein the speech recognition device analyzes and processes the customized corpus to construct a corresponding custom decoding model, including:
    所述语音识别设备对所述自定义语料进行分类,得到各分类的自定义语 言模型;及The voice recognition device classifies the customized corpus to obtain a customized language model of each category; and
    所述语音识别设备基于预先存储的声学模型、词典模型以及所述各分类的自定义语言模型,构建与各分类对应的所述至少一个自定义解语音识别码模型。The speech recognition apparatus constructs the at least one custom solution speech recognition code model corresponding to each category based on a pre-stored acoustic model, a dictionary model, and a custom language model of the respective categories.
  5. 根据权利要求4所述的方法,其特征在于,所述语音识别设备将所述至少一个自定义解码模型加载到预先存储的通用解码模型中,生成新的解码模型,包括:The method according to claim 4, wherein the speech recognition device loads the at least one custom decoding model into a pre-stored general decoding model to generate a new decoding model, including:
    所述语音识别设备获取带有槽的上下文模板,其中,所述上下文模板是对所述当前账户的历史语音数据进行数据挖掘获得的;及The voice recognition device acquires a context template with a slot, wherein the context template is obtained by performing data mining on historical voice data of the current account;
    所述语音识别设备根据所述槽的分类标记,在所述通用解码模型的开始符号和结束符号之间添加所述槽,并将所述槽与所述至少一个自定义解码模型中具有所述分类标记的自定义解码模型关联,生成所述新的解码模型。The voice recognition device adds the slot between a start symbol and an end symbol of the universal decoding model according to a classification mark of the slot, and has the slot and the at least one custom decoding model having the A custom decoding model associated with the classification tag associates to generate the new decoding model.
  6. 根据权利要求5所述的方法,其特征在于,所述语音识别设备对所述语音信号采用所述新的解码模型进行解码,获得语音识别结果,包括:The method according to claim 5, wherein the speech recognition device decodes the speech signal by using the new decoding model to obtain a speech recognition result, including:
    所述语音识别设备根据所述新的解码模型对所述语音信号进行解码识别,当解码令牌遇到所述槽时,跳转到所述槽所关联的自定义解码模型;在所述槽所关联的自定义解码模型中进行解码;及The speech recognition device decodes and recognizes the speech signal according to the new decoding model, and when the decoding token encounters the slot, jumps to a custom decoding model associated with the slot; Decoding in the associated custom decoding model; and
    所述语音识别设备在所述槽所关联的自定义解码模型中解码完成后返回所述槽,并继续在所述通用解码模型中继续进行解码,直至获得所述语音识别结果。The speech recognition device returns to the slot after decoding in the custom decoding model associated with the slot, and continues to continue decoding in the universal decoding model until the speech recognition result is obtained.
  7. 一个或多个存储有计算机可读指令的非易失性存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:One or more non-volatile storage media storing computer readable instructions, when executed by one or more processors, cause one or more processors to perform the following steps:
    在持续获取语音信号的过程中,获取当前账户对应的自定义语料;Obtaining a custom corpus corresponding to the current account during the process of continuously acquiring the voice signal;
    对所述自定义语料进行分析处理,构建相应的至少一个自定义解码模型;Performing analysis and processing on the customized corpus to construct corresponding at least one custom decoding model;
    将所述至少一个自定义解码模型加载到预先存储的通用解码模型中,生成新的解码模型;及Loading the at least one custom decoding model into a pre-stored general decoding model to generate a new decoding model;
    对所述语音信号采用所述新的解码模型进行解码,获得语音识别结果。The new decoding model is decoded for the speech signal to obtain a speech recognition result.
  8. 根据权利要求7所述的存储介质,其特征在于,所述当前账户对应的自定义语料至少包括以下之一:所述当前账户的联系人信息和至少一个领域的专有文本。The storage medium according to claim 7, wherein the custom corpus corresponding to the current account comprises at least one of: contact information of the current account and proprietary text of at least one domain.
  9. 根据权利要求7所述的存储介质,其特征在于,所述自定义解码模型为自定义加权有限转换机WFST网络;通用解码模型为通用WFST网络;The storage medium according to claim 7, wherein the custom decoding model is a custom weighted finite-conversion machine WFST network; the universal decoding model is a general-purpose WFST network;
    所述将所述至少一个自定义解码模型加载到预先存储的通用解码模型中,生成新的解码模型,包括:将所述自定义WFST网络与所述通用WFST网络合并,得到新的WFST网络;The loading the at least one custom decoding model into a pre-stored universal decoding model to generate a new decoding model includes: combining the custom WFST network with the universal WFST network to obtain a new WFST network;
    所述对所述语音信号采用所述新的解码模型进行解码,获得语音识别结果,包括:对所述语音信号采用所述新的WFST网络进行搜索解码,获得语音识别结果。The decoding of the voice signal by using the new decoding model to obtain a voice recognition result includes: performing search and decoding on the voice signal by using the new WFST network, and obtaining a voice recognition result.
  10. 根据权利要求7所述的存储介质,其特征在于,所述对所述自定义语料进行分析处理,构建相应的自定义解码模型,包括:The storage medium according to claim 7, wherein the analyzing and processing the customized corpus to construct a corresponding custom decoding model comprises:
    对所述自定义语料进行分类,得到各分类的自定义语言模型;及Classifying the custom corpus to obtain a custom language model for each category; and
    基于预先存储的声学模型、词典模型以及所述各分类的自定义语言模型,构建与各分类对应的所述至少一个自定义解语音识别码模型。The at least one custom solution speech recognition code model corresponding to each category is constructed based on a pre-stored acoustic model, a dictionary model, and a custom language model of the respective categories.
  11. 根据权利要求10所述的存储介质,其特征在于,所述将所述至少一个自定义解码模型加载到预先存储的通用解码模型中,生成新的解码模型,包括:The storage medium according to claim 10, wherein the loading the at least one custom decoding model into a pre-stored general decoding model to generate a new decoding model comprises:
    获取带有槽的上下文模板,其中,所述上下文模板是对所述当前账户的历史语音数据进行数据挖掘获得的;及Obtaining a context template with a slot, wherein the context template is obtained by performing data mining on historical voice data of the current account; and
    根据所述槽的分类标记,在所述通用解码模型的开始符号和结束符号之间添加所述槽,并将所述槽与所述至少一个自定义解码模型中具有所述分类标记的自定义解码模型关联,生成所述新的解码模型。Adding the slot between a start symbol and an end symbol of the universal decoding model according to a classification mark of the slot, and customizing the slot with the classification mark in the at least one custom decoding model The model association is decoded to generate the new decoding model.
  12. 根据权利要求11所述的存储介质,其特征在于,所述对所述语音信 号采用所述新的解码模型进行解码,获得语音识别结果,包括:The storage medium according to claim 11, wherein the decoding of the voice signal by using the new decoding model to obtain a voice recognition result comprises:
    根据所述新的解码模型对所述语音信号进行解码识别,当解码令牌遇到所述槽时,跳转到所述槽所关联的自定义解码模型;在所述槽所关联的自定义解码模型中进行解码;及Decoding and recognizing the speech signal according to the new decoding model, when the decoding token encounters the slot, jumping to a custom decoding model associated with the slot; customizing the association associated with the slot Decoding in the decoding model; and
    在所述槽所关联的自定义解码模型中解码完成后返回所述槽,并继续在所述通用解码模型中继续进行解码,直至获得所述语音识别结果。Returning to the slot after decoding is completed in the custom decoding model associated with the slot, and continuing to continue decoding in the universal decoding model until the speech recognition result is obtained.
  13. 一种语音识别设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行以下步骤:A speech recognition apparatus comprising a memory and a processor, the memory storing computer readable instructions, the computer readable instructions being executed by the processor such that the processor performs the following steps:
    在持续获取语音信号的过程中,获取当前账户对应的自定义语料;Obtaining a custom corpus corresponding to the current account during the process of continuously acquiring the voice signal;
    对所述自定义语料进行分析处理,构建相应的至少一个自定义解码模型;Performing analysis and processing on the customized corpus to construct corresponding at least one custom decoding model;
    将所述至少一个自定义解码模型加载到预先存储的通用解码模型中,生成新的解码模型;及Loading the at least one custom decoding model into a pre-stored general decoding model to generate a new decoding model;
    对所述语音信号采用所述新的解码模型进行解码,获得语音识别结果。The new decoding model is decoded for the speech signal to obtain a speech recognition result.
  14. 根据权利要求13所述的语音识别设备,其特征在于,所述当前账户对应的自定义语料至少包括以下之一:所述当前账户的联系人信息和至少一个领域的专有文本。The speech recognition device according to claim 13, wherein the custom corpus corresponding to the current account comprises at least one of the following: contact information of the current account and proprietary text of at least one domain.
  15. 根据权利要求13所述的语音识别设备,其特征在于,所述自定义解码模型为自定义加权有限转换机WFST网络;通用解码模型为通用WFST网络;The speech recognition device according to claim 13, wherein the custom decoding model is a custom weighted limited converter WFST network; the universal decoding model is a universal WFST network;
    所述将所述至少一个自定义解码模型加载到预先存储的通用解码模型中,生成新的解码模型,包括:将所述自定义WFST网络与所述通用WFST网络合并,得到新的WFST网络;The loading the at least one custom decoding model into a pre-stored universal decoding model to generate a new decoding model includes: combining the custom WFST network with the universal WFST network to obtain a new WFST network;
    所述对所述语音信号采用所述新的解码模型进行解码,获得语音识别结果,包括:对所述语音信号采用所述新的WFST网络进行搜索解码,获得语音识别结果。The decoding of the voice signal by using the new decoding model to obtain a voice recognition result includes: performing search and decoding on the voice signal by using the new WFST network, and obtaining a voice recognition result.
  16. 根据权利要求13所述的语音识别设备,其特征在于,所述对所述自定义语料进行分析处理,构建相应的自定义解码模型,包括:The speech recognition device according to claim 13, wherein the analyzing and processing the custom corpus to construct a corresponding custom decoding model comprises:
    对所述自定义语料进行分类,得到各分类的自定义语言模型;及Classifying the custom corpus to obtain a custom language model for each category; and
    基于预先存储的声学模型、词典模型以及所述各分类的自定义语言模型,构建与各分类对应的所述至少一个自定义解语音识别码模型。The at least one custom solution speech recognition code model corresponding to each category is constructed based on a pre-stored acoustic model, a dictionary model, and a custom language model of the respective categories.
  17. 根据权利要求16所述的语音识别设备,其特征在于,所述将所述至少一个自定义解码模型加载到预先存储的通用解码模型中,生成新的解码模型,包括:The speech recognition device according to claim 16, wherein the loading the at least one custom decoding model into a pre-stored general decoding model to generate a new decoding model comprises:
    获取带有槽的上下文模板,其中,所述上下文模板是对所述当前账户的历史语音数据进行数据挖掘获得的;及Obtaining a context template with a slot, wherein the context template is obtained by performing data mining on historical voice data of the current account; and
    根据所述槽的分类标记,在所述通用解码模型的开始符号和结束符号之间添加所述槽,并将所述槽与所述至少一个自定义解码模型中具有所述分类标记的自定义解码模型关联,生成所述新的解码模型。Adding the slot between a start symbol and an end symbol of the universal decoding model according to a classification mark of the slot, and customizing the slot with the classification mark in the at least one custom decoding model The model association is decoded to generate the new decoding model.
  18. 根据权利要求17所述的语音识别设备,其特征在于,所述对所述语音信号采用所述新的解码模型进行解码,获得语音识别结果,包括:The speech recognition device according to claim 17, wherein the decoding of the speech signal by using the new decoding model to obtain a speech recognition result comprises:
    根据所述新的解码模型对所述语音信号进行解码识别,当解码令牌遇到所述槽时,跳转到所述槽所关联的自定义解码模型;在所述槽所关联的自定义解码模型中进行解码;及Decoding and recognizing the speech signal according to the new decoding model, when the decoding token encounters the slot, jumping to a custom decoding model associated with the slot; customizing the association associated with the slot Decoding in the decoding model; and
    在所述槽所关联的自定义解码模型中解码完成后返回所述槽,并继续在所述通用解码模型中继续进行解码,直至获得所述语音识别结果。Returning to the slot after decoding is completed in the custom decoding model associated with the slot, and continuing to continue decoding in the universal decoding model until the speech recognition result is obtained.
PCT/CN2018/085819 2017-06-07 2018-05-07 Speech recognition method, storage medium, and speech recognition device WO2018223796A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710425219.XA CN108288467B (en) 2017-06-07 2017-06-07 Voice recognition method and device and voice recognition engine
CN201710425219.X 2017-06-07

Publications (1)

Publication Number Publication Date
WO2018223796A1 true WO2018223796A1 (en) 2018-12-13

Family

ID=62831581

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/085819 WO2018223796A1 (en) 2017-06-07 2018-05-07 Speech recognition method, storage medium, and speech recognition device

Country Status (2)

Country Link
CN (1) CN108288467B (en)
WO (1) WO2018223796A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110349569A (en) * 2019-07-02 2019-10-18 苏州思必驰信息科技有限公司 The training and recognition methods of customized product language model and device

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108288467B (en) * 2017-06-07 2020-07-14 腾讯科技(深圳)有限公司 Voice recognition method and device and voice recognition engine
CN108922531B (en) * 2018-07-26 2020-10-27 腾讯科技(北京)有限公司 Slot position identification method and device, electronic equipment and storage medium
CN109246214B (en) * 2018-09-10 2022-03-04 北京奇艺世纪科技有限公司 Prompt tone obtaining method and device, terminal and server
CN109272995A (en) * 2018-09-26 2019-01-25 出门问问信息科技有限公司 Audio recognition method, device and electronic equipment
CN109087645B (en) * 2018-10-24 2021-04-30 科大讯飞股份有限公司 Decoding network generation method, device, equipment and readable storage medium
CN109524017A (en) * 2018-11-27 2019-03-26 北京分音塔科技有限公司 A kind of the speech recognition Enhancement Method and device of user's custom words
CN110164421B (en) * 2018-12-14 2022-03-11 腾讯科技(深圳)有限公司 Voice decoding method, device and storage medium
CN110046276B (en) * 2019-04-19 2021-04-20 北京搜狗科技发展有限公司 Method and device for searching keywords in voice
CN110223695B (en) * 2019-06-27 2021-08-27 维沃移动通信有限公司 Task creation method and mobile terminal
CN110517692A (en) * 2019-08-30 2019-11-29 苏州思必驰信息科技有限公司 Hot word audio recognition method and device
CN110570857B (en) * 2019-09-06 2020-09-15 北京声智科技有限公司 Voice wake-up method and device, electronic equipment and storage medium
CN111667821A (en) * 2020-05-27 2020-09-15 山西东易园智能家居科技有限公司 Voice recognition system and recognition method
CN112530416B (en) * 2020-11-30 2024-06-14 北京汇钧科技有限公司 Speech recognition method, apparatus, device and computer readable medium
CN114242046B (en) * 2021-12-01 2022-08-16 广州小鹏汽车科技有限公司 Voice interaction method and device, server and storage medium

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1981020A1 (en) * 2007-04-12 2008-10-15 France Télécom Method and system for automatic speech recognition adapted for detecting utterances out of context
CN102270450A (en) * 2010-06-07 2011-12-07 株式会社曙飞电子 System and method of multi model adaptation and voice recognition
CN102270451A (en) * 2011-08-18 2011-12-07 安徽科大讯飞信息科技股份有限公司 Method and system for identifying speaker
US20110313767A1 (en) * 2010-06-18 2011-12-22 At&T Intellectual Property I, L.P. System and method for data intensive local inference
CN103377651A (en) * 2012-04-28 2013-10-30 北京三星通信技术研究有限公司 Device and method for automatic voice synthesis
CN104123933A (en) * 2014-08-01 2014-10-29 中国科学院自动化研究所 Self-adaptive non-parallel training based voice conversion method
US20150081293A1 (en) * 2013-09-19 2015-03-19 Maluuba Inc. Speech recognition using phoneme matching
CN104508739A (en) * 2012-06-21 2015-04-08 谷歌公司 Dynamic language model
US9190055B1 (en) * 2013-03-14 2015-11-17 Amazon Technologies, Inc. Named entity recognition with personalized models
CN105448292A (en) * 2014-08-19 2016-03-30 北京羽扇智信息科技有限公司 Scene-based real-time voice recognition system and method
CN105575386A (en) * 2015-12-18 2016-05-11 百度在线网络技术(北京)有限公司 Method and device for voice recognition
CN105719649A (en) * 2016-01-19 2016-06-29 百度在线网络技术(北京)有限公司 Voice recognition method and device
CN108288467A (en) * 2017-06-07 2018-07-17 腾讯科技(深圳)有限公司 A kind of audio recognition method, device and speech recognition engine

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004334193A (en) * 2003-05-01 2004-11-25 Microsoft Corp System with composite statistical and rule-based grammar model for speech recognition and natural language understanding
JPWO2007088853A1 (en) * 2006-01-31 2009-06-25 パナソニック株式会社 Speech coding apparatus, speech decoding apparatus, speech coding system, speech coding method, and speech decoding method
US7716039B1 (en) * 2006-04-14 2010-05-11 At&T Intellectual Property Ii, L.P. Learning edit machines for robust multimodal understanding
CN103971675B (en) * 2013-01-29 2016-03-02 腾讯科技(深圳)有限公司 Automatic speech recognition method and system
CN103971686B (en) * 2013-01-30 2015-06-10 腾讯科技(深圳)有限公司 Method and system for automatically recognizing voice
CN103325370B (en) * 2013-07-01 2015-11-25 百度在线网络技术(北京)有限公司 Audio recognition method and speech recognition system
CN106294460B (en) * 2015-05-29 2019-10-22 中国科学院声学研究所 A kind of Chinese speech keyword retrieval method based on word and word Hybrid language model
CN105118501B (en) * 2015-09-07 2019-05-21 徐洋 The method and system of speech recognition
CN105976812B (en) * 2016-04-28 2019-04-26 腾讯科技(深圳)有限公司 A kind of audio recognition method and its equipment

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1981020A1 (en) * 2007-04-12 2008-10-15 France Télécom Method and system for automatic speech recognition adapted for detecting utterances out of context
CN102270450A (en) * 2010-06-07 2011-12-07 株式会社曙飞电子 System and method of multi model adaptation and voice recognition
US20110313767A1 (en) * 2010-06-18 2011-12-22 At&T Intellectual Property I, L.P. System and method for data intensive local inference
CN102270451A (en) * 2011-08-18 2011-12-07 安徽科大讯飞信息科技股份有限公司 Method and system for identifying speaker
CN103377651A (en) * 2012-04-28 2013-10-30 北京三星通信技术研究有限公司 Device and method for automatic voice synthesis
CN104508739A (en) * 2012-06-21 2015-04-08 谷歌公司 Dynamic language model
US9190055B1 (en) * 2013-03-14 2015-11-17 Amazon Technologies, Inc. Named entity recognition with personalized models
US20150081293A1 (en) * 2013-09-19 2015-03-19 Maluuba Inc. Speech recognition using phoneme matching
CN104123933A (en) * 2014-08-01 2014-10-29 中国科学院自动化研究所 Self-adaptive non-parallel training based voice conversion method
CN105448292A (en) * 2014-08-19 2016-03-30 北京羽扇智信息科技有限公司 Scene-based real-time voice recognition system and method
CN105575386A (en) * 2015-12-18 2016-05-11 百度在线网络技术(北京)有限公司 Method and device for voice recognition
CN105719649A (en) * 2016-01-19 2016-06-29 百度在线网络技术(北京)有限公司 Voice recognition method and device
CN108288467A (en) * 2017-06-07 2018-07-17 腾讯科技(深圳)有限公司 A kind of audio recognition method, device and speech recognition engine

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110349569A (en) * 2019-07-02 2019-10-18 苏州思必驰信息科技有限公司 The training and recognition methods of customized product language model and device

Also Published As

Publication number Publication date
CN108288467B (en) 2020-07-14
CN108288467A (en) 2018-07-17

Similar Documents

Publication Publication Date Title
WO2018223796A1 (en) Speech recognition method, storage medium, and speech recognition device
US20240069860A1 (en) Search and knowledge base question answering for a voice user interface
CN107590135B (en) Automatic translation method, device and system
US10176804B2 (en) Analyzing textual data
CN112002308B (en) Voice recognition method and device
CN110431626B (en) Use of pair-wise comparisons for superparamagnetic detection in repeated speech queries to improve speech recognition
US11823678B2 (en) Proactive command framework
US11037553B2 (en) Learning-type interactive device
US20170084274A1 (en) Dialog management apparatus and method
US9324323B1 (en) Speech recognition using topic-specific language models
KR102041621B1 (en) System for providing artificial intelligence based dialogue type corpus analyze service, and building method therefor
US20190005951A1 (en) Method of processing dialogue based on dialog act information
US10963497B1 (en) Multi-stage query processing
US20190172444A1 (en) Spoken dialog device, spoken dialog method, and recording medium
JP2014145842A (en) Speech production analysis device, voice interaction control device, method, and program
JP2016536652A (en) Real-time speech evaluation system and method for mobile devices
JP6370962B1 (en) Generating device, generating method, and generating program
US20120221335A1 (en) Method and apparatus for creating voice tag
WO2013056343A1 (en) System, method and computer program for correcting speech recognition information
CN116343747A (en) Speech synthesis method, speech synthesis device, electronic device, and storage medium
JP7034027B2 (en) Recognition device, recognition method and recognition program
CN111126084A (en) Data processing method and device, electronic equipment and storage medium
CN112885335B (en) Speech recognition method and related device
CN113593523A (en) Speech detection method and device based on artificial intelligence and electronic equipment
KR102464156B1 (en) Call center service providing apparatus, method, and program for matching a user and an agent vasded on the user`s status and the agent`s status

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18813419

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18813419

Country of ref document: EP

Kind code of ref document: A1