CN115862604A

CN115862604A - Voice wakeup model training and voice wakeup method, device and computer equipment

Info

Publication number: CN115862604A
Application number: CN202211481741.7A
Authority: CN
Inventors: 李蒙
Original assignee: Mgjia Beijing Technology Co ltd
Current assignee: Mgjia Beijing Technology Co ltd
Priority date: 2022-11-24
Filing date: 2022-11-24
Publication date: 2023-03-28
Anticipated expiration: 2042-11-24
Also published as: CN115862604B

Abstract

The invention provides a voice awakening model training and voice awakening method, a device and computer equipment, wherein the method comprises the following steps: acquiring voice sample data, wherein the voice sample data is provided with a label related to the awakening word; and synchronously performing multi-task learning training of voice awakening classification, sequence alignment labeling and phoneme recognition on an awakening model consisting of an encoder and a decoder based on voice sample data to obtain a target voice awakening model. By the method and the device, multi-task training learning can be utilized, so that different information which is useful for awakening can be synchronously learned by the awakening model, and information complementation is further realized. Through complementation between useful information for awakening, accuracy of the voice awakening model for voice recognition is improved, and mistaken awakening is avoided to the maximum extent, so that balance between mistaken awakening and mistaken recognition rejection is realized by the voice awakening model, mistaken awakening is reduced to the maximum extent under the condition that the awakening rate is ensured, and user experience is improved.

Description

Voice wakeup model training and voice wakeup method, device and computer equipment

Technical Field

The invention relates to the technical field of computers, in particular to a voice awakening model training and voice awakening method, a device and computer equipment.

Background

Before voice interaction, the device needs to be awakened first and enters a working state from a dormant state, so that the instruction of the user can be processed normally. Awaken equipment from the dormancy state to the operating condition just awaken, common have the touch awaken (lock screen key), regularly awaken (alarm clock), awaken (phone) etc. passively, and the pronunciation awaken up just is: and switching the equipment from the dormant state to the working state in a voice mode.

In the related art, a main difficulty of the voice wake-up technology is to balance the false wake-up and the false rejection, that is, to improve the wake-up effect while ensuring a lower false wake-up rate, in order to achieve a lower false wake-up rate, the current mainstream scheme usually sacrifices the wake-up rate and increases more false rejections, and the low wake-up rate affects the user experience.

Disclosure of Invention

Therefore, the technical problem to be solved by the present invention is to overcome the defects in the prior art, and to provide a method, an apparatus and a computer device for voice wakeup model training and voice wakeup.

According to a first aspect, the invention provides a method for training a voice wake-up model, the method comprising:

acquiring voice sample data, wherein the voice sample data is provided with a label related to a wakeup word;

and synchronously performing multi-task learning training of voice wake classification, sequence alignment labeling and phoneme recognition on a wake model formed by an encoder and a decoder based on the voice sample data to obtain a target voice wake model.

In this way, multi-task training learning can be utilized, so that the wake-up model synchronously learns different information useful for wake-up, and further complementation between information is realized. Through complementation between useful information for awakening, accuracy of the voice awakening model for voice recognition is improved, and mistaken awakening is avoided to the maximum extent, so that balance between mistaken awakening and mistaken recognition rejection is realized by the voice awakening model, mistaken awakening is reduced to the maximum extent under the condition that the awakening rate is ensured, and user experience is improved.

With reference to the first aspect, in a first embodiment of the first aspect, the performing learning training of voice wake classification on a wake model composed of an encoder and a decoder based on the voice sample data includes:

inputting the voice sample data into the encoder to obtain output characteristics;

after averaging the output characteristics in a time sequence dimension, inputting the average output characteristics into a layer of fully-connected network for first classification, and performing supervised training by using a label related to a wakeup word in the voice sample data, wherein the classification of the first classification comprises the following steps: all wake words and a category representing non-wake words.

With reference to the first aspect, in a second embodiment of the first aspect, the training of learning sequence alignment labeling for a wake-up model composed of an encoder and a decoder based on the voice sample data includes:

inputting the output characteristics into a layer of fully-connected network for second classification, and performing supervised training by using the labels related to the awakening words in the voice sample data, wherein the second classification comprises the following classes: phonemes, silence classes, other pronunciation classes contained in the wake-up word.

With reference to the first aspect, in a third embodiment of the first aspect, the training of learning phoneme recognition on a wake-up model composed of an encoder and a decoder based on the voice sample data includes:

inputting the output features into the decoder to perform acoustic unit integration and third classification, and performing supervised training by using the labels related to the awakening words in the voice sample data, wherein the third classification comprises the following classes: phonemes, silence classes, other pronunciation classes contained in the wake-up word.

With reference to the first aspect, in a fourth embodiment of the first aspect, the voice sample data includes: original voice sample data and enhanced voice sample data, wherein the enhanced voice sample data is obtained by performing voice enhancement processing on the original voice sample data, and the method further comprises the following steps:

respectively calculating corresponding model output distribution of the original voice sample data and the enhanced voice sample data in the learning training process of performing voice wake-up classification and sequence alignment marking on the wake-up model;

calculating KL divergence loss based on model output distribution corresponding to the original voice sample data and the enhanced voice sample data;

training the wake-up model based on the KL divergence loss.

According to a second aspect, the present invention further provides a voice wake-up method, including:

inputting a target voice into a target voice awakening model obtained by training through the voice awakening model training method in any one of the first aspect and the optional implementation modes, and performing voice awakening classification to obtain a first awakening word;

performing sequence alignment and labeling on a target voice input by using a target voice awakening model obtained by training through any one of the first aspect and the optional implementation manner;

comparing whether the sequence alignment marking result contains the first awakening word or not;

when the sequence alignment labeling result contains the first awakening word, inputting the output characteristics of the target voice awakening model encoder corresponding to the position of the awakening word in the sequence alignment labeling result into a decoder for phoneme recognition;

comparing whether the phonemes, the silence classes and other pronunciation classes contained in the awakening words in the phoneme recognition result are consistent with the first awakening words or not;

and when the phoneme, the mute class and other pronunciation classes contained in the awakening words in the phoneme recognition result are consistent with the first awakening word, awakening the target object based on the first awakening word.

In the mode, the voice awakening model obtained by combining different tasks and training is utilized, and the multi-stage awakening process is carried out by combining respective advantages among different tasks, so that mistaken awakening can be controlled at the minimum mistaken refusing cost, the mistaken refusing rate and the mistaken awakening rate are greatly reduced, the balance of mistaken refusing and mistaken awakening is realized, and the use experience of a user is further improved.

With reference to the second aspect, in the first embodiment of the second aspect, when the voice wakeup classification result is a non-wakeup word type, or when the sequence alignment tagging result does not include the first wakeup word, or when a phoneme, a mute type, or another pronunciation type included in the wakeup word in the phoneme recognition result is inconsistent with the first wakeup word, the target object is rejected from being wakened up.

According to a third aspect, the present invention further provides a voice wake-up model training apparatus, including:

the device comprises an acquisition unit, a display unit and a control unit, wherein the acquisition unit is used for acquiring voice sample data which is provided with a label related to a wakeup word;

and the training unit is used for synchronously carrying out multi-task learning training of voice wake classification, sequence alignment marking and phoneme recognition on a wake model formed by an encoder and a decoder based on the voice sample data to obtain a target voice wake model.

With reference to the third aspect, in a first embodiment of the third aspect, the training unit includes:

a first input unit, configured to input the voice sample data into the encoder to obtain an output characteristic;

a first training unit, configured to average the output features in a time sequence dimension, input the average output features into a layer of fully-connected network to perform a first classification, and perform supervised training by using a tag related to a wakeup word in the voice sample data, where the first classification includes: all wake words and a category representing non-wake words.

With reference to the third aspect, in a second embodiment of the third aspect, the training unit includes:

a second input unit, configured to input the voice sample data into the encoder, so as to obtain an output characteristic;

a second training unit, configured to input the output feature into a layer of fully-connected network for second classification, and perform supervised training by using a tag associated with a wakeup word in the voice sample data, where the second classification includes: phonemes, silence classes, other pronunciation classes contained in the wake-up word.

With reference to the second aspect, in a third embodiment of the second aspect, the training unit includes:

a third input unit, configured to input the voice sample data into the encoder to obtain an output characteristic;

a third training unit, configured to input the output features into the decoder to perform acoustic unit integration and third classification, and perform supervised training by using a tag associated with a wakeup word in the voice sample data, where the third classification includes: phonemes, silence classes, other pronunciation classes contained in the wake-up word.

With reference to the third aspect, in a fourth embodiment of the third aspect, the training unit further includes:

the first calculation unit is used for respectively calculating corresponding model output distribution of the original voice sample data and the enhanced voice sample data in the learning training process of performing voice awakening classification and sequence alignment marking on the awakening model;

the second calculation unit is used for calculating KL divergence loss based on model output distribution corresponding to the original voice sample data and the enhanced voice sample data;

a fourth training unit for training the wake-up model based on the KL divergence loss.

According to a fourth aspect, the present invention further provides a voice wake-up apparatus, comprising:

a voice awakening classification unit, configured to perform voice awakening classification on a target voice awakening model obtained by inputting a target voice by using the target voice awakening model trained by using the voice awakening model training method in the first aspect and any optional implementation manner thereof, so as to obtain a first awakening word;

a sequence alignment labeling unit, configured to perform sequence alignment labeling on a target voice input through a target voice wake-up model obtained by training through the voice wake-up model training method according to any one of the first aspect and the optional embodiments;

the first comparison unit is used for comparing whether the sequence alignment marking result contains the first awakening word or not;

the phoneme recognition unit is used for inputting the output characteristics of the target voice awakening model coder corresponding to the position of the awakening word in the sequence alignment tagging result into a decoder to perform phoneme recognition when the sequence alignment tagging result contains the first awakening word;

the second comparison unit is used for comparing whether the phonemes, the silence class and other pronunciation classes contained in the awakening words in the phoneme recognition result are consistent with the first awakening words or not;

and the awakening unit is used for awakening the target object based on the first awakening word when the phoneme, the mute class and other pronunciation classes contained in the awakening word in the phoneme recognition result are consistent with the first awakening word.

With reference to the fourth aspect, in a first embodiment of the fourth aspect, the apparatus further comprises:

and the rejection unit is used for rejecting the awakening operation on the target object when the voice awakening classification result is in a non-awakening word type, or when the sequence alignment marking result does not contain the first awakening word, or when the phoneme, the mute type and other pronunciation types contained in the awakening word in the phoneme recognition result are inconsistent with the first awakening word.

According to a fifth aspect, the present invention further provides a computer device, comprising a memory and a processor, wherein the memory and the processor are communicatively connected to each other, the memory stores computer instructions, and the processor executes the computer instructions to perform the voice wakeup model training method of any one of the first aspect and its optional embodiments and the voice wakeup method of any one of the second aspect and its optional embodiments.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of a method for training a voice wakeup model according to an exemplary embodiment.

Fig. 2 is a flow chart of a voice wake-up method according to an exemplary embodiment.

Fig. 3 is a block diagram of a voice wakeup model training apparatus according to an exemplary embodiment.

Fig. 4 is a block diagram of a voice wake-up apparatus according to an exemplary embodiment.

Fig. 5 is a schematic diagram of a hardware structure of a computer device according to an exemplary embodiment.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the accompanying drawings, and it should be understood that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the related art, a main difficulty of the voice wake-up technology is to balance the false wake-up and the false reject, that is, to improve the wake-up effect while ensuring a lower false wake-up rate, the current mainstream scheme usually sacrifices the wake-up rate to achieve the lower false wake-up, and increases more false rejects.

In order to solve the above problems, an embodiment of the present invention provides a method for training a voice wakeup model, which is used in a computer device, where an execution main body of the method may be a voice wakeup model training apparatus, and the apparatus may be implemented as part or all of the computer device in a software, hardware, or a combination of software and hardware, where the computer device may be a terminal, a client, or a server, and the server may be one server or a server cluster composed of multiple servers, and the terminal in the embodiment of the present invention may be another intelligent hardware device such as a smart phone, a personal computer, a tablet computer, a wearable device, and an intelligent robot. In the following method embodiments, the execution subject is a computer device as an example.

The computer device in this embodiment is suitable for a use scenario of voice wakeup. By the voice awakening model training method provided by the invention, the awakening model can synchronously learn different information useful for awakening by utilizing multi-task training learning, and further realize the complementation between information. The accuracy of the voice awakening model is improved through complementation between useful information for awakening; through the three-stage awakening process, the mistaken refusal and the mistaken awakening are avoided to the maximum extent, so that the voice awakening model realizes the balance between the mistaken awakening and the mistaken refusal, the mistaken awakening is reduced to the maximum extent under the condition of ensuring high awakening rate, and the user experience is improved.

Fig. 1 is a flowchart of a method for training a voice wakeup model according to an exemplary embodiment. As shown in fig. 1, the method for training the voice wakeup model includes the following steps S101 to S102.

In step S101, voice sample data is acquired.

In the embodiment of the present invention, the voice sample data has a tag associated with a wakeup word, where the tag may include: awakening word labels and labels such as phonemes, silence class and other phonetics class contained in the awakening words.

In step S102, a multi-task learning training of voice wake-up classification, sequence alignment labeling, and phoneme recognition is synchronously performed on the wake-up model composed of the encoder and the decoder based on the voice sample data, so as to obtain a target voice wake-up model.

In the embodiment of the invention, after the voice sample data is received, in order to conveniently and accurately wake the target object, the multi-stage voice wake-up model is trained by a multi-task learning method. Wherein the wake-up model is composed of an encoder and a decoder. In one example, the wake-up model backbone structure is divided into two parts, namely an encoder and a decoder, wherein the encoder is a gMPL model with 12 layers, and the decoder is composed of a CIF continuous integration issuing module and a transformer decoder with one layer.

In the embodiment of the present invention, the learning training of performing the voice wakeup classification on the wakeup model composed of the encoder and the decoder based on the voice sample data includes: inputting voice sample data into an encoder to obtain output characteristics; the output characteristics are averaged in a time sequence dimension and then input into a layer of fully-connected network for first classification, and supervised training is carried out by using tags related to awakening words in voice sample data, wherein the classification of the first classification comprises the following steps: all wake words and a category representing non-wake words.

In an example, after inputting voice sample data into an encoder and obtaining output features, performing a voice wake-up classification learning task may include: the method comprises the steps of averaging output characteristics of voice samples obtained through an encoder in a time sequence dimension, inputting the output characteristics of the voice samples into a layer of fully-connected network after averaging, and performing supervised training on an awakening model by adopting cross entropy loss, wherein the obtained first classification type comprises all awakening words and a type representing non-awakening words.

In the embodiment of the present invention, the learning training of sequence alignment labeling is performed on a wake-up model composed of an encoder and a decoder based on voice sample data, and includes: inputting voice sample data into an encoder to obtain output characteristics; inputting the output characteristics into a layer of fully-connected network for second classification, and performing supervised training by using the labels related to the awakening words in the voice sample data, wherein the classification of the second classification comprises the following steps: phonemes, silence classes, other pronunciation classes contained in the wake-up word.

In one example, after inputting speech sample data into an encoder and obtaining output features, performing a sequence alignment annotation learning task may include: and inputting the output characteristics of the voice samples obtained by the encoder into a layer of fully-connected network for second classification, and performing supervised training on the awakening model by adopting cross entropy loss on the result of each frame of the voice samples obtained by inputting the voice samples into the fully-connected network, wherein the classification of the second classification comprises phonemes, mute classes and other pronunciation classes contained in the awakening words.

In the embodiment of the present invention, the training of learning phoneme recognition on the wake-up model composed of an encoder and a decoder based on voice sample data includes: inputting voice sample data into an encoder to obtain output characteristics; inputting the output characteristics into a decoder for acoustic unit integration and third classification, and performing supervised training by using a label related to a wakeup word in voice sample data, wherein the third classification comprises the following classes: phonemes, silence classes, other pronunciation classes contained in the wake-up word.

In one example, after inputting speech sample data into the encoder, resulting in output features, performing a phoneme recognition learning task may include: inputting the voice sample output characteristics obtained by the encoder into a continuous integration and release CIF module to obtain integrated acoustic unit characteristics; inputting the acoustic unit characteristics into a transform decoder for prediction to obtain a phoneme label corresponding to the acoustic unit characteristics; the loss of the phoneme recognition task is obtained by adopting cross entropy loss to carry out supervised training, and the categories comprise phonemes, silence categories and other pronunciation categories contained in the awakening words.

Through the embodiment, multi-task training learning can be utilized, so that the awakening model synchronously learns different information which is useful for awakening, and the complementation between the information is further realized. The accuracy of the voice wake-up model is improved by complementing the information useful for wake-up.

In one embodiment, the target speech is trained by a self-distillation task in order to make the target speech wake-up model robust. Through self-distillation task training, the target voice awakening model can adapt to noise, the anti-interference capability of the target voice awakening model is improved, the robustness of the target voice awakening model is further improved, and the voice awakening performance of the target voice awakening model in a noise environment is further improved. In an example, a method of obtaining enhanced speech sample data may include: and carrying out voice enhancement processing on voice training sample audio, wherein the enhancement method comprises volume disturbance, tone disturbance, noise addition, reverberation addition and the like.

In an embodiment of the present invention, a self-distillation task training method may include: respectively calculating corresponding model output distribution of original voice sample data and enhanced voice sample data in a learning training process of performing voice wake-up classification and sequence alignment marking on a wake-up model; calculating KL divergence loss based on model output distribution corresponding to the original voice sample data and the enhanced voice sample data; the wake-up model is trained based on the KL divergence loss.

In one example, a self-distillation training method may include: respectively calculating corresponding model output distribution of original voice sample data and enhanced voice sample data in a learning training process of performing voice wake-up classification and sequence alignment marking on a wake-up model; calculating KL divergence loss based on model output distribution corresponding to the original voice sample data and the enhanced voice sample data; the wake-up model is trained based on the KL divergence loss.

Fig. 2 is a flow chart of a voice wake-up method according to an exemplary embodiment. As shown in fig. 2, the voice wake-up method includes the following steps.

In step S201, a target voice is input into a target voice wake-up model obtained by training using any one of the voice wake-up model training methods in the embodiments, and voice wake-up classification is performed to obtain a first wake-up word;

in step S202, a target voice is input to a target voice wake-up model trained by using the voice wake-up model training method according to any one of the above embodiments to perform sequence alignment and labeling;

in step S203, comparing whether the sequence alignment annotation result includes a first wake-up word;

in step S204, when the wake-up word in the sequence alignment tagging result includes the first wake-up word, inputting the output feature of the target voice wake-up model encoder corresponding to the location of the wake-up word in the sequence alignment tagging result into a decoder for phoneme recognition;

in step S205, comparing whether the phoneme, the silence class, and other pronunciation classes included in the wakeup word in the phoneme recognition result are consistent with the first wakeup word;

in step S206, when the phoneme, the silence class, and the other pronunciation class included in the wakeup word match the first wakeup word in the phoneme recognition result, a wakeup operation is performed on the target object based on the first wakeup word.

In the embodiment of the present invention, when the voice wakeup classification result is a non-wakeup word type, or when the sequence alignment tagging result does not include the first wakeup word, or when the phoneme, the silence class, and the other pronunciation class included in the wakeup word in the phoneme recognition result are inconsistent with the first wakeup word, the target object is rejected from being wakened up.

In one example, the voice wake-up method can be divided into three phases, including:

and in the stage one, the target voice awakening model obtained based on the training of the voice awakening model training method is rejected quickly, and the method is characterized by high speed and efficiency, no decoding and post-processing flow, high awakening rate and high false awakening. The specific implementation method comprises the following steps: obtaining a wake-up category corresponding to the voice sample data through voice wake-up classification learning training, and if the wake-up category is one of wake-up words, performing corresponding wake-up; otherwise, refusing.

Secondly, performing sequence alignment labeling based on a target voice awakening model obtained by training through the voice awakening model training method, and decoding through a beacon search algorithm to obtain a phoneme sequence corresponding to the voice sequence; if the phoneme sequence contains the awakening words, recording the starting and ending positions of the pronunciation of the awakening words, continuing to enter the third stage, and otherwise, rejecting recognition.

And step three, inputting the encoder output characteristics corresponding to the awakening word position in the second stage into the phoneme recognition task module, decoding by using a beacon search algorithm to obtain a final accurate phoneme sequence corresponding to the position, awakening finally if the obtained phoneme sequence contains the awakening word, and otherwise rejecting recognition.

Through the embodiment, the voice awakening model obtained by combining different tasks and training is combined with the advantages of different tasks to carry out multi-stage awakening flow, so that the mistaken awakening can be controlled at the minimum mistaken refusing cost, the mistaken awakening rate is greatly reduced, the balance of mistaken refusing and mistaken awakening is realized, and the use experience of a user is further improved.

Based on the same inventive concept, the invention also provides a voice awakening model training device.

Fig. 3 is a block diagram of a voice wakeup model training apparatus according to an exemplary embodiment. As shown in fig. 3, the apparatus for training a voice wakeup model includes an obtaining unit 301 and a training unit 302.

An obtaining unit 301, configured to obtain voice sample data, where the voice sample data has a tag related to a wakeup word;

the training unit 302 is configured to perform multi-task learning training of voice wake classification, sequence alignment labeling, and phoneme recognition on a wake model composed of an encoder and a decoder synchronously based on voice sample data to obtain a target voice wake model.

In one embodiment, the training unit 302 includes: a first input unit, configured to input voice sample data into an encoder to obtain an output characteristic; the first training unit is used for inputting the average output characteristics into a layer of fully-connected network to perform first classification after the average output characteristics are carried out in a time sequence dimension, and performing supervised training by using a label related to a wakeup word in voice sample data, wherein the classification of the first classification comprises the following steps: all wake words and a category representing non-wake words.

In another embodiment, the training unit 302 includes: the second input unit is used for inputting the voice sample data into the encoder to obtain output characteristics; the second training unit is used for inputting the output characteristics into a layer of fully-connected network for second classification and performing supervised training by using the labels related to the awakening words in the voice sample data, wherein the classification of the second classification comprises the following steps: phonemes, silence classes, other pronunciation classes contained in the wake-up word.

In yet another embodiment, the training unit 302 includes: a third input unit, configured to input the voice sample data into the encoder to obtain an output characteristic; a third training unit, configured to input the output features into a decoder to perform acoustic unit integration and third classification, and perform supervised training by using a tag associated with a wakeup word in voice sample data, where the third classification includes: phonemes, silence classes, other pronunciation classes contained in the wake-up word.

In yet another embodiment, the training unit 302 further comprises: the first calculation unit is used for respectively calculating corresponding model output distribution of the original voice sample data and the enhanced voice sample data in the learning training process of performing voice awakening classification and sequence alignment marking on the awakening model; the second calculation unit is used for calculating KL divergence loss based on model output distribution corresponding to the original voice sample data and the enhanced voice sample data; and the fourth training unit is used for training the awakening model based on the KL divergence loss.

The specific limitations and beneficial effects of the above-mentioned voice wakeup model training apparatus can be referred to the limitations of the above-mentioned voice wakeup model training method, and are not described herein again. The various modules described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

Fig. 4 is a block diagram of a voice wake-up apparatus according to an exemplary embodiment. As shown in fig. 4, the apparatus for training a voice wakeup model includes a voice wakeup classification unit 401, a sequence alignment labeling unit 402, a first comparison unit 403, a phoneme recognition unit 404, a second comparison unit 405, and a wakeup unit 406.

A voice wake-up classification unit 401, configured to input a target voice into a target voice wake-up model trained by using any one of the above voice wake-up model training methods, and perform voice wake-up classification on the target voice wake-up model to obtain a first wake-up word;

a sequence alignment labeling unit 402, configured to input a target voice into a target voice wake-up model obtained by training according to any one of the above voice wake-up model training methods to perform sequence alignment labeling;

a first comparing unit 403, configured to compare whether the sequence alignment annotation result includes a first wakeup word;

a phoneme recognition unit 404, configured to, when the sequence alignment tagging result includes the first wake-up word, input an output feature of the target voice wake-up model encoder corresponding to the location of the wake-up word in the sequence alignment tagging result into a decoder for phoneme recognition;

a second comparing unit 405, configured to compare whether phonemes, silence classes, other pronunciation classes included in the wakeup word in the phoneme recognition result are consistent with the first wakeup word;

and a waking unit 406, configured to perform a waking operation on the target object based on the first wake-up word when the phoneme, the silence class, and the other pronunciation class included in the wake-up word in the phoneme recognition result are consistent with the first wake-up word.

In an embodiment, the apparatus further comprises: and the rejection unit is used for rejecting to wake up the target object when the voice wake-up classification result is in a non-wake-up word type, or when the sequence alignment marking result does not contain the first wake-up word, or when the phoneme, the mute type and other pronunciation types contained in the wake-up word in the phoneme recognition result are inconsistent with the first wake-up word.

The specific limitations and beneficial effects of the voice wake-up apparatus can be referred to the limitations of the voice wake-up method in the foregoing, and are not described herein again. The various modules described above may be implemented in whole or in part by software, hardware, and combinations thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

Fig. 5 is a schematic diagram of a hardware structure of a computer device according to an exemplary embodiment. As shown in fig. 5, the apparatus includes one or more processors 510 and a storage 520, where the storage 520 includes a persistent memory, a volatile memory, and a hard disk, and one processor 510 is taken as an example in fig. 5. The apparatus may further include: an input device 530 and an output device 540.

The processor 510, the memory 520, the input device 530, and the output device 540 may be connected by a bus or other means, and the bus connection is exemplified in fig. 5.

Processor 510 may be a Central Processing Unit (CPU). The Processor 510 may also be other general purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, or any combination thereof. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 520, which is a non-transitory computer-readable storage medium, includes a persistent memory, a volatile memory, and a hard disk, and can be used to store non-transitory software programs, non-transitory computer-executable programs, and modules, such as program instructions/modules corresponding to the service management method in the embodiment of the present application. The processor 510 executes the non-transitory software programs, instructions, and modules stored in the memory 520 to execute various functional applications and data processing of the server, so as to implement any one of the above-mentioned voice wakeup model training method and voice wakeup method.

The memory 520 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data used as needed or desired, and the like. Further, the memory 520 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 520 may optionally include memory located remotely from processor 510, which may be connected to a data processing device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The input device 530 may receive input numeric or character information and generate key signal inputs related to user settings and function control. The output device 540 may include a display device such as a display screen.

One or more modules are stored in the memory 520, and when executed by the one or more processors 510, perform the methods shown in fig. 1-2.

The product can execute the method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. Details of the technique that are not described in detail in the present embodiment may be specifically referred to the related description in the embodiments shown in fig. 1 to fig. 2.

Embodiments of the present invention further provide a non-transitory computer storage medium, where a computer-executable instruction is stored in the computer storage medium, and the computer-executable instruction may execute the authentication method in any of the above method embodiments. The storage medium may be a magnetic Disk, an optical Disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a Flash Memory (Flash Memory), a Hard Disk (Hard Disk Drive, abbreviated as HDD), a Solid State Drive (SSD), or the like; the storage medium may also comprise a combination of memories of the kind described above.

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims

1. A method for training a voice wakeup model, the method comprising:

and synchronously performing multi-task learning training of voice awakening classification, sequence alignment labeling and phoneme recognition on an awakening model formed by an encoder and a decoder based on the voice sample data to obtain a target voice awakening model.

2. The method of claim 1, wherein the learning training of the voice wake classification for the wake model composed of an encoder and a decoder based on the voice sample data comprises:

3. The method of claim 1, wherein the training of learning sequence alignment labeling for a wake-up model composed of an encoder and a decoder based on the voice sample data comprises:

4. The method of claim 1, wherein the learning training for phoneme recognition of a wake-up model comprised of an encoder and a decoder based on the speech sample data comprises:

5. The method of claim 1, wherein the voice sample data comprises: original voice sample data and enhanced voice sample data, wherein the enhanced voice sample data is obtained by performing voice enhancement processing on the original voice sample data, and the method further comprises the following steps:

training the wake-up model based on the KL divergence loss.

6. A voice wake-up method, comprising:

performing voice awakening classification on a target voice awakening model obtained by training target voice input by adopting the voice awakening model training method according to any one of claims 1-5 to obtain a first awakening word;

performing sequence alignment labeling on a target voice awakening model obtained by training target voice input by adopting the voice awakening model training method according to any one of claims 1-5;

comparing whether the sequence alignment labeling result contains the first awakening word or not;

7. The method of claim 6,

and when the voice awakening classification result is in a non-awakening word type, or when the sequence alignment labeling result does not contain the first awakening word, or when the phoneme, the mute type and other pronunciation types contained in the awakening word in the phoneme recognition result are inconsistent with the first awakening word, refusing to awaken the target object.

8. An apparatus for voice wake-up model training, the apparatus comprising:

and the training unit is used for synchronously carrying out multi-task learning training of voice awakening classification, sequence alignment marking and phoneme recognition on an awakening model formed by an encoder and a decoder based on the voice sample data to obtain a target voice awakening model.

9. A voice wake-up model training apparatus, the apparatus comprising:

a voice awakening classification unit, configured to perform voice awakening classification on a target voice awakening model obtained by training a target voice input by using the voice awakening model training method according to any one of claims 1 to 5, so as to obtain a first awakening word;

a sequence alignment labeling unit, which is used for inputting the target voice into the target voice awakening model obtained by training by adopting the voice awakening model training method according to any one of claims 1-5 to carry out sequence alignment labeling;

the first comparison unit is used for comparing whether phonemes, silence classes and other pronunciation classes contained in the awakening words in the sequence alignment labeling result are consistent with the first awakening words or not;

the phoneme recognition unit is used for inputting the output characteristics of the target voice awakening model coder corresponding to the position of the awakening word in the sequence alignment tagging result into a decoder for phoneme recognition when the phoneme, the mute class and other pronunciation classes contained in the awakening word in the sequence alignment tagging result are consistent with the first awakening word;

and the awakening unit is used for awakening the target object based on the first awakening word when the phoneme, the silence class and other pronunciation classes contained in the awakening word in the phoneme recognition result are consistent with the first awakening word.

10. A computer device comprising a memory and a processor, the memory and the processor being communicatively coupled to each other, the memory storing therein computer instructions, and the processor executing the computer instructions to perform the voice wakeup model training method of any one of claims 1 to 5 or to perform the voice wakeup method of any one of claims 6 to 7.