CN116884407A

CN116884407A - Lightweight personalized voice awakening method, device and equipment

Info

Publication number: CN116884407A
Application number: CN202310969471.2A
Authority: CN
Inventors: 李美蓉; 程传旭; 李川; 薛杉; 仇小鹏
Original assignee: Xian Aeronautical University
Current assignee: Xian Aeronautical University
Priority date: 2023-08-02
Filing date: 2023-08-02
Publication date: 2023-10-13

Abstract

The invention discloses a lightweight personalized voice awakening method, device and equipment, wherein the method comprises the following steps: training a universal wake-up word acoustic model and a speaker voiceprint recognition model, constructing a multi-teacher network structure, and distilling multi-teacher knowledge to generate a student model. Based on the student model, acoustic features of the registered wake-up words and speaker voiceprint features of the registered wake-up words in the registered audio are obtained, and a wake-up word template and a speaker voiceprint template of the wake-up words are generated. The method comprises the steps of obtaining acoustic characteristics of a to-be-detected wake-up word and voice print characteristics of a speaker of the to-be-detected wake-up word in current test audio, calculating acoustic characteristic scores of the to-be-detected wake-up word according to a wake-up word template, and determining whether to wake up according to the scores and the voice print characteristics of the to-be-detected wake-up word. The invention utilizes the student model to simultaneously complete the user-defined wake-up word detection and speaker voiceprint identity verification, has relatively fewer parameters, has double guidance of acoustic and voiceprint teacher models, and has higher accuracy.

Description

Lightweight personalized voice awakening method, device and equipment

Technical Field

The invention belongs to the technical field of voice processing, and particularly relates to a lightweight personalized voice awakening method, device and equipment.

Background

With the continuous maturity and application of intelligent voice technology, more and more intelligent devices support voice interaction, such as millet intelligent sound, hua as intelligent mobile phones and hundred degree robots. The voice wake-up is used as a voice interaction entrance, and a voice interaction module on the equipment is activated by continuously monitoring and detecting a target wake-up word contained in an input voice signal. Current smart devices mostly use predefined wake words, such as "little college", "heaven fairy", "small scale", etc. In order to provide a better differentiation of wake voices, these predefined wake words are often unfamiliar to the user, and memorizing a large number of unfamiliar wake words can cause a lot of trouble to the user. If the predefined wake words are replaced with the wake words registered by the user, a large number of registered wake word materials are generally collected and the model is retrained, which causes huge time and cost overhead. At present, although few intelligent devices support a self-defined wake-up word function, the wake-up word registration is carried out in a text mode, and when the pronunciation of a user is not matched with the standard pronunciation provided by the text, the wake-up effect is easily affected.

In addition to the above problems, current voice wakeup techniques do not substantially consider the speaker voiceprint recognition problem. Particularly, when the words expressed by the strangers contain the predefined wake-up words or are similar to the predefined wake-up words in pronunciation, the target device can be easily awakened by mistake, and the safety of the intelligent device is greatly reduced. In order to improve the voice interaction experience of users, a user voice-oriented mode of 'custom wake word detection+speaker voiceprint recognition' has become a new trend of the development of future voice wake technology. In the mode, the user is allowed to change the wake-up word according to own pronunciation habit, speaker voiceprint recognition is supported for the wake-up word provided by the user, and the usability and safety of the intelligent device are greatly improved. However, advanced custom voice wake-up detection technology and speaker voiceprint recognition technology often adopt deeper or more complex network structures to obtain rich and comprehensive acoustic and voiceprint features, models related by the two technologies are simply integrated together in series, model parameters can be about millions or even tens of millions, excessive model parameters or too complex structures are easily caused to occupy larger memory space during wake-up, and excessive calculation of models during running is also caused to generate higher execution delay, so that a voice wake-up system is difficult to deploy on lightweight equipment with low power consumption and limited computing resources, such as an IoT terminal.

Disclosure of Invention

The invention provides a lightweight personalized customized voice awakening method, device and equipment, which can automatically detect user-defined awakening words during voice awakening, and simultaneously identify speaker voiceprint information carried by the awakening words, thereby meeting the requirements of low memory occupation and low operation amount and having higher accuracy.

In order to achieve the above purpose, the invention is realized by adopting the following technical scheme:

a lightweight personalized voice wake-up method comprises the following steps:

training and obtaining a general wake-up word acoustic model; training and obtaining a speaker voiceprint recognition model irrelevant to a text;

based on a general wake-up word acoustic model and a speaker voiceprint recognition model irrelevant to texts, constructing a multi-teacher network structure and distilling multi-teacher knowledge to generate a student model integrating acoustic and voiceprint characteristics;

according to the student model, acquiring acoustic features of a registered wake-up word and voice print features of a speaker in the registrant audio, and generating a wake-up word template and a voice print template of the speaker; according to the student model, acquiring acoustic characteristics of a detected wake-up word in the current tester audio and voice print characteristics of a speaker;

scoring acoustic features of the detected wake-up word in the audio of the current tester based on the wake-up word template, and obtaining scoring results of the current detected wake-up word;

judging whether the scoring result of the current measured wake-up word exceeds a predefined wake-up word threshold value:

if not, the wake-up fails;

if yes, scoring the voice print characteristics of the speaker of the detected wake-up word in the current tester audio based on the voice print template of the speaker, and obtaining the scoring result of the current speaker;

judging whether the scoring result of the current speaker exceeds a predefined speaker threshold value:

if yes, the wake-up is successful;

if not, the wake-up fails.

Further, training and obtaining a general wake-up word acoustic model is achieved through the following steps:

collecting an acoustic corpus training set, and acquiring corresponding acoustic audio sequences and transcribed text label sequence information thereof;

and constructing a convolutional cyclic neural network, training the acoustic audio sequence and the transcribed text label sequence information thereof, and generating a universal wake-up word acoustic model.

Further, training and obtaining a speaker voiceprint recognition model irrelevant to the text is achieved through the following steps:

collecting a voiceprint corpus training set, and obtaining a corresponding voiceprint audio sequence and speaker tag information thereof;

and constructing a deep depth residual error network, training the voiceprint audio sequence and speaker tag information thereof, and generating a speaker voiceprint recognition model irrelevant to the text.

Further, the specific steps of constructing a multi-teacher network structure and distilling the multi-teacher knowledge to generate a student model integrating acoustic and voiceprint features are as follows:

constructing a multi-teacher network structure, taking a general wake-up word acoustic model as an acoustic teacher model, taking a speaker voiceprint recognition model irrelevant to texts as a voiceprint teacher model, taking a transcription text label with wake-up words and audio data of the speaker label as input of the acoustic teacher model and the voiceprint teacher model, outputting a wake-up word acoustic feature prediction result by the acoustic teacher model, and outputting a speaker voiceprint feature prediction result by the voiceprint teacher model;

constructing a student model, wherein the student model comprises a body part, an acoustic head and a voiceprint head, the acoustic head and the voiceprint head share parameters of the body part, the acoustic head adopts a feedforward neural network, and the voiceprint head adopts a shallow depth residual network;

performing knowledge distillation on the acoustic teacher model to enable the acoustic head of the student model to simulate the acoustic characteristic distribution of the acoustic teacher model, taking the posterior probability sequence of the voice output by the output layer of the acoustic teacher model as knowledge, and optimizing the acoustic head of the student model to minimize the loss function CTC loss function representing the difference between student model acoustic head prediction and true transcribed text tag sequence,/->A KL divergence loss function between the student model acoustic head prediction result and the acoustic teacher model prediction result is represented;

distilling knowledge of the voiceprint teacher model to enable the voiceprint head of the student model to simulate voiceprint characteristic representation of the voiceprint teacher model, taking the embedded characteristics output by the voiceprint teacher model embedded layer as knowledge, and optimizing the voiceprint head of the student model to minimize a loss function Voiceprint representing student modelCross entropy loss function between head prediction and real speaker tag, < >>Representing cosine similarity loss functions between the embedded features output by the voice print head embedded layer of the student model and the embedded features output by the voice print teacher model;

based on acoustic teacher knowledge distillation and voiceprint teacher knowledge distillation, the multi-teacher knowledge distillation optimization objective of the student model is to minimize a loss function

After multi-teacher knowledge is distilled, a student model fused with acoustic and voiceprint features is output, a voice posterior probability sequence output by an acoustic head output layer of the student model is wake-up word acoustic features, and an embedding feature output by an embedding layer of the voiceprint head of the student model is speaker voiceprint features.

Further, the specific steps of obtaining the acoustic features of the registered wake-up words and the voice print features of the speaker in the registrant audio and generating the wake-up word template and the voice print template of the speaker are as follows:

collecting registrant audio;

extracting a voice portion from the registrant audio by a voice endpoint detection technique;

inputting the extracted voice part into the student model, and outputting acoustic features of registered wake-up words and voice print features of a speaker thereof;

when a registrant only provides one piece of registration audio, the acoustic features of the registration wake-up words are used as a wake-up word template, and the voice print features of the registration wake-up words are used as a voice print template;

when a registrant provides two or more pieces of registration audio, frame-level alignment is carried out on the acoustic features of wake-up words obtained in the registration audio by adopting a dynamic time-warping algorithm, the average value is calculated frame by frame after alignment, the result obtained by frame-level average value calculation is used as a wake-up word template, the average value of the voice print features of all the registration audio is calculated, and the average value of the voice print features of the registration audio is used as a voice print template of the voice print of the speaker.

Further, the step of obtaining the acoustic feature of the wake-up word to be tested and the voiceprint feature of the speaker thereof in the current tester audio comprises the following steps:

collecting a tester audio;

extracting a voice portion from the tester audio by a voice endpoint detection technique;

and inputting the extracted voice part into the student model, and outputting the acoustic characteristics of the current wake-up word to be tested and the voice print characteristics of the speaker thereof.

Further, the method is realized by scoring acoustic features of the wake-up word to be detected in the current tester audio based on the wake-up word template and calculating similarity between the posterior probability of the voice in the wake-up word template and the posterior probability of the voice of the wake-up word to be detected currently by adopting a dynamic time warping algorithm.

Further, scoring the voice print characteristics of the wake-up word to be tested in the current tester audio based on the voice print template of the speaker, and calculating cosine similarity between the voice print characteristics of the wake-up word to be tested in the voice print template of the speaker and the voice print characteristics of the wake-up word to be tested.

A lightweight, personalized, customized voice wake-up device, comprising:

the server side comprises a collecting module, a training module and a model storage module;

the collecting module is responsible for obtaining a general wake-up word acoustic model and a speaker voiceprint model irrelevant to texts, the training module carries out multi-teacher knowledge distillation by taking the acoustic model and the voiceprint model in the collecting module as a teacher model, generates a student model fused with acoustic and voiceprint characteristics, and the model storage module is used for storing the trained student model;

the user terminal comprises a registration module, an updating module, a template generation module, a testing module and a verification module;

the registration module is used for acquiring acoustic characteristics of a registered wake-up word in the registrant audio and voice print characteristics of a speaker; the updating module is responsible for updating the student model in the remote server-side model storage module to the local user terminal; the template generation module is responsible for generating a wake-up word template from acoustic features of the registered wake-up words and generating a speaker voiceprint template from voiceprint features of the registered wake-up words; the test module is used for acquiring acoustic characteristics of the detected wake-up word and voice print characteristics of a speaker in the audio of the tester, and the verification module is responsible for judging whether the acoustic characteristics and the voice print characteristics of the detected wake-up word are consistent with the wake-up word template and the voice print template of the speaker generated in the template generation module and feeding back a wake-up result.

A computer device comprising a memory and a processor electrically connected, said memory having stored thereon a computer program executable on the processor, said processor, when executing said computer program, performing the steps of the method described above.

Compared with the prior art, the invention has at least the following beneficial technical effects: on one hand, the voice recognition method and the voice recognition device integrate the characteristics of acoustics and voice prints, allow a user to self-define wake-up words according to own pronunciation habits, and can recognize and verify voice print identities of speakers while detecting the self-defined wake-up words of the user. The personalized requirement of the user on the wake-up word is met, the target equipment is not easy to wake up by mistake by other strangers, and high safety is ensured. On the other hand, in the invention, one student model is utilized to simultaneously complete two functions of user-defined wake-up word detection and speaker voiceprint identity verification, and two mutually independent teacher models of acoustics and voiceprints are replaced. Compared with two teacher models, the student model adopts a structure with common body and head, has relatively fewer parameters, low memory occupation, small calculation amount and faster reasoning speed, and is more suitable for being deployed in an environment with low power consumption and limited resources. In addition, by utilizing the multi-teacher knowledge distillation technology, the student model can simulate the acoustic characteristic distribution of the acoustic teacher model and the voiceprint characteristic representation of the voiceprint teacher model at the same time, can quickly learn more useful acoustic and voiceprint knowledge at a lower cost, and shows higher accuracy.

Drawings

FIG. 1 is a flow chart of a lightweight personalized voice wake-up method;

FIG. 2 is a diagram of a multi-teacher network architecture;

FIG. 3 is a flow chart of a multi-teacher single student knowledge distillation;

fig. 4 is a schematic structural diagram of a lightweight personalized voice wake-up device.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it should be understood that the specific embodiments described herein are only for explaining the present invention and are not limiting the present invention.

Example 1

The invention provides a lightweight personalized customized voice awakening method, which is shown in figures 1, 2 and 3 and comprises the following steps:

and step 1, training and obtaining a general wake-up word acoustic model. Firstly, an acoustic corpus training set with transcription text labels is collected, corresponding acoustic audio sequences and transcription text label sequence information thereof are obtained, and filter-group-based features FBank (hereinafter referred to as FBank features) in the acoustic audio sequences are extracted through an input layer. The extracted FBank features are then trained using a convolutional recurrent neural network and CTC loss function, the convolutional recurrent neural network comprising one convolutional layer, three recurrent layers, and one output layer. The method comprises the steps that a convolutional layer carries out feature extraction on an output result of an input layer by using CNN, a feature map is output as input of the convolutional layer, three convolutional layers are sequentially connected according to a sequence, unidirectional RNN is used for predicting feature sequences obtained from a previous layer, each feature vector in the feature sequences is learned, a high-dimensional feature sequence is output as input of a next layer, an output layer carries out voice tag classification prediction on the high-dimensional feature sequences obtained from a third convolutional layer by adopting full connection, a softmax function is used for normalizing the prediction result, a voice posterior probability sequence is output, voice tags in the voice posterior probability sequence can be represented by voice units such as phonemes, words or syllables, and a CTC loss function is responsible for automatically aligning the voice posterior probability sequence with a true transcribed text tag sequence. After multiple training, a general wake-up word acoustic model can be obtained.

And step 2, training and acquiring a speaker voiceprint recognition model irrelevant to the text. Firstly, collecting a voiceprint corpus training set with a speaker tag, acquiring a corresponding voiceprint audio sequence and speaker tag information thereof, and extracting FBank characteristics in the voiceprint audio sequence through an input layer. Then, training the extracted FBank features by adopting a deep depth residual error network and a CE loss function, wherein the deep depth residual error network comprises a speaker feature extractor, a pooling layer, an embedding layer and an output layer, the speaker feature extractor uses a convolution layer and a residual error block in the ResNet34 depth residual error network to extract speaker features from the output result of the input layer, the pooling layer is responsible for collecting the average value and standard deviation statistical information of the extracted speaker features, the embedding layer is responsible for pooling speaker information through full connection and outputting embedded features r-vector, the output layer adopts full connection to conduct speaker classification prediction on the embedded features r-vector, the degree of distinction between different speaker features of the prediction result is optimized by using an AAM-softmax function, the optimized speaker posterior probability is output, and the CE loss function is responsible for minimizing the gap between the speaker posterior probability and a real speaker tag. After multiple training, the speaker voiceprint recognition model irrelevant to the text is output.

Step 3, constructing a multi-teacher network structure, as shown in fig. 2, firstly taking the general wake-up word acoustic model in the step 1 as an acoustic teacher model, taking the speaker voiceprint recognition model irrelevant to texts in the step 2 as a voiceprint teacher model, then enabling the acoustic teacher model and the voiceprint teacher model to adopt the same input layer, enabling the input layer to be responsible for receiving corpus data with wake-up word transcribed text labels and speaker labels, extracting FBank features to be respectively used as inputs of the acoustic teacher model and the voiceprint teacher model, outputting a voice posterior probability sequence by the acoustic teacher model through a trained convolutional cyclic neural network, and outputting speaker posterior probability by the voiceprint teacher model under the combined action of a trained speaker feature extractor, a pooling layer, an embedding layer and an output layer.

Step 4, distilling the multi-teacher knowledge, as shown in fig. 3, before distilling the multi-teacher knowledge, constructing a student model, wherein the student model mainly comprises a body part, an acoustic head and a voiceprint head. The body part is responsible for receiving FBank characteristics extracted from training data by an input layer, carrying out characteristic extraction on the FBank characteristics by adopting a convolution layer, enabling an acoustic head and a voiceprint head to share parameters of the same body part, training the acoustic head by using a feedforward neural network, enabling the feedforward neural network structure to comprise a full-connection layer and an output layer, enabling the full-connection layer to be used for receiving characteristic extraction results output by the body part, extracting and outputting a high-dimensional acoustic characteristic sequence on the basis, enabling the output layer to carry out voice tag classification prediction on the high-dimensional acoustic characteristic sequence obtained from the previous layer by adopting the full-connection layer, normalizing the prediction results by using a softmax function, and outputting a voice posterior probability sequence as acoustic characteristics. The voice print head is trained by using a shallow depth residual error network, the shallow depth residual error network comprises a speaker characteristic extractor, a pooling layer, an embedding layer and an output layer, the speaker characteristic extractor extracts speaker characteristics of output results of an input layer by using a convolution layer and a residual error block in a ResNet18 depth residual error network, the pooling layer is responsible for collecting average value and standard deviation statistical information of the extracted speaker characteristics, the embedding layer is used for pooling speaker information through full connection and outputting embedded characteristics r-vector, the output layer adopts full connection to conduct speaker classification prediction on the embedded characteristics r-vector, and an AAM-softmax function is used for optimizing the degree of distinction of a prediction result among different speaker characteristics, and posterior probability of a speaker is output as voice print characteristics.

In order to make the acoustic feature distribution of the acoustic head in the student model consistent with the acoustic feature distribution in the acoustic teacher model, knowledge distillation needs to be performed on the acoustic teacher model, and the posterior probability sequence of the voice output by the output layer of the acoustic teacher model is taken as the acoustic feature knowledge, so that the acoustic head of the minimum chemical model and the acoustic teacher model are in betweenKL divergence loss function

Where N represents the number of input audio samples, T represents the length of the input audio sequence,knowledge of acoustic characteristics representing the ith sample obtained from the acoustic teacher model, i.e. +.>Is the speech posterior probability sequence of the ith sample output by the teacher model,/for the teacher model>Is the prediction result of the ith sample processed by the acoustic head of the student model. CTC loss between student model acoustic head prediction and true transcribed text tag sequence +.>Expressed as:

wherein l ⁱ Representing the ith true transcribed text tag sequence, pi ⁱ Is to truly transcribe text tag sequence l ⁱ The sequence of the equivalents is that,indicating that the tag +.>Probability of->Is a many-to-one mapping function that merges any labels that are repeated in succession and then deletes the blank character. For example, a->a, b and c are labels, -are blank characters. When training the acoustic head of the student model, the KL divergence loss function is adoptedAnd CTC loss function->Co-optimizing student model acoustic heads with optimization objective of minimizing loss function +.>Namely:

wherein alpha is a super parameter for adjustmentAnd->Two loss functions.

In order to keep the voiceprint characteristics in the student model consistent with those in the acoustic teacher model, knowledge distillation is required to be carried out on the voiceprint teacher model, the embedded characteristics r-vector of the voiceprint teacher model embedded layer are taken as speaker voiceprint knowledge, and the cosine similarity loss function between the voiceprint head of the minimum chemical model and the voiceprint teacher model is the minimum

Where N represents the number of input audio samples,representing speaker voiceprint knowledge, is the embedded feature r-vector, v of the ith sample output by the teacher model ⁱ Is the embedded feature r-vector of the ith sample processed by the voice print head of the student model. Cross entropy loss between student model voiceprint head prediction and real speaker tag +.>Expressed as:

wherein SC represents the number of speaker classification categories,the true speaker tag value representing the i-th sample is expressed in one-hot vector. />Is the prediction result of the student model voiceprint head with respect to the ith sample. When training the voiceprint head of the student model, a cosine similarity loss function is adopted>And cross entropy loss function->Jointly optimizing the voiceprint head of the student model, wherein the optimization target is to minimize the loss functionNamely:

wherein beta is a super parameter for regulation ofAnd->Two loss functions.

According to the above acoustic head loss functionVoiceprint head loss function->Optimization objective of multi-teacher knowledge distillation of the whole student model is to minimize the loss function +.>Namely:

wherein γ and δ are hyper-parameters for adjustmentAnd->Two loss functions.

Step 4 comprises the steps of:

step 4.1, constructing a multi-teacher network structure, taking a general wake-up word acoustic model as an acoustic teacher model, taking a speaker voiceprint recognition model irrelevant to texts as a voiceprint teacher model, taking a transcribed text label with wake-up words and audio data of the speaker label as input of the acoustic teacher model and the voiceprint teacher model, outputting a wake-up word acoustic feature prediction result by the acoustic teacher model, and outputting a speaker voiceprint feature prediction result by the voiceprint teacher model;

step 4.2, constructing a student model, wherein the student model comprises a body part, an acoustic head and a voiceprint head, the acoustic head and the voiceprint head share parameters of the body part, the acoustic head adopts a feedforward neural network, and the voiceprint head adopts a shallow depth residual network;

step 4.3, performing knowledge distillation on the acoustic teacher model to enable the acoustic head of the student model to simulate the acoustic feature distribution of the acoustic teacher model, taking the posterior probability sequence of the voice output by the output layer of the acoustic teacher model as knowledge, and optimizing the acoustic head of the student model to minimize the loss function CTC loss function representing the difference between student model acoustic head prediction and true transcribed text tag sequence,/->A KL divergence loss function between the student model acoustic head prediction result and the acoustic teacher model prediction result is represented;

4.4, distilling knowledge of the voiceprint teacher model to enable the voiceprint head of the student model to simulate voiceprint characteristics of the voiceprint teacher model, taking the embedded characteristics output by the embedded layer of the voiceprint teacher model as knowledge, and optimizing the voiceprint head of the student model to minimize a loss function Representing interactions between student model voiceprint head predictions and real speaker tagsCross entropy loss function->Representing cosine similarity loss functions between the embedded features output by the voice print head embedded layer of the student model and the embedded features output by the voice print teacher model;

step 4.5, based on the acoustic teacher knowledge distillation and the voiceprint teacher knowledge distillation, the multi-teacher knowledge distillation optimization objective of the student model is to minimize the loss function

And 4.6, after distilling the multi-teacher knowledge, outputting a student model fused with acoustic and voiceprint features, wherein a voice posterior probability sequence output by an acoustic head output layer of the student model is wake-up word acoustic features, and an embedding feature output by an embedding layer of a voiceprint head of the student model is speaker voiceprint features.

And 5, generating a student model fused with the acoustic and voiceprint features. Based on the multi-teacher knowledge distillation described in step 4, a student model that incorporates acoustic and voiceprint features can be obtained. The posterior probability of the voice output by the acoustic head of the student model is the acoustic characteristic of the wake-up word, and the embedded characteristic output by the embedded layer of the acoustic head of the student model is the acoustic characteristic of the speaker.

And 6, acquiring acoustic features of the registered wake-up words in the registered audio and voice print features of a speaker. The method comprises the steps of collecting a piece of registrant audio by using a microphone of a user terminal, extracting a voice part from the registrant audio by a voice endpoint detection technology in a user terminal registration module, inputting the FBank characteristic of the extracted voice to a trained student model by using x, and outputting the acoustic characteristic y of a user registration wake-up word _enroll Voice print characteristic v of speaker _enroll 。

And 7, generating a wake-up word template and a speaker voiceprint template. When the registrant provides only one registration audio, the acoustic feature y of the registration wake-up word in the step 6 is recorded _enroll Template y as wake-up word _template The voice print characteristic v of the speaker in the step 6 is calculated _enroll As speaker voiceprint template v _template . If the registrant provides two or more pieces of registration audio, adopting a dynamic time warping (Dynamic Time Warping, DTW) algorithm to align the acoustic features of the registration wake-up words in the registration audio in a frame-level manner, calculating the average value frame by frame, and calculating the obtained frame-level average value y ^avg _enroll As the final output wake word acoustic feature, and will y ^avg _enroll Template y as wake-up word _template The method comprises the steps of carrying out a first treatment on the surface of the Similarly, the average value of the voice print characteristics of the speakers of all the registered audios is calculated, and the obtained result v ^avg _enroll As the final output speaker voiceprint feature, and will v ^avg _enroll As speaker voiceprint template v _template 。

The method specifically comprises the following steps:

step 7.1, collecting registrant audio;

step 7.2, extracting a voice part from the registrant audio through a voice endpoint detection technology;

step 7.3, inputting the extracted voice part into the student model, and outputting the acoustic characteristics of the registered wake-up word and the voice print characteristics of the speaker;

and 7.4, when the registrant only provides one piece of registration audio, taking the acoustic features of the registration wake-up words as a wake-up word template and taking the voice print features of the registration wake-up words as a voice print template.

And 7.5, when a registrant provides two or more than two pieces of registration audio, carrying out frame-level alignment on the acoustic features of the wake-up words obtained in the registration audio one by adopting a DTW algorithm, calculating the average value frame by frame after the alignment, taking the result obtained by calculating the frame-level average value as a wake-up word template, calculating the average value of the voice print features of all the registration audio, and taking the average value of the voice print features of the registration audio as a voice print template of the voice print of the voice.

And 8, acquiring acoustic characteristics of the wake-up word to be tested in the current test audio and voice print characteristics of a speaker. Using user terminalsThe microphone collects a piece of tester audio, extracts a voice part from the tester audio through a voice endpoint detection technology in the user terminal test module, and uses x for the FBank characteristics of the extracted voice ^′ Representing, inputting the acoustic feature y of the current wake-up word to a trained student model, and outputting the acoustic feature y of the current wake-up word _test Voice print characteristic v of speaker _test 。

And 9, obtaining the acoustic feature scores of the current wake-up words to be tested. Wake-up word template y for registering wake-up word by using DTW algorithm _template Acoustic feature y with currently detected wake word _test Performing similarity calculation to obtain acoustic feature score _wuw The method comprises the following steps:

score _wuw ＝DTW(y _template ,y _test )

step 10, it is determined whether the acoustic feature score exceeds a predefined wake word threshold. Score if acoustic feature _wuw Exceeding a predefined wake-up word threshold value to indicate that the detected wake-up word in the current tester audio is the same as the registered wake-up word in the registrant audio, turning to step 11 to further judge whether the voiceprint characteristics of the detected wake-up word are consistent with the voiceprint characteristics of the registered wake-up word; if not, the wake-up fails.

And 11, obtaining the voiceprint feature scores of the current wake-up words. Mainly by voice print template v of speaker registering wake-up words _template Voiceprint feature v of wake word currently tested _test Cosine similarity calculation is carried out to obtain voiceprint feature score _spk The method comprises the following steps:

step 12, a determination is made as to whether the voiceprint feature score exceeds a predefined speaker threshold. Score of voiceprint feature _spk Exceeding the predefined speaker threshold value, indicating that the speaker voiceprint of the detected wake-up word is consistent with the speaker voiceprint of the registered wake-up word, and if not, the wake-up is successful, and if not, the wake-up is failed.

Example 2

The invention also provides a lightweight personalized voice wake-up device, which comprises a server side and a user terminal as shown in fig. 4.

The server side is composed of a collecting module, a training module and a model storage module. The collection module is responsible for obtaining a general wake word acoustic model and a speaker voiceprint model irrelevant to texts, the training module carries out multi-teacher knowledge distillation by taking the acoustic model and the voiceprint model in the collection module as a teacher model, a student model integrating acoustic and voiceprint characteristics is generated, and the model storage module is used for storing the trained student model.

The user terminal is composed of a registration module, an updating module, a template generating module, a testing module and a verification module. The modules are used according to the sequence of updating, registering, generating, testing and verifying, wherein the updating module is used for updating a model, a student model generated in the remote server side model storage module is updated into a local user terminal, the registering module extracts a voice part from a registrant audio by utilizing a voice endpoint detection technology, the extracted voice part is input into the trained student model in the updating module for feature extraction, acoustic features of a registered wake-up word in the registrant audio and voice print features of the registered wake-up word are output, the template generating module is responsible for generating a wake-up word template by the acoustic features of the registered wake-up word obtained from the registering module, the voice print features of the registered wake-up word obtained from the registering module are generated into the voice print template of the speaker, the testing module extracts the voice part from the tester audio by utilizing a voice endpoint detection technology, the extracted voice part is input into the trained student model in the updating module for feature extraction, the acoustic features of the wake-up word to be tested in the tester audio and the voice print features of the speaker are output, and the verifying module is used for judging whether the acoustic features of the wake-up word to be tested and the voice print features of the wake-up word are consistent with the voice print template generated by the template and the voice print template.

The module workflow of the server side and the user terminal in fig. 4 mainly comprises the following steps that firstly, a collecting module of the server side is adopted to train and generate a general wake-up word acoustic model and a speaker voiceprint model irrelevant to texts, then, a training model is adopted to construct a multi-teacher network structure to carry out multi-teacher knowledge distillation, a student model integrating acoustic and voiceprint characteristics is generated, and the student model is stored in a model storage module and is updated to an updating module of the user terminal. Then, the registrant registers the self-defined wake-up word and voiceprint identity by using a registration module in the user terminal, and stores the generated wake-up word template and speaker voiceprint template in a template generation module. And finally, the tester generates the acoustic features and the voiceprint features of the current wake-up word to be tested by using the testing module, and verifies the acoustic features and the voiceprint features of the current wake-up word to be tested by using the verification module, if the acoustic features and the voiceprint features of the current wake-up word to be tested exceed the threshold value set in the template generating module, the wake-up is successful, otherwise, the wake-up is failed.

Example 3

The invention provides a computer device, which comprises a memory and a processor which are electrically connected, wherein the memory stores a computer program which can run on the processor, and when the processor executes the computer program, the wake-up method in the embodiment 1 is realized.

The computer program may be divided into one or more modules/units, which are stored in the memory and executed by the processor to accomplish the present invention.

The processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like.

The memory may be used to store the computer program and/or module, and the processor may implement the various functions of the wake-up device/terminal device by running or executing the computer program and/or module stored in the memory and invoking data stored in the memory.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

Example 4

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The specific content described above in connection with the drawings is only illustrative, and does not limit the scope of the invention, and various modifications or variations made by researchers in the field without the need for creative efforts are still within the scope of the invention based on the lightweight personalized voice wake-up method, device, apparatus and medium provided by the invention.

Claims

1. The lightweight personalized voice wake-up method is characterized by comprising the following steps of:

if not, the wake-up fails;

if yes, the wake-up is successful;

if not, the wake-up fails.

2. The method for lightweight personalized customized voice wakeup according to claim 1, wherein the training and obtaining a generic wake word acoustic model is obtained by:

3. The method for waking up a lightweight personalized speech according to claim 1, wherein the training and obtaining a text-independent speaker voiceprint recognition model is obtained by:

4. The method for waking up lightweight personalized customized voice according to claim 1, wherein the specific steps of constructing a multi-teacher network structure and distilling multi-teacher knowledge to generate a student model with integrated acoustic and voiceprint features are as follows:

distilling knowledge of the voiceprint teacher model to enable the voiceprint head of the student model to simulate voiceprint characteristic representation of the voiceprint teacher model, taking the embedded characteristics output by the voiceprint teacher model embedded layer as knowledge, and optimizing the voiceprint head of the student model to minimize a loss function Representing a cross entropy loss function between student model voiceprint head predictions and real speaker tags,/>Representing cosine similarity loss functions between the embedded features output by the voice print head embedded layer of the student model and the embedded features output by the voice print teacher model;

5. The method for waking up a lightweight personalized customized voice according to claim 1, wherein the specific steps of obtaining the acoustic features of the registered wake-up word and the speaker voiceprint features thereof in the registrant audio, and generating the wake-up word template and the speaker voiceprint template are as follows:

collecting registrant audio;

6. The method for lightweight personalized customized voice wakeup according to claim 1, wherein the step of obtaining the acoustic features of the wake-up word under test and the speaker voiceprint features thereof in the current tester audio includes:

collecting a tester audio;

7. The method for waking up a lightweight personalized customized voice according to claim 1, wherein the scoring of the acoustic features of the wake-up word to be tested in the audio of the current tester is achieved by calculating the similarity between the posterior probability of the voice in the wake-up word template and the posterior probability of the voice of the wake-up word to be tested by using a dynamic time warping algorithm.

8. The method for waking up a speech according to claim 1, wherein the scoring of the speaker voiceprint features of the wake-up word to be tested in the current tester audio is performed by performing cosine similarity calculation on the speaker voiceprint features in the speaker voiceprint template and the speaker voiceprint features of the wake-up word to be tested.

9. A lightweight, personalized, customized voice wake-up device, comprising:

10. A computer device comprising an electrically connected memory and a processor, the memory having stored thereon a computer program executable on the processor, when executing the computer program, performing the steps of the method according to any of claims 1-8.