WO2022227507A1

WO2022227507A1 - Wake-up degree recognition model training method and speech wake-up degree acquisition method

Info

Publication number: WO2022227507A1
Application number: PCT/CN2021/131223
Authority: WO
Inventors: 邵池; 黄东延
Original assignee: 深圳市优必选科技股份有限公司
Priority date: 2021-04-27
Filing date: 2021-11-17
Publication date: 2022-11-03
Also published as: CN113192537A; CN113192537B

Abstract

Embodiments of the present application provide a wake-up degree recognition model training method and a speech wake-up degree acquisition method. The wake-up degree recognition model training method comprises: obtaining a wake-up degree label of a sample speech, and performing data enhancement on part of the sample speech according to the wake-up degree label of the sample speech; extracting a feature matrix of a frame sequence corresponding to the sample speech; and inputting feature matrices of frame sequences corresponding to different classes of wake-up degree labels and the corresponding wake-up degree labels into a neural network for training. By means of a provided wake-up degree recognition model training solution, the features of sample speeches having different wake-up degrees are extracted and input into the neural network for training, and thus a wake-up degree recognition model capable of recognizing a speech wake-up degree can be obtained. The wake-up degree recognition model is applied to a speech recognition scene, and the wake-up degree recognition is additionally provided on the basis of basic speech recognition, thereby enhancing the accuracy and diversity of speech recognition.

Description

Arousal degree recognition model training method and voice arousal degree acquisition method

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the priority of the Chinese Patent Application No. 2021104622780, which was filed with the China Patent Office on April 27, 2021, and is entitled "Arousal Level Recognition Model Training Method and Voice Arousal Level Acquiring Method", the entire contents of which are incorporated by reference in this application.

technical field

The invention relates to the field of speech processing, in particular to a method for training an arousal degree recognition model and a method for acquiring a speech arousal degree.

Background technique

Emotion recognition has become an integral part of modern human-computer interaction systems in many healthcare, education, and safety-related scenarios. In an emotion recognition system, speech, text, video, etc. can be used as separate inputs, or a combination of them can be used as multimodal inputs. This paper focuses on speech-based emotion recognition. Generally, speech emotion recognition is performed in a supervised manner using segmented short sentences, and labels for emotions can be in two formats, either discrete labels such as happy, sad, angry and neutral, or continuous labels such as activation (sedation) versus (arousal), valence (negative versus positive), and dominance (weak versus strong). In recent years, continuous emotional attributes have received a lot of attention due to their flexibility in describing more complex emotional states. Continuous attribute classification plays an extremely important role in speech emotion recognition, and the degree of arousal also affects the speed and accuracy of emotion recognition. Generally speaking, the higher the degree of arousal, the faster the emotion recognition, and the higher the recognition accuracy. The accuracy of semantic emotion recognition can also be improved to a certain extent by identifying the degree of arousal in advance.

It can be seen that there is an urgent need for a method that can identify the level of arousal in the continuous emotion of speech.

Application content

In order to solve the above technical problems, embodiments of the present invention provide a method for training an arousal degree recognition model and a method for acquiring a voice arousal degree.

In a first aspect, an embodiment of the present invention provides a method for training an arousal degree recognition model, including:

Obtain the wake-up level label of the sample voice, and perform data enhancement on part of the sample voice according to the wake-up level label of the sample voice;

extracting the feature matrix of the frame sequence corresponding to the sample speech;

The feature matrix of the frame sequence corresponding to the various arousal degree labels and the corresponding arousal degree labels are input into the neural network for training.

According to a specific embodiment of the present disclosure, the step of acquiring the wake-up degree label of the sample speech includes:

From the preset data set, the first type of sample speech corresponding to the first arousal degree label, the second type of sample speech corresponding to the second arousal degree label, and the third type of sample speech corresponding to the third arousal degree label are selected.

Determine whether the difference between the number of sample voices of various types of arousal degree labels is greater than or equal to the preset number difference;

If the difference between the number of sample voices of various types of arousal level labels is greater than or equal to the preset number difference, perform data enhancement processing on the sample voices with a smaller number until the number of sample voices of various types of arousal degree labels is between the The difference is less than the preset number of differences.

According to a specific embodiment of the present disclosure, the step of performing data enhancement processing on a small number of sample speeches includes:

Add noise to the initial sample speech to get the augmented speech;

The speech after adding the initial sample speech and the augmented speech is used as the sample speech for training.

According to a specific embodiment of the present disclosure, the step of adding noise to the sample speech to obtain the amplified speech includes:

Use the librosa library to load the sample audio to obtain a floating-point time series;

Calculate the following formula for the floating-point time series S to obtain the amplified speech SNi after adding noise,

Among them, i=1,2,...,L, Si represents the floating-point time series, L represents the length of the floating-point time series, r is the coefficient of w, and the value range of r is [0.001, 0.002], w is a floating-point number that obeys a Gaussian distribution.

According to a specific embodiment of the present disclosure, the step of extracting the feature matrix of the frame sequence corresponding to the sample speech includes:

Divide the sample speech into a preset number of speech frames;

Extract the low-level descriptor features and first-order derivatives of each speech frame according to the frame sequence;

According to the frame sequence and the low-level descriptor features and first-order derivatives of each speech frame, feature matrices corresponding to various sample speeches are obtained.

According to a specific embodiment of the present disclosure, the neural network includes a gated recurrent unit, an attention layer, and a first fully connected layer for sentiment classification;

The step of inputting the feature matrix of the frame sequence corresponding to each type of arousal degree label and the corresponding arousal degree label into the neural network for training includes:

The feature matrix of the corresponding frame sequence of the sample speech and the corresponding wake-up degree label are fed into the gated loop unit, and a hidden state corresponding to each time step is formed inside the gated loop unit;

Input the hidden state model of the corresponding time series into the attention layer to determine the feature weight value of each time step;

Weighted summation of the hidden state and feature weight values corresponding to each time step to obtain the level of the corresponding sample speech;

The level of the sample speech is input into the first fully connected layer, and the classification result of the arousal degree label of the sample speech is obtained.

According to a specific embodiment of the present disclosure, the feature matrix of the frame sequence corresponding to the sample speech and the corresponding wake-up degree label are fed into the gated loop unit, and a feature matrix corresponding to each time step is formed inside the gated loop unit. Steps to hide state, including:

Feeding the feature matrix of the frame sequence corresponding to the sample speech and the corresponding wake-up degree label into the gated loop unit, and an internal hidden state ht is formed inside the gated loop unit;

The feature xt and the hidden state ht-1 of the previous time step are used to update at each time step; where, the hidden state update formula is h _t = f _θ (h _t-1 , x _t ), f _θ is the weight parameter of θ RNN function, ht denotes the hidden state at the t-th time step, and xt denotes the t-th feature in x={x1:t}.

According to a specific embodiment of the present disclosure, the hidden state model corresponding to the time series is input into the attention layer, the feature weight value of each time step is determined, and the hidden state and feature weight value corresponding to each time step are weighted and summed, The steps of obtaining the level of the corresponding sample speech include:

Calculated feature weight values for each time step

and, the level of the sample speech

Among them, α _t represents the feature weight value at time step t, h _t is the hidden state output by the gated recurrent unit, W represents the parameter vector to be learned, and C represents the level of the sample speech.

According to a specific embodiment of the present disclosure, the neural network further includes a second fully connected layer for gender classification;

After the weighted summation of the hidden state and feature weight value corresponding to each time step to obtain the level of the corresponding sample speech, the method further includes:

The level of the sample speech is input into the second fully connected layer to obtain the speaker gender classification result of the sample speech.

In a second aspect, an embodiment of the present invention provides a method for acquiring a voice arousal degree, the method comprising:

Get the speech to be recognized;

The to-be-recognized speech is input into a wake-up level recognition model, and a wake-up level label of the to-be-recognized speech is output, where the wake-up level recognition model is obtained according to any one of the arousal level recognition model training methods described above.

In a third aspect, an embodiment of the present invention provides an apparatus for training an arousal degree recognition model, the apparatus comprising:

an acquisition module, used for acquiring the wake-up degree label of the sample voice, and performing data enhancement on part of the sample voice according to the wake-up degree label of the sample voice;

an extraction module, for extracting the feature matrix of the corresponding frame sequence of the sample speech;

The training module is used for inputting the feature matrix of the frame sequence corresponding to various arousal degree labels and the corresponding arousal degree labels into the neural network for training.

In a fourth aspect, an embodiment of the present invention provides a device for acquiring a voice wake-up degree, the device comprising:

an acquisition module, used to acquire the speech to be recognized;

A recognition module, configured to input the voice to be recognized into a wake-up level recognition model, and output a wake-up level label of the to-be-recognized voice, the wake-up level recognition model being the wake-up level recognition model according to any one of the first aspects obtained by the training method.

In a fifth aspect, an embodiment of the present invention provides a computer device, including a memory and a processor, where the memory is used to store a computer program, and the computer program executes any one of the first aspect when the processor is running. The training method for arousal degree recognition model described above, or the voice arousal degree acquisition method described in the second aspect.

In a sixth aspect, an embodiment of the present invention provides a computer-readable storage medium, which stores a computer program, and when the computer program runs on a processor, executes the training of the arousal degree recognition model according to any one of the first aspects. method, or the method for acquiring the voice arousal degree described in the second aspect.

The above-mentioned training method for the recognition model of arousal degree and the method for acquiring the degree of voice arousal provided by the present application, feature extraction is performed for sample voices of different arousal degrees, and input into the neural network for training, so that the degree of arousal that can recognize the degree of voice arousal can be obtained. Identify the model. The arousal degree recognition model is applied to the speech recognition scene, and the recognition of arousal degree is added on the basis of basic speech recognition, so as to enhance the accuracy and diversity of speech recognition.

Description of drawings

In order to illustrate the technical solutions of the present invention more clearly, the accompanying drawings required in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of the present invention, and therefore should not be It is regarded as the limitation of the protection scope of the present invention. In the various figures, similar components are numbered similarly.

FIG. 1 shows a schematic flowchart of a training method for an arousal degree recognition model provided by an embodiment of the present application;

FIG. 2 shows a schematic flowchart of part of the data enhancement involved in the training method for an arousal degree identification model provided by an embodiment of the present application;

FIG. 3 shows a schematic partial flowchart of a feature matrix extraction involved in a method for training an arousal degree identification model provided by an embodiment of the present application;

FIG. 4 shows a schematic flowchart of part of the model training involved in the method for training an arousal degree recognition model provided by an embodiment of the present application;

FIG. 5 shows a schematic structural diagram of a part of the neural network involved in the method for training an arousal degree recognition model provided by an embodiment of the present application;

FIG. 6 shows a schematic flowchart of a method for acquiring a voice arousal degree provided by an embodiment of the present application;

FIG. 7 shows a block diagram of a module of an apparatus for training an arousal degree recognition model provided by an embodiment of the present application;

FIG. 8 shows a block diagram of a module of an apparatus for acquiring a voice wakefulness degree provided by an embodiment of the present application;

FIG. 9 shows a hardware structure diagram of a computer device provided by an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments.

The components of the embodiments of the invention generally described and illustrated in the drawings herein may be arranged and designed in a variety of different configurations. Thus, the following detailed description of the embodiments of the invention provided in the accompanying drawings is not intended to limit the scope of the invention as claimed, but is merely representative of selected embodiments of the invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative work fall within the protection scope of the present invention.

Hereinafter, the terms "comprising", "having" and their cognates, which may be used in various embodiments of the present invention, are only intended to denote particular features, numbers, steps, operations, elements, components, or combinations of the foregoing, and should not be construed as first excluding the presence of or adding one or more other features, numbers, steps, operations, elements, components or combinations of the foregoing or the possibility of a combination of the foregoing.

Furthermore, the terms "first", "second", "third", etc. are only used to differentiate the description and should not be construed as indicating or implying relative importance.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which various embodiments of this invention belong. The terms (such as those defined in commonly used dictionaries) will be interpreted as having the same meaning as the contextual meaning in the relevant technical field and will not be interpreted as having an idealized or overly formal meaning, unless explicitly defined in the various embodiments of the present invention.

Example 1

Referring to FIG. 1 , it is a schematic flowchart of a training method for an arousal degree recognition model (hereinafter referred to as a model training method) provided by an embodiment of the present invention. As shown in Figure 1, the model training method mainly includes the following steps:

S101, obtain a wake-up level label of the sample voice, and perform data enhancement on part of the sample voice according to the wake-up level label of the sample voice;

The model training method provided in this embodiment mainly uses the sample speech of the known arousal degree Arousal to train the basic neural network, so as to train the arousal degree recognition model with the arousal degree recognition function. The level of arousal represents the level of emotional physiological activation, such as a higher level of arousal from "anger" or "excitement" relative to calm.

Arousal degree labels are usually continuous emotional attributes, and the values of their original labels are distributed between [1, 5]. In order to facilitate the distinction, continuous emotional attributes can be discretized into three categories, for example, the continuous arousal value is divided into 3 intervals, for example, the degree of arousal between [1, 2] is classified as a relatively low degree of arousal. The first degree of arousal, the degree of arousal between (2, 4) is classified as the second degree of arousal in the middle of the degree of arousal, and the degree of arousal between [4, 5] is classified as the third degree of arousal with relatively high degree of arousal. degree of arousal. For the convenience of description, the voices belonging to these three categories can also be re-assigned labels 1, 2, 3, etc., so that the problem can be transformed into a three-category problem of emotion on the wake-up label. Of course, there may also be other classification schemes, such as four types of labels of zero, low, medium and high, etc., which are not limited.

When preparing the sample speech, in order to train the arousal degree recognition model, it is necessary to prepare sample speeches of different arousal degrees and add arousal degree labels to the sample speeches of various arousal degrees, so that the neural network can learn the speech features of different arousal degrees.

There are various ways to obtain sample speech. According to a specific implementation manner of the present disclosure, the step of obtaining sample speech corresponding to various arousal degrees described in S101 may include:

For the coverage of the arousal degree, the awakening degree of the voice to be recognized can be divided into three levels, and the corresponding labels are respectively defined as the first awakening degree label, the second awakening degree label and the third awakening degree label. These three awakening degree labels can be set The degree of arousal corresponding to the degree label increases sequentially. Then, according to various wake-up degree labels, the corresponding sample speech is obtained. That is, the first type of sample speech with a relatively low arousal degree is selected to correspond to the first arousal degree label, the second type of sample speech with a relatively middle arousal degree is selected to correspond to the second arousal degree label, and the third type of sample speech with a relatively high arousal degree is selected. The speech corresponds to the third arousal level label.

Further, considering that the IEMOCAP data set is one of the widely used data sets in the field of speech emotion recognition, the entire data set is relatively standardized from dialogue design to emotional annotation, and the data set contains many dialogues, and the annotation contains discrete emotional labels. and continuous emotional tags, in line with the requirements of the present invention. Therefore, in this embodiment, the preset data set selects the Interactive Emotion and Chord Motion Capture (IEMOCAP) data set. In other embodiments, other eligible data sets may also be selected.

When using the IEMOCAP data set to extract sample voices, according to the arousal degree value of each sample voice recorded in the data set, for example, the sample voice whose arousal degree value is in the range of [1, 2] is used as the first type of sample voice, and the The sample speech whose arousal degree value range is (2, 4) is used as the second type of sample speech, and the sample speech whose arousal degree value is [4, 5] is used as the third type of sample speech. Of course, there may also be other division methods and voice selection methods, which are not limited. In addition, considering that during model training, a larger number of sample speeches are required to train a higher recognition progress. Considering that the number of sample voices obtained from the preset data set or the IEMOCAP data set is small, the total number of sample voices can be expanded by means of data enhancement to improve the recognition progress of the trained model.

In order to optimize the training effect of the model, it is better to input the same or similar numbers of various sample speeches. According to a specific embodiment of the present disclosure, as shown in FIG. 2 , the step of acquiring the wake-up level label of the sample speech in S101 , and performing data enhancement on part of the sample speech according to the wake-up level label of the sample speech, includes: :

S201, judging whether the difference between the quantities of the sample voices of various types of arousal degree labels is greater than or equal to a preset quantity difference;

S202, if the difference between the number of sample voices of various types of arousal degree labels is greater than or equal to the preset number difference, perform data enhancement processing on the sample voices with a small number until the number of sample voices of various types of arousal degree labels The difference between them is smaller than the preset number of differences.

In this embodiment, the preset number of sample voices allowed for training can be about 3000, the difference between various types of sample voices is a preset number difference, and the preset number difference can be set to 0, that is, the requirement The number of sample voices of various types is exactly the same, and can also be set to other values greater than 0, that is, a partial difference between the number of sample voices of various types is allowed.

In the specific implementation, after acquiring the sample speech, it is first judged whether the difference between the number of sample speeches of various types of arousal degree labels is greater than or equal to the preset number difference. If the actual number difference is greater than or equal to the preset number difference, data enhancement processing is required for a small number of sample speeches. If the actual number difference is less than the preset number difference, data enhancement is not required for the sample speech. deal with.

During specific implementation, the above-mentioned steps of performing data enhancement processing on a small number of sample speeches may include:

Add noise to the initial sample speech to get the augmented speech;

Further, the step of adding noise to the sample speech to obtain the amplified speech includes:

Among them, i=1,2,...,L, Si represents the floating-point time series, L represents the length of the floating-point time series, r is the coefficient of w, and the value range of r is [0.001, 0.002], w is a floating-point number that obeys a Gaussian distribution. In this embodiment, the noise is Gaussian white noise.

For example, in the initial case, there are 1000 samples in the low category, 4000 samples in the middle category, and 3500 samples in the high category. For low-class samples, r=0.001 can be taken first, and noise is added to the initial sample speech to obtain 1000 new samples. At this time, the low-class sample speech used for training is increased to 2000. If r=0.002 is taken on this basis, and noise is added to the original sample speech again, the low-class sample speech can be increased to 3000 or even more. The specific difference can be customized according to the specific sample type or model recognition accuracy. w is generated by numpy.random.normal(0, 1, len(S)) in python, which is essentially a series of numbers of length L that conform to the Gaussian distribution.

By adding noise to enhance the voice data, it can be avoided that it is exactly the same as the original voice. The audio after adding noise is different from the original voice, and because the r value is set to a small value, the human ear hears little difference. , the emotion before and after adding noise will not be affected.

In this embodiment, by adding noise to the speech of the category with a small sample size, the effect of amplifying the data is achieved, and the difference in the number of samples of the three categories of low, medium and high is alleviated, and it is ensured that certain batches will not appear in each batch. There are too many samples in one class, so as to prevent the trained model from always biasing towards the class with more samples. Of course, it is also possible to directly limit the number of acquired sample voices to be less than the preset number difference when acquiring sample voices, or directly copy the sample voices as they are to achieve data enhancement, so as to reduce the impact on the model training effect.

S102, extracting the feature matrix of the frame sequence corresponding to the sample speech;

After acquiring the sample voices corresponding to various arousal degrees, the sample voices are divided into frames to obtain a frame sequence corresponding to each sample voice. The feature matrix corresponding to the frame sequence is extracted, which is used to learn and summarize the speech features of various arousal degrees.

Specifically, according to a specific embodiment of the present disclosure, the step of extracting the feature matrix of the frame sequence corresponding to the sample speech in S102, as shown in FIG. 3, may specifically include:

S301, dividing the sample speech into a preset number of speech frames;

S302, extract low-level descriptor features and first-order derivatives of each speech frame according to the frame sequence;

S303, according to the frame sequence and the low-level descriptor features and first-order derivatives of each speech frame, obtain feature matrices corresponding to various types of sample speech.

During speech emotion recognition, the sample speech is divided into speech frames corresponding to the time axis, and the features between adjacent speech frames are related or even overlapped in adjacent time periods. In the feature extraction stage, the Opensmile tool can be used to extract Low-Level Descriptor (LLD) features and their first-order derivatives. The low-level descriptor can be IS13_compare. There are 65 low-level descriptor features and 65 first-order derivatives of low-level descriptor features, resulting in a total of 65+65=130 features.

When framing the sample speech, the frame length can be set to 20ms, and the frame shift can be set to 10ms. In the IEMOCAP dataset, the length of each speech is not fixed, so the number of frames extracted from each speech is also different. In the specific implementation, the maximum number of frames for each voice setting can be uniformly set to 750. If the actual number of frames (frame_num) is less than 750, the augmentation padding operation is performed, that is, the line (750-frame_num) is added after the extracted two-dimensional features. zero. If the actual number of frames is greater than 750, the truncation operation is performed, and finally the feature matrix of each sample speech is the number of frames * the number of features, that is, a two-dimensional matrix of size 750*130.

S103 , input the feature matrix of the frame sequence corresponding to each type of arousal degree label and the corresponding arousal degree label into the neural network, and learn and train to obtain the arousal degree identification model.

After obtaining the feature matrices corresponding to the sample speeches of various arousal degree labels according to the above steps, the various feature matrices and the corresponding arousal degree labels can be input into the neural network prepared in advance for training, and the characteristics are learned and summarized, so as to obtain Arousal level recognition model capable of identifying different speech arousal levels.

According to a specific embodiment of the present disclosure, as shown in FIGS. 2 and 4 , the feature matrix of the frame sequence corresponding to various types of arousal degree labels and the corresponding arousal degree labels are input to the neural network for training. As shown in Figure 5, the neural network includes a gated recurrent unit, an attention layer and a first fully connected layer for sentiment classification. In this embodiment, the neural network for encoding the feature matrix adopts a recurrent neural network (Recurrent Neural Network, RNN for short), and the RNN sequentially includes a variant gating unit (Gated Recurrent Unit, GRU for short), an attention layer and a first The fully connected layer has a data transmission relationship between adjacent layers, and usually the output data of the upper layer is the input of the lower layer. Of course, the gate variant control unit that performs feature encoding may also be other encoding units, such as a long short-term memory layer (Long Short-Term Memory, LSTM for short), which is not limited.

As shown in Figure 4 and Figure 5, the method may specifically include:

S401, the feature matrix of the corresponding frame sequence of the sample speech and the corresponding wake-up degree label are fed into the gated loop unit, and a hidden state corresponding to each time step is formed inside the gated loop unit;

The feature xt and the hidden state ht-1 of the previous time step are used to update at each time step; where the hidden state update formula is:

h _t = f _θ (h _t-1 , x _t ), (2)

where f _θ is the RNN function with weight parameter θ, ht represents the hidden state at the t-th time step, and xt represents the t-th feature in x={x1:t}.

S402, input the hidden state model corresponding to the time series into the attention layer, and determine the feature weight value of each time step;

The attention layer is used to pay attention to the parts related to emotion. Specifically, as shown in Figure 4, at time step t, the output of the GRU is ht, and the feature weight of normalized importance is first calculated by the softmax function:

αt represents the feature weight value at time step _t , ht is the hidden state output by the gated recurrent unit, and W represents the parameter vector to be learned.

S403, weighting and summing the hidden state and feature weight value corresponding to each time step to obtain the level of the corresponding sample speech;

The weighted sum is performed according to the weight, and the hidden state and feature weight value corresponding to each time step are weighted and summed to obtain the level of the corresponding sample speech:

S404: Input the level of the sample speech into the first fully connected layer to obtain a classification result of the arousal degree of the sample speech.

The sentence level C obtained through the attention layer is input to the sentiment classification network, namely the first fully connected layer, for sentiment classification. In addition, in order to perform multi-task classification, on the basis of the first fully-connected layer, according to an embodiment of the present disclosure, the neural network further includes a second fully-connected layer for gender classification.

In this embodiment, it is assumed that the multi-classification task includes emotion classification and gender classification, wherein gender classification is a binary classification task, which is an auxiliary task of emotion classification. The emotion classification network includes the first fully connected layer and the softmax layer; the gender classification network includes the second fully connected layer and the softmax layer. The structure is shown in Figure 5, where yE indicates that a predicted sentence belongs to three categories of low, medium and high emotions. The probability of ; yG represents the probability that the predicted gender of the speaker of a certain sentence belongs to the male or female category. The loss equation for multi-task classification is as follows:

Among them, emotion and lgender denote the loss for sentiment classification and gender classification, respectively. α and β represent the weights of the two tasks, and in this study, both values are set to 1. The loss function of the two tasks is the cross entropy loss, and the calculation method is as follows:

Among them, N represents the total number of samples, K is the total number of emotional categories, yi,k represents the true probability that the ith sample belongs to the kth class, and pi,k represents the predicted probability that the ith sample belongs to the kth class.

Among them, yi represents the true label of the sample, and pi is the predicted probability that the sample belongs to the first class.

To sum up, the method for obtaining the arousal degree provided by the present application extracts features from sample speeches with different arousal degree labels, and inputs them into a neural network for training, so that an arousal degree recognition model capable of recognizing voice arousal degree labels can be obtained. . The arousal degree recognition model is applied to the speech recognition scene, and the recognition of arousal degree is added on the basis of basic speech recognition, so as to enhance the accuracy and diversity of speech recognition.

Example 2

Referring to FIG. 6 , it is a schematic flowchart of a method for acquiring a voice arousal degree according to an embodiment of the present invention. As shown in Figure 6, the method includes the following steps:

S601, acquiring the voice to be recognized;

S602: Input the voice to be recognized into an arousal degree recognition model, and output a wake-up degree label of the voice to be recognized.

Wherein, the arousal degree identification model is obtained according to the arousal degree identification model training method described in the above embodiment.

In this implementation manner, the arousal degree recognition model of the resume of the above-mentioned embodiment is loaded into the computer device, and applied to the scene of obtaining the voice arousal degree. Input the voice to be recognized into the computer device loaded with the wake-up level recognition model, and then output the wake-up level of the voice to be recognized. The voice to be recognized may be the voice collected by computer equipment, or the voice obtained from other channels such as the Internet.

For the specific implementation process of the method for acquiring the voice glide degree provided in this embodiment, reference may be made to the specific implementation process of the training method for the arousal degree recognition model provided by the embodiment shown in FIG. 1 , which will not be repeated here.

Example 3

Referring to FIG. 7 , it is a block diagram of a module of an apparatus for training an arousal degree recognition model according to an embodiment of the present invention. As shown in FIG. 7 , the arousal degree recognition model training device 700 mainly includes:

an acquisition module 701, configured to acquire a wake-up degree label of the sample voice, and perform data enhancement on part of the sample voice according to the wake-up degree label of the sample voice;

An extraction module 702, configured to extract the feature matrix of the frame sequence corresponding to the sample speech;

The training module 703 is used for inputting the feature matrix of the frame sequence corresponding to the various types of arousal degree labels and the corresponding arousal degree labels into the neural network for training.

Example 4

Referring to FIG. 8 , it is a block diagram of a module of an apparatus for acquiring a voice arousal degree according to an embodiment of the present invention. As shown in FIG. 8 , the apparatus 800 for obtaining the voice wakefulness degree includes:

an acquisition module 801, configured to acquire the speech to be recognized;

The identification module 802 is configured to input the voice to be recognized into a wake-up level recognition model, and output the wake-up level label of the to-be-recognized voice, where the wake-up level recognition model is obtained according to the wake-up level recognition model training method described in the above embodiment of.

In addition, an embodiment of the present disclosure provides a computer device, including a memory and a processor, where the memory stores a computer program, and when the computer program runs on the processor, the computer program executes the wake-up degree recognition provided by the above method embodiments Model training method or voice arousal degree acquisition method.

Specifically, as shown in FIG. 9 , in order to implement a computer device according to various embodiments of the present invention, the computer device 900 includes but is not limited to: a radio frequency unit 901 , a network module 902 , an audio output unit 903 , an input unit 904 , and a sensor 905 , a display unit 906 , a user input unit 907 , an interface unit 908 , a memory 909 , a processor 910 , and a power supply 911 and other components. Those skilled in the art can understand that the structure of the computer device shown in FIG. 9 does not constitute a limitation on the computer device, and the computer device may include more or less components than the one shown, or combine some components, or different components layout. In this embodiment of the present invention, the computer equipment includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palmtop computer, a vehicle-mounted terminal, a wearable device, a pedometer, and the like.

It should be understood that, in this embodiment of the present invention, the radio frequency unit 901 can be used for receiving and sending signals during sending and receiving of information or during a call. Specifically, after receiving the downlink data from the base station, it is processed by the processor 910; The uplink data is sent to the base station. Generally, the radio frequency unit 901 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. In addition, the radio frequency unit 901 can also communicate with the network and other devices through a wireless communication system.

The computer device provides the user with wireless broadband Internet access through the network module 902, such as helping the user to send and receive emails, browse the web, access streaming media, and so on.

The audio output unit 903 may convert audio data received by the radio frequency unit 901 or the network module 902 or stored in the memory 909 into audio signals and output as sound. Also, the audio output unit 903 may also provide audio output related to a specific function performed by the computer device 900 (eg, call signal reception sound, message reception sound, etc.). The audio output unit 903 includes a speaker, a buzzer, a receiver, and the like.

The input unit 904 is used to receive audio or video signals. The input unit 904 may include a graphics processor (Graphics Processing Unit, GPU for short) 9041 and a microphone 9042, and the graphics processor 9041 is used for still pictures or videos obtained by an image capture computer device (such as a camera) in a video capture mode or an image capture mode. image data for processing. The processed image frames can be video-played on the display unit 906 . The image frames processed by the graphics processor 9041 may be stored in the memory 909 (or other storage medium) or transmitted via the radio frequency unit 901 or the network module 902 . The microphone 9042 can receive sound and can process such sound into audio data. The processed audio data can be converted into a format that can be transmitted to a mobile communication base station via the radio frequency unit 901 for output in the case of a telephone call mode.

The computer device 900 also includes at least one sensor 905, including at least the barometer mentioned in the above embodiments. In addition, the sensor 905 may also be other sensors such as light sensors, motion sensors, and other sensors. Specifically, the light sensor includes an ambient light sensor and a proximity sensor, wherein the ambient light sensor can adjust the brightness of the display panel 9061 according to the brightness of the ambient light, and the proximity sensor can turn off the display panel 9061 and the proximity sensor when the computer device 900 is moved to the ear / or backlight. As a kind of motion sensor, the accelerometer sensor can detect the magnitude of acceleration in all directions (generally three axes), and can detect the magnitude and direction of gravity when stationary, and can be used to identify the posture of computer equipment (such as horizontal and vertical screen switching, related games , magnetometer attitude calibration), vibration recognition related functions (such as pedometer, tapping), etc.; the sensor 905 may also include a fingerprint sensor, a pressure sensor, an iris sensor, a molecular sensor, a gyroscope, a barometer, a hygrometer, a thermometer, Infrared sensors, etc., are not repeated here.

The display unit 906 is used for video playing information input by the user or information provided to the user. The display unit 906 may include a display panel 9061, which may be in the form of a liquid crystal panel (Liquid Crystal Display, LCD for short), an organic light-emitting diode (Organic Light-Emitting Diode, OLED for short) panel, and the like.

The user input unit 907 may be used to receive input numerical or character information, and generate key signal input related to user settings and function control of the computer device. Specifically, the user input unit 907 includes a touch panel 9071 and other input devices 9072 . The touch panel 9071, also referred to as a touch screen, can collect touch operations by the user on or near it (such as the user's finger, stylus, etc., any suitable object or accessory on or near the touch panel 9071). operate). The touch panel 9071 may include two parts, a touch detection computer device and a touch controller. Among them, the touch detection computer device detects the user's touch orientation, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection computer device and converts it into contact coordinates. , and then send it to the processor 910 to receive the command sent by the processor 910 and execute it. In addition, the touch panel 9071 can be implemented in various types such as resistive, capacitive, infrared, and surface acoustic waves. In addition to the touch panel 9071 , the user input unit 907 may also include other input devices 9072 . Specifically, other input devices 9072 may include, but are not limited to, physical keyboards, function keys (such as volume control keys, switch keys, etc.), trackballs, mice, and joysticks, which will not be repeated here.

Further, the touch panel 9071 can be overlaid on the display panel 9061. When the touch panel 9071 detects a touch operation on or near it, it transmits it to the processor 910 to determine the type of the touch event, and then the processor 910 determines the type of the touch event according to the touch The type of event provides a corresponding visual output on the display panel 9061. Although in FIG. 9, the touch panel 9071 and the display panel 9061 are used as two independent components to realize the input and output functions of the computer device, in some embodiments, the touch panel 9071 and the display panel 9061 can be integrated The implementation of the input and output functions of the computer device is not specifically limited here.

The interface unit 908 is an interface for connecting an external computer device to the computer device 900 . For example, the external computer device may include a wired or wireless headset port, an external power (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a computer device with an identification module, an audio input /Output (I/O) ports, video I/O ports, headphone ports, and more. Interface unit 908 may be used to receive input (eg, data information, power, etc.) from an external computer device and transmit the received input to one or more elements within computer device 900 or may be used between computer device 900 and Transfer data between external computer devices.

The memory 909 may be used to store software programs as well as various data. The memory 909 may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; Data created by the use of the mobile phone (such as audio data, phone book, etc.), etc. Additionally, memory 909 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.

The processor 910 is the control center of the computer equipment, using various interfaces and lines to connect various parts of the entire computer equipment, by running or executing the software programs and/or modules stored in the memory 909, and calling the data stored in the memory 909. , perform various functions of computer equipment and process data, so as to conduct overall monitoring of computer equipment. The processor 910 may include one or more processing units; preferably, the processor 910 may integrate an application processor and a modem processor, wherein the application processor mainly processes the operating system, user interface, and application programs, etc., and the modem The processor mainly handles wireless communication. It can be understood that, the above-mentioned modulation and demodulation processor may not be integrated into the processor 910.

The computer device 900 may also include a power supply 911 (such as a battery) for supplying power to various components. Preferably, the power supply 911 may be logically connected to the processor 910 through a power management system, so as to manage charging, discharging, and power consumption management through the power management system. and other functions.

In addition, the computer device 900 includes some unshown functional modules, which are not repeated here.

The memory is used for storing a computer program, and the computer program executes the above-mentioned method for training the arousal degree recognition model or the method for acquiring the voice arousal degree when the processor is running.

In addition, an embodiment of the present invention provides a computer-readable storage medium, which stores a computer program, and the computer program runs the above-mentioned method for training an arousal degree recognition model or a method for acquiring a speech arousal degree on a processor.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may also be implemented in other manners. The apparatus embodiments described above are only schematic, for example, the flowcharts and structural diagrams in the accompanying drawings show possible implementation architectures and functions of apparatuses, methods and computer program products according to various embodiments of the present invention and operation. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more functions for implementing the specified logical function(s) executable instructions. It should also be noted that, in alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flow diagrams, and combinations of blocks in the block diagrams and/or flow diagrams, can be implemented using dedicated hardware-based systems that perform the specified functions or actions. be implemented, or may be implemented in a combination of special purpose hardware and computer instructions.

In addition, each functional module or unit in each embodiment of the present invention may be integrated to form an independent part, or each module may exist alone, or two or more modules may be integrated to form an independent part.

If the functions are implemented in the form of software function modules and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a smart phone, a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM for short), Random Access Memory (RAM for short), magnetic disk or CD, etc. that can store program codes medium.

The above are only specific embodiments of the present invention, but the protection scope of the present invention is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed by the present invention. should be included within the protection scope of the present invention.

Claims

A method for training an arousal degree recognition model, characterized in that the method comprises:

Obtain the wake-up level label of the sample voice, and perform data enhancement on part of the sample voice according to the wake-up level label of the sample voice;

extracting the feature matrix of the frame sequence corresponding to the sample speech;

The feature matrix of the frame sequence corresponding to the various arousal degree labels and the corresponding arousal degree labels are input into the neural network for training.
The method according to claim 1, wherein the step of acquiring the wake-up degree label of the sample speech comprises:

From the preset data set, the first type of sample speech corresponding to the first arousal degree label, the second type of sample speech corresponding to the second arousal degree label, and the third type of sample speech corresponding to the third arousal degree label are selected.
The method according to claim 2, wherein the step of performing data enhancement on part of the sample speech according to the arousal degree label of the sample speech comprises:

Determine whether the difference between the number of sample voices of various types of arousal degree labels is greater than or equal to the preset number difference;

If the difference between the number of sample voices of various types of arousal level labels is greater than or equal to the preset number difference, perform data enhancement processing on the sample voices with a smaller number until the number of sample voices of various types of arousal degree labels is between the The difference is less than the preset number of differences.
The method according to claim 3, wherein the step of performing data enhancement processing on a small number of sample speeches comprises:

Add noise to the initial sample speech to get the augmented speech;

The speech after adding the initial sample speech and the augmented speech is used as the sample speech for training.
The method according to claim 4, wherein the step of adding noise to the sample speech to obtain the amplified speech comprises:

Use the librosa library to load the sample speech to obtain a floating-point time series;

Calculate the following formula for the floating-point time series S to obtain the amplified speech SNi after adding noise,

Among them, i=1,2,...,L, Si represents the floating-point time series, L represents the length of the floating-point time series, r is the coefficient of w, and the value range of r is [0.001, 0.002], w is a floating-point number that obeys a Gaussian distribution.
The method according to any one of claims 1 to 5, wherein the step of extracting the feature matrix of the frame sequence corresponding to the sample speech comprises:

Divide the sample speech into a preset number of speech frames;

Extract the low-level descriptor features and first-order derivatives of each speech frame according to the frame sequence;

According to the frame sequence and the low-level descriptor features and first-order derivatives of each speech frame, feature matrices corresponding to various sample speeches are obtained.
The method of claim 6, wherein the neural network comprises a gated recurrent unit, an attention layer and a first fully connected layer for sentiment classification;

The step of inputting the feature matrix of the frame sequence corresponding to each type of arousal degree label and the corresponding arousal degree label into the neural network for training includes:

The feature matrix of the corresponding frame sequence of the sample speech and the corresponding wake-up degree label are fed into the gated loop unit, and a hidden state corresponding to each time step is formed inside the gated loop unit;

Input the hidden state model of the corresponding time series into the attention layer to determine the feature weight value of each time step;

Weighted summation of the hidden state and feature weight values corresponding to each time step to obtain the level of the corresponding sample speech;

The level of the sample speech is input into the first fully connected layer, and the classification result of the arousal degree label of the sample speech is obtained.
The method according to claim 7, wherein the feature matrix of the corresponding frame sequence of the sample speech and the corresponding arousal degree label are fed into the gated loop unit, and the corresponding The steps of the hidden state of the time step, including:

The feature matrix of the corresponding frame sequence of the sample speech and the corresponding wake-up degree label are fed into the gated cyclic unit, and an internal hidden state h t is formed inside the gated cyclic unit;

At each time step, the feature x t and the hidden state h t-1 of the previous time step are used to update; wherein, the hidden state update formula is h t =f θ (h t-1 ,x t ), which is a weight parameter of θ RNN function, h t represents the hidden state at the t-th time step, and x t represents the t-th feature in x={x 1:t }.
The method according to claim 8, wherein the hidden state model corresponding to the time series is input into the attention layer, the feature weight value of each time step is determined, and the hidden state and feature weight value corresponding to each time step are weighted The steps of summing to obtain the level of the corresponding sample speech include:

Calculated feature weight values for each time step
and, the level of the sample speech

Among them, α t represents the feature weight value at time step t, h t is the hidden state output by the gated recurrent unit, W represents the parameter vector to be learned, and C represents the level of the sample speech.
The method of claim 9, wherein the neural network further comprises a second fully connected layer for gender classification;

After the weighted summation of the hidden state and feature weight value corresponding to each time step to obtain the level of the corresponding sample speech, the method further includes:

The level of the sample speech is input into the second fully connected layer to obtain the speaker gender classification result of the sample speech.
A method for acquiring a voice arousal degree, characterized in that the method comprises:

Get the speech to be recognized;

Input the voice to be recognized into the wake-up recognition model, and output the wake-up label of the voice to be recognized, and the wake-up recognition model is obtained according to the training method of the wake-up recognition model according to any one of claims 1-10. of.
An apparatus for training an arousal degree recognition model, characterized in that the apparatus comprises:

an acquisition module, used for acquiring the wake-up degree label of the sample voice, and performing data enhancement on part of the sample voice according to the wake-up degree label of the sample voice;

an extraction module, for extracting the feature matrix of the corresponding frame sequence of the sample speech;

The training module is used to input the feature matrix of the frame sequence corresponding to various arousal degrees and the corresponding arousal degree labels into the neural network for training.
A voice wake-up degree acquisition device, characterized in that the device comprises:

an acquisition module, used to acquire the speech to be recognized;

A recognition module, used to input the voice to be recognized into a wake-up level recognition model, and output the wake-up level label of the to-be-recognized voice, where the wake-up level recognition model is the wake-up level according to any one of claims 1-10 Recognition model training method obtained.
A computer device, characterized by comprising a memory and a processor, wherein the memory is used to store a computer program, and the computer program executes the wake-up degree according to any one of claims 1 to 10 when the processor runs The recognition model training method, or the voice arousal degree acquisition method according to claim 11 .
A computer-readable storage medium, characterized in that it stores a computer program, and when the computer program runs on a processor, the computer program executes the method for training an arousal degree recognition model according to any one of claims 1 to 10, or the right The method for acquiring the voice arousal degree according to requirement 11 is required.