[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

WO2022227507A1 - Wake-up degree recognition model training method and speech wake-up degree acquisition method - Google Patents

Wake-up degree recognition model training method and speech wake-up degree acquisition method Download PDF

Info

Publication number
WO2022227507A1
WO2022227507A1 PCT/CN2021/131223 CN2021131223W WO2022227507A1 WO 2022227507 A1 WO2022227507 A1 WO 2022227507A1 CN 2021131223 W CN2021131223 W CN 2021131223W WO 2022227507 A1 WO2022227507 A1 WO 2022227507A1
Authority
WO
WIPO (PCT)
Prior art keywords
speech
sample
wake
degree
arousal
Prior art date
Application number
PCT/CN2021/131223
Other languages
French (fr)
Chinese (zh)
Inventor
邵池
黄东延
Original Assignee
深圳市优必选科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市优必选科技股份有限公司 filed Critical 深圳市优必选科技股份有限公司
Publication of WO2022227507A1 publication Critical patent/WO2022227507A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • the invention relates to the field of speech processing, in particular to a method for training an arousal degree recognition model and a method for acquiring a speech arousal degree.
  • Emotion recognition has become an integral part of modern human-computer interaction systems in many healthcare, education, and safety-related scenarios.
  • speech, text, video, etc. can be used as separate inputs, or a combination of them can be used as multimodal inputs.
  • This paper focuses on speech-based emotion recognition.
  • speech emotion recognition is performed in a supervised manner using segmented short sentences, and labels for emotions can be in two formats, either discrete labels such as happy, sad, angry and neutral, or continuous labels such as activation (sedation) versus (arousal), valence (negative versus positive), and dominance (weak versus strong).
  • continuous emotional attributes have received a lot of attention due to their flexibility in describing more complex emotional states.
  • Continuous attribute classification plays an extremely important role in speech emotion recognition, and the degree of arousal also affects the speed and accuracy of emotion recognition. Generally speaking, the higher the degree of arousal, the faster the emotion recognition, and the higher the recognition accuracy. The accuracy of semantic emotion recognition can also be improved to a certain extent by identifying the degree of arousal in advance.
  • embodiments of the present invention provide a method for training an arousal degree recognition model and a method for acquiring a voice arousal degree.
  • an embodiment of the present invention provides a method for training an arousal degree recognition model, including:
  • the feature matrix of the frame sequence corresponding to the various arousal degree labels and the corresponding arousal degree labels are input into the neural network for training.
  • the step of acquiring the wake-up degree label of the sample speech includes:
  • the first type of sample speech corresponding to the first arousal degree label, the second type of sample speech corresponding to the second arousal degree label, and the third type of sample speech corresponding to the third arousal degree label are selected.
  • the step of acquiring the wake-up degree label of the sample speech includes:
  • the difference between the number of sample voices of various types of arousal level labels is greater than or equal to the preset number difference, perform data enhancement processing on the sample voices with a smaller number until the number of sample voices of various types of arousal degree labels is between the The difference is less than the preset number of differences.
  • the step of performing data enhancement processing on a small number of sample speeches includes:
  • the speech after adding the initial sample speech and the augmented speech is used as the sample speech for training.
  • the step of adding noise to the sample speech to obtain the amplified speech includes:
  • Si represents the floating-point time series
  • L represents the length of the floating-point time series
  • r is the coefficient of w
  • w is a floating-point number that obeys a Gaussian distribution.
  • the step of extracting the feature matrix of the frame sequence corresponding to the sample speech includes:
  • the neural network includes a gated recurrent unit, an attention layer, and a first fully connected layer for sentiment classification;
  • the feature matrix of the corresponding frame sequence of the sample speech and the corresponding wake-up degree label are fed into the gated loop unit, and a hidden state corresponding to each time step is formed inside the gated loop unit;
  • the level of the sample speech is input into the first fully connected layer, and the classification result of the arousal degree label of the sample speech is obtained.
  • the feature matrix of the frame sequence corresponding to the sample speech and the corresponding wake-up degree label are fed into the gated loop unit, and a feature matrix corresponding to each time step is formed inside the gated loop unit.
  • Steps to hide state including:
  • the hidden state model corresponding to the time series is input into the attention layer, the feature weight value of each time step is determined, and the hidden state and feature weight value corresponding to each time step are weighted and summed,
  • the steps of obtaining the level of the corresponding sample speech include:
  • ⁇ t represents the feature weight value at time step t
  • h t is the hidden state output by the gated recurrent unit
  • W represents the parameter vector to be learned
  • C represents the level of the sample speech.
  • the neural network further includes a second fully connected layer for gender classification
  • the method further includes:
  • the level of the sample speech is input into the second fully connected layer to obtain the speaker gender classification result of the sample speech.
  • an embodiment of the present invention provides a method for acquiring a voice arousal degree, the method comprising:
  • the to-be-recognized speech is input into a wake-up level recognition model, and a wake-up level label of the to-be-recognized speech is output, where the wake-up level recognition model is obtained according to any one of the arousal level recognition model training methods described above.
  • an embodiment of the present invention provides an apparatus for training an arousal degree recognition model, the apparatus comprising:
  • an acquisition module used for acquiring the wake-up degree label of the sample voice, and performing data enhancement on part of the sample voice according to the wake-up degree label of the sample voice;
  • an extraction module for extracting the feature matrix of the corresponding frame sequence of the sample speech
  • the training module is used for inputting the feature matrix of the frame sequence corresponding to various arousal degree labels and the corresponding arousal degree labels into the neural network for training.
  • an embodiment of the present invention provides a device for acquiring a voice wake-up degree, the device comprising:
  • an acquisition module used to acquire the speech to be recognized
  • a recognition module configured to input the voice to be recognized into a wake-up level recognition model, and output a wake-up level label of the to-be-recognized voice, the wake-up level recognition model being the wake-up level recognition model according to any one of the first aspects obtained by the training method.
  • an embodiment of the present invention provides a computer device, including a memory and a processor, where the memory is used to store a computer program, and the computer program executes any one of the first aspect when the processor is running.
  • an embodiment of the present invention provides a computer-readable storage medium, which stores a computer program, and when the computer program runs on a processor, executes the training of the arousal degree recognition model according to any one of the first aspects. method, or the method for acquiring the voice arousal degree described in the second aspect.
  • the above-mentioned training method for the recognition model of arousal degree and the method for acquiring the degree of voice arousal provided by the present application, feature extraction is performed for sample voices of different arousal degrees, and input into the neural network for training, so that the degree of arousal that can recognize the degree of voice arousal can be obtained. Identify the model.
  • the arousal degree recognition model is applied to the speech recognition scene, and the recognition of arousal degree is added on the basis of basic speech recognition, so as to enhance the accuracy and diversity of speech recognition.
  • FIG. 1 shows a schematic flowchart of a training method for an arousal degree recognition model provided by an embodiment of the present application
  • FIG. 2 shows a schematic flowchart of part of the data enhancement involved in the training method for an arousal degree identification model provided by an embodiment of the present application
  • FIG. 3 shows a schematic partial flowchart of a feature matrix extraction involved in a method for training an arousal degree identification model provided by an embodiment of the present application
  • FIG. 4 shows a schematic flowchart of part of the model training involved in the method for training an arousal degree recognition model provided by an embodiment of the present application
  • FIG. 5 shows a schematic structural diagram of a part of the neural network involved in the method for training an arousal degree recognition model provided by an embodiment of the present application
  • FIG. 6 shows a schematic flowchart of a method for acquiring a voice arousal degree provided by an embodiment of the present application
  • FIG. 7 shows a block diagram of a module of an apparatus for training an arousal degree recognition model provided by an embodiment of the present application
  • FIG. 8 shows a block diagram of a module of an apparatus for acquiring a voice wakefulness degree provided by an embodiment of the present application
  • FIG. 9 shows a hardware structure diagram of a computer device provided by an embodiment of the present application.
  • FIG. 1 it is a schematic flowchart of a training method for an arousal degree recognition model (hereinafter referred to as a model training method) provided by an embodiment of the present invention.
  • the model training method mainly includes the following steps:
  • the model training method provided in this embodiment mainly uses the sample speech of the known arousal degree Arousal to train the basic neural network, so as to train the arousal degree recognition model with the arousal degree recognition function.
  • the level of arousal represents the level of emotional physiological activation, such as a higher level of arousal from "anger” or "excitement” relative to calm.
  • Arousal degree labels are usually continuous emotional attributes, and the values of their original labels are distributed between [1, 5].
  • continuous emotional attributes can be discretized into three categories, for example, the continuous arousal value is divided into 3 intervals, for example, the degree of arousal between [1, 2] is classified as a relatively low degree of arousal.
  • the first degree of arousal, the degree of arousal between (2, 4) is classified as the second degree of arousal in the middle of the degree of arousal, and the degree of arousal between [4, 5] is classified as the third degree of arousal with relatively high degree of arousal. degree of arousal.
  • the voices belonging to these three categories can also be re-assigned labels 1, 2, 3, etc., so that the problem can be transformed into a three-category problem of emotion on the wake-up label.
  • the step of obtaining sample speech corresponding to various arousal degrees described in S101 may include:
  • the first type of sample speech corresponding to the first arousal degree label, the second type of sample speech corresponding to the second arousal degree label, and the third type of sample speech corresponding to the third arousal degree label are selected.
  • the awakening degree of the voice to be recognized can be divided into three levels, and the corresponding labels are respectively defined as the first awakening degree label, the second awakening degree label and the third awakening degree label.
  • These three awakening degree labels can be set The degree of arousal corresponding to the degree label increases sequentially.
  • the corresponding sample speech is obtained. That is, the first type of sample speech with a relatively low arousal degree is selected to correspond to the first arousal degree label, the second type of sample speech with a relatively middle arousal degree is selected to correspond to the second arousal degree label, and the third type of sample speech with a relatively high arousal degree is selected.
  • the speech corresponds to the third arousal level label.
  • the preset data set selects the Interactive Emotion and Chord Motion Capture (IEMOCAP) data set.
  • IEMOCAP Interactive Emotion and Chord Motion Capture
  • the sample voice whose arousal degree value is in the range of [1, 2] is used as the first type of sample voice
  • the sample speech whose arousal degree value range is (2, 4) is used as the second type of sample speech
  • the sample speech whose arousal degree value is [4, 5] is used as the third type of sample speech.
  • a larger number of sample speeches are required to train a higher recognition progress.
  • the total number of sample voices can be expanded by means of data enhancement to improve the recognition progress of the trained model.
  • the step of acquiring the wake-up level label of the sample speech in S101 , and performing data enhancement on part of the sample speech according to the wake-up level label of the sample speech includes: :
  • the preset number of sample voices allowed for training can be about 3000
  • the difference between various types of sample voices is a preset number difference
  • the preset number difference can be set to 0, that is, the requirement
  • the number of sample voices of various types is exactly the same, and can also be set to other values greater than 0, that is, a partial difference between the number of sample voices of various types is allowed.
  • the difference between the number of sample speeches of various types of arousal degree labels is greater than or equal to the preset number difference. If the actual number difference is greater than or equal to the preset number difference, data enhancement processing is required for a small number of sample speeches. If the actual number difference is less than the preset number difference, data enhancement is not required for the sample speech. deal with.
  • the above-mentioned steps of performing data enhancement processing on a small number of sample speeches may include:
  • the speech after adding the initial sample speech and the augmented speech is used as the sample speech for training.
  • the step of adding noise to the sample speech to obtain the amplified speech includes:
  • Si represents the floating-point time series
  • L represents the length of the floating-point time series
  • r is the coefficient of w
  • the value range of r is [0.001, 0.002]
  • w is a floating-point number that obeys a Gaussian distribution.
  • the noise is Gaussian white noise.
  • the specific difference can be customized according to the specific sample type or model recognition accuracy.
  • w is generated by numpy.random.normal(0, 1, len(S)) in python, which is essentially a series of numbers of length L that conform to the Gaussian distribution.
  • the effect of amplifying the data is achieved, and the difference in the number of samples of the three categories of low, medium and high is alleviated, and it is ensured that certain batches will not appear in each batch. There are too many samples in one class, so as to prevent the trained model from always biasing towards the class with more samples.
  • the sample voices After acquiring the sample voices corresponding to various arousal degrees, the sample voices are divided into frames to obtain a frame sequence corresponding to each sample voice.
  • the feature matrix corresponding to the frame sequence is extracted, which is used to learn and summarize the speech features of various arousal degrees.
  • the step of extracting the feature matrix of the frame sequence corresponding to the sample speech in S102 may specifically include:
  • the sample speech is divided into speech frames corresponding to the time axis, and the features between adjacent speech frames are related or even overlapped in adjacent time periods.
  • the Opensmile tool can be used to extract Low-Level Descriptor (LLD) features and their first-order derivatives.
  • LLD Low-Level Descriptor
  • the low-level descriptor can be IS13_compare.
  • There are 65 low-level descriptor features and 65 first-order derivatives of low-level descriptor features, resulting in a total of 65+65 130 features.
  • the frame length can be set to 20ms, and the frame shift can be set to 10ms.
  • the length of each speech is not fixed, so the number of frames extracted from each speech is also different.
  • the maximum number of frames for each voice setting can be uniformly set to 750. If the actual number of frames (frame_num) is less than 750, the augmentation padding operation is performed, that is, the line (750-frame_num) is added after the extracted two-dimensional features. zero. If the actual number of frames is greater than 750, the truncation operation is performed, and finally the feature matrix of each sample speech is the number of frames * the number of features, that is, a two-dimensional matrix of size 750*130.
  • the various feature matrices and the corresponding arousal degree labels can be input into the neural network prepared in advance for training, and the characteristics are learned and summarized, so as to obtain Arousal level recognition model capable of identifying different speech arousal levels.
  • the feature matrix of the frame sequence corresponding to various types of arousal degree labels and the corresponding arousal degree labels are input to the neural network for training.
  • the neural network includes a gated recurrent unit, an attention layer and a first fully connected layer for sentiment classification.
  • the neural network for encoding the feature matrix adopts a recurrent neural network (Recurrent Neural Network, RNN for short), and the RNN sequentially includes a variant gating unit (Gated Recurrent Unit, GRU for short), an attention layer and a first
  • GRU variant gating unit
  • the fully connected layer has a data transmission relationship between adjacent layers, and usually the output data of the upper layer is the input of the lower layer.
  • the gate variant control unit that performs feature encoding may also be other encoding units, such as a long short-term memory layer (Long Short-Term Memory, LSTM for short), which is not limited.
  • the method may specifically include:
  • the feature matrix of the corresponding frame sequence of the sample speech and the corresponding wake-up degree label are fed into the gated loop unit, and a hidden state corresponding to each time step is formed inside the gated loop unit;
  • the feature matrix of the frame sequence corresponding to the sample speech and the corresponding wake-up degree label are fed into the gated loop unit, and a feature matrix corresponding to each time step is formed inside the gated loop unit.
  • Steps to hide state including:
  • the feature xt and the hidden state ht-1 of the previous time step are used to update at each time step; where the hidden state update formula is:
  • f ⁇ is the RNN function with weight parameter ⁇
  • ht represents the hidden state at the t-th time step
  • the attention layer is used to pay attention to the parts related to emotion. Specifically, as shown in Figure 4, at time step t, the output of the GRU is ht, and the feature weight of normalized importance is first calculated by the softmax function:
  • ⁇ t represents the feature weight value at time step t
  • ht is the hidden state output by the gated recurrent unit
  • W represents the parameter vector to be learned.
  • the weighted sum is performed according to the weight, and the hidden state and feature weight value corresponding to each time step are weighted and summed to obtain the level of the corresponding sample speech:
  • S404 Input the level of the sample speech into the first fully connected layer to obtain a classification result of the arousal degree of the sample speech.
  • the sentence level C obtained through the attention layer is input to the sentiment classification network, namely the first fully connected layer, for sentiment classification.
  • the neural network further includes a second fully-connected layer for gender classification.
  • the method further includes:
  • the level of the sample speech is input into the second fully connected layer to obtain the speaker gender classification result of the sample speech.
  • the multi-classification task includes emotion classification and gender classification, wherein gender classification is a binary classification task, which is an auxiliary task of emotion classification.
  • the emotion classification network includes the first fully connected layer and the softmax layer; the gender classification network includes the second fully connected layer and the softmax layer.
  • the structure is shown in Figure 5, where yE indicates that a predicted sentence belongs to three categories of low, medium and high emotions.
  • the probability of ; yG represents the probability that the predicted gender of the speaker of a certain sentence belongs to the male or female category.
  • the loss equation for multi-task classification is as follows:
  • emotion and lgender denote the loss for sentiment classification and gender classification, respectively.
  • ⁇ and ⁇ represent the weights of the two tasks, and in this study, both values are set to 1.
  • the loss function of the two tasks is the cross entropy loss, and the calculation method is as follows:
  • N represents the total number of samples
  • K is the total number of emotional categories
  • yi,k represents the true probability that the ith sample belongs to the kth class
  • pi,k represents the predicted probability that the ith sample belongs to the kth class.
  • yi represents the true label of the sample
  • pi is the predicted probability that the sample belongs to the first class.
  • the method for obtaining the arousal degree extracts features from sample speeches with different arousal degree labels, and inputs them into a neural network for training, so that an arousal degree recognition model capable of recognizing voice arousal degree labels can be obtained.
  • the arousal degree recognition model is applied to the speech recognition scene, and the recognition of arousal degree is added on the basis of basic speech recognition, so as to enhance the accuracy and diversity of speech recognition.
  • FIG. 6 it is a schematic flowchart of a method for acquiring a voice arousal degree according to an embodiment of the present invention. As shown in Figure 6, the method includes the following steps:
  • S602 Input the voice to be recognized into an arousal degree recognition model, and output a wake-up degree label of the voice to be recognized.
  • the arousal degree identification model is obtained according to the arousal degree identification model training method described in the above embodiment.
  • the arousal degree recognition model of the resume of the above-mentioned embodiment is loaded into the computer device, and applied to the scene of obtaining the voice arousal degree.
  • the voice to be recognized may be the voice collected by computer equipment, or the voice obtained from other channels such as the Internet.
  • the arousal degree recognition model training device 700 mainly includes:
  • an acquisition module 701 configured to acquire a wake-up degree label of the sample voice, and perform data enhancement on part of the sample voice according to the wake-up degree label of the sample voice;
  • An extraction module 702 configured to extract the feature matrix of the frame sequence corresponding to the sample speech
  • the training module 703 is used for inputting the feature matrix of the frame sequence corresponding to the various types of arousal degree labels and the corresponding arousal degree labels into the neural network for training.
  • FIG. 8 it is a block diagram of a module of an apparatus for acquiring a voice arousal degree according to an embodiment of the present invention.
  • the apparatus 800 for obtaining the voice wakefulness degree includes:
  • the identification module 802 is configured to input the voice to be recognized into a wake-up level recognition model, and output the wake-up level label of the to-be-recognized voice, where the wake-up level recognition model is obtained according to the wake-up level recognition model training method described in the above embodiment of.
  • an embodiment of the present disclosure provides a computer device, including a memory and a processor, where the memory stores a computer program, and when the computer program runs on the processor, the computer program executes the wake-up degree recognition provided by the above method embodiments Model training method or voice arousal degree acquisition method.
  • the computer device 900 includes but is not limited to: a radio frequency unit 901 , a network module 902 , an audio output unit 903 , an input unit 904 , and a sensor 905 , a display unit 906 , a user input unit 907 , an interface unit 908 , a memory 909 , a processor 910 , and a power supply 911 and other components.
  • the structure of the computer device shown in FIG. 9 does not constitute a limitation on the computer device, and the computer device may include more or less components than the one shown, or combine some components, or different components layout.
  • the computer equipment includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palmtop computer, a vehicle-mounted terminal, a wearable device, a pedometer, and the like.
  • the radio frequency unit 901 can be used for receiving and sending signals during sending and receiving of information or during a call. Specifically, after receiving the downlink data from the base station, it is processed by the processor 910; The uplink data is sent to the base station.
  • the radio frequency unit 901 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like.
  • the radio frequency unit 901 can also communicate with the network and other devices through a wireless communication system.
  • the computer device provides the user with wireless broadband Internet access through the network module 902, such as helping the user to send and receive emails, browse the web, access streaming media, and so on.
  • the audio output unit 903 may convert audio data received by the radio frequency unit 901 or the network module 902 or stored in the memory 909 into audio signals and output as sound. Also, the audio output unit 903 may also provide audio output related to a specific function performed by the computer device 900 (eg, call signal reception sound, message reception sound, etc.).
  • the audio output unit 903 includes a speaker, a buzzer, a receiver, and the like.
  • the input unit 904 is used to receive audio or video signals.
  • the input unit 904 may include a graphics processor (Graphics Processing Unit, GPU for short) 9041 and a microphone 9042, and the graphics processor 9041 is used for still pictures or videos obtained by an image capture computer device (such as a camera) in a video capture mode or an image capture mode. image data for processing.
  • the processed image frames can be video-played on the display unit 906 .
  • the image frames processed by the graphics processor 9041 may be stored in the memory 909 (or other storage medium) or transmitted via the radio frequency unit 901 or the network module 902 .
  • the microphone 9042 can receive sound and can process such sound into audio data.
  • the processed audio data can be converted into a format that can be transmitted to a mobile communication base station via the radio frequency unit 901 for output in the case of a telephone call mode.
  • the computer device 900 also includes at least one sensor 905, including at least the barometer mentioned in the above embodiments.
  • the sensor 905 may also be other sensors such as light sensors, motion sensors, and other sensors.
  • the light sensor includes an ambient light sensor and a proximity sensor, wherein the ambient light sensor can adjust the brightness of the display panel 9061 according to the brightness of the ambient light, and the proximity sensor can turn off the display panel 9061 and the proximity sensor when the computer device 900 is moved to the ear / or backlight.
  • the accelerometer sensor can detect the magnitude of acceleration in all directions (generally three axes), and can detect the magnitude and direction of gravity when stationary, and can be used to identify the posture of computer equipment (such as horizontal and vertical screen switching, related games , magnetometer attitude calibration), vibration recognition related functions (such as pedometer, tapping), etc.; the sensor 905 may also include a fingerprint sensor, a pressure sensor, an iris sensor, a molecular sensor, a gyroscope, a barometer, a hygrometer, a thermometer, Infrared sensors, etc., are not repeated here.
  • the display unit 906 is used for video playing information input by the user or information provided to the user.
  • the display unit 906 may include a display panel 9061, which may be in the form of a liquid crystal panel (Liquid Crystal Display, LCD for short), an organic light-emitting diode (Organic Light-Emitting Diode, OLED for short) panel, and the like.
  • the user input unit 907 may be used to receive input numerical or character information, and generate key signal input related to user settings and function control of the computer device.
  • the user input unit 907 includes a touch panel 9071 and other input devices 9072 .
  • the touch panel 9071 also referred to as a touch screen, can collect touch operations by the user on or near it (such as the user's finger, stylus, etc., any suitable object or accessory on or near the touch panel 9071). operate).
  • the touch panel 9071 may include two parts, a touch detection computer device and a touch controller.
  • the touch detection computer device detects the user's touch orientation, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection computer device and converts it into contact coordinates. , and then send it to the processor 910 to receive the command sent by the processor 910 and execute it.
  • the touch panel 9071 can be implemented in various types such as resistive, capacitive, infrared, and surface acoustic waves.
  • the user input unit 907 may also include other input devices 9072 .
  • other input devices 9072 may include, but are not limited to, physical keyboards, function keys (such as volume control keys, switch keys, etc.), trackballs, mice, and joysticks, which will not be repeated here.
  • the touch panel 9071 can be overlaid on the display panel 9061.
  • the touch panel 9071 detects a touch operation on or near it, it transmits it to the processor 910 to determine the type of the touch event, and then the processor 910 determines the type of the touch event according to the touch
  • the type of event provides a corresponding visual output on the display panel 9061.
  • the touch panel 9071 and the display panel 9061 are used as two independent components to realize the input and output functions of the computer device, in some embodiments, the touch panel 9071 and the display panel 9061 can be integrated
  • the implementation of the input and output functions of the computer device is not specifically limited here.
  • the interface unit 908 is an interface for connecting an external computer device to the computer device 900 .
  • the external computer device may include a wired or wireless headset port, an external power (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a computer device with an identification module, an audio input /Output (I/O) ports, video I/O ports, headphone ports, and more.
  • Interface unit 908 may be used to receive input (eg, data information, power, etc.) from an external computer device and transmit the received input to one or more elements within computer device 900 or may be used between computer device 900 and Transfer data between external computer devices.
  • the memory 909 may be used to store software programs as well as various data.
  • the memory 909 may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; Data created by the use of the mobile phone (such as audio data, phone book, etc.), etc.
  • memory 909 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
  • the processor 910 is the control center of the computer equipment, using various interfaces and lines to connect various parts of the entire computer equipment, by running or executing the software programs and/or modules stored in the memory 909, and calling the data stored in the memory 909. , perform various functions of computer equipment and process data, so as to conduct overall monitoring of computer equipment.
  • the processor 910 may include one or more processing units; preferably, the processor 910 may integrate an application processor and a modem processor, wherein the application processor mainly processes the operating system, user interface, and application programs, etc., and the modem
  • the processor mainly handles wireless communication. It can be understood that, the above-mentioned modulation and demodulation processor may not be integrated into the processor 910.
  • the computer device 900 may also include a power supply 911 (such as a battery) for supplying power to various components.
  • a power supply 911 (such as a battery) for supplying power to various components.
  • the power supply 911 may be logically connected to the processor 910 through a power management system, so as to manage charging, discharging, and power consumption management through the power management system. and other functions.
  • the computer device 900 includes some unshown functional modules, which are not repeated here.
  • the memory is used for storing a computer program, and the computer program executes the above-mentioned method for training the arousal degree recognition model or the method for acquiring the voice arousal degree when the processor is running.
  • an embodiment of the present invention provides a computer-readable storage medium, which stores a computer program, and the computer program runs the above-mentioned method for training an arousal degree recognition model or a method for acquiring a speech arousal degree on a processor.
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more functions for implementing the specified logical function(s) executable instructions. It should also be noted that, in alternative implementations, the functions noted in the block may occur out of the order noted in the figures.
  • each block of the block diagrams and/or flow diagrams, and combinations of blocks in the block diagrams and/or flow diagrams can be implemented using dedicated hardware-based systems that perform the specified functions or actions. be implemented, or may be implemented in a combination of special purpose hardware and computer instructions.
  • each functional module or unit in each embodiment of the present invention may be integrated to form an independent part, or each module may exist alone, or two or more modules may be integrated to form an independent part.
  • the functions are implemented in the form of software function modules and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present invention can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution.
  • the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a smart phone, a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention.
  • the aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM for short), Random Access Memory (RAM for short), magnetic disk or CD, etc. that can store program codes medium.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Child & Adolescent Psychology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hospice & Palliative Care (AREA)
  • Psychiatry (AREA)
  • Measurement Of The Respiration, Hearing Ability, Form, And Blood Characteristics Of Living Organisms (AREA)
  • Image Analysis (AREA)

Abstract

Embodiments of the present application provide a wake-up degree recognition model training method and a speech wake-up degree acquisition method. The wake-up degree recognition model training method comprises: obtaining a wake-up degree label of a sample speech, and performing data enhancement on part of the sample speech according to the wake-up degree label of the sample speech; extracting a feature matrix of a frame sequence corresponding to the sample speech; and inputting feature matrices of frame sequences corresponding to different classes of wake-up degree labels and the corresponding wake-up degree labels into a neural network for training. By means of a provided wake-up degree recognition model training solution, the features of sample speeches having different wake-up degrees are extracted and input into the neural network for training, and thus a wake-up degree recognition model capable of recognizing a speech wake-up degree can be obtained. The wake-up degree recognition model is applied to a speech recognition scene, and the wake-up degree recognition is additionally provided on the basis of basic speech recognition, thereby enhancing the accuracy and diversity of speech recognition.

Description

唤醒程度识别模型训练方法及语音唤醒程度获取方法Arousal degree recognition model training method and voice arousal degree acquisition method
相关申请的交叉引用CROSS-REFERENCE TO RELATED APPLICATIONS
本申请要求于2021年04月27日提交中国专利局的申请号为2021104622780、名称为“唤醒程度识别模型训练方法及语音唤醒程度获取方法”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese Patent Application No. 2021104622780, which was filed with the China Patent Office on April 27, 2021, and is entitled "Arousal Level Recognition Model Training Method and Voice Arousal Level Acquiring Method", the entire contents of which are incorporated by reference in this application.
技术领域technical field
本发明涉及语音处理领域,尤其涉及一种唤醒程度识别模型训练方法及语音唤醒程度获取方法。The invention relates to the field of speech processing, in particular to a method for training an arousal degree recognition model and a method for acquiring a speech arousal degree.
背景技术Background technique
在许多与医疗健康、教育和安全相关的场景中,情感识别成为现代人机交互系统不可或缺的一部分。在情感识别系统中,可以将语音、文本、视频等作为单独的输入,也可以使用它们的组合作为多模态的输入,本文主要关注基于语音的情感识别。通常,语音情感识别是采用经过切分的简短语句以有监督的方式进行识别,情感的标签可以采用两种格式,即离散标签,例如幸福,悲伤,愤怒和中性,或连续标签,例如激活(镇静)对(唤起)、效价(负对正)和优势(弱对强)。近年来,连续情绪属性因在描述更复杂的情绪状态方面更加灵活,而受到许多关注。连续属性分类在语音情绪识别中有极其重要的作用,唤醒程度也影响了情感识别的速度和准确度,一般来讲,唤醒程度越高,情感识别就越迅速,识别准确率也相应越高,通过预先识别唤醒程度也可以一定程度上提高语义情感识别的准确度。Emotion recognition has become an integral part of modern human-computer interaction systems in many healthcare, education, and safety-related scenarios. In an emotion recognition system, speech, text, video, etc. can be used as separate inputs, or a combination of them can be used as multimodal inputs. This paper focuses on speech-based emotion recognition. Generally, speech emotion recognition is performed in a supervised manner using segmented short sentences, and labels for emotions can be in two formats, either discrete labels such as happy, sad, angry and neutral, or continuous labels such as activation (sedation) versus (arousal), valence (negative versus positive), and dominance (weak versus strong). In recent years, continuous emotional attributes have received a lot of attention due to their flexibility in describing more complex emotional states. Continuous attribute classification plays an extremely important role in speech emotion recognition, and the degree of arousal also affects the speed and accuracy of emotion recognition. Generally speaking, the higher the degree of arousal, the faster the emotion recognition, and the higher the recognition accuracy. The accuracy of semantic emotion recognition can also be improved to a certain extent by identifying the degree of arousal in advance.
可见,亟需一种能识别语音的连续情感中的唤醒程度高低的方法。It can be seen that there is an urgent need for a method that can identify the level of arousal in the continuous emotion of speech.
申请内容Application content
为了解决上述技术问题,本发明实施例提供了一种唤醒程度识别模型训练方法及语音唤醒程度获取方法。In order to solve the above technical problems, embodiments of the present invention provide a method for training an arousal degree recognition model and a method for acquiring a voice arousal degree.
第一方面,本发明实施例提供了一种唤醒程度识别模型训练方法,包括:In a first aspect, an embodiment of the present invention provides a method for training an arousal degree recognition model, including:
获取样本语音的唤醒程度标签,并根据所述样本语音的唤醒程度标签对部分所述样本语音进行数据增强;Obtain the wake-up level label of the sample voice, and perform data enhancement on part of the sample voice according to the wake-up level label of the sample voice;
提取所述样本语音对应帧序列的特征矩阵;extracting the feature matrix of the frame sequence corresponding to the sample speech;
将各类唤醒程度标签对应帧序列的特征矩阵及对应的唤醒程度标签输入神经网络进行训练。The feature matrix of the frame sequence corresponding to the various arousal degree labels and the corresponding arousal degree labels are input into the neural network for training.
根据本公开的一种具体实施方式,所述获取样本语音的唤醒程度标签的步骤,包括:According to a specific embodiment of the present disclosure, the step of acquiring the wake-up degree label of the sample speech includes:
从预设数据集中,选取对应第一唤醒程度标签的第一类样本语音、对应第二唤醒程度标签的第二类样本语音和对应第三唤醒程度标签的第三类样本语音。From the preset data set, the first type of sample speech corresponding to the first arousal degree label, the second type of sample speech corresponding to the second arousal degree label, and the third type of sample speech corresponding to the third arousal degree label are selected.
根据本公开的一种具体实施方式,所述获取样本语音的唤醒程度标签的步骤,包括:According to a specific embodiment of the present disclosure, the step of acquiring the wake-up degree label of the sample speech includes:
判断各类唤醒程度标签的样本语音的数量之间的差值是否大于或者等于预设数量差值;Determine whether the difference between the number of sample voices of various types of arousal degree labels is greater than or equal to the preset number difference;
若各类唤醒程度标签的样本语音的数量之间的差值大于或者等于预设数量差值,对数量较少的样本语音进行数据增强处理,直至各类唤醒程度标签的样本语音的数量之间的差值小于所述预设数量差值。If the difference between the number of sample voices of various types of arousal level labels is greater than or equal to the preset number difference, perform data enhancement processing on the sample voices with a smaller number until the number of sample voices of various types of arousal degree labels is between the The difference is less than the preset number of differences.
根据本公开的一种具体实施方式,所述对数量较少的样本语音进行数据增强处理的步骤,包括:According to a specific embodiment of the present disclosure, the step of performing data enhancement processing on a small number of sample speeches includes:
为初始的样本语音添加噪声,得到扩增语音;Add noise to the initial sample speech to get the augmented speech;
将初始的样本语音和扩增语音相加后的语音作为用于训练的样本语音。The speech after adding the initial sample speech and the augmented speech is used as the sample speech for training.
根据本公开的一种具体实施方式,所述为样本语音添加噪声,得到扩增语音的步骤,包括:According to a specific embodiment of the present disclosure, the step of adding noise to the sample speech to obtain the amplified speech includes:
利用librosa库加载所述样本音频,得到浮点型时间序列;Use the librosa library to load the sample audio to obtain a floating-point time series;
对浮点型时间序列S进行以下公式的计算,得到加噪后的扩增语音SNi,Calculate the following formula for the floating-point time series S to obtain the amplified speech SNi after adding noise,
Figure PCTCN2021131223-appb-000001
Figure PCTCN2021131223-appb-000001
其中,i=1,2,...,L,Si表示浮点型时间序列,L表示浮点型时间序列的长度,r为w的系数,r的取值范围为[0.001,0.002],w为服从高斯分布的浮点数。Among them, i=1,2,...,L, Si represents the floating-point time series, L represents the length of the floating-point time series, r is the coefficient of w, and the value range of r is [0.001, 0.002], w is a floating-point number that obeys a Gaussian distribution.
根据本公开的一种具体实施方式,所述提取所述样本语音对应帧序列的特征矩阵的步骤,包括:According to a specific embodiment of the present disclosure, the step of extracting the feature matrix of the frame sequence corresponding to the sample speech includes:
将样本语音划分为预设数量的语音帧;Divide the sample speech into a preset number of speech frames;
按照帧序列提取各语音帧的低级描述符特征及一阶导;Extract the low-level descriptor features and first-order derivatives of each speech frame according to the frame sequence;
根据帧序列和各语音帧的低级描述符特征及一阶导,得到对应各类样本语音的特征矩阵。According to the frame sequence and the low-level descriptor features and first-order derivatives of each speech frame, feature matrices corresponding to various sample speeches are obtained.
根据本公开的一种具体实施方式,所述神经网络包括门控循环单元、注意力层和用 于情感分类的第一全连接层;According to a specific embodiment of the present disclosure, the neural network includes a gated recurrent unit, an attention layer, and a first fully connected layer for sentiment classification;
所述将各类唤醒程度标签对应帧序列的特征矩阵及对应的唤醒程度标签输入神经网络进行训练的步骤,包括:The step of inputting the feature matrix of the frame sequence corresponding to each type of arousal degree label and the corresponding arousal degree label into the neural network for training includes:
将样本语音对应帧序列的特征矩阵及对应的唤醒程度标签馈入所述门控循环单元,在所述门控循环单元内部形成对应各时间步的隐藏状态;The feature matrix of the corresponding frame sequence of the sample speech and the corresponding wake-up degree label are fed into the gated loop unit, and a hidden state corresponding to each time step is formed inside the gated loop unit;
将对应时间序列的隐藏状态模型输入注意力层,确定各时间步的特征权重值;Input the hidden state model of the corresponding time series into the attention layer to determine the feature weight value of each time step;
将对应各时间步的隐藏状态及特征权重值加权求和,得到对应样本语音的级别;Weighted summation of the hidden state and feature weight values corresponding to each time step to obtain the level of the corresponding sample speech;
将所述样本语音的级别输入所述第一全连接层,得到所述样本语音的唤醒程度标签分类结果。The level of the sample speech is input into the first fully connected layer, and the classification result of the arousal degree label of the sample speech is obtained.
根据本公开的一种具体实施方式,所述将样本语音对应帧序列的特征矩阵及对应的唤醒程度标签馈入所述门控循环单元,在所述门控循环单元内部形成对应各时间步的隐藏状态的步骤,包括:According to a specific embodiment of the present disclosure, the feature matrix of the frame sequence corresponding to the sample speech and the corresponding wake-up degree label are fed into the gated loop unit, and a feature matrix corresponding to each time step is formed inside the gated loop unit. Steps to hide state, including:
将样本语音对应帧序列的特征矩阵及对应的唤醒程度标签馈入所述门控循环单元,在所述门控循环单元内部形成内部隐藏状态ht;Feeding the feature matrix of the frame sequence corresponding to the sample speech and the corresponding wake-up degree label into the gated loop unit, and an internal hidden state ht is formed inside the gated loop unit;
在每个时间步使用特征xt和先前时间步的隐藏状态ht-1更新;其中,隐藏状态更新公式为h t=f θ(h t-1,x t),f θ是权重参数为θ的RNN函数,ht表示第t个时间步的隐藏状态,xt表示x={x1:t}中的第t个特征。 The feature xt and the hidden state ht-1 of the previous time step are used to update at each time step; where, the hidden state update formula is h t = f θ (h t-1 , x t ), f θ is the weight parameter of θ RNN function, ht denotes the hidden state at the t-th time step, and xt denotes the t-th feature in x={x1:t}.
根据本公开的一种具体实施方式,所述将对应时间序列的隐藏状态模型输入注意力层,确定各时间步的特征权重值,将对应各时间步的隐藏状态及特征权重值加权求和,得到对应样本语音的级别的步骤,包括:According to a specific embodiment of the present disclosure, the hidden state model corresponding to the time series is input into the attention layer, the feature weight value of each time step is determined, and the hidden state and feature weight value corresponding to each time step are weighted and summed, The steps of obtaining the level of the corresponding sample speech include:
计算得到的各时间步的特征权重值
Figure PCTCN2021131223-appb-000002
以及,样本语音的级别
Figure PCTCN2021131223-appb-000003
Calculated feature weight values for each time step
Figure PCTCN2021131223-appb-000002
and, the level of the sample speech
Figure PCTCN2021131223-appb-000003
其中,α t表示时间步t的特征权重值,h t为门控循环单元输出的隐藏状态,W表示要学习的参数向量,C表示样本语音的级别。 Among them, α t represents the feature weight value at time step t, h t is the hidden state output by the gated recurrent unit, W represents the parameter vector to be learned, and C represents the level of the sample speech.
根据本公开的一种具体实施方式,所述神经网络还包括用于性别分类的第二全连接层;According to a specific embodiment of the present disclosure, the neural network further includes a second fully connected layer for gender classification;
所述将对应各时间步的隐藏状态及特征权重值加权求和,得到对应样本语音的级别的步骤之后,所述方法还包括:After the weighted summation of the hidden state and feature weight value corresponding to each time step to obtain the level of the corresponding sample speech, the method further includes:
将所述样本语音的级别输入所述第二全连接层,得到所述样本语音的说话人性别分类结果。The level of the sample speech is input into the second fully connected layer to obtain the speaker gender classification result of the sample speech.
第二方面,本发明实施例提供了一种语音唤醒程度获取方法,所述方法包括:In a second aspect, an embodiment of the present invention provides a method for acquiring a voice arousal degree, the method comprising:
获取待识别语音;Get the speech to be recognized;
将所述待识别语音输入唤醒程度识别模型,输出所述待识别语音的唤醒程度标签,所述唤醒程度识别模型是根据上述任一项所述的唤醒程度识别模型训练方法获得的。The to-be-recognized speech is input into a wake-up level recognition model, and a wake-up level label of the to-be-recognized speech is output, where the wake-up level recognition model is obtained according to any one of the arousal level recognition model training methods described above.
第三方面,本发明实施例提供了一种唤醒程度识别模型训练装置,所述装置包括:In a third aspect, an embodiment of the present invention provides an apparatus for training an arousal degree recognition model, the apparatus comprising:
获取模块,用于获取样本语音的唤醒程度标签,并根据所述样本语音的唤醒程度标签对部分所述样本语音进行数据增强;an acquisition module, used for acquiring the wake-up degree label of the sample voice, and performing data enhancement on part of the sample voice according to the wake-up degree label of the sample voice;
提取模块,用于提取所述样本语音对应帧序列的特征矩阵;an extraction module, for extracting the feature matrix of the corresponding frame sequence of the sample speech;
训练模块,用于将各类唤醒程度标签对应帧序列的特征矩阵及对应的唤醒程度标签输入神经网络进行训练。The training module is used for inputting the feature matrix of the frame sequence corresponding to various arousal degree labels and the corresponding arousal degree labels into the neural network for training.
第四方面,本发明实施例提供了一种语音唤醒程度获取装置,所述装置包括:In a fourth aspect, an embodiment of the present invention provides a device for acquiring a voice wake-up degree, the device comprising:
获取模块,用于获取待识别语音;an acquisition module, used to acquire the speech to be recognized;
识别模块,用于将所述待识别语音输入唤醒程度识别模型,输出所述待识别语音的唤醒程度标签,所述唤醒程度识别模型是根据第一方面中任一项所述的唤醒程度识别模型训练方法获得的。A recognition module, configured to input the voice to be recognized into a wake-up level recognition model, and output a wake-up level label of the to-be-recognized voice, the wake-up level recognition model being the wake-up level recognition model according to any one of the first aspects obtained by the training method.
第五方面,本发明实施例提供了一种计算机设备,包括存储器以及处理器,所述存储器用于存储计算机程序,所述计算机程序在所述处理器运行时执行第一方面中任一项所述的唤醒程度识别模型训练方法,或者第二方面所述的语音唤醒程度获取方法。In a fifth aspect, an embodiment of the present invention provides a computer device, including a memory and a processor, where the memory is used to store a computer program, and the computer program executes any one of the first aspect when the processor is running. The training method for arousal degree recognition model described above, or the voice arousal degree acquisition method described in the second aspect.
第六方面,本发明实施例提供了一种计算机可读存储介质,其存储有计算机程序,所述计算机程序在处理器上运行时执行第一方面中任一项所述的唤醒程度识别模型训练方法,或者第二方面所述的语音唤醒程度获取方法。In a sixth aspect, an embodiment of the present invention provides a computer-readable storage medium, which stores a computer program, and when the computer program runs on a processor, executes the training of the arousal degree recognition model according to any one of the first aspects. method, or the method for acquiring the voice arousal degree described in the second aspect.
上述本申请提供的唤醒程度识别模型训练方法及语音唤醒程度获取方法,针对不同唤醒程度的样本语音进行特征提取,并输入到神经网络中进行训练,这样即可得到能够识别语音唤醒程度的唤醒程度识别模型。将唤醒程度识别模型应用于语音识别场景,在基础语音识别的基础上增加唤醒程度的识别,增强语音识别的准确性和多样性。The above-mentioned training method for the recognition model of arousal degree and the method for acquiring the degree of voice arousal provided by the present application, feature extraction is performed for sample voices of different arousal degrees, and input into the neural network for training, so that the degree of arousal that can recognize the degree of voice arousal can be obtained. Identify the model. The arousal degree recognition model is applied to the speech recognition scene, and the recognition of arousal degree is added on the basis of basic speech recognition, so as to enhance the accuracy and diversity of speech recognition.
附图说明Description of drawings
为了更清楚地说明本发明的技术方案,下面将对实施例中所需要使用的附图作简单 地介绍,应当理解,以下附图仅示出了本发明的某些实施例,因此不应被看作是对本发明保护范围的限定。在各个附图中,类似的构成部分采用类似的编号。In order to illustrate the technical solutions of the present invention more clearly, the accompanying drawings required in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of the present invention, and therefore should not be It is regarded as the limitation of the protection scope of the present invention. In the various figures, similar components are numbered similarly.
图1示出了本申请实施例提供的一种唤醒程度识别模型训练方法的流程示意图;FIG. 1 shows a schematic flowchart of a training method for an arousal degree recognition model provided by an embodiment of the present application;
图2示出了本申请实施例提供的唤醒程度识别模型训练方法所涉及的数据增强的部分流程示意图;FIG. 2 shows a schematic flowchart of part of the data enhancement involved in the training method for an arousal degree identification model provided by an embodiment of the present application;
图3示出了本申请实施例提供的唤醒程度识别模型训练方法所涉及的提取特征矩阵的部分流程示意图;FIG. 3 shows a schematic partial flowchart of a feature matrix extraction involved in a method for training an arousal degree identification model provided by an embodiment of the present application;
图4示出了本申请实施例提供的唤醒程度识别模型训练方法所涉及的模型训练的部分流程示意图;FIG. 4 shows a schematic flowchart of part of the model training involved in the method for training an arousal degree recognition model provided by an embodiment of the present application;
图5示出了本申请实施例提供的唤醒程度识别模型训练方法所涉及的神经网络的部分结构示意图;FIG. 5 shows a schematic structural diagram of a part of the neural network involved in the method for training an arousal degree recognition model provided by an embodiment of the present application;
图6示出了本申请实施例提供的一种语音唤醒程度获取方法的流程示意图;FIG. 6 shows a schematic flowchart of a method for acquiring a voice arousal degree provided by an embodiment of the present application;
图7示出了本申请实施例提供的一种唤醒程度识别模型训练装置的模块框图;FIG. 7 shows a block diagram of a module of an apparatus for training an arousal degree recognition model provided by an embodiment of the present application;
图8示出了本申请实施例提供的一种语音唤醒程度获取装置的模块框图;FIG. 8 shows a block diagram of a module of an apparatus for acquiring a voice wakefulness degree provided by an embodiment of the present application;
图9示出了本申请实施例提供的一种计算机设备的硬件结构图。FIG. 9 shows a hardware structure diagram of a computer device provided by an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本发明实施例中附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments.
通常在此处附图中描述和示出的本发明实施例的组件可以以各种不同的配置来布置和设计。因此,以下对在附图中提供的本发明的实施例的详细描述并非旨在限制要求保护的本发明的范围,而是仅仅表示本发明的选定实施例。基于本发明的实施例,本领域技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例,都属于本发明保护的范围。The components of the embodiments of the invention generally described and illustrated in the drawings herein may be arranged and designed in a variety of different configurations. Thus, the following detailed description of the embodiments of the invention provided in the accompanying drawings is not intended to limit the scope of the invention as claimed, but is merely representative of selected embodiments of the invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative work fall within the protection scope of the present invention.
在下文中,可在本发明的各种实施例中使用的术语“包括”、“具有”及其同源词仅意在表示特定特征、数字、步骤、操作、元件、组件或前述项的组合,并且不应被理解为首先排除一个或更多个其它特征、数字、步骤、操作、元件、组件或前述项的组合的存在或增加一个或更多个特征、数字、步骤、操作、元件、组件或前述项的组合的可能性。Hereinafter, the terms "comprising", "having" and their cognates, which may be used in various embodiments of the present invention, are only intended to denote particular features, numbers, steps, operations, elements, components, or combinations of the foregoing, and should not be construed as first excluding the presence of or adding one or more other features, numbers, steps, operations, elements, components or combinations of the foregoing or the possibility of a combination of the foregoing.
此外,术语“第一”、“第二”、“第三”等仅用于区分描述,而不能理解为指示或暗 示相对重要性。Furthermore, the terms "first", "second", "third", etc. are only used to differentiate the description and should not be construed as indicating or implying relative importance.
除非另有限定,否则在这里使用的所有术语(包括技术术语和科学术语)具有与本发明的各种实施例所属领域普通技术人员通常理解的含义相同的含义。所述术语(诸如在一般使用的词典中限定的术语)将被解释为具有与在相关技术领域中的语境含义相同的含义并且将不被解释为具有理想化的含义或过于正式的含义,除非在本发明的各种实施例中被清楚地限定。Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which various embodiments of this invention belong. The terms (such as those defined in commonly used dictionaries) will be interpreted as having the same meaning as the contextual meaning in the relevant technical field and will not be interpreted as having an idealized or overly formal meaning, unless explicitly defined in the various embodiments of the present invention.
实施例1Example 1
参见图1,为本发明实施例提供的一种唤醒程度识别模型训练方法(以下简称模型训练方法)的流程示意图。如图1所示,所述模型训练方法主要包括以下步骤:Referring to FIG. 1 , it is a schematic flowchart of a training method for an arousal degree recognition model (hereinafter referred to as a model training method) provided by an embodiment of the present invention. As shown in Figure 1, the model training method mainly includes the following steps:
S101,获取样本语音的唤醒程度标签,并根据所述样本语音的唤醒程度标签对部分所述样本语音进行数据增强;S101, obtain a wake-up level label of the sample voice, and perform data enhancement on part of the sample voice according to the wake-up level label of the sample voice;
本实施例提供的模型训练方法,主要是利用已知唤醒程度Arousal的样本语音来训练基础的神经网络,以训练得到具备唤醒程度识别功能的唤醒程度识别模型。唤醒程度表示情绪生理激活水平,例如相对平静来说,“愤怒”或“兴奋”的唤醒程度更高。The model training method provided in this embodiment mainly uses the sample speech of the known arousal degree Arousal to train the basic neural network, so as to train the arousal degree recognition model with the arousal degree recognition function. The level of arousal represents the level of emotional physiological activation, such as a higher level of arousal from "anger" or "excitement" relative to calm.
唤醒程度标签通常为连续情感属性,其原始标签的值分布在[1,5]之间。为便于进行区分,可以将连续的情感属性离散化为三类,例如将连续的唤醒值划分为3个区间,例如将[1,2]之间的唤醒程度归类为唤醒程度相对较低的第一唤醒程度,将(2,4)之间的唤醒程度归类为唤醒程度居中的第二唤醒程度,将[4,5]之间的唤醒程度归类为唤醒程度相对较高的第三唤醒程度。为便于描述,还可以对属于这三类的语音重新赋予标签1、2、3等,这样就可以将问题转化为在唤醒标签上的情感三分类问题。当然,也可以有其他的分类方案,例如划分为零、低、中和高这四类标签等,不作限定。Arousal degree labels are usually continuous emotional attributes, and the values of their original labels are distributed between [1, 5]. In order to facilitate the distinction, continuous emotional attributes can be discretized into three categories, for example, the continuous arousal value is divided into 3 intervals, for example, the degree of arousal between [1, 2] is classified as a relatively low degree of arousal. The first degree of arousal, the degree of arousal between (2, 4) is classified as the second degree of arousal in the middle of the degree of arousal, and the degree of arousal between [4, 5] is classified as the third degree of arousal with relatively high degree of arousal. degree of arousal. For the convenience of description, the voices belonging to these three categories can also be re-assigned labels 1, 2, 3, etc., so that the problem can be transformed into a three-category problem of emotion on the wake-up label. Of course, there may also be other classification schemes, such as four types of labels of zero, low, medium and high, etc., which are not limited.
在准备样本语音时,为训练唤醒程度识别模型,需要分别准备不同唤醒程度的样本语音并为各类唤醒程度的样本语音添加唤醒程度标签,以使神经网络学习不同唤醒程度的语音特征。When preparing the sample speech, in order to train the arousal degree recognition model, it is necessary to prepare sample speeches of different arousal degrees and add arousal degree labels to the sample speeches of various arousal degrees, so that the neural network can learn the speech features of different arousal degrees.
获取样本语音的方式可以有多种,根据本公开的一种具体实施方式,S101所述的获取对应各类唤醒程度的样本语音的步骤,可以包括:There are various ways to obtain sample speech. According to a specific implementation manner of the present disclosure, the step of obtaining sample speech corresponding to various arousal degrees described in S101 may include:
从预设数据集中,选取对应第一唤醒程度标签的第一类样本语音、对应第二唤醒程度标签的第二类样本语音和对应第三唤醒程度标签的第三类样本语音。From the preset data set, the first type of sample speech corresponding to the first arousal degree label, the second type of sample speech corresponding to the second arousal degree label, and the third type of sample speech corresponding to the third arousal degree label are selected.
针对唤醒程度的覆盖范围,可以将要识别语音的唤醒程度划分为三个级别,对应标 签分别定义为第一唤醒程度标签、第二唤醒程度标签和第三唤醒程度标签,可以设定这三个唤醒程度标签对应的唤醒程度依次增强。再根据各类唤醒程度标签,获取对应的样本语音。即,选择唤醒程度相对较低的第一类样本语音对应第一唤醒程度标签,选择唤醒程度相对居中的第二类样本语音对应第二唤醒程度标签,选择唤醒程度相对较高的第三类样本语音对应第三唤醒程度标签。For the coverage of the arousal degree, the awakening degree of the voice to be recognized can be divided into three levels, and the corresponding labels are respectively defined as the first awakening degree label, the second awakening degree label and the third awakening degree label. These three awakening degree labels can be set The degree of arousal corresponding to the degree label increases sequentially. Then, according to various wake-up degree labels, the corresponding sample speech is obtained. That is, the first type of sample speech with a relatively low arousal degree is selected to correspond to the first arousal degree label, the second type of sample speech with a relatively middle arousal degree is selected to correspond to the second arousal degree label, and the third type of sample speech with a relatively high arousal degree is selected. The speech corresponds to the third arousal level label.
进一步的,考虑到IEMOCAP数据集是语音情感识别领域内广泛使用的数据集之一,整个数据集从对话设计到情感标注都比较规范,且数据集包含对话较多,标注中包含离散的情感标签和连续的情感标签,符合本发明的要求。因此,在本实施例中,所述预设数据集选择互动式情感和弦动捕捉(IEMOCAP)数据集。在其他实施例中,亦可选择其他符合条件的数据集。Further, considering that the IEMOCAP data set is one of the widely used data sets in the field of speech emotion recognition, the entire data set is relatively standardized from dialogue design to emotional annotation, and the data set contains many dialogues, and the annotation contains discrete emotional labels. and continuous emotional tags, in line with the requirements of the present invention. Therefore, in this embodiment, the preset data set selects the Interactive Emotion and Chord Motion Capture (IEMOCAP) data set. In other embodiments, other eligible data sets may also be selected.
在利用IEMOCAP数据集提取样本语音时,可以根据数据集内记录的各样本语音的唤醒程度值,例如将唤醒程度值范围为[1,2]的样本语音作为所述第一类样本语音,将唤醒程度值范围为(2,4)的样本语音作为所述第二类样本语音,将唤醒程度值为[4,5]的样本语音作为所述第三类样本语音。当然也可以有其他的划分方式和语音选取方式,不作限定。此外,考虑到在进行模型训练时,所需要的样本语音的数量较多才能训练更高的识别进度。考虑到从预设数据集或者IEMOCAP数据集内获取的样本语音的数量较少,可以通过数据增强的方式来扩充样本语音的总数,以提高所训练模型的识别进度。When using the IEMOCAP data set to extract sample voices, according to the arousal degree value of each sample voice recorded in the data set, for example, the sample voice whose arousal degree value is in the range of [1, 2] is used as the first type of sample voice, and the The sample speech whose arousal degree value range is (2, 4) is used as the second type of sample speech, and the sample speech whose arousal degree value is [4, 5] is used as the third type of sample speech. Of course, there may also be other division methods and voice selection methods, which are not limited. In addition, considering that during model training, a larger number of sample speeches are required to train a higher recognition progress. Considering that the number of sample voices obtained from the preset data set or the IEMOCAP data set is small, the total number of sample voices can be expanded by means of data enhancement to improve the recognition progress of the trained model.
为优化模型训练效果,输入的各类样本语音的数量最好相同或者数量接近。根据本公开的一种具体实施方式,如图2所示,S101所述获取样本语音的唤醒程度标签,并根据所述样本语音的唤醒程度标签对部分所述样本语音进行数据增强的步骤,包括:In order to optimize the training effect of the model, it is better to input the same or similar numbers of various sample speeches. According to a specific embodiment of the present disclosure, as shown in FIG. 2 , the step of acquiring the wake-up level label of the sample speech in S101 , and performing data enhancement on part of the sample speech according to the wake-up level label of the sample speech, includes: :
S201,判断各类唤醒程度标签的样本语音的数量之间的差值是否大于或者等于预设数量差值;S201, judging whether the difference between the quantities of the sample voices of various types of arousal degree labels is greater than or equal to a preset quantity difference;
S202,若各类唤醒程度标签的样本语音的数量之间的差值大于或者等于预设数量差值,对数量较少的样本语音进行数据增强处理,直至各类唤醒程度标签的样本语音的数量之间的差值小于所述预设数量差值。S202, if the difference between the number of sample voices of various types of arousal degree labels is greater than or equal to the preset number difference, perform data enhancement processing on the sample voices with a small number until the number of sample voices of various types of arousal degree labels The difference between them is smaller than the preset number of differences.
本实施方式中,预先设定训练所允许的样本语音的数量可以为约3000,各类样本语音之间的差值为预设数量差值,该预设数量差值可以设为0,即要求各类样本语音数量完全相同,也可以设为大于0的其他数值,即允许各类样本语音数量之间存在部分差值。In this embodiment, the preset number of sample voices allowed for training can be about 3000, the difference between various types of sample voices is a preset number difference, and the preset number difference can be set to 0, that is, the requirement The number of sample voices of various types is exactly the same, and can also be set to other values greater than 0, that is, a partial difference between the number of sample voices of various types is allowed.
具体实施时,在获取样本语音之后,先判断各类唤醒程度标签的样本语音的数量之 间的差值是否大于或者等于预设数量差值。若实际数量差值大于或者等于预设数量差值,则需要对数量较少的样本语音进行数据增强处理,若实际数量差值小于预设数量差值,则可以不需要对样本语音进行数据增强处理。In the specific implementation, after acquiring the sample speech, it is first judged whether the difference between the number of sample speeches of various types of arousal degree labels is greater than or equal to the preset number difference. If the actual number difference is greater than or equal to the preset number difference, data enhancement processing is required for a small number of sample speeches. If the actual number difference is less than the preset number difference, data enhancement is not required for the sample speech. deal with.
具体实施时,上述对数量较少的样本语音进行数据增强处理的步骤,可以包括:During specific implementation, the above-mentioned steps of performing data enhancement processing on a small number of sample speeches may include:
为初始的样本语音添加噪声,得到扩增语音;Add noise to the initial sample speech to get the augmented speech;
将初始的样本语音和扩增语音相加后的语音作为用于训练的样本语音。The speech after adding the initial sample speech and the augmented speech is used as the sample speech for training.
进一步的,所述为样本语音添加噪声,得到扩增语音的步骤,包括:Further, the step of adding noise to the sample speech to obtain the amplified speech includes:
利用librosa库加载所述样本音频,得到浮点型时间序列;Use the librosa library to load the sample audio to obtain a floating-point time series;
对浮点型时间序列S进行以下公式的计算,得到加噪后的扩增语音SNi,Calculate the following formula for the floating-point time series S to obtain the amplified speech SNi after adding noise,
Figure PCTCN2021131223-appb-000004
Figure PCTCN2021131223-appb-000004
其中,i=1,2,...,L,Si表示浮点型时间序列,L表示浮点型时间序列的长度,r为w的系数,r的取值范围为[0.001,0.002],w为服从高斯分布的浮点数。在本实施例中,所述噪声为高斯白噪声。Among them, i=1,2,...,L, Si represents the floating-point time series, L represents the length of the floating-point time series, r is the coefficient of w, and the value range of r is [0.001, 0.002], w is a floating-point number that obeys a Gaussian distribution. In this embodiment, the noise is Gaussian white noise.
例如,初始情况下,低类别样本1000个,中类别样本4000个,高类别样本3500个。对于低类别样本,可以先取r=0.001,在初始的样本语音添加噪声得到新的1000个样本,此时,用于训练的低类别的样本语音增加至2000。若在此基础上再取r=0.002,在原来的样本语音再次增加噪声,即可实现将低类别的样本语音增加至3000甚至更多。具体差值可以根据具体样本类型或者模型识别精度进行自定义设置。w在python中由numpy.random.normal(0,1,len(S))生成,本质就是长度为L的一系列符合高斯分布的数。For example, in the initial case, there are 1000 samples in the low category, 4000 samples in the middle category, and 3500 samples in the high category. For low-class samples, r=0.001 can be taken first, and noise is added to the initial sample speech to obtain 1000 new samples. At this time, the low-class sample speech used for training is increased to 2000. If r=0.002 is taken on this basis, and noise is added to the original sample speech again, the low-class sample speech can be increased to 3000 or even more. The specific difference can be customized according to the specific sample type or model recognition accuracy. w is generated by numpy.random.normal(0, 1, len(S)) in python, which is essentially a series of numbers of length L that conform to the Gaussian distribution.
通过添加噪声的方式进行语音数据增强,可以避免和原来的语音一模一样,加了噪声之后的音频,和原来的语音有所不同,而且由于r值设置的较小,人耳听到的差别不大,加噪声前后的情感不会受到影响。By adding noise to enhance the voice data, it can be avoided that it is exactly the same as the original voice. The audio after adding noise is different from the original voice, and because the r value is set to a small value, the human ear hears little difference. , the emotion before and after adding noise will not be affected.
本实施方式中,通过对样本量少的类别的语音加噪声,达到扩增数据的效果,缓解低、中和高三个类别的样本之间数量的差异,保证每个批次中不会出现某一类样本过多的情况,从而在一定程度上防止训练出的模型总是偏向于预测为样本多的那一类。当然也可以在获取样本语音时直接限定获取的各类样本语音的数量小于预设数量差值,或者直接将样本语音原样复制以实现数据增强,以减少对模型训练效果的影响。In this embodiment, by adding noise to the speech of the category with a small sample size, the effect of amplifying the data is achieved, and the difference in the number of samples of the three categories of low, medium and high is alleviated, and it is ensured that certain batches will not appear in each batch. There are too many samples in one class, so as to prevent the trained model from always biasing towards the class with more samples. Of course, it is also possible to directly limit the number of acquired sample voices to be less than the preset number difference when acquiring sample voices, or directly copy the sample voices as they are to achieve data enhancement, so as to reduce the impact on the model training effect.
S102,提取所述样本语音对应帧序列的特征矩阵;S102, extracting the feature matrix of the frame sequence corresponding to the sample speech;
获取对应各类唤醒程度的样本语音之后,将样本语音进行分帧,得到对应各样本语音的帧序列。提取对应帧序列的特征矩阵,用于对各类唤醒程度的语音特征进行学习总结。After acquiring the sample voices corresponding to various arousal degrees, the sample voices are divided into frames to obtain a frame sequence corresponding to each sample voice. The feature matrix corresponding to the frame sequence is extracted, which is used to learn and summarize the speech features of various arousal degrees.
具体的,根据本公开的一种具体实施方式,S102所述的提取所述样本语音对应帧序列的特征矩阵的步骤,如图3所示,可以具体包括:Specifically, according to a specific embodiment of the present disclosure, the step of extracting the feature matrix of the frame sequence corresponding to the sample speech in S102, as shown in FIG. 3, may specifically include:
S301,将样本语音划分为预设数量的语音帧;S301, dividing the sample speech into a preset number of speech frames;
S302,按照帧序列提取各语音帧的低级描述符特征及一阶导;S302, extract low-level descriptor features and first-order derivatives of each speech frame according to the frame sequence;
S303,根据帧序列和各语音帧的低级描述符特征及一阶导,得到对应各类样本语音的特征矩阵。S303, according to the frame sequence and the low-level descriptor features and first-order derivatives of each speech frame, obtain feature matrices corresponding to various types of sample speech.
语音情感识别时,将样本语音划分为对应时间轴的语音帧,相邻的语音帧之间的特征在相邻时段上是关联的甚至是重合的。在特征提取阶段,可以采用Opensmile工具提取低级描述符(Low-Level Descriptor,简称LLD)特征及其一阶导,低级描述符可以为IS13_compare。低级描述符特征为65个,低级描述符特征的一阶导也为65个,得到的特征总数为65+65=130。During speech emotion recognition, the sample speech is divided into speech frames corresponding to the time axis, and the features between adjacent speech frames are related or even overlapped in adjacent time periods. In the feature extraction stage, the Opensmile tool can be used to extract Low-Level Descriptor (LLD) features and their first-order derivatives. The low-level descriptor can be IS13_compare. There are 65 low-level descriptor features and 65 first-order derivatives of low-level descriptor features, resulting in a total of 65+65=130 features.
在对样本语音进行分帧时,帧长可以设置为20ms,帧移设为10ms。在IEMOCAP数据集中,每个语音的长度并不是固定的,所以每个语音提取出的帧数也不同。具体实施时,每条语音设置最大帧数可以统一设为750,若实际帧数(frame_num)不足750,则进行扩增padding操作,即在提取的二维特征后面补上(750-frame_num)行零。若实际帧数大于750,则进行截断操作,最终使得每个样本语音的特征矩阵是帧数*特征数,即750*130大小的二维矩阵。When framing the sample speech, the frame length can be set to 20ms, and the frame shift can be set to 10ms. In the IEMOCAP dataset, the length of each speech is not fixed, so the number of frames extracted from each speech is also different. In the specific implementation, the maximum number of frames for each voice setting can be uniformly set to 750. If the actual number of frames (frame_num) is less than 750, the augmentation padding operation is performed, that is, the line (750-frame_num) is added after the extracted two-dimensional features. zero. If the actual number of frames is greater than 750, the truncation operation is performed, and finally the feature matrix of each sample speech is the number of frames * the number of features, that is, a two-dimensional matrix of size 750*130.
S103,将各类唤醒程度标签对应帧序列的特征矩阵及对应的唤醒程度标签输入神经网络,学习训练得到唤醒程度识别模型。S103 , input the feature matrix of the frame sequence corresponding to each type of arousal degree label and the corresponding arousal degree label into the neural network, and learn and train to obtain the arousal degree identification model.
依据上述步骤获取各类唤醒程度标签的样本语音对应的特征矩阵之后,即可将各类特征矩阵及对应唤醒程度标签输入预先准备好的神经网络进行训练,对特征进行学习总结,这样即可得到能够识别不同语音唤醒程度的唤醒程度识别模型。After obtaining the feature matrices corresponding to the sample speeches of various arousal degree labels according to the above steps, the various feature matrices and the corresponding arousal degree labels can be input into the neural network prepared in advance for training, and the characteristics are learned and summarized, so as to obtain Arousal level recognition model capable of identifying different speech arousal levels.
根据本公开的一种具体实施方式,如图2和4所示,为各类唤醒程度标签对应帧序列的特征矩阵及对应的唤醒程度标签输入神经网络进行训练的步骤。如图5所示,所述神经网络包括门控循环单元、注意力层和用于情感分类的第一全连接层。本实施方式中, 对特征矩阵进行编码的神经网络采用递归神经网络(Recurrent Neural Network,简称RNN),RNN内依次包括变体门控单元(Gated Recurrent Unit,简称GRU)、注意力层和第一全连接层,相邻层之间为数据传输关系,通常上层输出数据为下层的输入。当然,进行特征编码的门变体控制单元也可以为其他编码单元,例如长短期记忆层(Long Short-Term Memory,简称LSTM),不作限定。According to a specific embodiment of the present disclosure, as shown in FIGS. 2 and 4 , the feature matrix of the frame sequence corresponding to various types of arousal degree labels and the corresponding arousal degree labels are input to the neural network for training. As shown in Figure 5, the neural network includes a gated recurrent unit, an attention layer and a first fully connected layer for sentiment classification. In this embodiment, the neural network for encoding the feature matrix adopts a recurrent neural network (Recurrent Neural Network, RNN for short), and the RNN sequentially includes a variant gating unit (Gated Recurrent Unit, GRU for short), an attention layer and a first The fully connected layer has a data transmission relationship between adjacent layers, and usually the output data of the upper layer is the input of the lower layer. Of course, the gate variant control unit that performs feature encoding may also be other encoding units, such as a long short-term memory layer (Long Short-Term Memory, LSTM for short), which is not limited.
如图4和图5所示,所述方法可以具体包括:As shown in Figure 4 and Figure 5, the method may specifically include:
S401,将样本语音对应帧序列的特征矩阵及对应的唤醒程度标签馈入所述门控循环单元,在所述门控循环单元内部形成对应各时间步的隐藏状态;S401, the feature matrix of the corresponding frame sequence of the sample speech and the corresponding wake-up degree label are fed into the gated loop unit, and a hidden state corresponding to each time step is formed inside the gated loop unit;
根据本公开的一种具体实施方式,所述将样本语音对应帧序列的特征矩阵及对应的唤醒程度标签馈入所述门控循环单元,在所述门控循环单元内部形成对应各时间步的隐藏状态的步骤,包括:According to a specific embodiment of the present disclosure, the feature matrix of the frame sequence corresponding to the sample speech and the corresponding wake-up degree label are fed into the gated loop unit, and a feature matrix corresponding to each time step is formed inside the gated loop unit. Steps to hide state, including:
将样本语音对应帧序列的特征矩阵及对应的唤醒程度标签馈入所述门控循环单元,在所述门控循环单元内部形成内部隐藏状态ht;Feeding the feature matrix of the frame sequence corresponding to the sample speech and the corresponding wake-up degree label into the gated loop unit, and an internal hidden state ht is formed inside the gated loop unit;
在每个时间步使用特征xt和先前时间步的隐藏状态ht-1更新;其中,隐藏状态更新公式为:The feature xt and the hidden state ht-1 of the previous time step are used to update at each time step; where the hidden state update formula is:
h t=f θ(h t-1,x t),(2) h t = f θ (h t-1 , x t ), (2)
其中,f θ是权重参数为θ的RNN函数,ht表示第t个时间步的隐藏状态,xt表示x={x1:t}中的第t个特征。 where f θ is the RNN function with weight parameter θ, ht represents the hidden state at the t-th time step, and xt represents the t-th feature in x={x1:t}.
S402,将对应时间序列的隐藏状态模型输入注意力层,确定各时间步的特征权重值;S402, input the hidden state model corresponding to the time series into the attention layer, and determine the feature weight value of each time step;
注意力层被用于关注与情感相关的部分,具体来说,如图4所示,在时间步t,GRU的输出为ht,首先通过softmax函数计算归一化重要性的特征权重:The attention layer is used to pay attention to the parts related to emotion. Specifically, as shown in Figure 4, at time step t, the output of the GRU is ht, and the feature weight of normalized importance is first calculated by the softmax function:
Figure PCTCN2021131223-appb-000005
Figure PCTCN2021131223-appb-000005
α t表示时间步t的特征权重值,ht为门控循环单元输出的隐藏状态,W表示要学习的参数向量。 αt represents the feature weight value at time step t , ht is the hidden state output by the gated recurrent unit, and W represents the parameter vector to be learned.
S403,将对应各时间步的隐藏状态及特征权重值加权求和,得到对应样本语音的级别;S403, weighting and summing the hidden state and feature weight value corresponding to each time step to obtain the level of the corresponding sample speech;
根据权重执行加权和,将对应各时间步的隐藏状态及特征权重值加权求和,得到对 应样本语音的级别:The weighted sum is performed according to the weight, and the hidden state and feature weight value corresponding to each time step are weighted and summed to obtain the level of the corresponding sample speech:
Figure PCTCN2021131223-appb-000006
Figure PCTCN2021131223-appb-000006
S404,将所述样本语音的级别输入所述第一全连接层,得到所述样本语音的唤醒程度分类结果。S404: Input the level of the sample speech into the first fully connected layer to obtain a classification result of the arousal degree of the sample speech.
将经过注意力层得到的句子级别C输入到情感分类网络即第一全连接层,进行情感分类。此外,为了进行多任务分类,在第一全连接层的基础上,根据本公开的一种具体实施方式,所述神经网络还包括用于性别分类的第二全连接层。The sentence level C obtained through the attention layer is input to the sentiment classification network, namely the first fully connected layer, for sentiment classification. In addition, in order to perform multi-task classification, on the basis of the first fully-connected layer, according to an embodiment of the present disclosure, the neural network further includes a second fully-connected layer for gender classification.
所述将对应各时间步的隐藏状态及特征权重值加权求和,得到对应样本语音的级别的步骤之后,所述方法还包括:After the weighted summation of the hidden state and feature weight value corresponding to each time step to obtain the level of the corresponding sample speech, the method further includes:
将所述样本语音的级别输入所述第二全连接层,得到所述样本语音的说话人性别分类结果。The level of the sample speech is input into the second fully connected layer to obtain the speaker gender classification result of the sample speech.
本实施方式,设定多分类任务包括情感分类和性别分类,其中性别分类为二分类任务,作为情感分类的辅助任务。情感分类网络包括第一全连接层和softmax层;性别分类网络包括第二全连接层和softmax层,结构如图5所示,其中yE表示预测的某个句子所属低、中、高三类情感类别的概率;yG表示预测的某个句子说话人性别所属男、女类别的概率。多任务分类的损失方程如下:In this embodiment, it is assumed that the multi-classification task includes emotion classification and gender classification, wherein gender classification is a binary classification task, which is an auxiliary task of emotion classification. The emotion classification network includes the first fully connected layer and the softmax layer; the gender classification network includes the second fully connected layer and the softmax layer. The structure is shown in Figure 5, where yE indicates that a predicted sentence belongs to three categories of low, medium and high emotions. The probability of ; yG represents the probability that the predicted gender of the speaker of a certain sentence belongs to the male or female category. The loss equation for multi-task classification is as follows:
Figure PCTCN2021131223-appb-000007
Figure PCTCN2021131223-appb-000007
其中,lemotion和lgender分别表示情感分类和性别分类的损失。α和β表示两个任务的权重,在本研究中,两者值都设置为1。两个任务的损失函数都为交叉熵损失,计算方法如下:Among them, emotion and lgender denote the loss for sentiment classification and gender classification, respectively. α and β represent the weights of the two tasks, and in this study, both values are set to 1. The loss function of the two tasks is the cross entropy loss, and the calculation method is as follows:
Figure PCTCN2021131223-appb-000008
Figure PCTCN2021131223-appb-000008
其中,N表示样本总数,K为总情感类别数,yi,k表示第i个样本属于第k类的真实概率,pi,k表示第i个样本属于第k类的预测概率。Among them, N represents the total number of samples, K is the total number of emotional categories, yi,k represents the true probability that the ith sample belongs to the kth class, and pi,k represents the predicted probability that the ith sample belongs to the kth class.
Figure PCTCN2021131223-appb-000009
Figure PCTCN2021131223-appb-000009
其中,yi表示样本真实标签,pi样本属于第1类的预测概率。Among them, yi represents the true label of the sample, and pi is the predicted probability that the sample belongs to the first class.
综上所述,本申请提供的唤醒程度获取方法,针对不同唤醒程度标签的样本语音进行特征提取,并输入到神经网络中进行训练,这样即可得到能够识别语音唤醒程度标签的唤醒程度识别模型。将唤醒程度识别模型应用于语音识别场景,在基础语音识别的基础上增加唤醒程度的识别,增强语音识别的准确性和多样性。To sum up, the method for obtaining the arousal degree provided by the present application extracts features from sample speeches with different arousal degree labels, and inputs them into a neural network for training, so that an arousal degree recognition model capable of recognizing voice arousal degree labels can be obtained. . The arousal degree recognition model is applied to the speech recognition scene, and the recognition of arousal degree is added on the basis of basic speech recognition, so as to enhance the accuracy and diversity of speech recognition.
实施例2Example 2
参见图6,为本发明实施例提供的一种语音唤醒程度获取方法的流程示意图。如图6所示,所述方法包括以下步骤:Referring to FIG. 6 , it is a schematic flowchart of a method for acquiring a voice arousal degree according to an embodiment of the present invention. As shown in Figure 6, the method includes the following steps:
S601,获取待识别语音;S601, acquiring the voice to be recognized;
S602,将所述待识别语音输入唤醒程度识别模型,输出所述待识别语音的唤醒程度标签。S602: Input the voice to be recognized into an arousal degree recognition model, and output a wake-up degree label of the voice to be recognized.
其中,所述唤醒程度识别模型是根据上述实施例所述的唤醒程度识别模型训练方法获得的。Wherein, the arousal degree identification model is obtained according to the arousal degree identification model training method described in the above embodiment.
本实施方式,将上述实施例简历的唤醒程度识别模型加载到计算机设备内,应用于语音唤醒程度获取场景。将待识别语音输入加载有唤醒程度识别模型的计算机设备,即可输出该待识别语音的唤醒程度。所指待识别语音可以为计算机设备采集的语音,或者是从网络等其他渠道获取的语音等。In this implementation manner, the arousal degree recognition model of the resume of the above-mentioned embodiment is loaded into the computer device, and applied to the scene of obtaining the voice arousal degree. Input the voice to be recognized into the computer device loaded with the wake-up level recognition model, and then output the wake-up level of the voice to be recognized. The voice to be recognized may be the voice collected by computer equipment, or the voice obtained from other channels such as the Internet.
本实施例提供的语音滑行程度获取方法的具体实施过程,可以参见上述图1所示的实施例提供的唤醒程度识别模型训练方法的具体实施过程,在此不再一一赘述。For the specific implementation process of the method for acquiring the voice glide degree provided in this embodiment, reference may be made to the specific implementation process of the training method for the arousal degree recognition model provided by the embodiment shown in FIG. 1 , which will not be repeated here.
实施例3Example 3
参见图7,为本发明实施例提供的一种唤醒程度识别模型训练装置的模块框图。如图7所示,所述唤醒程度识别模型训练装置700主要包括:Referring to FIG. 7 , it is a block diagram of a module of an apparatus for training an arousal degree recognition model according to an embodiment of the present invention. As shown in FIG. 7 , the arousal degree recognition model training device 700 mainly includes:
获取模块701,用于获取样本语音的唤醒程度标签,并根据所述样本语音的唤醒程度标签对部分所述样本语音进行数据增强;an acquisition module 701, configured to acquire a wake-up degree label of the sample voice, and perform data enhancement on part of the sample voice according to the wake-up degree label of the sample voice;
提取模块702,用于提取所述样本语音对应帧序列的特征矩阵;An extraction module 702, configured to extract the feature matrix of the frame sequence corresponding to the sample speech;
训练模块703,用于将各类唤醒程度标签对应帧序列的特征矩阵及对应的唤醒程度标签输入神经网络进行训练。The training module 703 is used for inputting the feature matrix of the frame sequence corresponding to the various types of arousal degree labels and the corresponding arousal degree labels into the neural network for training.
实施例4Example 4
参见图8,为本发明实施例提供的一种语音唤醒程度获取装置的模块框图。如图8 所示,所述语音唤醒程度获取装置800包括:Referring to FIG. 8 , it is a block diagram of a module of an apparatus for acquiring a voice arousal degree according to an embodiment of the present invention. As shown in FIG. 8 , the apparatus 800 for obtaining the voice wakefulness degree includes:
获取模块801,用于获取待识别语音;an acquisition module 801, configured to acquire the speech to be recognized;
识别模块802,用于将所述待识别语音输入唤醒程度识别模型,输出所述待识别语音的唤醒程度标签,所述唤醒程度识别模型是根据上述实施例所述的唤醒程度识别模型训练方法获得的。The identification module 802 is configured to input the voice to be recognized into a wake-up level recognition model, and output the wake-up level label of the to-be-recognized voice, where the wake-up level recognition model is obtained according to the wake-up level recognition model training method described in the above embodiment of.
此外,本公开实施例提供了一种计算机设备,包括存储器以及处理器,所述存储器存储有计算机程序,所述计算机程序在所述处理器上运行时执行上述方法实施例所提供的唤醒程度识别模型训练方法或者语音唤醒程度获取方法。In addition, an embodiment of the present disclosure provides a computer device, including a memory and a processor, where the memory stores a computer program, and when the computer program runs on the processor, the computer program executes the wake-up degree recognition provided by the above method embodiments Model training method or voice arousal degree acquisition method.
具体的,如图9所示,为实现本发明各个实施例的一种计算机设备,该计算机设备900包括但不限于:射频单元901、网络模块902、音频输出单元903、输入单元904、传感器905、显示单元906、用户输入单元907、接口单元908、存储器909、处理器910、以及电源911等部件。本领域技术人员可以理解,图9中示出的计算机设备结构并不构成对计算机设备的限定,计算机设备可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。在本发明实施例中,计算机设备包括但不限于手机、平板电脑、笔记本电脑、掌上电脑、车载终端、可穿戴设备、以及计步器等。Specifically, as shown in FIG. 9 , in order to implement a computer device according to various embodiments of the present invention, the computer device 900 includes but is not limited to: a radio frequency unit 901 , a network module 902 , an audio output unit 903 , an input unit 904 , and a sensor 905 , a display unit 906 , a user input unit 907 , an interface unit 908 , a memory 909 , a processor 910 , and a power supply 911 and other components. Those skilled in the art can understand that the structure of the computer device shown in FIG. 9 does not constitute a limitation on the computer device, and the computer device may include more or less components than the one shown, or combine some components, or different components layout. In this embodiment of the present invention, the computer equipment includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a palmtop computer, a vehicle-mounted terminal, a wearable device, a pedometer, and the like.
应理解的是,本发明实施例中,射频单元901可用于收发信息或通话过程中,信号的接收和发送,具体的,将来自基站的下行数据接收后,给处理器910处理;另外,将上行的数据发送给基站。通常,射频单元901包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器、双工器等。此外,射频单元901还可以通过无线通信系统与网络和其他设备通信。It should be understood that, in this embodiment of the present invention, the radio frequency unit 901 can be used for receiving and sending signals during sending and receiving of information or during a call. Specifically, after receiving the downlink data from the base station, it is processed by the processor 910; The uplink data is sent to the base station. Generally, the radio frequency unit 901 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier, a duplexer, and the like. In addition, the radio frequency unit 901 can also communicate with the network and other devices through a wireless communication system.
计算机设备通过网络模块902为用户提供了无线的宽带互联网访问,如帮助用户收发电子邮件、浏览网页和访问流式媒体等。The computer device provides the user with wireless broadband Internet access through the network module 902, such as helping the user to send and receive emails, browse the web, access streaming media, and so on.
音频输出单元903可以将射频单元901或网络模块902接收的或者在存储器909中存储的音频数据转换成音频信号并且输出为声音。而且,音频输出单元903还可以提供与计算机设备900执行的特定功能相关的音频输出(例如,呼叫信号接收声音、消息接收声音等等)。音频输出单元903包括扬声器、蜂鸣器以及受话器等。The audio output unit 903 may convert audio data received by the radio frequency unit 901 or the network module 902 or stored in the memory 909 into audio signals and output as sound. Also, the audio output unit 903 may also provide audio output related to a specific function performed by the computer device 900 (eg, call signal reception sound, message reception sound, etc.). The audio output unit 903 includes a speaker, a buzzer, a receiver, and the like.
输入单元904用于接收音频或视频信号。输入单元904可以包括图形处理器(Graphics Processing Unit,简称GPU)9041和麦克风9042,图形处理器9041对在视频捕获模式或图像捕获模式中由图像捕获计算机设备(如摄像头)获得的静态图片或视频的图 像数据进行处理。处理后的图像帧可以视频播放在显示单元906上。经图形处理器9041处理后的图像帧可以存储在存储器909(或其它存储介质)中或者经由射频单元901或网络模块902进行发送。麦克风9042可以接收声音,并且能够将这样的声音处理为音频数据。处理后的音频数据可以在电话通话模式的情况下转换为可经由射频单元901发送到移动通信基站的格式输出。The input unit 904 is used to receive audio or video signals. The input unit 904 may include a graphics processor (Graphics Processing Unit, GPU for short) 9041 and a microphone 9042, and the graphics processor 9041 is used for still pictures or videos obtained by an image capture computer device (such as a camera) in a video capture mode or an image capture mode. image data for processing. The processed image frames can be video-played on the display unit 906 . The image frames processed by the graphics processor 9041 may be stored in the memory 909 (or other storage medium) or transmitted via the radio frequency unit 901 or the network module 902 . The microphone 9042 can receive sound and can process such sound into audio data. The processed audio data can be converted into a format that can be transmitted to a mobile communication base station via the radio frequency unit 901 for output in the case of a telephone call mode.
计算机设备900还包括至少一种传感器905,至少包含上述实施例提到的气压计。此外,传感器905还可以为其他传感器比如光传感器、运动传感器以及其他传感器。具体地,光传感器包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示面板9061的亮度,接近传感器可在计算机设备900移动到耳边时,关闭显示面板9061和/或背光。作为运动传感器的一种,加速计传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别计算机设备姿态(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;传感器905还可以包括指纹传感器、压力传感器、虹膜传感器、分子传感器、陀螺仪、气压计、湿度计、温度计、红外线传感器等,在此不再赘述。The computer device 900 also includes at least one sensor 905, including at least the barometer mentioned in the above embodiments. In addition, the sensor 905 may also be other sensors such as light sensors, motion sensors, and other sensors. Specifically, the light sensor includes an ambient light sensor and a proximity sensor, wherein the ambient light sensor can adjust the brightness of the display panel 9061 according to the brightness of the ambient light, and the proximity sensor can turn off the display panel 9061 and the proximity sensor when the computer device 900 is moved to the ear / or backlight. As a kind of motion sensor, the accelerometer sensor can detect the magnitude of acceleration in all directions (generally three axes), and can detect the magnitude and direction of gravity when stationary, and can be used to identify the posture of computer equipment (such as horizontal and vertical screen switching, related games , magnetometer attitude calibration), vibration recognition related functions (such as pedometer, tapping), etc.; the sensor 905 may also include a fingerprint sensor, a pressure sensor, an iris sensor, a molecular sensor, a gyroscope, a barometer, a hygrometer, a thermometer, Infrared sensors, etc., are not repeated here.
显示单元906用于视频播放由用户输入的信息或提供给用户的信息。显示单元906可包括显示面板9061,可以采用液晶面板(Liquid Crystal Display,简称LCD)、有机发光二极管(Organic Light-Emitting Diode,简称OLED)面板等形式。The display unit 906 is used for video playing information input by the user or information provided to the user. The display unit 906 may include a display panel 9061, which may be in the form of a liquid crystal panel (Liquid Crystal Display, LCD for short), an organic light-emitting diode (Organic Light-Emitting Diode, OLED for short) panel, and the like.
用户输入单元907可用于接收输入的数字或字符信息,以及产生与计算机设备的用户设置以及功能控制有关的键信号输入。具体地,用户输入单元907包括触控面板9071以及其他输入设备9072。触控面板9071,也称为触摸屏,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触控面板9071上或在触控面板9071附近的操作)。触控面板9071可包括触摸检测计算机设备和触摸控制器两个部分。其中,触摸检测计算机设备检测用户的触摸方位,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控制器从触摸检测计算机设备上接收触摸信息,并将它转换成触点坐标,再送给处理器910,接收处理器910发来的命令并加以执行。此外,可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触控面板9071。除了触控面板9071,用户输入单元907还可以包括其他输入设备9072。具体地,其他输入设备9072可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆,在此不再赘述。The user input unit 907 may be used to receive input numerical or character information, and generate key signal input related to user settings and function control of the computer device. Specifically, the user input unit 907 includes a touch panel 9071 and other input devices 9072 . The touch panel 9071, also referred to as a touch screen, can collect touch operations by the user on or near it (such as the user's finger, stylus, etc., any suitable object or accessory on or near the touch panel 9071). operate). The touch panel 9071 may include two parts, a touch detection computer device and a touch controller. Among them, the touch detection computer device detects the user's touch orientation, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection computer device and converts it into contact coordinates. , and then send it to the processor 910 to receive the command sent by the processor 910 and execute it. In addition, the touch panel 9071 can be implemented in various types such as resistive, capacitive, infrared, and surface acoustic waves. In addition to the touch panel 9071 , the user input unit 907 may also include other input devices 9072 . Specifically, other input devices 9072 may include, but are not limited to, physical keyboards, function keys (such as volume control keys, switch keys, etc.), trackballs, mice, and joysticks, which will not be repeated here.
进一步的,触控面板9071可覆盖在显示面板9061上,当触控面板9071检测到在其上或附近的触摸操作后,传送给处理器910以确定触摸事件的类型,随后处理器910根据触摸事件的类型在显示面板9061上提供相应的视觉输出。虽然在图9中,触控面板9071与显示面板9061是作为两个独立的部件来实现计算机设备的输入和输出功能,但是在某些实施例中,可以将触控面板9071与显示面板9061集成而实现计算机设备的输入和输出功能,具体此处不做限定。Further, the touch panel 9071 can be overlaid on the display panel 9061. When the touch panel 9071 detects a touch operation on or near it, it transmits it to the processor 910 to determine the type of the touch event, and then the processor 910 determines the type of the touch event according to the touch The type of event provides a corresponding visual output on the display panel 9061. Although in FIG. 9, the touch panel 9071 and the display panel 9061 are used as two independent components to realize the input and output functions of the computer device, in some embodiments, the touch panel 9071 and the display panel 9061 can be integrated The implementation of the input and output functions of the computer device is not specifically limited here.
接口单元908为外部计算机设备与计算机设备900连接的接口。例如,外部计算机设备可以包括有线或无线头戴式耳机端口、外部电源(或电池充电器)端口、有线或无线数据端口、存储卡端口、用于连接具有识别模块的计算机设备的端口、音频输入/输出(I/O)端口、视频I/O端口、耳机端口等等。接口单元908可以用于接收来自外部计算机设备的输入(例如,数据信息、电力等等)并且将接收到的输入传输到计算机设备900内的一个或多个元件或者可以用于在计算机设备900和外部计算机设备之间传输数据。The interface unit 908 is an interface for connecting an external computer device to the computer device 900 . For example, the external computer device may include a wired or wireless headset port, an external power (or battery charger) port, a wired or wireless data port, a memory card port, a port for connecting a computer device with an identification module, an audio input /Output (I/O) ports, video I/O ports, headphone ports, and more. Interface unit 908 may be used to receive input (eg, data information, power, etc.) from an external computer device and transmit the received input to one or more elements within computer device 900 or may be used between computer device 900 and Transfer data between external computer devices.
存储器909可用于存储软件程序以及各种数据。存储器909可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据手机的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器909可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。The memory 909 may be used to store software programs as well as various data. The memory 909 may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; Data created by the use of the mobile phone (such as audio data, phone book, etc.), etc. Additionally, memory 909 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
处理器910是计算机设备的控制中心,利用各种接口和线路连接整个计算机设备的各个部分,通过运行或执行存储在存储器909内的软件程序和/或模块,以及调用存储在存储器909内的数据,执行计算机设备的各种功能和处理数据,从而对计算机设备进行整体监控。处理器910可包括一个或多个处理单元;优选的,处理器910可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器910中。The processor 910 is the control center of the computer equipment, using various interfaces and lines to connect various parts of the entire computer equipment, by running or executing the software programs and/or modules stored in the memory 909, and calling the data stored in the memory 909. , perform various functions of computer equipment and process data, so as to conduct overall monitoring of computer equipment. The processor 910 may include one or more processing units; preferably, the processor 910 may integrate an application processor and a modem processor, wherein the application processor mainly processes the operating system, user interface, and application programs, etc., and the modem The processor mainly handles wireless communication. It can be understood that, the above-mentioned modulation and demodulation processor may not be integrated into the processor 910.
计算机设备900还可以包括给各个部件供电的电源911(比如电池),优选的,电源911可以通过电源管理系统与处理器910逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。The computer device 900 may also include a power supply 911 (such as a battery) for supplying power to various components. Preferably, the power supply 911 may be logically connected to the processor 910 through a power management system, so as to manage charging, discharging, and power consumption management through the power management system. and other functions.
另外,计算机设备900包括一些未示出的功能模块,在此不再赘述。In addition, the computer device 900 includes some unshown functional modules, which are not repeated here.
所述存储器用于存储计算机程序,所述计算机程序在所述处理器运行时执行上述的唤醒程度识别模型训练方法或者语音唤醒程度获取方法。The memory is used for storing a computer program, and the computer program executes the above-mentioned method for training the arousal degree recognition model or the method for acquiring the voice arousal degree when the processor is running.
另外,本发明实施例提供了一种计算机可读存储介质,其存储有计算机程序,所述计算机程序在处理器上运行上述的唤醒程度识别模型训练方法或者语音唤醒程度获取方法。In addition, an embodiment of the present invention provides a computer-readable storage medium, which stores a computer program, and the computer program runs the above-mentioned method for training an arousal degree recognition model or a method for acquiring a speech arousal degree on a processor.
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,也可以通过其它的方式实现。以上所描述的装置实施例仅仅是示意性的,例如,附图中的流程图和结构图显示了根据本发明的多个实施例的装置、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或代码的一部分,所述模块、程序段或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在作为替换的实现方式中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,结构图和/或流程图中的每个方框、以及结构图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may also be implemented in other manners. The apparatus embodiments described above are only schematic, for example, the flowcharts and structural diagrams in the accompanying drawings show possible implementation architectures and functions of apparatuses, methods and computer program products according to various embodiments of the present invention and operation. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more functions for implementing the specified logical function(s) executable instructions. It should also be noted that, in alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams and/or flow diagrams, and combinations of blocks in the block diagrams and/or flow diagrams, can be implemented using dedicated hardware-based systems that perform the specified functions or actions. be implemented, or may be implemented in a combination of special purpose hardware and computer instructions.
另外,在本发明各个实施例中的各功能模块或单元可以集成在一起形成一个独立的部分,也可以是各个模块单独存在,也可以两个或更多个模块集成形成一个独立的部分。In addition, each functional module or unit in each embodiment of the present invention may be integrated to form an independent part, or each module may exist alone, or two or more modules may be integrated to form an independent part.
所述功能如果以软件功能模块的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是智能手机、个人计算机、服务器、或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,简称ROM)、随机存取存储器(Random Access Memory,简称RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the functions are implemented in the form of software function modules and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution. The computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a smart phone, a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM for short), Random Access Memory (RAM for short), magnetic disk or CD, etc. that can store program codes medium.
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。The above are only specific embodiments of the present invention, but the protection scope of the present invention is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed by the present invention. should be included within the protection scope of the present invention.

Claims (15)

  1. 一种唤醒程度识别模型训练方法,其特征在于,所述方法包括:A method for training an arousal degree recognition model, characterized in that the method comprises:
    获取样本语音的唤醒程度标签,并根据所述样本语音的唤醒程度标签对部分所述样本语音进行数据增强;Obtain the wake-up level label of the sample voice, and perform data enhancement on part of the sample voice according to the wake-up level label of the sample voice;
    提取所述样本语音对应帧序列的特征矩阵;extracting the feature matrix of the frame sequence corresponding to the sample speech;
    将各类唤醒程度标签对应帧序列的特征矩阵及对应的唤醒程度标签输入神经网络进行训练。The feature matrix of the frame sequence corresponding to the various arousal degree labels and the corresponding arousal degree labels are input into the neural network for training.
  2. 根据权利要求1所述的方法,其特征在于,所述获取样本语音的唤醒程度标签的步骤,包括:The method according to claim 1, wherein the step of acquiring the wake-up degree label of the sample speech comprises:
    从预设数据集中,选取对应第一唤醒程度标签的第一类样本语音、对应第二唤醒程度标签的第二类样本语音和对应第三唤醒程度标签的第三类样本语音。From the preset data set, the first type of sample speech corresponding to the first arousal degree label, the second type of sample speech corresponding to the second arousal degree label, and the third type of sample speech corresponding to the third arousal degree label are selected.
  3. 根据权利要求2所述的方法,其特征在于,所述根据所述样本语音的唤醒程度标签对部分所述样本语音进行数据增强的步骤,包括:The method according to claim 2, wherein the step of performing data enhancement on part of the sample speech according to the arousal degree label of the sample speech comprises:
    判断各类唤醒程度标签的样本语音的数量之间的差值是否大于或者等于预设数量差值;Determine whether the difference between the number of sample voices of various types of arousal degree labels is greater than or equal to the preset number difference;
    若各类唤醒程度标签的样本语音的数量之间的差值大于或者等于预设数量差值,对数量较少的样本语音进行数据增强处理,直至各类唤醒程度标签的样本语音的数量之间的差值小于所述预设数量差值。If the difference between the number of sample voices of various types of arousal level labels is greater than or equal to the preset number difference, perform data enhancement processing on the sample voices with a smaller number until the number of sample voices of various types of arousal degree labels is between the The difference is less than the preset number of differences.
  4. 根据权利要求3所述的方法,其特征在于,所述对数量较少的样本语音进行数据增强处理的步骤,包括:The method according to claim 3, wherein the step of performing data enhancement processing on a small number of sample speeches comprises:
    为初始的样本语音添加噪声,得到扩增语音;Add noise to the initial sample speech to get the augmented speech;
    将初始的样本语音和扩增语音相加后的语音作为用于训练的样本语音。The speech after adding the initial sample speech and the augmented speech is used as the sample speech for training.
  5. 根据权利要求4所述的方法,其特征在于,所述为样本语音添加噪声,得到扩增语音的步骤,包括:The method according to claim 4, wherein the step of adding noise to the sample speech to obtain the amplified speech comprises:
    利用librosa库加载所述样本语音,得到浮点型时间序列;Use the librosa library to load the sample speech to obtain a floating-point time series;
    对浮点型时间序列S进行以下公式的计算,得到加噪后的扩增语音SNi,Calculate the following formula for the floating-point time series S to obtain the amplified speech SNi after adding noise,
    Figure PCTCN2021131223-appb-100001
    Figure PCTCN2021131223-appb-100001
    其中,i=1,2,...,L,Si表示浮点型时间序列,L表示浮点型时间序列的长度,r为w的系数,r的取值范围为[0.001,0.002],w为服从高斯分布的浮点数。Among them, i=1,2,...,L, Si represents the floating-point time series, L represents the length of the floating-point time series, r is the coefficient of w, and the value range of r is [0.001, 0.002], w is a floating-point number that obeys a Gaussian distribution.
  6. 根据权利要求1至5中任一项所述的方法,其特征在于,所述提取所述样本语音对应帧序列的特征矩阵的步骤,包括:The method according to any one of claims 1 to 5, wherein the step of extracting the feature matrix of the frame sequence corresponding to the sample speech comprises:
    将样本语音划分为预设数量的语音帧;Divide the sample speech into a preset number of speech frames;
    按照帧序列提取各语音帧的低级描述符特征及一阶导;Extract the low-level descriptor features and first-order derivatives of each speech frame according to the frame sequence;
    根据帧序列和各语音帧的低级描述符特征及一阶导,得到对应各类样本语音的特征矩阵。According to the frame sequence and the low-level descriptor features and first-order derivatives of each speech frame, feature matrices corresponding to various sample speeches are obtained.
  7. 根据权利要求6所述的方法,其特征在于,所述神经网络包括门控循环单元、注意力层和用于情感分类的第一全连接层;The method of claim 6, wherein the neural network comprises a gated recurrent unit, an attention layer and a first fully connected layer for sentiment classification;
    所述将各类唤醒程度标签对应帧序列的特征矩阵及对应的唤醒程度标签输入神经网络进行训练的步骤,包括:The step of inputting the feature matrix of the frame sequence corresponding to each type of arousal degree label and the corresponding arousal degree label into the neural network for training includes:
    将样本语音对应帧序列的特征矩阵及对应的唤醒程度标签馈入所述门控循环单元,在所述门控循环单元内部形成对应各时间步的隐藏状态;The feature matrix of the corresponding frame sequence of the sample speech and the corresponding wake-up degree label are fed into the gated loop unit, and a hidden state corresponding to each time step is formed inside the gated loop unit;
    将对应时间序列的隐藏状态模型输入注意力层,确定各时间步的特征权重值;Input the hidden state model of the corresponding time series into the attention layer to determine the feature weight value of each time step;
    将对应各时间步的隐藏状态及特征权重值加权求和,得到对应样本语音的级别;Weighted summation of the hidden state and feature weight values corresponding to each time step to obtain the level of the corresponding sample speech;
    将所述样本语音的级别输入所述第一全连接层,得到所述样本语音的唤醒程度标签分类结果。The level of the sample speech is input into the first fully connected layer, and the classification result of the arousal degree label of the sample speech is obtained.
  8. 根据权利要求7所述的方法,其特征在于,所述将样本语音对应帧序列的特征矩阵及对应的唤醒程度标签馈入所述门控循环单元,在所述门控循环单元内部形成对应各时间步的隐藏状态的步骤,包括:The method according to claim 7, wherein the feature matrix of the corresponding frame sequence of the sample speech and the corresponding arousal degree label are fed into the gated loop unit, and the corresponding The steps of the hidden state of the time step, including:
    将样本语音对应帧序列的特征矩阵及对应的唤醒程度标签馈入所述门控循环单元,在所述门控循环单元内部形成内部隐藏状态h tThe feature matrix of the corresponding frame sequence of the sample speech and the corresponding wake-up degree label are fed into the gated cyclic unit, and an internal hidden state h t is formed inside the gated cyclic unit;
    在每个时间步使用特征x t和先前时间步的隐藏状态h t-1更新;其中,隐藏状态更新公式为h t=f θ(h t-1,x t),是权重参数为θ的RNN函数,h t表示第t个时间步的隐藏状态,x t表示x={x 1:t}中的第t个特征。 At each time step, the feature x t and the hidden state h t-1 of the previous time step are used to update; wherein, the hidden state update formula is h t =f θ (h t-1 ,x t ), which is a weight parameter of θ RNN function, h t represents the hidden state at the t-th time step, and x t represents the t-th feature in x={x 1:t }.
  9. 根据权利要求8所述的方法,其特征在于,所述将对应时间序列的隐藏状态模型输入注意力层,确定各时间步的特征权重值,将对应各时间步的隐藏状态及特征权重值加权求和,得到对应样本语音的级别的步骤,包括:The method according to claim 8, wherein the hidden state model corresponding to the time series is input into the attention layer, the feature weight value of each time step is determined, and the hidden state and feature weight value corresponding to each time step are weighted The steps of summing to obtain the level of the corresponding sample speech include:
    计算得到的各时间步的特征权重值
    Figure PCTCN2021131223-appb-100002
    以及,样本语音的级别
    Figure PCTCN2021131223-appb-100003
    Calculated feature weight values for each time step
    Figure PCTCN2021131223-appb-100002
    and, the level of the sample speech
    Figure PCTCN2021131223-appb-100003
    其中,α t表示时间步t的特征权重值,h t为门控循环单元输出的隐藏状态,W表示要学习的参数向量,C表示样本语音的级别。 Among them, α t represents the feature weight value at time step t, h t is the hidden state output by the gated recurrent unit, W represents the parameter vector to be learned, and C represents the level of the sample speech.
  10. 根据权利要求9所述的方法,其特征在于,所述神经网络还包括用于性别分类的第二全连接层;The method of claim 9, wherein the neural network further comprises a second fully connected layer for gender classification;
    所述将对应各时间步的隐藏状态及特征权重值加权求和,得到对应样本语音的级别的步骤之后,所述方法还包括:After the weighted summation of the hidden state and feature weight value corresponding to each time step to obtain the level of the corresponding sample speech, the method further includes:
    将所述样本语音的级别输入所述第二全连接层,得到所述样本语音的说话人性别分类结果。The level of the sample speech is input into the second fully connected layer to obtain the speaker gender classification result of the sample speech.
  11. 一种语音唤醒程度获取方法,其特征在于,所述方法包括:A method for acquiring a voice arousal degree, characterized in that the method comprises:
    获取待识别语音;Get the speech to be recognized;
    将所述待识别语音输入唤醒程度识别模型,输出所述待识别语音的唤醒程度标签,所述唤醒程度识别模型是根据权利要求1-10中任一项所述的唤醒程度识别模型训练方法获得的。Input the voice to be recognized into the wake-up recognition model, and output the wake-up label of the voice to be recognized, and the wake-up recognition model is obtained according to the training method of the wake-up recognition model according to any one of claims 1-10. of.
  12. 一种唤醒程度识别模型训练装置,其特征在于,所述装置包括:An apparatus for training an arousal degree recognition model, characterized in that the apparatus comprises:
    获取模块,用于获取样本语音的唤醒程度标签,并根据所述样本语音的唤醒程度标签对部分所述样本语音进行数据增强;an acquisition module, used for acquiring the wake-up degree label of the sample voice, and performing data enhancement on part of the sample voice according to the wake-up degree label of the sample voice;
    提取模块,用于提取所述样本语音对应帧序列的特征矩阵;an extraction module, for extracting the feature matrix of the corresponding frame sequence of the sample speech;
    训练模块,用于将各类唤醒程度对应帧序列的特征矩阵及对应的唤醒程度标签输入神经网络进行训练。The training module is used to input the feature matrix of the frame sequence corresponding to various arousal degrees and the corresponding arousal degree labels into the neural network for training.
  13. 一种语音唤醒程度获取装置,其特征在于,所述装置包括:A voice wake-up degree acquisition device, characterized in that the device comprises:
    获取模块,用于获取待识别语音;an acquisition module, used to acquire the speech to be recognized;
    识别模块,用于将所述待识别语音输入唤醒程度识别模型,输出所述待识别语音的唤醒程度标签,所述唤醒程度识别模型是根据权利要求1-10中任一项所述的唤醒程度识别模型训练方法获得的。A recognition module, used to input the voice to be recognized into a wake-up level recognition model, and output the wake-up level label of the to-be-recognized voice, where the wake-up level recognition model is the wake-up level according to any one of claims 1-10 Recognition model training method obtained.
  14. 一种计算机设备,其特征在于,包括存储器以及处理器,所述存储器用于存储计算机程序,所述计算机程序在所述处理器运行时执行权利要求1至10中任一项所述的唤醒程度识别模型训练方法,或者权利要求11所述的语音唤醒程度获取方法。A computer device, characterized by comprising a memory and a processor, wherein the memory is used to store a computer program, and the computer program executes the wake-up degree according to any one of claims 1 to 10 when the processor runs The recognition model training method, or the voice arousal degree acquisition method according to claim 11 .
  15. 一种计算机可读存储介质,其特征在于,其存储有计算机程序,所述计算机程序在处理器上运行时执行权利要求1至10中任一项所述的唤醒程度识别模型训练方法,或者权利要求11所述的语音唤醒程度获取方法。A computer-readable storage medium, characterized in that it stores a computer program, and when the computer program runs on a processor, the computer program executes the method for training an arousal degree recognition model according to any one of claims 1 to 10, or the right The method for acquiring the voice arousal degree according to requirement 11 is required.
PCT/CN2021/131223 2021-04-27 2021-11-17 Wake-up degree recognition model training method and speech wake-up degree acquisition method WO2022227507A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110462278.0 2021-04-27
CN202110462278.0A CN113192537B (en) 2021-04-27 2021-04-27 Awakening degree recognition model training method and voice awakening degree acquisition method

Publications (1)

Publication Number Publication Date
WO2022227507A1 true WO2022227507A1 (en) 2022-11-03

Family

ID=76979709

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/131223 WO2022227507A1 (en) 2021-04-27 2021-11-17 Wake-up degree recognition model training method and speech wake-up degree acquisition method

Country Status (2)

Country Link
CN (1) CN113192537B (en)
WO (1) WO2022227507A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116312513A (en) * 2023-02-13 2023-06-23 陕西省君凯电子科技有限公司 Intelligent voice control system
CN117058597A (en) * 2023-10-12 2023-11-14 清华大学 Dimension emotion recognition method, system, equipment and medium based on audio and video

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113192537B (en) * 2021-04-27 2024-04-09 深圳市优必选科技股份有限公司 Awakening degree recognition model training method and voice awakening degree acquisition method

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105609116A (en) * 2015-12-23 2016-05-25 东南大学 Speech emotional dimensions region automatic recognition method
US20170365277A1 (en) * 2016-06-16 2017-12-21 The George Washington University Emotional interaction apparatus
CN108091323A (en) * 2017-12-19 2018-05-29 想象科技(北京)有限公司 For identifying the method and apparatus of emotion from voice
CN108597541A (en) * 2018-04-28 2018-09-28 南京师范大学 A kind of speech-emotion recognition method and system for enhancing indignation and happily identifying
US20190074028A1 (en) * 2017-09-01 2019-03-07 Newton Howard Real-time vocal features extraction for automated emotional or mental state assessment
CN111966824A (en) * 2020-07-11 2020-11-20 天津大学 Text emotion recognition method based on emotion similarity attention mechanism
CN112216307A (en) * 2019-07-12 2021-01-12 华为技术有限公司 Speech emotion recognition method and device
US20210104245A1 (en) * 2019-06-03 2021-04-08 Amazon Technologies, Inc. Multiple classifications of audio data
CN113192537A (en) * 2021-04-27 2021-07-30 深圳市优必选科技股份有限公司 Awakening degree recognition model training method and voice awakening degree obtaining method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107272607A (en) * 2017-05-11 2017-10-20 上海斐讯数据通信技术有限公司 A kind of intelligent home control system and method
CN110444224B (en) * 2019-09-09 2022-05-27 深圳大学 Voice processing method and device based on generative countermeasure network
CN111311327A (en) * 2020-02-19 2020-06-19 平安科技(深圳)有限公司 Service evaluation method, device, equipment and storage medium based on artificial intelligence
CN111554279A (en) * 2020-04-27 2020-08-18 天津大学 Multi-mode man-machine interaction system based on Kinect
CN111933114B (en) * 2020-10-09 2021-02-02 深圳市友杰智新科技有限公司 Training method and use method of voice awakening hybrid model and related equipment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105609116A (en) * 2015-12-23 2016-05-25 东南大学 Speech emotional dimensions region automatic recognition method
US20170365277A1 (en) * 2016-06-16 2017-12-21 The George Washington University Emotional interaction apparatus
US20190074028A1 (en) * 2017-09-01 2019-03-07 Newton Howard Real-time vocal features extraction for automated emotional or mental state assessment
CN108091323A (en) * 2017-12-19 2018-05-29 想象科技(北京)有限公司 For identifying the method and apparatus of emotion from voice
CN108597541A (en) * 2018-04-28 2018-09-28 南京师范大学 A kind of speech-emotion recognition method and system for enhancing indignation and happily identifying
US20210104245A1 (en) * 2019-06-03 2021-04-08 Amazon Technologies, Inc. Multiple classifications of audio data
CN112216307A (en) * 2019-07-12 2021-01-12 华为技术有限公司 Speech emotion recognition method and device
CN111966824A (en) * 2020-07-11 2020-11-20 天津大学 Text emotion recognition method based on emotion similarity attention mechanism
CN113192537A (en) * 2021-04-27 2021-07-30 深圳市优必选科技股份有限公司 Awakening degree recognition model training method and voice awakening degree obtaining method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
UNRELIABLE CAT: "Audio Data Enhancement by Using Python)", XP009540606, Retrieved from the Internet <URL:https://baijiahao.baidu.com/s?id=1664050947493095222&wfr=spider&for=pc> *
ZHANG ZIXING; WU BINGWEN; SCHULLER BJORN: "Attention-augmented End-to-end Multi-task Learning for Emotion Prediction from Speech", ICASSP 2019 - 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 12 May 2019 (2019-05-12), pages 6705 - 6709, XP033565426, DOI: 10.1109/ICASSP.2019.8682896 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116312513A (en) * 2023-02-13 2023-06-23 陕西省君凯电子科技有限公司 Intelligent voice control system
CN117058597A (en) * 2023-10-12 2023-11-14 清华大学 Dimension emotion recognition method, system, equipment and medium based on audio and video
CN117058597B (en) * 2023-10-12 2024-01-05 清华大学 Dimension emotion recognition method, system, equipment and medium based on audio and video

Also Published As

Publication number Publication date
CN113192537A (en) 2021-07-30
CN113192537B (en) 2024-04-09

Similar Documents

Publication Publication Date Title
US10956771B2 (en) Image recognition method, terminal, and storage medium
WO2021036644A1 (en) Voice-driven animation method and apparatus based on artificial intelligence
WO2021135577A9 (en) Audio signal processing method and apparatus, electronic device, and storage medium
WO2022227507A1 (en) Wake-up degree recognition model training method and speech wake-up degree acquisition method
CN111402866B (en) Semantic recognition method and device and electronic equipment
CN110263131B (en) Reply information generation method, device and storage medium
CN108494947B (en) Image sharing method and mobile terminal
CN112820299B (en) Voiceprint recognition model training method and device and related equipment
CN110830368A (en) Instant messaging message sending method and electronic equipment
US11830501B2 (en) Electronic device and operation method for performing speech recognition
CN114333774B (en) Speech recognition method, device, computer equipment and storage medium
CN113744736B (en) Command word recognition method and device, electronic equipment and storage medium
CN111292727B (en) Voice recognition method and electronic equipment
CN115691486A (en) Voice instruction execution method, electronic device and medium
WO2023246558A1 (en) Semantic understanding method and apparatus, and medium and device
CN116860913A (en) Voice interaction method, device, equipment and storage medium
CN114462580B (en) Training method of text recognition model, text recognition method, device and equipment
CN111597823B (en) Method, device, equipment and storage medium for extracting center word
CN110868634A (en) Video processing method and electronic equipment
CN113707132B (en) Awakening method and electronic equipment
US12118983B2 (en) Electronic device and operation method thereof
CN113535926B (en) Active dialogue method and device and voice terminal
CN114155859B (en) Detection model training method, voice dialogue detection method and related equipment
CN109829167B (en) Word segmentation processing method and mobile terminal
CN115910051A (en) Audio data processing method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21938967

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21938967

Country of ref document: EP

Kind code of ref document: A1