CN115240657A

CN115240657A - Voice processing method, device, equipment and storage medium

Info

Publication number: CN115240657A
Application number: CN202210892174.8A
Authority: CN
Inventors: 张欢韵
Original assignee: Shenzhen Huace Huihong Technology Co ltd
Current assignee: Shenzhen Huace Huihong Technology Co ltd
Priority date: 2022-07-27
Filing date: 2022-07-27
Publication date: 2022-10-25

Abstract

A voice processing method, device, equipment and storage medium, the method includes: acquiring to-be-processed voice data of a target application scene, and determining an emotion recognition result set of the to-be-processed voice data based on a voice emotion recognition model, wherein the emotion recognition result set comprises emotion types to which voice fragments of the to-be-processed voice data belong; if the emotion recognition result set does not meet the scene rules of the target application scene (including predefined proportion rules of all reference emotion categories), clustering the voice segments to obtain a plurality of clustering clusters, wherein the voice segments in the same clustering cluster correspond to one emotion category; determining a first cluster with unmatched emotion types and target application scenes from the plurality of clusters, and acquiring an emotion marking label corresponding to the first cluster; and determining an emotion evaluation result of the voice data to be processed based on the emotion marking label and the emotion recognition result set corresponding to the first clustering cluster. The method and the device can improve the accuracy of speech emotion recognition.

Description

Voice processing method, device, equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for processing speech.

Background

At present, most of voice emotion analysis uses a general model, for example, some speakers are found to pronounce by adopting various emotions, and the purpose of recognizing the voice emotion of other people is achieved by training the model by using collected pronunciation data. For some cases that the recognition accuracy rate is not high, some researchers can combine other information (such as expressions, phonetic texts and the like) of the speaker to obtain results, but the cost required by the scheme is high, and each person has a set of phonetic emotion expression system, sometimes, only a simple pronunciation is loud, but a general model can be classified into anger and excitement with a high probability. Therefore, how to improve the accuracy of speech emotion recognition is an urgent problem to be solved.

Disclosure of Invention

The embodiment of the application provides a voice processing method, a voice processing device, voice processing equipment and a voice processing storage medium, and the accuracy of voice emotion recognition can be improved.

In one aspect, an embodiment of the present application provides a speech processing method, where the method includes:

acquiring voice data to be processed of a target application scene, and determining an emotion recognition result set of the voice data to be processed based on a voice emotion recognition model, wherein the emotion recognition result set comprises emotion types to which each voice fragment of a plurality of voice fragments of the voice data to be processed belongs;

if the emotion recognition result set does not meet the scene rule of the target application scene, clustering the voice segments to obtain a plurality of clustering clusters, wherein the scene rule comprises predefined duty rules of all reference emotion categories, and the voice segments in the same clustering cluster correspond to one emotion category;

determining a first cluster with an emotion category which is not matched with the target application scene from the plurality of clusters, and acquiring an emotion marking label corresponding to the first cluster;

and determining the emotion evaluation result of the voice data to be processed based on the emotion marking label corresponding to the first clustering cluster and the emotion recognition result set.

In one aspect, an embodiment of the present application provides a speech processing apparatus, where the apparatus includes:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring to-be-processed voice data of a target application scene and determining an emotion recognition result set of the to-be-processed voice data based on a voice emotion recognition model, and the emotion recognition result set comprises emotion types to which each voice fragment of a plurality of voice fragments of the to-be-processed voice data belongs;

the processing unit is used for clustering the plurality of voice segments to obtain a plurality of clustering clusters if the emotion recognition result set does not meet the scene rules of the target application scene, wherein the scene rules comprise predefined ratio rules of all reference emotion categories, and the voice segments in the same clustering cluster correspond to one emotion category;

the processing unit is further configured to determine a first cluster in which the emotion category is not matched with the target application scene from the plurality of clusters, and acquire an emotion label corresponding to the first cluster;

the processing unit is further configured to determine an emotion evaluation result of the to-be-processed voice data based on the emotion marking label corresponding to the first cluster and the emotion recognition result set.

In one aspect, the present application provides a computer device, where the computer device includes a processor, a communication interface, and a memory, where the processor, the communication interface, and the memory are connected to each other, where the memory stores a computer program, and the processor is configured to invoke the computer program to execute the voice processing method according to any one of the foregoing possible implementation manners.

In one aspect, the present application provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the speech processing method of any one of the possible implementations.

In one aspect, the present application further provides a computer program product, where the computer program product includes a computer program or a computer instruction, and the computer program or the computer instruction is executed by a processor to implement the steps of the speech processing method provided in the present application.

In an aspect, an embodiment of the present application further provides a computer program, where the computer program includes computer instructions, the computer instructions are stored in a computer-readable storage medium, a processor of a computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the speech processing method provided in the embodiment of the present application.

In the embodiment of the application, to-be-processed voice data of a target application scene can be obtained, an emotion recognition result set of the to-be-processed voice data is determined based on a voice emotion recognition model, the emotion recognition result set comprises emotion types to which each voice fragment of a plurality of voice fragments of the to-be-processed voice data belongs, if the emotion recognition result set does not meet scene rules of the target application scene (including predefined occupation rules of all reference emotion types), clustering is performed on the plurality of voice fragments to obtain a plurality of clustering clusters, the voice fragments in the same clustering cluster correspond to one emotion type, a first clustering cluster of which the emotion types are not matched with the target application scene can be determined from the plurality of clustering clusters, an emotion marking label corresponding to the first clustering cluster is obtained, and an emotion evaluation result of the to-be-processed voice data is determined based on the emotion marking label corresponding to the first clustering cluster and the emotion recognition result set. By adopting the method, whether the recognition result of the speech emotion recognition model is accurate can be determined based on whether the emotion category to which each speech fragment of the speech data to be processed belongs accords with objective scene rules, and a little manual intervention is introduced when the recognition result of the speech emotion recognition model is inaccurate, so that the accuracy of speech emotion recognition is greatly improved.

Drawings

In order to more clearly illustrate the technical method of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained by those skilled in the art without inventive labor.

Fig. 1 is a schematic system architecture diagram of a speech processing system according to an embodiment of the present application;

fig. 2 is a schematic flowchart of a speech processing method according to an embodiment of the present application;

FIG. 3 is a schematic flowchart of another speech processing method according to an embodiment of the present application;

fig. 4 is a schematic diagram of a clustering effect provided in an embodiment of the present application;

FIG. 5 is a schematic diagram of a cluster center provided in an embodiment of the present application;

FIG. 6 is a flowchart illustrating another speech processing method according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical method in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is to be understood that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Referring to fig. 1, fig. 1 is a schematic system structural diagram of a speech processing system according to an embodiment of the present application, as shown in fig. 1, the system includes a speech processing device 10 and a database 11, and the speech processing device 10 and the database 11 may be connected in a wired or wireless manner.

The speech processing device 10 may comprise one or more of a terminal and a server. That is, the voice processing method proposed in the embodiment of the present application may be executed by a terminal, may be executed by a server, or may be executed by both a terminal and a server capable of communicating with each other. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, content Delivery Network (CDN), big data and an artificial intelligence platform. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, an intelligent voice interaction device, an intelligent appliance, a vehicle-mounted terminal, and the like.

The database 11 may be a local database of the speech processing device 10, a cloud database accessible by the speech processing device 10, or a local database of another computer device. The database 11 may be used to store voice data, and may specifically store voice data to be processed.

The speech processing device 10 may be equipped with a speech emotion recognition model, and the interaction process between the speech emotion recognition model and the database 11 is as follows:

the method comprises the steps of obtaining voice data to be processed of a target application scene from a database 11, determining an emotion recognition result set of the voice data to be processed based on a carried voice emotion recognition model, wherein the emotion recognition result set comprises emotion types of each voice segment in a plurality of voice segments of the voice data to be processed, and specifically, the emotion recognition result set is obtained by performing emotion recognition processing on each voice segment through the voice emotion recognition model. And if the emotion recognition result set does not meet the scene rule of the target application scene, clustering the voice segments to obtain a plurality of clustering clusters, wherein the scene rule comprises predefined proportion rules of all reference emotion categories, and the voice segments in the same clustering cluster correspond to one emotion category. The first clustering cluster with the unmatched emotion category and the unmatched target application scene can be determined from the multiple clustering clusters, at this time, the voice emotion recognition model can be considered to be incapable of accurately recognizing the emotion category to which each voice fragment in the first clustering cluster belongs, then, the emotion marking label corresponding to the first clustering cluster is determined through manual marking, and the emotion marking label can be used for indicating the real emotion category of each voice fragment in the first clustering cluster. And determining an emotion evaluation result of the voice data to be processed based on the emotion marking label and the emotion recognition result set corresponding to the first clustering cluster.

In an embodiment, the requesting terminal may send the voice data to be processed to the voice processing apparatus 10, when determining that the emotion recognition result set of the voice data to be processed satisfies the scene rule of the target application scenario, the voice processing apparatus 10 may determine an emotion evaluation result of the voice data to be processed by using the emotion recognition result set of the voice data to be processed, and when determining that the emotion recognition result set of the voice data to be processed does not satisfy the scene rule of the target application scenario, may determine an emotion evaluation result of the voice data to be processed by using the emotion marking tag and the emotion recognition result set corresponding to the first cluster. After obtaining the emotion evaluation result of the voice data to be processed, the voice processing device 10 may return the emotion evaluation result to the request terminal, which is beneficial to improving the accuracy of the voice emotion evaluation.

The existing method cannot solve the speech emotion recognition requirements of different listeners or different conversation scenes, for example, the same section of speech data (interview recording) and different listeners (human resources) can judge the emotion of a candidate differently; or for the analysis of the recorded sound for sale, the pronunciation is generally louder, the speaking is faster (in order to attract the attention of the user as much as possible), and the common model is classified into anger and excitement with higher probability. According to the scheme, whether the emotion types of all the voice fragments in the voice data to be processed meet objective scene rules or not can be judged, the voice emotion recognition requirements of different conversation scenes are met, in addition, a little manual intervention can be introduced on the basis of voice emotion recognition model recognition, namely, emotion marking is carried out on the first clustering cluster manually, the voice emotion recognition requirements of different listeners are met, and the accuracy of voice emotion recognition is greatly improved.

Referring to fig. 2, fig. 2 is a schematic flowchart illustrating a voice processing method according to an embodiment of the present application. The method can be applied to the voice processing device 10 in the voice processing system, and the method comprises the following steps:

s201, obtaining voice data to be processed of a target application scene, and determining an emotion recognition result set of the voice data to be processed based on a voice emotion recognition model, wherein the emotion recognition result set comprises emotion types of each voice fragment in a plurality of voice fragments of the voice data to be processed.

The voice processing scheme provided by the application is applicable to scenes (non-single sentence analysis situation) requiring long-time recording analysis, such as interview voice analysis, recording analysis of return visit or dark visit, recording analysis in a telephone sale/customer service process and the like.

The target application scenario refers to a dialog scenario in which the interlocutor is located in the to-be-processed voice data, and the dialog scenario may include an interview scenario, a sales scenario, a customer service scenario, a conference scenario, and the like, which is not limited in the present application. The target application scenario can be determined by the dialogue content in the voice data to be processed, for example, the interview scenario generally has self introduction, and the sales scenario generally focuses on the products being sold.

The voice data to be processed is the voice data needing voice emotion recognition. The voice data to be processed can be segmented according to a preset frame length, and a plurality of voice segments of the voice data to be processed can be obtained by segmenting the voice data to be processed into one voice segment. In an embodiment, the voice data to be processed may be preprocessed, where the preprocessing specifically may be pre-emphasis, and the pre-emphasis may be to emphasize a high-frequency portion of a voice signal in the voice data to be processed and increase a high-frequency resolution, so as to perform segmentation processing on the pre-emphasized voice data to be processed, so as to obtain a plurality of voice segments of the voice data to be processed.

The speech emotion recognition model is a general model for performing speech emotion recognition tasks. In one implementation, a limited category boundary may be determined, that is, it is set that the emotions of all people can be substantially distributed in N (positive integer) emotion categories, for example, conventionally collected emotion categories are: { anger, fear/stress, happiness, neutrality, injury/committee, surprise/excitement }, but of course can be more and more detailed, but is a limited category. The training data may be collected from a limited number M of people (e.g., 6 speakers, including 3 men and 3 women), each having speech samples of the N emotion classes. The collected voice sample and the emotion type corresponding to the voice sample can be used to train a classification model, and the classification model can be an MLP (Multi-Layer Perceptron), an SVM (Support Vector Machine), an LSTM (Long Short-Term Memory network), a CNN (Convolutional Neural network), and the like. After the training of the classification model is finished, the classification model obtained by training can be used as a speech emotion recognition model. In addition, the speech emotion recognition model can also be a general model for performing speech emotion recognition by combining other information of a speaker, and the speech emotion recognition model is not limited in the application.

The speech emotion recognition model can be used for determining an emotion category to which each speech segment of a plurality of speech segments of the speech data to be processed belongs, and an emotion recognition result set of the speech data to be processed can be formed based on the emotion category to which each speech segment belongs. For example, after a period of long duration speech (hereinafter referred to as long speech a) is preprocessed, the long speech a may be split into a plurality of sentences (one sentence is a speech fragment), which includes: sentence 1, sentence 2, sentence 3, sentence 4, sentence 5, sentence 6, sentence 7, sentence 8, \8230, sentence m, and emotion recognition model of speech emotion recognition are used to perform emotion recognition on each sentence, so as to obtain the emotion classification to which each sentence belongs, and finally, the emotion recognition result set a of the speaker corresponding to the long speech a can be: {1: "neutral"; 2: "neutral"; 3: "neutral"; 4: "neutral"; 5: "excitement"; 6: "neutral"; 7: "tension"; 8: "neutral"; 8230; m: "tension", where 1,2, \8230, m is used to identify sentence 1, sentence 2, \8230, sentence m, and the visible emotion recognition result set a includes the emotion classification to which each sentence belongs.

S202, if the emotion recognition result set does not meet the scene rule of the target application scene, clustering the voice fragments to obtain a plurality of clustering clusters, wherein the scene rule comprises predefined proportion rules of all reference emotion categories, and the voice fragments in the same clustering cluster correspond to one emotion category.

The context rules of the target application context may include predefined duty rules for respective reference emotion categories, which may specifically include one or more of neutral emotion, positive emotion and negative emotion. For example, the interview scene should mostly be neutral or positive emotion, and the scene rule of the interview scene may be that the proportion of negative emotion is less than 20%; most of the sales scenes are positive emotions, and the scene rules of the sales scenes can be that the proportion of the positive emotions is more than 70%; the method comprises the following steps that a client in a customer service scene can have a certain proportion of negative emotions, the customer service scene is required to be completely neutral or positive emotions, the scene rule of the customer service scene can be that the proportion of the negative emotions of the client is more than 20%, and the proportion of the negative emotions of the customer service scene is 0%; the meeting scene should be mostly neutral or positive emotion and the like, and the scene rule of the meeting scene can be that the ratio of the neutral emotion to the positive emotion is more than 90%.

Neutral emotion refers to an emotion between positive and negative, such as calm, surprise. Positive emotions refer to a positive mood, such as happy, excited, happy. Negative emotions refer to a negative mood such as depression, fear, anger. The emotion classification that can be recognized by the speech emotion recognition model can be classified into any reference emotion classification, i.e., into one of neutral emotion, positive emotion and negative emotion.

In a feasible manner, the reference emotion categories may include one or more emotion categories that can be recognized by the speech emotion recognition model, for example, the scene rule of the target application scene may be more than 10% of "excited" and more than 20% of "surprised".

In one embodiment, obtaining the speech segment proportion corresponding to different emotion types in the emotion recognition result set includes: and determining the voice segments corresponding to the same emotion category according to the emotion categories to which the voice segments in the emotion recognition result set belong. Based on the number of the voice segments corresponding to the same emotion category and the total number of the voice segments in the emotion recognition result set, the voice segment occupation ratio corresponding to the same emotion category, that is, the voice segment occupation ratio corresponding to the same emotion category = (the number of the voice segments corresponding to the same emotion category/the total number of the voice segments in the emotion recognition result set) × 100% is calculated. By the method, the voice segment proportion corresponding to different emotion types in the emotion recognition result set can be obtained. For example, emotion recognition result set a: {1: "neutral"; 2: "neutral"; 3: "neutral"; 4: "neutral"; 5: "excitement"; 6: "neutral"; 7: tension; 8: "neutral"; 9: "tension" } indicates that the ratio of the voice segment corresponding to "neutral" is 67% =6/9 × 100%, the ratio of the voice segment corresponding to "excited" is 11% =1/9 × 100%, and the ratio of the voice segment corresponding to "tension" is 22% =2/9 × 100%.

And further judging whether the voice segment occupation ratios corresponding to different emotion types in the emotion recognition result set meet the occupation ratio rule of the corresponding reference emotion types. For example, the target application scenario is an interview scenario, the proportion of negative emotions in the interview scenario needs to be less than 20%, and if the proportion of sentences identified as "angry" (a kind of negative emotions) in the emotion recognition result set A is more than 20%, it is indicated that the proportion of voice segments corresponding to "angry" does not meet the proportion rule of corresponding negative emotions. When the proportion of the voice segments corresponding to one emotion category does not meet the proportion rule of the corresponding reference emotion category, the emotion recognition result set can be determined to not meet the scene rule of the voice data to be processed.

When it is determined that the emotion recognition result set does not meet the scene rule of the voice data to be processed, it can be considered that the voice emotion recognition model cannot accurately perform emotion recognition on a plurality of voice segments of the voice data to be processed, at this time, clustering processing can be performed on the plurality of voice segments of the voice data to be processed, the clustering processing refers to dividing the voice segments with high similarity between corresponding voice features into one clustering cluster, and dividing the voice segments with large corresponding voice feature differences into different clustering clusters, so that a plurality of clustering clusters are obtained. Because the voice segments belonging to the same emotion category reflect the same emotion characteristics, the similarity between the corresponding voice features of the voice segments belonging to the same emotion category is far higher than the similarity between the corresponding voice features of the voice segments belonging to other emotion categories. For example, in common speech emotion classification, speech of the same category can be classified into the same category because the similarity between speech features of speech belonging to the same category is much higher than that of speech features of speech belonging to other categories. Based on this, the purpose of the clustering process is known as follows: the voice fragments belonging to the same emotion category are divided into a cluster, so that the voice fragments in the same cluster correspond to one emotion category, and a plurality of voice fragments with inaccurate classification can be found at the same time.

The speech features may reflect time-domain characteristics and/or frequency-domain characteristics of the speech data, and may include frequency-domain characteristics and/or time-domain characteristics, which may include, for example, timbre, pitch, formants, spectrum, brightness, roughness, and the like, as well as spectral centroid, spectral plane, mel-frequency cepstral coefficients (MFCC) characteristics of different coefficients, mel-frequency spectrogram, chromatogram, root mean square energy, and the like. In addition, the voice feature may be a feature with different dimensions, that is, a one-dimensional feature, or a two-dimensional feature, and the specific form of the voice feature is not limited in the present application.

S203, determining a first cluster with the unmatched emotion category and the target application scene from the plurality of clusters, and acquiring an emotion marking label corresponding to the first cluster.

In an embodiment, the emotion category corresponding to any one of the multiple clustering clusters can be determined based on the characteristic that the voice segment in the same clustering cluster corresponds to one emotion category, that is, the emotion category corresponding to the voice segment in any one clustering cluster is taken as the emotion category corresponding to any one clustering cluster.

If the emotion category corresponding to any cluster is not the forward emotion category matched with the target application scene, the emotion category corresponding to the voice fragment of any cluster can be determined to be not in accordance with the objective rule of the target application scene, namely the emotion category which is unlikely to appear in the target application scene, and the cluster can be determined to be the first cluster with the emotion category not matched with the target application scene. At this time, manual intervention can be introduced, an emotion marking label corresponding to the first clustering cluster is determined, and the emotion marking label is used as the real emotion category of each voice segment in the first clustering cluster. By adopting the embodiment, the wrong recognition result of the speech emotion recognition model can be corrected only by introducing a little manual intervention, and meanwhile, the real emotion category of each speech segment in the first clustering cluster is determined through manual marking, so that the accuracy of the emotion evaluation result of the speech data to be processed is improved.

The forward emotion category matched with the target application scene refers to an emotion category which appears in the target application scene at a high probability. For example, if the interview scene should be mostly neutral or positive emotion, the positive emotion categories matched with the interview scene are neutral emotion and positive emotion; if most sales scenes are positive emotions, the positive emotion types matched with the sales scenes are the positive emotions; the client in the customer service scene may have a certain proportion of negative emotions, and the customer service should be all neutral or positive emotions, and the positive emotion types matched with the customer service scene are negative emotions, neutral emotions and positive emotions; if the conference scene should be mostly neutral or positive emotion, the positive emotion category matched with the conference scene is neutral emotion and positive emotion.

When the emotion category corresponding to any cluster can be classified into the forward emotion category matched with the target application scene, the emotion category corresponding to the cluster can be considered as the forward emotion category matched with the target application scene, otherwise, when the emotion category corresponding to the cluster cannot be classified into the forward emotion category matched with the target application scene, the emotion category corresponding to the cluster can be considered as not the forward emotion category matched with the target application scene.

And S204, determining an emotion evaluation result of the voice data to be processed based on the emotion marking label and the emotion recognition result set corresponding to the first clustering cluster.

In an embodiment, the emotion classification to which each voice segment in the first cluster in the emotion recognition result set belongs may be modified to an emotion label corresponding to the first cluster, so as to obtain a modified emotion recognition result set. And determining an emotion evaluation result of the voice data to be processed based on a series of scoring rules and the modified emotion recognition result set. For example, a score may be set for each emotion category, each speech segment is scored according to the emotion category to which each speech segment belongs and the corresponding score in the modified emotion recognition result set, the scoring sum of all the speech segments is used as an emotion score of the to-be-processed speech data, and the emotion score is an emotion evaluation result of the to-be-processed speech data. Wherein the score for positive emotions (e.g., excitement, happiness, excitement) may be relatively high and the score for negative emotions (e.g., casualty, counseling, fear) may be relatively low, a greater emotion score indicating a greater propensity of the processed speech data to exhibit positive emotions and a lower emotion score indicating a greater propensity of the processed speech data to exhibit negative emotions. This is by way of example only, and other scoring methods are possible, and the present application is not limited thereto. By adopting the scoring mode to determine the emotion evaluation result of the voice data to be processed, the emotion category which is inclined to reflect by the voice data to be processed can be known, and the analysis of the conversation atmosphere in the target application scene is facilitated, for example, when the conversation atmosphere is inclined to positive emotion, the conversation atmosphere can be easily and pleasantly, and when the conversation atmosphere is inclined to negative emotion, the conversation atmosphere can be tense and serious.

By adopting the method, whether the speech emotion recognition model accurately carries out speech emotion recognition can be determined by whether the proportion of the speech segments corresponding to different emotion types in the speech data to be processed meets the proportion rule of the corresponding scene, and when the recognition result of the speech emotion recognition model is inaccurate, a little manual intervention is introduced, and the real emotion type of each speech segment in the first cluster is determined through manual marking, so that the accuracy of the speech emotion recognition is greatly improved.

Referring to fig. 3, fig. 3 is a flowchart illustrating another speech processing method according to an embodiment of the present application. The method can be applied to the voice processing device 10 in the voice processing system described above, and includes:

s301, obtaining voice data to be processed of a target application scene, and determining an emotion recognition result set of the voice data to be processed based on a voice emotion recognition model, wherein the emotion recognition result set comprises emotion types of each voice fragment in a plurality of voice fragments of the voice data to be processed.

In one embodiment, the speech data to be processed may be pre-emphasized, which may emphasize high frequency portions of the speech signal in the speech data to be processed and increase high frequency resolution. Endpoint detection may then be performed on the pre-emphasized voice data to be processed. The endpoint detection refers to detecting the start time and the end time of a voice signal in a section of voice data, and may be specifically implemented by an endpoint detection algorithm. The pre-emphasized voice data to be processed can be segmented (windowed and framed) based on the obtained end point detection result, that is, the starting time and the ending time of the voice signal are used as segmentation points, and the voice data between the starting time and the ending time is used as a voice segment, so that a plurality of voice segments of the voice data to be processed are obtained, the voice segments are effective voice data in the voice data to be processed, and silent segment noise of the voice data to be processed is eliminated.

Further, the speech feature of each of the plurality of speech segments is determined, and the speech feature may include one or more of a frequency domain feature and a time domain feature, which is not limited in this application. And inputting the voice characteristics of each voice segment into the voice emotion recognition model to obtain the emotion category to which each voice segment belongs. And finally, determining an emotion recognition result set of the voice data to be processed based on the emotion category to which each voice segment belongs. For example, after preprocessing a long duration speech (hereinafter referred to as long speech a), long speech a may be split into a plurality of sentences, which include: sentence 1, sentence 2, sentence 3, sentence 4, sentence 5, sentence 6, sentence 7, sentence 8, \8230andsentence m, and classifying each sentence by using the speech emotion recognition model, the emotion classification to which each sentence belongs can be obtained, and thus, the emotion recognition result set a of the speaker corresponding to the long speech a is obtained: {1: "neutral"; 2: "neutral"; 3: "neutral"; 4: "neutral"; 5: "excitement"; 6: "neutral"; 7: "tension"; 8: "neutral"; 8230; m: tension and the visual emotion recognition result set A comprises the emotion category to which each sentence belongs.

S302, if the emotion recognition result set does not meet the scene rule of the target application scene, clustering the voice fragments to obtain a plurality of clustering clusters, wherein the scene rule comprises predefined proportion rules of all reference emotion categories, and the voice fragments in the same clustering cluster correspond to one emotion category.

In one embodiment, the voice segment occupation ratios corresponding to different emotion types in the emotion recognition result set are obtained, and if the voice segment occupation ratios corresponding to all emotion types in the emotion recognition result set meet the occupation ratio rules of the corresponding reference emotion types, the emotion recognition result set is determined to meet the scene rules of the target application scene. When the emotion recognition result set is determined to meet the scene rules of the voice data to be processed, the voice emotion recognition model can be considered to accurately perform emotion recognition on a plurality of voice fragments of the voice data to be processed, and at the moment, emotion scores of the voice data to be processed are determined directly through a series of scoring rules and the emotion recognition result set. For example, a score may be set for each emotion category, each speech segment is scored according to the emotion category to which each speech segment belongs and the corresponding score in the emotion recognition result set, the scoring sum of all the speech segments is used as an emotion score of the to-be-processed speech data, and the emotion score is an emotion evaluation result of the to-be-processed speech data. Wherein the score for positive emotions (e.g., excitement, happiness, excitement) may be relatively high and the score for negative emotions (e.g., injury, delegation, fear) may be relatively low, the greater the emotion score, the more likely the processed speech data is to represent a positive emotion, and the lower the emotion score, the more likely the processed speech data is to represent a negative emotion. The above-mentioned method is only an example, and other scoring methods are available, and the application is not limited. By adopting the scoring mode to determine the emotion evaluation result of the voice data to be processed, the emotion category which is inclined to reflect by the voice data to be processed can be known, and the analysis of the dialogue atmosphere in the target application scene is facilitated.

In another embodiment, the voice segment occupation ratios corresponding to different emotion types in the emotion recognition result set are obtained, and if the voice segment occupation ratio corresponding to one emotion type does not satisfy the occupation ratio rule of the corresponding reference emotion type in the emotion recognition result set, it is determined that the emotion recognition result set does not satisfy the scene rule of the target application scene. For example, the target application scenario is a conference scenario, the sum of the occupancy of neutral emotion and positive emotion in the conference scenario needs to be greater than 90%, and if in the emotion recognition result set a, the occupancy of a sentence identified as "neutral" (a neutral emotion) is 67%, the occupancy of a sentence identified as "tension" (a negative emotion) is 22%, and the occupancy of a sentence identified as "excitement" (a neutral emotion) is 11%, the sum of the occupancy of neutral emotion and positive emotion is 78%, and thus, the corresponding occupancy rule is not satisfied.

When it is determined that the emotion recognition result set does not meet the scene rule of the voice data to be processed, the voice feature of each voice fragment in the plurality of voice fragments can be determined, the voice feature and the target expected number of each voice fragment in the plurality of voice fragments are utilized, clustering processing is carried out on the plurality of voice fragments, at least one cluster is obtained, the number of the at least one cluster is the target expected number, namely the target expected number refers to the number of the cluster which is expected to be obtained through clustering processing.

Specifically, the voice features of each voice segment are processed based on a clustering algorithm and the target expected number to obtain a plurality of clusters, and the voice segments of which the voice features belong to the same cluster are divided into one cluster to obtain at least one cluster. The target expected number may be a first number, N x 2, N referring to the emotion type recognizable by the speech emotion recognition model. Taking the clustering algorithm as a k-means algorithm as an example: (1) selecting N x 2 initial cluster centers; (2) calculating the distance between the voice feature of each voice fragment and the centers of N x 2 initial clusters, and attributing the voice feature to the cluster with the closest distance; (3) calculating the average voice characteristics in the current class cluster aiming at each class cluster, and taking the average voice characteristics as a new cluster center; (4) and (4) repeating the steps (2) to (3) until a termination condition is reached, wherein the termination condition can be that a preset upper limit value of iteration times is reached or speech characteristics in various clusters are not changed. When the termination condition is reached, N × 2 clusters are obtained, and the speech segments of which the speech features belong to the same cluster are divided into one cluster, so that N × 2 clusters can be obtained, for example, B: {1: [1,2,4,5,10];2: [3,8,9,13,15,16,26];3: [5,6,7,11,12,14]; 8230; n x 2: [ x, x, x ] }. As shown in fig. 4, the clustering data is an effect graph obtained by clustering the original data, and a, B, C, and D in fig. 4 are effect graphs of various kinds of clustering.

Further, the total amount of the voice segments of each cluster in the at least one cluster is obtained, and the cluster of which the corresponding total amount of the voice segments is less than a preset total amount (for example, 5) is determined as the second cluster. If the number of the second cluster is less than the preset number, it may be determined that the number of the clusters with the speech segment occupation ratio is too small, for example, only one or two cluster of 20 cluster are less than 5 speech segments, and at this time, based on the speech segment occupation ratios corresponding to different emotion types in each cluster of at least one cluster, the maximum speech segment occupation ratio corresponding to each cluster is determined.

Specifically, according to the emotion category to which each voice segment in each cluster of at least one cluster belongs, voice segments corresponding to the same emotion category are determined, and based on the number of the voice segments corresponding to the same emotion category and the total amount of the voice segments in each cluster, the voice segment occupation ratio corresponding to the same emotion category is calculated, that is, the voice segment occupation ratio corresponding to the same emotion category = (the number of the voice segments corresponding to the same emotion category/the total amount of the voice segments in each cluster) x 100%, so that the voice segment occupation ratios corresponding to different emotion categories in each cluster can be obtained. And determining the maximum voice segment ratio from the voice segment ratios corresponding to the different emotion types, and taking the maximum voice segment ratio as the maximum voice segment ratio corresponding to each cluster, namely taking the maximum voice segment ratio corresponding to one cluster as the voice segment ratio corresponding to the emotion type with the maximum ratio in the same cluster.

If the maximum speech segment proportion corresponding to each cluster is greater than a preset proportion (for example, 80%), it can be shown that the speech segments in each cluster generally reflect the same emotional characteristic, and the clustering result of each cluster is accurate. One example of judging whether the clustering result is accurate is as follows: for example, for cluster 1, the emotion recognition results of sentence 1, sentence 2, sentence 3, sentence 4, and sentence 5 are obtained from the emotion recognition result set a and are neutral, excited, and neutral, respectively; wherein 83% of sentences belong to neutrality, 17% of sentences belong to excitement, and if the preset percentage b% is 80%, the result of the cluster 1 is accurate because 83% is greater than 80%. If the maximum voice fragment occupation ratio corresponding to each cluster in the at least one cluster is larger than the preset occupation ratio, the clustering result of each cluster can be considered to be accurate, and the cluster except the second cluster in the at least one cluster is determined to be a plurality of clusters.

If the number of the second cluster is greater than or equal to the preset number, it may be determined that the number of the voice segments is too large compared to the number of the clusters that are too small, for example, 15 cluster clusters in 20 cluster clusters are less than 5 voice segments, at this time, the target expected number may be updated to the second number (less than the first number), and a step of performing clustering processing on the plurality of voice segments based on the voice feature of each voice segment and the target expected number to obtain at least one cluster is performed, so as to obtain a plurality of cluster clusters. By adopting the method, the condition that the voice fragments in the cluster are too few due to noise data can be avoided.

If the maximum voice fragment occupation ratio corresponding to one cluster is smaller than or equal to the preset occupation ratio in at least one cluster, the clustering result of the cluster is not accurate, the target expected number is updated to be a third number (larger than the first number), and clustering processing is performed on the plurality of voice fragments based on the voice characteristics of each voice fragment and the target expected number to obtain at least one cluster. The number of the voice fragments in each cluster can be reduced by increasing the number of the clusters, so that the clustering result of the clusters is accurate.

S303, determining the emotion category corresponding to the maximum speech segment proportion based on the speech segment proportion corresponding to different emotion categories in any one of the plurality of clusters, and determining the emotion category corresponding to the maximum speech segment proportion as the emotion category corresponding to any one of the clusters.

Because the emotion type corresponding to the maximum voice segment occupation ratio of each cluster in the multiple clusters is the emotion characteristic universally reflected by each cluster, in an embodiment, the maximum voice segment occupation ratio corresponding to any cluster can be determined according to the emotion type to which each voice segment in any cluster in the multiple clusters belongs, the maximum voice segment occupation ratio corresponding to one cluster is the voice segment occupation ratio corresponding to the emotion type with the largest occupation ratio in one cluster, and the emotion type corresponding to the maximum voice segment occupation ratio can be further obtained. For example, emotion recognition result set a: {1: "neutral"; 2: "neutral"; 3: "neutral"; 4: "neutral"; 5: "excitement"; 6: "neutral"; 7: "tension"; 8: "neutral"; 9: "tension" indicates that the speech segment corresponding to "neutral" accounts for 67%, the speech segment corresponding to "excited" accounts for 11%, the speech segment corresponding to "tension" accounts for 22%, the maximum speech segment accounts for 67%, and the emotion category corresponding to the maximum speech segment accounts for "neutral". And finally, taking the emotion category corresponding to the maximum voice segment ratio as the emotion category corresponding to any one cluster, namely the emotion category corresponding to the voice segment in any one cluster.

And S304, acquiring the forward emotion category matched with the target application scene.

The forward emotion category matched with the target application scene refers to an emotion category which appears in the target application scene at a high probability, for example, if the interview scene should be mostly neutral or positive emotion, the forward emotion categories matched with the interview scene are neutral emotion and positive emotion; if most sales scenes are positive emotions, the positive emotion types matched with the sales scenes are the positive emotions; the client in the customer service scene may have a certain proportion of negative emotions, and the customer service should be all neutral or positive emotions, and the positive emotion types matched with the customer service scene are negative emotions, neutral emotions and positive emotions; if the conference scene should be mostly neutral or positive emotion, the positive emotion category matched with the conference scene is neutral emotion and positive emotion.

S305, if the emotion category corresponding to any cluster is not the forward emotion category matched with the target application scene, determining that any cluster is the first cluster with the emotion category not matched with the target application scene.

When the emotion category corresponding to any cluster can be classified into the forward emotion category matched with the target application scene, the emotion category corresponding to the cluster can be considered as the forward emotion category matched with the target application scene. Conversely, when the emotion classification corresponding to any cluster can not be classified into the forward emotion classification matched with the target application scene, the emotion classification corresponding to any cluster is considered not to be the forward emotion classification matched with the target application scene, and at this moment, the any cluster is the first cluster of which the emotion classification is not matched with the target application scene.

S306, obtaining the emotion marking label corresponding to the first clustering cluster, and determining an emotion evaluation result of the voice data to be processed based on the emotion marking label corresponding to the first clustering cluster and the emotion recognition result set.

Because the emotion classification corresponding to the first cluster is an emotion classification which is unlikely to appear in the target application scene, the emotion characteristics reflected by the voice segments in the first cluster do not conform to objective scene rules (objective rules), and it can be considered that the emotion classification to which each voice segment in the first cluster belongs is not accurate, that is, the voice emotion recognition model cannot accurately perform emotion recognition on each voice segment in the first cluster. At this time, a voice segment corresponding to the cluster center of the first cluster is obtained, and the voice segment corresponding to the cluster center can be a voice segment with voice characteristics as the cluster center; or the speech segment whose speech feature is closest to the cluster center. For example, as shown in FIG. 5, the cluster centers of the various clusters are identified by a five-pointed star.

And sending the voice segment corresponding to the cluster center of the first cluster to a user terminal, so that the user can mark the voice segment corresponding to the cluster center to obtain the mark type (the emotion type marked by the user) of the voice segment. The user terminal acquires the label type of the voice segment label corresponding to the cluster center and returns the label type to the voice processing equipment, and the voice processing equipment can receive the label type sent by the user terminal and takes the label type as the emotion label corresponding to the first cluster.

Because the voice segments in the first cluster correspond to the same emotion category, the emotion marking labels corresponding to the first cluster can be used as the real emotion categories of the voice segments in the first cluster. In an embodiment, the emotion classification to which each voice segment included in the first cluster in the emotion recognition result set belongs may be modified into an emotion label corresponding to the first cluster, so as to obtain a modified emotion recognition result set, the modified emotion recognition result set is processed according to a preset scoring rule, so as to obtain an emotion score of the voice data to be processed, and the emotion score is used as an emotion evaluation result of the voice data to be processed. For example, a score may be set for each emotion category, each speech segment is scored according to the emotion category to which each speech segment belongs and the corresponding score in the modified emotion recognition result set, and the scoring sum of all speech segments is used as the emotion score of the to-be-processed speech data. Wherein the score for positive emotions (e.g., excitement, happiness, excitement) can be relatively high and the score for negative emotions (e.g., casualty, committee, fear) can be relatively low, the greater the emotion score, the more likely the pending speech data is to exhibit positive emotions and the lower the emotion score, the more likely the pending speech data is to exhibit negative emotions. By adopting the scoring mode to determine the emotion evaluation result of the voice data to be processed, the emotion type which is inclined to reflect by the voice data to be processed can be known, and the analysis of the dialogue atmosphere in the target application scene is facilitated.

In summary, referring to fig. 6, fig. 6 is a schematic flowchart of another speech processing method according to an embodiment of the present application, where the method includes:

s601, obtaining voice data to be processed.

S602, recognizing the emotion types to which the voice segments belong in the voice data to be processed by using the general model, and generating an emotion recognition result set by recording the emotion types to which the voice segments belong.

And S603, judging whether the recognition result in the emotion recognition result set is normal or accords with the scene rule.

S604, if the recognition result in the emotion recognition result set is abnormal or does not accord with the scene rule, clustering processing is carried out based on the voice features, and a plurality of voice fragments in the voice data to be processed are divided into a plurality of clustering clusters according to the clustering result.

S605, traversing all the clusters, judging through the emotion recognition result set, if the recognition result exceeding a certain threshold value in the same cluster is the same emotion type, considering the clustering result of the cluster to be correct, otherwise, increasing the number of clusters, and continuing to judge until each cluster has the emotion type exceeding the threshold value.

And S606, aiming at the cluster which is judged to be abnormal, selecting the voice segment corresponding to the cluster center for manual judgment, and taking the manual judgment result as the emotion category to which each voice segment in the cluster belongs. The abnormal cluster is the first cluster.

And S607, ending, and returning the emotion evaluation result according to the preset rule. And specifically, determining an emotion evaluation result of the voice data to be processed according to a preset rule, an artificial judgment result and an emotion recognition result set. In addition, if the recognition result in the emotion recognition result set is normal or accords with the scene rule, ending, and returning the emotion evaluation result according to the preset rule. Specifically, the emotion recognition result is processed in a set according to a preset rule, and an emotion evaluation result of the voice data to be processed is obtained.

By adopting the method, whether the speech emotion recognition model accurately carries out speech emotion recognition can be determined according to whether the proportion of the speech segments corresponding to different emotion types in the speech data to be processed meets the proportion rule of the corresponding scene, and a little manual intervention is introduced when the speech emotion recognition model does not accurately carry out speech emotion recognition, so that the accuracy of speech emotion recognition is greatly improved. In addition, the voice segments of the voice data to be processed are divided into a plurality of clusters, so that the accurate result of each cluster can be ensured, namely, each voice segment in each cluster tends to reflect the same emotional characteristic, and the emotion marking label corresponding to the first cluster is accurate when the emotion marking label is used as the real emotion category of each voice segment in the first cluster.

It is understood that in the specific implementation of the present application, related data such as voice data to be processed is referred to, when the above embodiments of the present application are applied to specific products or technologies, user permission or consent needs to be obtained, and the collection, use and processing of the related data need to comply with relevant laws and regulations and standards of relevant countries and regions.

The method of the embodiments of the present application is described in detail above, and in order to better implement the method of the embodiments of the present application, the following provides a device of the embodiments of the present application. Referring to fig. 7, fig. 7 is a schematic structural diagram of a speech processing apparatus according to an embodiment of the present application, in an embodiment, the speech processing apparatus 70 may include:

an obtaining unit 701, configured to obtain to-be-processed voice data of a target application scene, and determine an emotion recognition result set of the to-be-processed voice data based on a voice emotion recognition model, where the emotion recognition result set includes an emotion category to which each of a plurality of voice segments of the to-be-processed voice data belongs;

a processing unit 702, configured to perform clustering processing on the multiple voice segments to obtain multiple clustering clusters if the emotion recognition result set does not meet a scene rule of the target application scene, where the scene rule includes predefined duty rules of each reference emotion category, and voice segments in a same clustering cluster correspond to one emotion category;

the processing unit 702 is further configured to determine, from the plurality of clusters, a first cluster whose emotion category is not matched with the target application scene, and acquire an emotion labeling label corresponding to the first cluster;

the processing unit 702 is further configured to determine an emotion evaluation result of the to-be-processed speech data based on the emotion marking label corresponding to the first cluster and the emotion recognition result set.

In an embodiment, the obtaining unit 701 is specifically configured to: acquiring the voice segment proportion corresponding to different emotion types in the emotion recognition result set;

the processing unit 702 is specifically configured to: and if the speech segment occupation ratio corresponding to one emotion type does not satisfy the occupation ratio rule of the corresponding reference emotion type in the emotion recognition result set, determining that the emotion recognition result set does not satisfy the scene rule of the target application scene.

In an embodiment, the processing unit 702 is specifically configured to: determining the emotion category corresponding to the maximum speech segment proportion based on the speech segment proportion corresponding to different emotion categories in any one of the plurality of clustering clusters, and determining the emotion category corresponding to the maximum speech segment proportion as the emotion category corresponding to the any one clustering cluster;

the obtaining unit 701 is specifically configured to: acquiring a forward emotion category matched with the target application scene;

the processing unit 702 is specifically configured to: if the emotion category corresponding to any cluster is not the forward emotion category matched with the target application scene, determining that any cluster is a first cluster with the emotion category not matched with the target application scene.

In an embodiment, the processing unit 702 is specifically configured to: determining the voice characteristics of each voice fragment in the plurality of voice fragments, and clustering the plurality of voice fragments based on the voice characteristics of each voice fragment and the target expected quantity to obtain at least one cluster, wherein the quantity of the at least one cluster is the target expected quantity;

the obtaining unit 701 is specifically configured to: acquiring the total amount of voice fragments of each cluster in the at least one cluster, and determining the cluster of which the corresponding total amount of voice fragments is less than a preset total amount as a second cluster;

the processing unit 702 is specifically configured to: if the number of the second cluster is smaller than the preset number, determining the maximum voice fragment occupation ratio corresponding to each cluster based on the voice fragment occupation ratios corresponding to different emotion categories in each cluster, wherein the maximum voice fragment occupation ratio corresponding to one cluster is the voice fragment occupation ratio corresponding to the emotion category with the maximum occupation ratio in the one cluster; and if the maximum voice fragment occupation ratio corresponding to each cluster in the at least one cluster is larger than a preset occupation ratio, determining the cluster in the at least one cluster except the second cluster as a plurality of clusters.

In one embodiment, the target desired number is a first number; the processing unit 702 is specifically configured to: if the number of the second clustering clusters is larger than or equal to the preset number, updating the target expected number to a second number, executing the step of clustering the plurality of voice segments based on the voice features of each voice segment and the target expected number to obtain at least one clustering cluster, wherein the second number is smaller than the first number; or if the maximum voice fragment occupation ratio corresponding to one cluster is smaller than or equal to the preset occupation ratio, updating the target expected number to a third number, and performing clustering processing on the voice fragments based on the voice features and the target expected number of each voice fragment to obtain at least one cluster, wherein the third number is larger than the first number.

In an embodiment, the processing unit 702 is specifically configured to: determining a voice segment corresponding to the cluster center of the first clustering cluster; sending the voice fragment corresponding to the cluster center to a user terminal so that the user terminal can acquire the labeling category labeled by a user aiming at the voice fragment corresponding to the cluster center; and receiving the label category sent by the user terminal, and taking the label category as an emotion label corresponding to the first clustering cluster.

In an embodiment, the processing unit 702 is specifically configured to: modifying the emotion types of the voice segments included in the first clustering cluster in the emotion recognition result set into emotion label labels corresponding to the first clustering cluster to obtain a modified emotion recognition result set; and processing the modified emotion recognition result set according to a preset scoring rule to obtain an emotion evaluation result of the voice data to be processed.

In an embodiment, the processing unit 702 is specifically configured to: performing endpoint detection on the voice data to be processed, and performing segmentation processing on the voice data to be processed based on an endpoint detection result to obtain a plurality of voice fragments of the voice data to be processed; determining a speech feature for each of the plurality of speech segments; inputting the voice characteristics of each voice segment into a voice emotion recognition model to obtain the emotion category to which each voice segment belongs; and determining an emotion recognition result set of the voice data to be processed based on the emotion category to which each voice segment belongs.

It can be understood that the functions of the functional units of the speech processing apparatus described in the embodiment of the present application can be specifically implemented according to the method in the foregoing method embodiment, and the specific implementation process of the method embodiment may refer to the description related to the foregoing method embodiment, which is not described herein again.

By the method, whether a little manual intervention is required to be introduced on the basis of the recognition of the speech emotion recognition model can be determined based on whether the scene types of the speech segments of the speech data to be processed conform to objective scene rules, so that the accuracy of speech emotion recognition is greatly improved.

As shown in fig. 8, fig. 8 is a schematic structural diagram of a computer device provided in an embodiment of the present application, and an internal structure of the computer device 80 is shown in fig. 8, and includes: one or more processors 801, memory 802, and a communication interface 803. The processor 801, the memory 802 and the communication interface 803 may be connected by a bus 804 or in other ways, and the embodiment of the present application is exemplified by the bus 804.

The processor 801 (or CPU) is a computing core and a control core of the computer device 80, and can analyze various instructions in the computer device 80 and process various data of the computer device 80, for example: the CPU may be configured to analyze a power on/off instruction sent by the user to the computer device 80, and control the computer device 80 to perform power on/off operation; and the following steps: the CPU may transfer various types of interactive data between the internal structures of the computer device 80, and so on. The communication interface 803 optionally may include a standard wired interface, a wireless interface (e.g., wi-Fi, mobile communication interface, etc.), and is controlled by the processor 801 to transmit and receive data. The Memory 802 (Memory) is a Memory device in the computer device 80 for storing computer programs and data. It is understood that the memory 802 may comprise a built-in memory of the computer device 80, and may also comprise an expansion memory supported by the computer device 80. The memory 802 provides storage space that stores an operating system of the computer device 80, which may include, but is not limited to: windows system, linux system, android system, iOS system, etc., which are not limited in this application. The processor 801 performs the following operations by running the computer program stored in the memory 802:

acquiring to-be-processed voice data of a target application scene, and determining an emotion recognition result set of the to-be-processed voice data based on a voice emotion recognition model, wherein the emotion recognition result set comprises emotion types to which each voice fragment of a plurality of voice fragments of the to-be-processed voice data belongs;

if the emotion recognition result set does not meet the scene rule of the target application scene, clustering the voice fragments to obtain a plurality of clustering clusters, wherein the scene rule comprises predefined duty rules of all reference emotion categories, and the voice fragments in the same clustering cluster correspond to one emotion category;

determining a first cluster with an emotion category not matched with the target application scene from the plurality of clusters, and acquiring an emotion marking label corresponding to the first cluster;

In an embodiment, the processor 801 is specifically configured to: acquiring the voice segment proportion corresponding to different emotion types in the emotion recognition result set; and if the speech segment occupation ratio corresponding to one emotion category does not meet the occupation ratio rule of the corresponding reference emotion category, determining that the emotion recognition result set does not meet the scene rule of the target application scene.

In an embodiment, the processor 801 is specifically configured to: determining the emotion category corresponding to the maximum speech segment proportion based on the speech segment proportion corresponding to different emotion categories in any one of the plurality of clustering clusters, and determining the emotion category corresponding to the maximum speech segment proportion as the emotion category corresponding to the any one clustering cluster; acquiring a forward emotion category matched with the target application scene; if the emotion category corresponding to any cluster is not the forward emotion category matched with the target application scene, determining that any cluster is a first cluster with the emotion category not matched with the target application scene.

In an embodiment, the processor 801 is specifically configured to: determining the voice characteristics of each voice fragment in the plurality of voice fragments, and clustering the plurality of voice fragments based on the voice characteristics of each voice fragment and the target expected number to obtain at least one cluster, wherein the number of the at least one cluster is the target expected number; acquiring the total amount of voice fragments of each cluster in the at least one cluster, and determining the cluster of which the corresponding total amount of voice fragments is less than a preset total amount as a second cluster; if the number of the second cluster is smaller than the preset number, determining the maximum voice segment occupation ratio corresponding to each cluster based on the voice segment occupation ratios corresponding to different emotion types in each cluster, wherein the maximum voice segment occupation ratio corresponding to one cluster is the voice segment occupation ratio corresponding to the emotion type with the maximum occupation ratio in the one cluster; and if the maximum voice fragment occupation ratio corresponding to each cluster in the at least one cluster is larger than a preset occupation ratio, determining the cluster in the at least one cluster except the second cluster as a plurality of clusters.

In one embodiment, the target desired number is a first number; the processor 801 is specifically configured to: if the number of the second clustering clusters is larger than or equal to the preset number, updating the target expected number to a second number, executing the step of clustering the plurality of voice segments based on the voice features of each voice segment and the target expected number to obtain at least one clustering cluster, wherein the second number is smaller than the first number; or if the maximum voice fragment occupation ratio corresponding to one cluster is smaller than or equal to the preset occupation ratio, updating the target expected number to a third number, and performing clustering processing on the voice fragments based on the voice features and the target expected number of each voice fragment to obtain at least one cluster, wherein the third number is larger than the first number.

In an embodiment, the processor 801 is specifically configured to: determining a voice segment corresponding to the cluster center of the first clustering cluster; sending the voice fragment corresponding to the cluster center to a user terminal so that the user terminal can acquire the labeling category of the voice fragment label corresponding to the cluster center; and receiving the label category sent by the user terminal, and taking the label category as an emotion label corresponding to the first clustering cluster.

In an embodiment, the processor 801 is specifically configured to: modifying the emotion types of the voice fragments included in the first cluster in the emotion recognition result set into emotion label corresponding to the first cluster to obtain a modified emotion recognition result set; and processing the modified emotion recognition result set according to a preset scoring rule to obtain an emotion evaluation result of the voice data to be processed.

In an embodiment, the processor 801 is specifically configured to: performing endpoint detection on the voice data to be processed, and performing segmentation processing on the voice data to be processed based on an endpoint detection result to obtain a plurality of voice segments of the voice data to be processed; determining a speech feature of each of the plurality of speech segments; inputting the voice characteristics of each voice segment into a voice emotion recognition model to obtain the emotion category to which each voice segment belongs; and determining an emotion recognition result set of the voice data to be processed based on the emotion category to which each voice fragment belongs.

In specific implementation, the processor 801, the memory 802, and the communication interface 803 described in this embodiment may execute an implementation manner described in a voice processing method provided in this embodiment, and may also execute an implementation manner described in a voice processing apparatus provided in this embodiment, which are not described herein again.

By the method, whether a little manual intervention needs to be introduced on the basis of the recognition of the speech emotion recognition model or not can be determined based on whether the scene types of the multiple speech segments of the speech data to be processed conform to objective scene rules or not, so that the accuracy of speech emotion recognition is greatly improved.

An embodiment of the present application further provides a computer-readable storage medium, in which a computer program is stored, and when the computer program runs on a computer device, the computer device is caused to execute the voice processing method according to any one of the foregoing possible implementation manners. For specific implementation, reference may be made to the foregoing description, which is not repeated herein.

Embodiments of the present application further provide a computer program product, where the computer program product includes a computer program or computer instructions, and when the computer program or the computer instructions are executed by a processor, the steps of the speech processing method provided in the embodiments of the present application are implemented. For a specific implementation, reference may be made to the foregoing description, which is not repeated herein.

The embodiment of the present application further provides a computer program, where the computer program includes computer instructions, the computer instructions are stored in a computer-readable storage medium, a processor of a computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device executes the speech processing method provided in the embodiment of the present application. For specific implementation, reference may be made to the foregoing description, which is not repeated herein.

It should be noted that, for simplicity of description, the above-mentioned embodiments of the method are described as a series of acts, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art will recognize that the embodiments described in this specification are preferred embodiments and that acts or modules referred to are not necessarily required for this application.

Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, read-Only memories (ROMs), random Access Memories (RAMs), magnetic or optical disks, and the like.

The above disclosure is only for the purpose of illustrating a part of the present disclosure, and it is not intended to limit the scope of the present disclosure in any way, so that the present disclosure will fall within the scope of the present disclosure.

Claims

1. A method of speech processing, the method comprising:

2. The method of claim 1, further comprising:

acquiring the voice segment proportion corresponding to different emotion types in the emotion recognition result set;

and if the speech segment occupation ratio corresponding to one emotion type does not satisfy the occupation ratio rule of the corresponding reference emotion type in the emotion recognition result set, determining that the emotion recognition result set does not satisfy the scene rule of the target application scene.

3. The method of claim 1, wherein the determining a first cluster from the plurality of clusters that the corresponding emotion classification does not match the target application scenario comprises:

determining the emotion category corresponding to the maximum speech segment proportion based on the speech segment proportion corresponding to different emotion categories in any one of the multiple clustering clusters, and determining the emotion category corresponding to the maximum speech segment proportion as the emotion category corresponding to any one clustering cluster;

acquiring a forward emotion category matched with the target application scene;

if the emotion category corresponding to any cluster is not the forward emotion category matched with the target application scene, determining that any cluster is a first cluster with the emotion category not matched with the target application scene.

4. The method according to any one of claims 1-3, wherein said clustering said plurality of speech segments to obtain a plurality of clusters comprises:

determining the voice characteristics of each voice fragment in the plurality of voice fragments, and clustering the plurality of voice fragments based on the voice characteristics of each voice fragment and the target expected number to obtain at least one cluster, wherein the number of the at least one cluster is the target expected number;

acquiring the total amount of voice fragments of each cluster in the at least one cluster, and determining the cluster of which the corresponding total amount of voice fragments is less than a preset total amount as a second cluster;

if the number of the second cluster is smaller than the preset number, determining the maximum voice fragment occupation ratio corresponding to each cluster based on the voice fragment occupation ratios corresponding to different emotion categories in each cluster, wherein the maximum voice fragment occupation ratio corresponding to one cluster is the voice fragment occupation ratio corresponding to the emotion category with the maximum occupation ratio in the one cluster;

and if the maximum voice fragment occupation ratio corresponding to each cluster in the at least one cluster is larger than a preset occupation ratio, determining the cluster in the at least one cluster except the second cluster as a plurality of clusters.

5. The method of claim 4, wherein the target desired number is a first number; the method further comprises the following steps:

if the number of the second clustering clusters is larger than or equal to the preset number, updating the target expected number to a second number, executing the step of clustering the plurality of voice segments based on the voice features of each voice segment and the target expected number to obtain at least one clustering cluster, wherein the second number is smaller than the first number;

or,

if the maximum voice fragment occupation ratio corresponding to one cluster exists in the at least one cluster and is smaller than or equal to the preset occupation ratio, updating the target expected number to a third number, and executing the step of clustering the voice fragments based on the voice characteristics of each voice fragment and the target expected number to obtain at least one cluster, wherein the third number is larger than the first number.

6. The method according to any one of claims 1 to 3, wherein the obtaining of the emotion marking label corresponding to the first cluster comprises:

determining a voice segment corresponding to the cluster center of the first cluster;

sending the voice fragment corresponding to the cluster center to a user terminal so that the user terminal can acquire the labeling category of the voice fragment label corresponding to the cluster center;

and receiving the label type sent by the user terminal, and taking the label type as an emotion label corresponding to the first cluster.

7. The method according to any one of claims 1 to 3, wherein the determining the emotion evaluation result of the speech data to be processed based on the emotion marking label corresponding to the first cluster and the emotion recognition result set comprises:

modifying the emotion types of the voice segments included in the first clustering cluster in the emotion recognition result set into emotion label labels corresponding to the first clustering cluster to obtain a modified emotion recognition result set;

and processing the modified emotion recognition result set according to a preset scoring rule to obtain an emotion evaluation result of the voice data to be processed.

8. A speech processing apparatus, characterized in that the apparatus comprises:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring voice data to be processed of a target application scene and determining an emotion recognition result set of the voice data to be processed based on a voice emotion recognition model, and the emotion recognition result set comprises emotion types to which each voice fragment of a plurality of voice fragments of the voice data to be processed belongs;

9. A computer device, comprising a memory, a communication interface, and a processor, the memory, the communication interface, and the processor being interconnected; the memory stores a computer program that the processor calls upon for implementing the speech processing method according to any of claims 1-7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the speech processing method of any one of claims 1 to 7.