CN112966082B

CN112966082B - Audio quality inspection method, device, equipment and storage medium

Info

Publication number: CN112966082B
Application number: CN202110253354.7A
Authority: CN
Inventors: 赵情恩; 曾新贵; 熊新雷; 陈蓉; 肖岩
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-03-05
Filing date: 2021-03-05
Publication date: 2024-08-09
Anticipated expiration: 2041-03-05
Also published as: CN112966082A

Abstract

The application discloses an audio quality inspection method, device, equipment and storage medium, and relates to the field of artificial intelligence such as voice recognition, natural language processing and deep learning. One embodiment of the method comprises the following steps: acquiring dialogue audio, wherein the dialogue audio records dialogue between a client and customer service; performing voice separation on the dialogue audio to obtain a first audio and a second audio, wherein the first audio and the second audio only comprise one speaker; performing voice recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio; performing role judgment on the first text and the second text, and selecting a text corresponding to customer service; and carrying out text content semantic classification on the text corresponding to the customer service to obtain a quality inspection result of the dialogue audio. This embodiment enables a fully automated audio quality inspection.

Description

Audio quality inspection method, device, equipment and storage medium

Technical Field

The embodiment of the application relates to the field of computers, in particular to the field of artificial intelligence such as voice recognition, natural language processing, deep learning and the like, and particularly relates to an audio quality inspection method, device, equipment and storage medium.

Background

The main purpose of quality inspection in the call center is to detect the working quality effect of customer service, and effectively improve the overall level and quality of customer service. The quality inspector is the standard post of the call center and takes responsibility for supervising the service, finding problems, summarizing experience, suggesting and prompting improvement.

In general, a quality inspector randomly samples conversation audio of massive clients and customer service, then listens, and scores conversation contents of the two parties for customer service quality according to a given scoring rule template.

Disclosure of Invention

The embodiment of the application provides an audio quality inspection method, device, equipment and storage medium.

In a first aspect, an embodiment of the present application provides an audio quality inspection method, including: acquiring dialogue audio, wherein the dialogue audio records dialogue between a client and customer service; performing voice separation on the dialogue audio to obtain a first audio and a second audio, wherein the first audio and the second audio only comprise one speaker; performing voice recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio; performing role judgment on the first text and the second text, and selecting a text corresponding to customer service; and carrying out text content semantic classification on the text corresponding to the customer service to obtain a quality inspection result of the dialogue audio.

In a second aspect, an embodiment of the present application provides an audio quality inspection apparatus, including: an acquisition module configured to acquire dialogue audio, wherein the dialogue audio records a dialogue between a customer and a customer service; the separation module is configured to perform voice separation on the dialogue audio to obtain first audio and second audio, wherein the first audio and the second audio only comprise one speaker; the recognition module is configured to perform voice recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio; the judging module is configured to judge roles of the first text and the second text and select a text corresponding to customer service; and the classification module is configured to perform text content semantic classification on the text corresponding to the customer service to obtain a quality inspection result of the dialogue audio.

In a third aspect, an embodiment of the present application provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in any one of the implementations of the first aspect.

In a fourth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method as described in any of the implementations of the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product comprising a computer program which, when executed by a processor, implements a method as described in any of the implementations of the first aspect.

According to the audio quality inspection method, the audio quality inspection device, the audio quality inspection equipment and the storage medium, firstly, the acquired dialogue audio is subjected to human-voice separation to obtain first audio and second audio; then, performing voice recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio; then, performing role judgment on the first text and the second text, and selecting a text corresponding to customer service; and finally, text content semantic classification is carried out on the text corresponding to the customer service to obtain a quality inspection result of the dialogue audio, so that the full-automatic audio quality inspection can be realized.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings. The drawings are included to provide a better understanding of the present application and are not to be construed as limiting the application. Wherein:

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow chart of one embodiment of an audio quality inspection method according to the present application;

FIG. 3 is a flow chart of yet another embodiment of an audio quality inspection method according to the present application;

fig. 4 is an application scenario diagram in which an audio quality inspection method of an embodiment of the present application may be implemented.

FIG. 5 is a schematic structural view of one embodiment of an audio quality testing apparatus according to the present application;

Fig. 6 is a block diagram of an electronic device for implementing an audio quality inspection method according to an embodiment of the present application.

Detailed Description

Exemplary embodiments of the present application will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present application are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.

Fig. 1 shows an exemplary system architecture 100 in which embodiments of an audio quality inspection method or audio quality inspection apparatus of the present application may be applied.

As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

A user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or transmit video frames or the like. Various client applications, such as a recording application, an audio quality inspection application, etc., may be installed on the terminal devices 101, 102, 103.

The terminal devices 101, 102, 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smartphones, tablets, laptop and desktop computers, and the like. When the terminal devices 101, 102, 103 are software, they can be installed in the above-described electronic devices. Which may be implemented as a plurality of software or software modules, or as a single software or software module. The present invention is not particularly limited herein.

The server 105 may provide various services. For example, the server 105 may analyze and process the dialogue audio acquired from the terminal devices 101, 102, 103, and generate processing results (e.g., quality inspection results of the dialogue audio).

It should be noted that, the server 105 may be hardware, or may be software. When the server 105 is hardware, it may be implemented as a distributed server cluster formed by a plurality of servers, or as a single server. When server 105 is software, it may be implemented as a plurality of software or software modules (e.g., to provide distributed services), or as a single software or software module. The present invention is not particularly limited herein.

It should be noted that, the audio quality inspection method provided in the embodiment of the present application is generally executed by the server 105, and accordingly, the audio quality inspection device is generally disposed in the server 105.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to fig. 2, a flow 200 of one embodiment of an audio quality inspection method in accordance with the present application is shown. The audio quality inspection method comprises the following steps:

In step 201, dialog audio is acquired.

In this embodiment, the execution subject of the audio quality inspection method (e.g., the server 105 shown in fig. 1) may acquire dialogue audio. The dialogue audio may be audio in which a dialogue between a customer and a customer service is recorded.

Typically, when a call center receives an incoming call from a customer, it can be automatically assigned to customer service. When a customer establishes a call with a customer service, a terminal device of the customer service (for example, terminal devices 101, 102, 103 shown in fig. 1) may start a recording function to record a session between the customer and the customer service until the call is ended, so as to obtain session audio. For businesses that sell products (e.g., physical goods, virtual services, etc.), call centers are typically set up to provide after-market services for their products. In order to improve the service quality of customer service, enterprises need to perform quality inspection on recorded dialogue audios. According to the quality inspection result, the method refines and promotes the favorable aspects, and supervises and corrects the unfavorable aspects. For fast-growing enterprises, the traffic of the call centers will climb continuously. If the quality of the full-volume dialogue audio is checked, the workload is very huge. To improve quality inspection efficiency, it is necessary to extract part of dialogue audio from the full dialogue audio in proportion for quality inspection. For example, the average time of the dialogue audio is about 6 minutes, and the dialogue audio is randomly extracted from the total dialogue audio according to the proportion of 1% -2% for quality inspection.

And 202, performing voice separation on the dialogue audio to obtain a first audio and a second audio.

In this embodiment, the executing body may perform voice separation on the dialogue audio to obtain the first audio and the second audio. Wherein the first audio and the second audio comprise only one speaker.

Because of the dialogue between the customer and customer service, audio recordings of the dialogue typically involve two speakers of the customer and customer service. The voice prints of different speakers are different, and voice separation is carried out on dialogue audios based on the voice prints, so that a first audio and a second audio which only comprise one speaker can be separated. Wherein the first audio and the second audio only comprise one speaker in the customer and customer service. For example, the first audio is the customer's audio and the second audio is the customer's audio.

It should be noted that, the voice separation of the dialogue audio can separate out the audio containing only one speaker, but cannot identify the specific speaker contained in the audio.

And 203, performing voice recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio.

In this embodiment, the executing body may perform speech recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio.

Specifically, the vocabulary content in the first audio and the second audio can be converted into corresponding words by using a voice recognition technology, so as to obtain a first text corresponding to the first audio and a second text corresponding to the second audio. The first text may include words corresponding to the vocabulary content in the first audio. The second text may include words corresponding to lexical content in the second audio.

And 204, performing role judgment on the first text and the second text, and selecting the text corresponding to the customer service.

In this embodiment, the executing body may perform role determination on the first text and the second text, mark roles corresponding to the first text and the second text, and then select a text corresponding to the customer service from the roles.

Specifically, the contents in the first text and the second text can be analyzed to determine roles corresponding to the first text and the second text. For example, a character corresponding to a text in which a welcome or end word exists is typically a customer service. For another example, a character corresponding to a text in which there is more query content for a product is typically a customer, and a character corresponding to a text in which there is more answer content for a product is typically a customer service.

And 205, performing text content semantic classification on the text corresponding to the customer service to obtain a quality inspection result of the dialogue audio.

In this embodiment, the execution body may perform text content semantic classification on a text corresponding to a customer service, to obtain a quality inspection result of a dialogue audio. The quality inspection result can be used for representing the service quality of customer service in the conversation.

In some optional implementations of this embodiment, a set of additional classification categories and a set of withholding categories may be preset, text content semantic classification is performed on a text corresponding to a customer service, and at least one additional classification category and at least one withholding category to which the text corresponding to the customer service belongs are determined, so as to obtain a quality inspection result of dialogue audio. Wherein, the scoring category in the scoring category set may be a positive, popularizing category. The withholding categories in the withholding category set may be passive, requiring correction categories. Taking quality inspection of service flow compliance and language compliance as an example, the set of scoring categories may include categories of welcome by standard, end by standard, confirm customer information, pacify customer complaint emotion, and the like. The set of withholding categories may include categories of presence service disabilities, outbound passive recommended products, aggressive recommended products, fraud inducing customers, and the like.

The quality inspection result of the dialogue audio is determined based on at least one scoring type and at least one withholding type of the text corresponding to the customer service. For example, at least one scoring category and at least one withholding category to which the text corresponding to the customer service belongs are directly used as quality inspection results. The method can be used for refining and promoting at least one classification, and can be used for supervising and correcting at least one deduction classification, so that the service quality of customer service is improved. For another example, the bonus in the bonus category set is labeled with a corresponding bonus score and the bonus category in the bonus category set is labeled with a corresponding bonus score. The difference between the sum of the added scores corresponding to the at least one added category and the sum of the deduction scores corresponding to the at least one deduction category can be further calculated, and the obtained difference is used as a quality inspection result of the dialogue audio. Generally, the larger the difference value is, the higher the service quality of customer service in the current dialogue is, the smaller the difference value is, and the lower the service quality of customer service in the current dialogue is. The dialogue audio with higher service quality can be extracted and promoted, and the dialogue audio with lower service quality can be supervised and corrected, so that the service quality of customer service is improved.

In some optional implementations of this embodiment, the execution body may further count quality inspection results of a plurality of conversational audios of the same customer service, and perform tracking analysis on quality of service of the customer service to form historical service data of the customer service. In addition, the statistical results can also be used for performance assessment of the customer service.

According to the audio quality inspection method provided by the embodiment of the application, firstly, the acquired dialogue audio is subjected to human-voice separation to obtain a first audio and a second audio; then, performing voice recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio; then, performing role judgment on the first text and the second text, and selecting a text corresponding to customer service; and finally, text content semantic classification is carried out on the text corresponding to the customer service to obtain a quality inspection result of the dialogue audio, so that the full-automatic audio quality inspection can be realized. Whether the voice separation, the voice recognition, the role judgment and the final quality inspection are carried out, the labor cost is greatly reduced, the voice can be rapidly analyzed, the accurate positioning is problematic, and the stable and efficient customer service working quality is ensured. Compared with the manual audio quality inspection, the method reduces the time consumption of the audio quality inspection, improves the audio quality inspection efficiency, reduces the audio quality inspection cost, improves the audio quality inspection accuracy, and eliminates the subjectivity of the audio quality inspection. Can support a large amount of complicated quality inspection work so as to adapt to the rapid growth steps of enterprises.

With further reference to fig. 3, a flow 300 of yet another embodiment of an audio quality inspection method according to the present application is shown. The audio quality inspection method comprises the following steps:

In step 301, dialog audio is acquired.

In this embodiment, the specific operation of step 301 is described in detail in step 201 in the embodiment shown in fig. 2, and will not be described herein.

Step 302, inputting dialogue audio to a pre-trained human voice separation model to obtain first audio and second audio.

In this embodiment, the executing body of the audio quality inspection method (for example, the server 105 shown in fig. 1) may input dialogue audio to a pre-trained human voice separation model, to obtain the first audio and the second audio. The audio of different roles is segmented through the human voice separation model, so that the cost of manual hearing identification is reduced. Wherein the human voice separation model may include, but is not limited to: xvector-AHC (Xvector-Agglomerative Hierarchical Clustering, voiceprint model-aggregate hierarchical clustering), GMM (Gaussian Mixed Model, gaussian mixture model), HMM (Hyundai MERCHANT MARINE, hidden Markov model) and other AI (ARTIFICIAL INTELLIGENCE ) models are obtained by training a neural network by using a training sample set. The training samples in the training sample set herein may be sample dialogue audio labeled with a speaker.

In some alternative implementations of the present embodiment, the human voice separation model is Xvector-AHC. Wherein Xvector-AHC may include Xvector and AHC. The corresponding human voice separation step may include:

First, dialog audio is divided into a plurality of audio segments.

In general, dialog audio may be segmented uniformly. For example, for a 10 second dialog audio, segments may be made every 500 milliseconds, resulting in 20 audio segments.

And then, respectively inputting the plurality of audio fragments to Xvector to obtain the characteristics of the plurality of audio fragments.

The common Xvector network structure includes, in order, a frame-level layer (frame-level), a pooling layer (STATISTICS POOLING), a segment-level layer (segment-level), and an activation function layer (softmax). Here Xvector removes the already trained activation function layer of the neural network. The Xvector features of the segment level layer output are the features of the audio clip.

Then, the features of the plurality of audio clips are clustered using the AHC, and the categories of the plurality of audio clips are determined based on the clustering result.

In general, AHCs can be classified into two categories according to the manner of clustering: top-down and bottom-up. For the bottom-up clustering algorithm, it is initially assumed that each sample is a separate class, and then the classes are combined sequentially until finally there is only one class. A tree-like structure will eventually result, the root of the tree being a class that contains all sample points, while the leaf is a cluster of only one sample. Here, the categories are merged based on distance metrics. Two categories are combined into one category during each iteration. Wherein an audio segment is characterized by a sample. The root of the tree obtained by clustering the features of the plurality of audio segments using AHC comprises two child nodes. The features of the audio segments in the same sub-node are similar, and the features of the audio segments in different sub-nodes are different. Thus, the audio clip corresponding to one child node belongs to one category, while the audio clip corresponding to another child node belongs to another category.

And finally, combining the audio clips in the same category to obtain a first audio and a second audio.

Typically, for the same category of audio segments, the audio segments may be combined in the order they were in the dialog audio, resulting in corresponding audio. The audio segments corresponding to one child node may be combined into a first audio and the audio segments corresponding to another child node may be combined into a second audio.

Step 303, inputting the first audio and the second audio into a pre-trained speech recognition model respectively to obtain a first text and a second text.

In this embodiment, the executing body may input the first audio and the second audio into a pre-trained speech recognition model, to obtain the first text and the second text, respectively. The audio content is identified through the end-to-end voice identification model, so that the content acquisition efficiency is greatly improved. The speech recognition model may include, but is not limited to, among others: AI models such as LSTM-CTC (Long Short-Term Memory-Connectionist Temporal Classifier, long-Short Term Memory network-link time classifier), GMM, HMM, and the like are obtained by training a neural network using a training sample set. The training samples in the training sample set herein may include sample audio and corresponding sample text.

In some alternative implementations of the present embodiment, the speech recognition model is LSTM-CTC. Among them, LSTM-CTC may include LSTM and CTC. The corresponding voice recognition step includes:

First, the first audio and the second audio are respectively input to the LSTM, and the characteristics of the first audio and the second audio are obtained.

The LSTM is a time-circulating neural network, can avoid the problems of gradient disappearance and gradient explosion in the common circulating neural network, and has the main core ideas that: the channel called "state" is used throughout the time sequence. Information is removed or added to the cell state by designing the structure of the "gate". Wherein, there are three gates in LSTM, namely "forget gate", "input gate" and "output gate".

Then, the features of the first audio and the second audio are respectively input into the CTC to obtain a first text and a second text.

The CTC is mainly used for solving the alignment problem of the input features and the output labels.

Step 304, the first text and the second text are respectively input into a pre-trained character judging model, a character corresponding to the first text and a character corresponding to the second text are obtained, and a text corresponding to customer service is selected.

In this embodiment, the executing body may input the first text and the second text into a pre-trained character determination model, obtain a character corresponding to the first text and a character corresponding to the second text, and select a text corresponding to the customer service. The role is judged by the role judgment model, so that the effect is better and the robustness is higher than that of the keyword matching. Wherein the role determination model may include, but is not limited to: textCNN (Text Convolutional Neural Network, text-level convolutional neural network), charCNN (Char Convolutional Neural Network, character-level convolutional neural network), RCNN (Region-based Convolutional Neural Network, regional convolutional neural network), transducer (converter), ELMO (Embeddings from Language Model, deep context word representation model), BERT (Bidirectional Encoder Representation from Transformers, converter output bi-directional encoder) representation, and the like AI models are obtained by training the neural network using a training sample set. The training samples in the training sample set herein may be sample text that annotates a character.

And 305, inputting the text corresponding to the customer service into a pre-trained semantic classification model to obtain a quality inspection result.

In this embodiment, the execution body may input a text corresponding to the customer service into a pre-trained semantic classification model, to obtain a quality inspection result. Content classification judgment is carried out through the semantic classification model, so that the quality inspection effect can be achieved. Wherein the semantic classification model may include, but is not limited to: BERT, ELMO, textCNN, charCNN, RCNN, transformer, and the like, is obtained by training the neural network by using a training sample set. The training samples in the training sample set may include sample customer service text labeled with quality test results.

In some alternative first approaches of this embodiment, the semantic classification model may be BERT. BERT is a bidirectional transducer model, finely describes semantic relations among contexts, and better obtains semantic classification results, namely, achieves the aim of quality inspection.

According to the audio quality inspection method provided by the embodiment of the application, firstly, audios of different roles are segmented through the human voice separation model, so that the cost of manual hearing identification is reduced; then, the audio content is identified through the end-to-end voice identification model, so that the content acquisition efficiency is greatly improved; then judging the roles through the role judging model, wherein the effect is better and the robustness is higher than that of adopting keyword matching; content classification judgment is carried out through the semantic classification model, so that the quality inspection effect can be achieved.

The embodiment of the application provides an intelligent quality inspection mode, which uses an AI technology as an application core and uses the AI technology to replace standardized work. And in terms of the working content of quality inspection, monitoring mass dialogue audios, scoring according to a set rule, producing a standardized analysis document, and accurately positioning the dialogue audios with problems. And all dialogue audios can be detected completely without dead angles in a short time. The requirements of fairness, quality control, combination of business knowledge and the like are met according to the working requirements of quality inspectors. The intelligent quality inspection has the advantages of stability and high efficiency compared with the manual quality inspection, and reduces the workload of quality inspectors for analyzing basic data. Thus, the advantages of full-scale quality inspection and real-time quality inspection using AI technology are more prominent than those of manual quality inspectors.

For ease of understanding, fig. 4 shows an application scenario diagram in which an audio quality inspection method of an embodiment of the present application may be implemented. As shown in fig. 4, first, dialogue audio is input to Xvector-AHC for human voice separation, resulting in first audio and second audio. And then, respectively inputting the first audio and the second audio into the LSTM-CTC for voice recognition to obtain a first text corresponding to the first audio and a second text corresponding to the second audio. And then, respectively inputting the first text and the second text into TextCNN for role judgment, and selecting the text corresponding to the customer service. And finally, inputting the text corresponding to the customer service to the BERT for intelligent quality inspection to obtain a quality inspection result.

With further reference to fig. 5, as an implementation of the method shown in the foregoing figures, the present application provides an embodiment of an audio quality inspection apparatus, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus is particularly applicable to various electronic devices.

As shown in fig. 5, the audio quality inspection apparatus 500 of the present embodiment may include: an acquisition module 501, a separation module 502, an identification module 503, a determination module 504, and a classification module 505. Wherein the obtaining module 501 is configured to obtain dialogue audio, wherein the dialogue audio records a dialogue between a client and a customer service; the separation module 502 is configured to perform voice separation on the dialogue audio to obtain a first audio and a second audio, wherein the first audio and the second audio only comprise one speaker; the recognition module 503 is configured to perform voice recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio; a determining module 504, configured to perform role determination on the first text and the second text, and select a text corresponding to the customer service; the classification module 505 is configured to perform text content semantic classification on the text corresponding to the customer service, so as to obtain a quality inspection result of the dialogue audio.

In this embodiment, in the audio quality inspection apparatus 500: specific processes of the obtaining module 501, the separating module 502, the identifying module 503, the determining module 504 and the classifying module 505 and technical effects thereof may refer to the description of steps 201 to 205 in the corresponding embodiment of fig. 2, and are not repeated herein.

In some alternative implementations of the present embodiment, the separation module 502 includes: the separation module is configured to input dialogue audio to a pre-trained human voice separation model to obtain first audio and second audio, wherein the human voice separation model comprises one of the following: voiceprint model-aggregated hierarchical cluster Xvector-AHC, gaussian mixture model GMM, hidden Markov model HMM.

In some alternative implementations of the present embodiment, the human voice separation model is Xvector-AHC, xvector-AHC including Xvector and AHC; and the separation submodule is further configured to: dividing dialog audio into a plurality of audio segments; inputting the plurality of audio clips to Xvector respectively to obtain characteristics of the plurality of audio clips; clustering the characteristics of the plurality of audio clips by using the AHC, and determining the categories of the plurality of audio clips based on the clustering result; and combining the audio fragments of the same category to obtain a first audio and a second audio.

In some alternative implementations of the present embodiment, the identification module 503 is further configured to: respectively inputting the first audio and the second audio into a pre-trained voice recognition model to obtain a first text and a second text, wherein the voice recognition model comprises one of the following items: long-term short-term memory network-coupled time classifier LSTM-CTC, GMM, HMM.

In some alternative implementations of the present embodiment, the determination module 504 is further configured to: respectively inputting the first text and the second text into a pre-trained character judging model to obtain a character corresponding to the first text and a character corresponding to the second text, wherein the character judging model comprises the following items: text-level convolutional neural network TextCNN, character-level convolutional neural network CharCNN, regional convolutional neural network RCNN, converter Transformer, deep context word representation model ELMO, converter output bi-directional encoder representation BERT.

In some alternative implementations of the present embodiment, the classification module 505 is further configured to: inputting the text corresponding to the customer service into a pre-trained semantic classification model to obtain a quality inspection result, wherein the semantic classification model comprises one of the following items: BERT, ELMO, textCNN, charCNN, RCNN, transformer.

According to embodiments of the present application, the present application also provides an electronic device, a readable storage medium and a computer program product.

Fig. 6 illustrates a schematic block diagram of an example electronic device 600 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 may also be stored. The computing unit 601, ROM 602, and RAM 603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Various components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, mouse, etc.; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the various methods and processes described above, such as an audio quality inspection method. For example, in some embodiments, the audio quality inspection method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into RAM 603 and executed by computing unit 601, one or more steps of the audio quality inspection method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the audio quality inspection method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. An audio quality inspection method comprising:

acquiring dialogue audio, wherein the dialogue audio records a dialogue between a client and customer service;

Dividing the dialog audio into a plurality of audio segments; inputting the plurality of audio clips to Xvector respectively to obtain characteristics of the plurality of audio clips; clustering features of the plurality of audio clips by using the AHC, and determining categories of the plurality of audio clips based on clustering results; combining audio fragments of the same category to obtain a first audio and a second audio, wherein the network structure of Xvector sequentially comprises a frame level layer, a pooling layer and a segment level layer;

Performing voice recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio;

performing role judgment on the first text and the second text, and selecting a text corresponding to customer service;

Performing text content semantic classification on the text corresponding to the customer service to obtain a quality inspection result of the dialogue audio, wherein a scoring type set and a buckling type set are preset, the scoring type in the scoring type set is marked with a corresponding scoring number, the buckling type in the buckling type set is marked with a corresponding buckling number, performing text content semantic classification on the text corresponding to the customer service, determining at least one scoring type and at least one buckling type to which the text corresponding to the customer service belongs, calculating a difference value of the sum of the scoring numbers corresponding to the at least one scoring type and the sum of the buckling numbers corresponding to the at least one buckling type, and taking the difference value as the quality inspection result of the dialogue audio.

2. The method of claim 1, wherein the performing speech recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio comprises:

inputting the first audio and the second audio into a pre-trained voice recognition model respectively to obtain the first text and the second text, wherein the voice recognition model comprises one of the following: long-term short-term memory network-coupled time classifier LSTM-CTC, GMM, HMM.

3. The method of claim 1, wherein the performing a role determination on the first text and the second text comprises:

Respectively inputting the first text and the second text into a pre-trained character judging model to obtain a character corresponding to the first text and a character corresponding to the second text, wherein the character judging model comprises one of the following items: text-level convolutional neural network TextCNN, character-level convolutional neural network CharCNN, regional convolutional neural network RCNN, converter Transformer, deep context word representation model ELMO, converter output bi-directional encoder representation BERT.

4. The method of claim 1, wherein the performing text content semantic classification on the text corresponding to the customer service to obtain the quality inspection result of the dialogue audio comprises:

Inputting the text corresponding to the customer service into a pre-trained semantic classification model to obtain the quality inspection result, wherein the semantic classification model comprises one of the following items: BERT, ELMO, textCNN, charCNN, RCNN, transformer.

5. An audio quality inspection device, comprising:

an acquisition module configured to acquire dialogue audio, wherein the dialogue audio records a dialogue between a customer and a customer service;

A separation module configured to divide the dialog audio into a plurality of audio segments; inputting the plurality of audio clips to Xvector respectively to obtain characteristics of the plurality of audio clips; clustering features of the plurality of audio clips by using the AHC, and determining categories of the plurality of audio clips based on clustering results; combining audio fragments of the same category to obtain a first audio and a second audio, wherein the network structure of Xvector sequentially comprises a frame level layer, a pooling layer and a segment level layer;

the recognition module is configured to perform voice recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio;

The judging module is configured to judge the roles of the first text and the second text and select the text corresponding to the customer service;

The classification module is configured to perform text content semantic classification on the text corresponding to the customer service to obtain a quality inspection result of the dialogue audio, wherein a scoring type set and a buckling type set are preset, the scoring type in the scoring type set is marked with a corresponding scoring number, the buckling type in the buckling type set is marked with a corresponding buckling number, the text corresponding to the customer service is subjected to text content semantic classification, at least one scoring type and at least one buckling type to which the text corresponding to the customer service belongs are determined, a difference value of the sum of the scoring numbers corresponding to the at least one scoring type and the sum of the buckling numbers corresponding to the at least one buckling type is calculated, and the difference value is used as the quality inspection result of the dialogue audio.

6. The apparatus of claim 5, wherein the identification module is further configured to:

7. The apparatus of claim 5, wherein the determination module is further configured to:

8. The apparatus of claim 5, wherein the classification module is further configured to:

9. An electronic device, comprising:

At least one processor; and

A memory communicatively coupled to the at least one processor; wherein,

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-4.

10. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-4.

11. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-4.