[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN112966082B - Audio quality inspection method, device, equipment and storage medium - Google Patents

Audio quality inspection method, device, equipment and storage medium Download PDF

Info

Publication number
CN112966082B
CN112966082B CN202110253354.7A CN202110253354A CN112966082B CN 112966082 B CN112966082 B CN 112966082B CN 202110253354 A CN202110253354 A CN 202110253354A CN 112966082 B CN112966082 B CN 112966082B
Authority
CN
China
Prior art keywords
audio
text
quality inspection
customer service
dialogue
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110253354.7A
Other languages
Chinese (zh)
Other versions
CN112966082A (en
Inventor
赵情恩
曾新贵
熊新雷
陈蓉
肖岩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202110253354.7A priority Critical patent/CN112966082B/en
Publication of CN112966082A publication Critical patent/CN112966082A/en
Application granted granted Critical
Publication of CN112966082B publication Critical patent/CN112966082B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3343Query execution using phonetics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Telephonic Communication Services (AREA)
  • Machine Translation (AREA)

Abstract

The application discloses an audio quality inspection method, device, equipment and storage medium, and relates to the field of artificial intelligence such as voice recognition, natural language processing and deep learning. One embodiment of the method comprises the following steps: acquiring dialogue audio, wherein the dialogue audio records dialogue between a client and customer service; performing voice separation on the dialogue audio to obtain a first audio and a second audio, wherein the first audio and the second audio only comprise one speaker; performing voice recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio; performing role judgment on the first text and the second text, and selecting a text corresponding to customer service; and carrying out text content semantic classification on the text corresponding to the customer service to obtain a quality inspection result of the dialogue audio. This embodiment enables a fully automated audio quality inspection.

Description

Audio quality inspection method, device, equipment and storage medium
Technical Field
The embodiment of the application relates to the field of computers, in particular to the field of artificial intelligence such as voice recognition, natural language processing, deep learning and the like, and particularly relates to an audio quality inspection method, device, equipment and storage medium.
Background
The main purpose of quality inspection in the call center is to detect the working quality effect of customer service, and effectively improve the overall level and quality of customer service. The quality inspector is the standard post of the call center and takes responsibility for supervising the service, finding problems, summarizing experience, suggesting and prompting improvement.
In general, a quality inspector randomly samples conversation audio of massive clients and customer service, then listens, and scores conversation contents of the two parties for customer service quality according to a given scoring rule template.
Disclosure of Invention
The embodiment of the application provides an audio quality inspection method, device, equipment and storage medium.
In a first aspect, an embodiment of the present application provides an audio quality inspection method, including: acquiring dialogue audio, wherein the dialogue audio records dialogue between a client and customer service; performing voice separation on the dialogue audio to obtain a first audio and a second audio, wherein the first audio and the second audio only comprise one speaker; performing voice recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio; performing role judgment on the first text and the second text, and selecting a text corresponding to customer service; and carrying out text content semantic classification on the text corresponding to the customer service to obtain a quality inspection result of the dialogue audio.
In a second aspect, an embodiment of the present application provides an audio quality inspection apparatus, including: an acquisition module configured to acquire dialogue audio, wherein the dialogue audio records a dialogue between a customer and a customer service; the separation module is configured to perform voice separation on the dialogue audio to obtain first audio and second audio, wherein the first audio and the second audio only comprise one speaker; the recognition module is configured to perform voice recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio; the judging module is configured to judge roles of the first text and the second text and select a text corresponding to customer service; and the classification module is configured to perform text content semantic classification on the text corresponding to the customer service to obtain a quality inspection result of the dialogue audio.
In a third aspect, an embodiment of the present application provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in any one of the implementations of the first aspect.
In a fourth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform a method as described in any of the implementations of the first aspect.
In a fifth aspect, embodiments of the present application provide a computer program product comprising a computer program which, when executed by a processor, implements a method as described in any of the implementations of the first aspect.
According to the audio quality inspection method, the audio quality inspection device, the audio quality inspection equipment and the storage medium, firstly, the acquired dialogue audio is subjected to human-voice separation to obtain first audio and second audio; then, performing voice recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio; then, performing role judgment on the first text and the second text, and selecting a text corresponding to customer service; and finally, text content semantic classification is carried out on the text corresponding to the customer service to obtain a quality inspection result of the dialogue audio, so that the full-automatic audio quality inspection can be realized.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings. The drawings are included to provide a better understanding of the present application and are not to be construed as limiting the application. Wherein:
FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;
FIG. 2 is a flow chart of one embodiment of an audio quality inspection method according to the present application;
FIG. 3 is a flow chart of yet another embodiment of an audio quality inspection method according to the present application;
fig. 4 is an application scenario diagram in which an audio quality inspection method of an embodiment of the present application may be implemented.
FIG. 5 is a schematic structural view of one embodiment of an audio quality testing apparatus according to the present application;
Fig. 6 is a block diagram of an electronic device for implementing an audio quality inspection method according to an embodiment of the present application.
Detailed Description
Exemplary embodiments of the present application will now be described with reference to the accompanying drawings, in which various details of the embodiments of the present application are included to facilitate understanding, and are to be considered merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.
Fig. 1 shows an exemplary system architecture 100 in which embodiments of an audio quality inspection method or audio quality inspection apparatus of the present application may be applied.
As shown in fig. 1, a system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 is used as a medium to provide communication links between the terminal devices 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
A user may interact with the server 105 via the network 104 using the terminal devices 101, 102, 103 to receive or transmit video frames or the like. Various client applications, such as a recording application, an audio quality inspection application, etc., may be installed on the terminal devices 101, 102, 103.
The terminal devices 101, 102, 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smartphones, tablets, laptop and desktop computers, and the like. When the terminal devices 101, 102, 103 are software, they can be installed in the above-described electronic devices. Which may be implemented as a plurality of software or software modules, or as a single software or software module. The present invention is not particularly limited herein.
The server 105 may provide various services. For example, the server 105 may analyze and process the dialogue audio acquired from the terminal devices 101, 102, 103, and generate processing results (e.g., quality inspection results of the dialogue audio).
It should be noted that, the server 105 may be hardware, or may be software. When the server 105 is hardware, it may be implemented as a distributed server cluster formed by a plurality of servers, or as a single server. When server 105 is software, it may be implemented as a plurality of software or software modules (e.g., to provide distributed services), or as a single software or software module. The present invention is not particularly limited herein.
It should be noted that, the audio quality inspection method provided in the embodiment of the present application is generally executed by the server 105, and accordingly, the audio quality inspection device is generally disposed in the server 105.
It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
With continued reference to fig. 2, a flow 200 of one embodiment of an audio quality inspection method in accordance with the present application is shown. The audio quality inspection method comprises the following steps:
In step 201, dialog audio is acquired.
In this embodiment, the execution subject of the audio quality inspection method (e.g., the server 105 shown in fig. 1) may acquire dialogue audio. The dialogue audio may be audio in which a dialogue between a customer and a customer service is recorded.
Typically, when a call center receives an incoming call from a customer, it can be automatically assigned to customer service. When a customer establishes a call with a customer service, a terminal device of the customer service (for example, terminal devices 101, 102, 103 shown in fig. 1) may start a recording function to record a session between the customer and the customer service until the call is ended, so as to obtain session audio. For businesses that sell products (e.g., physical goods, virtual services, etc.), call centers are typically set up to provide after-market services for their products. In order to improve the service quality of customer service, enterprises need to perform quality inspection on recorded dialogue audios. According to the quality inspection result, the method refines and promotes the favorable aspects, and supervises and corrects the unfavorable aspects. For fast-growing enterprises, the traffic of the call centers will climb continuously. If the quality of the full-volume dialogue audio is checked, the workload is very huge. To improve quality inspection efficiency, it is necessary to extract part of dialogue audio from the full dialogue audio in proportion for quality inspection. For example, the average time of the dialogue audio is about 6 minutes, and the dialogue audio is randomly extracted from the total dialogue audio according to the proportion of 1% -2% for quality inspection.
And 202, performing voice separation on the dialogue audio to obtain a first audio and a second audio.
In this embodiment, the executing body may perform voice separation on the dialogue audio to obtain the first audio and the second audio. Wherein the first audio and the second audio comprise only one speaker.
Because of the dialogue between the customer and customer service, audio recordings of the dialogue typically involve two speakers of the customer and customer service. The voice prints of different speakers are different, and voice separation is carried out on dialogue audios based on the voice prints, so that a first audio and a second audio which only comprise one speaker can be separated. Wherein the first audio and the second audio only comprise one speaker in the customer and customer service. For example, the first audio is the customer's audio and the second audio is the customer's audio.
It should be noted that, the voice separation of the dialogue audio can separate out the audio containing only one speaker, but cannot identify the specific speaker contained in the audio.
And 203, performing voice recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio.
In this embodiment, the executing body may perform speech recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio.
Specifically, the vocabulary content in the first audio and the second audio can be converted into corresponding words by using a voice recognition technology, so as to obtain a first text corresponding to the first audio and a second text corresponding to the second audio. The first text may include words corresponding to the vocabulary content in the first audio. The second text may include words corresponding to lexical content in the second audio.
And 204, performing role judgment on the first text and the second text, and selecting the text corresponding to the customer service.
In this embodiment, the executing body may perform role determination on the first text and the second text, mark roles corresponding to the first text and the second text, and then select a text corresponding to the customer service from the roles.
Specifically, the contents in the first text and the second text can be analyzed to determine roles corresponding to the first text and the second text. For example, a character corresponding to a text in which a welcome or end word exists is typically a customer service. For another example, a character corresponding to a text in which there is more query content for a product is typically a customer, and a character corresponding to a text in which there is more answer content for a product is typically a customer service.
And 205, performing text content semantic classification on the text corresponding to the customer service to obtain a quality inspection result of the dialogue audio.
In this embodiment, the execution body may perform text content semantic classification on a text corresponding to a customer service, to obtain a quality inspection result of a dialogue audio. The quality inspection result can be used for representing the service quality of customer service in the conversation.
In some optional implementations of this embodiment, a set of additional classification categories and a set of withholding categories may be preset, text content semantic classification is performed on a text corresponding to a customer service, and at least one additional classification category and at least one withholding category to which the text corresponding to the customer service belongs are determined, so as to obtain a quality inspection result of dialogue audio. Wherein, the scoring category in the scoring category set may be a positive, popularizing category. The withholding categories in the withholding category set may be passive, requiring correction categories. Taking quality inspection of service flow compliance and language compliance as an example, the set of scoring categories may include categories of welcome by standard, end by standard, confirm customer information, pacify customer complaint emotion, and the like. The set of withholding categories may include categories of presence service disabilities, outbound passive recommended products, aggressive recommended products, fraud inducing customers, and the like.
The quality inspection result of the dialogue audio is determined based on at least one scoring type and at least one withholding type of the text corresponding to the customer service. For example, at least one scoring category and at least one withholding category to which the text corresponding to the customer service belongs are directly used as quality inspection results. The method can be used for refining and promoting at least one classification, and can be used for supervising and correcting at least one deduction classification, so that the service quality of customer service is improved. For another example, the bonus in the bonus category set is labeled with a corresponding bonus score and the bonus category in the bonus category set is labeled with a corresponding bonus score. The difference between the sum of the added scores corresponding to the at least one added category and the sum of the deduction scores corresponding to the at least one deduction category can be further calculated, and the obtained difference is used as a quality inspection result of the dialogue audio. Generally, the larger the difference value is, the higher the service quality of customer service in the current dialogue is, the smaller the difference value is, and the lower the service quality of customer service in the current dialogue is. The dialogue audio with higher service quality can be extracted and promoted, and the dialogue audio with lower service quality can be supervised and corrected, so that the service quality of customer service is improved.
In some optional implementations of this embodiment, the execution body may further count quality inspection results of a plurality of conversational audios of the same customer service, and perform tracking analysis on quality of service of the customer service to form historical service data of the customer service. In addition, the statistical results can also be used for performance assessment of the customer service.
According to the audio quality inspection method provided by the embodiment of the application, firstly, the acquired dialogue audio is subjected to human-voice separation to obtain a first audio and a second audio; then, performing voice recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio; then, performing role judgment on the first text and the second text, and selecting a text corresponding to customer service; and finally, text content semantic classification is carried out on the text corresponding to the customer service to obtain a quality inspection result of the dialogue audio, so that the full-automatic audio quality inspection can be realized. Whether the voice separation, the voice recognition, the role judgment and the final quality inspection are carried out, the labor cost is greatly reduced, the voice can be rapidly analyzed, the accurate positioning is problematic, and the stable and efficient customer service working quality is ensured. Compared with the manual audio quality inspection, the method reduces the time consumption of the audio quality inspection, improves the audio quality inspection efficiency, reduces the audio quality inspection cost, improves the audio quality inspection accuracy, and eliminates the subjectivity of the audio quality inspection. Can support a large amount of complicated quality inspection work so as to adapt to the rapid growth steps of enterprises.
With further reference to fig. 3, a flow 300 of yet another embodiment of an audio quality inspection method according to the present application is shown. The audio quality inspection method comprises the following steps:
In step 301, dialog audio is acquired.
In this embodiment, the specific operation of step 301 is described in detail in step 201 in the embodiment shown in fig. 2, and will not be described herein.
Step 302, inputting dialogue audio to a pre-trained human voice separation model to obtain first audio and second audio.
In this embodiment, the executing body of the audio quality inspection method (for example, the server 105 shown in fig. 1) may input dialogue audio to a pre-trained human voice separation model, to obtain the first audio and the second audio. The audio of different roles is segmented through the human voice separation model, so that the cost of manual hearing identification is reduced. Wherein the human voice separation model may include, but is not limited to: xvector-AHC (Xvector-Agglomerative Hierarchical Clustering, voiceprint model-aggregate hierarchical clustering), GMM (Gaussian Mixed Model, gaussian mixture model), HMM (Hyundai MERCHANT MARINE, hidden Markov model) and other AI (ARTIFICIAL INTELLIGENCE ) models are obtained by training a neural network by using a training sample set. The training samples in the training sample set herein may be sample dialogue audio labeled with a speaker.
In some alternative implementations of the present embodiment, the human voice separation model is Xvector-AHC. Wherein Xvector-AHC may include Xvector and AHC. The corresponding human voice separation step may include:
First, dialog audio is divided into a plurality of audio segments.
In general, dialog audio may be segmented uniformly. For example, for a 10 second dialog audio, segments may be made every 500 milliseconds, resulting in 20 audio segments.
And then, respectively inputting the plurality of audio fragments to Xvector to obtain the characteristics of the plurality of audio fragments.
The common Xvector network structure includes, in order, a frame-level layer (frame-level), a pooling layer (STATISTICS POOLING), a segment-level layer (segment-level), and an activation function layer (softmax). Here Xvector removes the already trained activation function layer of the neural network. The Xvector features of the segment level layer output are the features of the audio clip.
Then, the features of the plurality of audio clips are clustered using the AHC, and the categories of the plurality of audio clips are determined based on the clustering result.
In general, AHCs can be classified into two categories according to the manner of clustering: top-down and bottom-up. For the bottom-up clustering algorithm, it is initially assumed that each sample is a separate class, and then the classes are combined sequentially until finally there is only one class. A tree-like structure will eventually result, the root of the tree being a class that contains all sample points, while the leaf is a cluster of only one sample. Here, the categories are merged based on distance metrics. Two categories are combined into one category during each iteration. Wherein an audio segment is characterized by a sample. The root of the tree obtained by clustering the features of the plurality of audio segments using AHC comprises two child nodes. The features of the audio segments in the same sub-node are similar, and the features of the audio segments in different sub-nodes are different. Thus, the audio clip corresponding to one child node belongs to one category, while the audio clip corresponding to another child node belongs to another category.
And finally, combining the audio clips in the same category to obtain a first audio and a second audio.
Typically, for the same category of audio segments, the audio segments may be combined in the order they were in the dialog audio, resulting in corresponding audio. The audio segments corresponding to one child node may be combined into a first audio and the audio segments corresponding to another child node may be combined into a second audio.
Step 303, inputting the first audio and the second audio into a pre-trained speech recognition model respectively to obtain a first text and a second text.
In this embodiment, the executing body may input the first audio and the second audio into a pre-trained speech recognition model, to obtain the first text and the second text, respectively. The audio content is identified through the end-to-end voice identification model, so that the content acquisition efficiency is greatly improved. The speech recognition model may include, but is not limited to, among others: AI models such as LSTM-CTC (Long Short-Term Memory-Connectionist Temporal Classifier, long-Short Term Memory network-link time classifier), GMM, HMM, and the like are obtained by training a neural network using a training sample set. The training samples in the training sample set herein may include sample audio and corresponding sample text.
In some alternative implementations of the present embodiment, the speech recognition model is LSTM-CTC. Among them, LSTM-CTC may include LSTM and CTC. The corresponding voice recognition step includes:
First, the first audio and the second audio are respectively input to the LSTM, and the characteristics of the first audio and the second audio are obtained.
The LSTM is a time-circulating neural network, can avoid the problems of gradient disappearance and gradient explosion in the common circulating neural network, and has the main core ideas that: the channel called "state" is used throughout the time sequence. Information is removed or added to the cell state by designing the structure of the "gate". Wherein, there are three gates in LSTM, namely "forget gate", "input gate" and "output gate".
Then, the features of the first audio and the second audio are respectively input into the CTC to obtain a first text and a second text.
The CTC is mainly used for solving the alignment problem of the input features and the output labels.
Step 304, the first text and the second text are respectively input into a pre-trained character judging model, a character corresponding to the first text and a character corresponding to the second text are obtained, and a text corresponding to customer service is selected.
In this embodiment, the executing body may input the first text and the second text into a pre-trained character determination model, obtain a character corresponding to the first text and a character corresponding to the second text, and select a text corresponding to the customer service. The role is judged by the role judgment model, so that the effect is better and the robustness is higher than that of the keyword matching. Wherein the role determination model may include, but is not limited to: textCNN (Text Convolutional Neural Network, text-level convolutional neural network), charCNN (Char Convolutional Neural Network, character-level convolutional neural network), RCNN (Region-based Convolutional Neural Network, regional convolutional neural network), transducer (converter), ELMO (Embeddings from Language Model, deep context word representation model), BERT (Bidirectional Encoder Representation from Transformers, converter output bi-directional encoder) representation, and the like AI models are obtained by training the neural network using a training sample set. The training samples in the training sample set herein may be sample text that annotates a character.
And 305, inputting the text corresponding to the customer service into a pre-trained semantic classification model to obtain a quality inspection result.
In this embodiment, the execution body may input a text corresponding to the customer service into a pre-trained semantic classification model, to obtain a quality inspection result. Content classification judgment is carried out through the semantic classification model, so that the quality inspection effect can be achieved. Wherein the semantic classification model may include, but is not limited to: BERT, ELMO, textCNN, charCNN, RCNN, transformer, and the like, is obtained by training the neural network by using a training sample set. The training samples in the training sample set may include sample customer service text labeled with quality test results.
In some alternative first approaches of this embodiment, the semantic classification model may be BERT. BERT is a bidirectional transducer model, finely describes semantic relations among contexts, and better obtains semantic classification results, namely, achieves the aim of quality inspection.
According to the audio quality inspection method provided by the embodiment of the application, firstly, audios of different roles are segmented through the human voice separation model, so that the cost of manual hearing identification is reduced; then, the audio content is identified through the end-to-end voice identification model, so that the content acquisition efficiency is greatly improved; then judging the roles through the role judging model, wherein the effect is better and the robustness is higher than that of adopting keyword matching; content classification judgment is carried out through the semantic classification model, so that the quality inspection effect can be achieved.
The embodiment of the application provides an intelligent quality inspection mode, which uses an AI technology as an application core and uses the AI technology to replace standardized work. And in terms of the working content of quality inspection, monitoring mass dialogue audios, scoring according to a set rule, producing a standardized analysis document, and accurately positioning the dialogue audios with problems. And all dialogue audios can be detected completely without dead angles in a short time. The requirements of fairness, quality control, combination of business knowledge and the like are met according to the working requirements of quality inspectors. The intelligent quality inspection has the advantages of stability and high efficiency compared with the manual quality inspection, and reduces the workload of quality inspectors for analyzing basic data. Thus, the advantages of full-scale quality inspection and real-time quality inspection using AI technology are more prominent than those of manual quality inspectors.
For ease of understanding, fig. 4 shows an application scenario diagram in which an audio quality inspection method of an embodiment of the present application may be implemented. As shown in fig. 4, first, dialogue audio is input to Xvector-AHC for human voice separation, resulting in first audio and second audio. And then, respectively inputting the first audio and the second audio into the LSTM-CTC for voice recognition to obtain a first text corresponding to the first audio and a second text corresponding to the second audio. And then, respectively inputting the first text and the second text into TextCNN for role judgment, and selecting the text corresponding to the customer service. And finally, inputting the text corresponding to the customer service to the BERT for intelligent quality inspection to obtain a quality inspection result.
With further reference to fig. 5, as an implementation of the method shown in the foregoing figures, the present application provides an embodiment of an audio quality inspection apparatus, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus is particularly applicable to various electronic devices.
As shown in fig. 5, the audio quality inspection apparatus 500 of the present embodiment may include: an acquisition module 501, a separation module 502, an identification module 503, a determination module 504, and a classification module 505. Wherein the obtaining module 501 is configured to obtain dialogue audio, wherein the dialogue audio records a dialogue between a client and a customer service; the separation module 502 is configured to perform voice separation on the dialogue audio to obtain a first audio and a second audio, wherein the first audio and the second audio only comprise one speaker; the recognition module 503 is configured to perform voice recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio; a determining module 504, configured to perform role determination on the first text and the second text, and select a text corresponding to the customer service; the classification module 505 is configured to perform text content semantic classification on the text corresponding to the customer service, so as to obtain a quality inspection result of the dialogue audio.
In this embodiment, in the audio quality inspection apparatus 500: specific processes of the obtaining module 501, the separating module 502, the identifying module 503, the determining module 504 and the classifying module 505 and technical effects thereof may refer to the description of steps 201 to 205 in the corresponding embodiment of fig. 2, and are not repeated herein.
In some alternative implementations of the present embodiment, the separation module 502 includes: the separation module is configured to input dialogue audio to a pre-trained human voice separation model to obtain first audio and second audio, wherein the human voice separation model comprises one of the following: voiceprint model-aggregated hierarchical cluster Xvector-AHC, gaussian mixture model GMM, hidden Markov model HMM.
In some alternative implementations of the present embodiment, the human voice separation model is Xvector-AHC, xvector-AHC including Xvector and AHC; and the separation submodule is further configured to: dividing dialog audio into a plurality of audio segments; inputting the plurality of audio clips to Xvector respectively to obtain characteristics of the plurality of audio clips; clustering the characteristics of the plurality of audio clips by using the AHC, and determining the categories of the plurality of audio clips based on the clustering result; and combining the audio fragments of the same category to obtain a first audio and a second audio.
In some alternative implementations of the present embodiment, the identification module 503 is further configured to: respectively inputting the first audio and the second audio into a pre-trained voice recognition model to obtain a first text and a second text, wherein the voice recognition model comprises one of the following items: long-term short-term memory network-coupled time classifier LSTM-CTC, GMM, HMM.
In some alternative implementations of the present embodiment, the determination module 504 is further configured to: respectively inputting the first text and the second text into a pre-trained character judging model to obtain a character corresponding to the first text and a character corresponding to the second text, wherein the character judging model comprises the following items: text-level convolutional neural network TextCNN, character-level convolutional neural network CharCNN, regional convolutional neural network RCNN, converter Transformer, deep context word representation model ELMO, converter output bi-directional encoder representation BERT.
In some alternative implementations of the present embodiment, the classification module 505 is further configured to: inputting the text corresponding to the customer service into a pre-trained semantic classification model to obtain a quality inspection result, wherein the semantic classification model comprises one of the following items: BERT, ELMO, textCNN, charCNN, RCNN, transformer.
According to embodiments of the present application, the present application also provides an electronic device, a readable storage medium and a computer program product.
Fig. 6 illustrates a schematic block diagram of an example electronic device 600 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 6, the apparatus 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 may also be stored. The computing unit 601, ROM 602, and RAM 603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
Various components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, mouse, etc.; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the various methods and processes described above, such as an audio quality inspection method. For example, in some embodiments, the audio quality inspection method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into RAM 603 and executed by computing unit 601, one or more steps of the audio quality inspection method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the audio quality inspection method by any other suitable means (e.g. by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (11)

1. An audio quality inspection method comprising:
acquiring dialogue audio, wherein the dialogue audio records a dialogue between a client and customer service;
Dividing the dialog audio into a plurality of audio segments; inputting the plurality of audio clips to Xvector respectively to obtain characteristics of the plurality of audio clips; clustering features of the plurality of audio clips by using the AHC, and determining categories of the plurality of audio clips based on clustering results; combining audio fragments of the same category to obtain a first audio and a second audio, wherein the network structure of Xvector sequentially comprises a frame level layer, a pooling layer and a segment level layer;
Performing voice recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio;
performing role judgment on the first text and the second text, and selecting a text corresponding to customer service;
Performing text content semantic classification on the text corresponding to the customer service to obtain a quality inspection result of the dialogue audio, wherein a scoring type set and a buckling type set are preset, the scoring type in the scoring type set is marked with a corresponding scoring number, the buckling type in the buckling type set is marked with a corresponding buckling number, performing text content semantic classification on the text corresponding to the customer service, determining at least one scoring type and at least one buckling type to which the text corresponding to the customer service belongs, calculating a difference value of the sum of the scoring numbers corresponding to the at least one scoring type and the sum of the buckling numbers corresponding to the at least one buckling type, and taking the difference value as the quality inspection result of the dialogue audio.
2. The method of claim 1, wherein the performing speech recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio comprises:
inputting the first audio and the second audio into a pre-trained voice recognition model respectively to obtain the first text and the second text, wherein the voice recognition model comprises one of the following: long-term short-term memory network-coupled time classifier LSTM-CTC, GMM, HMM.
3. The method of claim 1, wherein the performing a role determination on the first text and the second text comprises:
Respectively inputting the first text and the second text into a pre-trained character judging model to obtain a character corresponding to the first text and a character corresponding to the second text, wherein the character judging model comprises one of the following items: text-level convolutional neural network TextCNN, character-level convolutional neural network CharCNN, regional convolutional neural network RCNN, converter Transformer, deep context word representation model ELMO, converter output bi-directional encoder representation BERT.
4. The method of claim 1, wherein the performing text content semantic classification on the text corresponding to the customer service to obtain the quality inspection result of the dialogue audio comprises:
Inputting the text corresponding to the customer service into a pre-trained semantic classification model to obtain the quality inspection result, wherein the semantic classification model comprises one of the following items: BERT, ELMO, textCNN, charCNN, RCNN, transformer.
5. An audio quality inspection device, comprising:
an acquisition module configured to acquire dialogue audio, wherein the dialogue audio records a dialogue between a customer and a customer service;
A separation module configured to divide the dialog audio into a plurality of audio segments; inputting the plurality of audio clips to Xvector respectively to obtain characteristics of the plurality of audio clips; clustering features of the plurality of audio clips by using the AHC, and determining categories of the plurality of audio clips based on clustering results; combining audio fragments of the same category to obtain a first audio and a second audio, wherein the network structure of Xvector sequentially comprises a frame level layer, a pooling layer and a segment level layer;
the recognition module is configured to perform voice recognition on the first audio and the second audio to obtain a first text corresponding to the first audio and a second text corresponding to the second audio;
The judging module is configured to judge the roles of the first text and the second text and select the text corresponding to the customer service;
The classification module is configured to perform text content semantic classification on the text corresponding to the customer service to obtain a quality inspection result of the dialogue audio, wherein a scoring type set and a buckling type set are preset, the scoring type in the scoring type set is marked with a corresponding scoring number, the buckling type in the buckling type set is marked with a corresponding buckling number, the text corresponding to the customer service is subjected to text content semantic classification, at least one scoring type and at least one buckling type to which the text corresponding to the customer service belongs are determined, a difference value of the sum of the scoring numbers corresponding to the at least one scoring type and the sum of the buckling numbers corresponding to the at least one buckling type is calculated, and the difference value is used as the quality inspection result of the dialogue audio.
6. The apparatus of claim 5, wherein the identification module is further configured to:
inputting the first audio and the second audio into a pre-trained voice recognition model respectively to obtain the first text and the second text, wherein the voice recognition model comprises one of the following: long-term short-term memory network-coupled time classifier LSTM-CTC, GMM, HMM.
7. The apparatus of claim 5, wherein the determination module is further configured to:
Respectively inputting the first text and the second text into a pre-trained character judging model to obtain a character corresponding to the first text and a character corresponding to the second text, wherein the character judging model comprises one of the following items: text-level convolutional neural network TextCNN, character-level convolutional neural network CharCNN, regional convolutional neural network RCNN, converter Transformer, deep context word representation model ELMO, converter output bi-directional encoder representation BERT.
8. The apparatus of claim 5, wherein the classification module is further configured to:
Inputting the text corresponding to the customer service into a pre-trained semantic classification model to obtain the quality inspection result, wherein the semantic classification model comprises one of the following items: BERT, ELMO, textCNN, charCNN, RCNN, transformer.
9. An electronic device, comprising:
At least one processor; and
A memory communicatively coupled to the at least one processor; wherein,
The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-4.
10. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-4.
11. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-4.
CN202110253354.7A 2021-03-05 2021-03-05 Audio quality inspection method, device, equipment and storage medium Active CN112966082B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110253354.7A CN112966082B (en) 2021-03-05 2021-03-05 Audio quality inspection method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110253354.7A CN112966082B (en) 2021-03-05 2021-03-05 Audio quality inspection method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN112966082A CN112966082A (en) 2021-06-15
CN112966082B true CN112966082B (en) 2024-08-09

Family

ID=76276927

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110253354.7A Active CN112966082B (en) 2021-03-05 2021-03-05 Audio quality inspection method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112966082B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113436625A (en) * 2021-06-25 2021-09-24 安徽淘云科技股份有限公司 Man-machine interaction method and related equipment thereof
CN113628627B (en) * 2021-08-11 2022-06-14 广东电网有限责任公司广州供电局 Electric power industry customer service quality inspection system based on structured voice analysis
CN113709313B (en) * 2021-08-31 2022-10-25 平安科技(深圳)有限公司 Intelligent quality inspection method, device, equipment and medium for customer service call data
CN113836344A (en) * 2021-09-30 2021-12-24 广州艾美网络科技有限公司 Personalized song file generation method and device and music singing equipment
CN114299957A (en) * 2021-11-29 2022-04-08 北京百度网讯科技有限公司 Voiceprint separation method and device, electronic equipment and storage medium
CN114866644A (en) * 2022-05-13 2022-08-05 上海华客信息科技有限公司 Recording separation-based guest room service method, system, equipment and storage medium
CN115063155B (en) * 2022-06-25 2024-05-24 平安银行股份有限公司 Data labeling method, device, computer equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109327632A (en) * 2018-11-23 2019-02-12 深圳前海微众银行股份有限公司 Intelligent quality inspection system, method and the computer readable storage medium of customer service recording
CN110335621A (en) * 2019-05-28 2019-10-15 深圳追一科技有限公司 Method, system and the relevant device of audio processing
CN112364661A (en) * 2020-11-11 2021-02-12 北京大米科技有限公司 Data detection method and device, readable storage medium and electronic equipment

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180197548A1 (en) * 2017-01-09 2018-07-12 Onu Technology Inc. System and method for diarization of speech, automated generation of transcripts, and automatic information extraction
US10147438B2 (en) * 2017-03-02 2018-12-04 International Business Machines Corporation Role modeling in call centers and work centers
CN111368130A (en) * 2020-02-26 2020-07-03 深圳前海微众银行股份有限公司 Quality inspection method, device and equipment for customer service recording and storage medium
CN111681672A (en) * 2020-05-26 2020-09-18 深圳壹账通智能科技有限公司 Voice data detection method and device, computer equipment and storage medium
CN111709630A (en) * 2020-06-08 2020-09-25 深圳乐信软件技术有限公司 Voice quality inspection method, device, equipment and storage medium
CN112420069A (en) * 2020-11-18 2021-02-26 北京云从科技有限公司 Voice processing method, device, machine readable medium and equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109327632A (en) * 2018-11-23 2019-02-12 深圳前海微众银行股份有限公司 Intelligent quality inspection system, method and the computer readable storage medium of customer service recording
CN110335621A (en) * 2019-05-28 2019-10-15 深圳追一科技有限公司 Method, system and the relevant device of audio processing
CN112364661A (en) * 2020-11-11 2021-02-12 北京大米科技有限公司 Data detection method and device, readable storage medium and electronic equipment

Also Published As

Publication number Publication date
CN112966082A (en) 2021-06-15

Similar Documents

Publication Publication Date Title
CN112966082B (en) Audio quality inspection method, device, equipment and storage medium
CN108962282B (en) Voice detection analysis method and device, computer equipment and storage medium
US11005995B2 (en) System and method for performing agent behavioral analytics
CN113468296B (en) Model self-iteration type intelligent customer service quality inspection system and method capable of configuring business logic
CN111709630A (en) Voice quality inspection method, device, equipment and storage medium
CN112468659B (en) Quality evaluation method, device, equipment and storage medium applied to telephone customer service
CN112951275B (en) Voice quality inspection method and device, electronic equipment and medium
WO2018184518A1 (en) Microblog data processing method and device, computer device and storage medium
CN109710766B (en) Complaint tendency analysis early warning method and device for work order data
CN105808721A (en) Data mining based customer service content analysis method and system
CN113380238A (en) Method for processing audio signal, model training method, apparatus, device and medium
CN110955770A (en) Intelligent dialogue system
CN113297365A (en) User intention determination method, device, equipment and storage medium
CN112836053A (en) Man-machine conversation emotion analysis method and system for industrial field
WO2020199590A1 (en) Mood detection analysis method and related device
CN113505606B (en) Training information acquisition method and device, electronic equipment and storage medium
CN113254578B (en) Method, apparatus, device, medium and product for data clustering
CN117611005A (en) Customer service quality evaluation method, customer service quality evaluation device, customer service quality evaluation equipment and storage medium
CN110580899A (en) Voice recognition method and device, storage medium and computing equipment
CN114417974B (en) Model training method, information processing device, electronic equipment and medium
CN111818290B (en) Online interviewing method and system
CN114240250A (en) Intelligent management method and system for vocational evaluation
CN114297380A (en) Data processing method, device, equipment and storage medium
CN113763968A (en) Method, apparatus, device, medium and product for recognizing speech
CN114356982A (en) Marketing compliance checking method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant