CN114694678A

CN114694678A - Sound quality detection model training method, sound quality detection method, electronic device, and medium

Info

Publication number: CN114694678A
Application number: CN202210333127.XA
Authority: CN
Inventors: 陈洲旋
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2022-03-31
Filing date: 2022-03-31
Publication date: 2022-07-01

Abstract

The present application discloses a sound quality detection model training method, a sound quality detection method, equipment and a computer-readable storage medium. The sound quality detection model training method includes: obtaining initial training audio and corresponding average opinion score labels; Interference filtering processing of endpoint detection to obtain training audio; feature extraction processing of training audio to obtain training features; input training features to the initial model to obtain the corresponding training sound quality detection results; use the training sound quality detection results and the average opinion score label to calculate loss value, and use the loss value to adjust the model parameters of the initial model; when it is detected that the training completion conditions are met, the adjusted initial model is determined as the sound quality detection model; the obtained sound quality detection model can be used without pure and clean audio. , to accurately assess the quality of the audio under test.

Description

Sound quality detection model training method, sound quality detection method, electronic equipment and medium

技术领域technical field

本申请涉及音频处理技术领域，特别涉及音质检测模型训练方法、音质检测方法、电子设备及计算机可读存储介质。The present application relates to the technical field of audio processing, and in particular, to a training method for a sound quality detection model, a sound quality detection method, an electronic device, and a computer-readable storage medium.

背景技术Background technique

当前，通常采用PESQ(即perceptual evaluation of speech quality，语音质量感知评估)的方法对音频音质进行检测，得到表征音质优劣的检测结果。PESQ方法通常针对VOIP(Voice over Internet Protocol，基于IP的语音传输)网络通信的音频，能够评估音频信号在网络传输过程中由于丢帧、抖动等影响带来的音频时间不对齐、频谱失真等问题。计算PESQ得分需要准备含噪音频对应的纯干净音频，这种音质评估方法就称为有参考音质评估。而在应用中，纯干净音频难以获得，使得大部分音频的音质检测难以进行。At present, the method of PESQ (that is, perceptual evaluation of speech quality, perceptual evaluation of speech quality) is usually used to detect the audio quality, and a detection result representing the quality of the sound quality is obtained. The PESQ method is usually aimed at the audio of VOIP (Voice over Internet Protocol, voice over IP) network communication, and can evaluate the audio time misalignment and spectrum distortion caused by the influence of frame loss and jitter during the network transmission of the audio signal. . To calculate the PESQ score, it is necessary to prepare pure and clean audio corresponding to the noise-containing frequency. This sound quality evaluation method is called the reference sound quality evaluation. In applications, pure and clean audio is difficult to obtain, making it difficult to detect the sound quality of most audios.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本申请的目的在于提供音质检测模型训练、音质检测方法、电子设备及计算机可读存储介质，对待测音频的质量进行准确评估。In view of this, the purpose of the present application is to provide a sound quality detection model training, a sound quality detection method, an electronic device and a computer-readable storage medium, so as to accurately evaluate the quality of the audio to be tested.

为解决上述技术问题，第一方面，本申请提供了一种音质检测模型训练方法，包括：In order to solve the above technical problems, in the first aspect, the present application provides a method for training a sound quality detection model, including:

获取初始训练音频和对应的平均意见得分标签；所述平均意见得分标签用于表征多个评价对象对所述初始训练音频进行音质评估后得到的平均音质评价参数；Obtaining the initial training audio and the corresponding average opinion score label; the average opinion score label is used to represent the average sound quality evaluation parameter obtained after a plurality of evaluation objects perform sound quality evaluation on the initial training audio;

对所述初始训练音频进行基于语音端点检测的非人声滤除处理，得到训练音频；Carrying out the non-human voice filtering process based on voice endpoint detection to the initial training audio to obtain the training audio;

对所述训练音频进行特征提取处理，得到训练特征；Perform feature extraction processing on the training audio to obtain training features;

将所述训练特征输入初始模型，得到对应的训练音质检测结果；Input the training feature into the initial model to obtain the corresponding training sound quality detection result;

利用所述训练音质检测结果与所述平均意见得分标签计算损失值，并利用所述损失值调节所述初始模型的模型参数；Calculate a loss value using the training sound quality detection result and the average opinion score label, and use the loss value to adjust the model parameters of the initial model;

当检测到满足训练完成条件，则将调节后的所述初始模型确定为音质检测模型。When it is detected that the training completion condition is satisfied, the adjusted initial model is determined as a sound quality detection model.

可选地，所述平均意见得分标签的获取过程，包括：Optionally, the process of obtaining the average opinion score label includes:

向各个所述评价对象播放所述初始训练音频；playing the initial training audio to each of the evaluation objects;

接收各个所述评价对象对所述初始训练音频进行音质评估后得到的初始音质数据；receiving initial sound quality data obtained after each of the evaluation objects performs sound quality evaluation on the initial training audio;

利用各个所述初始音质数据生成所述平均意见得分标签。The mean opinion score label is generated using each of the initial voice quality data.

可选地，所述利用各个所述初始音质数据生成所述平均意见得分标签，包括：Optionally, generating the average opinion score label using each of the initial sound quality data includes:

对各个所述初始音质数据进行平均处理，得到第一得分标签；Perform averaging processing on each of the initial sound quality data to obtain a first score label;

将所述初始训练音频输入音频缺陷检测模型，得到缺陷检测结果；其中，所述音频缺陷检测模型用于检测能够影响听觉感受的音频缺陷；Inputting the initial training audio into an audio defect detection model to obtain a defect detection result; wherein the audio defect detection model is used to detect audio defects that can affect auditory experience;

基于所述缺陷检测结果，生成第二得分标签；generating a second score label based on the defect detection result;

利用所述第一得分标签和所述第二得分标签生成所述平均意见得分标签。The average opinion score label is generated using the first score label and the second score label.

可选地，所述对所述训练音频进行特征提取处理，得到训练特征，包括：Optionally, performing feature extraction processing on the training audio to obtain training features, including:

基于人耳可感知最大采样率，对所述训练音频进行重采样，得到中间数据；Based on the maximum sampling rate perceivable by the human ear, the training audio is resampled to obtain intermediate data;

对所述中间数据进行基于预设窗长的滑窗分帧，得到多个音频帧；Performing sliding window framing based on a preset window length on the intermediate data to obtain a plurality of audio frames;

对各个所述音频帧进行特征提取处理，得到所述训练特征。Feature extraction processing is performed on each of the audio frames to obtain the training features.

可选地，所述对所述初始训练音频进行基于语音端点检测的非人声滤除处理，得到训练音频，包括：Optionally, performing non-human voice filtering processing based on voice endpoint detection on the initial training audio to obtain training audio, including:

对所述初始训练音频进行语音端点检测，得到语音端点时刻；The voice endpoint detection is carried out to the initial training audio, and the voice endpoint time is obtained;

根据所述语音端点时刻对所述初始训练音频进行切分，得到多个音频段，并将所述音频段中的非人声音频段去除，得到人声音频段；According to the voice endpoint time, the initial training audio is segmented to obtain a plurality of audio segments, and the non-human voice frequency range in the audio segment is removed to obtain the human voice frequency range;

将所述人声音频段拼接得到所述训练音频。The training audio is obtained by splicing the human voice frequency bands.

可选地，所述初始模型包括卷积神经网络、长短期记忆网络、全连接层和平均池化层；Optionally, the initial model includes a convolutional neural network, a long short-term memory network, a fully connected layer and an average pooling layer;

所述将所述训练特征输入初始模型，得到对应的训练音质检测结果，包括：The described training features are input into the initial model to obtain corresponding training sound quality detection results, including:

将所述训练特征输入所述卷积神经网络，得到训练中间特征；Inputting the training features into the convolutional neural network to obtain training intermediate features;

将所述训练中间特征输入所述长短期记忆网络，得到训练初始检测结果；Inputting the training intermediate feature into the long short-term memory network to obtain the training initial detection result;

将所述训练初始检测结果输入所述全连接层，得到训练中间检测结果；Inputting the training initial detection result into the fully connected layer to obtain the training intermediate detection result;

将所述训练中间检测结果输入所述平均池化层，得到所述训练音质检测结果。The training intermediate detection result is input into the average pooling layer to obtain the training sound quality detection result.

可选地，所述获取初始训练音频和对应的平均意见得分标签，包括：Optionally, the obtaining of the initial training audio and the corresponding average opinion score label includes:

按照预设批次大小，从训练数据集中获取一个批次的多个所述初始训练音频和所述平均意见得分标签；According to a preset batch size, obtain a batch of a plurality of the initial training audio and the average opinion score label from the training data set;

相应的，所述利用所述训练音质检测结果与所述平均意见得分标签计算损失值，包括：Correspondingly, calculating the loss value using the training sound quality detection result and the average opinion score label includes:

当得到一个批次对应的所有所述初始训练音频对应的所述训练音质检测结果时，利用该批次内的所述训练音质检测结果、平均意见得分标签和训练中间检测结果得到所述损失值。When the training sound quality detection results corresponding to all the initial training audios corresponding to a batch are obtained, the loss value is obtained by using the training sound quality detection results, the average opinion score label and the training intermediate detection results in the batch .

可选地，所述利用该批次内的所述训练音质检测结果、平均意见得分标签和训练中间检测结果得到所述损失值，包括：Optionally, the loss value is obtained by using the training sound quality detection result, the average opinion score label and the training intermediate detection result in the batch, including:

根据according to

得到所述损失值；obtain the loss value;

其中，所述loss为所述损失值，所述S为所述预设批次大小，所述T_S为所述训练特征对应的帧数，所述M_S为所述平均意见得分标签，所述

为所述训练音质检测结果，所述

为所述训练特征中的第t帧在所述训练中间检测结果中对应的值，所述α为预设权重。Wherein, the loss is the loss value, the S is the preset batch size, the T _S is the number of frames corresponding to the training feature, the M _S is the average opinion score label, and the stated

For the training sound quality detection result, the

is the value corresponding to the t-th frame in the training feature in the intermediate detection result of the training, and the α is a preset weight.

可选地，所述检测到满足训练完成条件，包括：Optionally, it is detected that the training completion condition is met, including:

判断所述损失值是否小于预设阈值；judging whether the loss value is less than a preset threshold;

若是，则确定满足训练完成条件。If so, it is determined that the training completion condition is satisfied.

第二方面，本申请还提供了一种音质检测方法，包括：In a second aspect, the present application also provides a sound quality detection method, including:

获取初始待测音频；Get the initial audio to be tested;

对所述初始待测音频进行基于语音端点检测的非人声滤除处理，得到待测音频；Carrying out non-human voice filtering processing based on voice endpoint detection to the initial audio to be tested to obtain audio to be tested;

对所述待测音频进行特征提取处理，得到待测特征；Perform feature extraction processing on the audio to be tested to obtain features to be tested;

将所述待测特征输入音质检测模型，得到所述初始待测音频对应的音质检测结果。The feature to be tested is input into a sound quality detection model to obtain a sound quality detection result corresponding to the initial to-be-tested audio.

可选地，所述音质检测模型包括卷积神经网络、长短期记忆网络、全连接层和平均池化层；Optionally, the sound quality detection model includes a convolutional neural network, a long short-term memory network, a fully connected layer and an average pooling layer;

所述将所述待测特征输入音质检测模型，得到所述初始待测音频对应的音质检测结果，包括：The inputting the feature to be measured into the sound quality detection model to obtain the sound quality detection result corresponding to the initial to-be-measured audio, including:

将所述待测特征输入所述卷积神经网络，得到待测中间特征；Inputting the feature to be tested into the convolutional neural network to obtain an intermediate feature to be tested;

将所述待测中间特征输入所述长短期记忆网络，得到初始检测结果；Inputting the intermediate feature to be tested into the long short-term memory network to obtain an initial detection result;

将所述初始检测结果输入所述全连接层，得到中间检测结果；Inputting the initial detection result into the fully connected layer to obtain an intermediate detection result;

将所述中间检测结果输入所述平均池化层，得到所述音质检测结果。The intermediate detection result is input into the average pooling layer to obtain the sound quality detection result.

第三方面，本申请还提供了一种电子设备，包括存储器和处理器，其中：In a third aspect, the present application also provides an electronic device, including a memory and a processor, wherein:

所述存储器，用于保存计算机程序；the memory for storing computer programs;

所述处理器，用于执行所述计算机程序，以实现上述的音质检测模型训练方法，和/或，上述的音质检测方法。The processor is configured to execute the computer program to implement the above-mentioned method for training a sound quality detection model, and/or the above-mentioned method for sound quality detection.

第四方面，本申请还提供了一种计算机可读存储介质，用于保存计算机程序，其中，所述计算机程序被处理器执行时实现上述的音质检测模型训练方法，和/或，上述的音质检测方法。In a fourth aspect, the present application further provides a computer-readable storage medium for storing a computer program, wherein, when the computer program is executed by a processor, the above-mentioned training method for a sound quality detection model is implemented, and/or the above-mentioned sound quality Detection method.

本申请提供的音质检测模型训练方法，获取初始训练音频和对应的平均意见得分标签；对初始训练音频进行基于语音端点检测的干扰滤除处理，得到训练音频；对训练音频进行特征提取处理，得到训练特征；将训练特征输入初始模型，得到对应的训练音质检测结果；利用训练音质检测结果与平均意见得分标签计算损失值，并利用损失值调节初始模型的模型参数；当检测到满足训练完成条件，则将调节后的初始模型确定为音质检测模型。The training method for a sound quality detection model provided by this application is to obtain initial training audio and corresponding average opinion score labels; perform interference filtering processing based on voice endpoint detection on the initial training audio to obtain training audio; perform feature extraction processing on the training audio to obtain Training features; input the training features into the initial model to obtain the corresponding training sound quality detection results; use the training sound quality detection results and the average opinion score label to calculate the loss value, and use the loss value to adjust the model parameters of the initial model; when it is detected that the training completion conditions are met , the adjusted initial model is determined as the sound quality detection model.

可见，该方法利用平均意见得分(mean opinion score，MOS)作为初始训练音频以及训练音频的标签。平均意见得分由大量听众来评估通过通讯电路传输的，由男性或女性说话人大声朗读句子音频的质量。听众按以下标准给每个句子打分：(1)很差(2)差(3)一般(4)好(5)很好，MOS是所有听众个人打分的算术方法，范围从1(最差)到5(最好)。得到的MOS标签能够准确表征初始训练音频中句子音频的质量。通过语音端点检测，可以将非人声音频的部分作为干扰并滤除，仅保留人声的部分作为训练音频。通过特征提取得到训练特征后，利用初始模型进行质量检测，并利用得到的训练音质检测结果和平均意见得分标签计算损失值，利用损失值调节初始模型，使得初始模型学习正确评价音频质量的方式，得到的训练音质检测结果尽可能靠近MOS标签。在训练完成后，则可以将调节后的初始模型确定为音质检测模型。得到的音质检测模型能够在不具有纯干净音频的情况下，对待测音频的质量进行准确评估。It can be seen that the method utilizes the mean opinion score (MOS) as the initial training audio and the label of the training audio. Mean Opinion Scores were assessed by a large audience for the quality of audio of sentences read aloud by a male or female speaker, transmitted over a communication circuit. Listeners rated each sentence on the following scale: (1) very poor (2) poor (3) fair (4) good (5) very good, MOS is an arithmetic measure of individual scores for all listeners, ranging from 1 (worst) to 5 (best). The resulting MOS labels can accurately characterize the quality of the sentence audio in the initial training audio. Through speech endpoint detection, the part of non-human voice audio can be used as interference and filtered, and only the part of human voice can be retained as training audio. After the training features are obtained through feature extraction, the initial model is used for quality detection, and the loss value is calculated using the obtained training sound quality detection results and the average opinion score label. The obtained training sound quality detection results are as close as possible to the MOS labels. After the training is completed, the adjusted initial model can be determined as a sound quality detection model. The resulting sound quality detection model can accurately evaluate the quality of the audio under test without having pure clean audio.

此外，本申请还提供了一种音质检测方法、电子设备及计算机可读存储介质，同样具有上述有益效果。In addition, the present application also provides a sound quality detection method, an electronic device and a computer-readable storage medium, which also have the above beneficial effects.

附图说明Description of drawings

为了更清楚地说明本申请实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据提供的附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following briefly introduces the accompanying drawings that are used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only It is an embodiment of the present application. For those of ordinary skill in the art, other drawings can also be obtained according to the provided drawings without any creative effort.

图1为本申请实施例提供的一种音质检测模型训练方法所适用的硬件组成框架示意图；1 is a schematic diagram of a hardware composition framework to which a sound quality detection model training method provided by an embodiment of the present application is applicable;

图2为本申请实施例提供的另一种音质检测模型训练方法所适用的硬件组成框架示意图；2 is a schematic diagram of a hardware composition framework to which another sound quality detection model training method provided by an embodiment of the present application is applicable;

图3为本申请实施例提供的一种音质检测模型训练方法流程图；3 is a flowchart of a method for training a sound quality detection model provided by an embodiment of the present application;

图4为本申请实施例提供的一种音质检测方法流程图；4 is a flowchart of a sound quality detection method provided by an embodiment of the present application;

图5为本申请实施例提供的另一种音质检测方法流程图。FIG. 5 is a flowchart of another sound quality detection method provided by an embodiment of the present application.

具体实施方式Detailed ways

为使本申请实施例的目的、技术方案和优点更加清楚，下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。In order to make the purposes, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments It is only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

为了便于理解，先对本申请实施例提供的音质检测模型训练方法，和/或，音质检测方法对应的方案所使用的硬件组成框架进行介绍。请参考图1，图1为本申请实施例提供的一种音质检测模型训练方法所适用的硬件组成框架示意图。其中电子设备100可以包括处理器101和存储器102，还可以进一步包括多媒体组件103、信息输入/信息输出(I/O)接口104以及通信组件105中的一种或多种。For ease of understanding, the method for training a sound quality detection model provided by the embodiments of the present application and/or the hardware composition framework used in the solution corresponding to the sound quality detection method are introduced. Please refer to FIG. 1 , which is a schematic diagram of a hardware composition framework to which a method for training a sound quality detection model provided by an embodiment of the present application is applicable. The electronic device 100 may include a processor 101 and a memory 102 , and may further include one or more of a multimedia component 103 , an information input/information output (I/O) interface 104 and a communication component 105 .

其中，处理器101用于控制电子设备100的整体操作，以完成音质检测模型训练方法，和/或，音质检测方法中的全部或部分步骤；存储器102用于存储各种类型的数据以支持在电子设备100的操作，这些数据例如可以包括用于在该电子设备100上操作的任何应用程序或方法的指令，以及应用程序相关的数据。该存储器102可以由任何类型的易失性或非易失性存储设备或者它们的组合实现，例如静态随机存取存储器(Static Random AccessMemory，SRAM)、电可擦除可编程只读存储器(Electrically Erasable ProgrammableRead-Only Memory，EEPROM)、可擦除可编程只读存储器(Erasable Programmable Read-Only Memory，EPROM)、可编程只读存储器(Programmable Read-Only Memory，PROM)、只读存储器(Read-Only Memory，ROM)、磁存储器、快闪存储器、磁盘或光盘中的一种或多种。在本实施例中，存储器102中至少存储有用于实现以下功能的程序和/或数据：Wherein, the processor 101 is used to control the overall operation of the electronic device 100 to complete the training method of the sound quality detection model, and/or all or part of the steps in the sound quality detection method; the memory 102 is used to store various types of data to support Operation of the electronic device 100, such data may include, for example, instructions for any application or method operating on the electronic device 100, as well as application-related data. The memory 102 may be implemented by any type of volatile or non-volatile memory device or a combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (Erasable) Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (Read-Only Memory) , ROM), magnetic memory, flash memory, magnetic disk or optical disk one or more. In this embodiment, the memory 102 stores at least programs and/or data for implementing the following functions:

获取初始训练音频和对应的平均意见得分标签；Get the initial training audio and the corresponding mean opinion score label;

对初始训练音频进行基于语音端点检测的非人声滤除处理，得到训练音频；Perform non-human voice filtering processing based on voice endpoint detection on the initial training audio to obtain training audio;

对训练音频进行特征提取处理，得到训练特征；Perform feature extraction processing on the training audio to obtain training features;

将训练特征输入初始模型，得到对应的训练音质检测结果；Input the training features into the initial model to obtain the corresponding training sound quality detection results;

利用训练音质检测结果与平均意见得分标签计算损失值，并利用损失值调节初始模型的模型参数；Use the training sound quality detection results and the average opinion score label to calculate the loss value, and use the loss value to adjust the model parameters of the initial model;

当检测到满足训练完成条件，则将调节后的初始模型确定为音质检测模型。When it is detected that the training completion condition is satisfied, the adjusted initial model is determined as the sound quality detection model.

和/或，and / or,

获取初始待测音频；Get the initial audio to be tested;

对初始待测音频进行基于语音端点检测的非人声滤除处理，得到待测音频；Perform non-human voice filtering processing based on voice endpoint detection on the initial audio to be tested to obtain audio to be tested;

对待测音频进行特征提取处理，得到待测特征；Perform feature extraction processing on the audio to be tested to obtain features to be tested;

将待测特征输入音质检测模型，得到初始待测音频对应的音质检测结果。Input the feature to be tested into the sound quality detection model, and obtain the sound quality detection result corresponding to the audio to be initially tested.

多媒体组件103可以包括屏幕和音频组件。其中屏幕例如可以是触摸屏，音频组件用于输出和/或输入音频信号。例如，音频组件可以包括一个麦克风，麦克风用于接收外部音频信号。所接收的音频信号可以被进一步存储在存储器102或通过通信组件105发送。音频组件还包括至少一个扬声器，用于输出音频信号。I/O接口104为处理器101和其他接口模块之间提供接口，上述其他接口模块可以是键盘，鼠标，按钮等。这些按钮可以是虚拟按钮或者实体按钮。通信组件105用于电子设备100与其他设备之间进行有线或无线通信。无线通信，例如Wi-Fi，蓝牙，近场通信(Near Field Communication，简称NFC)，2G、3G或4G，或它们中的一种或几种的组合，因此相应的该通信组件105可以包括：Wi-Fi部件，蓝牙部件，NFC部件。Multimedia components 103 may include screen and audio components. Wherein the screen can be, for example, a touch screen, and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may be further stored in the memory 102 or transmitted through the communication component 105 . The audio assembly also includes at least one speaker for outputting audio signals. The I/O interface 104 provides an interface between the processor 101 and other interface modules, and the above-mentioned other interface modules may be a keyboard, a mouse, a button, and the like. These buttons can be virtual buttons or physical buttons. The communication component 105 is used for wired or wireless communication between the electronic device 100 and other devices. Wireless communication, such as Wi-Fi, Bluetooth, Near Field Communication (NFC for short), 2G, 3G or 4G, or one or a combination of them, so the corresponding communication component 105 may include: Wi-Fi parts, Bluetooth parts, NFC parts.

电子设备100可以被一个或多个应用专用集成电路(Application SpecificIntegrated Circuit，简称ASIC)、数字信号处理器(Digital Signal Processor，简称DSP)、数字信号处理设备(Digital Signal Processing Device，简称DSPD)、可编程逻辑器件(Programmable Logic Device，简称PLD)、现场可编程门阵列(Field ProgrammableGate Array，简称FPGA)、控制器、微控制器、微处理器或其他电子元件实现，用于执行音质检测模型训练方法。The electronic device 100 may be implemented by one or more Application Specific Integrated Circuit (ASIC for short), Digital Signal Processor (DSP for short), Digital Signal Processing Device (DSPD for short), or Programmable Logic Device (PLD for short), Field Programmable Gate Array (FPGA for short), controller, microcontroller, microprocessor or other electronic components to implement, used to perform sound quality detection model training method .

当然，图1所示的电子设备100的结构并不构成对本申请实施例中电子设备的限定，在实际应用中电子设备100可以包括比图1所示的更多或更少的部件，或者组合某些部件。Of course, the structure of the electronic device 100 shown in FIG. 1 does not constitute a limitation on the electronic device in the embodiments of the present application. In practical applications, the electronic device 100 may include more or less components than those shown in FIG. 1 , or a combination of certain parts.

可以理解的是，本申请实施例中并不对电子设备的数量进行限定，其可以是多个电子设备共同协作完成音质检测模型训练方法，和/或，音质检测方法。在一种可能的实施方式中，请参考图2，图2为本申请实施例提供的另一种音质检测模型训练方法所适用的硬件组成框架示意图。由图2可知，该硬件组成框架可以包括：第一电子设备11和第二电子设备12，二者之间通过网络13连接。It can be understood that the embodiment of the present application does not limit the number of electronic devices, which may be a method for training a sound quality detection model and/or a method for sound quality detection jointly completed by multiple electronic devices. In a possible implementation, please refer to FIG. 2 , which is a schematic diagram of a hardware composition framework to which another sound quality detection model training method provided by an embodiment of the present application is applicable. As can be seen from FIG. 2 , the hardware composition framework may include: a first electronic device 11 and a second electronic device 12 , which are connected through a network 13 .

在本申请实施例中，第一电子设备11与第二电子设备12的硬件结构可以参考图1中电子设备100。即可以理解为本实施例中具有两个电子设备100，两者进行数据交互。进一步，本申请实施例中并不对网络13的形式进行限定，即，网络13可以是无线网络(如WIFI、蓝牙等)，也可以是有线网络。In this embodiment of the present application, the hardware structures of the first electronic device 11 and the second electronic device 12 may refer to the electronic device 100 in FIG. 1 . That is, it can be understood that there are two electronic devices 100 in this embodiment, and the two perform data interaction. Further, the embodiment of the present application does not limit the form of the network 13, that is, the network 13 may be a wireless network (such as WIFI, Bluetooth, etc.) or a wired network.

其中，第一电子设备11和第二电子设备12可以是同一种电子设备，如第一电子设备11和第二电子设备12均为服务器；也可以是不同类型的电子设备，例如，第一电子设备11可以是智能手机或其它智能终端，第二电子设备12可以是服务器。在一种可能的实施方式中，可以利用计算能力强的服务器作为第二电子设备12来提高数据处理效率及可靠性，进而提高音质检测模型训练的处理效率。同时利用成本低，应用范围广的智能手机作为第一电子设备11，用于实现第二电子设备12与用户之间的交互。可以理解的是，该交互过程可以为：用户在智能手机上获取初始训练音频，并给出对应的平均意见得分标签，智能手机将初始训练音频与平均意见得分标签发送至服务器，由服务器利用初始训练音频和MOS标签训练得到音质检测模型。服务器将音质检测模型发送至智能手机，以便在智能手机上进行音质检测。The first electronic device 11 and the second electronic device 12 can be the same electronic device, for example, the first electronic device 11 and the second electronic device 12 are both servers; they can also be different types of electronic devices, for example, the first electronic device 11 and the second electronic device 12 are servers. The device 11 may be a smart phone or other smart terminals, and the second electronic device 12 may be a server. In a possible implementation, a server with strong computing power may be used as the second electronic device 12 to improve data processing efficiency and reliability, thereby improving the processing efficiency of the training of the sound quality detection model. At the same time, a smartphone with low cost and wide application range is used as the first electronic device 11 to realize the interaction between the second electronic device 12 and the user. It can be understood that the interaction process can be as follows: the user obtains the initial training audio on the smartphone and gives the corresponding average opinion score label, the smartphone sends the initial training audio and the average opinion score label to the server, and the server uses the initial training audio and the average opinion score label. Training audio and MOS label training to obtain a sound quality detection model. The server sends the sound quality detection model to the smartphone for sound quality detection on the smartphone.

或者，音质检测模型在服务器上部署，智能手机可以与用户交互，获取初始待测音频，并将其发送至服务器。服务器利用音质检测模型对初始待测音频进行检测，得到对应的音质检测结果，并将音质检测结果发送至智能手机输出给用户。Alternatively, the sound quality detection model is deployed on the server, and the smartphone can interact with the user to obtain the initial audio under test and send it to the server. The server uses the sound quality detection model to detect the initial audio to be tested, obtains the corresponding sound quality detection result, and sends the sound quality detection result to the smartphone for output to the user.

请参考图3，图3为本申请实施例提供的一种音质检测模型训练方法的一种流程示意图。该实施例中的方法包括：Please refer to FIG. 3 , which is a schematic flowchart of a method for training a sound quality detection model according to an embodiment of the present application. The method in this embodiment includes:

S101：获取初始训练音频和对应的平均意见得分标签。S101: Obtain the initial training audio and the corresponding average opinion score label.

平均意见得分，即Mean Opinion Score，MOS，由大量听众来评估通过通讯电路传输的，由男性或女性说话人大声朗读句子音频的质量。听众按以下标准给每个句子打分：(1)很差(2)差(3)一般(4)好(5)很好，MOS是所有听众个人打分的算术方法，范围从1(最差)到5(最好)。这种评价方式被广泛应用于音频质量主观评价中，然而，主观评价的过程费时费力。因此，PESQ(perceptual evaluation of speech quality，语音质量感知评估)方法被广泛应用与音频质量的自动检测(或称为客观评价)，以提高音频质量检测的效率。但是，PESQ为有参考的音质评估，在具有待测音频的纯干净音频(可视为无损音频)时才能对待测音频进行检测，使得检测范围有限。The Mean Opinion Score, or Mean Opinion Score, MOS, is used by a large audience to assess the quality of audio of sentences read aloud by male or female speakers, transmitted over a communication circuit. Listeners rated each sentence on the following scale: (1) very poor (2) poor (3) fair (4) good (5) very good, MOS is an arithmetic measure of individual scores for all listeners, ranging from 1 (worst) to 5 (best). This evaluation method is widely used in the subjective evaluation of audio quality, however, the process of subjective evaluation is time-consuming and labor-intensive. Therefore, the PESQ (perceptual evaluation of speech quality, perceptual evaluation of speech quality) method is widely used in the automatic detection of audio quality (or called objective evaluation) to improve the efficiency of audio quality detection. However, PESQ is a reference sound quality evaluation, and the audio to be tested can only be detected when there is pure and clean audio of the audio to be tested (which can be regarded as lossless audio), which makes the detection range limited.

本申请中，利用已有的初始训练音频和对应的MOS标签组成训练数据，训练得到音质检测模型，以便在没有纯干净音频的情况下仍能进行自动音质检测，提高检测效率，扩大检测范围。具体的，初始训练音频，是指直接获取到的训练数据，其中可以具有依据或多句语音。平均意见得分标签，是指人工对初始训练音频中的语音部分进行MOS打分后得到的MOS分值标签，是用于表征多个评价对象对初始训练音频进行音质评估后得到的平均音质评价参数。初始训练音频和对应的MOS标签可以预先准备好，或者可以在需要训练音质检测模型时，临时选取初始训练音频并由用户对其进行MOS打分，得到对应的MOS标签。In this application, the existing initial training audio and corresponding MOS tags are used to form training data, and a sound quality detection model is obtained by training, so that automatic sound quality detection can be performed without pure and clean audio, so as to improve the detection efficiency and expand the detection range. Specifically, the initial training audio refers to directly obtained training data, which may have evidence or multiple speech sentences. The average opinion score label refers to the MOS score label obtained by manually MOS scoring the speech part in the initial training audio, and is used to represent the average sound quality evaluation parameter obtained by multiple evaluation objects after the sound quality evaluation of the initial training audio. The initial training audio and the corresponding MOS label can be prepared in advance, or when the sound quality detection model needs to be trained, the initial training audio can be temporarily selected and the user can score it in MOS to obtain the corresponding MOS label.

在一种实施方式中，平均意见得分标签可以在获取时生成。具体的，可以向各个评价对象播放初始训练音频，例如可以向各个评价人员使用的电子设备发送初始训练音频并发送播放控制指令。接收各个评价对象对初始训练音频进行音质评估后得到的初始音质数据，初始音质数据可以由各个评价人员使用的电子设备生成并发送。利用各个初始音质数据生成平均意见得分标签。In one embodiment, the mean opinion score label may be generated at acquisition time. Specifically, the initial training audio can be played to each evaluation object, for example, the initial training audio and a playback control instruction can be sent to the electronic device used by each evaluator. Receive initial sound quality data obtained after each evaluation object performs sound quality evaluation on the initial training audio, and the initial sound quality data may be generated and sent by electronic devices used by each evaluator. A mean opinion score label is generated using each of the initial sound quality data.

本实施例并不限定生成平均意见得分的具体方式，例如可以对各个初始音质数据进行平均计算。在另一种实施方式中，为了提高平均意见得分标签的可靠性，可以首先对各个初始音质数据进行平均处理，得到第一得分标签。此外，还可以将初始训练音频输入音频缺陷检测模型，得到缺陷检测结果；音频缺陷检测模型用于检测能够影响听觉感受的音频缺陷，音频缺陷的类型可以根据需要设定，例如可以包括喷麦声(由于嘴部距离麦克风距离太近而生成，表现为po声、hu声、呼噜声等)、嘶嘶声/齿音(表现为嘶si声、吃chi声)、电流声(由硬件电路异常引起的噪声，表现为滋滋滋/嗡嗡嗡的声音)、咔咔声(表现为刺啦、click、卡顿声)、爆音(称为clip、由于声音爆掉、削波等产生，人声和伴奏混音后也容易产生)、环境噪声、卡顿(表现为人声的短时中断，前后衔接不佳，或明显的网络传输吞音)。音频缺陷检测模型能够识别初始训练音频中的音频缺陷，具体可以为音频缺陷的类型数量、每类音频缺陷出现的次数等。基于缺陷检测结果，可以生成第二得分标签，在一种实施方式中，可以设置满分为5分，并根据缺陷检测结果酌情扣分，得到第二得分标签，并利用第一得分标签和第二得分标签生成平均意见得分标签。This embodiment does not limit the specific manner of generating the average opinion score. For example, each initial sound quality data may be averaged. In another embodiment, in order to improve the reliability of the average opinion score label, each initial sound quality data may be averaged to obtain the first score label. In addition, the initial training audio can also be input into the audio defect detection model to obtain the defect detection result; the audio defect detection model is used to detect audio defects that can affect the auditory experience. (Generated because the mouth is too close to the microphone, manifested as po sound, hu sound, grunt, etc.), hissing/tooth sound (presented as hissing sound, eating chi sound), current sound (caused by abnormal hardware circuit) The noise caused by it is expressed as sizzling/humming sound), click sound (expressed as thorn, click, and stuck sound), popping sound (called clip, due to sound popping, clipping, etc.) It is also easy to produce after the voice and accompaniment are mixed), ambient noise, and stuttering (expressed as short-term interruption of vocals, poor connection before and after, or obvious network transmission swallow). The audio defect detection model can identify audio defects in the initial training audio, which may be the number of types of audio defects, the number of times each type of audio defects occurs, and the like. Based on the defect detection result, a second score label may be generated. In one embodiment, a full score of 5 points may be set, and points may be deducted as appropriate according to the defect detection result to obtain a second score label, and the first score label and the second score label may be used. Score labels generate mean opinion score labels.

S102：对初始训练音频进行基于语音端点检测的非人声滤除处理，得到训练音频。S102: Perform non-human voice filtering processing based on voice endpoint detection on the initial training audio to obtain training audio.

可以理解的是，朗读者在录制初始训练音频时通常会在其中说多句话，不同句之间通常具有语音空白的时间间隔或其他非人声音频。在进行MOS打分时，用户仅对语音部分进行评价，忽略非人声音频。因此在音质检测模型训练过程中，初始训练音频中的非人声音频应当被去除，避免对音质检测模型训练造成干扰。在本实施方式中，可以采用语音端点检测的方式识别人声的开始时间位置和终止时间位置，进而将初始训练音频中的非人声部分滤除，得到训练音频。语音端点检测即Voice Activity Detection，VAD，其能够识别音频信号中的人声静音期。It is understandable that the reader usually speaks multiple sentences in the recording of the initial training audio, often with speech-free time intervals or other non-vocal audio between sentences. When MOS scoring, users only evaluate the voice part, ignoring non-human voice audio. Therefore, during the training of the sound quality detection model, the non-human voice audio in the initial training audio should be removed to avoid interference with the training of the sound quality detection model. In this embodiment, the voice endpoint detection method can be used to identify the start time position and the end time position of the human voice, and then filter the non-human voice part in the initial training audio to obtain the training audio. Voice endpoint detection is Voice Activity Detection, VAD, which is able to identify periods of silence in the audio signal.

具体的，在一种实施方式中，训练音频的生成过程包括：Specifically, in one embodiment, the generation process of the training audio includes:

步骤11：对初始训练音频进行语音端点检测，得到语音端点时刻。Step 11: Perform voice endpoint detection on the initial training audio to obtain the voice endpoint time.

步骤12：根据语音端点时刻对初始训练音频进行切分，得到多个音频段，并将音频段中的非人声音频段去除，得到人声音频段。Step 12: Segment the initial training audio according to the voice endpoint time to obtain a plurality of audio segments, and remove the non-human voice frequency range in the audio segment to obtain the human voice frequency range.

步骤13：将人声音频段拼接得到训练音频。Step 13: Splicing vocal frequency bands to obtain training audio.

在本实施方式中，语音端点检测可以识别到语音音频(即人声音频)的开始时刻和终止时刻，二者为语音端点时刻。根据语音端点时刻对初始训练音频进行切分得到音频段，其中包括人声音频段和非人声音频段，两种音频段依次交替出现。将其中的非人声音频段去除，保留人声音频段。示例性的，人声音频段和非人声音频段交替出现，所以语音端点时刻中的开始时刻和终止时刻同样交替出现。沿着时间先后顺序，将相邻的开始时刻到终止时刻之间的音频段确定为人声音频段，将相邻的终止时刻到开始时刻之间的音频段确定为非人声音频段。在音频段分类完成后，将非人声音频段去除，保留人声音频段，进而对其进行拼接，得到最终的训练音频。其中，非人声音频段可以为空白音频段，或者可以为记录有背景音等非人声的音频段。In this embodiment, the voice endpoint detection can identify the start time and end time of the voice audio (ie, human voice audio), which are the voice endpoint time. The initial training audio is segmented according to the voice endpoint time to obtain audio segments, which include a human voice band and a non-human voice band, and the two audio segments appear alternately in turn. The non-human voice frequency bands are removed, and the human voice frequency bands are retained. Exemplarily, the human voice frequency range and the non-human voice frequency range appear alternately, so the start time and the end time in the voice endpoint time also appear alternately. Along the chronological sequence, the audio segment between the adjacent start time and the end time is determined as the human voice frequency range, and the audio segment between the adjacent end time and the start time is determined as the non-human voice frequency range. After the audio segment classification is completed, the non-human voice band is removed, the human voice band is retained, and then spliced to obtain the final training audio. The non-human voice segment may be a blank audio segment, or may be an audio segment recorded with non-human voices such as background sounds.

S103：对训练音频进行特征提取处理，得到训练特征。S103: Perform feature extraction processing on the training audio to obtain training features.

为了能够使得初始模型能够更高效地学习到如何对音频进行音质检测，对训练音频进行特征提取，得到对应的训练特征，以便更好地表征训练音频的音质特点。特征提取的具体方式不做限定，在一种实施方式中，训练特征可以为图像形式，例如可以采用语谱图(Spectrogram)、梅尔频谱图(Mel-Frequency Spectrum)或其他频谱图形式。在另一种实施方式中，为了在较宽的频段范围内对信号进行评估，还可以对训练音频进行重采样，使其信号频率更好。In order to enable the initial model to learn how to detect the sound quality of the audio more efficiently, feature extraction is performed on the training audio to obtain corresponding training features, so as to better characterize the sound quality characteristics of the training audio. The specific manner of feature extraction is not limited. In one embodiment, the training feature may be in the form of an image, for example, a spectrogram (Spectrogram), a Mel-Frequency Spectrum (Mel-Frequency Spectrum) or other spectrogram forms. In another embodiment, in order to evaluate the signal over a wider frequency range, the training audio can also be resampled to make its signal frequency better.

具体的，训练特征的生成过程可以包括：Specifically, the generation process of training features may include:

步骤21：基于人耳可感知最大采样率，对训练音频进行重采样，得到中间数据。Step 21: Based on the maximum sampling rate perceivable by the human ear, resample the training audio to obtain intermediate data.

步骤22：对中间数据进行基于预设窗长的滑窗分帧，得到多个音频帧。Step 22: Perform sliding window segmentation based on a preset window length on the intermediate data to obtain multiple audio frames.

步骤23：对各个音频帧进行特征提取处理，得到训练特征。Step 23: Perform feature extraction processing on each audio frame to obtain training features.

人耳能听到的信号频率区间是相对固定的，通常为20Hz-20000Hz的范围。采样频率通常为信号频率的2倍，因此可以确定人耳可感知最大采样率。可以理解的是，由于不同的人听力范围不同，有些人的人耳频率范围的上限可以达到22000Hz，因此人耳可感知最大采样率可以比普通人耳频率范围的上限采样率更高一些，具体大小不做限定。示例性的，可以选择48kHz作为人耳可感知最大采样率。通过重采样，可以训练音频的拓宽频域范围，得到对应的中间数据。The frequency range of the signal that the human ear can hear is relatively fixed, usually in the range of 20Hz-20000Hz. The sampling frequency is usually 2 times the signal frequency, so the maximum sampling rate perceivable by the human ear can be determined. It is understandable that due to the different hearing ranges of different people, the upper limit of the human ear frequency range of some people can reach 22000Hz, so the maximum sampling rate perceivable by the human ear can be higher than the upper limit sampling rate of the ordinary human ear frequency range. Size is not limited. Exemplarily, 48 kHz may be selected as the maximum sampling rate perceivable by the human ear. Through resampling, the audio can be trained to broaden the frequency domain range, and the corresponding intermediate data can be obtained.

滑窗分帧，是指采用预设的分析窗在时间顺序上采样得到音频帧的过程，分析窗具体可以为海宁窗(hann)、汉明窗(hamming)、布莱克曼-哈里斯窗(blackman-harris)等，预设窗长为分析窗的宽度大小，具体不做限定，例如当采样频率为48kHz时，预设窗长可以为21.3ms。分析窗每次采样完毕后，向后滑动一段距离，以便再次采样，该滑动的距离即为帧移。帧移的具体大小不做限定，例如可以为窗长的一半。Sliding window framing refers to the process of using a preset analysis window to sample audio frames in time sequence. The analysis window can be Hann, Hamming, Blackman-Harris window. -harris), etc., the preset window length is the width of the analysis window, which is not specifically limited. For example, when the sampling frequency is 48kHz, the preset window length can be 21.3ms. After each sampling of the analysis window, it slides back a certain distance for re-sampling, and the sliding distance is the frame shift. The specific size of the frame shift is not limited, for example, it can be half of the window length.

在得到音频帧后，对音频信号进行梅尔频谱提取、语谱图提取等特征提取处理，得到单个音频帧对应的音频帧特征，并将各个音频帧特征按照时间先后顺序进行拼接，得到对应的训练特征。After the audio frame is obtained, feature extraction processing such as mel spectrum extraction and spectrogram extraction is performed on the audio signal to obtain the audio frame feature corresponding to a single audio frame, and the features of each audio frame are spliced in chronological order to obtain the corresponding audio frame features. training features.

S104：将训练特征输入初始模型，得到对应的训练音质检测结果。S104: Input the training feature into the initial model to obtain the corresponding training sound quality detection result.

S105：利用训练音质检测结果与平均意见得分标签计算损失值，并利用损失值调节初始模型的模型参数。S105: Calculate a loss value using the training sound quality detection result and the average opinion score label, and use the loss value to adjust the model parameters of the initial model.

对上述两个步骤进行综合说明。The above two steps are comprehensively explained.

初始模型，是指未经完全训练的模型，经过足量的训练和参数调节后，调解过后的初始模型即可作为音质检测模型。本实施例并不限定初始模型的具体形式和类别，例如可以为卷积神经网络模型，或者可以为卷积神经网络与循环神经网络的结合。处理模型对训练特征进行处理后，可以得到基于当前学习和调参情况对训练音频进行音质检测得到的训练音质检测结果。The initial model refers to a model that has not been fully trained. After sufficient training and parameter adjustment, the adjusted initial model can be used as a sound quality detection model. This embodiment does not limit the specific form and category of the initial model, for example, it may be a convolutional neural network model, or may be a combination of a convolutional neural network and a recurrent neural network. After the processing model processes the training features, a training sound quality detection result obtained by performing sound quality detection on the training audio based on the current learning and parameter adjustment conditions can be obtained.

在未经过足够的训练时，初始模型得到的训练音质检测结果与真正正确的结果(即MOS标签)具有一定差距，通过计算损失值并基于损失值调节初始模型的模型参数，使得模型能够学习到如何正确地进行音质检测并给出正确的音质检测结果。Without sufficient training, there is a certain gap between the training sound quality detection results obtained by the initial model and the truly correct results (that is, MOS tags). By calculating the loss value and adjusting the model parameters of the initial model based on the loss value, the model can learn How to correctly perform sound quality detection and give correct sound quality detection results.

具体的，在一种实施方式中，初始模型包括卷积神经网络、长短期记忆网络、全连接层和平均池化层。其中，卷积神经网络用于对输入的训练特征进行卷积计算，提取有效的音频局部特征。长短期记忆网络(Long Short-Term Memory，LSTM)，具体可以为双向长短期记忆网络(Bi-directional Long-Short Term Memory，BLSTM)其用于提取音频局部特征之间的时序关系，学习前后帧之间的相关性。全连接层，用于以帧为单位预测每个帧对应的音质检测结果。平均池化层，用于综合每个帧的音质检测结果得到最终的训练音质检测结果。Specifically, in one embodiment, the initial model includes a convolutional neural network, a long short-term memory network, a fully connected layer, and an average pooling layer. Among them, the convolutional neural network is used to perform convolution calculation on the input training features to extract effective audio local features. Long Short-Term Memory (LSTM), specifically a Bi-directional Long-Short Term Memory (BLSTM), which is used to extract the temporal relationship between audio local features, learn the frames before and after correlation between. The fully connected layer is used to predict the sound quality detection result corresponding to each frame in units of frames. The average pooling layer is used to synthesize the sound quality detection results of each frame to obtain the final training sound quality detection results.

相应的，将训练特征输入初始模型，得到对应的训练音质检测结果地过程可以包括：Correspondingly, the process of inputting the training features into the initial model and obtaining the corresponding training sound quality detection results may include:

步骤31：将训练特征输入卷积神经网络，得到训练中间特征。Step 31: Input the training features into the convolutional neural network to obtain training intermediate features.

步骤32：将训练中间特征输入长短期记忆网络，得到训练初始检测结果。Step 32: Input the training intermediate features into the long short-term memory network to obtain the training initial detection results.

步骤33：将训练初始检测结果输入全连接层，得到训练中间检测结果。Step 33: Input the training initial detection result into the fully connected layer to obtain the training intermediate detection result.

步骤34：将训练中间检测结果输入平均池化层，得到训练音质检测结果。Step 34: Input the training intermediate detection result into the average pooling layer to obtain the training sound quality detection result.

其中，训练中间特征即为经过卷积计算后得到的特征，训练初始检测结果为长短期记忆网络对时序关系进行提取后得到的数据。训练中间检测结果为训练特征中各个帧对应的音质检测结果。Among them, the training intermediate feature is the feature obtained after convolution calculation, and the initial detection result of training is the data obtained after the long short-term memory network extracts the time series relationship. The training intermediate detection result is the sound quality detection result corresponding to each frame in the training feature.

需要说明的是，损失值的具体计算方式不做限定，可以根据初始模型类别、训练音质检测结果的表现形式、训练侧重方向等进行计算，例如可以采用平方损失函数、指数损失函数、交叉熵损失函数等。在一种具体的实施方式中，可以在综合一个训练批次内每个初始训练音频对应的训练音质检测结果计算得到损失值。在这种情况下，在获取初始训练音频和对应的MOS标签时，可以按照预设批次大小，从训练数据集中获取一个批次的多个初始训练音频和平均意见得分标签。其中，预设批次大小，是指每个训练批次获取的初始训练音频的数量。由于参数调节的频率与损失值生成的频率相同，因此在本实施方式中，每一批次的初始训练音频均被处理完毕后，进行一次损失值计算和参数调教。It should be noted that the specific calculation method of the loss value is not limited, and can be calculated according to the initial model type, the expression of the training sound quality detection results, the training focus direction, etc., for example, the square loss function, the exponential loss function, and the cross entropy loss can be used. function etc. In a specific implementation, the loss value may be calculated by synthesizing the training sound quality detection results corresponding to each initial training audio in a training batch. In this case, when obtaining the initial training audio and corresponding MOS labels, a batch of multiple initial training audio and average opinion score labels can be obtained from the training dataset according to the preset batch size. The preset batch size refers to the number of initial training audios obtained in each training batch. Since the frequency of parameter adjustment is the same as the frequency of loss value generation, in this embodiment, after each batch of initial training audio is processed, a loss value calculation and parameter adjustment are performed.

相应的，利用训练音质检测结果与平均意见得分标签计算损失值的过程可以包括：Correspondingly, the process of calculating the loss value using the training sound quality detection result and the mean opinion score label may include:

步骤41：当得到一个批次对应的所有初始训练音频对应的训练音质检测结果时，利用该批次内的训练音质检测结果、平均意见得分标签和训练中间检测结果得到损失值。Step 41: When the training sound quality detection results corresponding to all the initial training audios corresponding to a batch are obtained, the loss value is obtained by using the training sound quality detection results, the average opinion score label and the training intermediate detection results in the batch.

在本实施方式中，综合一个批次内所有的训练音质检测结果和平均意见得分标签进行损失值计算，可以综合一整个批次的训练情况进行调参。此外，利用MOS标签和训练中间检测结果进行损失值计算，能够在损失值中反映每个帧的音质评价情况。In this embodiment, the loss value calculation is performed by combining all the training sound quality detection results and the average opinion score label in a batch, and the parameters can be adjusted by combining the training conditions of a whole batch. In addition, the loss value is calculated using the MOS tag and the training intermediate detection result, which can reflect the sound quality evaluation of each frame in the loss value.

具体的，可以根据Specifically, according to

得到损失值。get the loss value.

其中，loss为损失值，S为预设批次大小，T_S为训练特征对应的帧数，即训练特征生成时划分的音频帧，M_S为平均意见得分标签，

为训练音质检测结果，

为训练特征中的第t帧在训练中间检测结果中对应的值，α为预设权重，其具体大小不做限定，例如可以为1。Among them, loss is the loss value, S is the preset batch size, T _S is the number of frames corresponding to the training feature, that is, the audio frame divided when the training feature is generated, M _S is the average opinion score label,

In order to train the sound quality detection results,

is the value corresponding to the t-th frame in the training feature in the intermediate detection result of the training, α is a preset weight, and its specific size is not limited, for example, it may be 1.

S106：当检测到满足训练完成条件，则将调节后的初始模型确定为音质检测模型。S106: When it is detected that the training completion condition is satisfied, the adjusted initial model is determined as the sound quality detection model.

循环执行上述各个步骤，并在每一执行一个训练轮次并进行调参后，判断是否满足训练完成条件。训练完成条件，是指表示初始模型已经经过足够训练的条件，其具体可以为对训练过程进行限定的条件或者为对初始模型性能进行限定的条件。例如可以为训练轮次数量条件，或者可以为损失值区间条件。示例性的，可以判断损失值是否小于预设阈值，若小于预设阈值，则说明，则确定满足训练完成条件。预设阈值的具体大小不做限定，可以根据需要进行设定。The above steps are executed cyclically, and after each training round is executed and parameters are adjusted, it is judged whether the training completion condition is met. The training completion condition refers to a condition indicating that the initial model has been sufficiently trained, which may specifically be a condition for limiting the training process or a condition for limiting the performance of the initial model. For example, it can be a condition of the number of training rounds, or it can be a condition of a loss value interval. Exemplarily, it may be determined whether the loss value is less than the preset threshold, and if it is less than the preset threshold, it means that it is determined that the training completion condition is satisfied. The specific size of the preset threshold is not limited, and can be set as required.

应用本申请实施例提供的音质检测模型训练方法，利用平均意见得分(meanopinion score，MOS)作为初始训练音频以及训练音频的标签。平均意见得分由大量听众来评估通过通讯电路传输的，由男性或女性说话人大声朗读句子音频的质量。听众按以下标准给每个句子打分：(1)很差(2)差(3)一般(4)好(5)很好，MOS是所有听众个人打分的算术方法，范围从1(最差)到5(最好)。得到的MOS标签能够准确表征初始训练音频中句子音频的质量。通过语音端点检测，可以将非人声音频的部分作为干扰并滤除，仅保留人声的部分作为训练音频。通过特征提取得到训练特征后，利用初始模型进行质量检测，并利用得到的训练音质检测结果和平均意见得分标签计算损失值，利用损失值调节初始模型，使得初始模型学习正确评价音频质量的方式，得到的训练音质检测结果尽可能靠近MOS标签。在训练完成后，则可以将调节后的初始模型确定为音质检测模型。得到的音质检测模型能够在不具有纯干净音频的情况下，对待测音频的质量进行准确评估。The sound quality detection model training method provided in the embodiment of the present application is applied, and the mean opinion score (meanopinion score, MOS) is used as the initial training audio and the label of the training audio. Mean Opinion Scores were assessed by a large audience for the quality of audio of sentences read aloud by a male or female speaker, transmitted over a communication circuit. Listeners rated each sentence on the following scale: (1) very poor (2) poor (3) fair (4) good (5) very good, MOS is an arithmetic measure of individual scores for all listeners, ranging from 1 (worst) to 5 (best). The resulting MOS labels can accurately characterize the quality of the sentence audio in the initial training audio. Through speech endpoint detection, the part of non-human voice audio can be used as interference and filtered, and only the part of human voice can be retained as training audio. After the training features are obtained through feature extraction, the initial model is used for quality detection, and the loss value is calculated using the obtained training sound quality detection results and the average opinion score label. The obtained training sound quality detection results are as close as possible to the MOS labels. After the training is completed, the adjusted initial model can be determined as a sound quality detection model. The resulting sound quality detection model can accurately evaluate the quality of the audio under test without having pure clean audio.

基于上述实施例，在得到音质检测模型后，可以利用其不具有MOS标签的初始待测音频进行音质检测。请参考图4，图4为本申请实施例提供的一种音质检测方法流程图，具体包括如下步骤：Based on the above embodiment, after the sound quality detection model is obtained, the sound quality detection can be performed by using the initial to-be-tested audio that does not have a MOS tag. Please refer to FIG. 4. FIG. 4 is a flowchart of a sound quality detection method provided by an embodiment of the present application, which specifically includes the following steps:

S201：获取初始待测音频。S201: Obtain the initial audio to be tested.

S202：对初始待测音频进行基于语音端点检测的非人声滤除处理，得到待测音频。S202: Perform non-human voice filtering processing based on voice endpoint detection on the initial audio to be tested to obtain audio to be tested.

S203：对待测音频进行特征提取处理，得到待测特征。S203: Perform feature extraction processing on the audio to be tested to obtain features to be tested.

S204：将待测特征输入音质检测模型，得到初始待测音频对应的音质检测结果。S204: Input the feature to be tested into the sound quality detection model to obtain a sound quality detection result corresponding to the audio to be initially tested.

其中，音质检测模型采用如上的音质检测模型训练方法得到，非人声滤除处理以及特征提取处理与音质检测模型训练过程相同。具体的，在一种实施方式中，音质检测模型包括卷积神经网络、长短期记忆网络、全连接层和平均池化层。Among them, the sound quality detection model is obtained by the above-mentioned training method of the sound quality detection model, and the non-human voice filtering and feature extraction processing are the same as the training process of the sound quality detection model. Specifically, in one embodiment, the sound quality detection model includes a convolutional neural network, a long short-term memory network, a fully connected layer and an average pooling layer.

相应的，将待测特征输入音质检测模型，得到初始待测音频对应的音质检测结果，包括：Correspondingly, the feature to be tested is input into the sound quality detection model, and the sound quality detection result corresponding to the initial to-be-tested audio is obtained, including:

步骤51：将待测特征输入卷积神经网络，得到待测中间特征。Step 51: Input the feature to be tested into the convolutional neural network to obtain the intermediate feature to be tested.

步骤52：将待测中间特征输入长短期记忆网络，得到初始检测结果。Step 52: Input the intermediate features to be tested into the long short-term memory network to obtain an initial detection result.

步骤53：将初始检测结果输入全连接层，得到中间检测结果。Step 53: Input the initial detection result into the fully connected layer to obtain the intermediate detection result.

步骤54：将中间检测结果输入平均池化层，得到音质检测结果。Step 54: Input the intermediate detection result into the average pooling layer to obtain the sound quality detection result.

其中，步骤51至步骤54的过程可以参考步骤31至步骤34的过程，区别在于被处理的数据不同。Wherein, for the process of step 51 to step 54, reference may be made to the process of step 31 to step 34, the difference is that the data to be processed are different.

进一步的，请参考图5，图5为本申请实施例提供的另一种音质检测方法流程图。音频信号(即初始待测音频)被获取后，利用VAD算法对其进行人声检测，输出有人声的信号(即待测音频)，VAD算法具体可以为webrtc-vad算法。人声检测将音频信号中静音和非人声的部分滤除。然后将其重采样至48kHz。提取梅尔频谱形式的音频特征(即待测特征)，并将音频特征输入音质检测模型。音频特征提取时采用的是布莱克曼-哈里斯(blackman-harris)窗，帧移的大小为10.7ms。Further, please refer to FIG. 5 , which is a flowchart of another sound quality detection method provided by an embodiment of the present application. After the audio signal (that is, the initial audio to be tested) is acquired, the VAD algorithm is used to detect the human voice, and the signal of the human voice (that is, the audio to be tested) is output. The VAD algorithm may specifically be the webrtc-vad algorithm. Vocal detection filters out the silent and non-vocal parts of the audio signal. It is then resampled to 48kHz. Extract the audio features (that is, the features to be tested) in the form of Mel spectrum, and input the audio features into the sound quality detection model. A Blackman-Harris window is used for audio feature extraction, and the frame shift size is 10.7ms.

本申请中音频检测模型包括3层CNN网络、2层BLSTM网络、全连接层和平均池化层。其中CNN网络中的卷积层为2D卷积层，卷积核为3*3，三个卷积层的输出滤波器长度依次为16、32和64。每层CNN后都进行归一化，并使用ReLU激活函数。通过3层CNN，可有效的学习音频局部特征。随后送入到2层双向LSTM网络，其隐藏单元均设为256，主要目的是提取局部特征的时序关系，学习前后帧的相关性。经过2层双向LSTM级联后，送入到全连接层，预测每个帧级别的音质分值(即中间检测结果)，最后通过平均池化层，输出初始待测音频对应的最终客观评估分值(即音质检测结果)。The audio detection model in this application includes a 3-layer CNN network, a 2-layer BLSTM network, a fully connected layer and an average pooling layer. The convolution layer in the CNN network is a 2D convolution layer, the convolution kernel is 3*3, and the output filter lengths of the three convolution layers are 16, 32, and 64 in turn. After each layer of CNN, normalization is performed and a ReLU activation function is used. Through 3-layer CNN, audio local features can be effectively learned. Then it is sent to the 2-layer bidirectional LSTM network, and its hidden units are all set to 256. The main purpose is to extract the temporal relationship of local features and learn the correlation between the front and rear frames. After 2 layers of bidirectional LSTM cascade, it is sent to the fully connected layer to predict the sound quality score of each frame level (ie, the intermediate detection result), and finally through the average pooling layer, the final objective evaluation score corresponding to the initial audio to be tested is output. value (that is, the sound quality detection result).

下面对本申请实施例提供的计算机可读存储介质进行介绍，下文描述的计算机可读存储介质与上文描述的音质检测模型训练方法可相互对应参照。The computer-readable storage medium provided by the embodiments of the present application will be introduced below. The computer-readable storage medium described below and the voice quality detection model training method described above may refer to each other correspondingly.

本申请还提供一种计算机可读存储介质，计算机可读存储介质上存储有计算机程序，计算机程序被处理器执行时实现上述的音质检测模型训练方法的步骤。The present application also provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the above-mentioned training method for a sound quality detection model are implemented.

该计算机可读存储介质可以包括：U盘、移动硬盘、只读存储器(Read-OnlyMemory，ROM)、随机存取存储器(Random Access Memory，RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The computer-readable storage medium may include: a USB flash drive, a removable hard disk, a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a magnetic disk or an optical disk, etc., which can store program codes. medium.

本说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其它实施例的不同之处，各个实施例之间相同或相似部分互相参见即可。对于实施例公开的装置而言，由于其与实施例公开的方法相对应，所以描述的比较简单，相关之处参见方法部分说明即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments may be referred to each other. As for the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant part can be referred to the description of the method.

本领域技术人员还可以进一步意识到，结合本文中所公开的实施例描述的各示例的单元及算法步骤，能够以电子硬件、计算机软件或者二者的结合来实现，为了清楚地说明硬件和软件的可互换性，在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件的方式来执行，取决于技术方案的特定应用和设计约束条件。本领域技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应该认为超出本申请的范围。Those skilled in the art may further realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of the two, in order to clearly illustrate the hardware and software In the above description, the components and steps of each example have been generally described according to their functions. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may use different approaches to implement the described functionality for each particular application, but such implementations should not be considered beyond the scope of this application.

结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块，或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的存储介质中。The steps of a method or algorithm described in conjunction with the embodiments disclosed herein may be directly implemented in hardware, a software module executed by a processor, or a combination of the two. A software module can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other in the technical field. in any other known form of storage medium.

最后，还需要说明的是，在本文中，诸如第一和第二等之类的关系属于仅仅用来将一个实体或者操作与另一个实体或者操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语包括、包含或者其他任何变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者设备所固有的要素。Finally, it should also be noted that, in this context, relationships such as first and second, etc., are used only to distinguish one entity or operation from another, and do not necessarily require or imply these entities or that there is any such actual relationship or sequence between operations. Furthermore, the terms including, comprising or any other variation are intended to cover non-exclusive inclusion such that a process, method, article or device comprising a series of elements includes not only those elements but also other elements not expressly listed, or Yes also includes elements inherent to such a process, method, article or apparatus.

本文中应用了具体个例对本申请的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本申请的方法及其核心思想；同时，对于本领域的一般技术人员，依据本申请的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本申请的限制。The principles and implementations of the present application are described herein by using specific examples. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present application. There will be changes in the specific implementation and application scope. To sum up, the content of this specification should not be construed as a limitation to the application.

Claims

1. a sound quality detection model training method, is characterized in that, comprises:

Obtaining the initial training audio and the corresponding average opinion score label; the average opinion score label is used to represent the average sound quality evaluation parameter obtained after a plurality of evaluation objects perform sound quality evaluation on the initial training audio;

Carrying out the non-human voice filtering process based on voice endpoint detection to the initial training audio to obtain the training audio;

Perform feature extraction processing on the training audio to obtain training features;

Input the training feature into the initial model to obtain the corresponding training sound quality detection result;

Calculate a loss value using the training sound quality detection result and the average opinion score label, and use the loss value to adjust the model parameters of the initial model;

When it is detected that the training completion condition is satisfied, the adjusted initial model is determined as a sound quality detection model.

2. The sound quality detection model training method according to claim 1, wherein the acquisition process of the average opinion score label comprises:

playing the initial training audio to each of the evaluation objects;

receiving initial sound quality data obtained after each of the evaluation objects performs sound quality evaluation on the initial training audio;

The mean opinion score label is generated using each of the initial voice quality data.

3. The method for training a sound quality detection model according to claim 2, wherein the generating the average opinion score label using each of the initial sound quality data, comprising:

Perform averaging processing on each of the initial sound quality data to obtain a first score label;

Inputting the initial training audio into an audio defect detection model to obtain a defect detection result; wherein the audio defect detection model is used to detect audio defects that can affect auditory experience;

generating a second score label based on the defect detection result;

The average opinion score label is generated using the first score label and the second score label.

4. The method for training a sound quality detection model according to claim 1, wherein the described training audio is subjected to feature extraction processing to obtain training features, comprising:

Based on the maximum sampling rate perceivable by the human ear, the training audio is resampled to obtain intermediate data;

Performing sliding window framing based on a preset window length on the intermediate data to obtain a plurality of audio frames;

Feature extraction processing is performed on each of the audio frames to obtain the training features.

5. sound quality detection model training method according to claim 1, is characterized in that, described initial training audio frequency is carried out the non-human voice filtering processing based on voice endpoint detection, obtains training audio frequency, comprises:

The voice endpoint detection is carried out to the initial training audio, and the voice endpoint time is obtained;

According to the voice endpoint time, the initial training audio is segmented to obtain a plurality of audio segments, and the non-human voice frequency range in the audio segment is removed to obtain the human voice frequency range;

The training audio is obtained by splicing the human voice frequency bands.

6. The sound quality detection model training method according to claim 1, wherein the initial model comprises a convolutional neural network, a long short-term memory network, a fully connected layer and an average pooling layer;

The described training features are input into the initial model to obtain corresponding training sound quality detection results, including:

Inputting the training features into the convolutional neural network to obtain training intermediate features;

Inputting the training intermediate feature into the long short-term memory network to obtain the training initial detection result;

Inputting the training initial detection result into the fully connected layer to obtain the training intermediate detection result;

The training intermediate detection result is input into the average pooling layer to obtain the training sound quality detection result.

7. The method for training a sound quality detection model according to claim 6, wherein the acquisition of the initial training audio and the corresponding average opinion score label comprises:

According to a preset batch size, obtain a batch of a plurality of the initial training audio and the average opinion score label from the training data set;

Correspondingly, calculating the loss value using the training sound quality detection result and the average opinion score label includes:

When the training sound quality detection results corresponding to all the initial training audios corresponding to a batch are obtained, the loss value is obtained by using the training sound quality detection results, the average opinion score label and the training intermediate detection results in the batch .

8. The method for training a sound quality detection model according to claim 7, wherein the loss value is obtained by using the training sound quality detection result, the average opinion score label and the training intermediate detection result in the batch, comprising: :

according to

obtain the loss value;

Wherein, the loss is the loss value, the S is the preset batch size, the T _S is the number of frames corresponding to the training feature, the M _S is the average opinion score label, and the stated

For the training sound quality detection result, the

9. The method for training a sound quality detection model according to claim 1, wherein the detection meets the training completion condition, comprising:

judging whether the loss value is less than a preset threshold;

If so, it is determined that the training completion condition is satisfied.

10. A sound quality detection method, characterized in that, comprising:

Get the initial audio to be tested;

Carrying out non-human voice filtering processing based on voice endpoint detection to the initial audio to be tested to obtain audio to be tested;

Perform feature extraction processing on the audio to be tested to obtain features to be tested;

Input the feature to be tested into a sound quality detection model to obtain a sound quality detection result corresponding to the initial audio to be tested; wherein, the sound quality detection model is obtained by using the sound quality detection model training method described in any one of claims 1 to 7 .

11. The sound quality detection method according to claim 10, wherein the sound quality detection model comprises a convolutional neural network, a long short-term memory network, a fully connected layer and an average pooling layer;

The inputting the feature to be measured into the sound quality detection model to obtain the sound quality detection result corresponding to the initial to-be-measured audio, including:

Inputting the feature to be tested into the convolutional neural network to obtain an intermediate feature to be tested;

Inputting the intermediate feature to be tested into the long short-term memory network to obtain an initial detection result;

Inputting the initial detection result into the fully connected layer to obtain an intermediate detection result;

The intermediate detection result is input into the average pooling layer to obtain the sound quality detection result.

12. An electronic device, comprising a memory and a processor, wherein:

the memory for storing computer programs;

The processor is used to execute the computer program to realize the sound quality detection model training method according to any one of claims 1 to 9, and/or, the sound quality according to any one of claims 10 to 11 Detection method.

13. A computer-readable storage medium, characterized in that it is used for saving a computer program, wherein, when the computer program is executed by a processor, the method for training a sound quality detection model according to any one of claims 1 to 9 is realized, and/or, the sound quality detection method according to any one of claims 10 to 11.