CN115240702B

CN115240702B - Voice separation method based on voiceprint characteristics

Info

Publication number: CN115240702B
Application number: CN202210836543.1A
Authority: CN
Inventors: 杜军朝; 刘惠; 王乾; 魏昱恒; 潘江涛; 于英涛
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-07-15
Filing date: 2022-07-15
Publication date: 2024-09-24
Anticipated expiration: 2042-07-15
Also published as: CN115240702A

Abstract

The present invention proposes a method for speech separation based on voiceprint features, and the implementation steps are: obtaining a training sample set and a test sample set; constructing a speech separation model based on voiceprint information; iteratively training the speech separation model; and obtaining a speech separation result. The speech separation model constructed by the present invention includes a Conv‑TasNet model. In the process of training the speech separation model and obtaining the speech separation result, the FiLM‑DCTasNet fusion algorithm is used to perform mask calculation on each pair of mixed audio coding and voiceprint feature vectors, so that the separation network can integrate the voiceprint features of the target speaker multiple times. At the same time, the separation network in the Conv‑TasNet model is used to perform mask calculation, which effectively improves the signal distortion ratio and separation efficiency of the separated audio signal.

Description

Speech separation method based on voiceprint features

技术领域Technical Field

本发明属于语音识别技术领域，涉及一种语音分离方法，具体涉及一种基于声纹特征的语音分离方法。The present invention belongs to the technical field of speech recognition and relates to a speech separation method, and in particular to a speech separation method based on voiceprint features.

背景技术Background Art

语音分离技术就是利用硬件或软件方法，将目标说话人音频中从背景干扰中分离出来。语音分离方法包括传统方法和深度学习方法。相较于传统的语音分离方法，深度学习方法具有精度高、准确率高、端到端训练测试等特点，是目前主流的语音分离方法。深度学习方法主要包括基于置换不变训练的方法、基于深度吸引子网络的方法和基于深度聚类的方法。这些方法通常是根据固定说话人数量来对分离模型进行训练，而且只能将多人音频中的所有说话人音频同时分离出来，而不能单独分离出一个目标说话人的音频，而且对于多个声音特征相似的说话人无法分离出高质量的多个音频。Speech separation technology is to use hardware or software methods to separate the target speaker's audio from background interference. Speech separation methods include traditional methods and deep learning methods. Compared with traditional speech separation methods, deep learning methods have the characteristics of high precision, high accuracy, and end-to-end training and testing, and are currently the mainstream speech separation methods. Deep learning methods mainly include methods based on permutation invariant training, methods based on deep attractor networks, and methods based on deep clustering. These methods usually train the separation model based on a fixed number of speakers, and can only separate the audio of all speakers in multi-person audio at the same time, but cannot separate the audio of a target speaker alone, and cannot separate high-quality multiple audios for multiple speakers with similar sound characteristics.

随着声纹特征提取模型的不断改进，在语音分离模型中融合声纹特征在目标说话人的语音分离方法中得到了应用。例如申请公布号为CN 113990344A，名称为“一种基于声纹特征的多人语音分离方法、设备及介质”的专利申请，公开了一种基于声纹特征的多人语音分离方法，该方法先对目标说话人音频提取声纹特征，并对混合音频进行短时傅里叶变换得到混合音频的频谱特征，将提取到的声纹特征和频谱特征拼接后使用深度聚类模型DPCL进行掩码生成，将生成的掩码与混合音频的频谱特征相乘得到目标说话人的纯净音频的频谱，最后通过短时傅里叶逆变换得到目标说话人的纯净音频。该方法通过在语音分离网络的输入中拼接混合音频的频谱特征和目标说话人的声纹特征提高了语音分离的精度，但是其将混合音频采用短时傅里叶变换从时域信号转换为幅度谱，这会导致在分离后的幅度谱在恢复成时域信号的过程中所用到的是含有噪声的相位谱，从而使得分离后音频的信号失真比降低，且该方法中使用的分离模型DPCL网络结构复杂，训练时间较长。With the continuous improvement of voiceprint feature extraction models, the fusion of voiceprint features in speech separation models has been applied in the target speaker's speech separation method. For example, the patent application with application publication number CN 113990344A, entitled "A method, device and medium for multi-person speech separation based on voiceprint features", discloses a multi-person speech separation method based on voiceprint features, which first extracts voiceprint features from the target speaker's audio, performs short-time Fourier transform on the mixed audio to obtain the spectral features of the mixed audio, concatenates the extracted voiceprint features and spectral features, and uses the deep clustering model DPCL to generate a mask, multiplies the generated mask with the spectral features of the mixed audio to obtain the spectrum of the target speaker's pure audio, and finally obtains the target speaker's pure audio through inverse short-time Fourier transform. This method improves the accuracy of speech separation by splicing the spectral features of the mixed audio and the voiceprint features of the target speaker at the input of the speech separation network. However, it converts the mixed audio from the time domain signal to the amplitude spectrum using short-time Fourier transform, which results in the use of a noisy phase spectrum in the process of restoring the separated amplitude spectrum to the time domain signal, thereby reducing the signal distortion ratio of the separated audio. In addition, the separation model DPCL network used in this method has a complex structure and takes a long time to train.

发明内容Summary of the invention

本发明的目的在于克服上述现有技术存在的缺陷，提出了一种基于声纹特征的语音分离方法，旨在提高在多人音频中分离出目标说话人音频的信号失真比和效率。The purpose of the present invention is to overcome the defects of the above-mentioned prior art and propose a speech separation method based on voiceprint features, aiming to improve the signal distortion ratio and efficiency of separating the target speaker's audio from multi-person audio.

为实现上述目的，本发明采取的技术方案包括如下步骤：To achieve the above object, the technical solution adopted by the present invention includes the following steps:

(1)获取训练样本集和测试样本集：(1) Obtain training sample set and test sample set:

(1a)获取N个非目标说话人的纯净音频数据F＝{f₁,f₂,…,f_n,...,f_N}，以及目标说话人的M条纯净音频数据K＝{k₁,k₂,...,k_m,...,k_M}，并将目标说话人的每条纯净音频数据k_m与F中任意两个说话人的纯净音频数据f_p、f_q以及自然噪音进行混合，得到混合音频数据集G＝{g₁,g₂,...,g_m,...,g_M}，其中N≥200，f_n表示第n个说话人的一条纯净音频数据，M≥8000，k_m表示目标说话人的第m条纯净音频数据，p∈[1,N]，q∈[1,N]，p≠q，g_m表示k_m对应的混合音频数据；(1a) Obtain clean audio data F = {f ₁ ,f ₂ ,…,f _n ,...,f _N } of N non-target speakers and M clean audio data K = {k ₁ ,k ₂ ,...,k _m ,...,k _M } of the target speaker, and mix each clean audio data _km of the target speaker with the clean audio data f _p , f _q of any two speakers in F and natural noise to obtain a mixed audio dataset G = {g ₁ ,g ₂ ,...,g _m ,...,g _M }, where N ≥ 200, f _n represents a clean audio data of the nth speaker, M ≥ 8000, _km represents the mth clean audio data of the target speaker, p ∈ [1, N], q ∈ [1, N], p ≠ q, and g _m represents the mixed audio data corresponding to _km ;

(1b)对每条纯净音频数据k_m进行短时傅里叶变换，并将k_m、g_m和对k_m进行短时傅里叶变换得到的辅助语音频谱h_m组成样本数据，得到包括M条样本数据的集合，然后将该集合中随机选取的半数以上条样本数据组成训练样本集T，将剩余的样本数据组成测试样本集R；(1b) performing a short-time Fourier transform on each piece of pure audio data _km , and combining _km , _g and the auxiliary speech spectrum _h obtained by performing the short-time Fourier transform on _km to form sample data, thereby obtaining a set of M sample data, and then randomly selecting more than half of the sample data from the set to form a training sample set T, and the remaining sample data to form a test sample set R;

(2)构建基于声纹特征的语音分离模型O：(2) Construct a speech separation model O based on voiceprint features:

构建包括声纹特征提取模块和Conv-TasNet模型的语音分离模型O；Conv-TasNet模型包括音频编码模块、语音分离网络和音频解码模块；音频编码模块与音频解码模块连接，音频编码模块和声纹特征提取模块与语音分离网络连接，语音分离网络与音频解码模块连接；其中，声纹特征提取模块包括顺次连接的双向LSTM层、全连接层和平均池化层；音频编码模块包括一维卷积层；语音分离网络包括多个归一化层、多个一维卷积层、依次级联的多组时间卷积模块TCN、PReLU层；音频解码模块包括一维反卷积层；Construct a speech separation model O including a voiceprint feature extraction module and a Conv-TasNet model; the Conv-TasNet model includes an audio encoding module, a speech separation network and an audio decoding module; the audio encoding module is connected to the audio decoding module, the audio encoding module and the voiceprint feature extraction module are connected to the speech separation network, and the speech separation network is connected to the audio decoding module; wherein the voiceprint feature extraction module includes a bidirectional LSTM layer, a fully connected layer and an average pooling layer connected in sequence; the audio encoding module includes a one-dimensional convolution layer; the speech separation network includes multiple normalization layers, multiple one-dimensional convolution layers, and multiple groups of time convolution modules TCN and PReLU layers cascaded in sequence; the audio decoding module includes a one-dimensional deconvolution layer;

(3)对语音分离模型O进行迭代训练：(3) Iteratively train the speech separation model O:

(3a)初始化迭代次数j，最大迭代次数为J，J≥100，当前语音分离模型为O_j，并令j＝0；(3a) Initialize the number of iterations j, the maximum number of iterations is J, J ≥ 100, the current speech separation model is O _j , and set j = 0;

(3b)将训练样本集T作为语音分离模型O的输入进行前向传播：(3b) Take the training sample set T as the input of the speech separation model O for forward propagation:

(3b1)音频编码模块对每个训练样本中的混合音频数据进行编码，输出混合音频编码，同时声纹特征提取模块对每个训练样本中的辅助语音频谱进行声纹特征提取，输出混合音频编码对应的声纹特征向量；(3b1) the audio encoding module encodes the mixed audio data in each training sample and outputs the mixed audio code, while the voiceprint feature extraction module extracts the voiceprint features from the auxiliary speech spectrum in each training sample and outputs the voiceprint feature vector corresponding to the mixed audio code;

(3b2)语音分离网络使用FiLM-DCTasNet融合算法对每对混合音频编码和声纹特征向量进行掩码计算，得到混合音频中目标说话人音频对应的非负掩码矩阵；音频解码模块将混合音频编码与混合音频中目标说话人音频对应的非负掩码矩阵点乘后使用一维反卷积层进行解码，得到每个训练样本分离后的目标说话人音频数据；(3b2) The speech separation network uses the FiLM-DCTasNet fusion algorithm to perform mask calculation on each pair of mixed audio code and voiceprint feature vector to obtain the non-negative mask matrix corresponding to the target speaker audio in the mixed audio; the audio decoding module performs point multiplication of the mixed audio code and the non-negative mask matrix corresponding to the target speaker audio in the mixed audio, and then uses a one-dimensional deconvolution layer to decode, to obtain the target speaker audio data after separation of each training sample;

(3c)计算每个训练样本分离出的目标说话人音频数据及其对应的纯净音频数据的尺度不变信号失真比SI-SNR，并采用Adam优化器，通过最大化SI-SNR值对语音分离模型O的权值进行更新，得到第j次迭代后的基于声纹特征的语音分离模型O_j；(3c) Calculate the scale-invariant signal-distortion ratio SI-SNR of the target speaker audio data separated from each training sample and its corresponding clean audio data, and use the Adam optimizer to update the weight of the speech separation model O by maximizing the SI-SNR value to obtain the voiceprint feature-based speech separation model O _j after the jth iteration;

(3d)判断j≥J是否成立，若是，得到训练好的基于声纹特征的语音分离模型O'，否则，令j＝j+1，并执行步骤(3b)；(3d) Determine whether j≥J holds true. If so, obtain the trained speech separation model O' based on voiceprint features. Otherwise, set j=j+1 and execute step (3b);

(4)获取语音分离结果：(4) Obtain speech separation results:

将测试样本集R作为训练好的基于声纹特征的语音分离模型O'的输入进行前向传播，得到测试样本集R对应的分离后的目标说话人音频数据。The test sample set R is used as the input of the trained voiceprint feature-based speech separation model O' for forward propagation to obtain the separated target speaker audio data corresponding to the test sample set R.

本发明与现有技术相比，具有如下优点：Compared with the prior art, the present invention has the following advantages:

1.本发明在对语音分离模型进行训练以及获取语音分离结果的过程中，采用FiLM-DCTasNet融合算法对每对混合音频编码和声纹特征向量进行掩码计算，使得分离网络可以多次融入目标说话人的声纹特征，解决了现有技术中直接以点乘融合声纹特征导致的分离出音频中含有其他与目标说话人声音特征相似的说话人音频的问题，有效提高了分离后音频信号的信号失真比。1. In the process of training the speech separation model and obtaining the speech separation results, the present invention adopts the FiLM-DCTasNet fusion algorithm to perform mask calculation on each pair of mixed audio codes and voiceprint feature vectors, so that the separation network can integrate the voiceprint features of the target speaker multiple times, which solves the problem in the prior art that the separated audio contains audio of other speakers with similar sound features to the target speaker caused by directly fusing voiceprint features by dot product, and effectively improves the signal-distortion ratio of the separated audio signal.

2.本发明在语音分离模型中使用Conv-TasNet模型中的分离网络进行掩码计算，该分离网络为深度卷积神经网络，相较于现有技术中使用循环神经网络构建模型具有更简单的模型结构、更快的训练速度和更好的信息抽象能力，使得语音分离模型的训练时间更短，模型分离出的目标说话人音频信号失真比更高。2. The present invention uses the separation network in the Conv-TasNet model to perform mask calculation in the speech separation model. The separation network is a deep convolutional neural network. Compared with the model constructed using a recurrent neural network in the prior art, it has a simpler model structure, faster training speed and better information abstraction ability, which shortens the training time of the speech separation model and increases the distortion ratio of the target speaker audio signal separated by the model.

3.本发明在编解码器的选择上采用时域编码，即使用一维卷积层直接对音频时域信号进行编码，该方式相较于时频编码具有帧长更短、可训练参数、没有相位重建的特点，避免了现有技术中在信号编码的过程中时域与时频域转换造成的信号损失，使其分离出的音频信号失真比更高。3. The present invention adopts time domain coding in the selection of codec, that is, a one-dimensional convolutional layer is used to directly encode the audio time domain signal. Compared with time-frequency coding, this method has the characteristics of shorter frame length, trainable parameters, and no phase reconstruction. It avoids the signal loss caused by the conversion between time domain and time-frequency domain in the process of signal encoding in the prior art, and makes the separated audio signal have a higher distortion ratio.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明的实现流程图；Fig. 1 is an implementation flow chart of the present invention;

图2是本发明所构建的基于声纹特征的语音分离模型结构示意图。FIG2 is a schematic diagram of the structure of the speech separation model based on voiceprint features constructed by the present invention.

具体实施方式DETAILED DESCRIPTION

以下结合附图和具体实施例，对本发明作进一步的详细描述。The present invention is further described in detail below in conjunction with the accompanying drawings and specific embodiments.

参照图1，本发明包括如下步骤：Referring to Figure 1, the present invention comprises the following steps:

步骤1)获取训练样本集和测试样本集：Step 1) Obtain training sample set and test sample set:

(1a)获取N个非目标说话人的纯净音频数据F＝{f₁,f₂,...,f_n,...,f_N}，以及目标说话人的M条纯净音频数据K＝{k₁,k₂,...,k_m,...,k_M}，并将目标说话人的每条纯净音频数据k_m与F中任意两个说话人的纯净音频数据f_p、f_q以及自然噪音进行混合，得到混合音频数据集G＝{g₁,g₂,...,g_m,...,g_M}，其中N≥200，f_n表示第n个说话人的一条纯净音频数据，M≥8000，k_m表示目标说话人的第m条纯净音频数据，p∈[1,N]，q∈[1,N]，p≠q，g_m表示k_m对应的混合音频数据；(1a) Obtain clean audio data F = {f ₁ ,f ₂ ,...,f _n ,...,f _N } of N non-target speakers and M clean audio data K = {k ₁ ,k ₂ ,...,k _m ,...,k _M } of the target speaker, and mix each clean audio data _km of the target speaker with the clean audio data f _p , f _q of any two speakers in F and natural noise to obtain a mixed audio dataset G = {g ₁ ,g ₂ ,...,g _m ,...,g _M }, where N ≥ 200, f _n represents a clean audio data of the nth speaker, M ≥ 8000, _km represents the mth clean audio data of the target speaker, p ∈ [1, N], q ∈ [1, N], p ≠ q, and g _m represents the mixed audio data corresponding to _km ;

使用目标说话人纯净音频与其他说话人音频及噪音混合的做法可以使得该纯净音频为语音分离模型分离的目标音频，从而使得训练得到的语音分离模型分离得到音频更接近真实音频，本实施例中，N＝300，M＝10000；The method of mixing the target speaker's pure audio with other speakers' audio and noise can make the pure audio the target audio separated by the speech separation model, so that the audio separated by the trained speech separation model is closer to the real audio. In this embodiment, N=300, M=10000;

(1b)对每条纯净音频数据k_m进行短时傅里叶变换，并将k_m、g_m和对k_m进行短时傅里叶变换得到的辅助语音特征h_m组成样本数据，得到包括M条样本数据的集合，然后将该集合中随机选取的半数以上条样本数据组成训练样本集T，将剩余的样本数据组成测试样本集R；(1b) performing a short-time Fourier transform on each piece of pure audio data _km , and combining _km , _gm, and the auxiliary speech feature _hm obtained by performing the short-time Fourier transform on _km to form sample data, thereby obtaining a set of M sample data, and then randomly selecting more than half of the sample data from the set to form a training sample set T, and the remaining sample data to form a test sample set R;

不同于已有的语音分离算法中将混合音频先经过短时傅里叶变换提取频谱后输入语音分离模型，本发明将混合音频的时域信号直接输入语音分离模型，可以避由于时域、时频域的转化带来的噪声加入，从而使得语音分离模型得到音频具有更少的噪声，本实施例中，选取的训练样本数量为8000条，测试样本数量为2000条；Different from the existing speech separation algorithm, in which the mixed audio is first subjected to short-time Fourier transform to extract the spectrum and then input into the speech separation model, the present invention directly inputs the time domain signal of the mixed audio into the speech separation model, which can avoid the addition of noise caused by the conversion of the time domain and the time-frequency domain, so that the audio obtained by the speech separation model has less noise. In this embodiment, the number of training samples selected is 8000 and the number of test samples is 2000;

步骤2)构建基于声纹特征的语音分离模型O，其结构如图2所示；Step 2) construct a speech separation model O based on voiceprint features, whose structure is shown in Figure 2;

语音分离网络包括第一归一化层、第一一维卷积层、第一TCN组、第二TCN组、第三TCN组、PReLU层、第二一维卷积层和第二归一化层，其中每组TCN包括八个依次级联的TCN，每个TCN的具体结构为：一维卷积层、ReLU层、归一化层、深度可分离卷积层、ReLU层、归一化层和一维卷积层，且输入与输出间进行残差连接；每个TCN具有各自的膨胀卷积结构，其中卷积过程的扩张因子即卷积核的步长从1开始以指数2逐渐增大，即1、2、4、…，且每组TCN的第一个TCN的扩张因子重置为1；The speech separation network includes a first normalization layer, a first one-dimensional convolution layer, a first TCN group, a second TCN group, a third TCN group, a PReLU layer, a second one-dimensional convolution layer and a second normalization layer, wherein each group of TCN includes eight TCNs cascaded in sequence, and the specific structure of each TCN is: a one-dimensional convolution layer, a ReLU layer, a normalization layer, a depth-separable convolution layer, a ReLU layer, a normalization layer and a one-dimensional convolution layer, and a residual connection is performed between the input and the output; each TCN has its own dilated convolution structure, wherein the dilation factor of the convolution process, i.e., the step size of the convolution kernel, starts from 1 and gradually increases with an exponential of 2, i.e., 1, 2, 4, ..., and the dilation factor of the first TCN in each group of TCN is reset to 1;

本发明中采用的TCN模块是Shaojie Bai等人在2018年提出的一种新型的可以用来解决时间序列预测的算法；由于RNN网络一次只读取、解析输入文本中的一个单词或字符，深度神经网络必须等前一个单词处理完，才能进行下一个单词的处理，这意味着RNN不能像卷积神经网络那样进行大规模并行处理，耗时较长，因此相比于RNN网络，TCN模块在使用时序数据的预测任务中的表现更为出色，在分离模型中使用基于TCN模块的Conv-TasNet模型可以提高分离效率和分离出音频的信号失真比；The TCN module used in the present invention is a new type of algorithm proposed by Shaojie Bai et al. in 2018 that can be used to solve time series prediction. Since the RNN network only reads and parses one word or character in the input text at a time, the deep neural network must wait until the previous word is processed before processing the next word. This means that the RNN cannot perform large-scale parallel processing like the convolutional neural network, which takes a long time. Therefore, compared with the RNN network, the TCN module performs better in the prediction task of using time series data. Using the Conv-TasNet model based on the TCN module in the separation model can improve the separation efficiency and the signal distortion ratio of the separated audio.

步骤3)对语音分离模型O进行迭代训练：Step 3) Iteratively train the speech separation model O:

(3a)初始化迭代次数j，最大迭代次数为J，J≥100，当前语音分离模型为O_j，并令j＝0，本实施例中J＝100；(3a) Initialize the number of iterations j, the maximum number of iterations is J, J ≥ 100, the current speech separation model is O _j , and set j = 0, in this embodiment, J = 100;

(3b1)音频编码模块对每个训练样本中的混合音频数据进行编码，输出混合音频编码，同时声纹特征提取模块对每个训练样本中的辅助语音特征进行声纹特征提取，输出声纹特征向量；(3b1) the audio encoding module encodes the mixed audio data in each training sample and outputs the mixed audio code, while the voiceprint feature extraction module extracts the voiceprint feature from the auxiliary speech feature in each training sample and outputs the voiceprint feature vector;

(3b2)语音分离网络对每个训练样本的混合音频编码和声纹特征向量进行掩码计算，得到混合音频中目标说话人音频对应的非负掩码矩阵；音频解码模块将混合音频编码与混合音频中目标说话人音频对应的非负掩码矩阵点乘后使用一维反卷积层进行解码，得到每个训练样本分离后的目标说话人音频数据；(3b2) The speech separation network performs mask calculation on the mixed audio code and voiceprint feature vector of each training sample to obtain a non-negative mask matrix corresponding to the target speaker audio in the mixed audio; the audio decoding module performs a dot multiplication of the mixed audio code and the non-negative mask matrix corresponding to the target speaker audio in the mixed audio, and then uses a one-dimensional deconvolution layer to decode, to obtain the target speaker audio data after separation of each training sample;

语音分离网络对每个训练样本的混合音频编码和声纹特征向量进行掩码计算的实现步骤为：混合音频编码依次输入第一归一化层、第一一维卷积层进行向前传播得到分离网络的中间表征L₀，L₀与声纹特征向量经过FiLM层的仿射变换后的结果输入第一TCN组进行向前传播得到分离网络的中间表征L₁，L₁与声纹特征向量经过FiLM层的仿射变换后的结果输入第二TCN组进行向前传播得到分离网络的中间表征L₂，L₂与声纹特征向量经过FiLM层的仿射变换后的结果输入第三TCN组进行向前传播得到第三TCN组的最后一个TCN模块的输出，将该输出与其他所有TCN的输出求和后输入PReLU层、第二一维卷积层和第二归一化层进行向前传播，得到混合音频中目标说话人音频对应的非负掩码矩阵；其中L_i,c与声纹特征向量经过FiLM层的仿射变换按照如下公式计算：The implementation steps of the speech separation network for mask calculation of the mixed audio code and voiceprint feature vector of each training sample are as follows: the mixed audio code is sequentially input into the first normalization layer and the first one-dimensional convolution layer for forward propagation to obtain the intermediate representation _L0 of the separation network; the result of _L0 and the voiceprint feature vector after the affine transformation of the FiLM layer is input into the first TCN group for forward propagation to obtain the intermediate representation _L1 of the separation network; the result of _L1 and the voiceprint feature vector after the affine transformation of the FiLM layer is input into the second TCN group for forward propagation to obtain the intermediate representation _L2 of the separation network; the result of _L2 and the voiceprint feature vector after the affine transformation of the FiLM layer is input into the third TCN group for forward propagation to obtain the output of the last TCN module of the third TCN group; the output is summed with the outputs of all other TCNs and then input into the PReLU layer, the second one-dimensional convolution layer and the second normalization layer for forward propagation to obtain the non-negative mask matrix corresponding to the audio of the target speaker in the mixed audio; wherein _Li,c and the voiceprint feature vector after the affine transformation of the FiLM layer are calculated according to the following formula:

γ_i,c＝f_c(e)γ _i,c ＝f _c (e)

β_i,c＝h_c(e)β _i,c = h _c (e)

FiLM(L_i,c|γ_i,c,β_i,c)＝γ_i,cL_i,c+β_i,c FiLM(L _i,c |γ _i,c ,β _i,c )=γ _i,c L _i,c +β _i,c

其中e为从目标说话人的纯净语音中提取出的声纹特征向量，f_c、h_c为对e进行的第c次仿射变换函数，L_i,c为第i个TCN组融合声纹特征向量前的中间表征的第c个特征；Where e is the voiceprint feature vector extracted from the clean speech of the target speaker, f _c , h _c are the cth affine transformation functions of e, and Li _,c is the cth feature of the intermediate representation before the i-th TCN group fuses the voiceprint feature vector;

相较于大多语音分离算法中直接以点乘融合声纹特征，本发明中使用FiLM-DCTasNet融合算法在分离网络中融合声纹特征，为每一次融合过程单独构建一组可训练参数，可以更好地将目标说话人的声纹特征表现在目标说话人音频对应的非负掩码矩阵中，从而提高语音分离模型的得到的目标说话人音频的信号失真比；Compared with most speech separation algorithms that directly fuse voiceprint features by dot product, the present invention uses the FiLM-DCTasNet fusion algorithm to fuse voiceprint features in the separation network, and constructs a set of trainable parameters for each fusion process, which can better represent the voiceprint features of the target speaker in the non-negative mask matrix corresponding to the target speaker's audio, thereby improving the signal distortion ratio of the target speaker's audio obtained by the speech separation model;

(3c)计算每个训练样本分离出的目标说话人音频数据及其对应的纯净音频数据的尺度不变信号失真比SI-SNR，采用Adam优化器，通过最大化SI-SNR值对语音分离模型O的权值进行更新，得到第j次迭代后的基于声纹特征的语音分离模型O_j；(3c) Calculate the scale-invariant signal-distortion ratio (SI-SNR) of the target speaker audio data separated from each training sample and its corresponding clean audio data, and use the Adam optimizer to update the weight of the speech separation model O by maximizing the SI-SNR value to obtain the speech separation model O _j based on voiceprint features after the jth iteration;

SI-SNR的计算方法按照如下公式定义：The calculation method of SI-SNR is defined as follows:

其中S为纯净音频数据，为分离后的目标说话人音频数据；Where S is the pure audio data, The separated target speaker audio data;

本发明采用的性能评价指标SI-SNR将生成向量投影到真实向量的垂直方向，相较于传统方法中采用的信号失真率SNR能够更直接反映分离后的目标说话人音频数据与纯净音频数据之间的相似性，从而得到效果更好的语音分离模型；The performance evaluation index SI-SNR used in the present invention projects the generated vector to the vertical direction of the real vector. Compared with the signal distortion rate SNR used in the traditional method, it can more directly reflect the similarity between the separated target speaker audio data and the pure audio data, thereby obtaining a better speech separation model.

(3d)判断j≥J是否成立，若是，得到训练好的基于声纹特征的语音分离模型O'，否则，令j＝j+1，并执行步骤(3b)(3d) Determine whether j≥J holds true. If so, obtain the trained voiceprint feature-based speech separation model O'. Otherwise, set j=j+1 and execute step (3b)

步骤4)获取语音分离结果：Step 4) Get the speech separation result:

Claims

1. A speech separation method based on voiceprint features, characterized in that it comprises the following steps:

(1) Obtain training sample set and test sample set:

(1a) Obtain clean audio data F = {f ₁ ,f ₂ ,...,f _n ,...,f _N } of N non-target speakers and M clean audio data K = {k ₁ ,k ₂ ,...,k _m ,...,k _M } of the target speaker, and mix each clean audio data _km of the target speaker with the clean audio data f _p , f _q of any two speakers in F and natural noise to obtain a mixed audio dataset G = {g ₁ ,g ₂ ,...,g _m ,...,g _M }, where N ≥ 200, f _n represents a clean audio data of the nth speaker, M ≥ 8000, _km represents the mth clean audio data of the target speaker, p ∈ [1, N], q ∈ [1, N], p ≠ q, and g _m represents the mixed audio data corresponding to _km ;

(1b) performing a short-time Fourier transform on each piece of pure audio data _km , and combining _km , _g and the auxiliary speech spectrum _h obtained by performing the short-time Fourier transform on _km to form sample data, thereby obtaining a set of M sample data, and then randomly selecting more than half of the sample data from the set to form a training sample set T, and the remaining sample data to form a test sample set R;

(2) Construct a speech separation model O based on voiceprint features:

Construct a speech separation model O including a voiceprint feature extraction module and a Conv-TasNet model; the Conv-TasNet model includes an audio encoding module, a speech separation network and an audio decoding module; the audio encoding module is connected to the audio decoding module, the audio encoding module and the voiceprint feature extraction module are connected to the speech separation network, and the speech separation network is connected to the audio decoding module; wherein the voiceprint feature extraction module includes a bidirectional LSTM layer, a fully connected layer and an average pooling layer connected in sequence; the audio encoding module includes a one-dimensional convolutional layer; the speech separation network includes multiple normalization layers, multiple one-dimensional convolutional layers, and multiple groups of time convolution modules TCN and PReLU layers cascaded in sequence; the audio decoding module includes a one-dimensional deconvolutional layer; the speech separation network includes 2 normalization layers, 2 one-dimensional convolutional layers, and 3 TCN groups, and the specific structure of the speech separation network is: a first normalization layer, a first one-dimensional convolutional layer, a first TCN group, a second TCN group, a third TCN group, a PReLU layer, a second one-dimensional convolutional layer and a second normalization layer;

(3) Iteratively train the speech separation model O:

(3a) Initialize the number of iterations j, the maximum number of iterations is J, J ≥ 100, the current speech separation model is O _j , and set j = 0;

(3b) Take the training sample set T as the input of the speech separation model O for forward propagation:

(3b1) the audio encoding module encodes the mixed audio data in each training sample and outputs the mixed audio code, while the voiceprint feature extraction module extracts the voiceprint features from the auxiliary speech spectrum in each training sample and outputs the voiceprint feature vector corresponding to the mixed audio code;

(3b2) The speech separation network uses the FiLM-DCTasNet fusion algorithm to perform mask calculation on each pair of mixed audio code and voiceprint feature vector to obtain the non-negative mask matrix corresponding to the target speaker audio in the mixed audio; the audio decoding module performs point multiplication of the mixed audio code and the non-negative mask matrix corresponding to the target speaker audio in the mixed audio, and then uses a one-dimensional deconvolution layer to decode, to obtain the target speaker audio data after separation of each training sample;

The speech separation network uses the FiLM-DCTasNet fusion algorithm to perform mask calculation on each pair of mixed audio codes and voiceprint feature vectors. The implementation steps are as follows: the mixed audio code is sequentially input into the first normalization layer and the first one-dimensional convolution layer for forward propagation to obtain the intermediate representation _L0 of the separation network. The result of _L0 and the voiceprint feature vector after the affine transformation of the FiLM layer is input into the first TCN group for forward propagation to obtain the intermediate representation _L1 of the separation network. The result of _L1 and the voiceprint feature vector after the affine transformation of the FiLM layer is input into the second TCN group for forward propagation to obtain the intermediate representation _L2 of the separation network. The result of _L2 and the voiceprint feature vector after the affine transformation of the FiLM layer is input into the third TCN group for forward propagation to obtain the output of the last TCN module. The output is summed with the outputs of all other TCNs and then input into the PReLU layer, the second one-dimensional convolution layer, and the second normalization layer for forward propagation to obtain the non-negative mask matrix corresponding to the audio of the target speaker in the mixed audio; wherein _Li,c and the voiceprint feature vector after the affine transformation of the FiLM layer are calculated according to the following formula:

γ _i,c ＝f _c (e)

β _i,c = h _c (e)

FiLM(L _i,c |γ _i,c ,β _i,c )=γ _i,c L _i,c +β _i,c

Where e is the voiceprint feature vector extracted from the clean speech of the target speaker, f _c , h _c are the cth affine transformation functions of e, and Li _,c is the cth feature of the intermediate representation before the i-th TCN group fuses the voiceprint feature vector;

(3c) Calculate the scale-invariant signal-distortion ratio SI-SNR of the target speaker audio data separated from each training sample and its corresponding clean audio data, and use the Adam optimizer to update the weight of the speech separation model O by maximizing the SI-SNR value to obtain the voiceprint feature-based speech separation model O _j after the jth iteration;

(3d) Determine whether j≥J holds true. If so, obtain the trained speech separation model O' based on voiceprint features. Otherwise, set j=j+1 and execute step (3b);

(4) Obtain speech separation results:

The test sample set R is used as the input of the trained voiceprint feature-based speech separation model O' for forward propagation to obtain the separated target speaker audio data corresponding to the test sample set R.

2. According to the voiceprint feature-based speech separation method of claim 1, it is characterized in that the speech separation network described in step (2), wherein each group of TCN includes eight TCNs cascaded in sequence, and the specific structure of each TCN is: a one-dimensional convolution layer, a ReLU layer, a normalization layer, a depth-separable convolution layer, a ReLU layer, a normalization layer and a one-dimensional convolution layer, and a residual connection is performed between the input and the output.

3. According to the voiceprint feature-based speech separation method of claim 2, it is characterized in that in step (3), a time convolution module TCN is used to separate the target speaker's voice from the mixed audio, and each TCN has its own dilated convolution structure, wherein the dilation factor of the convolution process, i.e., the step size of the convolution kernel, starts from 1 and gradually increases with an exponential of 2, i.e., 1, 2, 4, ..., and the dilation factor of the first TCN in each group of TCNs is reset to 1.

4. The method for speech separation based on voiceprint features according to claim 1, characterized in that in step (3c), a scale-invariant signal-distortion ratio SI-SNR of each pair of separated target speaker audio data and pure audio data is calculated, and the calculation method of the SI-SNR is defined according to the following formula:

Where S is the pure audio data, It is the separated target speaker audio data.