CN115691539A

CN115691539A - Two-stage voice separation method and system based on visual guidance

Info

Publication number: CN115691539A
Application number: CN202211317835.0A
Authority: CN
Inventors: 魏莹; 邓媛洁; 张寒冰
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2022-10-26
Filing date: 2022-10-26
Publication date: 2023-02-03

Abstract

The invention provides a two-stage voice separation method and a two-stage voice separation system based on visual guidance, wherein in the first stage, the voices of speakers are obtained from the obtained mixed voice in a time domain; in the second stage, the independent voice features with the speaker information are extracted by means of the roughly separated voice in the first stage, and meanwhile, potential relevant features and complementary features between visual and audio modes are mined and multi-mode features are fused and separated, and finally pure target voice is obtained. The invention extracts the independent voice characteristics of the speaker by utilizing the first stage, avoids introducing pure reference voice, improves the voice separation performance and robustness through visual guidance, and solves the problem of label arrangement. The invention further improves the voice separation quality by dynamically adjusting the weight of the two-stage model, and the disclosed voice separation system is suitable for most application scenes.

Description

Two-stage speech separation method and system based on visual guidance

技术领域technical field

本发明属于语音分离技术领域，涉及一种基于视觉导引的两阶段语音分离方法及系统。The invention belongs to the technical field of speech separation, and relates to a two-stage speech separation method and system based on visual guidance.

背景技术Background technique

本部分的陈述仅仅是提供了与本发明相关的背景技术信息，不必然构成在先技术。The statements in this section merely provide background information related to the present invention and do not necessarily constitute prior art.

语音分离是指从多个说话人产生的混合语音中提取出一个或多个目标语音信号。语音分离问题来自“鸡尾酒会效应”，它描述了一个现象是在嘈杂的室内环境中，比如在鸡尾酒会上同时存在着许多不同类型的声源。但即便如此，人们可以将注意力集中在某一个人的谈话之中而忽略背景中其他的对话或噪音。这是人的一种听力选择能力。我们希望通过机器学习具有这种语音选择并筛选的能力。语音分离具有广泛的应用，是许多语音下游任务中基础且重要的一环，分离出高质量的纯净语音才能更好的应用在语音识别[TORFIA,IRANMANESH S M,NASRABADI N,et al.3D Convolutional Neural Networks for CrossAudio-Visual Matching Recognition[J].IEEE Access,2017][T.AFOURAS J S C,A.SENIOR,O.VINYALS,AND A.ZISSERMAN.Deep Audio-Visual Speech Recognition[J].IEEE Transactions on Pattern Analysis&Machine Intelligence,2018]、人机交互等场景中。Speech separation refers to extracting one or more target speech signals from the mixed speech produced by multiple speakers. The speech separation problem comes from the "cocktail party effect", which describes a phenomenon where many different types of sound sources are present simultaneously in a noisy indoor environment, such as at a cocktail party. But even then, people can focus on one person's conversation while ignoring other conversations or noises in the background. This is a human ability to select hearing. We hope to have the ability to select and filter voices through machine learning. Speech separation has a wide range of applications and is a basic and important part of many downstream speech tasks. Only by separating high-quality pure speech can it be better applied in speech recognition [TORFIA, IRANMANESH S M, NASRABADI N, et al.3D Convolutional Neural Networks for CrossAudio-Visual Matching Recognition[J].IEEE Access,2017][T.AFOURAS J S C,A.SENIOR,O.VINYALS,AND A.ZISSERMAN.Deep Audio-Visual Speech Recognition[J].IEEE Transactions on Pattern Analysis&Machine Intelligence, 2018], human-computer interaction and other scenarios.

传统的语音分离算法包括计算听觉场景分析(CASA,Computational AuditoryScene Analysis)、非负矩阵分解以及隐马尔可夫模型等，通常需要一定的假设条件和先验知识[JOSH H,MCDERMOTT.The cocktail party problem[J].Current Biology,2009]，在分离相似的说话人语音时也存在效果不佳的问题[Rivet B,Wang W,Naqvi S M,etal.Audiovisual speech source separation:An overview of key methodologies[J].IEEE Signal Processing Magazine,2014,31(3):125-134.]。随着深度学习的发展，神经网络在语音分离领域也取得了较好的效果，神经网络可以学习混合语音信号与目标语音信号之间的复杂映射关系。常见的纯音频方法在训练分离网络时存在标签排列的问题，虽然可以通过排列组合——置换不变训练(PIT,Permutation Invariant Training)筛选出正确的匹配项，但计算较为复杂；除此之外，语音受环境和噪声的影响较大，导致分离系统的鲁棒性较差。Traditional speech separation algorithms include Computational Auditory Scene Analysis (CASA, Computational Auditory Scene Analysis), non-negative matrix factorization, and hidden Markov models, etc., usually require certain assumptions and prior knowledge [JOSH H, MCDERMOTT.The cocktail party problem [J].Current Biology,2009], there is also the problem of poor effect when separating similar speaker voices [Rivet B,Wang W,Naqvi S M,etal.Audiovisual speech source separation:An overview of key methodologies[J] . IEEE Signal Processing Magazine, 2014, 31(3): 125-134.]. With the development of deep learning, the neural network has also achieved good results in the field of speech separation, and the neural network can learn the complex mapping relationship between the mixed speech signal and the target speech signal. The common audio-only method has the problem of label arrangement when training the separation network. Although the correct match can be screened out through permutation and combination—Permutation Invariant Training (PIT, Permutation Invariant Training), the calculation is more complicated; in addition , speech is greatly affected by the environment and noise, resulting in poor robustness of the separation system.

在现实场景中，人们会通过观察说话者来辅助自己的听觉感知，当看到说话人的面部或唇部的时候，听力更容易。其次，在一些心理学和生理学，以及精神学等领域，也有研究[Golumbic E Z,Cogan G B,Schroeder C E,et al.Visual input enhances selectivespeech envelope tracking in auditory cortex at a‘cocktail party’[J].Journalof Neuroscience,2013,33(4):1417-1426.]证明视觉信息是有助于人们对说话的理解。相对于语音，说话人的视觉信息如嘴唇运动和面部外观等信息更加稳定，同时视觉信息具有身份特征，可以在分离混合语音的过程中匹配正确的说话人标签。2017年Ephrat提出的分离模型中只引入了静态图像，虽然降低了数据维度，但丢失了视觉时序上的信息、损失了分离性能。In real-world scenarios, people will assist their auditory perception by observing the speaker. When seeing the speaker's face or lips, hearing is easier. Secondly, in some fields of psychology, physiology, and psychiatry, there are also studies [Golumbic E Z, Cogan G B, Schroeder C E, et al.Visual input enhances selective speech envelope tracking in auditory cortex at a'cocktail party'[J].Journalof Neuroscience, 2013,33(4):1417-1426.] Prove that visual information is helpful for people to understand speech. Compared with speech, the speaker's visual information such as lip movement and facial appearance is more stable, and visual information has identity features, which can match the correct speaker label in the process of separating mixed speech. In the separation model proposed by Ephrat in 2017, only static images were introduced. Although the data dimension is reduced, the information on the visual timing is lost, and the separation performance is lost.

一些研究[AFOURAS T C J S,ZISSERMAN A.My lips are concealed Audio-visual speech enhancement through obstructions[J].Interspeech,2019,4295-9.][OCHIAI T,DELCROIX M,KINOSHITA K,et al.Multimodal SpeakerBeam:Single ChannelTarget Speech Extraction with Audio-Visual Speaker Clues[M].Interspeech2019.2019:2718-22.][R.Gu,S.-X.Zhang,Y.Xu,L.Chen,Y.Zou,and D.Yu,“Multi-modalmulti-channel target speech separation,”IEEE Journal of Selected Topics inSignal Processing,2020]中提出使用说话人额外的参考语音提取特征来提高分离效果。Wang等人在分离模型中使用了用于说话人识别的x-vector，Luo等人引入说话人的纯净语音的i-vector来辅助分离。但此类方法有两个弊端，一是需要提前录入说话人的纯净的参考语音才能进行分离模型的训练；二是在模型训练完成再应用于实际分离场景中时，必须有待分离说话人的纯净语音才能进行分离，因此在现实应用场景中有很大的限制。Some studies[AFOURAS T C J S, ZISSERMAN A.My lips are concealed Audio-visual speech enhancement through obstructions[J].Interspeech,2019,4295-9.][OCHIAI T,DELCROIX M,KINOSHITA K,et al.Multimodal SpeakerBeam:Single ChannelTarget Speech Extraction with Audio-Visual Speaker Clues[M].Interspeech2019.2019:2718-22.][R.Gu,S.-X.Zhang,Y.Xu,L.Chen,Y.Zou,and D.Yu , "Multi-modalmulti-channel target speech separation," IEEE Journal of Selected Topics in Signal Processing, 2020] proposed to use the speaker's additional reference speech extraction features to improve the separation effect. Wang et al. used the x-vector for speaker recognition in the separation model, and Luo et al. introduced the i-vector of the speaker's pure speech to assist separation. However, this method has two disadvantages. One is that the pure reference voice of the speaker needs to be input in advance to train the separation model; Speech can only be separated, so there are great limitations in real application scenarios.

发明内容Contents of the invention

本发明为了解决上述问题，提出了一种基于视觉导引的两阶段语音分离方法及系统，本发明通过第一阶段的分离语音提取出说话人的具有身份区分性的独立语音特征辅助语音分离，通过视觉和音频两种模态信息的特征提取和融合辅助语音分离，能够提升语音分离性能。In order to solve the above problems, the present invention proposes a two-stage speech separation method and system based on visual guidance. The present invention extracts the speaker's identity-distinguishing independent speech features through the first stage of speech separation to assist speech separation. The performance of speech separation can be improved through the feature extraction and fusion of visual and audio modal information to assist speech separation.

根据一些实施例，本发明采用如下技术方案：According to some embodiments, the present invention adopts the following technical solutions:

一种基于视觉导引的两阶段语音分离方法，包括以下步骤：A two-stage speech separation method based on vision guidance, comprising the following steps:

在第一阶段，对获取的混合语音在时域上进行分离，获得粗分离的说话人语音；In the first stage, the acquired mixed speech is separated in the time domain to obtain the roughly separated speaker speech;

在第二阶段，借助第一阶段的纯音频分离结果提取具有说话人信息的独立语音特征，然后挖掘视觉和音频两种模态之间的潜在相关特征和互补特征，进行视觉特征和语音时频域特征两种模态的融合后再分离，对两个阶段动态调整权重最终得到纯净的目标语音。In the second stage, with the help of the pure audio separation results of the first stage, independent speech features with speaker information are extracted, and then the potential correlation features and complementary features between the two modalities of vision and audio are mined, and the visual features and speech time-frequency The two modalities of domain features are fused and then separated, and the weights of the two stages are dynamically adjusted to finally obtain the pure target speech.

作为可选择的实施方式，所述第一阶段，具体过程包括：As an optional implementation, the first stage, the specific process includes:

利用编码器对获取的混合语音进行编码，提取混合语音特征；Encoding the obtained mixed speech by using an encoder to extract the mixed speech features;

对混合语音特征进行分离，得到目标语音的掩码，确定目标语音特征，并对目标语音特征解码，获得粗分离的目标语音时域信号。The mixed speech features are separated to obtain the mask of the target speech, the target speech features are determined, and the target speech features are decoded to obtain the coarsely separated target speech time domain signal.

作为进一步的限定，对混合语音特征进行分离的具体过程包括，利用第一分离网络，处理混合语音特征，所述第一分离网络为时间卷积网络结构，包括一个归一化层，多个相同的栈模块，其中每个栈模块由全卷积、膨胀卷积和残差模块构成，最后一个栈模块的输出经过卷积层和PReLU激活层得到分离后的目标掩码。As a further limitation, the specific process of separating the mixed speech features includes using the first separation network to process the mixed speech features. The first separation network is a temporal convolutional network structure, including a normalization layer, and multiple identical The stack module, where each stack module is composed of full convolution, expansion convolution and residual module, the output of the last stack module passes through the convolution layer and the PReLU activation layer to obtain the separated target mask.

作为进一步的限定，所述目标语音特征为混合语音和目标语音的掩码相乘计算得到。As a further limitation, the target speech feature is calculated by multiplying the mask of the mixed speech and the target speech.

作为可选择的实施方式，所述第二阶段，具体过程包括：As an optional implementation, the second stage, the specific process includes:

对混合语音进行变换，得到混合语音的复谱图，根据其获取真实纯净语音的复谱掩码；Transform the mixed speech to obtain the complex spectrum map of the mixed speech, and obtain the complex spectral mask of the real pure speech according to it;

对第一阶段获取的目标语音的时域信号进行转换，得到分离后的各个说话人的复谱图；复谱图经过独立语音特征提取网络ResNet-18提取各个说话人的独立语音特征。The time-domain signal of the target speech obtained in the first stage is converted to obtain the complex spectrogram of each speaker after separation; the complex spectrogram is extracted through the independent speech feature extraction network ResNet-18 to extract the independent speech features of each speaker.

获取与混合语音时间同步的说话人的视觉信息并进行预处理，对预处理后的视觉图像分别提取静态视觉特征和动态视觉特征，其中静态视觉特征包含了具有区分性的说话人的身份信息，与音色等声音特征具有相似性；动态视觉特征包含了语音的内容信息，与音素等声音特征具有相似性，同时结合这两种视觉特征能够在处理较少的维度信息的同时获得更多的语音相关特征；Obtain and preprocess the visual information of the speaker synchronized with the mixed speech time, and extract static visual features and dynamic visual features from the preprocessed visual images, in which the static visual features contain distinguishing speaker identity information, It is similar to sound features such as timbre; dynamic visual features contain the content information of speech, and is similar to sound features such as phonemes. At the same time, combining these two visual features can obtain more speech while processing less dimensional information relevant features;

混合语音特征提取网络对混合语音的时频域信息提取语音特征，进行多模态特征融合，分离网络2分离所述多模态特征，得到分离后的目标语音的掩码，将所述掩码和混合语音的复谱图相乘后进行逆变换，通过对两个阶段分离的联合训练和动态优化权重得到目标说话人的时域语音信号。The mixed speech feature extraction network extracts speech features from the time-frequency domain information of the mixed speech, performs multi-modal feature fusion, separates the multi-modal features from the separation network 2, obtains the mask of the separated target speech, and uses the mask After multiplication with the complex spectrogram of the mixed speech, inverse transformation is performed, and the time-domain speech signal of the target speaker is obtained through joint training and dynamic optimization of weights separated by two stages.

作为进一步的限定，对混合语音进行变换，得到混合语音的复谱图的具体过程包括对混合语音信号进行短时傅里叶变换，然后分别计算其实部和虚部得到复谱图，复谱图包含了语音的幅度和相位信息。As a further limitation, the specific process of transforming the mixed speech to obtain the complex spectrogram of the mixed speech includes performing short-time Fourier transform on the mixed speech signal, and then calculating the real part and the imaginary part respectively to obtain the complex spectrogram, the complex spectrogram Contains the amplitude and phase information of the speech.

作为进一步的限定，对第一阶段的粗分离语音进行时频域转换得到复谱图；然后使用独立语音特征提取网络ResNet-18对各个说话人的复谱图提取独立语音特征；接着对独立语音特征进行时间维度转换以实现音频和视频两种模态特征的维度一致性。As a further limitation, the time-frequency domain conversion is performed on the coarsely separated speech in the first stage to obtain the complex spectrogram; then the independent speech feature extraction network ResNet-18 is used to extract independent speech features from the complex spectrogram of each speaker; then the independent speech The features are time-dimension transformed to achieve dimensional consistency of both audio and video modal features.

作为进一步的限定，获取与混合语音时间同步的说话人的视觉信息并进行预处理的具体过程包括读取视频文件，截取设定长度视频获取多帧图像序列，随机选择一帧面部图像作为静态视觉信息；然后对各帧图像序列进行裁剪，选取大小设定的唇部区域以降低数据维度，生成唇部序列的文件，作为动态视觉信息。As a further limitation, the specific process of obtaining and preprocessing the speaker's visual information synchronized with the time of the mixed voice includes reading the video file, intercepting the video with a set length to obtain a multi-frame image sequence, and randomly selecting a frame of facial image as the static visual information; then crop the image sequence of each frame, select the lip area with a set size to reduce the data dimension, and generate a file of the lip sequence as dynamic visual information.

作为进一步的限定，提取视觉特征的具体过程包括对唇部图像进行归一化和数据填充，预处理后的唇部数据依次经过动态视觉特征提取网络，包括一个3D卷积层、ShuffleNet v2，和时间卷积网络来提取时间序列特征，能够更多的拟合语音的内容信息，最后获得的唇部特征；As a further limitation, the specific process of extracting visual features includes normalization and data filling of lip images, and the preprocessed lip data sequentially passes through a dynamic visual feature extraction network, including a 3D convolutional layer, ShuffleNet v2, and The temporal convolutional network is used to extract time series features, which can more fit the content information of the speech, and finally obtain the lip features;

对面部图像进行标准化和大小处理，经过静态视觉特征提取网络ResNet-18提取包含说话人身份信息的特征，对面部特征在时间维度上进行转换，保证转换后和唇部序列特征具有相同的时间维度。Standardize and size the face image, extract features containing speaker identity information through the static visual feature extraction network ResNet-18, and convert the facial features in the time dimension to ensure that the converted lip sequence features have the same time dimension .

作为进一步的限定，进行多模态特征融合的具体过程包括首先混合声音复谱图经过混合语音特征提取网络得到混合语音特征，然后通过级联的方式对说话人的视觉特征、独立语音特征和混合语音特征做拼接，最后得到融合的多模态特征。As a further limitation, the specific process of multimodal feature fusion includes first mixing the sound complex spectrogram through the mixed voice feature extraction network to obtain the mixed voice features, and then cascading the speaker's visual features, independent voice features and mixed voice features. Speech features are spliced, and finally the fused multimodal features are obtained.

作为进一步的限定，利用第二分离网络分离所述多模态特征，所述第二分离网络为U-Net的上采样网络层。As a further limitation, the multimodal feature is separated by using a second separation network, and the second separation network is an upsampling network layer of U-Net.

作为进一步的限定，对于两阶段的分离网络的损失函数权重进行动态调整，以最大程度的利用第一阶段的独立语音特征来辅助第二阶段的分离。As a further limitation, the weights of the loss function of the two-stage separation network are dynamically adjusted to maximize the use of the independent speech features of the first stage to assist the separation of the second stage.

一种基于视觉导引的两阶段语音分离系统，包括：A two-stage speech separation system based on vision guidance, including:

第一分离模块，被配置为在第一阶段，对获取的混合语音在时域上进行分离，获得粗分离的说话人语音；The first separation module is configured to, in the first stage, separate the acquired mixed speech in the time domain to obtain roughly separated speaker speech;

第二分离模块，被配置为在第二阶段，借助第一阶段的纯音频分离结果提取具有说话人信息的声音特征，同时挖掘视觉和音频两种模态之间的潜在相关特征和互补特征，进行视觉特征和语音时频域特征两种模态的融合后再分离，对两个阶段动态调整权重最终得到分离后的目标语音；The second separation module is configured to, in the second stage, use the pure audio separation result of the first stage to extract sound features with speaker information, and simultaneously mine potential correlation features and complementary features between the two modalities of vision and audio, Carry out the fusion of visual features and speech time-frequency domain features before separation, and dynamically adjust the weights of the two stages to finally obtain the separated target speech;

动态调整权重模块，根据两个阶段的分离模型性能，动态调整其权重，以最大程度的利用第一阶段提取的独立语音特征来辅助第二阶段，实现纯净的目标说话人语音分离。The dynamic adjustment weight module dynamically adjusts its weight according to the performance of the separation model in the two stages, so as to maximize the use of the independent speech features extracted in the first stage to assist the second stage and achieve pure target speaker voice separation.

与现有技术相比，本发明的有益效果为：Compared with prior art, the beneficial effect of the present invention is:

(1)本发明所提出的基于视觉导引的两阶段语音分离方案，可以在只有混合语音的情况下提取出单一说话人的语音特征来辅助语音分离，避免了引入额外的纯净参考语音。(1) The two-stage speech separation scheme based on visual guidance proposed by the present invention can extract the speech features of a single speaker to assist speech separation in the case of only mixed speech, avoiding the introduction of additional pure reference speech.

(2)本发明同时利用包含语音内容信息的动态视觉特征和包含身份信息的静态视觉特征，挖掘视觉和音频两种模态之间的潜在相关性和互补性，同时解决了纯音频语音分离中的标签排列问题，避免损失函数的计算复杂度，同时提高了分离效果和分离系统的鲁棒性。(2) The present invention simultaneously utilizes dynamic visual features containing voice content information and static visual features containing identity information to mine the potential correlation and complementarity between the two modalities of vision and audio, and simultaneously solves the problem of pure audio speech separation. The label permutation problem avoids the computational complexity of the loss function, and at the same time improves the separation effect and the robustness of the separation system.

(3)本发明提出了一种针对两阶段语音分离的动态调整损失函数权重的方法。在同时优化两个训练目标的情况下最大程度的利用第一阶段的语音提取独立语音特征来辅助第二阶段的分离，最终得到性能指标较高的纯净语音。(3) The present invention proposes a method for dynamically adjusting loss function weights for two-stage speech separation. In the case of optimizing two training objectives at the same time, the independent speech features of the first stage of speech extraction are used to the greatest extent to assist the separation of the second stage, and finally pure speech with high performance indicators is obtained.

附图说明Description of drawings

构成本发明的一部分的说明书附图用来提供对本发明的进一步理解，本发明的示意性实施例及其说明用于解释本发明，并不构成对本发明的不当限定。The accompanying drawings constituting a part of the present invention are used to provide a further understanding of the present invention, and the schematic embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute improper limitations to the present invention.

图1为本发明的一种基于视觉导引的两阶段语音分离方法的流程图。FIG. 1 is a flowchart of a two-stage speech separation method based on visual guidance in the present invention.

图2为本发明的一种提取视觉特征的方法流程图。FIG. 2 is a flowchart of a method for extracting visual features of the present invention.

图3为本发明的一种动态调节损失函数权重系数的方法流程图。FIG. 3 is a flowchart of a method for dynamically adjusting weight coefficients of a loss function according to the present invention.

具体实施方式Detailed ways

下面结合附图与实施例对本发明作进一步说明。The present invention will be further described below in conjunction with the accompanying drawings and embodiments.

应该指出，以下详细说明都是例示性的，旨在对本发明提供进一步的说明。除非另有指明，本文使用的所有技术和科学术语具有与本发明所属技术领域的普通技术人员通常理解的相同含义。It should be noted that the following detailed description is exemplary and intended to provide further explanation of the present invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

需要注意的是，这里所使用的术语仅是为了描述具体实施方式，而非意图限制根据本发明的示例性实施方式。如在这里所使用的，除非上下文另外明确指出，否则单数形式也意图包括复数形式，此外，还应当理解的是，当在本说明书中使用术语“包含”和/或“包括”时，其指明存在特征、步骤、操作、器件、组件和/或它们的组合。It should be noted that the terminology used here is only for describing specific embodiments, and is not intended to limit exemplary embodiments according to the present invention. As used herein, unless the context clearly dictates otherwise, the singular is intended to include the plural, and it should also be understood that when the terms "comprising" and/or "comprising" are used in this specification, they mean There are features, steps, operations, means, components and/or combinations thereof.

本发明提出了一种基于视觉导引的两阶段语音分离方法及系统，通过第一阶段的粗分离语音提取出说话人的语音特征辅助语音分离，通过多模态信息的特征提取和融合辅助语音分离。The present invention proposes a two-stage speech separation method and system based on visual guidance. The speaker's speech features are extracted through the first-stage rough separation speech to assist speech separation, and the feature extraction and fusion of multi-modal information is used to assist speech separation. separate.

具体过程包括：The specific process includes:

一、第一阶段的纯音频时域分离1. The first stage of pure audio time domain separation

1.获取混合语音。随机选择并读取两个说话人(以两人混合语音分离为例)的纯净语音wav文件，截取固定长度(以2.55s为例)，对读取到的时域语音信号进行采样，采样率为16kHz，并进行归一化，分别记为x_A,x_B，对两个纯净语音相加得到混合语音x_mix＝x_A+x_B。1. Get mixed voice. Randomly select and read the pure voice wav files of two speakers (take two-person mixed voice separation as an example), intercept a fixed length (take 2.55s as an example), and sample the read time-domain voice signal, the sampling rate It is 16kHz, and normalized, respectively recorded as x _A , x _B , adding the two pure voices to obtain the mixed voice x _mix = x _A + x _B .

2.混合语音x_mix经过编码器，提取混合语音特征。编码器使用一层一维卷积网络。2. The mixed voice x _mix passes through the encoder to extract the mixed voice features. The encoder uses a layer of 1D convolutional network.

3.混合语音特征经过分离网络1，得到分离后的目标语音的掩码。分离网络1使用时间卷积网络(TCN,Temporal Convolutional Network)结构。TCN包括一个归一化层(依次进行组归一化和一维卷积)，n个相同的栈模块，其中每个栈模块由全卷积、膨胀卷积和残差模块构成。最后一个栈模块的输出经过卷积层和PReLU激活层得到分离后的目标掩码。3. The mixed speech feature passes through the separation network 1 to obtain the mask of the separated target speech. The separation network 1 uses a temporal convolutional network (TCN, Temporal Convolutional Network) structure. TCN includes a normalization layer (group normalization and one-dimensional convolution in sequence), n identical stack modules, where each stack module consists of full convolution, dilated convolution and residual modules. The output of the last stack module passes through the convolution layer and the PReLU activation layer to obtain the separated target mask.

4.将混合语音x_mix分别和目标语音的掩码相乘，得到对应说话人的目标语音特征。4. Multiply the mixed speech x _mix by the mask of the target speech respectively to obtain the target speech features of the corresponding speaker.

5.将上述获得的目标语音特征经过解码器，获得目标语音的时域信号，记为x’_A,x’_B。解码器使用一层一维转置卷积网络。5. Pass the above obtained target speech features through the decoder to obtain the time domain signal of the target speech, denoted as x' _A , x' _B . The decoder uses a layer of 1D transposed convolutional network.

二、第二阶段的多模态时频域分离2. The second stage of multi-modal time-frequency domain separation

6.获取混合语音的时频域信息——复谱图(cIRM,complex Ideal Ratio Mask)，记为S_mix。复谱图是复数域的理想比率掩码，它可以用二维平面表示了时间、频率和能量的三维信息，实部和虚部避免单纯幅度谱丢失相位信息而造成的分离质量下降。首先对混合语音信号进行短时傅里叶变换(STFT，short-time Fourier transform)，然后分别计算其实部和虚部得到复谱图。其中STFT变换如公式(1)所示。6. Obtain the time-frequency domain information of the mixed speech—complex spectrogram (cIRM, complex Ideal Ratio Mask), denoted as S _mix . The complex spectrogram is an ideal ratio mask in the complex domain. It can represent the three-dimensional information of time, frequency and energy in a two-dimensional plane. The real and imaginary parts avoid the degradation of separation quality caused by the loss of phase information in the pure amplitude spectrum. First, the short-time Fourier transform (STFT, short-time Fourier transform) is performed on the mixed speech signal, and then the real part and the imaginary part are calculated respectively to obtain the complex spectrogram. Among them, the STFT transformation is shown in formula (1).

其中x(m)为输入的语音信号，w(m)是窗函数。X(n,ω)是时间n和频率ω的二维函数。在本发明中，可令窗函数的长度window_size为400，相邻STFT列之间的音频样本数hop_size为160，即相邻窗之间的重合音频样本数为160，用零填充后加窗信号的长度n_fft为512，最终得到的复谱图的大小为2*F*T，即2*257*256，其中2表示实部和虚部两个维度，F和T分别表示频率和时间维度。Where x(m) is the input speech signal and w(m) is the window function. X(n,ω) is a two-dimensional function of time n and frequency ω. In the present invention, the length window_size of the window function can be set to 400, the number of audio samples hop_size between adjacent STFT columns is 160, that is, the number of overlapping audio samples between adjacent windows is 160, and the windowed signal is filled with zeros The length of n_fft is 512, and the size of the final complex spectrogram is 2*F*T, that is, 2*257*256, where 2 represents the two dimensions of the real part and the imaginary part, and F and T represent the frequency and time dimensions, respectively.

7.时域的纯净语音通过STFT变换(如公式(1)所示)获得对应的复谱图，根据混合语音的复谱图获取真实纯净语音的复谱掩码M_A，M_B。7. The pure speech in the time domain is transformed through STFT (as shown in formula (1)) to obtain the corresponding complex spectrogram, and the complex spectral masks M _A and M _B of the real pure speech are obtained according to the complex spectrogram of the mixed speech.

其中，Y_r和Y_i分别表示混合语音复谱图的实部和虚部分量，S_r和S_i分别表示纯净语音复谱图的实部和虚部分量。根据理想复掩码和混合语音，可以得到纯净语音，公式如下所示：Among them, Y _r and Y _i represent the real part and imaginary part of the mixed speech complex spectrogram, respectively, and S _r and S _i represent the real part and imaginary part of the pure speech complex spectrogram, respectively. According to the ideal complex mask and mixed speech, pure speech can be obtained, the formula is as follows:

S＝M*Y (3)S=M*Y (3)

其中，*表示负数乘法。Among them, * represents the multiplication of negative numbers.

8.提取第一阶段输出的时域信号的独立语音特征。首先对第一阶段的分离语音x’_A,x’_B进行时频域转换，通过STFT变换(如公式(1)所示)得到分离后的各个说话人的复谱图S’_A,S’_B，然后经过ResNet-18网络获取各个说话人的语音特征，记为α_A、α_B，维度为128*1。此处提取到的语音特征来自于分离后的语音，所以在一定程度上表示了说话人的身份特性，可以为第二阶段的分离提供有效的身份信息。8. Extract the independent speech features of the time domain signal output by the first stage. Firstly, the time-frequency domain conversion is performed on the separated speech x' _A , x' _B in the first stage, and the separated complex spectrograms S' _A , S' of each speaker are obtained through STFT transformation (as shown in formula (1) _B , and then obtain the speech features of each speaker through the ResNet-18 network, which are recorded as α _A and α _B , and the dimension is 128*1. The speech features extracted here come from the separated speech, so to a certain extent, it represents the identity characteristics of the speaker and can provide effective identity information for the second stage of separation.

9.获取与混合语音时间同步的说话人的视觉信息并进行预处理。首先读取视频文件，截取2.55s长度。视频的采样率为75帧/秒，故可以获得64帧图像序列。由于在一段语音范围内，说话人的面部除唇部区域外不会有太大的变化，为了在保留说话人身份特征的同时降低数据处理的复杂度，因此随机选择一帧面部图像作为静态视觉信息；然后对64帧图像序列进行裁剪，选取大小为88*88的唇部区域，生成唇部序列的h5文件，作为动态视觉信息。9. Obtain and preprocess the speaker's visual information time-synchronized with the mixed speech. First read the video file and intercept the length of 2.55s. The video sampling rate is 75 frames per second, so a 64-frame image sequence can be obtained. Since the speaker's face does not change much except for the lip area within a range of speech, in order to reduce the complexity of data processing while retaining the identity of the speaker, a frame of facial image is randomly selected as the static vision information; then crop the 64-frame image sequence, select the lip area with a size of 88*88, and generate the h5 file of the lip sequence as dynamic visual information.

10.分别提取包含身份信息的静态视觉特征和包含语音内容信息的动态视觉特征。首先对唇部图像进行归一化和数据填充，来提升特征提取模型的精度和稳定性。预处理后的唇部数据依次经过一个3D卷积层、ShuffleNet v2网络，再经过TCN来提取时间序列特征，最后获得的唇部特征维度为512*1*64，记为f_{lip_A}，f_{lip_B}。然后对面部图像进行标准化和大小处理，以加快特征提取模型的收敛速度。由于面部图像为彩色图像有3个通道，故预处理后的面部数据大小为3*224*224，经过一个残差网络ResNet-18提取到维度为128*1的特征。为了对唇部特征和面部特征进行融合，需对面部特征在时间维度上进行复制，保证转换后和唇部序列特征具有相同的时间维度，即为128*1*64，记为f_{face_A}，f_{face_B}。10. Extract static visual features containing identity information and dynamic visual features containing voice content information respectively. Firstly, the lip image is normalized and data filled to improve the accuracy and stability of the feature extraction model. The preprocessed lip data passes through a 3D convolutional layer, ShuffleNet v2 network, and then TCN to extract time series features. The final lip feature dimension is 512*1*64, which is recorded as f _{lip_A} and f _{lip_B} . The facial images are then normalized and resized to speed up the convergence of the feature extraction model. Since the facial image is a color image with 3 channels, the size of the preprocessed facial data is 3*224*224, and a feature with a dimension of 128*1 is extracted through a residual network ResNet-18. In order to fuse the lip features and facial features, the facial features need to be copied in the time dimension to ensure that the converted lip sequence features have the same time dimension, which is 128*1*64, recorded as f _{face_A} , f _{face_B} .

11.获取混合语音的语音特征并进行多模态特征融合。首先混合声音复谱图S_mix经过U-Net下采样网络层得到混合语音特征mix，维度为512*1*64。然后对分离语音提取的语音特征α_A、α_B在时间维度上进行转换以保持和视觉特征维度相同，通过级联的方式对说话人的视觉特征f_{lip_A}，f_{lip_B}，f_{face_A}，f_{face_B}和声音特征α_A，α_B，α_mix做拼接，最后得到融合的多模态特征维度为2048*1*64。11. Acquire the speech features of the mixed speech and perform multi-modal feature fusion. Firstly, the mixed sound complex spectrogram S _mix is passed through the U-Net down-sampling network layer to obtain the mixed voice feature mix, with a dimension of 512*1*64. Then convert the speech features α _A , α _B extracted from the separated speech in the time dimension to keep the same dimension as the visual feature, and convert the speaker’s visual features f _{lip_A} , f _{lip_B} , f _{face_A} , f _{face_B} and Sound features α _A , α _B , and α _mix are spliced, and finally the fused multimodal feature dimension is 2048*1*64.

12.上述多模态特征经过分离网络2，得到分离后的目标语音的掩码M”_A，M”_B。分离网络2由U-Net的上采样网络层构成。12. The above multi-modal features are passed through the separation network 2 to obtain the masks M” _A and M” _B of the separated target speech. Separation Network 2 consists of the upsampling network layers of U-Net.

13.将目标掩码M”_A，M”_B和混合语音复谱图S_mix分别相乘，获得第二阶段分离后的说话人的复谱图S”_A,S”_B。对S”_A,S”_B进行逆短时傅里叶变换(iSTFT)得到目标说话人的时域语音信号x”_A,x”_B。计算公式如下所示：13. Multiply the target masks M” _A , M” _B and the mixed speech complex spectrogram S _mix respectively to obtain the speaker’s complex spectrogram S” _A , S” _B after the second stage separation. Perform inverse short-time Fourier transform (iSTFT) on S” _A , S” _B to obtain the time-domain speech signal x” _A , x” _B of the target speaker. The calculation formula is as follows:

S”_A＝M”_A*S_mix (4)S” _A = M” _A *S _mix (4)

S”_B＝M”_B*S_mix (5)S” _B = M” _B *S _mix (5)

x”_A＝iSTFT(S”_A) (6)x” _A = iSTFT(S” _A ) (6)

x”_B＝iSTFT(S”_B) (7)x” _B = iSTFT(S” _B ) (7)

三、动态调整两阶段损失函数权重3. Dynamically adjust the weight of the two-stage loss function

在本发明中，两个阶段的分离过程和各个网络模块同时训练以达到优化目标。定义整体网络架构的损失函数如下：In the present invention, the two-stage separation process and the simultaneous training of each network module achieve the optimization goal. The loss function that defines the overall network architecture is as follows:

loss＝λ₁loss₁+λ₂loss₂ (8)loss＝λ ₁ loss ₁ +λ ₂ loss ₂ (8)

loss₂＝||M_A-M’_A||+||M_B-M’_B|| (10)loss ₂ ＝||M _A -M' _A ||+||M _B -M' _B || (10)

其中，loss₁和loss₂分别是两个阶段的训练损失函数，λ₁和λ₂分别是两个损失函数的训练权重。针对loss₁，x_target定义为

x_noise定义为s’-x_target，其中s’表示分离的语音信号，s表示纯净语音信号。Among them, loss ₁ and loss ₂ are the training loss functions of the two stages respectively, and λ ₁ and λ ₂ are the training weights of the two loss functions respectively. For loss ₁ , x _target is defined as

x _noise is defined as s'-x _target , where s' represents the separated speech signal and s represents the pure speech signal.

为了从第一阶段的分离语音中获取有效的说话人语音特征，进一步提升第二阶段的分离效果，本发明提出了一种动态调节两阶段损失函数权重的方法。由于在训练初始状态下，第一阶段分离的语音质量较差，对应提取到的说话人语音特征的质量也相对较差，因此此时将两个阶段的损失函数权重设置为λ₁＝λ₂＝1，同时训练和优化两个阶段的分离网络。随着第一阶段提供的分离语音质量在提高，对应的语音特征更具有区分性，因此此时第二阶段的分离效果能得到显著提升。当两个阶段的分离效果达到一种阈值关系时，设置权重分别为λ₁＝1，λ₂＝2，重点训练第二阶段的分离网络。此时的阈值关系可以通过两个阶段的分离损失来进行判断。假设单一阶段纯音频语音分离的损失为源失真比(SDR,Source toDistortion Ratio)的负值，定义为e₀，第一阶段的语音分离的损失为分离语音的SDR的负值，定义为e₁，第二阶段的语音分离的损失为第二阶段分离的语音的SDR的负值，定义为e₂。当e₁-e₀<e₁-e₂时，表明第一阶段分离出的语音提升了第二阶段的分离效果，此时的语音特征对于第二阶段是有效的，因此进行权重分配调整。SDR的定义如下所示：In order to obtain effective speaker speech features from the separated speech in the first stage and further improve the separation effect in the second stage, the present invention proposes a method for dynamically adjusting the weight of the two-stage loss function. Since in the initial state of training, the quality of speech separated in the first stage is relatively poor, and the quality of the corresponding extracted speaker speech features is also relatively poor, so the weight of the loss function of the two stages is set to λ ₁ = λ ₂ at this time = 1, train and optimize two-stage separated networks simultaneously. As the quality of the separated speech provided by the first stage improves, the corresponding speech features become more discriminative, so the separation effect of the second stage can be significantly improved at this time. When the separation effect of the two stages reaches a threshold relationship, set the weights to λ ₁ =1 and λ ₂ =2 respectively, and focus on training the separation network in the second stage. The threshold relationship at this time can be judged by the separation loss of the two stages. Assuming that the loss of the single-stage pure audio speech separation is the negative value of the source distortion ratio (SDR, Source to Distortion Ratio), defined as e ₀ , the loss of the first stage of speech separation is the negative value of the SDR of the separated speech, defined as e ₁ , the loss of the speech separation in the second stage is the negative value of the SDR of the speech separated in the second stage, defined as e ₂ . When e ₁ -e ₀ <e ₁ -e ₂ , it indicates that the speech separated in the first stage improves the separation effect of the second stage, and the speech features at this time are effective for the second stage, so the weight distribution adjustment is performed. The definition of SDR is as follows:

其中，s_target，e_interf，e_noise和e_artif分别表示目标说话人的语音、其他说话人产生的干扰、噪声的干扰和其他人为处理时产生的干扰。Among them, s _target , e _interf , e _noise and e _artif represent target speaker's voice, interference generated by other speakers, noise interference and other interference generated by human processing, respectively.

实施例二Embodiment two

第一分离模块，被配置为在第一阶段，对获取的混合语音在时域上进行分离，获得独立的说话人语音；The first separation module is configured to, in the first stage, separate the acquired mixed speech in the time domain to obtain independent speaker speech;

第二分离模块，被配置为在第二阶段，借助第一阶段的纯音频分离结果提取具有说话人信息的独立语音特征，同时挖掘视觉和音频两种模态之间的潜在相关特征和互补特征，进行视觉特征和语音时频域特征两种模态的融合后再分离，最终得到分离后的目标语音；The second separation module is configured to, in the second stage, extract independent speech features with speaker information with the help of the audio-only separation results of the first stage, and simultaneously mine potential correlation features and complementary features between the two modalities of visual and audio , performing the fusion of visual features and voice time-frequency domain features and then separating them, and finally obtaining the separated target voice;

本领域内的技术人员应明白，本发明的实施例可提供为方法、系统、或计算机程序产品。因此，本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present invention may be provided as methods, systems, or computer program products. Accordingly, the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It should be understood that each procedure and/or block in the flowchart and/or block diagram, and a combination of procedures and/or blocks in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions may be provided to a general purpose computer, special purpose computer, embedded processor, or processor of other programmable data processing equipment to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing equipment produce a An apparatus for realizing the functions specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions The device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded onto a computer or other programmable data processing device, causing a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process, thereby The instructions provide steps for implementing the functions specified in the flow chart or blocks of the flowchart and/or the block or blocks of the block diagrams.

以上所述仅为本发明的优选实施例而已，并不用于限制本发明，对于本领域的技术人员来说，本发明可以有各种更改和变化。凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. For those skilled in the art, the present invention may have various modifications and changes. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

上述虽然结合附图对本发明的具体实施方式进行了描述，但并非对本发明保护范围的限制，所属领域技术人员应该明白，在本发明的技术方案的基础上，本领域技术人员不需要付出创造性劳动即可做出的各种修改或变形仍在本发明的保护范围以内。Although the specific implementation of the present invention has been described above in conjunction with the accompanying drawings, it does not limit the protection scope of the present invention. Those skilled in the art should understand that on the basis of the technical solution of the present invention, those skilled in the art do not need to pay creative work Various modifications or variations that can be made are still within the protection scope of the present invention.

Claims

1. A two-stage voice separation method based on visual guidance is characterized by comprising the following steps:

in the first stage, separating the obtained mixed voice on a time domain to obtain roughly separated speaker voice;

in the second stage, firstly, the independent sound characteristics with the speaker identity information are extracted from the roughly separated speech in the first stage, then the potential relevant characteristics and complementary characteristics between the visual mode and the audio mode are mined, secondly, the two modes of the visual characteristic and the speech time-frequency domain characteristic are fused and then separated, and finally the separated target speech is obtained after the weights of the two stages are dynamically adjusted.

2. The two-stage voice separation method based on visual guidance as claimed in claim 1, wherein the first stage comprises the following specific processes:

coding the obtained mixed voice by using a coder, and extracting the characteristics of the mixed voice;

and separating the mixed voice characteristics by using a first separation network to obtain a mask of the target voice, multiplying the mask and the mixed voice characteristics, and then decoding to obtain a time domain signal of the target voice.

3. The method as claimed in claim 2, wherein the step of separating the mixed speech features in the first stage comprises processing the mixed speech features by using a first separation network, wherein the first separation network is a time convolution network structure and comprises a normalization layer and a plurality of identical stack modules, each of which comprises a full convolution layer, a dilation convolution layer and a residual error module, and the output of the last stack module passes through the convolution layer and the PReLU activation layer to obtain a separated target mask;

or, the target voice feature is obtained by multiplying the mixed voice feature and the mask of the target voice, and the target voice feature is decoded to obtain the time domain signal of the target voice.

4. The two-stage voice separation method based on visual guidance as claimed in claim 1, wherein the second stage comprises the following specific processes:

transforming the mixed voice to obtain a complex spectrum image of the mixed voice, and acquiring a complex spectrum mask of the real pure voice according to the complex spectrum image;

converting the time domain signal of the target voice acquired in the first stage to obtain a separated complex spectrogram of each speaker, and extracting the independent voice feature of each speaker;

and acquiring visual information of the speaker in time synchronization with the mixed voice, preprocessing the visual information, and respectively extracting static visual features and dynamic visual features from the preprocessed visual image.

Extracting mixed voice features from the time-frequency domain information of the mixed voice complex spectrogram, performing multi-modal feature fusion, separating the multi-modal features to obtain a mask of the separated target voice, multiplying the mask and the complex spectrogram of the mixed voice, and performing inverse transformation to obtain a pure voice signal of the target speaker.

5. The method as claimed in claim 4, wherein the step of converting the time domain signal of the target speech obtained in the first stage comprises: firstly, roughly separating voice, and carrying out time-frequency domain conversion to obtain a complex spectrogram; then, using an independent voice feature extraction network ResNet-18 to extract independent voice features for the complex spectrogram of each speaker; the independent speech features are then time dimension transformed to achieve dimensional consistency of the audio and video modality features.

6. The two-stage voice separation method based on visual guidance as claimed in claim 4, wherein the specific process of obtaining and pre-processing the visual information of the speaker in time synchronization with the mixed voice comprises reading a video file, intercepting a video with a set length to obtain a multi-frame image sequence, and randomly selecting a frame of facial image as static visual information; and then cutting each frame of image sequence, selecting a lip region with a set size, and generating a file of the lip sequence as dynamic visual information.

7. The two-stage voice separation method based on visual guidance as claimed in claim 6, wherein the specific process of extracting visual features includes normalizing lip images and data filling, the preprocessed lip data sequentially passes through a 3D convolutional layer and a ShuffleNet v2 network, and then passes through a time convolutional network structure to extract time series features, and finally dynamic visual features are obtained, wherein the dynamic visual features include content information of voice;

standardizing and sizing the face image, and extracting static visual features through a residual error network ResNet-18, wherein the static visual features comprise identity information of speakers with distinctiveness; the static visual features are transformed to have the same time dimension as the dynamic visual sequence features.

8. The two-stage voice separation method based on visual guidance as claimed in claim 4, wherein the specific process of performing multi-modal feature fusion includes firstly obtaining mixed voice features by a mixed voice complex spectrogram through a U-Net down-sampling network layer, then performing cascade splicing on visual features, independent voice features and mixed voice features of a speaker, and finally obtaining fused multi-modal features;

or separating the multi-modal features by utilizing a second separation network, wherein the second separation network is an up-sampling network layer of the U-Net.

9. The method as claimed in claim 1, wherein the weight of the loss function for the two-stage speech separation is dynamically adjusted to maximally utilize the independent speech features of the first stage to assist the separation of the second stage.

10. A two-stage speech separation system based on visual guidance, comprising:

the first separation module is configured to separate the acquired mixed voice in a time domain to obtain independent speaker voice in a first stage;

the second separation module is configured to extract independent voice features with speaker information by means of the pure audio frequency separation result of the first stage at the second stage, simultaneously excavate potential relevant features and complementary features between the visual modality and the audio modality, perform fusion of the visual features and the voice time-frequency domain features, and then separate the two modalities to finally obtain separated target voice;

and the dynamic weight adjusting module dynamically adjusts the weight of the two stages according to the performance of the separation model of the two stages so as to utilize the independent voice characteristics extracted in the first stage to the maximum extent to assist the second stage and realize the voice separation of the pure target speaker.