CN115394310B

CN115394310B - Neural network-based background voice removing method and system

Info

Publication number: CN115394310B
Application number: CN202210998674.XA
Authority: CN
Inventors: 潘伟; 唐镇坤; 陈盛福; 马志豪; 刘黎思
Original assignee: China Post Consumer Finance Co ltd
Current assignee: China Post Consumer Finance Co ltd
Priority date: 2022-08-19
Filing date: 2022-08-19
Publication date: 2023-04-07
Anticipated expiration: 2042-08-19
Also published as: CN115394310A

Abstract

The invention relates to a background voice removing method and a system based on a neural network, comprising the following steps: s1, accessing real-time voice to a noise reduction module for noise reduction processing to obtain noise-reduced voice; s2, performing voice endpoint detection and interception on the target voice in the voice subjected to noise reduction to obtain auxiliary audio; s3, the obtained auxiliary audio and the real-time input audio are connected to a background third person voice eliminating module to eliminate background noise and third person voice, and the auxiliary audio and the real-time input audio with the background noise and the third person voice eliminated are obtained; and S4, combining the auxiliary audio without the background noise and the voice of the third person with the real-time input audio, and outputting the processed voice data.

Description

Neural network-based background voice removing method and system

Technical Field

The invention relates to the technical field of voice denoising, in particular to a background human voice removing method and system based on a neural network.

Background

The communication voice signal usually receives more or less noise interference, so that the communication quality is reduced, and meanwhile, the communication voice signal also usually receives the interference of the background voice of the third party, so that the voice of the communication owner is mixed in the background voice, and the communication quality is reduced. The target speaker extraction is an effective processing technology for solving the problem of eliminating the voice of the background third party, and aims to eliminate the influence of the voice of the background third party on the voice and language signals of the target speaker, improve the definition of the language signals and improve the quality of the language signals.

At present, the existing noise reduction technology has the following limitations: the single noise reduction technology cannot remove the third voice of the background, only can process noise parts except the non-voice, and cannot process the noise parts under the condition of the third voice; further, there is no technique for performing real-time speech processing under streaming input.

Disclosure of Invention

Based on this, it is necessary to provide a background human voice removing method and system based on a neural network.

The embodiment of the invention provides a background voice removing method based on a neural network on one hand, which comprises the following steps:

s1, accessing real-time voice to a noise reduction module for noise reduction processing to obtain noise-reduced voice;

s2, performing voice endpoint detection and interception on the target voice in the voice subjected to noise reduction to obtain auxiliary audio;

s3, the obtained auxiliary audio and the real-time input audio are connected to a background third person voice eliminating module to eliminate background noise and third person voice, and the auxiliary audio and the real-time input audio with the background noise and the third person voice eliminated are obtained;

and S4, combining the auxiliary audio without the background noise and the voice of the third person with the real-time input audio, and outputting the processed voice data.

Preferably, in step S1, the noise reduction module includes an encoder, an LSTM module, and a decoder, and accesses the real-time speech to the encoder for encoding to obtain encoded first intermediate representation data, and accesses the LSTM module for prediction to obtain second intermediate representation data, and accesses the decoder for processing to obtain noise-reduced speech.

Preferably, in step S1, the real-time speech is accessed to the encoder, and is encoded according to a frame with a length of 30-34ms, so as to obtain encoded first intermediate representation data.

Preferably, in step S1, the accessing of real-time speech to the encoder for encoding processing includes the following steps: in an encoder composed of an L-layer encoding layer, taking noisy audio data as input data to obtain encoded first intermediate representation data, wherein the expression is as follows:

mdata＝f(Input)

in the structure, the encoder has L layers in total, the initialization hidden _ size is H, the convolution kernel is K, the step size is S, the ith encoding layer comprises one convolution kernel with the size of K, the step size is S, and 2i-1H output channels; one Relu layer, one output channel is a 1 x 1 convolution of 2iH and one GLU activation layer.

Preferably, in step S1, the LSTM module is divided into two layers, hidden _ size is 2L-1 layer, if causal prediction is performed, a non-bidirectional LSTM network is used, and if non-causal prediction is performed, biLSTM is used, and then a linear layer is used to fuse two outputs of the biLSTM, so as to obtain second intermediate representation data, where the expression is:

R(z)＝LSTM(z)+z。

preferably, in step S1, the decoder has L layers in common, where the input of the i-th decoding layer has 2i-1H channels, 1 × 1 convolution of one 2iH channel is applied, then the channel is restored to 2i-1H by one GLU layer, and finally another transposed convolution is performed, if the i-th layer is not the last layer, then another Relu layer follows, and if the i-th layer is the last layer, the output is a single channel and there is no Relu layer, and the expression is:

Output＝g(mdata)

wherein mdata represents second intermediate representation data processed by the encoder and the LSTM module in sequence, output represents Output denoised audio data, and g () represents a decoding layer in the decoder.

Preferably, in step S2, the intercepting of the target voice by using the voice endpoint detection technology to obtain the auxiliary audio specifically includes:

s21: extracting the characteristics of the signal with each frame being 32ms after noise reduction, wherein the extracted characteristics are the energy of 6 sub-bands of the frame;

s22: training a classifier on a data frame set of a known voice and silence signal area, wherein the classifier is a two-dimensional Gaussian model;

s23: calculating the sum of the likelihood log ratio of each frequency band and the likelihood ratios of 6 sub-bands, classifying the unknown real-time frame data by judging whether the type of the unknown real-time frame data belongs to a speech signal or a silence signal, and if the type of the unknown real-time frame data is greater than a judgment threshold, determining that the type of the unknown real-time frame data belongs to the speech signal;

s24: and continuously updating the Gaussian model parameters, stopping updating the Gaussian model parameters when the intercepted voice signal is more than 2s, and then quitting the noise reduction module and the voice endpoint detection module.

Preferably, in step S3, the background third party voice elimination module includes a voice encoder, a speaker extractor and a voice decoder, and the third party voice elimination using the intercepted auxiliary audio and the real-time input voice specifically includes:

s31: taking the intercepted 2s auxiliary voice and real-time voice as the input of a voice encoder, wherein the real-time voice is the voice data of 1s before splicing;

s32: the speech encoder stacks the results as output, wherein the speech encoder is a multi-scale encoder;

s33: the output of the speaker encoder is one of the inputs of the speaker extractor, and the outputs are used as the inputs of the speaker extractor together, wherein the structure of the speaker encoder comprises 1d convolution and ResBlock;

s34: one of the speaker extractor outputs is multiplied by the initial input set of the network and the result is output to the speech decoder.

The invention also provides a background voice removing system based on the neural network, which comprises the following steps:

a noise reduction module: the voice processing module is used for processing noise reduction of the real-time voice accessed to the noise reduction module;

an auxiliary audio module: the voice endpoint detection and interception are carried out on the target voice in the voice after the noise reduction, so that auxiliary audio is obtained;

background third person voice elimination module: the auxiliary audio and the real-time input audio are used for eliminating background noise and third person voice to obtain the auxiliary audio and the real-time input audio, wherein the background noise and the third person voice are eliminated;

a merging output module: the voice processing device is used for merging the auxiliary audio without background noise and third person voice and the real-time input audio and outputting the processed voice data.

Preferably, the background third person voice elimination module comprises a voice coder, a speaker extractor and a voice decoder.

The invention provides a real-time background third person voice removing technology based on a neural network, which can remove noise and background third person voice in real time, and can process the voice in real time on the basis of combining a voice denoising technology and a target person voice extracting technology, so as to obtain higher-quality call voice; noise reduction is realized in real time, the voice of the target speaker is extracted, the quality of the call voice is further improved, and the method has high application value.

Drawings

The foregoing and other objects, features and advantages of the invention will be apparent from the following more particular description of preferred embodiments of the invention, as illustrated in the accompanying drawings. Like reference numerals refer to like parts throughout the drawings, and the drawings are not intended to be drawn to scale in actual dimensions, emphasis instead being placed upon illustrating the principles of the invention.

FIG. 1 is an overall flow diagram of a method of an embodiment of the invention;

fig. 2 is a schematic diagram of the method of the preferred embodiment of the present invention.

Detailed Description

The technical solutions of the present invention are further described in detail with reference to the drawings and specific embodiments so that those skilled in the art can better understand the present invention and can implement the present invention, but the embodiments are not limited to the present invention.

As shown in fig. 1-2, in one aspect, an embodiment of the present invention provides a method for removing background human voice based on a neural network, including the following steps:

s1, accessing real-time voice to a noise reduction module for noise reduction processing to obtain noise-reduced voice; preferably, the real-time call voice can be accessed to the noise reduction module for noise reduction processing in a time length of 32 ms.

In the invention, a denoising technology and a target voice extraction technology are combined, a denoising module is applied to collect auxiliary voices required in target voice extraction, and then a target voice extraction module is applied to process; then, the buffer processing is applied to the streaming data, so that the streaming output can be realized in the target voice extraction, and the performance is not sacrificed.

In a preferred embodiment, in step S1, the noise reduction module includes an encoder, an LSTM module, and a decoder, and accesses the real-time speech to the encoder for encoding to obtain encoded first intermediate representation data, and accesses the LSTM module for prediction to obtain second intermediate representation data, and accesses the decoder for processing to obtain noise-reduced speech.

In a preferred embodiment, in step S1, the real-time speech is accessed to the encoder, and is encoded according to a frame with a length of 30-34ms, so as to obtain encoded first intermediate representation data. Preferably, the encoding process can be performed in one frame of a length of 32 ms.

In a preferred embodiment, in step S1, the real-time speech access encoder is subjected to an encoding process, which includes the following steps: in an encoder composed of L-layer coding layers, noisy audio data is taken as input data, resulting in encoded first intermediate representation data, whose expression is:

mdata＝f(Input)

In a preferred embodiment, in step S1, the LSTM module is specifically implemented as a sequence network model R, the LSTM module is divided into two layers, hidden _ size is 2L-1 layer, if causal prediction is performed, a non-bidirectional LSTM network is used, if non-causal prediction is performed, biLSTM is used, and then a linear layer is connected to fuse two outputs of the biLSTM, so as to obtain second intermediate representation data, where the expression is:

R(z)＝LSTM(z)+z。

in a preferred embodiment, in step S1, the decoder has L layers in common, where the input of the i-th decoding layer has 2i-1H channels, 1 × 1 convolution of one 2iH channel is applied, then the channel is restored to 2i-1H by one GLU layer, and finally another transposed convolution is performed, if the i-th layer is not the last layer, then another Relu layer follows, and if the i-th layer is the last layer, the output is a single channel and there is no Relu layer, and the expression is:

Output＝g(mdata)

In a preferred embodiment, in step S2, the step of intercepting the target voice by using a voice endpoint detection technology to obtain an auxiliary audio specifically includes:

s23: calculating the sum of the likelihood log ratio of each frequency band and the likelihood ratios of 6 sub-bands, classifying the unknown real-time frame data by judging whether the type of the unknown real-time frame data belongs to a speech signal or a silent signal, and if the type of the unknown real-time frame data is greater than a judgment threshold, determining that the type of the unknown real-time frame data belongs to a speech signal;

In a preferred embodiment, in step S3, the background third party voice elimination module includes a voice encoder, a speaker extractor, and a voice decoder, and the third party voice elimination using the intercepted auxiliary audio and the real-time input voice specifically includes:

s31: taking the intercepted 2s auxiliary voice and real-time voice as the input of a voice encoder, wherein the real-time voice is voice data of the first 1s of splicing;

the formula is expressed as follows:

mdata _i ＝Concat(f ₁ (Input _i )，f ₂ (Input _i )，f ₃ (Input _i ))

f _i (x)＝Covd1d(Pad(x))

in this section f _i () Encoders expressed as different scales, in which structure input x is _input The left and right sides of the Input buffer are filled with 0 to different lengths, and then are convolved for 1d, and finally results Concat of 3 scales are taken as the Input of the speaker encoder and the separator, wherein the Input is divided into two, the first Input ₁ Is Input of real-time speech, and a second Input ₂ Is the registered voice of the target voice to be extracted, i.e., the intercepted auxiliary audio.

S33: the output of the speaker encoder is one of the input of the speaker extractor, and then the output of the speaker encoder is used as the input of the speaker extractor together, wherein the main structure of the speaker encoder comprises 1d convolution and ResBlock; the expression is as follows:

OUTPUT _embedding ＝g(mdata ₂ )

g(x)＝Convd1d(ResBlock(Covd1d(x))x3)

in the part g () expressed as speaker encoder, the input is mdata ₂ Is a speech coder with Input ₂ And (3) carrying out convolution for 1d and 3 ResBlock for the input output result, and carrying out convolution for 1d to finish the encoding of the speaker, wherein the ResBlock structure is represented as follows:

ResBlock(x)＝MaxPool(PReLu(x+Conv1d(PReLu(Conv1d(x)))))

s34: one of the outputs of the speaker extractor is multiplied by the initial input set of the network, and the resulting result is output to the speech decoder.

The output of the speaker encoder will become one of the outputs of the speaker extractorThen, the mdata ₁ Collectively as input to the speaker extractor, the structure of the speaker encoder is expressed as follows:

OUTPUT _Mask ＝ReLu(Conv1d(OUTPUT _StackedTCNs ))

OUTPUT _StackedTCNs ＝S(Conv1d(mdata ₁ )，OUTPUT _embedding )x4

S(u，v)＝S'(u+Conv1d(PReLu(DeConv1d(PReLU(Conv1d(Concat(u，v)))))))x6

S'(x)＝Conv1d(PReLu(DeConv1d(PReLu(Conv1d(x)))))

in this section, an OUTPUT will be OUTPUT _Mask Wherein OUTPUT _StackedTCNs The method consists of 4S (u, v) nests, wherein 6S '(x) nests are formed, deConv1d in S (u, v) and S' (x) refers to hole depth separable convolution, and the effect similar to RNN can be achieved and the calculation amount is reduced.

S35: s3, the real-time background third person voice removing technology based on the neural network, and an OUTPUT OUTPUT by the speaker extractor _Mask Initial input x to the network _input I.e. multiplication, the result is output to the speech decoder, which is expressed as follows:

OUTPUT＝D(OUTPUT _Mask *x _input )

D(x)＝ConvTrans1D(x)

in this architecture, speech is reconstructed through a speech decoder to obtain the final required output.

background third person voice elimination module: the audio processing device is used for eliminating background noise and third person voice of the obtained auxiliary audio and the real-time input audio to obtain the auxiliary audio and the real-time input audio, wherein the background noise and the third person voice of the auxiliary audio and the real-time input audio are eliminated;

In a preferred embodiment, the background third party voice removal module includes a speech encoder, a speaker extractor, and a speech decoder.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A background voice removing method based on a neural network is characterized by comprising the following steps:

2. The method of claim 1, wherein in step S1, the noise reduction module includes an encoder, an LSTM module and a decoder, and the real-time speech is accessed into the encoder to perform encoding processing to obtain encoded first intermediate representation data, and is accessed into the LSTM module to perform prediction to obtain second intermediate representation data, and is accessed into the decoder to perform processing to obtain noise-reduced speech.

3. The method of claim 2, wherein in step S1, the real-time speech is encoded into the encoder in a frame of 30-34ms length to obtain the encoded first intermediate representation data.

4. The method of claim 3, wherein the step S1 of encoding the real-time speech access encoder comprises the steps of: in an encoder composed of an L-layer encoding layer, taking noisy audio data as input data to obtain encoded first intermediate representation data, wherein the expression is as follows:

mdata＝f(Input)

in the structure, the encoder has L layers in total, the initial hidden _ size is H, the convolution kernel is K, the step size is S, the ith encoding layer comprises one convolution kernel with the size of K, and the number of output channels is 2 i-1H; one Relu layer, one output channel is a 1 x 1 convolution of 2iH and one GLU activation layer.

5. The method as claimed in claim 4, wherein in step S1, the LSTM module is divided into two layers, with hidden _ size 2L "1, and if causal prediction, a non-bi-directional LSTM network is used, and if non-causal prediction, biLSTM is used, and then a linear layer is followed to fuse the two outputs of the biLSTM to obtain a second intermediate representation data, which is expressed as:

R(z)＝LSTM(z)+z。

6. the method of claim 5 wherein in step S1, the decoder has L layers, wherein the i-th decoded layer has 2i-1H channels as input, a 1 x 1 convolution of 2iH channels is applied, then the channels are restored to 2i-1H by a GLU layer, and finally by a transposed convolution, and if the i-th layer is not the last layer, then by a Relu layer, and if the i-th layer is the last layer, the output is a single channel and there are no Relu layers, expressed as:

Output＝g(mdata)

7. The method according to claim 1, wherein in step S2, the step of intercepting the target voice by using the voice endpoint detection technology to obtain the auxiliary audio includes:

8. The method as claimed in claim 1, wherein in step S3, the background third party voice elimination module comprises a speech coder, a speaker extractor and a speech decoder, and the third party voice elimination using the intercepted auxiliary audio and the real-time input speech is specifically:

9. A background voice removal system based on a neural network, comprising:

a noise reduction module: the real-time voice processing module is used for carrying out noise reduction processing on the real-time voice accessed to the noise reduction module;

background third person voice eliminating module: the audio processing device is used for eliminating background noise and third person voice of the obtained auxiliary audio and the real-time input audio to obtain the auxiliary audio and the real-time input audio, wherein the background noise and the third person voice of the auxiliary audio and the real-time input audio are eliminated;

a merging output module: and the voice processing device is used for combining the auxiliary audio without the background noise and the voice of the third person and the real-time input audio and outputting the processed voice data.

10. The system of claim 9, wherein the background third party voice removal module comprises a speech encoder, a speaker extractor, and a speech decoder.