[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN115394310B - Neural network-based background voice removing method and system - Google Patents

Neural network-based background voice removing method and system Download PDF

Info

Publication number
CN115394310B
CN115394310B CN202210998674.XA CN202210998674A CN115394310B CN 115394310 B CN115394310 B CN 115394310B CN 202210998674 A CN202210998674 A CN 202210998674A CN 115394310 B CN115394310 B CN 115394310B
Authority
CN
China
Prior art keywords
voice
real
module
encoder
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210998674.XA
Other languages
Chinese (zh)
Other versions
CN115394310A (en
Inventor
潘伟
唐镇坤
陈盛福
马志豪
刘黎思
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Post Consumer Finance Co ltd
Original Assignee
China Post Consumer Finance Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Post Consumer Finance Co ltd filed Critical China Post Consumer Finance Co ltd
Priority to CN202210998674.XA priority Critical patent/CN115394310B/en
Publication of CN115394310A publication Critical patent/CN115394310A/en
Application granted granted Critical
Publication of CN115394310B publication Critical patent/CN115394310B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02087Noise filtering the noise being separate speech, e.g. cocktail party

Landscapes

  • Engineering & Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Compression, Expansion, Code Conversion, And Decoders (AREA)

Abstract

The invention relates to a background voice removing method and a system based on a neural network, comprising the following steps: s1, accessing real-time voice to a noise reduction module for noise reduction processing to obtain noise-reduced voice; s2, performing voice endpoint detection and interception on the target voice in the voice subjected to noise reduction to obtain auxiliary audio; s3, the obtained auxiliary audio and the real-time input audio are connected to a background third person voice eliminating module to eliminate background noise and third person voice, and the auxiliary audio and the real-time input audio with the background noise and the third person voice eliminated are obtained; and S4, combining the auxiliary audio without the background noise and the voice of the third person with the real-time input audio, and outputting the processed voice data.

Description

Neural network-based background voice removing method and system
Technical Field
The invention relates to the technical field of voice denoising, in particular to a background human voice removing method and system based on a neural network.
Background
The communication voice signal usually receives more or less noise interference, so that the communication quality is reduced, and meanwhile, the communication voice signal also usually receives the interference of the background voice of the third party, so that the voice of the communication owner is mixed in the background voice, and the communication quality is reduced. The target speaker extraction is an effective processing technology for solving the problem of eliminating the voice of the background third party, and aims to eliminate the influence of the voice of the background third party on the voice and language signals of the target speaker, improve the definition of the language signals and improve the quality of the language signals.
At present, the existing noise reduction technology has the following limitations: the single noise reduction technology cannot remove the third voice of the background, only can process noise parts except the non-voice, and cannot process the noise parts under the condition of the third voice; further, there is no technique for performing real-time speech processing under streaming input.
Disclosure of Invention
Based on this, it is necessary to provide a background human voice removing method and system based on a neural network.
The embodiment of the invention provides a background voice removing method based on a neural network on one hand, which comprises the following steps:
s1, accessing real-time voice to a noise reduction module for noise reduction processing to obtain noise-reduced voice;
s2, performing voice endpoint detection and interception on the target voice in the voice subjected to noise reduction to obtain auxiliary audio;
s3, the obtained auxiliary audio and the real-time input audio are connected to a background third person voice eliminating module to eliminate background noise and third person voice, and the auxiliary audio and the real-time input audio with the background noise and the third person voice eliminated are obtained;
and S4, combining the auxiliary audio without the background noise and the voice of the third person with the real-time input audio, and outputting the processed voice data.
Preferably, in step S1, the noise reduction module includes an encoder, an LSTM module, and a decoder, and accesses the real-time speech to the encoder for encoding to obtain encoded first intermediate representation data, and accesses the LSTM module for prediction to obtain second intermediate representation data, and accesses the decoder for processing to obtain noise-reduced speech.
Preferably, in step S1, the real-time speech is accessed to the encoder, and is encoded according to a frame with a length of 30-34ms, so as to obtain encoded first intermediate representation data.
Preferably, in step S1, the accessing of real-time speech to the encoder for encoding processing includes the following steps: in an encoder composed of an L-layer encoding layer, taking noisy audio data as input data to obtain encoded first intermediate representation data, wherein the expression is as follows:
mdata=f(Input)
Figure BDA0003806670010000021
in the structure, the encoder has L layers in total, the initialization hidden _ size is H, the convolution kernel is K, the step size is S, the ith encoding layer comprises one convolution kernel with the size of K, the step size is S, and 2i-1H output channels; one Relu layer, one output channel is a 1 x 1 convolution of 2iH and one GLU activation layer.
Preferably, in step S1, the LSTM module is divided into two layers, hidden _ size is 2L-1 layer, if causal prediction is performed, a non-bidirectional LSTM network is used, and if non-causal prediction is performed, biLSTM is used, and then a linear layer is used to fuse two outputs of the biLSTM, so as to obtain second intermediate representation data, where the expression is:
R(z)=LSTM(z)+z。
preferably, in step S1, the decoder has L layers in common, where the input of the i-th decoding layer has 2i-1H channels, 1 × 1 convolution of one 2iH channel is applied, then the channel is restored to 2i-1H by one GLU layer, and finally another transposed convolution is performed, if the i-th layer is not the last layer, then another Relu layer follows, and if the i-th layer is the last layer, the output is a single channel and there is no Relu layer, and the expression is:
Output=g(mdata)
Figure BDA0003806670010000031
wherein mdata represents second intermediate representation data processed by the encoder and the LSTM module in sequence, output represents Output denoised audio data, and g () represents a decoding layer in the decoder.
Preferably, in step S2, the intercepting of the target voice by using the voice endpoint detection technology to obtain the auxiliary audio specifically includes:
s21: extracting the characteristics of the signal with each frame being 32ms after noise reduction, wherein the extracted characteristics are the energy of 6 sub-bands of the frame;
s22: training a classifier on a data frame set of a known voice and silence signal area, wherein the classifier is a two-dimensional Gaussian model;
s23: calculating the sum of the likelihood log ratio of each frequency band and the likelihood ratios of 6 sub-bands, classifying the unknown real-time frame data by judging whether the type of the unknown real-time frame data belongs to a speech signal or a silence signal, and if the type of the unknown real-time frame data is greater than a judgment threshold, determining that the type of the unknown real-time frame data belongs to the speech signal;
s24: and continuously updating the Gaussian model parameters, stopping updating the Gaussian model parameters when the intercepted voice signal is more than 2s, and then quitting the noise reduction module and the voice endpoint detection module.
Preferably, in step S3, the background third party voice elimination module includes a voice encoder, a speaker extractor and a voice decoder, and the third party voice elimination using the intercepted auxiliary audio and the real-time input voice specifically includes:
s31: taking the intercepted 2s auxiliary voice and real-time voice as the input of a voice encoder, wherein the real-time voice is the voice data of 1s before splicing;
s32: the speech encoder stacks the results as output, wherein the speech encoder is a multi-scale encoder;
s33: the output of the speaker encoder is one of the inputs of the speaker extractor, and the outputs are used as the inputs of the speaker extractor together, wherein the structure of the speaker encoder comprises 1d convolution and ResBlock;
s34: one of the speaker extractor outputs is multiplied by the initial input set of the network and the result is output to the speech decoder.
The invention also provides a background voice removing system based on the neural network, which comprises the following steps:
a noise reduction module: the voice processing module is used for processing noise reduction of the real-time voice accessed to the noise reduction module;
an auxiliary audio module: the voice endpoint detection and interception are carried out on the target voice in the voice after the noise reduction, so that auxiliary audio is obtained;
background third person voice elimination module: the auxiliary audio and the real-time input audio are used for eliminating background noise and third person voice to obtain the auxiliary audio and the real-time input audio, wherein the background noise and the third person voice are eliminated;
a merging output module: the voice processing device is used for merging the auxiliary audio without background noise and third person voice and the real-time input audio and outputting the processed voice data.
Preferably, the background third person voice elimination module comprises a voice coder, a speaker extractor and a voice decoder.
The invention provides a real-time background third person voice removing technology based on a neural network, which can remove noise and background third person voice in real time, and can process the voice in real time on the basis of combining a voice denoising technology and a target person voice extracting technology, so as to obtain higher-quality call voice; noise reduction is realized in real time, the voice of the target speaker is extracted, the quality of the call voice is further improved, and the method has high application value.
Drawings
The foregoing and other objects, features and advantages of the invention will be apparent from the following more particular description of preferred embodiments of the invention, as illustrated in the accompanying drawings. Like reference numerals refer to like parts throughout the drawings, and the drawings are not intended to be drawn to scale in actual dimensions, emphasis instead being placed upon illustrating the principles of the invention.
FIG. 1 is an overall flow diagram of a method of an embodiment of the invention;
fig. 2 is a schematic diagram of the method of the preferred embodiment of the present invention.
Detailed Description
The technical solutions of the present invention are further described in detail with reference to the drawings and specific embodiments so that those skilled in the art can better understand the present invention and can implement the present invention, but the embodiments are not limited to the present invention.
As shown in fig. 1-2, in one aspect, an embodiment of the present invention provides a method for removing background human voice based on a neural network, including the following steps:
s1, accessing real-time voice to a noise reduction module for noise reduction processing to obtain noise-reduced voice; preferably, the real-time call voice can be accessed to the noise reduction module for noise reduction processing in a time length of 32 ms.
S2, performing voice endpoint detection and interception on the target voice in the voice subjected to noise reduction to obtain auxiliary audio;
s3, the obtained auxiliary audio and the real-time input audio are connected to a background third person voice eliminating module to eliminate background noise and third person voice, and the auxiliary audio and the real-time input audio with the background noise and the third person voice eliminated are obtained;
and S4, combining the auxiliary audio without the background noise and the voice of the third person with the real-time input audio, and outputting the processed voice data.
In the invention, a denoising technology and a target voice extraction technology are combined, a denoising module is applied to collect auxiliary voices required in target voice extraction, and then a target voice extraction module is applied to process; then, the buffer processing is applied to the streaming data, so that the streaming output can be realized in the target voice extraction, and the performance is not sacrificed.
In a preferred embodiment, in step S1, the noise reduction module includes an encoder, an LSTM module, and a decoder, and accesses the real-time speech to the encoder for encoding to obtain encoded first intermediate representation data, and accesses the LSTM module for prediction to obtain second intermediate representation data, and accesses the decoder for processing to obtain noise-reduced speech.
In a preferred embodiment, in step S1, the real-time speech is accessed to the encoder, and is encoded according to a frame with a length of 30-34ms, so as to obtain encoded first intermediate representation data. Preferably, the encoding process can be performed in one frame of a length of 32 ms.
In a preferred embodiment, in step S1, the real-time speech access encoder is subjected to an encoding process, which includes the following steps: in an encoder composed of L-layer coding layers, noisy audio data is taken as input data, resulting in encoded first intermediate representation data, whose expression is:
mdata=f(Input)
Figure BDA0003806670010000061
in the structure, the encoder has L layers in total, the initialization hidden _ size is H, the convolution kernel is K, the step size is S, the ith encoding layer comprises one convolution kernel with the size of K, the step size is S, and 2i-1H output channels; one Relu layer, one output channel is a 1 x 1 convolution of 2iH and one GLU activation layer.
In a preferred embodiment, in step S1, the LSTM module is specifically implemented as a sequence network model R, the LSTM module is divided into two layers, hidden _ size is 2L-1 layer, if causal prediction is performed, a non-bidirectional LSTM network is used, if non-causal prediction is performed, biLSTM is used, and then a linear layer is connected to fuse two outputs of the biLSTM, so as to obtain second intermediate representation data, where the expression is:
R(z)=LSTM(z)+z。
in a preferred embodiment, in step S1, the decoder has L layers in common, where the input of the i-th decoding layer has 2i-1H channels, 1 × 1 convolution of one 2iH channel is applied, then the channel is restored to 2i-1H by one GLU layer, and finally another transposed convolution is performed, if the i-th layer is not the last layer, then another Relu layer follows, and if the i-th layer is the last layer, the output is a single channel and there is no Relu layer, and the expression is:
Output=g(mdata)
Figure BDA0003806670010000062
wherein mdata represents second intermediate representation data processed by the encoder and the LSTM module in sequence, output represents Output denoised audio data, and g () represents a decoding layer in the decoder.
In a preferred embodiment, in step S2, the step of intercepting the target voice by using a voice endpoint detection technology to obtain an auxiliary audio specifically includes:
s21: extracting the characteristics of the signal with each frame being 32ms after noise reduction, wherein the extracted characteristics are the energy of 6 sub-bands of the frame;
s22: training a classifier on a data frame set of a known voice and silence signal area, wherein the classifier is a two-dimensional Gaussian model;
s23: calculating the sum of the likelihood log ratio of each frequency band and the likelihood ratios of 6 sub-bands, classifying the unknown real-time frame data by judging whether the type of the unknown real-time frame data belongs to a speech signal or a silent signal, and if the type of the unknown real-time frame data is greater than a judgment threshold, determining that the type of the unknown real-time frame data belongs to a speech signal;
s24: and continuously updating the Gaussian model parameters, stopping updating the Gaussian model parameters when the intercepted voice signal is more than 2s, and then quitting the noise reduction module and the voice endpoint detection module.
In a preferred embodiment, in step S3, the background third party voice elimination module includes a voice encoder, a speaker extractor, and a voice decoder, and the third party voice elimination using the intercepted auxiliary audio and the real-time input voice specifically includes:
s31: taking the intercepted 2s auxiliary voice and real-time voice as the input of a voice encoder, wherein the real-time voice is voice data of the first 1s of splicing;
s32: the speech encoder stacks the results as output, wherein the speech encoder is a multi-scale encoder;
the formula is expressed as follows:
mdata i =Concat(f 1 (Input i ),f 2 (Input i ),f 3 (Input i ))
f i (x)=Covd1d(Pad(x))
in this section f i () Encoders expressed as different scales, in which structure input x is input The left and right sides of the Input buffer are filled with 0 to different lengths, and then are convolved for 1d, and finally results Concat of 3 scales are taken as the Input of the speaker encoder and the separator, wherein the Input is divided into two, the first Input 1 Is Input of real-time speech, and a second Input 2 Is the registered voice of the target voice to be extracted, i.e., the intercepted auxiliary audio.
S33: the output of the speaker encoder is one of the input of the speaker extractor, and then the output of the speaker encoder is used as the input of the speaker extractor together, wherein the main structure of the speaker encoder comprises 1d convolution and ResBlock; the expression is as follows:
OUTPUT embedding =g(mdata 2 )
g(x)=Convd1d(ResBlock(Covd1d(x))x3)
in the part g () expressed as speaker encoder, the input is mdata 2 Is a speech coder with Input 2 And (3) carrying out convolution for 1d and 3 ResBlock for the input output result, and carrying out convolution for 1d to finish the encoding of the speaker, wherein the ResBlock structure is represented as follows:
ResBlock(x)=MaxPool(PReLu(x+Conv1d(PReLu(Conv1d(x)))))
s34: one of the outputs of the speaker extractor is multiplied by the initial input set of the network, and the resulting result is output to the speech decoder.
The output of the speaker encoder will become one of the outputs of the speaker extractorThen, the mdata 1 Collectively as input to the speaker extractor, the structure of the speaker encoder is expressed as follows:
OUTPUT Mask =ReLu(Conv1d(OUTPUT StackedTCNs ))
OUTPUT StackedTCNs =S(Conv1d(mdata 1 ),OUTPUT embedding )x4
S(u,v)=S'(u+Conv1d(PReLu(DeConv1d(PReLU(Conv1d(Concat(u,v)))))))x6
S'(x)=Conv1d(PReLu(DeConv1d(PReLu(Conv1d(x)))))
in this section, an OUTPUT will be OUTPUT Mask Wherein OUTPUT StackedTCNs The method consists of 4S (u, v) nests, wherein 6S '(x) nests are formed, deConv1d in S (u, v) and S' (x) refers to hole depth separable convolution, and the effect similar to RNN can be achieved and the calculation amount is reduced.
S35: s3, the real-time background third person voice removing technology based on the neural network, and an OUTPUT OUTPUT by the speaker extractor Mask Initial input x to the network input I.e. multiplication, the result is output to the speech decoder, which is expressed as follows:
OUTPUT=D(OUTPUT Mask *x input )
D(x)=ConvTrans1D(x)
in this architecture, speech is reconstructed through a speech decoder to obtain the final required output.
The invention also provides a background voice removing system based on the neural network, which comprises the following steps:
a noise reduction module: the voice processing module is used for processing noise reduction of the real-time voice accessed to the noise reduction module;
an auxiliary audio module: the voice endpoint detection and interception are carried out on the target voice in the voice after the noise reduction, so that auxiliary audio is obtained;
background third person voice elimination module: the audio processing device is used for eliminating background noise and third person voice of the obtained auxiliary audio and the real-time input audio to obtain the auxiliary audio and the real-time input audio, wherein the background noise and the third person voice of the auxiliary audio and the real-time input audio are eliminated;
a merging output module: the voice processing device is used for merging the auxiliary audio without background noise and third person voice and the real-time input audio and outputting the processed voice data.
In a preferred embodiment, the background third party voice removal module includes a speech encoder, a speaker extractor, and a speech decoder.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A background voice removing method based on a neural network is characterized by comprising the following steps:
s1, accessing real-time voice to a noise reduction module for noise reduction processing to obtain noise-reduced voice;
s2, performing voice endpoint detection and interception on the target voice in the voice subjected to noise reduction to obtain auxiliary audio;
s3, the obtained auxiliary audio and the real-time input audio are connected to a background third person voice eliminating module to eliminate background noise and third person voice, and the auxiliary audio and the real-time input audio with the background noise and the third person voice eliminated are obtained;
and S4, combining the auxiliary audio without the background noise and the voice of the third person with the real-time input audio, and outputting the processed voice data.
2. The method of claim 1, wherein in step S1, the noise reduction module includes an encoder, an LSTM module and a decoder, and the real-time speech is accessed into the encoder to perform encoding processing to obtain encoded first intermediate representation data, and is accessed into the LSTM module to perform prediction to obtain second intermediate representation data, and is accessed into the decoder to perform processing to obtain noise-reduced speech.
3. The method of claim 2, wherein in step S1, the real-time speech is encoded into the encoder in a frame of 30-34ms length to obtain the encoded first intermediate representation data.
4. The method of claim 3, wherein the step S1 of encoding the real-time speech access encoder comprises the steps of: in an encoder composed of an L-layer encoding layer, taking noisy audio data as input data to obtain encoded first intermediate representation data, wherein the expression is as follows:
mdata=f(Input)
Figure FDA0004057778580000011
in the structure, the encoder has L layers in total, the initial hidden _ size is H, the convolution kernel is K, the step size is S, the ith encoding layer comprises one convolution kernel with the size of K, and the number of output channels is 2 i-1H; one Relu layer, one output channel is a 1 x 1 convolution of 2iH and one GLU activation layer.
5. The method as claimed in claim 4, wherein in step S1, the LSTM module is divided into two layers, with hidden _ size 2L "1, and if causal prediction, a non-bi-directional LSTM network is used, and if non-causal prediction, biLSTM is used, and then a linear layer is followed to fuse the two outputs of the biLSTM to obtain a second intermediate representation data, which is expressed as:
R(z)=LSTM(z)+z。
6. the method of claim 5 wherein in step S1, the decoder has L layers, wherein the i-th decoded layer has 2i-1H channels as input, a 1 x 1 convolution of 2iH channels is applied, then the channels are restored to 2i-1H by a GLU layer, and finally by a transposed convolution, and if the i-th layer is not the last layer, then by a Relu layer, and if the i-th layer is the last layer, the output is a single channel and there are no Relu layers, expressed as:
Output=g(mdata)
Figure FDA0004057778580000021
wherein mdata represents second intermediate representation data processed by the encoder and the LSTM module in sequence, output represents Output denoised audio data, and g () represents a decoding layer in the decoder.
7. The method according to claim 1, wherein in step S2, the step of intercepting the target voice by using the voice endpoint detection technology to obtain the auxiliary audio includes:
s21: extracting the characteristics of the signal with each frame being 32ms after noise reduction, wherein the extracted characteristics are the energy of 6 sub-bands of the frame;
s22: training a classifier on a data frame set of a known voice and silence signal area, wherein the classifier is a two-dimensional Gaussian model;
s23: calculating the sum of the likelihood log ratio of each frequency band and the likelihood ratios of 6 sub-bands, classifying the unknown real-time frame data by judging whether the type of the unknown real-time frame data belongs to a speech signal or a silence signal, and if the type of the unknown real-time frame data is greater than a judgment threshold, determining that the type of the unknown real-time frame data belongs to the speech signal;
s24: and continuously updating the Gaussian model parameters, stopping updating the Gaussian model parameters when the intercepted voice signal is more than 2s, and then quitting the noise reduction module and the voice endpoint detection module.
8. The method as claimed in claim 1, wherein in step S3, the background third party voice elimination module comprises a speech coder, a speaker extractor and a speech decoder, and the third party voice elimination using the intercepted auxiliary audio and the real-time input speech is specifically:
s31: taking the intercepted 2s auxiliary voice and real-time voice as the input of a voice encoder, wherein the real-time voice is the voice data of 1s before splicing;
s32: the speech encoder stacks the results as output, wherein the speech encoder is a multi-scale encoder;
s33: the output of the speaker encoder is one of the inputs of the speaker extractor, and the outputs are used as the inputs of the speaker extractor together, wherein the structure of the speaker encoder comprises 1d convolution and ResBlock;
s34: one of the speaker extractor outputs is multiplied by the initial input set of the network and the result is output to the speech decoder.
9. A background voice removal system based on a neural network, comprising:
a noise reduction module: the real-time voice processing module is used for carrying out noise reduction processing on the real-time voice accessed to the noise reduction module;
an auxiliary audio module: the voice endpoint detection and interception are carried out on the target voice in the voice after the noise reduction, so that auxiliary audio is obtained;
background third person voice eliminating module: the audio processing device is used for eliminating background noise and third person voice of the obtained auxiliary audio and the real-time input audio to obtain the auxiliary audio and the real-time input audio, wherein the background noise and the third person voice of the auxiliary audio and the real-time input audio are eliminated;
a merging output module: and the voice processing device is used for combining the auxiliary audio without the background noise and the voice of the third person and the real-time input audio and outputting the processed voice data.
10. The system of claim 9, wherein the background third party voice removal module comprises a speech encoder, a speaker extractor, and a speech decoder.
CN202210998674.XA 2022-08-19 2022-08-19 Neural network-based background voice removing method and system Active CN115394310B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210998674.XA CN115394310B (en) 2022-08-19 2022-08-19 Neural network-based background voice removing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210998674.XA CN115394310B (en) 2022-08-19 2022-08-19 Neural network-based background voice removing method and system

Publications (2)

Publication Number Publication Date
CN115394310A CN115394310A (en) 2022-11-25
CN115394310B true CN115394310B (en) 2023-04-07

Family

ID=84120602

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210998674.XA Active CN115394310B (en) 2022-08-19 2022-08-19 Neural network-based background voice removing method and system

Country Status (1)

Country Link
CN (1) CN115394310B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115862581A (en) * 2023-02-10 2023-03-28 杭州兆华电子股份有限公司 Secondary elimination method and system for repeated pattern noise

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106971741A (en) * 2016-01-14 2017-07-21 芋头科技(杭州)有限公司 The method and system for the voice de-noising that voice is separated in real time
US9947333B1 (en) * 2012-02-10 2018-04-17 Amazon Technologies, Inc. Voice interaction architecture with intelligent background noise cancellation
CN109087658A (en) * 2018-07-16 2018-12-25 安徽国通亿创科技股份有限公司 A kind of online interaction live streaming noise processed system
CN110120225A (en) * 2019-04-01 2019-08-13 西安电子科技大学 A kind of audio defeat system and method for the structure based on GRU network
WO2020034779A1 (en) * 2018-08-14 2020-02-20 Oppo广东移动通信有限公司 Audio processing method, storage medium and electronic device
CN111128215A (en) * 2019-12-24 2020-05-08 声耕智能科技(西安)研究院有限公司 Single-channel real-time noise reduction method and system
CN111754982A (en) * 2020-06-19 2020-10-09 平安科技(深圳)有限公司 Noise elimination method and device for voice call, electronic equipment and storage medium
CN114898762A (en) * 2022-05-07 2022-08-12 北京快鱼电子股份公司 Real-time voice noise reduction method and device based on target person and electronic equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US12062369B2 (en) * 2020-09-25 2024-08-13 Intel Corporation Real-time dynamic noise reduction using convolutional networks

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9947333B1 (en) * 2012-02-10 2018-04-17 Amazon Technologies, Inc. Voice interaction architecture with intelligent background noise cancellation
CN106971741A (en) * 2016-01-14 2017-07-21 芋头科技(杭州)有限公司 The method and system for the voice de-noising that voice is separated in real time
CN109087658A (en) * 2018-07-16 2018-12-25 安徽国通亿创科技股份有限公司 A kind of online interaction live streaming noise processed system
WO2020034779A1 (en) * 2018-08-14 2020-02-20 Oppo广东移动通信有限公司 Audio processing method, storage medium and electronic device
CN110120225A (en) * 2019-04-01 2019-08-13 西安电子科技大学 A kind of audio defeat system and method for the structure based on GRU network
CN111128215A (en) * 2019-12-24 2020-05-08 声耕智能科技(西安)研究院有限公司 Single-channel real-time noise reduction method and system
CN111754982A (en) * 2020-06-19 2020-10-09 平安科技(深圳)有限公司 Noise elimination method and device for voice call, electronic equipment and storage medium
CN114898762A (en) * 2022-05-07 2022-08-12 北京快鱼电子股份公司 Real-time voice noise reduction method and device based on target person and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Separate Sound into STFT Frames to Eliminate Sound Noise Frames in Sound Classification;Thanh Tran;《2021 IEEE Symposium Series on Computational Intelligence (SSCI)》;全文 *
复杂环境下基于深度学习的语音信号预处理方法研究;高天;《中国博士学位论文全文数据库》;全文 *

Also Published As

Publication number Publication date
CN115394310A (en) 2022-11-25

Similar Documents

Publication Publication Date Title
Zhao et al. Monaural speech dereverberation using temporal convolutional networks with self attention
CN110379412B (en) Voice processing method and device, electronic equipment and computer readable storage medium
CN105321525B (en) A kind of system and method reducing VOIP communication resource expense
CN109065067A (en) A kind of conference terminal voice de-noising method based on neural network model
CN113140225A (en) Voice signal processing method and device, electronic equipment and storage medium
CN114338623B (en) Audio processing method, device, equipment and medium
CN111243617B (en) Speech enhancement method for reducing MFCC feature distortion based on deep learning
CN113287167B (en) Method, device and system for mixed speech synthesis
Wang et al. Caunet: Context-aware u-net for speech enhancement in time domain
CN115394310B (en) Neural network-based background voice removing method and system
Le et al. Inference skipping for more efficient real-time speech enhancement with parallel RNNs
CN112466297B (en) Speech recognition method based on time domain convolution coding and decoding network
Ali et al. Speech enhancement using dilated wave-u-net: an experimental analysis
CN116994564A (en) Voice data processing method and processing device
CN114678033A (en) Speech enhancement algorithm based on multi-head attention mechanism only comprising encoder
CN114360571A (en) Reference-based speech enhancement method
Raj et al. Multilayered convolutional neural network-based auto-CODEC for audio signal denoising using mel-frequency cepstral coefficients
CN114360561A (en) Voice enhancement method based on deep neural network technology
CN114283829A (en) Voice enhancement method based on dynamic gate control convolution cyclic network
CN110197657B (en) Dynamic sound feature extraction method based on cosine similarity
Westhausen et al. tplcnet: Real-time deep packet loss concealment in the time domain using a short temporal context
Gaafar et al. An improved method for speech/speaker recognition
Hong et al. Independent component analysis based single channel speech enhancement
Nossier et al. Two-stage deep learning approach for speech enhancement and reconstruction in the frequency and time domains
CN114743561A (en) Voice separation device and method, storage medium and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant