CN115394310B - Neural network-based background voice removing method and system - Google Patents
Neural network-based background voice removing method and system Download PDFInfo
- Publication number
- CN115394310B CN115394310B CN202210998674.XA CN202210998674A CN115394310B CN 115394310 B CN115394310 B CN 115394310B CN 202210998674 A CN202210998674 A CN 202210998674A CN 115394310 B CN115394310 B CN 115394310B
- Authority
- CN
- China
- Prior art keywords
- voice
- real
- module
- encoder
- time
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 27
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 13
- 230000009467 reduction Effects 0.000 claims abstract description 38
- 238000012545 processing Methods 0.000 claims abstract description 24
- 238000001514 detection method Methods 0.000 claims abstract description 13
- 238000005516 engineering process Methods 0.000 claims description 11
- 230000008030 elimination Effects 0.000 claims description 9
- 238000003379 elimination reaction Methods 0.000 claims description 9
- 230000001364 causal effect Effects 0.000 claims description 6
- 230000004913 activation Effects 0.000 claims description 3
- 238000012549 training Methods 0.000 claims description 3
- 230000008569 process Effects 0.000 description 7
- 238000004891 communication Methods 0.000 description 5
- 238000000605 extraction Methods 0.000 description 5
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/87—Detection of discrete points within a voice signal
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02087—Noise filtering the noise being separate speech, e.g. cocktail party
Landscapes
- Engineering & Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Quality & Reliability (AREA)
- Compression, Expansion, Code Conversion, And Decoders (AREA)
Abstract
The invention relates to a background voice removing method and a system based on a neural network, comprising the following steps: s1, accessing real-time voice to a noise reduction module for noise reduction processing to obtain noise-reduced voice; s2, performing voice endpoint detection and interception on the target voice in the voice subjected to noise reduction to obtain auxiliary audio; s3, the obtained auxiliary audio and the real-time input audio are connected to a background third person voice eliminating module to eliminate background noise and third person voice, and the auxiliary audio and the real-time input audio with the background noise and the third person voice eliminated are obtained; and S4, combining the auxiliary audio without the background noise and the voice of the third person with the real-time input audio, and outputting the processed voice data.
Description
Technical Field
The invention relates to the technical field of voice denoising, in particular to a background human voice removing method and system based on a neural network.
Background
The communication voice signal usually receives more or less noise interference, so that the communication quality is reduced, and meanwhile, the communication voice signal also usually receives the interference of the background voice of the third party, so that the voice of the communication owner is mixed in the background voice, and the communication quality is reduced. The target speaker extraction is an effective processing technology for solving the problem of eliminating the voice of the background third party, and aims to eliminate the influence of the voice of the background third party on the voice and language signals of the target speaker, improve the definition of the language signals and improve the quality of the language signals.
At present, the existing noise reduction technology has the following limitations: the single noise reduction technology cannot remove the third voice of the background, only can process noise parts except the non-voice, and cannot process the noise parts under the condition of the third voice; further, there is no technique for performing real-time speech processing under streaming input.
Disclosure of Invention
Based on this, it is necessary to provide a background human voice removing method and system based on a neural network.
The embodiment of the invention provides a background voice removing method based on a neural network on one hand, which comprises the following steps:
s1, accessing real-time voice to a noise reduction module for noise reduction processing to obtain noise-reduced voice;
s2, performing voice endpoint detection and interception on the target voice in the voice subjected to noise reduction to obtain auxiliary audio;
s3, the obtained auxiliary audio and the real-time input audio are connected to a background third person voice eliminating module to eliminate background noise and third person voice, and the auxiliary audio and the real-time input audio with the background noise and the third person voice eliminated are obtained;
and S4, combining the auxiliary audio without the background noise and the voice of the third person with the real-time input audio, and outputting the processed voice data.
Preferably, in step S1, the noise reduction module includes an encoder, an LSTM module, and a decoder, and accesses the real-time speech to the encoder for encoding to obtain encoded first intermediate representation data, and accesses the LSTM module for prediction to obtain second intermediate representation data, and accesses the decoder for processing to obtain noise-reduced speech.
Preferably, in step S1, the real-time speech is accessed to the encoder, and is encoded according to a frame with a length of 30-34ms, so as to obtain encoded first intermediate representation data.
Preferably, in step S1, the accessing of real-time speech to the encoder for encoding processing includes the following steps: in an encoder composed of an L-layer encoding layer, taking noisy audio data as input data to obtain encoded first intermediate representation data, wherein the expression is as follows:
mdata=f(Input)
in the structure, the encoder has L layers in total, the initialization hidden _ size is H, the convolution kernel is K, the step size is S, the ith encoding layer comprises one convolution kernel with the size of K, the step size is S, and 2i-1H output channels; one Relu layer, one output channel is a 1 x 1 convolution of 2iH and one GLU activation layer.
Preferably, in step S1, the LSTM module is divided into two layers, hidden _ size is 2L-1 layer, if causal prediction is performed, a non-bidirectional LSTM network is used, and if non-causal prediction is performed, biLSTM is used, and then a linear layer is used to fuse two outputs of the biLSTM, so as to obtain second intermediate representation data, where the expression is:
R(z)=LSTM(z)+z。
preferably, in step S1, the decoder has L layers in common, where the input of the i-th decoding layer has 2i-1H channels, 1 × 1 convolution of one 2iH channel is applied, then the channel is restored to 2i-1H by one GLU layer, and finally another transposed convolution is performed, if the i-th layer is not the last layer, then another Relu layer follows, and if the i-th layer is the last layer, the output is a single channel and there is no Relu layer, and the expression is:
Output=g(mdata)
wherein mdata represents second intermediate representation data processed by the encoder and the LSTM module in sequence, output represents Output denoised audio data, and g () represents a decoding layer in the decoder.
Preferably, in step S2, the intercepting of the target voice by using the voice endpoint detection technology to obtain the auxiliary audio specifically includes:
s21: extracting the characteristics of the signal with each frame being 32ms after noise reduction, wherein the extracted characteristics are the energy of 6 sub-bands of the frame;
s22: training a classifier on a data frame set of a known voice and silence signal area, wherein the classifier is a two-dimensional Gaussian model;
s23: calculating the sum of the likelihood log ratio of each frequency band and the likelihood ratios of 6 sub-bands, classifying the unknown real-time frame data by judging whether the type of the unknown real-time frame data belongs to a speech signal or a silence signal, and if the type of the unknown real-time frame data is greater than a judgment threshold, determining that the type of the unknown real-time frame data belongs to the speech signal;
s24: and continuously updating the Gaussian model parameters, stopping updating the Gaussian model parameters when the intercepted voice signal is more than 2s, and then quitting the noise reduction module and the voice endpoint detection module.
Preferably, in step S3, the background third party voice elimination module includes a voice encoder, a speaker extractor and a voice decoder, and the third party voice elimination using the intercepted auxiliary audio and the real-time input voice specifically includes:
s31: taking the intercepted 2s auxiliary voice and real-time voice as the input of a voice encoder, wherein the real-time voice is the voice data of 1s before splicing;
s32: the speech encoder stacks the results as output, wherein the speech encoder is a multi-scale encoder;
s33: the output of the speaker encoder is one of the inputs of the speaker extractor, and the outputs are used as the inputs of the speaker extractor together, wherein the structure of the speaker encoder comprises 1d convolution and ResBlock;
s34: one of the speaker extractor outputs is multiplied by the initial input set of the network and the result is output to the speech decoder.
The invention also provides a background voice removing system based on the neural network, which comprises the following steps:
a noise reduction module: the voice processing module is used for processing noise reduction of the real-time voice accessed to the noise reduction module;
an auxiliary audio module: the voice endpoint detection and interception are carried out on the target voice in the voice after the noise reduction, so that auxiliary audio is obtained;
background third person voice elimination module: the auxiliary audio and the real-time input audio are used for eliminating background noise and third person voice to obtain the auxiliary audio and the real-time input audio, wherein the background noise and the third person voice are eliminated;
a merging output module: the voice processing device is used for merging the auxiliary audio without background noise and third person voice and the real-time input audio and outputting the processed voice data.
Preferably, the background third person voice elimination module comprises a voice coder, a speaker extractor and a voice decoder.
The invention provides a real-time background third person voice removing technology based on a neural network, which can remove noise and background third person voice in real time, and can process the voice in real time on the basis of combining a voice denoising technology and a target person voice extracting technology, so as to obtain higher-quality call voice; noise reduction is realized in real time, the voice of the target speaker is extracted, the quality of the call voice is further improved, and the method has high application value.
Drawings
The foregoing and other objects, features and advantages of the invention will be apparent from the following more particular description of preferred embodiments of the invention, as illustrated in the accompanying drawings. Like reference numerals refer to like parts throughout the drawings, and the drawings are not intended to be drawn to scale in actual dimensions, emphasis instead being placed upon illustrating the principles of the invention.
FIG. 1 is an overall flow diagram of a method of an embodiment of the invention;
fig. 2 is a schematic diagram of the method of the preferred embodiment of the present invention.
Detailed Description
The technical solutions of the present invention are further described in detail with reference to the drawings and specific embodiments so that those skilled in the art can better understand the present invention and can implement the present invention, but the embodiments are not limited to the present invention.
As shown in fig. 1-2, in one aspect, an embodiment of the present invention provides a method for removing background human voice based on a neural network, including the following steps:
s1, accessing real-time voice to a noise reduction module for noise reduction processing to obtain noise-reduced voice; preferably, the real-time call voice can be accessed to the noise reduction module for noise reduction processing in a time length of 32 ms.
S2, performing voice endpoint detection and interception on the target voice in the voice subjected to noise reduction to obtain auxiliary audio;
s3, the obtained auxiliary audio and the real-time input audio are connected to a background third person voice eliminating module to eliminate background noise and third person voice, and the auxiliary audio and the real-time input audio with the background noise and the third person voice eliminated are obtained;
and S4, combining the auxiliary audio without the background noise and the voice of the third person with the real-time input audio, and outputting the processed voice data.
In the invention, a denoising technology and a target voice extraction technology are combined, a denoising module is applied to collect auxiliary voices required in target voice extraction, and then a target voice extraction module is applied to process; then, the buffer processing is applied to the streaming data, so that the streaming output can be realized in the target voice extraction, and the performance is not sacrificed.
In a preferred embodiment, in step S1, the noise reduction module includes an encoder, an LSTM module, and a decoder, and accesses the real-time speech to the encoder for encoding to obtain encoded first intermediate representation data, and accesses the LSTM module for prediction to obtain second intermediate representation data, and accesses the decoder for processing to obtain noise-reduced speech.
In a preferred embodiment, in step S1, the real-time speech is accessed to the encoder, and is encoded according to a frame with a length of 30-34ms, so as to obtain encoded first intermediate representation data. Preferably, the encoding process can be performed in one frame of a length of 32 ms.
In a preferred embodiment, in step S1, the real-time speech access encoder is subjected to an encoding process, which includes the following steps: in an encoder composed of L-layer coding layers, noisy audio data is taken as input data, resulting in encoded first intermediate representation data, whose expression is:
mdata=f(Input)
in the structure, the encoder has L layers in total, the initialization hidden _ size is H, the convolution kernel is K, the step size is S, the ith encoding layer comprises one convolution kernel with the size of K, the step size is S, and 2i-1H output channels; one Relu layer, one output channel is a 1 x 1 convolution of 2iH and one GLU activation layer.
In a preferred embodiment, in step S1, the LSTM module is specifically implemented as a sequence network model R, the LSTM module is divided into two layers, hidden _ size is 2L-1 layer, if causal prediction is performed, a non-bidirectional LSTM network is used, if non-causal prediction is performed, biLSTM is used, and then a linear layer is connected to fuse two outputs of the biLSTM, so as to obtain second intermediate representation data, where the expression is:
R(z)=LSTM(z)+z。
in a preferred embodiment, in step S1, the decoder has L layers in common, where the input of the i-th decoding layer has 2i-1H channels, 1 × 1 convolution of one 2iH channel is applied, then the channel is restored to 2i-1H by one GLU layer, and finally another transposed convolution is performed, if the i-th layer is not the last layer, then another Relu layer follows, and if the i-th layer is the last layer, the output is a single channel and there is no Relu layer, and the expression is:
Output=g(mdata)
wherein mdata represents second intermediate representation data processed by the encoder and the LSTM module in sequence, output represents Output denoised audio data, and g () represents a decoding layer in the decoder.
In a preferred embodiment, in step S2, the step of intercepting the target voice by using a voice endpoint detection technology to obtain an auxiliary audio specifically includes:
s21: extracting the characteristics of the signal with each frame being 32ms after noise reduction, wherein the extracted characteristics are the energy of 6 sub-bands of the frame;
s22: training a classifier on a data frame set of a known voice and silence signal area, wherein the classifier is a two-dimensional Gaussian model;
s23: calculating the sum of the likelihood log ratio of each frequency band and the likelihood ratios of 6 sub-bands, classifying the unknown real-time frame data by judging whether the type of the unknown real-time frame data belongs to a speech signal or a silent signal, and if the type of the unknown real-time frame data is greater than a judgment threshold, determining that the type of the unknown real-time frame data belongs to a speech signal;
s24: and continuously updating the Gaussian model parameters, stopping updating the Gaussian model parameters when the intercepted voice signal is more than 2s, and then quitting the noise reduction module and the voice endpoint detection module.
In a preferred embodiment, in step S3, the background third party voice elimination module includes a voice encoder, a speaker extractor, and a voice decoder, and the third party voice elimination using the intercepted auxiliary audio and the real-time input voice specifically includes:
s31: taking the intercepted 2s auxiliary voice and real-time voice as the input of a voice encoder, wherein the real-time voice is voice data of the first 1s of splicing;
s32: the speech encoder stacks the results as output, wherein the speech encoder is a multi-scale encoder;
the formula is expressed as follows:
mdata i =Concat(f 1 (Input i ),f 2 (Input i ),f 3 (Input i ))
f i (x)=Covd1d(Pad(x))
in this section f i () Encoders expressed as different scales, in which structure input x is input The left and right sides of the Input buffer are filled with 0 to different lengths, and then are convolved for 1d, and finally results Concat of 3 scales are taken as the Input of the speaker encoder and the separator, wherein the Input is divided into two, the first Input 1 Is Input of real-time speech, and a second Input 2 Is the registered voice of the target voice to be extracted, i.e., the intercepted auxiliary audio.
S33: the output of the speaker encoder is one of the input of the speaker extractor, and then the output of the speaker encoder is used as the input of the speaker extractor together, wherein the main structure of the speaker encoder comprises 1d convolution and ResBlock; the expression is as follows:
OUTPUT embedding =g(mdata 2 )
g(x)=Convd1d(ResBlock(Covd1d(x))x3)
in the part g () expressed as speaker encoder, the input is mdata 2 Is a speech coder with Input 2 And (3) carrying out convolution for 1d and 3 ResBlock for the input output result, and carrying out convolution for 1d to finish the encoding of the speaker, wherein the ResBlock structure is represented as follows:
ResBlock(x)=MaxPool(PReLu(x+Conv1d(PReLu(Conv1d(x)))))
s34: one of the outputs of the speaker extractor is multiplied by the initial input set of the network, and the resulting result is output to the speech decoder.
The output of the speaker encoder will become one of the outputs of the speaker extractorThen, the mdata 1 Collectively as input to the speaker extractor, the structure of the speaker encoder is expressed as follows:
OUTPUT Mask =ReLu(Conv1d(OUTPUT StackedTCNs ))
OUTPUT StackedTCNs =S(Conv1d(mdata 1 ),OUTPUT embedding )x4
S(u,v)=S'(u+Conv1d(PReLu(DeConv1d(PReLU(Conv1d(Concat(u,v)))))))x6
S'(x)=Conv1d(PReLu(DeConv1d(PReLu(Conv1d(x)))))
in this section, an OUTPUT will be OUTPUT Mask Wherein OUTPUT StackedTCNs The method consists of 4S (u, v) nests, wherein 6S '(x) nests are formed, deConv1d in S (u, v) and S' (x) refers to hole depth separable convolution, and the effect similar to RNN can be achieved and the calculation amount is reduced.
S35: s3, the real-time background third person voice removing technology based on the neural network, and an OUTPUT OUTPUT by the speaker extractor Mask Initial input x to the network input I.e. multiplication, the result is output to the speech decoder, which is expressed as follows:
OUTPUT=D(OUTPUT Mask *x input )
D(x)=ConvTrans1D(x)
in this architecture, speech is reconstructed through a speech decoder to obtain the final required output.
The invention also provides a background voice removing system based on the neural network, which comprises the following steps:
a noise reduction module: the voice processing module is used for processing noise reduction of the real-time voice accessed to the noise reduction module;
an auxiliary audio module: the voice endpoint detection and interception are carried out on the target voice in the voice after the noise reduction, so that auxiliary audio is obtained;
background third person voice elimination module: the audio processing device is used for eliminating background noise and third person voice of the obtained auxiliary audio and the real-time input audio to obtain the auxiliary audio and the real-time input audio, wherein the background noise and the third person voice of the auxiliary audio and the real-time input audio are eliminated;
a merging output module: the voice processing device is used for merging the auxiliary audio without background noise and third person voice and the real-time input audio and outputting the processed voice data.
In a preferred embodiment, the background third party voice removal module includes a speech encoder, a speaker extractor, and a speech decoder.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.
Claims (10)
1. A background voice removing method based on a neural network is characterized by comprising the following steps:
s1, accessing real-time voice to a noise reduction module for noise reduction processing to obtain noise-reduced voice;
s2, performing voice endpoint detection and interception on the target voice in the voice subjected to noise reduction to obtain auxiliary audio;
s3, the obtained auxiliary audio and the real-time input audio are connected to a background third person voice eliminating module to eliminate background noise and third person voice, and the auxiliary audio and the real-time input audio with the background noise and the third person voice eliminated are obtained;
and S4, combining the auxiliary audio without the background noise and the voice of the third person with the real-time input audio, and outputting the processed voice data.
2. The method of claim 1, wherein in step S1, the noise reduction module includes an encoder, an LSTM module and a decoder, and the real-time speech is accessed into the encoder to perform encoding processing to obtain encoded first intermediate representation data, and is accessed into the LSTM module to perform prediction to obtain second intermediate representation data, and is accessed into the decoder to perform processing to obtain noise-reduced speech.
3. The method of claim 2, wherein in step S1, the real-time speech is encoded into the encoder in a frame of 30-34ms length to obtain the encoded first intermediate representation data.
4. The method of claim 3, wherein the step S1 of encoding the real-time speech access encoder comprises the steps of: in an encoder composed of an L-layer encoding layer, taking noisy audio data as input data to obtain encoded first intermediate representation data, wherein the expression is as follows:
mdata=f(Input)
in the structure, the encoder has L layers in total, the initial hidden _ size is H, the convolution kernel is K, the step size is S, the ith encoding layer comprises one convolution kernel with the size of K, and the number of output channels is 2 i-1H; one Relu layer, one output channel is a 1 x 1 convolution of 2iH and one GLU activation layer.
5. The method as claimed in claim 4, wherein in step S1, the LSTM module is divided into two layers, with hidden _ size 2L "1, and if causal prediction, a non-bi-directional LSTM network is used, and if non-causal prediction, biLSTM is used, and then a linear layer is followed to fuse the two outputs of the biLSTM to obtain a second intermediate representation data, which is expressed as:
R(z)=LSTM(z)+z。
6. the method of claim 5 wherein in step S1, the decoder has L layers, wherein the i-th decoded layer has 2i-1H channels as input, a 1 x 1 convolution of 2iH channels is applied, then the channels are restored to 2i-1H by a GLU layer, and finally by a transposed convolution, and if the i-th layer is not the last layer, then by a Relu layer, and if the i-th layer is the last layer, the output is a single channel and there are no Relu layers, expressed as:
Output=g(mdata)
wherein mdata represents second intermediate representation data processed by the encoder and the LSTM module in sequence, output represents Output denoised audio data, and g () represents a decoding layer in the decoder.
7. The method according to claim 1, wherein in step S2, the step of intercepting the target voice by using the voice endpoint detection technology to obtain the auxiliary audio includes:
s21: extracting the characteristics of the signal with each frame being 32ms after noise reduction, wherein the extracted characteristics are the energy of 6 sub-bands of the frame;
s22: training a classifier on a data frame set of a known voice and silence signal area, wherein the classifier is a two-dimensional Gaussian model;
s23: calculating the sum of the likelihood log ratio of each frequency band and the likelihood ratios of 6 sub-bands, classifying the unknown real-time frame data by judging whether the type of the unknown real-time frame data belongs to a speech signal or a silence signal, and if the type of the unknown real-time frame data is greater than a judgment threshold, determining that the type of the unknown real-time frame data belongs to the speech signal;
s24: and continuously updating the Gaussian model parameters, stopping updating the Gaussian model parameters when the intercepted voice signal is more than 2s, and then quitting the noise reduction module and the voice endpoint detection module.
8. The method as claimed in claim 1, wherein in step S3, the background third party voice elimination module comprises a speech coder, a speaker extractor and a speech decoder, and the third party voice elimination using the intercepted auxiliary audio and the real-time input speech is specifically:
s31: taking the intercepted 2s auxiliary voice and real-time voice as the input of a voice encoder, wherein the real-time voice is the voice data of 1s before splicing;
s32: the speech encoder stacks the results as output, wherein the speech encoder is a multi-scale encoder;
s33: the output of the speaker encoder is one of the inputs of the speaker extractor, and the outputs are used as the inputs of the speaker extractor together, wherein the structure of the speaker encoder comprises 1d convolution and ResBlock;
s34: one of the speaker extractor outputs is multiplied by the initial input set of the network and the result is output to the speech decoder.
9. A background voice removal system based on a neural network, comprising:
a noise reduction module: the real-time voice processing module is used for carrying out noise reduction processing on the real-time voice accessed to the noise reduction module;
an auxiliary audio module: the voice endpoint detection and interception are carried out on the target voice in the voice after the noise reduction, so that auxiliary audio is obtained;
background third person voice eliminating module: the audio processing device is used for eliminating background noise and third person voice of the obtained auxiliary audio and the real-time input audio to obtain the auxiliary audio and the real-time input audio, wherein the background noise and the third person voice of the auxiliary audio and the real-time input audio are eliminated;
a merging output module: and the voice processing device is used for combining the auxiliary audio without the background noise and the voice of the third person and the real-time input audio and outputting the processed voice data.
10. The system of claim 9, wherein the background third party voice removal module comprises a speech encoder, a speaker extractor, and a speech decoder.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210998674.XA CN115394310B (en) | 2022-08-19 | 2022-08-19 | Neural network-based background voice removing method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210998674.XA CN115394310B (en) | 2022-08-19 | 2022-08-19 | Neural network-based background voice removing method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115394310A CN115394310A (en) | 2022-11-25 |
CN115394310B true CN115394310B (en) | 2023-04-07 |
Family
ID=84120602
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210998674.XA Active CN115394310B (en) | 2022-08-19 | 2022-08-19 | Neural network-based background voice removing method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115394310B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115862581A (en) * | 2023-02-10 | 2023-03-28 | 杭州兆华电子股份有限公司 | Secondary elimination method and system for repeated pattern noise |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106971741A (en) * | 2016-01-14 | 2017-07-21 | 芋头科技(杭州)有限公司 | The method and system for the voice de-noising that voice is separated in real time |
US9947333B1 (en) * | 2012-02-10 | 2018-04-17 | Amazon Technologies, Inc. | Voice interaction architecture with intelligent background noise cancellation |
CN109087658A (en) * | 2018-07-16 | 2018-12-25 | 安徽国通亿创科技股份有限公司 | A kind of online interaction live streaming noise processed system |
CN110120225A (en) * | 2019-04-01 | 2019-08-13 | 西安电子科技大学 | A kind of audio defeat system and method for the structure based on GRU network |
WO2020034779A1 (en) * | 2018-08-14 | 2020-02-20 | Oppo广东移动通信有限公司 | Audio processing method, storage medium and electronic device |
CN111128215A (en) * | 2019-12-24 | 2020-05-08 | 声耕智能科技(西安)研究院有限公司 | Single-channel real-time noise reduction method and system |
CN111754982A (en) * | 2020-06-19 | 2020-10-09 | 平安科技(深圳)有限公司 | Noise elimination method and device for voice call, electronic equipment and storage medium |
CN114898762A (en) * | 2022-05-07 | 2022-08-12 | 北京快鱼电子股份公司 | Real-time voice noise reduction method and device based on target person and electronic equipment |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US12062369B2 (en) * | 2020-09-25 | 2024-08-13 | Intel Corporation | Real-time dynamic noise reduction using convolutional networks |
-
2022
- 2022-08-19 CN CN202210998674.XA patent/CN115394310B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9947333B1 (en) * | 2012-02-10 | 2018-04-17 | Amazon Technologies, Inc. | Voice interaction architecture with intelligent background noise cancellation |
CN106971741A (en) * | 2016-01-14 | 2017-07-21 | 芋头科技(杭州)有限公司 | The method and system for the voice de-noising that voice is separated in real time |
CN109087658A (en) * | 2018-07-16 | 2018-12-25 | 安徽国通亿创科技股份有限公司 | A kind of online interaction live streaming noise processed system |
WO2020034779A1 (en) * | 2018-08-14 | 2020-02-20 | Oppo广东移动通信有限公司 | Audio processing method, storage medium and electronic device |
CN110120225A (en) * | 2019-04-01 | 2019-08-13 | 西安电子科技大学 | A kind of audio defeat system and method for the structure based on GRU network |
CN111128215A (en) * | 2019-12-24 | 2020-05-08 | 声耕智能科技(西安)研究院有限公司 | Single-channel real-time noise reduction method and system |
CN111754982A (en) * | 2020-06-19 | 2020-10-09 | 平安科技(深圳)有限公司 | Noise elimination method and device for voice call, electronic equipment and storage medium |
CN114898762A (en) * | 2022-05-07 | 2022-08-12 | 北京快鱼电子股份公司 | Real-time voice noise reduction method and device based on target person and electronic equipment |
Non-Patent Citations (2)
Title |
---|
Separate Sound into STFT Frames to Eliminate Sound Noise Frames in Sound Classification;Thanh Tran;《2021 IEEE Symposium Series on Computational Intelligence (SSCI)》;全文 * |
复杂环境下基于深度学习的语音信号预处理方法研究;高天;《中国博士学位论文全文数据库》;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN115394310A (en) | 2022-11-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhao et al. | Monaural speech dereverberation using temporal convolutional networks with self attention | |
CN110379412B (en) | Voice processing method and device, electronic equipment and computer readable storage medium | |
CN105321525B (en) | A kind of system and method reducing VOIP communication resource expense | |
CN109065067A (en) | A kind of conference terminal voice de-noising method based on neural network model | |
CN113140225A (en) | Voice signal processing method and device, electronic equipment and storage medium | |
CN114338623B (en) | Audio processing method, device, equipment and medium | |
CN111243617B (en) | Speech enhancement method for reducing MFCC feature distortion based on deep learning | |
CN113287167B (en) | Method, device and system for mixed speech synthesis | |
Wang et al. | Caunet: Context-aware u-net for speech enhancement in time domain | |
CN115394310B (en) | Neural network-based background voice removing method and system | |
Le et al. | Inference skipping for more efficient real-time speech enhancement with parallel RNNs | |
CN112466297B (en) | Speech recognition method based on time domain convolution coding and decoding network | |
Ali et al. | Speech enhancement using dilated wave-u-net: an experimental analysis | |
CN116994564A (en) | Voice data processing method and processing device | |
CN114678033A (en) | Speech enhancement algorithm based on multi-head attention mechanism only comprising encoder | |
CN114360571A (en) | Reference-based speech enhancement method | |
Raj et al. | Multilayered convolutional neural network-based auto-CODEC for audio signal denoising using mel-frequency cepstral coefficients | |
CN114360561A (en) | Voice enhancement method based on deep neural network technology | |
CN114283829A (en) | Voice enhancement method based on dynamic gate control convolution cyclic network | |
CN110197657B (en) | Dynamic sound feature extraction method based on cosine similarity | |
Westhausen et al. | tplcnet: Real-time deep packet loss concealment in the time domain using a short temporal context | |
Gaafar et al. | An improved method for speech/speaker recognition | |
Hong et al. | Independent component analysis based single channel speech enhancement | |
Nossier et al. | Two-stage deep learning approach for speech enhancement and reconstruction in the frequency and time domains | |
CN114743561A (en) | Voice separation device and method, storage medium and computer equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |