CN115565543B - Single-channel voice echo cancellation method and device based on deep neural network - Google Patents
Single-channel voice echo cancellation method and device based on deep neural network Download PDFInfo
- Publication number
- CN115565543B CN115565543B CN202211482692.9A CN202211482692A CN115565543B CN 115565543 B CN115565543 B CN 115565543B CN 202211482692 A CN202211482692 A CN 202211482692A CN 115565543 B CN115565543 B CN 115565543B
- Authority
- CN
- China
- Prior art keywords
- domain signal
- layer
- frequency domain
- frequency
- signal characteristics
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 67
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 46
- 230000008569 process Effects 0.000 claims description 16
- 238000005457 optimization Methods 0.000 claims description 13
- 238000004364 calculation method Methods 0.000 claims description 7
- 230000006870 function Effects 0.000 claims description 6
- 230000002123 temporal effect Effects 0.000 claims description 5
- 238000012795 verification Methods 0.000 claims description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 3
- 230000003993 interaction Effects 0.000 claims description 3
- 230000000873 masking effect Effects 0.000 claims description 3
- 238000012549 training Methods 0.000 claims description 3
- 230000001629 suppression Effects 0.000 abstract description 2
- 238000010586 diagram Methods 0.000 description 17
- 238000012545 processing Methods 0.000 description 8
- 238000004891 communication Methods 0.000 description 6
- 238000004590 computer program Methods 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 230000017105 transposition Effects 0.000 description 4
- 125000004122 cyclic group Chemical group 0.000 description 3
- 230000003321 amplification Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000003199 nucleic acid amplification method Methods 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 238000011022 operating instruction Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000036961 partial effect Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000000750 progressive effect Effects 0.000 description 1
- 230000002829 reductive effect Effects 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L2021/02082—Noise filtering the noise being echo, reverberation of the speech
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D30/00—Reducing energy consumption in communication networks
- Y02D30/70—Reducing energy consumption in communication networks in wireless communication networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Signal Processing (AREA)
- Biophysics (AREA)
- Human Computer Interaction (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Quality & Reliability (AREA)
- Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
Abstract
The invention discloses a method and a device for eliminating a single-channel voice echo based on a deep neural network, wherein the method comprises the following steps: respectively calculating to obtain near-end and far-end frequency domain signals and extracting signal characteristics; splicing the frequency domain signal characteristics and inputting the spliced frequency domain signal characteristics into a coding frame, wherein the coding frame comprises 3 layers of two-dimensional convolutional layers, and a batch regularization layer and a PReLU layer which are positioned behind each layer of two-dimensional convolutional layer; sequentially inputting the signal characteristics output by the coding frame into a 1-layer frequency GRU layer and a 1-layer time GRU layer; inputting the signal characteristics output by the time GRU layer into a decoding frame; and outputting a voice time domain signal after optimizing and calculating the output result of the decoding frame. The invention utilizes the correlation between the coding-decoding frame and the frequency points of the time-frequency characteristics, has small model scale, less parameter quantity and low performance consumption, and achieves better echo suppression effect.
Description
Technical Field
The present disclosure relates to the field of audio processing technologies, and in particular, to a method and an apparatus for single-channel speech echo cancellation based on a deep neural network, an electronic device, and a storage medium.
Background
In a remote audio and video conference communication system, when a near-end microphone is coupled with a loudspeaker, the microphone re-collects a voice signal generated by the loudspeaker and transmits the voice signal to an opposite end through the communication system, so that the opposite end hears own voice as echo, the echo problem seriously affects the conversation quality of the conference system, and the echo cancellation technology has important significance for high-quality remote real-time audio and video communication. The traditional echo cancellation method based on signal processing faces many technical challenges such as nonlinear echo and double talk in practical application, and the currently disclosed echo cancellation method based on the deep neural network has the problems that a model structure is not suitable for real-time reasoning, and the local low-power-consumption operation of equipment cannot be realized due to overlarge model scale. Therefore, how to provide an echo cancellation technique with low performance consumption and small model scale on the basis of the deep neural network method is an urgent technical problem to be solved.
Disclosure of Invention
An object of the embodiments of the present specification is to provide a method, an apparatus, an electronic device, and a storage medium for single-channel speech echo cancellation based on a deep neural network, in order to solve the above problems.
In order to solve the above technical problem, the embodiments of the present specification are implemented as follows:
in a first aspect, a method for single-channel speech echo cancellation based on a deep neural network is provided, including:
near-end time domain signal collected by near-end microphoneAnd a far-end time-domain signal->Respectively carrying out Fourier transform to obtain a near-end frequency domain signal->And a far-end frequency-domain signal->And separately on the near-end frequency domain signal->And the remote frequency-domain signal->Extracting signal features;
splicing the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics, and inputting the signals into a coding frame, wherein the coding frame comprises 3 two-dimensional convolutional layers, and a batch regularization layer and a PReLU layer which are sequentially positioned behind each two-dimensional convolutional layer;
inputting the signal characteristics output by the coding frame into a 1-layer frequency GRU layer and a 1-layer time GRU layer in sequence;
inputting the signal characteristics output by the time GRU layer into a decoding frame, wherein the decoding frame comprises a transposed convolutional layer corresponding to the two-dimensional convolutional layer in the encoding frame, and the batch regularization layer and the PReLU layer which are sequentially positioned after each layer of the transposed convolutional layer;
outputting voice time domain signals after optimizing and calculating the output result of the decoding frame。
Further, the convolution kernels of the two-dimensional convolution layers are respectively,/>And &>And a step size corresponding to the convolution kernel is ^ and ^ respectively>,/>And &>;
And/or the number of nodes of the frequency GRU layer and the time GRU layer is 32;
and/or the convolution kernels of the transposed convolution layers are respectively,/>And &>To do so byAnd the step length corresponding to the convolution kernel is ^ or>,/>And &>;
And/or the information interaction is carried out between the coding framework and the decoding framework by using jump connection.
Further, a near-end time domain signal collected by the near-end microphoneAnd a far-end time-domain signal->Respectively carrying out Fourier transform to obtain a near-end frequency domain signal->And a far-end frequency-domain signal->And separately on the near-end frequency domain signal->And the remote frequency-domain signal->A process for extracting signal features, comprising:
separately for the near-end time domain signalAnd the far-end time-domain signal->Performing Fourier transform: />,(ii) a Wherein the number of Fourier transform points is 512;
obtaining the near-end frequency domain signalIn a range->And the remote frequency-domain signal->Amplitude->;
And calculating and outputting the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics, wherein the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics respectively comprise 257 frequency points.
Further, the working process of splicing the near-end frequency domain signal features and the far-end frequency domain signal features and inputting the spliced signals into the coding frame includes:
splicing the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics to form first spliced frequency domain signal characteristics, wherein the first spliced frequency domain signal characteristics comprise 514 frequency points;
inputting the first spliced frequency domain signal characteristics into the coding frame, and calculating to obtain first frequency domain signal characteristics according to the number of the two-dimensional convolution layers and the corresponding batch regularization layer and the PReLU layer;
outputting the first frequency-domain signal characteristic.
Further, inputting the signal characteristics output by the time GRU layer into a decoding framework process, including:
inputting the first frequency domain signal characteristic into the decoding frame sequentially through a second frequency domain signal characteristic of the frequency GRU layer and the time GRU layer, and calculating to obtain a third frequency domain signal characteristic according to the number of the transposition convolution layers and sequentially through the corresponding transposition convolution layer, the batch regularization layer and the PReLU layer;
and outputting the third frequency domain signal characteristic.
Further, the output result of the decoding frame is optimized and calculated to output a voice time domain signalThe process of (2), comprising:
obtaining a corresponding complex ideal ratio mask according to the third frequency domain signal characteristics;
masking the complex ideality ratioIs greater than or equal to>And imaginary part +>Act on the near-end frequency domain signal->Is greater than or equal to>And imaginary part->And calculating the optimized near-end frequency domain signal->Is greater than or equal to>=/>And imaginary part->=/>;
For the optimized near-end frequency domain signalPerforming inverse Fourier transform to output a near-end time-domain signal->。
Further, the method is characterized by comprising the step of training a neural network in the voice echo cancellation process, and using a loss function ofWherein is present>, Where V is a frequency domain representation of the clean near-end speech signal after Fourier transform, V r Is the real part of V, V i Is the imaginary part of V; and/or the presence of a gas in the gas,
the used optimizer is an Adam optimizer with a learning rate of 0.001, and the learning rate is adjusted according to an optimization preset turn and/or an optimization verification result.
In a second aspect, a single-channel speech echo cancellation device based on a deep neural network is provided, including:
a first module capable of acquiring near-end time domain signal of the near-end microphoneAnd a far-end time-domain signal->Respectively carrying out Fourier transform to obtain a near-end frequency domain signal->And a far-end frequency-domain signal->And separately on the near-end frequency domain signal->And the remote frequency-domain signal->Extracting signal characteristics;
the second module can splice the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics and input the signals into a coding frame, wherein the coding frame comprises 3 two-dimensional convolutional layers and a batch regularization layer and a PReLU layer which are sequentially positioned behind each two-dimensional convolutional layer;
the third module can input the signal characteristics output by the coding framework into a 1-layer frequency GRU layer and a 1-layer time GRU layer in sequence;
a fourth module capable of inputting the signal characteristics output by the temporal GRU layer into a decoding framework, the decoding framework comprising transposed convolutional layers corresponding to the two-dimensional convolutional layers of the encoding framework, and the batch regularization layer and the prilu layer sequentially located after each of the transposed convolutional layers;
a fifth module, capable of outputting the speech time domain signal after performing optimization calculation on the output result of the decoding frame。
In a third aspect, a computer device is provided, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when a computer device is running, the machine-readable instructions when executed by the processor performing the method of the first aspect.
In a fourth aspect, a computer-readable storage medium is proposed, on which a computer program is stored, which computer program, when being executed by a processor, is adapted to carry out the method of the first aspect.
The specification can at least achieve the following technical effects:
the scheme of the invention is based on a U-shaped network structure of a coding-decoding frame, makes full use of the correlation among frequency points in time-frequency characteristics, has the characteristics of small model scale, less parameter quantity, low performance consumption and the like, can run on local equipment in real time, and achieves better echo suppression effect.
Drawings
In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the present specification, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative effort.
Fig. 1 is a schematic diagram of a deep neural network-based single-channel speech echo cancellation method provided in an embodiment of the present disclosure.
Fig. 2 is a second schematic diagram of a deep neural network-based single-channel speech echo cancellation method according to an embodiment of the present disclosure.
Fig. 3 is a third schematic diagram of a deep neural network-based single-channel speech echo cancellation method provided in an embodiment of the present disclosure.
Fig. 4 is a fourth schematic diagram of a deep neural network-based single-channel speech echo cancellation method according to an embodiment of the present disclosure.
Fig. 5 is a fifth schematic diagram of a deep neural network-based single-channel speech echo cancellation method provided in an embodiment of the present specification.
Fig. 6 is a sixth schematic diagram of a deep neural network-based single-channel speech echo cancellation method according to an embodiment of the present disclosure.
Fig. 7 is a seventh schematic diagram of a deep neural network-based single-channel speech echo cancellation method provided in an embodiment of the present disclosure.
Fig. 8 is an eighth schematic diagram of a deep neural network-based single-channel speech echo cancellation method according to an embodiment of the present disclosure.
Fig. 9 is a nine-step schematic diagram of a deep neural network-based single-channel speech echo cancellation method according to an embodiment of the present disclosure.
Fig. 10 is a schematic diagram of a deep neural network-based single-channel speech echo cancellation device provided in an embodiment of the present specification.
Fig. 11 is a schematic structural diagram of an electronic device provided in an embodiment of the present specification.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present specification, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present specification without making any creative effort shall fall within the protection scope of the present specification.
The following describes a detailed description of a deep neural network-based single-channel speech echo cancellation scheme by using a specific example.
Example one
The invention aims to overcome the defects and shortcomings that the existing echo cancellation algorithm based on the deep neural network is complex in model, large in model scale and does not consider correlation among frequency points, and provides an echo cancellation scheme which is low in calculation consumption, real-time, low in parameter and good in performance under a complex acoustic environment. Acoustic echo usually occurs in voice communication terminals, and since non-linear processing, such as lossy vocoder and transcoding, is applied in the communication channel, the acoustic echo has to be processed locally in the device, and the computational resources local to the device are usually limited, thus requiring a high complexity of the echo cancellation algorithm. In addition, real-time conferencing systems allow only limited processing delay and the devices may be placed in noisy environments, which, in addition to the non-linear distortion that speakers may have, present challenges to echo cancellation methods. Referring to fig. 1, in a real-time audio-video communication system, a far-end signalThe voice signal is played through the near-end loudspeaker and then is collected again by the near-end microphone, and the voice is considered to be possibly present at the near end at the same time>And ambient environmental noise>The signal picked up by the microphone at the near end->Can be expressed as:. Wherein +>For the echo path formed from playing far-end signal to being collected by near-end microphone, making convolution calculation between signals, and making it implement amplification and amplification>The delay generated for the far-end signal relative to the original far-end signal when it is re-acquired by the microphone. The aim of the echo cancellation system is to pick up the microphone signal at the known near end->And a remote reference signal->In the case that a near-end pure speech signal is obtained as much as possible->Is estimated. Therefore, the technical idea of the scheme of the invention is to fully utilize the advantages of the cyclic neural network, particularly the gated cyclic unit, in the time sequence processing aspect and the convolutional neural network in the feature extraction aspect under the coding-decoding framework, thereby realizing better echo cancellation effect.
Referring to fig. 2, it is a schematic diagram of a single-channel speech echo cancellation scheme based on a deep neural network according to an embodiment of the present invention. The echo cancellation method of the deep neural network disclosed by the invention is carried out under a U-shaped network structure consisting of an encoding-decoding framework. Firstly, respectively carrying out time domain to frequency domain conversion on a near-end signal and a far-end signal and extracting characteristics; secondly, the characteristics are sequentially transmitted through four parts, namely an encoding frame, a frequency GRU layer, a time GRU layer and a decoding frame, and finally, the output result of the decoding frame is converted into a voice time domain signal. The method of the framework fully combines the advantages of the cyclic neural network and the convolutional neural network, and meanwhile, the correlation among frequency points in the frequency spectrum characteristics is utilized. It should be noted that the time domain signals in the embodiments of the present specification are represented by lower case letters, and the corresponding frequency domain signals are represented by upper case letters. Referring to fig. 3, a schematic diagram of a method for single-channel speech echo cancellation based on a deep neural network in an embodiment of the present invention is shown, where the method includes:
s1: near-end time domain signal collected by near-end microphoneAnd a far-end time-domain signal->Respectively carrying out Fourier transform to obtain a near-end frequency domain signal->And a far-end frequency-domain signal->And separately on the near-end frequency domain signal->And the remote frequency-domain signal->And extracting signal features.
Optionally, a near-end time domain signal collected by a near-end microphoneAnd a far-end time-domain signal->Respectively carrying out Fourier transform to obtain a near-end frequency domain signal->And a far-end frequency-domain signal->And separately on the near-end frequency domain signal->And the remote frequency-domain signal->The process of extracting the signal features is shown in fig. 4, and includes:
s111: separately for the near-end time domain signalAnd said remote time domain signal>Performing Fourier transform:,/>(ii) a Wherein the number of Fourier transform points is 512.
S112: obtaining the near-end frequency domain signalIs greater than or equal to>And the remote frequency-domain signal->Amplitude->。
S113: and calculating and outputting the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics, wherein the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics respectively comprise 257 frequency points.
It should be noted here that when the number of fourier transform sampling points is 512, because the FFT is a fast fourier transform, that is, an efficient fast calculation method for calculating the discrete fourier transform DFT by using a computer, considering that the FFT has a conjugate symmetry property, after the FFT is performed for 512 real number points, half of the characteristic points, that is, 256 real number points, are retained, and a direct current component, that is, a component with a frequency of 0, is added to form 257 frequency point characteristics.
S2: and splicing the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics, and inputting the signals into a coding frame, wherein the coding frame comprises 3 two-dimensional convolutional layers, and a batch regularization layer and a PReLU layer which are sequentially positioned behind each two-dimensional convolutional layer. Fig. 5 is a schematic diagram of the coding framework.
Optionally, the convolution kernels of the two-dimensional convolution layers are respectively,/>And &>And a step size corresponding to the convolution kernel is ^ and ^ respectively>,/>And &>。
Optionally, the working process of splicing the near-end frequency-domain signal feature and the far-end frequency-domain signal feature and inputting the spliced signal feature to the encoding frame is as shown in fig. 6, and includes:
s211: and splicing the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics to form first spliced frequency domain signal characteristics, wherein the first spliced frequency domain signal characteristics comprise 514 frequency points.
S212: and inputting the first spliced frequency domain signal characteristics into the coding frame, and calculating to obtain first frequency domain signal characteristics according to the number of the two-dimensional convolution layers and the number of the batch regularization layers and the number of the PReLU layers.
S213: outputting the first frequency-domain signal characteristic.
S3: and sequentially inputting the signal characteristics output by the coding framework into a 1-layer frequency GRU layer and a 1-layer time GRU layer. Optionally, the number of nodes of the frequency GRU layer and the time GRU layer is 32.
S4: and inputting the signal characteristics output by the time GRU layer into a decoding frame, wherein the decoding frame comprises a transposed convolution layer corresponding to the two-dimensional convolution layer in the coding frame, and the batch regularization layer and the PReLU layer which are sequentially positioned after each layer of the transposed convolution layer. Fig. 7 is a schematic diagram of the decoding framework.
Optionally, the convolution kernels of the transposed convolution layers are respectively,/>And &>And a step size corresponding to the convolution kernel is ^ and ^ respectively>,/>And &>. Optionally, the encoding framework and the decoding framework use a skip connection for information interaction.
Optionally, a process of inputting the signal characteristics output by the temporal GRU layer into a decoding framework is shown in fig. 8, and includes:
s411: and inputting the first frequency domain signal characteristics into the decoding frame sequentially through second frequency domain signal characteristics of the frequency GRU layer and the time GRU layer, and calculating to obtain third frequency domain signal characteristics according to the number of the transposition convolution layers and sequentially through the corresponding transposition convolution layers, the batch regularization layer and the PReLU layer.
S412: and outputting the third frequency domain signal characteristic.
S5: outputting voice time domain signals after optimizing and calculating the output result of the decoding frame。
Optionally, the speech time domain signal is output after the output result of the decoding frame is optimized and calculatedAs shown in fig. 9, includes:
s511: and obtaining a corresponding complex ideal ratio mask according to the third frequency domain signal characteristics.
S512: masking the complex ideality ratioIn (c) is based on a real part>And imaginary part->Act on the near-end frequency domain signal->In (c) is based on a real part>And imaginary part->And calculating the optimized near-end frequency domain signal->Is greater than or equal to>=/>And imaginary part->=。
S513: for the optimized near-end frequency domain signalPerforming inverse Fourier transform to output a near-end time domain signal->。
Optionally, the loss function used for neural network training during speech echo cancellation isWherein is present>, Where V is a complex number, which is a frequency domain representation of the clean near-end speech signal V after Fourier transformation, V r Is the real part of V, V i Is the imaginary part of V; the loss function is used for comparing the difference between a near-end speech signal and a pure near-end speech signal which are obtained by the prediction of the trained neural network; and/or the presence of a gas in the atmosphere,
the used optimizer is an Adam optimizer with a learning rate of 0.001, and the learning rate is adjusted according to an optimization preset turn and/or an optimization verification result. Specifically, when the optimization verification result is not improved after 4 rounds of continuous optimization, the learning rate can be reduced to half of the original learning rate, and then 1 round or multiple rounds of sub-optimization are performed again.
Example two
Fig. 10 is a schematic structural diagram of a deep neural network-based single-channel speech echo cancellation device 1000 according to an embodiment of the present specification. Referring to fig. 10, in an embodiment, a deep neural network-based single channel speech echo cancellation apparatus 1000 includes:
a first module 1001 capable of acquiring a near-end time-domain signal of a near-end microphoneAnd a far-end time-domain signal->Respectively carrying out Fourier transform to obtain a near-end frequency domain signal->And a far-end frequency-domain signal->And separately on the near-end frequency domain signal->And the remote frequency-domain signal->Extracting signal features;
a second module 1002, configured to splice the near-end frequency domain signal features and the far-end frequency domain signal features and input the spliced near-end frequency domain signal features and far-end frequency domain signal features to an encoding frame, where the encoding frame includes 3 two-dimensional convolutional layers, and a batch regularization layer and a pcelu layer sequentially located after each two-dimensional convolutional layer;
a third module 1003, configured to sequentially input the signal features output by the coding framework into a 1-layer frequency GRU layer and a 1-layer time GRU layer;
a fourth module 1004 capable of inputting the signal characteristics output by the temporal GRU layer into a decoding framework, the decoding framework comprising transposed convolutional layers corresponding to the two-dimensional convolutional layers of the encoding framework, and the batch regularization layer and the prilu layer sequentially located after each of the transposed convolutional layers;
a fifth module 1005, capable of outputting the speech time domain signal after performing optimization calculation on the decoding frame output result。
It should be understood that, in the embodiment of the present specification, the deep neural network-based single-channel speech echo cancellation device may further perform the method performed by the deep neural network-based single-channel speech echo cancellation device (or apparatus) in fig. 1 to 9, and implement the functions of the deep neural network-based single-channel speech echo cancellation device (or apparatus) in the examples shown in fig. 1 to 9, which are not described herein again.
EXAMPLE III
Fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present specification. Referring to fig. 11, at a hardware level, the electronic device includes a processor, and optionally further includes an internal bus, a network interface, and a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory, such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.
The processor, the network interface, and the memory may be connected to each other via an internal bus, which may be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 11, but that does not indicate only one bus or one type of bus.
And the memory is used for storing programs. In particular, the program may include program code comprising computer operating instructions. The memory may include both memory and non-volatile storage and provides instructions and data to the processor.
The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to form the shared resource access control device on the logic level. The processor is used for executing the program stored in the memory and is specifically used for executing the following operations:
s1: near-end time domain signal collected by near-end microphoneAnd a remote time domain signal>Respectively carrying out Fourier transform to obtain a near-end frequency domain signal->And a far-end frequency-domain signal->And separately on said near-end frequency-domain signal->And the remote frequency-domain signal->Extracting signal features;
s2: splicing the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics, and inputting the signals into a coding frame, wherein the coding frame comprises 3 two-dimensional convolutional layers, and a batch regularization layer and a PReLU layer which are sequentially positioned behind each two-dimensional convolutional layer;
s3: sequentially inputting the signal characteristics output by the coding frame into a layer 1 frequency GRU layer and a layer 1 time GRU layer;
s4: inputting the signal characteristics output by the time GRU layer into a decoding frame, wherein the decoding frame comprises a transposed convolutional layer corresponding to the two-dimensional convolutional layer in the encoding frame, and the batch regularization layer and the PReLU layer which are sequentially positioned after each layer of the transposed convolutional layer;
s5: outputting voice time domain signals after optimizing and calculating the output result of the decoding frame。
The above-mentioned single channel speech echo cancellation method based on deep neural network as disclosed in the embodiments shown in fig. 1 to fig. 9 of the present specification can be applied to a processor, or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present specification may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present specification may be embodied directly in a hardware decoding processor, or in a combination of hardware and software modules in the decoding processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, etc. as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.
Of course, besides the software implementation, the electronic device of the embodiment of the present disclosure does not exclude other implementations, such as a logic device or a combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or a logic device.
Example four
Embodiments of the present specification also propose a computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a portable electronic device comprising a plurality of application programs, are capable of causing the portable electronic device to perform the method of the embodiments shown in fig. 1 to 9, and in particular to perform the method of:
s1: near-end time domain signal collected by near-end microphoneAnd a far-end time-domain signal->Respectively carrying out Fourier transform to obtain a near-end frequency domain signal->And a far-end frequency-domain signal->And separately on the near-end frequency domain signal->And the remote frequency-domain signal->Extracting signal characteristics;
s2: splicing the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics and inputting the signals into an encoding frame, wherein the encoding frame comprises 3 two-dimensional convolutional layers, and a batch regularization layer and a PReLU layer which are sequentially positioned behind each two-dimensional convolutional layer;
s3: inputting the signal characteristics output by the coding frame into a 1-layer frequency GRU layer and a 1-layer time GRU layer in sequence;
s4: inputting the signal characteristics output by the time GRU layer into a decoding frame, wherein the decoding frame comprises a transposed convolutional layer corresponding to the two-dimensional convolutional layer in the encoding frame, and the batch regularization layer and the PReLU layer which are sequentially positioned after each layer of the transposed convolutional layer;
s5: outputting voice time domain signals after optimizing and calculating the output result of the decoding frame。
In short, the above description is only a preferred embodiment of the present disclosure, and is not intended to limit the scope of the present disclosure. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present specification shall be included in the protection scope of the present specification.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an electronic data carrier device, a gaming console, a tablet computer, a wearable device, or a combination of any of these devices.
Computer-readable media, including both permanent and non-permanent, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus comprising the element.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
Claims (10)
1. A single-channel voice echo cancellation method based on a deep neural network is characterized by comprising the following steps:
near-end time domain signal collected by near-end microphoneAnd a far-end time-domain signal->Respectively carrying out Fourier transform to obtain a near-end frequency domain signal->And a remote frequency domain signal>And separately on the near-end frequency domain signal->And said remote frequency-domain signal->Extracting signal features;
splicing the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics and inputting the signals into an encoding frame, wherein the encoding frame comprises 3 two-dimensional convolutional layers, and a batch regularization layer and a PReLU layer which are sequentially positioned behind each two-dimensional convolutional layer;
inputting the signal characteristics output by the coding frame into a 1-layer frequency GRU layer and a 1-layer time GRU layer in sequence;
inputting the signal characteristics output by the time GRU layer into a decoding frame, wherein the decoding frame comprises a transposed convolution layer corresponding to the two-dimensional convolution layer in the coding frame, and the batch regularization layer and the PReLU layer which are sequentially positioned after each layer of the transposed convolution layer;
2. The deep neural network-based single-channel speech echo cancellation method of claim 1, wherein the two-dimensional convolutional layer is formed by a two-dimensional convolutional layerThe convolution kernels are respectively,/>And &>And step sizes corresponding to the convolution kernels are respectively,/>And &>;
And/or the number of nodes of the frequency GRU layer and the time GRU layer is 32;
and/or the convolution kernels of the transposed convolution layers are respectively,/>And &>And a step size corresponding to the convolution kernel is ^ and ^ respectively>,/>And &>;
And/or the information interaction is carried out between the coding framework and the decoding framework by using jump connection.
3. The deep neural network-based single-channel speech echo cancellation method according to claim 2, wherein the near-end time-domain signal collected by the near-end microphone is used for canceling the near-end time-domain signalAnd a far-end time-domain signal->Respectively carrying out Fourier transform to obtain a near-end frequency domain signal->And a far-end frequency-domain signal->And separately on said near-end frequency-domain signal->And the remote frequency-domain signal->A process for extracting signal features, comprising:
separately for the near-end time domain signalAnd the far-end time-domain signal->Performing Fourier transform: />,(ii) a Wherein the number of Fourier transform points is 512;
obtaining the near-end frequency domain signalIs greater than or equal to>And the remote frequency-domain signal->Amplitude->;
And calculating and outputting the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics, wherein the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics respectively comprise 257 frequency points.
4. The deep neural network-based single-channel speech echo cancellation method according to claim 3, wherein the working process of splicing the near-end frequency-domain signal features and the far-end frequency-domain signal features and inputting the spliced signals into the coding framework includes:
splicing the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics to form first spliced frequency domain signal characteristics, wherein the first spliced frequency domain signal characteristics comprise 514 frequency points;
inputting the first spliced frequency domain signal characteristics into the coding frame, and calculating to obtain first frequency domain signal characteristics according to the number of the two-dimensional convolution layers and the corresponding batch regularization layer and the PReLU layer;
outputting the first frequency-domain signal characteristic.
5. The deep neural network-based single-channel speech echo cancellation method according to claim 4, wherein inputting the signal features output by the temporal GRU layer into a decoding framework process comprises:
inputting the first frequency domain signal characteristics into the decoding frame sequentially through second frequency domain signal characteristics of the frequency GRU layer and the time GRU layer, and calculating to obtain third frequency domain signal characteristics according to the number of the transposed convolution layers and sequentially through the corresponding transposed convolution layers, the batch regularization layer and the PReLU layer;
and outputting the third frequency domain signal characteristic.
6. The deep neural network-based single-channel speech echo cancellation method according to claim 5, wherein the speech time-domain signal is output after performing optimization calculation on the output result of the decoding frameThe process of (2), comprising:
obtaining a corresponding complex ideal ratio mask according to the third frequency domain signal characteristics;
masking the complex ideality ratioIs greater than or equal to>And imaginary part->Act on the near-end frequency domain signal->Real part ofAnd imaginary part->And calculating the optimized near-end frequency domain signal->In (c) is based on a real part>=/>And imaginary part->=;
7. The deep neural network-based single-channel speech echo cancellation method according to claim 6, further comprising performing neural network training during the speech echo cancellation process using a loss function ofWherein is present>, Where V is a frequency domain representation of the clean near-end speech signal after Fourier transform, V r Is the real part of V, V i Is the imaginary part of V;
the used optimizer is an Adam optimizer with a learning rate of 0.001, and the learning rate is adjusted according to an optimization preset turn and/or an optimization verification result.
8. A single-channel voice echo cancellation device based on a deep neural network is characterized by comprising the following components:
a first module capable of acquiring near-end time domain signal of the near-end microphoneAnd a far-end time-domain signal->Respectively carrying out Fourier transform to obtain a near-end frequency domain signal->And a far-end frequency-domain signal->And separately on the near-end frequency domain signal->And the remote frequency-domain signal->Extracting signal characteristics;
the second module can splice the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics and input the signals into a coding frame, wherein the coding frame comprises 3 two-dimensional convolutional layers and a batch regularization layer and a PReLU layer which are sequentially positioned behind each two-dimensional convolutional layer;
the third module can input the signal characteristics output by the coding framework into a 1-layer frequency GRU layer and a 1-layer time GRU layer in sequence;
a fourth module capable of inputting the signal characteristics output by the temporal GRU layer into a decoding framework, the decoding framework comprising transposed convolutional layers corresponding to the two-dimensional convolutional layers of the encoding framework, and the batch regularization layer and the prilu layer sequentially located after each of the transposed convolutional layers;
9. An electronic device, comprising:
a processor; and
a memory arranged to store computer executable instructions that when executed cause the processor to perform the deep neural network based single channel speech echo cancellation method of any one of claims 1 to 7.
10. A computer-readable storage medium storing one or more programs which, when executed by an electronic device including a plurality of application programs, cause the electronic device to perform the deep neural network-based single-channel speech echo cancellation method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211482692.9A CN115565543B (en) | 2022-11-24 | 2022-11-24 | Single-channel voice echo cancellation method and device based on deep neural network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211482692.9A CN115565543B (en) | 2022-11-24 | 2022-11-24 | Single-channel voice echo cancellation method and device based on deep neural network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115565543A CN115565543A (en) | 2023-01-03 |
CN115565543B true CN115565543B (en) | 2023-04-07 |
Family
ID=84770233
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211482692.9A Active CN115565543B (en) | 2022-11-24 | 2022-11-24 | Single-channel voice echo cancellation method and device based on deep neural network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115565543B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN118471244A (en) * | 2023-02-07 | 2024-08-09 | 抖音视界有限公司 | Method and device for processing voice signal and electronic equipment |
CN118411997A (en) * | 2024-07-04 | 2024-07-30 | 苏州大学 | Single-channel voice echo cancellation method based on time domain neural network |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11393487B2 (en) * | 2019-03-28 | 2022-07-19 | Samsung Electronics Co., Ltd. | System and method for acoustic echo cancelation using deep multitask recurrent neural networks |
TWI738532B (en) * | 2019-10-27 | 2021-09-01 | 英屬開曼群島商意騰科技股份有限公司 | Apparatus and method for multiple-microphone speech enhancement |
CN113314140A (en) * | 2021-05-31 | 2021-08-27 | 哈尔滨理工大学 | Sound source separation algorithm of end-to-end time domain multi-scale convolutional neural network |
CN113707167A (en) * | 2021-08-31 | 2021-11-26 | 北京地平线信息技术有限公司 | Training method and training device for residual echo suppression model |
CN114283830A (en) * | 2021-12-17 | 2022-04-05 | 南京工程学院 | Deep learning network-based microphone signal echo cancellation model construction method |
-
2022
- 2022-11-24 CN CN202211482692.9A patent/CN115565543B/en active Active
Also Published As
Publication number | Publication date |
---|---|
CN115565543A (en) | 2023-01-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN115565543B (en) | Single-channel voice echo cancellation method and device based on deep neural network | |
CN104427068B (en) | A kind of audio communication method and device | |
US9246545B1 (en) | Adaptive estimation of delay in audio systems | |
CN113611324B (en) | Method and device for suppressing environmental noise in live broadcast, electronic equipment and storage medium | |
CN113571080B (en) | Voice enhancement method, device, equipment and storage medium | |
CN111508519A (en) | Method and device for enhancing voice of audio signal | |
WO2022142984A1 (en) | Voice processing method, apparatus and system, smart terminal and electronic device | |
US9832299B2 (en) | Background noise reduction in voice communication | |
WO2024027295A1 (en) | Speech enhancement model training method and apparatus, enhancement method, electronic device, storage medium, and program product | |
US20230050519A1 (en) | Speech enhancement method and apparatus, device, and storage medium | |
CN114792524B (en) | Audio data processing method, apparatus, program product, computer device and medium | |
CN113113038B (en) | Echo cancellation method and device and electronic equipment | |
CN114333893A (en) | Voice processing method and device, electronic equipment and readable medium | |
CN109379501B (en) | Filtering method, device, equipment and medium for echo cancellation | |
CN113674752A (en) | Method and device for reducing noise of audio signal, readable medium and electronic equipment | |
CN111353258A (en) | Echo suppression method based on coding and decoding neural network, audio device and equipment | |
CN117219107A (en) | Training method, device, equipment and storage medium of echo cancellation model | |
CN113299308B (en) | Voice enhancement method and device, electronic equipment and storage medium | |
CN118525332A (en) | Audio processing apparatus and method for suppressing noise | |
CN116597854A (en) | Audio noise reduction model training method, equipment and storage medium | |
US11521637B1 (en) | Ratio mask post-filtering for audio enhancement | |
CN115938379A (en) | Single-channel voice echo cancellation method and device | |
CN115273880A (en) | Voice noise reduction method, model training method, device, equipment, medium and product | |
US11924367B1 (en) | Joint noise and echo suppression for two-way audio communication enhancement | |
CN116153282A (en) | Single-channel voice noise reduction method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |