[go: up one dir, main page]
More Web Proxy on the site http://driver.im/

CN115565543B - Single-channel voice echo cancellation method and device based on deep neural network - Google Patents

Single-channel voice echo cancellation method and device based on deep neural network Download PDF

Info

Publication number
CN115565543B
CN115565543B CN202211482692.9A CN202211482692A CN115565543B CN 115565543 B CN115565543 B CN 115565543B CN 202211482692 A CN202211482692 A CN 202211482692A CN 115565543 B CN115565543 B CN 115565543B
Authority
CN
China
Prior art keywords
domain signal
layer
frequency domain
frequency
signal characteristics
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211482692.9A
Other languages
Chinese (zh)
Other versions
CN115565543A (en
Inventor
杨亮
顾骋
赵元军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
G Net Cloud Service Co Ltd
Original Assignee
G Net Cloud Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by G Net Cloud Service Co Ltd filed Critical G Net Cloud Service Co Ltd
Priority to CN202211482692.9A priority Critical patent/CN115565543B/en
Publication of CN115565543A publication Critical patent/CN115565543A/en
Application granted granted Critical
Publication of CN115565543B publication Critical patent/CN115565543B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L2021/02082Noise filtering the noise being echo, reverberation of the speech
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D30/00Reducing energy consumption in communication networks
    • Y02D30/70Reducing energy consumption in communication networks in wireless communication networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • Biophysics (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)

Abstract

The invention discloses a method and a device for eliminating a single-channel voice echo based on a deep neural network, wherein the method comprises the following steps: respectively calculating to obtain near-end and far-end frequency domain signals and extracting signal characteristics; splicing the frequency domain signal characteristics and inputting the spliced frequency domain signal characteristics into a coding frame, wherein the coding frame comprises 3 layers of two-dimensional convolutional layers, and a batch regularization layer and a PReLU layer which are positioned behind each layer of two-dimensional convolutional layer; sequentially inputting the signal characteristics output by the coding frame into a 1-layer frequency GRU layer and a 1-layer time GRU layer; inputting the signal characteristics output by the time GRU layer into a decoding frame; and outputting a voice time domain signal after optimizing and calculating the output result of the decoding frame. The invention utilizes the correlation between the coding-decoding frame and the frequency points of the time-frequency characteristics, has small model scale, less parameter quantity and low performance consumption, and achieves better echo suppression effect.

Description

Single-channel voice echo cancellation method and device based on deep neural network
Technical Field
The present disclosure relates to the field of audio processing technologies, and in particular, to a method and an apparatus for single-channel speech echo cancellation based on a deep neural network, an electronic device, and a storage medium.
Background
In a remote audio and video conference communication system, when a near-end microphone is coupled with a loudspeaker, the microphone re-collects a voice signal generated by the loudspeaker and transmits the voice signal to an opposite end through the communication system, so that the opposite end hears own voice as echo, the echo problem seriously affects the conversation quality of the conference system, and the echo cancellation technology has important significance for high-quality remote real-time audio and video communication. The traditional echo cancellation method based on signal processing faces many technical challenges such as nonlinear echo and double talk in practical application, and the currently disclosed echo cancellation method based on the deep neural network has the problems that a model structure is not suitable for real-time reasoning, and the local low-power-consumption operation of equipment cannot be realized due to overlarge model scale. Therefore, how to provide an echo cancellation technique with low performance consumption and small model scale on the basis of the deep neural network method is an urgent technical problem to be solved.
Disclosure of Invention
An object of the embodiments of the present specification is to provide a method, an apparatus, an electronic device, and a storage medium for single-channel speech echo cancellation based on a deep neural network, in order to solve the above problems.
In order to solve the above technical problem, the embodiments of the present specification are implemented as follows:
in a first aspect, a method for single-channel speech echo cancellation based on a deep neural network is provided, including:
near-end time domain signal collected by near-end microphone
Figure 7233DEST_PATH_IMAGE001
And a far-end time-domain signal->
Figure 224588DEST_PATH_IMAGE002
Respectively carrying out Fourier transform to obtain a near-end frequency domain signal->
Figure 228316DEST_PATH_IMAGE003
And a far-end frequency-domain signal->
Figure 889236DEST_PATH_IMAGE004
And separately on the near-end frequency domain signal->
Figure 611204DEST_PATH_IMAGE003
And the remote frequency-domain signal->
Figure 366671DEST_PATH_IMAGE004
Extracting signal features;
splicing the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics, and inputting the signals into a coding frame, wherein the coding frame comprises 3 two-dimensional convolutional layers, and a batch regularization layer and a PReLU layer which are sequentially positioned behind each two-dimensional convolutional layer;
inputting the signal characteristics output by the coding frame into a 1-layer frequency GRU layer and a 1-layer time GRU layer in sequence;
inputting the signal characteristics output by the time GRU layer into a decoding frame, wherein the decoding frame comprises a transposed convolutional layer corresponding to the two-dimensional convolutional layer in the encoding frame, and the batch regularization layer and the PReLU layer which are sequentially positioned after each layer of the transposed convolutional layer;
outputting voice time domain signals after optimizing and calculating the output result of the decoding frame
Figure 975638DEST_PATH_IMAGE005
Further, the convolution kernels of the two-dimensional convolution layers are respectively
Figure 791147DEST_PATH_IMAGE006
,/>
Figure 411DEST_PATH_IMAGE007
And &>
Figure 310301DEST_PATH_IMAGE007
And a step size corresponding to the convolution kernel is ^ and ^ respectively>
Figure 288621DEST_PATH_IMAGE008
,/>
Figure 275032DEST_PATH_IMAGE009
And &>
Figure 722325DEST_PATH_IMAGE009
And/or the number of nodes of the frequency GRU layer and the time GRU layer is 32;
and/or the convolution kernels of the transposed convolution layers are respectively
Figure 819594DEST_PATH_IMAGE006
,/>
Figure 652421DEST_PATH_IMAGE007
And &>
Figure 560465DEST_PATH_IMAGE007
To do so byAnd the step length corresponding to the convolution kernel is ^ or>
Figure 478742DEST_PATH_IMAGE008
,/>
Figure 645282DEST_PATH_IMAGE009
And &>
Figure 83347DEST_PATH_IMAGE009
And/or the information interaction is carried out between the coding framework and the decoding framework by using jump connection.
Further, a near-end time domain signal collected by the near-end microphone
Figure 411560DEST_PATH_IMAGE001
And a far-end time-domain signal->
Figure 817134DEST_PATH_IMAGE002
Respectively carrying out Fourier transform to obtain a near-end frequency domain signal->
Figure 538096DEST_PATH_IMAGE003
And a far-end frequency-domain signal->
Figure 79936DEST_PATH_IMAGE004
And separately on the near-end frequency domain signal->
Figure 844630DEST_PATH_IMAGE003
And the remote frequency-domain signal->
Figure 471920DEST_PATH_IMAGE004
A process for extracting signal features, comprising:
separately for the near-end time domain signal
Figure 730994DEST_PATH_IMAGE001
And the far-end time-domain signal->
Figure 127341DEST_PATH_IMAGE001
Performing Fourier transform: />
Figure 62936DEST_PATH_IMAGE010
Figure 928255DEST_PATH_IMAGE011
(ii) a Wherein the number of Fourier transform points is 512;
obtaining the near-end frequency domain signal
Figure 974708DEST_PATH_IMAGE003
In a range->
Figure 491140DEST_PATH_IMAGE012
And the remote frequency-domain signal->
Figure 348369DEST_PATH_IMAGE004
Amplitude->
Figure 950251DEST_PATH_IMAGE013
And calculating and outputting the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics, wherein the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics respectively comprise 257 frequency points.
Further, the working process of splicing the near-end frequency domain signal features and the far-end frequency domain signal features and inputting the spliced signals into the coding frame includes:
splicing the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics to form first spliced frequency domain signal characteristics, wherein the first spliced frequency domain signal characteristics comprise 514 frequency points;
inputting the first spliced frequency domain signal characteristics into the coding frame, and calculating to obtain first frequency domain signal characteristics according to the number of the two-dimensional convolution layers and the corresponding batch regularization layer and the PReLU layer;
outputting the first frequency-domain signal characteristic.
Further, inputting the signal characteristics output by the time GRU layer into a decoding framework process, including:
inputting the first frequency domain signal characteristic into the decoding frame sequentially through a second frequency domain signal characteristic of the frequency GRU layer and the time GRU layer, and calculating to obtain a third frequency domain signal characteristic according to the number of the transposition convolution layers and sequentially through the corresponding transposition convolution layer, the batch regularization layer and the PReLU layer;
and outputting the third frequency domain signal characteristic.
Further, the output result of the decoding frame is optimized and calculated to output a voice time domain signal
Figure 800396DEST_PATH_IMAGE005
The process of (2), comprising:
obtaining a corresponding complex ideal ratio mask according to the third frequency domain signal characteristics;
masking the complex ideality ratio
Figure 187646DEST_PATH_IMAGE014
Is greater than or equal to>
Figure 199464DEST_PATH_IMAGE015
And imaginary part +>
Figure 288643DEST_PATH_IMAGE016
Act on the near-end frequency domain signal->
Figure 704929DEST_PATH_IMAGE003
Is greater than or equal to>
Figure 930374DEST_PATH_IMAGE017
And imaginary part->
Figure 378673DEST_PATH_IMAGE018
And calculating the optimized near-end frequency domain signal->
Figure 705880DEST_PATH_IMAGE019
Is greater than or equal to>
Figure 897827DEST_PATH_IMAGE020
=/>
Figure 243358DEST_PATH_IMAGE021
And imaginary part->
Figure 613291DEST_PATH_IMAGE022
=/>
Figure 677062DEST_PATH_IMAGE023
For the optimized near-end frequency domain signal
Figure 672699DEST_PATH_IMAGE019
Performing inverse Fourier transform to output a near-end time-domain signal->
Figure 623469DEST_PATH_IMAGE005
Further, the method is characterized by comprising the step of training a neural network in the voice echo cancellation process, and using a loss function of
Figure 413571DEST_PATH_IMAGE024
Wherein is present>
Figure 964638DEST_PATH_IMAGE025
Figure 983540DEST_PATH_IMAGE026
Figure 303663DEST_PATH_IMAGE027
Figure 264666DEST_PATH_IMAGE028
Where V is a frequency domain representation of the clean near-end speech signal after Fourier transform, V r Is the real part of V, V i Is the imaginary part of V; and/or the presence of a gas in the gas,
the used optimizer is an Adam optimizer with a learning rate of 0.001, and the learning rate is adjusted according to an optimization preset turn and/or an optimization verification result.
In a second aspect, a single-channel speech echo cancellation device based on a deep neural network is provided, including:
a first module capable of acquiring near-end time domain signal of the near-end microphone
Figure 771871DEST_PATH_IMAGE001
And a far-end time-domain signal->
Figure 860044DEST_PATH_IMAGE002
Respectively carrying out Fourier transform to obtain a near-end frequency domain signal->
Figure 300252DEST_PATH_IMAGE003
And a far-end frequency-domain signal->
Figure 166577DEST_PATH_IMAGE004
And separately on the near-end frequency domain signal->
Figure 442969DEST_PATH_IMAGE003
And the remote frequency-domain signal->
Figure 52942DEST_PATH_IMAGE004
Extracting signal characteristics;
the second module can splice the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics and input the signals into a coding frame, wherein the coding frame comprises 3 two-dimensional convolutional layers and a batch regularization layer and a PReLU layer which are sequentially positioned behind each two-dimensional convolutional layer;
the third module can input the signal characteristics output by the coding framework into a 1-layer frequency GRU layer and a 1-layer time GRU layer in sequence;
a fourth module capable of inputting the signal characteristics output by the temporal GRU layer into a decoding framework, the decoding framework comprising transposed convolutional layers corresponding to the two-dimensional convolutional layers of the encoding framework, and the batch regularization layer and the prilu layer sequentially located after each of the transposed convolutional layers;
a fifth module, capable of outputting the speech time domain signal after performing optimization calculation on the output result of the decoding frame
Figure 347657DEST_PATH_IMAGE005
In a third aspect, a computer device is provided, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when a computer device is running, the machine-readable instructions when executed by the processor performing the method of the first aspect.
In a fourth aspect, a computer-readable storage medium is proposed, on which a computer program is stored, which computer program, when being executed by a processor, is adapted to carry out the method of the first aspect.
The specification can at least achieve the following technical effects:
the scheme of the invention is based on a U-shaped network structure of a coding-decoding frame, makes full use of the correlation among frequency points in time-frequency characteristics, has the characteristics of small model scale, less parameter quantity, low performance consumption and the like, can run on local equipment in real time, and achieves better echo suppression effect.
Drawings
In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the present specification, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative effort.
Fig. 1 is a schematic diagram of a deep neural network-based single-channel speech echo cancellation method provided in an embodiment of the present disclosure.
Fig. 2 is a second schematic diagram of a deep neural network-based single-channel speech echo cancellation method according to an embodiment of the present disclosure.
Fig. 3 is a third schematic diagram of a deep neural network-based single-channel speech echo cancellation method provided in an embodiment of the present disclosure.
Fig. 4 is a fourth schematic diagram of a deep neural network-based single-channel speech echo cancellation method according to an embodiment of the present disclosure.
Fig. 5 is a fifth schematic diagram of a deep neural network-based single-channel speech echo cancellation method provided in an embodiment of the present specification.
Fig. 6 is a sixth schematic diagram of a deep neural network-based single-channel speech echo cancellation method according to an embodiment of the present disclosure.
Fig. 7 is a seventh schematic diagram of a deep neural network-based single-channel speech echo cancellation method provided in an embodiment of the present disclosure.
Fig. 8 is an eighth schematic diagram of a deep neural network-based single-channel speech echo cancellation method according to an embodiment of the present disclosure.
Fig. 9 is a nine-step schematic diagram of a deep neural network-based single-channel speech echo cancellation method according to an embodiment of the present disclosure.
Fig. 10 is a schematic diagram of a deep neural network-based single-channel speech echo cancellation device provided in an embodiment of the present specification.
Fig. 11 is a schematic structural diagram of an electronic device provided in an embodiment of the present specification.
Detailed Description
In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present specification, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present specification without making any creative effort shall fall within the protection scope of the present specification.
The following describes a detailed description of a deep neural network-based single-channel speech echo cancellation scheme by using a specific example.
Example one
The invention aims to overcome the defects and shortcomings that the existing echo cancellation algorithm based on the deep neural network is complex in model, large in model scale and does not consider correlation among frequency points, and provides an echo cancellation scheme which is low in calculation consumption, real-time, low in parameter and good in performance under a complex acoustic environment. Acoustic echo usually occurs in voice communication terminals, and since non-linear processing, such as lossy vocoder and transcoding, is applied in the communication channel, the acoustic echo has to be processed locally in the device, and the computational resources local to the device are usually limited, thus requiring a high complexity of the echo cancellation algorithm. In addition, real-time conferencing systems allow only limited processing delay and the devices may be placed in noisy environments, which, in addition to the non-linear distortion that speakers may have, present challenges to echo cancellation methods. Referring to fig. 1, in a real-time audio-video communication system, a far-end signal
Figure 135615DEST_PATH_IMAGE029
The voice signal is played through the near-end loudspeaker and then is collected again by the near-end microphone, and the voice is considered to be possibly present at the near end at the same time>
Figure 148571DEST_PATH_IMAGE030
And ambient environmental noise>
Figure 562234DEST_PATH_IMAGE031
The signal picked up by the microphone at the near end->
Figure 462188DEST_PATH_IMAGE032
Can be expressed as:
Figure 201474DEST_PATH_IMAGE033
. Wherein +>
Figure 170567DEST_PATH_IMAGE034
For the echo path formed from playing far-end signal to being collected by near-end microphone, making convolution calculation between signals, and making it implement amplification and amplification>
Figure 138654DEST_PATH_IMAGE035
The delay generated for the far-end signal relative to the original far-end signal when it is re-acquired by the microphone. The aim of the echo cancellation system is to pick up the microphone signal at the known near end->
Figure 407962DEST_PATH_IMAGE032
And a remote reference signal->
Figure 786991DEST_PATH_IMAGE029
In the case that a near-end pure speech signal is obtained as much as possible->
Figure 994112DEST_PATH_IMAGE030
Is estimated. Therefore, the technical idea of the scheme of the invention is to fully utilize the advantages of the cyclic neural network, particularly the gated cyclic unit, in the time sequence processing aspect and the convolutional neural network in the feature extraction aspect under the coding-decoding framework, thereby realizing better echo cancellation effect.
Referring to fig. 2, it is a schematic diagram of a single-channel speech echo cancellation scheme based on a deep neural network according to an embodiment of the present invention. The echo cancellation method of the deep neural network disclosed by the invention is carried out under a U-shaped network structure consisting of an encoding-decoding framework. Firstly, respectively carrying out time domain to frequency domain conversion on a near-end signal and a far-end signal and extracting characteristics; secondly, the characteristics are sequentially transmitted through four parts, namely an encoding frame, a frequency GRU layer, a time GRU layer and a decoding frame, and finally, the output result of the decoding frame is converted into a voice time domain signal. The method of the framework fully combines the advantages of the cyclic neural network and the convolutional neural network, and meanwhile, the correlation among frequency points in the frequency spectrum characteristics is utilized. It should be noted that the time domain signals in the embodiments of the present specification are represented by lower case letters, and the corresponding frequency domain signals are represented by upper case letters. Referring to fig. 3, a schematic diagram of a method for single-channel speech echo cancellation based on a deep neural network in an embodiment of the present invention is shown, where the method includes:
s1: near-end time domain signal collected by near-end microphone
Figure 749578DEST_PATH_IMAGE001
And a far-end time-domain signal->
Figure 873392DEST_PATH_IMAGE002
Respectively carrying out Fourier transform to obtain a near-end frequency domain signal->
Figure 705213DEST_PATH_IMAGE003
And a far-end frequency-domain signal->
Figure 914478DEST_PATH_IMAGE004
And separately on the near-end frequency domain signal->
Figure 208056DEST_PATH_IMAGE003
And the remote frequency-domain signal->
Figure 937108DEST_PATH_IMAGE004
And extracting signal features.
Optionally, a near-end time domain signal collected by a near-end microphone
Figure 189098DEST_PATH_IMAGE001
And a far-end time-domain signal->
Figure 885659DEST_PATH_IMAGE002
Respectively carrying out Fourier transform to obtain a near-end frequency domain signal->
Figure 733660DEST_PATH_IMAGE003
And a far-end frequency-domain signal->
Figure 300908DEST_PATH_IMAGE004
And separately on the near-end frequency domain signal->
Figure 723799DEST_PATH_IMAGE003
And the remote frequency-domain signal->
Figure 642076DEST_PATH_IMAGE004
The process of extracting the signal features is shown in fig. 4, and includes:
s111: separately for the near-end time domain signal
Figure 293769DEST_PATH_IMAGE001
And said remote time domain signal>
Figure 981102DEST_PATH_IMAGE002
Performing Fourier transform:
Figure 574894DEST_PATH_IMAGE010
,/>
Figure 465621DEST_PATH_IMAGE011
(ii) a Wherein the number of Fourier transform points is 512.
S112: obtaining the near-end frequency domain signal
Figure 435851DEST_PATH_IMAGE003
Is greater than or equal to>
Figure 243270DEST_PATH_IMAGE012
And the remote frequency-domain signal->
Figure 481398DEST_PATH_IMAGE004
Amplitude->
Figure 374268DEST_PATH_IMAGE013
S113: and calculating and outputting the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics, wherein the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics respectively comprise 257 frequency points.
It should be noted here that when the number of fourier transform sampling points is 512, because the FFT is a fast fourier transform, that is, an efficient fast calculation method for calculating the discrete fourier transform DFT by using a computer, considering that the FFT has a conjugate symmetry property, after the FFT is performed for 512 real number points, half of the characteristic points, that is, 256 real number points, are retained, and a direct current component, that is, a component with a frequency of 0, is added to form 257 frequency point characteristics.
S2: and splicing the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics, and inputting the signals into a coding frame, wherein the coding frame comprises 3 two-dimensional convolutional layers, and a batch regularization layer and a PReLU layer which are sequentially positioned behind each two-dimensional convolutional layer. Fig. 5 is a schematic diagram of the coding framework.
Optionally, the convolution kernels of the two-dimensional convolution layers are respectively
Figure 351451DEST_PATH_IMAGE006
,/>
Figure 764109DEST_PATH_IMAGE007
And &>
Figure 699704DEST_PATH_IMAGE007
And a step size corresponding to the convolution kernel is ^ and ^ respectively>
Figure 79870DEST_PATH_IMAGE008
,/>
Figure 611476DEST_PATH_IMAGE009
And &>
Figure 393488DEST_PATH_IMAGE009
Optionally, the working process of splicing the near-end frequency-domain signal feature and the far-end frequency-domain signal feature and inputting the spliced signal feature to the encoding frame is as shown in fig. 6, and includes:
s211: and splicing the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics to form first spliced frequency domain signal characteristics, wherein the first spliced frequency domain signal characteristics comprise 514 frequency points.
S212: and inputting the first spliced frequency domain signal characteristics into the coding frame, and calculating to obtain first frequency domain signal characteristics according to the number of the two-dimensional convolution layers and the number of the batch regularization layers and the number of the PReLU layers.
S213: outputting the first frequency-domain signal characteristic.
S3: and sequentially inputting the signal characteristics output by the coding framework into a 1-layer frequency GRU layer and a 1-layer time GRU layer. Optionally, the number of nodes of the frequency GRU layer and the time GRU layer is 32.
S4: and inputting the signal characteristics output by the time GRU layer into a decoding frame, wherein the decoding frame comprises a transposed convolution layer corresponding to the two-dimensional convolution layer in the coding frame, and the batch regularization layer and the PReLU layer which are sequentially positioned after each layer of the transposed convolution layer. Fig. 7 is a schematic diagram of the decoding framework.
Optionally, the convolution kernels of the transposed convolution layers are respectively
Figure 499984DEST_PATH_IMAGE006
,/>
Figure 118178DEST_PATH_IMAGE007
And &>
Figure 702743DEST_PATH_IMAGE007
And a step size corresponding to the convolution kernel is ^ and ^ respectively>
Figure 339261DEST_PATH_IMAGE008
,/>
Figure 101812DEST_PATH_IMAGE009
And &>
Figure 190991DEST_PATH_IMAGE009
. Optionally, the encoding framework and the decoding framework use a skip connection for information interaction.
Optionally, a process of inputting the signal characteristics output by the temporal GRU layer into a decoding framework is shown in fig. 8, and includes:
s411: and inputting the first frequency domain signal characteristics into the decoding frame sequentially through second frequency domain signal characteristics of the frequency GRU layer and the time GRU layer, and calculating to obtain third frequency domain signal characteristics according to the number of the transposition convolution layers and sequentially through the corresponding transposition convolution layers, the batch regularization layer and the PReLU layer.
S412: and outputting the third frequency domain signal characteristic.
S5: outputting voice time domain signals after optimizing and calculating the output result of the decoding frame
Figure 579247DEST_PATH_IMAGE005
Optionally, the speech time domain signal is output after the output result of the decoding frame is optimized and calculated
Figure 821003DEST_PATH_IMAGE005
As shown in fig. 9, includes:
s511: and obtaining a corresponding complex ideal ratio mask according to the third frequency domain signal characteristics.
S512: masking the complex ideality ratio
Figure 269302DEST_PATH_IMAGE014
In (c) is based on a real part>
Figure 845777DEST_PATH_IMAGE015
And imaginary part->
Figure 788456DEST_PATH_IMAGE016
Act on the near-end frequency domain signal->
Figure 868408DEST_PATH_IMAGE003
In (c) is based on a real part>
Figure 753187DEST_PATH_IMAGE017
And imaginary part->
Figure 551379DEST_PATH_IMAGE018
And calculating the optimized near-end frequency domain signal->
Figure 297749DEST_PATH_IMAGE019
Is greater than or equal to>
Figure 763365DEST_PATH_IMAGE020
=/>
Figure 304199DEST_PATH_IMAGE021
And imaginary part->
Figure 589687DEST_PATH_IMAGE022
=
Figure 123437DEST_PATH_IMAGE023
S513: for the optimized near-end frequency domain signal
Figure 194292DEST_PATH_IMAGE019
Performing inverse Fourier transform to output a near-end time domain signal->
Figure 889716DEST_PATH_IMAGE005
Optionally, the loss function used for neural network training during speech echo cancellation is
Figure 928079DEST_PATH_IMAGE024
Wherein is present>
Figure 16252DEST_PATH_IMAGE025
Figure 925302DEST_PATH_IMAGE026
Figure 322785DEST_PATH_IMAGE027
Figure 317286DEST_PATH_IMAGE028
Where V is a complex number, which is a frequency domain representation of the clean near-end speech signal V after Fourier transformation, V r Is the real part of V, V i Is the imaginary part of V; the loss function is used for comparing the difference between a near-end speech signal and a pure near-end speech signal which are obtained by the prediction of the trained neural network; and/or the presence of a gas in the atmosphere,
the used optimizer is an Adam optimizer with a learning rate of 0.001, and the learning rate is adjusted according to an optimization preset turn and/or an optimization verification result. Specifically, when the optimization verification result is not improved after 4 rounds of continuous optimization, the learning rate can be reduced to half of the original learning rate, and then 1 round or multiple rounds of sub-optimization are performed again.
Example two
Fig. 10 is a schematic structural diagram of a deep neural network-based single-channel speech echo cancellation device 1000 according to an embodiment of the present specification. Referring to fig. 10, in an embodiment, a deep neural network-based single channel speech echo cancellation apparatus 1000 includes:
a first module 1001 capable of acquiring a near-end time-domain signal of a near-end microphone
Figure 209150DEST_PATH_IMAGE001
And a far-end time-domain signal->
Figure 238285DEST_PATH_IMAGE002
Respectively carrying out Fourier transform to obtain a near-end frequency domain signal->
Figure 275512DEST_PATH_IMAGE003
And a far-end frequency-domain signal->
Figure 39199DEST_PATH_IMAGE004
And separately on the near-end frequency domain signal->
Figure 718442DEST_PATH_IMAGE003
And the remote frequency-domain signal->
Figure 602085DEST_PATH_IMAGE004
Extracting signal features;
a second module 1002, configured to splice the near-end frequency domain signal features and the far-end frequency domain signal features and input the spliced near-end frequency domain signal features and far-end frequency domain signal features to an encoding frame, where the encoding frame includes 3 two-dimensional convolutional layers, and a batch regularization layer and a pcelu layer sequentially located after each two-dimensional convolutional layer;
a third module 1003, configured to sequentially input the signal features output by the coding framework into a 1-layer frequency GRU layer and a 1-layer time GRU layer;
a fourth module 1004 capable of inputting the signal characteristics output by the temporal GRU layer into a decoding framework, the decoding framework comprising transposed convolutional layers corresponding to the two-dimensional convolutional layers of the encoding framework, and the batch regularization layer and the prilu layer sequentially located after each of the transposed convolutional layers;
a fifth module 1005, capable of outputting the speech time domain signal after performing optimization calculation on the decoding frame output result
Figure 826524DEST_PATH_IMAGE005
It should be understood that, in the embodiment of the present specification, the deep neural network-based single-channel speech echo cancellation device may further perform the method performed by the deep neural network-based single-channel speech echo cancellation device (or apparatus) in fig. 1 to 9, and implement the functions of the deep neural network-based single-channel speech echo cancellation device (or apparatus) in the examples shown in fig. 1 to 9, which are not described herein again.
EXAMPLE III
Fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present specification. Referring to fig. 11, at a hardware level, the electronic device includes a processor, and optionally further includes an internal bus, a network interface, and a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory, such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.
The processor, the network interface, and the memory may be connected to each other via an internal bus, which may be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 11, but that does not indicate only one bus or one type of bus.
And the memory is used for storing programs. In particular, the program may include program code comprising computer operating instructions. The memory may include both memory and non-volatile storage and provides instructions and data to the processor.
The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to form the shared resource access control device on the logic level. The processor is used for executing the program stored in the memory and is specifically used for executing the following operations:
s1: near-end time domain signal collected by near-end microphone
Figure 61196DEST_PATH_IMAGE001
And a remote time domain signal>
Figure 278551DEST_PATH_IMAGE002
Respectively carrying out Fourier transform to obtain a near-end frequency domain signal->
Figure 767432DEST_PATH_IMAGE003
And a far-end frequency-domain signal->
Figure 677619DEST_PATH_IMAGE004
And separately on said near-end frequency-domain signal->
Figure 134008DEST_PATH_IMAGE003
And the remote frequency-domain signal->
Figure 905787DEST_PATH_IMAGE004
Extracting signal features;
s2: splicing the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics, and inputting the signals into a coding frame, wherein the coding frame comprises 3 two-dimensional convolutional layers, and a batch regularization layer and a PReLU layer which are sequentially positioned behind each two-dimensional convolutional layer;
s3: sequentially inputting the signal characteristics output by the coding frame into a layer 1 frequency GRU layer and a layer 1 time GRU layer;
s4: inputting the signal characteristics output by the time GRU layer into a decoding frame, wherein the decoding frame comprises a transposed convolutional layer corresponding to the two-dimensional convolutional layer in the encoding frame, and the batch regularization layer and the PReLU layer which are sequentially positioned after each layer of the transposed convolutional layer;
s5: outputting voice time domain signals after optimizing and calculating the output result of the decoding frame
Figure 764021DEST_PATH_IMAGE005
The above-mentioned single channel speech echo cancellation method based on deep neural network as disclosed in the embodiments shown in fig. 1 to fig. 9 of the present specification can be applied to a processor, or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present specification may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present specification may be embodied directly in a hardware decoding processor, or in a combination of hardware and software modules in the decoding processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, etc. as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.
Of course, besides the software implementation, the electronic device of the embodiment of the present disclosure does not exclude other implementations, such as a logic device or a combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or a logic device.
Example four
Embodiments of the present specification also propose a computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a portable electronic device comprising a plurality of application programs, are capable of causing the portable electronic device to perform the method of the embodiments shown in fig. 1 to 9, and in particular to perform the method of:
s1: near-end time domain signal collected by near-end microphone
Figure 845110DEST_PATH_IMAGE001
And a far-end time-domain signal->
Figure 551246DEST_PATH_IMAGE002
Respectively carrying out Fourier transform to obtain a near-end frequency domain signal->
Figure 375983DEST_PATH_IMAGE003
And a far-end frequency-domain signal->
Figure 88724DEST_PATH_IMAGE004
And separately on the near-end frequency domain signal->
Figure 91446DEST_PATH_IMAGE003
And the remote frequency-domain signal->
Figure 522427DEST_PATH_IMAGE004
Extracting signal characteristics;
s2: splicing the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics and inputting the signals into an encoding frame, wherein the encoding frame comprises 3 two-dimensional convolutional layers, and a batch regularization layer and a PReLU layer which are sequentially positioned behind each two-dimensional convolutional layer;
s3: inputting the signal characteristics output by the coding frame into a 1-layer frequency GRU layer and a 1-layer time GRU layer in sequence;
s4: inputting the signal characteristics output by the time GRU layer into a decoding frame, wherein the decoding frame comprises a transposed convolutional layer corresponding to the two-dimensional convolutional layer in the encoding frame, and the batch regularization layer and the PReLU layer which are sequentially positioned after each layer of the transposed convolutional layer;
s5: outputting voice time domain signals after optimizing and calculating the output result of the decoding frame
Figure 619696DEST_PATH_IMAGE005
In short, the above description is only a preferred embodiment of the present disclosure, and is not intended to limit the scope of the present disclosure. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present specification shall be included in the protection scope of the present specification.
The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an electronic data carrier device, a gaming console, a tablet computer, a wearable device, or a combination of any of these devices.
Computer-readable media, including both permanent and non-permanent, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus comprising the element.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Claims (10)

1. A single-channel voice echo cancellation method based on a deep neural network is characterized by comprising the following steps:
near-end time domain signal collected by near-end microphone
Figure QLYQS_1
And a far-end time-domain signal->
Figure QLYQS_2
Respectively carrying out Fourier transform to obtain a near-end frequency domain signal->
Figure QLYQS_3
And a remote frequency domain signal>
Figure QLYQS_4
And separately on the near-end frequency domain signal->
Figure QLYQS_5
And said remote frequency-domain signal->
Figure QLYQS_6
Extracting signal features;
splicing the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics and inputting the signals into an encoding frame, wherein the encoding frame comprises 3 two-dimensional convolutional layers, and a batch regularization layer and a PReLU layer which are sequentially positioned behind each two-dimensional convolutional layer;
inputting the signal characteristics output by the coding frame into a 1-layer frequency GRU layer and a 1-layer time GRU layer in sequence;
inputting the signal characteristics output by the time GRU layer into a decoding frame, wherein the decoding frame comprises a transposed convolution layer corresponding to the two-dimensional convolution layer in the coding frame, and the batch regularization layer and the PReLU layer which are sequentially positioned after each layer of the transposed convolution layer;
outputting voice time domain signals after optimizing and calculating the output result of the decoding frame
Figure QLYQS_7
2. The deep neural network-based single-channel speech echo cancellation method of claim 1, wherein the two-dimensional convolutional layer is formed by a two-dimensional convolutional layerThe convolution kernels are respectively
Figure QLYQS_8
,/>
Figure QLYQS_9
And &>
Figure QLYQS_10
And step sizes corresponding to the convolution kernels are respectively
Figure QLYQS_11
,/>
Figure QLYQS_12
And &>
Figure QLYQS_13
And/or the number of nodes of the frequency GRU layer and the time GRU layer is 32;
and/or the convolution kernels of the transposed convolution layers are respectively
Figure QLYQS_14
,/>
Figure QLYQS_15
And &>
Figure QLYQS_16
And a step size corresponding to the convolution kernel is ^ and ^ respectively>
Figure QLYQS_17
,/>
Figure QLYQS_18
And &>
Figure QLYQS_19
And/or the information interaction is carried out between the coding framework and the decoding framework by using jump connection.
3. The deep neural network-based single-channel speech echo cancellation method according to claim 2, wherein the near-end time-domain signal collected by the near-end microphone is used for canceling the near-end time-domain signal
Figure QLYQS_20
And a far-end time-domain signal->
Figure QLYQS_21
Respectively carrying out Fourier transform to obtain a near-end frequency domain signal->
Figure QLYQS_22
And a far-end frequency-domain signal->
Figure QLYQS_23
And separately on said near-end frequency-domain signal->
Figure QLYQS_24
And the remote frequency-domain signal->
Figure QLYQS_25
A process for extracting signal features, comprising:
separately for the near-end time domain signal
Figure QLYQS_26
And the far-end time-domain signal->
Figure QLYQS_27
Performing Fourier transform: />
Figure QLYQS_28
Figure QLYQS_29
(ii) a Wherein the number of Fourier transform points is 512;
obtaining the near-end frequency domain signal
Figure QLYQS_30
Is greater than or equal to>
Figure QLYQS_31
And the remote frequency-domain signal->
Figure QLYQS_32
Amplitude->
Figure QLYQS_33
And calculating and outputting the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics, wherein the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics respectively comprise 257 frequency points.
4. The deep neural network-based single-channel speech echo cancellation method according to claim 3, wherein the working process of splicing the near-end frequency-domain signal features and the far-end frequency-domain signal features and inputting the spliced signals into the coding framework includes:
splicing the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics to form first spliced frequency domain signal characteristics, wherein the first spliced frequency domain signal characteristics comprise 514 frequency points;
inputting the first spliced frequency domain signal characteristics into the coding frame, and calculating to obtain first frequency domain signal characteristics according to the number of the two-dimensional convolution layers and the corresponding batch regularization layer and the PReLU layer;
outputting the first frequency-domain signal characteristic.
5. The deep neural network-based single-channel speech echo cancellation method according to claim 4, wherein inputting the signal features output by the temporal GRU layer into a decoding framework process comprises:
inputting the first frequency domain signal characteristics into the decoding frame sequentially through second frequency domain signal characteristics of the frequency GRU layer and the time GRU layer, and calculating to obtain third frequency domain signal characteristics according to the number of the transposed convolution layers and sequentially through the corresponding transposed convolution layers, the batch regularization layer and the PReLU layer;
and outputting the third frequency domain signal characteristic.
6. The deep neural network-based single-channel speech echo cancellation method according to claim 5, wherein the speech time-domain signal is output after performing optimization calculation on the output result of the decoding frame
Figure QLYQS_34
The process of (2), comprising:
obtaining a corresponding complex ideal ratio mask according to the third frequency domain signal characteristics;
masking the complex ideality ratio
Figure QLYQS_37
Is greater than or equal to>
Figure QLYQS_40
And imaginary part->
Figure QLYQS_42
Act on the near-end frequency domain signal->
Figure QLYQS_36
Real part of
Figure QLYQS_39
And imaginary part->
Figure QLYQS_41
And calculating the optimized near-end frequency domain signal->
Figure QLYQS_44
In (c) is based on a real part>
Figure QLYQS_35
=/>
Figure QLYQS_38
And imaginary part->
Figure QLYQS_43
=
Figure QLYQS_45
For the optimized near-end frequency domain signal
Figure QLYQS_46
Performing inverse Fourier transform to output a near-end time domain signal->
Figure QLYQS_47
7. The deep neural network-based single-channel speech echo cancellation method according to claim 6, further comprising performing neural network training during the speech echo cancellation process using a loss function of
Figure QLYQS_48
Wherein is present>
Figure QLYQS_49
Figure QLYQS_50
Figure QLYQS_51
Where V is a frequency domain representation of the clean near-end speech signal after Fourier transform, V r Is the real part of V, V i Is the imaginary part of V;
the used optimizer is an Adam optimizer with a learning rate of 0.001, and the learning rate is adjusted according to an optimization preset turn and/or an optimization verification result.
8. A single-channel voice echo cancellation device based on a deep neural network is characterized by comprising the following components:
a first module capable of acquiring near-end time domain signal of the near-end microphone
Figure QLYQS_52
And a far-end time-domain signal->
Figure QLYQS_53
Respectively carrying out Fourier transform to obtain a near-end frequency domain signal->
Figure QLYQS_54
And a far-end frequency-domain signal->
Figure QLYQS_55
And separately on the near-end frequency domain signal->
Figure QLYQS_56
And the remote frequency-domain signal->
Figure QLYQS_57
Extracting signal characteristics;
the second module can splice the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics and input the signals into a coding frame, wherein the coding frame comprises 3 two-dimensional convolutional layers and a batch regularization layer and a PReLU layer which are sequentially positioned behind each two-dimensional convolutional layer;
the third module can input the signal characteristics output by the coding framework into a 1-layer frequency GRU layer and a 1-layer time GRU layer in sequence;
a fourth module capable of inputting the signal characteristics output by the temporal GRU layer into a decoding framework, the decoding framework comprising transposed convolutional layers corresponding to the two-dimensional convolutional layers of the encoding framework, and the batch regularization layer and the prilu layer sequentially located after each of the transposed convolutional layers;
a fifth module, capable of outputting the speech time domain signal after performing optimization calculation on the output result of the decoding frame
Figure QLYQS_58
9. An electronic device, comprising:
a processor; and
a memory arranged to store computer executable instructions that when executed cause the processor to perform the deep neural network based single channel speech echo cancellation method of any one of claims 1 to 7.
10. A computer-readable storage medium storing one or more programs which, when executed by an electronic device including a plurality of application programs, cause the electronic device to perform the deep neural network-based single-channel speech echo cancellation method of any one of claims 1 to 7.
CN202211482692.9A 2022-11-24 2022-11-24 Single-channel voice echo cancellation method and device based on deep neural network Active CN115565543B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211482692.9A CN115565543B (en) 2022-11-24 2022-11-24 Single-channel voice echo cancellation method and device based on deep neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211482692.9A CN115565543B (en) 2022-11-24 2022-11-24 Single-channel voice echo cancellation method and device based on deep neural network

Publications (2)

Publication Number Publication Date
CN115565543A CN115565543A (en) 2023-01-03
CN115565543B true CN115565543B (en) 2023-04-07

Family

ID=84770233

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211482692.9A Active CN115565543B (en) 2022-11-24 2022-11-24 Single-channel voice echo cancellation method and device based on deep neural network

Country Status (1)

Country Link
CN (1) CN115565543B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118471244A (en) * 2023-02-07 2024-08-09 抖音视界有限公司 Method and device for processing voice signal and electronic equipment
CN118411997A (en) * 2024-07-04 2024-07-30 苏州大学 Single-channel voice echo cancellation method based on time domain neural network

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11393487B2 (en) * 2019-03-28 2022-07-19 Samsung Electronics Co., Ltd. System and method for acoustic echo cancelation using deep multitask recurrent neural networks
TWI738532B (en) * 2019-10-27 2021-09-01 英屬開曼群島商意騰科技股份有限公司 Apparatus and method for multiple-microphone speech enhancement
CN113314140A (en) * 2021-05-31 2021-08-27 哈尔滨理工大学 Sound source separation algorithm of end-to-end time domain multi-scale convolutional neural network
CN113707167A (en) * 2021-08-31 2021-11-26 北京地平线信息技术有限公司 Training method and training device for residual echo suppression model
CN114283830A (en) * 2021-12-17 2022-04-05 南京工程学院 Deep learning network-based microphone signal echo cancellation model construction method

Also Published As

Publication number Publication date
CN115565543A (en) 2023-01-03

Similar Documents

Publication Publication Date Title
CN115565543B (en) Single-channel voice echo cancellation method and device based on deep neural network
CN104427068B (en) A kind of audio communication method and device
US9246545B1 (en) Adaptive estimation of delay in audio systems
CN113611324B (en) Method and device for suppressing environmental noise in live broadcast, electronic equipment and storage medium
CN113571080B (en) Voice enhancement method, device, equipment and storage medium
CN111508519A (en) Method and device for enhancing voice of audio signal
WO2022142984A1 (en) Voice processing method, apparatus and system, smart terminal and electronic device
US9832299B2 (en) Background noise reduction in voice communication
WO2024027295A1 (en) Speech enhancement model training method and apparatus, enhancement method, electronic device, storage medium, and program product
US20230050519A1 (en) Speech enhancement method and apparatus, device, and storage medium
CN114792524B (en) Audio data processing method, apparatus, program product, computer device and medium
CN113113038B (en) Echo cancellation method and device and electronic equipment
CN114333893A (en) Voice processing method and device, electronic equipment and readable medium
CN109379501B (en) Filtering method, device, equipment and medium for echo cancellation
CN113674752A (en) Method and device for reducing noise of audio signal, readable medium and electronic equipment
CN111353258A (en) Echo suppression method based on coding and decoding neural network, audio device and equipment
CN117219107A (en) Training method, device, equipment and storage medium of echo cancellation model
CN113299308B (en) Voice enhancement method and device, electronic equipment and storage medium
CN118525332A (en) Audio processing apparatus and method for suppressing noise
CN116597854A (en) Audio noise reduction model training method, equipment and storage medium
US11521637B1 (en) Ratio mask post-filtering for audio enhancement
CN115938379A (en) Single-channel voice echo cancellation method and device
CN115273880A (en) Voice noise reduction method, model training method, device, equipment, medium and product
US11924367B1 (en) Joint noise and echo suppression for two-way audio communication enhancement
CN116153282A (en) Single-channel voice noise reduction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant