CN115565543B

CN115565543B - Single-channel voice echo cancellation method and device based on deep neural network

Info

Publication number: CN115565543B
Application number: CN202211482692.9A
Authority: CN
Inventors: 杨亮; 顾骋; 赵元军
Original assignee: G Net Cloud Service Co Ltd
Current assignee: G Net Cloud Service Co Ltd
Priority date: 2022-11-24
Filing date: 2022-11-24
Publication date: 2023-04-07
Anticipated expiration: 2042-11-24
Also published as: CN115565543A

Abstract

The invention discloses a method and a device for eliminating a single-channel voice echo based on a deep neural network, wherein the method comprises the following steps: respectively calculating to obtain near-end and far-end frequency domain signals and extracting signal characteristics; splicing the frequency domain signal characteristics and inputting the spliced frequency domain signal characteristics into a coding frame, wherein the coding frame comprises 3 layers of two-dimensional convolutional layers, and a batch regularization layer and a PReLU layer which are positioned behind each layer of two-dimensional convolutional layer; sequentially inputting the signal characteristics output by the coding frame into a 1-layer frequency GRU layer and a 1-layer time GRU layer; inputting the signal characteristics output by the time GRU layer into a decoding frame; and outputting a voice time domain signal after optimizing and calculating the output result of the decoding frame. The invention utilizes the correlation between the coding-decoding frame and the frequency points of the time-frequency characteristics, has small model scale, less parameter quantity and low performance consumption, and achieves better echo suppression effect.

Description

Single-channel voice echo cancellation method and device based on deep neural network

Technical Field

The present disclosure relates to the field of audio processing technologies, and in particular, to a method and an apparatus for single-channel speech echo cancellation based on a deep neural network, an electronic device, and a storage medium.

Background

In a remote audio and video conference communication system, when a near-end microphone is coupled with a loudspeaker, the microphone re-collects a voice signal generated by the loudspeaker and transmits the voice signal to an opposite end through the communication system, so that the opposite end hears own voice as echo, the echo problem seriously affects the conversation quality of the conference system, and the echo cancellation technology has important significance for high-quality remote real-time audio and video communication. The traditional echo cancellation method based on signal processing faces many technical challenges such as nonlinear echo and double talk in practical application, and the currently disclosed echo cancellation method based on the deep neural network has the problems that a model structure is not suitable for real-time reasoning, and the local low-power-consumption operation of equipment cannot be realized due to overlarge model scale. Therefore, how to provide an echo cancellation technique with low performance consumption and small model scale on the basis of the deep neural network method is an urgent technical problem to be solved.

Disclosure of Invention

An object of the embodiments of the present specification is to provide a method, an apparatus, an electronic device, and a storage medium for single-channel speech echo cancellation based on a deep neural network, in order to solve the above problems.

In order to solve the above technical problem, the embodiments of the present specification are implemented as follows:

in a first aspect, a method for single-channel speech echo cancellation based on a deep neural network is provided, including:

near-end time domain signal collected by near-end microphone

And a far-end time-domain signal->

Respectively carrying out Fourier transform to obtain a near-end frequency domain signal->

And a far-end frequency-domain signal->

And separately on the near-end frequency domain signal->

And the remote frequency-domain signal->

Extracting signal features;

splicing the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics, and inputting the signals into a coding frame, wherein the coding frame comprises 3 two-dimensional convolutional layers, and a batch regularization layer and a PReLU layer which are sequentially positioned behind each two-dimensional convolutional layer;

inputting the signal characteristics output by the coding frame into a 1-layer frequency GRU layer and a 1-layer time GRU layer in sequence;

inputting the signal characteristics output by the time GRU layer into a decoding frame, wherein the decoding frame comprises a transposed convolutional layer corresponding to the two-dimensional convolutional layer in the encoding frame, and the batch regularization layer and the PReLU layer which are sequentially positioned after each layer of the transposed convolutional layer;

outputting voice time domain signals after optimizing and calculating the output result of the decoding frame

。

Further, the convolution kernels of the two-dimensional convolution layers are respectively

，/>

And &>

And a step size corresponding to the convolution kernel is ^ and ^ respectively>

，/>

And &>

；

And/or the number of nodes of the frequency GRU layer and the time GRU layer is 32;

and/or the convolution kernels of the transposed convolution layers are respectively

，/>

And &>

To do so byAnd the step length corresponding to the convolution kernel is ^ or>

，/>

And &>

；

And/or the information interaction is carried out between the coding framework and the decoding framework by using jump connection.

Further, a near-end time domain signal collected by the near-end microphone

And a far-end time-domain signal->

And a far-end frequency-domain signal->

And separately on the near-end frequency domain signal->

And the remote frequency-domain signal->

A process for extracting signal features, comprising:

separately for the near-end time domain signal

And the far-end time-domain signal->

Performing Fourier transform: />

，

(ii) a Wherein the number of Fourier transform points is 512;

obtaining the near-end frequency domain signal

In a range->

And the remote frequency-domain signal->

Amplitude->

；

And calculating and outputting the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics, wherein the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics respectively comprise 257 frequency points.

Further, the working process of splicing the near-end frequency domain signal features and the far-end frequency domain signal features and inputting the spliced signals into the coding frame includes:

splicing the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics to form first spliced frequency domain signal characteristics, wherein the first spliced frequency domain signal characteristics comprise 514 frequency points;

inputting the first spliced frequency domain signal characteristics into the coding frame, and calculating to obtain first frequency domain signal characteristics according to the number of the two-dimensional convolution layers and the corresponding batch regularization layer and the PReLU layer;

outputting the first frequency-domain signal characteristic.

Further, inputting the signal characteristics output by the time GRU layer into a decoding framework process, including:

inputting the first frequency domain signal characteristic into the decoding frame sequentially through a second frequency domain signal characteristic of the frequency GRU layer and the time GRU layer, and calculating to obtain a third frequency domain signal characteristic according to the number of the transposition convolution layers and sequentially through the corresponding transposition convolution layer, the batch regularization layer and the PReLU layer;

and outputting the third frequency domain signal characteristic.

Further, the output result of the decoding frame is optimized and calculated to output a voice time domain signal

The process of (2), comprising:

obtaining a corresponding complex ideal ratio mask according to the third frequency domain signal characteristics;

masking the complex ideality ratio

Is greater than or equal to>

And imaginary part +>

Act on the near-end frequency domain signal->

Is greater than or equal to>

And imaginary part->

And calculating the optimized near-end frequency domain signal->

Is greater than or equal to>

=/>

And imaginary part->

=/>

；

For the optimized near-end frequency domain signal

Performing inverse Fourier transform to output a near-end time-domain signal->

。

Further, the method is characterized by comprising the step of training a neural network in the voice echo cancellation process, and using a loss function of

Wherein is present>

，

Where V is a frequency domain representation of the clean near-end speech signal after Fourier transform, V _r Is the real part of V, V _i Is the imaginary part of V; and/or the presence of a gas in the gas,

the used optimizer is an Adam optimizer with a learning rate of 0.001, and the learning rate is adjusted according to an optimization preset turn and/or an optimization verification result.

In a second aspect, a single-channel speech echo cancellation device based on a deep neural network is provided, including:

a first module capable of acquiring near-end time domain signal of the near-end microphone

And a far-end time-domain signal->

And a far-end frequency-domain signal->

And separately on the near-end frequency domain signal->

And the remote frequency-domain signal->

Extracting signal characteristics;

the second module can splice the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics and input the signals into a coding frame, wherein the coding frame comprises 3 two-dimensional convolutional layers and a batch regularization layer and a PReLU layer which are sequentially positioned behind each two-dimensional convolutional layer;

the third module can input the signal characteristics output by the coding framework into a 1-layer frequency GRU layer and a 1-layer time GRU layer in sequence;

a fourth module capable of inputting the signal characteristics output by the temporal GRU layer into a decoding framework, the decoding framework comprising transposed convolutional layers corresponding to the two-dimensional convolutional layers of the encoding framework, and the batch regularization layer and the prilu layer sequentially located after each of the transposed convolutional layers;

a fifth module, capable of outputting the speech time domain signal after performing optimization calculation on the output result of the decoding frame

。

In a third aspect, a computer device is provided, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when a computer device is running, the machine-readable instructions when executed by the processor performing the method of the first aspect.

In a fourth aspect, a computer-readable storage medium is proposed, on which a computer program is stored, which computer program, when being executed by a processor, is adapted to carry out the method of the first aspect.

The specification can at least achieve the following technical effects:

the scheme of the invention is based on a U-shaped network structure of a coding-decoding frame, makes full use of the correlation among frequency points in time-frequency characteristics, has the characteristics of small model scale, less parameter quantity, low performance consumption and the like, can run on local equipment in real time, and achieves better echo suppression effect.

Drawings

In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the present specification, and for those skilled in the art, other drawings can be obtained according to the drawings without any creative effort.

Fig. 1 is a schematic diagram of a deep neural network-based single-channel speech echo cancellation method provided in an embodiment of the present disclosure.

Fig. 2 is a second schematic diagram of a deep neural network-based single-channel speech echo cancellation method according to an embodiment of the present disclosure.

Fig. 3 is a third schematic diagram of a deep neural network-based single-channel speech echo cancellation method provided in an embodiment of the present disclosure.

Fig. 4 is a fourth schematic diagram of a deep neural network-based single-channel speech echo cancellation method according to an embodiment of the present disclosure.

Fig. 5 is a fifth schematic diagram of a deep neural network-based single-channel speech echo cancellation method provided in an embodiment of the present specification.

Fig. 6 is a sixth schematic diagram of a deep neural network-based single-channel speech echo cancellation method according to an embodiment of the present disclosure.

Fig. 7 is a seventh schematic diagram of a deep neural network-based single-channel speech echo cancellation method provided in an embodiment of the present disclosure.

Fig. 8 is an eighth schematic diagram of a deep neural network-based single-channel speech echo cancellation method according to an embodiment of the present disclosure.

Fig. 9 is a nine-step schematic diagram of a deep neural network-based single-channel speech echo cancellation method according to an embodiment of the present disclosure.

Fig. 10 is a schematic diagram of a deep neural network-based single-channel speech echo cancellation device provided in an embodiment of the present specification.

Fig. 11 is a schematic structural diagram of an electronic device provided in an embodiment of the present specification.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present specification, the technical solutions in the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is obvious that the described embodiments are only a part of the embodiments of the present specification, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present specification without making any creative effort shall fall within the protection scope of the present specification.

The following describes a detailed description of a deep neural network-based single-channel speech echo cancellation scheme by using a specific example.

Example one

The invention aims to overcome the defects and shortcomings that the existing echo cancellation algorithm based on the deep neural network is complex in model, large in model scale and does not consider correlation among frequency points, and provides an echo cancellation scheme which is low in calculation consumption, real-time, low in parameter and good in performance under a complex acoustic environment. Acoustic echo usually occurs in voice communication terminals, and since non-linear processing, such as lossy vocoder and transcoding, is applied in the communication channel, the acoustic echo has to be processed locally in the device, and the computational resources local to the device are usually limited, thus requiring a high complexity of the echo cancellation algorithm. In addition, real-time conferencing systems allow only limited processing delay and the devices may be placed in noisy environments, which, in addition to the non-linear distortion that speakers may have, present challenges to echo cancellation methods. Referring to fig. 1, in a real-time audio-video communication system, a far-end signal

The voice signal is played through the near-end loudspeaker and then is collected again by the near-end microphone, and the voice is considered to be possibly present at the near end at the same time>

And ambient environmental noise>

The signal picked up by the microphone at the near end->

Can be expressed as:

. Wherein +>

For the echo path formed from playing far-end signal to being collected by near-end microphone, making convolution calculation between signals, and making it implement amplification and amplification>

The delay generated for the far-end signal relative to the original far-end signal when it is re-acquired by the microphone. The aim of the echo cancellation system is to pick up the microphone signal at the known near end->

And a remote reference signal->

In the case that a near-end pure speech signal is obtained as much as possible->

Is estimated. Therefore, the technical idea of the scheme of the invention is to fully utilize the advantages of the cyclic neural network, particularly the gated cyclic unit, in the time sequence processing aspect and the convolutional neural network in the feature extraction aspect under the coding-decoding framework, thereby realizing better echo cancellation effect.

Referring to fig. 2, it is a schematic diagram of a single-channel speech echo cancellation scheme based on a deep neural network according to an embodiment of the present invention. The echo cancellation method of the deep neural network disclosed by the invention is carried out under a U-shaped network structure consisting of an encoding-decoding framework. Firstly, respectively carrying out time domain to frequency domain conversion on a near-end signal and a far-end signal and extracting characteristics; secondly, the characteristics are sequentially transmitted through four parts, namely an encoding frame, a frequency GRU layer, a time GRU layer and a decoding frame, and finally, the output result of the decoding frame is converted into a voice time domain signal. The method of the framework fully combines the advantages of the cyclic neural network and the convolutional neural network, and meanwhile, the correlation among frequency points in the frequency spectrum characteristics is utilized. It should be noted that the time domain signals in the embodiments of the present specification are represented by lower case letters, and the corresponding frequency domain signals are represented by upper case letters. Referring to fig. 3, a schematic diagram of a method for single-channel speech echo cancellation based on a deep neural network in an embodiment of the present invention is shown, where the method includes:

s1: near-end time domain signal collected by near-end microphone

And a far-end time-domain signal->

And a far-end frequency-domain signal->

And separately on the near-end frequency domain signal->

And the remote frequency-domain signal->

And extracting signal features.

Optionally, a near-end time domain signal collected by a near-end microphone

And a far-end time-domain signal->

And a far-end frequency-domain signal->

And separately on the near-end frequency domain signal->

And the remote frequency-domain signal->

The process of extracting the signal features is shown in fig. 4, and includes:

s111: separately for the near-end time domain signal

And said remote time domain signal>

Performing Fourier transform:

，/>

(ii) a Wherein the number of Fourier transform points is 512.

S112: obtaining the near-end frequency domain signal

Is greater than or equal to>

And the remote frequency-domain signal->

Amplitude->

。

S113: and calculating and outputting the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics, wherein the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics respectively comprise 257 frequency points.

It should be noted here that when the number of fourier transform sampling points is 512, because the FFT is a fast fourier transform, that is, an efficient fast calculation method for calculating the discrete fourier transform DFT by using a computer, considering that the FFT has a conjugate symmetry property, after the FFT is performed for 512 real number points, half of the characteristic points, that is, 256 real number points, are retained, and a direct current component, that is, a component with a frequency of 0, is added to form 257 frequency point characteristics.

S2: and splicing the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics, and inputting the signals into a coding frame, wherein the coding frame comprises 3 two-dimensional convolutional layers, and a batch regularization layer and a PReLU layer which are sequentially positioned behind each two-dimensional convolutional layer. Fig. 5 is a schematic diagram of the coding framework.

Optionally, the convolution kernels of the two-dimensional convolution layers are respectively

，/>

And &>

，/>

And &>

。

Optionally, the working process of splicing the near-end frequency-domain signal feature and the far-end frequency-domain signal feature and inputting the spliced signal feature to the encoding frame is as shown in fig. 6, and includes:

s211: and splicing the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics to form first spliced frequency domain signal characteristics, wherein the first spliced frequency domain signal characteristics comprise 514 frequency points.

S212: and inputting the first spliced frequency domain signal characteristics into the coding frame, and calculating to obtain first frequency domain signal characteristics according to the number of the two-dimensional convolution layers and the number of the batch regularization layers and the number of the PReLU layers.

S213: outputting the first frequency-domain signal characteristic.

S3: and sequentially inputting the signal characteristics output by the coding framework into a 1-layer frequency GRU layer and a 1-layer time GRU layer. Optionally, the number of nodes of the frequency GRU layer and the time GRU layer is 32.

S4: and inputting the signal characteristics output by the time GRU layer into a decoding frame, wherein the decoding frame comprises a transposed convolution layer corresponding to the two-dimensional convolution layer in the coding frame, and the batch regularization layer and the PReLU layer which are sequentially positioned after each layer of the transposed convolution layer. Fig. 7 is a schematic diagram of the decoding framework.

Optionally, the convolution kernels of the transposed convolution layers are respectively

，/>

And &>

，/>

And &>

. Optionally, the encoding framework and the decoding framework use a skip connection for information interaction.

Optionally, a process of inputting the signal characteristics output by the temporal GRU layer into a decoding framework is shown in fig. 8, and includes:

s411: and inputting the first frequency domain signal characteristics into the decoding frame sequentially through second frequency domain signal characteristics of the frequency GRU layer and the time GRU layer, and calculating to obtain third frequency domain signal characteristics according to the number of the transposition convolution layers and sequentially through the corresponding transposition convolution layers, the batch regularization layer and the PReLU layer.

S412: and outputting the third frequency domain signal characteristic.

S5: outputting voice time domain signals after optimizing and calculating the output result of the decoding frame

。

Optionally, the speech time domain signal is output after the output result of the decoding frame is optimized and calculated

As shown in fig. 9, includes:

s511: and obtaining a corresponding complex ideal ratio mask according to the third frequency domain signal characteristics.

S512: masking the complex ideality ratio

In (c) is based on a real part>

And imaginary part->

Act on the near-end frequency domain signal->

In (c) is based on a real part>

And imaginary part->

And calculating the optimized near-end frequency domain signal->

Is greater than or equal to>

=/>

And imaginary part->

=

。

S513: for the optimized near-end frequency domain signal

Performing inverse Fourier transform to output a near-end time domain signal->

。

Optionally, the loss function used for neural network training during speech echo cancellation is

Wherein is present>

，

Where V is a complex number, which is a frequency domain representation of the clean near-end speech signal V after Fourier transformation, V _r Is the real part of V, V _i Is the imaginary part of V; the loss function is used for comparing the difference between a near-end speech signal and a pure near-end speech signal which are obtained by the prediction of the trained neural network; and/or the presence of a gas in the atmosphere,

the used optimizer is an Adam optimizer with a learning rate of 0.001, and the learning rate is adjusted according to an optimization preset turn and/or an optimization verification result. Specifically, when the optimization verification result is not improved after 4 rounds of continuous optimization, the learning rate can be reduced to half of the original learning rate, and then 1 round or multiple rounds of sub-optimization are performed again.

Example two

Fig. 10 is a schematic structural diagram of a deep neural network-based single-channel speech echo cancellation device 1000 according to an embodiment of the present specification. Referring to fig. 10, in an embodiment, a deep neural network-based single channel speech echo cancellation apparatus 1000 includes:

a first module 1001 capable of acquiring a near-end time-domain signal of a near-end microphone

And a far-end time-domain signal->

And a far-end frequency-domain signal->

And separately on the near-end frequency domain signal->

And the remote frequency-domain signal->

Extracting signal features;

a second module 1002, configured to splice the near-end frequency domain signal features and the far-end frequency domain signal features and input the spliced near-end frequency domain signal features and far-end frequency domain signal features to an encoding frame, where the encoding frame includes 3 two-dimensional convolutional layers, and a batch regularization layer and a pcelu layer sequentially located after each two-dimensional convolutional layer;

a third module 1003, configured to sequentially input the signal features output by the coding framework into a 1-layer frequency GRU layer and a 1-layer time GRU layer;

a fourth module 1004 capable of inputting the signal characteristics output by the temporal GRU layer into a decoding framework, the decoding framework comprising transposed convolutional layers corresponding to the two-dimensional convolutional layers of the encoding framework, and the batch regularization layer and the prilu layer sequentially located after each of the transposed convolutional layers;

a fifth module 1005, capable of outputting the speech time domain signal after performing optimization calculation on the decoding frame output result

。

It should be understood that, in the embodiment of the present specification, the deep neural network-based single-channel speech echo cancellation device may further perform the method performed by the deep neural network-based single-channel speech echo cancellation device (or apparatus) in fig. 1 to 9, and implement the functions of the deep neural network-based single-channel speech echo cancellation device (or apparatus) in the examples shown in fig. 1 to 9, which are not described herein again.

EXAMPLE III

Fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present specification. Referring to fig. 11, at a hardware level, the electronic device includes a processor, and optionally further includes an internal bus, a network interface, and a memory. The Memory may include a Memory, such as a Random-Access Memory (RAM), and may further include a non-volatile Memory, such as at least 1 disk Memory. Of course, the electronic device may also include hardware required for other services.

The processor, the network interface, and the memory may be connected to each other via an internal bus, which may be an ISA (Industry Standard Architecture) bus, a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 11, but that does not indicate only one bus or one type of bus.

And the memory is used for storing programs. In particular, the program may include program code comprising computer operating instructions. The memory may include both memory and non-volatile storage and provides instructions and data to the processor.

The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to form the shared resource access control device on the logic level. The processor is used for executing the program stored in the memory and is specifically used for executing the following operations:

s1: near-end time domain signal collected by near-end microphone

And a remote time domain signal>

And a far-end frequency-domain signal->

And separately on said near-end frequency-domain signal->

And the remote frequency-domain signal->

Extracting signal features;

s2: splicing the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics, and inputting the signals into a coding frame, wherein the coding frame comprises 3 two-dimensional convolutional layers, and a batch regularization layer and a PReLU layer which are sequentially positioned behind each two-dimensional convolutional layer;

s3: sequentially inputting the signal characteristics output by the coding frame into a layer 1 frequency GRU layer and a layer 1 time GRU layer;

s4: inputting the signal characteristics output by the time GRU layer into a decoding frame, wherein the decoding frame comprises a transposed convolutional layer corresponding to the two-dimensional convolutional layer in the encoding frame, and the batch regularization layer and the PReLU layer which are sequentially positioned after each layer of the transposed convolutional layer;

。

The above-mentioned single channel speech echo cancellation method based on deep neural network as disclosed in the embodiments shown in fig. 1 to fig. 9 of the present specification can be applied to a processor, or implemented by a processor. The processor may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software. The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but also Digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present specification may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present specification may be embodied directly in a hardware decoding processor, or in a combination of hardware and software modules in the decoding processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, etc. as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor.

Of course, besides the software implementation, the electronic device of the embodiment of the present disclosure does not exclude other implementations, such as a logic device or a combination of software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or a logic device.

Example four

Embodiments of the present specification also propose a computer-readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by a portable electronic device comprising a plurality of application programs, are capable of causing the portable electronic device to perform the method of the embodiments shown in fig. 1 to 9, and in particular to perform the method of:

s1: near-end time domain signal collected by near-end microphone

And a far-end time-domain signal->

And a far-end frequency-domain signal->

And separately on the near-end frequency domain signal->

And the remote frequency-domain signal->

Extracting signal characteristics;

s2: splicing the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics and inputting the signals into an encoding frame, wherein the encoding frame comprises 3 two-dimensional convolutional layers, and a batch regularization layer and a PReLU layer which are sequentially positioned behind each two-dimensional convolutional layer;

s3: inputting the signal characteristics output by the coding frame into a 1-layer frequency GRU layer and a 1-layer time GRU layer in sequence;

。

In short, the above description is only a preferred embodiment of the present disclosure, and is not intended to limit the scope of the present disclosure. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present specification shall be included in the protection scope of the present specification.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an electronic data carrier device, a gaming console, a tablet computer, a wearable device, or a combination of any of these devices.

Computer-readable media, including both permanent and non-permanent, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrases "comprising a," "8230," "8230," or "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus comprising the element.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

Claims

1. A single-channel voice echo cancellation method based on a deep neural network is characterized by comprising the following steps:

near-end time domain signal collected by near-end microphone

And a far-end time-domain signal->

And a remote frequency domain signal>

And separately on the near-end frequency domain signal->

And said remote frequency-domain signal->

Extracting signal features;

splicing the near-end frequency domain signal characteristics and the far-end frequency domain signal characteristics and inputting the signals into an encoding frame, wherein the encoding frame comprises 3 two-dimensional convolutional layers, and a batch regularization layer and a PReLU layer which are sequentially positioned behind each two-dimensional convolutional layer;

inputting the signal characteristics output by the time GRU layer into a decoding frame, wherein the decoding frame comprises a transposed convolution layer corresponding to the two-dimensional convolution layer in the coding frame, and the batch regularization layer and the PReLU layer which are sequentially positioned after each layer of the transposed convolution layer;

。

2. The deep neural network-based single-channel speech echo cancellation method of claim 1, wherein the two-dimensional convolutional layer is formed by a two-dimensional convolutional layerThe convolution kernels are respectively

，/>

And &>

And step sizes corresponding to the convolution kernels are respectively

，/>

And &>

；

，/>

And &>

，/>

And &>

；

3. The deep neural network-based single-channel speech echo cancellation method according to claim 2, wherein the near-end time-domain signal collected by the near-end microphone is used for canceling the near-end time-domain signal

And a far-end time-domain signal->

And a far-end frequency-domain signal->

And separately on said near-end frequency-domain signal->

And the remote frequency-domain signal->

A process for extracting signal features, comprising:

separately for the near-end time domain signal

And the far-end time-domain signal->

Performing Fourier transform: />

，

(ii) a Wherein the number of Fourier transform points is 512;

obtaining the near-end frequency domain signal

Is greater than or equal to>

And the remote frequency-domain signal->

Amplitude->

；

4. The deep neural network-based single-channel speech echo cancellation method according to claim 3, wherein the working process of splicing the near-end frequency-domain signal features and the far-end frequency-domain signal features and inputting the spliced signals into the coding framework includes:

outputting the first frequency-domain signal characteristic.

5. The deep neural network-based single-channel speech echo cancellation method according to claim 4, wherein inputting the signal features output by the temporal GRU layer into a decoding framework process comprises:

inputting the first frequency domain signal characteristics into the decoding frame sequentially through second frequency domain signal characteristics of the frequency GRU layer and the time GRU layer, and calculating to obtain third frequency domain signal characteristics according to the number of the transposed convolution layers and sequentially through the corresponding transposed convolution layers, the batch regularization layer and the PReLU layer;

and outputting the third frequency domain signal characteristic.

6. The deep neural network-based single-channel speech echo cancellation method according to claim 5, wherein the speech time-domain signal is output after performing optimization calculation on the output result of the decoding frame

The process of (2), comprising:

masking the complex ideality ratio

Is greater than or equal to>

And imaginary part->

Act on the near-end frequency domain signal->

Real part of

And imaginary part->

And calculating the optimized near-end frequency domain signal->

In (c) is based on a real part>

=/>

And imaginary part->

=

；

For the optimized near-end frequency domain signal

Performing inverse Fourier transform to output a near-end time domain signal->

。

7. The deep neural network-based single-channel speech echo cancellation method according to claim 6, further comprising performing neural network training during the speech echo cancellation process using a loss function of

Wherein is present>

，

Where V is a frequency domain representation of the clean near-end speech signal after Fourier transform, V _r Is the real part of V, V _i Is the imaginary part of V;

8. A single-channel voice echo cancellation device based on a deep neural network is characterized by comprising the following components:

And a far-end time-domain signal->

And a far-end frequency-domain signal->

And separately on the near-end frequency domain signal->

And the remote frequency-domain signal->

Extracting signal characteristics;

。

9. An electronic device, comprising:

a processor; and

a memory arranged to store computer executable instructions that when executed cause the processor to perform the deep neural network based single channel speech echo cancellation method of any one of claims 1 to 7.

10. A computer-readable storage medium storing one or more programs which, when executed by an electronic device including a plurality of application programs, cause the electronic device to perform the deep neural network-based single-channel speech echo cancellation method of any one of claims 1 to 7.