CN111128209A

CN111128209A - Speech enhancement method based on mixed masking learning target

Info

Publication number: CN111128209A
Application number: CN201911385421.XA
Authority: CN
Inventors: 张涛; 王泽宇; 朱诚诚
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2019-12-28
Filing date: 2019-12-28
Publication date: 2020-05-08
Anticipated expiration: 2039-12-28
Also published as: CN111128209B

Abstract

A speech enhancement method based on a hybrid masking learning objective: performing traditional feature extraction on the voice signals, wherein the traditional feature extraction comprises dividing the acquired voice signals into a training set and a test set, and respectively extracting traditional features of the voice signals of the training set and the test set; respectively extracting the amplitude spectrum characteristics of the STFT domain of the voice signals of the training set and the test set; constructing a depth stack residual error network; constructing a learning target; training a deep stack residual error network by using the extracted traditional characteristics of the training set, the amplitude spectrum characteristics of the STFT domain and the learning target; and inputting the extracted traditional characteristics of the test set and the amplitude spectrum characteristics of the STFT domain into a trained deep stack residual error network to obtain a predicted learning target, obtaining an enhanced voice signal for the predicted learning target through an ISTFT, and calculating the PESQ value of the voice signal. The invention does not keep noise information in the time-frequency unit with voice dominance, reduces the calculated amount, and is easy to train neural network learning so as to improve the intelligibility and quality of voice.

Description

Speech enhancement method based on mixed masking learning target

Technical Field

The invention relates to a hybrid masking learning objective. And more particularly to a speech enhancement method based on a hybrid masking learning objective.

Background

At present, the speech enhancement methods based on deep learning are numerous, and the key technology mainly relates to three aspects of extracting what kind of characteristics, adopting what kind of model and learning what kind of target. Like the features, the study of the learning objective is also very valuable, and the model can be trained better through a better learning objective on the premise of the same training data features and the learning model.

In a speech enhancement system using a supervised neural network, the learning objective is generally obtained based on background noise and pure speech calculation, and the effective learning objective has an important influence on the learning capability of a speech enhancement model and the generalization of the system.

The currently used speech enhancement learning objectives mainly include two categories: training targets based on time-frequency masking and targets based on speech magnitude spectrum estimation. The former class of targets reflects the energy relationship between the clean speech signal and the background noise in the mixed signal, and the latter class of targets is the amplitude spectrum characteristic of the clean target speech. Common time-frequency masking objectives include: ideal Binary Mask (IBM), Ideal floating value Mask (Ideal Ratio Mask, IRM), Target Binary Mask (TBM), etc.; the most commonly used learning targets are ideal binary masking and ideal floating value masking, but the two learning targets respectively have the defects of inaccurate prediction, poor generalization and the like.

When the learning target is IRM, the model only needs to classify (0 or 1) whether each time-frequency unit belongs to noise dominance or target voice dominance, which can lead to the retention of noise information in the time-frequency unit dominated by the target voice, and the noise signals can seriously affect the intelligibility and quality of the voice; when the learning target is IRM, the model needs to predict the coefficient in each time-frequency unit, and in the time-frequency unit dominated by noise, the extracted features cannot well represent the features of the target speech in the time-frequency unit, but for the model, it is difficult to accurately predict the coefficient of the time-frequency unit.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a speech enhancement method based on a mixed masking learning target, which can improve the intelligibility and quality of speech.

The technical scheme adopted by the invention is as follows: a speech enhancement method based on a mixed masking learning objective comprises the following steps:

1) performing traditional feature extraction on the voice signals, wherein the traditional feature extraction comprises dividing the acquired voice signals into a training set and a test set, and respectively extracting traditional features of the voice signals of the training set and the test set;

2) respectively extracting the amplitude spectrum characteristics of the STFT domain of the voice signals of the training set and the test set;

3) constructing a depth stack residual error network;

4) constructing a learning target;

5) training a deep stack residual error network by using the extracted traditional characteristics of the training set, the amplitude spectrum characteristics of the STFT domain and the learning target;

6) and inputting the extracted traditional characteristics of the test set and the amplitude spectrum characteristics of the STFT domain into a trained deep stack residual error network to obtain a predicted learning target, obtaining an enhanced voice signal for the predicted learning target through an ISTFT, and calculating the PESQ value of the voice signal.

The voice enhancement method based on the mixed masking learning target combines the advantages of the ideal binary masking learning target and the ideal floating value masking learning target. Firstly, ensuring that noise information is not reserved in a time-frequency unit dominated by voice; the learning target is directly returned to 0 in the time-frequency unit with leading noise, although a small part of voice information is possibly lost, the performance is better than that of the learning target of the IRM due to the fact that the data redundancy is reduced, and the calculated amount is reduced; moreover, due to the fact that the mixed masking contains the 0 time-frequency unit, compared with the learning target of the IRM, the fitting capability and the calculation accuracy of the neural network are further improved, and the neural network is easier to train for learning so as to improve the intelligibility and the quality of the voice.

Detailed Description

The following describes a speech enhancement method based on a hybrid masking learning objective in detail with reference to the embodiments.

The invention discloses a voice enhancement method based on a mixed masking learning target, which comprises the following steps:

the method comprises the following steps: randomly extracting 1500 sections of voice from a training part of a TIMIT corpus, randomly mixing the 1500 sections of voice with 9 kinds of noise extracted from a NOISEX-92 corpus, generating 1500 sections of mixed voice signals to form a training set under a continuously changing signal-to-noise ratio of-5 dB, randomly selecting 500 sections of pure voice from a testing part of the TIMIT corpus, randomly mixing the 500 sections of pure voice with 15 kinds of voice extracted from the NOISEX-92 corpus, and generating 500 sections of mixed voice signals to form a testing set under 10-8-6-4-2-0-2-4-6-8 dB different signal-to-noise ratios.

The traditional characteristic processes of extracting the speech signals of the training set and the test set are the same, and respectively obtain the following different characteristic vectors:

(1) carrying out 512-point short-time Fourier transform on a mixed voice signal with a sampling rate of 16kHz, extracting a 31-dimensional MFCC feature vector by adopting a Hamming window with a frame length of 20ms and a frame shift of 10ms, and calculating a first-order reciprocal of the 31-dimensional MFCC feature vector;

(2) performing full-wave rectification on a mixed voice signal with a sampling rate of 16kHz to extract the envelope of the mixed voice signal, performing quarter sampling, performing framing by adopting a 32ms frame length and a 10ms frame-shifted Hamming window, obtaining a 15-dimensional AMS eigenvector by utilizing 15 triangular windows with center frequencies uniformly distributed in 15.6-400 Hz, and calculating the first reciprocal of the 15-dimensional AMS eigenvector;

(3) decomposing a mixed voice signal with a sampling rate of 16kHz by adopting a 64-channel Gamma-tone filter bank, sampling each decomposition output result by adopting a sampling rate of 100Hz, carrying out amplitude suppression on the obtained sampling signal by adopting cubic root operation, finally extracting a 64-dimensional Gamma-tone eigenvector, and calculating the first order reciprocal of the 64-dimensional Gamma-tone eigenvector;

(4) converting a power spectrum of a mixed voice signal with a sampling rate of 16kHz into a bark scale of a 20-channel by adopting a ladder filter, then applying equal loudness pre-emphasis, then using an intensity loudness law and a 12-order linear prediction model to obtain a 13-dimensional PLP feature vector, and calculating a first order reciprocal of the 13-dimensional PLP feature vector;

respectively connecting 31-dimensional MFCC feature vectors, 15-dimensional AMS feature vectors, 64-dimensional Gamma feature vectors and 13-dimensional PLP feature vectors in series to obtain 123-dimensional feature vectors, connecting the first reciprocal of the 31-dimensional MFCC feature vectors, the first reciprocal of the 15-dimensional AMS feature vectors, the first reciprocal of the 64-dimensional Gamma feature vectors and the first reciprocal of the 13-dimensional PLP feature vectors in series to obtain 123-dimensional feature vectors, and connecting the two 123-dimensional feature vectors in series to obtain 246-dimensional feature vectors;

and respectively acquiring the zero-crossing rate characteristic, the root-mean-square energy characteristic and the characteristic of the spectrum centroid of the mixed voice signal with the sampling rate of 16kHz, forming 269-dimensional characteristic vectors together with the 246-dimensional characteristic vectors, sending the 269-dimensional characteristic vectors into a firework algorithm characteristic selector for characteristic dimension reduction, wherein the number N of the initialized fireworks is 400, and the characteristic subset dimension M is 50, 70 and 90.

the method for extracting the amplitude spectrum characteristics of the STFT domain of the speech signals of the training set and the test set is the same as that of the STFT domain of the speech signals of the training set and the test set, and comprises the following steps: the method comprises the steps of carrying out short-time Fourier transform on a mixed voice signal with a sampling rate of 16kHz, framing the mixed voice signal by adopting a Hamming window with a frame length of 25ms and a frame shift of 10ms in the transformation process, adding the amplitude spectrums of two frames adjacent to the left and the right of a single frame when the amplitude spectrum of each single frame corresponding to the traditional characteristics is input, wherein the total number of the frames is 5, the dimensionality of the amplitude spectrum of each frame is 200, and obtaining the amplitude spectrum characteristics of an STFT domain with the input dimensionality of 1000.

3) Constructing a depth stack residual error network; wherein,

the deep stack residual error network comprises: the device comprises an input channel I, an input channel II and a full-connection residual error network module connected with the output end of the input channel I and the input channel II after being connected, wherein,

the input channel I: the convolution error network module is composed of three convolution layers and three normalization layers which are formed by combining the convolution error network module through error networks, the dimensionality of each convolution kernel is set to be 2, the step length of each convolution kernel is set to be 1, and a 0 complementing mode is adopted, the size of the convolution kernel of the first convolution layer from top to bottom in the three convolution layers is 1 x 1, and the number of output channels is 32; the convolution kernel size of the second convolution layer is 3 x 3, and the number of output channels is 32; the convolution kernel size of the third layer of convolution layer is 1 x 1, the number of output channels is 64, and the activation functions of the three layers of convolution layer are all Relu activation functions;

the second input channel: the neural network is composed of a normalization layer and a full-link layer which are combined through a residual error network, wherein the full-link layer is provided with 1024 neurons, and the full-link layer uses a Relu activation function;

the fully connected residual error network module: is composed of a normalization layer and a full-connection layer with 4096 neurons, and the full-connection layer uses a Sigmoid activation function.

4) Constructing a learning target; the method comprises the following steps:

(1) respectively calculating an ideal binary masking learning target IBM and an ideal floating masking learning target IRM of a mixed voice signal of a training set by using the following formulas:

where LC is set to 20 dB; SNR (m, f) is a local signal-to-noise ratio of a time-frequency unit with a time frame of m and a frequency of f, wherein f is 80Hz to 5000 Hz; s (m, f)²And N (m, f)²Respectively representing speech energy and noise energy at the mth time frame and frequency f;

IBM is a binary time-frequency masking matrix, calculated from clean speech and noise. For each time-frequency unit, if the local signal-to-noise ratio SNR (m, f) is greater than a certain local threshold, the corresponding element in the masking matrix is marked as 1, otherwise, it is marked as 0. IRM is a widely used training target in supervised learning speech separation.

(2) In order to combine the advantages of both masks, the present invention proposes a learning objective based on Mixed Mask (MM). The masking value of the masking is consistent with IBM in the time-frequency unit with the leading noise, namely the masking value is equal to 0; the masking value is kept consistent with the IRM in the time-frequency unit dominated by the target voice. Specifically, an ideal binary masking learning target IBM and an ideal floating value masking learning target IRM are subjected to point multiplication to obtain a mixed masking learning target MM, and a final learning target is formed:

wherein x is_1,1…x_m,nRespectively representing an ideal floating value masking value in each time-frequency unit in a section of mixed voice signal; x is the number of_1,1…x_m,1Ideal floating value masking respectively representing the first frame mixed speech signal; y is_1,1…y_m,nRespectively representing ideal binary masking values in each time-frequency unit in a section of mixed voice signal; y is_1,1…y_m,1Ideal binary masks respectively representing the first frame mixed speech signal; x is the number of_1,1*y_1,1…x_m,n*y_m,nRespectively representing the ideal mixed masking value in each time-frequency unit in a section of mixed voice signal.

According to the voice enhancement method based on the hybrid masking learning target, the PESQ index is greatly improved, and the voice quality after enhancement is improved by 1.6% compared with the learning target masked by an ideal floating value, as shown in Table 1.

TABLE 1 PESQ values for speech signals of two learning objectives

Claims

1. A speech enhancement method based on a hybrid masking learning objective is characterized by comprising the following steps:

3) constructing a depth stack residual error network;

4) constructing a learning target;

2. The method for enhancing speech based on the hybrid masking learning objective as claimed in claim 1, wherein the step 1) comprises: randomly extracting 1500 sections of voice from a training part of a TIMIT corpus, randomly mixing the 1500 sections of voice with 9 kinds of noise extracted from a NOISEX-92 corpus, generating 1500 sections of mixed voice signals to form a training set under a continuously changing signal-to-noise ratio of-5 dB, randomly selecting 500 sections of pure voice from a testing part of the TIMIT corpus, randomly mixing the 500 sections of pure voice with 15 kinds of voice extracted from the NOISEX-92 corpus, and generating 500 sections of mixed voice signals to form a testing set under 10-8-6-4-2-0-2-4-6-8 dB different signal-to-noise ratios.

3. The method according to claim 2, wherein the conventional feature processes for extracting the speech signals of the training set and the test set in step 1) are the same, and each process includes obtaining the following different feature vectors:

4. The method of claim 1, wherein the extracting of the amplitude spectrum features of the STFT domain of the speech signals of the training set and the test set in step 2) is the same as the extracting of the amplitude spectrum features of the STFT domain of the speech signals of the training set and the test set, and both the extracting and the extracting comprise: the method comprises the steps of carrying out short-time Fourier transform on a mixed voice signal with a sampling rate of 16kHz, framing the mixed voice signal by adopting a Hamming window with a frame length of 25ms and a frame shift of 10ms in the transformation process, adding the amplitude spectrums of two frames adjacent to the left and the right of a single frame when the amplitude spectrum of each single frame corresponding to the traditional characteristics is input, wherein the total number of the frames is 5, the dimensionality of the amplitude spectrum of each frame is 200, and obtaining the amplitude spectrum characteristics of an STFT domain with the input dimensionality of 1000.

5. The method according to claim 1, wherein the deep stack residual network of step 3) comprises: the device comprises an input channel I, an input channel II and a full-connection residual error network module connected with the output end of the input channel I and the input channel II after being connected, wherein,

6. The method of claim 1, wherein the step 4) of constructing the learning objective comprises:

(2) performing point multiplication on the ideal binary masking learning target IBM and the ideal floating value masking learning target IRM to obtain a mixed masking learning target MM, and forming a final learning target: