CN111128209A - Speech enhancement method based on mixed masking learning target - Google Patents
Speech enhancement method based on mixed masking learning target Download PDFInfo
- Publication number
- CN111128209A CN111128209A CN201911385421.XA CN201911385421A CN111128209A CN 111128209 A CN111128209 A CN 111128209A CN 201911385421 A CN201911385421 A CN 201911385421A CN 111128209 A CN111128209 A CN 111128209A
- Authority
- CN
- China
- Prior art keywords
- dimensional
- masking
- learning target
- mixed
- voice signal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000000873 masking effect Effects 0.000 title claims abstract description 44
- 238000000034 method Methods 0.000 title claims abstract description 27
- 238000001228 spectrum Methods 0.000 claims abstract description 29
- 238000012360 testing method Methods 0.000 claims abstract description 26
- 238000000605 extraction Methods 0.000 claims abstract description 8
- 238000013528 artificial neural network Methods 0.000 claims abstract description 6
- 239000013598 vector Substances 0.000 claims description 40
- 238000005070 sampling Methods 0.000 claims description 20
- 230000004913 activation Effects 0.000 claims description 8
- 230000006870 function Effects 0.000 claims description 8
- NGVDGCNFYWLIFO-UHFFFAOYSA-N pyridoxal 5'-phosphate Chemical compound CC1=NC=C(COP(O)(O)=O)C(C=O)=C1O NGVDGCNFYWLIFO-UHFFFAOYSA-N 0.000 claims description 8
- 238000010606 normalization Methods 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 5
- 230000037433 frameshift Effects 0.000 claims description 4
- 238000009432 framing Methods 0.000 claims description 4
- 210000002569 neuron Anatomy 0.000 claims description 4
- 238000000354 decomposition reaction Methods 0.000 claims description 2
- 230000009467 reduction Effects 0.000 claims description 2
- 230000001629 suppression Effects 0.000 claims description 2
- 230000009466 transformation Effects 0.000 claims description 2
- 230000002708 enhancing effect Effects 0.000 claims 1
- 238000004364 calculation method Methods 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Multimedia (AREA)
- Acoustics & Sound (AREA)
- Signal Processing (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Quality & Reliability (AREA)
- Software Systems (AREA)
- Artificial Intelligence (AREA)
- Theoretical Computer Science (AREA)
- Medical Informatics (AREA)
- Mathematical Physics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Cable Transmission Systems, Equalization Of Radio And Reduction Of Echo (AREA)
Abstract
A speech enhancement method based on a hybrid masking learning objective: performing traditional feature extraction on the voice signals, wherein the traditional feature extraction comprises dividing the acquired voice signals into a training set and a test set, and respectively extracting traditional features of the voice signals of the training set and the test set; respectively extracting the amplitude spectrum characteristics of the STFT domain of the voice signals of the training set and the test set; constructing a depth stack residual error network; constructing a learning target; training a deep stack residual error network by using the extracted traditional characteristics of the training set, the amplitude spectrum characteristics of the STFT domain and the learning target; and inputting the extracted traditional characteristics of the test set and the amplitude spectrum characteristics of the STFT domain into a trained deep stack residual error network to obtain a predicted learning target, obtaining an enhanced voice signal for the predicted learning target through an ISTFT, and calculating the PESQ value of the voice signal. The invention does not keep noise information in the time-frequency unit with voice dominance, reduces the calculated amount, and is easy to train neural network learning so as to improve the intelligibility and quality of voice.
Description
Technical Field
The invention relates to a hybrid masking learning objective. And more particularly to a speech enhancement method based on a hybrid masking learning objective.
Background
At present, the speech enhancement methods based on deep learning are numerous, and the key technology mainly relates to three aspects of extracting what kind of characteristics, adopting what kind of model and learning what kind of target. Like the features, the study of the learning objective is also very valuable, and the model can be trained better through a better learning objective on the premise of the same training data features and the learning model.
In a speech enhancement system using a supervised neural network, the learning objective is generally obtained based on background noise and pure speech calculation, and the effective learning objective has an important influence on the learning capability of a speech enhancement model and the generalization of the system.
The currently used speech enhancement learning objectives mainly include two categories: training targets based on time-frequency masking and targets based on speech magnitude spectrum estimation. The former class of targets reflects the energy relationship between the clean speech signal and the background noise in the mixed signal, and the latter class of targets is the amplitude spectrum characteristic of the clean target speech. Common time-frequency masking objectives include: ideal Binary Mask (IBM), Ideal floating value Mask (Ideal Ratio Mask, IRM), Target Binary Mask (TBM), etc.; the most commonly used learning targets are ideal binary masking and ideal floating value masking, but the two learning targets respectively have the defects of inaccurate prediction, poor generalization and the like.
When the learning target is IRM, the model only needs to classify (0 or 1) whether each time-frequency unit belongs to noise dominance or target voice dominance, which can lead to the retention of noise information in the time-frequency unit dominated by the target voice, and the noise signals can seriously affect the intelligibility and quality of the voice; when the learning target is IRM, the model needs to predict the coefficient in each time-frequency unit, and in the time-frequency unit dominated by noise, the extracted features cannot well represent the features of the target speech in the time-frequency unit, but for the model, it is difficult to accurately predict the coefficient of the time-frequency unit.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a speech enhancement method based on a mixed masking learning target, which can improve the intelligibility and quality of speech.
The technical scheme adopted by the invention is as follows: a speech enhancement method based on a mixed masking learning objective comprises the following steps:
1) performing traditional feature extraction on the voice signals, wherein the traditional feature extraction comprises dividing the acquired voice signals into a training set and a test set, and respectively extracting traditional features of the voice signals of the training set and the test set;
2) respectively extracting the amplitude spectrum characteristics of the STFT domain of the voice signals of the training set and the test set;
3) constructing a depth stack residual error network;
4) constructing a learning target;
5) training a deep stack residual error network by using the extracted traditional characteristics of the training set, the amplitude spectrum characteristics of the STFT domain and the learning target;
6) and inputting the extracted traditional characteristics of the test set and the amplitude spectrum characteristics of the STFT domain into a trained deep stack residual error network to obtain a predicted learning target, obtaining an enhanced voice signal for the predicted learning target through an ISTFT, and calculating the PESQ value of the voice signal.
The voice enhancement method based on the mixed masking learning target combines the advantages of the ideal binary masking learning target and the ideal floating value masking learning target. Firstly, ensuring that noise information is not reserved in a time-frequency unit dominated by voice; the learning target is directly returned to 0 in the time-frequency unit with leading noise, although a small part of voice information is possibly lost, the performance is better than that of the learning target of the IRM due to the fact that the data redundancy is reduced, and the calculated amount is reduced; moreover, due to the fact that the mixed masking contains the 0 time-frequency unit, compared with the learning target of the IRM, the fitting capability and the calculation accuracy of the neural network are further improved, and the neural network is easier to train for learning so as to improve the intelligibility and the quality of the voice.
Detailed Description
The following describes a speech enhancement method based on a hybrid masking learning objective in detail with reference to the embodiments.
The invention discloses a voice enhancement method based on a mixed masking learning target, which comprises the following steps:
1) performing traditional feature extraction on the voice signals, wherein the traditional feature extraction comprises dividing the acquired voice signals into a training set and a test set, and respectively extracting traditional features of the voice signals of the training set and the test set;
the method comprises the following steps: randomly extracting 1500 sections of voice from a training part of a TIMIT corpus, randomly mixing the 1500 sections of voice with 9 kinds of noise extracted from a NOISEX-92 corpus, generating 1500 sections of mixed voice signals to form a training set under a continuously changing signal-to-noise ratio of-5 dB, randomly selecting 500 sections of pure voice from a testing part of the TIMIT corpus, randomly mixing the 500 sections of pure voice with 15 kinds of voice extracted from the NOISEX-92 corpus, and generating 500 sections of mixed voice signals to form a testing set under 10-8-6-4-2-0-2-4-6-8 dB different signal-to-noise ratios.
The traditional characteristic processes of extracting the speech signals of the training set and the test set are the same, and respectively obtain the following different characteristic vectors:
(1) carrying out 512-point short-time Fourier transform on a mixed voice signal with a sampling rate of 16kHz, extracting a 31-dimensional MFCC feature vector by adopting a Hamming window with a frame length of 20ms and a frame shift of 10ms, and calculating a first-order reciprocal of the 31-dimensional MFCC feature vector;
(2) performing full-wave rectification on a mixed voice signal with a sampling rate of 16kHz to extract the envelope of the mixed voice signal, performing quarter sampling, performing framing by adopting a 32ms frame length and a 10ms frame-shifted Hamming window, obtaining a 15-dimensional AMS eigenvector by utilizing 15 triangular windows with center frequencies uniformly distributed in 15.6-400 Hz, and calculating the first reciprocal of the 15-dimensional AMS eigenvector;
(3) decomposing a mixed voice signal with a sampling rate of 16kHz by adopting a 64-channel Gamma-tone filter bank, sampling each decomposition output result by adopting a sampling rate of 100Hz, carrying out amplitude suppression on the obtained sampling signal by adopting cubic root operation, finally extracting a 64-dimensional Gamma-tone eigenvector, and calculating the first order reciprocal of the 64-dimensional Gamma-tone eigenvector;
(4) converting a power spectrum of a mixed voice signal with a sampling rate of 16kHz into a bark scale of a 20-channel by adopting a ladder filter, then applying equal loudness pre-emphasis, then using an intensity loudness law and a 12-order linear prediction model to obtain a 13-dimensional PLP feature vector, and calculating a first order reciprocal of the 13-dimensional PLP feature vector;
respectively connecting 31-dimensional MFCC feature vectors, 15-dimensional AMS feature vectors, 64-dimensional Gamma feature vectors and 13-dimensional PLP feature vectors in series to obtain 123-dimensional feature vectors, connecting the first reciprocal of the 31-dimensional MFCC feature vectors, the first reciprocal of the 15-dimensional AMS feature vectors, the first reciprocal of the 64-dimensional Gamma feature vectors and the first reciprocal of the 13-dimensional PLP feature vectors in series to obtain 123-dimensional feature vectors, and connecting the two 123-dimensional feature vectors in series to obtain 246-dimensional feature vectors;
and respectively acquiring the zero-crossing rate characteristic, the root-mean-square energy characteristic and the characteristic of the spectrum centroid of the mixed voice signal with the sampling rate of 16kHz, forming 269-dimensional characteristic vectors together with the 246-dimensional characteristic vectors, sending the 269-dimensional characteristic vectors into a firework algorithm characteristic selector for characteristic dimension reduction, wherein the number N of the initialized fireworks is 400, and the characteristic subset dimension M is 50, 70 and 90.
2) Respectively extracting the amplitude spectrum characteristics of the STFT domain of the voice signals of the training set and the test set;
the method for extracting the amplitude spectrum characteristics of the STFT domain of the speech signals of the training set and the test set is the same as that of the STFT domain of the speech signals of the training set and the test set, and comprises the following steps: the method comprises the steps of carrying out short-time Fourier transform on a mixed voice signal with a sampling rate of 16kHz, framing the mixed voice signal by adopting a Hamming window with a frame length of 25ms and a frame shift of 10ms in the transformation process, adding the amplitude spectrums of two frames adjacent to the left and the right of a single frame when the amplitude spectrum of each single frame corresponding to the traditional characteristics is input, wherein the total number of the frames is 5, the dimensionality of the amplitude spectrum of each frame is 200, and obtaining the amplitude spectrum characteristics of an STFT domain with the input dimensionality of 1000.
3) Constructing a depth stack residual error network; wherein,
the deep stack residual error network comprises: the device comprises an input channel I, an input channel II and a full-connection residual error network module connected with the output end of the input channel I and the input channel II after being connected, wherein,
the input channel I: the convolution error network module is composed of three convolution layers and three normalization layers which are formed by combining the convolution error network module through error networks, the dimensionality of each convolution kernel is set to be 2, the step length of each convolution kernel is set to be 1, and a 0 complementing mode is adopted, the size of the convolution kernel of the first convolution layer from top to bottom in the three convolution layers is 1 x 1, and the number of output channels is 32; the convolution kernel size of the second convolution layer is 3 x 3, and the number of output channels is 32; the convolution kernel size of the third layer of convolution layer is 1 x 1, the number of output channels is 64, and the activation functions of the three layers of convolution layer are all Relu activation functions;
the second input channel: the neural network is composed of a normalization layer and a full-link layer which are combined through a residual error network, wherein the full-link layer is provided with 1024 neurons, and the full-link layer uses a Relu activation function;
the fully connected residual error network module: is composed of a normalization layer and a full-connection layer with 4096 neurons, and the full-connection layer uses a Sigmoid activation function.
4) Constructing a learning target; the method comprises the following steps:
(1) respectively calculating an ideal binary masking learning target IBM and an ideal floating masking learning target IRM of a mixed voice signal of a training set by using the following formulas:
where LC is set to 20 dB; SNR (m, f) is a local signal-to-noise ratio of a time-frequency unit with a time frame of m and a frequency of f, wherein f is 80Hz to 5000 Hz; s (m, f)2And N (m, f)2Respectively representing speech energy and noise energy at the mth time frame and frequency f;
IBM is a binary time-frequency masking matrix, calculated from clean speech and noise. For each time-frequency unit, if the local signal-to-noise ratio SNR (m, f) is greater than a certain local threshold, the corresponding element in the masking matrix is marked as 1, otherwise, it is marked as 0. IRM is a widely used training target in supervised learning speech separation.
(2) In order to combine the advantages of both masks, the present invention proposes a learning objective based on Mixed Mask (MM). The masking value of the masking is consistent with IBM in the time-frequency unit with the leading noise, namely the masking value is equal to 0; the masking value is kept consistent with the IRM in the time-frequency unit dominated by the target voice. Specifically, an ideal binary masking learning target IBM and an ideal floating value masking learning target IRM are subjected to point multiplication to obtain a mixed masking learning target MM, and a final learning target is formed:
wherein x is1,1…xm,nRespectively representing an ideal floating value masking value in each time-frequency unit in a section of mixed voice signal; x is the number of1,1…xm,1Ideal floating value masking respectively representing the first frame mixed speech signal; y is1,1…ym,nRespectively representing ideal binary masking values in each time-frequency unit in a section of mixed voice signal; y is1,1…ym,1Ideal binary masks respectively representing the first frame mixed speech signal; x is the number of1,1*y1,1…xm,n*ym,nRespectively representing the ideal mixed masking value in each time-frequency unit in a section of mixed voice signal.
5) Training a deep stack residual error network by using the extracted traditional characteristics of the training set, the amplitude spectrum characteristics of the STFT domain and the learning target;
6) and inputting the extracted traditional characteristics of the test set and the amplitude spectrum characteristics of the STFT domain into a trained deep stack residual error network to obtain a predicted learning target, obtaining an enhanced voice signal for the predicted learning target through an ISTFT, and calculating the PESQ value of the voice signal.
According to the voice enhancement method based on the hybrid masking learning target, the PESQ index is greatly improved, and the voice quality after enhancement is improved by 1.6% compared with the learning target masked by an ideal floating value, as shown in Table 1.
TABLE 1 PESQ values for speech signals of two learning objectives
Claims (6)
1. A speech enhancement method based on a hybrid masking learning objective is characterized by comprising the following steps:
1) performing traditional feature extraction on the voice signals, wherein the traditional feature extraction comprises dividing the acquired voice signals into a training set and a test set, and respectively extracting traditional features of the voice signals of the training set and the test set;
2) respectively extracting the amplitude spectrum characteristics of the STFT domain of the voice signals of the training set and the test set;
3) constructing a depth stack residual error network;
4) constructing a learning target;
5) training a deep stack residual error network by using the extracted traditional characteristics of the training set, the amplitude spectrum characteristics of the STFT domain and the learning target;
6) and inputting the extracted traditional characteristics of the test set and the amplitude spectrum characteristics of the STFT domain into a trained deep stack residual error network to obtain a predicted learning target, obtaining an enhanced voice signal for the predicted learning target through an ISTFT, and calculating the PESQ value of the voice signal.
2. The method for enhancing speech based on the hybrid masking learning objective as claimed in claim 1, wherein the step 1) comprises: randomly extracting 1500 sections of voice from a training part of a TIMIT corpus, randomly mixing the 1500 sections of voice with 9 kinds of noise extracted from a NOISEX-92 corpus, generating 1500 sections of mixed voice signals to form a training set under a continuously changing signal-to-noise ratio of-5 dB, randomly selecting 500 sections of pure voice from a testing part of the TIMIT corpus, randomly mixing the 500 sections of pure voice with 15 kinds of voice extracted from the NOISEX-92 corpus, and generating 500 sections of mixed voice signals to form a testing set under 10-8-6-4-2-0-2-4-6-8 dB different signal-to-noise ratios.
3. The method according to claim 2, wherein the conventional feature processes for extracting the speech signals of the training set and the test set in step 1) are the same, and each process includes obtaining the following different feature vectors:
(1) carrying out 512-point short-time Fourier transform on a mixed voice signal with a sampling rate of 16kHz, extracting a 31-dimensional MFCC feature vector by adopting a Hamming window with a frame length of 20ms and a frame shift of 10ms, and calculating a first-order reciprocal of the 31-dimensional MFCC feature vector;
(2) performing full-wave rectification on a mixed voice signal with a sampling rate of 16kHz to extract the envelope of the mixed voice signal, performing quarter sampling, performing framing by adopting a 32ms frame length and a 10ms frame-shifted Hamming window, obtaining a 15-dimensional AMS eigenvector by utilizing 15 triangular windows with center frequencies uniformly distributed in 15.6-400 Hz, and calculating the first reciprocal of the 15-dimensional AMS eigenvector;
(3) decomposing a mixed voice signal with a sampling rate of 16kHz by adopting a 64-channel Gamma-tone filter bank, sampling each decomposition output result by adopting a sampling rate of 100Hz, carrying out amplitude suppression on the obtained sampling signal by adopting cubic root operation, finally extracting a 64-dimensional Gamma-tone eigenvector, and calculating the first order reciprocal of the 64-dimensional Gamma-tone eigenvector;
(4) converting a power spectrum of a mixed voice signal with a sampling rate of 16kHz into a bark scale of a 20-channel by adopting a ladder filter, then applying equal loudness pre-emphasis, then using an intensity loudness law and a 12-order linear prediction model to obtain a 13-dimensional PLP feature vector, and calculating a first order reciprocal of the 13-dimensional PLP feature vector;
respectively connecting 31-dimensional MFCC feature vectors, 15-dimensional AMS feature vectors, 64-dimensional Gamma feature vectors and 13-dimensional PLP feature vectors in series to obtain 123-dimensional feature vectors, connecting the first reciprocal of the 31-dimensional MFCC feature vectors, the first reciprocal of the 15-dimensional AMS feature vectors, the first reciprocal of the 64-dimensional Gamma feature vectors and the first reciprocal of the 13-dimensional PLP feature vectors in series to obtain 123-dimensional feature vectors, and connecting the two 123-dimensional feature vectors in series to obtain 246-dimensional feature vectors;
and respectively acquiring the zero-crossing rate characteristic, the root-mean-square energy characteristic and the characteristic of the spectrum centroid of the mixed voice signal with the sampling rate of 16kHz, forming 269-dimensional characteristic vectors together with the 246-dimensional characteristic vectors, sending the 269-dimensional characteristic vectors into a firework algorithm characteristic selector for characteristic dimension reduction, wherein the number N of the initialized fireworks is 400, and the characteristic subset dimension M is 50, 70 and 90.
4. The method of claim 1, wherein the extracting of the amplitude spectrum features of the STFT domain of the speech signals of the training set and the test set in step 2) is the same as the extracting of the amplitude spectrum features of the STFT domain of the speech signals of the training set and the test set, and both the extracting and the extracting comprise: the method comprises the steps of carrying out short-time Fourier transform on a mixed voice signal with a sampling rate of 16kHz, framing the mixed voice signal by adopting a Hamming window with a frame length of 25ms and a frame shift of 10ms in the transformation process, adding the amplitude spectrums of two frames adjacent to the left and the right of a single frame when the amplitude spectrum of each single frame corresponding to the traditional characteristics is input, wherein the total number of the frames is 5, the dimensionality of the amplitude spectrum of each frame is 200, and obtaining the amplitude spectrum characteristics of an STFT domain with the input dimensionality of 1000.
5. The method according to claim 1, wherein the deep stack residual network of step 3) comprises: the device comprises an input channel I, an input channel II and a full-connection residual error network module connected with the output end of the input channel I and the input channel II after being connected, wherein,
the input channel I: the convolution error network module is composed of three convolution layers and three normalization layers which are formed by combining the convolution error network module through error networks, the dimensionality of each convolution kernel is set to be 2, the step length of each convolution kernel is set to be 1, and a 0 complementing mode is adopted, the size of the convolution kernel of the first convolution layer from top to bottom in the three convolution layers is 1 x 1, and the number of output channels is 32; the convolution kernel size of the second convolution layer is 3 x 3, and the number of output channels is 32; the convolution kernel size of the third layer of convolution layer is 1 x 1, the number of output channels is 64, and the activation functions of the three layers of convolution layer are all Relu activation functions;
the second input channel: the neural network is composed of a normalization layer and a full-link layer which are combined through a residual error network, wherein the full-link layer is provided with 1024 neurons, and the full-link layer uses a Relu activation function;
the fully connected residual error network module: is composed of a normalization layer and a full-connection layer with 4096 neurons, and the full-connection layer uses a Sigmoid activation function.
6. The method of claim 1, wherein the step 4) of constructing the learning objective comprises:
(1) respectively calculating an ideal binary masking learning target IBM and an ideal floating masking learning target IRM of a mixed voice signal of a training set by using the following formulas:
where LC is set to 20 dB; SNR (m, f) is a local signal-to-noise ratio of a time-frequency unit with a time frame of m and a frequency of f, wherein f is 80Hz to 5000 Hz; s (m, f)2And N (m, f)2Respectively representing speech energy and noise energy at the mth time frame and frequency f;
(2) performing point multiplication on the ideal binary masking learning target IBM and the ideal floating value masking learning target IRM to obtain a mixed masking learning target MM, and forming a final learning target:
wherein x is1,1…xm,nRespectively representing an ideal floating value masking value in each time-frequency unit in a section of mixed voice signal; x is the number of1,1…xm,1Ideal floating value masking respectively representing the first frame mixed speech signal; y is1,1…ym,nRespectively representing ideal binary masking values in each time-frequency unit in a section of mixed voice signal; y is1,1…ym,1Ideal binary masks respectively representing the first frame mixed speech signal; x is the number of1,1*y1,1…xm,n*ym,nRespectively representing the ideal mixed masking value in each time-frequency unit in a section of mixed voice signal.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911385421.XA CN111128209B (en) | 2019-12-28 | 2019-12-28 | Speech enhancement method based on mixed masking learning target |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911385421.XA CN111128209B (en) | 2019-12-28 | 2019-12-28 | Speech enhancement method based on mixed masking learning target |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111128209A true CN111128209A (en) | 2020-05-08 |
CN111128209B CN111128209B (en) | 2022-05-10 |
Family
ID=70504227
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911385421.XA Active CN111128209B (en) | 2019-12-28 | 2019-12-28 | Speech enhancement method based on mixed masking learning target |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111128209B (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111583954A (en) * | 2020-05-12 | 2020-08-25 | 中国人民解放军国防科技大学 | Speaker independent single-channel voice separation method |
CN111653287A (en) * | 2020-06-04 | 2020-09-11 | 重庆邮电大学 | Single-channel speech enhancement algorithm based on DNN and in-band cross-correlation coefficient |
CN111899750A (en) * | 2020-07-29 | 2020-11-06 | 哈尔滨理工大学 | Speech enhancement algorithm combining cochlear speech features and hopping deep neural network |
CN112562706A (en) * | 2020-11-30 | 2021-03-26 | 哈尔滨工程大学 | Target voice extraction method based on time potential domain specific speaker information |
CN113257267A (en) * | 2021-05-31 | 2021-08-13 | 北京达佳互联信息技术有限公司 | Method for training interference signal elimination model and method and equipment for eliminating interference signal |
CN113470671A (en) * | 2021-06-28 | 2021-10-01 | 安徽大学 | Audio-visual voice enhancement method and system by fully utilizing visual and voice connection |
CN114495957A (en) * | 2022-01-27 | 2022-05-13 | 安徽大学 | Method, system and device for speech enhancement based on Transformer improvement |
CN114495968A (en) * | 2022-03-30 | 2022-05-13 | 北京世纪好未来教育科技有限公司 | Voice processing method and device, electronic equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20080049385A (en) * | 2006-11-30 | 2008-06-04 | 한국전자통신연구원 | Pre-processing method and device for clean speech feature estimation based on masking probability |
CN101237303A (en) * | 2007-01-30 | 2008-08-06 | 华为技术有限公司 | Data transmission method, system and transmitter, receiver |
US20150124987A1 (en) * | 2013-11-07 | 2015-05-07 | The Board Of Regents Of The University Of Texas System | Enhancement of reverberant speech by binary mask estimation |
CN107845389A (en) * | 2017-12-21 | 2018-03-27 | 北京工业大学 | A kind of sound enhancement method based on multiresolution sense of hearing cepstrum coefficient and depth convolutional neural networks |
CN110120227A (en) * | 2019-04-26 | 2019-08-13 | 天津大学 | A kind of depth stacks the speech separating method of residual error network |
-
2019
- 2019-12-28 CN CN201911385421.XA patent/CN111128209B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20080049385A (en) * | 2006-11-30 | 2008-06-04 | 한국전자통신연구원 | Pre-processing method and device for clean speech feature estimation based on masking probability |
CN101237303A (en) * | 2007-01-30 | 2008-08-06 | 华为技术有限公司 | Data transmission method, system and transmitter, receiver |
US20150124987A1 (en) * | 2013-11-07 | 2015-05-07 | The Board Of Regents Of The University Of Texas System | Enhancement of reverberant speech by binary mask estimation |
CN107845389A (en) * | 2017-12-21 | 2018-03-27 | 北京工业大学 | A kind of sound enhancement method based on multiresolution sense of hearing cepstrum coefficient and depth convolutional neural networks |
CN110120227A (en) * | 2019-04-26 | 2019-08-13 | 天津大学 | A kind of depth stacks the speech separating method of residual error network |
Non-Patent Citations (6)
Title |
---|
SHASHA XIA 等: "Using optimal ratio mask as training target for supervised speech separation", 《2017 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE 》 * |
YAN ZHAO 等: "Perceptually Guided Speech Enhancement Using Deep Neural Networks", 《 2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING 》 * |
YUXUAN WANG 等: "On Training Targets for Supervised Speech Separation", 《IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING 》 * |
夏莎莎等: "基于优化浮值掩蔽的监督性语音分离", 《自动化学报》 * |
李如玮 等: "基于深度学习的听觉倒谱系数语音增强算法", 《华中科技大学学报(自然科学版)》 * |
梁山等: "基于噪声追踪的二值时频掩蔽到浮值掩蔽的泛化算法", 《声学学报》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111583954A (en) * | 2020-05-12 | 2020-08-25 | 中国人民解放军国防科技大学 | Speaker independent single-channel voice separation method |
CN111653287A (en) * | 2020-06-04 | 2020-09-11 | 重庆邮电大学 | Single-channel speech enhancement algorithm based on DNN and in-band cross-correlation coefficient |
CN111899750A (en) * | 2020-07-29 | 2020-11-06 | 哈尔滨理工大学 | Speech enhancement algorithm combining cochlear speech features and hopping deep neural network |
CN111899750B (en) * | 2020-07-29 | 2022-06-14 | 哈尔滨理工大学 | Speech enhancement algorithm combining cochlear speech features and hopping deep neural network |
CN112562706A (en) * | 2020-11-30 | 2021-03-26 | 哈尔滨工程大学 | Target voice extraction method based on time potential domain specific speaker information |
CN112562706B (en) * | 2020-11-30 | 2023-05-05 | 哈尔滨工程大学 | Target voice extraction method based on time potential domain specific speaker information |
CN113257267A (en) * | 2021-05-31 | 2021-08-13 | 北京达佳互联信息技术有限公司 | Method for training interference signal elimination model and method and equipment for eliminating interference signal |
CN113470671A (en) * | 2021-06-28 | 2021-10-01 | 安徽大学 | Audio-visual voice enhancement method and system by fully utilizing visual and voice connection |
CN113470671B (en) * | 2021-06-28 | 2024-01-23 | 安徽大学 | Audio-visual voice enhancement method and system fully utilizing vision and voice connection |
CN114495957A (en) * | 2022-01-27 | 2022-05-13 | 安徽大学 | Method, system and device for speech enhancement based on Transformer improvement |
CN114495968A (en) * | 2022-03-30 | 2022-05-13 | 北京世纪好未来教育科技有限公司 | Voice processing method and device, electronic equipment and storage medium |
CN114495968B (en) * | 2022-03-30 | 2022-06-14 | 北京世纪好未来教育科技有限公司 | Voice processing method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN111128209B (en) | 2022-05-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111128209B (en) | Speech enhancement method based on mixed masking learning target | |
CN108447495B (en) | Deep learning voice enhancement method based on comprehensive feature set | |
CN109524014A (en) | A kind of Application on Voiceprint Recognition analysis method based on depth convolutional neural networks | |
CN107633842A (en) | Audio recognition method, device, computer equipment and storage medium | |
CN108777146A (en) | Speech model training method, method for distinguishing speek person, device, equipment and medium | |
CN107331384A (en) | Audio recognition method, device, computer equipment and storage medium | |
CN111292762A (en) | Single-channel voice separation method based on deep learning | |
CN103345923A (en) | Sparse representation based short-voice speaker recognition method | |
CN101599271A (en) | A kind of recognition methods of digital music emotion | |
CN102664010B (en) | Robust speaker distinguishing method based on multifactor frequency displacement invariant feature | |
CN103117059A (en) | Voice signal characteristics extracting method based on tensor decomposition | |
CN108922515A (en) | Speech model training method, audio recognition method, device, equipment and medium | |
CN111192598A (en) | Voice enhancement method for jump connection deep neural network | |
CN106531174A (en) | Animal sound recognition method based on wavelet packet decomposition and spectrogram features | |
CN106024010A (en) | Speech signal dynamic characteristic extraction method based on formant curves | |
Li et al. | Sams-net: A sliced attention-based neural network for music source separation | |
CN113539293B (en) | Single-channel voice separation method based on convolutional neural network and joint optimization | |
CN109036470A (en) | Speech differentiation method, apparatus, computer equipment and storage medium | |
CN108364641A (en) | A kind of speech emotional characteristic extraction method based on the estimation of long time frame ambient noise | |
Chao et al. | Cross-domain single-channel speech enhancement model with bi-projection fusion module for noise-robust ASR | |
CN110136746B (en) | Method for identifying mobile phone source in additive noise environment based on fusion features | |
Fan et al. | Deep attention fusion feature for speech separation with end-to-end post-filter method | |
CN111524520A (en) | Voiceprint recognition method based on error reverse propagation neural network | |
CN111785262B (en) | Speaker age and gender classification method based on residual error network and fusion characteristics | |
Wu et al. | A Characteristic of Speaker's Audio in the Model Space Based on Adaptive Frequency Scaling |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |