WO2022103290A1 - Method for automatic quality evaluation of speech signals using neural networks for selecting a channel in multimicrophone systems - Google Patents
Method for automatic quality evaluation of speech signals using neural networks for selecting a channel in multimicrophone systems Download PDFInfo
- Publication number
- WO2022103290A1 WO2022103290A1 PCT/RU2020/000600 RU2020000600W WO2022103290A1 WO 2022103290 A1 WO2022103290 A1 WO 2022103290A1 RU 2020000600 W RU2020000600 W RU 2020000600W WO 2022103290 A1 WO2022103290 A1 WO 2022103290A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- signal
- noise
- training
- signals
- speech
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 238000013528 artificial neural network Methods 0.000 title claims abstract description 32
- 238000013441 quality evaluation Methods 0.000 title description 5
- 238000012549 training Methods 0.000 claims abstract description 62
- 230000000694 effects Effects 0.000 claims abstract description 9
- 230000004044 response Effects 0.000 claims description 28
- 230000001373 regressive effect Effects 0.000 claims description 3
- 239000012634 fragment Substances 0.000 abstract description 8
- 230000006870 function Effects 0.000 description 16
- 238000011156 evaluation Methods 0.000 description 8
- 238000012545 processing Methods 0.000 description 6
- 230000004913 activation Effects 0.000 description 2
- 238000013527 convolutional neural network Methods 0.000 description 2
- 230000014509 gene expression Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000005236 sound signal Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000003190 augmentative effect Effects 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000001303 quality assessment method Methods 0.000 description 1
- 239000013074 reference sample Substances 0.000 description 1
- 230000035945 sensitivity Effects 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/60—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for measuring the quality of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
- G10L21/0216—Noise filtering characterised by the method used for estimating noise
- G10L2021/02161—Number of inputs available containing the signal or the noise to be suppressed
- G10L2021/02166—Microphone arrays; Beamforming
Definitions
- the present invention relates to the field of automatic evaluating of a speech signal quality and, in particular, to a method for training neural networks to evaluate a signal-to-noise ratio, a reverberation time, a type of noise present in a recording, and to output an overall quality estimation as a function of said evaluating for a whole speech signal or a fragment thereof.
- the speech signal quality evaluating can be used in various speech processing applications, such as for automatically selecting the best microphone in a sound recording multimicrophone system.
- voice biometrics it can be used for determining speech segments of the highest quality in recordings taken in various acoustic conditions in order to develop a voice pattern of a speaker based on selected fragments.
- a signal-to-noise ratio is defined as a ratio of signal power to the noise power and can be represented through the following mathematical expression: where SNR is a signal-to-noise ratio,
- P signal is a signal mean power
- P noise is a noise mean power
- a signal is a signal root-mean-square amplitude
- a noise is a noise root-mean-square amplitude.
- the reverberation time (RT) is considered as a main parameter that defines acoustic environment of the area where the speech was recorded. In most cases, it is determined as the time the sound pressure level takes to decreases by 60 dB (in 1 million times in terms of power or in 1,000 times in terms of sound pressure). In literature, the reverberation time defined in such way is usually referred to as RT60 or T60. There are established methods to determine RT60 based on a known room impulse response, but in real-life scenarios of working with sound recordings obtained from random sources, the impulse response is not available. Thus, the task of the approximate reverberation time evaluating is based on a given sound recording without any additional data on the acoustic conditions becomes relevant.
- MOS mean opinion score estimation
- PESQ perceptual evaluation of speech quality
- POLQA perceptual objective listening quality assessment
- the reverberation time evaluation is usually reduced to traditional signal processing methods.
- US 9558757B1 provides determination of the rate of sound decay by generating an autocorrelogram of the signal intensity as a function of time.
- reverberation time values that exceed a certain threshold are determined.
- the disadvantages of this method include low sensitivity to signals having small reverberation time values, the inability to be used on short speech fragments, as well as instability to noised data.
- a multichannel microphone-based reverberation time estimation method using deep neural networks is disclosed in US 20200082843A1. According to this method, signals obtained through a multichannel microphone are analyzed. However, using this method for speech signal processing from one microphone input is quite problematic.
- a method for determining characteristics, selecting and adapting training acoustic signals for an automatic speech recognition system comprises preparing training data that imitates target environment conditions, including noise and reverberation levels, which later can be used for training a deep neural network.
- the trained neural network can be later used for classifying speech data samples to simulate codecs corresponding to the speech data samples.
- this method does not allow, in particular, training a neural network to simultaneously predict or evaluate a signal-to-noise ratio, a reverberation time, a noise class and an overall quality estimation for an input speech signal based on the extracted training data. Therefore, there is a need in methods for training neural networks to simultaneously evaluate input speech signal characteristics without using additional data on the acoustic environment in which this speech signal was extracted.
- a method for training a neural network to evaluate quality characteristics of an input speech signal comprising the steps of: preparing a set of training speech signals, wherein for each of the training speech signals the following is known: a signal-to-noise ratio, a reverberation time and a noise class selected from a predetermined plurality of noise classes; applying a voice activity detector to each of the training speech signals so as to extract training features from said training speech signal; and training the neural network to simultaneously evaluate a signal-to-noise ratio, a reverberation time, a noise class and an overall quality estimation for an input speech signal based on the extracted training features.
- the quality characteristics of the signal can be evaluated only based on a given input signal, without the need to compare with a reference or undistorted signal, and for evaluating the reverberation time, knowing an impulse response of a room where the speech was recorded is not required.
- preparing the set of training speech signals may comprise providing a plurality of clear speech signals having minimum values for the signal-to-noise ratio and the reverberation time, further providing a plurality of stationary noise signals of various types, and further providing a plurality of impulse responses corresponding to various rooms for which the reverberation time is known; performing a convolution operation on each of the clear speech signals with an impulse response from said plurality of impulse responses to generate a plurality of reverberated signals; combining the generated reverberated signals with the stationary noise signals of various types to generate a plurality of distorted by noise signals having a varying signal- to-noise ratio; generating a final set of training speech signals from the distorted by noise signals, the final set of training speech signals being balanced in terms of the signal-to-noise ratio, the reverberation time and the noise class; and also calculating an integral quality estimation for each of the distorted by noise signals as a function of the signal-to-noise ratio and the noise class; and also
- each of the noise signals is reverberated using an impulse response of the same room which was selected for reverberating a corresponding clear speech signal. Furthermore, each of the noise signal is reverberated using a room impulse response different from the impulse response that was selected for reverberating the corresponding clear speech signal.
- the method comprising the step of using a regressive predictor model trained with a cost function based on a mean squared error so as to evaluate the signal-to-noise ratio, the reverberation time and the overall quality estimation.
- the noise class is evaluated using a classifier trained with the help of binary cross-entropy.
- a method for automatically selecting a channel in a multimicrophone system using a neural network trained based on training features comprising the steps of: receiving input speech signals from a plurality of channels of the multimicrophone system; applying a voice activity detector to each of the input speech signals so as extract therefrom features characterizing the input speech signal and corresponding to the training features; providing the extracted features characterizing the input speech signal to an input of the neural network, while simultaneously evaluating the extracted features; receiving from an output of the neural network output for each of the input speech signals the following: an evaluated signal-to-noise ratio, a reverberation time, an overall quality estimation and a predicted noise class values; and selecting a channel from the plurality of channels of the multimicrophone system, wherein the channel has yielded an input speech signal having evaluated values that satisfy a predetermined condition.
- the predetermined condition is a maximum value of the overall quality estimation.
- Fig. 1 is a sequence of operations providing the train data set
- Fig. 2 is a flow-chart of generating speech signal quality estimations from an original audio recording with the help of a trained model.
- a method for training neural networks to evaluate quality characteristics of an input speech signal comprises preparing a set of training speech signals, wherein for each of the training speech signals the following is known: a signal-to-noise ratio, a reverberation time and a noise class selected from a predetermined plurality of noise classes.
- the method further comprises applying a voice activity detector to each of the training speech signals so as to extract training features from said training speech signal, and training the neural network to simultaneously evaluate a signal-to-noise ratio, a reverberation time, a noise class and an overall quality estimation for an input speech signal based on the extracted training features.
- a plurality of clear speech signals 101 having minimum values for the signal-to-noise and the reverberation time, a plurality of stationary noise signals 102 of various classes, a plurality of impulse responses 103 corresponding to various rooms, the reverberation time (T60) for which is known, are taken.
- existing speech and noise databases can be used as a source of clear speech signals and of stationary noise signals.
- the required impulse responses can be generated using special utility software.
- a database of 79 noise classes such as typing noise, rain noise, the hum of a crowd of people, manufacture machinery, etc.
- a specifically generated impulse response database imitating 40,000 rooms of various sizes with the reverberation time of 0 - 2 seconds was used as the plurality of impulse responses, wherein for each room, 4 impulse responses were generated imitating various positions of the acoustic source inside that room.
- a convolution operation is performed on each clear speech signal with an impulse response of an arbitrarily selected room to generate a plurality of reverberated speech signals 104.
- impulse responses can correspond to each room depending on the position of the acoustic source in this room.
- each noise signal can be also reverberated 105.
- reverberation can be performed using an impulse response of the same room which was selected for reverberating a corresponding clear speech signal. If there are several impulse responses in a room, an impulse response can be used that differs from the one used for reverberating a clear speech signal. It enables to imitate various spatial positions of speech and noise sources and generate a more realistic database.
- each reverberated signal from a plurality of reverberated signals generated at the previous stage is combined with stationary noises of various types to result in a plurality of noised signals 106 with various signal-to-noise ratio values.
- the power of a speech signal is calculated only on speech segments, with pauses not taken into account; for this, a voice activity detector (VAD) is applied to the speech signal.
- VAD voice activity detector
- a final balanced train and test plurality 107 is formed from prepared speech signals distorted by noise and reverberation.
- parameters SNR, RT60 and a noise class are known for each of these signals.
- an overall or integral quality estimation is calculated for each speech signal distorted by reverberation and noised as a certain function of distortion parameters SNR and RT60.
- an QE is calculated using the following mathematical expressions: where S SNR is a speech segment SNR level estimation,
- S RT60 is a speech segment reverberation level estimation
- OQ is an integral speech segment quality estimation
- SNR dB is a speech segment SNR value in decibels
- RT 60 ms is a speech segment reverberation time value in milliseconds.
- an existing data set that satisfies the balance condition in terms of the SNR and T60 ranges and in terms of noise classes can also be used.
- training features are extracted from the prepared speech signals.
- a voice activity detector VAD
- VAD voice activity detector
- training features are extracted from the generated signals, such as mel- frequency cepstral coefficients (MFCC) or band filter bank (set) (FBANK) or other known features characterizing an audio signal.
- MFCC mel- frequency cepstral coefficients
- FBANK band filter bank
- Isolated training features are then used to train a convolutional neural network in a multitasking mode.
- a convolutional neural network is simultaneously trained to evaluate a signal- to-noise ratio (SNR), a reverberation time (RT60), a noise class and an overall quality estimation (OQ) on input features generated at the previous stage. It is performed by using four outputs in the neural network architecture and a combined loss function based on the sum of four cost functions with different weight coefficients.
- SNR signal- to-noise ratio
- RT60 reverberation time
- OQ overall quality estimation
- a regressive predictor model trained using a cost function based on a mean squared error is used for automatic SNR, RT60 and OQ evaluation.
- An automatic noise class estimation can be based on the use of a classifier trained using binary cross-entropy (BCE).
- L 10 • MSE(0Q) + 0,001 • MSE(RT60 ms ) + MSE(SNR dB ) + 10 • BCE (noise class) where L is a combined loss function
- MSE(0Q) is a loss function based on a mean squared error for the integral quality estimation
- MSE(RT 60 ms ) is a loss function based on a mean squared error for the RT60 estimation
- MSE(SNR AB ) is a loss function based on a mean squared error for the SNR estimation
- BCE noise class
- the developed non-linear speech signal quality prediction model should evaluate the quality on short speech fragments (1 to 2 seconds), in one of the embodiments, the model also is trained on short speech fragments.
- human speech and natural noises are not strictly stationary. It means that a global signal-to-noise ratio value which was obtained at the data preparation stage and which is singular for a whole file should be corrected for each short segment of this file.
- a formula for calculating a local signal-to-noise ratio where is an energy of a reverberated speech signal before being noised, and is an energy of a reverberated noise.
- the coefficients ⁇ and ⁇ for each signal are determined by solving a linear equation system on four signal fragments: where X aug (i) is an i-th fragment of the augmented signal, are its reverberated speech and noise parts.
- a neural network architecture that can be used for evaluating speech signal quality characteristic is given below as a non-limiting example.
- Residual network ResNet18 is comprised of 8 ResNet blocks, each being formed by two convolutional layers with 64 filters dimensioned 3x3 and a skip connection through the two layers. This connection is implemented by simply combining, element-by-element, a block input and an output of the last layer of a block if the dimensions match, or by using a convolution operation to accommodate dimensions.
- the top level is formed by a global average pooling layer, the 512-dimensional output of which can be referred to as a quality vector (“quality embedding”). This vector is then provided to three linear layers: for predicting a signal-to-noise ratio (SNR), reverberation time (RT60) and a quality estimate (OQ).
- SNR signal-to-noise ratio
- RT60 reverberation time
- OQ quality estimate
- an additional two-layer classifier is used with a softmax activation function or its modifications in the number of noise classes (79 in this embodiment).
- the suggested method for training a neural network allows obtaining both an overall speech signal quality estimation and its specific acoustic characteristics (a signal-to-noise ratio and a reverberation time), which can be used both in voice biometrics applications and for selecting the best channel in multimicrophone systems according to a predetermined criterion.
- a method for automatically selecting a channel in a multimicrophone system is provided, which is implemented using the trained neural network described above.
- input speech signals are received from a plurality of channels of a multimicrophone system.
- a voice activity detector 202 is applied to each individual speech signal 201 so as to extract features characterizing this input speech signal and corresponding to training features that have been used for training the neural network, such as mel-frequency cepstral coefficients (MFCC) or band filter bank (set) (FBANK) or other known features characterizing an audio signal.
- MFCC mel-frequency cepstral coefficients
- FBANK band filter bank
- the obtained features 203 characterizing the input speech signal are provided to a neural network input 204 and are simultaneously evaluated to obtain evaluated values of a signal- to-noise ratio 205, reverberation time 206, an overall quality estimation 207 and a predicted noise class 208 at a neural network output for each input speech signal.
- a channel from which an input speech signal having evaluated values that satisfy the predetermined condition was received, is selected from a plurality of channels of a multimicrophone system.
- a maximum value of an overall quality evaluation can be used as the predetermined condition.
- a person skilled in the art will readily appreciate that other parameters known from the prior art can be used as the predetermined condition.
- the present invention is not limited to specific embodiments disclosed in the specification for illustration purposes and encompasses all possible modifications and alternatives at each step of implementation.
Landscapes
- Engineering & Computer Science (AREA)
- Human Computer Interaction (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Quality & Reliability (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Circuit For Audible Band Transducer (AREA)
Abstract
Description
Claims
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/RU2020/000600 WO2022103290A1 (en) | 2020-11-12 | 2020-11-12 | Method for automatic quality evaluation of speech signals using neural networks for selecting a channel in multimicrophone systems |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/RU2020/000600 WO2022103290A1 (en) | 2020-11-12 | 2020-11-12 | Method for automatic quality evaluation of speech signals using neural networks for selecting a channel in multimicrophone systems |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022103290A1 true WO2022103290A1 (en) | 2022-05-19 |
Family
ID=76305976
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/RU2020/000600 WO2022103290A1 (en) | 2020-11-12 | 2020-11-12 | Method for automatic quality evaluation of speech signals using neural networks for selecting a channel in multimicrophone systems |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2022103290A1 (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116564351A (en) * | 2023-04-03 | 2023-08-08 | 湖北经济学院 | Voice dialogue quality evaluation method and system and portable electronic equipment |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2456296A (en) * | 2007-12-07 | 2009-07-15 | Hamid Sepehr | Audio enhancement and hearing protection by producing a noise reduced signal |
EP2238590A1 (en) | 2008-01-31 | 2010-10-13 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Apparatus and method for computing filter coefficients for echo suppression |
WO2015010983A1 (en) | 2013-07-22 | 2015-01-29 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Method for processing an audio signal in accordance with a room impulse response, signal processing unit, audio encoder, audio decoder, and binaural renderer |
CN104581758A (en) | 2013-10-25 | 2015-04-29 | 中国移动通信集团广东有限公司 | Voice quality estimation method and device as well as electronic equipment |
US9396738B2 (en) | 2013-05-31 | 2016-07-19 | Sonus Networks, Inc. | Methods and apparatus for signal quality analysis |
US9558757B1 (en) | 2015-02-20 | 2017-01-31 | Amazon Technologies, Inc. | Selective de-reverberation using blind estimation of reverberation level |
US9922664B2 (en) | 2016-03-28 | 2018-03-20 | Nuance Communications, Inc. | Characterizing, selecting and adapting audio and acoustic training data for automatic speech recognition systems |
US9972339B1 (en) * | 2016-08-04 | 2018-05-15 | Amazon Technologies, Inc. | Neural network based beam selection |
CN108322346A (en) | 2018-02-09 | 2018-07-24 | 山西大学 | A kind of voice quality assessment method based on machine learning |
CN108346434A (en) | 2017-01-24 | 2018-07-31 | 中国移动通信集团安徽有限公司 | A kind of method and apparatus of speech quality evaluation |
EP3494575A1 (en) | 2016-08-09 | 2019-06-12 | Huawei Technologies Co., Ltd. | Devices and methods for evaluating speech quality |
US20200082843A1 (en) | 2016-12-15 | 2020-03-12 | Industry-University Cooperation Foundation Hanyang University | Multichannel microphone-based reverberation time estimation method and device which use deep neural network |
-
2020
- 2020-11-12 WO PCT/RU2020/000600 patent/WO2022103290A1/en active Application Filing
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2456296A (en) * | 2007-12-07 | 2009-07-15 | Hamid Sepehr | Audio enhancement and hearing protection by producing a noise reduced signal |
EP2238590A1 (en) | 2008-01-31 | 2010-10-13 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Apparatus and method for computing filter coefficients for echo suppression |
US9396738B2 (en) | 2013-05-31 | 2016-07-19 | Sonus Networks, Inc. | Methods and apparatus for signal quality analysis |
WO2015010983A1 (en) | 2013-07-22 | 2015-01-29 | Fraunhofer-Gesellschaft zur Förderung der angewandten Forschung e.V. | Method for processing an audio signal in accordance with a room impulse response, signal processing unit, audio encoder, audio decoder, and binaural renderer |
CN104581758A (en) | 2013-10-25 | 2015-04-29 | 中国移动通信集团广东有限公司 | Voice quality estimation method and device as well as electronic equipment |
US9558757B1 (en) | 2015-02-20 | 2017-01-31 | Amazon Technologies, Inc. | Selective de-reverberation using blind estimation of reverberation level |
US9922664B2 (en) | 2016-03-28 | 2018-03-20 | Nuance Communications, Inc. | Characterizing, selecting and adapting audio and acoustic training data for automatic speech recognition systems |
US9972339B1 (en) * | 2016-08-04 | 2018-05-15 | Amazon Technologies, Inc. | Neural network based beam selection |
EP3494575A1 (en) | 2016-08-09 | 2019-06-12 | Huawei Technologies Co., Ltd. | Devices and methods for evaluating speech quality |
US20200082843A1 (en) | 2016-12-15 | 2020-03-12 | Industry-University Cooperation Foundation Hanyang University | Multichannel microphone-based reverberation time estimation method and device which use deep neural network |
CN108346434A (en) | 2017-01-24 | 2018-07-31 | 中国移动通信集团安徽有限公司 | A kind of method and apparatus of speech quality evaluation |
CN108322346A (en) | 2018-02-09 | 2018-07-24 | 山西大学 | A kind of voice quality assessment method based on machine learning |
Non-Patent Citations (2)
Title |
---|
ANDERSON R AVILA ET AL: "Non-intrusive speech quality assessment using neural networks", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 16 March 2019 (2019-03-16), XP081154240 * |
J.URGEN TCHORZ ET AL: "SNR Estimation Based on Amplitude Modulation Analysis With Applications to Noise Suppression", vol. 11, no. 3, 1 May 2003 (2003-05-01), XP011079710, ISSN: 1063-6676, Retrieved from the Internet <URL:https://ieeexplore.ieee.org/document/1208288> DOI: 10.1109/TSA.2003.811542 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116564351A (en) * | 2023-04-03 | 2023-08-08 | 湖北经济学院 | Voice dialogue quality evaluation method and system and portable electronic equipment |
CN116564351B (en) * | 2023-04-03 | 2024-01-23 | 湖北经济学院 | Voice dialogue quality evaluation method and system and portable electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Avila et al. | Non-intrusive speech quality assessment using neural networks | |
Eaton et al. | Estimation of room acoustic parameters: The ACE challenge | |
Su et al. | HiFi-GAN-2: Studio-quality speech enhancement via generative adversarial networks conditioned on acoustic features | |
Dong et al. | An attention enhanced multi-task model for objective speech assessment in real-world environments | |
CN108447495B (en) | Deep learning voice enhancement method based on comprehensive feature set | |
Falk et al. | Single-ended speech quality measurement using machine learning methods | |
Cauchi et al. | Non-intrusive speech quality prediction using modulation energies and LSTM-network | |
Fu et al. | MetricGAN-U: Unsupervised speech enhancement/dereverberation based only on noisy/reverberated speech | |
Valentini-Botinhao et al. | Speech enhancement of noisy and reverberant speech for text-to-speech | |
Dubey et al. | Non-intrusive speech quality assessment using several combinations of auditory features | |
US20230245674A1 (en) | Method for learning an audio quality metric combining labeled and unlabeled data | |
CN109313893A (en) | Characterization, selection and adjustment are used for the audio and acoustics training data of automatic speech recognition system | |
Poorjam et al. | Automatic quality control and enhancement for voice-based remote Parkinson’s disease detection | |
Lavrentyeva et al. | Blind Speech Signal Quality Estimation for Speaker Verification Systems. | |
Karbasi et al. | Non-intrusive speech intelligibility prediction using automatic speech recognition derived measures | |
Sharma et al. | Non-intrusive estimation of speech signal parameters using a frame-based machine learning approach | |
Huber et al. | Single-ended speech quality prediction based on automatic speech recognition | |
WO2022103290A1 (en) | Method for automatic quality evaluation of speech signals using neural networks for selecting a channel in multimicrophone systems | |
Pirhosseinloo et al. | A new feature set for masking-based monaural speech separation | |
Li et al. | Non-intrusive signal analysis for room adaptation of ASR models | |
Al-Karawi et al. | The effects of distance and reverberation time on speaker recognition performance | |
Karbasi et al. | Blind Non-Intrusive Speech Intelligibility Prediction Using Twin-HMMs. | |
Ahmed et al. | Channel and channel subband selection for speaker diarization | |
EA043719B1 (en) | METHOD FOR AUTOMATIC ASSESSMENT OF THE QUALITY OF SPEECH SIGNALS USING NEURAL NETWORKS FOR CHANNEL SELECTION IN MULTI-MICROPHONE SYSTEMS | |
Bharti et al. | Speech Enhancement And Noise Reduction In Forensic Applications |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20897637 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20897637 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20897637 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 15/11/2023) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20897637 Country of ref document: EP Kind code of ref document: A1 |