A Hybrid Speech Enhancement Algorithm for Voice Assistance Application
<p>Proposed idea.</p> "> Figure 2
<p>Iterative Signal Enhancement Algorithm.</p> "> Figure 3
<p>Subspace—based speech enhancement.</p> "> Figure 4
<p>Nonlinear spectral subtraction.</p> "> Figure 5
<p>Working model of Hidden Markov Model.</p> "> Figure 6
<p>Proposed hybridized algorithm for speech enhancement.</p> "> Figure 7
<p>(<b>a</b>,<b>b</b>) Time waveform and Spectrogram of speech signal using ISE.</p> "> Figure 8
<p>(<b>a</b>,<b>b</b>) Time waveform and Spectrogram of speech signal using subspace.</p> "> Figure 9
<p>(<b>a</b>,<b>b</b>) Time waveform and Spectrogram of speech signal using NSS.</p> "> Figure 10
<p>WER comparison of typical SE algorithms.</p> "> Figure 11
<p>Performance of HSEA for medical speech.</p> "> Figure 12
<p>Performance of HSEA for RAVDESS.</p> ">
Abstract
:1. Introduction
2. Proposed Methodology
2.1. Speech Enhancement Algorithms
2.1.1. Iterative Signal Enhancement Algorithm (ISE)
2.1.2. Subspace—Based Speech Enhancement
- (1)
- Isolating the subspace as signal and noise subspaces from the original subspace(noise mixed speech)
- (2)
- Eliminating the noise-only subspace that has been isolated in step1.
- By nullifying the components in the noise subspace, the enhanced speech is constrained to inhabit only the signal subspace.
- Changing (decreasing) the eigenvalues of the signal subspace.
2.1.3. Nonlinear Spectral Subtraction (NSS)
2.2. Speech to Text Conversion
Hidden Markov Model
3. Hybrid Speech Enhancement Algorithm (HSEA)
Algorithm 1 Hybrid Speech Enhancement Algorithm |
Input y(i) the noisy speech signal with n(i) the noise and s(i) the clear speech signal
|
4. Results and Discussion
4.1. Performance Analysis of Speech Enhancement Algorithms
4.1.1. ISE for Spontaneous Signal
4.1.2. Sub Space Method for Spontaneous Signal
4.1.3. NSS for Spontaneous Signal
4.2. Speech to Text Conversion of Enhanced Speech
Performance of HMM
4.3. Performance Analysis of HSEA
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Anasuya, M.A.; Katti, S.K. Speech recognition by machine: A review. Int. J. Comput. Sci. Inf. Secur. 2009, 6, 181–205. [Google Scholar]
- Boll, S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal. Process. 1979, 27, 113–120. [Google Scholar] [CrossRef] [Green Version]
- Ephraim, Y.; Van Trees, H. A signal subspace approach for speech enhancement. In Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing, Minneapolis, MN, USA, 27–30 April 1993; Volume 2, pp. 355–358. [Google Scholar]
- Rezayee, A.; Gazor, S. An adaptive KLT approach for speech enhancement. IEEE Trans. Speech Audio Process. 2001, 9, 87–95. [Google Scholar] [CrossRef]
- Abd El-Fattah, M.A.; Dessouky, M.I.; Diab, S.M.; Abd Elsamie, F.E. Adaptive wiener filtering approach for speech en-hancement. Ubiquitous Comput. Commun. J. 2010, 3, 23–31. [Google Scholar]
- Lu, X.; Unoki, M.; Matsuda, S.; Hori, C.; Kashioka, H. Controlling tradeoff between approximation accuracy and complexity of a smooth function in a reproducing kernel Hilbert space for noise reduction. IEEE Trans. Signal. Process. 2013, 61, 601–610. [Google Scholar] [CrossRef]
- Bengio, Y. Learning deep architectures for AI. Learn. Deep. Archit. AI 2009, 2, 1–127. [Google Scholar] [CrossRef]
- Lu, X.; Tsao, Y.; Matsuda, S.; Hori, C. Speech enhancement based on deep denoising autoencoder. In Proceedings of the In-terspeech, Annual Conference of the 3 International Speech Communication Association, Lyon, France, 25–29 August 2013; pp. 436–440. [Google Scholar]
- Xu, Y.; Du, J.; Dai, L.-R.; Lee, C.-H. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 7–19. [Google Scholar] [CrossRef]
- Barnov, A.; Bar Bracha, V.; Markovich-Golan, S. QRD based MVDR beamforming for fast tracking of speech and noise dynamics. In Proceedings of the 2017 IEEE Workshop on Applications of Signal. Processing to Audio and Acoustics (WASPAA), New Paltz, NY, USA, 15–18 October 2017; pp. 369–373. [Google Scholar] [CrossRef]
- Weninger, F.; Erdogan, H.; Watanabe, S.; Vincent, E.; Le Roux, J.; Hershey, J.R.; Schuller, B. Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR. In Proceedings of the International Conference on Latent Variable Analysis and Signal Separation, Liberec, Czech Republic, 25–28 August 2015; pp. 91–99. [Google Scholar]
- Pascual, S.; Bonafonte, A.; Serrà, J. SEGAN: Speech enhancement generative adversarial network. In Proceedings of the Interspeech, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, 20–24 August 2017; pp. 3642–3646. [Google Scholar]
- Kumar, A.; Florencio, D. Speech enhancement in multiple-noise conditions using deep neural networks. In Proceedings of the Interspeech, San Francisco, CA, USA, 8–12 September 2016; pp. 3738–3742. [Google Scholar]
- Furuya, K.; Kataoka, A. Robust speech dereverberation using multichannel blind deconvolution with spectral subtraction. IEEE Trans. Audio Speech Lang. Process. 2007, 15, 1579–1591. [Google Scholar] [CrossRef]
- Kamath, S.; Loizou, P. A multi-band spectral subtraction method for enhancing speech corrupted by colored noise. In Proceedings of the IEEE International Conference on Acoustics Speech and Signal Processing, Orlando, FL, USA, 13–17 May 2002; Volume 4, p. 44164. [Google Scholar]
- Udrea, R.M.; Oprea, C.C.; Stanciu, C. Multi-microphone Noise reduction system integrating nonlinear multi-band spectral subtraction. In Pervasive Computing Paradigms for Mental Health; Oliver, N., Serino, S., Matic, A., Cipresso, P., Filipovic, N., Gavrilovska, L., Eds.; Springer International Publishing: Cham, Switzerland, 2018; pp. 133–138. [Google Scholar]
- Tufts, D.W.; Kumaresan, R.; Kirsteins, I. Data adaptive signal estimation by singular value decomposition of a data matrix. Proc. IEEE 1982, 70, 684–685. [Google Scholar] [CrossRef]
- Saliha, B.; Youssef, E.; Abdeslam, D. A Study on Automatic Speech Recognition. J. Inf. Technol. Rev. 2019, 10, 77–85. [Google Scholar]
- Hermus, K.; Wambacq, P.; Van Hamme, H. A review of signal subspace speech enhancement and its application to noise robust speech recognition. EURASIP J. Adv. Signal. Process. 2006, 2007, 045821. [Google Scholar] [CrossRef] [Green Version]
- Pardede, H.; Ramli, K.; Suryanto, Y.; Hayati, N.; Presekal, A. speech enhancement for secure communication using coupled spectral subtraction and wiener filter. Electronics 2019, 8, 897. [Google Scholar] [CrossRef] [Green Version]
- Jousse, V.; Petit-Renaud, S.; Meignier, S.; Esteve, Y.; Jacquin, C. Automatic named identification of speakers using diarization and ASR systems. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal, Processing, Taipei, Taiwan, 19–24 April 2009; pp. 4557–4560. [Google Scholar] [CrossRef]
- Herbordt, W.; Buchner, H.; Kellermann, W. An acoustic human-machine front-end for multimedia applications. EURASIP J. Appl. Signal. Process. 2003, 1, 21–31. [Google Scholar] [CrossRef] [Green Version]
- Doclo, S.; Dologlou, I.; Moonen, M. A Novel Iterative Signal Enhancement Algorithm for Noise Reduction in Speech. Available online: https://www.isca-speech.org/archive_v0/archive_papers/icslp_1998/i98_0131.pdf (accessed on 16 October 2021).
- Bin Amin, T.; Mahmood, I. Speech recognition using dynamic time warping. In Proceedings of the 2nd International Conference on Advances in Space Technologies, Islamabad, Pakistan, 29–30 November 2008; pp. 74–79. [Google Scholar]
- Jenifa, G.; Yuvaraj, N.; Karthikeyan, B.; Preethaa, K.R.S. Deep learning based voice assistance in hospitals using face recognition. J. Phys. Conf. Ser. 2021, 1916, 012159. [Google Scholar] [CrossRef]
- Yuvaraj, N.; Sanjeev, M.; Jenifa, G.; Preethaa, K.R.S. Voice activated face recognition based smart support system. J. Phys. Conf. Ser. 2021, 1916, 012158. [Google Scholar] [CrossRef]
- Markovich-Golan, S.; Gannot, S. Performance analysis of the covariance subtraction method for relative transfer function estimation and comparison to the covariance whitening method. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; pp. 544–548. [Google Scholar]
- Souden, M.; Chen, J.; Benesty, J.; Affes, S. Gaussian model-based multichannel speech presence probability. IEEE Trans. Audio Speech Lang. Process. 2010, 18, 1072–1077. [Google Scholar] [CrossRef]
- Serizel, R.; Moonen, M.; Van Dijk, B.; Wouters, J. Low-rank Approximation based multichannel wiener filter algorithms for noise reduction with application in cochlear implants. IEEE/ACM Trans. Audio Speech Lang. Process. 2014, 22, 785–799. [Google Scholar] [CrossRef] [Green Version]
- Qazi, O.U.R.; van Dijk, B.; Moonen, M.; Wouters, J. Understanding the effect of noise on electrical stimulation sequences in cochlear implants and its impact on speech intelligibility. Hear. Res. 2013, 299, 79–87. [Google Scholar] [CrossRef] [PubMed]
Techniques | Performance | Advantage(s) | Disadvantage(s) |
---|---|---|---|
Spectral subtraction algorithm [2] | Estimating the spectrum of the noise-free signal and subtracting the estimated noisy signal is done to enhance the speech | Can be applied for both stationary and non-stationary noises | Resultant speech contains residual noise |
Signal subspace algorithm [3] | Uses Karhunen-Loeve transforms (KLT) or eigenvalue decomposition | Discards the noisy space Can directly use the state-space representation for the system | Realizations done by state-space are not unique |
Weiner filter-based algorithm [5] | Mainly used in real time applications | Better performance for noise cancellation | Needs more number of computations |
Adaptive Weiner filter-based algorithm [6] | Mainly used in real time applications | Reduced/moderate computational complexity | Mean square error is not always relevant |
Deep Auto-Encoder (DAE) algorithm [8] | Deep denoising autoencoders are used to enhance the speech features | Efficient for resonant speech recognition | Mainly used for clean/controlled speech only |
Voice Activity Detector (VAD) [10] | Works on the long pause between the words | Can classify the noise even during the pause of the speech | Not efficient for encrypted speech signals |
Long Short-Term Memory (LSTM) [11] | A type of RNN, and it can learn long period dependencies | Produces good result in speech recognition | Concentrates only on the size (length) of the speech |
Generative Adversarial Networks (GAN) [12] | It is a type of RNN and it constructs clear speech from the noisy speech | Generate audio that looks similar to original audio by eliminating noise | Harder to train |
Multiband spectral subtraction algorithm [15] | Inverse filtered reverberation was suppressed by the spectral subtraction | Overcome the distortion, maintaining the quality of speech signal | Not suitable for highly random real-world noise |
Length of the Word | WER (%) | Accuracy (%) |
---|---|---|
100 words | 41.6 | 58.4 |
150 words | 51.5 | 48.5 |
200 words | 55.5 | 44.5 |
250 words | 59.9 | 40.1 |
300 words | 69.5 | 30.5 |
350 words | 73.4 | 26.6 |
Length of the Word | WER (%) | Accuracy (%) |
---|---|---|
100 words | 29.8 | 60.2 |
150 words | 40.4 | 49.6 |
200 words | 43.1 | 46.9 |
250 words | 46.7 | 43.3 |
300 words | 52.8 | 37.2 |
350 words | 60.2 | 29.8 |
Length of the Word | WER (%) | Accuracy (%) |
---|---|---|
100 words | 11.9 | 88.1 |
150 words | 19.4 | 80.6 |
200 words | 20.3 | 79.7 |
250 words | 21.2 | 78.8 |
300 words | 23.1 | 76.9 |
350 words | 24.3 | 75.7 |
Length of the Word | WER (%) with Noise | WER (%) without Noise |
---|---|---|
100 words | 26.2 | 21.4 |
150 words | 28.1 | 23.3 |
200 words | 31.6 | 28.9 |
250 words | 34.9 | 32.7 |
300 words | 41.5 | 39.8 |
350 words | 44.3 | 41.2 |
Length of the Word | WER (%) | |
---|---|---|
HSEA | NSS | |
100 words | 9.5 | 11.9 |
150 words | 11.4 | 19.4 |
200 words | 13.6 | 20.3 |
250 words | 16.7 | 21.2 |
300 words | 17.1 | 23.1 |
350 words | 19.9 | 24.3 |
Length of the Word | WER (%) | |
---|---|---|
HSEA | NSS | |
100 words | 7.6 | 8.1 |
150 words | 9.2 | 14.5 |
200 words | 11.9 | 17.9 |
250 words | 14.8 | 19.2 |
300 words | 15.3 | 21.1 |
350 words | 17.5 | 22.9 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Gnanamanickam, J.; Natarajan, Y.; K. R., S.P. A Hybrid Speech Enhancement Algorithm for Voice Assistance Application. Sensors 2021, 21, 7025. https://doi.org/10.3390/s21217025
Gnanamanickam J, Natarajan Y, K. R. SP. A Hybrid Speech Enhancement Algorithm for Voice Assistance Application. Sensors. 2021; 21(21):7025. https://doi.org/10.3390/s21217025
Chicago/Turabian StyleGnanamanickam, Jenifa, Yuvaraj Natarajan, and Sri Preethaa K. R. 2021. "A Hybrid Speech Enhancement Algorithm for Voice Assistance Application" Sensors 21, no. 21: 7025. https://doi.org/10.3390/s21217025
APA StyleGnanamanickam, J., Natarajan, Y., & K. R., S. P. (2021). A Hybrid Speech Enhancement Algorithm for Voice Assistance Application. Sensors, 21(21), 7025. https://doi.org/10.3390/s21217025