A Helium Speech Unscrambling Algorithm Based on Deep Learning
<p>The helium speech recognition model.</p> "> Figure 2
<p>The optimization algorithm.</p> "> Figure 3
<p>The images of relu and swish.</p> "> Figure 4
<p>The GLU architecture.</p> "> Figure 5
<p>Formant information of normal speech.</p> "> Figure 6
<p>Formant information of helium speech.</p> "> Figure 7
<p>The algorithm process when generating label files.</p> "> Figure 8
<p>Comparison of the LSF performance of different algorithms.</p> ">
Abstract
:1. Introduction
- Chinese helium speech corpora are built. When building the corpora, we design one algorithm to automatically generate label files and one algorithm to select the continuous helium speech corpus, which reduces the scale of the training set without changing the corpus size.
- A helium speech recognition algorithm combining a CNN, connectionist temporal classification (CTC) and a transformer model is proposed to improve the intelligibility of helium speech. Furthermore, the influence of the complexity of the algorithm and language model on the recognition rate is analyzed.
- To improve the recognition rate of the helium speech recognition algorithm for continuous helium speech, an optimization algorithm is proposed. The algorithm combines depth-wise separable convolution (DSC), a gated linear unit (GLU) and a feedforward neural network (FNN) to improve the recognition rate of continuous helium speech.
2. Nature of Helium Speech
3. Helium Speech Recognition Algorithm
3.1. Helium Speech Recognition Model
3.2. Preprocessing
3.3. Feature Extraction
3.4. Acoustic Model
3.5. Language Model
3.6. Optimization of Unscrambling Algorithm
- Construct a helium speech corpus.
- Successively construct two corpora: an isolated Chinese helium speech corpus and a continuous Chinese helium speech corpus.
- Perform preprocessing and feature extraction (we extracted Fbank features as the input of the acoustic model).
- Train the model.
- Input the extracted features to the CNN to reduce the feature dimension, and then, input the features to CTC to obtain the pinyin sequence with the maximum probability. Input the pinyin output of the acoustic model to the transformer model to obtain the output of Chinese words.
- Test the model.
- Input the test set into the model to obtain the word error rate (WER) of the model.
4. Construction of Helium Speech Corpus
- (1)
- Selecting text content:
- (2)
- Collecting raw speech:
- (3)
- Converting the file format:
- (4)
- Labeling the corpus:
- (5)
- Grouping recordings:
5. Experimental Results and Analysis
5.1. Setup
5.2. Isolated Helium Speech Recognition
- (1)
- The complexity of the CNN
- (2)
- The complexity of the transformer
5.3. Continuous Helium Speech Recognition
5.4. The Performance of DSCNN-GFNN
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Zhang, S.B.; Guo, L.L.; Li, H.J.; Bao, Z.H. A survey on helium speech communications in saturation diving. China Commun. 2020, 17, 68–79. [Google Scholar] [CrossRef]
- Zhang, Y.; Zhang, Q.; Zhang, A.; Sun, Y.; He, Z. Development and Experiments of a Novel Deep-sea Resident ROV. In Proceedings of the 6th International Conference on Control and Robotics Engineering (ICCRE), Beijing, China, 16–18 April 2021. [Google Scholar]
- Cibis, T.; McEwan, A.; Sieber, A. Diving Into Research of Biomedical Engineering in Scuba Diving. IEEE Rev. Biomed. Eng. 2017, 10, 323–333. [Google Scholar] [CrossRef] [PubMed]
- Richards, M.; Belcher, E.O. Comparative evaluation of a new method for helium speech unscrambling. In Proceedings of the IEEE International Conference on Engineering in the Ocean Environment, San Francisco, CA, USA, 29 August–1 September 1983. [Google Scholar]
- Holywell, K. Helium speech. J. Acoust. Soc. Am. 1964, 36, 210–211. [Google Scholar] [CrossRef]
- Beil, R.G. Frequency analysis of vowels produced in a helium-rich atmosphere. J. Acoust. Soc. Am. 1962, 34, 347–349. [Google Scholar] [CrossRef]
- Sergeant, R.L. Phonemic Analysis of Consonants in Helium Speech. J. Acoust. Soc. Am. 1967, 41, 66. [Google Scholar] [CrossRef]
- Brubaker, R.S. Spectrographic analysis of divers’ speech during decompression. J. Acoust. Soc. Am. 1968, 43, 798–802. [Google Scholar] [CrossRef]
- Wathem, W.; Michaels, S.B. Some effects of gas density on speech production. Ann. N. Y. Acad. Sci. 1968, 155, 368–378. [Google Scholar]
- Flower, R.A.; Gerstman, L. Correction of Helium Speech Distortions by Real-Time Electronic Processing. IEEE Trans. Commun. Technol. 1971, 19, 37–39. [Google Scholar] [CrossRef]
- Morrow, C.T. Speech in deep-submergence atmospheres. J. Acoust. Soc. Am. 1971, 50, 715–728. [Google Scholar] [CrossRef]
- Giordano, T.A.; Rothman, H.; Hollien, H. Helium speech unscramblers—A critical review of the state of the art. IEEE Trans. Audio Electroacoust. 1973, 21, 436–444. [Google Scholar] [CrossRef]
- Hollien, H.; Rothman, H. Speech intelligibility as a function of ambient pressure and HeO2atmosphere. Aerosp. Med. 1973, 44, 249–253. [Google Scholar] [PubMed]
- Stover, W.R. Technique for Correcting Helium Speech Distortion. J. Acoust. Soc. Am. 1967, 41, 70–74. [Google Scholar] [CrossRef] [PubMed]
- Duncan, G. Correction of the Helium Speech Effect by Short-Time Autoregressive Signal Processing. Ph.D. Thesis, University of Edinburgh, Edinburgh, UK, 1983. [Google Scholar]
- Suzuki, J.; Nakatsui, M. Translation of Helium Speech by Splicing of Autocorrelation Function. J. Radio Res. Lab. 1976, 23, 229–234. [Google Scholar]
- Takasugi, T.; Suzuki, J. Translation of Helium Speech by the Use of Analytic Signal. J. Radio Res. Lab. 1974, 21, 61–69. [Google Scholar]
- Golden, R.M. Improving Naturalness and Intelligibility of Helium-oxygen Speech, Using Vocoder Techniques. J. Acoust. Soc. Am. 1966, 40, 621–624. [Google Scholar] [CrossRef]
- Richards, M.A. Helium speech enhancement using the short-time Fourier transform. IEEE Trans. Acoust. Speech Signal Process. 1982, 30, 841–853. [Google Scholar] [CrossRef]
- Duncan, G.; Jack, M.A. Residually Excited LPC Processor for Enhancing Helium Speech Intelligibility. Electron. Lett. 1983, 19, 710–711. [Google Scholar] [CrossRef]
- Daymi, M.A.; Gayed, M.B.; Malherbe, J.C.; Kammoun, L. A modified hyperbaric speech transcoder. In Proceedings of the IEEE International Conference on Systems, Hammamet, Tunisia, 6–9 October 2002. [Google Scholar]
- Zhang, Y.; Zhao, X.Q. Helium speech enhancement based on linear predictive coding. Tech. Acoust. 2007, 26, 111–116. [Google Scholar]
- Cheng, S.F.; Li, S.T.; Deng, H.A. Study on the algorithm of helium speech enhancement. Appl. Acoust. 2004, 23, 15–19. [Google Scholar]
- Wang, P. Research and Design of Smart Home Speech Recognition System Based on Deep Learning. In Proceedings of the International Conference on Computer Vision, Image and Deep Learning (CVIDL), Nanchang, China, 15–17 May 2020. [Google Scholar]
- Ling, Z. An Acoustic Model for English Speech Recognition Based on Deep Learning. In Proceedings of the 11th International Conference on Measuring Technology and Mechatronics Automation (ICMTMA), Wuhan, China, 9–11 November 2019. [Google Scholar]
- Këpuska, V.; Bohouta, G. Next-generation of virtual personal assistants (Microsoft Cortana, Apple Siri, Amazon Alexa and Google Home). In Proceedings of the IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, 8–10 January 2018. [Google Scholar]
- Ren, J.; Wang, Z. A novel deep learning method for application identification in wireless network. China Commun. 2018, 15, 73–83. [Google Scholar] [CrossRef]
- Xu, Y.; Du, J.; Dai, L.R.; Lee, C.H. A regression approach to speech enhancement based on deep neural networks. IEEE Trans. ASLP 2015, 23, 7–19. [Google Scholar] [CrossRef]
- Ying, C.T.; Huang, Z.; Ying, C.Y. Accelerating the image processing by the optimization strategy for deep learning algorithm DBN. EURASIP J. Wirel. Comm. 2018, 1, 1–8. [Google Scholar] [CrossRef]
- Goswami, T. Impact of deep learning in image processing and computer vision. In Proceedings of the Third International Conference on Micro Electronics, Electromagnetics and Telecommunications (ICMEET), Hyderabad, India, 9–10 September 2017. [Google Scholar]
- Mitra, V.; Franco, H.; Stern, R.M.; Hout, J.V.; Ferrer, L.; Graciarena, M.; Wang, W.; Vergyri, D.; Alwan, A.; Hansen, J.H.L. Robust features in deep-learning based speech recognition. In New Era for Robust Speech Recognition; Watanabe, S., Delcroix, M., Metze, F., Hershey, J., Eds.; Springer: Cham, Switzerland, 2017; pp. 187–217. [Google Scholar]
- Hinton, G.; Deng, L.; Yu, D.; Dahl, G.E.; Mohamed, A.; Jaitly, N. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal Proc. Mag. 2012, 29, 82–97. [Google Scholar] [CrossRef]
- He, Y.; Xiang, S.; Kang, C.; Wang, J.; Pan, C. Cross-modal retrieval via deep and bidirectional representation learning. IEEE Trans. Multimed. 2016, 18, 1363–1377. [Google Scholar] [CrossRef]
- Wang, T.; Wen, C.K.; Wang, H.; Gao, F.; Tao, J.; Shi, J. Deep learning for wireless physical layer: Opportunities and challenges. China Commun. 2017, 14, 90–111. [Google Scholar] [CrossRef]
- Fu, S.W.; Tsao, Y.; Lu, X.; Kawai, H. Raw waveform-based speech enhancement by fully convolutional networks. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Kuala Lumpur, Malaysia, 12–15 December 2017. [Google Scholar]
- Fu, S.W.; Wang, T.W.; Tsao, Y.; Lu, X.; Kawai, H. End-to-end waveform utterance enhancement for direct evaluation metrics optimization by fully convolutional neural networks. IEEE/ACM Trans. Audio Speech Lang. Process 2018, 26, 1570–1584. [Google Scholar] [CrossRef] [Green Version]
- Wu, Z.; Sivadas, S.; Tan, Y.K.; Bin, M.; Goh, R.S.M. Multi-modal hybrid deep neural network for speech enhancement. arXiv 2016, arXiv:1606.04750. [Google Scholar]
- Gabbay, A.; Shamir, A.; Peleg, S. Visual speech enhancement. In Proceedings of the Interspeech, Hyderabad, India, 2–6 September 2018. [Google Scholar]
- Wu, J.; Xu, Y.; Zhang, S.X.; Chen, L.W.; Yu, M.; Xie, L.; Yu, D. Time domain audio visual speech separation. In Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Singapore, 14–18 December 2019. [Google Scholar]
- Verma, A.; Raghav, A.; Priyank, K.S.; Nishat, A.A. An Acoustic Analysis of Speech for Emotion Recognition using Deep Learning. In Proceedings of the 1st International Conference on the Paradigm Shifts in Communication, Embedded Systems, Machine Learning and Signal Processing (PCEMS), Nagpur, India, 6–7 May 2022. [Google Scholar]
- Rincón-Trujillo, J.; Córdova-Esparza, D.M. Analysis of speech separation methods based on deep learning. Int. J. Comput. Appl. 2019, 148, 21–29. [Google Scholar] [CrossRef]
- Kang, C.F.; Hung, K.H.; Chen, Y.W.; Li, Y.J.; Tsao, Y. CITISEN: A Deep Learning-Based Speech Signal-Processing Mobile Application. IEEE Access 2022, 10, 46082–46099. [Google Scholar]
- Li, M.; Li, D.M.; Yan, P.X.; Zhang, S.B. Isolated Words Recognition Algorithm Based on Deep Learning in Heliumspeech Unscrambling. In Proceedings of the WCSP, Nanjing, China, 20–22 October 2021. [Google Scholar]
- Suzuki, H. Helium speech unscrambler using a digital filter constructed by linear prediction and impulse response conversion. Electron. Commun. Jpn. 1975, 58, 68–75. [Google Scholar]
- Flanagan, J.L. Speech Analysis Synthesis and Perception, 2nd ed.; Springer-Verlag: New York, NY, USA, 1972; pp. 79–85. [Google Scholar]
- Nakatsui, M.; Suzuki, J. Observation of speech parameters and their daily variations in a He-N2-O2 mixture at a depth of 30m. J. Radio Res. Lab. 1971, 18, 221–225. [Google Scholar]
- Quick, R.F. Helium Speech Translation Using Homomorphic Techniques; Rep. AFCRL-70-0424; United States Air Force: Washington, USA, 1970; pp. 3–16. [Google Scholar]
- MacLean, D.J. Analysis of speech in a helium-oxygen mixture under pressure. J. Acoust. Soc. Am. 1966, 40, 625–627. [Google Scholar] [CrossRef] [PubMed]
- Nakatsui, M.; Suzuki, J.; Takasugi, T.; Tanaka, R. Nature of helium-speech and its unscrambling. In Proceedings of the IEEE International Conference on Engineering in the Ocean Environment, Seattle, WA, USA, 25–28 September 1973. [Google Scholar]
- Goosen, M.E.; Sinha, S. Analysis of adaptive FIR filter pre-emphasis for high speed serial links. In Proceedings of the IEEE Africon, Nairobi, Kenya, 23–25 September 2009. [Google Scholar]
- Zhou, Y.; Ma, H.; Chen, Y. A frame skipping transcoding method based on optimum frame allocation in sliding window. In Proceedings of the International Conference on Signal Processing Systems, Dalian, China, 5–7 July 2010. [Google Scholar]
- Gaikwad, S.; Gawali, B.; Yannawar, P.; Mehrotra, S. Feature extraction using fusion MFCC for continuous marathi speech recognition. In Proceedings of the Annual IEEE India Conference, Hyderabad, India, 16–18 December 2011. [Google Scholar]
- Wang, J.; Li, L.; Wang, D.; Zheng, T.F. Research on generalization property of time-varying Fbank-weighted MFCC for i-vector based speaker verification. In Proceedings of the 9th International Symposium on Chinese Spoken Language Processing, Singapore, 12–14 September 2014. [Google Scholar]
- Shuai, G.; Renkai, C.; Tuo, H.; Guo, W.; Yong, W. A Convenient and Extensible Offline Chinese Speech Recognition System Based on Convolutional CTC Networks. In Proceedings of the Chinese Control Conference (CCC), Guangzhou, China, 27–30 July 2019. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–8 December 2017. [Google Scholar]
- Miao, H.; Cheng, G.; Gao, C.; Zhang, P.; Yan, Y. Transformer-Based Online CTC/Attention End-To-End Speech Recognition Architecture. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 4–8 May 2020. [Google Scholar]
- Salam, K.M.A.; Rahman, M.; Khan, M.M.S. Developing the Bangladeshi National Corpus—A Balanced and Representative Bangla Corpus. In Proceedings of the International Conference on Sustainable Technologies for Industry 4.0 (STI), Dhaka, Bangladesh, 24–25 December 2019. [Google Scholar]
- Enqvist, P. Spectrum estimation by interpolation of covariances and cepstrum parameters in an exponentional class of spectral densities. In Proceedings of the 45th IEEE Conference on Decision and Control, San Diego, CA, USA, 13–15 December 2006. [Google Scholar]
- Wang, X.; Liu, J.; Wang, G.Y.; Wang, M.J. Research on mask speech correction based on generalized regression neural network. Mod. Electron. Tech. 2017, 40, 60–63. [Google Scholar]
Network Layer | Specific Parameters |
---|---|
Conv2d | Convolution kernel size 3 × 3, number 32, activation function relu |
batch normalization | |
Conv2d | Convolution kernel size 3 × 3, number 32, activation function relu |
batch normalization | |
Pooling layer | Maximum pooling, pooling area 2 × 2 |
Conv2d | Convolution kernel size 3 × 3, number 64, activation function relu |
batch normalization | |
Conv2d | Convolution kernel size 3 × 3, number 64, activation function relu |
batch normalization | |
Pool | Maximum pooling, pooling area 2 × 2 |
Conv2d | Convolution kernel size 3 × 3, number 128, activation function relu |
batch normalization | |
Conv2d | Convolution kernel size 3 × 3, number 128, activation function relu |
batch normalization | |
Pooling layer | Maximum pooling, pooling area 2 × 2 |
Conv2d | Convolution kernel size 3 × 3, number 128, activation function relu |
batch normalization | |
Reshape | |
Dropout |
WER on Validation Set (%) | WER on Test Set (%) | |
---|---|---|
model 1 | 0.09 | 8.62 |
model 2 | 0.09 | 8.94 |
Model 1 | Model 2 | |
---|---|---|
word 1 | 3.20 | 4.50 |
word 2 | 3.10 | 5.08 |
word 3 | 0.84 | 0.84 |
word 4 | 1.93 | 1.93 |
word 5 | 10.43 | 17.53 |
word 6 | 3.39 | 3.39 |
word 7 | 2.41 | 2.42 |
word 8 | 1.51 | 1.51 |
word 9 | 3.94 | 3.95 |
word 10 | 1.07 | 1.66 |
The Number of Convolution Layers | WER (%) |
---|---|
5 | 13.82 |
6 | 10.34 |
7 | 8.29 |
8 | 9.25 |
9 | 12.42 |
10 | 14.94 |
The Number of Blocks | WER (%) |
---|---|
1 | 8.46 |
2 | 8.46 |
3 | 8.46 |
4 | 8.46 |
5 | 9.11 |
6 | 8.29 |
WER on Validation Set (%) | WER on Test Set (%) | |
---|---|---|
model 1 | 38.31 | 42.24 |
model 2 | 45.40 | 45.73 |
WER of Validation Set (%) | WER of Test Set (%) | |
---|---|---|
① CNN + CTC | 45.40 | 45.73 |
② CNN + CTC + transformer | 38.31 | 42.24 |
③ DSCNN + CTC | 35.94 | 37.29 |
④ DSCNN-GFNN + CTC | 33.82 | 36.75 |
⑤ DSCNN-GFNN + CTC + transformer | 29.64 | 32.98 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Chen, Y.; Zhang, S. A Helium Speech Unscrambling Algorithm Based on Deep Learning. Information 2023, 14, 189. https://doi.org/10.3390/info14030189
Chen Y, Zhang S. A Helium Speech Unscrambling Algorithm Based on Deep Learning. Information. 2023; 14(3):189. https://doi.org/10.3390/info14030189
Chicago/Turabian StyleChen, Yonghong, and Shibing Zhang. 2023. "A Helium Speech Unscrambling Algorithm Based on Deep Learning" Information 14, no. 3: 189. https://doi.org/10.3390/info14030189
APA StyleChen, Y., & Zhang, S. (2023). A Helium Speech Unscrambling Algorithm Based on Deep Learning. Information, 14(3), 189. https://doi.org/10.3390/info14030189