Abstract
This paper is an attempt to address to the problem of native language in a mixed voice environment. G- Cocktail would aid these applications in identifying commands given in Gujarati, even from a mixed voice stream. There are two phases of G-cocktail in the first phase, it creates features after filtering the voices and in the second it trains and classifies the dataset. This trained dataset helps in recognizing the new voice signal. The challenge in training a native language is the availability of a small dataset. A single-word input is used in model and phrase benchmark dataset from Microsoft and the Linguistic Data Consortium for Indian Languages (LDC-IL). To overcome the over fitting problem due to smaller dataset we used CatBoost algorithm. And fine-tuned the classification model to avoid the over fitting issue. Mel-frequency cepstral coefficients (MFCCs) are coefficients that collectively make up an MFC. They are derived from a type of cepstral representation of the audio clip (a nonlinear "spectrum-of-a-spectrum"). MFCC is good for human voices but noises in the sound makes it less productive. To avoid this shortcoming of MFCC, first filtered the voices are used and then calculated the MFCC. The most relevant features are retained to make it more robust. With MFCC features, the pitch of the voices is also added, as pitch could vary with regional changes, mood of the person, age, and knowledge of the language to the speaker. A voice print of the whole sound files is constructed and fed it as features to the classification model. For training and testing 70% and 30% ratio is used in algorithms like K-means, Naïve Bayes, and Light GBM. Proposed model is compared with given data set and results proved that G-cocktail using XBoost performed better than the others under the given scenario in all parameters.
Similar content being viewed by others
Availability of Data and Material
LDC-IL: The Indian repository of resources for language technology. Language Resources and Evaluation, 1–13. https://www.ldcil.org/publications.aspx.
Code Availability
There is no such software application or custom code copied for the work.
References
Bhaskararao, P. (2011). Salient phonetic features of Indian languages in speech technology. Sadhana, 36(5), 587–599. https://doi.org/10.1007/s12046-011-0039-z,Oct
Bhat, G. S., Shankar, N., Reddy, C. K. A., & Panahi, I. M. S. (2019). A real-time convolutional neural network based speech enhancement for hearing impaired listeners using smartphone. IEEE Access, 7, 78421–78433. https://doi.org/10.1109/ACCESS.2019.2922370
Yarra, C., Aggarwal, R., Rajpal, A., & Ghosh, P. K. (2019). Indic TIMIT and Indic English lexicon: A speech database of Indian speakers using TIMIT stimuli and a lexicon from their mispronunciations. In: 2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-ordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA), Cebu, Philippines, pp. 1–6, doi: https://doi.org/10.1109/O-COCOSDA46868.2019.9041230.
Jeeva, M. P. A., Nagarajan, T., & Vijayalakshmi, P. (2020). Adaptive multi-band filter structure-based far-end speech enhancement. IET Signal Process, 14(5), 288–299. https://doi.org/10.1049/iet-spr.2019.0226,Jul
Panda, S. P., Nayak, A. K., & Rai, S. C. (2020). A survey on speech synthesis techniques in Indian languages. Multimedia Systems, 26(4), 453–478. https://doi.org/10.1007/s00530-020-00659-4
Sarkar, P., Haque, A., Dutta, A. K., Gurunath Reddy, M., Harikrishna D. M., Dhara, P., Rashmi, V., Narendra, N. P., Sunil Kr. S. B., Yadav, J., & Sreenivasa Rao, K. (2014). Designing prosody rule-set for converting neutral TTS speech to storytelling style speech for Indian languages: Bengali, Hindi and Telugu,” in Seventh International Conference on Contemporary Computing (IC3), Noida,, pp. 473–477,doi: https://doi.org/10.1109/IC3.2014.6897219.
Mishra, N., Tech, M., Shrawankar, U., & Thakare, D. V. M. (2010). AN OVERVIEW OF HINDI SPEECH RECOGNITION. In: Proceedings of the International Conference on Computational Systems and Communication Technology -Tamil Nadu, p. 6, May 5 2010.
Shri Shrimal, P. P., Deshmukh, R. R., & Waghmare, V. B. (2012). Indian language speech database: A review. IJCA, 47(5), 17–21. https://doi.org/10.5120/7184-9893
ud Khan, S. D. (2012). The phonetics of contrastive phonation in Gujarati. Journal of Phonetics, 40(6), 780–795. https://doi.org/10.1016/j.wocn.2012.07.001
Wang, D., & Chen, J. (2018). Supervised speech separation based on deep learning: An overview. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(10), 1702–1726. https://doi.org/10.1109/TASLP.2018.2842159,Oct
Roy, B. K., Biswas, S. C., & Mukhopadhyay, P. (2018). Designing unicode-compliant Indic-script based institutional digital repository with special reference to Bengali. International Journal of Knowledge Content Development & Technology, 8(3), 53–67. https://doi.org/10.5865/IJKCT.2018.8.3.053,Sep
Sproat, R., (2003). A formal computational analysis of indic scripts. In: International Symposium on Indic Scripts: Past and Future, Tokyo, Dec. 2003.
Upadhyay, N., & Karmakar, A. (2015). Speech enhancement using spectral subtraction-type algorithms: A comparison and simulation study. Procedia Computer Science, 54, 574–584. https://doi.org/10.1016/j.procs.2015.06.066
Upadhyay, N. (2014). An improved multi-band speech enhancement utilizing masking properties of human hearing system. In: 2014 Fifth International Symposium on Electronic System Design, Surathkal, Mangalore, India, pp. 150–155, doi: https://doi.org/10.1109/ISED.2014.38.
Jo, J., Yoo, H., & Park, I. (2016). Energy-efficient floating-point MFCC extraction architecture for speech recognition systems. IEEE Transactions on Very Large-Scale Integration (VLSI) Systems, 24(2), 754–758.
Chakroborty, S., Roy, A., & Saha, G. (2006). Fusion of a complementary feature set with MFCC for improved closed set text-independent speaker identification. In: 2006 IEEE International Conference on Industrial Technology, Mumbai, India, pp. 387–390, doi: https://doi.org/10.1109/ICIT.2006.372388.
Das, A., Guha, S., Singh, P. K., Ahmadian, A., Senu, N., & Sarkar, R. (2020). A hybrid meta-heuristic feature selection method for identification of Indian spoken languages from audio signals. IEEE Access, 8, 181432–181449. https://doi.org/10.1109/ACCESS.2020.3028241
Garg, K., & Jain, G. (2016). A comparative study of noise reduction techniques for automatic speech recognition systems. In: 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Jaipur, India, pp. 2098–2103, doi: https://doi.org/10.1109/ICACCI.2016.7732361
Alim, S. A., & Rashid, N. K. A. (2018). Some commonly used speech feature extraction algorithms. From Natural to Artificial Intelligence - Algorithms and Applications. https://doi.org/10.5772/intechopen.80419
Nehe, N. S., & Holambe, R. S. (2012). DWT and LPC based feature extraction methods for isolated word recognition. EURASIP Journal on Audio, Speech, and Music Processing, 2012(1), 7. https://doi.org/10.1186/1687-4722-2012-7
Hung, J., & Fan, H. (2009). Subband feature statistics normalization techniques based on a discrete wavelet transform for robust speech recognition. IEEE Signal Processing Letters, 16(9), 806–809. https://doi.org/10.1109/LSP.2009.2024113
Eltiraifi, O., Elbasheer, E., & Nawari, M. (2018). A comparative study of MFCC and LPCC features for speech activity detection using deep belief network. In: 2018 International Conference on Computer, Control, Electrical, and Electronics Engineering (ICCCEEE), Khartoum, pp. 1–5, doi: https://doi.org/10.1109/ICCCEEE.2018.8515821
Dehak, N., Torres-Carrasquillo, P., Reynolds, D., & Dehak, R, (2011). Language recognition via I-vectors and dimensionality reduction. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp 857–860. https://doi.org/10.21437/Interspeech.2011-328.
Mohammad Amini, M., & Matrouf, D. (2021). Data augmentation versus noise compensation for x-vector speaker recognition systems in noisy environments," 2020 28th European Signal Processing Conference (EUSIPCO), Amsterdam, Netherlands, pp. 1–5, doi: https://doi.org/10.23919/Eusipco47968.2020.9287690
Wu, J., Hua, Y., Yang, S., Qin, H., & Qin, H. (2019). Speech enhancement using generative adversarial network by distilling knowledge from statistical method. Applied Sciences, 9(16), 3396. https://doi.org/10.3390/app9163396
Pulugundla, B., Karthick, M., Kesiraju, S., & Kgorova, K. (2018). BUT system for low resource Indian language ASR. Interspeech, 2018, 3182–3186. https://doi.org/10.21437/Interspeech.2018-1302
Gogoi, S., & Bhattacharjee, U., (2017). Vocal tract length normalization and sub-band spectral subtraction based robust assamese vowel recognition system. In: 2017 International Conference on Computing Methodologies and Communication (ICCMC), Erode, pp. 32–35, doi: https://doi.org/10.1109/ICCMC.2017.8282709
Wang, J., Zhang, J., Honda, K., Wei, J., & Dang, J. (2016). Audio-visual speech recognition integrating 3D lip information obtained from the Kinect. Multimedia Systems, 22(3), 315–323. https://doi.org/10.1007/s00530-015-0499-9
Varalwar, M., & Patel, N. (2006). Characteristics of Indian Languages. Bhrigus Inc.
Sirsa, H., & Redford, M. A. (2013). The effects of native language on Indian English sounds and timing patterns. Journal of Phonetics, 41(6), 393–406. https://doi.org/10.1016/j.wocn.2013.07.004
Singh, J. & Kaur, K. (2019). Speech eEnhancement for Punjabi language using deep neural network. In: 2019 International Conference on Signal Processing and Communication (ICSC), NOIDA, India, pp. 202–204, doi: https://doi.org/10.1109/ICSC45622.2019.8938309.
Reddy, M. G., Sen, Manjunath, K., Sarkar, P., & Rao, K. S. (2015). Automatic pitch accent contour transcription for Indian languages. In: 2015 International Conference on Computer, Communication and Control (IC4), Indore, India, pp. 1–6, doi: https://doi.org/10.1109/IC4.2015.7375669.
Polasi, P. K., & Sri Rama Krishna, K. (2016). Combining the evidences of temporal and spectral enhancement techniques for improving the performance of Indian language identification system in the presence of background noise. International Journal of Speech Technology, 19(1), 75–85. https://doi.org/10.1007/s10772-015-9326-0
Patil, A., More, P., & Sasikumar, M. (2019). Incorporating finer acoustic phonetic features in lexicon for Hindi language speech recognition. Journal of Information and Optimization Sciences, 40(8), 1731–1739. https://doi.org/10.1080/02522667.2019.1703266
Parikh, R. B., & Joshi, D. H. (2020). Gujarati speech recognition – A review. No. 549, p. 6.
Nath, S., Chakraborty, J., Sarmah, P. (2018) Machine identification of spoken Indian languages,” p. 6.
Mullah, H. U., Pyrtuh, F., & Singh, L. J. (2015). Development of an HMM-based speech synthesis system for Indian English language. In: 2015 International Symposium on Advanced Computing and Communication (ISACC), Silchar, India, pp. 124–127, doi: https://doi.org/10.1109/ISACC.2015.7377327, 2015.
Morris, A., Maier, V., Green, P. (2004). From WER and RIL to MER and WIL: Improved evaluation measures for connected speech recognition, https://doi.org/10.21437/Interspeech.2004-668.
Londhe, N. D., Ahirwal, M. K., & Lodha, P. (2016). Machine learning paradigms for speech recognition of an Indian dialect. In: 2016 International Conference on Communication and Signal Processing (ICCSP), Melmaruvathur, Tamilnadu, India, pp. 0780–0786, doi: https://doi.org/10.1109/ICCSP.2016.7754251.
Li, Q., Yang, Y., Lan, F., Zhu, H., Wei, Q., Qia, F., Liu, Z., & Yang, H. (2020). MSP-MFCC: Energy-efficient MFCC feature extraction method with mixed-signal processing architecture for wearable speech recognition applications. IEEE Access, 8, 48720–48730. https://doi.org/10.1109/ACCESS.2020.2979799
Lavanya, T., Nagarajan, T., & Vijayalakshmi, P. (2020). Multi-level single-channel speech enhancement using a unified framework for estimating magnitude and phase spectra. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 1315–1327. https://doi.org/10.1109/TASLP.2020.2986877
S. Kiruthiga and K. Krishnamoorthy, “Design issues in developing speech corpus for Indian languages — A survey. In: 2012 International Conference on Computer Communication and Informatics, Coimbatore, India, pp. 1–4, doi: https://doi.org/10.1109/ICCCI.2012.6158831.
Khan, M. K. S., & Al-Khatib, W. G. (2006). Machine-learning based classification of speech and music. Multimedia Systems, 12(1), 55–67. https://doi.org/10.1007/s00530-006-0034-0
Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T.- Y. (2017). Light GBM: A highly efficient gradient boosting decision tree. In: 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA, p. 9, 2017.
Joshi, M., Iyer, M., & Gupta, N. (2010). Effect of accent on speech intelligibility in multiple speaker environment with sound spatialization. In: 2010 Seventh International Conference on Information Technology: New Generations, Las Vegas, NV, USA, pp. 338–342, doi: https://doi.org/10.1109/ITNG.2010.11.
Hao, X., Wen, S., Su, X., Liu, Y., Gao, G., & Li, X. (2020). Sub-band knowledge distillation framework for speech enhancement. Interspeech, 2020, 2687–2691. https://doi.org/10.21437/Interspeech.2020-1539
Yang, C., Xie, L., Su, C., &Yuille, A. L, (2019). Snapshot distillation: Teacher-student optimization in one generation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2854–2863.
Desai Vijayendra, A., & Thakar, V. K. (2016). Neural network based Gujarati speech recognition for dataset collected by in-ear microphone. Procedia Computer Science, 93, 668–675. https://doi.org/10.1016/j.procs.2016.07.259
Patel, H. N., & Virparia, P. V. (2011) A small vocabulary speech recognition for Gujarati. vol. 2, no. 1.
Pipaliahoomikaave, D. S. (2015). An approach to increase word recognition accuracy in Gujarati language. International Journal of Innovative Research in Computer and Communication Engineering, 3297(9), 6442–6450.
Jinal, H., & Dipti, B. (2016). Speech recognition system architecture for Gujarati language. International Journal of Computer Applications, 138(12), 28–31.
Valaki, S., & Jethva, H. (2017). A hybrid HMM/ANN approach for automatic Gujarati speech recognition. Proc. 2017 Int. Conf. Innov. Information, Embed. Commun. Syst. ICIIECS 2017, vol. 2018-Janua, pp. 1–5
Tailor, J. H., & Shah, D. B. (2017). HMM-based lightweight speech recognition system for Gujarati language. pp. 451–461.
Sainath, T. N., Weiss, R. J., Li, B., Moreno, P., Weinstein, E., & Rao, K. (2018). Multilingual speech recognition with a single end-to-end model Shubham Toshniwal∗ Toyota Technological Institute at Chicago, ICASSP, IEEE Int. Conf. Acoust. Speech Signal Process. - Proc., pp. 4904–4908, 2018.
Vydana, H. K., Gurugubelli, K., Raju, V. V. V., Vuppala, A. K. (2018). An exploration towards joint acoustic modeling for Indian languages: IIIT-H submission for Low Resource Speech Recognition Challenge for Indian languages, INTERSPEECH 2018,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-Septe, no. September, pp. 3192–3196, 2018.
Sailor, H. B., Siva Krishna, M. V., Chhabra, D., Patil, A. T., Kamble, M. R., & Patil, H. A. (2018). DA-IICT/IIITV system for low resource speech recognition challenge 2018. Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-Septe, no. September, pp. 3187–3191, 2018.
Billa, J. (2018). ISI ASR system for the low resource speech recognition challenge for Indian languages,” Proc. Annu. Conf. Int. Speech Commun. Assoc. INTERSPEECH, vol. 2018-Septe, no. September, pp. 3207–3211, 2018
Prokhorenkova, L., Gusev, G., Vorobev, A., Dorogush, A. V., & Gulin, A. (2021). CatBoost: Unbiased boosting with categorical features,” arXiv:1706.09516 [cs], Jan. 2019, Accessed: Mar. 03, 2021. [Online]. Available https://arxiv.org/abs/1706.09516.
Padmapriya. J., Sasilatha, T., Karthickmano, J. R., Aagash, G., & Bharathi, V. (2021). Voice extraction from background noise using filter bank analysis for voice communication applications. In: 2021 Third International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV), 2021, pp. 269–273, doi: https://doi.org/10.1109/ICICV50876.2021.9388453.
Choudhary, N. (2021). LDC-IL: The Indian repository of resources for language technology. Language Resources and Evaluation, pp 1–13. https://www.ldcil.org/publications.aspx.
Bahmaninezhad, F., Wu, J., Gu, R., Zhang, S. -X., Xu, Y., Yu, M., & Yu, D. (2021). A comprehensive study of speech separation: spectrogram vs waveform separation,” arXiv:1905.07497 [cs, eess], p. 2, May 2019, Accessed: Nov. 11, 2021. [Online]. Available: https://arxiv.org/abs/1905.07497.
Fischer, T., Caversaccio, M., & Wimmer, W. (2021). Speech signal enhancement in cocktail party scenarios by deep learning based virtual sensing of head-mounted microphones. Hearing Research, 408, 108294. https://doi.org/10.1016/j.heares.2021.108294
Funding
This research work is not supported by any funding agency.
Author information
Authors and Affiliations
Contributions
Authors implemented “G-Cocktail: An algorithm to address cocktail party problem of Gujarati language using Cat Boost”.
Corresponding author
Ethics declarations
Conflict of interest
There is no Conflicts of interest/Competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Gupta, M., Singh, R.K. & Singh, S. G-Cocktail: An Algorithm to Address Cocktail Party Problem of Gujarati Language Using Cat Boost. Wireless Pers Commun 125, 261–280 (2022). https://doi.org/10.1007/s11277-022-09549-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11277-022-09549-6