Abstract
Research and development of speech technology applications in low-resource languages (LRL) are challenging due to the non-availability of proper speech corpus. Especially, for most of the Indian languages, the amount and type of data found in different digital sources are sparse and prior works are too few to serve the purpose of large-scale development needs. This paper illustrates the creation process of such an LRL corpus comprising of sixteen rarely studied Eastern and Northeastern (E&NE) Indian languages and presents the data variability with different statistics. Furthermore, several experiments are carried out using the collected LRL corpus to build baseline speaker identification (SID) and language identification (LID) system for acceptance evaluation. For investigating the presence of speaker and language-specific information, spectral features like Mel frequency cepstral coefficients (MFCCs), shifted delta cepstral (SDC), and relative spectral transform-perceptual linear prediction (RASTA-PLP) features are used here. Vector quantization (VQ), Gaussian mixture models (GMMs), support vector machine (SVM), and multilayer perceptron (MLP)-based models are developed to represent the speaker and language-specific information captured through the spectral features. Apart from this, i-vectors, time delay neural networks (TDNN), and recurrent neural network with long short-term memory (LSTM-RNN) method-based SID and LID models are being experimented with to comply with the recent approaches. Performances of the developed systems are analyzed with LRL corpus in terms of SID and LID accuracy. The best SID and LID performances are observed to be 94.49% and 95.69%, respectively, for the baseline systems using LSTM-RNN with MFCC + SDC feature.
Similar content being viewed by others
Data availability statement
The data that support the findings of this study are available on request from the corresponding author J. Basu, after seeking permission from the funding agency of the project. The data are not publicly available due to Data was collected under a project funded by the Ministry of Electronics and Information Technology (MeitY), Govt. of India, for the applications of Speaker and language Identification of Low-resource languages of Eastern and Northeast Indian Languages.
References
F. Allen, E. Ambikairajah, J. Epps, Language Identification using Warping and the Shifted Delta Cepstrum. 2005 IEEE 7th Workshop on Multimedia Signal Processing, 1–4. (2005) https://doi.org/10.1109/MMSP.2005.248554
S. M. Amalesh Gope, Lexical Tones in Sylheti. In C. Gussenhoven, Y. Chen, & D. Dediu (Eds.), 4th International Symposium on Tonal Aspects of Languages (TAL-2014) (pp. 10–14). http://www.isca-speech.org/archive/tal_2014 (2014)
A. Baby, A.L.N. Thomas, T. T. S. Consortium, Resources for Indian languages. In CBBLR – Community Based Building of Language Resources (pp. 37–43) (2016)
S. Balakrishnama, A. Ganapathiraju, Linear discriminant analysis-a brief tutorial. Inst. Signal Inf. Process. 18, 1–8 (1998)
D. Barsha, C. Joyshree, A.D. Shikhamoni, N. S. Priyankoo, S.R. Nirmala, S. V.: SPEECH CORPORA OF UNDER RESOURCED LANGUAGES OF NORTH-EAST INDIA. Oriental COCOSDA, 72–77. (2018) https://doi.org/10.1109/ICSDA.2018.8693038
J. Basu, S. Khan, R. Roy, B. Saxena, D. Ganguly, S. Arora, K. K. Arora, S. Bansal, S. S. Agrawal: Indian Languages Corpus for Speech Recognition. 2019 22nd Conference of the Oriental COCOSDA International Committee for the Co-Ordination and Standardisation of Speech Databases and Assessment Techniques (O-COCOSDA), 1–6 (2019) https://doi.org/10.1109/O-COCOSDA46868.2019.9041171
J. Basu, T. Basu, S. Khan, M. Pal, R. Roy, M S. Bepari, S. Nandi, T. K. Basu, S. Majumder, S. Chatterjee, Acoustic analysis of vowels in five low resource north East Indian languages of Nagaland. 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), 1–6 (2017) https://doi.org/10.1109/ICSDA.2017.8384460
J. Basu, S. Khan, R. Roy, M. S. Bepari, Commodity price retrieval system in Bangla. Proceedings of the 11th Asia Pacific Conference on Computer Human Interaction - APCHI ’13, 406–415 (2013) https://doi.org/10.1145/2525194.2525310
J. Basu, S. Khan, M. Samirakshma Bepari, R. Roy, M. Pal, S. Nandi, Designing an IVR based framework for telephony speech data collection and transcription in under-resourced languages. The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages, 47–51 (2018) https://doi.org/10.21437/SLTU.2018-10
J. Basu, S. Majumder, Identification of seven low-resource north-eastern languages: an experimental study. In D. P. Bhattacharyya S., Mitra S. (Ed.), Intelligence Enabled Research. Advances in Intelligent Systems and Computing, vol 1109 (pp. 71–81). Springer, Singapore (2020) https://doi.org/10.1007/978-981-15-2021-1_9
L. Besacier, E. Barnard, A. Karpov, T. Schultz, Automatic speech recognition for under-resourced languages: a survey. Speech Commun. 56, 85–100 (2014). https://doi.org/10.1016/j.specom.2013.07.008
R.H. Bolt, F. Cooper, E. David, P.B. Denes, J.M. Pickett, K.N. Stevens, Speaker identification by speech spectrograms: a scientists’ view of its reliability for legal purpose. J. Acoust. Soc. Am. 47(2B), 597–612 (1970)
J.P. Campbell, Speaker recognition: a tutorial. Proc. IEEE 85(9), 1437–1462 (1997). https://doi.org/10.1109/5.628714
W.M. Campbell, D.E. Sturim, D.A. Reynolds, Support vector machines using GMM supervectors for speaker verification. IEEE Signal Process. Lett. 13(5), 308–311 (2006). https://doi.org/10.1109/LSP.2006.870086
B. Das, S. Mandal, P. Mitra, Bengali speech corpus for continuous auutomatic speech recognition system. 2011 International Conference on Speech Database and Assessments (Oriental COCOSDA), 51–55 (2011) https://doi.org/10.1109/ICSDA.2011.6085979
S. Davis, P. Mermelstein, Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–366 (1980). https://doi.org/10.1109/TASSP.1980.1163420
A. Debbarma, Isolated Kokborok Vowels Recognition. In Global Trends in Information Systems and Software Applications (pp. 489–493) (2012) https://doi.org/10.1007/978-3-642-29216-3_53
N. Dehak, A.Torres-Carrasquillo, P. Reynolds, R. Dehak, Language Recognition via Ivectors and Dimensionality Reduction. INTERSPEECH, 857–860. https://www.isca-speech.org/archive/interspeech_2011/i11_0857.html (2011)
N. Dehak, R. Dehak, P. Kenny, N. Brümmer, P. Ouellet, P. Dumouchel, Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification. INTERSPEECH, 1559–1562. https://www.isca-speech.org/archive/interspeech_2009/i09_1559.html (2009)
N. Dehak, P.J. Kenny, R. Dehak, P. Dumouchel, P. Ouellet, Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011). https://doi.org/10.1109/TASL.2010.2064307
A.P. Dempster, N.M. Laird, B. RubinD, Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B 39, 1–8 (1977)
Dey, N. S., Mohanty, R., & Chugh, K. L.: Speech and Speaker Recognition System Using Artificial Neural Networks and Hidden Markov Model. 2012 International Conference on Communication Systems and Network Technologies, 311–315 (2012) https://doi.org/10.1109/CSNT.2012.221
H. Dubey, A. Sangwan, J. H. L. Hansen, Robust feature clustering for unsupervised speech activity detection. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2726–2730 (2018) https://doi.org/10.1109/ICASSP.2018.8461652
T. Gallinari, Soulie.: Multilayer perceptrons and data analysis. IEEE 1988 International Conference on Neural Networks, 391–399 vol.1 (1988) https://doi.org/10.1109/ICNN.1988.23871
F.A. Gers, J. Schmidhuber, F. Cummins, Learning to forget: continual prediction with LSTM. Neural Comput. 12(10), 2451–2471 (2000). https://doi.org/10.1162/089976600300015015
T. Godambe, N. Bondale, K. Samudravijaya, P. Rao, Multi-speaker, narrowband, continuous Marathi speech database. 2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 1–6 (2013) https://doi.org/10.1109/ICSDA.2013.6709844
S. Guha, A. Das, P.K. Singh, A. Ahmadian, N. Senu, R. Sarkar, Hybrid feature selection method based on harmony search and naked mole-rat algorithms for spoken language identification from audio signals. IEEE Access 8, 182868–182887 (2020). https://doi.org/10.1109/ACCESS.2020.3028121
R. M. Hegde, H. A. Murthy, Automatic language identification and discrimination using the modified group delay feature. Proceedings of 2005 International Conference on Intelligent Sensing and Information Processing, 2005., 395–399 (2005) https://doi.org/10.1109/ICISIP.2005.1529484
H. Hermansky, N. Morgan, RASTA processing of speech. IEEE Trans. Speech Audio Process. 2(4), 578–589 (1994). https://doi.org/10.1109/89.326616
H. Hermansky, N. Morgan, A. Bayya, P. Kohn, RASTA-PLP speech analysis technique. [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, 121–124 vol.1 (1992) https://doi.org/10.1109/ICASSP.1992.225957
H. Hermansky, Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4), 1738–1752 (1990). https://doi.org/10.1121/1.399423
G.E. Hinton, Connectionist learning procedures. Artif. Intell. 40(1–3), 185–234 (1989). https://doi.org/10.1016/0004-3702(89)90049-0
S. Hochreiter, J. Schmidhuber, Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735
S. Jothilakshmi, V. Ramalingam, S. Palanivel, A hierarchical language identification system for Indian languages. Digital Signal Process. 22(3), 544–553 (2012). https://doi.org/10.1016/j.dsp.2011.11.008
M. Kaiser, Time-delay neural networks for control. IFAC Proc. Vol. 27(14), 967–972 (1994). https://doi.org/10.1016/S1474-6670(17)47423-4
P. Kenny, P. Ouellet, N. Dehak, V. Gupta, P. Dumouchel, A study of interspeaker variability in speaker verification. IEEE Trans. Audio Speech Lang. Process. 16(5), 980–988 (2008). https://doi.org/10.1109/TASL.2008.925147
L.G. Kersta, Speaker recognition and identification by voice prints. Conn. BJ 40, 586 (1966)
A. N. Khan, S. V. Gangashetty, B. Yegnanarayana, Syllabic properties of three Indian languages: Implications for speech recognition and language identification. International Conference on Natural Language Processing, 125–134 (2003)
S. Khan, J. Basu, M. S. Bepari, Performance Evaluation of PBDP Based Real-Time Speaker Identification System with Normal MFCC vs MFCC of LP Residual Features. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics): Vol. 7143 LNCS, pp. 358–366 (2012) https://doi.org/10.1007/978-3-642-27387-2_44
S. Khan, J. Basu, M. S. Bepari, R. Roy, Pitch based selection of optimal search space at runtime: Speaker recognition perspective. 2012 4th International Conference on Intelligent Human Computer Interaction (IHCI), 1–6 (2012) https://doi.org/10.1109/IHCI.2012.6481822
S. Khan, J. Basu, M. Pal, R. Roy, M. S. Bepari, Multilingual conversational telephony speech corpus creation for real world speaker diarization and recognition. 2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA), 177–182 (2016) https://doi.org/10.1109/ICSDA.2016.7919007
T. Kinnunen, P. Fränti, Speaker discriminative weighting method for VQ-based speaker identification. In B. J. & S. F. (Eds.), Audio- and Video-Based Biometric Person Authentication. AVBPA 2001,Lecture Notes in Computer Science, vol 2091, pp. 150–156 (2001) https://doi.org/10.1007/3-540-45344-X_22
J.W.G.N. Kleiner, Speaker identification based on nasal phonation. J. Acoust. Soc. Am. 43(2), 368–372 (1968)
M. A. Kohler, M. Kennedy, Language identification using shifted delta cepstra. The 2002 45th Midwest Symposium on Circuits and Systems, 2002. MWSCAS-2002., 3, III-69–72 (2002) https://doi.org/10.1109/MWSCAS.2002.1186972
K.J. Lang, A.H. Waibel, G.E. Hinton, A time-delay neural network architecture for isolated word recognition. Neural Netw. 3(1), 23–43 (1990). https://doi.org/10.1016/0893-6080(90)90044-L
H.S. Lee, Y. Tsao, S.-K. Jeng, H.-M. Wang, Subspace-based representation and learning for phonotactic spoken language recognition. IEEE/ACM Trans. Audio Speech Lan. Process. 28, 3065–3079 (2020). https://doi.org/10.1109/TASLP.2020.3037457
Y. Linde, A. Buzo, R. Gray, An Algorithm for vector quantizer design. IEEE Trans. Commun. 28(1), 84–95 (1980). https://doi.org/10.1109/TCOM.1980.1094577
J.E. Luck, Automatic speaker verification using cepstral mean surements. J. Acoust. Soc. Am. 56(4B), 1026–1032 (1969)
S. Maity, A. Kumar Vuppala, K. S. Rao, D. Nandi, IITKGP-MLILSC speech database for language identification. 2012 National Conference on Communications (NCC), 1–5 (2012) https://doi.org/10.1109/NCC.2012.6176831
S. Manchala, V.K. Prasad, V. Janaki, GMM based language identification system using robust features. Int. J. Speech Technol. 17(2), 99–105 (2014). https://doi.org/10.1007/s10772-013-9209-1
S. Mandal Das, Saha, A., & Datta, A. K. Vishwa Bharat Annotated Speech Corpora Development in Indian languages 16, 49–64 (2005)
E. Mansour, M. S. Sayed, A. M. Moselhy, A. A. Abdelnaiem, LPC and MFCC Performance Evaluation with Artificial Neural Network for Spoken Language Identification. International Journal of Signal Processing, Image Processing and Pattern Recognition, 6(3), 55–66 (2013)
A. Mohan, R. Rose, S.H. Ghalehjegh, S. Umesh, Acoustic modelling for speech recognition in Indian languages in an agricultural commodities task domain. Speech Commun. 56, 167–180 (2014). https://doi.org/10.1016/j.specom.2013.07.005
K.V. Mounika, S. Achanta, R., L. H. Gangashetty, S. V., Vuppala, A. K.: An Investigation of Deep Neural Network Architectures for Language Recognition in Indian Languages. INTERSPEECH, 2930–2933 (2016) https://doi.org/10.21437/Interspeech.2016-910
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, K. Vesely, The Kaldi Speech Recognition Toolkit. IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. https://kaldi-asr.org/doc/about.html (2011)
N. V. Prasad, S. Umesh, Improved cepstral mean and variance normalization using Bayesian framework. 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, 156–161 (2013) https://doi.org/10.1109/ASRU.2013.6707722
L.R. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77(2), 257–286 (1989). https://doi.org/10.1109/5.18626
D. Reynolds, Gaussian Mixture Models. In S. Z. Li & A. K. Jain (Eds.), Encyclopedia of biometrics, pp. 827–832, Springer US (2015). https://doi.org/10.1007/978-1-4899-7488-4_196
F. Richardson, D. Reynolds, N. Dehak, Deep neural network approaches to speaker and language recognition. IEEE Signal Process. Lett. 22(10), 1671–1675 (2015). https://doi.org/10.1109/LSP.2015.2420092
K. Samudravijaya, P.V.S. Rao, S. Agrawal, Hindi speech database. Sixth International Conference on Spoken Language Processing (ICSLP 2000), 456–459. https://www.isca-speech.org/archive/icslp_2000/i00_4456.html (2000)
B. Sarma, P. Sarmah, W. Lalhminghlui, S. Prasanna, Detection of mizo tones. Interspeech 2015, 934–937. https://www.isca-speech.org/archive/interspeech_2015/i15_0934.html (2015)
K. Sarmah, U. Bhattacharjee, GMM based language identification using MFCC and SDC features. Int. J. Comput. Appl. 85(5), 36–42 (2014). https://doi.org/10.5120/14840-3103
S. Shahnawazuddin, D. Thotappa, B. D. Sarma, A. Deka, S. R. M. Prasanna, R. Sinha, Assamese spoken query system to access the price of agricultural commodities. 2013 National Conference on Communications (NCC), 1–5 (2013). https://doi.org/10.1109/NCC.2013.6488011
P. Shen, X. Lu, S. Li, H. Kawai, Knowledge distillation-based representation learning for short-utterance spoken language identification. IEEE/ACM Trans. Audio Speech Lang. Process. 28, 2674–2683 (2020). https://doi.org/10.1109/TASLP.2020.3023627
D. Snyder, D. Garcia-Romero, A. McCree, G. Sell, D. Povey, S. Khudanpur, Spoken Language Recognition using X-vectors. Odyssey 2018 The Speaker and Language Recognition Workshop, 105–111 (2018) https://doi.org/10.21437/Odyssey.2018-15
W.A.H.J.A. Starkweather, Recognition of Speaker Identity. Lang. Speech 6(2), 63–67 (1963)
S. B. Sunil Kumar, K. S. Rao, D. Pati, Phonetic and Prosodically Rich Transcribed speech corpus in Indian languages: Bengali and Odia. Oriental COCOSDA, 1–5 (2013). https://doi.org/10.1109/ICSDA.2013.6709901
Z. Tang, D. Wang, Y. Chen, L. Li, A. Abel, Phonetic temporal neural model for language identification. IEEE/ACM Trans. Audio Speech Lang. Process. 26(1), 134–144 (2018). https://doi.org/10.1109/TASLP.2017.2764271
V.N. Vapnik, The Nature of Statistical Learning Theory. Springer, New York. (2000). https://doi.org/10.1007/978-1-4757-3264-1
Vempada, R., Maity, S., & Rao, K.: Identification of Indian languages using multi-level spectral and prosodic features. International Journal of Speech Technology, 16 (2013). https://doi.org/10.1007/s10772-013-9198-0
J. Villalba, N. Chen, D. Snyder, D. Garcia-Romero, A. McCree, G. Sell, J. Borgstrom, L. P. García-Perera, F. Richardson, R. Dehak, P. A. Torres-Carrasquillo, N. Dehak, State-of-the-art speaker recognition with neural network embeddings in NIST SRE18 and Speakers in the Wild evaluations. Computer Speech & Language, 60, 101026 (2020). https://doi.org/10.1016/j.csl.2019.101026
P. B. D. Weenink, Praat Website. http://www.fon.hum.uva.nl/praat/ (2016)
J.J. Wolf, Efficient acoustic parameters for speaker recognition. J. Acoust. Soc. Am. 51(6B), 2044–2056 (1972)
E. Wong, J. Pelecanos, S. Myers, S. Sridharan, Language identification using efficient gaussian mixture model analysis. Aust. Int. Conf. Speech Sci. Technol. 2000, 78–83 (2000)
Acknowledgements
This work is a part of the project namely “Deployment of Automatic Speaker Recognition System on Conversational Speech Data for North-Eastern states” funded by the Ministry of Electronics and Information Technology (MeitY), Govt. of India. The authors are thankful to the funding agency for their support and cooperation. The authors would like to record their deep appreciation for the unstinted support and cooperation of the authorities and students of different linguistic groups of North Eastern Regional Institute of Science and Technology (NERIST), Arunachal Pradesh, India, during the collection of data. The authors also acknowledge the contribution of the native speakers of E&NE Indian states, who have participated in the data collection task. The authors are thankful to the Centre for Development of Advanced Computing (CDAC), Kolkata, India, for the necessary support to carry out the research activity.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Basu, J., Khan, S., Roy, R. et al. Multilingual Speech Corpus in Low-Resource Eastern and Northeastern Indian Languages for Speaker and Language Identification. Circuits Syst Signal Process 40, 4986–5013 (2021). https://doi.org/10.1007/s00034-021-01704-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-021-01704-x