[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Generative attention based framework for implicit language change detection

Published: 21 November 2024 Publication History

Abstract

Spoken language change detection (LCD) refers to detecting language switching points in a multilingual speech signal. Most approaches in literature use the explicit framework that requires the modeling of intermediate phonemes and Senones to distinguish language. However, such techniques are limited when used with resource scare/ zero resource languages. Hence as an alternative, this study explores implicit frameworks to perform LCD. The focus of this work is to detect language change when a single speaker is speaking two languages. In this direction, a subjective study is performed to analyze the method humans adapt to discriminate languages. The outcome of the subjective study suggests humans require more neighborhood duration to detect language change. The initial observation suggests, that detecting language change is challenging using the baseline implicit unsupervised distance-based approach. Inspired by human cognition, prior language knowledge is integrated into the computational framework through the Gaussian mixture model and universal background model (GMM-UBM), temporal information via attention, and pattern storage using the Generative adversarial network (GAN) to enhance language discrimination. The experimental results on the Microsoft code-switched (MSCS) dataset show, compared to the unsupervised distance-based approach, the performance of the proposed LCD relatively improved by 19.3%, 47.3%, and 50.7% using the GMM-UBM, attention, and GAN-attention based framework, respectively.

Highlights

This work proposed implicit model-based frameworks to perform LCD.
Humans' ability to detect speaker and language change is studied.
Motivated by human cognition, the GAN-Attention framework is proposed.
The proposed framework outperforms the baseline with a relative improvement of 50.7%.

References

[1]
Sitaram, S.; Chandu, K.R.; Rallabandi, S.K.; Black, A.W. (2019): A survey of code switching speech and language processing. arXiv:1904.00784 [cs.CL].
[2]
V. Spoorthy, V. Thenkanidiyoor, D.A. Dinesh, SVM based language diarization for code-switched bilingual Indian speech using bottleneck features, in: Proc. the 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages, 2018, pp. 132–136,.
[3]
S. Shah, S. Sitaram, R. Mehta, First workshop on speech processing for code-switching in multilingual communities: shared task on code-switched spoken language identification, in: WSTCSMC 2020, 2020, p. 24.
[4]
J.Y. Chan, P. Ching, T. Lee, H.M. Meng, Detection of language boundary in code-switching utterances by bi-phone probabilities, in: 2004 International Symposium on Chinese Spoken Language Processing, IEEE, 2004, pp. 293–296.
[5]
Wang, Q.; Yılmaz, E.; Derinel, A.; Li, H. (2019): Code-switching detection using asr-generated language posteriors. arXiv preprint arXiv:1906.08003.
[6]
Yılmaz, E.; Heuvel, H.v.d.; van Leeuwen, D.A. (2018): Code-switching detection with data-augmented acoustic and language models. arXiv preprint arXiv:1808.00521.
[7]
D.C. Lyu, E.S. Chng, H. Li, Language diarization for conversational code-switch speech with pronunciation dictionary adaptation, in: Signal and Information Processing (ChinaSIP), 2013 IEEE China Summit and International Conference on, IEEE, 2013, pp. 147–150.
[8]
H. Li, B. Ma, K.A. Lee, Spoken language recognition: from fundamentals to practice, Proc. IEEE 101 (2013) 1136–1159.
[9]
T. Nagarajan, Implicit systems for spoken language identification, PhD dissertation 2004.
[10]
R.K. Vuddagiri, K. Gurugubelli, P. Jain, H.K. Vydana, A.K. Vuppala, Iiith-ilsc speech database for Indian language identification.
[11]
J. Ajmera, I. McCowan, H. Bourlard, Robust speaker change detection, IEEE Signal Process. Lett. 11 (2004) 649–651.
[12]
L. Lu, H.-J. Zhang, Speaker change detection and tracking in real-time news broadcasting analysis, in: Proceedings of the Tenth ACM International Conference on Multimedia, 2002, pp. 602–610.
[13]
R. Yin, H. Bredin, C. Barras, Speaker change detection in broadcast tv using bidirectional long short-term memory networks, in: Interspeech 2017, ISCA, 2017.
[14]
L. Sari, M. Hasegawa-Johnson, S. Thomas, Auxiliary networks for joint speaker adaptation and speaker change detection, IEEE/ACM Trans. Audio Speech Lang. Process. 29 (2020) 324–333.
[15]
J. Mishra, S.M. Prasanna, Importance of supra-segmental information and self-supervised framework for spoken language diarization task, in: International Conference on Speech and Computer, Springer, 2022, pp. 494–507.
[16]
H.J. Zhang, Code-switching speech detection method by combination of language and acoustic information, Advanced Materials Research, vol. 756, Trans Tech Publ, 2013, pp. 3622–3627.
[17]
H. Liu, L.P.G. Perera, X. Zhang, J. Dauwels, A.W. Khong, S. Khudanpur, S.J. Styles, End-to-end language diarization for bilingual code-switching speech, in: 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021, vol. 2, International Speech Communication Association, 2021.
[18]
T.J. Park, N. Kanda, D. Dimitriadis, K.J. Han, S. Watanabe, S. Narayanan, A review of speaker diarization: recent advances with deep learning, Comput. Speech Lang. 72 (2022).
[19]
M.H. Moattar, M.M. Homayounpour, A review on speaker diarization systems and approaches, Speech Commun. 54 (2012) 1065–1103.
[20]
Dawalatabad, N.; Ravanelli, M.; Grondin, F.; Thienpondt, J.; Desplanques, B.; Na, H. (2021): Ecapa-tdnn embeddings for speaker diarization. arXiv preprint arXiv:2104.01466.
[21]
Y. Fujita, N. Kanda, S. Horiguchi, Y. Xue, K. Nagamatsu, S. Watanabe, End-to-end neural speaker diarization with self-attention, in: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), IEEE, 2019, pp. 296–303.
[22]
Raffel, C.; Ellis, D.P. (2015): Feed-forward networks with attention can solve some long-term memory problems. arXiv preprint arXiv:1512.08756.
[23]
J. Mishra, S. Siddhartha, S.M. Prasanna, Importance of excitation source and sequence learning towards spoken language identification task, in: 2022 National Conference on Communications (NCC), IEEE, 2022, pp. 190–194.
[24]
D.A. Reynolds, Speaker identification and verification using Gaussian mixture speaker models, Speech Commun. 17 (1995) 91–108.
[25]
Goodfellow, I. (2016): Nips 2016 tutorial: generative adversarial networks. arXiv preprint arXiv:1701.00160.
[26]
H.Y. Kim, J.W. Yoon, S.J. Cheon, W.H. Kang, N.S. Kim, A multi-resolution approach to gan-based speech enhancement, Appl. Sci. 11 (2021) 721.
[27]
H. Phan, I.V. McLoughlin, L. Pham, O.Y. Chén, P. Koch, M. De Vos, A. Mertins, Improving gans for speech enhancement, IEEE Signal Process. Lett. 27 (2020) 1700–1704.
[28]
P. Shen, X. Lu, S. Li, H. Kawai, Conditional generative adversarial nets classifier for spoken language identification, in: Interspeech, 2017, pp. 2814–2818.
[29]
L. Chen, Y. Liu, W. Xiao, Y. Wang, H. Xie, Speakergan: speaker identification with conditional generative adversarial network, Neurocomputing 418 (2020) 211–220.
[30]
A. Diwan, R. Vaideeswaran, S. Shah, A. Singh, S. Raghavan, S. Khare, V. Unni, S. Vyas, A. Rajpuria, C. Yarra, A. Mittal, P.K. Ghosh, P. Jyothi, K. Bali, V. Seshadri, S. Sitaram, S. Bharadwaj, J. Nanavati, R. Nanavati, K. Sankaranarayanan, T. Seeram, B. Abraham, Multilingual and code-switching asr challenges for low resource Indian languages, in: Proceedings of Interspeech, 2021.
[31]
B.C. Haris, G. Pradhan, A. Misra, S. Prasanna, R.K. Das, R. Sinha, Multivariability speaker recognition database in Indian scenario, Int. J. Speech Technol. 15 (2012) 441–453.
[32]
N. Ryant, K. Church, C. Cieri, A. Cristia, J. Du, S. Ganapathy, M. Liberman, First dihard challenge evaluation plan, tech. Rep. 2018.
[33]
L.R. Rabiner, Digital Processing of Speech Signals, Pearson Education India, 1978.
[34]
N.K. Sharma, S. Ganesh, S. Ganapathy, L.L. Holt, Talker change detection: a comparison of human and machine performance, J. Acoust. Soc. Am. 145 (2019) 131–142.
[35]
M.A. Siegler, U. Jain, B. Raj, R.M. Stern, Automatic segmentation, classification and clustering of broadcast news audio, in: Proc. DARPA Speech Recognition Workshop, vol. 1997, 1997.
[36]
H. Gish, M.-H. Siu, J.R. Rohlicek, Segregation of speakers for speech recognition and speaker identification, in: Icassp, vol. 91, 1991, pp. 873–876.
[37]
S. Chen, P. Gopalakrishnan, et al., Speaker, environment and channel change detection and clustering via the Bayesian information criterion, in: Proc. DARPA Broadcast News Transcription and Understanding Workshop, vol. 8, Virginia, USA, 1998, pp. 127–132.
[38]
P.E. Hart, D.G. Stork, R.O. Duda, Pattern Classification, Wiley, Hoboken, 2000.
[39]
E. Ambikairajah, H. Li, L. Wang, B. Yin, V. Sethu, Language identification: a tutorial, IEEE Circuits Syst. Mag. 11 (2011) 82–108.
[40]
A. Cruttenden, Gimson's Pronunciation of English, Taylor & Francis, 2014, https://books.google.co.in/books?id=M2nMAgAAQBAJ.
[41]
E. Wong, S. Sridharan, Methods to improve Gaussian mixture model based language identification system, in: Seventh International Conference on Spoken Language Processing, 2002.
[42]
S. Siddhartha, J. Mishra, S.R.M. Prasanna, Language specific information from LP residual signal using linear sub band filters, in: 2020 National Conference on Communications (NCC), IEEE, 2020, pp. 1–5.
[43]
J. Mishra, A. Agarwal, S.M. Prasanna, Spoken language diarization using an attention based neural network, in: 2021 National Conference on Communications (NCC), IEEE, 2021, pp. 1–6.
[44]
K.S.R. Murty, B. Yegnanarayana, Epoch extraction from speech signals, IEEE Trans. Audio Speech Lang. Process. 16 (2008) 1602–1613.
[45]
S.E. Tranter, D.A. Reynolds, An overview of automatic speaker diarization systems, IEEE Trans. Audio Speech Lang. Process. 14 (2006) 1557–1565.
[46]
C. Barras, V.-B. Le, J.-L. Gauvain, Vocapia-limsi system for 2020 shared task on code-switched spoken language identification, in: The First Workshop on Speech Technologies for Code-Switching in Multilingual Communities, 2020.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Digital Signal Processing
Digital Signal Processing  Volume 154, Issue C
Nov 2024
623 pages

Publisher

Academic Press, Inc.

United States

Publication History

Published: 21 November 2024

Author Tags

  1. Spoken language change detection
  2. Generative adversarial network (GAN)
  3. Attention
  4. Speaker change detection

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 30 Dec 2024

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media