[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Utterance Clustering Using Stereo Audio Channels

Published: 01 January 2021 Publication History

Abstract

Utterance clustering is one of the actively researched topics in audio signal processing and machine learning. This study aims to improve the performance of utterance clustering by processing multichannel (stereo) audio signals. Processed audio signals were generated by combining left- and right-channel audio signals in a few different ways and then by extracting the embedded features (also called d-vectors) from those processed audio signals. This study applied the Gaussian mixture model for supervised utterance clustering. In the training phase, a parameter-sharing Gaussian mixture model was obtained to train the model for each speaker. In the testing phase, the speaker with the maximum likelihood was selected as the detected speaker. Results of experiments with real audio recordings of multiperson discussion sessions showed that the proposed method that used multichannel audio signals achieved significantly better performance than a conventional method with mono-audio signals in more complicated conditions.

References

[1]
T. Menne, I. Sklyar, R. Schlüter, and H. Ney, “Analysis of deep clustering as preprocessing for automatic speech recognition of sparsely overlapping speech,” in Proceedings of the Interspeech, Graz, Austria, 2019.
[2]
X. Anguera Miro, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland, and O. Vinyals, “Speaker diarization: a review of recent research,” IEEE Transactions on Audio Speech and Language Processing, vol. 20, no. 2, pp. 356–370, 2012.
[3]
T. von Neumann, K. Kinoshita, M. Delcroix, S. Araki, T. Nakatani, and R. Haeb-Umbach, “All-neural online source separation, counting, and diarization for meeting analysis,” in Proceedings of the ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 91–95, IEEE, Brighton, UK, 2019.
[4]
Z. Chen, X. Xiao, T. Yoshioka, H. Erdogan, J. Li, and Y. Gong, “Multi-channel overlapped speech recognition with location guided speech extraction network,” in Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 558–565, IEEE, Athens, Greece, 2018.
[5]
S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357–366, 1980.
[6]
N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio Speech and Language Processing, vol. 19, no. 4, pp. 788–798, 2010.
[7]
D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: robust DNN embeddings for speaker recognition,” in Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333, IEEE, Calgary, Canada, 2018.
[8]
E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4052–4056, IEEE, Florence, Italy, 2014.
[9]
Y. Dong, N. G. MacLaren, Y. Cao, F. J. Yammarino, S. D. Dionne, M. D. Mumford, S. Connelly, H. Sayama, and G. A. Ruark, “Utterance clustering using stereo audio channels,” 2021, http://arxiv.org/abs/2009.05076.
[10]
L. Lei and S. Kun, “Speaker recognition using wavelet packet entropy, i-vector, and cosine distance scoring,” Journal of Electrical and Computer Engineering, vol. 2017, 9 pages, 2017.
[11]
L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883, IEEE, Calgary, Canada, 2018.
[12]
H. Ma, Y. Zuo, T. Li, and C. L. P. Chen, “Data-driven decision-support system for speaker identification using e-vector system,” Scientific Programming, vol. 2020, 13 pages, 2020.
[13]
J. Lin, Y. Yumei, Z. Maosheng, C. Defeng, W. Chao, and W. Tonghan, “A multiscale chaotic feature extraction method for speaker recognition,” Complexity, vol. 2020, 9 pages, 2020.
[14]
K. Daqrouq, R. Al-Hmouz, A. S. Balamash, N. Alotaibi, and E. Noeth, “An investigation of wavelet average framing LPC for noisy speaker identification environment,” Mathematical Problems in Engineering, vol. 2015, 10 pages, 2015.
[15]
P. Delacourt and C. J. Wellekens, “Distbic: a speaker-based segmentation for audio data indexing,” Speech Communication, vol. 32, no. 1-2, pp. 111–126, 2000.
[16]
D. Li, Y. Yang, and W. Dai, “Cost-sensitive learning for emotion robust speaker recognition,” Science World Journal, vol. 2014, 9 pages, 2014.
[17]
M. Algabri, H. Mathkour, M. A. Bencherif, M. Alsulaiman, and M. A. Mekhtiche, “Automatic speaker recognition for mobile forensic applications,” Mobile Information Systems, vol. 2017, 6 pages, 2017.
[18]
S. H. Shum, N. Dehak, R. Dehak, and J. R. Glass, “Unsupervised methods for speaker diarization: an integrated and iterative approach,” IEEE Transactions on Audio Speech and Language Processing, vol. 21, no. 10, pp. 2015–2028, 2013.
[19]
Z. Zajíc, M. Hrúz, and L. Müller, “Speaker diarization using convolutional neural network for statistics accumulation refinement,” in Proceedings of the INTERSPEECH, pp. 3562–3566, Stockholm, Sweden, 2017.
[20]
Q. Wang, C. Downey, L. Wan, P. A. Mansfield, and I. L. Moreno, “Speaker diarization with LSTM,” in Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5239–5243, IEEE, Calgary, Canada, 2018.
[21]
A. Zhang, Q. Wang, Z. Zhu, J. Paisley, and C. Wang, “Fully supervised speaker diarization,” in Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6301–6305, IEEE, Brighton, UK, 2019.
[22]
B. McFee, C. Raffel, D. Liang et al., “librosa: audio and music signal analysis in python,” in Proceedings of the 14th Python in Science Conference, pp. 18–25, Austin, TX, USA, 2015.
[23]
C. Jemine, “Real-time-voice-cloning,” University of Liége, Liége, Belgium, 2019, Master’s thesis, https://github.com/CorentinJ/Real-Time-Voice-Cloning.
[24]
V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public domain audio books,” in Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210, IEEE, Brisbane, Australia, 2015.
[25]
A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” in Proceedings of the INTERSPEECH, Stockholm, Sweden, 2017.
[26]
J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: deep speaker recognition,” in Proceedings of the INTERSPEECH, Hyderabad, India, 2018.
[27]
D. Yu and L. Deng, Automatic Speech Recognition, Springer, Berlin, Germany, 2016.
[28]
J. A. Bilmes, “A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden markov models,” International Computer Science Institute, vol. 4, no. 510, p. 126, 1998.
[29]
N. G. MacLaren, F. J. Yammarino, S. D. Dionne, H. Sayama, M. D. Mumford, S. Connelly, R. W. Martin, T. J. Mulhearn, E. M. Todd, A. Kulkarni, Y. Cao, and G. A. Ruark, “Testing the babble hypothesis: speaking time predicts leader emergence in small groups,” The Leadership Quarterly, vol. 31, no. 5, 2020.
[30]
FFmpeg Developers, FFmpeg Tool, https://ffmpeg.org/.
[31]
F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
[32]
L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Computational Intelligence and Neuroscience
Computational Intelligence and Neuroscience  Volume 2021, Issue
2021
8452 pages
ISSN:1687-5265
EISSN:1687-5273
Issue’s Table of Contents
This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Publisher

Hindawi Limited

London, United Kingdom

Publication History

Published: 01 January 2021

Qualifiers

  • Research-article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 0
    Total Downloads
  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Dec 2024

Other Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media