More Web Proxy on the site http://driver.im/

research-article

Utterance Clustering Using Stereo Audio Channels

Authors:

Neil G. MacLaren,

Francis J. Yammarino,

Shelley D. Dionne,

Michael D. Mumford,

Shane Connelly,

Gregory A. Ruark Academic Editor:

Carlos M. Travieso-GonzálezAuthors Info & Claims

Computational Intelligence and Neuroscience, Volume 2021

https://doi.org/10.1155/2021/6151651

Published: 01 January 2021 Publication History

Abstract

Utterance clustering is one of the actively researched topics in audio signal processing and machine learning. This study aims to improve the performance of utterance clustering by processing multichannel (stereo) audio signals. Processed audio signals were generated by combining left- and right-channel audio signals in a few different ways and then by extracting the embedded features (also called d-vectors) from those processed audio signals. This study applied the Gaussian mixture model for supervised utterance clustering. In the training phase, a parameter-sharing Gaussian mixture model was obtained to train the model for each speaker. In the testing phase, the speaker with the maximum likelihood was selected as the detected speaker. Results of experiments with real audio recordings of multiperson discussion sessions showed that the proposed method that used multichannel audio signals achieved significantly better performance than a conventional method with mono-audio signals in more complicated conditions.

References

[1]

T. Menne, I. Sklyar, R. Schlüter, and H. Ney, “Analysis of deep clustering as preprocessing for automatic speech recognition of sparsely overlapping speech,” in Proceedings of the Interspeech, Graz, Austria, 2019.

[2]

X. Anguera Miro, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland, and O. Vinyals, “Speaker diarization: a review of recent research,” IEEE Transactions on Audio Speech and Language Processing, vol. 20, no. 2, pp. 356–370, 2012.

Digital Library

[3]

T. von Neumann, K. Kinoshita, M. Delcroix, S. Araki, T. Nakatani, and R. Haeb-Umbach, “All-neural online source separation, counting, and diarization for meeting analysis,” in Proceedings of the ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 91–95, IEEE, Brighton, UK, 2019.

[4]

Z. Chen, X. Xiao, T. Yoshioka, H. Erdogan, J. Li, and Y. Gong, “Multi-channel overlapped speech recognition with location guided speech extraction network,” in Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), pp. 558–565, IEEE, Athens, Greece, 2018.

[5]

S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 4, pp. 357–366, 1980.

[6]

N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio Speech and Language Processing, vol. 19, no. 4, pp. 788–798, 2010.

Digital Library

[7]

D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: robust DNN embeddings for speaker recognition,” in Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5329–5333, IEEE, Calgary, Canada, 2018.

[8]

E. Variani, X. Lei, E. McDermott, I. L. Moreno, and J. Gonzalez-Dominguez, “Deep neural networks for small footprint text-dependent speaker verification,” in Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4052–4056, IEEE, Florence, Italy, 2014.

[9]

Y. Dong, N. G. MacLaren, Y. Cao, F. J. Yammarino, S. D. Dionne, M. D. Mumford, S. Connelly, H. Sayama, and G. A. Ruark, “Utterance clustering using stereo audio channels,” 2021, http://arxiv.org/abs/2009.05076.

[10]

L. Lei and S. Kun, “Speaker recognition using wavelet packet entropy, i-vector, and cosine distance scoring,” Journal of Electrical and Computer Engineering, vol. 2017, 9 pages, 2017.

Digital Library

[11]

L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4879–4883, IEEE, Calgary, Canada, 2018.

[12]

H. Ma, Y. Zuo, T. Li, and C. L. P. Chen, “Data-driven decision-support system for speaker identification using e-vector system,” Scientific Programming, vol. 2020, 13 pages, 2020.

Digital Library

[13]

J. Lin, Y. Yumei, Z. Maosheng, C. Defeng, W. Chao, and W. Tonghan, “A multiscale chaotic feature extraction method for speaker recognition,” Complexity, vol. 2020, 9 pages, 2020.

Digital Library

[14]

K. Daqrouq, R. Al-Hmouz, A. S. Balamash, N. Alotaibi, and E. Noeth, “An investigation of wavelet average framing LPC for noisy speaker identification environment,” Mathematical Problems in Engineering, vol. 2015, 10 pages, 2015.

[15]

P. Delacourt and C. J. Wellekens, “Distbic: a speaker-based segmentation for audio data indexing,” Speech Communication, vol. 32, no. 1-2, pp. 111–126, 2000.

Digital Library

[16]

D. Li, Y. Yang, and W. Dai, “Cost-sensitive learning for emotion robust speaker recognition,” Science World Journal, vol. 2014, 9 pages, 2014.

[17]

M. Algabri, H. Mathkour, M. A. Bencherif, M. Alsulaiman, and M. A. Mekhtiche, “Automatic speaker recognition for mobile forensic applications,” Mobile Information Systems, vol. 2017, 6 pages, 2017.

[18]

S. H. Shum, N. Dehak, R. Dehak, and J. R. Glass, “Unsupervised methods for speaker diarization: an integrated and iterative approach,” IEEE Transactions on Audio Speech and Language Processing, vol. 21, no. 10, pp. 2015–2028, 2013.

Digital Library

[19]

Z. Zajíc, M. Hrúz, and L. Müller, “Speaker diarization using convolutional neural network for statistics accumulation refinement,” in Proceedings of the INTERSPEECH, pp. 3562–3566, Stockholm, Sweden, 2017.

[20]

Q. Wang, C. Downey, L. Wan, P. A. Mansfield, and I. L. Moreno, “Speaker diarization with LSTM,” in Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5239–5243, IEEE, Calgary, Canada, 2018.

[21]

A. Zhang, Q. Wang, Z. Zhu, J. Paisley, and C. Wang, “Fully supervised speaker diarization,” in Proceedings of the ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6301–6305, IEEE, Brighton, UK, 2019.

[22]

B. McFee, C. Raffel, D. Liang et al., “librosa: audio and music signal analysis in python,” in Proceedings of the 14th Python in Science Conference, pp. 18–25, Austin, TX, USA, 2015.

[23]

C. Jemine, “Real-time-voice-cloning,” University of Liége, Liége, Belgium, 2019, Master’s thesis, https://github.com/CorentinJ/Real-Time-Voice-Cloning.

[24]

V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an ASR corpus based on public domain audio books,” in Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210, IEEE, Brisbane, Australia, 2015.

[25]

A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” in Proceedings of the INTERSPEECH, Stockholm, Sweden, 2017.

[26]

J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: deep speaker recognition,” in Proceedings of the INTERSPEECH, Hyderabad, India, 2018.

[27]

D. Yu and L. Deng, Automatic Speech Recognition, Springer, Berlin, Germany, 2016.

[28]

J. A. Bilmes, “A gentle tutorial of the em algorithm and its application to parameter estimation for Gaussian mixture and hidden markov models,” International Computer Science Institute, vol. 4, no. 510, p. 126, 1998.

[29]

N. G. MacLaren, F. J. Yammarino, S. D. Dionne, H. Sayama, M. D. Mumford, S. Connelly, R. W. Martin, T. J. Mulhearn, E. M. Todd, A. Kulkarni, Y. Cao, and G. A. Ruark, “Testing the babble hypothesis: speaking time predicts leader emergence in small groups,” The Leadership Quarterly, vol. 31, no. 5, 2020.

[30]

FFmpeg Developers, FFmpeg Tool, https://ffmpeg.org/.

[31]

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.

Digital Library

[32]

L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008.

Digital Library

Recommendations

Improving short utterance i-vector speaker verification using utterance variance modelling and compensation techniques

This paper proposes techniques to improve the performance of i-vector based speaker verification systems when only short utterances are available. Short-length utterance i-vectors vary with speaker, session variations, and the phonetic content of the ...
Mixed stereo audio classification using a stereo-input mixed-to-panned level feature

Many past studies have been conducted on speech/music discrimination due to the potential applications for broadcast and other media; however, it remains possible to expand the experimental scope to include samples of speech with varying amounts of ...
Gestural user interface for audio multitrack real-time stereo mixing
AM '13: Proceedings of the 8th Audio Mostly Conference

Sound mixing is a well-established task applied (directly or indirectly) in many fields of music and sound production. For example, in the case of classical music orchestras, their conductors perform sound mixing by specifying the reproduction gain of ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Computational Intelligence and Neuroscience

Computational Intelligence and Neuroscience Volume 2021, Issue

2021

8452 pages

ISSN:1687-5265

EISSN:1687-5273

Issue’s Table of Contents

Copyright © 2021 Yingjun Dong et al.

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Publisher

Hindawi Limited

London, United Kingdom

Publication History

Published: 01 January 2021

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 21 Dec 2024

Other Metrics

View Author Metrics

Citations

View Options

View options

Media

Figures

Other

Tables

Affiliations

Yingjun Dong

Center for Collective Dynamics of Complex SystemsBinghamton UniversityState University of New YorkBinghamtonNY 13902-6000USAbinghamton.edu

Department of Systems Science and Industrial EngineeringBinghamton UniversityState University of New YorkBinghamtonNY 13902-6000USAbinghamton.edu

https://orcid.org/0000-0002-1935-1105

Neil G. MacLaren

Center for Collective Dynamics of Complex SystemsBinghamton UniversityState University of New YorkBinghamtonNY 13902-6000USAbinghamton.edu

Bernard M. and Ruth R. Bass Center for Leadership StudiesSchool of ManagementBinghamton UniversityState University of New YorkBinghamtonNYUSAbinghamton.edu

https://orcid.org/0000-0002-8478-8530

Yiding Cao

Center for Collective Dynamics of Complex SystemsBinghamton UniversityState University of New YorkBinghamtonNY 13902-6000USAbinghamton.edu

Department of Systems Science and Industrial EngineeringBinghamton UniversityState University of New YorkBinghamtonNY 13902-6000USAbinghamton.edu

https://orcid.org/0000-0002-2392-3700

Francis J. Yammarino

Center for Collective Dynamics of Complex SystemsBinghamton UniversityState University of New YorkBinghamtonNY 13902-6000USAbinghamton.edu

Bernard M. and Ruth R. Bass Center for Leadership StudiesSchool of ManagementBinghamton UniversityState University of New YorkBinghamtonNYUSAbinghamton.edu

Shelley D. Dionne

Center for Collective Dynamics of Complex SystemsBinghamton UniversityState University of New YorkBinghamtonNY 13902-6000USAbinghamton.edu

Bernard M. and Ruth R. Bass Center for Leadership StudiesSchool of ManagementBinghamton UniversityState University of New YorkBinghamtonNYUSAbinghamton.edu

Michael D. Mumford

Department of PsychologyUniversity of OklahomaNormanOKUSAou.edu

Shane Connelly

Department of PsychologyUniversity of OklahomaNormanOKUSAou.edu

https://orcid.org/0000-0002-2686-6836

Hiroki Sayama

Center for Collective Dynamics of Complex SystemsBinghamton UniversityState University of New YorkBinghamtonNY 13902-6000USAbinghamton.edu

Department of Systems Science and Industrial EngineeringBinghamton UniversityState University of New YorkBinghamtonNY 13902-6000USAbinghamton.edu

Bernard M. and Ruth R. Bass Center for Leadership StudiesSchool of ManagementBinghamton UniversityState University of New YorkBinghamtonNYUSAbinghamton.edu

https://orcid.org/0000-0002-2670-5864

Gregory A. Ruark

U.S. Army Research Institute for the Behavioral and Social SciencesFort BelvoirVAUSA

Carlos M. Travieso-González

View Issue’s Table of Contents