Speaker Counting Based on a Novel Hive Shaped Nested Microphone Array by WPT and 2D Adaptive SRP Algorithms in Near-Field Scenarios
<p>The near-field model for simulating the recorded signals by microphone array.</p> "> Figure 2
<p>The structure of the proposed hive-shaped nested microphone array (HNMA) for estimating the number of speakers.</p> "> Figure 3
<p>The microphone pairs related to the proposed HNMA for the (<b>a</b>) first sub-array, (<b>b</b>) second sub-array, (<b>c</b>) third sub-array, and (<b>d</b>) fourth sub-array.</p> "> Figure 4
<p>(<b>a</b>) A tree structure for implementing the analysis filter bank, and (<b>b</b>) the frequency spectrum related to the analysis filter for each sub-array of the HNMA.</p> "> Figure 4 Cont.
<p>(<b>a</b>) A tree structure for implementing the analysis filter bank, and (<b>b</b>) the frequency spectrum related to the analysis filter for each sub-array of the HNMA.</p> "> Figure 5
<p>The block diagram of the proposed NoS estimating system based on Blackman–Tukey spectral estimation, WPT, and SB-2DASRP by agglomerative classification and elbow criteria.</p> "> Figure 6
<p>The 2-level decomposition of 2-channel WPT for the speech signal based on a recursive filter bank.</p> "> Figure 7
<p>A view of the simulated room with the positions of HNMA and five simultaneous speakers.</p> "> Figure 8
<p>The acoustical room in the University of Sydney for real data recording.</p> "> Figure 9
<p>The speech signal for (<b>a</b>) each speaker and (<b>b</b>) the overlapped speech signals for two, three, four, and five simultaneous speakers in the simulations.</p> "> Figure 10
<p>The spectral estimation of the speech signal by the Blackman–Tukey method based on the proposed threshold for eliminating the undesirable frequency components.</p> "> Figure 11
<p>The energy map of the SB-2DASRP function for extracting the peak positions for (<b>a</b>) two, (<b>b</b>) three, (<b>c</b>) four, and (<b>d</b>) five simultaneous speakers.</p> "> Figure 12
<p>The elbow curves for estimating the number of speakers in various time frames of overlapped speech signals for (<b>a</b>) two, (<b>b</b>) three, (<b>c</b>) four, and (<b>d</b>) five simultaneous speakers.</p> "> Figure 13
<p>The percentage of correct number of speakers for 5 overlapped speech signals on real and simulated data for the proposed HNMA-SB-2DASRP method in comparison to the FD-MSC, i-vector PLDA, AF-CRNN, and SC-DCCD algorithms for (<b>a</b>) fixed <math display="inline"><semantics> <mrow> <mi>S</mi> <mi>N</mi> <mi>R</mi> <mo>=</mo> <mn>15</mn> <mi>dB</mi> </mrow> </semantics></math> and variable <math display="inline"><semantics> <mrow> <mn>0</mn> <mo>≤</mo> <mi>R</mi> <msub> <mi>T</mi> <mrow> <mn>60</mn> </mrow> </msub> <mo>≤</mo> <mn>800</mn> <mi>ms</mi> </mrow> </semantics></math>, and (<b>b</b>) fixed <math display="inline"><semantics> <mrow> <mi>R</mi> <msub> <mi>T</mi> <mrow> <mn>60</mn> </mrow> </msub> <mo>=</mo> <mn>600</mn> <mi>ms</mi> </mrow> </semantics></math> and variable <math display="inline"><semantics> <mrow> <mo>−</mo> <mn>10</mn> <mo>≤</mo> <mi>S</mi> <mi>N</mi> <mi>R</mi> <mo>≤</mo> <mn>20</mn> <mi>dB</mi> </mrow> </semantics></math>.</p> "> Figure 14
<p>The percentage of the correct number of speakers for the proposed HNMA-SB-2DASRP method in compassion with the FD-MSC, i-vector PLDA, AF-CRNN, and SC-DCCD algorithms in noisy and reverberant environments (<math display="inline"><semantics> <mrow> <mi>S</mi> <mi>N</mi> <mi>R</mi> <mo>=</mo> <mn>5</mn> <mi>dB</mi> </mrow> </semantics></math> and <math display="inline"><semantics> <mrow> <mi>R</mi> <msub> <mi>T</mi> <mrow> <mn>60</mn> </mrow> </msub> <mo>=</mo> <mn>600</mn> <mi>ms</mi> </mrow> </semantics></math>) for 2 to 5 simultaneous speakers for (<b>a</b>) simulated data and (<b>b</b>) real data.</p> ">
Abstract
:1. Introduction
2. State-of-the-Art
3. Microphone Models and the Proposed Nested Microphone Array
3.1. Microphone Signal Model for Real and Ideal Conditions
3.2. The Proposed Hive Shaped Nested Microphone Array
4. The Proposed Speaker Counting Method Based on WPT and Sub-Band-2DASRP Algorithm
4.1. The Blackman–Tukey Spectral Estimation for Multi-Speaker Speech Signal
4.2. Smart Sub-Band Processing for Speech Signal by Wavelet Packet Transform
4.3. The Sub-Band 2D-Adaptive SRP Implementation with PHAT and ML Weighting Functions
4.4. The Unsupervised Agglomerative Classification with Elbow Criteria
5. Results and Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
2DASRP | Two-dimensional adaptive steered response power |
AF-CRNN | Ambisonics features of the convolutional recurrent neural networks |
CWT | Continues wavelet transform |
DWT | Discrete wavelet transform |
FD-MSC | Frequency domain magnitude square coherence |
FSB | Filter-and-sum beamformer |
HNMA | Hive shaped nested microphone array |
HPF | High-pass filter |
LPF | Low-pass filter |
ML | Maximum likelihood |
NoS | Number of speakers |
PHAT | Phase transform |
PLDA | Probabilistic linear discriminant analysis |
SC-DCCD | Speaker counting based on density clustering and classification decision |
SD | Standard deviation |
SRP | Steered response power |
SSL | Sound source localization |
TF | Time–frequency |
WPT | Wavelet packet transform |
References
- Grasse, L.; Boutros, S.J.; Tata, M.S. Speech Interaction to Control a Hands-Free Delivery Robot for High-Risk Health Care Scenarios. Front. Robot. AI 2021, 8, 612750. [Google Scholar] [CrossRef] [PubMed]
- Wakabayashi, M.; Okuno, H.G.; Kumon, M. Multiple Sound Source Position Estimation by Drone Audition Based on Data Association Between Sound Source Localization and Identification. IEEE Robot. Autom. Lett. 2020, 5, 782–789. [Google Scholar] [CrossRef]
- Wang, R.; Chen, Z.; Yin, F. Speaker Tracking Based on Distributed Particle Filter and Iterative Covariance Intersection in Distributed Microphone Networks. IEEE J. Sel. Top. Signal Process. 2019, 13, 76–87. [Google Scholar] [CrossRef]
- Kawase, T.; Okamoto, M.; Fukutomi, T.; Takahashi, Y. Speech Enhancement Parameter Adjustment to Maximize Accuracy of Automatic Speech Recognition. IEEE Trans. Consum. Electron. 2020, 66, 125–133. [Google Scholar] [CrossRef]
- Jahangir, R.; The, Y.W.; Memon, N.A.; Mujtaba, G.; Zareei, M.; Ishtiaq, O.; Akhtar, M.Z.; Ali, I. Text-Independent Speaker Identification Through Feature Fusion and Deep Neural Network. IEEE Access 2020, 8, 32187–32202. [Google Scholar] [CrossRef]
- Teo, J.H.; Cheng, S.; Alioto, M. Low-Energy Voice Activity Detection via Energy-Quality Scaling from Data Conversion to Machine Learning. IEEE Trans. Circuits Syst. I Regul. Pap. 2020, 67, 1378–1388. [Google Scholar] [CrossRef]
- Laufer-Goldshtein, B.; Talmon, R.; Gannot, S. Source Counting and Separation Based on Simplex Analysis. IEEE Trans. Signal Process. 2018, 66, 6458–6473. [Google Scholar] [CrossRef]
- Wang, Z.Q.; Wang, D. Count and Separate: Incorporating Speaker Counting for Continuous Speaker Separation. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2021), Toronto, ON, Canada, 6–11 June 2021; pp. 11–15. [Google Scholar]
- Winter, F.; Schultz, F.; Firtha, G.; Spors, S. A Geometric Model for Prediction of Spatial Aliasing in 2.5D Sound Field Synthesis. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1031–1046. [Google Scholar] [CrossRef]
- Wang, P.; Chen, Z.; Wang, D.; Li, L.; Gong, Y. Speaker Separation Using Speaker Inventories and Estimated Speech. IEEE/ACM Trans. Audio Speech Lang. Process. 2021, 29, 537–546. [Google Scholar] [CrossRef]
- Rouvier, M.; Bousquet, P.M.; Favre, B. Speaker diarization through speaker embeddings. In Proceedings of the 23rd European Signal Processing Conference (EUSIPCO 2015), Nice, France, 31 August–4 September 2015; pp. 2082–2086. [Google Scholar]
- Ramaiah, V.S.; Rao, R.R. Speaker diarization system using HXLPS and deep neural network. Alex. Eng. J. 2018, 57, 255–266. [Google Scholar] [CrossRef]
- Yin, R.; Bredin, H.; Barras, C. Speaker change detection in broadcast TV using bidirectional long short-term memory networks. In Proceedings of the Interspeech Conference, Stockholm, Sweden, 20–24 August 2017; pp. 3827–3831. [Google Scholar]
- Anguera, X.; Bozonnet, S.; Evans, N.; Fredouille, C.; Friedland, G.; Vinyals, O. Speaker Diarization: A Review of Recent Research. IEEE Trans. Audio Speech Lang. Process. 2012, 20, 356–370. [Google Scholar] [CrossRef]
- Huijbregts, M.; Leeuwen, D.A.; Jong, F. Speech overlap detection in a two-pass speaker diarization system. In Proceedings of the Interspeech Conference, Brighton, UK, 6–10 September 2009; pp. 1063–1066. [Google Scholar]
- Shokouhi, N.; Hansen, J.H.L. Teager–Kaiser Energy Operators for Overlapped Speech Detection. IEEE/ACM Trans. Audio Speech Lang. Process. 2017, 25, 1035–1047. [Google Scholar] [CrossRef]
- Andrei, V.; Cucuand, H.; Burileanu, C. Detecting overlapped speech on short timeframes using deep learning. In Proceedings of the Interspeech Conference, Stockholm, Sweden, 20–24 August 2017; pp. 1198–1202. [Google Scholar]
- Lefèvre, A.; Bach, F.; Févotte, C. Itakura-Saito nonnegative matrix factorization with group sparsity. In Proceedings of the 36th International Conference on Acoustics, Speech and Signal Processing (ICASSP), Prague, Czech Republic, 22–27 May 2011; pp. 21–24. [Google Scholar]
- Bregman, A.S. Auditory Scene Analysis: The Perceptual Organization of Sound; MIT Press: Cambridge, MA, USA, 1994. [Google Scholar]
- Kumar, P.V.A.; Balakrishna, J.; Prakash, C.; Gangashetty, S.V. Bessel features for estimating number of speakers from multi speaker speech signals. In Proceedings of the 18th International Conference on Systems, Signals and Image Processing (IWSSIP), Sarajevo, Bosnia and Herzegovina, 16–18 June 2011; pp. 1–4. [Google Scholar]
- Maka, T.; Lazoryszczak, M. Detecting the Number of Speakers in Speech Mixtures by Human and Machine. In Proceedings of the 25th Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA), Poznan, Poland, 19–21 September 2018; pp. 239–244. [Google Scholar]
- Stöter, F.R.; Chakrabarty, S.; Edler, B.; Habets, E.A.P. CountNet: Estimating the Number of Concurrent Speakers Using Supervised Learning. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 268–282. [Google Scholar] [CrossRef]
- Andrei, V.; Cucu, H.; Burileanu, C. Overlapped Speech Detection and Competing Speaker Counting—Humans Versus Deep Learning. IEEE J. Sel. Top. Signal Process. 2019, 13, 850–862. [Google Scholar] [CrossRef]
- Pasha, S.; Donley, J.; Ritz, C. Blind speaker counting in highly reverberant environments by clustering coherence features. In Proceedings of the 9th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Kuala Lumpur, Malaysia, 12–15 December 2017; pp. 1684–1687. [Google Scholar]
- Vinals, I.; Gimeno, P.; Ortega, A.; Miguel, A.; Lleida, E. Estimation of the Number of Speakers with Variational Bayesian PLDA in the DIHARD Diarization Challenge. In Proceedings of the Interspeech Conference, Hyderabad, India, 2–6 September 2018; pp. 2803–2807. [Google Scholar]
- Grumiaux, P.A.; Kitić, S.; Girin, L.; Guérin, A. High-Resolution Speaker Counting in Reverberant Rooms Using CRNN with Ambisonics Features. In Proceedings of the 28th European Signal Processing Conference (EUSIPCO), Amsterdam, The Netherlands, 18–21 January 2021; pp. 71–75. [Google Scholar]
- Yang, J.; Guo, Y.; Yang, Z.; Yang, L.; Xie, S. Estimating Number of Speakers via Density-Based Clustering and Classification Decision. IEEE Access 2019, 7, 176541–176551. [Google Scholar] [CrossRef]
- Firoozabadi, A.D.; Irarrazaval, P.; Adasme, P.; Zabala-Blanco, D.; Palacios-Játiva, P.; Durney, H.; Sanhueza, M.; Azurdia-Meza, C.A. Speakers counting by proposed nested microphone array in combination with limited space SRP. In Proceedings of the 29th European Signal Processing Conference (EUSIPCO), Dublin, Ireland, 23–27 August 2021; pp. 271–275. [Google Scholar]
- Zheng, Y.R.; Goubran, R.A.; El-Tanany, M. Experimental evaluation of a nested microphone array with adaptive noise cancellers. IEEE Trans. Instrum. Meas. 2004, 53, 777–786. [Google Scholar] [CrossRef]
- Niu, Y.; Chen, J.; Li, B. Novel PSD estimation algorithm based on compressed sensing and Blackman-Tukey approach. In Proceedings of the 4th IEEE International Conference on Information Science and Technology, Shenzhen, China, 26–28 April 2014; pp. 278–281. [Google Scholar]
- Rickard, S.; Yilmaz, O. On the approximate W-disjoint orthogonality of speech. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Orlando, FL, USA, 13–17 May 2002; pp. I-529–I-532. [Google Scholar]
- Shi, J.; Liu, X.; Xiang, W.; Han, M.; Zhang, Q. Novel Fractional Wavelet Packet Transform: Theory, Implementation, and Applications. IEEE Trans. Signal Process. 2020, 68, 4041–4054. [Google Scholar] [CrossRef]
- Wang, Z.; Li, S. Discrete Fourier Transform and Discrete Wavelet Packet Transform in speech denoising. In Proceedings of the 5th International Congress on Image and Signal Processing, Chongqing, China, 16–18 October 2012; pp. 1588–1591. [Google Scholar]
- Zhuo, D.B.; Cao, H. Fast Sound Source Localization Based on SRP-PHAT Using Density Peaks Clustering. Appl. Sci. 2021, 11, 445. [Google Scholar] [CrossRef]
- Firoozabadi, A.D.; Abutalebi, H.R. SRP-ML: A Robust SRP-based speech source localization method for Noisy environments. In Proceedings of the 18th Iranian Conference on Electrical Engineering (ICEE), Isfahan, Iran, 11–13 May 2010; pp. 2950–2955. [Google Scholar]
- Babichev, S.; Taif, M.A.; Lytvynenko, V. Inductive model of data clustering based on the agglomerative hierarchical algorithm. In Proceedings of the First International Conference on Data Stream Mining & Processing (DSMP), Lviv, Ukraine, 23–27 August 2016; pp. 19–22. [Google Scholar]
- Wang, J.; Wichakool, W. Artificial elbow joint classification using upper arm based on surface-EMG signal. In Proceedings of the 3rd International Conference on Engineering Technologies and Social Sciences (ICETSS), Bangkok, Thailand, 7–8 August 2017; pp. 1–4. [Google Scholar]
- Garofolo, J.S.; Lamel, L.F.; Fisher, W.M.; Fiscus, J.G.; Pallett, D.S.; Dahlgren, N.L.; Zue, V. TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1; Web Download; Linguistic Data Consortium: Philadelphia, PA, USA, 1993; Available online: https://catalog.ldc.upenn.edu/LDC93S1 (accessed on 12 December 2022).
- Allen, J.; Berkley, D. Image method for efficiently simulating small room acoustics. J. Acoust. Soc. Am. 1979, 65, 943–950. [Google Scholar] [CrossRef]
Computational Complexity (Program Run-Time (s)) | |||||
---|---|---|---|---|---|
Simulated Data | |||||
Scenario | FD-MSC | i-vector PLDA | AF-CRNN | SC-DCCD | Proposed HNMA-SB-2DASRP |
Noisy | 43 | 35 | 64 | 37 | 21 |
Reverberant | 46 | 36 | 69 | 34 | 26 |
Noisy–Reverberant | 52 | 44 | 78 | 42 | 32 |
Real Data | |||||
Scenario | FD-MSC | i-vector PLDA | AF-CRNN | SC-DCCD | Proposed HNMA-SB-2DASRP |
Noisy | 48 | 42 | 69 | 38 | 25 |
Reverberant | 51 | 48 | 73 | 36 | 29 |
Noisy–Reverberant | 54 | 47 | 85 | 39 | 36 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Dehghan Firoozabadi, A.; Adasme, P.; Zabala-Blanco, D.; Palacios Játiva, P.; Azurdia-Meza, C. Speaker Counting Based on a Novel Hive Shaped Nested Microphone Array by WPT and 2D Adaptive SRP Algorithms in Near-Field Scenarios. Sensors 2023, 23, 4499. https://doi.org/10.3390/s23094499
Dehghan Firoozabadi A, Adasme P, Zabala-Blanco D, Palacios Játiva P, Azurdia-Meza C. Speaker Counting Based on a Novel Hive Shaped Nested Microphone Array by WPT and 2D Adaptive SRP Algorithms in Near-Field Scenarios. Sensors. 2023; 23(9):4499. https://doi.org/10.3390/s23094499
Chicago/Turabian StyleDehghan Firoozabadi, Ali, Pablo Adasme, David Zabala-Blanco, Pablo Palacios Játiva, and Cesar Azurdia-Meza. 2023. "Speaker Counting Based on a Novel Hive Shaped Nested Microphone Array by WPT and 2D Adaptive SRP Algorithms in Near-Field Scenarios" Sensors 23, no. 9: 4499. https://doi.org/10.3390/s23094499
APA StyleDehghan Firoozabadi, A., Adasme, P., Zabala-Blanco, D., Palacios Játiva, P., & Azurdia-Meza, C. (2023). Speaker Counting Based on a Novel Hive Shaped Nested Microphone Array by WPT and 2D Adaptive SRP Algorithms in Near-Field Scenarios. Sensors, 23(9), 4499. https://doi.org/10.3390/s23094499