More Web Proxy on the site http://driver.im/

research-article

Open access

Tackling Perception Bias in Unsupervised Phoneme Discovery Using DPGMM-RNN Hybrid Model and Functional Load

Authors:

Sakriani Sakti,

Satoshi NakamuraAuthors Info & Claims

IEEE/ACM Transactions on Audio, Speech, and Language Processing, Volume 29

Pages 348 - 362

https://doi.org/10.1109/TASLP.2020.3042016

Published: 02 December 2020 Publication History

Abstract

The human perception of phonemes is biased against speech sounds. The lack of correspondence between perceputal phonemes and acoustic signals forms a big challenge in designing unsupervised algorithms to distinguish phonemes from sound. We propose the DPGMM-RNN hybrid model that improves phoneme categorization by relieving the fragmentation problem. We also merge segments with low functional load, which is the work done by segment contrasts to differentiate between utterances, just like humans who convert unambiguous segments into phonemes as units for immediate perception. Our results show that the DPGMM-RNN hybrid model relieves the fragmentation problem and improves phoneme discriminability. The minimal functional load merge compresses a segment system, preserves information and keeps phoneme discriminability.

References

[1]

C.-y. Lee and J. Glass, “A nonparametric Bayesian approach to acoustic model discovery,” in Proc. 50th Annu. Meet. Assoc. Comput. Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 2012, pp. 40–49.

Digital Library

[2]

B. Varadarajan, S. Khudanpur, and E. Dupoux, “Unsupervised learning of acoustic sub-word units,” in Proc. 46th Annu. Meet. Assoc. Comput. Linguist. Hum. Lang. Technol.: Short Papers. Association for Computational Linguistics, 2008, pp. 165–168.

Digital Library

[3]

M. Huijbregts, M. McLaren, and D. Van Leeuwen, “Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2011, pp. 4436–4439.

[4]

A. Jansen and B. Van Durme, “Efficient spoken term discovery using randomized algorithms,” in Proc. IEEE Workshop Autom. Speech Recognit. Understanding, 2011, pp. 401–406.

[5]

A. S. Park and J. R. Glass, “Unsupervised pattern discovery in speech,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, no. 1, pp. 186–197, Jan. 2008.

[6]

M. Versteegh et al., “The zero resource speech challenge 2015,” in Proc. INTERSPEECH, 2015, pp. 3169–3173.

[7]

D. Renshaw, H. Kamper, A. Jansen, and S. Goldwater, “A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge,” in Proc. INTERSPEECH, 2015, pp. 3199–3203.

[8]

L. Badino, C. Canevari, L. Fadiga, and G. Metta, “An Auto-encoder based approach to unsupervised learning of subword units,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2014, pp. 7634–7638.

[9]

L. Badino, A. Mereta, and L. Rosasco, “Discovering discrete subword units with binarized Autoencoders and hidden-Markov-model encoders,” in Proc. INTERSPEECH, 2015, pp. 3174–3178.

[10]

R. Thiolliere, E. Dunbar, G. Synnaeve, M. Versteegh, and E. Dupoux, “A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling,” in Proc. INTERSPEECH, 2015, pp. 3179–3183.

[11]

A. Tjandra, B. Sisman, M. Zhang, S. Sakti, H. Li, and S. Nakamura, “VQVAE unsupervised unit discovery and multi-scale code2spec inverter for Zerospeech challenge 2019,” 2019, arXiv:1905.11449.

[12]

C. Manenti, T. Pellegrini, and J. Pinquier, “Unsupervised speech unit discovery using K-means and neural networks,” in Proc. Int. Conf. Stat. Lang. Speech Process. Springer, 2017, pp. 169–180.

[13]

L. Ondel, L. Burget, and J. Černockỳ, “Variational inference for acoustic unit discovery,” Procedia Comput. Sci., vol. 81, pp. 80–86, 2016.

[14]

J. Ebbers, J. Heymann, L. Drude, T. Glarner, R. Haeb-Umbach, and B. Raj, “Hidden Markov model variational Autoencoder for acoustic unit discovery.” in Proc. INTERSPEECH, 2017, pp. 488–492.

[15]

M. Heck, S. Sakti, and S. Nakamura, “Feature optimized DPGMM clustering for unsupervised subword modeling: A contribution to Zerospeech 2017,” in Proc. IEEE Autom. Speech Recognit. Understanding Workshop, 2017, pp. 740–746.

[16]

H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Parallel inference of Dirichlet process Gaussian mixture models for unsupervised acoustic modeling: A feasibility study,” in Proc. INTERSPEECH, 2015, pp. 3189–3193.

[17]

B. Wu, S. Sakti, J. Zhang, and S. Nakamura, “Optimizing DPGMM clustering in zero resource setting based on functional load.” in Proc. Spoken Lang. Technol. Under-Resourced Lang., 2018, pp. 1–5.

[18]

J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, “DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1,” NASA STI/Recon Techn. Rep. n, vol. 93, 1993.

[19]

G. E. Peterson and H. L. Barney, “Control methods used in a study of the vowels,” J. Acoust. Soc. Amer., vol. 24, no. 2, pp. 175–184, 1952.

[20]

L. Lisker and A. S. Abramson, “A cross-language study of voicing in initial stops: Acoustical measurements,” Word, vol. 20, no. 3, pp. 384–422, 1964.

[21]

C. F. Hockett, A Manual of Phonology. Waverly Press, 1955, no. 11.

[22]

A. M. Liberman, F. S. Cooper, D. P. Shankweiler, and M. Studdert-Kennedy, “Perception of the speech code.” Psychol. Rev., vol. 74, no. 6, p. 431, 1967.

[23]

H. McGurk and J. MacDonald, “Hearing lips and seeing voices,” Nature, vol. 264, no. 5588, p. 746, 1976.

[24]

R. M. Warren, “Perceptual restoration of missing speech sounds,” Science, vol. 167, no. 3917, pp. 392–393, 1970.

[25]

W. F. Ganong, “Phonetic categorization in auditory word perception.,” J. Exp. Psychol.: Hum. Percep. Perform., vol. 6, no. 1, p. 110, 1980.

[26]

M. A. Pitt and J. M. McQueen, “Is compensation for coarticulation mediated by the lexicon?,” J. Memory Lang., vol. 39, no. 3, pp. 347–370, 1998.

[27]

E. Dunbar et al., “The zero resource speech challenge 2019: TTS without T,” 2019, arXiv:1904.11469.

[28]

J. Chang and J. W. Fisher III, “Parallel sampling of DP mixture models using sub-cluster splits,” in Proc. Adv. Neural Inf. Process. Syst., 2013, pp. 620–628.

[29]

D. Görür and C. E. Rasmussen, “Dirichlet process Gaussian mixture models: Choice of the base distribution,” J. Comput. Sci. Technol., vol. 25, no. 4, pp. 653–664, 2010.

[30]

Y. W. Teh, “Dirichlet process,” Encyclopedia Mach. Learn., pp. 280–287, 2010.

[31]

P. C. Delattre, A. M. Liberman, and F. S. Cooper, “Acoustic loci and transitional cues for consonants,” J. Acoust. Soc. Amer., vol. 27, no. 4, pp. 769–773, 1955.

[32]

A. Fourtassi and E. Dupoux, “A rudimentary lexicon and semantics help bootstrap phoneme acquisition,” in Proc. 18th Conf. Comput. Natural Lang. Learn., 2014, pp. 191–200.

[33]

A. Fourtassi, E. Dunbar, and E. Dupoux, “Self-consistency as an inductive bias in early language acquisition,” in Proc. Annu. Meet. Cognit. Sci. Soc., vol. 36, no. 36, 2014.

[34]

N. Feldman, T. Griffiths, and J. Morgan, “Learning phonetic categories by learning a lexicon,” in Proc. Annu. Meet. Cognit. Sci. Soc., vol. 31, no. 31, 2009.

[35]

M. Johnson, T. L. Griffiths, and S. Goldwater, “Adaptor grammars: A framework for specifying compositional nonparametric Bayesian models,” in Proc. Adv. Neural Inf. Process. Syst., 2007, pp. 641–648.

[36]

Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Sharing clusters among related groups: Hierarchical Dirichlet processes,” in Proc. Adv. Neural Inf. Process. Syst., 2005, pp. 1385–1392.

[37]

M. Heck, S. Sakti, and S. Nakamura, “Iterative training of a DPGMM-HMM acoustic unit recognizer in a zero resource scenario,” in Spoken Lang. Technol. Workshop (SLT), 2016 IEEE. IEEE, 2016, pp. 57–63.

[38]

G. A. Miller and W. G. Taylor, “The perception of repeated bursts of noise,” J. Acoust. Soc. Amer., vol. 20, no. 2, pp. 171–182, 1948.

[39]

G. A. Miller, “The magical number seven, plus or minus two: Some limits on our capacity for processing information.” Psychology Rev., vol. 63, no. 2, p. 81, 1956.

[40]

J.-c. Chou, C.-c. Yeh, H.-y. Lee, and L.-s. Lee, “Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations,” 2018, arXiv:1804.02812.

[41]

T. Tsuchiya, N. Tawara, T. Ogawa, and T. Kobayashi, “Speaker invariant feature extraction for zero-resource languages with adversarial learning,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2018, pp. 2381–2385.

[42]

M. Heck, S. Sakti, and S. Nakamura, “Supervised learning of acoustic models in a zero resource setting to improve DPGMM clustering,” in Proc. INTERSPEECH, 2016, pp. 1310–1314.

[43]

M. Heck, S. Sakti, and S. Nakamura, “Unsupervised linear discriminant analysis for supporting DPGMM clustering in the zero resource scenario,” Procedia Comput. Sci., vol. 81, pp. 73–79, 2016.

[44]

J.-S. Zhang, X.-H. Hu, and S. Nakamura, “Using mutual information criterion to design an efficient phoneme set for Chinese speech recognition,” IEICE TRANSACTIONS Inf. Syst., vol. 91, no. 3, pp. 508–513, 2008.

[45]

W.-Y. Wang, “The measurement of functional load,” Phonetica, vol. 16, no. 1, pp. 36–54, 1967.

[46]

J. Zhang, W. Li, Y. Hou, W. Cao, and Z. Xiong, “A study on functional loads of phonetic contrasts under context based on mutual information of chinese text and phonemes,” in Proc. 7th Int. Symp.Chin. Spoken Lang. Process., 2010, pp. 194–198.

[47]

D. Surendran and P. Niyogi, “Quantifying the functional load of phonemic oppositions, distinctive features, and suprasegmentals,” Amsterdam studies in the theory and history of linguistic science series 4, vol. 279, p. 43, 2006.

[48]

M. André, “Economie des changements phonétiques,” Berne: Francke, 1955.

[49]

T. Schatz, V. Peddinti, F. Bach, A. Jansen, H. Hermansky, and E. Dupoux, “Evaluating speech features with the minimal-pair ABX task: Analysis of the classical MFC/PLP pipeline,” in Proc. INTERSPEECH, 2013, pp. 1–5.

[50]

J. Sethuraman, “A constructive definition of Dirichlet priors,” Statistica sinica, pp. 639–650, 1994.

[51]

C. F. Hockett, “The quantification of functional load,” Word, vol. 23, no. 1-3, pp. 300–320, 1967.

[52]

A. Rosenberg and J. Hirschberg, “V-measure: A conditional entropy-based external cluster evaluation measure,” in Proc. Joint Conf. Empirical Methods Natural Lang. Process. Comput. Natural Lang. Learn., 2007, pp. 410–420.

[53]

T. Schatz, “ABX-discriminability measures and applications,” Ph.D. dissertation, 2016.

[54]

F. Pedregosa et al., “Scikit-learn: Machine learning in python,” J. Mach. Learn. Res., vol. 12, no. Oct, pp. 2825–2830, 2011.

[55]

S. Greenberg, H. Carvey, L. Hitchcock, and S. Chang, “Temporal properties of spontaneous speech-a syllable-centric perspective,” J. Phonetics, vol. 31, no. 3-4, pp. 465–485, 2003.

[56]

A. van den Oord et al., “Neural discrete representation learning,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 6306–6315.

[57]

W.-N. Hsu, Y. Zhang, and J. Glass, “Unsupervised learning of disentangled and interpretable representations from sequential data,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 1878–1889.

[58]

S. Feng, T. Lee, and Z. Peng, “Combining adversarial training and disentangled speech representation for robust zero-resource subword modeling,” 2019, arXiv:1906.07234.

[59]

A. H. H. N. Torbati, J. Picone, and M. Sobel, “Speech acoustic unit segmentation using hierarchical Dirichlet processes,” in Proc. INTERSPEECH, 2013, pp. 637–641.

[60]

P. Baljekar, S. Sitaram, P. K. Muthukumar, and A. W. Black, “Using articulatory features and inferred phonological segments in zero resource speech processing,” in Proc. INTERSPEECH, 2015, pp. 3194–3198.

[61]

S. H. Shum, N. Dehak, R. Dehak, and J. R. Glass, “Unsupervised methods for speaker diarization: An integrated and iterative approach,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 10, pp. 2015–2028, Oct. 2013.

Cited By

Wu BSakti SZhang JNakamura S(2022)Modeling Unsupervised Empirical Adaptation by DPGMM and DPGMM-RNN Hybrid Model to Extract Perceptual Features for Low-Resource ASRIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2022.315022030(901-916)Online publication date: 10-Feb-2022
https://dl.acm.org/doi/10.1109/TASLP.2022.3150220

Index Terms

Tackling Perception Bias in Unsupervised Phoneme Discovery Using DPGMM-RNN Hybrid Model and Functional Load

Index terms have been assigned to the content through auto-classification.

Recommendations

Modeling Unsupervised Empirical Adaptation by DPGMM and DPGMM-RNN Hybrid Model to Extract Perceptual Features for Low-Resource ASR
Speech feature extraction is critical for ASR systems. Such successful features as MFCC and PLP use filterbank techniques to model log-scaled speech perception but fail to model the adaptation of human speech perception by hearing experiences. Infant ...
Automatic syllable-based phoneme recognition using ESTER corpus
ISCGAV'07: Proceedings of the 7th WSEAS International Conference on Signal Processing, Computational Geometry & Artificial Vision

This paper presents an evaluation of speaker-independent continuous phoneme recognition systems on the French speech database ESTER. The tested systems are syllable-based phoneme recognizers, i.e. they use syllables as basic units together with syllabic ...
Hidden Markov model-based Assamese vowel phoneme recognition using cepstral features

We use linear prediction cepstrum coefficients LPCC-based features, namely, the weighted LPCC and delta weighted LPCC, to recognise Assamese vowel phonemes employing a discrete hidden Markov model HMM. We create a small database for the Assamese vowels, ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Audio, Speech and Language Processing

IEEE/ACM Transactions on Audio, Speech and Language Processing Volume 29, Issue

2021

3717 pages

ISSN:2329-9290

EISSN:2329-9304

Issue’s Table of Contents

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/.

Publisher

IEEE Press

Publication History

Published: 02 December 2020

Published in TASLP Volume 29

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
100
Total Downloads

Downloads (Last 12 months)51
Downloads (Last 6 weeks)9

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wu BSakti SZhang JNakamura S(2022)Modeling Unsupervised Empirical Adaptation by DPGMM and DPGMM-RNN Hybrid Model to Extract Perceptual Features for Low-Resource ASRIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2022.315022030(901-916)Online publication date: 10-Feb-2022
https://dl.acm.org/doi/10.1109/TASLP.2022.3150220

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Figures

Tables

Media

View Issue’s Table of Contents