[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

Tackling Perception Bias in Unsupervised Phoneme Discovery Using DPGMM-RNN Hybrid Model and Functional Load

Published: 02 December 2020 Publication History

Abstract

The human perception of phonemes is biased against speech sounds. The lack of correspondence between perceputal phonemes and acoustic signals forms a big challenge in designing unsupervised algorithms to distinguish phonemes from sound. We propose the DPGMM-RNN hybrid model that improves phoneme categorization by relieving the fragmentation problem. We also merge segments with low functional load, which is the work done by segment contrasts to differentiate between utterances, just like humans who convert unambiguous segments into phonemes as units for immediate perception. Our results show that the DPGMM-RNN hybrid model relieves the fragmentation problem and improves phoneme discriminability. The minimal functional load merge compresses a segment system, preserves information and keeps phoneme discriminability.

References

[1]
C.-y. Lee and J. Glass, “A nonparametric Bayesian approach to acoustic model discovery,” in Proc. 50th Annu. Meet. Assoc. Comput. Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 2012, pp. 40–49.
[2]
B. Varadarajan, S. Khudanpur, and E. Dupoux, “Unsupervised learning of acoustic sub-word units,” in Proc. 46th Annu. Meet. Assoc. Comput. Linguist. Hum. Lang. Technol.: Short Papers. Association for Computational Linguistics, 2008, pp. 165–168.
[3]
M. Huijbregts, M. McLaren, and D. Van Leeuwen, “Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2011, pp. 4436–4439.
[4]
A. Jansen and B. Van Durme, “Efficient spoken term discovery using randomized algorithms,” in Proc. IEEE Workshop Autom. Speech Recognit. Understanding, 2011, pp. 401–406.
[5]
A. S. Park and J. R. Glass, “Unsupervised pattern discovery in speech,” IEEE Trans. Audio, Speech, Lang. Process., vol. 16, no. 1, pp. 186–197, Jan. 2008.
[6]
M. Versteegh et al., “The zero resource speech challenge 2015,” in Proc. INTERSPEECH, 2015, pp. 3169–3173.
[7]
D. Renshaw, H. Kamper, A. Jansen, and S. Goldwater, “A comparison of neural network methods for unsupervised representation learning on the zero resource speech challenge,” in Proc. INTERSPEECH, 2015, pp. 3199–3203.
[8]
L. Badino, C. Canevari, L. Fadiga, and G. Metta, “An Auto-encoder based approach to unsupervised learning of subword units,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2014, pp. 7634–7638.
[9]
L. Badino, A. Mereta, and L. Rosasco, “Discovering discrete subword units with binarized Autoencoders and hidden-Markov-model encoders,” in Proc. INTERSPEECH, 2015, pp. 3174–3178.
[10]
R. Thiolliere, E. Dunbar, G. Synnaeve, M. Versteegh, and E. Dupoux, “A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling,” in Proc. INTERSPEECH, 2015, pp. 3179–3183.
[11]
A. Tjandra, B. Sisman, M. Zhang, S. Sakti, H. Li, and S. Nakamura, “VQVAE unsupervised unit discovery and multi-scale code2spec inverter for Zerospeech challenge 2019,” 2019, arXiv:1905.11449.
[12]
C. Manenti, T. Pellegrini, and J. Pinquier, “Unsupervised speech unit discovery using K-means and neural networks,” in Proc. Int. Conf. Stat. Lang. Speech Process. Springer, 2017, pp. 169–180.
[13]
L. Ondel, L. Burget, and J. Černockỳ, “Variational inference for acoustic unit discovery,” Procedia Comput. Sci., vol. 81, pp. 80–86, 2016.
[14]
J. Ebbers, J. Heymann, L. Drude, T. Glarner, R. Haeb-Umbach, and B. Raj, “Hidden Markov model variational Autoencoder for acoustic unit discovery.” in Proc. INTERSPEECH, 2017, pp. 488–492.
[15]
M. Heck, S. Sakti, and S. Nakamura, “Feature optimized DPGMM clustering for unsupervised subword modeling: A contribution to Zerospeech 2017,” in Proc. IEEE Autom. Speech Recognit. Understanding Workshop, 2017, pp. 740–746.
[16]
H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Parallel inference of Dirichlet process Gaussian mixture models for unsupervised acoustic modeling: A feasibility study,” in Proc. INTERSPEECH, 2015, pp. 3189–3193.
[17]
B. Wu, S. Sakti, J. Zhang, and S. Nakamura, “Optimizing DPGMM clustering in zero resource setting based on functional load.” in Proc. Spoken Lang. Technol. Under-Resourced Lang., 2018, pp. 1–5.
[18]
J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, “DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1,” NASA STI/Recon Techn. Rep. n, vol. 93, 1993.
[19]
G. E. Peterson and H. L. Barney, “Control methods used in a study of the vowels,” J. Acoust. Soc. Amer., vol. 24, no. 2, pp. 175–184, 1952.
[20]
L. Lisker and A. S. Abramson, “A cross-language study of voicing in initial stops: Acoustical measurements,” Word, vol. 20, no. 3, pp. 384–422, 1964.
[21]
C. F. Hockett, A Manual of Phonology. Waverly Press, 1955, no. 11.
[22]
A. M. Liberman, F. S. Cooper, D. P. Shankweiler, and M. Studdert-Kennedy, “Perception of the speech code.” Psychol. Rev., vol. 74, no. 6, p. 431, 1967.
[23]
H. McGurk and J. MacDonald, “Hearing lips and seeing voices,” Nature, vol. 264, no. 5588, p. 746, 1976.
[24]
R. M. Warren, “Perceptual restoration of missing speech sounds,” Science, vol. 167, no. 3917, pp. 392–393, 1970.
[25]
W. F. Ganong, “Phonetic categorization in auditory word perception.,” J. Exp. Psychol.: Hum. Percep. Perform., vol. 6, no. 1, p. 110, 1980.
[26]
M. A. Pitt and J. M. McQueen, “Is compensation for coarticulation mediated by the lexicon?,” J. Memory Lang., vol. 39, no. 3, pp. 347–370, 1998.
[27]
E. Dunbar et al., “The zero resource speech challenge 2019: TTS without T,” 2019, arXiv:1904.11469.
[28]
J. Chang and J. W. Fisher III, “Parallel sampling of DP mixture models using sub-cluster splits,” in Proc. Adv. Neural Inf. Process. Syst., 2013, pp. 620–628.
[29]
D. Görür and C. E. Rasmussen, “Dirichlet process Gaussian mixture models: Choice of the base distribution,” J. Comput. Sci. Technol., vol. 25, no. 4, pp. 653–664, 2010.
[30]
Y. W. Teh, “Dirichlet process,” Encyclopedia Mach. Learn., pp. 280–287, 2010.
[31]
P. C. Delattre, A. M. Liberman, and F. S. Cooper, “Acoustic loci and transitional cues for consonants,” J. Acoust. Soc. Amer., vol. 27, no. 4, pp. 769–773, 1955.
[32]
A. Fourtassi and E. Dupoux, “A rudimentary lexicon and semantics help bootstrap phoneme acquisition,” in Proc. 18th Conf. Comput. Natural Lang. Learn., 2014, pp. 191–200.
[33]
A. Fourtassi, E. Dunbar, and E. Dupoux, “Self-consistency as an inductive bias in early language acquisition,” in Proc. Annu. Meet. Cognit. Sci. Soc., vol. 36, no. 36, 2014.
[34]
N. Feldman, T. Griffiths, and J. Morgan, “Learning phonetic categories by learning a lexicon,” in Proc. Annu. Meet. Cognit. Sci. Soc., vol. 31, no. 31, 2009.
[35]
M. Johnson, T. L. Griffiths, and S. Goldwater, “Adaptor grammars: A framework for specifying compositional nonparametric Bayesian models,” in Proc. Adv. Neural Inf. Process. Syst., 2007, pp. 641–648.
[36]
Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei, “Sharing clusters among related groups: Hierarchical Dirichlet processes,” in Proc. Adv. Neural Inf. Process. Syst., 2005, pp. 1385–1392.
[37]
M. Heck, S. Sakti, and S. Nakamura, “Iterative training of a DPGMM-HMM acoustic unit recognizer in a zero resource scenario,” in Spoken Lang. Technol. Workshop (SLT), 2016 IEEE. IEEE, 2016, pp. 57–63.
[38]
G. A. Miller and W. G. Taylor, “The perception of repeated bursts of noise,” J. Acoust. Soc. Amer., vol. 20, no. 2, pp. 171–182, 1948.
[39]
G. A. Miller, “The magical number seven, plus or minus two: Some limits on our capacity for processing information.” Psychology Rev., vol. 63, no. 2, p. 81, 1956.
[40]
J.-c. Chou, C.-c. Yeh, H.-y. Lee, and L.-s. Lee, “Multi-target voice conversion without parallel data by adversarially learning disentangled audio representations,” 2018, arXiv:1804.02812.
[41]
T. Tsuchiya, N. Tawara, T. Ogawa, and T. Kobayashi, “Speaker invariant feature extraction for zero-resource languages with adversarial learning,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2018, pp. 2381–2385.
[42]
M. Heck, S. Sakti, and S. Nakamura, “Supervised learning of acoustic models in a zero resource setting to improve DPGMM clustering,” in Proc. INTERSPEECH, 2016, pp. 1310–1314.
[43]
M. Heck, S. Sakti, and S. Nakamura, “Unsupervised linear discriminant analysis for supporting DPGMM clustering in the zero resource scenario,” Procedia Comput. Sci., vol. 81, pp. 73–79, 2016.
[44]
J.-S. Zhang, X.-H. Hu, and S. Nakamura, “Using mutual information criterion to design an efficient phoneme set for Chinese speech recognition,” IEICE TRANSACTIONS Inf. Syst., vol. 91, no. 3, pp. 508–513, 2008.
[45]
W.-Y. Wang, “The measurement of functional load,” Phonetica, vol. 16, no. 1, pp. 36–54, 1967.
[46]
J. Zhang, W. Li, Y. Hou, W. Cao, and Z. Xiong, “A study on functional loads of phonetic contrasts under context based on mutual information of chinese text and phonemes,” in Proc. 7th Int. Symp.Chin. Spoken Lang. Process., 2010, pp. 194–198.
[47]
D. Surendran and P. Niyogi, “Quantifying the functional load of phonemic oppositions, distinctive features, and suprasegmentals,” Amsterdam studies in the theory and history of linguistic science series 4, vol. 279, p. 43, 2006.
[48]
M. André, “Economie des changements phonétiques,” Berne: Francke, 1955.
[49]
T. Schatz, V. Peddinti, F. Bach, A. Jansen, H. Hermansky, and E. Dupoux, “Evaluating speech features with the minimal-pair ABX task: Analysis of the classical MFC/PLP pipeline,” in Proc. INTERSPEECH, 2013, pp. 1–5.
[50]
J. Sethuraman, “A constructive definition of Dirichlet priors,” Statistica sinica, pp. 639–650, 1994.
[51]
C. F. Hockett, “The quantification of functional load,” Word, vol. 23, no. 1-3, pp. 300–320, 1967.
[52]
A. Rosenberg and J. Hirschberg, “V-measure: A conditional entropy-based external cluster evaluation measure,” in Proc. Joint Conf. Empirical Methods Natural Lang. Process. Comput. Natural Lang. Learn., 2007, pp. 410–420.
[53]
T. Schatz, “ABX-discriminability measures and applications,” Ph.D. dissertation, 2016.
[54]
F. Pedregosa et al., “Scikit-learn: Machine learning in python,” J. Mach. Learn. Res., vol. 12, no. Oct, pp. 2825–2830, 2011.
[55]
S. Greenberg, H. Carvey, L. Hitchcock, and S. Chang, “Temporal properties of spontaneous speech-a syllable-centric perspective,” J. Phonetics, vol. 31, no. 3-4, pp. 465–485, 2003.
[56]
A. van den Oord et al., “Neural discrete representation learning,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 6306–6315.
[57]
W.-N. Hsu, Y. Zhang, and J. Glass, “Unsupervised learning of disentangled and interpretable representations from sequential data,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 1878–1889.
[58]
S. Feng, T. Lee, and Z. Peng, “Combining adversarial training and disentangled speech representation for robust zero-resource subword modeling,” 2019, arXiv:1906.07234.
[59]
A. H. H. N. Torbati, J. Picone, and M. Sobel, “Speech acoustic unit segmentation using hierarchical Dirichlet processes,” in Proc. INTERSPEECH, 2013, pp. 637–641.
[60]
P. Baljekar, S. Sitaram, P. K. Muthukumar, and A. W. Black, “Using articulatory features and inferred phonological segments in zero resource speech processing,” in Proc. INTERSPEECH, 2015, pp. 3194–3198.
[61]
S. H. Shum, N. Dehak, R. Dehak, and J. R. Glass, “Unsupervised methods for speaker diarization: An integrated and iterative approach,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 10, pp. 2015–2028, Oct. 2013.

Cited By

View all
  • (2022)Modeling Unsupervised Empirical Adaptation by DPGMM and DPGMM-RNN Hybrid Model to Extract Perceptual Features for Low-Resource ASRIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2022.315022030(901-916)Online publication date: 10-Feb-2022

Index Terms

  1. Tackling Perception Bias in Unsupervised Phoneme Discovery Using DPGMM-RNN Hybrid Model and Functional Load
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Please enable JavaScript to view thecomments powered by Disqus.

            Information & Contributors

            Information

            Published In

            cover image IEEE/ACM Transactions on Audio, Speech and Language Processing
            IEEE/ACM Transactions on Audio, Speech and Language Processing  Volume 29, Issue
            2021
            3717 pages
            ISSN:2329-9290
            EISSN:2329-9304
            Issue’s Table of Contents

            Publisher

            IEEE Press

            Publication History

            Published: 02 December 2020
            Published in TASLP Volume 29

            Qualifiers

            • Research-article

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • Downloads (Last 12 months)39
            • Downloads (Last 6 weeks)4
            Reflects downloads up to 13 Dec 2024

            Other Metrics

            Citations

            Cited By

            View all
            • (2022)Modeling Unsupervised Empirical Adaptation by DPGMM and DPGMM-RNN Hybrid Model to Extract Perceptual Features for Low-Resource ASRIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2022.315022030(901-916)Online publication date: 10-Feb-2022

            View Options

            View options

            PDF

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            Login options

            Full Access

            Media

            Figures

            Other

            Tables

            Share

            Share

            Share this Publication link

            Share on social media