More Web Proxy on the site http://driver.im/

research-article

Open access

Modeling Unsupervised Empirical Adaptation by DPGMM and DPGMM-RNN Hybrid Model to Extract Perceptual Features for Low-Resource ASR

Authors:

Sakriani Sakti,

Satoshi NakamuraAuthors Info & Claims

IEEE/ACM Transactions on Audio, Speech, and Language Processing, Volume 30

Pages 901 - 916

https://doi.org/10.1109/TASLP.2022.3150220

Published: 10 February 2022 Publication History

Abstract

Speech feature extraction is critical for ASR systems. Such successful features as MFCC and PLP use filterbank techniques to model log-scaled speech perception but fail to model the adaptation of human speech perception by hearing experiences. Infant perception that is adapted by hearing speech without text may cause permanent brain state modifications (engrams) that serve as a physical fundamental basis for lifetime speech perception formation. This realization motivates us to propose to model such an unsupervised adaptation process, where adaptation denotes perception that is affected or changed by the history of experiences, with the Dirichlet Process Gaussian Mixture Model (DPGMM) and the DPGMM-RNN hybrid model to extract perceptual features to improve ASR. Our proposed features extend MFCC features with posteriorgrams extracted from the DPGMM algorithm or the DPGMM-RNN hybrid model. Our analysis shows that the DPGMM and DPGMM-RNN model perplexities agree with infant auditory perplexity to support that the proposed features are perceptual. Our ASR results verify the effectiveness of the proposed unsupervised features in such tasks as LVCSR on WSJ and ASR on noisy low-resource telephone conversations, compared with the supervised bottleneck features from Kaldi in ASR performance.

References

[1]

S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE/ACM Trans. Acoust., Speech, Signal Process., vol. 28, no. 4, pp. 357–366, Aug. 1980.

[2]

H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” J. Acoust. Soc. Amer., vol. 87, no. 4, pp. 1738–1752, 1990.

[3]

J. F. Werker and R. C. Tees, “Cross-language speech perception: Evidence for perceptual reorganization during the first year of life,” Infant Behav. Develop., vol. 7, no. 1, pp. 49–63, 1984.

[4]

R. W. Semon, The Mneme. London, U.K.: George Allen and Unwin, 1921.

[5]

B. Russell, Analysis of Mind. London, U.K.: George Allen and Unwin, 1921.

[6]

S. A. Josselyn and S. Tonegawa, “Memory engrams: Recalling the past and imagining the future,” Science, vol. 367, no. 6473, 2020.

[7]

S. A. Josselyn, S. Köhler, and P. W. Frankland, “Heroes of the Engram,” J. Neurosci., vol. 37, no. 18, pp. 4647–4657, 2017.

[8]

L. G. Reijmers, B. L. Perkins, N. Matsuo, and M. Mayford, “Localization of a stable neural correlate of associative memory,” Science, vol. 317, no. 5842, pp. 1230–1233, 2007.

[9]

X. Liu et al., “Optogenetic stimulation of a hippocampal engram activates fear memory recall,” Nature, vol. 484, no. 7394, pp. 381–385, 2012.

[10]

J.-H. Han et al., “Selective erasure of a fear memory,” Science, vol. 323, no. 5920, pp. 1492–1496, 2009.

[11]

S. Ramirez et al., “Creating a false memory in the hippocampus,” Science, vol. 341, no. 6144, pp. 387–391, 2013.

[12]

G. Vetere et al., “Memory formation in the absence of experience,” Nature Neurosci., vol. 22, no. 6, pp. 933–940, 2019.

[13]

S. J. Martin, P. D. Grimwood, and R. G. Morris, “Synaptic plasticity and memory: An evaluation of the hypothesis,” Annu. Rev. Neurosci., vol. 23, no. 1, pp. 649–711, 2000.

[14]

J. A. Kauer and R. C. Malenka, “Synaptic plasticity and addiction,” Nat. Rev. Neurosci., vol. 8, no. 11, pp. 844–858, 2007.

[15]

W. G. Penfield, “Ferrier lecture - Some observations on the cerebral cortex of man,” Roy. Soc. London. Ser. B- Biol. Sci., vol. 134, no. 876, pp. 329–347, 1947.

[16]

J. Locke, An Essay Concerning Human Understanding. London, U.K.: Thomas Basset, 1690.

[17]

D. Hume, An Enquiry Concerning Human Understanding. London, U.K.: Andrew Millar, 1748.

[18]

A. G. Samuel, “Lexical representations are malleable for about one second: Evidence for the non-automaticity of perceptual recalibration,” Cogn. Psychol., vol. 88, pp. 88–114, 2016.

[19]

P. D. Eimas and J. D. Corbit, “Selective adaptation of linguistic feature detectors,” Cogn. Psychol., vol. 4, no. 1, pp. 99–109, 1973.

[20]

D. Norris, J. M. McQueen, and A. Cutler, “Perceptual learning in speech,” Cogn. Psychol., vol. 47, no. 2, pp. 204–238, 2003.

[21]

W. Penfield and L. Roberts, Speech and Brain Mechanisms. Princeton, NJ, USA: Princeton Univ. Press, 1959.

[22]

J. H. McDermott, Audition. Oxford, UK: Oxford Univ. Press, 2014.

[23]

J. M. Nielsen, Agnosia, Apraxia, Aphasia: Their Value in Cerebral Localization. New York, NY, USA: Paul B. Hoeber, Inc., 1946.

[24]

S. Freud, Zur Auffassung Der Aphasien: Eine Kritische Studie. Leipzig, Germany: Franz Deuticke, 1891.

[25]

S. Curtin, D. Hufnagle, K. Mulak, and P. Escudero, Speech Perception: Development. The Netherlands: Elsevier, 2017, pp. 1–7.

[26]

H. Hermansky, D. P. Ellis, and S. Sharma, “Tandem connectionist feature extraction for conventional HMM systems,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2000, pp. 1635–1638.

[27]

T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur, “Recurrent neural network based language model,” in Proc. INTERSPEECH, 2010, pp. 1045–1048.

[28]

T. S. Ferguson, “A Bayesian analysis of some nonparametric problems,” Ann. Statist., vol. 1, pp. 209–230, 1973.

[29]

M. Heck, S. Sakti, and S. Nakamura, “Feature optimized DPGMM clustering for unsupervised subword modeling: A contribution to Zerospeech 2017,” in Proc. Autom. Speech Recognit. Understanding, 2017, pp. 740–746.

[30]

H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Parallel inference of Dirichlet process Gaussian mixture models for unsupervised acoustic modeling: A feasibility study,” in Proc. INTERSPEECH, 2015, pp. 3189–3193.

[31]

S. Feng, T. Lee, and Z. Peng, “Combining adversarial training and disentangled speech representation for robust zero-resource subword modeling,” in INTERSPEECH, 2019, pp. 1093–1097.

[32]

T. Schatz, V. Peddinti, F. Bach, A. Jansen, H. Hermansky, and E. Dupoux, “Evaluating speech features with the minimal-pair ABX task: Analysis of the classical MFC/PLP pipeline,” in Proc. INTERSPEECH, 2013, pp. 1781–1785.

[33]

D. Renshaw, H. Kamper, A. Jansen, and S. Goldwater, “A comparison of neural network methods for unsupervised representation learning on the Zero Resource Speech Challenge,” in Proc. INTERSPEECH, 2015, pp. 3199–3203.

[34]

R. Thiolliere, E. Dunbar, G. Synnaeve, M. Versteegh, and E. Dupoux, “A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling,” in Proc. INTERSPEECH, 2015, pp. 3179–3183.

[35]

A. Tjandra, B. Sisman, M. Zhang, S. Sakti, H. Li, and S. Nakamura, “VQVAE unsupervised unit discovery and multi-scale Code2Spec inverter for zerospeech challenge,” in INTERSPEECH, 2019, pp. 1118–1122.

[36]

C. Manenti, T. Pellegrini, and J. Pinquier, “Unsupervised speech unit discovery using k-means and neural networks,” in Proc. Int. Conf. Stat. Lang. Speech Process., 2017, pp. 169–180.

[37]

L. Ondel, L. Burget, and J. Černocký, “Variational inference for acoustic unit discovery,” Procedia Comput. Sci., vol. 81, pp. 80–86, 2016.

[38]

J. Ebbers, J. Heymann, L. Drude, T. Glarner, R. Haeb-Umbach, and B. Raj, “Hidden Markov model variational Autoencoder for acoustic unit discovery,” in Proc. INTERSPEECH, 2017, pp. 488–492.

[39]

C.-Y. Lee and J. Glass, “A nonparametric Bayesian approach to acoustic model discovery,” in Proc. Assoc. Comput. Linguistics, 2012, pp. 40–49.

[40]

B. Wu, S. Sakti, and S. Nakamura, “Incorporating discriminative DPGMM posteriorgrams for low-resource ASR,” in Proc. IEEE Spoken Lang. Technol. Workshop, 2021, pp. 201–208.

[41]

S. Goldwater, M. Johnson, and T. L. Griffiths, “Interpolating between types and tokens by estimating power-law generators,” in Proc. Neural Inf. Process. Syst., 2006, pp. 459–466.

[42]

N. Feldman, T. Griffiths, and J. Morgan, “Learning phonetic categories by learning a lexicon,” in Proc. Annu. Meeting Cogn. Sci. Soc., vol. 31, 2009, pp. 2208–2213.

[43]

N. H. Feldman, E. B. Myers, K. S. White, T. L. Griffiths, and J. L. Morgan, “Word-level information influences phonetic learning in adults and infants,” Cognition, vol. 127, no. 3, pp. 427–438, 2013.

[44]

J. Maye, J. F. Werker, and L. Gerken, “Infant sensitivity to distributional information can affect phonetic discrimination,” Cognition, vol. 82, no. 3, pp. B101–B111, 2002.

[45]

B. De Boer and P. K. Kuhl, “Investigating the role of infant-directed speech with a computer model,” Acoust. Res. Lett. Online, vol. 4, no. 4, pp. 129–134, 2003.

[46]

B. McMurray, R. N. Aslin, and J. C. Toscano, “Statistical learning of phonetic categories: Insights from a computational approach,” Devlop. Sci., vol. 12, no. 3, pp. 369–378, 2009.

[47]

D. Görür and C. E. Rasmussen, “Dirichlet process Gaussian mixture models: Choice of the base distribution,” J. Comput. Sci. Technol., vol. 25, no. 4, pp. 653–664, 2010.

[48]

Y. W. Teh, “Dirichlet process,” Encyclopedia Mach. Learn., vol. 1063, pp. 280–287, 2010.

[49]

B. Wu, S. Sakti, J. Zhang, and S. Nakamura, “Tackling perception bias in unsupervised phoneme discovery using DPGMM-RNN hybrid model and functional load,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 348–362, 2021.

Digital Library

[50]

B. Wu, S. Sakti, J. Zhang, and S. Nakamura, “Optimizing DPGMM clustering in zero-resource setting based on functional load,” in SLTU, vol. 1, 2018, pp. 1–5.

[51]

T. M. Cover, Elements of Information Theory. Hoboken, NJ, USA: Wiley, 1999.

[52]

A. Rosenberg and J. Hirschberg, “V-measure: A conditional entropy-based external cluster evaluation measure,” in Proc. Joint Conf. Empirical Methods Natural Lang. Process. Comput. Natural Lang. Learn., 2007, pp. 410–420.

[53]

J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, “DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1,” NASA STI, Tech. Rep. 93, 1993.

[54]

D. B. Paul and J. Baker, “The design for the wall street journal-based CSR corpus,” in ICSLP, 1992, pp. 899–902.

[55]

P. Godard et al., “A very low resource language speech corpus for computational language documentation experiments,” Comput. Res. Repository, vol. abs/1710.03501, 2017. [Online]. Available: https://github.com/besacier/mboshi-french-parallel-corpus

[56]

A. Bills et al., “IARPA Babel Javanese Language Pack IARPA-babel402b-v1.0b LDC2020S07,” Linguistic Data Consortium, 2020.

[57]

D. Povey et al., “The Kaldi speech recognition toolkit,” in Proc. Autom. Speech Recognit. Understanding, 2011, pp. 1–4.

[58]

J. Sethuraman, “A constructive definition of dirichlet priors,” Statistica Sinica, vol. 4, pp. 639–650, 1994.

[59]

K. P. Murphy, “Conjugate Bayesian analysis of the Gaussian distribution,” Univ. British Columbia, Tech. Rep., vol. 1, 2007.

[60]

R. P. Beapami, R. Chatfield, G. Kouarata, and A. Waldschmidt, Dictionnaire Mbochi-Français. Point Noire, Congo: SIL-Congo, 2000.

[61]

L. Bouquiaux and J. M. Thomas, Enquête Et Description Des Langues à Tradition Orale, vol. 1. Leuven, Belgium: Peeters Publishers, 1976.

[62]

J. Cooper-Leavitt, L. Lamel, A. Rialland, M. Adda-Decker, and G. Adda, “Corpus base linguistic exploration via forced alignments with a light-weight ASR tool,” in Proc. Lang. Technol. Conf., Hum. Lang. Technol. Challenge Comput. Sci. Linguistics, 2017.

[63]

M. Heck, S. Sakti, and S. Nakamura, “Unsupervised linear discriminant analysis for supporting DPGMM clustering in the zero-resource scenario,” Procedia Comput. Sci., vol. 81, pp. 73–79, 2016.

[64]

W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2016, pp. 4960–4964.

[65]

M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in EMNLP, 2015, pp. 1412–1421.

[66]

B. Van Niekerk, L. Nortje, and H. Kamper, “Vector-quantized neural networks for acoustic unit discovery in the Zerospeech 2020 challenge,” in INTERSPEECH, 2020, pp. 4836–4840.

[67]

K. N. Stevens, Acoustic Phonetics, vol. 30. Cambridge, MA, USA: MIT Press, 2000.

[68]

S. Kim, T. Hori, and S. Watanabe, “Joint CTC-attention based end-to-end speech recognition using multi-task learning,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2017, pp. 4835–4839.

[69]

A. Tjandra, S. Sakti, and S. Nakamura, “Machine speech chain,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28, pp. 976–989, 2020.

Digital Library

[70]

M. Versteegh et al., “The Zero Resource Speech Challenge 2015,” in Proc. INTERSPEECH, 2015, pp. 3169–3173.

[71]

E. Dunbar et al., “The Zero Resource Speech Challenge 2017,” in Proc. Autom. Speech Recognit. Understanding, 2017, pp. 323–330.

[72]

E. Dunbar et al., “The zero resource speech challenge 2019: TTS without T,” in INTERSPEECH, 2019, pp. 1088–1092.

[73]

R. E. Eilers, W. R. Wilson, and J. M. Moore, “Developmental changes in speech discrimination in infants,” J. Speech Hear. Res., vol. 20, no. 4, pp. 766–780, 1977.

[74]

P. D. Eimas, E. R. Siqueland, P. Jusczyk, and J. Vigorito, “Speech perception in infants,” Science, vol. 171, no. 3968, pp. 303–306, 1971.

[75]

J. Bertoncini, R. Bijeljac-Babic, S. E. Blumstein, and J. Mehler, “Discrimination in neonates of very short CVs,” J. Acoust. Soc. Amer., vol. 82, no. 1, pp. 31–37, 1987.

[76]

S. E. Trehub, “Infants’ sensitivity to vowel and tonal contrasts,” Devlop. Psychol., vol. 9, no. 1, pp. 91–96, 1973.

[77]

P. J. Swoboda, P. A. Morse, and L. A. Leavitt, “Continuous vowel discrimination in normal and at risk infants,” Child Develop., vol. 47, pp. 459–465, 1976.

[78]

P. W. Jusczyk, Infant Speech Perception: A Critical Appraisal, J. L. Miller and P. D. Eimas, Eds. New York, NY, USA: Psychol. Press, 1982, pp. 113–164.

[79]

R. E. Eilers and D. K. Oller, “The role of speech discrimination in developmental sound substitutions,” J. Child Lang., vol. 3, no. 3, pp. 319–329, 1976.

[80]

M. S. Abbs and F. D. Minifie, “Effect of acoustic cues in fricatives on perceptual confusions in preschool children,” J. Acoust. Soc. Amer., vol. 46, no. 6B, pp. 1535–1542, 1969.

[81]

M. A. Hanner, “Auditory discrimination and phonetic contexts in school age children,” Master’s thesis, Dept. Speech Pathology, Eastern Illinois Univ., 1974. [Online]. Available: https://thekeep.eiu.edu/theses/3625/

[82]

P. W. Jusczyk, H. Copan, and E. Thompson, “Perception by 2-month-old infants of glide contrasts in multisyllabic utterances,” Percep. Psychophys., vol. 24, no. 6, pp. 515–520, 1978.

[83]

J. M. Byrne, C. L. Miller, and B. Hondas, “Psychophysiologic and behavioral responsivity to temporal parameters of acoustic stimuli,” Infant Behav. Develop., vol. 17, no. 3, pp. 245–254, 1994.

[84]

J. L. Miller and P. D. Eimas, “Studies on the perception of place and manner of articulation: A comparison of the labial-alveolar and nasal-stop distinctions,” J. Acoust. Soc. Amer., vol. 61, no. 3, pp. 835–845, 1977.

Cited By

Prinos KPatwari NPower C(2024)Speaking of accent: A content analysis of accent misconceptions in ASR researchProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3658969(1245-1254)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3630106.3658969

Index Terms

Modeling Unsupervised Empirical Adaptation by DPGMM and DPGMM-RNN Hybrid Model to Extract Perceptual Features for Low-Resource ASR

Index terms have been assigned to the content through auto-classification.

Recommendations

Pitch adaptive MFCC features for improving children's mismatched ASR

A pitch normalization algorithm is proposed for addressing the pitch mismatch between adults' and children's speech for children's automatic speech recognition (ASR). Motivated by the appearance of pitch-dependent distortions in the smoothed mel ...
Merging of Native and Non-native Speech for Low-resource Accented ASR
SLSP 2015: Proceedings of the Third International Conference on Statistical Language and Speech Processing - Volume 9449

This paper presents our recent study on low-resource automatic speech recognition ASR system with accented speech. We propose multi-accent Subspace Gaussian Mixture Models SGMM and accent-specific Deep Neural Networks DNN for improving non-native ASR ...
Tackling Perception Bias in Unsupervised Phoneme Discovery Using DPGMM-RNN Hybrid Model and Functional Load
The human perception of phonemes is biased against speech sounds. The lack of correspondence between perceputal phonemes and acoustic signals forms a big challenge in designing unsupervised algorithms to distinguish phonemes from sound. We propose the ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image IEEE/ACM Transactions on Audio, Speech and Language Processing

IEEE/ACM Transactions on Audio, Speech and Language Processing Volume 30, Issue

2022

3239 pages

ISSN:2329-9290

EISSN:2329-9304

Issue’s Table of Contents

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/.

Publisher

IEEE Press

Publication History

Published: 10 February 2022

Published in TASLP Volume 30

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
65
Total Downloads

Downloads (Last 12 months)22
Downloads (Last 6 weeks)5

Reflects downloads up to 13 Dec 2024

Other Metrics

View Author Metrics

Citations

Cited By

Prinos KPatwari NPower C(2024)Speaking of accent: A content analysis of accent misconceptions in ASR researchProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3658969(1245-1254)Online publication date: 3-Jun-2024
https://dl.acm.org/doi/10.1145/3630106.3658969

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Media

Figures

Other

Tables

View Issue’s Table of Contents