[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article
Open access

Modeling Unsupervised Empirical Adaptation by DPGMM and DPGMM-RNN Hybrid Model to Extract Perceptual Features for Low-Resource ASR

Published: 10 February 2022 Publication History

Abstract

Speech feature extraction is critical for ASR systems. Such successful features as MFCC and PLP use filterbank techniques to model log-scaled speech perception but fail to model the adaptation of human speech perception by hearing experiences. Infant perception that is adapted by hearing speech without text may cause permanent brain state modifications (engrams) that serve as a physical fundamental basis for lifetime speech perception formation. This realization motivates us to propose to model such an unsupervised adaptation process, where adaptation denotes perception that is affected or changed by the history of experiences, with the Dirichlet Process Gaussian Mixture Model (DPGMM) and the DPGMM-RNN hybrid model to extract perceptual features to improve ASR. Our proposed features extend MFCC features with posteriorgrams extracted from the DPGMM algorithm or the DPGMM-RNN hybrid model. Our analysis shows that the DPGMM and DPGMM-RNN model perplexities agree with infant auditory perplexity to support that the proposed features are perceptual. Our ASR results verify the effectiveness of the proposed unsupervised features in such tasks as LVCSR on WSJ and ASR on noisy low-resource telephone conversations, compared with the supervised bottleneck features from Kaldi in ASR performance.

References

[1]
S. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE/ACM Trans. Acoust., Speech, Signal Process., vol. 28, no. 4, pp. 357–366, Aug. 1980.
[2]
H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” J. Acoust. Soc. Amer., vol. 87, no. 4, pp. 1738–1752, 1990.
[3]
J. F. Werker and R. C. Tees, “Cross-language speech perception: Evidence for perceptual reorganization during the first year of life,” Infant Behav. Develop., vol. 7, no. 1, pp. 49–63, 1984.
[4]
R. W. Semon, The Mneme. London, U.K.: George Allen and Unwin, 1921.
[5]
B. Russell, Analysis of Mind. London, U.K.: George Allen and Unwin, 1921.
[6]
S. A. Josselyn and S. Tonegawa, “Memory engrams: Recalling the past and imagining the future,” Science, vol. 367, no. 6473, 2020.
[7]
S. A. Josselyn, S. Köhler, and P. W. Frankland, “Heroes of the Engram,” J. Neurosci., vol. 37, no. 18, pp. 4647–4657, 2017.
[8]
L. G. Reijmers, B. L. Perkins, N. Matsuo, and M. Mayford, “Localization of a stable neural correlate of associative memory,” Science, vol. 317, no. 5842, pp. 1230–1233, 2007.
[9]
X. Liu et al., “Optogenetic stimulation of a hippocampal engram activates fear memory recall,” Nature, vol. 484, no. 7394, pp. 381–385, 2012.
[10]
J.-H. Han et al., “Selective erasure of a fear memory,” Science, vol. 323, no. 5920, pp. 1492–1496, 2009.
[11]
S. Ramirez et al., “Creating a false memory in the hippocampus,” Science, vol. 341, no. 6144, pp. 387–391, 2013.
[12]
G. Vetere et al., “Memory formation in the absence of experience,” Nature Neurosci., vol. 22, no. 6, pp. 933–940, 2019.
[13]
S. J. Martin, P. D. Grimwood, and R. G. Morris, “Synaptic plasticity and memory: An evaluation of the hypothesis,” Annu. Rev. Neurosci., vol. 23, no. 1, pp. 649–711, 2000.
[14]
J. A. Kauer and R. C. Malenka, “Synaptic plasticity and addiction,” Nat. Rev. Neurosci., vol. 8, no. 11, pp. 844–858, 2007.
[15]
W. G. Penfield, “Ferrier lecture - Some observations on the cerebral cortex of man,” Roy. Soc. London. Ser. B- Biol. Sci., vol. 134, no. 876, pp. 329–347, 1947.
[16]
J. Locke, An Essay Concerning Human Understanding. London, U.K.: Thomas Basset, 1690.
[17]
D. Hume, An Enquiry Concerning Human Understanding. London, U.K.: Andrew Millar, 1748.
[18]
A. G. Samuel, “Lexical representations are malleable for about one second: Evidence for the non-automaticity of perceptual recalibration,” Cogn. Psychol., vol. 88, pp. 88–114, 2016.
[19]
P. D. Eimas and J. D. Corbit, “Selective adaptation of linguistic feature detectors,” Cogn. Psychol., vol. 4, no. 1, pp. 99–109, 1973.
[20]
D. Norris, J. M. McQueen, and A. Cutler, “Perceptual learning in speech,” Cogn. Psychol., vol. 47, no. 2, pp. 204–238, 2003.
[21]
W. Penfield and L. Roberts, Speech and Brain Mechanisms. Princeton, NJ, USA: Princeton Univ. Press, 1959.
[22]
J. H. McDermott, Audition. Oxford, UK: Oxford Univ. Press, 2014.
[23]
J. M. Nielsen, Agnosia, Apraxia, Aphasia: Their Value in Cerebral Localization. New York, NY, USA: Paul B. Hoeber, Inc., 1946.
[24]
S. Freud, Zur Auffassung Der Aphasien: Eine Kritische Studie. Leipzig, Germany: Franz Deuticke, 1891.
[25]
S. Curtin, D. Hufnagle, K. Mulak, and P. Escudero, Speech Perception: Development. The Netherlands: Elsevier, 2017, pp. 1–7.
[26]
H. Hermansky, D. P. Ellis, and S. Sharma, “Tandem connectionist feature extraction for conventional HMM systems,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2000, pp. 1635–1638.
[27]
T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur, “Recurrent neural network based language model,” in Proc. INTERSPEECH, 2010, pp. 1045–1048.
[28]
T. S. Ferguson, “A Bayesian analysis of some nonparametric problems,” Ann. Statist., vol. 1, pp. 209–230, 1973.
[29]
M. Heck, S. Sakti, and S. Nakamura, “Feature optimized DPGMM clustering for unsupervised subword modeling: A contribution to Zerospeech 2017,” in Proc. Autom. Speech Recognit. Understanding, 2017, pp. 740–746.
[30]
H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Parallel inference of Dirichlet process Gaussian mixture models for unsupervised acoustic modeling: A feasibility study,” in Proc. INTERSPEECH, 2015, pp. 3189–3193.
[31]
S. Feng, T. Lee, and Z. Peng, “Combining adversarial training and disentangled speech representation for robust zero-resource subword modeling,” in INTERSPEECH, 2019, pp. 1093–1097.
[32]
T. Schatz, V. Peddinti, F. Bach, A. Jansen, H. Hermansky, and E. Dupoux, “Evaluating speech features with the minimal-pair ABX task: Analysis of the classical MFC/PLP pipeline,” in Proc. INTERSPEECH, 2013, pp. 1781–1785.
[33]
D. Renshaw, H. Kamper, A. Jansen, and S. Goldwater, “A comparison of neural network methods for unsupervised representation learning on the Zero Resource Speech Challenge,” in Proc. INTERSPEECH, 2015, pp. 3199–3203.
[34]
R. Thiolliere, E. Dunbar, G. Synnaeve, M. Versteegh, and E. Dupoux, “A hybrid dynamic time warping-deep neural network architecture for unsupervised acoustic modeling,” in Proc. INTERSPEECH, 2015, pp. 3179–3183.
[35]
A. Tjandra, B. Sisman, M. Zhang, S. Sakti, H. Li, and S. Nakamura, “VQVAE unsupervised unit discovery and multi-scale Code2Spec inverter for zerospeech challenge,” in INTERSPEECH, 2019, pp. 1118–1122.
[36]
C. Manenti, T. Pellegrini, and J. Pinquier, “Unsupervised speech unit discovery using k-means and neural networks,” in Proc. Int. Conf. Stat. Lang. Speech Process., 2017, pp. 169–180.
[37]
L. Ondel, L. Burget, and J. Černocký, “Variational inference for acoustic unit discovery,” Procedia Comput. Sci., vol. 81, pp. 80–86, 2016.
[38]
J. Ebbers, J. Heymann, L. Drude, T. Glarner, R. Haeb-Umbach, and B. Raj, “Hidden Markov model variational Autoencoder for acoustic unit discovery,” in Proc. INTERSPEECH, 2017, pp. 488–492.
[39]
C.-Y. Lee and J. Glass, “A nonparametric Bayesian approach to acoustic model discovery,” in Proc. Assoc. Comput. Linguistics, 2012, pp. 40–49.
[40]
B. Wu, S. Sakti, and S. Nakamura, “Incorporating discriminative DPGMM posteriorgrams for low-resource ASR,” in Proc. IEEE Spoken Lang. Technol. Workshop, 2021, pp. 201–208.
[41]
S. Goldwater, M. Johnson, and T. L. Griffiths, “Interpolating between types and tokens by estimating power-law generators,” in Proc. Neural Inf. Process. Syst., 2006, pp. 459–466.
[42]
N. Feldman, T. Griffiths, and J. Morgan, “Learning phonetic categories by learning a lexicon,” in Proc. Annu. Meeting Cogn. Sci. Soc., vol. 31, 2009, pp. 2208–2213.
[43]
N. H. Feldman, E. B. Myers, K. S. White, T. L. Griffiths, and J. L. Morgan, “Word-level information influences phonetic learning in adults and infants,” Cognition, vol. 127, no. 3, pp. 427–438, 2013.
[44]
J. Maye, J. F. Werker, and L. Gerken, “Infant sensitivity to distributional information can affect phonetic discrimination,” Cognition, vol. 82, no. 3, pp. B101–B111, 2002.
[45]
B. De Boer and P. K. Kuhl, “Investigating the role of infant-directed speech with a computer model,” Acoust. Res. Lett. Online, vol. 4, no. 4, pp. 129–134, 2003.
[46]
B. McMurray, R. N. Aslin, and J. C. Toscano, “Statistical learning of phonetic categories: Insights from a computational approach,” Devlop. Sci., vol. 12, no. 3, pp. 369–378, 2009.
[47]
D. Görür and C. E. Rasmussen, “Dirichlet process Gaussian mixture models: Choice of the base distribution,” J. Comput. Sci. Technol., vol. 25, no. 4, pp. 653–664, 2010.
[48]
Y. W. Teh, “Dirichlet process,” Encyclopedia Mach. Learn., vol. 1063, pp. 280–287, 2010.
[49]
B. Wu, S. Sakti, J. Zhang, and S. Nakamura, “Tackling perception bias in unsupervised phoneme discovery using DPGMM-RNN hybrid model and functional load,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 348–362, 2021.
[50]
B. Wu, S. Sakti, J. Zhang, and S. Nakamura, “Optimizing DPGMM clustering in zero-resource setting based on functional load,” in SLTU, vol. 1, 2018, pp. 1–5.
[51]
T. M. Cover, Elements of Information Theory. Hoboken, NJ, USA: Wiley, 1999.
[52]
A. Rosenberg and J. Hirschberg, “V-measure: A conditional entropy-based external cluster evaluation measure,” in Proc. Joint Conf. Empirical Methods Natural Lang. Process. Comput. Natural Lang. Learn., 2007, pp. 410–420.
[53]
J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, “DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1,” NASA STI, Tech. Rep. 93, 1993.
[54]
D. B. Paul and J. Baker, “The design for the wall street journal-based CSR corpus,” in ICSLP, 1992, pp. 899–902.
[55]
P. Godard et al., “A very low resource language speech corpus for computational language documentation experiments,” Comput. Res. Repository, vol. abs/1710.03501, 2017. [Online]. Available: https://github.com/besacier/mboshi-french-parallel-corpus
[56]
A. Bills et al., “IARPA Babel Javanese Language Pack IARPA-babel402b-v1.0b LDC2020S07,” Linguistic Data Consortium, 2020.
[57]
D. Povey et al., “The Kaldi speech recognition toolkit,” in Proc. Autom. Speech Recognit. Understanding, 2011, pp. 1–4.
[58]
J. Sethuraman, “A constructive definition of dirichlet priors,” Statistica Sinica, vol. 4, pp. 639–650, 1994.
[59]
K. P. Murphy, “Conjugate Bayesian analysis of the Gaussian distribution,” Univ. British Columbia, Tech. Rep., vol. 1, 2007.
[60]
R. P. Beapami, R. Chatfield, G. Kouarata, and A. Waldschmidt, Dictionnaire Mbochi-Français. Point Noire, Congo: SIL-Congo, 2000.
[61]
L. Bouquiaux and J. M. Thomas, Enquête Et Description Des Langues à Tradition Orale, vol. 1. Leuven, Belgium: Peeters Publishers, 1976.
[62]
J. Cooper-Leavitt, L. Lamel, A. Rialland, M. Adda-Decker, and G. Adda, “Corpus base linguistic exploration via forced alignments with a light-weight ASR tool,” in Proc. Lang. Technol. Conf., Hum. Lang. Technol. Challenge Comput. Sci. Linguistics, 2017.
[63]
M. Heck, S. Sakti, and S. Nakamura, “Unsupervised linear discriminant analysis for supporting DPGMM clustering in the zero-resource scenario,” Procedia Comput. Sci., vol. 81, pp. 73–79, 2016.
[64]
W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2016, pp. 4960–4964.
[65]
M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to attention-based neural machine translation,” in EMNLP, 2015, pp. 1412–1421.
[66]
B. Van Niekerk, L. Nortje, and H. Kamper, “Vector-quantized neural networks for acoustic unit discovery in the Zerospeech 2020 challenge,” in INTERSPEECH, 2020, pp. 4836–4840.
[67]
K. N. Stevens, Acoustic Phonetics, vol. 30. Cambridge, MA, USA: MIT Press, 2000.
[68]
S. Kim, T. Hori, and S. Watanabe, “Joint CTC-attention based end-to-end speech recognition using multi-task learning,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2017, pp. 4835–4839.
[69]
A. Tjandra, S. Sakti, and S. Nakamura, “Machine speech chain,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28, pp. 976–989, 2020.
[70]
M. Versteegh et al., “The Zero Resource Speech Challenge 2015,” in Proc. INTERSPEECH, 2015, pp. 3169–3173.
[71]
E. Dunbar et al., “The Zero Resource Speech Challenge 2017,” in Proc. Autom. Speech Recognit. Understanding, 2017, pp. 323–330.
[72]
E. Dunbar et al., “The zero resource speech challenge 2019: TTS without T,” in INTERSPEECH, 2019, pp. 1088–1092.
[73]
R. E. Eilers, W. R. Wilson, and J. M. Moore, “Developmental changes in speech discrimination in infants,” J. Speech Hear. Res., vol. 20, no. 4, pp. 766–780, 1977.
[74]
P. D. Eimas, E. R. Siqueland, P. Jusczyk, and J. Vigorito, “Speech perception in infants,” Science, vol. 171, no. 3968, pp. 303–306, 1971.
[75]
J. Bertoncini, R. Bijeljac-Babic, S. E. Blumstein, and J. Mehler, “Discrimination in neonates of very short CVs,” J. Acoust. Soc. Amer., vol. 82, no. 1, pp. 31–37, 1987.
[76]
S. E. Trehub, “Infants’ sensitivity to vowel and tonal contrasts,” Devlop. Psychol., vol. 9, no. 1, pp. 91–96, 1973.
[77]
P. J. Swoboda, P. A. Morse, and L. A. Leavitt, “Continuous vowel discrimination in normal and at risk infants,” Child Develop., vol. 47, pp. 459–465, 1976.
[78]
P. W. Jusczyk, Infant Speech Perception: A Critical Appraisal, J. L. Miller and P. D. Eimas, Eds. New York, NY, USA: Psychol. Press, 1982, pp. 113–164.
[79]
R. E. Eilers and D. K. Oller, “The role of speech discrimination in developmental sound substitutions,” J. Child Lang., vol. 3, no. 3, pp. 319–329, 1976.
[80]
M. S. Abbs and F. D. Minifie, “Effect of acoustic cues in fricatives on perceptual confusions in preschool children,” J. Acoust. Soc. Amer., vol. 46, no. 6B, pp. 1535–1542, 1969.
[81]
M. A. Hanner, “Auditory discrimination and phonetic contexts in school age children,” Master’s thesis, Dept. Speech Pathology, Eastern Illinois Univ., 1974. [Online]. Available: https://thekeep.eiu.edu/theses/3625/
[82]
P. W. Jusczyk, H. Copan, and E. Thompson, “Perception by 2-month-old infants of glide contrasts in multisyllabic utterances,” Percep. Psychophys., vol. 24, no. 6, pp. 515–520, 1978.
[83]
J. M. Byrne, C. L. Miller, and B. Hondas, “Psychophysiologic and behavioral responsivity to temporal parameters of acoustic stimuli,” Infant Behav. Develop., vol. 17, no. 3, pp. 245–254, 1994.
[84]
J. L. Miller and P. D. Eimas, “Studies on the perception of place and manner of articulation: A comparison of the labial-alveolar and nasal-stop distinctions,” J. Acoust. Soc. Amer., vol. 61, no. 3, pp. 835–845, 1977.

Cited By

View all
  • (2024)Speaking of accent: A content analysis of accent misconceptions in ASR researchProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3658969(1245-1254)Online publication date: 3-Jun-2024

Index Terms

  1. Modeling Unsupervised Empirical Adaptation by DPGMM and DPGMM-RNN Hybrid Model to Extract Perceptual Features for Low-Resource ASR
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Please enable JavaScript to view thecomments powered by Disqus.

            Information & Contributors

            Information

            Published In

            cover image IEEE/ACM Transactions on Audio, Speech and Language Processing
            IEEE/ACM Transactions on Audio, Speech and Language Processing  Volume 30, Issue
            2022
            3239 pages
            ISSN:2329-9290
            EISSN:2329-9304
            Issue’s Table of Contents

            Publisher

            IEEE Press

            Publication History

            Published: 10 February 2022
            Published in TASLP Volume 30

            Qualifiers

            • Research-article

            Contributors

            Other Metrics

            Bibliometrics & Citations

            Bibliometrics

            Article Metrics

            • Downloads (Last 12 months)22
            • Downloads (Last 6 weeks)5
            Reflects downloads up to 13 Dec 2024

            Other Metrics

            Citations

            Cited By

            View all
            • (2024)Speaking of accent: A content analysis of accent misconceptions in ASR researchProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency10.1145/3630106.3658969(1245-1254)Online publication date: 3-Jun-2024

            View Options

            View options

            PDF

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            Login options

            Full Access

            Media

            Figures

            Other

            Tables

            Share

            Share

            Share this Publication link

            Share on social media