Efficiency-oriented approaches for self-supervised speech representation learning

Luis Lugo¹ &
Valentin Vielzeuf¹

127 Accesses
Explore all metrics

Abstract

Self-supervised learning enables the training of large neural models without the need for large, labeled datasets. It has been generating breakthroughs in several fields, including computer vision, natural language processing, biology, and speech. In particular, the state-of-the-art in several speech processing applications, such as automatic speech recognition or speaker identification, are models where the latent representation is learned using self-supervised approaches. Several configurations exist in self-supervised learning for speech, including contrastive, predictive, and multilingual approaches. There is, however, a crucial limitation in the majority of existing approaches: their high computational costs. These costs limit the deployment of models, the size of the training dataset, and the number of research groups that can afford research with large self-supervised models. Likewise, we should consider the environmental costs that high energy consumption implies. Efforts in this direction comprise optimization of existing models, neural architecture efficiency, improvements in finetuning for speech processing tasks, and data efficiency. But despite current efforts, more work could be done to address high computational costs in self-supervised representation learning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Deep neural networks for automatic speech processing: a survey from large corpora to limited data

Article Open access 17 August 2022

Applications of Deep Learning Approaches in Speech Recognition: A Survey

Modeling under-resourced languages for speech recognition

Article 10 February 2016

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data availability

Not applicable because no dataset was generated or analyzed during the current study.

References

Abdullah, B. M., Shaik, M. M., & Klakow, D. (2023). On the nature of discrete speech representations in multilingual self-supervised models. In Proceedings of the 5th workshop on research in computational linguistic typology and multilingual. NLP.
Google Scholar
Allen-Zhu, Z., & Li, Y. (2020). Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv preprint arXiv:2012.09816
Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G., Chen, J., Chen, J., Chen, Z., Chrzanowski, M., Coates, A., Diamos, G., Ding, K., Du, N., Elsen, E., … Zhu, Z. (2016). Deep speech 2: End-to-end speech recognition in English and Mandarin. In International conference on machine learning (ICML). PMLR.
Google Scholar
Arora, S., Dalmia, S., Denisov, P., Chang, X., Ueda, Y., Peng, Y., Zhang, Y., Kumar, S., Ganesan, K., Yan, B., Vu, N., Black, A., & Watanabe, S. (2022). Espnet-slu: Advancing spoken language understanding through ESPnet. In IEEE international conference on acoustics, speech and signal processing (ICASSP).
Ashihara, T., Moriya, T., Matsuura, K., & Tanaka, T. (2022). Deep versus wide: An analysis of student architectures for task-agnostic knowledge distillation of self-supervised speech models. In Interspeech.
Google Scholar
Babu, A., Wang, C., Tjandra, A., Lakhotia, K., Xu, Q., Goyal, N., Singh, K., Platen, P., Saraf, Y., Pino, J., Baevski, A., Conneau, A., & Auli, M. (2021). Xls-r: Self-supervised cross-lingual speech representation learning at scale. In Interspeech.
Baevski, A., Schneider, S., & Auli, M. (2020). vq-wav2vec: Self-supervised learning of discrete speech representations. In International conference on learning representations (ICLR).
Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in neural information processing systems (NIPS).
Baevski, A., Babu, A., Hsu, W.-N., & Auli, M. (2023). Efficient self-supervised learning with contextualized target representations for vision, speech and language. In International conference on machine learning (ICML). PMLR.
Google Scholar
Baevski, A., Hsu, W.-N., Xu, Q., Babu, A., Gu, J., & Auli, M. (2022). Data2vec: A general framework for self-supervised learning in speech, vision and language. In International conference on machine learning (ICML). PMLR.
Google Scholar
Bartley, T.M., Jia, F., Puvvada, K.C., Kriman, S., & Ginsburg, B. (2023). Accidental learners: Spoken language identification in multilingual self-supervised models. In IEEE international conference on acoustics, speech and signal processing (ICASSP).
Bello, I., Zoph, B., Le, Q., Vaswani, A., & Shlens, J. (2019). Attention augmented convolutional networks. In Proceedings of the IEEE/CVF international conference on computer vision.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D. (2020). Language models are few-shot learners. In Advances in neural information processing systems (NIPS).
Chang, H.-J., Yang, S.-w., & Lee, H.-y. (2022). Distilhubert: Speech representation learning by layer-wise distillation of hidden-unit Bert. In IEEE international conference on acoustics, speech and signal processing (ICASSP).
Chen, W., Chang, X., Peng, Y., Ni, Z., Maiti, S., & Watanabe, S. (2023). Reducing barriers to self-supervised learning: Hubert pre-training with academic compute. In Interspeech.
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In International conference on machine learning (ICML). PMLR.
Google Scholar
Chen, S., Wang, C., Chen, Z., Wu, Y., Liu, S., Chen, Z., Li, J., Kanda, N., Yoshioka, T., Xiao, X., Wu, J., Zhou, L., Ren, S., Qian, Y., Qian, Y., Wu, J., Zeng, M., Yu, X., Wei, F. (2022). Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6), 1505–1518.
Article Google Scholar
Chiba, Y., Nose, T., & Ito, A. (2019). Multi-condition training for noise-robust speech emotion recognition. Acoustical Science and Technology, 40(6), 406–409.
Article Google Scholar
Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509
Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J. Q., Mohiuddin, A., Kaiser, L., Belanger, D. B., Colwell, L. J., & Weller, A. (2021). Rethinking attention with performers. In International conference on learning representations (ICLR).
Chung, Y.-A., & Glass, J. (2018). Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech. In Interspeech.
Google Scholar
Conneau, A., Baevski, A., Collobert, R., Mohamed, A., & Auli, M. (2021). Unsupervised cross-lingual representation learning for speech recognition. In Interspeech.
Dao, T., Fu, D., Ermon, S., Rudra, A., & Ré, C. (2022). Flashattention: Fast and memory-efficient exact attention with io-awareness. In Advances in neural information processing systems (NIPS).
Dao, T., Fu, D.Y., Saab, K.K., Thomas, A.W., Rudra, A., & Ré, C. (2022). Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the conference of the North American Chapter of the Association for computational linguistics: Human language technologies.
Dunbar, E., Bernard, M., Hamilakis, N., Nguyen, T. A., Seyssel, M., Rozé, P., Rivière, M., Kharitonov, E., & Dupoux, E. (2021). The zero resource speech challenge 2021: Spoken language modelling. In Interspeech.
Google Scholar
Ericsson, L., Gouk, H., Loy, C. C., & Hospedales, T. M. (2022). Self-supervised representation learning: Introduction, advances, and challenges. IEEE Signal Processing Magazine, 39(3), 42–62.
Article Google Scholar
Evain, S., Nguyen, H., Le, H., Boito, M. Z., Mdhaffar, S., Alisamir, S., Tong, Z., Tomashenko, N., Dinarelli, M., Parcollet, T., Allauzen, A., Esteve, Y., Lecouteux, B., Portet, F., Rossato, S., Ringeval, F., Schwab, D., & Besacier, L. (2021). LeBenchmark: A reproducible framework for assessing self-supervised representation learning from speech. In Interspeech.
Google Scholar
Gao, Y., Fernandez-Marques, J., Parcollet, T., Mehrotra, A., & Lane, N. D. (2022). Federated self-supervised speech representations: Are we there yet? arXiv preprint arXiv:2204.02804
Gaol, Y., Fernandez-Marques, J., Parcollet, T., Gusmao, P. P., & Lane, N. D. (2023). Match to win: Analysing sequences lengths for efficient self-supervised learning in speech and audio. In IEEE spoken language technology workshop (SLT)
Graves, A. (2012). Supervised sequence labelling with recurrent neural networks. Studies in Computational Intelligence, 385, 1–131.
MathSciNet Google Scholar
Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., Piot, B., Kavukcuoglu, K., Munos, R., & Valko, M. (2020). Bootstrap your own latent: A new approach to self-supervised learning. In Advances in neural information processing systems (NIPS)
Gulati, A., Qin, J., Chiu, C.-C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y., & Pang, R. (2020). Conformer: Convolution-augmented transformer for speech recognition. In Interspeech.
Hsu, W.-N., Bolte, B., Tsai, Y.-H.H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 3451–3460.
Article Google Scholar
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). Lora: Low-rank adaptation of large language models. In International conference on learning representations (ICLR).
Huang, W., Zhang, Z., Yeung, Y. T., Jiang, X., & Liu, Q. (2022). Spiral: Self-supervised perturbation-invariant representation learning for speech pre-training. In International conference on learning representations (ICLR).
Jang, E., Gu, S., & Poole, B. (2016). Categorical reparameterization with Gumbel–softmax. In International conference on learning representations (ICLR).
Kahn, J., Riviere, M., Zheng, W., Kharitonov, E., Xu, Q., Mazaré, P.-E., Karadayi, J., Liptchinsky, V., Collobert, R., Fuegen, C., Likhomanenko, T., Synnaeve, G., Joulin, A., Mohamed, A., & Dupoux, E. (2020). Librilight: A benchmark for ASR with limited or no supervision. In IEEE international conference on acoustics, speech and signal processing (ICASSP).
Karimi Mahabadi, R., Henderson, J., & Ruder, S. (2021). Compacter: Efficient low-rank hypercomplex adapter layers. In Advances in neural information processing systems (NIPS).
Kitaev, N., Kaiser, Ł., & Levskaya, A. (2020). Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451
Lai, C.-I., Chuang, Y.-S., Lee, H.-Y., Li, S.-W., & Glass, J. (2021). Semi-supervised spoken language understanding via self-supervised speech and language model pretraining. In IEEE international conference on acoustics, speech and signal processing (ICASSP).
Lai, C.-I.J., Zhang, Y., Liu, A.H., Chang, S., Liao, Y.-L., Chuang, Y.-S., Qian, K., Khurana, S., Cox, D., & Glass, J. (2021). Parp: Prune, adjust and re-prune for self-supervised speech recognition. In Advances in neural information processing systems (NIPS).
Le, D., Zhang, X., Zheng, W., Fügen, C., Zweig, G., & Seltzer, M.L. (2019). From senones to Chenones: Tied context-dependent graphemes for hybrid speech recognition. In IEEE automatic speech recognition and understanding workshop (ASRU).
Lee, Y., & Jang, K., Goo, J., Jung, Y., & Kim, H.-R. (2022). Fithubert: Going thinner and deeper for knowledge distillation of speech self-supervised learning. In Interspeech.
Lee-Thorp, J., Ainslie, J., Eckstein, I., & Ontanon, S. (2022). Fnet: Mixing tokens with Fourier transforms. In Proceedings of the conference of the North American Chapter of the association for computational linguistics: Human language technologies.
Lester, B., Al-Rfou, R., & Constant, N. (2021). The power of scale for parameter-efficient prompt tuning. EMNLP.
Book Google Scholar
Li, X.L., & Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the annual meeting of the association for computational linguistics and the international joint conference on natural language processing.
Lialin, V., Deshpande, V., & Rumshisky, A. (2023). Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv preprint arXiv:2303.15647
Lin, T.-Q., Lee, H.-y., & Tang, H. (2022). Melhubert: A simplified Hubert on Mel spectrogram. arXiv preprint arXiv:2211.09944
Liu, A. H., Chang, H.-J., Auli, M., Hsu, W.-N., & Glass, J. R. (2023). Dinosr: Self-distillation and online clustering for self-supervised speech representation learning. In Advances in neural information processing systems (NIPS).
Maekawa, A., Kobayashi, N., Funakoshi, K., & Okumura, M. (2023). Dataset distillation with attention labels for fine-tuning BERT. In Proceedings of the 61st annual meeting of the association for computational linguistics.
Mehta, H., Gupta, A., Cutkosky, A., & Neyshabur, B. (2022). Long range language modeling via gated state spaces. arXiv preprint arXiv:2206.13947
Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., & Wu, H. (2018). Mixed precision training. In International conference on learning representations (ICLR).
Mohamed, A., Lee, H.-Y., Borgholt, L., Havtorn, J. D., Edin, J., Igel, C., Kirchhoff, K., Li, S.-W., Livescu, K., Maaløe, L., Sainath, T. N., & Watanabe, S. (2022). Self-supervised speech representation learning: A review. IEEE Journal of Selected Topics in Signal Processing, 16(6), 1179–210.
Article Google Scholar
Moumen, A., & Parcollet, T. (2023). Stabilising and accelerating light gated recurrent units for automatic speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP).
Nguyen, T.A., Seyssel, M., Rozé, P., Rivière, M., Kharitonov, E., Baevski, A., Dunbar, E., & Dupoux, E. (2020). The zero resource speech benchmark 2021: Metrics and baselines for unsupervised spoken language modeling. In NeuRIPS workshop on self-supervised learning for speech and audio processing.
Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). Librispeech: An ASR corpus based on public domain audio books. In IEEE international conference on acoustics, speech and signal processing (ICASSP).
Parcollet, T., Dalen, R., Zhang, S., & Bhattacharya, S. (2023). Sumformer: A linear-complexity alternative to self-attention for speech recognition. arXiv preprint arXiv:2307.07421
Parcollet, T., Zhang, S., Dalen, R., Ramos, A.G.C., & Bhattacharya, S. (2023). On the (in) efficiency of acoustic feature extractors for self-supervised speech representation learning. In Interspeech.
Park, D.S., Zhang, Y., Chiu, C.-C., Chen, Y., Li, B., Chan, W., Le, Q.V., & Wu, Y. (2020). Specaugment on large scale datasets. In IEEE international conference on acoustics, speech and signal processing (ICASSP).
Park, D. S., Chan, W., Zhang, Y., Chiu, C.-C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019). Specaugment: A simple data augmentation method for automatic speech recognition. In Interspeech.
Google Scholar
Pasad, A., Chou, J.-C., & Livescu, K. (2021). Layer-wise analysis of a self-supervised speech representation model. In IEEE automatic speech recognition and understanding workshop (ASRU).
Pasad, A., Shi, B., & Livescu, K. (2023). Comparative layer-wise analysis of self-supervised speech models. In IEEE international conference on acoustics, speech and signal processing (ICASSP).
Peng, Y., Kim, K., Wu, F., Sridhar, P., & Watanabe, S. (2023). Structured pruning of self-supervised pre-trained models for speech recognition and understanding. In IEEE international conference on acoustics, speech and signal processing (ICASSP).
Peng, Y., Sudo, Y., Muhammad, S., & Watanabe, S. (2023). Dphubert: Joint distillation and pruning of self-supervised speech models. In Interspeech.
Google Scholar
Poli, M., Massaroli, S., Nguyen, E., Fu, D.Y., Dao, T., Baccus, S., Bengio, Y., Ermon, S., & Ré, C. (2023). Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(140), 1–67.
MathSciNet Google Scholar
Ravanelli, M., Brakel, P., Omologo, M., & Bengio, Y. (2018). Light gated recurrent units for speech recognition. IEEE Transactions on Emerging Topics in Computational Intelligence,2
Reed, C.J., Yue, X., Nrusimha, A., Ebrahimi, S., Vijaykumar, V., Mao, R., Li, B., Zhang, S., Guillory, D., Metzger, S., Keutzer, K., & Darrell, T. (2022). Self-supervised pretraining improves self-supervised pretraining. In Proceedings of the IEEE/CVF winter conference on applications of computer vision.
Roy, A., Saffar, M., Vaswani, A., & Grangier, D. (2021). Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics.
Book Google Scholar
Sadhu, S., He, D., Huang, C.-W., Mallidi, S.H., Wu, M., Rastrow, A., Stolcke, A., Droppo, J., & Maas, R. (2021). Wav2vec-C: A self-supervised model for speech representation learning. In Interspeech.
San, N., Bartelds, M., Browne, M., Clifford, L., Gibson, F., Mansfield, J., Nash, D., Simpson, J., Turpin, M., Vollmer, M., Wilmoth, S., & Jurafsky, D. (2021). Leveraging pre-trained representations to improve access to untranscribed speech from endangered languages. In IEEE automatic speech recognition and understanding workshop (ASRU)
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). Distilbert, a distilled version of Bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108
Schlag, I., Irie, K., & Schmidhuber, J. (2021). Linear transformers are secretly fast weight programmers. In International conference on machine learning (ICML). PMLR.
Google Scholar
Schneider, S., Baevski, A., Collobert, R., & Auli, M. (2019). wav2vec: Unsupervised pre-training for speech recognition. In Interspeech.
Seltzer, M.L., Yu, D., & Wang, Y. (2013). An investigation of deep neural networks for noise robust speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP).
Seo, S., Kwak, D., & Lee, B. (2022). Integration of pre-trained networks with continuous token interface for end-to-end spoken language understanding. In IEEE international conference on acoustics, speech and signal processing (ICASSP).
Shi, Y., Paige, B., Torr, P., & Siddharth, N. (2020). Relating by contrasting: A data-efficient framework for multimodal generative models. In International conference on learning representations (ICLR).
Stafylakis, T., Mošner, L., Kakouros, S., Plchot, O., Burget, L., & Ćernockỳ, J. (2022). Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations. In IEEE Spoken language technology workshop (SLT).
Sung, Y.-L., Cho, J., & Bansal, M. (2022). LST: Ladder side-tuning for parameter and memory efficient transfer learning. In Advances in neural information processing systems (NIPS).
Tay, Y., Dehghani, M., Bahri, D., & Metzler, D. (2022). Efficient transformers: A survey. ACM Computing Surveys,6
Tu, Z., Talebi, H., Zhang, H., Yang, F., Milanfar, P., Bovik, A., & Li, Y. (2022). Maxvit: Multi-axis vision transformer. In European conference on computer vision.
Tyagi, S., & Sharma, P. (2020). Taming resource heterogeneity in distributed ml training with dynamic batching. In IEEE international conference on autonomic computing and self-organizing systems (ACSOS).
Vyas, A., Hsu, W.-N., Auli, M., & Baevski, A. (2022). On-demand compute reduction with stochastic wav2vec 2.0. arXiv preprint arXiv:2204.11934
Wang, R., Bai, Q., Ao, J., Zhou, L., Xiong, Z., Wei, Z., Zhang, Y., Ko, T., & Li, H. (2022). Lighthubert: Lightweight and configurable speech representation learning with once-for-all hidden-unit BERT. In Interspeech.
Wang, S., Li, B.Z., Khabsa, M., Fang, H., & Ma, H. (2020). Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768
Wang, Y., Li, J., Wang, H., Qian, Y., Wang, C., & Wu, Y. (2022). Wav2vec-switch: Contrastive learning from original-noisy speech pairs for robust speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP).
Wang, Y., Mohamed, A., Le, D., Liu, C., Xiao, A., Mahadeokar, J., Huang, H., Tjandra, A., Zhang, X., Zhang, F., Fuegen, C., Zweig, G., & Seltzer, M. (2020). Transformer-based acoustic modeling for hybrid speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP).
Wang, S., Nguyen, J., Li, K., & Wu, C.-J. (2023). Read: Recurrent adaptation of large transformers. arXiv preprint arXiv:2305.15348
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. (2018). Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP workshop blackbox NLP: Analyzing and interpreting neural networks for NLP.
Wang, S., Zhou, L., Gan, Z., Chen, Y.-C., Fang, Y., Sun, S., Cheng, Y., & Liu, J. (2021). Cluster-former: Clustering-based sparse transformer for question answering. In Proceedings of the annual meeting of the association for computational linguistics and the international joint conference on natural language processing.
Wang, T., Zhu, J.-Y., Torralba, A., & Efros, A.A. (2018). Dataset distillation. arXiv preprint arXiv:1811.10959
Wu, F., Kim, K., Pan, J., Han, K.J., Weinberger, K.Q., & Artzi, Y. (2022). Performance-efficiency trade-offs in unsupervised pre-training for speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP).
Wu, Z., Liu, Z., Lin, J., Lin, Y., & Han, S. (2020). Lite transformer with long-short range attention. In International conference on learning representations (ICLR).
Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., Gschwind, M., Gupta, A., Ott, M., Melnikov, A., Candido, S., Brooks, D., Chauhan, G., Lee, B., Lee, H.-H., … Hazelwood, K. (2022). Sustainable AI: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems, 4, 795–813.
Google Scholar
Xie, Q., Luong, M.-T., Hovy, E., & Le, Q.V. (2020). Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.
Yang, B., Wang, L., Wong, D.F., Chao, L.S., & Tu, Z. (2019). Convolutional self-attention networks. In Proceedings of the conference of the North American chapter of the association for computational linguistics: Human language technologies.
Yang, S.-w., Chi, P.-H., Chuang, Y.-S., Lai, C.-I.J., Lakhotia, K., Lin, Y.Y., Liu, A.T., Shi, J., Chang, X., Lin, G.-T., Huang, T.-H., Tseng, W.-C., Lee, K.-T., Liu, D.-R., Huang, Z., Dong, S., Li, S.-W., Watanabe, S., Mohamed, A., & Lee, H.-Y. (2021). Superb: Speech processing universal performance benchmark. In Interspeech.
Yeh, S.-L., & Tang, H. (2022). Autoregressive co-training for learning discrete speech representations. In Interspeech.
Book Google Scholar
Yu, A.W., Dohan, D., Luong, M.-T., Zhao, R., Chen, K., Norouzi, M., & Le, Q.V. (2018). Qanet: Combining local convolution with global self-attention for reading comprehension. In International conference on learning representations (ICLR).
Zaken, E.B., Goldberg, Y., & Ravfogel, S. (2022). Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th annual meeting of the association for computational linguistics.
Zhai, S., Talbott, W., Srivastava, N., Huang, C., Goh, H., Zhang, R., & Susskind, J. (2021). An attention free transformer. arXiv preprint arXiv:2105.14103
Zhang, Q., Chen, M., Bukharin, A., He, P., Cheng, Y., Chen, W., & Zhao, T. (2023). Adaptive budget allocation for parameter-efficient fine-tuning. In International conference on learning representations (ICLR).
Zhang, Y., Chen, G., Yu, D., Yao, K., Khudanpur, S., & Glass, J. (2016). Highway long short-term memory RNNS for distant speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP).
Zhang, Y., Qin, J., Park, D.S., Han, W., Chiu, C.-C., Pang, R., Le, Q.V., & Wu, Y. (2020). Pushing the limits of semi-supervised learning for automatic speech recognition. arXiv preprint arXiv:2010.10504
Zhang, J.O., Sax, A., Zamir, A., Guibas, L., & Malik, J. (2020). Side-tuning: A baseline for network adaptation via additive side networks. In European conference on computer vision.

Download references

Funding

Not applicable.

Author information

Authors and Affiliations

Orange, 4 Rue du Clos Courtel, Cesson-Sevigne, 35510, Brittany, France
Luis Lugo & Valentin Vielzeuf

Authors

Luis Lugo
View author publications
You can also search for this author in PubMed Google Scholar
Valentin Vielzeuf
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors participated in the literature review. All authors have reviewed and approved the final version for publication and maintain accountability for all aspects of the article, including integrity and validity.

Corresponding author

Correspondence to Luis Lugo.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial or non-financial interests.

Ethical approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Lugo, L., Vielzeuf, V. Efficiency-oriented approaches for self-supervised speech representation learning. Int J Speech Technol 27, 765–779 (2024). https://doi.org/10.1007/s10772-024-10121-9

Download citation

Received: 22 January 2024
Accepted: 21 June 2024
Published: 19 August 2024
Issue Date: September 2024
DOI: https://doi.org/10.1007/s10772-024-10121-9

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Deep neural networks for automatic speech processing: a survey from large corpora to limited data

Applications of Deep Learning Approaches in Speech Recognition: A Survey

Modeling under-resourced languages for speech recognition

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Efficiency-oriented approaches for self-supervised speech representation learning

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Deep neural networks for automatic speech processing: a survey from large corpora to limited data

Applications of Deep Learning Approaches in Speech Recognition: A Survey

Modeling under-resourced languages for speech recognition

Explore related subjects

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now