[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ Skip to main content

Advertisement

Log in

Efficiency-oriented approaches for self-supervised speech representation learning

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Self-supervised learning enables the training of large neural models without the need for large, labeled datasets. It has been generating breakthroughs in several fields, including computer vision, natural language processing, biology, and speech. In particular, the state-of-the-art in several speech processing applications, such as automatic speech recognition or speaker identification, are models where the latent representation is learned using self-supervised approaches. Several configurations exist in self-supervised learning for speech, including contrastive, predictive, and multilingual approaches. There is, however, a crucial limitation in the majority of existing approaches: their high computational costs. These costs limit the deployment of models, the size of the training dataset, and the number of research groups that can afford research with large self-supervised models. Likewise, we should consider the environmental costs that high energy consumption implies. Efforts in this direction comprise optimization of existing models, neural architecture efficiency, improvements in finetuning for speech processing tasks, and data efficiency. But despite current efforts, more work could be done to address high computational costs in self-supervised representation learning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

Data availability

Not applicable because no dataset was generated or analyzed during the current study.

References

  • Abdullah, B. M., Shaik, M. M., & Klakow, D. (2023). On the nature of discrete speech representations in multilingual self-supervised models. In Proceedings of the 5th workshop on research in computational linguistic typology and multilingual. NLP.

    Google Scholar 

  • Allen-Zhu, Z., & Li, Y. (2020). Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv preprint arXiv:2012.09816

  • Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G., Chen, J., Chen, J., Chen, Z., Chrzanowski, M., Coates, A., Diamos, G., Ding, K., Du, N., Elsen, E., … Zhu, Z. (2016). Deep speech 2: End-to-end speech recognition in English and Mandarin. In International conference on machine learning (ICML). PMLR.

    Google Scholar 

  • Arora, S., Dalmia, S., Denisov, P., Chang, X., Ueda, Y., Peng, Y., Zhang, Y., Kumar, S., Ganesan, K., Yan, B., Vu, N., Black, A., & Watanabe, S. (2022). Espnet-slu: Advancing spoken language understanding through ESPnet. In IEEE international conference on acoustics, speech and signal processing (ICASSP).

  • Ashihara, T., Moriya, T., Matsuura, K., & Tanaka, T. (2022). Deep versus wide: An analysis of student architectures for task-agnostic knowledge distillation of self-supervised speech models. In Interspeech.

    Google Scholar 

  • Babu, A., Wang, C., Tjandra, A., Lakhotia, K., Xu, Q., Goyal, N., Singh, K., Platen, P., Saraf, Y., Pino, J., Baevski, A., Conneau, A., & Auli, M. (2021). Xls-r: Self-supervised cross-lingual speech representation learning at scale. In Interspeech.

  • Baevski, A., Schneider, S., & Auli, M. (2020). vq-wav2vec: Self-supervised learning of discrete speech representations. In International conference on learning representations (ICLR).

  • Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in neural information processing systems (NIPS).

  • Baevski, A., Babu, A., Hsu, W.-N., & Auli, M. (2023). Efficient self-supervised learning with contextualized target representations for vision, speech and language. In International conference on machine learning (ICML). PMLR.

    Google Scholar 

  • Baevski, A., Hsu, W.-N., Xu, Q., Babu, A., Gu, J., & Auli, M. (2022). Data2vec: A general framework for self-supervised learning in speech, vision and language. In International conference on machine learning (ICML). PMLR.

    Google Scholar 

  • Bartley, T.M., Jia, F., Puvvada, K.C., Kriman, S., & Ginsburg, B. (2023). Accidental learners: Spoken language identification in multilingual self-supervised models. In IEEE international conference on acoustics, speech and signal processing (ICASSP).

  • Bello, I., Zoph, B., Le, Q., Vaswani, A., & Shlens, J. (2019). Attention augmented convolutional networks. In Proceedings of the IEEE/CVF international conference on computer vision.

  • Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D. (2020). Language models are few-shot learners. In Advances in neural information processing systems (NIPS).

  • Chang, H.-J., Yang, S.-w., & Lee, H.-y. (2022). Distilhubert: Speech representation learning by layer-wise distillation of hidden-unit Bert. In IEEE international conference on acoustics, speech and signal processing (ICASSP).

  • Chen, W., Chang, X., Peng, Y., Ni, Z., Maiti, S., & Watanabe, S. (2023). Reducing barriers to self-supervised learning: Hubert pre-training with academic compute. In Interspeech.

    Google Scholar 

  • Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. In International conference on machine learning (ICML). PMLR.

    Google Scholar 

  • Chen, S., Wang, C., Chen, Z., Wu, Y., Liu, S., Chen, Z., Li, J., Kanda, N., Yoshioka, T., Xiao, X., Wu, J., Zhou, L., Ren, S., Qian, Y., Qian, Y., Wu, J., Zeng, M., Yu, X., Wei, F. (2022). Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6), 1505–1518.

    Article  Google Scholar 

  • Chiba, Y., Nose, T., & Ito, A. (2019). Multi-condition training for noise-robust speech emotion recognition. Acoustical Science and Technology, 40(6), 406–409.

    Article  Google Scholar 

  • Child, R., Gray, S., Radford, A., & Sutskever, I. (2019). Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509

  • Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J. Q., Mohiuddin, A., Kaiser, L., Belanger, D. B., Colwell, L. J., & Weller, A. (2021). Rethinking attention with performers. In International conference on learning representations (ICLR).

  • Chung, Y.-A., & Glass, J. (2018). Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech. In Interspeech.

    Google Scholar 

  • Conneau, A., Baevski, A., Collobert, R., Mohamed, A., & Auli, M. (2021). Unsupervised cross-lingual representation learning for speech recognition. In Interspeech.

  • Dao, T., Fu, D., Ermon, S., Rudra, A., & Ré, C. (2022). Flashattention: Fast and memory-efficient exact attention with io-awareness. In Advances in neural information processing systems (NIPS).

  • Dao, T., Fu, D.Y., Saab, K.K., Thomas, A.W., Rudra, A., & Ré, C. (2022). Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint arXiv:2212.14052

  • Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the conference of the North American Chapter of the Association for computational linguistics: Human language technologies.

  • Dunbar, E., Bernard, M., Hamilakis, N., Nguyen, T. A., Seyssel, M., Rozé, P., Rivière, M., Kharitonov, E., & Dupoux, E. (2021). The zero resource speech challenge 2021: Spoken language modelling. In Interspeech.

    Google Scholar 

  • Ericsson, L., Gouk, H., Loy, C. C., & Hospedales, T. M. (2022). Self-supervised representation learning: Introduction, advances, and challenges. IEEE Signal Processing Magazine, 39(3), 42–62.

    Article  Google Scholar 

  • Evain, S., Nguyen, H., Le, H., Boito, M. Z., Mdhaffar, S., Alisamir, S., Tong, Z., Tomashenko, N., Dinarelli, M., Parcollet, T., Allauzen, A., Esteve, Y., Lecouteux, B., Portet, F., Rossato, S., Ringeval, F., Schwab, D., & Besacier, L. (2021). LeBenchmark: A reproducible framework for assessing self-supervised representation learning from speech. In Interspeech.

    Google Scholar 

  • Gao, Y., Fernandez-Marques, J., Parcollet, T., Mehrotra, A., & Lane, N. D. (2022). Federated self-supervised speech representations: Are we there yet? arXiv preprint arXiv:2204.02804

  • Gaol, Y., Fernandez-Marques, J., Parcollet, T., Gusmao, P. P., & Lane, N. D. (2023). Match to win: Analysing sequences lengths for efficient self-supervised learning in speech and audio. In IEEE spoken language technology workshop (SLT)

  • Graves, A. (2012). Supervised sequence labelling with recurrent neural networks. Studies in Computational Intelligence, 385, 1–131.

    MathSciNet  Google Scholar 

  • Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond, P., Buchatskaya, E., Doersch, C., Avila Pires, B., Guo, Z., Gheshlaghi Azar, M., Piot, B., Kavukcuoglu, K., Munos, R., & Valko, M. (2020). Bootstrap your own latent: A new approach to self-supervised learning. In Advances in neural information processing systems (NIPS)

  • Gulati, A., Qin, J., Chiu, C.-C., Parmar, N., Zhang, Y., Yu, J., Han, W., Wang, S., Zhang, Z., Wu, Y., & Pang, R. (2020). Conformer: Convolution-augmented transformer for speech recognition. In Interspeech.

  • Hsu, W.-N., Bolte, B., Tsai, Y.-H.H., Lakhotia, K., Salakhutdinov, R., & Mohamed, A. (2021). Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, 3451–3460.

    Article  Google Scholar 

  • Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2021). Lora: Low-rank adaptation of large language models. In International conference on learning representations (ICLR).

  • Huang, W., Zhang, Z., Yeung, Y. T., Jiang, X., & Liu, Q. (2022). Spiral: Self-supervised perturbation-invariant representation learning for speech pre-training. In International conference on learning representations (ICLR).

  • Jang, E., Gu, S., & Poole, B. (2016). Categorical reparameterization with Gumbel–softmax. In International conference on learning representations (ICLR).

  • Kahn, J., Riviere, M., Zheng, W., Kharitonov, E., Xu, Q., Mazaré, P.-E., Karadayi, J., Liptchinsky, V., Collobert, R., Fuegen, C., Likhomanenko, T., Synnaeve, G., Joulin, A., Mohamed, A., & Dupoux, E. (2020). Librilight: A benchmark for ASR with limited or no supervision. In IEEE international conference on acoustics, speech and signal processing (ICASSP).

  • Karimi Mahabadi, R., Henderson, J., & Ruder, S. (2021). Compacter: Efficient low-rank hypercomplex adapter layers. In Advances in neural information processing systems (NIPS).

  • Kitaev, N., Kaiser, Ł., & Levskaya, A. (2020). Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451

  • Lai, C.-I., Chuang, Y.-S., Lee, H.-Y., Li, S.-W., & Glass, J. (2021). Semi-supervised spoken language understanding via self-supervised speech and language model pretraining. In IEEE international conference on acoustics, speech and signal processing (ICASSP).

  • Lai, C.-I.J., Zhang, Y., Liu, A.H., Chang, S., Liao, Y.-L., Chuang, Y.-S., Qian, K., Khurana, S., Cox, D., & Glass, J. (2021). Parp: Prune, adjust and re-prune for self-supervised speech recognition. In Advances in neural information processing systems (NIPS).

  • Le, D., Zhang, X., Zheng, W., Fügen, C., Zweig, G., & Seltzer, M.L. (2019). From senones to Chenones: Tied context-dependent graphemes for hybrid speech recognition. In IEEE automatic speech recognition and understanding workshop (ASRU).

  • Lee, Y., & Jang, K., Goo, J., Jung, Y., & Kim, H.-R. (2022). Fithubert: Going thinner and deeper for knowledge distillation of speech self-supervised learning. In Interspeech.

  • Lee-Thorp, J., Ainslie, J., Eckstein, I., & Ontanon, S. (2022). Fnet: Mixing tokens with Fourier transforms. In Proceedings of the conference of the North American Chapter of the association for computational linguistics: Human language technologies.

  • Lester, B., Al-Rfou, R., & Constant, N. (2021). The power of scale for parameter-efficient prompt tuning. EMNLP.

    Book  Google Scholar 

  • Li, X.L., & Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the annual meeting of the association for computational linguistics and the international joint conference on natural language processing.

  • Lialin, V., Deshpande, V., & Rumshisky, A. (2023). Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv preprint arXiv:2303.15647

  • Lin, T.-Q., Lee, H.-y., & Tang, H. (2022). Melhubert: A simplified Hubert on Mel spectrogram. arXiv preprint arXiv:2211.09944

  • Liu, A. H., Chang, H.-J., Auli, M., Hsu, W.-N., & Glass, J. R. (2023). Dinosr: Self-distillation and online clustering for self-supervised speech representation learning. In Advances in neural information processing systems (NIPS).

  • Maekawa, A., Kobayashi, N., Funakoshi, K., & Okumura, M. (2023). Dataset distillation with attention labels for fine-tuning BERT. In Proceedings of the 61st annual meeting of the association for computational linguistics.

  • Mehta, H., Gupta, A., Cutkosky, A., & Neyshabur, B. (2022). Long range language modeling via gated state spaces. arXiv preprint arXiv:2206.13947

  • Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., & Wu, H. (2018). Mixed precision training. In International conference on learning representations (ICLR).

  • Mohamed, A., Lee, H.-Y., Borgholt, L., Havtorn, J. D., Edin, J., Igel, C., Kirchhoff, K., Li, S.-W., Livescu, K., Maaløe, L., Sainath, T. N., & Watanabe, S. (2022). Self-supervised speech representation learning: A review. IEEE Journal of Selected Topics in Signal Processing, 16(6), 1179–210.

    Article  Google Scholar 

  • Moumen, A., & Parcollet, T. (2023). Stabilising and accelerating light gated recurrent units for automatic speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP).

  • Nguyen, T.A., Seyssel, M., Rozé, P., Rivière, M., Kharitonov, E., Baevski, A., Dunbar, E., & Dupoux, E. (2020). The zero resource speech benchmark 2021: Metrics and baselines for unsupervised spoken language modeling. In NeuRIPS workshop on self-supervised learning for speech and audio processing.

  • Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). Librispeech: An ASR corpus based on public domain audio books. In IEEE international conference on acoustics, speech and signal processing (ICASSP).

  • Parcollet, T., Dalen, R., Zhang, S., & Bhattacharya, S. (2023). Sumformer: A linear-complexity alternative to self-attention for speech recognition. arXiv preprint arXiv:2307.07421

  • Parcollet, T., Zhang, S., Dalen, R., Ramos, A.G.C., & Bhattacharya, S. (2023). On the (in) efficiency of acoustic feature extractors for self-supervised speech representation learning. In Interspeech.

  • Park, D.S., Zhang, Y., Chiu, C.-C., Chen, Y., Li, B., Chan, W., Le, Q.V., & Wu, Y. (2020). Specaugment on large scale datasets. In IEEE international conference on acoustics, speech and signal processing (ICASSP).

  • Park, D. S., Chan, W., Zhang, Y., Chiu, C.-C., Zoph, B., Cubuk, E. D., & Le, Q. V. (2019). Specaugment: A simple data augmentation method for automatic speech recognition. In Interspeech.

    Google Scholar 

  • Pasad, A., Chou, J.-C., & Livescu, K. (2021). Layer-wise analysis of a self-supervised speech representation model. In IEEE automatic speech recognition and understanding workshop (ASRU).

  • Pasad, A., Shi, B., & Livescu, K. (2023). Comparative layer-wise analysis of self-supervised speech models. In IEEE international conference on acoustics, speech and signal processing (ICASSP).

  • Peng, Y., Kim, K., Wu, F., Sridhar, P., & Watanabe, S. (2023). Structured pruning of self-supervised pre-trained models for speech recognition and understanding. In IEEE international conference on acoustics, speech and signal processing (ICASSP).

  • Peng, Y., Sudo, Y., Muhammad, S., & Watanabe, S. (2023). Dphubert: Joint distillation and pruning of self-supervised speech models. In Interspeech.

    Google Scholar 

  • Poli, M., Massaroli, S., Nguyen, E., Fu, D.Y., Dao, T., Baccus, S., Bengio, Y., Ermon, S., & Ré, C. (2023). Hyena hierarchy: Towards larger convolutional language models. arXiv preprint arXiv:2302.10866

  • Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(140), 1–67.

    MathSciNet  Google Scholar 

  • Ravanelli, M., Brakel, P., Omologo, M., & Bengio, Y. (2018). Light gated recurrent units for speech recognition. IEEE Transactions on Emerging Topics in Computational Intelligence,2

  • Reed, C.J., Yue, X., Nrusimha, A., Ebrahimi, S., Vijaykumar, V., Mao, R., Li, B., Zhang, S., Guillory, D., Metzger, S., Keutzer, K., & Darrell, T. (2022). Self-supervised pretraining improves self-supervised pretraining. In Proceedings of the IEEE/CVF winter conference on applications of computer vision.

  • Roy, A., Saffar, M., Vaswani, A., & Grangier, D. (2021). Efficient content-based sparse attention with routing transformers. Transactions of the Association for Computational Linguistics.

    Book  Google Scholar 

  • Sadhu, S., He, D., Huang, C.-W., Mallidi, S.H., Wu, M., Rastrow, A., Stolcke, A., Droppo, J., & Maas, R. (2021). Wav2vec-C: A self-supervised model for speech representation learning. In Interspeech.

  • San, N., Bartelds, M., Browne, M., Clifford, L., Gibson, F., Mansfield, J., Nash, D., Simpson, J., Turpin, M., Vollmer, M., Wilmoth, S., & Jurafsky, D. (2021). Leveraging pre-trained representations to improve access to untranscribed speech from endangered languages. In IEEE automatic speech recognition and understanding workshop (ASRU)

  • Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). Distilbert, a distilled version of Bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108

  • Schlag, I., Irie, K., & Schmidhuber, J. (2021). Linear transformers are secretly fast weight programmers. In International conference on machine learning (ICML). PMLR.

    Google Scholar 

  • Schneider, S., Baevski, A., Collobert, R., & Auli, M. (2019). wav2vec: Unsupervised pre-training for speech recognition. In Interspeech.

  • Seltzer, M.L., Yu, D., & Wang, Y. (2013). An investigation of deep neural networks for noise robust speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP).

  • Seo, S., Kwak, D., & Lee, B. (2022). Integration of pre-trained networks with continuous token interface for end-to-end spoken language understanding. In IEEE international conference on acoustics, speech and signal processing (ICASSP).

  • Shi, Y., Paige, B., Torr, P., & Siddharth, N. (2020). Relating by contrasting: A data-efficient framework for multimodal generative models. In International conference on learning representations (ICLR).

  • Stafylakis, T., Mošner, L., Kakouros, S., Plchot, O., Burget, L., & Ćernockỳ, J. (2022). Extracting speaker and emotion information from self-supervised speech models via channel-wise correlations. In IEEE Spoken language technology workshop (SLT).

  • Sung, Y.-L., Cho, J., & Bansal, M. (2022). LST: Ladder side-tuning for parameter and memory efficient transfer learning. In Advances in neural information processing systems (NIPS).

  • Tay, Y., Dehghani, M., Bahri, D., & Metzler, D. (2022). Efficient transformers: A survey. ACM Computing Surveys,6

  • Tu, Z., Talebi, H., Zhang, H., Yang, F., Milanfar, P., Bovik, A., & Li, Y. (2022). Maxvit: Multi-axis vision transformer. In European conference on computer vision.

  • Tyagi, S., & Sharma, P. (2020). Taming resource heterogeneity in distributed ml training with dynamic batching. In IEEE international conference on autonomic computing and self-organizing systems (ACSOS).

  • Vyas, A., Hsu, W.-N., Auli, M., & Baevski, A. (2022). On-demand compute reduction with stochastic wav2vec 2.0. arXiv preprint arXiv:2204.11934

  • Wang, R., Bai, Q., Ao, J., Zhou, L., Xiong, Z., Wei, Z., Zhang, Y., Ko, T., & Li, H. (2022). Lighthubert: Lightweight and configurable speech representation learning with once-for-all hidden-unit BERT. In Interspeech.

  • Wang, S., Li, B.Z., Khabsa, M., Fang, H., & Ma, H. (2020). Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768

  • Wang, Y., Li, J., Wang, H., Qian, Y., Wang, C., & Wu, Y. (2022). Wav2vec-switch: Contrastive learning from original-noisy speech pairs for robust speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP).

  • Wang, Y., Mohamed, A., Le, D., Liu, C., Xiao, A., Mahadeokar, J., Huang, H., Tjandra, A., Zhang, X., Zhang, F., Fuegen, C., Zweig, G., & Seltzer, M. (2020). Transformer-based acoustic modeling for hybrid speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP).

  • Wang, S., Nguyen, J., Li, K., & Wu, C.-J. (2023). Read: Recurrent adaptation of large transformers. arXiv preprint arXiv:2305.15348

  • Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., & Bowman, S. (2018). Glue: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP workshop blackbox NLP: Analyzing and interpreting neural networks for NLP.

  • Wang, S., Zhou, L., Gan, Z., Chen, Y.-C., Fang, Y., Sun, S., Cheng, Y., & Liu, J. (2021). Cluster-former: Clustering-based sparse transformer for question answering. In Proceedings of the annual meeting of the association for computational linguistics and the international joint conference on natural language processing.

  • Wang, T., Zhu, J.-Y., Torralba, A., & Efros, A.A. (2018). Dataset distillation. arXiv preprint arXiv:1811.10959

  • Wu, F., Kim, K., Pan, J., Han, K.J., Weinberger, K.Q., & Artzi, Y. (2022). Performance-efficiency trade-offs in unsupervised pre-training for speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP).

  • Wu, Z., Liu, Z., Lin, J., Lin, Y., & Han, S. (2020). Lite transformer with long-short range attention. In International conference on learning representations (ICLR).

  • Wu, C.-J., Raghavendra, R., Gupta, U., Acun, B., Ardalani, N., Maeng, K., Chang, G., Aga, F., Huang, J., Bai, C., Gschwind, M., Gupta, A., Ott, M., Melnikov, A., Candido, S., Brooks, D., Chauhan, G., Lee, B., Lee, H.-H., … Hazelwood, K. (2022). Sustainable AI: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems, 4, 795–813.

    Google Scholar 

  • Xie, Q., Luong, M.-T., Hovy, E., & Le, Q.V. (2020). Self-training with noisy student improves imagenet classification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.

  • Yang, B., Wang, L., Wong, D.F., Chao, L.S., & Tu, Z. (2019). Convolutional self-attention networks. In Proceedings of the conference of the North American chapter of the association for computational linguistics: Human language technologies.

  • Yang, S.-w., Chi, P.-H., Chuang, Y.-S., Lai, C.-I.J., Lakhotia, K., Lin, Y.Y., Liu, A.T., Shi, J., Chang, X., Lin, G.-T., Huang, T.-H., Tseng, W.-C., Lee, K.-T., Liu, D.-R., Huang, Z., Dong, S., Li, S.-W., Watanabe, S., Mohamed, A., & Lee, H.-Y. (2021). Superb: Speech processing universal performance benchmark. In Interspeech.

  • Yeh, S.-L., & Tang, H. (2022). Autoregressive co-training for learning discrete speech representations. In Interspeech.

    Book  Google Scholar 

  • Yu, A.W., Dohan, D., Luong, M.-T., Zhao, R., Chen, K., Norouzi, M., & Le, Q.V. (2018). Qanet: Combining local convolution with global self-attention for reading comprehension. In International conference on learning representations (ICLR).

  • Zaken, E.B., Goldberg, Y., & Ravfogel, S. (2022). Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th annual meeting of the association for computational linguistics.

  • Zhai, S., Talbott, W., Srivastava, N., Huang, C., Goh, H., Zhang, R., & Susskind, J. (2021). An attention free transformer. arXiv preprint arXiv:2105.14103

  • Zhang, Q., Chen, M., Bukharin, A., He, P., Cheng, Y., Chen, W., & Zhao, T. (2023). Adaptive budget allocation for parameter-efficient fine-tuning. In International conference on learning representations (ICLR).

  • Zhang, Y., Chen, G., Yu, D., Yao, K., Khudanpur, S., & Glass, J. (2016). Highway long short-term memory RNNS for distant speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP).

  • Zhang, Y., Qin, J., Park, D.S., Han, W., Chiu, C.-C., Pang, R., Le, Q.V., & Wu, Y. (2020). Pushing the limits of semi-supervised learning for automatic speech recognition. arXiv preprint arXiv:2010.10504

  • Zhang, J.O., Sax, A., Zamir, A., Guibas, L., & Malik, J. (2020). Side-tuning: A baseline for network adaptation via additive side networks. In European conference on computer vision.

Download references

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

All authors participated in the literature review. All authors have reviewed and approved the final version for publication and maintain accountability for all aspects of the article, including integrity and validity.

Corresponding author

Correspondence to Luis Lugo.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial or non-financial interests.

Ethical approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lugo, L., Vielzeuf, V. Efficiency-oriented approaches for self-supervised speech representation learning. Int J Speech Technol 27, 765–779 (2024). https://doi.org/10.1007/s10772-024-10121-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-024-10121-9

Keywords