KroneckerBERT: Significant Compression of Pre-trained Language Models Through Kronecker Decomposition and Knowledge Distillation

Marzieh Tahaei, Ella Charlaix, Vahid Nia, Ali Ghodsi, Mehdi Rezagholizadeh

Abstract

The development of over-parameterized pre-trained language models has made a significant contribution toward the success of natural language processing. While over-parameterization of these models is the key to their generalization power, it makes them unsuitable for deployment on low-capacity devices. We push the limits of state-of-the-art Transformer-based pre-trained language model compression using Kronecker decomposition. We present our KroneckerBERT, a compressed version of the BERT_BASE model obtained by compressing the embedding layer and the linear mappings in the multi-head attention, and the feed-forward network modules in the Transformer layers. Our KroneckerBERT is trained via a very efficient two-stage knowledge distillation scheme using far fewer data samples than state-of-the-art models like MobileBERT and TinyBERT. We evaluate the performance of KroneckerBERT on well-known NLP benchmarks. We show that our KroneckerBERT with compression factors of 7.7x and 21x outperforms state-of-the-art compression methods on the GLUE and SQuAD benchmarks. In particular, using only 13% of the teacher model parameters, it retain more than 99% of the accuracy on the majority of GLUE tasks.

Anthology ID:: 2022.naacl-main.154
Volume:: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Month:: July
Year:: 2022
Address:: Seattle, United States
Editors:: Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
Venue:: NAACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 2116–2127
Language:
URL:: https://aclanthology.org/2022.naacl-main.154
DOI:: 10.18653/v1/2022.naacl-main.154
Bibkey:
Cite (ACL):: Marzieh Tahaei, Ella Charlaix, Vahid Nia, Ali Ghodsi, and Mehdi Rezagholizadeh. 2022. KroneckerBERT: Significant Compression of Pre-trained Language Models Through Kronecker Decomposition and Knowledge Distillation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2116–2127, Seattle, United States. Association for Computational Linguistics.
Cite (Informal):: KroneckerBERT: Significant Compression of Pre-trained Language Models Through Kronecker Decomposition and Knowledge Distillation (Tahaei et al., NAACL 2022)
Copy Citation:
PDF:: https://aclanthology.org/2022.naacl-main.154.pdf
Video:: https://aclanthology.org/2022.naacl-main.154.mp4
Data: CoLA, GLUE, MRPC, MultiNLI, QNLI, SQuAD, SST, SST-2

PDF Cite Search Video