Abstract
The application of knowledge distillation (KD) has shown promise in transferring knowledge from a larger teacher model to a smaller student model. Nevertheless, a prevalent phenomenon in knowledge distillation is that student performance decreases when the teacher-student gap becomes large. Our contention is that the degradation from teacher to student is predominantly attributable to two gaps, namely the capacity gap and the knowledge gap. In this paper, we introduce Elastic Student Knowledge Distillation (ESKD), an innovative method that comprises Elastic Architecture and Elastic Learning to bridge the two gaps. The Elastic Architecture temporarily increases the number of student’s parameters during training and subsequently reverts to its original size while inference. It improves the learning ability of the model without increasing the cost at inference time. The Elastic Learning strategy introduces mask matrix and progressive learning strategies that facilitates the student in comprehending the intricate knowledge of the teacher and accomplishing the effect of regularization. We conducted extensive experiments on CIFAR-100 and ImageNet datasets, demonstrating that ESKD outperforms existing methods while preserving computational efficiency.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: FitNets: hints for thin deep nets. arXiv preprint arXiv:1412.6550(2014)
Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: CVPR, pp. 3967–3976 (2019)
Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. arXiv preprint arXiv:1910.10699 (2019)
Huang, Z., Wang, N.: Like what you like: knowledge distill via neuron selectivity transfer. arXiv preprint arXiv:1707.01219 (2017)
Chen, D., et al.: Cross-layer distillation with semantic calibration. In: AAAI, pp. 7028–7036 (2021)
Ahn, S., Hu, S.X., Damianou, A., Lawrence, N.D., Dai, Z.: Variational information distillation for knowledge transfer. In: CVPR, pp. 9163–9171 (2019)
Komodakis, N., Zagoruyko, S.: Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In ICLR (2017)
Tung, F., Mori, G.: Similarity-preserving knowledge distillation. In: ICCV, pp. 1365–1374 (2019)
Peng, B., Jin, X., Liu, J., Li, D., Wu, Y., Liu, Y., Zhang, Z.: Correlation congruence for knowledge distillation. In: ICCV, pp. 5007–5016 (2019)
Ding, X., Zhang, X., Ma, N., Han, J., Ding, G., Sun, J.: RepVGG: making VGG-style convnets great again. In: CVPR, pp. 13733–13742 (2021)
Ding, X., Xia, C., Zhang, X., Chu, X., Han, J., Ding, G.: RepMLP: re-parameterizing convolutions into fully-connected layers for image recognition. arXiv preprint arXiv:2105.01883 (2021)
Ding, X., Zhang, X., Han, J., Ding, G.: Diverse branch block: Building a convolution as an inception-like unit. In: CVPR, pp. 10886–10895 (2021)
Zhang, K., Zhang, C., Li, S., Zeng, D., Ge, S.: Student network learning via evolutionary knowledge distillation.In: IEEE Trans. Circuits Syst. Video Technol. 32(4), 2251–2263 (2021)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Zagoruyko, S., Komodakis, N.: Wide residual networks. arXiv preprint arXiv:1605.07146 (2016)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images (2009)
Mirzadeh, S.I., Farajtabar, M., Li, A., Levine, N., Matsukawa, A., Ghasemzadeh, H.: Improved knowledge distillation via teacher assistant. In: AAAI, vol. 34, no. 04, pp. 5191–5198 (2020)
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR, pp. 16000–16009 (2022)
Acknowledgements
This work was supported by Natural Science Foundation of Guangdong (2023A1515012073), National Natural Science Foundation of China(No. 62006083) and South China Normal University Student Innovation and Entrepreneurship Training Program(No. 202321004).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Li, H., Chen, Z., Zhou, J., Li, S. (2023). Reducing the Teacher-Student Gap via Elastic Student. In: Jin, Z., Jiang, Y., Buchmann, R.A., Bi, Y., Ghiran, AM., Ma, W. (eds) Knowledge Science, Engineering and Management. KSEM 2023. Lecture Notes in Computer Science(), vol 14117. Springer, Cham. https://doi.org/10.1007/978-3-031-40283-8_37
Download citation
DOI: https://doi.org/10.1007/978-3-031-40283-8_37
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-40282-1
Online ISBN: 978-3-031-40283-8
eBook Packages: Computer ScienceComputer Science (R0)