What Role Does Data Augmentation Play in Knowledge Distillation?

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13842))

Included in the following conference series:

Asian Conference on Computer Vision

424 Accesses

Abstract

Knowledge distillation is an effective way to transfer knowledge from a large model to a small model, which can significantly improve the performance of the small model. In recent years, some contrastive learning-based knowledge distillation methods (i.e., SSKD and HSAKD) have achieved excellent performance by utilizing data augmentation. However, the worth of data augmentation has always been overlooked by researchers in knowledge distillation, and no work analyzes its role in particular detail. To fix this gap, we analyze the effect of data augmentation on knowledge distillation from a multi-sided perspective. In particular, we demonstrate the following properties of data augmentation: (a) data augmentation can effectively help knowledge distillation work even if the teacher model does not have the information about augmented samples, and our proposed diverse and rich Joint Data Augmentation (JDA) is more valid than single rotating in knowledge distillation; (b) using diverse and rich augmented samples to assist the teacher model in training can improve its performance, but not the performance of the student model; (c) the student model can achieve excellent performance when the proportion of augmented samples is within a suitable range; (d) data augmentation enables knowledge distillation to work better in a few-shot scenario; (e) data augmentation is seamlessly compatible with some knowledge distillation methods and can potentially further improve their performance. Enlightened by the above analysis, we propose a method named Cosine Confidence Distillation (CCD) to transfer the augmented samples’ knowledge more reasonably. And CCD achieves better performance than the latest SOTA HSAKD with fewer storage requirements on CIFAR-100 and ImageNet-1k. Our code is released at https://github.com/liwei-group/CCD.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: GBP 19.95; Price includes VAT (United Kingdom)

eBook: GBP 71.50; Price includes VAT (United Kingdom)

Softcover Book: GBP 89.99; Price includes VAT (United Kingdom)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Switchable Online Knowledge Distillation

Cross-Architecture Knowledge Distillation

Article 19 February 2024

Semantic-aware knowledge distillation with parameter-free feature uniformization

Article Open access 08 May 2023

Notes

1.
For the sake of simplicity, \(\mathcal {L}_{kl\_q}\) and \(\mathcal {L}_{kl\_p}\) here have an additional process of calculating mathematical expectations compared to the original paper.
2.
The reason JDA is not added to SSKD and HSAKD is that these methods themselves use rotating as their data augmentation. If we are to force the inclusion of JDA, it will destroy the original character of these approaches.

References

Beyer, L., Zhai, X., Royer, A., Markeeva, L., Anil, R., Kolesnikov, A.: Knowledge distillation: a good teacher is patient and consistent. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10925–10934 (2022)
Google Scholar
Chen, D., Mei, J.P., Wang, C., Feng, Y., Chen, C.: Online knowledge distillation with diverse peers. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 3430–3437 (2020)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
Google Scholar
Yang, C., An, Z., Cai, L., Xu, Y.: Hierarchical self-supervised augmented knowledge distillation. In: Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI), pp. 1217–1223 (2021)
Google Scholar
Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: practical automated data augmentation with a reduced search space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 702–703 (2020)
Google Scholar
Das, D., Massa, H., Kulkarni, A., Rekatsinas, T.: An empirical analysis of the impact of data augmentation on distillation (2020)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2020)
Google Scholar
Fu, J., et al.: Role-wise data augmentation for knowledge distillation. arXiv preprint arXiv:2004.08861 (2020)
Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: a survey. Int. J. Comput. Vision 129(6), 1789–1819 (2021)
Article Google Scholar
Guo, Q., et al.: Online knowledge distillation via collaborative learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11020–11029 (2020)
Google Scholar
Guo, S.: Dpn: Detail-preserving network with high resolution representation for efficient segmentation of retinal vessels. J. Ambient Intell. Hum. Comput., 1–14 (2021)
Google Scholar
Han, J., et al.: You only cut once: boosting data augmentation with a single cut (2022)
Google Scholar
He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network (2015). https://doi.org/10.48550/ARXIV.1503.02531, https://arxiv.org/abs/1503.02531
Ho, D., Liang, E., Chen, X., Stoica, I., Abbeel, P.: Population based augmentation: efficient learning of augmentation policy schedules. In: International Conference on Machine Learning, pp. 2731–2741. PMLR (2019)
Google Scholar
Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
Google Scholar
Lim, S., Kim, I., Kim, T., Kim, C., Kim, S.: Fast autoaugment. Adv. Neural Inf. Process. Syst. 32, 1–11 (2019)
Google Scholar
Liu, S., Tian, Y., Chen, T., Shen, L.: Don’t be so dense: sparse-to-sparse gan training without sacrificing performance. Int. J. Comput. Vision 20(X) (2022)
Google Scholar
Liu, S., et al.: Paint transformer: feed forward neural painting with stroke prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6598–6607 (2021)
Google Scholar
Liu, Z., Farrell, J., Wandell, B.A.: Isetauto: detecting vehicles with depth and radiance information. IEEE Access 9, 41799–41808 (2021)
Article Google Scholar
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022)
Google Scholar
Matsubara, Y.: torchdistill: a modular, configuration-driven framework for knowledge distillation. In: Kerautret, B., Colom, M., Krähenbühl, A., Lopresti, D., Monasse, P., Talbot, H. (eds.) RRPR 2021. LNCS, vol. 12636, pp. 24–44. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-76423-4_3
Chapter Google Scholar
Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Peng, B., et al.: Correlation congruence for knowledge distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5007–5016 (2019)
Google Scholar
Razavi, M., Alikhani, H., Janfaza, V., Sadeghi, B., Alikhani, E.: An automatic system to monitor the physical distance and face mask wearing of construction workers in covid-19 pandemic. SN Comput. Sci. 3(1), 1–8 (2022)
Article Google Scholar
Russakovsky, O.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015)
Article MathSciNet Google Scholar
Sharma, S.: Game theory for adversarial attacks and defenses. arXiv preprint arXiv:2110.06166 (2021)
Singh, B., Najibi, M., Davis, L.S.: Sniper: efficient multi-scale training. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31. Curran Associates, Inc. (2018). https://proceedings.neurips.cc/paper/2018/file/166cee72e93a992007a89b39eb29628b-Paper.pdf
Sun, C., Shrivastava, A., Singh, S., Gupta, A.: Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. In: International Conference on Learning Representations (2019)
Google Scholar
Tung, F., Mori, G.: Similarity-preserving knowledge distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1365–1374 (2019)
Google Scholar
Wang, H., Lohit, S., Jones, M., Fu, Y.: Knowledge distillation thrives on data augmentation. arXiv preprint arXiv:2012.02909 (2020)
Wieczorek, M., Rychalska, B., Dąbrowski, J.: On the unreasonable effectiveness of centroids in image retrieval. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds.) ICONIP 2021. LNCS, vol. 13111, pp. 212–223. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-92273-3_18
Chapter Google Scholar
Xu, G., Liu, Z., Li, X., Loy, C.C.: Knowledge distillation meets self-supervision. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 588–604. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_34
Chapter Google Scholar
Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6023–6032 (2019)
Google Scholar
Zagoruyko, S., Komodakis, N.: Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In: International Conference on Learning Representations (ICLR) (2016)
Google Scholar
Zagoruyko, S., Komodakis, N.: Wide residual networks. In: BMVC (2016)
Google Scholar
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. In: International Conference on Learning Representations (2018)
Google Scholar
Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856 (2018)
Google Scholar
Zhang, Y., Xiang, T., Hospedales, T.M., Lu, H.: Deep mutual learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Zhao, B., Cui, Q., Song, R., Qiu, Y., Liang, J.: Decoupled knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11953–11962 (2022)
Google Scholar

Download references

Acknowledgement

This work was supported in part by the Aeronautical Science Foundation of China under Grant 20200058069001 and in part by the Fundamental Research Funds for the Central Universities under Grant 2242021R41094.

Author information

Authors and Affiliations

School of Instrument Science and Engineering, Southeast University, Nanjing, 210096, Jiangsu, China
Wei Li, Shitong Shao, Weiyan Liu, Ziming Qiu, Zhihao Zhu & Wei Huan

Authors

Wei Li
View author publications
You can also search for this author in PubMed Google Scholar
Shitong Shao
View author publications
You can also search for this author in PubMed Google Scholar
Weiyan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Ziming Qiu
View author publications
You can also search for this author in PubMed Google Scholar
Zhihao Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Wei Huan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wei Li .

Editor information

Editors and Affiliations

University of Wollongong, Wollongong, NSW, Australia
Lei Wang
University of Bonn, Bonn, Germany
Juergen Gall
University of Adelaide, Adelaide, SA, Australia
Tat-Jun Chin
National Institute of Informatics, Tokyo, Japan
Imari Sato
Johns Hopkins University, Baltimore, MD, USA
Rama Chellappa

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 913 KB)

Appendices

A Hyperparameter Settings

All experiments performed in this paper followed the settings in Table 7. In particular, we ensure that their batch sizes are the same for experiments with different \(n_{step}\). Furthermore, for our proposed JDA, the magnitudes of the corresponding sub-policies are shown in Table 8. Most of these settings are obtained with modifications based on AutoAugment.

Table 7. This table shows the hyperparameter settings when \(n_{step}\) is \(\times 1\). At the same time, we achieve the setting where \(n_{step}\) is \(\times 2\) and \(\times 4\) by increasing epoch. Crucially, we use the same batch size of the real input for all methods including JDA and rotating.

Full size table

Table 8. This table shows the 14 sub-policies and their hyperparameter settings used in our experiments. Part of this table is copied from [5]. And the execution order of the sub-policies can be found in our released codes.

Full size table

B Additional Method Comparisons

First, we present a series of computational cost comparisons of SOTA algorithms for knowledge distillation in Table 9. Second, we compare the difference in performance between JDA and AutoAugment on knowledge distillation in Table 10. The results show that both JDA and CCD achieve the best performance in their respective comparisons.

Table 9. GFLOPs: Giga Floating-point Operations Per Second. We utilize facebook’s open-source project fvcore to calculate GFLOPs. For operators that fvcore does not support statistics, we count their totals in NUO. NUO: The Number of Unsupported Operators. TP: ThroughPut (images/s). We calculated the throughput of all methods from start to finish under an NVIDIA RTX 3080 Ti. Meanwhile, all methods are executed 5000 times to reduce interference. This table presents the comparison results of related knowledge distillation methods on other vital indicators. In general, response-based methods are more portable and reproducible than feature-based methods. Methods that do not use additional modules are more lightweight in training than methods that use additional modules. JDA+CCD does not require additional modules and is very close to the original KD regarding GFLOPs, NUO and TP. Therefore, we can conclude that our proposed JDA+CCD is lightweight.

Full size table

Table 10. Performance comparison of JDA and AutoAugment on offline knowledge distillation. All experiments in this table use the same hyperparameter settings. As a result, we find that JDA beats AutoAugment on four teacher-student pairs.

Full size table

C Additional Visualization

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, W., Shao, S., Liu, W., Qiu, Z., Zhu, Z., Huan, W. (2023). What Role Does Data Augmentation Play in Knowledge Distillation?. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds) Computer Vision – ACCV 2022. ACCV 2022. Lecture Notes in Computer Science, vol 13842. Springer, Cham. https://doi.org/10.1007/978-3-031-26284-5_31

Download citation

DOI: https://doi.org/10.1007/978-3-031-26284-5_31
Published: 23 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-26283-8
Online ISBN: 978-3-031-26284-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

What Role Does Data Augmentation Play in Knowledge Distillation?

Abstract

Access this chapter

Subscribe and save

Buy Now