[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to main content

What Role Does Data Augmentation Play in Knowledge Distillation?

  • Conference paper
  • First Online:
Computer Vision – ACCV 2022 (ACCV 2022)

Abstract

Knowledge distillation is an effective way to transfer knowledge from a large model to a small model, which can significantly improve the performance of the small model. In recent years, some contrastive learning-based knowledge distillation methods (i.e., SSKD and HSAKD) have achieved excellent performance by utilizing data augmentation. However, the worth of data augmentation has always been overlooked by researchers in knowledge distillation, and no work analyzes its role in particular detail. To fix this gap, we analyze the effect of data augmentation on knowledge distillation from a multi-sided perspective. In particular, we demonstrate the following properties of data augmentation: (a) data augmentation can effectively help knowledge distillation work even if the teacher model does not have the information about augmented samples, and our proposed diverse and rich Joint Data Augmentation (JDA) is more valid than single rotating in knowledge distillation; (b) using diverse and rich augmented samples to assist the teacher model in training can improve its performance, but not the performance of the student model; (c) the student model can achieve excellent performance when the proportion of augmented samples is within a suitable range; (d) data augmentation enables knowledge distillation to work better in a few-shot scenario; (e) data augmentation is seamlessly compatible with some knowledge distillation methods and can potentially further improve their performance. Enlightened by the above analysis, we propose a method named Cosine Confidence Distillation (CCD) to transfer the augmented samples’ knowledge more reasonably. And CCD achieves better performance than the latest SOTA HSAKD with fewer storage requirements on CIFAR-100 and ImageNet-1k. Our code is released at https://github.com/liwei-group/CCD.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
£29.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
GBP 19.95
Price includes VAT (United Kingdom)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
GBP 71.50
Price includes VAT (United Kingdom)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
GBP 89.99
Price includes VAT (United Kingdom)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    For the sake of simplicity, \(\mathcal {L}_{kl\_q}\) and \(\mathcal {L}_{kl\_p}\) here have an additional process of calculating mathematical expectations compared to the original paper.

  2. 2.

    The reason JDA is not added to SSKD and HSAKD is that these methods themselves use rotating as their data augmentation. If we are to force the inclusion of JDA, it will destroy the original character of these approaches.

References

  1. Beyer, L., Zhai, X., Royer, A., Markeeva, L., Anil, R., Kolesnikov, A.: Knowledge distillation: a good teacher is patient and consistent. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10925–10934 (2022)

    Google Scholar 

  2. Chen, D., Mei, J.P., Wang, C., Feng, Y., Chen, C.: Online knowledge distillation with diverse peers. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 3430–3437 (2020)

    Google Scholar 

  3. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)

    Google Scholar 

  4. Yang, C., An, Z., Cai, L., Xu, Y.: Hierarchical self-supervised augmented knowledge distillation. In: Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence (IJCAI), pp. 1217–1223 (2021)

    Google Scholar 

  5. Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: practical automated data augmentation with a reduced search space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 702–703 (2020)

    Google Scholar 

  6. Das, D., Massa, H., Kulkarni, A., Rekatsinas, T.: An empirical analysis of the impact of data augmentation on distillation (2020)

    Google Scholar 

  7. Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2020)

    Google Scholar 

  8. Fu, J., et al.: Role-wise data augmentation for knowledge distillation. arXiv preprint arXiv:2004.08861 (2020)

  9. Gou, J., Yu, B., Maybank, S.J., Tao, D.: Knowledge distillation: a survey. Int. J. Comput. Vision 129(6), 1789–1819 (2021)

    Article  Google Scholar 

  10. Guo, Q., et al.: Online knowledge distillation via collaborative learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11020–11029 (2020)

    Google Scholar 

  11. Guo, S.: Dpn: Detail-preserving network with high resolution representation for efficient segmentation of retinal vessels. J. Ambient Intell. Hum. Comput., 1–14 (2021)

    Google Scholar 

  12. Han, J., et al.: You only cut once: boosting data augmentation with a single cut (2022)

    Google Scholar 

  13. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)

    Google Scholar 

  14. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  15. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network (2015). https://doi.org/10.48550/ARXIV.1503.02531, https://arxiv.org/abs/1503.02531

  16. Ho, D., Liang, E., Chen, X., Stoica, I., Abbeel, P.: Population based augmentation: efficient learning of augmentation policy schedules. In: International Conference on Machine Learning, pp. 2731–2741. PMLR (2019)

    Google Scholar 

  17. Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)

    Google Scholar 

  18. Lim, S., Kim, I., Kim, T., Kim, C., Kim, S.: Fast autoaugment. Adv. Neural Inf. Process. Syst. 32, 1–11 (2019)

    Google Scholar 

  19. Liu, S., Tian, Y., Chen, T., Shen, L.: Don’t be so dense: sparse-to-sparse gan training without sacrificing performance. Int. J. Comput. Vision 20(X) (2022)

    Google Scholar 

  20. Liu, S., et al.: Paint transformer: feed forward neural painting with stroke prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6598–6607 (2021)

    Google Scholar 

  21. Liu, Z., Farrell, J., Wandell, B.A.: Isetauto: detecting vehicles with depth and radiance information. IEEE Access 9, 41799–41808 (2021)

    Article  Google Scholar 

  22. Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022)

    Google Scholar 

  23. Matsubara, Y.: torchdistill: a modular, configuration-driven framework for knowledge distillation. In: Kerautret, B., Colom, M., Krähenbühl, A., Lopresti, D., Monasse, P., Talbot, H. (eds.) RRPR 2021. LNCS, vol. 12636, pp. 24–44. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-76423-4_3

    Chapter  Google Scholar 

  24. Park, W., Kim, D., Lu, Y., Cho, M.: Relational knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  25. Peng, B., et al.: Correlation congruence for knowledge distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5007–5016 (2019)

    Google Scholar 

  26. Razavi, M., Alikhani, H., Janfaza, V., Sadeghi, B., Alikhani, E.: An automatic system to monitor the physical distance and face mask wearing of construction workers in covid-19 pandemic. SN Comput. Sci. 3(1), 1–8 (2022)

    Article  Google Scholar 

  27. Russakovsky, O.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  28. Sharma, S.: Game theory for adversarial attacks and defenses. arXiv preprint arXiv:2110.06166 (2021)

  29. Singh, B., Najibi, M., Davis, L.S.: Sniper: efficient multi-scale training. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 31. Curran Associates, Inc. (2018). https://proceedings.neurips.cc/paper/2018/file/166cee72e93a992007a89b39eb29628b-Paper.pdf

  30. Sun, C., Shrivastava, A., Singh, S., Gupta, A.: Revisiting unreasonable effectiveness of data in deep learning era. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)

    Google Scholar 

  31. Tian, Y., Krishnan, D., Isola, P.: Contrastive representation distillation. In: International Conference on Learning Representations (2019)

    Google Scholar 

  32. Tung, F., Mori, G.: Similarity-preserving knowledge distillation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1365–1374 (2019)

    Google Scholar 

  33. Wang, H., Lohit, S., Jones, M., Fu, Y.: Knowledge distillation thrives on data augmentation. arXiv preprint arXiv:2012.02909 (2020)

  34. Wieczorek, M., Rychalska, B., Dąbrowski, J.: On the unreasonable effectiveness of centroids in image retrieval. In: Mantoro, T., Lee, M., Ayu, M.A., Wong, K.W., Hidayanto, A.N. (eds.) ICONIP 2021. LNCS, vol. 13111, pp. 212–223. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-92273-3_18

    Chapter  Google Scholar 

  35. Xu, G., Liu, Z., Li, X., Loy, C.C.: Knowledge distillation meets self-supervision. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12354, pp. 588–604. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58545-7_34

    Chapter  Google Scholar 

  36. Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6023–6032 (2019)

    Google Scholar 

  37. Zagoruyko, S., Komodakis, N.: Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In: International Conference on Learning Representations (ICLR) (2016)

    Google Scholar 

  38. Zagoruyko, S., Komodakis, N.: Wide residual networks. In: BMVC (2016)

    Google Scholar 

  39. Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. In: International Conference on Learning Representations (2018)

    Google Scholar 

  40. Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856 (2018)

    Google Scholar 

  41. Zhang, Y., Xiang, T., Hospedales, T.M., Lu, H.: Deep mutual learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  42. Zhao, B., Cui, Q., Song, R., Qiu, Y., Liang, J.: Decoupled knowledge distillation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11953–11962 (2022)

    Google Scholar 

Download references

Acknowledgement

This work was supported in part by the Aeronautical Science Foundation of China under Grant 20200058069001 and in part by the Fundamental Research Funds for the Central Universities under Grant 2242021R41094.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wei Li .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 913 KB)

Appendices

A Hyperparameter Settings

All experiments performed in this paper followed the settings in Table 7. In particular, we ensure that their batch sizes are the same for experiments with different \(n_{step}\). Furthermore, for our proposed JDA, the magnitudes of the corresponding sub-policies are shown in Table 8. Most of these settings are obtained with modifications based on AutoAugment.

Table 7. This table shows the hyperparameter settings when \(n_{step}\) is \(\times 1\). At the same time, we achieve the setting where \(n_{step}\) is \(\times 2\) and \(\times 4\) by increasing epoch. Crucially, we use the same batch size of the real input for all methods including JDA and rotating.
Table 8. This table shows the 14 sub-policies and their hyperparameter settings used in our experiments. Part of this table is copied from [5]. And the execution order of the sub-policies can be found in our released codes.

B Additional Method Comparisons

First, we present a series of computational cost comparisons of SOTA algorithms for knowledge distillation in Table 9. Second, we compare the difference in performance between JDA and AutoAugment on knowledge distillation in Table 10. The results show that both JDA and CCD achieve the best performance in their respective comparisons.

Table 9. GFLOPs: Giga Floating-point Operations Per Second. We utilize facebook’s open-source project fvcore to calculate GFLOPs. For operators that fvcore does not support statistics, we count their totals in NUO. NUO: The Number of Unsupported Operators. TP: ThroughPut (images/s). We calculated the throughput of all methods from start to finish under an NVIDIA RTX 3080 Ti. Meanwhile, all methods are executed 5000 times to reduce interference. This table presents the comparison results of related knowledge distillation methods on other vital indicators. In general, response-based methods are more portable and reproducible than feature-based methods. Methods that do not use additional modules are more lightweight in training than methods that use additional modules. JDA+CCD does not require additional modules and is very close to the original KD regarding GFLOPs, NUO and TP. Therefore, we can conclude that our proposed JDA+CCD is lightweight.
Table 10. Performance comparison of JDA and AutoAugment on offline knowledge distillation. All experiments in this table use the same hyperparameter settings. As a result, we find that JDA beats AutoAugment on four teacher-student pairs.

C Additional Visualization

Fig. 6.
figure 6

The figure contains T-SNE visualizations of the output of the teacher model’s GAP for eight different scenarios. The four columns from left to right refer to the four cases of \(\left( \mathcal {X},\mathcal {X}\right) \), \(\left( \mathcal {\widetilde{X}},\mathcal {X}\right) \), \(\left( \mathcal {X},\mathcal {\widetilde{X}}+\mathcal {X}\right) \) and \(\left( \mathcal {\widetilde{X}},\mathcal {X}+\mathcal {\widetilde{X}}\right) \), where \(\left( A,B\right) \) stands for the teacher model trained with B. Then, we adopt T-SNE to visualize A.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, W., Shao, S., Liu, W., Qiu, Z., Zhu, Z., Huan, W. (2023). What Role Does Data Augmentation Play in Knowledge Distillation?. In: Wang, L., Gall, J., Chin, TJ., Sato, I., Chellappa, R. (eds) Computer Vision – ACCV 2022. ACCV 2022. Lecture Notes in Computer Science, vol 13842. Springer, Cham. https://doi.org/10.1007/978-3-031-26284-5_31

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-26284-5_31

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-26283-8

  • Online ISBN: 978-3-031-26284-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics