A Video Question Answering Model Based on Knowledge Distillation
<p>This is an example of video QA. Given a video and a series of questions about the video, the video QA task requires the machine to give the correct answer after analyzing the video content and understanding the question semantics.</p> "> Figure 2
<p>This is the overall framework of the teacher model, which includes four modules. First, an encoder extracts and represents the video and question. Then, visual–text interaction achieves the reasoning between the visual and textual features. The visual fusion is used to fuse the appearance and motion feature to the visual representation. Finally, answer generation predicts the answer using a decoder.</p> "> Figure 3
<p>Multimodal knowledge distillation architecture. The teacher model with multiple GCN layers is first trained. Then, its fusion visual feature is used to guide the training of visual features in the student model, which has just one GCN layer.</p> "> Figure 4
<p>The loss changes of the training set during the training process represent the real label loss, the appearance feature distillation loss and the motion feature distillation loss, respectively. The hard loss is calculated by output and real labels; the APP and MO loss are calculated by features and soft labels.</p> "> Figure 5
<p>The loss changes of the validation set during the training process represent the real label loss, the appearance feature distillation loss and the motion feature distillation loss, respectively. The hard loss is calculated by output and real labels; the APP and MO loss are calculated by features and soft labels.</p> "> Figure 6
<p>Examples of some false predictions.</p> "> Figure 7
<p>Examples of prediction difference of three models.</p> ">
Abstract
:1. Introduction
- (1)
- Teacher-student framework: We introduce a teacher-student framework leveraging knowledge distillation techniques. This framework allows for the training of a simpler student model in a more convenient and efficient manner. By distilling the knowledge learned by the teacher model, the student model benefits from the expertise while maintaining a reduced model size.
- (2)
- Multimodal knowledge distillation: We propose a novel approach to multimodal knowledge distillation. This technique enables the student model to acquire rich multimodal information during the training process of individual modalities. By incorporating multimodal interactions early on, the fusion of appearance and motion features is significantly enhanced.
- (3)
- Competitive results on MSVD-QA and MSRVTT-QA: Through extensive experiments, we demonstrate the effectiveness of our proposed model on the popular MSVD-QA and MSRVTT-QA datasets. Our model achieves competitive performance compared to existing approaches, showcasing its capabilities in video question answering tasks.
2. Related Work
3. Materials and Methods
3.1. Teacher Model
3.1.1. Encoder
3.1.2. Visual–Text Interaction
3.1.3. Visual Fusion
3.1.4. Answer Generation
3.2. Student Model
3.3. Loss Function
4. Experimental Results Analysis
4.1. Dataset
4.2. Implement Details
4.2.1. Teacher Model
4.2.2. Student Model
4.3. Results Analysis
4.3.1. Visual Analysis
4.3.2. Comparative Analysis
- Co-Mem [4]. This method is developed from the dynamic memory network (DMN) in visual QA, and is improved based on video QA. In the context memory module, the attention mechanism of appearance-action collaborative memory is introduced, and the convolution-deconvolution network, based on time-series and the dynamic fact integration method, is used to mine video information deeply.
- AMU [1]. The algorithm is an end-to-end video QA model, which applies the fine-grained features of the question to video understanding. It reads the words in the question word-by-word, interacts with the appearance features and motion features through the attention mechanism, constantly refines the video attention features, and finally obtains the video understanding that integrates the different scale features of the problem.
- HGA [9]. The graph network is introduced into the model for reasoning learning. It constructs the video clip and the question word into the form of a graph, and carries out a cross-modal graph reasoning learning process.
- HCRN [24]. This is a stackable model of relational network modules based on clips. The relational network takes the input as a set of tensor objects and a conditional feature, outputs a set of relational information containing them, and then realizes multi-step reasoning of the relational information by hierarchically stacking the network modules.
- DSAVS [8]. The answer to the question may be deduced from a few frames or fragments in the video, and the appearance and motion information are generally complementary. To this end, the author proposes a visual synchronization dynamic self-attention network, which selects important video clips first and synchronizes various features in time.
- DualVGR [6]. This model is a stacked model of an attention graph inference network. In the attention graph inference network module, the query punish mechanism is used to strengthen the features of key video clips, and then the relationship is modeled by the multi-head graph network combined with attention. The model performs multi-step reasoning of the relationship information by stacking the network module.
4.3.3. Ablation Study
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Xu, D.; Zhao, Z.; Xiao, J.; Wu, F.; Zhang, H.; He, X.; Zhuang, Y. Video Question Answering via Gradually Refined Attention over Appearance and Motion. In Proceedings of the 25th ACM International Conference on Multimedia, San Francisco, CA, USA, 23–27 October 2017; pp. 1645–1653. [Google Scholar]
- Antol, S.; Agrawal, A.; Lu, J.; Mitchell, M.; Parikh, D. Vqa: Visual Question Answering. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 13–16 December 2015; pp. 2425–2433. [Google Scholar]
- Gupta, P.; Gupta, V. A Survey of Text Question Answering Techniques. J. Comput. Appl. 2012, 53, 1–8. [Google Scholar] [CrossRef]
- Gao, J.; Ge, R.; Chen, K.; Nevatia, R. Motion-appearance Co-memory Networks for Video Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6576–6585. [Google Scholar]
- Wang, X.; Gupta, A. Videos as Space-time Region Graphs. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 399–417. [Google Scholar]
- Wang, J.; Bao, B.; Xu, C. DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering. IEEE Trans. Multimed. 2022, 24, 3369–3380. [Google Scholar] [CrossRef]
- Zhang, Z.; Zhao, Z.; Lin, Z.; Song, J.; He, X. Open-ended Long-form Video Question Answering via Hierarchical Convolutional Self-attention Networks. In Proceedings of the 28 International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; pp. 4383–4389. [Google Scholar]
- Liu, Y.; Zhang, X.; Huang, F.; Shen, S.; Tian, P.; Li, L.; Li, Z. Dynamic Self-Attention with Vision Synchronization Networks for Video Question Answering. Pattern Recognit. 2022, 132, 108959. [Google Scholar] [CrossRef]
- Jiang, P.; Han, Y. Reasoning with Heterogeneous Graph Alignment for Video Question Answering. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 11109–11116. [Google Scholar]
- Huang, D.; Chen, P.; Zeng, R.; Du, Q.; Tan, M.; Gan, C. Location-Aware Graph Convolutional Networks for Video Question Answering. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 11021–11028. [Google Scholar]
- Wang, X.; Zhu, M.; Bo, D.; Cui, P.; Shi, C.; Pei, J. AM-GCN: Adaptive Multi-channel Graph Convolutional Networks. In Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, San Diego, CA, USA, 23–27 August 2020; pp. 1243–1253. [Google Scholar]
- Gou, J.; Yu, B.; Maybank, S.J.; Tao, D. Knowledge Distillation: A Survey. Int. J. Comput. Vis. 2021, 129, 1789–1819. [Google Scholar] [CrossRef]
- Tapaswi, M.; Zhu, Y.; Stiefelhagen, R.; Torralba, A.; Urtasun, R.; Fidler, S. MovieQA: Understanding Stories in Movies through Question-Answering. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 4631–4640. [Google Scholar]
- Lei, J.; Yu, L.; Bansal, M.; Berg, T.L. Tvqa: Localized, Compositional Video Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; pp. 1369–1379. [Google Scholar]
- Castro, S.; Azab, M.; Stroud, J.; Noujaim, C.; Wang, R.; Deng, J.; Mihalcea, R. LifeQA: A Real-life Dataset for Video Question Answering. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 4352–4358. [Google Scholar]
- Song, X.; Shi, Y.; Chen, X.; Han, Y. Explore Multi-step Reasoning in Video Question Answering. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Repubic of Korea, 22–26 October 2018; pp. 239–247. [Google Scholar]
- Jia, D.; Wei, D.; Socher, R.; Li, L.J.; Kai, L.; Li, F.F. Imagenet: A Large-Scale Hierarchical Image Database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; pp. 248–255. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Carreira, J.; Zisserman, A. Quo Vadis, Action Recognition? A New model and the Kinetics Dataset. In Proceedings of the IEEE Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
- Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning Spatiotemporal Features with 3D Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 13–16 December 2015; pp. 4489–4497. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Jang, Y.; Song, Y.; Yu, Y.; Kim, Y.; Kim, G. Tgif-qa: Toward Spatio-temporal Reasoning in Visual Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2758–2766. [Google Scholar]
- Kim, K.M.; Heo, M.O.; Choi, S.H.; Zhang, B.T. Deepstory: Video Story QA by Deep Embedded Memory Networks. In Proceedings of the 26 International Joint Conference on Artificial Intelligence, Melbourne, Australia, 19–25 August 2017; pp. 2016–2022. [Google Scholar]
- Le, V.M.; Le, V.; Venkatesh, S.; Tran, T. Hierarchical Conditional Relation Networks for Video Question Answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9972–9981. [Google Scholar]
- Yu, Z.; Yu, J.; Fan, J.; Tao, D. Multi-modal Factorized Bilinear Pooling with Co-attention Learning for Visual Question Answering. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 24–27 October 2017; pp. 1821–1830. [Google Scholar]
- Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Yu, P.S. A Comprehensive Survey on Graph Neural Networks. Trans. Neural Netw. Learn. Syst. 2020, 32, 4–24. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Song, L.; Smola, A.; Gretton, A.; Borgwardt, K.; Bedo, J. Supervised Feature Selection via Dependence Estimation. In Proceedings of the 24th Annual International Conference on Machine Learning, Corvallis, OR, USA, 20–24 June 2007; pp. 823–830. [Google Scholar]
- Chen, D.; Dolan, W.B. Collecting Highly Parallel Data for Paraphrase Evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, USA, 19–24 June 2011; pp. 190–200. [Google Scholar]
- Xu, J.; Mei, T.; Yao, T.; Rui, Y. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 5288–5296. [Google Scholar]
Model | What | Who | How | When | Where | All |
---|---|---|---|---|---|---|
Co-Mem | 19.6 | 48.7 | 81.6 | 74.1 | 31.7 | 31.7 |
AMU | 20.6 | 47.5 | 83.5 | 72.4 | 53.6 | 32.0 |
HGA | 23.5 | 50.4 | 83.0 | 72.4 | 46.4 | 34.7 |
HCRN | / | / | / | / | / | 36.1 |
DSAVS | 25.6 | 53.5 | 85.1 | 75.9 | 53.6 | 37.2 |
DualVGR | 28.7 | 53.8 | 80.0 | 70.7 | 46.4 | 39.0 |
ours | 29.22 | 53.98 | 80.81 | 74.14 | 53.57 | 39.48 |
Model | What | Who | How | When | Where | All |
---|---|---|---|---|---|---|
Co-Mem | 23.9 | 42.5 | 74.1 | 69.0 | 42.9 | 32.0 |
AMU | 26.2 | 43.0 | 80.2 | 72.5 | 30.0 | 32.5 |
HGA | 29.2 | 45.7 | 83.5 | 75.2 | 34.0 | 35.5 |
HCRN | / | / | / | / | / | 35.6 |
DSAVS | 29.5 | 46.1 | 84.3 | 75.5 | 35.6 | 35.8 |
DualVGR | 29.4 | 45.6 | 79.8 | 76.7 | 36.4 | 35.5 |
ours | 29.67 | 45.51 | 80.91 | 76.51 | 35.20 | 35.71 |
Model | Accuracy | Number of Trainable Parameters |
---|---|---|
Teacher | 39.03% | 31.19 million |
Student | 38.85% | 24.09 million |
Student-kd | 39.48% | 24.09 million |
Model | Accuracy | Number of Trainable Parameters |
---|---|---|
Teacher | 35.52% | 41.29 million |
Student | 35.11% | 29.09 million |
Student-kd | 35.71% | 29.09 million |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Shao, Z.; Wan, J.; Zong, L. A Video Question Answering Model Based on Knowledge Distillation. Information 2023, 14, 328. https://doi.org/10.3390/info14060328
Shao Z, Wan J, Zong L. A Video Question Answering Model Based on Knowledge Distillation. Information. 2023; 14(6):328. https://doi.org/10.3390/info14060328
Chicago/Turabian StyleShao, Zhuang, Jiahui Wan, and Linlin Zong. 2023. "A Video Question Answering Model Based on Knowledge Distillation" Information 14, no. 6: 328. https://doi.org/10.3390/info14060328
APA StyleShao, Z., Wan, J., & Zong, L. (2023). A Video Question Answering Model Based on Knowledge Distillation. Information, 14(6), 328. https://doi.org/10.3390/info14060328