Abstract
As a Turing test in multimedia, visual question answering (VQA) aims to answer the textual question with a given image. Recently, the “dynamic” property of neural networks has been explored as one of the most promising ways of improving the adaptability, interpretability, and capacity of the neural network models. Unfortunately, despite the prevalence of dynamic convolutional neural networks, it is relatively less touched and very nontrivial to exploit dynamics in the transformers of the VQA tasks through all the stages in an end-to-end manner. Typically, due to the large computation cost of transformers, researchers are inclined to only apply transformers on the extracted high-level visual features for downstream vision and language tasks. To this end, we introduce a question-guided dynamic layer to the transformer as it can effectively increase the model capacity and require fewer transformer layers for the VQA task. In particular, we name the dynamics in the Transformer as Conditional Multi-Head Self-Attention block (cMHSA). Furthermore, our questionguided cMHSA is compatible with conditional ResNeXt block (cResNeXt). Thus a novel model mixture of conditional gating blocks (McG) is proposed for VQA, which keeps the best of the Transformer, convolutional neural network (CNN), and dynamic networks. The pure conditional gating CNN model and the conditional gating Transformer model can be viewed as special examples of McG. We quantitatively and qualitatively evaluate McG on the CLEVR and VQA-Abstract datasets. Extensive experiments show that McG has achieved the state-of-the-art performance on these benchmark datasets.
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick C L, Parikh D. VQA: Visual question answering. In Proc. the IEEE International Conference on Computer Vision, Dec. 2015, pp.2425–2433. DOI: https://doi.org/10.1109/iccv.2015.279.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In Proc. the 31st In. Conf. Neural Information Processing Systems, Dec. 2017, pp.6000–6010.
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y Q, Li W, Liu P J. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020, 21 (1): Article No. 140.
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X H, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. the 9th International Conference on Learning Representations, May 2021.
Nicolas C, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with transformers. In Proc. the 16th European Conference on Computer Vision, Aug. 2020, pp.213–229. DOI: https://doi.org/10.1007/978-3-030-58452-8_13.
Zheng S X, Lu J C, Zhao H S, Zhu X T, Luo Z K, Wang Y B, Fu Y W, Feng J F, Xiang T, Torr P H S, Zhang L. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2021, pp.6877–6886. DOI: https://doi.org/10.1109/CVPR46437.2021.00681.
Lu J S, Batra D, Parikh D, Lee S. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proc. the 33rd International Conference on Neural Information Processing Systems, Dec. 2019, Article No. 2.
Tan H, Bansal M. LXMERT: Learning cross-modality encoder representations from transformers. In Proc. the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Nov. 2019, pp.5100–5111. DOI: https://doi.org/10.18653/v1/d19-1514.
Huang Z C, Zeng Z Y, Liu B, Fu D M, Fu J L. Pixel-BERT: Aligning image pixels with text by deep multimodal transformers. arXiv: 2004.00849, 2020. https://arx-iv.org/abs/2004.00849, Jun. 2024.
Kim W, Son B, Kim I. ViLT: Vision-and-language transformer without convolution or region supervision. In Proc. the 38th International Conference on Machine Learning, Jul. 2021, pp.5583–5594.
Johnson J, Hariharan B, Van Der Maaten L, Hoffman J, Fei-Fei L, Zitnick C L, Girshick R. Inferring and executing programs for visual reasoning. In Proc. the 2017 IEEE International Conference on Computer Vision, Oct. 2017, pp.3008–3017. DOI: https://doi.org/10.1109/iccv.2017.325.
Perez E, Strub F, De Vries H, Dumoulin V, Courville A. FiLM: Visual reasoning with a general conditioning layer. In Proc. the 32nd AAAI Conference on Artificial Intelligence, Apr. 2018. DOI: https://doi.org/10.1609/aaai.v32i1.11671.
Wu Y Z, Sun Q, Ma J Q, Li B, Fu Y W, Peng Y, Xue X Y. Question guided modular routing networks for visual question answering. arXiv: 1904.08324, 2019. https://arx-iv.org/abs/1904.08324, Jun. 2024.
Zhong H S, Chen J Y, Shen C, Zhang H W, Huang J Q, Hua X S. Self-adaptive neural module transformer for visual question answering. IEEE Trans. Multimedia, 2021, 23: 1264–1273. DOI: https://doi.org/10.1109/tmm.2020.2995278.
Hu R H, Andreas J, Rohrbach M, Darrell T, Saenko K. Learning to reason: End-to-end module networks for visual question answering. In Proc. the 2017 IEEE International Conference on Computer Vision, Oct. 2017, pp.804–813. DOI: https://doi.org/10.1109/iccv.2017.93.
Mascharka D, Tran P, Soklaski R, Majumdar A. Transparency by design: Closing the gap between performance and interpretability in visual reasoning. In Proc. the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, pp.4942–4950. DOI: https://doi.org/10.1109/cvpr.2018.00519.
Noh H, Seo P H, Han B. Image question answering using convolutional neural network with dynamic parameter prediction. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2016, pp.30–38. DOI: https://doi.org/10.1109/cvpr.2016.11.
Gao P, Li H S, Li S, Lu P, Li Y K, Hoi S C H, Wang X G. Question-guided hybrid convolution for visual question answering. In Proc. the 15th European Conference on Computer Vision, Sept. 2018, pp.485–501. DOI: https://doi.org/10.1007/978-3-030-01246-5_29.
Xie S N, Girshick R, Dollár P, Tu Z W, He K M. Aggregated residual transformations for deep neural networks. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Jul. 2017, pp.5987–5995. DOI: https://doi.org/10.1109/cvpr.2017.634.
Johnson J, Hariharan B, Van Der maaten L, Fei-Fei L, Zitnick C L, Girshick R. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Jul. 2017, pp.1988–1997. DOI: https://doi.org/10.1109/cvpr.2017.215.
Chen Y P, Dai X Y, Liu M C, Chen D D, Yuan L, Liu Z C. Dynamic ReLU. In Proc. the 16th European Conference on Computer Vision, Aug. 2020, pp.351–367. DOI: https://doi.org/10.1007/978-3-030-58529-7_21.
Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Jul. 2017, pp.6325–6334. DOI: https://doi.org/10.1109/cvpr.2017.670.
Hudson D A, Manning C D. GQA: A new dataset for real-world visual reasoning and compositional question answering. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2019, pp.6693–6702. DOI: https://doi.org/10.1109/cvpr.2019.00686.
Anderson P, He X D, Buehler C, Teney D, Johnson M, Gould S, Zhang L. Bottom-up and top-down attention for image captioning and visual question answering. In Proc. the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, pp.6077–6086. DOI: https://doi.org/10.1109/cvpr.2018.00636.
Liu F, Liu J, Fang Z W, Hong R C, Lu H Q. Visual question answering with dense inter- and intra-modality interactions. IEEE Trans. Multimedia, 2021, 23:3518–3529. DOI: https://doi.org/10.1109/tmm.2020.3026892.
Yu J, Zhang W F, Lu Y H, Qin Z C, Hu Y, Tan J L, Wu Q. Reasoning on the relation: Enhancing visual representation for visual question answering and cross-modal retrieval. IEEE Trans. Multimedia, 2020, 22(12): 3196–3209. DOI: https://doi.org/10.1109/tmm.2020.2972830.
He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2016, pp.770–778. DOI: https://doi.org/10.1109/cvpr.2016.90.
Fukui A, Park D H, Yang D, Rohrbach A, Darrell T, Rohrbach M. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proc. the 2016 Conference on Empirical Methods in Natural Language Processing, Nov. 2016, pp.457–468. DOI: https://doi.org/10.18653/v1/d16-1044.
Ben-Younes H, Cadene R, Cord M, Thome N. MUTAN: Multimodal tucker fusion for visual question answering. In Proc. the 2017 IEEE International Conference on Computer Vision, Oct. 2017, pp.2631–2639. DOI: https://doi.org/10.1109/iccv.2017.285.
Kazemi V, Elqursh A. Show, ask, attend, and answer: A strong baseline for visual question answering. arXiv: 1704.03162, 2017. https://arxiv.org/abs/1704.03162, Jun. 2024.
Niu Y L, Tang K H, Zhang H W, Lu Z W, Hua X S, Wen J R. Counterfactual VQA: A cause-effect look at language bias. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2021, pp.12695–12705. DOI: https://doi.org/10.1109/cvpr46437.2021.01251.
Pan Y H, Li Z C, Zhang L Y, Tang J H. Causal inference with knowledge distilling and curriculum learning for unbiased VQA. ACM Trans. Multimedia Computing, Communications, and Applications, 2022, 18(3): 67. DOI: https://doi.org/10.1145/3487042.
Chen L, Yan X, Xiao J, Zhang H W, Pu S L, Zhuang Y T. Counterfactual samples synthesizing for robust visual question answering. In Proc. the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2020, pp.10797–10806. DOI: https://doi.org/10.1109/cvpr42600.2020.01081.
Yu Z, Yu J, Cui Y H, Tao D C, Tian Q. Deep modular co-attention networks for visual question answering. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2019, pp.6274–6283. DOI: https://doi.org/10.1109/cvpr.2019.00644.
Yu J, Li J, Yu Z, Huang Q M. Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans. Circuits and Systems for Video Technology, 2019, 30(12): 4467–4480. DOI: https://doi.org/10.1109/TCSVT.2019.2947482.
Wang J H, Jin L, Li Z C, Tang J H. Crossmodal knowledge distillation hashing. Scientia Sinica Technologica, 2022, 52(5): 713–726. DOI: https://doi.org/10.1360/sst-2021-0214.
Andreas J, Rohrbach M, Darrell T, Klein D. Learning to compose neural networks for question answering. In Proc. the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jun. 2016, pp.1545–1554. DOI: https://doi.org/10.18653/v1/n16-1181.
Devlin J, Chang M W, Lee K, Toutanova K. BERT: Pretraining of deep bidirectional transformers for language understanding. In Proc. the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jun. 2019, pp.4171–4186. DOI: https://doi.org/10.18653/v1/N19-1423.
Radford A, Narasimhan K, Salimans T, Sutskever I, Improving language understanding by generative pre-training, https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf, May 2024.
Srinivas A, Lin T Y, Parmar N, Shlens J, Abbeel P, Vaswani A. Bottleneck transformers for visual recognition. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2021, pp.16514–16524. DOI: https://doi.org/10.1109/cvpr46437.2021.01625.
Huang G, Chen D L, Li T H, Wu F, Van Der Maaten L, Weinberger K Q. Multi-scale dense networks for resource efficient image classification. In Proc. the 6th International Conference on Learning Representations, May 2018.
Teerapittayanon S, McDanel B, Kung T H. BranchyNet: Fast inference via early exiting from deep neural networks. In Proc. the 23rd International Conference on Pattern Recognition, Dec. 2016, pp.2464–2469. DOI: https://doi.org/10.1109/icpr.2016.7900006.
Wang X, Yu F, Dou Z Y, Darrell T, Gonzalez J E. Skip-Net: Learning dynamic routing in convolutional networks. In Proc. the 15th European Conference on Computer Vision, Sept. 2018, pp.420–436. DOI: https://doi.org/10.1007/978-3-030-01261-8_25.
Wu Z X, Nagarajan T, Kumar A, Rennie S, Davis L S, Grauman K, Feris R. BlockDrop: Dynamic inference paths in residual networks. In Proc. the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, pp.8817–8826. DOI: https://doi.org/10.1109/cvpr.2018.00919.
Shazeer N, Mirhoseini A, Maziarz K, Davis A, Le Q V, Hinton G E, Dean J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In Proc. the 5th International Conference on Learning Representations, Apr. 2017.
Cai S F, Shu Y, Wang W. Dynamic routing networks. In Proc. the 2021 IEEE Winter Conference on Applications of Computer Vision, Jan. 2021, pp.3587–3596. DOI: https://doi.org/10.1109/wacv48630.2021.00363.
Fedus W, Zoph B, Shazeer N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 2022, 23(1): 120.
Ha D, Dai A M, Le Q V. HyperNetworks. In Proc. the 5th International Conference on Learning Representations, Apr. 2017.
Jiang A W, Liu B, Wang M W. Deep multimodal reinforcement network with contextually guided recurrent attention for image question answering. Journal of Computer Science and Technology, 2017, 32(4): 738–748. DOI: https://doi.org/10.1007/s11390-017-1755-6.
Cho K, Van Merrienboer B, Bahdanau D, Bengio Y. On the properties of neural machine translation: Encoder-decoder approaches. In Proc. 8th Workshop on Syntax, Semantics and Structure in Statistical Translation, Oct. 2014, pp.103–111. DOI: https://doi.org/10.3115/v1/w14-4012.
Shaw P, Uszkoreit J, Vaswani A. Self-attention with relative position representations. In Proc. the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jun. 2018, pp.464–468. DOI: https://doi.org/10.18653/v1/n18-2074.
Santoro A, Raposo D, Barrett D G T, Malinowski M, Pascanu R, Battaglia P, Lillicrap T. A simple neural network module for relational reasoning. In Proc. the 31st International Conference on Neural Information Processing Systems, Dec. 2017, pp.4974–4983.
Hudson D A, Manning C D. Compositional attention networks for machine reasoning. In Proc. the 6th Int. Conf. Learning Representations, Apr. 2018.
Mao J Y, Gan C, Kohli P, Tenenbaum J B, Wu J J. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. In Proc. the 7th International Conference on Learning Representations, May 2019.
Kim J H, Lee S W, Kwak D, Heo M O, Kim J, Ha J W, Zhang B T. Multimodal residual learning for visual QA. In Proc. the 30th International Conference on Neural Information Processing Systems, Dec. 2016, pp.361–369.
Saito K, Shin A, Ushiku Y, Harada T. DualNet: Domaininvariant network for visual question answering. In Proc. the 2017 IEEE International Conference on Multimedia and Expo, Jul. 2017, pp.829–834. DOI: https://doi.org/10.1109/icme.2017.8019436.
Teney D, Liu L Q, Van Den Hengel A. Graph-structured representations for visual question answering. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Jul. 2017, pp.3233–3241. DOI: https://doi.org/10.1109/cvpr.2017.344.
Selvaraju R R, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proc. the 2017 IEEE International Conference on Computer Vision, Oct. 2017, pp.618–626. DOI: https://doi.org/10.1109/iccv.2017.74.
Acknowledgements
We would like to thank Yan-Ze Wu (a postgraduate student at the School of Computer Science, Fudan University) for the insightful discussions.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest The authors declare that they have no conflict of interest.
Additional information
This work was supported in part by the National Natural Science Foundation of China under Grant No. 62176061 and the Science and Technology Commission of Shanghai Municipality under Grant No. 22511105000.
Qiang Sun received his Ph.D. degree from Academy of Engineering & Technology, Fudan University, Shanghai, in 2023, his M.S. degree from the Department of Computer Science and Technology, Nanjing University, Nanjing, in 2011, and his B.S. degree from Software Engineering Institute, Nanjing University, Nanjing, in 2008. He is currently a lecturer at the School of Statistics and Information, Shanghai University of International Business and Economics, Shanghai. His research interests include visual language answering and vision language navigation.
Yan-Wei Fu received his Ph.D. degree from Queen Mary University of London, London, in 2014, and his M.Eng. degree from the Department of Computer Science and Technology, Nanjing University, Nanjing, in 2011. He held a post-doctoral position at Disney Research, Pittsburgh, from 2015 to 2016. He is currently a tenure-track professor at the School of Data Science, Fudan University, Shanghai. His research interests are image and video understanding, and life-long learning.
Xiang-Yang Xue received his B.S., M.S., and Ph.D. degrees in communication engineering from Xidian University, Xi’an, in 1989, 1992, and 1995, respectively. He is currently a professor at the School of Computer Science, Fudan University, Shanghai. His research interests include computer vision, multimedia information processing, and machine learning.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Sun, Q., Fu, YW. & Xue, XY. Learning a Mixture of Conditional Gating Blocks for Visual Question Answering. J. Comput. Sci. Technol. 39, 912–928 (2024). https://doi.org/10.1007/s11390-024-2113-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-024-2113-0