Learning a Mixture of Conditional Gating Blocks for Visual Question Answering

Qiang Sun (孙强)^1,2,
Yan-Wei Fu (付彦伟)³ &
Xiang-Yang Xue (薛向阳)⁴

116 Accesses
1 Altmetric
Explore all metrics

Abstract

As a Turing test in multimedia, visual question answering (VQA) aims to answer the textual question with a given image. Recently, the “dynamic” property of neural networks has been explored as one of the most promising ways of improving the adaptability, interpretability, and capacity of the neural network models. Unfortunately, despite the prevalence of dynamic convolutional neural networks, it is relatively less touched and very nontrivial to exploit dynamics in the transformers of the VQA tasks through all the stages in an end-to-end manner. Typically, due to the large computation cost of transformers, researchers are inclined to only apply transformers on the extracted high-level visual features for downstream vision and language tasks. To this end, we introduce a question-guided dynamic layer to the transformer as it can effectively increase the model capacity and require fewer transformer layers for the VQA task. In particular, we name the dynamics in the Transformer as Conditional Multi-Head Self-Attention block (cMHSA). Furthermore, our questionguided cMHSA is compatible with conditional ResNeXt block (cResNeXt). Thus a novel model mixture of conditional gating blocks (McG) is proposed for VQA, which keeps the best of the Transformer, convolutional neural network (CNN), and dynamic networks. The pure conditional gating CNN model and the conditional gating Transformer model can be viewed as special examples of McG. We quantitatively and qualitatively evaluate McG on the CLEVR and VQA-Abstract datasets. Extensive experiments show that McG has achieved the state-of-the-art performance on these benchmark datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

£29.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (United Kingdom)

Instant access to the full article PDF.

Institutional subscriptions

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick C L, Parikh D. VQA: Visual question answering. In Proc. the IEEE International Conference on Computer Vision, Dec. 2015, pp.2425–2433. DOI: https://doi.org/10.1109/iccv.2015.279.
Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A N, Kaiser Ł, Polosukhin I. Attention is all you need. In Proc. the 31st In. Conf. Neural Information Processing Systems, Dec. 2017, pp.6000–6010.
Google Scholar
Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y Q, Li W, Liu P J. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 2020, 21 (1): Article No. 140.
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X H, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N. An image is worth 16x16 words: Transformers for image recognition at scale. In Proc. the 9th International Conference on Learning Representations, May 2021.
Google Scholar
Nicolas C, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S. End-to-end object detection with transformers. In Proc. the 16th European Conference on Computer Vision, Aug. 2020, pp.213–229. DOI: https://doi.org/10.1007/978-3-030-58452-8_13.
Google Scholar
Zheng S X, Lu J C, Zhao H S, Zhu X T, Luo Z K, Wang Y B, Fu Y W, Feng J F, Xiang T, Torr P H S, Zhang L. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2021, pp.6877–6886. DOI: https://doi.org/10.1109/CVPR46437.2021.00681.
Google Scholar
Lu J S, Batra D, Parikh D, Lee S. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Proc. the 33rd International Conference on Neural Information Processing Systems, Dec. 2019, Article No. 2.
Google Scholar
Tan H, Bansal M. LXMERT: Learning cross-modality encoder representations from transformers. In Proc. the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, Nov. 2019, pp.5100–5111. DOI: https://doi.org/10.18653/v1/d19-1514.
Google Scholar
Huang Z C, Zeng Z Y, Liu B, Fu D M, Fu J L. Pixel-BERT: Aligning image pixels with text by deep multimodal transformers. arXiv: 2004.00849, 2020. https://arx-iv.org/abs/2004.00849, Jun. 2024.
Google Scholar
Kim W, Son B, Kim I. ViLT: Vision-and-language transformer without convolution or region supervision. In Proc. the 38th International Conference on Machine Learning, Jul. 2021, pp.5583–5594.
Google Scholar
Johnson J, Hariharan B, Van Der Maaten L, Hoffman J, Fei-Fei L, Zitnick C L, Girshick R. Inferring and executing programs for visual reasoning. In Proc. the 2017 IEEE International Conference on Computer Vision, Oct. 2017, pp.3008–3017. DOI: https://doi.org/10.1109/iccv.2017.325.
Google Scholar
Perez E, Strub F, De Vries H, Dumoulin V, Courville A. FiLM: Visual reasoning with a general conditioning layer. In Proc. the 32nd AAAI Conference on Artificial Intelligence, Apr. 2018. DOI: https://doi.org/10.1609/aaai.v32i1.11671.
Google Scholar
Wu Y Z, Sun Q, Ma J Q, Li B, Fu Y W, Peng Y, Xue X Y. Question guided modular routing networks for visual question answering. arXiv: 1904.08324, 2019. https://arx-iv.org/abs/1904.08324, Jun. 2024.
Google Scholar
Zhong H S, Chen J Y, Shen C, Zhang H W, Huang J Q, Hua X S. Self-adaptive neural module transformer for visual question answering. IEEE Trans. Multimedia, 2021, 23: 1264–1273. DOI: https://doi.org/10.1109/tmm.2020.2995278.
Article Google Scholar
Hu R H, Andreas J, Rohrbach M, Darrell T, Saenko K. Learning to reason: End-to-end module networks for visual question answering. In Proc. the 2017 IEEE International Conference on Computer Vision, Oct. 2017, pp.804–813. DOI: https://doi.org/10.1109/iccv.2017.93.
Google Scholar
Mascharka D, Tran P, Soklaski R, Majumdar A. Transparency by design: Closing the gap between performance and interpretability in visual reasoning. In Proc. the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, pp.4942–4950. DOI: https://doi.org/10.1109/cvpr.2018.00519.
Chapter Google Scholar
Noh H, Seo P H, Han B. Image question answering using convolutional neural network with dynamic parameter prediction. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2016, pp.30–38. DOI: https://doi.org/10.1109/cvpr.2016.11.
Google Scholar
Gao P, Li H S, Li S, Lu P, Li Y K, Hoi S C H, Wang X G. Question-guided hybrid convolution for visual question answering. In Proc. the 15th European Conference on Computer Vision, Sept. 2018, pp.485–501. DOI: https://doi.org/10.1007/978-3-030-01246-5_29.
Google Scholar
Xie S N, Girshick R, Dollár P, Tu Z W, He K M. Aggregated residual transformations for deep neural networks. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Jul. 2017, pp.5987–5995. DOI: https://doi.org/10.1109/cvpr.2017.634.
Google Scholar
Johnson J, Hariharan B, Van Der maaten L, Fei-Fei L, Zitnick C L, Girshick R. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Jul. 2017, pp.1988–1997. DOI: https://doi.org/10.1109/cvpr.2017.215.
Google Scholar
Chen Y P, Dai X Y, Liu M C, Chen D D, Yuan L, Liu Z C. Dynamic ReLU. In Proc. the 16th European Conference on Computer Vision, Aug. 2020, pp.351–367. DOI: https://doi.org/10.1007/978-3-030-58529-7_21.
Google Scholar
Goyal Y, Khot T, Summers-Stay D, Batra D, Parikh D. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Jul. 2017, pp.6325–6334. DOI: https://doi.org/10.1109/cvpr.2017.670.
Google Scholar
Hudson D A, Manning C D. GQA: A new dataset for real-world visual reasoning and compositional question answering. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2019, pp.6693–6702. DOI: https://doi.org/10.1109/cvpr.2019.00686.
Google Scholar
Anderson P, He X D, Buehler C, Teney D, Johnson M, Gould S, Zhang L. Bottom-up and top-down attention for image captioning and visual question answering. In Proc. the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, pp.6077–6086. DOI: https://doi.org/10.1109/cvpr.2018.00636.
Chapter Google Scholar
Liu F, Liu J, Fang Z W, Hong R C, Lu H Q. Visual question answering with dense inter- and intra-modality interactions. IEEE Trans. Multimedia, 2021, 23:3518–3529. DOI: https://doi.org/10.1109/tmm.2020.3026892.
Article Google Scholar
Yu J, Zhang W F, Lu Y H, Qin Z C, Hu Y, Tan J L, Wu Q. Reasoning on the relation: Enhancing visual representation for visual question answering and cross-modal retrieval. IEEE Trans. Multimedia, 2020, 22(12): 3196–3209. DOI: https://doi.org/10.1109/tmm.2020.2972830.
Article Google Scholar
He K M, Zhang X Y, Ren S Q, Sun J. Deep residual learning for image recognition. In Proc. the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Jun. 2016, pp.770–778. DOI: https://doi.org/10.1109/cvpr.2016.90.
Google Scholar
Fukui A, Park D H, Yang D, Rohrbach A, Darrell T, Rohrbach M. Multimodal compact bilinear pooling for visual question answering and visual grounding. In Proc. the 2016 Conference on Empirical Methods in Natural Language Processing, Nov. 2016, pp.457–468. DOI: https://doi.org/10.18653/v1/d16-1044.
Chapter Google Scholar
Ben-Younes H, Cadene R, Cord M, Thome N. MUTAN: Multimodal tucker fusion for visual question answering. In Proc. the 2017 IEEE International Conference on Computer Vision, Oct. 2017, pp.2631–2639. DOI: https://doi.org/10.1109/iccv.2017.285.
Google Scholar
Kazemi V, Elqursh A. Show, ask, attend, and answer: A strong baseline for visual question answering. arXiv: 1704.03162, 2017. https://arxiv.org/abs/1704.03162, Jun. 2024.
Google Scholar
Niu Y L, Tang K H, Zhang H W, Lu Z W, Hua X S, Wen J R. Counterfactual VQA: A cause-effect look at language bias. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2021, pp.12695–12705. DOI: https://doi.org/10.1109/cvpr46437.2021.01251.
Google Scholar
Pan Y H, Li Z C, Zhang L Y, Tang J H. Causal inference with knowledge distilling and curriculum learning for unbiased VQA. ACM Trans. Multimedia Computing, Communications, and Applications, 2022, 18(3): 67. DOI: https://doi.org/10.1145/3487042.
Article Google Scholar
Chen L, Yan X, Xiao J, Zhang H W, Pu S L, Zhuang Y T. Counterfactual samples synthesizing for robust visual question answering. In Proc. the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2020, pp.10797–10806. DOI: https://doi.org/10.1109/cvpr42600.2020.01081.
Google Scholar
Yu Z, Yu J, Cui Y H, Tao D C, Tian Q. Deep modular co-attention networks for visual question answering. In Proc. the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2019, pp.6274–6283. DOI: https://doi.org/10.1109/cvpr.2019.00644.
Google Scholar
Yu J, Li J, Yu Z, Huang Q M. Multimodal transformer with multi-view visual representation for image captioning. IEEE Trans. Circuits and Systems for Video Technology, 2019, 30(12): 4467–4480. DOI: https://doi.org/10.1109/TCSVT.2019.2947482.
Article Google Scholar
Wang J H, Jin L, Li Z C, Tang J H. Crossmodal knowledge distillation hashing. Scientia Sinica Technologica, 2022, 52(5): 713–726. DOI: https://doi.org/10.1360/sst-2021-0214.
Article Google Scholar
Andreas J, Rohrbach M, Darrell T, Klein D. Learning to compose neural networks for question answering. In Proc. the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jun. 2016, pp.1545–1554. DOI: https://doi.org/10.18653/v1/n16-1181.
Google Scholar
Devlin J, Chang M W, Lee K, Toutanova K. BERT: Pretraining of deep bidirectional transformers for language understanding. In Proc. the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jun. 2019, pp.4171–4186. DOI: https://doi.org/10.18653/v1/N19-1423.
Google Scholar
Radford A, Narasimhan K, Salimans T, Sutskever I, Improving language understanding by generative pre-training, https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf, May 2024.
Google Scholar
Srinivas A, Lin T Y, Parmar N, Shlens J, Abbeel P, Vaswani A. Bottleneck transformers for visual recognition. In Proc. the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2021, pp.16514–16524. DOI: https://doi.org/10.1109/cvpr46437.2021.01625.
Google Scholar
Huang G, Chen D L, Li T H, Wu F, Van Der Maaten L, Weinberger K Q. Multi-scale dense networks for resource efficient image classification. In Proc. the 6th International Conference on Learning Representations, May 2018.
Google Scholar
Teerapittayanon S, McDanel B, Kung T H. BranchyNet: Fast inference via early exiting from deep neural networks. In Proc. the 23rd International Conference on Pattern Recognition, Dec. 2016, pp.2464–2469. DOI: https://doi.org/10.1109/icpr.2016.7900006.
Google Scholar
Wang X, Yu F, Dou Z Y, Darrell T, Gonzalez J E. Skip-Net: Learning dynamic routing in convolutional networks. In Proc. the 15th European Conference on Computer Vision, Sept. 2018, pp.420–436. DOI: https://doi.org/10.1007/978-3-030-01261-8_25.
Google Scholar
Wu Z X, Nagarajan T, Kumar A, Rennie S, Davis L S, Grauman K, Feris R. BlockDrop: Dynamic inference paths in residual networks. In Proc. the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Jun. 2018, pp.8817–8826. DOI: https://doi.org/10.1109/cvpr.2018.00919.
Chapter Google Scholar
Shazeer N, Mirhoseini A, Maziarz K, Davis A, Le Q V, Hinton G E, Dean J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In Proc. the 5th International Conference on Learning Representations, Apr. 2017.
Google Scholar
Cai S F, Shu Y, Wang W. Dynamic routing networks. In Proc. the 2021 IEEE Winter Conference on Applications of Computer Vision, Jan. 2021, pp.3587–3596. DOI: https://doi.org/10.1109/wacv48630.2021.00363.
Chapter Google Scholar
Fedus W, Zoph B, Shazeer N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 2022, 23(1): 120.
MathSciNet Google Scholar
Ha D, Dai A M, Le Q V. HyperNetworks. In Proc. the 5th International Conference on Learning Representations, Apr. 2017.
Google Scholar
Jiang A W, Liu B, Wang M W. Deep multimodal reinforcement network with contextually guided recurrent attention for image question answering. Journal of Computer Science and Technology, 2017, 32(4): 738–748. DOI: https://doi.org/10.1007/s11390-017-1755-6.
Article MathSciNet Google Scholar
Cho K, Van Merrienboer B, Bahdanau D, Bengio Y. On the properties of neural machine translation: Encoder-decoder approaches. In Proc. 8th Workshop on Syntax, Semantics and Structure in Statistical Translation, Oct. 2014, pp.103–111. DOI: https://doi.org/10.3115/v1/w14-4012.
Google Scholar
Shaw P, Uszkoreit J, Vaswani A. Self-attention with relative position representations. In Proc. the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Jun. 2018, pp.464–468. DOI: https://doi.org/10.18653/v1/n18-2074.
Google Scholar
Santoro A, Raposo D, Barrett D G T, Malinowski M, Pascanu R, Battaglia P, Lillicrap T. A simple neural network module for relational reasoning. In Proc. the 31st International Conference on Neural Information Processing Systems, Dec. 2017, pp.4974–4983.
Google Scholar
Hudson D A, Manning C D. Compositional attention networks for machine reasoning. In Proc. the 6th Int. Conf. Learning Representations, Apr. 2018.
Google Scholar
Mao J Y, Gan C, Kohli P, Tenenbaum J B, Wu J J. The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. In Proc. the 7th International Conference on Learning Representations, May 2019.
Google Scholar
Kim J H, Lee S W, Kwak D, Heo M O, Kim J, Ha J W, Zhang B T. Multimodal residual learning for visual QA. In Proc. the 30th International Conference on Neural Information Processing Systems, Dec. 2016, pp.361–369.
Google Scholar
Saito K, Shin A, Ushiku Y, Harada T. DualNet: Domaininvariant network for visual question answering. In Proc. the 2017 IEEE International Conference on Multimedia and Expo, Jul. 2017, pp.829–834. DOI: https://doi.org/10.1109/icme.2017.8019436.
Google Scholar
Teney D, Liu L Q, Van Den Hengel A. Graph-structured representations for visual question answering. In Proc. the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Jul. 2017, pp.3233–3241. DOI: https://doi.org/10.1109/cvpr.2017.344.
Google Scholar
Selvaraju R R, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In Proc. the 2017 IEEE International Conference on Computer Vision, Oct. 2017, pp.618–626. DOI: https://doi.org/10.1109/iccv.2017.74.
Google Scholar

Download references

Acknowledgements

We would like to thank Yan-Ze Wu (a postgraduate student at the School of Computer Science, Fudan University) for the insightful discussions.

Author information

Authors and Affiliations

School of Statistics and Information, Shanghai University of International Business and Economics, Shanghai, 201620, China
Qiang Sun (孙强)
Academy for Engineering and Technology, Fudan University, Shanghai, 200433, China
Qiang Sun (孙强)
School of Data Science, Fudan University, Shanghai, 200433, China
Yan-Wei Fu (付彦伟)
School of Computer Science, Fudan University, Shanghai, 200433, China
Xiang-Yang Xue (薛向阳)

Authors

Qiang Sun (孙强)
View author publications
You can also search for this author in PubMed Google Scholar
Yan-Wei Fu (付彦伟)
View author publications
You can also search for this author in PubMed Google Scholar
Xiang-Yang Xue (薛向阳)
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qiang Sun (孙强).

Ethics declarations

Conflict of Interest The authors declare that they have no conflict of interest.

Additional information

This work was supported in part by the National Natural Science Foundation of China under Grant No. 62176061 and the Science and Technology Commission of Shanghai Municipality under Grant No. 22511105000.

Qiang Sun received his Ph.D. degree from Academy of Engineering & Technology, Fudan University, Shanghai, in 2023, his M.S. degree from the Department of Computer Science and Technology, Nanjing University, Nanjing, in 2011, and his B.S. degree from Software Engineering Institute, Nanjing University, Nanjing, in 2008. He is currently a lecturer at the School of Statistics and Information, Shanghai University of International Business and Economics, Shanghai. His research interests include visual language answering and vision language navigation.

Yan-Wei Fu received his Ph.D. degree from Queen Mary University of London, London, in 2014, and his M.Eng. degree from the Department of Computer Science and Technology, Nanjing University, Nanjing, in 2011. He held a post-doctoral position at Disney Research, Pittsburgh, from 2015 to 2016. He is currently a tenure-track professor at the School of Data Science, Fudan University, Shanghai. His research interests are image and video understanding, and life-long learning.

Xiang-Yang Xue received his B.S., M.S., and Ph.D. degrees in communication engineering from Xidian University, Xi’an, in 1989, 1992, and 1995, respectively. He is currently a professor at the School of Computer Science, Fudan University, Shanghai. His research interests include computer vision, multimedia information processing, and machine learning.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sun, Q., Fu, YW. & Xue, XY. Learning a Mixture of Conditional Gating Blocks for Visual Question Answering. J. Comput. Sci. Technol. 39, 912–928 (2024). https://doi.org/10.1007/s11390-024-2113-0

Download citation

Received: 26 December 2021
Accepted: 24 January 2024
Published: 20 September 2024
Issue Date: July 2024
DOI: https://doi.org/10.1007/s11390-024-2113-0