Abstract
Understanding a visual scene requires not only identifying single objects in isolation but also inferring the relationships and interactions between object pairs. In this study, we propose a novel scene graph generation framework based on Transformer to convert image data into linguistic descriptions characterized as nodes and edges of a graph describing the <subject–predicate–object> information of the given image. The proposed model consists of three components. First, we propose an enhanced object detection module with bidirectional long short-term memory (Bi-LSTM) for object-to-object information exchange to generate the classification probabilities for object bounding boxes and classes. Second, we introduce a novel context information capture module containing Transformer layers that outputs object categories containing object context as well as edge information for specific object pairs with context. Finally, since the relationship frequencies follow a long-tailed distribution, an adaptive inference module with a special feature fusion strategy is designed to soften the distribution and perform adaptive reasoning about relationship classification based on the visual appearance of object pairs. We have conducted detailed experiments on three popular open-source datasets, namely, Visual Genome, OpenImages, and Visual Relationship Detection, and have performed ablation experiments on each module, demonstrating significant improvements under different settings and in terms of various metrics.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning, pp 684–699
Gao L, Wang B, Wang W (2018) Image captioning with scene-graph based semantic concepts. In: Proceedings of the 2018 10th international conference on machine learning and computing, pp 225–229
Armeni I, He Z-Y, Gwak JY, Zamir AR, Fischer M, Malik J, Savarese S (2019) 3d Scene graph: A structure for unified semantics, 3d space, and camera. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 5664–5673
Chaoyi Z, Yu J, Song Y, Cai W (2021) Exploiting edge-oriented reasoning for 3d point-based scene graph analysis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9705–9715
Ren S, He K, Girshick R , Sun J (2016) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Dai X, Chen Y, Xiao B, Chen D, Liu M, Yuan L, Zhang L (2021) Dynamic head: Unifying object detection heads with attentions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7373–7382
Zou C, Wang B, Hu Y, Liu J, Wu Q, Zhao Y, Li B, Zhang C, Zhang C, Wei Y et al (2021) End-to-end human object interaction detection with hoi transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 11825–11834
Wang T, Yang T, Danelljan M, Khan FS, Zhang X, Su J (2020) Learning human-object interaction detection using interaction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 4116–4125
Nweke HF, Teh YW, Al-Garadi MA, Alo UR (2018) Deep learning algorithms for human activity recognition using mobile and wearable sensor networks: State of the art and research challenges. Expert Syst Appl 105:233–261
Chen K, Zhang D, Yao L, Guo B, Yu Z (2021) Deep learning for sensor-based human activity recognition: Overview, challenges, and opportunities. ACM Comput Surv (CSUR) 54(4):1–40
Zellers R, Yatskar M, Thomson S (2018) Neural motifs: Scene graph parsing with global context, pp 5831–5840
Chen L, Zhang H, Xiao J, He X, Pu S, Chang S (2019) Counterfactual critic multi-agent training for scene graph generation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4613–4623
Woo S, Kim D, Cho D, Kweon IS (2018) Linknet: Relational embedding for scene graph. Adv Neural Inf Process, Syst, 31,2018
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. arXiv:1706.03762
Zhao B, Wu X, Feng J, Peng Q, Yan S (2017) Diversified visual attention networks for fine-grained object classification. IEEE Trans Multimedia 19(6):1245–1256
Xiao T, Xu Y, Yang K, Zhang J, Peng Y, Zhang Z (2015) The application of two-level attention models in deep convolutional neural network for fine-grained image classification, pp 842– 850
Kolesnikov A, Kuznetsova A, Lampert C, Ferrari V (2019) Detecting visual relationships using box. In: Proceedings of the IEEE/CVF international conference on computer vision workshops, pp 0–0
Yang J, Lu J, Lee S, Batra D, Parikh D (2018) Graph r-cnn for scene graph generation, pp 670–685
Liu A-A, Tian H, Xu N, Nie W, Zhang Y, Kankanhalli M (2021) Toward region-aware attention learning for scene graph generation. IEEE Trans Neural Netw Learn Syst
Xu D, Zhu Y, Choy CB, Fei-Fei L (2017) Scene graph generation by iterative message passing, pp 5410–5419
Li Y, Ouyang W, Zhou B, Wang K, Wang X (2017) Scene graph generation from objects, phrases and region captions, pp 1261–1270
Dai B, Zhang Y, Lin D (2017) Detecting visual relationships with deep relational networks, pp 3076–3086
Lin X, Ding C, Zeng J, Tao D (2020) Gps-net: Graph property sensing network for scene graph generation, pp 3746–3753
Herzig R, Raboh M, Chechik G, Berant J, Globerson A (2018) Mapping images to scene graphs with permutation-invariant structured prediction. Adv Neural Inf Process Syst, 31, 2018
Tang K, Zhang H, Wu B, Luo W, Liu W (2019) Learning to compose dynamic tree structures for visual contexts, pp 6619–6628
Chen T, Yu W, Chen R, Lin L (2019) Knowledge-embedded routing network for scene graph generation, pp 6163–6171
Zhang J, Shih KJ, Elgammal A, Tao A, Catanzaro B (2019) Graphical contrastive losses for scene graph parsing, pp 11535–11543
Lu Y, Rai H, Chang J, Knyazev B, Yu G, Shekhar S, Taylor GW, Volkovs M (2021) Context-aware scene graph generation with seq2seq transformers. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 15931–15941
Lafferty J, McCallum A, Pereira FCN (2001) Conditional random fields: Probabilistic models for segmenting and labeling sequence data
Tang K, Niu Y, Huang J, Shi J, Zhang H (2020) Unbiased scene graph generation from biased training
Guo Y, Gao L, Wang X, Hu Y, Xing X u, Xu L u, Shen Heng Tao (2021) From general to informative scene graph generation via balance adjustment. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 16383–16392
Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks, pp 1492–1500
Lin TY, Dollár P, Girshick R, He K, Hariharan B, Belongie S (2017) Feature pyramid networks for object detection, pp 2117–2125
Zhang Y, Hare J, Prügel-Bennett A (2018) Learning to count objects in natural images for visual question answering. arXiv:1802.05766
Krishna R, Zhu Y, Groth O, Johnson J, Hata K , Kravitz J, Chen S, Kalantidis Y, Li LJ, Shamma DA et al (2017) Visual genome: Connecting language and vision using crowdsourced dense image annotations. Int J Comput Vis 123(1):32–73
Kuznetsova A, Rom H, Alldrin N, Uijlings J, Krasin I, Pont-Tuset J, Kamali S, Popov S, Malloci M, Kolesnikov A et al (2020) The open images dataset v4. Int J Comput Vis 128 (7):1956–1981
Lu C, Krishna R, Bernstein M, Fei-Fei L (2016) Visual relationship detection with language priors. pp 852–869. Springer
Pennington J, Socher R, Christopher DM (2014) Glove: Global vectors for word representation, pp 1532–1543
Newell A, Deng J (2017)
Zhou H, Yang Y, Luo T, Zhang J, Li S (2022) A unified deep sparse graph attention network for scene graph generation. Pattern Recognit 123:108367
Li Y, Ouyang W, Wang X (2017) Vip-cnn: A visual phrase reasoning convolutional neural network for visual relationship detection 2, 2017. arXiv:1702.07191
Zhang H, Kyaw Z, Chang SF, Chua TS (2017) Visual translation embedding network for visual relation detection, pp 5532–5540
Liang X, Lee L, Xing EP (2017) Deep variation-structured reinforcement learning for visual relationship and attribute detection, pp 848–857
Acknowledgements
This work was sponsored by the National Natural Science Foundation of China (No. 61802253).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, Y., Gao, Y., Yu, W. et al. Transformer networks with adaptive inference for scene graph generation. Appl Intell 53, 9621–9633 (2023). https://doi.org/10.1007/s10489-022-04022-0
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-022-04022-0