Abstract
Robust multi-scale object detection is challenging as it requires both spatial details and semantic knowledge to deal with problems including high scale variation and cluttered background. Appropriate fusion of high-resolution features with deep semantic features is the key issue to achieve better performance. Different approaches have been developed to extract and combine deep features with shallow layer spatial features, such as feature pyramid network. However, high-resolution feature maps contain noisy and distractive features. Directly combines shallow features with semantic features might degrade detection accuracy. Besides, contextual information is also important for multi-scale object detection. In this work, we present a feature refinement scheme to tackle the feature fusion problem. The proposed feature refinement module increases feature resolution and refine feature maps progressively with the guidance from deep features. Meanwhile, we propose a context extraction method to capture global and local contextual information. The method utilizes a multi-level cross-pooling unit to extract global context and a cascaded context module to extract local context. The proposed object detection framework has been evaluated on PASCAL VOC and MS COCO datasets. Experimental results demonstrate that the proposed method performs favorably against state-of-the-art approaches.
Similar content being viewed by others
Data availability
The data that support the findings of this study are available in “The pascal visual object classes (VOC) challenge,” https://doi.org/10.1007/s11263-009-0275-4 and the “Microsoft COCO,” https://doi.org/10.1007/978-3-319-10602-1_48.
Code availability
Not applicable.
References
Ma, Y., Deng, L., Chen, X., Guo, N.: Integrating orientation cue with EOH-OLBP-based multilevel features for human detection. IEEE Trans. Circuits Syst. Video Technol. 23(10), 1755–1766 (2013)
Keren, Fu., Zhao, Qijun, Irene Yu-Hua, Gu.: Refinet: a deep segmentation assisted refinement network for salient object detection. IEEE Trans. Multimedia 21(2), 457–469 (2019)
Li, Z., Peng, C., Yu, G., Zhang, X., Deng, Y., Sun, J.: DetNet: design backbone for object detection. In ECCV, Munich, Germany (2018)
Zhang, P., Liu, W., Zeng, Yi., Le, Y., Huchuan, Lu.: Looking for the detail and context devils: high-resolution salient object detection. IEEE Trans. Image Processing 30, 3204–3216 (2021)
Qiu, Heqian, Li, Hongliang, Qingbo, Wu., Meng, Fanman, Linfeng, Xu., Ngan, King Ngi, Shi, Hengcan: Hierarchical context features embedding for object detection. IEEE Trans. Multimedia 22(12), 3039–3050 (2020)
Lin, Tsung-Yi., Dollar, Piotr, Girshick, Ross, He, Kaiming, Hariharan, Bharath, Belongie, Serge: Feature pyramid networks for object detection. In CVPR, Honolulu, HI, USA (2017)
Kong, T., Sun, F., Yao, A., Liu, H., Lu, M., Chen, Y.: RON: reverse connection with objectness prior networks for object detection. In CVPR, Honolulu, HI, USA (2017)
Mingliang, Xu., Cui, Lisha, Lv, Pei, Jiang, Xiaoheng, Niu, Jianwei, Zhou, Bing, Wang, Meng: MDSSD: multi-scale deconvolutional single shot detector for small objects. Sci. China Inf. Sci. 63, 120113 (2020)
Kong T., Sun F., Huang W. and Liu H., Deep feature pyramid reconfiguration for object detection. In ECCV, Munich, Germany, (2018).
Li, Y., Chen, Y., Wang, N., Zhang, Z.: Scale-aware trident networks for object detection. In ICCV, Seoul, Korea (2019)
Zhao, J., Cao, Y., Fan, D., Cheng, M., Li, X., Zhang, L.: Contrast prior and fluid pyramid integration for rgbd salient object detection. In CVPR, Long Beach, CA, USA (2019)
Alamri, Faisal, Pugeault, Nicolas: Improving object detection performance using scene contextual constraints. IEEE Trans. Cogn. Dev. Sys. (2020). https://doi.org/10.1109/TCDS.2020.3008213
Yu F., and Koltun V., Multi-scale context aggregation by dilated convolutions. In ICLR, Caribe Hilton, San Juan, Puerto Rico, (2016).
Yang, Maoke, Kun, Yu., Zhang, Chi, Li, Zhiwei, Yang, Kuiyuan: DenseASPP for semantic segmentation in street scenes. In CVPR, Salt Lake City, UT, USA (2018)
Zhao, H., Shi, J., Qi, X., Wang, X., Jia, J.: Pyramid scene parsing network. In CVPR, Honolulu, HI, USA (2017)
Tian, Zhi, Shen, Chunhua, Chen, Hao, He, Tong: Fcos: fully convolutional one-stage object detection. In ICCV, Seoul, Korea (2019)
Jiahao, Xu., Tian, H., Wang, Z., Wang, Y., Kang, W., Chen, F.: joint input and output space learning for multi-label image classification. IEEE Trans. Multimedia 23, 1696–1707 (2020)
Wei, LiHua, Ma, YingDong: Multi-module spatial semantic network for semantic segmentation. In ICIEV, Kitakyushu, Japan (2020)
Wang, X., Ma, Y.: Multi-level feature and context pyramid network for object detection. Int. J. Comput. Vision Signal Process 1, 1–8 (2020)
Girshick R., Fast R-CNN. In ICCV, Santiago, Chile, Dec. (2015).
Ren, Shaoqing, He, Kaiming, Girshick, Ross, Sun, Jian: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137 (2017)
Liu W., Anguelov D., Erhan D., Szegedy C., Reed S., Fu C.-Y., and Berg A.C., Ssd: single shot multibox detector. In ECCV, pp. 21–37, Amsterdam, The Netherlands, (2016).
Redmon, Joseph, Divvala, Santosh, Girshick, Ross, Farhadi, Ali: You only look once: unified, real-time object detection. In CVPR, Las Vegas, NV, USA (2016)
Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. IEEE Trans. Pattern Anal. Mach. Intell. 42(2), 318–327 (2020)
Shen, Zhiqiang, Liu, Zhuang, Li, Jianguo, Jiang, Yu-Gang., Chen, Yurong, Xue, Xiangyang: DSOD: learning deeply supervised object detectors from scratch. In ICCV, Venice, Italy (2017)
Jie, Hu., Shen, Li., Albanie, S., Sun, G., Enhua, Wu.: Squeeze-and-excitation networks. IEEE Trans. Pattern Anal. Mach. Intell. 42(8), 2011–2023 (2020)
Park, Jongchan, Woo, Sanghyun, Lee, Joon-Young.: In So Kweon, BAM: bottleneck attention module. In BMVC, Newcastle, UK (2018)
Woo, Sanghyun, Park, Jongchan, Lee, Joon-Young.: In so Kweon, CBAM: convolutional block attention module. In ECCV, Munich, Germany (2018)
Jun, Fu., Liu, Jing, Tian, Haijie, Li, Yong, Bao, Yongjun, Fang, Zhiwei, Hanqing, Lu.: Dual attention network for scene segmentation. In CVPR, Long Beach, CA, USA (2019)
Li H., Xiong P., An J., Wang L., Pyramid attention network for semantic segmentation. arXiv:1805.10180, (2018).
Chen, S., Tan, X., Wang, B., Huchuan, Lu., Xuelong, Hu., Yun, Fu.: Reverse attention-based residual network for salient object detection. IEEE Trans. Image Processing 29, 3763–3776 (2020)
Chen L.-C., Papandreou G., Schroff F., and Adam H., Rethinking Atrous Convolution for Semantic Image Segmentation. CoRR abs/1706.05587 (2017).
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, Atrous convolution, and fully connected CRFs. IEEE Trans. Patten Anal. Mach. Intell. 40(4), 834–848 (2018)
Fu, Cheng-Yang, Liu, Wei, Ranga, Ananth, Tyagi, Ambrish, and Berg, Alexander C, Dssd: deconvolutional single shot detector. arXiv preprint arXiv:1701.06659, (2017).
Changqian, Yu., Wang, Jingbo, Peng, Chao, Gao, Changxin, Gang, Yu., Sang, Nong: BiSeNet: bilateral segmentation network for real-time semantic segmentation. In ECCV, Munich, Germany (2018)
Wang A., Ou W., Ren Chunhong., Liu Y., Cross-level feature aggregation and fusion network for light field salient object detection. International Conferences on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom), Rhodes, Greece, Nov. (2020).
Poudel R.P.K., Bonde U., Liwicki S., Zach C., ContextNet: exploring context and detail for semantic segmentation in real-time. In BMVC, Newcastle, UK, (2018).
Nie, J., Anwer, R.M., Cholakkal, H., Khan, F.S., Pang, Y., Shao, L.: Enriched feature guided refinement network for object detection. In ICCV, Seoul, Korea (2019)
Zhang, P., Liu, W., Zeng, Yi., Lei, Y., Huchuan, Lu.: looking for the detail and context devils: high-resolution salient object detection. IEEE Trans. Image Processing 30, 3204–3216 (2021)
Hou, Qibin, Zhang, Li., Cheng, Ming-Ming., Feng, Jiashi: Strip pooling: rethinking spatial pooling for scene parsing. In CVPR, Seattle, WA, USA (2020)
Cai, Zhaowei, Vasconcelos, Nuno: Cascade r-cnn: Delving into high quality object detection. In CVPR, Salt Lake City, UT, USA (2018)
He, Kaiming, Gkioxari, Georgia, Dollár, Piotr, Girshick, Ross: Mask r-cnn. In ICCV, Venice, Italy (2017)
Liu, Ziming, Gao, Guangyu, Sun, Lin, Fang, Li.: IPG-Net: image pyramid guidance network for small object detection. In CVPR Workshops, Seattle, WA, USA (2020)
Li, Yanghao, Chen, Yuntao, Wang, Naiyan, Zhang, Zhao-Xiang.: Scale-aware trident networks for object detection. In ICCV, Seoul, Korea (2019)
Zhao Q., Sheng T., Wang Y., Tang Z., Chen Y., Cai L, and Ling H., M2Det: a single-shot object detector based on muti-level feature pyramid network. In AAAI, pp.9259–9266, (2019).
Zhang, S., Wen, L., Lei, Z., Li, S.Z.: RefineDet++: single-shot refinement neural network for object detection. IEEE Trans. Circuits Syst. Video Technol. 31(2), 674–687 (2021)
Kim, Seung-Wook., Kook, Hyong-Keun., Sun, Jee-Young., Kang, Mun-Cheon., Ko, Sung-Jea.: Parallel feature pyramid network for object detection. In ECCV, Amsterdam, Netherlands (2018)
Law, Hei, Deng, Jia: Cornernet: detecting objects as paired keypoints. In ECCV, Munich, Germany (2018)
Zhou X., Wang D., Krähenbühl P., Objects as points. [J]. arXiv preprint arXiv:1904. 07850, (2019).
Kong, T., Sun, F., Liu, H., Jiang, Y., Li, L., Shi, J.: Foveabox: beyound anchor-based object detection. IEEE Trans. Image Process. 29, 7389–7398 (2020)
Redmon J., and Farhadi Ali., YOLOv3: an incremental improvement. arXiv:1804.02767, (2018).
Bochkovskiy A., Wang C.-Y. and Liao H.Y.M., YOLOv4: optimal speed and accuracy of object detection. arXiv:2004.10934, (2020).
Zhu B., Wang J., Jiang Z., Zong F., Liu S., Li Z., and Sun J., Autoassign: differentiable label assignment for dense object detection. arXiv preprint arXiv:2007.03496, (2020).
Kim K., Lee H.S., Probabilistic anchor assignment with iou prediction for object detection. In ECCV, pp. 355–371, (2020).
Zhang, X., Wan, F., Liu, C., Ji, R., Ye, Q.: Freeanchor: learning to match anchors for visual object detection. Adv. Neural Inf. Processing Syst. 32, 1 (2019)
Zhu C., Chen F., Shen Z., and Savvides M., Soft anchor-point object detection. In ECCV, pp.91–107. (2020).
Ge Z., Liu S., Li Z., Yoshie O., and Sun J., Ota: optimal transport assignment for object detection. In CVPR, pp. 303–312, (2021).
Tan M., Pang R., and Le Q.V., Efficientdet: scalable and efficient object detection. In CVPR, pp. 10781–10790, (2020).
Kong, Tao, Yao, Anbang, Chen, Yurong, Sun, Fuchun: Hypernet: towards accurate region proposal generation and joint object detection. In CVPR, Las Vegas, NV, USA (2016)
Zhu, Yousong, Zhao, Chaoyang, Jinqiao Wang, Xu., Zhao, Yi Wu., Hanqing, Lu.: CoupleNet: coupling global structure with local parts for object detection. In ICCV, Venice, Italy (2017)
Bell, S., Lawrence Zitnick, C., Bala, Kavita, Girshick, Ross: Inside-outside net: detecting objects in context with skip pooling and recurrent neural networks. In CVPR, Las Vegas, NV, USA (2016)
Dai, J., Li, Yi., He, K., Sun, J.: R-FCN: object detection via region-based fully convolutional networks, pp. 379–387. In NIPS, Barcelona (2016)
Jeong, Jisoo, Park, Hyojin, and Kwak, Nojun, Enhancement of ssd by concatenating feature maps for object detection,” In BMVC, (2017).
YOLOv5, https://github.com/ultralytics/yolov5, (2022).
Hongyu, Xu., Lv, X., Wang, X., Ren, Z., Bodla, N., Chellappa, R.: Deep regionlets: blended representation and deep learning for generic object detection. IEEE Trans. Pattern Anal. Mach. Intell. 43(6), 1914–1927 (2021)
Shuai, Wu., Yong, Xu., Zhang, B., Yang, J., Zhang, D.: Deformable template network (DTN) for object detection. IEEE Trans. Multimedia 24, 2058–2068 (2022)
Chen, L., Zheng, H., Yan, Z., Li, Ye.: Discriminative region mining for object detection. IEEE Trans. Multimedia 23, 4297–4310 (2021)
Dai Z., Cai B., Lin Y., Chen J., UP-DETR: unsupervised Pre-training for object detection with transformers. In CVPR, (2021).
Dai X., Chen Y., Yang J., Zhang P., Yuan L.and Zhang L., Dynamic DETR: end-to-end object detection with dynamic attention. In ICCV, (2021).
Liang, T., Chu, X., Liu, Y., Wang, Y., Tang, Z., Chu, W., Chen, J., Ling, H.: CBNet: a Composite BACKBONE NETWORK ARCHITECTURE FOR OBJECT DETECTIOn. IEEE Trans. Image Process. 31, 6893–6906 (2022)
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
YM contributed to the central idea and programming and wrote the draft of the manuscript; YW collected the data and did the programming. All authors discussed the results and revised the manuscript.
Corresponding author
Ethics declarations
Conflicts of interest
The authors declare that they have no competing interests.
Consent to participate
Not applicable.
Consent for publication
Manuscript has been approved by all authors for publication. I would like to declare on behalf of my co-authors that the work described was original research that has not been published previously, and not under consideration for publication elsewhere.
Ethics approval
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ma, Y., Wang, Y. Feature refinement with multi-level context for object detection. Machine Vision and Applications 34, 49 (2023). https://doi.org/10.1007/s00138-023-01402-5
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00138-023-01402-5