Abstract
Single Shot Multibox Detector (SSD) is one of the top performing object detection algorithms in terms of both accuracy and speed. SSD achieves impressive performance on various datasets by using different output layers for object detection. However, each layer in the feature pyramid is used independently, and SSD considers only the fine-grained details of the objects but ignores the context surrounding objects. In this paper, we proposed an enhanced SSD, called ESSD, that improved the performance of the conventional SSD by fusing feature maps of different output layers, instead of growing layers close to the input data. Our method used two-way transfer of feature information and feature fusion to enhance the network. To assist further with object detection, we proposed a visual reasoning method that utilized fully the relationships between objects instead of using only the features of the objects themselves. This addition of visual reasoning proved very effective for detecting objects that are too small or have small features. To evaluate the proposed ESSD, we trained the model with VOC2007 and VOC2012 training sets and evaluated the performance on the Pascal VOC2007 test set. For \(300 \times 300\) input, ESSD achieved 79.2% mean average precision (mAP) at 52.0 frames per second (FPS), and for \(512 \times 512\) input, this approach achieved 82.4% mAP at 18.6 FPS. These results demonstrated that our proposed method can achieve state-of-the-art mAP, which is a better result than provided by the conventional SSD and other advanced detectors.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Yang F, Choi W, Lin Y (2016) Exploit all the layers: fast and accurate cnn object detector with scale dependent pooling and cascaded rejection classifiers. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp 2129–2137
Dai J, Li Y, He K, et al (2016) R-fcn: object detection via region-based fully convolutional networks. Adv Neural Inf Process. Syst, pp 379–387
Girshick R, Donahue J, Darrell T, et al (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 580–587
Bell S, Lawrence Zitnick C, Bala K, et al (2016) Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2874–2883
Fukui A, Park D H, Yang D, et al (2016) Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847
Kong T, Yao A, Chen Y, et al (2016) Hypernet: towards accurate region proposal generation and joint object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 845–853
Liu W, Anguelov D, Erhan D et al (2016) Ssd: single shot multibox detector[C]. In: European conference on computer vision. Springer, Cham, pp 21–37
Gao Y, Beijbom O, Zhang N, et al (2016) Compact bilinear pooling. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 317–326
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: CVPR 2005: IEEE computer society conference on computer vision and pattern recognition, 2005, vol 1. IEEE, pp 886–893
Erhan D, Szegedy C, Toshev A, et al (2014) Scalable object detection using deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2147–2154
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Pinheiro PO, Collobert R, Dollár P (2015) Learning to segment object candidates. In: Proceedings of the 28th international conference on neural information processing systems (NIPS’15), Montreal, 7–12 December 2015, pp 1990–1998
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Proceedings of the 25th international conference on neural information processing systems (NIPS’12), Lake Tahoe, 3–6 December 2012, pp 1097–1105
Zhang H, Cao X, Ho JKL et al (2017) Object-level video advertising: an optimization framework. IEEE Trans Industr Inf 13(2):520–531
Girshick RB, Donahue J, Darrell T et al (2016) Region-based convolutional networks for accurate object detection and segmentation. IEEE Trans Pattern Anal Mach Intell 38(1):142–158
Uijlings JR, De Sande KE, Gevers T et al (2013) Selective search for object recognition. Int J Comput Vis 104(2):154–171
Zitnick CL, Dollár P (2014) Edge boxes: locating object proposals from edges. In: European conference on computer vision. Springer, Cham, pp 391–405
He K, Zhang X, Ren S, et al (2014) Spatial pyramid pooling in deep convolutional networks for visual recognition. In: European conference on computer vision, pp 346–361
Girshick RB (2015) Fast R-CNN. In: International conference on computer vision, pp 1440–1448
Zitnick CL, Dollár P (2014) Edge boxes: locating object proposals from edges. In: European conference on computer vision. Springer, Cham, pp 391–405
Ren S, He K, Girshick RB et al (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Redmon J, Divvala S, Girshick R et al (2016) You only look once: unified, real-time object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, Las Vegas, 27–30 June 2016, pp 779–788
Redmon J, Farhadi A (2016) YOLO9000: better, faster, stronger. arXiv preprint, p 1612
Fu CY, Liu W, Ranga A, et al (2017) DSSD: deconvolutional single shot detector. arXiv preprint arXiv:1701.06659
Everingham M, Van Gool L, Williams CKI et al (2010) The Pascal visual object classes (voc) challenge. Int J Comput Vis 88(2):303–338
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Acknowledgements
This project was partially supported by Grants from Natural Science Foundation of 353 China 71671178, 9154620, and 61202321, and the open project of the Key Lab of Big Data 354 Mining and Knowledge Management. It was also supported by Hainan Provincial 355 Department of Science and Technology under Grant No. ZDKJ2016021, and by 356 Guangdong Provincial Science and Technology Project 2016B010127004.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
We declare that we do not have any commercial or associative interest that represents a conflict of interest in connection with the work submitted.
Rights and permissions
About this article
Cite this article
Leng, J., Liu, Y. An enhanced SSD with feature fusion and visual reasoning for object detection. Neural Comput & Applic 31, 6549–6558 (2019). https://doi.org/10.1007/s00521-018-3486-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-018-3486-1