[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Visual and Semantic Knowledge Transfer for Large Scale Semi-Supervised Object Detection

Published: 01 December 2018 Publication History

Abstract

Deep CNN-based object detection systems have achieved remarkable success on several large-scale object detection benchmarks. However, training such detectors requires a large number of labeled bounding boxes, which are more difficult to obtain than image-level annotations. Previous work addresses this issue by transforming image-level classifiers into object detectors. This is done by modeling the differences between the two on categories with both image-level and bounding box annotations, and transferring this information to convert classifiers to detectors for categories without bounding box annotations. We improve this previous work by incorporating knowledge about object similarities from visual and semantic domains during the transfer process. The intuition behind our proposed method is that visually and semantically similar categories should exhibit more common transferable properties than dissimilar categories, e.g. a better detector would result by transforming the differences between a dog classifier and a dog detector onto the cat class, than would by transforming from the violin class. Experimental results on the challenging ILSVRC2013 detection dataset demonstrate that each of our proposed object similarity based knowledge transfer methods outperforms the baseline methods. We found strong evidence that visual similarity and semantic relatedness are complementary for the task, and when combined notably improve detection, achieving state-of-the-art detection performance in a semi-supervised setting.

References

[1]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2012, pp. 1097–1105.
[2]
C. Szegedy, A. Toshev, and D. Erhan, “Deep neural networks for object detection,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2013, pp. 2553–2561.
[3]
P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” in Proc. Int. Conf. Learn. Representations, 2014.
[4]
R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 580–587.
[5]
K. He, X. Zhang, S. Ren, and J. Sun, “Spatial pyramid pooling in deep convolutional networks for visual recognition,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 37, no. 9, pp. 1904–1916, Sep. 2015.
[6]
R. Girshick, “Fast R-CNN: Towards real-time object detection with region proposal networks,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1440–1448.
[7]
S. Ren, K. He, R. Girshick, and J. Sun, “Fast R-CNN: Towards real-time object detection with region proposal networks,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2015, pp. 91–99.
[8]
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 779–788.
[9]
W. Liu, et al., “SSD: Single shot multibox detector,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 21–37.
[10]
M. Everingham, L. Van Gool, C. Williams, J. Winn, and A. Zisserman, “The Pascal visual object classes (VOC) challenge,” Int. J. Comput. Vis., vol. 88, no. 2, pp. 303 –338, 2010.
[11]
O. Russakovsky, et al., “ImageNet large scale visual recognition challenge,” Int. J. Comput. Vis., vol. 115, no. 3, pp. 211–252, Apr. 2015.
[12]
T. Lin, et al., “Microsoft COCO: Common objects in context,” in Proc. Eur. Conf. Comput. Vis. (ECCV), vol. abs/1405.0312, 2014, pp. 740–755.
[13]
J. Hoffman, et al., “LSDA: Large scale detection through adaptation,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2014, pp. 3536–3544.
[14]
B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Object detectors emerge in deep scene CNNs,” in Proc. Int. Conf. Learn. Representations, 2015.
[15]
T. Deselaers and V. Ferrari, “Visual and semantic similarity in imageNet,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2011, pp. 1777–1784.
[16]
Y. Tang, J. Wang, B. Gao, E. Dellandrea, R. Gaizauskas, and L. Chen, “Large scale semi-supervised object detection using visual and semantic knowledge transfer,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 2119–2128.
[17]
D. Crandall and D. Huttenlocher, “Weakly supervised learning of part-based spatial models for visual object recognition,” in Proc. Eur. Conf. Comput. Vis., 2006, pp. 16–29.
[18]
O. Chum and A. Zisserman, “An exemplar model for learning object classes, in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2007, pp. 1–8 .
[19]
C. Galleguillos, B. Babenko, A. Rabinovich, and S. Belongie, “Weakly supervised object recognition and localization with stable segmentations,” in Proc. Eur. Conf. Comput. Vis., 2008, pp 193–207.
[20]
M. Nguyen, L. Torresani, F. de la Torre, and C. Rother, “Weakly supervised discriminative localization and classification: A joint learning process,” in Proc. IEEE Int. Conf. Comput. Vis., 2009, pp. 1925 –1932.
[21]
P. Siva and T. Xiang, “Weakly supervised object detector learning with model drift detection,” in Proc. IEEE Int. Conf. Comput. Vis, 2011, pp. 343 –350.
[22]
M. Pandey and S. Lazebnik, “Scene recognition and weakly supervised object localization with deformable part-based models,” in Proc. IEEE Int. Conf. Comput. Vis., 2011, pp. 343–350.
[23]
P. Siva, C. Russell, and T. Xiang, “In defence of negative mining for annotating weakly labelled data,” in Proc. Eur. Conf. Comput. Vis., 2012, pp. 594–608.
[24]
T. Deselaers, B. Alexe, and V. Ferrari, “Weakly supervised localization and learning with generic knowledge,” Int. J. Comput. Vis. , vol. 100, no. 3, pp. 275–293, 2012.
[25]
Z. Shi, T. M. Hospedales, and T. Xiang, “Bayesian joint topic modelling for weakly supervised object localisation,” in Proc. IEEE Int. Conf. Comput. Vis., 2013, pp. 2984–2991.
[26]
Y. Tang, X. Wang, E. Dellandrea, S. Masnou, and L. Chen, “Fusing generic objectness and deformable part-based models for weakly supervised object detection,” in Proc. IEEE Int. Conf. Image Process., 2014, pp. 4072–4076.
[27]
H. Bilen, M. Pedersoli, and T. Tuytelaars, “Weakly supervised object detection with posterior regularization,” in Proc. Brit. Mach. Vis. Conf., 2014.
[28]
H. O. Song, R. Girshick, S. Jegelka, J. Mairal, Z. Harchaoui, and T. Darrell, “On learning to localize objects with minimal supervision,” in Proc. Int. Conf. Mach. Learn., 2014, pp. II-1611–II-1619.
[29]
H. Bilen, M. Pedersoli, and T. Tuytelaars, “Weakly supervised object detection with convex clustering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 1081–1089.
[30]
C. Wang, K. Huang, W. Ren, J. Zhang, and S. Maybank, “Large-scale weakly supervised object localization via latent category learning,” IEEE Trans. Image Process., vol. 24, no. 4, pp. 1371–1385, Apr. 2015.
[31]
Y. Tang, X. Wang, E. Dellandrea, and L. Chen, “Weakly supervised learning of deformable part-based models for object detection via region proposals,” IEEE Trans. Multimedia, vol. 19, no. 2, pp. 393–407, Feb. 2017.
[32]
J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders, “Selective search for object recognition,” Int. J. Comput. Vis., vol. 104, no. 2, pp. 154–171, 2013.
[33]
M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. Torr, “Bing: Binarized normed gradients for objectness estimation at 300fps,” in IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 3286–3293.
[34]
C. L. Zitnick and P. Dollár, “Edge boxes: Locating object proposals from edges,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 391–405.
[35]
M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Is object localization for free? - weakly-supervised learning with convolutional neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 685– 694.
[36]
H. Bilen and A. Vedaldi, “Weakly supervised deep detection networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 2846– 2854.
[37]
B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 2921–2929.
[38]
L. Shao, F. Zhu, and X. Li, “Transfer learning for visual categorization: A survey,” IEEE Trans. Neural Netw. Learn. Syst., vol. 26, no. 5, pp. 1019–1034, May 2015.
[39]
J. Donahue, J. Hoffman, E. Rodner, K. Saenko, and T. Darrell, “Semi-supervised domain adaptation with instance constraints,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 668–675.
[40]
M. Oquab, L. Bottou, I. Laptev, and J. Sivic, “Learning and transferring mid-level image representations using convolutional neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 1717–1724.
[41]
M. Rochan and Y. Wang, “Weakly supervised localization of novel objects using appearance transfer,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 4315–4324.
[42]
X. Shu, G.-J. Qi, J. Tang, and J. Wang, “Weakly-shared deep transfer networks for heterogeneous-domain knowledge propagation,” in Proc. ACM Int. Conf. Multimedia, 2015, pp. 35–44.
[43]
Y. Zhu, Y. Chen, Z. Lu, S. J. Pan, G.-R. Xue, Y. Yu, and Q. Yang, “Heterogeneous transfer learning for image classification,” in Proc. AAAI Conf. Artif. Intell., 2011, pp. 1304–1309.
[44]
Y. Lu, L. Chen, A. Saidi, E. Dellandrea, and Y. Wang, “Discriminative transfer learning using similarities and dissimilarities,” IEEE Trans. Neural Netw. Learn. Syst., vol. PP, no. 99, pp. 1–14, 2017.
[45]
K. K. Singh, F. Xiao, and Y. J. Lee, “Track and transfer: Watching videos to simulate strong human supervision for weakly-supervised object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 3548–3556.
[46]
A. Frome, et al., “DeViSE: A deep visual-semantic embedding model,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2013, pp. 2121–2129.
[47]
Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle, “Greedy layer-wise training of deep networks,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2007, pp. 153–160.
[48]
C. Fellbaum, Ed., WordNet: An Electronic Lexical Database. Cambridge, MA, USA: MIT Press, 1998.
[49]
C. Leacock and M. Chodorow, “Combining local context and WordNet similarity for word sense identification,” in WordNet: An Electronic Lexical Database, C. Fellbaum, Ed. Cambridge, MA, USA: MIT Press, 1998, pp. 265 –283.
[50]
P. Resnik, “Using information content to evaluate semantic similarity in a taxonomy,” in Proc. Int. Joint Conf. Artif. Intell., 1995, pp. 448–453.
[51]
D. Lin, “An information-theoretic definition of similarity,” in Proc. Int. Conf. Mach. Learn., 1998, pp. 296–304.
[52]
T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Proc. Int. Conf. Neural Inf. Process. Syst., 2013, pp. 3111–3119.
[53]
J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proc. Conf. Empirical Methods Natural Language Process., 2014, pp. 1532–1543.
[54]
T. Mikolov, W.-T. Yih, and G. Zweig, “Linguistic regularities in continuous space word representations,” in Proc. Conf. North Amer. Chapter Assoc. Comput. Linguistics: Human Language Technol., 2013, pp. 746 –751.
[55]
I. Misra, A. Shrivastava, and M. Hebert, “ Watch and learn: Semi-supervised learning of object detectors from videos,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 3593–3602.
[56]
C. Rosenberg, M. Hebert, and H. Schneiderman, “ Semi-supervised self-training of object detection models,” in Proc. IEEE Workshops Appl. Comput. Vis., 2005, pp. 29–36 .
[57]
Y. Yang, G. Shu, and M. Shah, “ Semi-supervised learning of feature hierarchies for object detection in a video,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 1650–1657.
[58]
P. Agrawal, R. Girshick, and J. Malik, “Analyzing the performance of multilayer neural networks for object recognition,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 329–344.
[59]
Y. Jia, et al., “Caffe: Convolutional architecture for fast feature embedding,” in Proc. ACM Int. Conf. Multimedia, 2014, pp. 675–678.
[60]
M. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 818–833.
[61]
M. Rohrbach, M. Stark, G. Szarvas, I. Gurevych, and B. Schiele, “What helps where – and why? semantic relatedness for knowledge transfer,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2010, pp. 910–917.
[62]
S. Rothe and H. Schütze, “Autoextend: Extending word embeddings to embeddings for synsets and lexemes,” in Proc. 53rd Annu. Meet. Assoc. Comput. Linguistics 7th Int. Joint Conf. Natural Language Process., 2015, pp. 1793–1803.
[63]
B. Gao, E. Dellandrea, and L. Chen, “Music sparse decomposition onto a midi dictionary of musical words and its application to music mood classification,” in Proc. Int. Workshop Content-Based Multimedia Indexing, 2012, pp. 1 –6.
[64]
K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learning Representations, 2015.
[65]
C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 2818–2826.
[66]
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., June 2016, pp. 770–778.
[67]
R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-based convolutional networks for accurate object detection and segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 1, pp. 142–158, Jan. 2016.

Cited By

View all
  • (2024)Dual-View Data Hallucination With Semantic Relation Guidance for Few-Shot Image RecognitionIEEE Transactions on Multimedia10.1109/TMM.2024.345305526(11302-11315)Online publication date: 1-Jan-2024
  • (2023)Image Defogging Based on Regional Gradient Constrained PriorACM Transactions on Multimedia Computing, Communications, and Applications10.1145/361783420:3(1-17)Online publication date: 23-Oct-2023
  • (2023)Recent Few-shot Object Detection Algorithms: A Survey with Performance ComparisonACM Transactions on Intelligent Systems and Technology10.1145/359358814:4(1-36)Online publication date: 15-Jun-2023
  • Show More Cited By

Index Terms

  1. Visual and Semantic Knowledge Transfer for Large Scale Semi-Supervised Object Detection
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Please enable JavaScript to view thecomments powered by Disqus.

        Information & Contributors

        Information

        Published In

        cover image IEEE Transactions on Pattern Analysis and Machine Intelligence
        IEEE Transactions on Pattern Analysis and Machine Intelligence  Volume 40, Issue 12
        Dec. 2018
        276 pages

        Publisher

        IEEE Computer Society

        United States

        Publication History

        Published: 01 December 2018

        Qualifiers

        • Research-article

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0
        Reflects downloads up to 09 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Dual-View Data Hallucination With Semantic Relation Guidance for Few-Shot Image RecognitionIEEE Transactions on Multimedia10.1109/TMM.2024.345305526(11302-11315)Online publication date: 1-Jan-2024
        • (2023)Image Defogging Based on Regional Gradient Constrained PriorACM Transactions on Multimedia Computing, Communications, and Applications10.1145/361783420:3(1-17)Online publication date: 23-Oct-2023
        • (2023)Recent Few-shot Object Detection Algorithms: A Survey with Performance ComparisonACM Transactions on Intelligent Systems and Technology10.1145/359358814:4(1-36)Online publication date: 15-Jun-2023
        • (2023)A Knowledge Transfer-Based Semi-Supervised Federated Learning for IoT Malware DetectionIEEE Transactions on Dependable and Secure Computing10.1109/TDSC.2022.317366420:3(2127-2143)Online publication date: 1-May-2023
        • (2023)SPL-Net: Spatial-Semantic Patch Learning Network for Facial Attribute Recognition with Limited Labeled DataInternational Journal of Computer Vision10.1007/s11263-023-01787-w131:8(2097-2121)Online publication date: 1-Aug-2023
        • (2022)A survey on visual transfer learning using knowledge graphsSemantic Web10.3233/SW-21295913:3(477-510)Online publication date: 1-Jan-2022
        • (2022)UnseenNet: Fast Training Detector for Unseen Concepts with No Bounding BoxesImage and Vision Computing10.1007/978-3-031-25825-1_2(18-32)Online publication date: 23-Nov-2022
        • (2022)Robust Object Detection with Inaccurate Bounding BoxesComputer Vision – ECCV 202210.1007/978-3-031-20080-9_4(53-69)Online publication date: 23-Oct-2022
        • (2020)Boosting Weakly Supervised Object Detection with Progressive Knowledge TransferComputer Vision – ECCV 202010.1007/978-3-030-58574-7_37(615-631)Online publication date: 23-Aug-2020

        View Options

        View options

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media