Abstract
In this paper, we study the task of embodied interactive learning for object detection. Given a set of environments (and some labeling budget), our goal is to learn an object detector by having an agent select what data to obtain labels for. How should an exploration policy decide which trajectory should be labeled? One possibility is to use a trained object detector’s failure cases as an external reward. However, this will require labeling millions of frames required for training RL policies, which is infeasible. Instead, we explore a self-supervised approach for training our exploration policy by introducing a notion of semantic curiosity. Our semantic curiosity policy is based on a simple observation – the detection outputs should be consistent. Therefore, our semantic curiosity rewards trajectories with inconsistent labeling behavior and encourages the exploration policy to explore such areas. The exploration policy trained via semantic curiosity generalizes to novel scenes and helps train an object detector that outperforms baselines trained with other possible alternatives such as random exploration, prediction-error curiosity, and coverage-maximizing exploration.
D. S. Chaplot and H. Jiang—Equal Contribution.
Webpage: https://devendrachaplot.github.io/projects/SemanticCuriosity.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Note that curiosity-based policy has the lowest mAP because of outlier toilet category.
References
Ammirato, P., Poirson, P., Park, E., Košecká, J., Berg, A.C.: A dataset for developing and benchmarking active vision. In: 2017 IEEE International Conference on Robotics and Automation (ICRA). pp. 1378–1385. IEEE (2017)
Anderson, P., et al.: Vision-and-language navigation: interpreting visually-grounded navigation instructions in real environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3674–3683 (2018)
Auer, P.: Using confidence bounds for exploitation-exploration trade-offs. J. Mach. Learn. Res. 3(Nov), 397–422 (2002)
Badrinarayanan, V., Galasso, F., Cipolla, R.: Label propagation in video sequences. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3265–3272. IEEE (2010)
Bajcsy, R.: Active perception. Proc. IEEE 76(8), 966–1005 (1988)
Bengio, Y., Delalleau, O., Le Roux, N.: Label propagation and quadratic criterion. In: Semi-Supervised Learning (2006)
Chandra, S., Couprie, C., Kokkinos, I.: Deep spatio-temporal random fields for efficient video segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8915–8924 (2018)
Chang, A., et al.: Matterport3D: learning from RGB-D data in indoor environments. International Conference on 3D Vision (3DV) (2017). http://kaldir.vc.in.tum.de/matterport/MP_TOS.pdf
Chaplot, D.S., Gandhi, D., Gupta, S., Gupta, A., Salakhutdinov, R.: Learning to explore using active neural SLAM. In: ICLR (2020). https://openreview.net/forum?id=HklXn1BKDH
Chaplot, D.S., Lample, G.: Arnold: an autonomous agent to play fps games. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)
Chaplot, D.S., Parisotto, E., Salakhutdinov, R.: Active neural localization. In: International Conference on Learning Representations (2018). https://openreview.net/forum?id=ry6-G_66b
Chaplot, D.S., Salakhutdinov, R., Gupta, A., Gupta, S.: Neural topological SLAM for visual navigation. In: CVPR (2020)
Chaplot, D.S., Sathyendra, K.M., Pasumarthi, R.K., Rajagopal, D., Salakhutdinov, R.: Gated-attention architectures for task-oriented language grounding. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Chen, T., Gupta, S., Gupta, A.: Learning exploration policies for navigation. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=SyMWn05F7
Chen, X., Shrivastava, A., Gupta, A.: NEIL: extracting visual knowledge from web data. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1409–1416 (2013)
Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied question answering. In: CVPR (2018)
Dosovitskiy, A., Koltun, V.: Learning to act by predicting the future. In: ICLR (2017)
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html
Eysenbach, B., Gupta, A., Ibarz, J., Levine, S.: Diversity is all you need: learning skills without a reward function. In: International Conference on Learning Representations (2019). https://openreview.net/forum?id=SJx63jRqFm
Fang, K., Toshev, A., Fei-Fei, L., Savarese, S.: Scene memory transformer for embodied agents in long-horizon tasks. In: CVPR (2019)
Fathi, A., Balcan, M.F., Ren, X., Rehg, J.M.: Combining self training and active learning for video segmentation. In: Proceedings of the British Machine Vision Conference, Georgia Institute of Technology (2011)
Fox, D., Burgard, W., Thrun, S.: Active Markov localization for mobile robots. Robot. Auton. Syst. 25(3–4), 195–207 (1998)
Gadde, R., Jampani, V., Gehler, P.V.: Semantic video CNNs through representation warping. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4453–4462 (2017)
Gal, Y., Islam, R., Ghahramani, Z.: Deep Bayesian active learning with image data. In: Proceedings of the 34th International Conference on Machine Learning, vol. 70, pp. 1183–1192. JMLR.org (2017)
Gordon, D., Kembhavi, A., Rastegari, M., Redmon, J., Fox, D., Farhadi, A.: IQA: visual question answering in interactive environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4089–4098 (2018)
Gupta, S., Davidson, J., Levine, S., Sukthankar, R., Malik, J.: Cognitive mapping and planning for visual navigation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2616–2625 (2017)
Hermann, K.M., et al.: Grounded language learning in a simulated 3D world. arXiv preprint arXiv:1706.06551 (2017)
Jaksch, T., Ortner, R., Auer, P.: Near-optimal regret bounds for reinforcement learning. J. Mach. Learn. Res. 11(Apr), 1563–1600 (2010)
Jayaraman, D., Grauman, K.: Learning to look around: intelligently exploring unseen environments for unknown tasks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1238–1247 (2018)
Kuo, W., Häne, C., Yuh, E., Mukherjee, P., Malik, J.: Cost-sensitive active learning for intracranial hemorrhage detection. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11072, pp. 715–723. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00931-1_82
Lample, G., Chaplot, D.S.: Playing FPS games with deep reinforcement learning. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Martin, D., Fowlkes, C., Tal, D., Malik, J.: A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: Proceedings of 8th International Conference on Computer Vision, vol. 2, pp. 416–423, July 2001
Mirowski, P., et al.: Learning to navigate in complex environments. ICLR (2017)
Misra, I., Girshick, R., Fergus, R., Hebert, M., Gupta, A., van der Maaten, L.: Learning by asking questions. In: CVPR (2018)
Pathak, D., Agrawal, P., Efros, A.A., Darrell, T.: Curiosity-driven exploration by self-supervised prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 16–17 (2017)
Pathak, D., Gandhi, D., Gupta, A.: Self-supervised exploration via disagreement. In: ICML (2019)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)
Savva, M., Chang, A.X., Dosovitskiy, A., Funkhouser, T., Koltun, V.: MINOS: multimodal indoor simulator for navigation in complex environments. arXiv:1712.03931 (2017)
Savva, M., et al.: Habitat: a platform for embodied AI research. In: ICCV (2019)
Schmidhuber, J.: A possibility for implementing curiosity and boredom in model-building neural controllers. In: Proceedings of the International Conference on Simulation of Adaptive Behavior: From Animals to Animats, pp. 222–227 (1991)
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O.: Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017)
Sener, O., Savarese, S.: Active learning for convolutional neural networks: a core-set approach. In: International Conference on Learning Representations (2018). https://openreview.net/forum?id=H1aIuk-RW
Settles, B.: Active learning literature survey. Technical report, University of Wisconsin-Madison Department of Computer Sciences (2009)
Siddiqui, Y., Valentin, J., Nießner, M.: ViewAL: active learning with viewpoint entropy for semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9433–9443 (2020)
Straub, J., et al.: The replica dataset: a digital replica of indoor spaces. arXiv preprint arXiv:1906.05797 (2019). https://github.com/facebookresearch/Replica-Dataset/blob/master/LICENSE
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press (1998). http://www.cs.ualberta.ca/~sutton/book/the-book.html
Tulsiani, S., Zhou, T., Efros, A.A., Malik, J.: Multi-view supervision for single-view reconstruction via differentiable ray consistency. TPAMI (2019)
Vijayanarasimhan, S., Grauman, K.: Large-scale live active learning: training object detectors with crawled data and crowds. Int. J. Comput. Vis. 108(1–2), 97–114 (2014)
Vondrick, C., Patterson, D., Ramanan, D.: Efficiently scaling up crowdsourced video annotation. Int. J. Comput. Vis. 1–21 (2013). http://dx.doi.org/10.1007/s11263-012-0564-1
Wu, Y., Kirillov, A., Massa, F., Lo, W.Y., Girshick, R.: Detectron2 (2019). https://github.com/facebookresearch/detectron2
Wu, Y., Tian, Y.: Training agent for first-person shooter game with actor-critic curriculum learning. In: ICLR (2017)
Xia, F., Zamir, A.R., He, Z.Y., Sax, A., Malik, J., Savarese, S.: Gibson Env: real-world perception for embodied agents. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE (2018). http://svl.stanford.edu/gibson2/assets/GDS_agreement.pdf
Yang, J., Lu, J., Lee, S., Batra, D., Parikh, D.: Visual curiosity: learning to ask questions to learn visual recognition. In: Conference on Robot Learning, pp. 63–80 (2018)
Yang, J., et al.: Embodied amodal recognition: learning to move to perceive objects. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2040–2050 (2019)
Yoo, D., Kweon, I.S.: Learning loss for active learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 93–102 (2019)
Zhu, Y., et al.: Target-driven visual navigation in indoor scenes using deep reinforcement learning. In: 2017 IEEE International Conference on Robotics and Automation (ICRA), pp. 3357–3364. IEEE (2017)
Acknowledgements
This work was supported by IARPA DIVA D17PC00340, ONR MURI, ONR Grant N000141812861, ONR Young Investigator, DARPA MCS, and NSF Graduate Research Fellowship. We would also like to thank NVIDIA for GPU support.
Licenses for referenced datasets:
Gibson: http://svl.stanford.edu/gibson2/assets/GDS_agreement.pdf
Matterport3D: http://kaldir.vc.in.tum.de/matterport/MP_TOS.pdf
Replica: https://raw.githubusercontent.com/facebookresearch/Replica-Dataset/master/LICENSE.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Supplementary material 1 (mp4 5714 KB)
Supplementary material 2 (mp4 9164 KB)
Supplementary material 3 (mp4 5119 KB)
Supplementary material 4 (mp4 9828 KB)
Supplementary material 5 (mp4 1906 KB)
Supplementary material 6 (mp4 8418 KB)
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Chaplot, D.S., Jiang, H., Gupta, S., Gupta, A. (2020). Semantic Curiosity for Active Visual Learning. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12351. Springer, Cham. https://doi.org/10.1007/978-3-030-58539-6_19
Download citation
DOI: https://doi.org/10.1007/978-3-030-58539-6_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58538-9
Online ISBN: 978-3-030-58539-6
eBook Packages: Computer ScienceComputer Science (R0)