Abstract
This paper addresses the problem of image-based event recognition by transferring deep representations learned from object and scene datasets. First we empirically investigate the correlation of the concepts of object, scene, and event, thus motivating our representation transfer methods. Based on this empirical study, we propose an iterative selection method to identify a subset of object and scene classes deemed most relevant for representation transfer. Afterwards, we develop three transfer techniques: (1) initialization-based transfer, (2) knowledge-based transfer, and (3) data-based transfer. These newly designed transfer techniques exploit multitask learning frameworks to incorporate extra knowledge from other networks or additional datasets into the fine-tuning procedure of event CNNs. These multitask learning frameworks turn out to be effective in reducing the effect of over-fitting and improving the generalization ability of the learned CNNs. We perform experiments on four event recognition benchmarks: the ChaLearn LAP Cultural Event Recognition dataset, the Web Image Dataset for Event Recognition, the UIUC Sports Event dataset, and the Photo Event Collection dataset. The experimental results show that our proposed algorithm successfully transfers object and scene representations towards the event dataset and achieves the current state-of-the-art performance on all considered datasets.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Azizpour, H., Sharif Razavian, A., Sullivan, J., Maki, A., & Carlsson, S. (2015). From generic to specific deep representations for visual recognition. In CVPR Workshop on DeepVision, pp. 36–45.
Baro, X., Gonzalez, J., Fabian, J., Bautista, M. A., Oliu, M., Jair Escalante, H., Guyon, I., & Escalera, S .(2015). Chalearn looking at people 2015 challenges: Action spotting and cultural event recognition. In CVPR Workshop on ChaLearn Looking at People, pp. 1–9.
Bhattacharya, S., Kalayeh, M. M., Sukthankar, R., & Shah, M. (2014). Recognition of complex events: Exploiting temporal dynamics between underlying concepts. In CVPR, pp. 2243–2250.
Bossard, L., Guillaumin, M., & Gool, L. J. V. (2013). Event recognition in photo collections with a stopwatch HMM. In ICCV, pp. 1193–1200.
Caruana, R. (1997). Multitask learning. Machine Learning, 28(1), 41–75.
Chatfield, K., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. In BMVC, pp. 1–12.
Cooper, M. L., Foote, J., Girgensohn, A., & Wilcox, L. (2003). Temporal event clustering for digital photo collections. In ACM Multimedia, pp. 364–373.
Das, A., Dasgupta, A., & Kumar, R. (2012). Selecting diverse features via spectral regularization. In NIPS, pp. 1592–1600.
Delaitre, V., Sivic, J., & Laptev, I. (2011). Learning person-object interactions for action recognition in still images. In NIPS, pp. 1503–1511.
Deng, J., Dong, W., Socher, R., Li, L., Li, K., & Li, F. (2009). ImageNet: A large-scale hierarchical image database. In CVPR, pp. 248–255.
Desai, C., & Ramanan, D. (2012). Detecting actions, poses, and objects with relational phraselets. In ECCV, pp. 158–172.
Duan, L., Xu, D., Tsang, I. W., & Luo, J. (2012). Visual event recognition in videos by learning from web data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1667–1680.
Ebadollahi, S., Xie, L., Chang, S., & Smith, J. R. (2006). Visual event detection using multi-dimensional concept dynamics. In ICME, pp. 881–884.
Escalera, S., Fabian, J., Pardo, P., Baro, X., Gonzalez, J., Escalante, H. J., Misevic, D., Steiner, U., & Guyon, I. (2015). Chalearn looking at people 2015: Apparent age and cultural event recognition datasets and results. In ICCV Workshop on ChaLearn Looking at People, pp. 1–9.
Everingham, M., Gool, L. J. V., Williams, C. K. I., Winn, J. M., & Zisserman, A. (2010). The pascal visual object classes (VOC) challenge. International Journal of Computer Vision, 88(2), 303–338.
Fernando, B., Habrard, A., Sebban, M., & Tuytelaars, T. (2013). Unsupervised visual domain adaptation using subspace alignment. In ICCV, pp. 2960–2967.
Gan, C., Wang, N., Yang, Y., Yeung, D., & Hauptmann, A. G. (2015). Devnet: A deep event network for multimedia event detection and evidence recounting. In CVPR, pp. 2568–2577.
Gao, B., Wei, X., Wu, J., & Lin, W. (2015). Deep spatial pyramid: The devil is once again in the details. CoRR abs/1504.05277.
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, pp. 580–587.
Gkioxari, G., Girshick, R. B., & Malik, J. (2015). Contextual action recognition with r*cnn. In ICCV, pp. 1080–1088.
Gong, B., Grauman, K., & Sha, F. (2014). Learning kernels for unsupervised domain adaptation with applications to visual object recognition. International Journal of Computer Vision, 109(1–2), 3–27.
Habibian, A., & Snoek, C. G. M. (2014). Recommendations for recognizing video events by concept vocabularies. Computer Vision and Image Understanding, 124, 110–122.
Hauptmann, A. G., Christel, M. G., & Yan, R. (2008). Video retrieval based on semantic concepts. Proceedings of the IEEE, 96(4), 602–622.
He, K., Zhang, X., Ren, S., & Sun. J. (2015). Deep residual learning for image recognition. CoRR abs/1512.03385.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, pp. 448–456.
Izadinia, H., & Shah, M. (2012). Recognizing complex events using large margin joint low-level event model. In Computer Vision—ECCV 2012–12th European Conference on Computer Vision, Florence, Italy, October 7–13, 2012, Proceedings, Part IV, pp. 430–444.
Jain, M., van Gemert, J. C., & Snoek, C. G. M. (2015). What do 15,000 object categories tell us about classifying and localizing actions? In CVPR, pp. 46–55.
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R. B., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. CoRR abs/1408.5093.
Juneja, M., Vedaldi, A., Jawahar, C. V., & Zisserman, A. (2013). Blocks that shout: Distinctive parts for scene classification. In CVPR, pp. 923–930.
Kolmogorov, V. (2006). Convergent tree-reweighted message passing for energy minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10), 1568–1583.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In NIPS, pp. 1106–1114.
Kulis, B., Saenko, K., & Darrell, T. (2011). What you saw is not what you get: Domain adaptation using asymmetric kernel transforms. In CVPR, pp. 1785–1792.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
Li, L., & Li, F. (2007). What, where and who? classifying events by scene and object recognition. In ICCV, pp. 1–8.
Liu, J., Yu, Q., Javed, O., Ali, S., Tamrakar, A., Divakaran, A., Cheng, H., & Sawhney, H. S. (2013). Video event recognition using concept attributes. In WACV, pp. 339–346.
Liu, M., Liu, X., Li, Y., Chen, X., Hauptmann, A. G., & Shan, S. (2015). Exploiting feature hierarchies with convolutional neural networks for cultural event recognition. In ICCV Workshop on ChaLearn Looking at People, pp. 32–37.
Ma, Z., Yang, Y., Sebe, N., & Hauptmann, A. G. (2014). Knowledge adaptation with partiallyshared features for event detectionusing few exemplars. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(9), 1789–1802.
Marszalek, M., Laptev, I., & Schmid, C. (2009). Actions in context. In CVPR, pp. 2929–2936.
Mazloom, M., Gavves, E., & Snoek, C. G. M. (2014). Conceptlets: Selective semantics for classifying video events. IEEE Transactions on Multimedia, 16(8), 2214–2228.
Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2014). Learning and transferring mid-level image representations using convolutional neural networks. In CVPR, pp. 1717–1724.
Park, S., & Kwak, N. (2015). Cultural event recognition by subregion classification with convolutional neural network. In CVPR Workshop on ChaLearn Looking at People, pp. 45–50.
Ramanathan, V., Tang, K. D., Mori, G., & Li, F. (2015). Learning temporal embeddings for complex video analysis. In ICCV, pp. 4471–4479.
Rothe, R., Timofte, R., & Van Gool, L. (2015). Dldr: Deep linear discriminative retrieval for cultural event classification from a single image. In The IEEE International Conference on Computer Vision (ICCV) Workshops, pp. 53–60.
Salvador, A., Zeppelzauer, M., Manchon-Vizuete, D., Calafell, A., & Giro-i Nieto, X. (2015). Cultural event recognition with visual convnets and temporal models. In CVPR Workshop on ChaLearn Looking at People, pp. 36–44.
Sharif Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). Cnn features off-the-shelf: An astounding baseline for recognition. In CVPR Workshop on DeepVision, pp. 806–813.
Shen, L., Lin, Z., & Huang, Q. (2016). Relay backpropagation for effective learning of deep convolutional neural networks. In ECCV, pp. 467–482.
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In NIPS, pp. 568–576.
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR, pp. 1–8.
Tang, K. D., Li, F., & Koller, D. (2012). Learning latent temporal structure for complex event detection. In CVPR, pp. 1250–1257.
Torralba, A., & Efros, A. A. (2011). Unbiased look at dataset bias. In CVPR, pp. 1521–1528.
Tzeng, E., Hoffman, J., Darrell, T., & Saenko, K. (2015). Simultanously deep transfer across domains and tasks. In ICCV, pp. 4068–4076.
Vedaldi, A., & Zisserman, A. (2012). Efficient additive kernels via explicit feature maps. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(3), 480–492.
Vu, T., Olsson, C., Laptev, I., Oliva, A., & Sivic, J. (2014). Predicting actions from static scenes. In ECCV, pp. 421–436.
Wang, L., Guo, S., Huang, W., Xiong, Y., & Qiao, Y. (2017). Knowledge guided disambiguation for large-scale scene classification with multi-resolution cnns. IEEE Transactions on Image Processing, 26(4), 2055–2068.
Wang, H., Kläser, A., Schmid, C., & Liu, C. (2013). Dense trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 103(1), 60–79.
Wang, H., Oneata, D., Verbeek, J. J., & Schmid, C. (2016a). A robust and efficient video representation for action recognition. International Journal of Computer Vision, 119(3), 219–238.
Wang, L., Qiao, Y., & Tang, X. (2015a). Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR, pp. 4305–4314.
Wang, L., Qiao, Y., & Tang, X. (2016b). MoFAP: A multi-level representation for action recognition. International Journal of Computer Vision, 119(3), 254–271.
Wang, L., Wang, Z., Du, W., & Qiao, Y. (2015b). Object-scene convolutional neural networks for event recognition in images. In CVPR Workshop on ChaLearn Looking at People, pp. 30–35.
Wang, L., Wang, Z., Guo, S., & Qiao, Y. (2015c). Better exploiting os-cnns for better event recognition in images. In ICCV Workshop on ChaLearn Looking at People, pp. 45–52.
Wang, L., Xiong, Y., Wang, Z., & Qiao, Y. (2015d). Towards good practices for very deep two-stream convnets. CoRR abs/1507.02159.
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., & Van Gool, L. (2016c). Temporal segment networks: Towards good practices for deep action recognition. In ECCV, pp. 20–36.
Wei, X. S., Gao, B. B., & Wu, J. (2015). Deep spatial pyramid ensemble for cultural event recognition. In ICCV Workshop on ChaLearn Looking at People, pp. 38–44.
Xiong, Y., Zhu, K., Lin, D., & Tang, X. (2015). Recognize complex events from static images by fusing deep channels. In CVPR, pp. 1600–1609.
Yan, Y., Yang, Y., Shen, H., Meng, D., Liu, G., Hauptmann, A. G., & Sebe, N. (2015). Complex event detection via event oriented dictionary learning. In AAAI, pp. 3841–3847.
Yang, Y., Yang, Y., Huang, Z., Liu, J., & Ma, Z. (2012). Robust cross-media transfer for visual event detection. In Proceedings of the 20th ACM International Conference on Multimedia, pp. 1045–1048.
Yao, B., Jiang, X., Khosla, A., Lin, A. L., Guibas, L. J., & Li, F. (2011). Human action recognition by learning bases of action attributes and parts. In ICCV, pp. 1331–1338.
Yao, B., & Li, F. (2010). Grouplet: A structured image representation for recognizing human and object interactions. In CVPR, pp. 9–16.
Yao, J., Fidler, S., & Urtasun, R. (2012). Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation. In CVPR, pp. 702–709.
Zheng, J., Jiang, Z., Chellappa, R., & Phillips, P. J. (2014). Submodular attribute selection for action recognition in video. In NIPS, pp 1341–1349.
Zhou, B., Khosla, A., Lapedriza, À., Oliva, A., & Torralba, A. (2015). Learning deep features for discriminative localization. CoRR abs/1512.04150.
Zhou, B., Lapedriza, À., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In NIPS, pp. 487–495.
Acknowledgements
This work is partially supported by the ERC Advanced Grant VarCity, the Toyota Research Project TRACE-Zurich, the National Key Research and Development Program of China (2016YFC1400704), the National Natural Science Foundation of China (U1613211, 61633021), and the External Cooperation Program of BIC Chinese Academy of Sciences (172644KYSB20150019, 172644KYSB20160033).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Kristen Grauman.
Rights and permissions
About this article
Cite this article
Wang, L., Wang, Z., Qiao, Y. et al. Transferring Deep Object and Scene Representations for Event Recognition in Still Images. Int J Comput Vis 126, 390–409 (2018). https://doi.org/10.1007/s11263-017-1043-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-017-1043-5