[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Visual Content Recognition by Exploiting Semantic Feature Map with Attention and Multi-task Learning

Published: 05 February 2019 Publication History

Abstract

Recent studies have shown that spatial relationships among objects are very important for visual recognition, since they can provide rich clues on object contexts within the images. In this article, we introduce a novel method to learn the Semantic Feature Map (SFM) with attention-based deep neural networks for image and video classification in an end-to-end manner, aiming to explicitly model the spatial object contexts within the images. In particular, we explicitly apply the designed gate units to the extracted object features for important objects selection and noise removal. These selected object features are then organized into the proposed SFM, which is a compact and discriminative representation with the spatial information among objects preserved. Finally, we employ either Fully Convolutional Networks (FCN) or Long-Short Term Memory (LSTM) as the classifiers on top of the SFM for content recognition. A novel multi-task learning framework with image classification loss, object localization loss, and grid labeling loss are also introduced to help better learn the model parameters. We conduct extensive evaluations and comparative studies to verify the effectiveness of the proposed approach on Pascal VOC 2007/2012 and MS-COCO benchmarks for image classification. In addition, the experimental results also show that the SFMs learned from the image domain can be successfully transferred to CCV and FCVID benchmarks for video classification.

References

[1]
Sean Bell, C. Lawrence Zitnick, Kavita Bala, and Ross B. Girshick. 2016. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2874--2883.
[2]
Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Return of the devil in the details: Delving deep into convolutional nets. In Proceedings of the British Machine Vision Conference. 54.1--54.12.
[3]
Ming-Ming Cheng, Ziming Zhang, Wen-Yan Lin, and Philip Torr. 2014. BING: Binarized normed gradients for objectness estimation at 300fps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3286--3293.
[4]
Jian Dong, Wei Xia, Qiang Chen, Jiashi Feng, ZhongYang Huang, and Shuicheng Yan. 2013. Subcategory-aware object classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 827--834.
[5]
Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2014. The pascal visual object classes challenge: A retrospective. Int. J. Comput. Vision 111, 1 (2014), 98--136.
[6]
Ross Girshick, Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 580--587.
[7]
Ross B. Girshick. 2015. Fast R-CNN. In IEEE International Conference on Computer Vision. 1440--1448.
[8]
Yunchao Gong, Liwei Wang, Ruiqi Guo, and Svetlana Lazebnik. 2014. Multi-scale orderless pooling of deep convolutional activation features. In Proceedings of the European Conference on Computer Vision. 392--407.
[9]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37, 9 (2015), 1904--1916.
[10]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 770--778.
[11]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780.
[12]
Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. 2015. Spatial transformer networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems.
[13]
Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. 2010. Aggregating local descriptors into a compact image representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3304--3311.
[14]
I.-Hong Jhuo, Guangnan Ye, Shenghua Gao, Dong Liu, Yu-Gang Jiang, D. T. Lee, and Shih-Fu Chang. 2014. Discovering joint audio-visual codewords for video event detection. Mach. Vision Appl. 25, 1 (Oct. 2014), 33--47.
[15]
Yu-Gang Jiang, Qi Dai, Wei Liu, Xiangyang Xue, and Chong-Wah Ngo. 2015. Human action recognition in unconstrained videos by explicit motion modeling. IEEE Trans. Image Process. 24, 11 (2015), 3781--3795.
[16]
Yu-Gang Jiang, Chong-Wah Ngo, and Jun Yang. 2007. Towards optimal bag-of-features for object categorization and semantic video retrieval. In Proceedings of ACM International Conference on Image and Video Retrieval.
[17]
Yu-Gang Jiang, Zuxuan Wu, Jun Wang, Xiangyang Xue, and Shih-Fu Chang. 2018. Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 40, 2 (Feb. 2018), 352--364.
[18]
Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel P. W. Ellis, and Alexander C. Loui. 2011. Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In Proceedings of the ACM International Conference on Multimedia Retrieval. ACM Press.
[19]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1097--1105.
[20]
Kuan-Ting Lai, Felix X. Yu, Ming-Syan Chen, and Shih-Fu Chang. 2014. Video event detection by inferring temporal instance labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2251--2258.
[21]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436--444.
[22]
Li-Jia Li, Hao Su, Yongwhan Lim, and Fei-Fei Li. 2014. Object bank: An object-level image representation for high-level visual recognition. Int. J. Comput. Vision 107, 1 (2014), 20--39.
[23]
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. 740--755.
[24]
Dong Liu, Kuan-Ting Lai, Guangnan Ye, Ming-Syan Chen, and Shih-Fu Chang. 2013. Sample-specific late fusion for visual category recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 803--810.
[25]
Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision. Springer International Publishing, Amsterdam, The Netherlands, 21--37.
[26]
Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Boston, MA, 3431--3440.
[27]
Jianwei Luo, Jianguo Li, Jun Wang, Zhiguo Jiang, and Yurong Chen. 2015. Deep attributes from context-aware regional neural codes. arXiv.org. arXiv:1509.02470v1
[28]
Andy J. Ma and Pong C. Yuen. 2014. Reduced analytic dependency modeling: Robust fusion for visual recognition. Int. J. Comput. Vision 109, 3 (2014), 233--251.
[29]
Pascal Mettes, Jan C. van Gemert, and Cees G. M. Snoek. 2016. No spare parts: Sharing part detectors for image categorization. Comput. Vision Image Understand. 152 (Nov. 2016), 131--141.
[30]
Markus Nagel, Thomas Mensink, and Cees G. M. Snoek. 2015. Event fisher vectors: Robust encoding visual diversity of visual streams. In Proceedings of the British Machine Vision Conference. BMVA Press, Swansea, UK, 178.1--178.12.
[31]
Maxime Oquab, Léon Bottou, Ivan Laptev, and Josef Sivic. 2014. Learning and transferring mid-level image representations using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1717--1724.
[32]
Lei Pang and Chong-Wah Ngo. 2015. Opinion question answering by sentiment clip localization. ACM Trans. Multimedia Comput. Commun. Appl. 12, 2, Article 31 (Nov. 2015), 19 pages.
[33]
Joseph Redmon, Santosh Divvala, Ross B. Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 779--788.
[34]
Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 91--99.
[35]
Olga Russakovsky, Jia Deng, Hao Su, et al. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115, 3 (2015), 211--252.
[36]
Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. 2015. Action recognition using visual attention. In Proceedings of the International Conference on Learning Representations Workshop.
[37]
Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 568--576.
[38]
Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large scale image recognition. In Proceedings of the International Conference on Learning Representations.
[39]
Yongqing Sun, Zuxuan Wu, Xi Wang, Hiroyuki Arai, Tetsuya Kinebuchi, and Yu-Gang Jiang. 2016. Exploiting objects with LSTMs for video categorization. In Proceedings of the ACM International Conference on Multimedia. ACM Press, Amsterdam, The Netherlands, 142--146.
[40]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1--9.
[41]
Jasper R. R. Uijlings, Koen E. A. van de Sande, Theo Gevers, and Arnold W. M. Smeulders. 2013. Selective search for object recognition. Int. J. Comput. Vision 104, 2 (2013), 154--171.
[42]
Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, and Wei Xu. 2016. CNN-RNN: A unified framework for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2285--2294.
[43]
Meng Wang, Changzhi Luo, Richang Hong, Jinhui Tang, and Jianshi Feng. 2016. Beyond object proposals: Random crop pooling for multi-label image recognition. IEEE Trans. Image Process. 25, 12 (Dec. 2016), 5678--5688.
[44]
Yunchao Wei, Wei Xia, Min Lin, Junshi Huang, Bingbing Ni, Jian Dong, Yao Zhao, and Shuicheng Yan. 2015. HCP: A flexible CNN framework for multi-label image classification. IEEE Trans. Pattern Anal. Mach. Intell. (2015), 1--8.
[45]
Ruobing Wu, Baoyuan Wang, Wenping Wang, and Yizhou Yu. 2015. Harvesting discriminative meta objects with deep CNN features for scene classification. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 1287--1295.
[46]
Zuxuan Wu, Yanwei Fu, Yu-Gang Jiang, and Leonid Sigal. 2016. Harnessing object and scene semantics for large-scale video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3112--3121.
[47]
Zuxuan Wu, Yu-Gang Jiang, Xi Wang, Hao Ye, and Xiangyang Xue. 2016. Multi-stream multi-class fusion of deep networks for video classification. In Proceedings of the ACM International Conference on Multimedia. ACM Press, Amsterdam, The Netherlands, 791--800.
[48]
Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, and Xiangyang Xue. 2015. Modeling spatial-temporal clues in a hybrid deep-learning framework for video classification. In Proceedings of the ACM International Conference on Multimedia. 461--470.
[49]
Tianjun Xiao, Yichong Xu, Kuiyuan Yang, Jiaxing Zhang, Yuxin Peng, and Zheng Zhang. 2015. The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 842--850.
[50]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048--2057.
[51]
Zhongwen Xu, Yi Yang, Ivor W. Tsang, Nicu Sebe, and Alexander G. Hauptmann. 2013. Feature weighting via optimal thresholding for video analysis. In Proceedings of the IEEE International Conference on Computer Vision. 3440--3447.
[52]
Hao Yang, Joey Tianyi Zhou, Yu Zhang, Bin-Bin Gao, Jianxin Wu, and Jianfei Cai. 2015. Can partial strong labels boost multi-label object recognition? arXiv:1504.05843.
[53]
Xiaoshan Yang, Tianzhu Zhang, and Changsheng Xu. 2016. Semantic feature mining for video event understanding. ACM Trans. Multimedia Comput. Commun. Appl. 12, 4, Article 55 (Aug. 2016), 22 pages.
[54]
Guangnan Ye, Dong Liu, I-Hong Jhuo, and Shih-Fu Chang. 2012. Robust late fusion with rank minimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3021--3028.
[55]
Shengxin Zha, Florian Luisier, Walter Andrews, Nitish Srivastava, and Ruslan Salakhutdinov. 2015. Exploiting image-trained CNN architectures for unconstrained video classification. In Proceedings of the British Machine Vision Conference, Xianghua Xie, Mark W. Jones, and Gary K. L. Tam (Eds.). BMVA Press, Swansea, UK, 60.1--60.13.
[56]
Bo Zhao, Xiao Wu, Jiashi Feng, Qiang Peng, and Shuicheng Yan. 2017. Diversified visual attention networks for fine-grained object classification. IEEE Trans. Multimedia 19, 6 (2017), 1245--1256.
[57]
Rui-Wei Zhao, Jianguo Li, Yurong Chen, Jia-Ming Liu, Yu-Gang Jiang, and Xiangyang Xue. 2016. Regional gating neural networks for multi-label image classification. In Proceedings of the British Machine Vision Conference. British Machine Vision Association, York, UK, 72.1--72.12.
[58]
Rui-Wei Zhao, Zuxuan Wu, Jianguo Li, and Yu-Gang Jiang. 2017. Learning semantic feature map for visual content recognition. In Proceedings of the ACM International Conference on Multimedia. ACM Press, 1291--1299.
[59]
Heliang Zheng, Jianlong Fu, Tao Mei, and Jiebo Luo. 2017. Learning multi-attention convolutional neural network for fine-grained image recognition. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 5219--5227.
[60]
C. Lawrence Zitnick and Piotr Dollár. 2014. Edge boxes: Locating object proposals from edges. In Proceedings of the European Conference on Computer Vision. 391--405.
[61]
Zhen Zuo, Bing Shuai, Gang Wang, Xiao Liu, Xingxing Wang, Bing Wang, and Yushi Chen. 2016. Learning contextual dependence with convolutional hierarchical recurrent neural networks. IEEE Trans. Image Process. 25, 7 (Mar. 2016), 2983--2996.

Cited By

View all
  • (2022)AABLSTM: A Novel Multi-task based CNN-RNN Deep Model for Fashion AnalysisACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3519029Online publication date: 12-Mar-2022
  • (2020)Multi-modal Continuous Dimensional Emotion Recognition Using Recurrent Neural Network and Self-Attention MechanismProceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop10.1145/3423327.3423672(27-34)Online publication date: 16-Oct-2020
  • (2020)Spatio-Temporal VLAD Encoding of Visual Events Using Temporal Ordering of the Mid-Level Deep SemanticsIEEE Transactions on Multimedia10.1109/TMM.2019.295942622:7(1769-1784)Online publication date: Jul-2020
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications
ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 15, Issue 1s
Special Section on Deep Learning for Intelligent Multimedia Analytics and Special Section on Multi-Modal Understanding of Social, Affective and Subjective Attributes of Data
January 2019
265 pages
ISSN:1551-6857
EISSN:1551-6865
DOI:10.1145/3309769
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 February 2019
Accepted: 01 June 2018
Revised: 01 March 2018
Received: 01 October 2017
Published in TOMM Volume 15, Issue 1s

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Image representation
  2. contextual fusion
  3. image classification
  4. video classification

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)0
Reflects downloads up to 21 Dec 2024

Other Metrics

Citations

Cited By

View all
  • (2022)AABLSTM: A Novel Multi-task based CNN-RNN Deep Model for Fashion AnalysisACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3519029Online publication date: 12-Mar-2022
  • (2020)Multi-modal Continuous Dimensional Emotion Recognition Using Recurrent Neural Network and Self-Attention MechanismProceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop10.1145/3423327.3423672(27-34)Online publication date: 16-Oct-2020
  • (2020)Spatio-Temporal VLAD Encoding of Visual Events Using Temporal Ordering of the Mid-Level Deep SemanticsIEEE Transactions on Multimedia10.1109/TMM.2019.295942622:7(1769-1784)Online publication date: Jul-2020
  • (2019)A Hierarchical CNN-RNN Approach for Visual Emotion ClassificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/335975315:3s(1-17)Online publication date: 7-Dec-2019

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media