More Web Proxy on the site http://driver.im/

research-article

Visual Content Recognition by Exploiting Semantic Feature Map with Attention and Multi-task Learning

Authors:

Yu-Gang JiangAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Volume 15, Issue 1s

Article No.: 6, Pages 1 - 22

https://doi.org/10.1145/3231739

Published: 05 February 2019 Publication History

Abstract

Recent studies have shown that spatial relationships among objects are very important for visual recognition, since they can provide rich clues on object contexts within the images. In this article, we introduce a novel method to learn the Semantic Feature Map (SFM) with attention-based deep neural networks for image and video classification in an end-to-end manner, aiming to explicitly model the spatial object contexts within the images. In particular, we explicitly apply the designed gate units to the extracted object features for important objects selection and noise removal. These selected object features are then organized into the proposed SFM, which is a compact and discriminative representation with the spatial information among objects preserved. Finally, we employ either Fully Convolutional Networks (FCN) or Long-Short Term Memory (LSTM) as the classifiers on top of the SFM for content recognition. A novel multi-task learning framework with image classification loss, object localization loss, and grid labeling loss are also introduced to help better learn the model parameters. We conduct extensive evaluations and comparative studies to verify the effectiveness of the proposed approach on Pascal VOC 2007/2012 and MS-COCO benchmarks for image classification. In addition, the experimental results also show that the SFMs learned from the image domain can be successfully transferred to CCV and FCVID benchmarks for video classification.

References

[1]

Sean Bell, C. Lawrence Zitnick, Kavita Bala, and Ross B. Girshick. 2016. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2874--2883.

[2]

Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Return of the devil in the details: Delving deep into convolutional nets. In Proceedings of the British Machine Vision Conference. 54.1--54.12.

[3]

Ming-Ming Cheng, Ziming Zhang, Wen-Yan Lin, and Philip Torr. 2014. BING: Binarized normed gradients for objectness estimation at 300fps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3286--3293.

Digital Library

[4]

Jian Dong, Wei Xia, Qiang Chen, Jiashi Feng, ZhongYang Huang, and Shuicheng Yan. 2013. Subcategory-aware object classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 827--834.

Digital Library

[5]

Mark Everingham, S. M. Ali Eslami, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2014. The pascal visual object classes challenge: A retrospective. Int. J. Comput. Vision 111, 1 (2014), 98--136.

Digital Library

[6]

Ross Girshick, Ross B. Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 580--587.

Digital Library

[7]

Ross B. Girshick. 2015. Fast R-CNN. In IEEE International Conference on Computer Vision. 1440--1448.

Digital Library

[8]

Yunchao Gong, Liwei Wang, Ruiqi Guo, and Svetlana Lazebnik. 2014. Multi-scale orderless pooling of deep convolutional activation features. In Proceedings of the European Conference on Computer Vision. 392--407.

[9]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 37, 9 (2015), 1904--1916.

Digital Library

[10]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 770--778.

[11]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput. 9, 8 (1997), 1735--1780.

Digital Library

[12]

Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. 2015. Spatial transformer networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems.

Digital Library

[13]

Hervé Jégou, Matthijs Douze, Cordelia Schmid, and Patrick Pérez. 2010. Aggregating local descriptors into a compact image representation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 3304--3311.

[14]

I.-Hong Jhuo, Guangnan Ye, Shenghua Gao, Dong Liu, Yu-Gang Jiang, D. T. Lee, and Shih-Fu Chang. 2014. Discovering joint audio-visual codewords for video event detection. Mach. Vision Appl. 25, 1 (Oct. 2014), 33--47.

Digital Library

[15]

Yu-Gang Jiang, Qi Dai, Wei Liu, Xiangyang Xue, and Chong-Wah Ngo. 2015. Human action recognition in unconstrained videos by explicit motion modeling. IEEE Trans. Image Process. 24, 11 (2015), 3781--3795.

Digital Library

[16]

Yu-Gang Jiang, Chong-Wah Ngo, and Jun Yang. 2007. Towards optimal bag-of-features for object categorization and semantic video retrieval. In Proceedings of ACM International Conference on Image and Video Retrieval.

Digital Library

[17]

Yu-Gang Jiang, Zuxuan Wu, Jun Wang, Xiangyang Xue, and Shih-Fu Chang. 2018. Exploiting feature and class relationships in video categorization with regularized deep neural networks. IEEE Trans. Pattern Anal. Mach. Intell. 40, 2 (Feb. 2018), 352--364.

Digital Library

[18]

Yu-Gang Jiang, Guangnan Ye, Shih-Fu Chang, Daniel P. W. Ellis, and Alexander C. Loui. 2011. Consumer video understanding: A benchmark database and an evaluation of human and machine performance. In Proceedings of the ACM International Conference on Multimedia Retrieval. ACM Press.

Digital Library

[19]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 1097--1105.

Digital Library

[20]

Kuan-Ting Lai, Felix X. Yu, Ming-Syan Chen, and Shih-Fu Chang. 2014. Video event detection by inferring temporal instance labels. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2251--2258.

Digital Library

[21]

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. Nature 521, 7553 (2015), 436--444.

[22]

Li-Jia Li, Hao Su, Yongwhan Lim, and Fei-Fei Li. 2014. Object bank: An object-level image representation for high-level visual recognition. Int. J. Comput. Vision 107, 1 (2014), 20--39.

Digital Library

[23]

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. In Proceedings of the European Conference on Computer Vision. 740--755.

[24]

Dong Liu, Kuan-Ting Lai, Guangnan Ye, Ming-Syan Chen, and Shih-Fu Chang. 2013. Sample-specific late fusion for visual category recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 803--810.

Digital Library

[25]

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C. Berg. 2016. SSD: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision. Springer International Publishing, Amsterdam, The Netherlands, 21--37.

[26]

Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Boston, MA, 3431--3440.

[27]

Jianwei Luo, Jianguo Li, Jun Wang, Zhiguo Jiang, and Yurong Chen. 2015. Deep attributes from context-aware regional neural codes. arXiv.org. arXiv:1509.02470v1

[28]

Andy J. Ma and Pong C. Yuen. 2014. Reduced analytic dependency modeling: Robust fusion for visual recognition. Int. J. Comput. Vision 109, 3 (2014), 233--251.

Digital Library

[29]

Pascal Mettes, Jan C. van Gemert, and Cees G. M. Snoek. 2016. No spare parts: Sharing part detectors for image categorization. Comput. Vision Image Understand. 152 (Nov. 2016), 131--141.

Digital Library

[30]

Markus Nagel, Thomas Mensink, and Cees G. M. Snoek. 2015. Event fisher vectors: Robust encoding visual diversity of visual streams. In Proceedings of the British Machine Vision Conference. BMVA Press, Swansea, UK, 178.1--178.12.

[31]

Maxime Oquab, Léon Bottou, Ivan Laptev, and Josef Sivic. 2014. Learning and transferring mid-level image representations using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1717--1724.

Digital Library

[32]

Lei Pang and Chong-Wah Ngo. 2015. Opinion question answering by sentiment clip localization. ACM Trans. Multimedia Comput. Commun. Appl. 12, 2, Article 31 (Nov. 2015), 19 pages.

Digital Library

[33]

Joseph Redmon, Santosh Divvala, Ross B. Girshick, and Ali Farhadi. 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 779--788.

[34]

Shaoqing Ren, Kaiming He, Ross B. Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 91--99.

Digital Library

[35]

Olga Russakovsky, Jia Deng, Hao Su, et al. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vision 115, 3 (2015), 211--252.

Digital Library

[36]

Shikhar Sharma, Ryan Kiros, and Ruslan Salakhutdinov. 2015. Action recognition using visual attention. In Proceedings of the International Conference on Learning Representations Workshop.

[37]

Karen Simonyan and Andrew Zisserman. 2014. Two-stream convolutional networks for action recognition in videos. In Proceedings of the Conference on Advances in Neural Information Processing Systems. 568--576.

Digital Library

[38]

Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large scale image recognition. In Proceedings of the International Conference on Learning Representations.

[39]

Yongqing Sun, Zuxuan Wu, Xi Wang, Hiroyuki Arai, Tetsuya Kinebuchi, and Yu-Gang Jiang. 2016. Exploiting objects with LSTMs for video categorization. In Proceedings of the ACM International Conference on Multimedia. ACM Press, Amsterdam, The Netherlands, 142--146.

Digital Library

[40]

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 1--9.

[41]

Jasper R. R. Uijlings, Koen E. A. van de Sande, Theo Gevers, and Arnold W. M. Smeulders. 2013. Selective search for object recognition. Int. J. Comput. Vision 104, 2 (2013), 154--171.

Digital Library

[42]

Jiang Wang, Yi Yang, Junhua Mao, Zhiheng Huang, Chang Huang, and Wei Xu. 2016. CNN-RNN: A unified framework for multi-label image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2285--2294.

[43]

Meng Wang, Changzhi Luo, Richang Hong, Jinhui Tang, and Jianshi Feng. 2016. Beyond object proposals: Random crop pooling for multi-label image recognition. IEEE Trans. Image Process. 25, 12 (Dec. 2016), 5678--5688.

Digital Library

[44]

Yunchao Wei, Wei Xia, Min Lin, Junshi Huang, Bingbing Ni, Jian Dong, Yao Zhao, and Shuicheng Yan. 2015. HCP: A flexible CNN framework for multi-label image classification. IEEE Trans. Pattern Anal. Mach. Intell. (2015), 1--8.

[45]

Ruobing Wu, Baoyuan Wang, Wenping Wang, and Yizhou Yu. 2015. Harvesting discriminative meta objects with deep CNN features for scene classification. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 1287--1295.

Digital Library

[46]

Zuxuan Wu, Yanwei Fu, Yu-Gang Jiang, and Leonid Sigal. 2016. Harnessing object and scene semantics for large-scale video understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3112--3121.

[47]

Zuxuan Wu, Yu-Gang Jiang, Xi Wang, Hao Ye, and Xiangyang Xue. 2016. Multi-stream multi-class fusion of deep networks for video classification. In Proceedings of the ACM International Conference on Multimedia. ACM Press, Amsterdam, The Netherlands, 791--800.

Digital Library

[48]

Zuxuan Wu, Xi Wang, Yu-Gang Jiang, Hao Ye, and Xiangyang Xue. 2015. Modeling spatial-temporal clues in a hybrid deep-learning framework for video classification. In Proceedings of the ACM International Conference on Multimedia. 461--470.

Digital Library

[49]

Tianjun Xiao, Yichong Xu, Kuiyuan Yang, Jiaxing Zhang, Yuxin Peng, and Zheng Zhang. 2015. The application of two-level attention models in deep convolutional neural network for fine-grained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 842--850.

[50]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048--2057.

Digital Library

[51]

Zhongwen Xu, Yi Yang, Ivor W. Tsang, Nicu Sebe, and Alexander G. Hauptmann. 2013. Feature weighting via optimal thresholding for video analysis. In Proceedings of the IEEE International Conference on Computer Vision. 3440--3447.

Digital Library

[52]

Hao Yang, Joey Tianyi Zhou, Yu Zhang, Bin-Bin Gao, Jianxin Wu, and Jianfei Cai. 2015. Can partial strong labels boost multi-label object recognition? arXiv:1504.05843.

[53]

Xiaoshan Yang, Tianzhu Zhang, and Changsheng Xu. 2016. Semantic feature mining for video event understanding. ACM Trans. Multimedia Comput. Commun. Appl. 12, 4, Article 55 (Aug. 2016), 22 pages.

Digital Library

[54]

Guangnan Ye, Dong Liu, I-Hong Jhuo, and Shih-Fu Chang. 2012. Robust late fusion with rank minimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3021--3028.

Digital Library

[55]

Shengxin Zha, Florian Luisier, Walter Andrews, Nitish Srivastava, and Ruslan Salakhutdinov. 2015. Exploiting image-trained CNN architectures for unconstrained video classification. In Proceedings of the British Machine Vision Conference, Xianghua Xie, Mark W. Jones, and Gary K. L. Tam (Eds.). BMVA Press, Swansea, UK, 60.1--60.13.

[56]

Bo Zhao, Xiao Wu, Jiashi Feng, Qiang Peng, and Shuicheng Yan. 2017. Diversified visual attention networks for fine-grained object classification. IEEE Trans. Multimedia 19, 6 (2017), 1245--1256.

Digital Library

[57]

Rui-Wei Zhao, Jianguo Li, Yurong Chen, Jia-Ming Liu, Yu-Gang Jiang, and Xiangyang Xue. 2016. Regional gating neural networks for multi-label image classification. In Proceedings of the British Machine Vision Conference. British Machine Vision Association, York, UK, 72.1--72.12.

[58]

Rui-Wei Zhao, Zuxuan Wu, Jianguo Li, and Yu-Gang Jiang. 2017. Learning semantic feature map for visual content recognition. In Proceedings of the ACM International Conference on Multimedia. ACM Press, 1291--1299.

Digital Library

[59]

Heliang Zheng, Jianlong Fu, Tao Mei, and Jiebo Luo. 2017. Learning multi-attention convolutional neural network for fine-grained image recognition. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, 5219--5227.

[60]

C. Lawrence Zitnick and Piotr Dollár. 2014. Edge boxes: Locating object proposals from edges. In Proceedings of the European Conference on Computer Vision. 391--405.

[61]

Zhen Zuo, Bing Shuai, Gang Wang, Xiao Liu, Xingxing Wang, Bing Wang, and Yushi Chen. 2016. Learning contextual dependence with convolutional hierarchical recurrent neural networks. IEEE Trans. Image Process. 25, 7 (Mar. 2016), 2983--2996.

Digital Library

Cited By

Zhang XShen MLi XWang X(2022)AABLSTM: A Novel Multi-task based CNN-RNN Deep Model for Fashion AnalysisACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3519029Online publication date: 12-Mar-2022
https://doi.org/10.1145/3519029
Sun LLian ZTao JLiu BNiu MSchuller BLefter ICambria EKompatsiaris IStappen L(2020)Multi-modal Continuous Dimensional Emotion Recognition Using Recurrent Neural Network and Self-Attention MechanismProceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop10.1145/3423327.3423672(27-34)Online publication date: 16-Oct-2020
https://dl.acm.org/doi/10.1145/3423327.3423672
Soltanian MAmini SGhaemmaghami S(2020)Spatio-Temporal VLAD Encoding of Visual Events Using Temporal Ordering of the Mid-Level Deep SemanticsIEEE Transactions on Multimedia10.1109/TMM.2019.295942622:7(1769-1784)Online publication date: Jul-2020
https://doi.org/10.1109/TMM.2019.2959426
Show More Cited By

Recommendations

Learning Semantic Feature Map for Visual Content Recognition
MM '17: Proceedings of the 25th ACM international conference on Multimedia

The spatial relationship among objects provide rich clues to object contexts for visual recognition. In this paper, we propose to learn Semantic Feature Map (SFM) by deep neural networks to model the spatial object contexts for better understanding of ...
Multi-view Multi-task Feature Extraction for Web Image Classification
MM '14: Proceedings of the 22nd ACM international conference on Multimedia

The features used in many multimedia analysis-based applications are frequently of very high dimension. Feature extraction offers several advantages in highly dimensional cases, and many recent studies have used multi-task feature extraction approaches, ...
Object Bank: An Object-Level Image Representation for High-Level Visual Recognition

It is a remarkable fact that images are related to objects constituting them. In this paper, we propose to represent images by using objects appearing in them. We introduce the novel concept of object bank (OB), a high-level image representation ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 15, Issue 1s

Special Section on Deep Learning for Intelligent Multimedia Analytics and Special Section on Multi-Modal Understanding of Social, Affective and Subjective Attributes of Data

January 2019

265 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3309769

Editor:
Alberto Del Bimbo
University of Firenze, Italy

Issue’s Table of Contents

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 February 2019

Accepted: 01 June 2018

Revised: 01 March 2018

Received: 01 October 2017

Published in TOMM Volume 15, Issue 1s

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
300
Total Downloads

Downloads (Last 12 months)5
Downloads (Last 6 weeks)1

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhang XShen MLi XWang X(2022)AABLSTM: A Novel Multi-task based CNN-RNN Deep Model for Fashion AnalysisACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3519029Online publication date: 12-Mar-2022
https://doi.org/10.1145/3519029
Sun LLian ZTao JLiu BNiu MSchuller BLefter ICambria EKompatsiaris IStappen L(2020)Multi-modal Continuous Dimensional Emotion Recognition Using Recurrent Neural Network and Self-Attention MechanismProceedings of the 1st International on Multimodal Sentiment Analysis in Real-life Media Challenge and Workshop10.1145/3423327.3423672(27-34)Online publication date: 16-Oct-2020
https://dl.acm.org/doi/10.1145/3423327.3423672
Soltanian MAmini SGhaemmaghami S(2020)Spatio-Temporal VLAD Encoding of Visual Events Using Temporal Ordering of the Mid-Level Deep SemanticsIEEE Transactions on Multimedia10.1109/TMM.2019.295942622:7(1769-1784)Online publication date: Jul-2020
https://doi.org/10.1109/TMM.2019.2959426
Li LZhu XHao YWang SGao XHuang Q(2019)A Hierarchical CNN-RNN Approach for Visual Emotion ClassificationACM Transactions on Multimedia Computing, Communications, and Applications10.1145/335975315:3s(1-17)Online publication date: 7-Dec-2019
https://dl.acm.org/doi/10.1145/3359753

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Issue’s Table of Contents