More Web Proxy on the site http://driver.im/

research-article

Aberrance-aware Gradient-sensitive Attentions for Scene Recognition with RGB-D Videos

Authors:

Shuqiang JiangAuthors Info & Claims

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Pages 1286 - 1294

https://doi.org/10.1145/3343031.3351051

Published: 15 October 2019 Publication History

Abstract

With the developments of deep learning, previous approaches have made successes in scene recognition with massive RGB data obtained from the ideal environments. However, scene recognition in real world may face various types of aberrant conditions caused by different unavoidable factors, such as the lighting variance of the environments and the limitations of cameras, which may damage the performance of previous models. In addition to ideal conditions, our motivation is to investigate researches on robust scene recognition models for unconstrained environments. In this paper, we propose an aberrance-aware framework for RGB-D scene recognition, where several types of attentions, such as temporal, spatial and modal attentions are integrated to spatio-temporal RGB-D CNN models to avoid the interference of RGB frame blurring, depth missing, and light variance. All the attentions are homogeneously obtained by projecting the gradient-sensitive maps of visual data into corresponding spaces. Particularly, the gradient maps are captured with the convolutional operations with the typically designed kernels, which can be seamlessly integrated into end-to-end CNN training. The experiments under different challenging conditions demonstrate the effectiveness of the proposed method.

References

[1]

Dan Banica and Cristian Sminchisescu. 2015. Second-Order Constrained Parametric Proposals and Sequential Search-Based Structured Prediction for Semantic Segmentation in RGB-D Images. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .

[2]

Jo a o Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21--26, 2017. 4724--4733.

[3]

Konstantinos G. Derpanis, Matthieu Lecce, Kostas Daniilidis, and Richard P. Wildes. 2012. Dynamic scene understanding: The role of orientation features in space and time in scene classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .

[4]

Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Trevor Darrell, and Kate Saenko. 2015. Long-term recurrent convolutional networks for visual recognition and description. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7--12, 2015. 2625--2634.

[5]

Li Fei-Fei, Asha Iyer, Christof Koch, and Pietro Perona. 2007. What do we perceive in a glance of a real-world scene? Journal of Vision, Vol. 7, 1 (2007), 10. https://doi.org/10.1167/7.1.10

[6]

Christoph Feichtenhofer, Axel Pinz, and Richard Wildes. 2016. Spatiotemporal Residual Networks for Video Action Recognition. Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates, Inc., 3468--3476. http://papers.nips.cc/paper/6433-spatiotemporal-residual-networks-for-video-action-recognition.pdf

[7]

Christoph Feichtenhofer, Axel Pinz, and Richard P. Wildes. 2017. Temporal Residual Networks for Dynamic Scene Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .

[8]

Saurabh Gupta, Pablo Arbelaez, Ross Girshick, and Jitendra Malik. 2014. Indoor scene understanding with RGB-D images: Bottom-up segmentation, object detection and semantic segmentation. Int J Comput Vis, Vol. 112 (2014), 133--149.

Digital Library

[9]

Saurabh Gupta, Pablo Arbeláez, Ross Girshick, and Jitendra Malik. 2015. Indoor Scene Understanding with RGB-D Images: Bottom-up Segmentation, Object Detection and Semantic Segmentation. International Journal of Computer Vision, Vol. 112, 2 (2015), 133--149. https://doi.org/10.1007/s11263-014-0777--6

Digital Library

[10]

Saurabh Gupta, Judy Hoffman, and Jitendra Malik. 2016. Cross Modal Distillation for Supervision Transfer. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .

[11]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .

[12]

Shuiwang Ji, Wei Xu, Ming Yang, and Kai Yu. 2013. 3D Convolutional Neural Networks for Human Action Recognition. IEEE Trans. Pattern Anal. Mach. Intell., Vol. 35, 1 (2013), 221--231.

Digital Library

[13]

Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Fei-Fei Li. 2014. Large-Scale Video Classification with Convolutional Neural Networks. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23--28, 2014. 1725--1732.

Digital Library

[14]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS . 1106--1114.

Digital Library

[15]

S. Lazebnik, C. Schmid, and J. Ponce. 2006. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR .

[16]

Fei Fei Li, Rufin VanRullen, Christof Koch, and Pietro Perona. 2002. Rapid natural scene categorization in the near absence of attention. Proceedings of the National Academy of Sciences, Vol. 99, 14 (2002), 9596--9601. https://doi.org/10.1073/pnas.092277599

[17]

A. Quattoni and A. Torralba. 2009. Recognizing indoor scenes. In CVPR .

[18]

N. Shroff, P. Turaga, and R. Chellappa. 2010. Moving vistas: Exploiting motion for describing scenes. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. 1911--1918. https://doi.org/10.1109/CVPR.2010.5539864

[19]

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. 2012. Indoor Segmentation and Support Inference from RGBD Images. In Proceedings of the 12th European Conference on Computer Vision - Volume Part V (ECCV'12). Springer-Verlag, Berlin, Heidelberg, 746--760. https://doi.org/10.1007/978--3--642--33715--4_54

Digital Library

[20]

Karen Simonyan and Andrew Zisserman. 2014. Two-Stream Convolutional Networks for Action Recognition in Videos. In Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8--13 2014, Montreal, Quebec, Canada . 568--576.

Digital Library

[21]

K. Simonyan and A. Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations .

[22]

R. Socher, B. Huval, B. Bath, C. D. Manning, and A. Y. Ng. 2012. Convolutional-recursive deep learning for 3D object classification. In NIPS .

[23]

Shuran Song, S. P. Lichtenberg, and Jianxiong Xiao. 2015. SUN RGB-D: A RGB-D scene understanding benchmark suite. In Computer Vision and Pattern Recognition (CVPR), 2015 IEEE Conference on. 567--576. https://doi.org/10.1109/CVPR.2015.7298655

[24]

Xinhang Song, Luis Herranz, and Shuqiang Jiang. 2017a. Depth CNNs for RGB-D Scene Recognition: Learning from Scratch Better than Transferring from RGB-CNNs. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4--9, 2017, San Francisco, California, USA. 4271--4277.

[25]

Xinhang Song, Shuqiang Jiang, and Luis Herranz. 2017b. Combining Models from Multiple Sources for RGB-D Scene Recognition. In Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17. 4523--4529.

[26]

X. Song, S. Jiang, L. Herranz, and C. Chen. 2019. Learning Effective RGB-D Representations for Scene Recognition. IEEE Transactions on Image Processing, Vol. 28, 2 (Feb 2019), 980--993. https://doi.org/10.1109/TIP.2018.2872629

Digital Library

[27]

S. Thorpe, D. Fize, and C. Marlot. 1996. Speed of processing in the human visual system. Nature, Vol. 381 (June 1996), 520--522. https://doi.org/10.1038/381520a0

[28]

Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning Spatiotemporal Features With 3D Convolutional Networks. In The IEEE International Conference on Computer Vision (ICCV) .

Digital Library

[29]

Anran Wang, Jianfei Cai, Jiwen Lu, and Tat-Jen Cham. 2016. Modality and Component Aware Feature Fusion For RGB-D Scene Classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .

[30]

Chuan Wang, Haibin Huang, Xiaoguang Han, and Jue Wang. 2019. Video Inpainting by Jointly Learning Temporal Structure and Spatial Details. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. 5232--5239. https://aaai.org/ojs/index.php/AAAI/article/view/4458

[31]

Chuan Wang, Jie Zhu, Yanwen Guo, and Wenping Wang. 2017. Video Vectorization via Tetrahedral Remeshing. IEEE Trans. Image Processing, Vol. 26, 4 (2017), 1833--1844. https://doi.org/10.1109/TIP.2017.2666742

Digital Library

[32]

J. Xiao, J. Hayes, K. Ehringer, A. Olivia, and A. Torralba. 2010. SUN database: Largescale scene recognition from Abbey to Zoo. In CVPR .

[33]

Jianxiong Xiao, A. Owens, and A. Torralba. 2013. SUN3D: A Database of Big Spaces Reconstructed Using SfM and Object Labels. In Computer Vision (ICCV), 2013 IEEE International Conference on. 1625--1632. https://doi.org/10.1109/ICCV.2013.458

[34]

Yang Xiao, Jianxin Wu, and Junsong Yuan. 2014. mCENTRIST: A Multi-Channel Feature Generation Mechanism for Scene Categorization. IEEE Trans. on Image Process., Vol. 23, 2 (Feb 2014), 823--836. https://doi.org/10.1109/TIP.2013.2295756

Digital Library

[35]

Bolei Zhou, Aditya Khosla, Agata Lapedriza, Antonio Torralba, and Aude Oliva. 2018. Places: An Image Database for Deep Scene Understanding. IEEE Trans. on Pattern Anal. and Mach. Intell. (Accepted) (2018).

[36]

Hongyuan Zhu, Jean-Baptiste Weibel, and Shijian Lu. 2016. Discriminative Multi-Modal Feature Fusion for RGBD Indoor Scene Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .

Cited By

Sun TWang CSong XFeng FNie L(2022)Response Generation by Jointly Modeling Personalized Linguistic Styles and EmotionsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/347587218:2(1-20)Online publication date: 16-Feb-2022
https://dl.acm.org/doi/10.1145/3475872
Chen JNiu LZhang L(2021)Depth Privileged Scene Recognition via Dual Attention HallucinationIEEE Transactions on Image Processing10.1109/TIP.2021.312295530(9164-9178)Online publication date: 2021
https://doi.org/10.1109/TIP.2021.3122955
Souley Dosso YGreenwood KHarrold JGreen J(2021)RGB-D scene analysis in the NICUComputers in Biology and Medicine10.1016/j.compbiomed.2021.104873138(104873)Online publication date: Nov-2021
https://doi.org/10.1016/j.compbiomed.2021.104873
Show More Cited By

Index Terms

Aberrance-aware Gradient-sensitive Attentions for Scene Recognition with RGB-D Videos
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision representations
      2. Computer vision tasks
        Scene understanding

Recommendations

A Real-Time Scene Recognition System Based on RGB-D Video Streams
ICMI '19: 2019 International Conference on Multimodal Interaction

Depth data captured by the cameras such as Microsoft Kinect can bring depth information than traditional RGB data, which is also more robust to different environments, such as dim or dark lighting conditions. In this technical demonstration, we build a ...
3D dynamic facial expression recognition using low-resolution videos

We develop a 4D facial expression recognition algorithm.Our algorithm is suitable for both high and low-resolution RGB-D videos.4D feature learning is used for facial expression recognition.We demonstrate that feature learning is extremely effective in ...
Indoor scene recognition via multi-task metric multi-kernel learning from RGB-D images

The traditional scene analysis mainly focuses on outdoor scene recognition rather than indoor scene understanding. However, with the widespread use of depth cameras, we have a new opportunity to handle the indoor scene recognition problem. In this paper,...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

October 2019

2794 pages

ISBN:9781450368896

DOI:10.1145/3343031

General Chairs:
Laurent Amsaleg
CNRS-IRISA, France
,
Benoit Huet
EURECOM, France
,
Martha Larson
Radboud University and TU Delft (Netherlands)
,
Program Chairs:
Guillaume Gravier
CNRS-IRISA, France
,
Hayley Hung
Delft University of Technology Netherlands
,
Chong-Wah Ngo
City University of Hong Kong Hong Kong
,
Wei Tsang Ooi
National University of Singapore Singapore

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Postdoctoral Program for Innovative Talents
National Natural Science Foundation of China
Beijing Natural Science Foundation
China Postdoctoral Science Foundation

Conference

MM '19

Sponsor:

SIGMM

MM '19: The 27th ACM International Conference on Multimedia

October 21 - 25, 2019

Nice, France

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

5
Total Citations
View Citations
169
Total Downloads

Downloads (Last 12 months)10
Downloads (Last 6 weeks)1

Reflects downloads up to 06 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Sun TWang CSong XFeng FNie L(2022)Response Generation by Jointly Modeling Personalized Linguistic Styles and EmotionsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/347587218:2(1-20)Online publication date: 16-Feb-2022
https://dl.acm.org/doi/10.1145/3475872
Chen JNiu LZhang L(2021)Depth Privileged Scene Recognition via Dual Attention HallucinationIEEE Transactions on Image Processing10.1109/TIP.2021.312295530(9164-9178)Online publication date: 2021
https://doi.org/10.1109/TIP.2021.3122955
Souley Dosso YGreenwood KHarrold JGreen J(2021)RGB-D scene analysis in the NICUComputers in Biology and Medicine10.1016/j.compbiomed.2021.104873138(104873)Online publication date: Nov-2021
https://doi.org/10.1016/j.compbiomed.2021.104873
Zhou HQi LWan ZHuang HYang X(2021)RGB-D Co-attention Network for Semantic SegmentationComputer Vision – ACCV 202010.1007/978-3-030-69525-5_31(519-536)Online publication date: 27-Feb-2021
https://doi.org/10.1007/978-3-030-69525-5_31
Peng XBouzerdoum APhung S(2020)A part-based spatial and temporal aggregation method for dynamic scene recognitionNeural Computing and Applications10.1007/s00521-020-05415-3Online publication date: 19-Oct-2020
https://doi.org/10.1007/s00521-020-05415-3

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents