Multi-modal Unsupervised Feature Learning for RGB-D Scene Labeling

Anran Wang¹⁹,
Jiwen Lu²⁰,
Gang Wang^19,20,
Jianfei Cai¹⁹ &
…
Tat-Jen Cham¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 8693))

Included in the following conference series:

European Conference on Computer Vision

23k Accesses
21 Citations
3 Altmetric

Abstract

Most of the existing approaches for RGB-D indoor scene labeling employ hand-crafted features for each modality independently and combine them in a heuristic manner. There has been some attempt on directly learning features from raw RGB-D data, but the performance is not satisfactory. In this paper, we adapt the unsupervised feature learning technique for RGB-D labeling as a multi-modality learning problem. Our learning framework performs feature learning and feature encoding simultaneously which significantly boosts the performance. By stacking basic learning structure, higher-level features are derived and combined with lower-level features for better representing RGB-D data. Experimental results on the benchmark NYU depth dataset show that our method achieves competitive performance, compared with state-of-the-art.

Download to read the full chapter text

Chapter PDF

Learning Common and Specific Features for RGB-D Semantic Segmentation with Deconvolutional Networks

Cross-Modal Pyramid Translation for RGB-D Scene Recognition

Article 18 May 2021

Learning depth-aware features for indoor scene understanding

Article 12 July 2022

Keywords

References

Micusik, B., Kosecka, J.: Semantic segmentation of street scenes by superpixel co-occurrence and 3d geometry. In: ICCV Workshops, pp. 625–632 (2009)
Google Scholar
Lempitsky, V.S., Vedaldi, A., Zisserman, A.: Pylon model for semantic segmentation. In: NIPS, pp. 1485–1493 (2011)
Google Scholar
Socher, R., Lin, C.C., Manning, C., Ng, A.Y.: Parsing natural scenes and natural language with recursive neural networks. In: ICML, pp. 129–136 (2011)
Google Scholar
Galleguillos, C., Rabinovich, A., Belongie, S.: Object categorization using co-occurrence, location and appearance. In: CVPR, pp. 1–8 (2008)
Google Scholar
Grangier, D., Bottou, L., Collobert, R.: Deep convolutional networks for scene parsing. In: ICML Deep Learning Workshop (2009)
Google Scholar
Farabet, C., Couprie, C., Najman, L., LeCun, Y.: Learning hierarchical features for scene labeling. IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1915–1929 (2013)
Article Google Scholar
Silberman, N., Fergus, R.: Indoor scene segmentation using a structured light sensor. In: ICCV Workshops, pp. 601–608 (2011)
Google Scholar
Ren, X., Bo, L., Fox, D.: Rgb-(d) scene labeling: Features and algorithms. In: CVPR, pp. 2759–2766 (2012)
Google Scholar
Cadena, C., Košecka, J.: Semantic parsing for priming object detection in rgb-d scenes. In: Workshop on Semantic Perception, Mapping and Exploration (2013)
Google Scholar
Le, Q.V., Karpenko, A., Ngiam, J., Ng, A.Y.: Ica with reconstruction cost for efficient overcomplete feature learning. In: NIPS, pp. 1017–1025 (2011)
Google Scholar
Le, Q.V., Zou, W.Y., Yeung, S.Y., Ng, A.Y.: Learning hierarchical invariant spatio-temporal features for action recognition with independent subspace analysis. In: CVPR, pp. 3361–3368 (2011)
Google Scholar
Couprie, C., Farabet, C., Najman, L., LeCun, Y.: Indoor semantic segmentation using depth information. arXiv preprint arXiv:1301.3572 (2013)
Google Scholar
Pei, D., Liu, H., Liu, Y., Sun, F.: Unsupervised multimodal feature learning for semantic image segmentation. In: IJCNN (2013)
Google Scholar
He, X., Zemel, R., Carreira-Perpindn, M.: Multiscale conditional random fields for image labeling. In: CVPR (2004)
Google Scholar
Shotton, J., Winn, J.M., Rother, C., Criminisi, A.: textonBoost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006, Part I. LNCS, vol. 3951, pp. 1–15. Springer, Heidelberg (2006)
Chapter Google Scholar
Lowe, D.G.: Object recognition from local scale-invariant features. In: ICCV, pp. 1150–1157 (1999)
Google Scholar
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR, pp. 886–893 (2005)
Google Scholar
Gould, S., Fulton, R., Koller, D.: Decomposing a scene into geometric and semantically consistent regions. In: ICCV (2009)
Google Scholar
Liu, C., Yuen, J., Torralba, A.: Sift flow: Dense correspondence across scenes and its applications. PAMI (2011)
Google Scholar
Koppula, H.S., Anand, A., Joachims, T., Saxena, A.: Semantic labeling of 3d point clouds for indoor scenes. In: NIPS, pp. 244–252 (2011)
Google Scholar
Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Computation 18(7), 1527–1554 (2006)
Article MATH MathSciNet Google Scholar
Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding for image classification. In: CVPR, pp. 1794–1801 (2009)
Google Scholar
Kumar, A., Rai, P., Daumé III, H.: Co-regularized multi-view spectral clustering. In: NIPS, pp. 1413–1421 (2011)
Google Scholar
Potamianos, G., Neti, C., Luettin, J., Matthews, I.: Audio-visual automatic speech recognition: An overview. Issues in Visual and Audio-Visual Speech Processing 22, 23 (2004)
Google Scholar
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: ICML, pp. 689–696 (2011)
Google Scholar
Socher, R., Huval, B., Bhat, B., Manning, D.: C., Ng, A.Y.: Convolutional-Recursive Deep Learning for 3D Object Classification. In: NIPS, pp. 665–673 (2012)
Google Scholar
Coates, A., Ng, A.Y., Lee, H.: An analysis of single-layer networks in unsupervised feature learning. In: International Conference on Artificial Intelligence and Statistics, pp. 215–223 (2011)
Google Scholar
Wang, H., Ullah, M.M., Klaser, A., Laptev, I., Schmid, C., et al.: Evaluation of local spatio-temporal features for action recognition. In: BMVC, pp. 124.1–124.11 (2009)
Google Scholar
Schimidt, M.: minfunc. (2005)
Google Scholar
Lee, H., Battle, A., Raina, R., Ng, A.Y.: Efficient sparse coding algorithms. In: NIPS, pp. 801–808 (2006)
Google Scholar
Lee, H., Grosse, R., Ranganath, R., Ng, A.Y.: Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In: ICML, pp. 609–616 (2009)
Google Scholar
Arbelaez, P., Maire, M., Fowlkes, C., Malik, J.: Contour detection and hierarchical image segmentation. PAMI, 898–916 (2011)
Google Scholar
Silberman, N., Hoiem, D., Kohli, P., Fergus, R.: Indoor segmentation and support inference from RGBD images. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part V. LNCS, vol. 7576, pp. 746–760. Springer, Heidelberg (2012)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Nanyang Technological University, Singapore
Anran Wang, Gang Wang, Jianfei Cai & Tat-Jen Cham
Advanced Digital Sciences Center, Singapore
Jiwen Lu & Gang Wang

Authors

Anran Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jiwen Lu
View author publications
You can also search for this author in PubMed Google Scholar
Gang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jianfei Cai
View author publications
You can also search for this author in PubMed Google Scholar
Tat-Jen Cham
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Toronto, 6 King’s College Road, M5H 3S5, Toronto, ON, Canada
David Fleet
Faculty of Electrical Engineering, Department of Cybernetics, Czech Technical University in Prague, Technicka 2, 166 27, Prague 6, Czech Republic
Tomas Pajdla
Max-Planck-Institut für Informatik, Campus E1 4, 66123, Saarbrücken, Germany
Bernt Schiele
ESAT - PSI, iMinds, KU Leuven, Kasteelpark Arenberg 10, Bus 2441, 3001, Leuven, Belgium
Tinne Tuytelaars

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, A., Lu, J., Wang, G., Cai, J., Cham, TJ. (2014). Multi-modal Unsupervised Feature Learning for RGB-D Scene Labeling. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds) Computer Vision – ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, vol 8693. Springer, Cham. https://doi.org/10.1007/978-3-319-10602-1_30

Download citation

DOI: https://doi.org/10.1007/978-3-319-10602-1_30
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10601-4
Online ISBN: 978-3-319-10602-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Multi-modal Unsupervised Feature Learning for RGB-D Scene Labeling

Abstract

Chapter PDF

Similar content being viewed by others

Learning Common and Specific Features for RGB-D Semantic Segmentation with Deconvolutional Networks

Cross-Modal Pyramid Translation for RGB-D Scene Recognition

Learning depth-aware features for indoor scene understanding

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Multi-modal Unsupervised Feature Learning for RGB-D Scene Labeling

Abstract

Chapter PDF

Similar content being viewed by others

Learning Common and Specific Features for RGB-D Semantic Segmentation with Deconvolutional Networks

Cross-Modal Pyramid Translation for RGB-D Scene Recognition

Learning depth-aware features for indoor scene understanding

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation