[go: up one dir, main page]
More Web Proxy on the site http://driver.im/ skip to main content
research-article

Depth-Adaptive Deep Neural Network for Semantic Segmentation

Published: 01 September 2018 Publication History

Abstract

In this paper, we present the depth-adaptive deep neural network using a depth map for semantic segmentation. Typical deep neural networks receive inputs at the predetermined locations regardless of the distance from the camera. This fixed receptive field presents a challenge to generalize the features of objects at various distances in neural networks. Specifically, the predetermined receptive fields are too small at a short distance, and vice versa. To overcome this challenge, we develop a neural network that is able to adapt the receptive field not only for each layer but also for each neuron at the spatial location. To adjust the receptive field, we propose the depth-adaptive multiscale (DaM) convolution layer consisting of the adaptive perception neuron and the in-layer multiscale neuron. The adaptive perception neuron is to adjust the receptive field at each spatial location using the corresponding depth information. The in-layer multiscale neuron is to apply the different size of the receptive field at each feature space to learn features at multiple scales. The proposed DaM convolution is applied to two fully convolutional neural networks. We demonstrate the effectiveness of the proposed neural networks on the publicly available RGB-D dataset for semantic segmentation and the novel hand segmentation dataset for hand-object interaction. The experimental results show that the proposed method outperforms the state-of-the-art methods without any additional layers or preprocessing/postprocessing.

References

[1]
Z. Zhang, “ Microsoft kinect sensor and its effect,” IEEE MultiMedia, vol. Volume 19, no. Issue 2, pp. 4–10, 2012.
[2]
M. Adams and P. Probert, “ The interpretation of phase and intensity data from AMCW light detection sensors for reliable ranging,” Int. J. Robot. Res., vol. Volume 15, no. Issue 5, pp. 441–458, 1996.
[3]
B. Schwarz, “ Lidar: Mapping the world in 3D,” Nature Photon., vol. Volume 4, pp. 429–430, 2010.
[4]
Z. Lee and T. Q. Nguyen, “ Multi-resolution disparity processing and fusion for large high-resolution stereo image,” IEEE Trans. Multimedia, vol. Volume 17, no. Issue 6, pp. 792–803, 2015.
[5]
Z. Lee and T. Nguyen, “ Multi-array camera disparity enhancement,” IEEE Trans. Multimedia, vol. Volume 16, no. Issue 8, pp. 2168–2177, 2014.
[6]
R. Ranftl, V. Vineet, Q. Chen, and V. Koltun, “ Dense monocular depth estimation in complex dynamic scenes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4058–4066.
[7]
J. Shotton et al., “ Real-time human pose recognition in parts from single depth images,” in Proc. Comput. Vis. Pattern Recognit., 2011, pp. 1297–1304.
[8]
J. Shotton et al., “ Efficient human pose estimation from single depth images,” IEEE Trans. Pattern Anal. Mach. Intell., vol. Volume 35, no. Issue 12, pp. 2821–2840, 2013.
[9]
N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “ Indoor segmentation and support inference from RGBD images,” in Computer Vision—ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V . Berlin, Germany: Springer, 2012, pp. 746–760.
[10]
A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “ Vision meets robotics: The kitti dataset,” Int. J. Robot. Res., vol. Volume 32, pp. 1231–1237, 2013.
[11]
M. Cordts et al., “ The cityscapes dataset for semantic urban scene understanding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 3213–3223.
[12]
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ Imagenet classification with deep convolutional neural networks,” in Proc. 25th Int. Conf. Neural Inf. Process. Syst., vol. Volume 1, Red Hook, NY, USA: Curran Associates Inc., 2012, pp. 1097–1105. {Online}. Available: http://dl.acm.org/citation.cfm?id=2999134.2999257
[13]
K. Simonyan and A. Zisserman, “ Very deep convolutional networks for large-scale image recognition,” in Proc. Int. Conf. Learn. Representations, 2015.
[14]
K. He, X. Zhang, S. Ren, and J. Sun, “ Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.
[15]
R. Girshick, J. Donahue, T. Darrell, and J. Malik, “ Rich feature hierarchies for accurate object detection and semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2014, pp. 580–587.
[16]
R. Girshick, “ Fast R-CNN,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1440–1448.
[17]
S. Ren, K. He, R. Girshick, and J. Sun, “ Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. Volume 39, no. Issue 6, pp. 1137–1149, 2017.
[18]
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “ You only look once: Unified, real-time object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 779–788.
[19]
S. Tripathi, Z. Lipton, S. Belongie, and T. Nguyen, “ Context matters: Refining object detection in video with recurrent neural networks,” in Proc. Brit. Mach. Vis. Conf., 2016, pp. 44.1–44.12.
[20]
J. Long, E. Shelhamer, and T. Darrell, “ Fully convolutional networks for semantic segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 3431–3440.
[21]
E. Shelhamer, J. Long, and T. Darrell, “ Fully convolutional networks for semantic segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. Volume 39, no. Issue 4, pp. 640–651, 2017.
[22]
L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “ Semantic image segmentation with deep convolutional nets and fully connected CRFS,” in Proc. Int. Conf. Learn. Representations, 2015.
[23]
L. C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “ Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFS,” IEEE Trans. Pattern Anal. Mach. Intell., to be published.
[24]
F. Yu and V. Koltun, “ Multi-scale context aggregation by dilated convolutions,” in Proc. Int. Conf. Learn. Representations, 2016.
[25]
J. Tompson, M. Stein, Y. Lecun, and K. Perlin, “ Real-time continuous pose recovery of human hands using convolutional networks,” ACM Trans. Graph., vol. Volume 33, no. Issue 5, pp. 169:1–169:10, 2014. {Online}. Available:
[26]
L. Ge, H. Liang, J. Yuan, and D. Thalmann, “ Robust 3d hand pose estimation in single depth images: From single-view CNN to multi-view CNNS,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 3593–3601.
[27]
A. Sinha, C. Choi, and K. Ramani, “ Deephand: Robust hand pose estimation by completing a matrix imputed with deep features,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 4150–4158.
[28]
K. Wang, S. Zhai, H. Cheng, X. Liang, and L. Lin, “ Human pose estimation from depth images via inference embedded multi-task learning,” in Proc. ACM Multimedia Conf., New York, NY, USA: ACM, 2016, pp. 1227–1236.
[29]
B. Kang, S. Tripathi, and T. Q. Nguyen, “ Real-time sign language fingerspelling recognition using convolutional neural networks from depth map,” in Proc. 3rd IAPR Asian Conf. Pattern Recognit., 2015, pp. 136–140.
[30]
S. Gupta, R. Girshick, P. Arbeláez, and J. Malik, “ Learning rich features from RGB-D images for object detection and segmentation,” in Computer Vision—ECCV 2014: 13th European Conference, Zurich, Switzerland, Proceedings, Part VII . Berlin, Germany: Springer, 2014, pp. 345–360.
[31]
S. Zheng et al., “ Conditional random fields as recurrent neural networks,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 1529–1537.
[32]
I. Oikonomidis, N. Kyriazis, and A. A. Argyros, “ Full DOF tracking of a hand interacting with an object by modeling occlusions and physical constraints,” in Proc. Int. Conf. Comput. Vis., 2011, pp. 2088–2095.
[33]
J. Romero, H. Kjellstrm, and D. Kragic, “ Hands in action: real-time 3D reconstruction of hands in interaction with objects,” in Proc. IEEE Int. Conf. Robot. Autom., 2010, pp. 458–463.
[34]
J. Romero, H. Kjellström, C. H. Ek, and D. Kragic, “ Non-parametric hand pose estimation with object context,” Image Vis. Comput., vol. Volume 31, no. Issue 8, pp. 555–564, 2013.
[35]
Y. Wang et al., “ Video-based hand manipulation capture through composite motion control,” ACM Trans. Graph., vol. Volume 32, no. Issue 4, pp. 43:1–43:14, 2013. {Online}. Available:
[36]
J. A. Palmer, K. Kreutz-Delgado, and S. Makeig, “ Super-Gaussian mixture source model for ICA,” in Independent Component Analysis and Blind Signal Separation, Berlin, Heidelberg, Germany: Springer, 2006, pp. 854–861.
[37]
M. J. Jones and J. M. Rehg, “ Statistical color models with application to skin detection,” Int. J. Comput. Vis., vol. Volume 46, pp. 81–96, 2002.
[38]
D. Tzionas and J. Gall, “ 3D object reconstruction from hand-object interactions,” in Proc. IEEE Int. Conf. Comput. Vis., 2015, pp. 729–737.
[39]
Y. Jeon and J. Kim, “ Active convolution: Learning the shape of convolution for image classification,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1846–1854.
[40]
J. Dai et al., “ Deformable convolutional networks,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 764–773.
[41]
G. Strang and T. Nguyen, Wavelets and Filter Banks . Philadelphia, PA, USA: SIAM, 1996.
[42]
B. Kang, K.-H. Tan, N. Jiang, H.-S. Tai, D. Tretter, and T. Nguyen, “ Hand segmentation for hand-object interaction from depth map,” in Proc. IEEE Global Conf. Signal Inform. Process., 2017.
[43]
C. M. Bishop, Pattern Recognition and Machine Learning . New York, NY, USA: Springer-Verlag, 2006.
[44]
I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, MA, USA: MIT Press, 2016.
[45]
T. Apostol, Mathematical Analysis (Ser. Addison-Wesley Series in Mathematics). Reading, MA, USA: Addison-Wesley, 1974.
[46]
S. Gupta, P. Arbelez, and J. Malik, “ Perceptual organization and recognition of indoor scenes from RGB-D images,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2013, pp. 564–571.
[47]
O. Russakovsky et al., “ Imagenet large scale visual recognition challenge,” Int. J. Comput. Vis., vol. Volume 115, no. Issue 3, pp. 211–252, 2015.

Cited By

View all
  • (2024)Learning Cross-modality Interaction for Robust Depth Perception of Autonomous DrivingACM Transactions on Intelligent Systems and Technology10.1145/365003915:3(1-26)Online publication date: 1-Mar-2024
  • (2024)Pyramid Fusion Transformer for Semantic SegmentationIEEE Transactions on Multimedia10.1109/TMM.2024.339628126(9630-9643)Online publication date: 28-May-2024
  • (2024)Dual-Guided Frequency Prototype Network for Few-Shot Semantic SegmentationIEEE Transactions on Multimedia10.1109/TMM.2024.338327626(8874-8888)Online publication date: 29-Mar-2024
  • Show More Cited By
  1. Depth-Adaptive Deep Neural Network for Semantic Segmentation

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image IEEE Transactions on Multimedia
    IEEE Transactions on Multimedia  Volume 20, Issue 9
    September 2018
    306 pages

    Publisher

    IEEE Press

    Publication History

    Published: 01 September 2018

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)0
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 22 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Learning Cross-modality Interaction for Robust Depth Perception of Autonomous DrivingACM Transactions on Intelligent Systems and Technology10.1145/365003915:3(1-26)Online publication date: 1-Mar-2024
    • (2024)Pyramid Fusion Transformer for Semantic SegmentationIEEE Transactions on Multimedia10.1109/TMM.2024.339628126(9630-9643)Online publication date: 28-May-2024
    • (2024)Dual-Guided Frequency Prototype Network for Few-Shot Semantic SegmentationIEEE Transactions on Multimedia10.1109/TMM.2024.338327626(8874-8888)Online publication date: 29-Mar-2024
    • (2024)Query-Guided Prototype Evolution Network for Few-Shot SegmentationIEEE Transactions on Multimedia10.1109/TMM.2024.335292126(6501-6512)Online publication date: 11-Jan-2024
    • (2024)Enhancing long-term person re-identification using global, local body part, and head streamsNeurocomputing10.1016/j.neucom.2024.127480580:COnline publication date: 1-May-2024
    • (2024)Pixel-level clustering network for unsupervised image segmentationEngineering Applications of Artificial Intelligence10.1016/j.engappai.2023.107327127:PBOnline publication date: 1-Jan-2024
    • (2023)FECANet: Boosting Few-Shot Semantic Segmentation With Feature-Enhanced Context-Aware NetworkIEEE Transactions on Multimedia10.1109/TMM.2023.323852125(8580-8592)Online publication date: 1-Jan-2023
    • (2023)Cellular Binary Neural Network for Accurate Image Classification and Semantic SegmentationIEEE Transactions on Multimedia10.1109/TMM.2022.323325525(8064-8075)Online publication date: 1-Jan-2023
    • (2023)Self-Ensembling GAN for Cross-Domain Semantic SegmentationIEEE Transactions on Multimedia10.1109/TMM.2022.322997625(7837-7850)Online publication date: 1-Jan-2023
    • (2023)A Boundary-Aware Network for Shadow RemovalIEEE Transactions on Multimedia10.1109/TMM.2022.321442225(6782-6793)Online publication date: 1-Jan-2023
    • Show More Cited By

    View Options

    View options

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media