Effective Fusion of Multi-Modal Remote Sensing Data in a Fully Convolutional Network for Semantic Labeling
<p>Multi-resolution model for the analysis of the contribution of individual layers to each class. conv: Convolution layer, relu: Rectifying Linear Unit (nonlinearity), sum: Element-wise sum.</p> "> Figure 2
<p>Fusion model incorporating multi-modal remote sensing data: Imagery and elevation. Fusion 3 means that the dual streams (elevation and color) are fused in layer 3, the same holds for Fusion1 up to Fusion5. The fusion strategy works element-wise after the ReLU layer.</p> "> Figure 3
<p>The <span class="html-italic">DR</span> (descent rate of recall) of different layers. The left graph is for imagery, the right for elevation data.</p> "> Figure 4
<p>The contribution of layers 2, 3, 4 for different classes in images: (<b>a</b>) When layer 2 is removed, half of the trees in the scene are classified as low vegetation; (<b>b</b>) without layer 3 cars are mostly not detected anymore; (<b>c</b>) when layer 4 is removed, the results are cluttered. However, adding layer 4 eliminates clutter; (<b>d</b>) without layer 5 parts of buildings are not detected. (White = impervious surface; dark blue = building; green = high vegetation; cyan = low vegetation).</p> "> Figure 5
<p>The contribution of layers 2, 3, 4 for different classes in elevation data: (<b>a</b>) Layer 2 contributes to impervious surfaces and low vegetation. When layer 2 is removed, they are no longer classified correctly; (<b>b</b>) shows the contribution of layer 3 for trees. Without layer 3, trees are mostly not detected anymore; (<b>c</b>) when layer 4 is removed, the results are noisy. Trees and low vegetation are not correctly classified; (<b>d</b>) without layer 5, large parts of buildings are not detected anymore.</p> "> Figure 6
<p>Qualitative segmentation results. The first three rows give the input data color, elevation and ground-truth. The following rows present the segmentation results of different models. The last two rows show our fusion model results. The Fusion3 model obtains the best accuracy for the Vaihingen dataset.</p> ">
Abstract
:1. Introduction
2. Related Work
2.1. FCN for Images
2.2. FCN for Multi-Modal Data
3. Multi-Resolution Model and Fusion Framework
3.1. Multi-Resolution Model
3.2. Fusion Strategy
3.3. Fusion Model
4. Experiments
4.1. Datasets
4.2. Training
4.2.1. Multi-resolution model training
4.2.2. Fusion model training
4.3. Evaluation of Experimental Results
4.4. Quantitative Results
4.5. Fusion Model Results
5. Discussion
6. Conclusions
- (i)
- For multi-modal data, the modes have different contributions to specific classes. The performance can be improved by avoiding conflicts between layers, i.e., different sensitivities to specific classes. By this means, we can, thus, design adaptive models suitable for semantic labeling tasks.
- (ii)
- Summation fusion after the ReLU layer is applied in layer fusion. This preserves more information about features and strengthens the activation. The deeper layers are sensitive to objects which have a more complex texture and occupy a larger parts of the scene.
- (iii)
- The skip architecture in the fusion allows to utilize the shallow as well as the deep features. Combining all these components, the multi-modal fusion model incorporates heterogeneous data precisely and effectively.
Acknowledgments
Author Contributions
Conflicts of Interest
References
- Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
- Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
- Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25; Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2012; pp. 1097–1105. [Google Scholar]
- Fu, G.; Liu, C.; Zhou, R.; Sun, T.; Zhang, Q. Classification for high resolution remote sensing imagery using a fully convolutional network. Remote Sens. 2017, 9, 498. [Google Scholar] [CrossRef]
- Längkvist, M.; Kiselev, A.; Alirezaie, M.; Loutfi, A. Classification and segmentation of satellite orthoimagery using convolutional neural networks. Remote Sens. 2016, 8, 329. [Google Scholar] [CrossRef]
- Zhang, M.; Hu, X.; Zhao, L.; Lv, Y.; Luo, M.; Pang, S. Learning dual multi-scale manifold ranking for semantic segmentation of high-resolution images. Remote Sens. 2017, 9, 500. [Google Scholar] [CrossRef]
- Zheng, S.; Jayasumana, S.; Romera-Paredes, B.; Vineet, V.; Su, Z.; Du, D.; Huang, C.; Torr, P.H. Conditional random fields as recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1529–1537. [Google Scholar]
- Chen, L.-C.; Papandreou, G.; Kokkinos, I.; Murphy, K.; Yuille, A.L. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. arXiv, 2016; arXiv:1606.00915. [Google Scholar]
- Hazirbas, C.; Ma, L.; Domokos, C.; Cremers, D. FuseNet: Incorporating depth into semantic segmentation via fusion-based CNN architecture. In Proceedings of the Asian Conference on Computer Vision ACCV, Taipei, Taiwan, 20–24 November 2016; Volume 2. [Google Scholar]
- Marmanis, D.; Wegner, J.D.; Galliani, S.; Schindler, K.; Datcu, M.; Stilla, U. Semantic segmentation of aerial images with an ensemble of CNSS. ISPRS Ann. Photogramm. Remote Sens. Spat. Inf. Sci. 2016, 3, 473–480. [Google Scholar] [CrossRef]
- Shelhamer, E.; Long, J.; Darrell, T. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 640–651. [Google Scholar] [CrossRef] [PubMed]
- Sherrah, J. Fully convolutional networks for dense semantic labelling of high-resolution aerial imagery. arXiv, 2016; arXiv:1606.02585. [Google Scholar]
- Maggiori, E.; Tarabalka, Y.; Charpiat, G.; Alliez, P. Convolutional neural networks for large-scale remote-sensing image classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 645–657. [Google Scholar] [CrossRef]
- Ning, F.; Delhomme, D.; LeCun, Y.; Piano, F.; Bottou, L.; Barbano, P.E. Toward automatic phenotyping of developing embryos from videos. IEEE Trans. Image Process. 2005, 14, 1360–1371. [Google Scholar] [CrossRef] [PubMed]
- Pinheiro, P.H.O.; Collobert, R. Recurrent convolutional neural networks for scene parsing. arXiv, 2013; arXiv:1306.2795. [Google Scholar]
- Zeiler, M.D.; Fergus, R. Visualizing and understanding convolutional networks. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; pp. 818–833. [Google Scholar]
- Zintgraf, L.M.; Cohen, T.S.; Adel, T.; Welling, M. Visualizing deep neural network decisions: Prediction difference analysis. arXiv, 2017; arXiv:1702.04595. [Google Scholar]
- Zhang, W.; Huang, H.; Schmitz, M.; Sun, X.; Wang, H.; Mayer, H. A multi-resolution fusion model incorporating color and elevation for semantic segmentation. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2017, XLII-1/W1, 513–517. [Google Scholar] [CrossRef]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A deep convolutional encoder-decoder architecture for scene segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017. [Google Scholar] [CrossRef] [PubMed]
- Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv, 2015; arXiv:1511.00561. [Google Scholar]
- Kampffmeyer, M.; Salberg, A.-B.; Jenssen, R. Semantic segmentation of small objects and modeling of uncertainty in urban remote sensing images using deep convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 1–9. [Google Scholar]
- Audebert, N.; Saux, B.L.; Lefèvre, S. Semantic segmentation of earth observation data using multimodal and multi-scale deep networks. arXiv, 2016; arXiv:1609.06846. [Google Scholar]
- Gupta, S.; Girshick, R.; Arbeláez, P.; Malik, J. Learning rich features from RGB-D images for object detection and segmentation. In Computer Vision—ECCV 2014; Springer: Cham, Switzerland, 2014; pp. 345–360. [Google Scholar]
- Gerke, M.; Rottensteiner, F.; Wegner, J.D.; Sohn, G. ISPRS semantic labeling contest. In Proceedings of the Photogrammetric Computer Vision—PCV, Zurich, Switzerland, 5–7 September 2014. [Google Scholar]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv, 2014; arXiv:1409.1556. [Google Scholar]
- Jia, Y.; Shelhamer, E.; Donahue, J.; Karayev, S.; Long, J.; Girshick, R.; Guadarrama, S.; Darrell, T. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM International Conference on Multimedia, Orlando, FL, USA, 3–7 November 2014; pp. 675–678. [Google Scholar]
Recall | Imp_Surf | Building | Low_Veg | Tree | Car | Clutter |
---|---|---|---|---|---|---|
Layers-all | 0.86 | 0.86 | 0.72 | 0.84 | 0.43 | 0.52 |
Layer-2 | 0.19 | 0.04 | 0 | 0.69 | 0.6 | 0.91 |
Layers-345 | 0.84 | 0.88 | 0.84 | 0.67 | 0.28 | 0 |
Layer-3 | 0.59 | 0.03 | 0.01 | 0.01 | 0.86 | 0.66 |
Layers-245 | 0.78 | 0.88 | 0.74 | 0.86 | 0.05 | 0.49 |
Layer-4 | 0.83 | 0.83 | 0.65 | 0.88 | 0.01 | 0 |
Layers-235 | 0.26 | 0.53 | 0.5 | 0.39 | 0.5 | 0.91 |
Layer-5 | 0 | 0.99 | 0.23 | 0 | 0 | 0 |
Layers-234 | 0.88 | 0.78 | 0.45 | 0.93 | 0.66 | 0.47 |
Recall | Imp_Surf | Building | Low_Veg | Tree | Car | Clutter |
---|---|---|---|---|---|---|
Layers-all | 0.79 | 0.68 | 0.27 | 0.62 | 0 | 0 |
Layer-2 | 0.92 | 0 | 0.07 | 0.52 | 0 | 0.01 |
Layers-345 | 0.04 | 0.84 | 0.14 | 0.55 | 0 | 0 |
Layer-3 | 0.98 | 0 | 0 | 0.24 | 0 | 0 |
Layers-245 | 0.72 | 0.8 | 0.34 | 0.53 | 0 | 0 |
Layer-4 | 0.59 | 0.51 | 0.11 | 0.8 | 0 | 0 |
Layers-235 | 0.85 | 0.8 | 0.19 | 0.2 | 0 | 0 |
Layer-5 | 0 | 1 | 0.04 | 0 | 0 | 0 |
Layers-234 | 0.87 | 0.1 | 0.21 | 0.65 | 0 | 0 |
Color-D | Fusion1 | Fusion2 | Fusion3 | Fusion4 | Fusion5 | Color | Elevation | |
---|---|---|---|---|---|---|---|---|
OA (%) | 79.77 | 81.79 | 82.21 | 82.69 | 81.75 | 81.70 | 81.86 | 60.53 |
F1 | 0.80 | 0.81 | 0.82 | 0.83 | 0.81 | 0.81 | 0.81 | 0.58 |
F1 | Imp_Surf | Building | Low_Veg | Tree | Car | Clutter |
---|---|---|---|---|---|---|
Color-D | 0.82 | 0.86 | 0.69 | 0.81 | 0.56 | 0.59 |
Fusion1 | 0.85 | 0.88 | 0.72 | 0.83 | 0.5 | 0.54 |
Fusion2 | 0.85 | 0.89 | 0.72 | 0.82 | 0.47 | 0.58 |
Fusion3 | 0.85 | 0.9 | 0.72 | 0.84 | 0.56 | 0.57 |
Fusion4 | 0.85 | 0.89 | 0.71 | 0.82 | 0.51 | 0.58 |
Fusion5 | 0.84 | 0.89 | 0.7 | 0.83 | 0.55 | 0.62 |
Color | 0.84 | 0.87 | 0.72 | 0.84 | 0.54 | 0.66 |
Elevation | 0.69 | 0.71 | 0.34 | 0.59 | 0 | 0 |
RGB-D | Fusion1 | Fusion2 | Fusion3 | Fusion4 | Fusion5 | RGB | Elevation | |
---|---|---|---|---|---|---|---|---|
OA (%) | 79.21 | 80.35 | 80.99 | 81.64 | 81.78 | 81.80 | 78.31 | 63.16 |
F1 | 0.79 | 0.78 | 0.78 | 0.79 | 0.80 | 0.80 | 0.78 | 0.48 |
F1 | Imp_Surf | Building | Low_Veg | Tree | Car | Clutter |
---|---|---|---|---|---|---|
RGB-D | 0.80 | 0.86 | 0.74 | 0.69 | 0.70 | 0.50 |
Fusion1 | 0.80 | 0.85 | 0.74 | 0.64 | 0.55 | 0.29 |
Fusion2 | 0.81 | 0.88 | 0.75 | 0.68 | 0.68 | 0.45 |
Fusion3 | 0.82 | 0.89 | 0.76 | 0.71 | 0.74 | 0.52 |
Fusion4 | 0.83 | 0.90 | 0.77 | 0.72 | 0.74 | 0.51 |
Fusion5 | 0.83 | 0.90 | 0.77 | 0.72 | 0.74 | 0.51 |
RGB | 0.81 | 0.85 | 0.76 | 0.71 | 0.77 | 0.50 |
Elevation | 0.56 | 0.80 | 0.16 | 0.47 | 0 | 0 |
© 2017 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, W.; Huang, H.; Schmitz, M.; Sun, X.; Wang, H.; Mayer, H. Effective Fusion of Multi-Modal Remote Sensing Data in a Fully Convolutional Network for Semantic Labeling. Remote Sens. 2018, 10, 52. https://doi.org/10.3390/rs10010052
Zhang W, Huang H, Schmitz M, Sun X, Wang H, Mayer H. Effective Fusion of Multi-Modal Remote Sensing Data in a Fully Convolutional Network for Semantic Labeling. Remote Sensing. 2018; 10(1):52. https://doi.org/10.3390/rs10010052
Chicago/Turabian StyleZhang, Wenkai, Hai Huang, Matthias Schmitz, Xian Sun, Hongqi Wang, and Helmut Mayer. 2018. "Effective Fusion of Multi-Modal Remote Sensing Data in a Fully Convolutional Network for Semantic Labeling" Remote Sensing 10, no. 1: 52. https://doi.org/10.3390/rs10010052
APA StyleZhang, W., Huang, H., Schmitz, M., Sun, X., Wang, H., & Mayer, H. (2018). Effective Fusion of Multi-Modal Remote Sensing Data in a Fully Convolutional Network for Semantic Labeling. Remote Sensing, 10(1), 52. https://doi.org/10.3390/rs10010052