Authors:
Mohammed Hassoubah
1
;
2
;
Ibrahim Sobh
2
and
Mohamed Elhelw
1
Affiliations:
1
Center for Informatics Science, Nile University, Egypt
;
2
Valeo, Egypt
Keyword(s):
Epistemic Uncertainty, LiDAR, Self-supervision Training, Semantic Segmentation, Transformer.
Abstract:
For the task of semantic segmentation of 2D or 3D inputs, Transformer architecture suffers limitation in
the ability of localization because of lacking low-level details. Also for the Transformer to function well,
it has to be pre-trained first. Still pre-training the Transformer is an open area of research. In this work,
Transformer is integrated into the U-Net architecture as (Chen et al., 2021). The new architecture is trained to
conduct semantic segmentation of 2D spherical images generated from projecting the 3D LiDAR point cloud.
Such integration allows capturing the the local dependencies from CNN backbone processing of the input,
followed by Transformer processing to capture the long range dependencies. To define the best pre-training
settings, multiple ablations have been executed to the network architecture, the self-training loss function and
self-training procedure, and results are observed. It’s proved that, the integrated architecture and self-training
improve
the mIoU by +1.75% over U-Net architecture only, even with self-training it too. Corrupting the input
and self-train the network for reconstruction of the original input improves the mIoU by highest difference =
2.9% over using reconstruction plus contrastive training objective. Self-training the model improves the mIoU
by 0.48% over initialising with imageNet pre-trained model even with self-training the pre-trained model
too. Random initialisation of the Batch Normalisation layers improves the mIoU by 2.66% over using selftrained parameters. Self supervision training of the segmentation network reduces the model’s epistemic
uncertainty. The integrated architecture and self-training outperformed the SalsaNext (Cortinhal et al., 2020)
(to our knowledge it’s the best projection based semantic segmentation network) by 5.53% higher mIoU, using
the SemanticKITTI (Behley et al., 2019) validation dataset with 2D input dimension 1024×64.
(More)