Open AccessArticle

BEV Semantic Map Reconstruction for Self-Driving Cars with the Multi-Head Attention Mechanism

Yi-Cheng Liao

¹,

Jichiang Tsai

^2,* and

Hsuan-Ying Chien

Department of Electrical Engineering, National Chung Hsing University, Taichung 40227, Taiwan

Department of Electrical Engineering & Graduate Institute of Communication Engineering, National Chung Hsing University, Taichung 40227, Taiwan

Department of Computer Science and Engineering, National Chung Hsing University, Taichung 40227, Taiwan

Author to whom correspondence should be addressed.

Electronics 2025, 14(1), 32; https://doi.org/10.3390/electronics14010032

Submission received: 7 November 2024 / Revised: 20 December 2024 / Accepted: 23 December 2024 / Published: 25 December 2024

(This article belongs to the Special Issue Advancement on Smart Vehicles and Smart Travel)

Download

Browse Figures

Figure 1
Overall process flow. "> Figure 2
MSA block architecture diagram. "> Figure 3
The operation of the transposed convolution. "> Figure 4
Transposed convolution upsampling network architecture. "> Figure 5
The operation of the dilated convolution. "> Figure 6
Dilated convolution upsampling network architecture. "> Figure 7
Attention skip connection layer network architecture. "> Figure 8
Comparisons of different algorithms: (a) the original input; (b) the result of the CVT; (c) the result of using MSA; and (d) the result of using both MSA and transposed convolution. "> Figure 9
The effect of DCU: (a) the original input; (b) the result without applying DCU; and (c) the result with applying DCU. "> Figure 10
The impact of incorporating attention skip connection layers: (a) the original input; (b) the result of using standard skip connection layers; and (c) the result of using attention skip connection layers. "> Figure 11
The training loss curves for both the CVT and the proposed model. "> Figure 12
The impact of filtering low-confidence predictions for both the CVT and the proposed model. "> Figure 13
A qualitative comparison of BEV map predictions with CVT in diverse environments: (a) daytime; (b) daytime; (c) nighttime; and (d) rainy weather. "> Figure 14
A visualization of model prediction results, i.e., vehicles and roads: (a) nighttime; and (b) daytime. Ego Vehicle (Red) and Predicted Vehicles (Blue). ">

Versions Notes

Abstract

Environmental perception is crucial for safe autonomous driving, enabling accurate analysis of the vehicle’s surroundings. While 3D LiDAR is traditionally used for 3D environment reconstruction, its high cost and complexity present challenges. In contrast, camera-based cross-view frameworks can offer a cost-effective alternative. Hence, this manuscript proposes a new cross-view model to extract mapping features from camera images and then transfer them to a Bird’s-Eye View (BEV) map. Particularly, a multi-head attention mechanism in the decoder architecture generates the final semantic map. Each camera learns embedding information corresponding to its position and angle within the BEV map. Cross-view attention fuses information from different perspectives to predict top-down map features enriched with spatial information. The multi-head attention mechanism then globally performs dependency matches, enhancing long-range information and capturing latent relationships between features. Transposed convolution replaces traditional upsampling methods, avoiding high similarities of local features and facilitating semantic segmentation inference of the BEV map. Finally, we conduct numerous simulation experiments to verify the performance of our cross-view model.

Keywords:

autonomous driving; BEV map; cross-view model; multi-head attention mechanism; transposed convolution

1. Introduction

While GPS has been in use for years, its accuracy and sensitivity to environmental conditions remain problematic, hindering the implementation of autonomous driving systems in real-world scenarios. To compensate for these instabilities, utilizing multiple sensors to complement each other’s shortcomings becomes crucial. For instance, IMUs collect inertial data from the vehicle, LiDAR acquires 3D environmental information, cameras record the road and surrounding objects, and radar and ultrasonic sensors detect distances. This wealth of sensory data collaborates to overcome the limitations of any single sensor, generating high-precision maps for use in unmanned vehicles.

However, the traditional Multi-Stage Discrete Pipeline of Tasks architecture, where each stage operates independently, often leads to information loss or distortion between stages, increasing computational cost and latency. Early Sensor Fusion, a more advanced approach, fuses raw data from different sensors at the data level, leveraging their characteristics and advantages. This enables the system to learn collectively from multiple sensor data sources. PointFusion [1] highlights the advantages of early sensor fusion in enhancing overall perception performance, while Multi-View 3D (MV3D) networks [2], exploring multi-view 3D object detection networks, further confirm its positive impact on autonomous driving systems.

Despite its advantages, implementing early sensor fusion faces challenges like sensor calibration, data synchronization, and complex fusion algorithms. To overcome these, using surround RGB monocular cameras for Bird’s-Eye View (BEV) prediction offers a viable alternative. This approach integrates visual data from surround cameras, providing rich spatial and semantic information without relying on multiple sensors, thus simplifying the perception system architecture and improving real-time performance and responsiveness.

Our method integrates camera parameters with image features extracted through a CNN (Convolutional Neural Network) to generate mapping features projected from the plane to the BEV map. To reconstruct the final semantic map, we improve the decoder architecture by enhancing embedded features with a multi-head attention mechanism and refining dimensional elevation through transposed convolution. Additionally, we incorporated dilated convolution to expand the receptive field, effectively capturing more detailed features and contextual information, further enhancing the model’s performance.

2. Related Work

According to the U.S. Department of Transportation, 94% of vehicle crashes are caused by driver behaviour [3], highlighting the potential of autonomous vehicles to enhance road safety. While autonomous driving systems may vary, they fundamentally comprise perception, localization, and mapping. The 3D sensor-based joint perception systems, though effective, increase hardware costs, necessitate precise calibration and synchronization, and demand high-performance computing for data alignment and fusion, as shown in CLR-BNN [4]. However, in most scenarios, localization and mapping do not heavily rely on height information.

In autonomous driving, accurate environmental perception is key. Traditionally, LiDAR-based sensors, generating high-precision 3D point cloud data, have been widely used. These data are further segmented into voxels for identification, as in References [5,6,7]. Image-based detection methods, utilizing depth and RGB cameras to generate stereo images and 3D candidate bounding boxes for object identification via deep learning algorithms like Fast RCNN [8], have also been explored [9,10]. However, these methods, often focusing on a single viewpoint, have limitations in creating complete maps. The study “Vehicle Detection from 3D LiDAR Using Fully Convolutional Network” [11] addressed these challenges by converting LiDAR data into a BEV representation and processing it with an FCN, leveraging its advantages in image processing and feature extraction. This approach effectively handles sparse 3D data while preserving spatial information on a 2D plane.

In studies of scene reconstruction and perception, conventional methods usually rely on binocular stereo images for matching. However, “A stereo matching algorithm with an adaptive window: Theory and experiment” [12] pointed out that such methods lack the fusion capability of multiple sensors (e.g., LiDAR or RGB-D cameras), which leads to their insufficient features in distinguishing pixel correspondence relationships. In addition, the motion of a dynamic object further affects the accuracy of matching. By relying only on a single passive sensor, these approaches neglect the possible performance enhancement effects of multi-sensor fusion. On the other hand, “A computer algorithm for reconstructing a scene from two projections”, Reference [13], mentioned the natural limitation of monocular cameras in depth estimation, as they do not provide depth information directly by themselves, especially in remote object scenes where the limited resolution of image pixels tends to lead to increased depth estimation error. The study focuses on vehicle detection and localization, but is less applicable to other targets such as pedestrians and motorcycles with limited functional scope. For multi-view angle scenes, “A computer algorithm for reconstructing a scene from two projections” [14] emphasized the limitations of reconstruction methods relying on only two view angles, which cannot effectively utilize multi-view angle information to enhance the completeness and accuracy of the reconstruction. Such methods are limited in large-scale scenes or applications that require omni-directional reconstruction. In addition, the single-sensor approach does not have sufficient depth to model the geometry of the scene, and has limited ability to process dynamic objects. For image projection-based studies, a method to enhance a 2D image with multiple viewing angles into 3D space for a self-driving car perception task was proposed in ”Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d” [15], but the study only used 2D feature maps and did not incorporate additional depth-aware data. This design limits the accuracy of depth estimation or distance inference, and is not strong enough to restore fine scenes. In “Structure-from-motion revisited” [16], which mainly relies on multi-view feature images for scene reconstruction, the performance of the model is severely degraded in scenes with heavily occluded or sparse feature points due to the lack of multi-sensor sensing data. Moreover, the method is highly dependent on high-quality feature matching, and the stability of feature extraction and matching will be greatly reduced when the texture of the scene is sparse or when there are changes in illumination. Although monocular cameras with multiple viewing angles can improve the reconstruction results to some extent, in “Fiery: Future instance prediction in Bird’s-Eye View from surround monocular cameras” [16], it is pointed out that the depth perception capability of monocular cameras is still limited, especially when dealing with distant objects or scenes with unclear geometrical features. This limitation can easily lead to depth ambiguity or inaccurate object localization in BEV (Bird’s-Eye View) representations, further affecting the performance of downstream tasks.

FIERY [17] addresses this by using surround RGB monocular cameras to estimate top-down map predictions, achieving BEV grid generation without auxiliary systems. It models future randomness end-to-end solely through camera data, creating a camera-based perception and prediction system that is more compact, cheaper, and has higher resolution than LiDAR sensing. Image cameras are cost-effective sensors for generating detailed environmental views, accurately segmenting scenes into meaningful regions and objects. Ordinary semantic segmentation results, being 2D, lack 3D spatial information crucial for autonomous driving tasks like object localization and obstacle avoidance. To address this, 2.5D projection transforms 2D semantic information into a BEV semantic map incorporating depth and spatial layout. Using surround RGB monocular cameras instead of multi-sensor systems reduces hardware costs, system complexity, and calibration/synchronization issues. It also decreases computational latency, improving real-time performance crucial for responsiveness in dynamic environments. Surround cameras provide rich visual information, including colour, texture, and depth cues, aiding accurate object identification and segmentation. Through deep learning, they effectively integrate semantic and spatial information, generating precise BEV maps vital for autonomous navigation and path planning.

This manuscript utilizes surround RGB monocular cameras to extract image features through a CNN, which are then fused with camera parameters to generate mapping features projected from the plane to the BEV map. This replaces explicit geometry-based projection and avoids rigid learning. Additionally, we improved the decoder architecture. To reconstruct the final semantic map, the upsampling process are redesigned, enhancing embedded features through a multi-head attention mechanism and employing transposed convolution for refined dimensional elevation. This approach demonstrates robust performance, even under conditions of occlusion or insufficient lighting.

3. Overall System Architecture Proposed

In the proposed system, the cross-view attention mechanism from the CVT (Cross-View Transformer) architecture is leveraged to extract multi-scale image features from a set of six camera images. Particularly, such a mechanism aggregates image features and spatial information to generate spatially-aware feature embeddings. Then, the decoding process with several improvements for BEV semantic map generation is redesigned, resulting in a final BEV semantic map with a resolution of 200 × 200. In the following, Table 1 and Table 2 show the list of abbreviations and quantifiers used in our method and their corresponding meanings, respectively.

3.1. nuScenes Dataset

The nuScenes dataset [18], released by Motional, a company dedicated to advancing autonomous vehicle technology, provides a valuable resource for research in the field. This large-scale dataset captures 15 h of driving data collected in the bustling cities of Boston and Singapore, encompassing diverse locations, times of day, and weather conditions. A key advantage of nuScenes over other datasets, such as KITTI [19], is the inclusion of millimeter-wave radar sensors (RADAR, Radio Detection and Ranging) in its data acquisition setup. This addition enables more robust perception capabilities for autonomous driving systems, particularly in adverse weather conditions where other sensors may be less reliable. nuScenes offers a comprehensive suite of sensor data, and this rich sensor data, combined with high-quality manual annotations, forms a crucial foundation for developing and evaluating autonomous driving algorithms. Table 3 provides a summary of the data acquisition configuration.

This study utilizes the nuScenes dataset [18], a comprehensive resource for autonomous driving research released by Motional. The dataset provides a rich collection of sensor data and annotations, enabling the development and evaluation of robust perception algorithms. Table 4 outlines key terms and concepts within the nuScenes dataset. To facilitate data extraction and processing, this research leverages the nuScenes API to access information within each scene. These extracted data are then used to train the model to learn the mapping and attention mechanisms across different camera views.

3.2. Cross-View BEV Semantic Map Prediction Using a Multi-Head Attention Mechanism

In this task, we leverage the cross-view attention mechanism from the Cross-View Transformer (CVT) architecture [20] to extract multi-scale image features from a set of N = 6 camera images, each capturing a

(360 / N) °

degree view. Each image is processed by a shared convolutional encoder

E

(we utilize a pre-trained EfficientNet [21] in our implementation), using known camera extrinsic and intrinsic parameters. The camera extrinsic parameters include

R_{k} \in R^{(3 \times 3)}

, the rotation matrix, and

t_{k} \in R^{3}

, the translation vector, both relative to the ego vehicle’s coordinate system. The cross-view attention mechanism aggregates image features and parameters into spatially-aware feature embeddings.

To accurately represent the direction and position of camera-captured images in the BEV space, precise localization of the camera’s coordinates relative to the center of the BEV segmentation map, as well as its orientation and intrinsic parameters, is required. The nuScenes dataset provides camera extrinsic parameters, using

R_{k}

and

t_{k}

to describe the transformation from each camera’s coordinates to the global 2D coordinates

K_{k} \in R^{(3 \times 3)}

. Following prior work [17,20], we designate the location of the Front Radar as the center of the BEV segmentation map, transforming each camera’s world coordinates into the global 2D coordinates through computations involving

R_{k}

and

t_{k}

. Additionally, the nuScenes dataset provides intrinsic parameters, such as focal length, principal point, and distortion coefficients, which represent the relative positional relationships of objects in the images. By accurately computing both the extrinsic and intrinsic parameters, our BEV segmentation map achieves more precise localization of the vehicle’s position and orientation in the camera-captured images.

The cross-view attention mechanism aggregates image features and spatial information to generate spatially-aware feature embeddings. We then redesign the decoding process with several improvements for BEV semantic map generation. Positional encoding is applied to the feature embeddings to provide valuable spatial information to the multi-head attention module. After processing with

M = 4

attention heads, the feature maps undergo an initial upsampling step using standard transposed convolution. This design choice ensures that the importance of each pixel is preserved at lower resolutions. Subsequently, we perform two consecutive dilated convolution upsampling operations (dilation rate = 2), effectively upsampling with a larger receptive field and increasing the original input feature resolution by a factor of 4. To mitigate potential aliasing effects and sparsity introduced by the upsampling operations, the feature maps are processed by a regular convolutional layer to smooth the output and reduce noise and inconsistencies.

Through this process, we generate the final semantic segmentation prediction. Figure 1 illustrates the complete architecture, detailing the core multi-head attention decoder mechanism and the refined upsampling steps.

3.3. Multi-Head Attention Upsampling

In Stage 1 of Figure 1, after obtaining the cross-view BEV feature map, we treat it as a two-dimensional spatial problem. Initially, we apply positional encoding to each pixel, utilizing sine and cosine functions to define their order. This ensures the preservation of spatial relationships and positional information. To achieve this, we avoid employing local attention mechanisms, which have fewer parameters. Equation (1) represents the positional encoding for even indices, and Equation (2) represents the positional encoding for odd indices, where

p o s

is the position index,

i

is the dimension index, ranging from

0

d - 1

, and

d

is the input dimension.

P E (p o s, 2 i) = \sin (\frac{p o s}{{10,000}^{\frac{2 i}{d}}})

(1)

P E (p o s, 2 i + 1) = \cos (\frac{p o s}{{10,000}^{\frac{2 (i + 1)}{d}}})

(2)

The Multi-head Self-Attention Upsampling Block (MSAU Block) consists of two main sub-modules: the Multi-head Self-Attention Block (MSA Block) and the TransposeConv Block, as shown in Figure 2. After position coding, the coded features are concatenated with the original feature maps and fed into the MSA block, and the Norm layer is used to normalize the output of the MSA block to ensure the model is more stable during the training process, which improves the stability of the training and helps the model to converge to the optimal solution faster, and then finally feeds into the Multilayer Perceptron (MLP) block for further processing of the information after normalization of the MSA block. The multi-head attention mechanism captures the dependencies between different positions in the input sequence, and the MLP block nonlinearly transforms and processes this captured information to enhance the model’s expressive power. This mechanism is mainly based on the Transformer Encoder architecture, as shown in Figure 2. This architecture provides more explicit spatial information than using multi-head attention alone. We use a single Transformer Encoder, assuming that the features processed by the encoder have been initially converted into BEV map features. This approach avoids significant computational overhead. The multi-head attention mechanism reduces the computational cost by splitting the input channel into M heads. This mechanism effectively captures information from partially occluded, low-confidence objects and focuses attention on different objects, which enhances interpretability and contributes to the overall model understanding and convergence.

In this module, we employ a multi-head self-attention mechanism with M = 4 attention heads. Particularly, in the multi-head attention mechanism, the number of heads M is a twistable hyperparameter. As the value of M increases, each head can more focus on learning a distinct feature of the data, enabling the model to capture the diversity of data better. However, if its value becomes too large, the attention of each head may become overly divergent, and the dimensionality of the feature space available to each head decreases. This will reduce the model’s ability to represent the global characteristics of the data effectively. Accordingly, the settings of M = 4 and M = 6 are evaluated in our experiment. The experimental results show that the setting of M = 4 yields better performance. Such a setting can allow the attention mechanism to capture sufficient global contextual information while avoiding a decline in the ability to represent global features.

Each attention head independently calculates the linear projections of query

q

, key

k

, and value

v

. For the input feature map

Χ \in R^{(4 \times c \times h \times w)}

, we first map it to the feature spaces of

q

k

, and

v

through linear transformations. The linear transformation matrices are defined as in Equation (3), where

c_{h}

and

c_{v}

represent the feature dimensions of

q / k

and

v

, respectively, for each attention head.

W_{Q} \in R^{c \times (M \cdot c_{h})}, W_{K} \in R^{c \times (M \cdot c_{h})}, W_{V} \in R^{c \times (M \cdot c_{v})}

(3)

This yields Equation (4):

Q = Χ W_{Q}, K = Χ W_{K}, V = Χ W_{V}

(4)

There are

M = 4

sets of

(Q_{i}, K_{i}, V_{i})

, where

i

denotes the

i

-th attention head. For each

i

, we calculate the self-attention weights, as expressed in Equation (5).

A t t e n t i o n (Q_{i}, K_{i}, V_{i}) = s o f t m a x (\frac{Q_{i} K_{i}^{T}}{\sqrt{c_{h}}}) V_{i}

(5)

Transposed convolution, also known as deconvolution or upsampling convolution, is a variant of the convolution operation in Convolutional Neural Networks (CNNs). It is primarily used to increase the spatial resolution of feature maps (i.e., upsample the feature maps). The goal of transposed convolution is to recover the spatial information lost during the convolution operation, generating a larger feature map from a smaller one. However, as convolution is inherently irreversible, transposed convolution learns to infer the lost information. The basic principle is to reverse the forward pass of a standard convolution. While standard convolution uses a kernel to slide over and compute the weighted sum of the input, producing a smaller feature map, transposed convolution performs this process in reverse, generating a larger feature map from a smaller one.

This operation involves inserting

S t r i d e - 1

zeros between each input pixel and padding the feature map with

K e r n e l S i z e - P a d d i n g - 1

rows and columns of zeros to expand its size. The kernel matrix is then flipped vertically and horizontally. Finally, a standard convolution operation is applied to the interpolated feature map using the flipped kernel, producing the upsampled output. This method allows for precise control over the upsampling factor, and the parameters are learned during training. This means the network can automatically adjust the parameters to suit specific task requirements. Figure 3 illustrates an example with

S t r i d e = 1, p a d d i n g = 0, k e r n e l

, and a

3 \times 3

kernel.

Deep neural networks often employ pooling or downsampling operations, which can lead to the loss of spatial details in feature maps. This loss is particularly pronounced after the multi-head attention module, potentially hindering the accurate distinction between adjacent objects. While nearest-neighbor and bilinear interpolation methods are insufficient to address this issue, a learned transposed convolution module is well-suited for this task. By training the transposed convolution module in conjunction with the multi-head attention module, the entire process is jointly optimized, resulting in more precise and detailed BEV semantic maps.

During feature upsampling, we employ transposed convolution with a kernel size of 4 and a stride of 2. This configuration doubles the spatial dimensions of the input feature map, allowing us to gradually recover spatial resolution without sacrificing crucial details. Subsequently, the

3 \times 3

convolutional layer performs further feature extraction on the upsampled feature maps. This kernel size effectively captures fine-grained local features while maintaining computational efficiency. Finally, pointwise convolution applies a linear transformation to the feature vector at each pixel, adjusting the channel dimension to match the output requirements without altering the spatial dimensions. The overall network architecture is illustrated in Figure 4.

3.4. Dilated Convolution Upsampling

Dilated convolution, also known as atrous convolution, is a specialized convolution operation used in Convolutional Neural Networks (CNNs) to increase the receptive field of the convolutional kernel without increasing the computational cost. This is particularly useful for tasks that require capturing features over a larger spatial extent, such as image semantic segmentation and object detection. The key idea behind dilated convolution is to expand the receptive field by inserting spaces or “dilations” between the elements of the convolutional kernel, as illustrated in Figure 5. The dashed lines in the figure represent the boundaries of the receptive field, visually highlighting how the kernel skips over certain pixels in the input feature map to capture information from more distant locations. This approach allows the kernel to “see” a wider range of the input feature map without increasing the number of parameters or computational complexity.

Dilated convolution can be described as introducing a dilation rate

d

into the standard convolution operation. This dilation rate determines the spacing between the elements of the kernel.

x (i)

is the value at position

i

in the input feature map,

y (i)

is the value at position

i

in the output feature map,

k

is the size of the kernel, and

w (k)

is the weight at position

k

in the kernel. With a dilation rate of

d

, the spacing between kernel elements during the convolution operation becomes

d

pixels. This leads to Equation (6):

y (i) = \sum_{k = 1}^{K} x (i + d \cdot k) \cdot w (k) Q = Χ W_{Q}, K = Χ W_{K}, V = Χ W_{V}

(6)

In Stage 2, we stack multiple Dilated Convolution Upsampling Blocks (DCU Blocks). After the initial upsampling in Stage 1, we observe that the objects of interest in this experiment are relatively sparse, while the background information is rich and diverse. To address this, we employ a strategy that combines dilated convolutional networks with transposed convolutional networks. Dilated convolutions introduce a dilation rate, effectively expanding the receptive field without increasing the number of parameters by inserting spaces between the elements of the convolution kernel. This is particularly beneficial for handling sparse positive samples, as it allows the network to capture features over a larger area, enhancing generalization and mitigating overfitting.

The overall design of the DCU Block is similar to the transposed convolution module in Figure 4, with the key difference being the replacement of the regular convolutional layer with a dilated convolution layer (dilation rate = 2), as shown in Figure 6. This introduces a spacing of one element between each element in the kernel, effectively expanding the kernel size to

5 \times 5

(double the original

3 \times 3

) while maintaining the same number of parameters. To preserve the spatial dimensions of the feature maps, we apply padding of 2. This ensures that the output of the dilated convolution has the same dimensions as the input, retaining more spatial information. Due to the map configuration, we stack two DCU Blocks. After two dilated convolution upsampling operations, the feature map is further upsampled by a factor of

2 \times 2

, achieving the desired resolution.

3.5. Attention Skip Connection Layer

In the skip connection pathway, we introduce an Attention Skip Connection layer, inspired by attention mechanisms, to enhance the reliability of feature fusion. This design combines the strengths of both attention mechanisms and skip connections, enabling the model to dynamically adjust channel weights based on the input feature map. By emphasizing important features and suppressing irrelevant ones, this approach improves the model’s expressive capacity while preserving spatial information. To compute attention weights for the feature maps, we employ a multi-layer convolutional operation. This attention mechanism aims to adaptively adjust the weights of different locations within the feature maps, effectively enhancing the model’s focus on crucial features.

The network architecture is illustrated in Figure 7. First, a

1 \times 1

pointwise convolution reduces the channel dimension of the input feature map to

1 / 8

of its original size. This operation performs a linear combination of channels without altering the spatial dimensions, reducing computational overhead. Batch normalization is then applied to standardize the output of the convolutional layer, accelerating training and stabilizing the numerical distribution of the feature maps. A ReLU activation function introduces non-linearity, enhancing the model’s expressive power and mitigating the vanishing gradient problem. Finally, another

1 \times 1

pointwise convolution transforms the feature map into a single-channel attention map. The attention weights are then normalized to the range of

[0, 1]

using a Sigmoid activation function. This output serves as a weighted mask for the feature map, emphasizing important features.

3.6. Experimental Setup

The input consists of

N = 6

RGB images

I

and

I_{i} \in R^{3 \times H \times W}

, where

H = 224

and

W = 480

. After encoder downsampling, the initial feature embeddings input to the decoder are

X_{i} \in R^{c \times (h \times w)}

, where

C = 128

and

(h, w) = (25,25)

. We utilize a multi-head attention configuration with

M = 4

heads and an attention head embedding dimension of

d h e a d = 32

. Both the Multi-Head Attention Upsampling (MSAU) block and the Dilated Convolution Upsampling (DCU) block increase the resolution by a factor of 2, resulting in a final BEV semantic map with a resolution of

200 \times 200

The model is implemented in PyTorch 1.11.0 and trained on an RTX 3090 GPU with a batch size of 4. We employ Focal Loss [18] as the loss function, with settings outlined in Table 5. Optimization is performed using the AdamW optimizer [22] and a one-cycle learning rate scheduler [23], with configurations detailed in Table 6 and Table 7. The total training time is 7 h.

4. Experimental Results

Following the framework established in reference [24], we define the mapping region as a 100 m × 100 m area surrounding the vehicle, employing a sampling resolution of 0.5 m. This configuration yields a BEV semantic map with dimensions of 200 pixels × 200 pixels, consistent with the primary evaluation metric promoted and standardized by the Lift-Splat [24] team.

To assess performance on the validation set, we employ the Intersection over Union (IoU) score as our primary metric. This metric quantifies the ratio between the intersection and union of the predicted segmentation output and the corresponding ground truth labels. Qualitative comparisons are also performed through visualization of the predicted labels, where pixel values are normalized to the range of [0, 1], with values approaching 1 represented as yellow. The primary benchmark for our experiments is the Cross-View Transformer (CVT) [20] model, enabling a comprehensive evaluation of the advantages and improvements offered by our proposed approach.

4.1. Results of Multi-Head Attention Upsampling

As illustrated in Figure 8, the qualitative comparison aligns with our initial expectations. In contrast to directly applying convolutional operations, the multi-head attention mechanism demonstrates superior spatial awareness, enabling the model to focus on distinct regions concurrently. Each attention head specializes in capturing different facets of features (e.g., shape, size, orientation), thereby enhancing the accuracy and robustness of object detection. The output of the multi-head attention module, which encapsulates relevant features and spatial relationships, serves as the input for the transposed convolution layer. This integration ensures that the spatial details and relationships learned by the attention mechanism are preserved and reinforced during the upsampling process. Consequently, this leads to more accurate and informative Bird’s-Eye View representations, as depicted in Figure 8c. The multi-head attention module excels at capturing long-range dependencies and local details within the image, particularly for low-confidence, partially occluded objects. Conversely, the transposed convolution module focuses on recovering the spatial details lost due to repeated pooling or downsampling operations. The synergy between these two modules empowers the model to accurately comprehend global contextual information while meticulously recovering spatial details, significantly enhancing the quality of feature maps.

Furthermore, we investigated the effect of increasing the number of attention heads to 8, as shown in Table 8. The results indicate that the model achieves optimal performance with M = 4 heads. This observation can be attributed to the dimensionality of the key embeddings (D = 128). When the number of heads increases to 8, the attention computation for each head potentially becomes excessively sparse. This sparsity may hinder the effective concentration of attention on crucial information, leading to a decline in attention quality and ultimately impacting the overall performance. This translation clarifies the relationship between the number of attention heads, the dimensionality of key embeddings, and the model’s performance. It also explains the potential negative impact of excessive sparsity on attention quality.

We also delve into the impact of using the Multi-head Attention Mechanism before and after upsampling. From Table 9, it is clear that the multi-head attention mechanism is a better choice for low resolution, both in terms of IoU performance and computational cost.

4.2. Results of Dilated Convolution Upsampling

As depicted in Figure 9, the qualitative comparison reveals that while the multi-head attention mechanism effectively performs upsampling, the model struggles to distinguish between closely parked vehicles due to their similar spacing. This limitation results in elongated shapes in the visualized semantic map. However, when dilated convolution is employed for subsequent upsampling, the model accurately identifies individual vehicle positions, leading to a more precise grid-like segmentation result. Moreover, the vehicle boundaries in the segmentation output are sharper, and the distinction between vehicles is significantly improved. By introducing a dilation rate, dilated convolution effectively expands the receptive field, enabling the capture of detailed features and contextual information over a wider range. The results in Figure 9b demonstrate that dilated convolution can more clearly delineate vehicle boundaries, mitigating the issue of mis-segmentation caused by similar inter-vehicle spacing. Furthermore, the integration of transposed convolution for upsampling ensures that the feature maps retain high resolution, preventing feature blurring caused by detail loss.

Table 10 presents the ablation study conducted to compare the effects of incorporating dilated convolution in Stage 1 versus Stage 2. The results clearly demonstrate that at low resolutions, where each pixel carries significant information, employing dilated convolution with a larger receptive field for upsampling is not optimal.

4.3. Results of Attention Skip Connection Layer

Figure 10 illustrates the impact of incorporating attention skip connection layers. It is evident that these layers effectively suppress low-confidence noise while emphasizing reliable features, such as cars and motorcycles. By assigning different weights to features of varying dimensions, the model can learn to prioritize crucial information. This capability holds significant potential for BEV map generation, where precise localization is essential.

4.4. Overall Performance

Figure 11 presents the training loss curves for both CVT and our proposed model. Figure 12 illustrates the impact of filtering low-confidence predictions, with purple representing CVT and blue representing our model. Table 11 provides a detailed IoU analysis. A qualitative comparison between the two models is shown in Figure 13. Finally, Figure 14 visualizes the predicted BEV semantic maps (including roads and vehicles) generated by our model. This concise description effectively summarizes the key information presented in the figures and table, highlighting the performance comparison between CVT and the proposed model.

5. Discussions

Our model demonstrates superior performance in the initial training phase, achieving lower loss values for both training loss and visible loss, indicating faster convergence. While the loss values stabilize after approximately 1000 training steps, suggesting that CVT also learns complex features eventually, our model consistently maintains a lower average error. This implies superior training stability and accuracy compared to the CVT baseline. The IoU evaluation highlights the higher accuracy of our model, particularly with a 50% IoU threshold where the performance gap reaches 1%. This signifies better generalization across varying IoU conditions and diverse object classes (vehicles, buses, bicycles) present in the dataset.

Qualitative analysis reinforces the quantitative findings, showcasing the model’s exceptional performance in complex traffic scenarios. Our approach effectively distinguishes between vehicles, even large ones, and accurately identifies them. Notably, the BEV semantic maps reconstructed by the multi-head attention mechanism exhibit remarkable environmental adaptability, even in challenging conditions like dim lighting and rainy weather. This robustness stems from the ability to capture subtle features, ensuring high-quality semantic segmentation even with poor illumination or adverse weather. Each attention head focuses on different aspects of vehicle features (shape, size, orientation), contributing to a comprehensive understanding of the scene. This global context awareness enhances semantic comprehension and enables efficient segmentation in crowded and dynamic traffic environments.

The visualization of road predictions overlaid on vehicle BEV semantic maps simulates real-world applications, showcasing the model’s real-time localization and mapping capabilities. These results emphasize the refined decoder’s ability to generate accurate road predictions and vehicle localization across diverse scenarios, particularly in low-light conditions like nighttime and rainy weather, demonstrating robustness and generalization. The hybrid architecture combining multi-head attention and convolutional networks effectively addresses the limitations of each individual approach, balancing the processing of long-range and local spatial relationships. This design enhances sensitivity to detailed features without compromising overall performance, further solidifying the model’s robustness and generalization across different environments.

It is notable to mention that in our system, the cameras are mounted in fixed positions on the vehicle and thus remain stationary with respect to the vehicle during the image capturing process. As a result, we can infer the directions and heights of objects in the images with respect to the vehicle based on the position of the vehicle. However, if the cameras are not precisely calibrated, the inferred object directions and heights from the images may slightly deviate from their actual values, potentially reducing the accuracy of detection results. Consequently, most current related research requires precise camera calibration for ensuring data accuracy to optimize model performance.

Finally, two potential approaches can be adopted to reduce the computational resources required by our multi-head attention mechanism and refined upsampling technique for resource-constrained platforms. The first is to use a sliding window to shorten the lengths of computation sequences, thereby lowering computational overhead and also enabling the model to concentrate more on global information. The other one is to utilize a sparse attention mechanism that calculates only a subset of the attention matrix while preserving the global information, thereby reducing computational costs as well. The above two approaches will be incorporated into the enhanced version of our algorithm in the future.

6. Conclusions

This paper proposes a novel approach for enhancing cross-view Bird’s-Eye View (BEV) semantic map prediction by leveraging a multi-head attention mechanism. The meticulously designed decoder employs this mechanism to address the limitations of conventional Convolutional Neural Networks (CNNs) in capturing global contextual information, effectively facilitating feature fusion across a wide field of view. This mechanism demonstrates robust perception capabilities, even in the presence of occlusions or distant vehicles. Furthermore, we demonstrate that the hybrid architecture, combining multi-head attention and CNNs, can seamlessly transition to a fully convolutional architecture at lower spatial resolutions.

To enhance the precision of segmentation results and mitigate the issue of excessive adjacency between target objects, this study introduces a refined upsampling mechanism, enabling the model to achieve more accurate object prediction and reconstruction. Experimental results indicate that the model exhibits superior performance when utilizing the multi-head attention mechanism within the decoder to generate BEV semantic maps. It effectively leverages global spatial information for multi-view feature fusion while maintaining spatial consistency. By integrating transposed convolution techniques, this method effectively avoids information loss and distortion, demonstrating strong adaptability to diverse environments.

To facilitate real-world deployment in autonomous driving systems, data acquired from other sensors can be incorporated into the predicted BEV maps. Through multi-modal fusion, both accuracy and real-time perception capabilities can be further augmented to effectively address unforeseen circumstances. However, it is important to note that this method necessitates substantial computational resources for real-time analysis. Consequently, its practical applications may be more readily realized in scenarios such as traffic signal control systems, where such technology can be more effectively leveraged.

Author Contributions

Conceptualization, J.T.; Methodology, Y.-C.L.; Software, Y.-C.L.; Validation, Y.-C.L. and H.-Y.C.; Formal analysis, Y.-C.L.; Investigation, J.T. and H.-Y.C.; Resources, J.T.; Writing—original draft, H.-Y.C.; Writing—review & editing, J.T.; Supervision, J.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Xu, D.; Anguelov, D.; Jain, A. PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 244–253. [Google Scholar]
Chen, X.; Ma, H.; Wan, J.; Li, B.; Xia, T. Multi-view 3D Object Detection Network for Autonomous Driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1907–1915. [Google Scholar]
Singh, S. Critical Reasons for Crashes Investigated in the National Motor Vehicle Crash Causation Survey; National Highway Traffic Safety Administration: Washington, DC, USA, 2015. Available online: https://crashstats.nhtsa.dot.gov/Api/Public/ViewPublication/812115 (accessed on 25 January 2024).
Ravindran, R.; Santora, M.J.; Jamali, M.M. Camera, LiDAR, and Radar Sensor Fusion Based on Bayesian Neural Network (CLR-BNN). IEEE Sens. J. 2022, 22, 6964–6974. [Google Scholar] [CrossRef]
Engelcke, M.; Rao, D.; Wang, D.Z.; Tong, C.H.; Posner, I. Vote3Deep: Fast Object Detection in 3D Point Clouds Using Efficient Convolutional Neural Networks. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 1355–1361. [Google Scholar]
Maturana, D.; Scherer, S. 3D Convolutional Neural Networks for Landing Zone Detection from LiDAR. In Proceedings of the 2015 IEEE International Conference on Robotics and Automation (ICRA), Seattle, WA, USA, 26–30 May 2015; pp. 3471–3478. [Google Scholar]
Wang, D.Z.; Posner, I. Voting for Voting in Online Point Cloud Object Detection. Robot. Sci. Syst. 2015, 1, 10–15. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Chen, X.; Kundu, K.; Zhang, Z.; Ma, H.; Fidler, S.; Urtasun, R. Monocular 3D Object Detection for Autonomous Driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2147–2156. [Google Scholar]
Chen, X.; Kundu, K.; Zhu, Y.; Berneshawi, A.G.; Ma, H.; Fidler, S.; Urtasun, R. 3D Object Proposals for Accurate Object Class Detection. In Proceedings of the Advances in Neural Information Processing Systems 28 (NIPS 2015), Montreal, QC, Canada, 7–12 December 2015; Volume 28, pp. 1–9. [Google Scholar]
Li, B.; Zhang, T.; Xia, T. Vehicle Detection from 3D LiDAR Using Fully Convolutional Network. arXiv 2016, arXiv:1608.07916. [Google Scholar]
Ammar Abbas, S.; Zisserman, A. A Geometric Approach to Obtain a Bird’s Eye View from an Image. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Kanade, T.; Okutomi, M. A Stereo Matching Algorithm with an Adaptive Window: Theory and Experiment. IEEE Trans. Pattern Anal. Mach. Intell. 1994, 16, 920–932. [Google Scholar] [CrossRef]
Kim, Y.; Kum, D. Deep Learning Based Vehicle Position and Orientation Estimation via Inverse Perspective Mapping Image. In Proceedings of the 2019 IEEE Intelligent Vehicles Symposium (IV), Paris, France, 9–12 June 2019; pp. 317–323. [Google Scholar]
Longuet-Higgins, H.C. A Computer Algorithm for Reconstructing a Scene from Two Projections. Nature 1981, 293, 133–135. [Google Scholar] [CrossRef]
Schonberger, J.L.; Frahm, J.M. Structure-from-motion revisited. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4104–4113. [Google Scholar]
Hu, A.; Murez, Z.; Mohan, N.; Dudas, S.; Hawke, J.; Badrinarayanan, V.; Kendall, A. Fiery: Future Instance Prediction in Bird’s-Eye View from Surround Monocular Cameras. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 15273–15282. [Google Scholar]
Caesar, H.; Bankiti, V.; Lang, A.H.; Vora, S.; Liong, V.E.; Xu, Q.; Beijbom, O. nuScenes: A Multimodal Dataset for Autonomous Driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11621–11631. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef] [PubMed]
Zhou, B.; Krähenbühl, P. Cross-View Transformers for Real-Time Map-View Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 13760–13769. [Google Scholar]
Tan, M.; Le, Q. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 10–15 June 2019; pp. 6105–6114. [Google Scholar]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. arXiv 2017, arXiv:1711.05101. [Google Scholar]
Smith, L.N.; Topin, N. Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates. In Proceedings of the Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications, Baltimore, MD, USA, 10 May 2019; Volume 11006, pp. 369–386. [Google Scholar]
Philion, J.; Fidler, S. Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XIV 16. pp. 194–210. [Google Scholar]

Figure 1. Overall process flow.

Figure 2. MSA block architecture diagram.

Figure 3. The operation of the transposed convolution.

Figure 4. Transposed convolution upsampling network architecture.

Figure 5. The operation of the dilated convolution.

Figure 6. Dilated convolution upsampling network architecture.

Figure 7. Attention skip connection layer network architecture.

Figure 8. Comparisons of different algorithms: (a) the original input; (b) the result of the CVT; (c) the result of using MSA; and (d) the result of using both MSA and transposed convolution.

Figure 9. The effect of DCU: (a) the original input; (b) the result without applying DCU; and (c) the result with applying DCU.

Figure 10. The impact of incorporating attention skip connection layers: (a) the original input; (b) the result of using standard skip connection layers; and (c) the result of using attention skip connection layers.

Figure 11. The training loss curves for both the CVT and the proposed model.

Figure 12. The impact of filtering low-confidence predictions for both the CVT and the proposed model.

Figure 13. A qualitative comparison of BEV map predictions with CVT in diverse environments: (a) daytime; (b) daytime; (c) nighttime; and (d) rainy weather.

Figure 14. A visualization of model prediction results, i.e., vehicles and roads: (a) nighttime; and (b) daytime. Ego Vehicle (Red) and Predicted Vehicles (Blue).

Table 1. The list of the abbreviations used in our method and their corresponding meanings.

Abbreviation	Corresponding Meanings
RADAR	Radio Detection and Ranging
CVT	Cross View Transformer
BEV	Bird’s-Eye View
MSAU	Multi-head Self-Attention Upsampling
MSA	Multi-head Self-Attention
MLP	Multilayer Perceptron
CNNs	Convolutional Neural Networks
DCU	Dilated Convolution Upsampling

Table 5. Focal loss hyperparameter configuration.

Parameter Name	Value
$α$ (alpha)	−1.0
$γ$ (gamma)	2.0

Table 6. AdamW hyperparameter configuration.

Parameter Name	Value
lr (learning rate)	0.004
weight_decay	0.0000001

Table 7. One-cycle learning rate scheduler hyperparameter configuration.

Parameter Name	Value
div_factor	10
pct_start	0.3
final_div_factor	10
max_lr	0.004
total_steps	21,101
cycle_momentum	False

Table 2. The list of the quantities used in our method and their corresponding meanings.

Quantities	Corresponding Meanings
N	Number of cameras.
$R_{k} \in R^{(3 \times 3)}$	The rotation matrix of the k-th camera $R_{k}$ is a 3 × 3 orthogonal matrix.
$t_{k} \in R^{3}$	The translation vector of the k-th camera.
$K_{k} \in R^{(3 \times 3)}$	The intrinsic matrix of the k-th camera.
M	The number of heads in the Multi-Head Attention mechanism.
$c_{h}$ and $c_{v}$	the feature dimensions of $q / k$ and $v$ , respectively, for each attention head.
$x (i)$	The value at position $i$ in the input feature map.
$y (i)$	The value at position $i$ in the output feature map.
$w (k)$	The weight at position $k$ in the kernel.

Table 3. Summary of data acquisition configuration.

Term	Description
CAM	6 cameras providing 360-degree surround view images.
LIDAR	Roof-mounted LiDAR sensor acquiring high-precision 3D point cloud data.
RADAR	Provides stable and reliable perception information in adverse weather or extreme conditions.
GPS and IMU	Offers precise vehicle localization and motion data.

Table 4. Key terms in the nuScenes dataset.

Term	Description
sample_data	Sensor data returned by the sensors, such as LiDAR point clouds or camera images.
calibrated_sensor	Refers to a calibrated sensor (e.g., LiDAR, RADAR, camera). All extrinsic parameters are provided with respect to the ego vehicle’s coordinate system.
ego_pose	Provides information about the ego vehicle’s pose at a specific timestamp. The output coordinates are in the world coordinate system, calibrated by the LiDAR algorithm.

Table 8. Impact of attention head count on model performance.

	IoU (0.40)	IoU (0.50)
Without MSAU	32.88%	27.06%
4-head MSAU	33.29%	27.76%
8-head MSAU	32.83%	26.75%

Table 9. Experiments with multi-head attention mechanism before and after upsampling.

	IoU (0.40)	IoU (0.50)	Training Time
MSA Pre-Upsampling	33.37%	27.51%	7 h
MSA After Upsampling	32.81%	27.13%	23 h

Table 10. Performance evaluation of dilated convolution applied at different stages.

	IoU (0.40)	IoU (0.50)
Stage 1 with DCU	32.45%	25.70%
Stage 2 with DCU	33.29%	27.41%

Table 11. Comparison of BEV semantic maps for vehicles.

	IoU (0.40)	IoU (0.50)
CVT	32.88%	27.06%
Ours	33.62%	28.07%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liao, Y.-C.; Tsai, J.; Chien, H.-Y. BEV Semantic Map Reconstruction for Self-Driving Cars with the Multi-Head Attention Mechanism. Electronics 2025, 14, 32. https://doi.org/10.3390/electronics14010032

AMA Style

Liao Y-C, Tsai J, Chien H-Y. BEV Semantic Map Reconstruction for Self-Driving Cars with the Multi-Head Attention Mechanism. Electronics. 2025; 14(1):32. https://doi.org/10.3390/electronics14010032

Chicago/Turabian Style

Liao, Yi-Cheng, Jichiang Tsai, and Hsuan-Ying Chien. 2025. "BEV Semantic Map Reconstruction for Self-Driving Cars with the Multi-Head Attention Mechanism" Electronics 14, no. 1: 32. https://doi.org/10.3390/electronics14010032

APA Style

Liao, Y. -C., Tsai, J., & Chien, H. -Y. (2025). BEV Semantic Map Reconstruction for Self-Driving Cars with the Multi-Head Attention Mechanism. Electronics, 14(1), 32. https://doi.org/10.3390/electronics14010032

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

BEV Semantic Map Reconstruction for Self-Driving Cars with the Multi-Head Attention Mechanism

Abstract

1. Introduction

2. Related Work

3. Overall System Architecture Proposed

3.1. nuScenes Dataset

3.2. Cross-View BEV Semantic Map Prediction Using a Multi-Head Attention Mechanism

3.3. Multi-Head Attention Upsampling

3.4. Dilated Convolution Upsampling

3.5. Attention Skip Connection Layer

3.6. Experimental Setup

4. Experimental Results

4.1. Results of Multi-Head Attention Upsampling

4.2. Results of Dilated Convolution Upsampling

4.3. Results of Attention Skip Connection Layer

4.4. Overall Performance

5. Discussions

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI