1. Introduction
While GPS has been in use for years, its accuracy and sensitivity to environmental conditions remain problematic, hindering the implementation of autonomous driving systems in real-world scenarios. To compensate for these instabilities, utilizing multiple sensors to complement each other’s shortcomings becomes crucial. For instance, IMUs collect inertial data from the vehicle, LiDAR acquires 3D environmental information, cameras record the road and surrounding objects, and radar and ultrasonic sensors detect distances. This wealth of sensory data collaborates to overcome the limitations of any single sensor, generating high-precision maps for use in unmanned vehicles.
However, the traditional Multi-Stage Discrete Pipeline of Tasks architecture, where each stage operates independently, often leads to information loss or distortion between stages, increasing computational cost and latency. Early Sensor Fusion, a more advanced approach, fuses raw data from different sensors at the data level, leveraging their characteristics and advantages. This enables the system to learn collectively from multiple sensor data sources. PointFusion [
1] highlights the advantages of early sensor fusion in enhancing overall perception performance, while Multi-View 3D (MV3D) networks [
2], exploring multi-view 3D object detection networks, further confirm its positive impact on autonomous driving systems.
Despite its advantages, implementing early sensor fusion faces challenges like sensor calibration, data synchronization, and complex fusion algorithms. To overcome these, using surround RGB monocular cameras for Bird’s-Eye View (BEV) prediction offers a viable alternative. This approach integrates visual data from surround cameras, providing rich spatial and semantic information without relying on multiple sensors, thus simplifying the perception system architecture and improving real-time performance and responsiveness.
Our method integrates camera parameters with image features extracted through a CNN (Convolutional Neural Network) to generate mapping features projected from the plane to the BEV map. To reconstruct the final semantic map, we improve the decoder architecture by enhancing embedded features with a multi-head attention mechanism and refining dimensional elevation through transposed convolution. Additionally, we incorporated dilated convolution to expand the receptive field, effectively capturing more detailed features and contextual information, further enhancing the model’s performance.
2. Related Work
According to the U.S. Department of Transportation, 94% of vehicle crashes are caused by driver behaviour [
3], highlighting the potential of autonomous vehicles to enhance road safety. While autonomous driving systems may vary, they fundamentally comprise perception, localization, and mapping. The 3D sensor-based joint perception systems, though effective, increase hardware costs, necessitate precise calibration and synchronization, and demand high-performance computing for data alignment and fusion, as shown in CLR-BNN [
4]. However, in most scenarios, localization and mapping do not heavily rely on height information.
In autonomous driving, accurate environmental perception is key. Traditionally, LiDAR-based sensors, generating high-precision 3D point cloud data, have been widely used. These data are further segmented into voxels for identification, as in References [
5,
6,
7]. Image-based detection methods, utilizing depth and RGB cameras to generate stereo images and 3D candidate bounding boxes for object identification via deep learning algorithms like Fast RCNN [
8], have also been explored [
9,
10]. However, these methods, often focusing on a single viewpoint, have limitations in creating complete maps. The study “Vehicle Detection from 3D LiDAR Using Fully Convolutional Network” [
11] addressed these challenges by converting LiDAR data into a BEV representation and processing it with an FCN, leveraging its advantages in image processing and feature extraction. This approach effectively handles sparse 3D data while preserving spatial information on a 2D plane.
In studies of scene reconstruction and perception, conventional methods usually rely on binocular stereo images for matching. However, “A stereo matching algorithm with an adaptive window: Theory and experiment” [
12] pointed out that such methods lack the fusion capability of multiple sensors (e.g., LiDAR or RGB-D cameras), which leads to their insufficient features in distinguishing pixel correspondence relationships. In addition, the motion of a dynamic object further affects the accuracy of matching. By relying only on a single passive sensor, these approaches neglect the possible performance enhancement effects of multi-sensor fusion. On the other hand, “A computer algorithm for reconstructing a scene from two projections”, Reference [
13], mentioned the natural limitation of monocular cameras in depth estimation, as they do not provide depth information directly by themselves, especially in remote object scenes where the limited resolution of image pixels tends to lead to increased depth estimation error. The study focuses on vehicle detection and localization, but is less applicable to other targets such as pedestrians and motorcycles with limited functional scope. For multi-view angle scenes, “A computer algorithm for reconstructing a scene from two projections” [
14] emphasized the limitations of reconstruction methods relying on only two view angles, which cannot effectively utilize multi-view angle information to enhance the completeness and accuracy of the reconstruction. Such methods are limited in large-scale scenes or applications that require omni-directional reconstruction. In addition, the single-sensor approach does not have sufficient depth to model the geometry of the scene, and has limited ability to process dynamic objects. For image projection-based studies, a method to enhance a 2D image with multiple viewing angles into 3D space for a self-driving car perception task was proposed in ”Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d” [
15], but the study only used 2D feature maps and did not incorporate additional depth-aware data. This design limits the accuracy of depth estimation or distance inference, and is not strong enough to restore fine scenes. In “Structure-from-motion revisited” [
16], which mainly relies on multi-view feature images for scene reconstruction, the performance of the model is severely degraded in scenes with heavily occluded or sparse feature points due to the lack of multi-sensor sensing data. Moreover, the method is highly dependent on high-quality feature matching, and the stability of feature extraction and matching will be greatly reduced when the texture of the scene is sparse or when there are changes in illumination. Although monocular cameras with multiple viewing angles can improve the reconstruction results to some extent, in “Fiery: Future instance prediction in Bird’s-Eye View from surround monocular cameras” [
16], it is pointed out that the depth perception capability of monocular cameras is still limited, especially when dealing with distant objects or scenes with unclear geometrical features. This limitation can easily lead to depth ambiguity or inaccurate object localization in BEV (Bird’s-Eye View) representations, further affecting the performance of downstream tasks.
FIERY [
17] addresses this by using surround RGB monocular cameras to estimate top-down map predictions, achieving BEV grid generation without auxiliary systems. It models future randomness end-to-end solely through camera data, creating a camera-based perception and prediction system that is more compact, cheaper, and has higher resolution than LiDAR sensing. Image cameras are cost-effective sensors for generating detailed environmental views, accurately segmenting scenes into meaningful regions and objects. Ordinary semantic segmentation results, being 2D, lack 3D spatial information crucial for autonomous driving tasks like object localization and obstacle avoidance. To address this, 2.5D projection transforms 2D semantic information into a BEV semantic map incorporating depth and spatial layout. Using surround RGB monocular cameras instead of multi-sensor systems reduces hardware costs, system complexity, and calibration/synchronization issues. It also decreases computational latency, improving real-time performance crucial for responsiveness in dynamic environments. Surround cameras provide rich visual information, including colour, texture, and depth cues, aiding accurate object identification and segmentation. Through deep learning, they effectively integrate semantic and spatial information, generating precise BEV maps vital for autonomous navigation and path planning.
This manuscript utilizes surround RGB monocular cameras to extract image features through a CNN, which are then fused with camera parameters to generate mapping features projected from the plane to the BEV map. This replaces explicit geometry-based projection and avoids rigid learning. Additionally, we improved the decoder architecture. To reconstruct the final semantic map, the upsampling process are redesigned, enhancing embedded features through a multi-head attention mechanism and employing transposed convolution for refined dimensional elevation. This approach demonstrates robust performance, even under conditions of occlusion or insufficient lighting.
3. Overall System Architecture Proposed
In the proposed system, the cross-view attention mechanism from the CVT (Cross-View Transformer) architecture is leveraged to extract multi-scale image features from a set of six camera images. Particularly, such a mechanism aggregates image features and spatial information to generate spatially-aware feature embeddings. Then, the decoding process with several improvements for BEV semantic map generation is redesigned, resulting in a final BEV semantic map with a resolution of 200 × 200. In the following,
Table 1 and
Table 2 show the list of abbreviations and quantifiers used in our method and their corresponding meanings, respectively.
3.1. nuScenes Dataset
The nuScenes dataset [
18], released by Motional, a company dedicated to advancing autonomous vehicle technology, provides a valuable resource for research in the field. This large-scale dataset captures 15 h of driving data collected in the bustling cities of Boston and Singapore, encompassing diverse locations, times of day, and weather conditions. A key advantage of nuScenes over other datasets, such as KITTI [
19], is the inclusion of millimeter-wave radar sensors (RADAR, Radio Detection and Ranging) in its data acquisition setup. This addition enables more robust perception capabilities for autonomous driving systems, particularly in adverse weather conditions where other sensors may be less reliable. nuScenes offers a comprehensive suite of sensor data, and this rich sensor data, combined with high-quality manual annotations, forms a crucial foundation for developing and evaluating autonomous driving algorithms.
Table 3 provides a summary of the data acquisition configuration.
This study utilizes the nuScenes dataset [
18], a comprehensive resource for autonomous driving research released by Motional. The dataset provides a rich collection of sensor data and annotations, enabling the development and evaluation of robust perception algorithms.
Table 4 outlines key terms and concepts within the nuScenes dataset. To facilitate data extraction and processing, this research leverages the nuScenes API to access information within each scene. These extracted data are then used to train the model to learn the mapping and attention mechanisms across different camera views.
3.2. Cross-View BEV Semantic Map Prediction Using a Multi-Head Attention Mechanism
In this task, we leverage the cross-view attention mechanism from the Cross-View Transformer (CVT) architecture [
20] to extract multi-scale image features from a set of N = 6 camera images, each capturing a
degree view. Each image is processed by a shared convolutional encoder
(we utilize a pre-trained EfficientNet [
21] in our implementation), using known camera extrinsic and intrinsic parameters. The camera extrinsic parameters include
, the rotation matrix, and
, the translation vector, both relative to the ego vehicle’s coordinate system. The cross-view attention mechanism aggregates image features and parameters into spatially-aware feature embeddings.
To accurately represent the direction and position of camera-captured images in the BEV space, precise localization of the camera’s coordinates relative to the center of the BEV segmentation map, as well as its orientation and intrinsic parameters, is required. The nuScenes dataset provides camera extrinsic parameters, using
and
to describe the transformation from each camera’s coordinates to the global 2D coordinates
. Following prior work [
17,
20], we designate the location of the Front Radar as the center of the BEV segmentation map, transforming each camera’s world coordinates into the global 2D coordinates through computations involving
and
. Additionally, the nuScenes dataset provides intrinsic parameters, such as focal length, principal point, and distortion coefficients, which represent the relative positional relationships of objects in the images. By accurately computing both the extrinsic and intrinsic parameters, our BEV segmentation map achieves more precise localization of the vehicle’s position and orientation in the camera-captured images.
The cross-view attention mechanism aggregates image features and spatial information to generate spatially-aware feature embeddings. We then redesign the decoding process with several improvements for BEV semantic map generation. Positional encoding is applied to the feature embeddings to provide valuable spatial information to the multi-head attention module. After processing with attention heads, the feature maps undergo an initial upsampling step using standard transposed convolution. This design choice ensures that the importance of each pixel is preserved at lower resolutions. Subsequently, we perform two consecutive dilated convolution upsampling operations (dilation rate = 2), effectively upsampling with a larger receptive field and increasing the original input feature resolution by a factor of 4. To mitigate potential aliasing effects and sparsity introduced by the upsampling operations, the feature maps are processed by a regular convolutional layer to smooth the output and reduce noise and inconsistencies.
Through this process, we generate the final semantic segmentation prediction.
Figure 1 illustrates the complete architecture, detailing the core multi-head attention decoder mechanism and the refined upsampling steps.
3.3. Multi-Head Attention Upsampling
In Stage 1 of
Figure 1, after obtaining the cross-view BEV feature map, we treat it as a two-dimensional spatial problem. Initially, we apply positional encoding to each pixel, utilizing sine and cosine functions to define their order. This ensures the preservation of spatial relationships and positional information. To achieve this, we avoid employing local attention mechanisms, which have fewer parameters. Equation (1) represents the positional encoding for even indices, and Equation (2) represents the positional encoding for odd indices, where
is the position index,
is the dimension index, ranging from
to
, and
is the input dimension.
The Multi-head Self-Attention Upsampling Block (MSAU Block) consists of two main sub-modules: the Multi-head Self-Attention Block (MSA Block) and the TransposeConv Block, as shown in
Figure 2. After position coding, the coded features are concatenated with the original feature maps and fed into the MSA block, and the Norm layer is used to normalize the output of the MSA block to ensure the model is more stable during the training process, which improves the stability of the training and helps the model to converge to the optimal solution faster, and then finally feeds into the Multilayer Perceptron (MLP) block for further processing of the information after normalization of the MSA block. The multi-head attention mechanism captures the dependencies between different positions in the input sequence, and the MLP block nonlinearly transforms and processes this captured information to enhance the model’s expressive power. This mechanism is mainly based on the Transformer Encoder architecture, as shown in
Figure 2. This architecture provides more explicit spatial information than using multi-head attention alone. We use a single Transformer Encoder, assuming that the features processed by the encoder have been initially converted into BEV map features. This approach avoids significant computational overhead. The multi-head attention mechanism reduces the computational cost by splitting the input channel into M heads. This mechanism effectively captures information from partially occluded, low-confidence objects and focuses attention on different objects, which enhances interpretability and contributes to the overall model understanding and convergence.
In this module, we employ a multi-head self-attention mechanism with M = 4 attention heads. Particularly, in the multi-head attention mechanism, the number of heads M is a twistable hyperparameter. As the value of M increases, each head can more focus on learning a distinct feature of the data, enabling the model to capture the diversity of data better. However, if its value becomes too large, the attention of each head may become overly divergent, and the dimensionality of the feature space available to each head decreases. This will reduce the model’s ability to represent the global characteristics of the data effectively. Accordingly, the settings of M = 4 and M = 6 are evaluated in our experiment. The experimental results show that the setting of M = 4 yields better performance. Such a setting can allow the attention mechanism to capture sufficient global contextual information while avoiding a decline in the ability to represent global features.
Each attention head independently calculates the linear projections of query
, key
, and value
. For the input feature map
, we first map it to the feature spaces of
,
, and
through linear transformations. The linear transformation matrices are defined as in Equation (3), where
and
represent the feature dimensions of
and
, respectively, for each attention head.
This yields Equation (4):
There are
sets of
, where
denotes the
-th attention head. For each
, we calculate the self-attention weights, as expressed in Equation (5).
Transposed convolution, also known as deconvolution or upsampling convolution, is a variant of the convolution operation in Convolutional Neural Networks (CNNs). It is primarily used to increase the spatial resolution of feature maps (i.e., upsample the feature maps). The goal of transposed convolution is to recover the spatial information lost during the convolution operation, generating a larger feature map from a smaller one. However, as convolution is inherently irreversible, transposed convolution learns to infer the lost information. The basic principle is to reverse the forward pass of a standard convolution. While standard convolution uses a kernel to slide over and compute the weighted sum of the input, producing a smaller feature map, transposed convolution performs this process in reverse, generating a larger feature map from a smaller one.
This operation involves inserting
zeros between each input pixel and padding the feature map with
rows and columns of zeros to expand its size. The kernel matrix is then flipped vertically and horizontally. Finally, a standard convolution operation is applied to the interpolated feature map using the flipped kernel, producing the upsampled output. This method allows for precise control over the upsampling factor, and the parameters are learned during training. This means the network can automatically adjust the parameters to suit specific task requirements.
Figure 3 illustrates an example with
, and a
kernel.
Deep neural networks often employ pooling or downsampling operations, which can lead to the loss of spatial details in feature maps. This loss is particularly pronounced after the multi-head attention module, potentially hindering the accurate distinction between adjacent objects. While nearest-neighbor and bilinear interpolation methods are insufficient to address this issue, a learned transposed convolution module is well-suited for this task. By training the transposed convolution module in conjunction with the multi-head attention module, the entire process is jointly optimized, resulting in more precise and detailed BEV semantic maps.
During feature upsampling, we employ transposed convolution with a kernel size of 4 and a stride of 2. This configuration doubles the spatial dimensions of the input feature map, allowing us to gradually recover spatial resolution without sacrificing crucial details. Subsequently, the
convolutional layer performs further feature extraction on the upsampled feature maps. This kernel size effectively captures fine-grained local features while maintaining computational efficiency. Finally, pointwise convolution applies a linear transformation to the feature vector at each pixel, adjusting the channel dimension to match the output requirements without altering the spatial dimensions. The overall network architecture is illustrated in
Figure 4.
3.4. Dilated Convolution Upsampling
Dilated convolution, also known as atrous convolution, is a specialized convolution operation used in Convolutional Neural Networks (CNNs) to increase the receptive field of the convolutional kernel without increasing the computational cost. This is particularly useful for tasks that require capturing features over a larger spatial extent, such as image semantic segmentation and object detection. The key idea behind dilated convolution is to expand the receptive field by inserting spaces or “dilations” between the elements of the convolutional kernel, as illustrated in
Figure 5. The dashed lines in the figure represent the boundaries of the receptive field, visually highlighting how the kernel skips over certain pixels in the input feature map to capture information from more distant locations. This approach allows the kernel to “see” a wider range of the input feature map without increasing the number of parameters or computational complexity.
Dilated convolution can be described as introducing a dilation rate
into the standard convolution operation. This dilation rate determines the spacing between the elements of the kernel.
is the value at position
in the input feature map,
is the value at position
in the output feature map,
is the size of the kernel, and
is the weight at position
in the kernel. With a dilation rate of
, the spacing between kernel elements during the convolution operation becomes
pixels. This leads to Equation (6):
In Stage 2, we stack multiple Dilated Convolution Upsampling Blocks (DCU Blocks). After the initial upsampling in Stage 1, we observe that the objects of interest in this experiment are relatively sparse, while the background information is rich and diverse. To address this, we employ a strategy that combines dilated convolutional networks with transposed convolutional networks. Dilated convolutions introduce a dilation rate, effectively expanding the receptive field without increasing the number of parameters by inserting spaces between the elements of the convolution kernel. This is particularly beneficial for handling sparse positive samples, as it allows the network to capture features over a larger area, enhancing generalization and mitigating overfitting.
The overall design of the DCU Block is similar to the transposed convolution module in
Figure 4, with the key difference being the replacement of the regular convolutional layer with a dilated convolution layer (dilation rate = 2), as shown in
Figure 6. This introduces a spacing of one element between each element in the kernel, effectively expanding the kernel size to
(double the original
) while maintaining the same number of parameters. To preserve the spatial dimensions of the feature maps, we apply padding of 2. This ensures that the output of the dilated convolution has the same dimensions as the input, retaining more spatial information. Due to the map configuration, we stack two DCU Blocks. After two dilated convolution upsampling operations, the feature map is further upsampled by a factor of
, achieving the desired resolution.
3.5. Attention Skip Connection Layer
In the skip connection pathway, we introduce an Attention Skip Connection layer, inspired by attention mechanisms, to enhance the reliability of feature fusion. This design combines the strengths of both attention mechanisms and skip connections, enabling the model to dynamically adjust channel weights based on the input feature map. By emphasizing important features and suppressing irrelevant ones, this approach improves the model’s expressive capacity while preserving spatial information. To compute attention weights for the feature maps, we employ a multi-layer convolutional operation. This attention mechanism aims to adaptively adjust the weights of different locations within the feature maps, effectively enhancing the model’s focus on crucial features.
The network architecture is illustrated in
Figure 7. First, a
pointwise convolution reduces the channel dimension of the input feature map to
of its original size. This operation performs a linear combination of channels without altering the spatial dimensions, reducing computational overhead. Batch normalization is then applied to standardize the output of the convolutional layer, accelerating training and stabilizing the numerical distribution of the feature maps. A ReLU activation function introduces non-linearity, enhancing the model’s expressive power and mitigating the vanishing gradient problem. Finally, another
pointwise convolution transforms the feature map into a single-channel attention map. The attention weights are then normalized to the range of
using a Sigmoid activation function. This output serves as a weighted mask for the feature map, emphasizing important features.
3.6. Experimental Setup
The input consists of RGB images and , where and . After encoder downsampling, the initial feature embeddings input to the decoder are , where and . We utilize a multi-head attention configuration with heads and an attention head embedding dimension of . Both the Multi-Head Attention Upsampling (MSAU) block and the Dilated Convolution Upsampling (DCU) block increase the resolution by a factor of 2, resulting in a final BEV semantic map with a resolution of .
The model is implemented in PyTorch 1.11.0 and trained on an RTX 3090 GPU with a batch size of 4. We employ Focal Loss [
18] as the loss function, with settings outlined in
Table 5. Optimization is performed using the AdamW optimizer [
22] and a one-cycle learning rate scheduler [
23], with configurations detailed in
Table 6 and
Table 7. The total training time is 7 h.
4. Experimental Results
Following the framework established in reference [
24], we define the mapping region as a 100 m × 100 m area surrounding the vehicle, employing a sampling resolution of 0.5 m. This configuration yields a BEV semantic map with dimensions of 200 pixels × 200 pixels, consistent with the primary evaluation metric promoted and standardized by the Lift-Splat [
24] team.
To assess performance on the validation set, we employ the Intersection over Union (IoU) score as our primary metric. This metric quantifies the ratio between the intersection and union of the predicted segmentation output and the corresponding ground truth labels. Qualitative comparisons are also performed through visualization of the predicted labels, where pixel values are normalized to the range of [0, 1], with values approaching 1 represented as yellow. The primary benchmark for our experiments is the Cross-View Transformer (CVT) [
20] model, enabling a comprehensive evaluation of the advantages and improvements offered by our proposed approach.
4.1. Results of Multi-Head Attention Upsampling
As illustrated in
Figure 8, the qualitative comparison aligns with our initial expectations. In contrast to directly applying convolutional operations, the multi-head attention mechanism demonstrates superior spatial awareness, enabling the model to focus on distinct regions concurrently. Each attention head specializes in capturing different facets of features (e.g., shape, size, orientation), thereby enhancing the accuracy and robustness of object detection. The output of the multi-head attention module, which encapsulates relevant features and spatial relationships, serves as the input for the transposed convolution layer. This integration ensures that the spatial details and relationships learned by the attention mechanism are preserved and reinforced during the upsampling process. Consequently, this leads to more accurate and informative Bird’s-Eye View representations, as depicted in
Figure 8c. The multi-head attention module excels at capturing long-range dependencies and local details within the image, particularly for low-confidence, partially occluded objects. Conversely, the transposed convolution module focuses on recovering the spatial details lost due to repeated pooling or downsampling operations. The synergy between these two modules empowers the model to accurately comprehend global contextual information while meticulously recovering spatial details, significantly enhancing the quality of feature maps.
Furthermore, we investigated the effect of increasing the number of attention heads to 8, as shown in
Table 8. The results indicate that the model achieves optimal performance with M = 4 heads. This observation can be attributed to the dimensionality of the key embeddings (D = 128). When the number of heads increases to 8, the attention computation for each head potentially becomes excessively sparse. This sparsity may hinder the effective concentration of attention on crucial information, leading to a decline in attention quality and ultimately impacting the overall performance. This translation clarifies the relationship between the number of attention heads, the dimensionality of key embeddings, and the model’s performance. It also explains the potential negative impact of excessive sparsity on attention quality.
We also delve into the impact of using the Multi-head Attention Mechanism before and after upsampling. From
Table 9, it is clear that the multi-head attention mechanism is a better choice for low resolution, both in terms of IoU performance and computational cost.
4.2. Results of Dilated Convolution Upsampling
As depicted in
Figure 9, the qualitative comparison reveals that while the multi-head attention mechanism effectively performs upsampling, the model struggles to distinguish between closely parked vehicles due to their similar spacing. This limitation results in elongated shapes in the visualized semantic map. However, when dilated convolution is employed for subsequent upsampling, the model accurately identifies individual vehicle positions, leading to a more precise grid-like segmentation result. Moreover, the vehicle boundaries in the segmentation output are sharper, and the distinction between vehicles is significantly improved. By introducing a dilation rate, dilated convolution effectively expands the receptive field, enabling the capture of detailed features and contextual information over a wider range. The results in
Figure 9b demonstrate that dilated convolution can more clearly delineate vehicle boundaries, mitigating the issue of mis-segmentation caused by similar inter-vehicle spacing. Furthermore, the integration of transposed convolution for upsampling ensures that the feature maps retain high resolution, preventing feature blurring caused by detail loss.
Table 10 presents the ablation study conducted to compare the effects of incorporating dilated convolution in Stage 1 versus Stage 2. The results clearly demonstrate that at low resolutions, where each pixel carries significant information, employing dilated convolution with a larger receptive field for upsampling is not optimal.
4.3. Results of Attention Skip Connection Layer
Figure 10 illustrates the impact of incorporating attention skip connection layers. It is evident that these layers effectively suppress low-confidence noise while emphasizing reliable features, such as cars and motorcycles. By assigning different weights to features of varying dimensions, the model can learn to prioritize crucial information. This capability holds significant potential for BEV map generation, where precise localization is essential.
4.4. Overall Performance
Figure 11 presents the training loss curves for both CVT and our proposed model.
Figure 12 illustrates the impact of filtering low-confidence predictions, with purple representing CVT and blue representing our model.
Table 11 provides a detailed IoU analysis. A qualitative comparison between the two models is shown in
Figure 13. Finally,
Figure 14 visualizes the predicted BEV semantic maps (including roads and vehicles) generated by our model. This concise description effectively summarizes the key information presented in the figures and table, highlighting the performance comparison between CVT and the proposed model.
5. Discussions
Our model demonstrates superior performance in the initial training phase, achieving lower loss values for both training loss and visible loss, indicating faster convergence. While the loss values stabilize after approximately 1000 training steps, suggesting that CVT also learns complex features eventually, our model consistently maintains a lower average error. This implies superior training stability and accuracy compared to the CVT baseline. The IoU evaluation highlights the higher accuracy of our model, particularly with a 50% IoU threshold where the performance gap reaches 1%. This signifies better generalization across varying IoU conditions and diverse object classes (vehicles, buses, bicycles) present in the dataset.
Qualitative analysis reinforces the quantitative findings, showcasing the model’s exceptional performance in complex traffic scenarios. Our approach effectively distinguishes between vehicles, even large ones, and accurately identifies them. Notably, the BEV semantic maps reconstructed by the multi-head attention mechanism exhibit remarkable environmental adaptability, even in challenging conditions like dim lighting and rainy weather. This robustness stems from the ability to capture subtle features, ensuring high-quality semantic segmentation even with poor illumination or adverse weather. Each attention head focuses on different aspects of vehicle features (shape, size, orientation), contributing to a comprehensive understanding of the scene. This global context awareness enhances semantic comprehension and enables efficient segmentation in crowded and dynamic traffic environments.
The visualization of road predictions overlaid on vehicle BEV semantic maps simulates real-world applications, showcasing the model’s real-time localization and mapping capabilities. These results emphasize the refined decoder’s ability to generate accurate road predictions and vehicle localization across diverse scenarios, particularly in low-light conditions like nighttime and rainy weather, demonstrating robustness and generalization. The hybrid architecture combining multi-head attention and convolutional networks effectively addresses the limitations of each individual approach, balancing the processing of long-range and local spatial relationships. This design enhances sensitivity to detailed features without compromising overall performance, further solidifying the model’s robustness and generalization across different environments.
It is notable to mention that in our system, the cameras are mounted in fixed positions on the vehicle and thus remain stationary with respect to the vehicle during the image capturing process. As a result, we can infer the directions and heights of objects in the images with respect to the vehicle based on the position of the vehicle. However, if the cameras are not precisely calibrated, the inferred object directions and heights from the images may slightly deviate from their actual values, potentially reducing the accuracy of detection results. Consequently, most current related research requires precise camera calibration for ensuring data accuracy to optimize model performance.
Finally, two potential approaches can be adopted to reduce the computational resources required by our multi-head attention mechanism and refined upsampling technique for resource-constrained platforms. The first is to use a sliding window to shorten the lengths of computation sequences, thereby lowering computational overhead and also enabling the model to concentrate more on global information. The other one is to utilize a sparse attention mechanism that calculates only a subset of the attention matrix while preserving the global information, thereby reducing computational costs as well. The above two approaches will be incorporated into the enhanced version of our algorithm in the future.
6. Conclusions
This paper proposes a novel approach for enhancing cross-view Bird’s-Eye View (BEV) semantic map prediction by leveraging a multi-head attention mechanism. The meticulously designed decoder employs this mechanism to address the limitations of conventional Convolutional Neural Networks (CNNs) in capturing global contextual information, effectively facilitating feature fusion across a wide field of view. This mechanism demonstrates robust perception capabilities, even in the presence of occlusions or distant vehicles. Furthermore, we demonstrate that the hybrid architecture, combining multi-head attention and CNNs, can seamlessly transition to a fully convolutional architecture at lower spatial resolutions.
To enhance the precision of segmentation results and mitigate the issue of excessive adjacency between target objects, this study introduces a refined upsampling mechanism, enabling the model to achieve more accurate object prediction and reconstruction. Experimental results indicate that the model exhibits superior performance when utilizing the multi-head attention mechanism within the decoder to generate BEV semantic maps. It effectively leverages global spatial information for multi-view feature fusion while maintaining spatial consistency. By integrating transposed convolution techniques, this method effectively avoids information loss and distortion, demonstrating strong adaptability to diverse environments.
To facilitate real-world deployment in autonomous driving systems, data acquired from other sensors can be incorporated into the predicted BEV maps. Through multi-modal fusion, both accuracy and real-time perception capabilities can be further augmented to effectively address unforeseen circumstances. However, it is important to note that this method necessitates substantial computational resources for real-time analysis. Consequently, its practical applications may be more readily realized in scenarios such as traffic signal control systems, where such technology can be more effectively leveraged.