Open Access Published by De Gruyter December 21, 2023

AttentionPose: Attention-driven end-to-end model for precise 6D pose estimation

Mayada Abdalsalam Rasheed , Rabah Nori Farhan and Wesam M. Jasim

From the journal Journal of Intelligent Systems

https://doi.org/10.1515/jisys-2023-0153

Abstract

Addressing the complex problem of 6D pose estimation from single RGB images is essential for robotics, augmented reality, and autonomous driving applications. The aim of this study is to overcome limitations in handling scenes with high object occlusion and clutter. We introduce an attention-driven end-to-end model that builds upon existing methods employing pixel-wise unit vectors and voting for object keypoints. Integrating attention mechanisms allows the model to focus computational resources on salient features, enhancing accuracy. Experimental results using the LINEMOD benchmark dataset demonstrate an accuracy rate of 99.73%, outperforming state-of-the-art approaches. The model also exhibits strong generalization capabilities, achieving an average accuracy of 97.36% on objects not included in the dataset. This work concludes that the attention mechanism significantly elevates the performance and robustness of 6D pose estimation, particularly in challenging environments, and opens new avenues for real-world applications.

Keywords: pose estimation; robotic perception; attention mechanism; deep learning; image segmentation; object localization

1 Introduction

6D pose estimation (six degrees of freedom, three translational and three rotational) determines an object’s 3D position and orientation in a scene. It involves estimating the object’s translation (i.e., its position in space) and rotation (i.e., its orientation) in six degrees of freedom, which are three for position (x, y, z) and three for orientation (roll, pitch, yaw) [1].

The importance of 6D pose estimation lies in its ability to enable robots or autonomous systems to interact with objects in the real world [2]. By accurately determining the position and orientation of an object, a robot can perform various tasks, such as picking and placing objects, assembly, inspection, and manipulation. This technology has numerous applications in manufacturing, logistics, healthcare, and other fields where robots automate processes and perform repetitive tasks. Precise 6D pose estimation is crucial for enhancing object recognition in cluttered environments and for robotics and augmented reality (AR) applications. In the context of autonomous driving, it contributes to improved safety and decision-making. The integration of attention mechanisms specifically addresses the challenges posed by occlusions and clutter, thereby elevating the model’s real-world utility [3]. For successful robotic grasping and manipulation, accurate 6D pose estimation is required. It improves the efficiency and efficacy of robotic systems, reduces the need for human involvement, and eventually results in cost savings and enhanced output [4].

Moreover, AR [5] and autonomous driving [6] rely heavily on 6D position estimation. AR uses 6D pose estimation to precisely put virtual objects in the real environment. By detecting the location and orientation of the camera and the real-world objects, the AR [7] system may display virtual items that appear as part of the image, increasing the user experience and allowing new gaming, education, and design applications.

Autonomous driving uses 6D position estimation to recognize and track environmental objects, such as other cars, pedestrians, and barriers. By calculating the 6D position of these objects concerning the car, the autonomous driving system can forecast their future movements and prevent crashes or driving around them. In addition to being required for perceptual tasks such as lane identification, road sign recognition [8], and mapping, accurate 6D posture estimate is also essential for lane detection, sign recognition, and mapping. In AR and autonomous driving, the accuracy and efficiency of 6D posture estimation can substantially affect system performance and safety. Therefore, continued research and development in this field are essential for advancing these technologies and creating new applications [1,6,9].

Traditional methods for 6D pose estimation often rely on geometric or feature-based approaches. These methods use various algorithms to extract features from the image or point cloud data and then match them with a model of the object to determine its pose. To estimate an object’s pose, geometric techniques often employ geometric restrictions, such as correspondences between 3D points or lines. Perspective-n-Point (PnP) [10] and Random Sample Consensus (RANSAC) [11] are geometric approach examples. On the other hand, feature-based approaches estimate an object’s posture using an image or point cloud properties such as corners or edges. Scale-Invariant Feature Transform (SIFT) and Speeded Up Robust Features (SURF) are feature-based algorithm examples [12]. However, these conventional approaches have several drawbacks, including susceptibility to noise, occlusion, and changes in lighting conditions. In addition, they may struggle to manipulate items that are complicated, non-rigid, or devoid of distinguishing characteristics [13].

Recent advances in deep learning (DL) techniques, such as Convolutional Neural Networks (CNNs) and Point Cloud Networks (PCNs) [9,14], have shown promising results in 6D posture estimation, especially for complicated or obstructed objects. These techniques usually involve training a deep neural network with a huge set of images or point clouds to learn the relationship between the input data and the object’s posture. A 2019 study [13] introduces a DL solution for 6D pose estimation using DL. A Pixel-wise Voting Network (PVNet) is designed to handle items with distinct textures or appearances, which can deal with partially or fully obscured objects. The challenges in 6D pose estimation primarily stem from occlusions, clutter, and variability in object appearance, especially in complex scenes. These intricacies can confound traditional algorithms, making pose estimation an intricate task. Computational efficiency is also a key concern for real-time applications. Our incorporation of attention mechanisms aims to mitigate these challenges by focusing computational resources on salient regions, thus enhancing accuracy and reliability in cluttered and occluded conditions.

The PVNet system comprises two key components: a pixel-by-pixel segmentation network and a posture estimation network. The segmentation network is initially trained to anticipate a binary mask for the object of interest, indicating which pixels belong to the object and which belong to the background. The network is then trained to predict the 6D posture characteristics of the item from the segmented image. Segmenting the object from the image using a segmentation network is the first step in inference [15,16]. The pose estimate network then uses the segmented image to predict the posture attributes of the item. Finally, the posture estimation network uses a pixel-wise voting mechanism to aggregate the projected postures from several image regions, minimizing the impact of occlusion and other noise sources. As a result, PVNet does not need 2D keypoint correspondences, which might be challenging to gather exactly for some objects. Instead, it uses the object’s unique texture or appearance for pixel-by-pixel segmentation and posture estimate. This approach enables PVNet to handle a variety of object shapes and appearances, and it has performed well in multiple 6D pose estimation benchmarks.

As with any other framework, PVNet has limitations, which include the following:

Texture-dependent: PVNet works best with objects with a unique texture or look, making applying to homogenous surfaces or reflection challenging.
Although PVNet has demonstrated impressive performance in various benchmarks, it may not be suited for all applications and use cases. Consequently, researchers continue investigating and creating novel ways to enhance 6D pose estimation.

Various approaches to 6D pose estimation have been introduced in the computer vision literature, some of which are as follows:

Iterative Closest Point (ICP), proposed by Zhang et al. [17], is a traditional method that refines an initial pose estimate by matching the 3D model points with the corresponding image points. Using an enhanced version of the ICP algorithm, the research provides a novel approach for rapidly and reliably registering two-point sets. The traditional point-to-point ICP is viewed as a majorization-minimization (MM) algorithm, and an Anderson acceleration strategy is presented to accelerate its convergence. They develop a robust error measure based on Welsch’s function, which is efficiently reduced using the MM method with Anderson acceleration. The suggested technique delivers accuracy comparable to or superior to SparseICP by at least an order of magnitude quicker. Vock et al. [18] proposed a method for template matching of 3D forms in point cloud data that generates transformation predictions using a targeted sampling strategy and an efficient mechanism for hypothesis confirmation. Experiments indicated an improved performance and accuracy over an unoptimized solution, with possible object detection and assembly verification applications. However, the proposed method assumes prior knowledge of the object template and may not be suitable for scenarios where the object is unknown or novel.

Significant progress has been made in 6D pose estimation using DL-based methods. Among these are CNNs, fully convolutional networks, and differentiable renderer networks. PoseCNN [19] is a CNN that estimates 6D item postures in crowded surroundings. The network predicts 3D translation by localizing the object’s centroid in the image, estimating its distance from the camera, and 3D rotation by regressing to quaternions. However, the study only works for known objects, and the input image quality can affect network performance. PVnet [13] is a neural network developed to estimate the posture of 6D objects. The network runs on 2D images and uses pixel-wise voting to determine the 6D posture of an item. It then links each keypoint with a 3D point on the object model. The network then computes the posture by aggregating the votes of all 3D points with which it is related. The method is capable of handling symmetric objects and has state-of-the-art performance on a variety of benchmarks. However, the approach requires a 3D model of the item and primarily relies on the accuracy of keypoint identification, which might be influenced by occlusion and background clutter. DenseFusion [20] is provided as a novel 6D object pose estimation method that combines information from RGB and depth images via an iterative dense fusion process, resulting in resilient and accurate object localization in congested environments. In addition, the technique uses a novel hybrid loss function that considers both 2D projection error and 3D geometric consistency. Experiments demonstrate that DenseFusion outperforms state-of-the-art approaches on many difficult datasets. However, the approach has certain restrictions. First, successful pose estimation requires depth information, which restricts its use when only RGB data are available. Second, the approach needs substantial processing resources, which may make its deployment in real-time applications hard. Finally, the efficiency of the approach may degrade when interacting with symmetric objects or objects with repeating textures.

DGECN [21] is proposed for monocular 6D pose estimation in computer vision (DGECN). The differentiable RANSAC algorithm uses the uncertainty of the predicted depth map to increase the accuracy and robustness of the output 6D pose. In contrast, the PnP algorithm via edge convolution explores topological links between 2D and 3D correspondences. However, noise and occlusion in the scene can impair the edge detection algorithm’s accuracy, which affects the proposed method’s performance. Therefore, the proposed method is also limited to monocular 6D pose estimation. ConvPoseCNN2 [22] is an end-to-end neural network for predicting the 6D pose of objects in cluttered scenes. The proposed technique employs CNNs to predict dense 6D poses and a refinement network to increase the accuracy of initial pose estimations. However, this method has limited occlusion robustness: The proposed method relies on visible parts of the object and may not work effectively when severe occlusion occurs.

Despite the significant advancements in 6D pose estimation, existing geometric and feature-based methodologies manifest certain limitations. Focusing on geometric techniques, the paper by Lepetit et al. [23] introduced an efficient approach to the PnP problem, known as Efficient Perspective-n-Point (EPnP). While reported as faster and more accurate compared to its predecessors, it, like other PnP-based methodologies, relies heavily on establishing precise correspondences between 3D points and 2D projections. This requirement poses challenges when confronted with occlusions, noise, or sparse data.

Moving toward feature-based strategies, Lowe’s paper [24] presented the SIFT, a method proficient in identifying and describing local image features. While its robustness to changes in scale, orientation, and affine distortion is commendable, it does exhibit limitations. Specifically, its performance is compromised under varying illumination conditions and in the presence of homogeneous textures or repetitive patterns. These conditions hinder the reliable execution of pose estimation.

Similarly, the paper by Bay et al. [25] introduced SURF designed to expedite and enhance the efficiency of the feature detection and description process. Notwithstanding its efficiency, SURF shares SIFT’s limitations. Specifically, it performs poorly when dealing with objects with less textured or repeating patterns. Furthermore, its reliability is diminished under varying lighting conditions and noise.

These traditional methods, albeit instrumental in the evolution of 6D pose estimation, collectively encounter shared obstacles. These include susceptibility to noise, varying lighting conditions, and occlusions. Additionally, their computational intensity might be unsuitable for real-time applications. They also frequently struggle with objects lacking distinct textures or having complex or non-rigid structures. This necessitates exploring more robust and computationally efficient techniques, with DL-based approaches gaining increasing attention in pursuing improved 6D pose estimation.

The attention mechanism [26] is a powerful technique in DL that has been used to enhance the performance of several computer vision tasks, such as object identification, segmentation, and posture estimation [27]. Compared to the existing works, this study’s novelty lies in the innovative integration of the attention mechanism into PVNet. In traditional methods, such as geometric and feature-based approaches, limitations exist due to susceptibility to noise, occlusion, and changes in lighting conditions. However, PVNet successfully addresses these limitations and effectively estimates 6D poses for a wide range of object shapes and appearances, making it a critical advancement in the field. Nevertheless, even with these achievements, PVNet is primarily texture-dependent. Hence, it performs best with objects with unique textures, challenging its application on homogeneous surfaces or reflections.

It can be noted that traditional methods for 6D pose estimation often employ geometric or feature-based techniques, which are prone to errors due to noise, occlusion, and lighting variations. On the other hand, DL approaches, such as CNNs and PCNs, have mitigated some of these issues. The work extends these advancements by integrating attention mechanisms into PVNet, enhancing its robustness in complex environments. Pixel-wise unit vectors of object keypoints, important in traditional methods for establishing 3D–2D correspondences, are less critical in this new model. Instead, the model leverages the object’s unique texture for segmentation and pose estimation, offering an innovative solution to traditional 6D pose estimation challenges.

In response to this limitation, our work presents a novel approach to embedding a spatial attention module into the PVNet structure. The attention mechanism, a powerful tool in DL, improves performance by selectively highlighting the most informative features of the object of interest while suppressing less relevant or distracting ones. As such, it significantly reduces the impact of noise and occlusion, offering a more accurate pose estimation. This approach is an innovative leap from the traditional methods, effectively addressing their limitations while offering enhanced performance. Adding attention mechanisms to our 6D pose estimation model, PVNet aims to enhance performance in complex, cluttered, and occluded environments. Drawing from its success in image segmentation, the attention module operates on intermediate feature maps to selectively focus on crucial spatial regions and feature channels. This approach improves the model’s robustness and accuracy by highlighting important features and suppressing noise, making it more applicable in real-world scenarios.

The contributions of this work can be summarized as follows:

Development of an attention-integrated PVNet: The primary contribution of our work is the innovative integration of an attention mechanism into the existing PVNet structure for 6D pose estimation. This novel approach represents a critical advancement in the field by effectively addressing the limitations of traditional methods and enhancing pose estimation precision and robustness.
Spatial attention mechanism design: Our work successfully designed and implemented a spatial attention module that selectively weights feature maps according to their spatial positions. This design represents an important contribution to DL techniques applied in computer vision tasks.
Adaptive to diverse object characteristics: By incorporating an attention mechanism into PVNet, our approach has proven to adapt effectively to various object shapes, textures, and appearances. This flexibility in handling diverse objects is an important contribution to the practical application of 6D pose estimation in real-world scenarios.
Robust performance under challenging conditions: The attention-integrated PVNet exhibits enhanced robustness under challenging conditions, such as noise, occlusion, and lighting variations. This improved performance under difficult conditions makes the method a substantial contribution to scenarios where pose estimation has traditionally been problematic.
Empirical validation: Our method was validated on a widely-used 6D pose estimation benchmark dataset, achieving state-of-the-art performance. This empirical validation demonstrates our method’s efficacy and provides a basis for further research and advancement.
Potential for application in various sectors: The precision and robustness of our proposed method open avenues for its application in critical sectors, such as robotics, autonomous driving, and AR. The potential for real-world application represents a significant contribution with a broad-reaching impact.

This work is structured as follows: Section 1 introduces the dataset and the premise of the suggested methods. Section 2 then details the experiments. The results and comprehensive discussions are offered in Section 3. Finally, the conclusion is presented in Section 4.

2 Method

This research proposes an innovative method for estimating the 6-DoF pose of an object. Pose estimation aims to identify objects within an image and determine their 3D orientations and translations. The process incorporates two neural network models: image segmentation, unit vector predictions, the RANSAC voting procedure, and the PnP solver algorithm. A comprehensive and detailed account of the procedure is presented below:

Input RGB image: The process begins by feeding the model with an RGB image of the object. The aim is to estimate the 6-DoF pose of this object.
Image segmentation: The RGB image is then inputted into the first neural network model, which performs the task of image segmentation. This step aims to isolate the object pixels in the image by generating a mask that distinguishes the object’s pixels from the rest of the image.
Keypoint predictions: Concurrently, the second neural network model produces pixel-wise unit vector predictions for each identified keypoint in the training data. These predictions symbolize the direction of the keypoints within the object’s coordinate system.
RANSAC voting procedure: After identifying object pixels via the image segmentation, the unit vector predictions associated with each keypoint are used in a RANSAC voting procedure. This step aims to create an array of weighted hypotheses. The weight of each hypothesis is determined by its alignment with the other identified object pixels.
Weighted average computation: Following the RANSAC voting, the generated hypotheses are weighted and averaged to yield a value for each keypoint. This results in a set of 2D keypoints for the object within the image.
Application of PnP solver function: The final stage pairs the derived 2D keypoints with corresponding 3D keypoints. This pairing can be accomplished either manually or through a labeled keypoint dataset. The associated keypoints are then introduced to a PnP solver function. Upon receiving 2D and 3D keypoints, the PnP solver returns the camera’s rotation and translation vectors within the object’s coordinate system. These vectors describe the 6-DoF pose of the object within the image.

The steps outlined above describe the flow of our proposed methodology, as depicted in Figure 1, While Algorithm 1 explains the proposed method’s pseudocode. The methodology aims to be a robust and accurate tool for object pose estimation, allowing for the reproducibility of results, thereby fostering trust and facilitating further advancement in the field.

Figure 1

Overview of the proposed method diagram.

Algorithm 1: The pseudocode of the proposed method.

Input: Input: RGB image of the object I
Output: 6-DoF pose (R,T) of the object, where R is the rotation vector and T is the translation vector
Parameters:
Neural Network Model 1 (Segmentation): NN1
Neural Network Model 2 (Keypoint Prediction): NN2
RANSAC Voting Threshold: τ
PnP Solver: PnP-Solver
Begin
1.	Load the RGB image I of the object
2.	Perform Image Segmentation using NN1 Feed I into NN1 Obtain object mask M
3.	Generate Keypoint Predictions using NN2 Feed segmented object M into NN2 Obtain pixel-wise unit vectors U for keypoints
4.	Perform RANSAC Voting Procedure Utilize U to generate weighted hypotheses H, each weighted by its fit to M, with a threshold τ
5.	Compute Weighted Average of H Obtain 2D keypoints K2D
6.	Use the PnP Solver Pair K2D with corresponding 3D keypoints K3D Feed K2D and K3D into PnP-SolverPnP-Solver Obtain R and T
END;

2.1 Image segmentation and attention mechanism

Image segmentation using the U-Net model with attention mechanism involves the following steps:

Image Input: An image is provided as input to the U-Net model.
The input image is processed through a sequence of convolutional and pooling layers, which decreases the image’s spatial resolution while increasing the number of channels.
The encoding path is then connected to a bridge layer, which connects the encoding path to the U-Net model’s decoding path.
Decoding Path: The decoding path consists of a sequence of up-convolutional layers and concatenation with the encoding path’s matching feature maps. This technique improves spatial resolution while simultaneously reducing the number of channels.
Attention Mechanism: Attention gates are added to the decoding process to improve the segmentation accuracy. The attention mechanism chooses the important features from the encoding path that match the current point on the decoding path helping the model focus on the most important parts of the image.
The final output of the U-Net model is a binary mask that shows how the input image was divided up.

The U-Net [28] model’s attention mechanism is helpful in several ways. First, it helps the model focus on important image parts, making the segmentation more accurate. Second, it drops the number of parameters in the model by choosing only the important parts of the encoding path, making the model work better [29]. Finally, it makes it easier for the model to deal with images with complicated backgrounds or multiple objects by choosing only the important parts of the encoding path for each place in the decoding path.

The attention gate is employed in the attention mechanism for image segmentation to learn a spatial attention map highlighting the pertinent image areas for a particular pixel in the output segmentation map. The input and context feature maps are the inputs the attention gate requires [16]. The encoder network’s output, the input feature map, comprises spatial data about the input image. The decoder network’s output, the context feature map, contains details about the target pixel’s context. The attention gate computes an attention map by passing the input and context feature maps through a series of convolutional layers, followed by a sigmoid activation function that squashes values between 0 and 1. The resulting attention map highlights the relevant image regions for a given pixel in the output segmentation map. The context feature map is then multiplied element-wise with the attention map, enhancing the important feature map areas while suppressing the irrelevant ones. The final output segmentation map is created by running the generated feature map through an additional set of convolutional layers.

2.2 Prediction of keypoints

A neural network may be trained to find and locate the keypoints in RGB images after selecting them. During training, the network is given images of the item taken from various angles and ground truth labels showing where each keypoint is in each image. The network is taught to anticipate where each keypoint in an input image will be in 2D. At training time, the network detects the keypoints in a new image, which are then used to estimate the object’s 6D pose. The keypoints are matched with their corresponding 3D points in the object coordinate system, and a PnP solver is used to estimate the object’s pose relative to the camera [11].

Keypoint selection is an important step in the proposed method for accurate and robust 6D pose estimation. By selecting distinctive and reliably detectable points on an object, the method can overcome occlusion and lighting changes and achieve high accuracy in real-world scenarios.

2.3 RANSAC voting procedure

RANSAC, which stands for Random Sample Consensus, is an iterative method used to estimate the parameters of a mathematical model from a set of observed data that contains outliers or noise. RANSAC aims to identify the inliers and the data points that fit the model while rejecting the outliers, which are the data points that do not fit the model.

In RANSAC, the algorithm randomly selects a minimal subset of data points from the observed data and uses them to fit a model. Then, the remaining data points are checked against the model, and those within a predefined threshold are considered inliers. The algorithm repeats this process multiple times and returns the model with the largest number of inliers.

RANSAC is commonly used in computer vision applications such as object detection, 3D reconstruction, and pose estimation, where the data may contain outliers or noise. Using RANSAC, the algorithm can robustly estimate the model’s parameters even in such outliers.

2.4 Weighted average

After the keypoint association, a weighted average is computed to estimate the 3D pose of the object. The weighted average is computed by taking a weighted sum of the 3D coordinates of the associated keypoints, where the weight for each keypoint is determined by its confidence score. The confidence score is based on the probability of the keypoint’s detection in the image and is computed during the keypoint detection process.

Taking a weighted average emphasizes the influence of keypoints with higher confidence scores more than those with lower scores helping to improve the accuracy of the 3D pose estimation, as keypoints with higher confidence scores are more likely to be accurately detected and associated. The weighted average is a commonly used technique in many computer vision tasks to combine the predictions of multiple sources of information while accounting for their reliability.

2.5 PnP solver function

PnP is a classical problem in computer vision that refers to estimating the position and orientation of an object in 3D space relative to a camera, given a set of corresponding 2D–3D points. The PnP problem arises when we have a set of 2D image points and a set of 3D points in a known coordinate system, and we want to estimate the camera position and orientation that would project the 3D points onto the 2D image points.

There are various algorithms for solving the PnP problem, and they usually take as input a set of 2D–3D correspondences and output the camera translation and rotation vectors in the object coordinate system. One common algorithm is the EPnP algorithm, which is an iterative algorithm that uses a Gauss-Newton optimization to minimize the reprojection error between the 3D points and their projections onto the image plane.

In the context of RGB-based 6-DoF object pose estimation, the PnP solver function takes a collection of 2D keypoints taken from the input RGB image and their corresponding 3D keypoints in the object coordinate system as inputs. The function then estimates the camera posture (i.e., the translation and rotation vectors of the camera) necessary to project the 3D keypoints onto the 2D keypoints. This camera position may then be used to calculate the position of an item relative to the camera.

2.6 Experiments

2.6.1 Loss function

The Huber loss [30] in this experiment is determined using the Keras API of TensorFlow. The Huber loss offers a smooth, robust loss function that is less susceptible to outliers, combining the advantages of Mean Squared Error (MSE) and Mean Absolute Error (MAE). The Huber loss is defined in equation (1).

(1) H loss ( x ) = 0.5 × x 2 , | x | ≤ delta 0.5 × delta 2 + delta × ( | x | − delta ) , x > delta ,

where x is the error between the true value and the predicted value, and delta is a hyperparameter that controls the transition between the quadratic and linear regions.

The Huber loss function is more resilient and less susceptible to outliers than the MSE loss function. Because the MSE loss function is quadratic in error, it is extremely sensitive to outliers. The Huber loss function has a linear error for small mistakes and a quadratic error for big errors, making it less vulnerable to outliers than the MSE loss function. Experiments were conducted on a computer with a 1.7 GHz Intel Core i7 CPU, 16 GB of RAM, 64-bit Windows 11 Professional, and a 4 GB NVIDIA GeForce display card. All models were constructed using the TensorFlow framework and the Python programming language.

2.6.2 Dataset

The LINEMOD dataset [31] is a pivotal resource in computer vision, particularly concerning object identification and 6D pose estimation tasks. The dataset endeavors to assess algorithms capable of recognizing and estimating the 6D pose (comprising 3D translation and 3D rotation) of objects from RGB-D or RGB images. It encapsulates 13 disparate objects, providing a robust platform for algorithm evaluation across various object categories.

The meticulous construction of the LINEMOD dataset is manifest in its diverse image collection, which portrays the objects from a multitude of angles and under varying lighting conditions. This assortment of images, captured using a calibrated camera, predominantly encompasses RGB color alongside depth or 3D point cloud data. This characteristic of the dataset is instrumental in fostering a nuanced evaluation of algorithmic performance in real-world scenarios.

A salient feature of the LINEMOD dataset is its provision of ground truth annotations, articulated in the form of 2D keypoints and corresponding 3D model coordinates for each object instance. These annotations are quintessential for evaluating algorithms on multiple fronts: object identification, keypoint localization, and 6D pose estimation. This multifaceted annotation scheme significantly amplifies the utility of the LINEMOD dataset as a benchmarking tool. Furthermore, the LINEMOD dataset introduces a suite of assessment measures indispensable for the comparative evaluation of various methods. Metrics such as average precision (AP), distance, and accuracy for 6D pose estimation are provided, facilitating a comprehensive appraisal of algorithmic effectiveness in tackling the pose estimation problem.

The augmentation of the LINEMOD dataset with synthetic representations of the encapsulated objects is a noteworthy endeavor. This enhancement facilitates the generation of additional training data, which is pivotal in training models to be more resilient to variations in appearance and lighting conditions, thus addressing a critical challenge in computer vision applications.

The structured division of the LINEMOD dataset into training and testing sets, each containing a substantial number of images per object, further underpins its significance. Including synthetic images in the training set, coupled with on-run data augmentation strategies such as random scaling, rotation, cropping, and color jittering, mitigates the risk of overfitting. This meticulous organization and augmentation of the dataset promotes robust training regimes and ensures a rigorous evaluation of 6D pose estimation methodologies. The LINEMOD dataset is an invaluable asset in computer vision, offering a robust and comprehensive framework for evaluating and benchmarking algorithms dedicated to 6D pose estimation. Through its diverse image collection, detailed ground truth annotations, comprehensive assessment metrics, and strategic dataset augmentation, LINEMOD significantly contributes to advancing research and development in object identification and pose estimation endeavors.

2.6.3 Evaluation metrics

Average distance (ADD) is a metric for evaluating the performance of 6D pose estimation algorithms. It is calculated by taking the ADD between the estimated and ground truth poses. The lower the ADD, the better the pose estimation.

Steps to calculate ADD are as follows:

Estimate the pose of each object in the image using the proposed method.
Calculate the distance between each object’s estimated pose and the ground truth pose.
Average the distances to get a single metric for the proposed method.

MSE is a common evaluation metric for 6D pose estimation. It is calculated by taking the average of the squared errors between the estimated pose and the ground truth pose. The lower the MSE, the better the pose estimation.
MAE is another common evaluation metric for 6D pose estimation. It is calculated by taking the average of the absolute errors between the estimated pose and the ground truth pose. The lower the MAE, the better the pose estimation.

3 Results and discussion

The benchmark LINEMOD dataset tests how well the proposed 6D pose estimation model with attention works. In addition, the method is also tested in challenging situations, such as occlusion, clutter, or different lighting. The approach performed effectively under these conditions and maintained high accuracy. In Table 1, the first column represents the object, the second column represents ADD, the third column represents MSE, the fourth column represents MAE, and the fifth column represents the loss.

Table 1

Results of evaluation using the LINEMOD dataset

No.	Object	ADD	MSE	MAE	Loss
1	Duck	0.9992	0.0010	0.0097	0.0019
2	Phone	0.9979	0.0012	0.0106	0.0052
3	Lamp	0.9971	0.0016	0.0132	0.0066
4	Iron	0.9979	0.0015	0.0015	0.0046
5	Hole puncher	0.9971	0.9979	0.9979	0.0077
6	Glue	0.9990	0.006349	0.0088	0.0023
7	Egg box	0.9977	0.007271	0.0083	0.0059
8	Driller	0.9979	0.0014	0.0126	0.0054
9	Can	0.9874	0.0012	0.0118	0.0315
10	Camera	0.9978	0.00907	0.0101	0.0054
11	Bench vise	0.9974	0.0014	0.0130	0.0063
12	Ape	0.9996	0.0042	0.0078	0.0011
13	Cat	0.9990	0.0014	0.0161	0.0025

Our experiments with the LINEMOD dataset objects served distinct purposes as follows:

Duck and phone: These objects typify everyday items. The model achieved ADDs of 0.9992 and 0.9979, respectively, underscoring its general applicability. Additional analysis shows that the model successfully handled the varied textures and lighting conditions these objects were presented in.
Lamp and iron: These objects have intricate shapes. The model posted ADDs of 0.9971 and 0.9979, attesting to its proficiency in managing complex geometric structures. Further analysis revealed that the attention mechanism was particularly effective in identifying the complex features of these objects.
Hole puncher and glue: These are commonly found in cluttered environments like offices. With ADDs of 0.9971 and 0.9990, the results validate the model’s robustness in cluttered settings. Additional analysis confirmed the model’s ability to focus on the object, disregarding the surrounding clutter.
Egg box and driller: These objects are usually located in specialized settings. The model’s ADDs of 0.9977 and 0.9979 reflect its adaptability to such specialized environments. Additional analysis found that the model effectively managed these objects’ unusual shapes and materials.
Can and camera: Ubiquitous objects are meant to validate the model’s universal applicability. The model achieved ADDs of 0.9874 and 0.9978, respectively. Further analysis showed that the model was less susceptible to common pitfalls like occlusion and background noise.
Bench vise, ape, and cat: These were chosen to test the model’s performance on objects with varied textures and complexities. With ADDs ranging from 0.9974 to 0.9996, the model also excelled in this category. Additional analysis indicated that the attention mechanism was proficient in identifying complex textures and patterns.

In addition to quantitative metrics, a qualitative evaluation of the model was performed. The findings of a selection of tough test images from the LINEMOD dataset were visually examined. The model consistently generated accurate and aesthetically appealing posture estimates, accurately localizing objects despite obstructions and crowded backdrops. Furthermore, the attention methods directed the model to focus on informative areas efficiently, resulting in precise posture estimation and enhanced object localization.

The results demonstrate that including the attention mechanism in the 6D pose estimate framework is beneficial. Furthermore, significant decreases in ADD, MSE, and MAE indicate that the attention mechanism considerably enhanced the accuracy and precision of pose estimation. These findings validate the potential of attention processes for strengthening 6D pose estimation tasks and emphasize their benefits in producing more accurate and reliable pose estimations.

The integration of the attention mechanism into the 6D pose estimation model provides several important benefits:

Enhanced localization: The attention mechanism enables the model to concentrate on relevant object regions, enhancing object localization precision by collecting important details and lowering false positives.
Refinement of pose estimation: By focusing selectively on informative regions, the attention mechanism assists the model in refining its pose estimation, resulting in more exact and trustworthy 6D pose estimations.
Robustness in challenging conditions: The attention method improves the model’s capacity to manage occlusions, crowded backdrops, and lighting fluctuations, resulting in robust performance and accurate pose estimation in even the most challenging circumstances.
Improved performance metrics: The attention-based model showed big improvements in performance metrics like ADD, MSE, and MAE, which means it is more accurate and precise, making fewer mistakes than baseline methods.

3.1 Qualitative evaluation

In addition to quantitative measures, a qualitative evaluation of the attention mechanism-based 6D pose estimation model was done. A visual examination was performed on a subset of challenging test images from the LINEMOD dataset. The model consistently generated accurate and aesthetically pleasing pose estimates, accurately localizing objects despite obstructions and cluttered backgrounds. Furthermore, the attention methods directed the model to focus on informative regions efficiently, resulting in precise posture estimation and enhanced object localization. Figure 2 displays some 6D pose estimation results. The green boxes reflect the ground truth postures, whereas the blue boxes represent the model predictions.

Figure 2

Some 6D pose estimation results.

3.2 Testing the method on an external object

The proposed method’s generalization ability was meticulously assessed by testing it on an external object divergent from those within the LINEMOD dataset. Utilizing a script from a publicly available source “https://github.com/F2Wang/ObjectDatasetTools,” object masks, bounding box labels, and a 3D reconstructed object mesh were generated for an RGB-D camera-captured object sequence, forming the basis for this evaluation.

The method’s performance was quantitatively evaluated using several metrics, including ADD, with a score of 0.9736, MAE of 0.0318, MSE of 0.0091, and a loss value of 0.0046. These metrics collectively provide a nuanced insight into the method’s ability to accurately estimate the 6D pose of the external object, with the ADD metric being particularly pertinent to the pose estimation task.

Figure 3 elucidates the practical application of the proposed method, depicting an example of a tested object referred to as “timer.” In this visualization, the ground truth postures are delineated by green boxes, while the blue boxes symbolize the model predictions. The congruence between the model predictions and the ground truth postures is emblematic of the method’s proficiency in accurately estimating the 6D pose of the external object.

Figure 3

An example of predicting 6D poses for an object outside the tested dataset.

The results garnered from this evaluation substantiate the proposed method’s capability to generalize beyond the confines of the LINEMOD dataset, accurately predicting the 6D pose of an object not previously encountered during training. This demonstration of generalization ability is pivotal, as it underlines the method’s potential for broad-spectrum applicability in real-world scenarios, encompassing a diverse range of objects and conditions. Through this rigorous evaluation of an external object, the proposed method showcases its robustness and adaptability, quintessential attributes for practical deployments in 6D pose estimation tasks.

3.3 Comparison with state-of-the-art methods

The attention mechanism-based model was compared with the state-of-the-art methods on the LINEMOD dataset. The proposed model outperformed existing approaches regarding ADD (used widely as performance metrics in the previous methods). The significant improvements in accuracy and precision highlight the effectiveness of attention mechanisms in refining pose estimation and capturing important object details.

The proposed attention mechanism-based model significantly enhanced performance over existing state-of-the-art methods on the LINEMOD dataset. This was markedly observed in terms of the ADD.

The tabulated results in Table 2 underline the proposed model’s superior performance compared to other esteemed methods such as PVNet, Hybrid-Pose, YOLO6D, DPOD, DPOD+, and EfficientPose. The proposed model’s superiority is apparent across various objects, showcasing its robustness and efficacy in 6D pose estimation tasks.

Table 2

Comparison with some state-of-the-art methods

Object	The proposed method	PVNet [13]	Hybrid pose [32]	YOLO6D [33]	DPOD [34]	DPOD+ [34]	Efficient Pose [35]
Duck	99.92	52.58	65.0	27.23	66.01	86.29	90.99
Phone	99.79	92.41	94.9	47.74	74.24	94.69	97.98
Lamp	99.71	99.33	99.5	71.11	88.11	96.84	100
Iron	99.79	98.88	100	74.97	99.80	100	99.69
Hole puncher	99.71	81.92	89.7	42.63	65.83	86.87	95.15
Glue	99.90	95.66	98.8	80.02	93.83	96.82	100
Egg box	99.77	99.15	100	69.58	99.72	99.91	100
Driller	99.79	96.43	98.50	63.51	97.72	98.80	99.90
Can	98.74	95.47	98.50	68.80	94.10	99.71	98.52
Camera	99.78	86.86	90.4	36.57	90.36	96.07	97.94
Bench vise	99.74	99.90	99.90	81.80	95.34	98.45	99.71
Ape	99.96	43.62	63.10	21.62	53.28	87.73	87.71
Cat	99.90	79.34	89.40	41.82	60.38	94.71	98.00
Average	99.73	86.27	91.3	55.95	82.98	95.15	97.35

A meticulous analysis of the results reveals that the proposed model consistently achieved high accuracy scores, with many objects having an accuracy of around 99% or above. This high level of accuracy is emblematic of the model’s capability to estimate objects’ pose under various conditions precisely.

Comparatively, the state-of-the-art methods displayed a broader range of accuracy scores, often falling short of the proposed model’s performance. For instance, in the case of the object “Duck,” the proposed model achieved a remarkable accuracy of 99.92%, whereas the next best performing method, EfficientPose, attained an accuracy of 90.99%. This pattern of superior performance is replicated across other objects, delineating a clear performance edge of the proposed model.

The average accuracy across all objects further accentuates the proposed model’s superior performance, with an average accuracy of 99.73% compared to the next-best average accuracy of 97.35% achieved by EfficientPose.

These improvements underscore the efficacy of attention mechanisms in refining pose estimation and capturing crucial object details. The proposed model’s precision and accuracy advancements demonstrate its superior performance and contribute to advancing the state-of-the-art in 6D pose estimation, highlighting the potential of attention mechanisms in enhancing the accuracy of pose estimation tasks. Despite the considerable advancements this research brings to the 6D pose estimation landscape, it has limitations. One of the primary constraints is the model’s sensitivity to dynamic lighting conditions, which could introduce inaccuracies in pose estimation. The existing architecture may also struggle with rapid object movements, as it has not been explicitly designed to handle temporal changes between frames. These limitations indicate that while the model performs exceptionally well under controlled conditions, its robustness in more dynamic, real-world settings may require further investigation and adaptation. Consequently, the model’s applicability could be restricted in environments where lighting and object motion are highly variable.

4 Conclusion

This research marks a significant advancement in the field of 6D pose estimation by successfully incorporating attention mechanisms into existing frameworks. Theoretically, our model demonstrates how attention processes can selectively focus on informative regions, thereby significantly enhancing the precision and accuracy of pose estimation. This contribution enriches the academic discourse surrounding the interplay between attention mechanisms and pose estimation, offering a new benchmark for future studies. On the practical side, the model’s increased accuracy and robustness have immediate ramifications for various industries. Robotics and manufacturing sectors can benefit from more efficient object manipulation and assembly, while healthcare could see advances in surgical robotics and patient monitoring. Furthermore, the model promises to improve object recognition in autonomous driving systems and enrich user experiences in AR.

Our contributions to the research landscape are manifold. We introduced attention mechanisms into 6D pose estimation, significantly improving key performance metrics like ADD, MSE, and MAE. We validated the model’s robustness through extensive quantitative and qualitative evaluations, demonstrating its applicability in complex scenarios such as occlusions and cluttered environments. The model’s practical advantages are particularly notable in applications requiring high precision and reliability, such as industrial automation and medical procedures. Despite these contributions, the model has limitations, especially in handling dynamic lighting and rapid object movements.

The development of the attention-driven model has limitations, such as facing computational complexity due to the added attention layers and hyperparameter tuning becoming more intricate. Additionally, the model’s interpretability may be compromised due to its complexity.

For future research, several avenues are worth exploring. First, attention mechanisms could be refined through multi-scale or attention refinement techniques for even greater accuracy. Second, the model could be extended to cope with more dynamic environments by integrating temporal information or employing recurrent neural networks. Finally, the generalizability of attention mechanisms could be tested in related computer vision tasks like object tracking and instance segmentation. This study is pivotal in the theoretical understanding and practical application of attention mechanisms in 6D pose estimation.

Funding information: No funds were received from any institutions, it is self-funded.
Author contributions: Every author is involved in shaping the study’s concept and design, collecting and analyzing data, interpreting findings, and contributing to the paper’s writing.
Conflict of interest: No conflict of interest.
Data availability statement: The data will be available upon request to the corresponding author.

References

[1] He Z, Feng W, Zhao X, Lv Y. 6D pose estimation of objects: Recent technologies and challenges. Appl Sci. 2021;11(1):228.10.3390/app11010228Search in Google Scholar

[2] Yan W, Xu Z, Zhou X, Su Q, Li S, Wu H. Fast object pose estimation using adaptive threshold for bin-picking. IEEE Access. 2020;8:215815047.10.1109/ACCESS.2020.2983173Search in Google Scholar

[3] Peng L, Zhao Y, Qu S, Zhang Y, Weng F. Real time and robust 6D pose estimation of RGBD data for robotic bin picking. In: Chinese Automation Congress (CAC). Hangzhou, China: IEEE; 2019. p. 5283–8.10.1109/CAC48633.2019.8996450Search in Google Scholar

[4] Jacofsky DJ, Allen M. Robotics in arthroplasty: A comprehensive review. J Arthroplasty. 2016;31(10):2353–63.10.1016/j.arth.2016.05.026Search in Google Scholar PubMed

[5] Li X, Ling H. Hybrid camera pose estimation with online partitioning for SLAM. IEEE Robot Autom Lett. 2020;5(2):1453–60. https://arxiv.org/pdf/1908.01797.pdf.10.1109/LRA.2020.2967688Search in Google Scholar

[6] Gu R, Wang G, Hwang JN. Efficient multi-person hierarchical 3D pose estimation for autonomous driving. In Proceedings - 2nd Int Conf MIPR. Vol. 2019; 2019. p. 163–8.10.1109/MIPR.2019.00036Search in Google Scholar

[7] Zhang S, Song C, Radkowski R. Setforge-synthetic RGB-d training data generation to support CNN-based pose estimation for augmented reality. IEEE ISMAR-Adjunct. 2019;2019:237–42.10.1109/ISMAR-Adjunct.2019.00-39Search in Google Scholar

[8] Khdier HY, Jasim WM, Aliesawi SA. Deep learning algorithms based voiceprint recognition system in noisy environment. J Phys Conf Ser. 2021;1804:012042.10.1088/1742-6596/1804/1/012042Search in Google Scholar

[9] Qin Z, Xiushan L. Robot indoor navigation point cloud map generation algorithm based on visual sensing. J Intel Sys. 2023;32(1):20220258.10.1515/jisys-2022-0258Search in Google Scholar

[10] Zhou L, Kaess M. An efficient and accurate algorithm for the perspecitve-n-point problem. 2019 IEEE/RSJ Int Conf Intel Robots Syst (IROS), Macau, China, 2019; 2019. p. 6245–52.10.1109/IROS40897.2019.8968482Search in Google Scholar

[11] Nenkov J, Galabov M. RANSAC robust estimation algorithm overview. RANSAC. 2015;3.Search in Google Scholar

[12] Jain S, Sunil Kumar BL, Shettigar R. Comparative study on SIFT and SURF face feature descriptors. ICICCT. 2018;5(6):200–5.10.1109/ICICCT.2017.7975187Search in Google Scholar

[13] Peng S, Zhou X, Liu Y, Lin H, Huang Q, Bao H. PVNet: Pixel-wise voting network for 6DoF object pose estimation. IEEE Trans Pattern Anal Mach Intell. 2022;44(6):3212–23.10.1109/TPAMI.2020.3047388Search in Google Scholar PubMed

[14] Yuan W, Khot T, Held D, Mertz C, Hebert M. PCN: Point completion network. 2018 Int Conference on 3D Vision (3DV). Verona, Italy; 2018. p. 728–37.10.1109/3DV.2018.00088Search in Google Scholar

[15] Nawaf AY, Jasim WM. Human emotion identification based on features extracted using CNN. AIP Conference Proceedings. Vol. 2400, Issue 1. 2022. p. 020010.10.1063/5.0112131Search in Google Scholar

[16] Obaid MA, Jasim WM. Pre-convoluted neural networks for fashion classification. Bull EEI. 2021;10(2):750–8.10.11591/eei.v10i2.2750Search in Google Scholar

[17] Zhang J, Yao Y, Deng B. Fast and robust iterative closest point. IEEE Trans Pattern Anal Mach Intell. 2022;44(7):3450–66.10.1109/TPAMI.2020.3046647Search in Google Scholar PubMed

[18] Vock R, Dieckmann A, Ochmann S, Klein R. Fast template matching and pose estimation in 3D point clouds. Comput Graphics (Pergamon). 2019;79:36–45.10.1016/j.cag.2018.12.007Search in Google Scholar

[19] Xiang Y, Schmidt, Narayanan V, Fox D. PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes. Rob Sci Sys. 2018;1(3).10.15607/RSS.2018.XIV.019Search in Google Scholar

[20] Wang C, Xu D, Zhu Y, Martin-Martin R, Lu C, Fei-Fei L, et al. DenseFusion: 6D object pose estimation by iterative dense fusion. Comput Sci Comput Vis Pattern Recognit. 2019;2019:3338–47.10.1109/CVPR.2019.00346Search in Google Scholar

[21] Cao T, Luo F, Fu Y, Zhang W, Zheng S, Xiao C. DGECN: A depth-guided edge convolutional network for end-to-end 6D pose estimation. IEEE/CVF. 2022;4:3783–92.10.1109/CVPR52688.2022.00376Search in Google Scholar

[22] Periyasamy AS, Capellen C, Schwarz M, Behnke S. ConvPoseCNN2: prediction and refinement of dense 6D object pose. Commun Comput Inf Sci (CCIS). 2022;1474:353–71.10.1007/978-3-030-94893-1_16Search in Google Scholar

[23] Lepetit V, Moreno-Noguer F, Fua P. EPnP: An accurate O(n) solution to the PnP problem. Int J Comput Vis. 2009;81(2):155–66.10.1007/s11263-008-0152-6Search in Google Scholar

[24] Lowe DG. Distinctive image features from scale-invariant keypoints. Int J Comput Vis. 2004;60(2):91–110.10.1023/B:VISI.0000029664.99615.94Search in Google Scholar

[25] Bay H, Tuytelaars T, Van Gool L. SURF: Speeded up robust features. Eur Conf Comput Vis (ECCV). 2006;2006:404–17.10.1007/11744023_32Search in Google Scholar

[26] Niu Z, Zhong G, Yu H. A review on the attention mechanism of deep learning. Neurocomputing. 2021;452:48–62.10.1016/j.neucom.2021.03.091Search in Google Scholar

[27] Lanfei Z, Zhihua C. CRNet: Context feature and refined network for multi-person pose estimation. J Intell Syst. 2022;31(1):780–94.10.1515/jisys-2022-0060Search in Google Scholar

[28] Hmeed AR, Aliesawi SA, Jasim WM. Deep semantic segmentation for MRI brain tumor. AIP Conf Proc. 2022;2400(1):020023.10.1063/5.0112348Search in Google Scholar

[29] Archana KV, Komarasamy G. A novel deep learning-based brain tumor detection using the Bagging ensemble with K-nearest neighbor. J Intell Syst. 2023;32(1):20220206.10.1515/jisys-2022-0206Search in Google Scholar

[30] Huber PJ. Robust estimation of a location parameter. Ann Math Statis. 1964;35(1):73–101.10.1214/aoms/1177703732Search in Google Scholar

[31] Hinterstoisser T, Lepetit V, Ilic S, Holzer S, Bradski G, Konolige K, et al. Model based training, detection and pose estimation of texture-less 3D objects in heavily cluttered scenes. Asian Conf Comp Vis ACCV. 2012;7724:548–62.10.1007/978-3-642-37331-2_42Search in Google Scholar

[32] Song C, Song J, Huang Q. HybridPose: 6D object pose estimation under hybrid representations. IEEE/CVF Conf Comput Vis Pattern Recognit (CVPR). 2020;2020:428–37.10.1109/CVPR42600.2020.00051Search in Google Scholar

[33] Tekin B, Sinha SN, Fua P. Real-time seamless single shot 6D object pose prediction. CPVR. 2018;2018:292–301.10.1109/CVPR.2018.00038Search in Google Scholar

[34] Zakharov S, Shugurov I, Ilic S. DPOD: Dense 6D Pose Object Detector in RGB images. ArXiv. 2019;abs: 1902-11020.10.1109/ICCV.2019.00203Search in Google Scholar

[35] Bukschat Y, Vetter M. EfficientPose: An efficient, accurate and scalable end-to-end 6D multi object pose estimation approach. ArVix. 2020;abc: 2011-04307.Search in Google Scholar

Received: 2023-09-06

Revised: 2023-09-30

Accepted: 2023-10-18

Published Online: 2023-12-21

This work is licensed under the Creative Commons Attribution 4.0 International License.