1 Introduction
The advent of point cloud data has revolutionized various fields, such as computer vision [
15,
84], autonomous driving [
36,
96], augmented reality [
8,
59,
60], and smart cities [
63,
63] by enabling highly accurate and detailed representation of real-world environments. A point cloud [
92] refers to a collection of 3D data points in space, typically representing the surface geometry or shape of real-world objects or environments. Each data point in a point cloud is defined by its spatial coordinates (
x,
y,
z) and, in some cases, additional attributes such as color or intensity values. Figure
1 illustrates an example of a point cloud representing the shape of a car, composed of thousands of individual points. To showcase the inherent 3D attributes of the point cloud, we present multiple perspectives of the object from different viewing angles. From specific angles, the car is easily identifiable and recognizable. However, from some angles, it becomes challenging to identify the object as a car. Point clouds are commonly generated using various sensing technologies, including
Light Detection and Ranging (LiDAR) [
24], depth cameras [
18], or structured light scanners [
77], which capture the physical measurements of points in the environment.
Compared to 2D data like images, 3D point clouds have inherent differences and significant advantages. First, 3D point clouds offer a 3D depiction of objects, resulting in higher accuracy and reliability when identifying complex 3D shapes and volumes. Moreover, point cloud data can directly capture surface details and morphology of objects, making it difficult for them to be replaced by images in many practical applications. Consequently, the integration of point cloud processing in safety-critical applications, such as autonomous driving [
19,
96], medical imaging [
87], and industrial automation [
86], has become increasingly prevalent. For instance, 3D point clouds can be utilized for autonomous driving in the context of obstacle detection and perception [
80]. More specifically, 3D point cloud data obtained from LiDAR sensors [
24] can provide a rich and detailed representation of the surrounding environment in 3D space. By leveraging this data, it becomes feasible to identify and localize various objects on the road, such as automobiles, pedestrians, cyclists, and obstacles. Leveraging these 3D data, autonomous driving systems can employ 3D classification models to detect and categorize objects, thereby guiding the avoidance of obstacles. Hence, the accuracy of these 3D classification models plays a pivotal role in ensuring the safety of autonomous driving.
In recent years,
Deep Neural Networks (DNNs) have emerged as a powerful tool for various computer vision tasks [
38,
43], and their application to 3D point cloud data has garnered significant attention. Ensuring the reliability of DNNs operating on point cloud data is crucial for safe and efficient functioning. DNN testing [
39,
90] has become a widely adopted approach to assess and ensure the quality of such networks. Nevertheless, prior investigations [
14,
28,
89] have highlighted a central challenge pertaining to DNN testing: the significant cost incurred in labeling test inputs to verify the accuracy of DNN predictions. First, the scale of the test set is typically extensive. Second, manual labeling is mainstream, typically necessitating the involvement of multiple annotators to ensure the accuracy and consistency of the labeling process for each test input.
The challenges are further compounded in the case of 3D point cloud data. In addition to the aforementioned obstacles, labeling point cloud data presents additional distinctive challenges compared to traditional image/text data:
—
Data representation: Image data is represented as 2D matrices, with each pixel having a distinct position and value. In contrast, point cloud data comprises an unordered set of points, each possessing 3D coordinates and additional attributes such as color and normals. This distinctive data representation significantly increases the complexity of labeling, necessitating additional processing and interpretation steps.
—
Sparsity of point clouds: Point cloud data is generally characterized by sparsity compared to image data. There can be missing points or noise in the point cloud, and the distribution of points is non-uniform. This inherent sparsity poses challenges for accurate labeling.
—
Expert knowledge for 3D point clouds: Labeling 3D point cloud data necessitates domain-specific expertise due to its unique characteristics. With a large number of 3D points, each with its own coordinates and potential attributes, accurately labeling 3D point cloud data requires expert knowledge. This expertise is crucial for understanding and interpreting the geometric attributes, shapes, and potentially semantic information conveyed by the points.
To address the issue of labeling cost in the context of DNNs, previous research efforts [
28] have primarily focused on test prioritization, which aims to prioritize test inputs that are more likely to be misclassified by the model. By allocating resources to label these challenging inputs first, developers can ensure priority for critical test cases, ultimately resulting in reduced overall labeling costs. Existing test prioritization approaches [
28,
89,
90] can be broadly categorized into two main groups: coverage based and confidence based. Coverage-based techniques prioritize test inputs based on the coverage of neurons [
53,
94]. In contrast, confidence-based approaches operate under the assumption that test inputs for which the model exhibits lower confidence are more likely to be misclassified. Notably, confidence-based approaches have been demonstrated to be more effective than coverage-based approaches in the existing studies [
28]. Weiss and Tonella [
90] conducted a comprehensive exploration of diverse test input prioritization techniques, encompassing several confidence-based metrics that can be adapted to 3D point cloud data, such as DeepGini, Vanilla Softmax,
Prediction-Confidence Score (PCS), and Entropy.
Although the confidence-based test prioritization approaches have demonstrated efficacy in specific contexts such as image and text data, they encounter several limitations when applied to 3D point cloud data:
—
Noises in 3D point cloud data: 3D point cloud data can exhibit inherent noise, which arises from various sources such as sensor noise and non-uniform sampling density. These noise factors can affect the effectiveness of confidence-based approaches. Specifically, in the presence of noise, the model can erroneously assign a high probability to an incorrect category for a given test sample. Consequently, confidence-based approaches incorrectly assume that the model is highly confident of this particular test, considering it will not be misclassified. However, the model’s prediction on this test sample is indeed incorrect (i.e., this test is misclassified by the model).
—
Missing crucial spatial features: Confidence-based methods typically rely on the model’s prediction confidence on test samples. However, in the case of 3D point cloud data, the point cloud exhibits complex spatial characteristics, and relying solely on the confidence feature of the model’s prediction for test prioritization is limited. In other words, confidence-based methods fail to fully leverage the informative features inherent in point cloud data for test prioritization.
In addition to coverage-based and confidence-based techniques, Wang et al. [
89] proposed PRIMA, which leverages mutation analysis and learning-to-rank methodologies for test input prioritization in DNNs. However, even though demonstrating effectiveness in the domain of DNN test prioritization, PRIMA faces challenges when applied to 3D point cloud data for a few reasons. First, the mutation operators utilized in PRIMA are primarily designed for 2D images, text, and pre-defined features. These operators are not directly applicable to 3D point cloud data. In contrast to conventional image or text data, 3D point clouds exhibit a distinctive 3D representation characterized by a substantial quantity of points. Second, even when considering the possibility of utilizing dimensionality reduction techniques to transform 3D data into 2D images and integrating them into PRIMA, practical issues emerge. The execution flow of PRIMA necessitates the mutated 2D images to be fed into the evaluated model for comparing the prediction results between mutants and original inputs. However, the model employed for 3D point clouds is inherently tailored to process 3D data and lacks the capability to classify the mutated 2D images. As a result, even in scenarios where dimensionality reduction tools are accessible, PRIMA remains unsuitable for accommodating 3D point cloud data.
In this article, we propose
PCPrior (3D
Point
Cloud Test
Prioritization), a novel test prioritization approach specifically designed for 3D point cloud test cases. PCPrior leverages the unique characteristics of 3D point clouds to prioritize tests. It is crucial to emphasize that our approach focuses on datasets where each 3D point cloud corresponds to an individual test case. Therefore, each test case is constituted by a collection of points. The core idea behind the PCPrior framework is that test inputs situated closer to the decision boundary of the model are more likely to be predicted incorrectly, which has been proven in the prior research [
57]. PCPrior aims to prioritize such possibly misclassified tests higher.
To reflect the distance between a test (a point cloud) and the decision boundary, we adopt a vectorization approach to map each test to a low-dimensional space, indirectly revealing the proximity between the point cloud data and the decision boundary. Based on this vectorization strategy, we design a diverse set of features to characterize a point cloud test, including spatial features, mutation features, prediction features, and uncertainty features. Notably, spatial and mutation features are specifically designed based on the characteristics of point clouds. Specifically, these features play a pivotal role in capturing essential aspects, including the spatial properties of the point cloud, mutation information present in the input, predictions generated by the DNN model, and the corresponding confidence levels. PCPrior constructs a comprehensive feature vector through the concatenation of these four feature types and leverages a ranking model to learn from it for effective test prioritization.
Compared to existing test prioritization approaches, PCPrior has the following advantages:
—
Tailored for 3D point cloud data: PCPrior is specifically designed to address the challenges of test prioritization for 3D point cloud data. Unlike existing approaches that focus on 2D images or text data, PCPrior leverages the distinctive characteristics of 3D point clouds and provides a more targeted approach for prioritizing tests.
—
Effective utilization of spatial features: PCPrior leverages the spatial features of 3D point clouds, which are essential for understanding the geometric attributes and shapes of objects in the data. Unlike confidence-based approaches that solely rely on prediction confidence, PCPrior incorporates spatial features into the prioritization process. By considering the spatial properties of the point cloud data, PCPrior can effectively capture the informative features necessary for accurate test prioritization.
—
Comprehensive feature generation mechanism: In addition to incorporating spatial characteristics, PCPrior integrates confidence-based features while taking into account mutation and prediction features. By combining these features into a comprehensive feature vector, PCPrior captures a rich set of information that enhances the effectiveness of test prioritization.
PCPrior exhibits broad applicability across diverse domains. As a case in point, in the field of autonomous driving, when testing a 3D shape classification model, the utilization of sensors facilitates the collection of unlabeled test sets comprising surrounding 3D point clouds. PCPrior can be utilized to identify and prioritize test instances that are more likely to be misclassified by the model. By focusing on labeling these possibly misclassified test inputs, it results in a reduction of both labeling time and the manual efforts involved in the labeling process.
To evaluate the effectiveness of PCPrior, we conduct an extensive experimental evaluation on a diverse set of 165 subjects, encompassing both natural datasets and noisy datasets. We compare PCPrior with several existing test prioritization approaches that have demonstrated effectiveness in prior studies [
28,
90]. The evaluation metrics include the
Average Percentage of Fault Detection (APFD) [
94] and
Percentage of Fault Detected (PFD) [
28], which are standard and widely adopted metrics for test prioritization. The experimental results demonstrate the superiority of PCPrior over existing test prioritization techniques. Specifically, when applied to natural datasets, PCPrior consistently outperforms all of the comparative test prioritization approaches, yielding an improvement ranging from 10.99% to 66.94% in terms of APFD. Moreover, on noisy datasets, the improvement ranges from 16.62% to 53%. We publish our dataset, results, and tools to the community on
GitHub. 1Our work has the following major contributions:
—
Approach: We propose PCPrior, the first test prioritization approach specifically for 3D point cloud data. To this end, we design four types of features that can comprehensively extract information from a 3D point cloud test. We employ effective ranking models to learn from the generated features for test prioritization.
—
Study: We conduct an extensive study based on 165 3D point cloud subjects involving natural and noisy datasets. We compare PCPrior with multiple test prioritization approaches. Our experimental results demonstrate the effectiveness of PCPrior.
—
Performance analysis: We compare the contributions of different types of features to the effectiveness of PCPrior. We also investigate the impact of main parameters in PCPrior.
4 Study Design
In this section, we provide a comprehensive exposition of the details pertaining to our study design. Specifically, Section
4.1 elucidates the research questions that served as the guiding framework for our investigation. Within Sections
4.2 and
4.3, we meticulously present the point cloud subjects and measurement metrics that were employed to assess the effectiveness of PCPrior. Furthermore, Section
4.4 showcases the five DNN test prioritization methods that were employed as comparative approaches against PCPrior. In Section
4.5, we elucidate the design and characteristics of PCPrior variants. Additionally, Section
4.6 exhibits the implementation and configuration setup that were utilized in our study.
4.1 Research Questions
Our experimental evaluation answers the following research questions:
—
RQ1: How does PCPrior perform in prioritizing test inputs for 3D point clouds?
In contrast to existing test prioritization methodologies, our proposed approach, PCPrior, leverages the unique characteristics of point clouds for test prioritization. In this research question, we evaluate the effectiveness of PCPrior by comparing it with existing test prioritization approaches that have been demonstrated effective in prior studies [
28,
90] and random selection (baseline).
—
RQ2: How do different ranking models affect the effectiveness of PCPrior?
In the original implementation of PCPrior, the LightGBM ranking algorithm [
41] was employed to leverage the generated features of test inputs for test prioritization. In this research question, we explore the utilization of alternative ranking algorithms, namely Logistic Regression [
81], XGBoost [
17], and Random Forest [
9], with the objective of examining the influence of ranking models on the effectiveness of PCPrior. To this end, we design a set of variants for PCPrior, each incorporating one of the aforementioned ranking models, while maintaining consistency with the remaining workflow.
—
RQ3: How does the selection of main parameters of PCPrior affect its effectiveness?
We conducted an in-depth investigation of the main parameters in PCPrior, with the aim of evaluating whether PCPrior can consistently outperform the compared test prioritization approaches when these parameters undergo modifications.
—
RQ4: How does PCPrior and its variants perform on noisy 3D point clouds?
In addition to assessing PCPrior and its variants on natural datasets, we undertake an evaluation that encompasses noisy 3D point clouds, thereby facilitating an in-depth examination of their effectiveness.
—
RQ5: To what extent does each type of features contribute to the effectiveness of PCPrior?
In PCPrior, we generate four different types of features from each test input for test prioritization, namely spatial features, mutation features, prediction features, and uncertainty features, as elaborated in Section
3. In this research question, we compare the contributions of different types of features on the effectiveness of PCPrior.
—
RQ6: Can PCPrior and uncertainty-based methods be employed to guide the retraining process for enhancing a 3D shape classification model?
Faced with a substantial volume of unlabeled inputs and a constrained time budget, manually labeling all inputs for retraining a 3D shape classification model becomes impractical. Active learning is acknowledged as a practical solution for reducing data labeling costs [
71]. This approach focuses on selecting an informative subset of samples to retrain the model, aiming to improve model performance with minimal labeling costs. In this research question, we investigate the effectiveness of PCPrior and uncertainty-based metrics in selecting informative retraining inputs to improve the performance of 3D shape classification models.
4.2 Models and Datasets
The effectiveness of PCPrior and the compared test prioritization approaches [
28,
90] was evaluated using a set of 165 subjects. Essential details regarding these subjects are presented in Table
1, which highlights the matching relationship between the point cloud dataset and the DNN models. In particular, the “# Size” column indicates the size of the dataset, whereas the “Type” column denotes the type of the dataset, with “Original” representing natural data and “Noisy” indicating noisy data.
Among the 165 subjects, 15 subjects (3 point cloud datasets
\(\times\) 5 models) were generated using natural datasets, whereas the remaining 150 subjects were generated using noisy datasets. To generate a noisy dataset from the original test set
\(T\), each test instance
\(t \in T\) undergoes a modification. Specifically, within each test instance
\(t\) (a point cloud), approximately 30% of the points undergo a random offset, whereas the remaining 70% of the points remain unchanged. The 30% ratio is derived from the reasonable range of noise injection proportions provided in the existing work [
2]. The 150 subjects derived from noisy data were obtained as follows. For each original dataset, we generated 10 noisy datasets, resulting in a total of 30 noisy datasets. Each noisy dataset was paired with five different models, resulting in a total of 150 subjects (30 datasets
\(\times\) 5 models).
Next, we present the description of the 3D point cloud datasets and DNN models utilized in our study.
4.2.1 Datasets.
In our research, we employed three prominent point cloud datasets, namely ModelNet40 [
93], ShapeNet [
10], and S3DIS [
4]. These datasets are widely adopted within the academic community and have consistently served as benchmarks for several state-of-the-art point cloud studies [
30,
35,
52]:
—
ModelNet40 [
93]
: ModelNet40 consists of 12,311 point clouds in 40 categories (e.g., airplane, car, plant, lamp). It encompasses synthetic object point clouds and stands as a paramount benchmark for point cloud analysis. Renowned for its diverse range of categories, meticulous geometric shapes, and methodical dataset construction, ModelNet40 has garnered significant popularity in the research community [
30].
—
ShapeNet [
10]
: ShapeNet dataset is a widely recognized and extensively used benchmark in the field of 3D shape classification. The ShapeNet dataset utilized in our study consists of 50 categories and a total of 53,107 samples. These categories include chairs, tables, cars, airplanes, animals, and so on.
—
Stanford Large-Scale 3D Indoor Spaces dataset [
4]
: The S3DIS dataset is widely recognized for its comprehensive representation of diverse indoor environments, encompassing various real-world scenes encountered in indoor settings. The S3DIS dataset utilized in our study consists of 9,813 samples, classified into 13 categories (e.g., office, meeting room, and open space).
4.2.2 Models.
—
PointConv [
92]
: In our research, we utilized five widely-used 3D point cloud classification models. In the following, we provide a detailed description for each model. PointConv is a CNN operator specifically designed for processing 3D point clouds characterized by non-uniform sampling. By training MLPs using local point coordinates, PointConv approximates continuous weight and density functions within convolutional filters. In this way, deep convolutional networks can be directly constructed on 3D point clouds, enabling efficient and effective analysis and processing.
—
Dynamic graph convolutional neural network [
88]
: DGCNN is a DL architecture specifically designed for processing and analyzing 3D point cloud data. The key idea behind DGCNN is to exploit the intrinsic spatial relationships present in point clouds by modeling them as graphs. By leveraging graph convolutions and dynamically adapting the graph structure based on the input data, DGCNN can effectively learn and process point cloud representations, making it suitable for point cloud classification tasks.
—
PointNet [
72]
: PointNet is a widely adopted DL architecture specifically tailored for 3D point cloud data. The architecture includes a shared MLP with max-pooling to extract local features from individual points and a symmetric function to aggregate the global features across all points. By employing T-Net layers, PointNet is able to learn transformation matrices that aid in aligning and transforming input point clouds, enhancing the model’s robustness to input variations. PointNet has demonstrated impressive capabilities in 3D shape classification tasks, establishing it as an effective approach for point cloud analysis.
—
Multi-scale grouping [
73]
: The
Multi-Scale Grouping (MSG) approach involves sampling representative points and grouping nearby points within a specified radius. This allows for the extraction of local features at multiple scales, enabling hierarchical feature learning from point sets.
—
Single-scale grouping [
73]
: Single-Scale Grouping (SSG) denotes a simplified variant of the MSG architecture. The essence of SSG lies in the partitioning of a point cloud into local regions of fixed size while disregarding the consideration of multiple scales. Within each region, a representative subset of points is judiciously sampled, and proximate points falling within a pre-defined radius are grouped together. This approach facilitates local feature extraction while avoiding the intricate intricacies associated with handling diverse scales.
4.3 Measurements
The goal of PCPrior is to prioritize the possibly misclassified test inputs in the context of 3D point cloud data. Thus, following the existing work [
28], we adopted APFD and PFD to measure the effectiveness of PCPrior, the compared approaches, and the variants of PCPrior:
—
Average percentage of fault detection: APFD [
94] is a widely recognized metric for assessing the effectiveness of prioritization techniques. A higher APFD value indicates a quicker rate of detecting misclassifications. The calculation of APFD values is based on Formula (
17):
where
\(n\) denotes the total number of test inputs, and the variable
\(k\) represents the number of test inputs in
\(T\) that will be incorrectly predicted by the model. The index
\(o_i\) pertains to the position of the
i-th misclassified test within the prioritized test set. Specifically,
\(o_i\) represents an integer value indicating the position of the
i-th misclassified test within the prioritized test set.
Based on the existing study [
28], we normalize the APFD values to [0,1]. A prioritization approach is considered better when the APFD value is closer to 1. This is because a larger APFD value corresponds to a smaller value of
\(\sum _{i=1}^k o_i\). Here,
\(\sum _{i=1}^k o_i\) represents the total index sum of misclassified tests within the prioritized list. A smaller
\(\sum _{i=1}^k o_i\) implies that the evaluated test prioritization method assigns higher priority to misclassified tests, positioning them at the front of the ranked test set. This effective detection of misclassified tests demonstrates the efficacy of the test prioritization approach. Therefore, a larger APFD value serves as an indicator of better effectiveness for test prioritization strategies.
—
Percentage of fault detected: PFD refers to the proportion of detected misclassified test inputs among all misclassified tests. Higher PFD values indicate better test prioritization effectiveness. PFD is calculated based on Formula (
18):
where
\(\#N_d\) is the number of misclassified test inputs that have been detected.
\(\#N\) denotes the total number of misclassified tests. In our study, we evaluated the PFD of PCPrior and the compared test prioritization approaches against different ratios of prioritized tests. We utilize
PFD-n to denote the first
n% prioritized test inputs.
4.4 Compared Approaches
This study employed five comparative approaches, which included a baseline approach (random selection) and four DNN test prioritization techniques. The selection of these methods was driven by multiple factors. First, these approaches can be adapted for test prioritization in the context of 3D point cloud data. Second, these approaches were proposed within the DL testing community and have been previously demonstrated as effective for DNNs. Third, these approaches provide open source implementations. The approaches are as follows:
—
Random selection [
26]
: Random selection is the baseline in our study. Random selection involves the randomized determination of the execution order for test inputs. This means that the sequencing of test inputs is established in a completely arbitrary manner, devoid of any pre-determined patterns or logical arrangements.
—
DeepGini [
28]
: DeepGini utilizes the Gini coefficient, which is a statistical metric used to assess the probability of misclassification, to facilitate the ranking of test inputs. The Gini score is calculated according to Formula (
19), which is presented next:
where
\(G(t)\) represents the probability of the test input
\(t\) being misclassified.
\(p_i(t)\) denotes the probability that the test input
\(t\) is predicted to belong to label
\(i\).
\(N\) represents the total amount of categories that the input can be assigned to.
—
Prediction-confidence score: PCS [
90] assigns rankings to test inputs based on the difference between the predicted class and the second most confident class in the softmax likelihood. A smaller difference indicates that the model is less certain about the prediction for a particular test input. These uncertain tests are given higher priority and are placed at the front of the test set. The calculation of this difference is defined by Formula (
20) as follows:
where
\(l_{k}(x)\) refers to the most confident prediction probability.
\(l_{j}(x)\) refers to the second most confident prediction probability.
—
Vanilla Softmax [
90]
: Vanilla Softmax measures the difference between the maximum activation probability in the output softmax layer and the ideal value of 1 for each test input. This disparity reflects the degree of uncertainty associated with the model’s predictions. Test inputs with larger disparities are considered more likely to be misclassified by the model. The specific computation of this disparity is illustrated by Formula (
21), which provides a clear and concise representation of the underlying mathematical calculations:
where
\(l_c(x)\) belongs to a valid softmax array in which all values are between 0 and 1, and their sum is 1.
—
Entropy [
90]
: Entropy serves as a criterion for ranking test inputs based on the entropy of their softmax likelihood. Higher entropy values indicate greater uncertainty in the model’s predictions for those inputs. Consequently, test inputs with higher entropy are considered more likely to be misclassified by the model. As a result, they are given higher priority and placed at the beginning of the test set.
4.5 Variants of PCPrior
We conducted an investigation into the influence of different ranking models on the effectiveness of PCPrior. To this end, we proposed five variants of PCPrior, namely
\(\text{PCPrior}^L\),
\(\text{PCPrior}^X\),
\(\text{PCPrior}^R\),
\(\text{PCPrior}^D\), and
\(\text{PCPrior}^T\), which utilize Logistic Regression [
81], XGBoost [
17], Random Forest [
9], DNNs [
82], and TabNet [
3] as the ranking model, respectively. It is essential to emphasize that apart from the variation in ranking models, the execution workflow of these derived variants remains identical to that of the original PCPrior approach.
Furthermore, we extended the modifications applied to the LightGBM ranking model of PCPrior to the ranking models employed by the variants of PCPrior. Specifically, instead of making the ranking models provide a binary classification output (i.e., indicating whether the test will be predicted incorrectly by the model), we extract the intermediate output, which can indicate the probability of misclassification for each test input. Consequently, we obtain a misclassification score for each test input, which can be effectively utilized for test prioritization. Next, we provide a comprehensive explanation of the specific ranking models utilized in each variant of PCPrior:
—
\({PCPrior}^L\!\) : In the context of
\({PCPrior}^L\), we employ the Logistic Regression algorithm [
61] as the ranking model. Logistic Regression is a statistical modeling technique that employs a logistic function to establish the relationship between a categorical dependent variable and one or more independent variables.
—
\({PCPrior}^X\!\) : In the context of
\({PCPrior}^X\), we utilize the XGBoost ranking algorithm [
17] to estimate the misclassification score of a test input based on its corresponding feature vector. XGBoost is a powerful gradient boosting technique that integrates decision trees to enhance prediction accuracy. By leveraging its ensemble learning capabilities, XGBoost effectively captures complex relationships within the data, enabling accurate estimation of the likelihood of misclassification for each test input.
—
\({PCPrior}^R\!\) : In the context of
\({PCPrior}^R\), we employ the Random Forest algorithm [
9] as the ranking model. Random Forest is an ensemble learning algorithm that constructs multiple decision trees. The predictions from individual trees are combined using averaging or voting mechanisms to produce the final prediction. Random Forest is known for its ability to handle high-dimensional data and capture intricate interactions among features. By leveraging these strengths,
\(\text{PCPrior}^R\) accurately estimates the misclassification score for each test input, aiding in effective test prioritization.
—
\({PCPrior}^D\!\) : In the context of
\({PCPrior}^D\), we utilize a DNN model as the ranking model, derived from a prior investigation [
82]. This DNN model is capable of producing a misclassification score for a given test input, relying on its feature vector generated by PCPrior.
—
\({PCPrior}^T\!\) : In the context of
\({PCPrior}^T\), we utilize TabNet [
3] as the ranking model. TabNet is a DNN architecture specifically designed for tabular data. It has been demonstrated to be more effective than XGBoost and LightGBM in a previous study [
3].
4.6 Implementation and Configuration
We implemented PCPrior in Python, utilizing the PyTorch 2.0.0 framework [
68]. To enable comparison with other approaches, we integrated existing implementations of the compared methods [
28,
90] into our experimental pipeline, specifically tailored for test prioritization of 3D point cloud data. To generate mutation features, we created 30 mutants for each test sample. Regarding the configuration of the ranking models employed in PCPrior, we utilized XGBoost 1.7.4, LightGBM 3.3.5, and scikit-learn 1.0.2 frameworks. Furthermore, we made specific parameter selections: for LightGBM, the learning rate was set to 0.1; for Logistic Regression, the parameter
\(max\_iter\) was set to 100; for XGBoost, the learning rate was set to 0.3; and for the Random Forest algorithm, the number of estimators was set to 100. Our experimental setup involved conducting experiments on NVIDIA Tesla V100 32GB GPUs. For the data analysis, we utilized a MacBook Pro laptop running Mac OS Big Sur 11.6, equipped with an Intel Core i9 CPU and 64 GB of RAM. In total, we conducted experiments on 165 subjects, consisting of 15 subjects based on natural inputs and 150 subjects based on noisy inputs.
5 Results and Analysis
5.1 RQ1: Performance of PCPrior
Objective . We investigate the effectiveness and efficiency of PCPrior, comparing it with several existing test prioritization approaches.
Experimental Design . We conducted experiments to evaluate the performance of PCPrior from the following three aspects:
—
Effectiveness evaluation on natural datasets: We employed a set of 15 subjects constructed from 3D point cloud datasets to evaluate the effectiveness of PCPrior. Table
1 presents the basic information of these subjects. To assess the performance of PCPrior, we carefully selected four test prioritization approaches, namely DeepGini, Vanilla SM, PCS, and Entropy, alongside a baseline method (i.e., random selection), for comparative analysis. Moreover, we utilized two measurement metrics, specifically the APFD and PFD, to evaluate the effectiveness of PCPrior and the compared approaches. A detailed explanation of the calculations for these metrics can be found in Section
4.3.
—
Statistical analysis: Due to the inherent randomness in the model training process, we performed statistical analysis by conducting the experiments 10 times. Specifically, for each subject, which refers to a point cloud dataset paired with a DNN model, we generated 10 distinct models through separate training processes. The average results are reported in our experimental findings. Moreover, for each subject, we calculated the variance of 10 repeated experimental results for each test prioritization method to demonstrate the stability of PCPrior better.
To further validate the stability and reliability of the experimental findings, we calculated
p-values associated with the results. Specifically, we employed the
paired two-sample t-test [
46] to calculate the
p-value, a commonly used statistical method for evaluating differences between two related datasets. The essential steps involved are (1) selecting two related sets of data, (2) computing the difference for each corresponding pair of data points, and (3) analyzing these differences to ascertain if there is a statistically significant disparity between the two datasets. In the paired two-sample
t-test approach, the significance of the results is determined by the
p-value. Generally, if the
p-value is less than
\(10^{-05}\), it is considered that the difference between the two sets of data is statistically significant [
58]. Additionally, we quantify the magnitude of the difference between the two sets of results through the
effect size. Specifically, we use Cohen’s
\(d\) for measuring the effect size [
42], wherein,
\(|d|\lt 0.2\) is “negligible,”
\(|d|\lt 0.5\) is “small,”
\(|d|\lt 0.8\) is “medium,” and otherwise is “large.” To ensure that the difference between the results of PCPrior and the compared approach is “non-negligible,” we require that the value of
\(d\) is greater than or equal to 0.2.
—
Efficiency evaluation: In addition to evaluating the effectiveness of PCPrior, we conducted an assessment of its efficiency and compared it with the selected test prioritization methods. Specifically, we quantified the time required for each step of PCPrior to measure its efficiency. By analyzing the execution time of PCPrior, we aim to gain insights into its computational efficiency and its potential for practical application in real-world scenarios.
Results . The experimental findings pertaining to RQ1 are presented in Tables 2 through 7 and Figure
3. Table
2 and Table
3 offer a comparative analysis, employing the APFD metric, between PCPrior and the comparative methods. Conversely, Table
6 and Figure
3 provide an assessment of effectiveness using the PFD metric. It is important to note that we highlight the approach with the highest effectiveness for each case in gray. Additionally, Table
7 offers a comparison of the efficiency between PCPrior and the evaluated test prioritization approaches.
Notably, Table
2 reveals that across all 15 subjects, PCPrior consistently outperforms all comparative methods in terms of its APFD. Specifically, the range of APFD values for PCPrior spans from 0.781 to 0.905, whereas the range for the comparative methods lies between 0.495 and 0.853. Moreover, Table
3 further highlights the average APFD value for PCPrior and its relative improvement compared to the comparative methods. We see that PCPrior achieves an average APFD of 0.836, whereas the average APFD of the comparative methods falls within the range of 0.501 to 0.754. The improvement observed in PCPrior, relative to the comparative methods, ranges from 10.99% to 66.94%. These findings demonstrate that PCPrior performs better than all of the comparative test prioritization methods in terms of the APFD metric.
The comparative analysis presented in Table
6 employs the PFD metric to exhibit the comparison between PCPrior and various DNN test prioritization methods. Notably, from prioritizing 10% to 70% of the dataset, PCPrior consistently outperforms all comparative methods in terms of PFD. To facilitate a more intuitive comparison, Figure
3 showcases two line graphs with PFD as the
y-axis, illustrating the cases of the ModelNet dataset with the DGCNN model and the ShapeNet dataset with the PointNet model, respectively. All of the results can be found on our
GitHub. 2 In the figures, PCPrior is depicted by the red lines, whereas the baseline is represented by the pink lines. Visual analysis reveals that PCPrior consistently achieves a higher PFD value when contrasted with DeepGini, Entropy, Vanilla SM, PCS, and Random methods. These experimental results demonstrate that PCPrior outperforms all comparative test prioritization methods in terms of the PFD metric.
As stated in the experimental design, a statistical analysis was conducted to ensure the stability of our findings. To this end, all experiments were repeated 10 times for each subject. The statistical analysis reveals a
p-value lower than
\(10^{-05}\), providing strong evidence that PCPrior consistently outperforms the compared approaches in the context of test prioritization. Table
4 presents detailed results from the statistical analysis. The analysis employs two primary metrics:
p-value and effect size. As outlined in the experimental design, a
p-value less than
\(10^{-05}\) indicates that the difference between two datasets is statistically significant [
58]. Furthermore, an effect size
\(\ge\) 0.2 suggests that the difference is “non-negligible.” In Table
4, we observed that all of the
p-values between PCPrior and the compared approaches consistently fall below
\(10^{-05}\), indicating that PCPrior statistically outperforms all of the compared test prioritization methods. For example, the
p-value for the difference in experimental results between PCPrior and DeepGini is
\(2.039 \times 10^{-07}\). The
p-value between PCPrior and VanillaSM is
\(4.403 \times 10^{-07}\). Additionally, the experimental results for both PCPrior and the compared approaches show effect sizes exceeding 0.2, confirming a non-negligible difference. Moreover, we found that all of the effect sizes are even greater than 0.8. For example, the effect size of PCPrior and VanillaSM is 2.273. According to Cohen’s
\(d\) [
42], this means that the difference in experimental results between PCPrior and the compared methods is not only statistically significant but also relatively “large” in scale.
Moreover, for each case, we calculated the variance of 10 repeated experimental results with respect to each test prioritization method, as presented in Table
5. It is important to note that the unit for the table is
\(10^{-3}\). For instance, in the second row, the first number, 0.026, represents that for the ModelNet dataset, under the DGCNN model, the variance of 10 repeated experimental results for the DeepGini method is
\(0.026 \times 10^{-3}\). The cases highlighted in gray represent the test prioritization method with the minimum variance for each subject. We see that for 66.7% (10 out of 15) of subjects, PCPrior has the smallest variance. Furthermore, the variance range for PCPrior is
\(0.001 \times 10^{-3}\) to
\(0.375 \times 10^{-3}\). In contrast, the variance range for comparative methods is
\(0.008 \times 10^{-3}\) to
\(0.430 \times 10^{-3}\). The preceding experimental results indicate that the variance of PCPrior’s results is generally lower compared to the comparative test prioritization methods, suggesting that PCPrior is relatively more stable.
Table
7 provides a comprehensive comparison of the efficiency between PCPrior and the compared test prioritization approaches. A noteworthy distinction between our proposed method and the comparative approaches pertains to the requirement of training a ranking model and generating features. As can be observed from Table
7, the overall time taken by PCPrior is approximately 6 minutes and 32 seconds. Specifically, the average training time for the PCPrior ranking model amounts to 32 seconds, whereas the average time for feature generation is 6 minutes. The final prediction time of the compared approaches is less than 1 second. Although PCPrior is not as efficient as confidence-based test prioritization approaches, the effectiveness improvement of PCPrior relative to confidence-based methods is 10.99% to 13.38%. Considering the tradeoff between effectiveness and efficiency, PCPrior remains a practical option.
5.2 RQ2: Influence of Ranking Models
Objective . We investigate the impact of various ranking models on the effectiveness of PCPrior.
Experimental Design . We proposed five variants of PCPrior that incorporate different ranking models. In addition to the ranking models, the other procedures of these methods remain identical to PCPrior. The five variants are
\(\text{PCPrior}^L\),
\(\text{PCPrior}^X\),
\(\text{PCPrior}^R\),
\(\text{PCPrior}^D\), and
\(\text{PCPrior}^T\), which utilize Logistic Regression [
81], XGBoost [
17], Random Forest [
9], DNNs [
82], and TabNet [
3] as the ranking model, respectively. We evaluated the impact of these ranking models on the effectiveness of PCPrior by assessing the performance of these variants on natural datasets utilizing both the APFD and PFD metrics.
Results . The experimental results for RQ2 are presented in Tables 8 and 9. Table
8 showcases the comparison between PCPrior and its variants in terms of the APFD metric, whereas Table
9 presents their comparison based on the PFD metric.
In Table
8, we see that PCPrior, which employs LightGBM as the ranking model, performs the best in 66.67% (10 out of 15) of the cases.
\(\text{PCPrior}^X\), which utilizes XGBoost as the ranking model, performs the best in the remaining 33.3% (5 out of 15) cases. Furthermore, Table
9 presents a comparison of the effectiveness of PCPrior and its variants from the perspective of the PFD metric. We see that PCPrior performs the best in 61.9% (13 out of 21) cases, whereas
\(\text{PCPrior}^X\) performs the best in 38.1% (8 out of 21) of the cases. The aforementioned experimental results illustrate that the ranking models employed by PCPrior and
\(\text{PCPrior}^X\), specifically LightGBM and XGBoost, can better utilize the generated test input features for test prioritization.
Surprisingly, despite existing studies [
3] mentioning that TabNet is more effective than XGBoost and LightGBM in their evaluated datasets when applied to PCPrior for the purpose of test prioritization, the effectiveness of PCPrior (which utilizes the LightGBM model) is higher than that of
\(\text{PCPrior}^T\) (which utilize the TabNet model). In Table 8, we can see that PCPrior’s APFD ranges from 0.781 to 0.905, whereas
\(\text{PCPrior}^T\)’s APFD ranges from 0.701 to 0.894. This suggests that, compared to TabNet, LightGBM performs better in leveraging the features (generated by PCPrior) for test prioritization. Some potential reasons include the following: (1) different datasets and their distributions can impact the training of classification models, thereby affecting their performance, and (2) the size of the dataset can also influence the model’s performance. The experimental results demonstrate that LightGBM is more suitable and compatible with the feature dataset constructed by PCPrior.
5.3 RQ3: Impact of Main Parameters in PCPrior
Objective . We investigate the impact of main parameters on the effectiveness of PCPrior for test prioritization.
Experimental Design . Building upon the parameter selection and consideration of parameter values in previous research [
89], we conducted a systematic investigation to analyze the impact of key parameters in PCPrior. Specifically, we focused on three parameters:
\(max\_depth\) (representing the maximum tree depth for each LightGBM model),
\(colsample\_bytree\) (indicating the sampling ratio of feature columns when constructing each tree), and
\(learning\_rate\) (referring to the boosting learning rate) in the LightGBM ranking algorithm. For our investigation, we performed experiments on all subjects within the natural dataset. By observing the performance variations of PCPrior as these parameters changed, we aimed to gain insights into the influence of parameters on the effectiveness of PCPrior.
Results . The experimental results of RQ3 are presented in Figure
4, showcasing the effectiveness of PCPrior under diverse parameter settings based on average APFD values across the 15 subjects. The solid red line represents PCPrior, whereas the dashed lines depict the comparative methods. The findings demonstrate that PCPrior consistently outperforms all of the test prioritization methods across various parameter configurations, as evident from the visual analysis of Figure
4. Furthermore, it can be observed that the parameter
\(colsample\_bytree\), which determines the sampling ratio of feature columns during the construction of each tree, has a relatively modest impact on the effectiveness of PCPrior. PCPrior exhibits relative stability when this parameter is adjusted. Conversely, the parameters
\(max\_depth\) (representing the maximum tree depth for each LightGBM model) and
\(learning\_rate\) (referring to the boosting learning rate) have a relatively larger influence on the effectiveness of PCPrior. Remarkably, regardless of the extent to which the parameters influence PCPrior’s effectiveness, we see that PCPrior can consistently outperform all of the compared methods across different parameter settings.
5.4 RQ4: Effectiveness on Noisy Test Inputs
Objective . We further investigate the effectiveness of PCPrior and its variants on noisy data.
Experimental Design . In the initial phase, we introduce noise to the original 3D point cloud datasets, namely ModelNet40, ShapeNet, and S3DIS, to create noisy data. To generate a noisy dataset from an initial test set denoted as
\(T\), each test instance
\(t \in T\) undergoes a specific modification. Specifically, within each test instance
\(t\) (a point cloud), approximately 30% of the points are subjected to a random offset in the
x,
y, and
z coordinates, whereas the remaining 70% of the points remain unaltered. The decision to select 30% of the points in a point cloud for displacement is because if a large number of the points were to be shifted, it would lead to a significant number of tests being misclassified by the original model. In such a scenario, all test prioritization methods could identify a large number of misclassified tests. This, in turn, could affect the evaluation of PCPrior. Therefore, we opted to carefully select the modification ratio that is not excessively high for the evaluation of PCPrior. As a result, we generate 10 noisy datasets for each original dataset, resulting in a total of 30 (3
\(\times\) 10) noisy datasets. Each of these noisy datasets is paired with five different models, resulting in a total of 150 (30
\(\times\) 5) subjects. Finally, we compared the effectiveness of PCPrior, its variants, and all of the comparative test prioritization approaches on the generated 150 noisy subjects. On the generated noise subjects, we assessed the effectiveness of PCPrior, the confidence-based test prioritization methods, along with PCPrior variants that employed Logistic Regression [
81], XGBoost [
17], Random Forest [
9], DNNs [
82], and TabNet [
3] as ranking models, respectively. We also included random selection as a baseline for comparison.
Statistical Analysis . Similar to RQ1, due to the inherent randomness in the model training process, we performed the experiments 10 times and conducted a statistical analysis. Like in RQ1, the statistical analysis method we used is the paired two-sample
t-test [
46]. We calculated the
p-value and effect size for the experimental results. We consider that if the
p-value is less than
\(10^{-05}\), the difference between the two sets of data is statistically significant [
58]. Moreover, to ensure that the difference between the results of PCPrior and the compared approach is non-negligible, the effect size should be greater than or equal to 0.2.
Results . The experimental results for RQ4 are presented in Tables 10 through 14 and Figure
5. Specifically, Table
10 and Table
11 provide a comparative analysis of the effectiveness of PCPrior (including its variants) and various test prioritization methods in the context of noisy data using the APFD metric, and Table
13 and Table
14 present the comparative evaluation based on the PFD metric.
Table
10 shows the comparison results of PCPrior, its variants, and comparative methods on noisy test inputs in terms of APFD. We found that the effectiveness of PCPrior and its variants surpasses that of all compared test prioritization methods in each case. Specifically, the APFD values for PCPrior range from 0.665 to 0.862. For PCPrior’s variants, the APFD values range from 0.568 to 0.864. For the compared test prioritization methods, the APFD values range from 0.499 to 0.762. Furthermore, Table
11 provides a more detailed analysis by presenting the number of cases in which each test prioritization method performs the best, the average APFD value, and the improvement of PCPrior relative to each comparative method. We see that, on noisy test inputs, PCPrior’s average APFD is 0.765, whereas the range for its variants is 0.703 to 0.763. The average APFD range for the benchmark methods is 0.500 to 0.656. Notably, PCPrior performs the best in 76.7% (115 out of 150) of the cases, whereas
\(\text{PCPrior}^X\) performs the best in 23.3% (35 out of 150) of the cases. PCPrior continues to outperform the variants of PCPrior that utilize DNN ranking models (
\(\text{PCPrior}^T\) and
\(\text{PCPrior}^N\)) in all cases. Moreover, PCPrior shows an improvement ranging from 16.62% to 53.00% over all of the comparative methods. The preceding experimental results demonstrate that, under the APFD measurement, the average effectiveness of PCPrior surpasses all of its variants and comparative methods on noisy datasets.
The results from the statistical analysis on noisy test inputs are presented in Table
12. We see that the
p-values for the experimental results of PCPrior and each of the compared methods are all less than
\(10^{-05}\), indicating that PCPrior statistically outperforms all of the test prioritization methods on noisy datasets. For instance, the
p-value between PCPrior and DeepGini is
\(1.688 \times 10^{-08}\). The
p-value between PCPrior and PCS is
\(5.049 \times 10^{-08}\). Furthermore, all of the effect sizes of PCPrior and the compared approaches exceed 0.2, demonstrating a non-negligible difference. Notably, all of the effect sizes are greater than 0.8. For example, the effect size between PCPrior and VanillaSM is 2.792, and the effect size between PCPrior and Entropy is 3.352. According to Cohen’s
\(d\) [
42], this implies that the difference in experimental results between PCPrior and the compared methods is not only statistically significant but also relatively “large” in scale.
Tables
13 and
14 present a comparative analysis regarding the PFD metric. It is observed that in Table
13, the best performance is consistently achieved by PCPrior or its variants across all cases. Table
14 provides a deeper analysis of this finding. When considering different percentages of test data prioritization, PCPrior consistently outperforms other approaches in terms of effectiveness, as evidenced by the highest number of best-performing cases and the highest average PFD values. Figure
5 visually demonstrates the performance comparison of PCPrior, its variants, and the comparative methods on noisy data. The solid lines depict PCPrior and its variant methods, whereas the dashed lines represent the comparative methods. We see that across the noisy dataset, PCPrior and all of its variants exhibit higher effectiveness compared to all comparative methods. Furthermore, PCPrior demonstrates superior performance when compared to its variants.
5.5 RQ5: Feature Contribution Analysis
Objective . We investigate the contributions of each type of features on the effectiveness of PCPrior for test prioritization. Our investigation revolves around two primary sub-questions, as outlined next:
—
RQ5.1: Based on the ablation study, to what extent does each type of features contribute to the effectiveness of PCPrior?
—
RQ5.2: What is the distribution of feature types among the top-N most contributing features toward PCPrior?
Experimental Design . We conduct the following two experiments to answer the preceding two sub-questions:
[
Experiment ❶]
: In the original PCPrior framework, a comprehensive set of four feature types is generated, namely mutation features, spatial features, uncertainty features, and prediction features. To compare the contributions of each feature type on PCPrior’s effectiveness, we conducted a carefully designed ablation study following the prior work [
25]. More specifically, we individually removed one type of features and evaluated PCPrior’s effectiveness under these modified conditions. For instance, to assess the contribution of spatial features, PCPrior is executed with spatial features excluded while retaining the other three feature types. The resulting performance of PCPrior is then evaluated under these adjusted circumstances. Similarly, to gauge the contribution of mutation features, PCPrior is executed without generating mutation features while keeping generating the other three feature types. The performance of PCPrior is subsequently assessed in this context. By conducting the aforementioned ablation study, we can determine the contribution of each feature type to the overall effectiveness of PCPrior.
[
Experiment ❷]
: The method we employed to evaluate the contributions of features is the cover metric within the XGBoost algorithm [
17]. Initially, we utilized the cover metric to compute the importance scores of each feature used by PCPrior for test prioritization. Subsequently, we selected the top-
N most important features based on these scores. By analyzing the categorization of these features, we investigated the contributions of different feature types to the effectiveness of PCPrior. In the following, we provide an overview of how XGBoost quantifies feature importance.
The cover metric employed in XGBoost serves as a means to quantify the importance of features by assessing the average coverage of individual instances across the leaf nodes within a decision tree. This metric operates by evaluating the frequency with which a specific feature is utilized for partitioning the data across the entirety of the ensemble’s trees. The coverage values associated with each feature across all trees are subsequently aggregated, resulting in a cumulative coverage value. To obtain the average coverage of each instance by the leaf nodes, the cumulative coverage value is normalized in relation to the total number of instances. Consequently, the derived coverage value of a given feature plays a crucial role in determining its significance, with features that demonstrate higher coverage values being considered more important.
Results . The experimental results of RQ5.1 are presented in Table
15. In this table,
w/o stands for “without.” For example,
PCPrior w/o SF refers to executing PCPrior without generating the spatial features. From Table
15, we see that the original PCPrior achieves the highest average effectiveness. Removing any type of feature results in a decrease in the effectiveness of PCPrior, demonstrating that each type of features contributes to PCPrior’s effectiveness. For instance, on the ModelNet dataset, the average APFD value of the original PCPrior is 0.797. Removing spatial features results in a decline of PCPrior’s average APFD to 0.769, whereas the removal of mutation features causes a decrease to 0.788, uncertainty features to 0.785, and prediction features to 0.782.
Furthermore, among all four types of features, spatial features demonstrate the highest average contributions. This inference is drawn from the following findings. When removing spatial features, PCPrior’s effectiveness shows the largest average decrease. Specifically, when removing spatial features, the average APFD decreases by 0.067. In comparison, the removal of mutation features leads to an average APFD decrease of 0.011, uncertainty features result in an average APFD decrease of 0.012, and prediction features show an average APFD decrease of 0.025. Moreover, across all datasets, removing spatial features results in the highest average decrease in PCPrior’s effectiveness.
The findings of RQ5.2 are presented in Table
16, where the scores represent the importance levels of each feature. For each combination of model and dataset, we present the top-
N features that contribute the most. It is worth noting that abbreviations SF, MF, PF, and UF are used to represent spatial features, mutation features, prediction features, and uncertainty features, respectively. Moreover, the numbers after the feature abbreviations indicate the indices of the corresponding features. For instance,
SF-23 represents the spatial feature with index 23. From Table
16, it can be observed that all four types of features consistently appear among the top-
N most contributing features across various subjects. As an example, in the case of the PointConv subject with the S3DIS dataset, satial features account for 50%, uncertainty features account for 30%, mutation features account for 10%, and prediction features account for 10%. Remarkably, among the 15 subjects investigated, in 93.3% (14 out of 15) of the cases, the top 10 contributing features include three or more distinct feature types. These experimental findings provide robust evidence that all three feature categories play pivotal roles in the effectiveness of PCPrior.
5.6 RQ6: Retraining 3D Shape Classification Models with PCPrior and Uncertainty-Based Methods
Objective . We investigate whether PCPrior and uncertainty-based test prioritization approaches are effective in selecting informative retraining inputs to enhance the performance of a 3D shape classification model.
Experimental Design . Building on the previous research [
58], we structured our retraining experiments in the following manner. First, we randomly divided the point cloud dataset into three parts: the training set, the candidate set, and the test set, in a 4:4:2 ratio. The candidate set was used for retraining, whereas the test set was reserved for evaluation purposes and remained untouched. In the first phase, we trained a 3D shape classification model using only the initial training set. In the second round, we integrate an extra 10% of new inputs from the candidate set into the current training set without replacement. The chosen inputs for inclusion are those prioritized in the top 10% by PCPrior and the compared test prioritization approaches. The prioritization range we selected for retraining is from 10% to 70%. We chose this range because, according to the experimental results (cf. Section
5.1), when prioritizing up to 70%, PCPrior can identify the majority of misclassified inputs in the dataset (99.6%), as indicated in Table
6. For example, in the ShapeNet dataset, within the 70% prioritized test set, PCPrior has identified 99.8% of misclassified inputs. Given that the primary objective of this research question is to validate PCPrior’s effectiveness in retraining, we chose a retraining range of up to 70%. Following prior work [
58], we retrained the model using the expanded training set, ensuring equal treatment of both old and new training data. This retraining was repeated in five rounds. The reason for opting to conduct retraining five times is that the training process of DNN models involves various random factors, and conducting multiple rounds of retraining can contribute to ensuring the stability and reproducibility of the results. However, excessive retraining can lead the model to over-optimize for a specific dataset, resulting in overfitting. Therefore, based on the experimental experience of existing studies [
32], we chose to conduct five rounds of retraining. To account for the inherent randomness in model training, we repeated all experiments three times and reported the average results across these repetitions.
Results . The experimental results for RQ6 are presented in Table
17, which illustrates the average accuracy of 3D shape classification models after retraining. In each case, we have highlighted the approach with the highest effectiveness in gray for a quick and straightforward interpretation of the findings. As shown in Table
17, PCPrior and all uncertainty-based approaches demonstrate better average effectiveness compared to random selection. However, the improvements they achieved are relatively small. For instance, when selecting 10% of tests for retraining the original model, PCPrior’s selected samples result in a post-retrain model accuracy of 0.851, whereas uncertainty-based methods range from 0.846 to 0.850. In contrast, the random selection yields an accuracy of 0.847. Similarly, when choosing 70% of tests for retraining the original model, PCPrior’s selected samples result in a post-retrain model accuracy of 0.901, whereas uncertainty-based methods range from 0.896 to 0.898. In comparison, the random selection yields an accuracy of 0.887.
The reasons for the aforementioned findings, where PCPrior and uncertainty-based methods show only small improvements over random selection in enhancing model accuracy, include the following:
—
Lack of diversity: PCPrior and uncertainty-based methods focus on identifying corner cases, which are tests that the model finds more challenging. Consequently, the tests identified can lack diversity. In contrast, random selection provides a broader and more diverse set of samples, contributing to the model learning more comprehensive data features and thereby improving its generalization capability.
—
Overfitting risk: Concentrating on samples the model is most likely to predict incorrectly can lead to overfitting. These samples can exhibit certain extreme or uncommon features, causing the model to overly adapt to these specific cases after retraining and ignoring more widespread patterns.
Moreover, another observation from the results in Table
17 is that PCPrior performs better than uncertainty-based methods on average. Specifically, PCPrior performs the best in 75% (6 out of 8) cases, whereas uncertainty-based methods perform the best in only 25% (2 out of 8) cases. Moreover, after retraining the original model with tests selected by PCPrior, the average accuracy of the resulting model is 0.880. In contrast, for uncertainty-based methods, the range is from 0.876 to 0.878.
8 Conclusion
To address the issue of high labeling costs for 3D point cloud data, we proposed a novel approach called PCPrior, which aims to prioritize test inputs that are likely to be misclassified. By focusing on these challenging inputs, developers can allocate their limited labeling budgets more efficiently, ensuring that the most critical test cases are labeled first, which can lead to cost savings and a more cost-effective testing process. The core idea behind PCPrior is that test inputs closer to the decision boundary of the model are more likely to be predicted incorrectly. To capture the spatial relationship between a point cloud test and the decision boundary, we adopted a vectorization approach that transforms the point cloud data into a low-dimensional space, toward revealing the underlying proximity between the point cloud data and the decision boundary indirectly. To implement the vectorization strategy, we generated four distinct types of features for each point cloud (test): spatial features, mutation features, prediction features, and uncertainty features. For each test input, the four generated features are concatenated into a final feature vector. Subsequently, PCPrior employs a ranking model to automatically learn the probability of a test input being mispredicted by the model based on its final feature vector. Finally, PCPrior utilized the obtained probability values to rank all of the test inputs. To assess the performance of PCPrior, we conducted a comprehensive evaluation involving a diverse set of 165 subjects. These subjects encompass both natural datasets and noise datasets. We compared the effectiveness of PCPrior with several established test prioritization approaches that have exhibited effectiveness in prior studies. The empirical results demonstrated the remarkable effectiveness of PCPrior. Specifically, on natural datasets, PCPrior consistently performed better than all of the comparative test prioritization approaches, yielding an improvement ranging from 10.99% to 66.94% in terms of APFD. Moreover, on noisy datasets, the improvement ranged from 16.62% to 53%.