In this section, we report and discuss the results for RQ1 and RQ2.
4.1 RQ1: To what extent can machine learning models automatically identify safety-related concerns in issue reports of UAV software platforms?
As detailed in Section
3.4, we experimented with different ML classifiers, namely Naive Bayes, J48, SMO, Logistic Regression, Random Forest, and fastText, leveraging different combinations of features. In the following, we report and discuss our results. Note that, we only report statistical comparisons where appropriate, because we computed statistical comparisons among all possible comparisons of treatments. Detailed statistical results can be found in the replication package.
Table
2 summarizes the results obtained when considering different ML classifiers with different combinations of features. As shown in Table
2, Random Forest, together with fastText are the classifiers showing the best performance in terms of precision, recall, and f-measure.
Based on the statistical comparison, Random Forest with BOW outperforms all other techniques (adjusted p-values \(\lt\) 0.05) with ORs ranging from 1.3 for the comparison with BOW + SMO and 2.95 for the comparison with BOW+Logistic regression. There is no statistically significant difference when augmenting BOW with OB/EB/S2R and applying Random Forest (p-value = 0.6). Indeed, the weighted precision, recall, and f-measure only slightly increase from 0.80 to 0.81. We can also notice that the introduction of n-grams in the model slightly lowers the performance to 0.79 for all metrics, yet such a difference is not statistically significant (p-value = 0.31). This may happen because terms contributing toward the classification of a safety-related sentence are unlikely to be sequences of adjacent words.
When training the classifiers using the features extracted with LSI, Logistic Regression outperforms Random Forest: 0.79 versus 0.74 on all the evaluation metrics being considered. The difference is statistically significant (p-value \(\lt\) 0.01, with an OR = 1.5). However, there is no statistically significant difference with BOW + Random Forest (p-value \(=\) 0.28). Hence the application of LSI in this context may not result as particularly convenient, given that building a LSI space is typically more expensive than just creating a simple BOW model.
As regards the classifiers showing the worst performance, we found that Naive Bayes trained with all features shows an average f-measure of 0.45, while Logistic Regression trained with BOW features has an average F-measure of 0.63. Based on the results reported in Table
2 it is possible to conclude that, for most of the ML classifiers, the performance tends to be already very positive when considering the BOW features, with marginal improvements when adding OB/EB/S2R. Only Naive Bayes does not show an improvement in terms of F-Measure results when combining different types of features. The latter might happen because the Naive Bayes classifiers show limited improvements when increasing the number of features [
138,
139].
It is important to highlight that, in general, for almost all the ML models trained with multiple features, the performance tends to be better in identifying true negatives (i.e., issues not containing safety-related concerns). This may happen because the dataset used for training the classifiers is imbalanced, i.e., \(\simeq\) 56% of the sentences do not discuss any safety-related concerns.
When comparing the results obtained by Random Forest and fastText, we must mention that, as shown in Table
2, fastText and Random Forest both achieve an average f-measure value of 0.81. It is important to highlight that fastText achieves good performance by using a standard configuration (as detailed in Section
3.4.2), while Random Forest, achieves similar results only when applying several pre-processing steps and experimenting with specific sets of features. Interestingly, fastText achieves a higher recall in the
Safety class, which is important to identify safety-related issues timely during the issue management process. It is worth noticing that fastText does not outperform the other experimented techniques by a large margin (as it happens in the context of assigning labels to issues by considering the whole text reported [
109]) achieving comparable results to the ones achieved by the Random Forest classifier trained with
M\(_{BOW}\) +
M\(_{OB-EB}\) features.
Our dataset has two levels of imbalance: (i) the imbalance of the projects (i.e., a project has many more safety issues than the others) and (ii) the imbalance of the number of samples in the Safety and Non-Safety classes. These two issues are addressed in this Section (see “Imbalance handling techniques” and “Cross-project analysis” paragraphs).
Imbalance handling techniques. To check whether the results are biased due to the imbalance of the number of samples in the Safety and Non-Safety classes, we also report the results of imbalance handling techniques applied to our dataset, considering the best performing shallow ML model in all our experiments, i.e., Random Forest trained with M\(_{BOW}\) + M\(_{OB-EB}\) features. We then compare the results of Random Forest when using the default configuration and its variant using imbalance handling techniques. Specifically, the imbalance handling approach applied was conducted by performing the following steps:
(1)
242 Non-Safety sentences were randomly discarded from the dataset (i.e., undersampling), to obtain a balanced dataset (i.e., 837 Safety sentences and 837 Non-Safety sentences).
(2)
We leveraged the balanced dataset obtained from the previous step to experiment with 10-fold cross-validation using the default configuration of Random Forest trained with M\(_{BOW}\) + M\(_{OB-EB}\) features.
Hence, we decided to use undersampling (and not oversampling) as the imbalance handling strategy since, as reported in previous work [
159], oversampling methods tend to generate false examples, causing classifiers to perform well in labs but more likely to fail in practice. We follow the recommendation of such previous work [
159] that suggests avoiding oversampling methods when dealing with sensitive applications such as security, autonomous driving, aviation safety, and medical applications. Given the projects investigated in our study fall in the domain of aviation safety, we decided to experiment with the undersampling strategy. Table
3 reports the results achieved by (i) Random Forest when using its default configuration (trained with
M\(_{BOW}\) +
M\(_{OB-EB}\) features) and 10-fold cross-validation on the unbalanced original dataset (described in Section
3.2), and (ii) the default configuration of Random Forest (trained with
M\(_{BOW}\) +
M\(_{OB-EB}\) features) in a 10-fold cross-validation setting by leveraging the balanced dataset obtained when applying the undersampling strategy. As we can observe from Table
3, the performance obtained by the two variants of Random Forest are almost identical, so no major improvements are achieved by balancing the dataset. We conjecture that this happens because our dataset does not present a heavy imbalance in terms of the number of samples in the
Safety and
Non-Safety classes. Indeed, as reported in Section
3.3, we have a fairly balanced set of sentences: i.e., 837
Safety sentences (representing 43.7% of the data), and 1,079
Non-Safety sentences (representing 56.31% of data).
Cross-project analysis. As detailed in Section
3.4, to check whether the results are biased due to the dataset imbalance (i.e., the majority of issues and pull requests considered in our dataset belong to
Ardupilot), we also report results of a cross-project analysis, in which we use the best performing configurations, i.e., Random Forest trained with
M\(_{BOW}\) +
M\(_{OB-EB}\) features, and fastText. As it is shown in Table
4, for Random Forest, the average precision and recall values decreased compared to a 10-fold setting. Consequently, the overall F-measure decreases of 0.16. In particular, for the Random Forest algorithm, we observe a quite significant degradation in the recall concerning the identification of the sentences labeled as
Safety. This performance degradation could happen because the terminology used to describe safety issues can be slightly different in
Ardupilot, compared to the other projects.
When comparing the performance of fastText in the cross-project setting and the results achieved by this model in a 10-fold setting (see Table
2), we can observe that such performance only slightly decreased, i.e., the weighted average f-measure in a cross-project setting is 0.77 while it is equal to 0.81 in a 10-fold setting. Specifically, in a cross-project setting, the default implementation of fastText achieves a recall of 0.62 for the sentences in the
Safety category, suggesting that the usage of word embeddings (i.e., the features’ representation strategy leveraged by fastText) allows to partially mitigate the problem of different vocabularies used in heterogeneous projects to describe safety-related problems.
Feature Importance. To better understand the features that contribute more to the classification, we computed the Mean Decrease Gini (also called Mean Decrease in Impurity) [
53,
125,
164] for Random Forest, considering BOW and OB/EB/S2R features. As shown in Figure
3, we can find intuitive (top 20) features considered important for the classification of safety-related sentences. Specifically, hardware-related features (e.g.,
motor and
vehicle) and aspects concerning concrete states (e.g.,
failsafe) or actions (e.g.,
arm and
disarm) of the UAVs tend to be considered as important features by the model. Although the keyword
safety is among the top 20 relevant features, it only appears in 48 out of 837 (less than 6%) safety-related sentences in our dataset.
Hyperparameter Optimization. As detailed in Section
3.4.1, we experimented with Grid search as hyperparameter optimization approach [
2,
14,
15] to investigate potential optimal combinations of parameters for the selected shallow ML models. Specifically, with Grid search, we experimented with several parameter combinations for the various ML models. In total, we experimented with around 600 combinations using a 10-fold validation setting for the selected shallow ML models (all detailed results are shared in our replication package along with the Python code used to run the Grid search experiments). As is shown in Table
5, for Random Forest, the average Precision, Recall, and F-Measure values slightly increased (by about +1%) compared to its default configuration. For most of the other models, the obtained improvements are slightly more evident in terms of Precision, Recall, and F-Measure. In particular, for the J48 algorithm, we observe a +2% of improvements for all metrics. Similarly, for Naive-Bayes we observe small improvements in Recall (+1%), Precision (+2%), and F-Measure (+1%) values. For Logistic Regression we observe more substantial improvements in Recall (+16%), Precision (+16%), and F-Measure (+16%) values. SMO is the only model that did not achieve any observable improvement in terms of F-Measure. Overall, considering mainly the model having the higher values of both Precision and Recall (i.e., values
\(\gt 0.80\)), Random Forest and fastText are the best performing models, as observed before, with Random Forest, slightly better in terms of F-Measure (+1%). Interestingly, after a Grid search optimization step, all experimented ML strategies are not able to achieve more than 0.82 of F-Measure, which represents—within the considered hyperparameter values—the upper bound of our experiments.
4.2 RQ2: What are the main hazards and accidents emerging from safety issues reported in UAV software platforms?
To explore the recurrent hazards and accidents reported in UAV software platforms, and answer RQ
\(_2\), we performed a manual analysis of the 273 issues and pull-requests of our dataset containing at least one sentence describing a safety-related concern (as described in Section
3.5). Specifically, based on the title and the description content, each issue is assigned to at most one hazard and/or accident category. Upon completion of the manual inspection and clustering procedure (see Section
3.5), we found 19 different categories of hazards and seven categories of accidents reported in Tables
6 and
7, respectively, along with their occurrences.
As reported in Table
6, undesired hardware behavior [
75] (4.40%), onboard instrumentation issues, such as GPS, GCS, or RC connection lost during flight [
57] (2.56%), communication failures, such as improper communication with motors and between motors [
76] (1.47%), inappropriate handling (due to either hardware or software defects) for high vibrations [
89] (1.10%), and battery-related issues [
77] (0.73%) are signaled as sources of hazards in only about 10% of the cases, meaning that, in the great majority of the considered reports, the hazardous circumstances depend on software-related defects. Specifically, in about 40% of the analyzed issues, the reported hazards depend on misbehaviors occurring when the system enters failsafe mode (
undesired behavior on failsafe or error condition/configuration \(+\) undesired failsafe behavior, 24.18%), or the missing inhibition of certain actions (e.g., reboot while in-flight [
87], motor lockdown during takeoff [
95]) when the system is in specific states (14.65%).
Concerning hazards happening in failsafe mode, we distinguish misbehaviors depending on improper failsafe settings [
54] (i.e.,
undesired failsafe behavior) from unsafe actions occurring when a failsafe (or error condition) is triggered [
60] (i.e.,
undesired behavior on failsafe or error condition/configuration). As regards the former, Issue # 292 [
54] from
ardupilot having as title “Battery Monitoring RTL Bug” states a case where there is a bug inside the functionality handling the failsafe condition due to low battery. Specifically, “a low battery warning can cause a permanent RTL that can[not] be overridden”. As regards the latter, instead, Issue # 697 [
60] from
ardupilot clearly mentions the case where the propellers “go to 100% Full Throttle” when a radio signal loss occurs (i.e., the failsafe condition is triggered), resulting in the vehicle flipping.
On the one hand, in
\(\simeq 9\%\) of the inspected reports, the risky situations are due to insufficient checks such as Issue # 6649 [
79] from
ardupilot where there is a need to “check that the battery voltage is above the ARMING_MIN_VOLT and ARMING_MIN_VOLT2 parameter values” before arming the vehicle. On the other hand, there are 20 (7.33%) issues/pull requests discussing hazardous situations due to an improper parameter setting, such as PR # 1516 [
81] from
dronin where there is the need for changing the output calibration, i.e., “offset the minimum of the calibration range up a little bit from what we send” during disarming.
Furthermore, in
\(\simeq 13\%\) of the manually analyzed pull requests and issues, misleading data [
65,
83] (6.59%), or missing warnings when the system is under anomalous conditions [
62,
69] (5.86%) are highlighted as risky situations by both developers and/or end-users.
In the remaining cases, dangerous events are related to the inability to perform certain actions when the system is in specific states [
68] (4.03%), improper mode switches (
inappropriate mode changes/handling [
86]
\(+\) inappropriate safety switch handling [
84], 5.86%), the missing detection of failures occurred [
85] (2.56%), race [
90] or timing [
71] conditions (1.83%), impossibility to take control [
63] (1.47%), and memory size issues [
70] (0.73%).
Moving the attention to the accidents generated by risky situations, as expected, crashes/collisions [
64,
66] or abnormal behaviors while in-flight [
74] are the most recurrent ones. Specifically, (potential) accidents in these categories have been indicated in more than 20% of the manually inspected issues (see Table
7). In a further 8% of the analyzed reports, users signaled (potential) mishaps connected with problematic landing/return-to-location operations. For instance, users experienced the automated triggering of blind [
93] or hard [
91] landings after the occurrence of unexpected events, confirming that landing is among the most critical and accident-prone phases of a UAV flight [
44]. Flyaways [
161] (i.e., the devices fly off from their users) are signaled in nearly 7% of investigated issues, while stabilization/positioning accidents [
88] are disclosed in less than 4% of the cases. Finally, in less than 3% of the inspected reports, users indicated (potential) operator injuries [
73] (1.83%) or takeoff issues [
58] (0.73%).
To better understand the cause-effect relationships between the different hazard and accident categories, Figure
4 reports the occurrences with which specific hazard categories (detailed in Table
6) co-occur with the different accident categories (see Table
7). To avoid discussing relations occurring only once in our dataset, Figure
4 only shows the co-occurrences that took place more than once. Note that, the thickness of the lines is proportional to the number of issues reporting, at the same time, a specific hazard (on the left) and a specific accident (on the right).
Out of the overall 34 issues describing undesired behaviors when the system enters the failsafe mode or encounters an error condition (see Table
6), (i) 8 (23.53%) led to landing/rtl problems, (ii) 6 (17.65%) caused crashes or collisions, and (iii) 3 (8.82%) resulted in flyaways, e.g., PR # 8039 [
82] from
ardupilot stating “Prevent DCM fallback from triggering a flyaway”. As regards the former, Issue # 1164 [
61] from
ardupilot reports a situation where the undesired behavior of the drone when a battery failsafe is triggered results in an immediate undesirable landing of the vehicles (i.e., “give the operator the possibility to calculate and configure a battery level at which it is still safe to RTL/fly to a failsafe destination before landing”). Issue #1234 [
67] from
dronin, instead, discusses a scenario where the drone crashes as a consequence of “motors keep running at neutral throttle until the arming timeout has expired” once a TX failsafe is triggered — “this is unsafe, as the motors will likely burn up in a crash.”
Similarly, out of the overall 32 signaled undesired failsafe behaviors (see Table
6), (i) 4 (12.50%) led to crashes or collisions, (ii) 3 (9.38%) caused landing/rtl problems, (iii) 3 (9.38%) provoked stabilization/positioning accidents, (iv) 3 (9.38%) induced abnormal in-flight behaviors, and (v) 3 (9.38%) co-occurred with flyaways. For instance, Issue #374 [
55] from
ardupilot clearly discussed a scenario where the “copter lost around 30m of height” due to a bug in the handling of the failsafe operation triggered by an RC-signal loss. Differently, when talking about anomalous behavior of the UAV while in flight, PR # 16594 [
96] from
ardupilot reports about a situation where it is mandatory to have “the EKF failsafe [being] trigger[ed] soon after the vehicle is armed in stabilizing (or any other non-GPS mode) if it does [not] have a good position estimate.”
Crashes or collisions might also be due to (i) misleading data (11.43% of reported crashes/ collisions) such as “there is no filtration in the current sensor drivers, and any noise gets passed down as obstacles; this gives a sudden jerky response by the vehicle, which may even lead to crash” [
97], (ii) inadequate checks (11.43% of reported crashes/collisions), (iii) missing inhibition of certain actions in specific states (11.43% of reported crashes/collisions) like “allowing the operator to inhibit ADS-B avoidance below the specified altitude” so that it is possible to avoid “crashing into trees or buildings” as described in PR # 7074 [
80] from
ardupilot, (iv) undesired hardware behaviors (8.57% of reported crashes/collisions), (v) undetected failures (5.71% of reported crashes/ collisions), or (vi) inability to perform certain actions (5.71% of reported crashes/collisions).
Besides, out of the 16 issues reported as a hazard the missing or misleading communication with the pilot, 2 (12.50%) caused flyaways, and 2 (12.50%) led to abnormal in-flight behaviors, such as “add warning to user from vehicle and ground station when user is approaching the fence [so that] it’s more obvious for the user to see the vehicle distance from the fences” [
62]. Abnormal in-flight behaviors might also arise when specific actions are not inhibited (17.39% of the reported abnormal in-flight behaviors), or disabled (8.70% of the reported abnormal in-flight behaviors), and in case of inadequate checks (13.04% of the reported abnormal in-flight behaviors) or high vibrations were not properly handled (8.70% of the reported abnormal in-flight behaviors). As regards the latter, PR # 12349 [
89] from
ardupilot describes a case where the inappropriate detection of high vibrations and loss of altitude results in a vehicle climbing very quickly while it is not expected to climb at all.
Flyaways, instead, might also depend on missing, misleading, noisy data or measurements coming from the sensors (15.79% of reported flyaways) such as “catch fly-aways caused by bad compass heading” [
56] and onboard instrumentation issues (10.53% of reported flyaways), such as “sudden bad GPS position leads to Loiter flying off (GPSGlitch)” [
57].
Finally, landing/rtl problems might also be caused by inadequate checks (9.52% of reported landing/rtl problems). For instance, PR # 15092 [
92] from
ardupilot reports a dangerous situation where “the vehicle continuously climbs or descends at its maximum rate which at best leads to a hard landing” that is caused by a missing pre-arm check of the EKF’s altitude estimate.
The results presented above, on the one hand, provide an empirical characterization of safety issues in UAV-related software. On the other hand, such results pave the way to research approaches aimed at supporting a better analysis and testing of UAV software:
–
Have a clear mapping of actions and events enabled or inhibited in each specific configuration or state. The obtained results showed that, if the system performs an action in a given state, handles an event that should be ignored, or fails to handle an event that should be handled, then accidents can occur. Therefore, the studied bugs urge the need for clearly specifying—e.g., through state machines—the system behavior in different states. Moreover, given such state models, it is desirable to leverage suitable hazard analysis approaches [
10] or testing approaches [
135] not only to check whether events/inputs are correctly captured in each state, but also, through the testing of
sneaky paths (i.e., unspecified transitions in a state machine) that unexpected events do not have unintended consequences.
–
Failsafe and error modes. We have seen how the improper handle of the failsafe mode may be the cause of accidents, for example when a rotor is not stopped after an emergency landing. On the one hand, this highlights design problems, and the need to clearly specify how the UAV should behave in such failsafe or error mode. On the other hand, this also triggers the need for suitable state-based testing, as explained above. More in general, such results also suggest the need to adapt or customize failsafe and error modes for specific UAVs.
–
Test of misleading data or improper parameter settings. Misleading data from communication channels or from sensors (e.g., malfunctioning ones) may also cause unintended consequences and, ultimately, accidents. This raises the need to use fault injection [
169] and mutation testing techniques with the aim at simulating such errors, and test whether they are correctly handled. Furthermore, this trigger the need for customized sets of high-level mutant operators specialized for such a domain, similar to what has been done in domains such as deep learning-based systems and embedded software [
102,
106]. Last, but not least, this triggers the development of approaches aimed at supporting the developer in the definition of input data or state consistency check, where appropriate.
–
Cause-effect analysis. The analysis of the co-occurrences between hazard and accident categories (see Figure
4), as well as the frequencies of reported accidents (see Table
7) can be leveraged not only to develop approaches aimed at supporting issue triaging by predicting the likely root causes based on the accidents and other observable elements, but, also, to prioritize test and analysis activities [
17,
18], by determining what kinds of root causes can be responsible for the most dangerous accidents.