1 Introduction
Code reuse is very popular in IoT firmware to facilitate its development [
63]. Unfortunately, code reuse also introduces vulnerabilities concealed in the original code into a variety of firmware [
22]. The security and privacy of our lives are seriously threatened by the widespread use of these firmware [
64]. Even though the vulnerabilities have been publicly disclosed, there are a large number of firmware versions that still contain them due to delayed code upgrades or code compatibility issues [
18]. Recurring vulnerabilities, often referred to as “N-day vulnerabilities,” cannot be detected through symbol information such as function names, because this type of information is usually removed during firmware compilation. Additionally, the source code of firmware is typically unavailable as IoT vendors only provide binary versions of their firmware.
To this end,
binary code similarity detection (BCSD) is applied to quickly find homologous vulnerabilities in a large amount of firmware [
23]. The BCSD technique focuses on determining the similarity between two binary code pieces. As for the vulnerability search, BCSD looks for other vulnerable functions that are similar to one that is already known to be vulnerable. In addition to the vulnerability search, BCSD has been widely used for other security applications such as code plagiarism detection [
16,
48,
56], malware detection [
41,
42], and patch analysis [
28,
34,
62]. Despite many existing research efforts, the diversity of IoT hardware architectures and software platforms poses challenges to BCSD for IoT firmware. There are many different
instruction set architectures (ISA) for IoT firmware, such as ARM, PowerPC, X64, and X86. The instructions are different, and the rules, such as the calling convention and the stack layout, also differ across different ISAs. It is non-trivial to find homologous vulnerable functions across various architectures.
BCSD methods can be generally classified into two categories: (i) Dynamic analysis-based methods and (ii) static analysis-based methods. The methods based on dynamic analysis capture the runtime behavior as function semantic features by running target functions, where the function features can be I/O pairs of function [
54] or system calls during program execution [
29], and so on. They are not scalable for large-scale firmware analysis, since running firmware requires specific devices and emulating firmware is also difficult [
20,
35,
72]. The methods based on static analysis mainly extract statistical features from assembly code. An intuitive way is to calculate the edit distance between assembly code sequences [
24]. They cannot be directly applied across architectures, since instruction sets are totally distinct. Architecture-independent statistical features of functions are proposed for similarity detection [
31]. These features are less affected across architectures such as the number of function calls, strings, and constants. Furthermore, the
control flow graph (CFG) at the assembly code level is utilized by conducting a graph isomorphism comparison for improving the similarity detection [
31,
33]. Based on statistical features and CFG, Gemini [
65] leverages the graph embedding network to encode functions as vectors for similarity detection. With the application of deep learning models in programming language analysis, various methods have recently appeared to employ such models to encode binary functions in different forms and calculate function similarity based on function encoding [
46,
50,
53,
61]. Static analysis-based methods are faster and more scalable for large-scale firmware analysis but often produce false positives due to the lack of semantic information. Since homologous vulnerable functions in different architectures usually have the same semantics, a cross-architecture BCSD should be able to capture the semantic information about functions in a way that can be scaled.
In our previous work
Asteria [
70], we first utilized the Tree-LSTM network to encode the AST in an effort to capture its semantic representation. In particular, Tree-LSTM is trained using a
Siamese [
37] architecture to understand the semantic representation by feeding homologous and non-homologous function pairs into the Tree-LSTM network. Consequently, the Tree-LSTM network learns function semantic representations to distinguish between homologous and non-homologous functions. To further improve the accuracy, we also use the call graph to calibrate the AST similarity. Precisely, we count callee functions of target functions in the call graph to measure the difference in function calls. The final function similarity is determined by calibrating the AST similarity with the disparity in function calls. In our previous evaluation,
Asteria outperformed the available state-of-the-art methods,
Gemini and
Diaphora, in terms of accuracy. The evaluation results demonstrate the superiority of function semantic extraction by encoding AST with the Tree-LSTM model. However, encoding the AST incurs a clear temporal cost for
Asteria. According to our earlier research [
70], the entire AST encoding process takes about one second. When
Asteria is applied to vulnerability detection, where there are numerous functions to perform similarity calculations given a vulnerable function, the time cost becomes unacceptable. Since the majority of candidate functions are non-homologous, there is room for enhancing the efficiency of
Asteria. In other words, non-homologous candidate functions differ from vulnerable functions in certain characteristics that we can exploit to skip the majority of non-homologous functions more effectively. In addition, the evaluations do not align with the approaches used in the majority of real-world vulnerability detection efforts [
33,
45,
51,
65,
73], including our prior study
Asteria. Vulnerability detection involves retrieving homologous (vulnerable) functions from a large pool of functions. Consequently, their performance in detecting vulnerabilities is insufficiently described. It is necessary to evaluate the performance of
Asteria on the vulnerability search task. Moreover, according to the result in the real world vulnerability detection [
70],
Asteria suffers from high false positives, which affects its effectiveness in reality.
There are two main challenges that hinder
Asteria from being practical for large-scale vulnerability detection:
•
Challenge 1 (C1). It is challenging to filter out the majority of non-homologous functions before encoding ASTs, while retaining the homologous ones, to speed up the vulnerability-detection process.
•
Challenge 2 (C2). It is challenging to distinguish similar but non-homologous functions. Despite Asteria’s high precision in homologous and non-homologous classification, it still yields false positives when distinguishing functions with similar ASTs.
We design Asteria-Pro by introducing domain knowledge as two answers, A1 and A2 to overcome these two challenges. Our fundamental concept is that introducing inter-functional domain knowledge will helps Asteria-Pro achieve greater precision combined the intra-functional semantic knowledge deep learning model learned. Asteria-Pro consists of three modules: (1) Domain Knowledge-based (DK-based) pre-filtration, (2) Deep Learning-based (DL-based) similarity detection, and (3) DK-based re-ranking, among them DL-based similarity detection is basically based on Asteria. Domain knowledge is fully exploited for different purposes in DK-based pre-filtration and re-ranking. In pre-filtration module, Asteria-Pro aims to skip as many as possible non-homologous function by comparing lightweight robust features (A1). Meanwhile, filtration is required to retain all homologous functions. To this end, we conducted a preliminary study into the filtering performance of several lightweight function features. According to the findings of the study, we propose a novel algorithm that successfully employs three distinct function features in the filter. In the re-ranking module, Asteria-Pro confirms the homology of functions by comparing call relationships (A2), based on the assumption that functions designed for distinct purposes have different call relationships.
Our evaluation indicates that Asteria-Pro significantly outperforms existing state-of-the-art methods in terms of both accuracy and efficiency. Compared with Asteria, Asteria-Pro successfully cuts the detection time of Asteria by 96.90% by incorporating DK-based pre-filtration module. In the vulnerability-search task, Asteria-Pro has the shortest average search time than other baseline methods. By incorporating DK-based re-ranking, Asteria-Pro manages to enhance the MRR and Recall@Top-1 by 23.71% and 36.4%, to 90.8% and 89.6%, respectively. We have also applied our enhancement framework to embed baseline methods, and the evaluation results demonstrate a significant improvement in the precision of these methods. Asteria-Pro identifies 1,482 vulnerable functions with a high precision of 91.65% by conducting a large-scale real-world firmware vulnerability detection utilizing 90 CVEs. Moreover, the detection results of CVE-2017-13001 demonstrate that Asteria-Pro has an advanced capacity to detect inlined vulnerable code.
Our contributions are summarized as follows:
•
We conduct a preliminary study to demonstrate the effectiveness of various simple function features in identifying non-homologous functions.
•
To the best of our knowledge, it is the first work to propose incorporating domain knowledge before and after deep learning models for vulnerability detection optimization. We implement the domain knowledge-based pre-filtration and re-ranking algorithms and equip Asteria with them.
•
The evaluation indicates the pre-filtration module significantly reduces the detection time, and re-ranking module improves the detection precision by a fairly amount. The
Asteria-Pro outperforms existing state-of-the-art methods in terms of both accuracy and efficiency. In evaluation
8.5, we find that the performance of distinct BCSD methods may vary widely in different usage scenarios.
•
We demonstrate the utility of Asteria-Pro by conducting a large-scale, real-world firmware vulnerability detection. Asteria-Pro manages to find 1,482 vulnerable functions with a high precision of 91.65%. We analyze the vulnerability distribution in widely-used software from various IoT vendors to illustrate our inspiring findings.
8 Evaluation
We aim to conduct a comprehensive practicality evaluation of various state-of-the-art function similarity detection methods for bug search. To this end, we adopt eight different metrics to depict the search capability of different methods in a more comprehensive way. Furthermore, we construct a large evaluation dataset, in a way that is closer to practical usage of bug search.
8.1 Research Questions
In the evaluation experiments, we aim to answer following research questions:
RQ1.
How does Asteria-Pro compare to baseline methods in cross-architecture and cross-compiler function similarity detection?
RQ2.
What is the performance of Asteria-Pro, compared to baseline methods for bug search purpose?
RQ3.
How much do DK-based filtration and DK-based re-ranking improves in accuracy and efficiency for Asteria-Pro? How do their performance compare to other baseline methods when integrated together?
RQ4.
How do different configurable parameters affect the accuracy and efficiency?
RQ5.
How does Asteria-Pro perform in a real-world bug search?
8.2 Implementation Details
We utilize
IDA Pro 7.5 [
10] and its plugin
Hexray Decompiler to decompile binary code and extract ASTs. The current version of the
Hexray Decompiler supports x86, x64, PowerPC (PPC), and ARM architectures. For the encoding of leaf nodes in Equations (
6)–(
11), We assign zero vectors to the state vectors
\(h_{kl}\),
\(h_{kr}\),
\(c_{kl}\), and
\(c_{kr}\). During model training, we use the
binary cross-entropy loss function (BCELoss) to measure the discrepancy between the labels and the predictions. The
AdaGrad optimizer is utilized for gradient computation and weight-matrix updating after the losses are computed. Due to the dependency of Tree-LSTM computation steps on the AST shape, parallel batch computation is not possible. Therefore, the batch size is always set to 1. The model is trained for 60 epochs. Our experiments are conducted on a local server with two Intel Xeon CPUs E5-2620 v4 @ 2.10 GHz, each with 16 cores, 128 GB of RAM, and 4 TB of storage. The
Asteria-Pro code runs in a Python 3.6 environment. We compile the source code in our dataset using the
gcc v5.4.0 compiler and utilize
buildroot-2018.11.1 [
3] for dataset construction. We use the binwalk tool [
2] to unpack firmware and obtain the binaries for further analysis. In the
UpRelation algorithm of the filtering module, we set the threshold values
\(T_{NCL}, T_{callee}, T_{string}\) to 0.1, 0.8, and 0.8, respectively, based on their
\(F_{score}\). The crucial threshold
\(T_{NCL}\) is discussed in Section 8.8.1. In Equation (
15), we set
\(\alpha = 0.1\) and
\(\beta = 0.9\) to emphasize the role of callee function similarities in the re-ranking process. The sensitivity analysis of these weights is presented in Section
8.8.2.
8.3 Comprehensive Benchmark
To compare BCSD methods in a comprehensive way, we build an extensive benchmark based on multiple advanced works [
50,
61,
65]. The benchmark comprises of two datasets, two detection tasks, and five measure metrics.
8.3.1 Dataset.
The functions not involved in the prefiltering test (see Section 3) are divided into two datasets for model training and testing and evaluation. The evaluation dataset consists of two sub-datasets, each of which is used for a different detection task.
Model Dataset Construction. The model dataset is constructed for training and testing the Tree-LSTM model. It consists of a total of 31,940 functions extracted from 1,944 distinct binaries. From these functions, 314,852 pairs of homologous functions and 314,852 pairs of non-homologous functions are created. To ensure a fair evaluation of the model’s performance, the dataset is divided into a training set and a testing set using an 8:2 ratio. This means that 80% of the function pairs are used for training the model, while the remaining 20% are used for testing and evaluating the model’s performance. The dataset construction allows the Tree-LSTM model to learn and generalize from a diverse set of functions, including both homologous and non-homologous pairs. By dividing the dataset into training and testing sets, the model’s performance can be assessed on unseen data to measure its effectiveness in identifying homologous functions.
Evaluation Dataset Construction. The dataset construction process involves creating two sub-datasets: the g-dataset and the v-dataset. These datasets are used for different evaluation tasks: Classification test and bug search test. The g-dataset is constructed for the classification test, which evaluates the model’s ability to classify homologous and non-homologous function pairs. It consists of tuples in the form \((F, (F_h, F_n))\), where F is the source function and \((F_h, F_n)\) represents a function set containing a homologous function \(F_h\) and a non-homologous function \(F_n\). Each tuple in the g-dataset represents a pair of functions to be classified as homologous or non-homologous. However, the v-dataset is constructed for the bug search test, which evaluates the model’s ability to identify non-homologous functions among a larger set of candidates. The tuples in the v-dataset are of the form \((F, (F_h, F_{n1},\ldots , F_{ni},\ldots , F_{n10000}))\). Here, \(F_h\) represents a homologous function, and \(F_{n1}\) to \(F_{n10000}\) represent non-homologous functions. In this case, the \(P_{set}\) contains a larger number of non-homologous functions to simulate the bug search scenario. For both datasets, the source function F is matched with all the functions in the \(P_{set}\) for evaluation. The g-dataset focuses on evaluating the model’s accuracy in classifying homologous and non-homologous pairs, while the v-dataset assesses the model’s performance in identifying non-homologous functions among a larger pool of candidates.
8.3.2 Metrics.
We choose five distinct metrics for comprehensive evaluation from earlier works [
53,
61,
70]. In our evaluation, the similarity of a function pair is calculated as a score of
r. Assuming the threshold is
\(\beta\), if the similarity score
r of a function pair is greater than or equal to
\(\beta\), then the function pair is regarded as a positive result; otherwise, it is regarded as a negative result. For a homologous pair, if its similarity score
r is greater than or equal to
\(\beta\), then it is a true positive
(TP). If a similarity score of
r is less than
\(\beta\), then the calculation result is a false negative
(FN). For a non-homologous pair, if a similarity score
r is greater than or equal to
\(\beta\), then it is a false positive
(FP). When the similarity score
r is less than
\(\beta\), it is a true negative
(TN). These metrics are described as follows:
•
TPR. TPR is short for true-positive rate. TPR shows the accuracy of homologous function detection at threshold \(\beta\). It is calculated as \(TPR = \frac{TP}{TP+FN}\).
•
FPR. FPR is short for false-positive rate. FPR shows the accuracy of non-homologous function detection at threshold \(\beta\). It is calculated as \(FPR = \frac{FP}{FP+TN}\).
•
AUC. AUC is short for area under the curve, where the curve is termed Receiver Operating Characteristic (ROC) curve. The ROC curve illustrates the detection capacity of both homologous and non-homologous functions as its discrimination threshold \(\beta\) is varied. AUC is a quantitative representation of ROC.
•
MRR. MRR is short for mean reciprocal rank, which is a statistic measure for evaluating the results of a sample of queries, ordered by probability of correctness. It is commonly used in retrieval experiments. In our bug retrieval-manner evaluation, it is calculated as \(MRR = \frac{1}{|P_{set}|}\sum _{F_{hi}\in P_{set}}\frac{1}{Rank_{F_{hi}}}\), where \(Rank_{F_{hi}}\) denotes the rank of function \(F_{hi}\) in pairing candidate set \(P_{set}\), and \(|P_{set}|\) denotes the size of \(P_{set}\).
•
Recall@Top-k. It shows the capacity of homologous function retrieve at top k detection results. The top k results are regarded as homologous functions (positive). It is calculated as follows:
To demonstrate the reliability of the ranking results, we adopt Recall@Top-1 and Recall@Top-10.
8.3.3 Detection Tasks.
The two function similarity detection tasks based on BCSD applications are as follows:
Task-C (Classification Task): This task focuses on evaluating the ability of methods to classify function pairs as either homologous or non-homologous. It involves performing binary classification on the g-dataset, which contains tuples of the form \((F, (F_h, F_n))\), where \(F_h\) represents a homologous function and \(F_n\) represents a non-homologous function. The task evaluates the performance using three metrics: TPR, FPR, and AUC of the ROC curve. TPR and FPR are commonly used to measure the performance of binary classification models, while AUC provides an overall measure of the model’s discriminative ability.
Task-V (Bug/Vulnerability Search Task): This task focuses on evaluating the ability of methods to identify homologous functions from a large pool of candidate functions. It uses the v-dataset, which contains tuples of the form \((F, (F_h, F_{n1},\ldots , F_{ni},\ldots , F_{n10000}))\), where \(F_h\) represents a homologous function and \(F_{ni}\) represents non-homologous functions. The task involves calculating function similarity between a source function F and all functions in the \(P_{set}\). The functions in \(P_{set}\) can then be sorted based on similarity scores. The task evaluates the performance using three metrics: MRR, Recall@Top-1, and Recall@Top-10. MRR measures the rank of the first correctly identified homologous function, while Recall@Top-1 and Recall@Top-10 measure the proportion of cases where the correct homologous function is included in the top-1 and top-10 rankings, respectively.
These tasks provide a comprehensive evaluation of the methods’ performance in distinguishing between homologous and non-homologous functions and identifying homologous functions from a large pool of candidates.
8.4 Baseline Methods
We choose various representative cross-architectural BCSD works, that make use of AST or are built around deep learning encoding. These BCSD works consist of
Diaphora [
7],
Gemini [
65],
SAFE [
51], and
Trex [
53]. Moreover, we also use our previous conference work
Asteria as one of baseline methods. We go over these works in more details below.
Diaphora.
Diaphora performs similarity detection also based on AST.
Diaphora maps nodes in an AST to primes and calculates the product of all prime numbers. Then it utilizes a
difference function to calculate the similarity between the prime products. We download the
Diaphora source code from github [
7], and extract
Diaphora’s core algorithm for AST similarity calculation for comparison. Noting that it would take a significant amount of time (several minutes) to compute a pair of functions with extremely dissimilar ASTs, we add a filtering computation before the prime difference. The filtering calculates the AST size difference and eliminates function pairs with a significant size difference. We publish the improved Diaphora source code on our website [
1].
Gemini.
Gemini encodes
attributed CFGs (ACFGs) into vectors with a graph embedding neural network. The ACFG is a graph structure where each node is a vector corresponding to a basic block. We have obtained
Gemini’s source code and its training dataset. Notice that in Reference [
65] the authors mentioned it can be retrained for a specific task, such as the bug search. To obtain the best accuracy of
Gemini, we first use the given training dataset to train the model to achieve the best performance. Then, we re-train the model with the part of our training dataset.
Gemini supports similarity detection on X86, MIPS, and ARM architectures.
SAFE.
SAFE works directly on disassembled binary functions, does not require manual feature extraction, is computationally more efficient than
Gemini. In their vulnerability search task,
SAFE outperforms
Gemini in terms of recall.
SAFE supports three different instruction set architecture X64, X86, and ARM. We retrain
SAFE based on the official code [
51] and use retrained model parameter for our test. In particular, we select all appropriate function pairs from the training dataset, whose instruction set architectures are supported by
SAFE. Then, we extract the function features for all function pairs selected and discard the function pairs whose features
SAFE cannot extract. After feature extraction, 27,580 function pairs of three distinct architecture combinations (i.e., X86-X64, X86-ARM, and X64-ARM) are obtained for training. Next, We adopt the default model parameters (e.g., embedding size) and training setting (e.g., training epoches) to train
SAFE.
Trex. Trex is based on pretrained model [
53] of the state-of-the-art NLP technique, and micro-traces. It utilizes a dynamic component to extract micro-traces and use them to pretrain a masked language model. Then it integrates pretrained ML model into a similarity detection model along with the learned semantic knowledge from micro-traces. It supports similarity detection of ARM, MIPS, X86, and X64.
8.5 Comparison of Similarity Detection Accuracy (RQ1)
In the evaluation of cross-architecture scenarios, the focus was on assessing the detection capability of different approaches in two tasks separately, which is commonly encountered in vulnerability search scenarios. Additionally, the evaluation also considered the performance in cross-compiler scenarios involving three different combinations of compilers: gcc-clang, gcc-icc, and clang-icc.
8.5.1 Cross-architecture Evaluation.
In the evaluation of the two distinct tasks, it is important to note that the baseline methods may not be capable of detecting function similarities for all four instruction set architectures. As a result, the detection results for certain architecture combinations may be empty, indicating that the baseline methods were unable to provide any meaningful results. For each task, the evaluation measured the performance of various approaches in terms of the defined metrics. The specific outcomes and results of the evaluation for each task were analyzed and discussed.
Comparison on task-C. In Task-C, all approaches were evaluated by conducting similarity detection on all supported architectural combinations. The evaluation results were used to calculate the three metrics (TPR, FPR, and AUC) for each approach. These results are presented in Table
2 and visualized in Figure
10, where each subplot represents the ROC curve for a specific architecture combination. The x-axis represents the FPR (False-positive Rate), and the y-axis represents the TPR (True-positive Rate). By examining the ROC curves in Figure
10, it can be observed that methods with performance curves closer to the upper-left corner generally exhibit superior performance. In particular, the ROC curves of
Asteria-Pro and Asteria are almost indistinguishable across all architectural combinations, indicating that they possess equivalent classification performance in Task-C. Furthermore, the AUC values presented in Table
2 provide a quantitative measure of the approaches’ ability to distinguish between homologous and non-homologous functions. It is noted that Asteria-Pro and Asteria demonstrate nearly identical performance in this regard. However, the AUC values of
Asteria-Pro are consistently greater than those of the other baseline techniques for all architectural combinations. This suggests that
Asteria-Pro exhibits superior discriminative capability between homologous and non-homologous functions in Task-C. These findings highlight the strong performance of
Asteria-Pro in the classification task and its ability to outperform the baseline methods in distinguishing between homologous and non-homologous functions across various architecture combinations.
Comparison on task-V. Table
3 presents the results of calculating MRR, Recall@Top-1, and Recall@Top-10 for different architectural combinations. These metrics evaluate the performance of the methods in the bug (vulnerability) search task. Recall@Top-1 measures the ability to accurately detect homologous functions, while Recall@Top-10 assesses the capability to rank homologous functions within the top ten positions. In the table, the first column represents the metrics, and the second column lists the names of the methods. The third through eighth columns display the metric values for the different architectural combinations, while the last column shows the mean value across all architectures. It can be observed that
Asteria-Pro and
Asteria consistently outperform the baseline approaches by a significant margin across all architecture configurations.
Asteria-Pro achieves an impressive average MRR of 0.908, indicating a substantial improvement of up to 23.71% compared to
Asteria. Even after retraining,
Safe demonstrates poor performance in properly recognizing small functions. In terms of Recall@Top-1, both
Asteria-Pro and
Asteria achieve relatively high average precisions of 0.89 and 0.65, respectively, which are 237% and 146% higher than the best result (0.26). Notably,
Asteria-Pro shows a 36.4% improvement in Recall@Top-1 compared to
Asteria. Regarding Recall@Top-10, both
Asteria-Pro and
Asteria continue to exhibit superior performance compared to the other methods. While other methods show a significant increase in recall compared to Recall@Top-1, their values remain below
Asteria-Pro. Overall, these results demonstrate that
Asteria-Pro outperforms the baseline methods, including
Asteria, in terms of MRR, Recall@Top-1, and Recall@Top-10 across different architecture combinations. The recall of other methods, such as
Trex, increases significantly from Recall@Top-1 to Recall@Top-10, indicating their ability to rank homologous sequences more accurately. However, they still fall short compared to
Asteria-Pro.
Indeed, the performance of BCSD approaches can vary significantly between different evaluation tasks, as demonstrated by the differences in Task-V performance compared to the similar ROC curve performance in Task-C. In the case of Gemini, despite having a high AUC score similar to Asteria, its MRR performance is relatively poor compared to both Asteria and Asteria-Pro. This indicates that evaluating BCSD approaches in a single experiment setting, such as Task-C, may not provide a comprehensive understanding of their real-world applicability and behavior. Task-V, which focuses on bug (vulnerability) search, simulates the scenario of identifying homologous functions from a pool of candidate functions. In this task, the ability to accurately rank and identify homologous functions becomes crucial. While ROC curves and AUC scores provide information about the ability to discriminate between homologous and non-homologous functions, they may not reflect the performance in ranking and retrieving homologous functions accurately. Therefore, it is important to consider multiple evaluation tasks, such as Task-C and Task-V, to assess the overall performance and effectiveness of BCSD approaches. The results obtained from different tasks can provide a more comprehensive understanding of the strengths and limitations of each method and their suitability for real-world applications.
False-positive Analysis. The false-positive outcomes of Asteria-Pro can be attributed to two primary causes:
Cause 1: Similar Syntactic Structures of Proxy Functions—Proxy functions exhibit similar syntactic structures, which can lead to similar semantics. This can make it challenging for Asteria-Pro to differentiate between proxy functions, since their semantics are alike. Figure
11 provides an illustration of two proxy functions that differ only on line 9. Due to their similar semantics, it becomes difficult to confirm the actual callees, especially when symbols are lacking or when indirect jump tables are involved.
Cause 2: Compiler-Specific Intrinsic Functions—Compilers for different architectures utilize various intrinsic functions, which substitute libc function calls with optimized assembly instructions. For example, the gcc-X86 compiler may replace the memcpy function with several memory operation instructions that are specific to the architecture. As a result, the memcpy function may be absent from the list of callee functions used by Asteria’s filtering and re-ranking modules. This lack of complete callee function information can lead to a loss of precision in the scoring calculation.
Both causes contribute to the false-positive outcomes in Asteria-Pro, highlighting the challenges in accurately detecting function similarity across different architectures and handling variations in compilers’ optimization techniques. Addressing these causes and improving the precision of function similarity detection in such scenarios is an ongoing area of research and development in the field of BCSD.
8.5.2 Cross-compiler Evaluation.
In the cross-compiler evaluation, we conducted experiments using three different compilers: gcc, icc (Version 2021.1 Build 20201112_000000), and clang (10.0.0), all for the x86 architecture. The evaluation results are presented in Table
4. We evaluated the performance of different methods using metrics such as MRR and Recall in the three cross-compiler settings: gcc-clang, gcc-icc, and clang-icc. The average values for all three settings are also provided in the last column of the table. Our new tool,
Asteria-Pro, consistently outperforms the baseline methods by significant margins across all three compiler combinations. Compared to Asteria,
Asteria-Pro achieves an average improvement of 47.6% in MRR and Recall, demonstrating its superior performance. The improvements compared to other baseline tools such as Trex, Gemini, Safe, and Diaphora are even more substantial, with average improvements of 596.6%, 331.7%, 485.0%, and 26.7%, respectively. It is worth noting that Diaphora achieves surprisingly high precision in the gcc-clang setting, particularly compared to the cross-architecture setting. This may be attributed to the fact that compilers gcc and clang employ similar compilation optimization algorithms, resulting in similar assembly code and
abstract syntax tree (AST) structures. However, since Asteria is not trained on a cross-compiler dataset, it exhibits relatively lower precision compared to Diaphora. Although the precision performances of the methods vary in different compiler combination settings, a consistent trend can be observed. Specifically, higher precision is observed in the gcc-clang setting, while lower precision is observed in the gcc-icc and clang-icc settings, except for Safe. This can be attributed to the fact that the icc compiler employs more aggressive code optimizations, resulting in dissimilar assembly code compared to the other compilers. Overall, the results of the cross-compiler evaluation demonstrate the effectiveness of
Asteria-Pro in detecting function similarity across different compilers and highlight its superior performance compared to the baseline methods.
8.6 Performance Comparison (RQ2)
In this section, the detection time of function similarity for all baseline approaches and Asteria-Pro are measured. Since the DK-based prefiltration and DK-based re-ranking modules are intended to enhance performance in Task-V, we only count the timings in Task-V. In Task-V, given a source function, methods extract the function features of source and all candidate functions, which is referred to as phase 1. Next, the extracted function features are subjected to feature encoding and encoding similarity computation to determine the final similarities, which is referred to as phase 2.
As shown in Figure
12(a), we calculate the average feature extraction time for each function. The x-axis depicts extraction time, while the y-axis lists various extraction methods. During feature extraction for one single function,
Asteria-Pro, Asteria, and Diaphora all execute the same operation (i.e., AST extraction), resulting in the same average extraction time. Since AST extraction requires binary disassembly and decompilation, it requires the most time compared to other methods. Trex requires the least amount of time for feature extraction, which is less than 0.001 s per function, as code disassembly is the only time-consuming activity.
Figure
12(b) illustrates the average duration of a single search procedure for various methods. The phases 1 and 2 of a single search procedure are denoted by distinct signs. Due to its efficient filtering mechanism,
Asteria-Pro requires the least amount of time (58.593 s) to complete a search. Due to its extensive pre-training model encoding computation, Trex is the most time-consuming algorithm.
Asteria-Pro cuts search time by 96.90%, or 1831.36 s, compared to Asteria (1889.96 s).
8.7 Ablation Experiments (RQ3)
To demonstrate the progresses made by different modules of DK-based filtration and DK-based re-ranking, we conduct ablation experiments by evaluating the different module combinations in
Asteria-Pro. The module combinations are
Pre-filtering + Asteria and
Asteria + Re-ranking. The two module combinations performs Task-V and the results are shown in Table
5. For
Asteria + Re-ranking, the top 20 similarity detection are re-ranked by the Re-ranking module.
8.7.1 Filtration Improvement.
Compared to Asteria, the integration of pre-filtering improves MRR, Recall@Top-1, and Recall@Top-10 by 12.26%, 16.29%, and 5.93%, respectively. In terms of efficiency, it cuts search time by 96.94%. The Pre-filtering + Asteria combination performs better than Asteria + Re-ranking in terms of Recall@Top-10 and time consumption. It generates a greater Recall@Top-10, because it filters out a large proportion of highly rated non-homologous functions.
8.7.2 Re-ranking Improvement.
Compared to Asteria, the integration of Re-ranking module improves MRR, Recall@Top-1, and Recall@Top-10 by 20.16%, 31.51%, and 3.76%, respectively. In terms of efficiency, it costs average additional 0.13 s for re-ranking, which is negligible. Compared to Pre-filtering + Asteria, re-ranking module contributes to an increase in MRR and Recall@Top-1 by enhancing the rank of homologous functions.
8.7.3 Embedding Baseline Methods.
We demonstrate the generalizability of innovative BCSD enhancement framework, by integrating our two new components, pre-filtering and re-ranking, with other baseline BCSD methods. Specifically, we apply all the baseline methods to compute the similarity scores between the remaining functions and the source function after pre-filtering. Subsequently, we rank the remaining functions in descending order based on their similarity scores and select the top 50 functions for re-ranking. The final similarity score is obtained by combining the re-ranking score and the score generated by the baseline methods. Using the final similarity score, we determine the rankings of the top 50 candidate functions and calculate three metrics, namely, MRR, Recall@Top-1, and Recall@Top-10. Table
6 presents a comparison of the original and integrated versions of the baseline methods, with the baseline method names listed in the first row and their corresponding integrated versions in the next column, such as
Trex-I for the integrated version of Trex. The second to fourth rows provide the values of different metrics, namely, MRR, Recall@Top-1, and Recall@Top-10.
The accuracy of the baseline methods is significantly improved by the addition of our two components, with Diaphora-I in particular showing a substantial increase in MRR from 0.02 to 0.772. We manually analyzed the outputs of Diaphora and Diaphora-I to understand the reason for the improved ranking of homologous functions. We found that while Diaphora tends to assign high similarity scores to homologous functions, it also assigns high scores to numerous non-homologous functions, which lowers the ranking of homologous functions. Specifically, we found that the average score difference between the highest score (i.e., score of top 1) and the score assigned to the homologous function is only 0.11. By incorporating reranking scores into the final scores, Diaphora-I places a higher emphasis on homologous functions, resulting in improved ranking. If homologous functions are present in the top 50 before re-ranking, then they are mostly ranked at the top. Safe-I also shows improved accuracy, although not as substantial as Diaphora-I, as Safe tends to rank homologous functions outside the top 50, reducing the impact of reranking.
The enhancement framework also effectively enhances the accuracy of Asteria. Specifically, Asteria-Pro achieves very high MRR and Recall@Top-1, with a notable margin compared to other integrated versions of baseline methods. The high accuracy of Asteria-Pro enables it to generate more reliable search results, which can significantly reduce the efforts required for vulnerability confirmation when applied to bug search tasks.
8.8 Configurable Parameter Sensitivity Analysis (RQ4)
Asteria-Pro has two sets of configurable parameters: The filtering threshold
\(T_{NCL}\) in the pre-filtering algorithm, and the weight values in Equation (
15) for the final precision score. In our evaluation, we analyze the impact of these parameters on
Asteria-Pro’s performance by testing different values of
\(T_{NCL}\) for pre-filtering and varying weight combinations for the final precision score.
8.8.1 Different Filtering Threshold.
In Algorithm
1, the threshold
\(T_{NCL}\) determines the number of functions that are filtered out. We evaluate the efficacy of the filtering module by utilizing various
\(T_{NCL}\) values, and the results are presented in Table
7. The threshold values range from 0.1 to 0.5 in the first column, where a higher threshold value suggests a more severe selection of the similarity function. The second column indicates the number of functions omitted by the filter, while the third column displays the recall rate in the filteration results. As the threshold value increases, the recall rate declines and the number of filtered-out functions grows. We use 0.1 as our threshold value for two key reasons: (a) The high recall rate of filtering results is advantageous for subsequent homologous function detection, and (b) there is no significant difference in the number of functions that are filtered out.
8.8.2 Weights in Re-ranking.
We conducted a sensitivity analysis of different weight values in Equation (16). The evaluation results are presented in Table
8. The first two columns display the combinations of two distinct weights,
\(\alpha\) and
\(\beta\), from Equation (16). The last three columns give the values of various metrics, including MRR, Recall@Top-1, and Recall@Top-10.
We did not test all weight combinations as the accuracy metrics consistently decreased with an increase in
\(\alpha\). As shown in the table, when
\(\alpha = 0.1\) and
\(\beta = 0.9\), Asteria-Pro has the best accuracy. The last column sets
\(\beta\) to 0.0, meaning the re-ranking score is not included in the final similarity calculation. Therefore, the results are consistent with the combination “Pre-filtering + Asteria” in Section
8.7.
8.9 Real World Bug Search (RQ5)
To assess the efficacy of Asteria-Pro, we conduct a massive real-world search for bugs. To accomplish this, we obtain firmware and compile vulnerability functions to create a firmware dataset and a vulnerability dataset. Utilizing vulnerability dataset, we then apply Asteria-Pro to detect vulnerable functions in the firmware dataset. To confirm vulnerability in the resulting functions, we design a semi-automatic method for identifying vulnerable functions. Through a comprehensive analysis of the results, we discover intriguing facts regarding vulnerabilities existed in IoT firmware.
8.9.1 Dataset Construction.
In contrast to our prior work, we expand both the vulnerability dataset and the firmware dataset for a comprehensive vulnerability detection evaluation.
Vulnerability Dataset. The prior vulnerability dataset of seven CVE functions is enlarged to
90, as shown in Table
9. Vulnerability information is primarily gathered from the NVD website [
11]. As shown in the first column, the vulnerabilities are collected from widely used open-source software in IoT firmware, including OpenSSL, Busybox, Dnsmasq, Lighttpd, and Tcpdump. In the second column, the number of software vulnerabilities is listed. In the third column, the timeframe or specific years of the disclosure of the vulnerability are listed. The final column describes the software version ranges affected by vulnerabilities. Note that the version ranges are obtained by calculating the union of all versions mentioned in the vulnerability reports. As a result,
Asteria-Pro is expected to generate vulnerability detection results for all software versions falling within the specified ranges.
Firmware Dataset. We download as much firmware from six popular IoT vendors as we could, consisting of Netgear [
12], Tp-Link [
15], Hikvision [
9], Cisco [
4], Schneider [
14], and Dajiang [
6] as shown in first column of Table
10. These firmware are utilized by routers, IP cameras, switches, and drones, all of which play essential parts in our life. The second column shows the firmware numbers, which range from 7 to 548. The third and fourth columns give numbers of binaries and functions after unpacking firmware by using binwalk. Note that the binary number is the number of software selected to be in the vulnerability dataset. The fifth column to ninth column gives the five software numbers in all firmware vendors. OpenSSL and Busybox are widely integrated in these IoT firmware as their numbers are close to those of the firmware. Through querying their official websites for device type information, we find that the majority of Hikvision vendor firmware is for IP cameras, whereas Cisco vendor firmware is for routers. In particular, IP camera firmware incorporates less software than router firmware, because routers offer more functionality. For example, the firmware of the Cisco
RV340 router includes OpenSSL, Tcpdump, Busybox, and Dnsmasq, whereas the majority of IP camera firmware only include OpenSSL. Similarly, the majority of the firmware of Netgear and Tp-Link consists of routers, while Schneider and Dajiang’firmware include specialized devices such as Ethernet Radio and Stabilizers.
8.9.2 Large Scale Bug Search.
Asteria-Pro is employed to identify vulnerable homologous functions among 3,483,136 firmware functions by referencing 90 functions from the vulnerability dataset. Specifically, to expedite the detection process, vulnerability detection is restricted to the same software between firmware dataset and vulnerability dataset. For instance, the vulnerable functions disclosed in OpenSSL are utilized to detect vulnerable homologous functions in OpenSSL in the firmware dataset. For each software
S, we first extract features (i.e., ASTs and call graphs) of all functions in firmware dataset and vulnerable functions in vulnerability dataset. For each vulnerability disclosed in
S, the pre-filtration module uses the call graph to filter out non-homologous functions, followed by the Tree-LSTM model encoding all remaining functions as vectors.
Asteria-Pro then computes the AST similarity between the vulnerable function vectors and the firmawre function vector.
Asteria-Pro computes reranking scores based on the top 20 of AST similarities based on similarity scores, since the evaluation demonstrates a very high recall in the top 20. As a final step,
Asteria-Pro generates 20 candidate homologous functions for each
S as a
bug search result for each vulnerability. To further refine the bug search results, we compute the average similarity score of homologous functions in Section
8.5 and use it to eliminate non-vulnerable functions. In particular, the average similarity score of 0.89 is used to eliminate 3,987 of 5,604 results. We perform heuristic confirmation of vulnerability for the remaining 1,617 results.
Vulnerability Confirmation Method. We devise a semi-automatic method for confirming the actual vulnerable functions from the candidate homologous functions. The method makes use of the symbols and string literals within the firmware binaries of the target. Specifically, we use unique regular expressions to match version strings for each software and to extract function symbols from the software. The method is then comprised of two distinct operations that correspond to two distinct vulnerable circumstances \(VC_1\), \(VC_2\).
•
\(VC_1\). In this circumstances, the target binary contains version string (e.g., “OpenSSL 1.0.0a”) and the symbol of target function is not removed.
•
\(VC_2\). Target binary contains version strings whereas the symbol of vulnerable homologous function is removed.
The versions of software listed in Table
9 are easy to extract using version strings [
27]. The descriptions of the two confirmation operations
\(CO_1\) and
\(CO_2\) are as follows:
•
\(CO_1\). For \(VC_1\), we confirm the vulnerable function based on the version and name of the target software. In particular, a vulnerable function is confirmed when the following two conditions are met: (1) Software version is in vulnerable version range, (2) the vulnerable function name retains after elimination with average similarity score.
•
\(CO_2\). For \(VC_2\), if the software versions are in the range of vulnerable versions, then we manually compare the code between the CVE functions and remaining functions to confirm the vulnerability.
Results Analysis. In Table
11, We tally the number of vulnerable functions, software, and firmware upon vulnerability confirmation. The first column contains the names of different vendors. The second through sixth columns show the amount of vulnerable functions in various software, while the seventh column indicates the total number of vulnerable functions across all vendors. The eighth through twelfth columns display the amount of vulnerable software binaries in various software, while the thirteenth column provides the total number of vulnerable software binaries. According to the seventh column of Table
10, there are a total of
1,482 vulnerable functions. 1456 are confirmed by
\(CO_1\), whereas 26 are confirmed by
\(CO_2\). For a total of 1,456
\(CO_1\) vulnerable functions, 1,377 vulnerable functions rank first and 79 vulnerable functions rank second.
\(CO_2\) is performed on 47 detection results, of which 26 are confirmed. the 21 unconfirmed detection results can be attributed to two reasons. First, 18 of them were due to the fact that the target binaries detected did not contain any of the target vulnerable functions. For example, we were unable to detect the vulnerable function “EVP_EncryptUpdate” of CVE-2016-2106 from the “libssl.so” library of OpenSSL, since it exists in the “libcrypto.so” library. Second, three of the unconfirmed results were ranked in the top 20 but were subsequently filtered out by the similarity threshold used in real world bug search setting. A large proportion of vulnerable functions are found in the OpenSSL software used by the three vendors. The number of vulnerable software is consistent with this circumstance. The final column shows the number of firmware containing at least one vulnerable function, together with its proportion of total firmware. Every Dajiang firmware contains at least one CVE vulnerability, because all OpenSSL components used in firmware are vulnerable. In addition, Hikvision is detected to have a large proportion of vulnerable firmware (58.89%). To inspect the CVE vulnerable function distribution, we plot the top 10 CVEs and their distributions in five vendors except Cisco in Figure
13, since Cisco takes additional two CVEs.
•
Top 10 CVE Analysis. Figure
13 demonstrates the top 10 CVE distribution in various vendors. The total number of discovered CVE vulnerabilities decreases from left to right along the x-axis. Except for CVE-2015-0287, all of the top 10 CVE vulnerabilities are discovered in every
Dajiang firmware. This is because Dajiang utilizes an outdated version of OpenSSL 1.0.1h that contains numerous vulnerable functions [
13]. Although Hikvision firmware has the third largest number of firmware, it has the most vulnerable functions in our experiment settings. The reason for this is that Hikvision firmware heavily uses OpenSSL-1.0.1e (184) and OpenSSL-1.0.1l (401) versions, both of which contain a large number of vulnerabilities.
Finding: Since they typically adopt the same vulnerable software version, it is highly plausible that firmware from the same vendor and released at the same period contains identical vulnerabilities. Security analysts can quickly narrow down the vulnerability analysis based on the firmware release date.
•
CVE and Version Analysis. Figure
14 Depicts the distribution of vulnerable OpenSSL versions for various CVEs from various vendors. Where the x-axis represents the version and the y-axis represents the CVE ID associated with the vulnerability. Each square in each subfigure indicates the number of OpenSSL versions that are vulnerable and contain the corresponding CVE along the y-axis. The number is greater the lighter the red colour. The left subfigure demonstrates that OpenSSL 1.0.2h is widely used by Netgear, resulting in a significant number of CVE-2016-2180 vulnerabilities (92). Additionally, OpenSSL version 1.0.1e exposes the majority of CVEs listed on the y-axis, which may increase the device’s attack surface. The TP-Link firmware incorporates OpenSSL version 1.0.1e, resulting in brighter hues. Hikvision firmware utilizes versions 1.0.1e and 1.0.1l, which are vulnerable to a number of CVEs. Comparing vulnerability distribution in OpenSSL version 1.0.1e among different vendors reveals inconsistencies in the existence of vulnerabilities. For instance, CVE-2016-2106 is present in OpenSSL 1.0.1e from Hikvision but not from Netgear and TP-Link.
Finding: Despite using the same version of software, various vendor firmware behaves differently in terms of vulnerability, since they can tailor the software to the device’s specific capabilities.
•
CVE-2016-2180 Analysis. The CVE-2016-2180, which is a remote Denial-of-Service flaw caused by received forged time-stamp file, impacting OpenSSL 1.0.1 through 1.0.2h, exists in 207 firmware. NETGEAR is responsible for 117 of these, as it deploys 92 OpenSSL 1.0.2h out of a total of 548 firmware. NETGEAR incorporated an extra nine OpenSSL 1.0.2 series software and sixteen OpenSSL 1.0.1 series software. The vulnerable version 1.0.2h was released in May 2016, and by comparing their timestamps, we determined that OpenSSL 1.0.2h was integrated into firmware between 2016 and 2019. Finding: Even after their vulnerabilities have been discovered, the vulnerable versions of software continue to be used for firmware development.
Based on the confirmation results, Asteria-Pro manages to detect 1,482 vulnerable functions out of 1,617 bug search results, indicating that Asteria-Pro achieves a high vulnerability detection precision of 91.65% under our experiment settings. By randomly selecting 1,000 of 5,604 bug search results, we manually validate the existence of vulnerabilities in software binaries to calculate the recall. Among 1,000 bug search results, 205 target functions are confirmed to be vulnerable by checking software versions and the vulnerable functions. Targeting 205 vulnerable functions, Asteria-Pro detects 53 of them, representing a recall rate of \(25.85\%\).
Finding Inlined Vulnerable Code. During the analysis of mismatched cases, in which the target homologous functions are not in the top ranking position, we observe that the top-ranked functions contain the same vulnerable code. We use CVE-2017-13001 as an illustration of inlined vulnerable code detection. CVE-2017-13001 is a buffer over-read vulnerability in the Tcpdump
nfs_printfh function prior to version 4.9.2. After a confirmation operation
\(CO_2\),
Asteria-Pro reports a single function,
parsefh as being vulnerable. We manually compare the decompiled code of the
parsefh function to the source code of
nfs_printfh in tcpdump version 4.9.1 (i.e., vulnerable version). Figure
15 demonstrates that the source code of
nfs_printfh (on the left) and the partial code of
parsefh (on the right) are consistent. We designate codes with apparently identical semantics with distinct backdrop hues. In other words, during compilation, function
nfs_printfh is inlined into function
parsefh. As a result, the function
parsefh contains CVE-2017-13001 vulnerable code, and
Asteria-Pro manages to identify the inlined vulnerable code.
Asteria-Pro has detected an additional eight instances of inlined vulnerable code out of 20 functions in vulnerable circumstance
\(VC_2\).
The preceding analysis and conclusions are constrained by the dataset we constructed, which offers security analysts some recommendations for the security analysis of firmware.
10 Discussion
10.1 How does the re-ranking module solve the function inline issues?
Inlining functions is a common optimization technique used by compilers to improve the performance of code execution. The decision of whether or not to inline a function is based on various factors, including the size of the function, the frequency of function calls, and the complexity of the code. In general, inlining smaller functions tends to be more beneficial than inlining larger functions.
When smaller functions being inlined, the re-ranking module in Asteria-Pro will be capable of handling inline issues by considering both function similarities and the match of callee relational structures. Specifically, the module matches all callee functions between the source function and target functions, which allows for high similarity even if one callee function is inlined to the target function. This is because the target function can still maintain a relatively high similarity with the source function even after incorporating inlined code, and the un-inlined callee functions still contribute to the final similarity score. As a result, Asteria-Pro exhibits high metric values (i.e., recall and MRR) in our evaluation based on the contribution of the target function code and all its callee functions. Through manual analysis of search results in real-world bug detection, we have also demonstrated that Asteria-Pro is capable of finding homologous vulnerable functions that contain inlined function code.
10.2 What is the Design Difference between Pre-Filtering, Re-Ranking and SCA Tools
Some
software composition analysis (SCA) tools have adopted similar feature when comparing to the pre-filtering and re-ranking module of
Asteria-Pro. For example, Modx [
69] matches string literals and whole call graph between two libraries, and LibDB [
60] adopts string literals and exported function names to measure the similarity of libraries. The usages of string literals and call graphs are quite straightforward.
However, we would like to highlight some conceptual-level differences between our approach and these prior works. While the use of string literals and call graphs is indeed straightforward, it can be challenging to apply them to function matching, particularly when functions lack string literals or are leaf nodes in call graphs. Additionally, callee functions of a target function may not be exported functions, meaning that function names are removed and cannot be used for matching.
To address these challenges, our approach differs from SCA methods in several key ways. First, we utilize the local context extracted from call graphs in both the pre-filtering and re-ranking modules to efficiently remove non-homologous functions and confirm homologous ones. Second, we introduce an algorithm called “UpRelation” to utilize caller relations from call graphs in pre-filtering. The algorithm leverages the genealogist of parent nodes to identify potential homologous functions. It achieves this by matching the genealogist of parent nodes and retaining the child nodes of the matched parent nodes. This approach is particularly useful when the target function is a leaf node in the call graph and does not contain any string literals. Last, our re-ranking module considers both structural and semantic similarities of functions, resulting in more accurate ranking of homologous functions. Specifically, the re-ranking module uses Asteria to calculate similarities when callee functions are not exported functions.
10.3 What Will Asteria-Pro Perform on Cross-Optimization Settings?
Although we did not evaluate the performance of
Asteria-Pro in cross-optimization settings, it is worth discussing the potential impact of such settings on the performance of our method. Cross-optimization refers to the situation where the training and testing sets are compiled with different optimization settings. This is a common scenario in practice as different developers may use different optimization flags, or the same developer may use different optimization levels for different releases. Previous studies have shown that cross-optimization can significantly affect the accuracy of BCSD methods, as the semantic features extracted from the binary code may change depending on the optimization settings. For instance, in a study by Liu et al. [
46], the accuracy of a state-of-the-art BCSD method dropped from 95.3% to 46.2% when tested in cross.
In the case of our method, Asteria-Pro, which is based on the Tree-LSTM architecture, the impact of cross-optimization on its performance is likely to be substantial. This is because Tree-LSTM model is sensitive to AST structure and summarizes semantics by identifying structure patterns. Therefore, if the source and target functions are compiled with different optimization settings, then the Tree-LSTM may not be capable to summarize the expected semantic from substantial AST structure transformation and thus produce inaccurate results.
Moreover, training our model on cross-optimization settings would require significant computational resources and time, which may not be feasible in practice. Therefore, we have not evaluated our method on cross-optimization settings in this study. Nevertheless, we acknowledge that cross-optimization is an essential consideration for evaluating Asteria-Pro’s generalizability, and we encourage future studies to investigate this aspect further.