1 Introduction
In order to accelerate the software development process and integrate external functionalities, developers frequently rely on existing code from open-source code repositories and package management platforms such as
Conan [
1],
Vcpkg [
2],
GitHub [
3], and
SourceForge [
4]. These resources are referred to as
third-party libraries (
TPLs) [
5]. As the
open-source software (
OSS) ecosystem continues to expand, an ever-increasing number of software projects are being constructed utilizing TPLs as their foundation. A recent report from Synopsys [
6] indicates that a staggering 97% of audited software incorporates at least one TPL.
However, the reliability of a large number of TPLs is difficult to guarantee, and security vulnerabilities in one software project can easily propagate throughout the software supply chain, thereby impacting other software projects. Among the audited software in Synopsys report [
6], 81% contain at least one known security vulnerability. Moreover, developers who unintentionally introduce improper TPLs may also violate open-source licensing regulations, resulting in legal complications. For instance, both Cisco and VMware have encountered significant legal issues due to non-compliance with the stipulations set forth in the Linux license [
7,
8]. This underlines the importance of being vigilant when incorporating TPLs into software projects, in order to maintain both the security and legal integrity of the resulting applications.
Generally speaking, TPL detection is a generic technology that detects code reuse relationships between software and can be applied to a large number of downstream tasks. On the one hand, both developers and users are keen to identify and manage TPLs in their software to mitigate security risks associated with TPL reuse [
5]. Moreover, the detection of software plagiarism and open-source code infringement has gained significant attention from researchers in recent years [
9]. On the other hand, TPL detection results can be utilized for 1-day vulnerability detection and malware identification. For instance, Firmsec [
10] employed TPL detection technology to uncover numerous 1-day vulnerabilities in IoT firmware, demonstrating that the actual impact scope of known vulnerabilities detected via TPL detection technology often extends far beyond what is reported in the CVE [
11] or NVD [
12] databases. This article focuses on the study of the generic TPL detection technique and evaluates the application of the vulnerability association task.
In an effort to counteract the potential risks posed by unreliable TPLs, numerous researchers have focused on the development of effective TPL detection approaches for software applications. Originally, the majority of research efforts concentrated on TPL detection in Java [
13–
19]. Recently, there has been a growing interest in TPL detection within C/C++ binaries [
9,
20–
25]. TPL detection in C/C++ binaries presents even greater challenges, as binaries compiled using diverse optimization options and architectures exhibit significant differences [
26,
27]. In detail, Current approaches for TPL detection in C/C++ binaries typically involve gathering an extensive database of candidate TPLs and subsequently determining, which of these have been reused in the target binary. These approaches can be broadly classified into two main categories: constant-based works [
9,
20–
23] and function similarity-based works [
22,
24,
28]. Constant-based works identify TPL reuse within the target binary by extracting identical constant features, such as strings [
20,
23], function names [
9], and jump tables [
21], from both the target binary and the candidate TPLs. In contrast, function similarity-based works involve comparing all the functions of the target binary with those of the candidate TPLs. Subsequently, a predetermined threshold is set to establish whether reuse has occurred. ISRD [
22] addresses reuse detection by identifying more than half of the similar functions in TPL, while LibDB [
24] depends on over three connected functions on FCG to ascertain reuse. ModX [
28] divides functions with the same functionality into a group by clustering and sets a threshold for every group.
However, it is essential to note that existing TPL detection technologies for C/C++ binary exhibit certain limitations. These limitations may hinder their performance and scalability in addressing the various downstream tasks for which they are intended. We summarize the limitations of the existing works in the following three points:
Firstly, existing TPL detection approaches are difficult to achieve both high accuracy and high robustness across different optimization options and architectures. On the one hand, The features selected by existing methods are not always working. Although constant-based approaches, which rely on strings [
20,
23], function names [
9], and jump tables [
21], exhibit robustness in varying optimization options and architectures, these approaches may falter when the number of constants is limited or when repeated constants appear in distinct binaries, leading to reduced performance in detecting TPLs. In contrast, function similarity-based approaches can detect every instance of reuse by comparing all functions [
22,
24,
28]. Unfortunately, the accuracy of isolated function matching significantly declines as the number of similar functions increases and variations across different optimization options and architectures, compromising the accuracy of the TPL detection results. On the other hand, the detection granularity of existing approaches does not match the reuse granularity. Current techniques generally set a threshold value for the entire binary [
24,
28] or source file [
9,
21], and when the number of matched features reaches this threshold, the entire binary or source file is considered to be reused. However, In numerous cases, software reuses only a portion of TPLs (Partial Reuse) [
9]. As a result, some small-scale reuse instances may be missed because the threshold is not met, and some libraries with a large number of similar features may be mistakenly reported, causing false negatives and false positives, which are described in detail in Section
2.2.
Secondly, the TPL detection results of current approaches are limited to the file-level granularity, which is insufficient for uncovering complex reuse relationships and interferes with downstream tasks. Complex reuse relationships may include partial reuse and pseudo-propagation reuse (the relation between A and B in which both A and B reuse C), which is proposed in the previous paper [
21]. Existing approaches can only detect which TPLs are reused by the target software, but cannot further detect which part of the code in TPLs is actually reused. These coarse-grained detection results are not applicable to the need for fine-grained results for downstream tasks. For instance, ModX [
28] demonstrates that a large number of vulnerabilities are often concentrated in a small portion of the software code. Users want to identify, which specific parts of the TPLs are reused by the software for refined software management or to assess whether sensitive portions (vulnerabilities or malware) of the TPLs have been reused. Developing and refining such methods would significantly enhance the effectiveness of TPL detection and reduce the risks associated with the undetected reuse of vulnerabilities or malware.
Thirdly, the datasets used in previous research for TPL detection tasks are in a single compiled environment [
22,
24], limiting the evaluation of the robustness and scalability in real scenarios. Even though existing works have gathered a substantial number of real-world binaries and manually labeled the ground truth to analyze their accuracy, the TPL detection datasets are limited to default optimization options and a single architecture. In many situations, target binaries are compiled from varying architectures and optimization options (e.g., binaries in firmware), but existing works [
22,
24] only evaluate the accuracy of the function similarity matching task rather than TPL detection task under different compilation environments. The absence of diverse datasets increased the difficulty of appraising the scalability of TPL detection approaches. To tackle this issue, it is crucial to develop new datasets that encompass a broader range of optimization options and architectures, capturing the inherent variability and complexity of modern software systems.
In this article, we introduce LibAM, a novel Area Matching framework that exhibits accuracy and robustness across various optimization options and architectures and can detect exact reuse areas. We find that if one function is reused, its callee functions are also reused. Therefore, different from existing works that rely on function granularity matching, we leverage the function call relationships to connect isolated functions into areas on FCG and compare the similarity of these areas rather than single functions to judge reuse. Function inlining, function call deletion, and other changes from the compiler that are fatal to isolated function matching, have little effect on the graph similarity of FCGs. There are two primary modules in LibAM: Area Generation and Area Comparison. This innovative framework can be effectively employed for two tasks: TPL detection and Reuse area detection. The former aims to detect which TPLs are reused by the target binary, and the latter aims to detect which functions (exact reuse areas) of the target binary are from TPLs. LibAM takes target binaries and TPL binaries as inputs, and finally outputs which TPLs are reused in target binaries and a list of reused functions. Therefore, LibAM can provide specific reuse areas for each TPL detection result, thus providing interpretability and being used to detect complex reuse relationships.
In detail, LibAM commences with conducting a comprehensive comparison of all target functions and TPL functions through a vector-searching technique. Subsequently, the anchor extension phase enables the connection of previously isolated function nodes into areas on the
Function Call Graph (
FCG). Finally, LibAM assesses whether a particular area has been reused by calculating the similarity of function areas. LibAM outputs candidate TPLs for the TPL detection task while generating a reused function name list for the Reuse area detection task. Note that the novel area detection task can help analyze complex reuse relationships, and we obtain two interesting findings based on this in Section
4.8. In addition, we show that by detecting exact reuse areas, vulnerabilities can be associated by matching vulnerable functions rather than by matching vulnerable TPLs, which filters false positives for vulnerability associations caused by partial reuse.
We evaluate LibAM and the
state-of-the-art (
SOTA) works including LibDX, B2SFinder, ISRD and LibDB using the public dataset from a previous article [
9] as well as the first dataset for different optimization options and architectures built by ourselves. Our experiments show that LibAM outperforms all existing TPL detection works and beats the SOTA work [
23] by an average of 24% in precision and 36% in recall even across different optimization options and architectures. Moreover, compared to previous methods, which failed in the exact reuse area detection task, LibAM has demonstrated the feasibility of doing so with 0.99 precision and 0.844 recall. We further evaluate the ability of area matching to detect the exact reuse area for complicated reuse relationships and downstream tasks by detecting reuse relationships in large-scale IoT firmware and associating vulnerabilities introduced due to TPL. ModX [
28] shows that a large number of vulnerabilities are often in a small part of the software code. By detecting specific reuse areas, we can determine whether vulnerable functions are reused, thus avoiding false vulnerability associations due to file-level reuse detection.
We summarize our main contributions below:
—
We propose a novel framework LibAM, which represents a significant improvement over all existing TPL detection works both in the public real-world dataset and in different optimization options or architectures.
—
To the best of our knowledge, this is the first work to attempt to detect the exact reuse area which is beneficial for downstream tasks.
—
We build the first dataset in different optimization options and architectures for TPL detection, which allows us to evaluate the robustness of existing works.
—
We evaluated the scalability of LibAM using a large-scale real-world IoT firmware dataset and generate potential firmware vulnerabilities to show one application of LibAM.
The remainder of the article is as follows: Section
2 introduces the problem description, motivation and notations of this article. Section
3 describes the design details of LibAM. Section
4 compares LibAM with existing TPL detection techniques. Section
5 discusses the results and limitations of LibAM. Section
6 presents related work.
4 Evaluation
In this section, we elaborate on the evaluation process of our proposed approach, LibAM, and discuss its accuracy in various tasks as compared to SOTA works. We provide a comprehensive analysis, including an investigation into the bad cases of existing works, and identify the reasons for their shortcomings. Before that, we describe our experimental setup, the dataset used, and the tools employed for disassembly and coding.
In the experiments, we aim to answer the following research questions:
RQ1:
How does LibAM perform in the TPL detection task in the public real-world dataset compared to other works?
RQ2:
How does LibAM perform in the Area detection task in the public real-world dataset compared to other works?
RQ3:
Is LibAM robust in detecting TPLs in different optimization options and architectures?
RQ4:
How do the individual components of LibAM impact the final accuracy?
RQ5:
How efficient is LibAM compared to other works?
RQ6:
How does LibAM perform in detecting TPLs in large-scale real-world firmware?
First, let us describe the experimental setup and the dataset used for evaluation. We employed a high-accuracy computing environment to ensure accurate and efficiency evaluation. The system runs on Ubuntu 22.04 and is equipped with an Intel Xeon 128-core 3.0 GHz CPU, which includes hyperthreading capabilities. In addition, it has 1TB of RAM and two NVIDIA V100 32 GB GPUs. This setup guarantees ample resources for the execution and comparison of different algorithms. In fact, LibAM is lightweight and can even run on machines without a GPU. For the disassembly of binaries, we used IDA Pro 6.8, a widely recognized disassembler and debugger. IDA Pro 6.8 has extensive capabilities to reverse-engineer and analyze binary files, making it an ideal choice for our evaluation process. The code for our proposed method is primarily written in Python 3.6, with the IDAPython code being in Python 2.7. This choice of languages allows us to leverage the extensive library support and ease of use that Python offers.
In this evaluation section, we assess the accuracy of our proposed approach, LibAM, by using widely-accepted metrics, namely Precision (P), Recall (R), and F1 score (F1). These metrics are suitable for determining the accuracy of our approach in detecting TPLs and reuse areas. In addition, we performed statistical analysis of the results using violin plots and CDF (Cumulative Distribution Diagram). In detail,
\(TP\) presents the number of reused TPLs that are correctly detected as reused while
\(TN\) presents the number of unused TPLs that are correctly detected as unused. Besides,
\(FN\) presents the number of reused TPLs that are incorrectly detected as unused while
\(FP\) presents the number of unused TPLs that are incorrectly detected as reused. Moreover, the equations of Precision, Recall, and F1 score are as follows:
To maintain consistency and facilitate comparison, we preserve the vector dimension of both the function and the FCG at the same level as in Gemini [
30], with a dimensionality of 64. This allows for a fair comparison between the accuracy of our approach and other Gemini-based works.
To identify anchor functions and assess the structural similarity of the FCG, we apply optimal threshold values derived from the validation dataset in Dataset-I, which are 0.72 and 0.8, respectively. These threshold values were determined through experimentation and analysis on the validation dataset to achieve the best balance between precision and recall.
As shown in Figure
9, we randomly select a subset consisting of 400 function pairs and FCG pairs from Dataset-I to provide a representative sample for our evaluation. The selected thresholds, 0.72 and 0.8, effectively distinguish between positive (homologous) and negative (non-homologous) samples. This indicates that the threshold values are adept at identifying homologous code while minimizing false positives and false negatives.
Furthermore, we set the alignment length threshold based on the number of common edges employed in LibDB [
24] with a value of 3, thus facilitating a comparison of the effectiveness of the Anchor Alignment algorithm compared with the simple common edge filtering rule in LibDB [
24]. This threshold serves as an additional criterion to ensure that the identified reusable TPLs have a sufficient level of similarity. We further evaluated the impact of different values of this threshold on LibAM in Section
4.6.
4.1 Dataset
In order to comprehensively evaluate the accuracy of LibAM in different environments, we leveraged one independent dataset for training our model and three additional datasets for evaluation purposes. The details of these datasets are presented in Table
1.
Dataset_OSS: To develop a rich and diverse dataset for training our function embedding model and Embedded-GNN network, we invested significant time and effort into crawling 260 commonly used open-source projects from Github [
3] and SourceForge [
4]. We manually compiled these projects into 22,100 binaries, encompassing three architectures (ARM, x86, x64) and four optimization options (O0, O1, O2, O3). This extensive dataset ensures that our model is well-trained and capable of handling various real-world scenarios. We divide it into a training set, validation set, and testing set in a ratio of 8:1:1.
Dataset_ISRD: To evaluate the ability of LibAM on existing public real-world datasets, we leverage a real-world reuse dataset used by ISRD [
22]. The dataset contains 85 binaries from 24 popular open-source projects across various domains, compiled with default optimization options in x64, and includes 74 real partial reuses. This dataset is the only complete public TPL detection dataset, as we know. While another TPL dataset used in LibDB [
24] only contains a ground truth file without corresponding binaries. Even though these target binaries originate from the FedoraLib dataset [
24], their names and version numbers in the ground truth file do not correspond to those in the FedoraLib dataset, posing difficulties in utilizing this dataset for TPL detection. (for example,
vorbis in ground truth file while there are many confused binaries like
libsox_fmt_vorbis.so,
libvorbis.so.0.4.6,
libvorbisenc.so.2.0.9 and so on in FedoraLib.)
Dataset_ExtISRD: Recognizing the limitations of existing TPL detection datasets, we expanded the ISRD dataset by manually compiling binaries for three architectures (arm, x86, x64) and four optimization options (O0, O1, O2, O3). This extended dataset, with 289 binaries and 477 real partial reuses, enables us to assess the accuracy of LibAM under a broader range of conditions, thus providing a more comprehensive evaluation of its capabilities.
Dataset_FW: To evaluate LibAM’s ability to tackle large-scale firmware TPL detection tasks, we collected 167 firmware from 10 different vendors. Such kinds of firmware spanned various device categories, such as IP cameras, routers, and switches. We used Binwalk [
36] to decompress the firmware and extracted 12,699 binaries. This large-scale dataset allowed us to test the scalability and applicability of LibAM in real-world situations, further demonstrating its scalability and practicality.
By leveraging these four datasets, we were able to rigorously evaluate the accuracy of LibAM in a variety of environments, ensuring that our model is adaptable and effective in handling real-world challenges. The results from this comprehensive evaluation serve as strong evidence for the suitability of LibAM for deployment in practical TPL detection tasks.
4.2 Compared Methods
In this evaluation, we introduced the comparison between LibAM and various existing methods, examining their strengths and weaknesses in different scenarios. We aimed to provide a comprehensive assessment of the accuracy of both constant-based works and function similarity-based works, and to demonstrate the effectiveness of LibAM. The comparison methods are further elaborated on below:
LibDX: LibDX [
23] has the advantage of being a simple and fast TPL detection approach. However, its reliance on constant features, specifically strings, might result in limited detection capabilities when dealing with binaries with few strings. Additionally, this approach may produce false positives when the logical blocks in the target binary and TPL are coincidentally similar.
B2SFinder: B2SFinder [
21] boasts a comprehensive range of constant features, allowing it to perform a more in-depth analysis when detecting TPL reuse. However, this approach may suffer from increased computational complexity and time cost due to the increased number of features it employs. Moreover, as a constant-based work, B2SFinder still faces performance degradation from binary with fewer constant features.
ISRD\(_{Gemini}\): ISRD [
22] checks if more than half of the functions in the TPL match the functions in the target binary to determine reuse. However, as ISRD is limited to detecting TPLs in cross-architecture environments and has no open-source version available, we did not perform a direct comparison between LibAM and ISRD. Instead, we use Gemini [
30] as a baseline with the strategy of ISRD, which is called
\(ISRD_{Gemini}\). Note that, LibAM focuses on the work after function matching and can simply replace Gemini with ISRD to detect anchors in a single environment.
LibDB: LibDB [
24] adds FCG information to perform simple filtering of function matching results. It detects if there are more than three matched functions on the FCG to determine reuse. Unlike LibDB, which performs a simple filter on isolated functions, LibAM connects the isolated functions into areas to compare the structural similarity of areas and solves the overlapping problem of LibDB with the Anchor Alignment algorithm.
By comparing LibAM with these existing works, we were able to highlight the advantages and unique features of our approach. As for ModX [
28], since it is not open source and it focuses on program modularity and reverse program semantic understanding, TPL detection is only one of its application tasks, we did not implement it. Through this evaluation, we demonstrated that LibAM outperforms other methods in various aspects, such as the ability to detect TPLs across different environments, offering a comprehensive analysis through the Area detection task, and addressing the overlapping issue using the Anchor Alignment algorithm. This comparison demonstrates the effectiveness and superiority of LibAM in TPL detection tasks.
4.3 Answer to RQ1: Accuracy of LibAM in Public Real-World Dataset
In this evaluation, we test the accuracy of LibAM on public real-world Dataset_ISRD. Table
2 reports the accuracy of LibAM in terms of precision, recall, and F1 score compared with other works.
The results showed that LibAM achieved the highest precision and recall, with a precision of 1.0 and a recall of 0.993. Compared with other works, LibAM’s precision was 8.0% higher than the second-ranking LibDX [
23], and the recall was 7.3% higher than the second-ranking LibDB [
24]. LibAM can accurately identify every reuse in Dataset_ISRD, and only the
liblzg code area reused by
lzbench is too small to cause LibAM to miss it. The reuse area detection in Section
4.4 can further explain the good results of TPL detection.
To our surprise, LibDX [
23] achieves the second-highest precision of 0.92 and a high recall of 0.809, demonstrating that continuous strings in binaries can provide weak semantic information. The precision of B2SFinder[
21] that uses more types of constant features is only 0.692 in precision and 0.644 in recall. Although B2SFinder uses many types of features rather than just strings, the use of thresholds for the entire file granularity makes both precision and recall significantly lower than LibDX. Although LibDX can achieve good scores, it is still significantly lower than LibAM due to the harsh conditions for ten continuous strings. For example, LibDX[
23] cannot detect the
lzbench reuse from
csc, because even if the number of common strings is large, the number of continuous strings is only 2.
We analyzed the bad cases and find that both constant-based works [
21,
23] have limitations in real-world scenarios for two main reasons. Firstly, some reused binaries have few constant features, such as the
brotli code reused in
lzbench. Additionally, some binaries slightly modify the reused code to remove string-print instructions without changing the semantics. As in the case of
minizip, which removes the string-print instructions in the
\(BZ2\_BlockSort\) function of
bzip2. Consequently, both constant-based works fail to detect this reuse relation. To deal with this problem, LibAM uses function matching and area matching to detect functions that do not contain or contain few strings.
Both the precision and recall of
\(ISRD_{Gemini}\)[
22] are the lowest, which proved that directly using function similarity result as the TPL detection result cannot get a satisfactory result. Although Figure
9 demonstrates that Gemini has a high ability to distinguish between pairs of functions that are homologous and non-homologous, it is quite difficult to retrieve homologous functions in large-scale functions, which is consistent with the results of jTrans [
37].
To handle this problem, LibDB [
24] leverages common edges in FCG to filter the function similarity results and get a higher recall. However, LibDB [
24] only uses three common edges to filter functions, which is too simple to cause many false positives. Moreover, the overlapping phenomenon further aggravates the problem.
On the contrary, based on the function matching results, LibAM expands the comparison granularity to the area on FCG and further detects the similarity of areas using area structural similarity and anchor alignment length, so as to obtain good results.
4.4 Answer to RQ2: Accuracy of Detecting Reuse Areas and the Interpretable Evidence for TPL Detection
In this evaluation, we employ Dataset_ISRD to evaluate the exact area detection ability of LibAM. Three of us manually analyzed the exact reuse areas as ground truth over the course of a week. Since neither the target binaries nor the TPL binaries in the ISRD dataset have deleted function names, we can easily analyze which functions are reused and generate ground truth by function names. However, the function names in some reuse areas may be slightly modified and cannot be directly compared by string, so we manually filter them one by one to determine the exact reuse areas.
As shown in Table
3, LibAM can accurately detect reuse areas by filtering at the function level, area structure level, and area node level of 0.985 in precision and 0.847 in recall. In order to avoid false positives, we set the areas with an alignment length of more than 3 to be recognized as reuse areas, which makes some small areas easy to be missed, thus leading to the fact that recall is not as high as precision. However, the results show that LibAM is able to detect accurate reuse areas to support the interpretable TPL detection results.
The previous works only performed TPL detection without Area detection, so they cannot be used directly for area detection, we simply modified existing work to support area detection and compare them with LibAM. Specifically, we treat the functions that use the matched constant features in LibDX and B2SFinder as reuse areas. For \(ISRD_{Gemini}\), we just take matched functions as the reuse area. For LibDB, we use the functions that satisfy the three common edges as the reuse area.
LibDX [
23] still has the second-highest precision of 0.779, but its recall is only 0.291. Our analysis of bad cases shows that a large number of matched strings are not called by functions in LibDX, which makes it difficult to determine which piece of code is being reused.
The same problem occurs in B2SFinder [
21], which uses a wider variety of features and has a significantly higher recall than LibDX, but still only 0.372. This demonstrates that while Constant-based works are convenient and effective, it is difficult to identify, which code areas are actually reused, and these matched features are sometimes difficult to use as interpretable evidence that TPL is actually reused.
Function similarity-based works have a significantly higher recall due to the comparison of all functions. Even \(ISRD_{Gemini}\), which directly uses the matched functions as the reuse area, has a recall of 0.573. After filtering with 3 common edges, LibDB obtains a precision of 0.613 and a recall of 0.719. However, there is still a big gap between them and LibAM.
4.5 Answer to RQ3: Accuracy of LibAM in Different Architectures and Optimization Options
In this evaluation, we aimed to assess the robustness of the works under extreme conditions by leveraging Dataset_ExtISRD. We designed a series of experiments to test the accuracy of the works when faced with different compilation option combinations and architectural variations. Our goal was to evaluate the adaptability of the works in handling various real-world scenarios, ensuring that they are applicable across a wide range of situations.
Table
4 presents every compilation option combination for x64-x64, where both target and TPL binaries are in x64. This setup allows us to evaluate the accuracy of the works when dealing with binaries compiled using various optimization levels and options. By examining how the works handle these diverse combinations, we can better understand their ability to cope with complex and challenging scenarios.
In Table
5, we showcase every architecture combination with O2-default options, where target binaries are compiled with O2 optimization while TPL binaries are compiled using default options. This experiment is designed to test the works’ robustness when faced with discrepancies between target and TPL binaries in terms of architectures. This is particularly relevant in real-world IoT firmware.
The
\(Mix\) entry in Table
5 represents a combination of all 12 optimization options and architecture variations. In this experiment, we aimed to evaluate the works’ accuracy under a more complex and diverse set of conditions, simulating the challenges they may encounter in real-world situations. By testing the works’ adaptability to this wide range of scenarios, we can gain insights into their overall robustness and resilience.
It is important to note that the default compilation option in Dataset_ExtISRD is a mix of O2 and O3 optimization levels, rather than a single optimization level. This choice reflects the reality of software development, where binaries are often compiled using a combination of optimization levels to balance accuracy and code size. This mixed optimization setting further enhances the complexity and diversity of the dataset, providing a more challenging testbed for the works under evaluation.
LibAM achieved a precision of 0.93 and recall of 0.94 on average for different optimization options. In different architectures, LibAM achieved a precision of 0.96 and recall of 0.96. LibAM maintains a high score across architectures. The cross-optimization option has a greater impact on LibAM, but LibAM still maintains good results even with the O0 option. These results indicate that LibAM consistently achieves high accuracy in all environments.
Due to the natural robustness of constant features under different compilation options, the results of B2SFinder are very stable. In contrast, the accuracy of LibDX [
23] is significantly degraded due to the changes in the order of strings in different environments. The results show that the two Constant-based works are more stable across optimization options and architectures, but their scores are not ideal.
The results of both two function similarity-based works vary unstably from one compilation environment to another. LibDB has a recall rate of up to 0.85, while in the lowest case, it is only 0.47. At the limit of O0, \(ISRD_{Gemini}\) only gets a precision of 0.13. This is because the homologous functions obtained by compiling in different environments change a lot, resulting in poor function similarity matching results. This indicates that the stability of TPL detection using isolated function matching is poor. In contrast, LibAM greatly reduces the impact of different compilation environments on function matching results by connecting isolated functions to areas and comparing area similarity, thus filtering out many mistakes.
In Table
6, although LibAM still achieves high scores for the TPL detection task on datasets with cross-architecture and optimization options, the results for area detection drop significantly. Due to strict conditional filtering, LibAM still achieves a precision of 0.941 on the area detection task, but the recall is only 0.462. Nevertheless, LibAM still outperforms all other methods, which have unsatisfactory results on this challenging task. The reason for the significant decrease in recall is that the FCG changes across optimization options and architectures, and the Area Detection task is very demanding, so LibAM ensures high precision through strict filtering, but at the cost of some recall.
In fact, detecting reused areas across optimization options and architectures is a serious challenge, and this is the first evaluation work. Therefore, the precision and recall of existing methods are less than satisfactory. Despite this challenging task, LibAM still clearly outperforms other existing methods, and the high precision rate increases the usability of LibAM. Future proposals of better function similarity matching tools may further improve the precision. Besides, new area matching methods are expected to be further proposed in the future to improve the low recall deficiency.
Figure
10 presents violin plots illustrating the distribution of precision, recall, and F1 value across various architectures and optimization options. Notably, the results of LibAM are predominantly clustered around 1.0, demonstrating a significantly superior and more concentrated accuracy compared to other methods. The results of LibDB [
24] are mainly concentrated around 1.0 and 0.2, attributable to its high recognition accuracy for a limited number of samples that have fewer reuse relationships, while the accuracy for other samples is generally low. A substantial number of samples in
\(ISRD_{Gemini}\) [
22] close proximity to 0, as relationships with a reuse proportion less than half are frequently undetected by
\(ISRD_{Gemini}\) [
22]. The distribution patterns of the two Constant-based works [
21,
23] are similar, exhibiting a polarized distribution between 1.0 and 0. This is because they have good detection accuracy for samples with rich strings, but poor detection ability for samples with fewer strings.
4.6 Answer to RQ4: Impact of Each Part in LibAM
In order to evaluate the impact of the Anchor Alignment algorithm and the Embedded-GNN algorithm on the accuracy of our proposed LibAM framework, we conducted two separate ablation experiments, as in Tables
4 and
5. The first variant,
\(LibAM_{-align}\), is a version of LibAM without the Anchor Alignment algorithm, while the second variant,
\(LibAM_{-gnn}\), is a version of LibAM without the Embedded-GNN algorithm. By analyzing the results of these experiments, we aim to shed light on the individual contributions of these two algorithms to the overall accuracy of the LibAM framework. In addition, we conducted sensitivity experiments on the selection of threshold
\(n\) for the Anchor Alignment algorithm and tested the trend of precision and recall with changes in the threshold.
In the case of \(LibAM_{-align}\), after obtaining anchors via function matching, the framework solely relies on the GNN algorithm to compare the structural similarity of the areas in question. An area is deemed as a reused area if its structural similarity score surpasses a predetermined threshold value. While this approach attains a perfect recall score of 1.0 across all tested scenarios, the precision remains consistently below 0.2. This is primarily due to the presence of numerous similar CFGs and FCGs within non-homologous functions, which leads to a high rate of false positives. These results emphasize the vital role of the Anchor Alignment algorithm in further filtering out false positives and improving the precision of the LibAM framework.
On the other hand, \(LibAM_{-gnn}\) disregards the structural similarity of areas altogether. Instead, after obtaining anchors, it uses only the anchor alignment length to determine whether an area has been reused. This approach bears a resemblance to the LibDB framework but with a few key differences. Specifically, \(LibAM_{-gnn}\) addresses the overlapping phenomenon present in LibDB by employing the Anchor Alignment algorithm. This results in a significant improvement in both precision and recall rates, owing to the strict one-to-one alignment relationship between anchors and the absence of proportional limitation used in LibDB. However, despite these improvements, a substantial accuracy gap remains between \(LibAM_{-gnn}\) and the full LibAM framework. This indicates that considering the structural similarity of areas is instrumental in filtering out false positives and enhancing the overall accuracy of the LibAM framework.
In conclusion, our ablation experiments demonstrate the considerable contributions of both the Anchor Alignment algorithm and the Embedded-GNN algorithm to the effectiveness of the LibAM framework. The Anchor Alignment algorithm is crucial for filtering out false positives and improving precision, while the Embedded-GNN algorithm plays a significant role in further refining the identification of reuse areas. By incorporating these two algorithms, the LibAM framework achieves more robust and accurate in detecting TPLs and identifying exact reuse areas.
We further investigated the impact of varying threshold values
\(n\) within the Anchor Alignment algorithms on LibAM’s accuracy in TPL detection tasks, utilizing both Dataset_ISRD and Dataset_ExtISRD. For this experiment, in Table
7, n values ranged from 1 to 5.
For Dataset_ISRD, the results demonstrate that precision increases progressively with the rise in n, attaining its maximum at n = 3. This improvement can be attributed to the algorithm’s enhanced ability to differentiate between true and false reuse relationships as the threshold becomes more stringent. Conversely, recall diminishes gradually as n increases, with a substantial decline observed at n = 4. This decline is likely due to the higher threshold inadvertently excluding some relevant reuse relationships. Nevertheless, LibAM demonstrates satisfactory accuracy on Dataset_ISRD for both n = 3 and n = 4, achieving a balance between precision and recall.
In the case of Dataset_ExtISRD, which comprises samples with more intricate reuse patterns, precision consistently increases with the growth of n for values less than 4. However, when n is greater than or equal to 4, the average precision begins to decline, because some samples remain undetected, which causes a precision of 0. This phenomenon suggests that a higher threshold might impede the algorithm’s sensitivity to subtle reuse patterns. On the other hand, the recall still significantly decreases with the increase of n, indicating that larger thresholds are more likely to miss out on smaller reuse areas. This evaluation indicates the algorithm’s capacity to accurately discern genuine reuse relationships even in more complex scenarios.
In summary, our experiments reveal that for both datasets, the optimal results are achieved when n is set to 3. This value represents a balance between precision and recall, allowing LibAM to effectively identify true reuse relationships while minimizing false positives and false negatives. These findings have implications for the tuning of the Anchor Alignment algorithm in TPL detection tasks, shedding light on the importance of selecting appropriate threshold values to optimize accuracy.
4.7 Answer to RQ5: Efficient of LibAM
In this evaluation, we assessed the time cost for detecting Dataset_ISRD and Dataset_ExtISRD as presented in Table
8. We conducted a comparative analysis of each step in the approaches and calculated the time required for the TPL detection task and the Area detection task separately to analyze the efficiency of LibAM in relation to existing works.
Feature extraction represents the Feature Extraction phase, while
Embedding represents the function or area embedding phase.
TPL detection refers to the time cost from the completion of the Embedding phase until the TPL detection task is finished, and
Area detection refers to the time cost for the Area detection task. The full-time costs for the TPL detection task and the Area detection task are denoted as
\(All_{TPL}\) and
\(All_{area}\), respectively.
For constant-based works, the most significant time-consuming aspect is the constant comparison phase. The feature extraction phase in constant-based works, which focuses on extracting string or other constant features, is much faster than in function similarity-based works. LibDX [
23], which employs only string features accelerated by a backward indexing algorithm, is the fastest of all methods. On average, LibDX takes only 15.1 seconds to complete the detection of the target binary. B2SFinder [
21], on the other hand, takes a longer time to match features due to the inability to accelerate the comparison of if/switch features using inverted indexes or prefix trees like strings or arrays. B2SFinder takes 77.9 seconds to detect a target binary, making it the most time-consuming of all TPL detection methods.
For function similarity-based works, the most time-consuming process is feature extraction via IDA Pro. In
\(ISRD_{Gemini}\), more than 90% of the time is spent on extracting function features. In LibDB, the TPL detection phase is also time-consuming. However, due to the acceleration strategy detailed in Section
3, LibAM is faster than both LibDB and B2SFinder in the TPL detection task, with 80% of the time allocated to feature extraction. While LibAM is more time-consuming in the area detection task, it is the only method capable of accurately identifying reuse areas.
Furthermore, since only the detected reused relations, rather than all TPLs, are needed to further identify the reuse areas, the time consumption of the area detection phase is manageable. Moreover, as LibAM uses the top 200 TPLs and top 100 function limits, which are the same as LibDB, the detection time cost does not increase with the size of the TPL database. This aspect contributes to the overall efficiency of the LibAM framework when compared to existing works.
4.8 Answer to RQ6: Accuracy of Detecting Large-Scale Reuse Relation
In this evaluation, we evaluated the ability of LibAM to detect large-scale real-world reuse relations and validated the possibilities of reuse area detection in associating vulnerabilities. Furthermore, we analyzed the detection results and made several interesting findings.
For target binaries, we selected 10 vendors and collected 30 firmware for each vendor. Then, we use Binwalk [
36] to extract the file system and binaries in firmware. There are 164 firmware extracted successfully, from which we extracted 12,699 binaries, as in Table
9.
For TPL binaries, we first collect the widely used TPLs from Conan [
1] and Vcpkg [
2]. Then, we compile some well-known projects from Github [
3] and SourceForge [
4] by ourselves. Besides, we also extract known TPLs in some Linux-based operating systems. Finally, we selected those with public vulnerability information and obtained 216 TPL binaries from these candidate TPLs.
LibAM detected,3353 TPLs, as shown in Table
9. This entire process was completed in 30 hours. We counted the TPLs with the widest impact range, as shown in Figure
11. Figure
11(a) shows the top 10 TPLs that are reused in the highest number of firmware and
busybox is the most reused TPL.
11(b) and
11(c) demonstrate the top 10 TPLs with the most influential vendors and binaries and they are slightly different from the TPLs in Figure
11(a).
In addition to detecting TPLs, we want to further associate vulnerabilities that may be introduced by TPL reuse, which is one of the downstream tasks. Firstly, we conducted a web crawler to collect the public vulnerability data of detected TPLs from CVE [
11] and NVD [
12] cites. Then, we utilized an existing technique [
9] to extract all strings from the TPLs of vulnerability-related versions, enabling us to identify the specific version of the detected TPLs through string matching. After that, we associated 2519 CVEs and generated 14,863 potential vulnerabilities for 167 firmware. In Figure
11(c), we listed the top 10 vulnerability types (CWE) that have the highest number of CVEs.
Note that many software vulnerabilities affect only partial versions, and we filter non-vulnerable versions by combining existing version identification works [
9]. In addition, lots of TPLs are partially reused, which causes vulnerable functions may not be in the target software [
9]. As a result, existing methods tend to produce numerous false positives when their detected TPLs are directly associated with a vulnerability. LibAM is expected to solve this problem by reuse area detection techniques. We can extract the vulnerable function names from patches and match them with the reused function names in TPLs detected by the area matching framework. Out of the 2,519 CVEs that we associated, 1,240 had patches available. This enabled us to further filter the false positives. After the filtering process, we were able to reduce the number of vulnerabilities associated with TPL from 36,135 down to 14,863. To explain how LibAM filters incorrect vulnerabilities, we’re presenting two examples. In the case of CVE-2014-9485, there’s a vulnerability related to Path Traversal in the
\(do\_extract\_currentfile\) function within the
minizip library. While both
precomp and
lzbench use parts of
minizip’s code, they do not use the specific
\(do\_extract\_currentfile\) function. As a result, there will be two false positive vulnerabilities without area detection. LibAM handles this by identifying these reused areas and removing the false positives. Additionally, for CVE-2019-12900, there’s a vulnerability involving an out-of-bounds write operation in the
\(BZ2\_decompress\) function in version 1.0.6 of the
bzip2 library. Versions of
minizip from 2.0.0 to 2.7.5 use this vulnerable version of
bzip2, making them vulnerable to this particular issue. However,
lzbench is not affected because it uses versions of
bzip2 released after 1.8. Therefore, identifying the specific library versions also helps to eliminate incorrect vulnerability warnings.
Furthermore, by analyzing the results, we have the following findings, which demonstrate the benefits of area detection for discovering complex reuse relationships.
Different target binaries always tend to reuse the same area of TPL. In the detection results of Dataset_FW, different areas of TPLs have different tendencies to be reused. In Figure
12(a), more than 85.8% of TPLs meet the requirement of having the area reused by more than 50% of the detected target binaries. Besides, 43.5% of TPLs have more than 90% of target binaries reusing the same area of it. This finding aligns with the ground truth from the public Dataset_ISRD. For example, all four target binaries that reuse the
Bzip2 share the same
BZ2_bzBufferToBuffCompress and
BZ2_bzBufferToBuffDecompress functions, as well as their respective subfunctions. This suggests that some areas of code in TPLs are more likely to be reused than others and that vulnerabilities in these code areas are more widespread in their impact. Consequently, when a code area from a TPL is identified as being reused, it is more probable that this area will also be reused by other software. Based on this finding, researchers can allocate additional manual analysis resources to scrutinize code areas that exhibit a higher likelihood of reuse, thus enhancing the effectiveness of vulnerability detection and mitigation efforts.
There are numerous identical reuse areas in different TPLs. In Figure
12(b), more than 97.5% of target binaries meet the requirement that the same area is detected to have reuse relationships with more than two TPLs. Moreover, The same area can be detected in up to 15 different TPLs. This is also reflected in the ground truth of Dataset_ISRD. For instance,
lzbench reuses a large number of functions from
precomp, many of which are actually from
Brotli. Dataset_ISRD only analyzes that
lzbench reuses
precomp and
Brotli, but after area detection, we find that these two reuses are actually the same area of code. This shows that area detection is able to detect complex reuse relationships. In addition, different areas of the TPLs may represent different functionalities. Distinguishing between areas that are reused by more TPLs and more specific areas belonging to only one TPL facilitates further manual analysis, e.g., analysis of areas that are reused more often will lead to more valuable results.
5 Discussion
LibAM achieved high scores on both TPL detection tasks and Reuse area detection tasks. Although the accuracy of the reuse area detection task was not as high as that of the TPL detection task, we have proved that the Area Matching framework is feasible, and further research into a better Area Matching framework is worth pursuing. Moreover, we believe that Area Matching framework at the FCG level has research prospects for other applications of binary similarity calculation such as malware detection, software infringement, and patch analysis.
Software reuse relationships are often complicated. We have addressed the issue of partial reuse through Area Matching framework. The detection at the area level makes the detection granularity consistent with the reuse granularity, greatly reducing the mistakes caused by existing approaches that set thresholds for file-level TPLs.
The existing works [
22,
24,
28], including LibAM, are based on the results of function similarity calculation methods. However, even the SOTA function similarity calculation methods may yield many false positives and false negatives in different optimization options and architectures [
37]. Although LibAM has significantly reduced this impact through Area Matching framework, the impact still exists. A better function similarity calculation method can further improve the accuracy of TPL detection.
LibAM focuses on the improvement of TPL detection at the area level compared to existing methods, leaving the improvement of the function similarity calculation work itself to help TPL detection for future research. We performed TPL detection for target binaries with different optimization options and architectures to demonstrate the stability of LibAM. We did not do other similar experiments across compilers, as in IoT firmware, we found that the main impact comes from the different compilation options, and the compilers tend to be only gcc, with operating system mostly based on Linux. We leave the more diverse scenarios for future research.
Some callee functions may be called externally, such as using dynamic links. This can lead to differences in the target FCG and TPL FCG. However, by using the external dynamic link library as the target binary for LibAM, we can detect the missed TPLs caused by the FCG difference. In fact, in Dataset_ISRD, all callee functions are within the binary. The only observed external call is the use of an additional springboard function for calls to the C standard library function in the x64 architecture, which resulted in a minor difference from other architectures. Nevertheless, this difference did not have a significant impact on the detection process.
Some TPL detection works treat different versions of TPL as different libraries [
20,
21,
28], and other TPL detection works also do simple version identification by using the TPL detection technology [
9,
24]. We believe that version identification using coarse-grained features at the library level is inaccurate because TPL detection aims to detect roughly similar code in a large amount of different code, while version identification aims to detect fine-grained differences in a large amount of the same code. We consider the TPL detection work as orthogonal to the version identification work. It is worth noting that LibAM can be directly combined with existing version identification work, as in Section
4.8, where we use the version identification method in OSSPolice [
9].
There is a bias between the list of potential vulnerabilities generated in Section
4.7 and the real vulnerabilities. Even though we have filtered out many false positives through version and reuse area location information, the target binary may patch known vulnerabilities or the target binary vulnerabilities cannot be triggered, which is a common problem faced by all TPL detection works. Vulnerabilities verifying work [
38,
39,
40,
41,
42,
43,
44] need to be done to further verify the quality of potential vulnerabilities. However, this article has demonstrated the accuracy of TPL detection and reuse area detection compared with the existing methods, and vulnerability association is only one application of LibAM. We regard vulnerabilities verifying as orthogonal work for future research.