Abstract
Open-source software (OSS) is a critical element in the design and operation of complex cyber-physical systems. Contributions to OSS projects are typically the result of voluntary work and time allocation by researchers, software developers, hackers, and even opportunistic programmers. These communities often operate on a “trust” basis, and although they strive to evaluate the technical correctness and merits of contributed code, the processes they use are usually loosely supervised. Social rules, trust, reputation, and even arcane processes often govern these communities. While these components have undoubtedly contributed to the growth and expansion of OSS, they could also lead to opportunities for subversion [3], hindering the reliability of an OSS project. This, in turn, could not only compromise the integrity of cyber-physical systems depending on OSS but also affect their performance.
The risks of new and emerging socio-technical attack vectors on cyber-physical systems that rely on OSS are real, broad, and growing [8]. Therefore, it is essential for the cyber-defense community to develop both a comprehensive and a deeper understanding of the socio-technical behavior and behavioral dynamics involved in these attacks. Additionally, mechanisms must be in place to extract latent information hidden in these operations. Much of the previous research on understanding socio-technical behavior in OSS projects has focused on a static view of the problem, paying close attention to individual and publicly available traces of information involving source code, commits, logs, or external packages (e.g., [7]). However, social-cyber operations are not static [6]. Instead, they can change over time to help potential contributors build a reputation and eventually become project committers (as seen in the case study of a “successful socialization” scenario in Ducheneaut [2]). Furthermore, some of these dynamics, particularly those related to vulnerability fixes, may occur behind closed doors and be black boxes of complexity [4]. Introspecting multiple streams of information resulting from both social and technical interactions across and between development channels (such as mailing lists, version control systems, and source code) can help us open this black box and build a high fidelity model of socio-technical behavior. Such a model can be operationalized as an early warning mechanism to highlight emergent social-cyber operations that aim to undermine the integrity of OSS projects and their dependent cyber-physical systems. This paper summarizes SIGNAL 1, a single and coherent software introspection capability for signaling social-cyber operations against cyber-physical systems that depend on OSS projects.
As shown in Figure 1, SIGNAL views an OSS project as a changing artifact that grows and evolves over time through socially vetted modifications submitted by programmers. SIGNAL is grounded on three key and inter-connected components: (1) Explainable persuasive behavior extraction (Yellow Patch), (2) Graph-based revision history analysis (Sensor), and (3) Self-supervised mechanisms for dynamic trace analysis (Antenna). In the first component, SIGNAL combines white-box transfer learning for Random Forest and exploratory factor analysis to compute an accurate model of persuasive developer action flows emerging within a project’s social and technical channels. This effectively links key traces of developer social and technical interactions to their associated traces of code modifications. The computed model achieves a comparable accuracy (~68%) to the state-of-the-art [9], and 16x faster training time. In the second component, SIGNAL introduces a novel graph-based pattern mining approach for detecting API misuses that originated from persuasive developer activities. This component looks at chains of code changes in OSS projects to evaluate structural and semantic patterns. It uses this information to identify API misuses. In the third component, SIGNAL combines the output of the first two components and performs self-supervision on their temporal ordering to learn dynamic developer activity embeddings. These embeddings can be used to track the semantic evolution of developer contribution ploys. An advantage of using an embedding approach to track the semantic evolution of socio-technical behavior is that it produces a natural “backtrace” of contributors’ modus operandi. This backtrace details how their actions exploit seams within a project to influence technical change.
Case Study: The Evolution of Hypocrite Commits in the Linux Kernel. In a recent work [5], we assessed the effectiveness of SIGNAL in introspecting a well-documented social engineering attack against the Linux Kernel, specifically the “hypocrite commits” [11]. “Hypocrite commits” refer to scenarios where an attacker exploits the social landscape of OSS projects, such as the Linux Kernel in this case, to earn the trust of maintainers before introducing malicious code or malware that can lead to critical vulnerabilities in the OSS project or its subsystems. Our SIGNAL analysis of the 2020 social engineering attack against the Linux Kernel revealed new and distinct social-cyber operation traces, as depicted in Figure 1 of our recent study. In [5], we sought to capture the dynamics of influence-seeking and trust-building operations carried out by adversaries seeking to acquire write permissions to an OSS project. Additionally, we drew similarities between OSS development life-cycle and online social networks [10] and introduced the concept of trust ascendancy. This concept describes any influence-seeking and trust-building operations seeking to change a project’s technical direction.
In our SIGNAL analysis of the “hypocrite commits” attack, we collected mailing-list, patch, and commit data from August to November 2020, the period when the attack took place [1]. Our approach was hybrid as it formulated our analysis task as an unsupervised learning task with a self-supervised learning twist. Through our experiments, we successfully captured the modus operandi trajectories followed by the aliases involved in the attack and identified a series of potentially influenced maintainers and core contributors. In the process, we also identified a series of trust ascendancy classes, such as opportunistic or awry trust ascendancy 2.
Remarks. SIGNAL makes the following technical contributions:
Moving forward, we aim to scale SIGNAL to new case studies and to larger volumes of diverse socio-technical activity data. Our goal is to chart the strategic landscape of influence-seeking and trust-building operations in OSS development while avoiding information overload and unnecessary CPU-intensive data operations. We anticipate these efforts will facilitate new research in secure and continuous software development, benefiting the advancement of complex cyber-physical system design and development.