6.2 Performance Evaluation
Runtime performance overhead with SPEC CPU2006. Figure
5 shows the performance overhead of SPEC with
Mardu trampoline-only instrumentation (which does not use a shadow stack) as well as with a full
Mardu implementation. Both of these numbers are normalized to the unprotected and uninstrumented baseline, compiled with vanilla Clang. Note that this performance overhead is the base incurred overhead of security hardening an application with
Mardu. In the rare case that the application were to come under attack, on-demand re-randomization would be triggered inducing additional brief performance overheads. We discuss the performance overhead of
Mardu under active attack in Section
6.3.
Figure
5 does not include a direct performance comparison to other randomization techniques, as
Mardu is substantially different in how it implements re-randomization and the source code of closely related systems, such as Shuffler [
74] and CodeArmor [
12], is not publicly available. It is not based on timing nor system call history compared to previous works. This peculiar approach allows
Mardu’s average overhead to be comparable to the fastest re-randomization systems and its worst-case overhead to be significantly better than similar systems. The average overhead of
Mardu is 5.5%, and the worst-case overhead is 18.3% (
perlbench), in comparison to Shuffler [
74] and CodeArmor [
12], whose reported average overheads are 14.9% and 3.2%, and their worst-case overhead are 45% and 55%, respectively (see Table
1). TASR [
8] shows a very practical average overhead of 2.1%; however, it has been reported by Shuffler [
74] and ReRanz [
70] that TASR’s overhead against a more realistic baseline (not using compiler flag
-Og) is closer to 30% to 50% overhead. This confirms that
Mardu is capable of matching if not slightly improving the performance (especially worst-case) overhead while casting a wider net in terms of known attack coverage.
Mardu’s two sources of runtime overhead are trampolines and the shadow stack.
Mardu uses a compact shadow stack without a comparison epilogue whose sole purpose is to secure return addresses. Specifically, only four additional assembly instructions are needed to support our shadow stack. Therefore, we show the trampoline-only configuration to clearly differentiate the overhead contribution of each component. Figure
5 shows that
Mardu’s shadow stack overhead is negligible with an average of less than 0.3%, and in the noticeable gaps adding less than 2% in
perlbench,
gobmk, and
sjeng. The overhead in these three benchmarks comes from the higher frequency of short function calls, making shadow stack updates not amortize as well as in other benchmarks. In the cases where Full
Mardu is actually faster than the trampoline-only version (e.g.,
bzip2,
gcc, and
h264ref), we investigated and found that our handcrafted assembly for integrating the trampolines with the regular stack in the trampoline-only version can inadvertantly cause elevated amounts of branch misses, leading to the expected performance slowdown.
6.3 Scalability Evaluation
Runtime performance overhead with NGINX. NGINX is configured to handle a max of 1,024 connections per processor, and its performance is observed according to the number of worker processes.
wrk [
28] is used to generate HTTP requests for benchmarking.
wrk spawns the same number of threads as NGINX workers, and each
wrk thread sends a request for a 6,745-byte static html. To see worst-case performance,
wrk is run on the same machine as NGINX to factor out network latency, unlike Shuffler. Figure
6 presents the performance of NGINX with and without
Mardu for a varying number of worker processes. The performance observed shows that
Mardu exhibits quite similar throughput to vanilla.
Mardu incurs 4.4%, 4.8%, and 1.2% throughput degradation on average, at peak (12 threads), and at saturation (24 threads), respectively. Note that Shuffler [
74] suffers from overhead from its
per-process shuffling thread; just enabling Shuffler essentially doubles CPU usage. Even in their NGINX experiments with network latency (i.e., running a benchmarking client on a different machine), Shuffler shows 15% to 55% slowdown. This verifies
Mardu’s design that having a crashing process perform system-wide re-randomization, rather than a per-process background thread as in Shuffler, scales better.
Load-time randomization overhead. We categorize load-time to cold or warm load-time whether the in-kernel code cache (❷ in Figure
3) hits or not. Upon a code cache miss (i.e., the executable is first loaded in a system),
Mardu performs initial randomization including function-level permutation, start offset randomization of the code layout, and loading and patching of fixup metadata. As Figure
7 shows, all C SPEC benchmarks showed negligible overhead averaging 95.9 msec.
gcc, being the worst case, takes 771 msec; it requires the most (291,699 total) fixups relative to other SPEC benchmarks, with
\(\approx\)9,372 fixups on average.
perlbench and
gobmk are the only other outliers, having 103,200 and 66,900 fixups, respectively; all other programs have
\(\lt \lt\)35K fixups (refer to Table
2). For NGINX, we observe that load-time is constant (61 msec) for any number of specified worker processes. Cold load-time is roughly linear to the number of trampolines. Upon a code cache hit,
Mardu simply maps the already-randomized code to a user-process’s virtual address space. Therefore, we found that warm load-time is negligible. Note that a cold load-time of
musl takes about 52 msec on average. Even so, this is a one-time cost; all subsequent warm load-time accesses of fetching
musl takes below 1
\(\mu\)sec, for any program needing it. Thus, load-time can be largely ignored.
Re-randomization latency. Figure
8 presents time to re-randomize all associated binaries of a crashing process. The time includes creating and re-randomizing a new code layout, and reclaiming old code (❶–❹ in Figure
4). We emulate an XoM violation by killing the process via a
SIGBUS signal and measured re-randomization time inside the kernel. The average latency of SPEC is 6.2 msec. The performance gained between load-time and re-randomization latency is from
Mardu taking advantage of metadata being cached from load-time, meaning that no redundant file I/O penalty is incurred. To evaluate the efficiency of re-randomization on multi-process applications, we measured the re-randomization latency with varying number of NGINX worker processes up to 24. We confirm that latency is consistent regardless of the number of workers (5.8 msec on average, 0.5 msec standard deviation).
Re-randomization overhead under active attacks. In addition, a good re-randomization system should exhibit good performance not only in its idle state but also under stress from active attacks. To evaluate this, we stress test
Mardu under frequent re-randomization to see how well it can perform, assuming a scenario that
Mardu is under attack. In particular, we measure the performance of SPEC benchmarks while triggering frequent re-randomization. We emulate the attack by running a background application, which continuously crashes at the given periods: 1 second, 100 msec, 50 msec, 10 msec, and 1 msec. SPEC benchmarks and the crashing application are linked with the
Mardu version of
musl, forcing
Mardu to constantly re-randomize
musl and potentially incur performance degradation on other processes using the same shared library. In this experiment, we choose three representative benchmarks,
milc,
sjeng, and
gobmk, that
Mardu exhibits a small, medium, and large overhead in an idle state, respectively. Figure
9 shows that the overhead is consistent and in fact is quite close to the performance overhead in the idle state observed in Figure
5. More specifically, all three benchmarks differ by less than 0.4% at a 1-second re-randomization interval. When we decrease the re-randomization period to 10 msec and 1 msec, the overhead is quickly saturated. Even at 1 msec re-randomization frequency, the additional overhead is under 6%. These results show that
Mardu provides performant system-wide re-randomization even under active attack.
File size overhead. Figure
10 and Table
2 show how much binary files increase with
Mardu compilation. In our implementation, file size increase comes from transforming the traditional x86-64 calling convention with the one designed for
Mardu (besides calls to outside libraries). On average,
Mardu compilation with trampolines increases the file size by 66%. One would assume that applications with more call sites incur a higher overhead, as we are adding five instructions for every
call and four instructions for every
retq (e.g.,
perlbench,
gcc,
milc, and
gobmk are the only benchmarks with more than 100% increase, being 108%, 110%, 101%, and 104% respectively).
Runtime memory savings. Although there is an upfront one-time cost for instrumenting with
Mardu, the savings greatly outweigh this. To illustrate, we show a typical use case of
Mardu in regard to shared code.
musl is
\(\approx\)800 KB in size, and instrumented is 2 MB. Specifically,
musl has 14K trampolines and 7.6K fixups for PC-relative addressing, the total trampoline size is 190 KB, and the amount of loaded metadata is 1.2 MB (Table
2). Since
Mardu supports code sharing, only one copy of
libc is needed for the entire system. Backes and Nürnberger [
6] and Ward et al. [
71] also highlighted the code sharing problem in randomization techniques and reported a similar amount of memory savings by sharing randomized code. Finally, note that the use of our shadow stack does not increase the runtime memory footprint beyond the necessary additional memory page allocated to support the shadow stack and the increase in code size from our shadow stack instrumentation.
Mardu solely relocates return addresses from the normal stack to the shadow stack.
System-wide performance estimation. Deploying
Mardu system-wide for all applications and all shared libraries requires additional engineering effort of recompiling the entire Linux distribution. Instead, we get an estimate of how
Mardu would perform on a regular Linux server during boot-time. We obtain this estimate based on the fact that
Mardu’s load-time overhead increases linearly with the total number of functions and call sites present in an application or library. We calculate the estimated boot overhead if the entire system were protected by
Mardu. Referencing Figure
2,
Mardu requires one trampoline per function containing one fixup, and one return trampoline per call site containing three fixups. In addition, all PC-relative instructions must be patched. Therefore, the total number of fixups to be patched is as follows:
Extrapolating from
Mardu’s load-time randomization overhead in Figure
7, where
gcc has the most fixups at 291,699 and takes 771 ms, this makes each fixup take approximately 2.6
\(\mu\)sec. We recorded all executables launched as well as all respective loaded libraries in our Linux server to calculate the additional overhead imposed by
Mardu during boot-time. We included all programs run within the first 5 minutes of boot-time and looked at the current system load. In 5 minutes after the system booting, we recorded a total of 117 no longer active processes and recorded 265 currently active processes, using a total of 784 unique libraries. The applications contained a total of 8,862 functions, a total of 472,530 call sites, and 415,951 total PC-relative instructions. Using Equation (
7), this gave a total of 1,842,403 fixups if all launched applications were
Mardu-enabled. The libraries contained a total of 223,415 functions, a total of 4,450,488 call sites, and 2,514,676 total PC-relative instructions; this gave a total of 16,089,555 fixups for shared libraries. Using our estimation from
gcc, we can approximate that patching all fixups including both application fixups and shared library fixups (a total of 17,931,958 fixups) for a
Mardu-enabled Linux server will take roughly
\(\approx\)46.6 additional seconds compared to a vanilla boot. To give a little more insight, application fixups contribute only
\(\approx\)4.8 seconds of delay; the majority of overhead comes from randomization of the shared libraries. However, note that this delay is greatly amortized, as many libraries are shared by a large number of applications compared to the scenario where each library is not shared (e.g., statically linked) and needs a separate randomized copy for each application requiring it.
System-wide memory savings estimation. Similarly, we give a system-wide snapshot of memory savings observed when
Mardu’s randomized code sharing is leveraged. For this, we again use the same Linux server for system-wide estimation. The vanilla total file size of 784 unique libraries is approximately 787 MB. From our scalability evaluation (Figure
10) showing that
Mardu roughly increases file size by 66% on average, this total file size would grow to 1,306 MB if all were instrumented with
Mardu. Although this does appear to be a large increase, it is a one-time cost as code sharing is enabled under
Mardu. From our Linux server having 265 processes, 127 of them had mapped libraries. If code sharing is not supported, each process needs its own copy of a library in memory. We counted each library use and multiplied by its size to get the total non-sharing memory usage. For our Linux server, this non-sharing would incur approximately an 8.8-GB overhead. This means that
Mardu provides approximately 7.5-GB memory savings through its inherent code sharing design. This memory savings is compared to if these libraries were individually and separately statically linked to each of the running processes.
To get how many times each library is shared by multiple processes, we analyzed each process’s memory mapping on our Linux server by investigating
/proc/{PID}/maps. Figure
11 presents the active reference count for the 25 most linked shared libraries on our idle Linux server. The 25 most linked shared libraries are referenced more than
\(\approx\)106 times on average, showing that dynamically linked libraries really do save a lot of memory compared to a non-shared approach. For the same 25 most linked shared libraries, we also demonstrate in Figure
12 the estimated memory savings obtained for each of those libraries if
Mardu were used instead of an approach that does not support sharing of code. Notice that some of our biggest memory savings come from
libc.so and
libm.so, very commonly used libraries for which
Mardu saves almost 0.80 GB and 0.25 GB of memory, respectively.
We also show the entire system snapshot (including all 784 unique libraries) in the form of a CDF for both the unique library link count in Figure
13 and cumulative memory savings in Figure
14 if
Mardu were to be applied and used system-wide for all dynamically linked libraries. From Figure
13, it can be seen that approximately 150 libraries are in the 75th percentile of link count.