IskiOS: Lightweight Defense Against Kernel-Level Code-Reuse Attacks

Technical Report #1007

Spyridoula Gravani, Mohammad Hedayati, John Criswell, and Michael L. Scott

Department of Computer Science, University of Rochester
{sgravani,hedayati,criswell,scott}@cs.rochester.edu

Abstract
Commodity operating systems such as Windows, Linux, and MacOS X form the Trusted Computing Base (TCB) of today’s computing systems. However, since they are written in C and C++, they have memory safety errors and are vulnerable to kernel-level code reuse attacks. This paper presents IskiOS: a system that helps to thwart such attacks by providing both execute-only memory and an efficient shadow stack for operating system kernels on the x86 processor. Execute-only memory hides the code segment from buffer overread attacks, strengthening code randomization techniques. Shadow stacks protect return addresses from corruption. IskiOS leverages Intel’s Memory Protection Keys (MPK, a.k.a. PKU) and Kernel Page Table Isolation (KPTI) to protect kernel memory from buffer overwrite and overread attacks and to prevent corruption of the shadow stack. Unlike previous work, IskiOS places no restrictions on virtual address space layout, allowing the operating system to achieve higher diversification entropy by placing kernel stacks and kernel code in arbitrary locations within the virtual address space. IskiOS incurs virtually no performance overhead for execute-only memory. Its shadow stacks incur a geometric mean slowdown of 12.3% in our experiments.

1 Introduction
Control-flow hijacking attacks violate Control-flow Integrity (CFI) [1] to take over execution and control the behavior of a program. When that program is the operating system (OS) kernel, everything running on the machine may be at risk. Control-flow hijacking attacks on the OS kernel have evolved from simple stack smashing [56] attacks to sophisticated variants of return-oriented programming (ROP) [61] which hijack the return address and then construct an exploit by chaining together fragments of existing code (typically referred to as gadgets). Existing defenses against such code-reuse attacks focus on static analysis to identify and label legitimate code paths, coupled with dynamic instrumentation to enforce control flow integrity (CFI) by ensuring that only labeled paths are followed during execution [2] (more on this in Section 2). Unfortunately, because the static analysis is inevitably imprecise and the dynamic instrumentation expensive, such defenses tend to embody an unfortunate tradeoff between safety and performance.

We argue that a defense against code-reuse attacks in the OS kernel cannot rely on static analysis and label-based CFI enforcement. Following the strategy of defenses used in user space, we instead propose a comprehensive solution that (1) diversifies code layout through address space randomization, (2) protects against direct disclosure of the layout by making executable memory unreadable, and (3) prevents corruption of return addresses during execution. Running on Intel x86 processors, our IskiOS system implements execute-only memory and protected shadow stacks with very low run-time overhead.

The key to our design is a novel use of Intel’s memory protection keys (MPK), which it calls Protection Keys for Userspace (PKU) [39]. PKU was originally designed as a debugging and safety aid for programmers who want to limit access to sensitive resources (an in-memory database, perhaps, or cryptographic keys) to a limited and well-tested subset of the application code. It does so by allowing the program to indicate, dynamically, that read and/or write access should be disabled for certain sets of pages. As the name implies, PKU applies only to memory whose page table entries (PTEs) are marked as user space (i.e., that are inaccessible only when running in supervisor mode). Over the past year, however, widespread adoption of kernel page-table isolation (KPTI) [37] as mitigation for Meltdown attacks [47] has essentially obviated use of the user/supervisor bit in PTEs: since we use an entirely separate page table when running in the kernel, there is no reason kernel pages cannot be marked as “user” memory, making PKU usable in kernel space.

Intel’s x86 processors use one PTE bit to distinguish between read/write and read-only pages and another to indicate executability [39]. This convention does not support an execute-only (unreadable) mode. By leveraging PKU and

---

*This work was supported in part by NSF grants CCF-1422649, CCF-1717712, and CNS-1618213 and by a Google Faculty Research award.
KPTI, however, IskiOS obtains the effect of such a mode by disabling read access for all code pages in the kernel. Unlike previous work [58], IskiOS places no restrictions on virtual address space layout: the OS kernel can scatter code pages throughout the address space, allowing randomization techniques to use as much entropy as possible.

Separately, IskiOS uses PKU to ensure the integrity of a shadow stack that is used to protect all function returns and that is writable only during short sequences of straightline code in function prologues. Because the wrpkrk instruction, which changes protection keys, is used only in ways that are immune to gadgetization (matching pairs with no intervening ret instructions)—and because even these uses are hidden through diversification and execute-only memory—an attacker who cannot inject code into the kernel is unable to override either the execute-only code segment or the protected shadow stack.1

To summarize, our paper makes the following contributions:

- We demonstrate that Intel’s PKU mechanism can be used, in conjunction with kernel page table isolation, to widen the set of protection modes for kernel memory, and to change those protections cheaply at a fine temporal granularity.

- We describe a system, IskiOS, that leverages this observation to provide execute-only memory and protected shadow stacks within the Linux kernel, while retaining the ability to employ arbitrary address space layout.

- We report the performance of IskiOS on the LM Bench microbenchmarks [51] and the Phoronix test suite [52]. Our execute-only memory solution incurs virtually no performance overhead. Our complete solution, with protected shadow stacks, incurs 12.3% overhead (geometric mean) on the Phoronix system benchmarks.

The rest of the paper is organized as follows. Section 2 provides background on KPTI [37] and code reuse attacks [13,34,59]. Section 3 describes our threat model. Section 4 describes the design of IskiOS, and Section 5 describes the implementation of our prototype. Section 6 presents the results of our performance evaluation. Section 7 describes the security guarantees and limitations of IskiOS, and Section 8 presents related work. Conclusions and future work appear in Section 9.

---

1For full protection, device drivers and kernel-loaded modules must conform to the same conventions as the rest of the kernel. Intriguingly, since IskiOS uses only three of the 16 available PKU keys, we could (in future work) employ additional keys to isolate drivers from the main body of the kernel, or to explore other forms of intra-kernel protection domains.

2 Background

2.1 Code-reuse Attacks and Prevention

The wide deployment of the W’X [3,39,66] policy, which prevents executable memory from being overwritten, shifted user- and kernel-space exploitation from code injection (e.g., stack smashing [56]), to code-reuse (e.g., return-to-libc [63] or return-to-user [53]) attacks, which repurpose existing code in memory.

In the past decade, code-reuse attacks (CRAs) evolved from ROP attacks that use static analysis on x86 binaries [61] to discover useful gadgets, to sophisticated attacks that do not rely on any assumption about the code layout of a program [35, 60]. While the first generation of ROP attacks misuse only return instructions [63], ROP was subsequently generalized to jump-oriented programming [7,13], in which any indirect branch can be exploited to subvert execution.

On the defense side, the key observation that CRAs rely heavily on prior knowledge of the program layout gave rise to software diversity techniques that randomize the address space layout on every execution [26,65]. Attackers now have to guess or leak the location of gadgets in memory. The practical limitation of guessing attacks—they may end up crashing the program instead of controlling it—leaves one way forward: information leakage. Adversaries leverage memory disclosure vulnerabilities to “learn” how code is laid out during execution, and “compile” their malicious payload in a just-in-time (JIT) [62] fashion.

In a JIT-style attack [62], an adversary may try to read directly from the code segment (direct disclosure), or leak information from function pointers saved in readable parts of memory such as the heap and stack (indirect disclosure). State-of-the-art defenses against advanced code-reuse attacks use leakage-resilient diversification techniques, which combine fine-grained randomization with (1) execute-only memory (XOM) (to prevent direct disclosure) and (2) code-pointer hiding (CPH), a set of techniques that effectively hide code pointers in readable memory, either cryptographically [49] or using trampolines [20] (to prevent indirect disclosure).

While protection mechanisms in user space have closely followed the advancement of respective attacks, progress in the kernel setting has been substantially slower. Most existing defenses attempt to enforce the control-flow integrity [1] of the OS kernel during execution [21,22,31,45,54]. In particular, they use static analysis to identify and label legitimate code paths under a specific security policy, and add dynamic instrumentation to ensure that only labeled paths are followed during execution. While these solutions limit the gadgets available for reuse to code reachable from the entry point of the computed control-flow graph, the number of these gadgets depends on the inevitably imprecise analysis and the specifications of the security policy. Most importantly, unless coupled with a protected shadow stack, CRAs can still exploit memory safety errors within code protected by label-based CFI to per-
form unauthorized computation [1, 12] because a function can return to any of its potential callers and because label-based CFI suffers from exploitable imprecision [9, 12, 17, 30].

Given these realities (and user-space experience), we do not believe that kernel code can continue to rely on static analysis and label-based CFI enforcement. We therefore explore protections that: (1) hide code layout and code pointers, and (2) prevent their leakage during execution. KHide [32] and kRˆX [58] are, to the best of our knowledge, the only systems to follow the example set by user-space defenses. Both systems enable XOM for kernel code, however, KHide relies on a hypervisor and incurs high overhead (9%-52% for LMBench, compared to 0%-2.20% for IskiOS’s XOM implementation), and kRˆX, breaks the kernel memory layout, negating the security benefits of already deployed randomization techniques. In contrast, IskiOS provides the necessary tools to enable efficient and flexible protection against code-reuse attacks in the OS kernel.

2.2 Kernel Page Table Isolation

Commodity operating systems provide separate protection domains for code executing in user space and in the kernel. For efficiency, operating systems have traditionally mapped kernel memory permanently into the virtual address space of every user process, while preventing user-level code from accessing kernel mappings [8]. The user/kernel isolation is enforced through a set of hardware features provided by most modern processors. In particular, x86/x86-64 architectures [39] provide: (1) distinct user and kernel (supervisor) modes of CPU execution, and (2) a User/Supervisor (U/S) flag in each paging-structure entry that indicates whether the corresponding page is accessible in user mode.

Recently, a plethora of attacks have targeted side effects of transient execution—sequences of instructions that the processor executes speculatively or out-of-order to increase performance and that never get committed to the architectural state [11, 47]. In particular, Meltdown [47], an attack that affects all x86 processors and several ARM processors, exploits transient execution of instructions following an access to a kernel mapping from user code, to break the boundaries of user/kernel isolation and leak privileged data through a timing side channel. Access to kernel memory from user programs will cause a page fault due to the U/S permission violation; however, the processor may continue executing instructions transiently until the check is actually enforced, caching kernel data as a side effect and allowing aspects of that data to be inferred from the timing of subsequent user-level accesses.

In light of Meltdown [47], most operating systems have turned to a strict isolation design that removes all kernel mappings from user space. Kernel Page Table Isolation (KPTI) [37] is an implementation of the new isolation mechanism that uses two sets of pages tables: one that contains all memory mappings and is available during kernel-mode execution, and a second set that contains translations for user memory and a minimal set of mappings that enable transition from user space to kernel. Since the OS kernel is not even mapped in user space, access to non-present pages is impossible, and Meltdown attacks are mitigated.

3 Threat Model

Our threat model assumes an attacker that can execute arbitrary code in user space. The attacker’s goal is to execute a computation within the OS kernel with supervisor privileges. The OS kernel is non-malicious but may have exploitable memory safety errors such as buffer overflows [56] and dangling pointers [4]. Our attacker is an unprivileged user and cannot direct the OS kernel to load a new kernel module implementing malicious code. We assume the deployment of a user/kernel isolation mechanism such as KPTI [37] and the enforcement of the WˆX [3, 39, 66] policy that prevents the attacker from injecting code directly into kernel memory. The kernel is hardened against return-to-user attacks [40, 53] with a feature like Intel’s Supervisor-mode Execution Prevention (SMEP) [39]. Finally, we assume a leakage-resilient diversified kernel (e.g., using Readactor [20] diversification).

Given the hardening assumptions of our system, the attacker must use a code-reuse attack [59, 68] to force the OS kernel into executing the desired computation. In principle, the attacker might also use a memory safety error to tamper with page tables [44]—e.g., to make OS kernel code pages writable and then use a second memory safety error attack to overwrite the instructions within the kernel code segment. Our design protects against this by putting all page tables in a separate PKU-based protection domain.

We assume an attacker that can leak the content of any memory location through direct reads and may directly overwrite any kernel code pointer. Side-channel attacks [11, 28, 29, 42, 48, 74] are out of scope; leaking information through hardware resource sharing (e.g., cache, branch target predictor, etc.) is an orthogonal issue that needs to be resolved independently. However, we should note that IskiOS, by design, prevents Meltdown [47] since address translation mappings between kernel and userspace are no longer shared. Finally, non-control-data attacks [14] are out of scope; protecting sensitive data in memory (e.g., program control blocks (PCB), interrupt tables, etc.) is an orthogonal issue and part of our future work.

4 Design

IskiOS’s goal is to defend against advanced code reuse attacks (e.g., ROP [59] and JOP [13]) launched against an OS kernel. Such attacks corrupt either return addresses stored on the stack or function pointers stored within the OS kernel’s memory. IskiOS protects return addresses from corruption. To mitigate attacks that corrupt function pointers, IskiOS prevents memory reads from accessing the kernel code segment. This allows strong diversification techniques to hide the location of reusable code within the kernel. Since the code is not
readable, attackers cannot use buffer overread attacks [64] to find reusable code.

We first present our mechanism to restrict access to kernel-mode pages using Intel PKU [39], a feature that was originally intended for user space. We then use this mechanism to make kernel code (both the primary kernel and loadable kernel modules) unreadable—i.e., execute-only—thereby preventing buffer overreads [64] that might otherwise reveal keys into two sets of 8 keys each. It reserves one set (keys 0–7) for kernel pages and allows applications to use the second set (keys 8–15).

**Supervisor-mode Execution Prevention (SMEP)** Recent Intel processors provide SMEP [39], a security feature that hardens the operating system kernel against ret2usr [40] attacks. When enabled, kernel-mode code cannot fetch instructions from pages marked as user-accessible (i.e., the page has its U/S bit set). Any such access will cause a page fault, allowing the operating system to handle the SMEP violation. Since IskiOS considers all addresses to be user-mode, SMEP must be disabled for kernel code to execute. IskiOS disables the feature at boot time by clearing bit 20 in the CR4 control register. To prevent the kernel from executing arbitrary user code without SMEP support, IskiOS marks all application pages as non-executable by setting the execute-disable (NX) bit [39] of the root page table entry in the kernel page table that maps application pages.

**Supervisor-mode Access Prevention (SMAP)** Another CPU-based protection mechanism provided by newer Intel platforms is SMAP [39]. SMAP disables supervisor-mode accesses to user pages in an attempt to prevent attacker-controlled pointers from accessing user memory directly, possibly subverting the kernel’s control flow [18]. When the operating system needs to access user memory for legitimate purposes (e.g., copy_to/from_user() [8]), it can temporarily disable SMAP protection by clearing an appropriate flag that controls SMAP enforcement. IskiOS configures all linear addresses to be user-mode. Consequently, SMAP must be disabled for kernel code to be able to access its own data. As with SMEP, IskiOS clears the SMAP bit in the CR4 control register to disable the feature. However, to replicate SMAP’s protections, IskiOS disallows kernel access to pages tagged with keys 8–15 (i.e., user pages) by default.

### 4.1 Kernel Protection Keys

In its Skylake generation of processors, Intel introduced a mechanism it calls *memory protection keys for userspace (PKU)* [39]. (Similar mechanisms have appeared in previous architectures from several other vendors.) PKU introduces a new 32-bit register called `pkru` and two instructions, `rdpkru` and `wrpkru`, which read and write the register, respectively. PKU employs 4 previously unused bits (bits 62:59) in each page table entry to assign a key to every linear address, associating it with one of 16 possible protection domains. The `pkru` register uses two bits per key to encode the access rights, read and/or write, that should be restricted in each domain. On access to a user-mode address (i.e., an address within a page that has its U/S bit set in its page table entry), the processor checks the permission bits as usual and then drops the access rights (if any) associated with the page’s protection key value within the `pkru` register. The processor ignores the protection key of a kernel-mode address (i.e., an address whose page has the U/S bit clear in its page table entry): thus, in the expected use case, PKU does not affect accesses to kernel code or data.

Page table isolation mechanisms like KPTI [37] (Sec. 2.2) unmap OS kernel memory from the virtual address space when user-space code is executing on the processor, rendering the U/S bit largely redundant. This is done by providing separate page tables to applications and the OS kernel. IskiOS leverages this scheme to enable *protection keys for kernel-space*. IskiOS sets the U/S bit in every page table entry (except a few entries that map trampoline pages handling system calls and interrupts in the user page table), effectively marking all memory as user-mode. It then relies exclusively on page table isolation to prevent user-space code from reading and writing OS kernel pages. This enables the use of Intel PKU for both user and OS kernel memory.

On Linux, user programs use the `pkey_alloc()` system call to request the allocation of a protection key from the kernel, the `pkey_mprotect()` system call to change the protection of a desired memory region, and the `pkey_free()` system call to return a key back to the system for later use [36]. Since the kernel manages protection keys, applications that use PKU for their own purposes do not need to be modified to execute on IskiOS. IskiOS splits the 16 available protection keys into two sets of 8 keys each. It reserves one set (keys 0–7) for kernel pages and allows applications to use the second set (keys 8–15).

### 4.2 Kernel XOM

A high-entropy, leakage-resilient diversification scheme raises the bar for successfully launching a code-reuse attack. In addition to function permutation [41], instruction randomization [57] and register randomization [20, 57], a leakage-resilient diversification scheme ensures that code pointers in readable memory (i.e., function pointers and return addresses) do not give away the code layout. To overcome the diversification obstacle, an attacker may attempt to exploit memory disclosure vulnerabilities to directly read code and infer the location of instructions at run time [64]. IskiOS prevents direct memory disclosure attacks by placing all kernel and module code pages in *eXecute Only Memory (XOM)*—memory that can be executed, but neither read nor written. Unfortunately, x86 architectures lack hardware support for creating XOM. IskiOS uses the Kernel Protection Keys described in Section 4.1 to create eXecute Only Memory.

IskiOS reserves one of the 8 OS kernel protection keys for the OS kernel code segment. It configures all page table
entries for pages containing kernel code to use this key. It then
sets the access disable (AD) bit in the pkru register for this key,
disabling read access to OS kernel pages containing kernel
code. Since protection keys do not affect instruction fetch and
execution, only memory read accesses are prevented.

Page Table Protection IskiOS’s XOM enforcement would
be incomplete without protection against page table tamper-
ing [44]. Without such protection, an attacker could read page
tables to infer the location of code pages. Worse yet, the
attacker could write into page tables to change which instruc-
tions are mapped into the kernel code segment. This would
effectively allow an attacker to inject code into the OS kernel
and obviate the need to locate code-reuse gadgets. To that
end, IskiOS reserves a second kernel protection key for page
table pages and assigns pages that map page tables into the
virtual address space to use this key. Access to this key is dis-
abled in the pkru register, causing reads and writes of page
tables to generate a trap. Functions in the OS kernel that legiti-
mately need to read and write page table pages first call the
pgtblaccess_enable() function which changes the pkru
register to enable access to the page table pages. When they
are done modifying page table entries, these functions call
the pgtblaccess_disable() function which re-enables pro-
tection in the pkru register. In this way, only authorized OS
kernel code modifies page table pages; errant buffer overflows
are unable to write to the page tables.

4.3 Kernel Shadow Stack

Even state-of-the-art leakage-resilient diversification schemes
are vulnerable to code-reuse attacks constructed out of pro-
tected code pointers without direct knowledge of the code
layout [60]. These attacks use techniques such as profiling to
learn the indirection of protected code pointers and use them,
for instance, to return to any call site or function. IskiOS
prevents the misuse of ret instructions in the kernel, guar-
anteeing that any function will return to its actual caller. To
achieve this, IskiOS uses a separate, protected stack to keep
a protected copy of each function return address. We chose
a parallel shadow stack [23] design in which all entries are
located at a constant offset from the original stack. On each
function call, IskiOS pushes the return address onto both the
main kernel stack and the shadow kernel stack. On function
return, control is redirected to the address present on top of
the shadow stack. This eliminates the need for a comparison
between the two return addresses while forcing execution to
continue from the intended call site. To protect the shadow
stack itself from tampering, write access to the shadow stack
is disabled by default. IskiOS reserves a third kernel protection
key and assigns it to pages used for shadow stacks. During
normal execution, write access to the shadow stack is disabled
while read access remains enabled. When IskiOS needs to
create a copy of the return address on the shadow stack, it
temporarily enables write access to the shadow stack’s protection
key in the pkru register, pushes the return address to the
shadow stack, and then revokes write access to the protected
area. Similar treatment is also applied for interrupts. IskiOS
saves the interrupt frame to the shadow stack before calling
the target handler and uses the interrupt frame in the shadow
stack on return from interrupt.

As modifying the pkru register has moderate cost, avoiding
changes to the pkru register can significantly improve
performance. We therefore developed two optimizations to
reduce the number of writes to the pkru register:

Leaf Function Optimization Leaf functions (i.e, functions
that do not call any other functions) can avoid storing the
return address in memory if they have a free register into
which they can store the return address. IskiOS finds
leaf functions, identifies any unused caller-saved registers in
such functions, and modifies the function to save the return
address in one of these registers.

Shadow Write Optimization IskiOS executes two wrpkru
instructions every time it copies a return address to the shadow
stack: one for enabling access to the shadow stack and one
for disabling it. As the shadow stack is always readable, this
makes writing to the shadow stack much more expensive than
reading from it. We observe that IskiOS only needs to create
a shadow copy of the return address when it differs from the

Figure 1: IskiOS Pipeline.
return address that was saved into the same location on the shadow stack. For example, if a function \( A \) calls a function \( B \) from within a loop and calls no other functions, then only the first execution of \( B \) needs to save a copy of the return address on the shadow stack. Likewise, if a function has been optimized using the tail-call optimization, its caller will use a \texttt{jmp} instruction instead of a \texttt{call} instruction to call the function. In this case, the return address has already been pushed on to the shadow stack, so there is no need to write it to the shadow stack again.

To leverage this behavior, we designed a new optimization called \textit{Shadow Write Optimization (SWO)}. With SWO, IskiOS adds code to every function that first checks to see if the value in the shadow stack to which the return address will be written already contains the return address. If it does, the return address is not written to the shadow stack a second time. All functions perform this dynamic check, but, as Section 6 shows, SWO almost always improves performance.

5 Implementation

We implemented IskiOS’s protection keys and XOM as a set of patches to the Linux kernel v4.19. Note that our protection keys design relies on page table isolation provided by KPTI [37]. Our design provides SMEP [39] features by disabling execution of user pages from kernel code in kernel page tables and SMAP [39] with a novel use of protection keys. Finally, we used both Linux kernel support and compiler instrumentation to implement IskiOS’s shadow stack.

5.1 Kernel Modifications

The IskiOS changes to the Linux kernel are built as three separate patches which enable protection key support, provide execute-only memory, and provide support for shadow stacks, respectively. We describe each patch below.

**IskiOS-PK** This patch enables the Intel PKU feature for all virtual memory. We marked all pages (except a few entries that map trampoline pages handling system calls and interrupts in the user page table) as user-mode by setting the U/S bit in every page table entry. Since our design associates protections keys (PKEYs) 0-7 with kernel space and PKEYs 8-16 with user pages, we changed the default protection domain for user pages from 0 to 8 by setting bit 62 (the most significant bit of the 4-bit protection key) of every page table entry that maps a user-mode address. By default, the kernel can only access pages with PKEY 0 (and PKEY 8-16 if SMAP is disabled), and user processes can only access pages with PKEY 8. Note that a user process may change the value in the pkr u register arbitrarily, but it will not be able to access kernel pages since they will not be mapped in user page tables.

IskiOS must save and restore the pkr u register on OS kernel entry and exit. We added pkr u to the set of registers that the Linux kernel saves on interrupt, trap, and system calls within the \texttt{pt_regs} structure which is used for saving register state on kernel entry. However, because of the cost of the \texttt{wrprku} instruction, IskiOS’s trap dispatch code first checks the value of the pkr u register to see if it is already set to the value used by kernel code and only modifies the pkr u register if it needs to change. This improves performance if a trap or interrupt occurs while the OS kernel is running.

**IskiOS-XOM** We changed the 4-bit protection key in every page table entry that maps a kernel code page to the value 7, associating every OS kernel executable page with the protection domain defined by PKEY 7. Bits 15:14 in the pkr u register are already set to the default value 01 which disables any data access (read/write) to the pages associated with PKEY 7. Since the access restriction does not affect instruction fetches, this patch effectively places kernel code in \textit{execute-only memory}. This patch also adds the foundation for protection against page table tampering by using another protection key for page table pages. However, our current prototype does not fully protect page tables as we are still in the process of manually vetting all code paths that access page table pages.

**IskiOS-SS** In this patch, we doubled the size of every stack in the kernel (i.e., every per-thread stack, per-cpu interrupt stack, etc.) in order to use the upper half as a shadow stack. Specifically, we modified the kernel to allocate 8 pages (32 KB) per stack instead of 4 pages (16 KB). We write-protect the upper half using protection keys. This configuration allows us to locate the shadow stack by flipping a single bit in the stack pointer (bit 15 for 16 KB stack). Additionally, a shadow stack of the same size enables us to replicate additional sensitive data (e.g., the x86 code segment saved on an interrupt [8]) on the shadow stack.

5.2 Compiler Instrumentation

We used the LLVM compiler [43] (revision 346827) for our prototype. We extended the LLVM code generator with a \texttt{MachineFunction} pass that instruments every function in a module with code that places a copy of the return address in the shadow stack on function entry. The pass also modifies the code to use the shadow stack return value instead of the original return value on function return.

Figure 2 shows IskiOS’s prologue and epilogue code with and without our two optimizations. In the naïve stack implementation, Figure 2 (a), we first enable access to the shadow stack using the \texttt{wrprku} instruction which resets the write-disable (ND) bit for PKEY 3. We then copy the return address to the shadow stack and then execute another \texttt{wrprku} instruction to disable write access to the shadow stack. Note that since \texttt{rdprku} and \texttt{wrprku} force us to zero %ecx and %edx, we may need to spill these registers to the stack in the prologue. Additionally, the pass adds code to the epilogue of every function that copies the return address stored in the shadow stack to the original stack and executes the \texttt{ret} (or \texttt{jmp} if tail-call optimized) instruction.

Figure 2 (b) shows the instrumentation for leaf functions
Hidden Shadow Stack (HSS) As a point of comparison, we also implement a hidden shadow stack where we rely on random placement of the shadow stack to “hide” it (instead of disabling writes to it using protection keys). We use a debug register ($dr0$) to store the base of the shadow region and use the lower 14 bits of $rsp$ as the offset to find shadow entries. We also ensure that $dr0$ is never leaked to the stack. While hidden shadow stacks can potentially reduce the overhead significantly, the additional performance comes with a significant security compromise: an attacker with sufficient memory disclosure ability (e.g., using buffer overreads [64]) can infer the location of the hidden shadow stack and subvert the control flow.

Limitations Our current implementation of shadow stacks, shown in Figure 2, suffers from two race hazards. The first window for race is between the call instruction that pushes the return address to the top of the stack and the first instruction in the prologue that reads the return address; an attacker could overwrite the return address between the time it is saved to and then read from the stack. The second race is a similar case between the last instruction of the epilogue and the ret instruction. The first race hazard can be addressed by moving the prologue instrumentation to before the call instruction; the second race can be addressed by directly using the value on the shadow stack with an unconditional jump. However, both approaches may fail to use the processor’s return branch predictor effectively, incurring additional overhead. Although the race windows are extremely narrow, making their reliable exploitation nearly impossible, we plan to further investigate this issue as part of our future work.

6 Evaluation

We evaluated the performance overhead that IskiOS incurs for providing execute-only memory and a shadow stack. We also studied the performance improvements of the optimizations discussed in Section 4.3 (i.e., the leaf function and shadow write optimizations). Specifically, we examined the following systems:

- **vanilla**: Unmodified Linux kernel v4.19 (KPTI enabled, SMAP disabled)
- **XOM**: IskiOS kernel with execute-only memory (XOM)
- **SS**: IskiOS kernel with XOM and shadow stack
- **SS+LFO**: IskiOS kernel with XOM, shadow stack, and the leaf function optimization (LFO) enabled
- **SS+LFO+SWO**: IskiOS kernel with XOM, shadow stack, and both optimizations enabled (LFO and shadow write optimization (SWO))
- **HSS+LFO**: IskiOS kernel with XOM, a hidden shadow stack and LFO enabled

We used the LMBench suite [51] for micro-benchmarking and the Phoronix Test Suite (PTS) [52] to measure the performance impact on real-world applications. We performed our experiments on a Fedora Linux 28 system equipped with two 3.00 GHz Intel Xeon Silver 4114 (Skylake) CPUs — 2×10 cores, 40 threads, 16 GB of RAM and a 1 TB Seagate 7200 RPM disk. For all our tests, we loaded the intel_pstate performance scaling driver into the kernel to prevent the processor from reducing frequency (for power saving) during our experiments. For the networking experiments, we ran the client and server on the same machine. We used the default settings on all benchmarks.

When possible, we compare the overhead of IskiOS against kR*X-MPX [58] which, to the best of our knowledge, is the
only other system to provide execute-only memory in the kernel, and KCoFI [21], a system that enforces control-flow integrity (CFI) on a modern OS kernel.

### 6.1 Micro-benchmarks

To better understand the impact of IskiOS on various OS subsystems, we used LMBench [51] v.3.0-a9 to measure the latency and bandwidth overheads imposed by IskiOS on basic kernel operations. In particular, we selected benchmarks that measure the latency of critical I/O system calls (open()/close(), read()/write(), select(), fstat(), stat(), mmap()/munmap()), as well as the overhead on execution mode switches (null system call) and context switches between two processes. We also measured the impact on process creation followed by exit(), execve() and /bin/sh, as well as the latency of signal installation (via sigaction()) and delivery, protection faults and page faults. Finally, we measured the latency overhead on pipe I/O (Unix Domain) and socket I/O (TCP and UDP sockets), and the bandwidth degradation on pipe (Unix Domain), socket (TCP), and file I/O operations. We report the geometric mean of 10 runs for each microbenchmark.

Table 1 summarizes our results. The second column shows the geometric mean for ten runs of each latency and bandwidth microbenchmark on the unmodified Linux kernel (i.e., our baseline). Columns 3-7 show the overheads over the geometric means for the various versions of IskiOS that we examined. The maximum standard deviation among the results was 3.97%, with most tests having a standard deviation of less than 2%. For the throughput experiments, we ensured that the baseline is limited by CPU time; as a result, throughput degradation can be reasonably interpreted as overhead. The columns named kRˆX-MPX and KCoFI report the overheads published in the kRˆX [58] and KCoFI [21] papers. Both works used LMBench—though on different hardware and OS kernel. kRˆX-MPX [58] used the default settings for all microbenchmarks except select(). KCoFI [21] used the version of LMBench in the FreeBSD ports tree.

Table 1 shows that IskiOS’s XOM implementation incurs almost no overhead when accounting for standard deviation. XOM adds less than ten instructions to the OS kernel entry/exit path, and as expected, the overhead on kernel operations (except for the null system call which performs an extremely small service) is negligible. In contrast, the unoptimized shadow stack implementation (SS) adds significant instrumentation to the prologue and epilogue of every function. Each function executes a pair of rdpkru and wrpkru instructions twice in the prologue, incurring a relatively high overhead of up to nearly 5× (geomean 143%) in latency and 48% (geomean 36%) in bandwidth. However, when the leaf function optimization is enabled (SS+LFO implementation) the overheads decrease to a maximum of 3.1× (geomean 103%) for latency and 40% (geomean 28%) for throughput, and when both optimizations are enabled (SS+LFO+SWO implementation) the overheads decrease dramatically to a maximum of 2.3× (geomean 58%) for latency and 34% (geomean 24%) for throughput. While, admittedly, a hidden shadow stack does not provide the same level of security guarantees as the protected shadow stack, elimination of two rdpkru/wrpkru pairs from function prologues significantly reduces the overheads to a maximum of 32% (geomean 6%) for latency and a maximum of 8% (geomean 6%) for bandwidth.

kRˆX-MPX [58] (Table 1, column 8) provides the same level of security as our XOM implementation for a slightly higher (but still acceptable) overhead. However, kRˆX-MPX requires that all code is laid out in a contiguous region in memory to be protected, a requirement that weakens already deployed randomization protections. Our results show that execute-only memory can be implemented with equivalent or better performance without breaking the code layout of the OS kernel.

Table 1 also shows that IskiOS provides much better performance than KCoFI [21] [1], an instance of label-based CFI enforcement in kernel. KCoFI checks the target of every indirect branch before. Unlike IskiOS, KCoFI does not protect the return addresses from corruption and is therefore vulnerable to trivial ROP attacks [12]. Finally, KCoFI also employs SFI [72] on store instructions to protect program counters and stack pointers saved on context switches, interrupts, traps, and system calls. Non control data attacks are considered out of scope in our design. However, there is no limitation in IskiOS’s design that prevents the deployment of techniques for protecting the integrity of such data. In fact, IskiOS could easily create a new, write-protected region (similar to the shadow stack) to securely save processor state during context switches. Given Intel PKU’s lower overheads, we believe IskiOS could provide similar protections with significantly lower overhead by employing our kernel protection key mechanism in place of SFI.

### 6.2 Macro-benchmarks

To assess the overhead of IskiOS on real-world programs, we used the Phoronix Test Suite [52] v8.4.1 (Skitvet). Phoronix is an open-source automated benchmarking suite with over 300 different benchmarks grouped into categories such as disk, network, processor, graphics and system. Phoronix’s system benchmarks are particularly popular for tracking performance regressions of Linux kernels [46]. We chose 11 tests from Phoronix’s system benchmark which cover different kinds of workload: a) web servers like Apache and Nginx, b) compilation of the Linux kernel (called Kbuild), c) encryption like GnuPG and OpenSSL, d) interpreters such as Python (PyBench) and PHP (PHPBench), e) databases such as SQLite, f) key-value stores like Redis and Memcached and g) PostMark, a file-system benchmark which is designed to simulate small-file use similar to web and email servers.

All the results in this section have standard deviation less than 3.5% — Phoronix keeps running a benchmark until the standard deviation falls below the threshold (3.5% by default)
or a maximum number of runs is exhausted. We verified that all benchmarks set up a sufficient number of concurrent operations (e.g., we use 100 and 500 concurrent requests for Apache and Nginx, respectively) to ensure throughput is not bound by I/O latencies and degradation can be reasonably interpreted as overhead.

Table 2 presents the overhead of IskiOS on each benchmark. The second column, vanilla, shows the metric used by each benchmark and the result on the unmodified kernel (i.e., our baseline). Similar to the micro-benchmark results in Section 6.1, XOM incurs no measurable overhead for any of the applications. However, the shadow stack (SS+LFO+SWO) and hidden shadow stack (HSS+LFO) within the OS kernel incur different performance penalties on different applications. For interpreters, i.e., PyBench and PHPBench, and encryption programs, GnuPG and OpenSSL, the overheads are negligible because these programs spend most of their time executing user-mode code. For web servers serving static web pages, however, most of the time is spent either accessing files through file-system interfaces (e.g., open()/*close()*) or sending and receiving requests over TCP, resulting in significantly higher performance overheads. Key-value stores, compilers and database applications spend significant amounts of time performing both user-space and kernel-space computation. The PostMark benchmark, similar to the web servers that it simulates, incurs moderate overhead when the OS kernel employs IskiOS’s shadow stacks. The hidden shadow stack (with a geometric mean of 1.3%) incurs significantly lower overhead compared to a protected shadow stack (with a geometric mean of 12.3%). This comparison shows that most of the overhead of IskiOS’s shadow stack comes from changing the value of the pkru register.

Table 2 also includes the overheads reported for kRX-MPX [58]. Since we evaluated some programs that were not used for the kRX-MPX evaluation, a comparison is not always possible. For the commonly used benchmarks, IskiOS’s XOM implementation consistently performs as well or better than kRX-MPX. This is expected since kRX-MPX instruments

<table>
<thead>
<tr>
<th>Benchmark</th>
<th>vanilla</th>
<th>XOM</th>
<th>SS+LFO+SWO</th>
<th>SS+LFO</th>
<th>HSS+LFO</th>
<th>kRX [58]*</th>
</tr>
</thead>
<tbody>
<tr>
<td>Apache</td>
<td>27753</td>
<td>req/s</td>
<td>~0%</td>
<td>45.75%</td>
<td>11.11%</td>
<td>0.48%</td>
</tr>
<tr>
<td>Kbuild</td>
<td>57.19</td>
<td>sec</td>
<td>~0%</td>
<td>5.47%</td>
<td>1.42%</td>
<td>3.21%</td>
</tr>
<tr>
<td>GnuPG</td>
<td>15.54</td>
<td>sec</td>
<td>~0%</td>
<td>0.84%</td>
<td>0.71%</td>
<td>~0%</td>
</tr>
<tr>
<td>OpenSSL</td>
<td>3812</td>
<td>sign/s</td>
<td>~0%</td>
<td>0.25%</td>
<td>0.23%</td>
<td>~0%</td>
</tr>
<tr>
<td>PyBench</td>
<td>1766</td>
<td>msec</td>
<td>~0%</td>
<td>0.28%</td>
<td>0.24%</td>
<td>~0%</td>
</tr>
<tr>
<td>PHPBench</td>
<td>469753</td>
<td>score</td>
<td>~0%</td>
<td>~0%</td>
<td>~0%</td>
<td>~0%</td>
</tr>
<tr>
<td>PostMark</td>
<td>4717</td>
<td>trans/s</td>
<td>~0%</td>
<td>53.11%</td>
<td>10.73%</td>
<td>1.81%</td>
</tr>
<tr>
<td>SQLite</td>
<td>423.95</td>
<td>query/s</td>
<td>~0%</td>
<td>10.86%</td>
<td>2.84%</td>
<td></td>
</tr>
<tr>
<td>Redis</td>
<td>17.37</td>
<td>gets/s</td>
<td>~0%</td>
<td>9.31%</td>
<td>1.67%</td>
<td></td>
</tr>
<tr>
<td>Ngnix</td>
<td>2432</td>
<td>req/s</td>
<td>~0%</td>
<td>34.99%</td>
<td>8.87%</td>
<td></td>
</tr>
<tr>
<td>Memcached</td>
<td>533284</td>
<td>gets/s</td>
<td>~0%</td>
<td>24.56%</td>
<td>6.42%</td>
<td></td>
</tr>
</tbody>
</table>

* We considered kRX-MPX which only provides executable-only memory (cf. XOM).
almost every memory load, while XOM adds at most 10 instructions to each mode switch.

7 Security Discussion

IskiOS protects the integrity of return addresses in the kernel by saving every call site securely on its protected shadow stack. This eliminates all traditional ROP (i.e., misusing ret instructions to subvert execution), which is the predominant technique used for mounting real-world code-reuse attacks [69].

An attacker has to therefore discover other indirect branch instructions to chain CRA gadgets. Since IskiOS builds on top of KASLR (or finer-grained diversification schemes), any prior information about the code layout, obtained during an offline analysis of the kernel, is no longer useful. In addition, IskiOS places all (randomized) code in execute-only memory, making guessing attacks and direct reads from the code segment (i.e., JIT-style attacks through direct memory disclosure) impossible. The only option left for the attacker is to harvest code pointers from the kernel’s heap and stack and try to infer the code-layout during execution (i.e., JIT code-reuse attack through an indirect memory disclosure).

Deploying IskiOS on a leakage-resilient diversification scheme (e.g., Readactor [20]) mitigates most code-reuse attacks. To further explain this, a leakage-resilient diversification scheme creates an indirection layer that hides the actual code layout from the pointers present in readable memory. An attacker that harvests code pointers from readable memory can at most learn an indirectness to a callsite or a function (i.e., the actual code layout cannot be leaked).

That said, learning the indirectness for functions allows an adversary to invoke an entire function— as opposed to smaller CRA gadgets. This is another flavor of return-to-libc [63] attacks, and the main idea behind Address-oblivious Code Reuse [60], which is especially hard to prevent in the OS kernel since they could be legitimate targets (i.e., new functions can be registered by modules at run time.)

8 Related Work

Recent approaches against code-reuse attacks in OS kernel code employ code diversification and/or code-pointer hiding techniques to prevent attackers from discovering the location of gadgets during execution. ASLR [65] randomizes the base address of various sections of an executable program (e.g., heap, stack, text) and KASLR [26] also randomizes the address at which the kernel image is decompressed on boot. Other techniques randomize code at the granularity of functions [41, 58], basic blocks [58], instructions [57], and registers [20, 57]. Unfortunately, even high-entropy randomization schemes can be broken through information leaks [5, 6, 20, 33].

KHide [32] and kR’X [58] combine execute-only memory with kernel diversification to prevent gadgets from being leaked. KHide [32] applies instruction level diversification across all kernel sources at compile time and uses a hypervisor to prevent read accesses to kernel code at runtime. IskiOS does not require more privileged software for its execution, keeping its TCB small 3 and avoiding unnecessary virtualization overheads. kR’X [58] provides execute-only memory for kernel code without hypervisor support. kR’X modifies the kernel layout to separate code from data, and instruments all read operations with runtime checks to ensure that they never fall into the code segment. kR’X needs to place all code in a contiguous region, weakening already deployed diversification schemes such as KASLR [26]. To make up for the entropy loss, kR’X re-arranges code in the protected region using function permutation and, at the function level, block randomization [58]. In contrast, IskiOS does not break the memory layout, it provides more flexibility (i.e., it supports more than one protected area in memory), and it preserves the protection guarantees of existing randomization schemes. Finally, IskiOS’s shadow stack provides stronger protection for return addresses than kR’X which only hides them in memory, and KHide which does not provide any protection at all.

Several kernel defenses enforce some control-flow integrity policy during execution. SVA [22] uses whole-program points-to analysis on the Linux kernel to enforce memory safety. SVA performs run-time checks on forward edges of the computed CFG; these checks, as well as the memory safety checks, are limited by the precision of the static analysis. KCoFI [21] protects sensitive data (e.g., program counter) from corruption during context switches by saving them in a kernel-inaccessible region in memory. To prevent illegal flows during normal execution, it enforces a coarse-grained CFI policy that uses a single label to tag all targets of indirect control transfers as valid [21]. As a result, any function can be called from a callsite and can return to any callsite in the kernel. Both KCoFI and SVA have higher overheads than IskiOS while their protections against code-reuse attacks are weaker.

Ge et al. [31], based on the assumption that no data pointer points to a function pointer, use taint analysis to track function pointers in the kernel and determine the set of targets for every indirect call. Although true for their FreeBSD [50] and MINIX [38] prototypes, this assumption does not hold in general (e.g., Linux). Furthermore, since their analysis requires that all valid targets be statically computed, it breaks loadable kernel module support and excludes preemptive kernels. kCFI [54] uses both source code and binary analysis to compute an augmented call graph for the Linux kernel and adds checks to indirect branches to verify that the type signature of each indirect control transfer matches the type of its target. While their combined analysis may be more precise than points-to analysis, type collisions are possible and, as shown by Farkhan et al. [30], abundant in large code bases. Similar to Ge’s [31] approach, kCFI cannot support loadable kernel modules due to static analysis restrictions. In contrast, IskiOS

3IskiOS adds less than 1 KLOC compared to 25 KLOC for kR’X. KHide does not report lines of code for its TCB.
does not impose any requirement on kernel modules. In fact, our IskiOS provides two options to any loadable module: 1) it can be loaded as is, or 2) it be compiled prior to loading with our compiler to enjoy the security benefits of a shadow stack. In either case, the base kernel is protected with its shadow stack and XOM.

Code-reuse attacks on the OS kernel can succeed without hijacking control flow directly. An attacker that has successfully gained control over the kernel can change permissions in page table entries [44] to overwrite kernel code pages with a malicious payload, or to change the physical addresses in the page table entries so that new frames are mapped into the kernel code segment. Several systems like SVA [22], HyperSafe [73], KCoFI [21] and Nested Kernel [24] protect the MMU configurations to ensure the integrity of the code segment, while PT-Rand [25] uses randomization techniques to hide kernel page tables in memory, providing probabilistic guarantees on the integrity of the MMU configuration. As discussed in Section 4.2, IskiOS includes page-table protections in its design.

The simplest line of defense for return addresses is to detect their corruption on the stack. StackGuard [19] and ProPolice [27, 71] are stack-smashing protectors that place a canary word prior to the saved return address on the stack and verify that it has not been corrupted before using the return address. Other approaches [55, 58, 67] use encryption-based mechanisms to protect return addresses. RAP™ [67], a patented defense mechanism against code reuse attacks, uses return address encryption in the Linux kernel. Specifically, RAP™ uses a secret key (XOR cookie) placed in a reserved general-purpose register to encrypt the return address on function entry and decrypt it on function return. This solution, although stronger than simple canaries, is susceptible to information leaks and brute-forced attacks. Stronger protections store a copy of each callsite on a separate (shadow) stack [10, 15, 23, 70], and on each function return, verify that the return address on the stack matches the copy on the shadow stack. These solutions require double the effort for the attack to succeed (i.e., two return addresses have to be corrupted in memory). IskiOS is, to the best of our knowledge, the first system to implement a shadow stack in the OS kernel. In addition, IskiOS write-protects the shadow stack to preserve the integrity of each callsite rather than detect its corruption.

9 Conclusions

IskiOS is, to the best of our knowledge, the first system to implement protected shadow stacks and flexible execute-only memory for the OS kernel. Shadow stacks protect the return address from corruption, and execute-only memory forms an integral part of state-of-the-art leakage-resilient diversification systems by hiding the code from buffer overread attacks. Unlike previous work, IskiOS places no restrictions on virtual address space layout, allowing the operating system to achieve higher diversification entropy by placing kernel stacks and kernel code in any location within the virtual address space. IskiOS achieves these benefits through a novel novel use of Intel PKU for protection inside the OS kernel. PKU-based implementation of execute-only memory incurs virtually no performance overhead. The addition of protected shadow stacks leads to a slowdown of 12.3% (geometric mean) on the Phoronix Test Suite.

We are in the process of integrating state-of-the-art diversification into IskiOS. We also plan to use protection keys to harden the OS kernel against non-control data attacks by protecting sensitive data regions (e.g., program control blocks (PCB), interrupt tables, etc.) against unauthorized accesses.

References


