Read-only dynamic data

By Jonathan Corbet
March 27, 2018

Kernel developers go to some lengths to mark read-only data so that it can be protected by the system's memory-management unit. Memory that cannot be changed cannot be altered by an attacker to corrupt the system. But the kernel's mechanisms for managing read-only memory do not work for memory that must be initialized after the initial system bootstrap has completed. A patch set from Igor Stoppa seeks to change that situation by creating a new API just for late-initialized read-only data.

The most straightforward way to create read-only data is, of course, the C const keyword. The compiler will annotate any data marked with const, and the linker will ensure that it is placed in memory that ends up being marked read-only. But const only works at build time. The post-init read-only data mechanism, adapted from the grsecurity patch set, takes things a step further by marking data that can be made read-only once the system's initialization process has completed. Data structures that must be set up during boot, but which need not be modified thereafter, can be protected in this way.

Once initialization is completed, though, the (easy) ability to create read-only data in the kernel goes away. At that point, any additional memory needed must be allocated dynamically, and such memory is, by its nature, dynamic. So, while a kernel subsystem may well allocate memory, fill it in, and never change it again, there is no mechanism in place to actually block further modifications to that memory.

The proposed new API could be such a mechanism; it is called the "protectable memory allocator", or "pmalloc". The core concept is that a subsystem allocates a "pool" for a set of objects that will all be rendered read-only at the same time. The individual objects can be allocated from the pool and initialized; then the whole thing is set in stone. Or, in terms of code, one starts with:

    #include <linux/pmalloc.h>

    struct pmalloc_pool *pool = pmalloc_create_pool();

The return value (on success) is a pointer to a pool from which objects can be allocated. Thereafter, objects can be allocated from the pool with any of:

    void *pmalloc(struct pmalloc_pool *pool, size_t size);
    void *pzalloc(struct pmalloc_pool *pool, size_t size);
    void *pmalloc_array(struct pmalloc_pool *pool, size_t n, size_t size);
    void *pcalloc(struct pmalloc_pool *pool, size_t n, size_t size);
    char *pstrdup(struct pmalloc_pool *pool, const char *s);

The basic allocation function is pmalloc(), which allocates a chunk of memory of the given size from the pool. Variants include pzalloc() (which zeroes the memory before returning it), pmalloc_array() (to allocate an array of objects), pcalloc() (which perhaps should be called pzalloc_array()), and pstrdup() (which allocates memory and copies a string into it).

When the process of allocating and initializing objects has run its course, the entire set of objects associated with the pool is made read-only with a call to:

    void pmalloc_protect_pool(struct pmalloc_pool *pool);

It's worth noting that it is still possible to allocate objects from the pool after a call to pmalloc_protect_pool(). The newly allocated objects will be writable until the next call. Frequent calls to pmalloc_protect_pool() when many objects are being allocated may result in faster protection, but it may also waste any unallocated memory within the pages that are write-protected. There is no way to "unprotect" memory once it has been made read-only; the protection of memory in pmalloc is meant to be a permanent thing.

While there are numerous variants of pmalloc() to obtain protectable memory, there is no pfree() function. The only way to release this memory is to get rid of the entire pool with:

    void pmalloc_destroy_pool(struct pmalloc_pool *pool);

This call will return all objects allocated from pool, so the caller should be sure that none of them are still in use.

Underneath this interface, pmalloc uses vmalloc() to obtain a range (some number of pages) of memory; that range is then managed to satisfy individual allocation requests. As a result, the pmalloc functions cannot be used in atomic context, but it is hard to imagine a situation where that would be necessary in any case. Marking the pool read-only is a matter of dipping into the internals of the memory-management layer to tweak the page-table entries for the pages holding the pool.

The advantage of using pmalloc is that it can protect objects from unwanted modifications after they have been initialized. There is one significant limitation, though. While the protection is applied to the page-table entries in the vmalloc() area containing the pool, the underlying memory remains writable in the main system memory map. So an attacker who can determine where an object sits in physical memory may be able to bypass the protection applied by pmalloc entirely. Pmalloc thus places another obstacle in an attacker's path, but is not an absolute protection against modification.

One thing that is missing from the current patch set is a set of in-kernel users of the new API. So it is not entirely clear where Stoppa intends this mechanism to be used. That omission will probably have to be rectified at some point; the kernel community is reluctant to merge new APIs without in-tree users to show how those APIs work in real-world use. Once that has been filled in, the community should have some time to debate the merits of this interface before the 4.18 development cycle begins.

Index entries for this article
Kernel	Memory management
Kernel	Security/Kernel hardening
Security	Linux kernel/Hardening

Read-only dynamic data

Posted Mar 28, 2018 3:18 UTC (Wed) by kees (subscriber, #27264) [Link]

As I understand it, the main system memory map permission mismatch flaw only exists on architectures that aren't using a fine-grain page table (e.g. arm64 by default). This shouldn't be a problem on x86.

The thread on this was here: https://lkml.org/lkml/2018/2/14/799

Read-only dynamic data

Posted Mar 28, 2018 17:25 UTC (Wed) by timraymond (guest, #117450) [Link] (3 responses)

Forgive my naivete here: How does this differ from using mprotect to mark pages read only and then seccomp to disable access to mprotect?

Read-only dynamic data

Posted Mar 28, 2018 18:28 UTC (Wed) by willy (subscriber, #9762) [Link] (2 responses)

This is within the kernel; you can't call mprotect on kernel allocations.

Read-only dynamic data

Posted Apr 5, 2018 23:09 UTC (Thu) by mkatiyar (guest, #75286) [Link] (1 responses)

mprotect is a syscall. So whatever it does in kernel, same thing could be done. right ?

Read-only dynamic data

Posted Apr 7, 2018 9:44 UTC (Sat) by igor_stoppa (guest, #76128) [Link]

Which is precisely what the patchset does, leaving aside some technical details.
The main points to justify the patchset are:
* better utilization of the pages allocated (and less tlb trashing), compared to plain vmalloc, which allocates at least one new page for each invocation
* segregation of read-only variables, compared to kmalloc
* potentially more efficient code (depends on the specific case), compared to flex_array, certainly easier to read
* grouping of related variables in pools, for ease of protection/destruction
* additional tagging, for usercopy

PCalloc vs PZalloc_array

Posted Mar 28, 2018 18:31 UTC (Wed) by willy (subscriber, #9762) [Link]

pcalloc matches the userspace 'calloc' and we already have a kcalloc()

Pysmap based attack

Posted Apr 7, 2018 10:04 UTC (Sat) by igor_stoppa (guest, #76128) [Link]

If an attacker can write to kernel data memory, in practice there is no way to prevent the system takeover. Protecting the pysmap page would help, and it can be done on x86. However the attacker can always revoke that protection from the page table. Same goes for any other page attribute, as long as there is no way to make it permanent (write once).