A multi-order radix tree

May 24, 2016

This article was contributed by Ross Zwisler

Radix trees have been a part of Linux for quite some time; an LWN article from a decade ago explained the structure and functionality of them. The radix tree is a general-purpose data structure that is used by many different components within the kernel. It provides an efficient way to create a key-value store, where the key is an unsigned long, referred to as the index, and the value is a void *. The radix tree also stores a few bits of additional information for each entry in the form of tags.

The most common use of the radix tree is to keep track of a collection of pages. In struct address_space, for example, there is a radix tree called page_tree that tracks the in-memory page-cache pages that are associated with a given inode. The key for page_tree is the page offset (pgoff_t) into the file. For normal files, page_tree will map that key to a void * value which is actually the struct page * for the page-cache page at that file offset. For page-cache pages, the radix-tree tags let us track entries that are dirty and which are marked for writeback.

For inodes that take advantage of the DAX ("direct access") feature, there is no page cache sitting between the user processes and the storage. Hence, for DAX inodes there is no need to keep track of struct page * entries via the page_tree. Instead, for DAX inodes, this same page_tree is used to hold DAX exceptional entries that track the state of the persistent-memory pages used by DAX. On x86_64, these exceptional entries are 64-bit values that store several pieces of information, such as the page size (more on that below), the sector offset within the persistent-memory storage, and some flags. From these exceptional entries, DAX knows which dirty pages need to be flushed from the processor cache when an fsync() is received from user space.

For radix-tree uses where there is a one-to-one mapping between keys and values, such as a page_tree that only tracks PAGE_SIZE page-cache entries and/or DAX entries, this all works perfectly. But what about cases where this one-to-one relationship breaks down?

One example of this breakdown is huge pages. On x86_64, regular pages are 4KiB in size. Linux x86_64 also supports "huge pages" that are 2MiB in size, and the Linux DAX code has explicit support for these 2MiB pages. For the page_tree radix tree, this means that the one-to-one relationship between keys and values may not be sufficient.

There is a desire for a 2MiB page to be tracked as a single entity. There would be a single pointer to the 2MiB worth of data and the tag state would be consistent, so that the kernel can reliably track whether the data is dirty.

Existing solutions

As of kernel 4.5, DAX successfully tracks the value and state of 2MiB pages through the use of DAX exceptional radix-tree entries that reserve a few bits to record whether the DAX entry represents a 4KiB page (RADIX_DAX_PTE) or a 2MiB page (RADIX_DAX_PMD). 2MiB page entries (referred to as "Page Middle Directory" or PMD entries) are always inserted at a 2MiB boundary, so DAX is able to support huge-page entries with the following logic:

    pgoff_t pmd_index = DAX_PMD_INDEX(index);

    entry = radix_tree_lookup(page_tree, pmd_index);

    if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
	    /* operate on the 2MiB 'entry' at 'pmd_index' */
    else
	    entry = radix_tree_lookup(page_tree, index);
	    /* operate on the 4KiB 'entry' at 'index' */

This has the obvious cost that for every radix-tree operation there has to be an extra lookup for the entry using pmd_index to first check whether the index is covered by a 2MiB page. This solution is correct, but is somewhat costly in the RADIX_DAX_PTE case where we have to do a radix_tree_lookup() at both pmd_index and index. Having special-case lookups at multiple offsets also does not make for the cleanest code. When 1GiB page support is added to the DAX code, this solution begins to look even worse, because there will have to be yet another special-case lookup.

Another possible alternative that would not involve an extra lookup would be to represent the 2MiB entry via 512 redundant entries, each at a unique index. This would have the property that any lookup for the indices in the 2MiB range would return a copy of the data pointer, but it has the downside that the tag tracking is no longer consistent among the 512 entries. This would mean that some of the 512 entries could be tagged as clean and some of them could be tagged as dirty, even though they all had the same data pointer as their value.

There would also be a need to be sure to replicate other operations, such as removal, among all 512 entries. This solution has the additional downside that representing a 2MiB page via 512 individual entries adds many extra nodes to the radix tree.

Multi-order radix-tree techniques

Both of the solutions mentioned thus far for dealing with huge pages have left the radix-tree API unchanged. Ideally, there would be a solution where the one-to-one mapping between keys and values in the radix tree can be broken. That would allow inserting an entry that covers multiple 4KiB-sized indices and have operations on indices in that range, such as lookup, tag modification, and removal, all act on the same radix-tree entry. Recently there have been several patch series (1, 2, and 3) posted to the Linux kernel mailing list that set out to accomplish this goal via a solution that is called "the multi-order radix tree".

The basic idea is to add the ability to insert entries that span 2^X 4KiB-sized page indices. X is referred to as the page's "order". Using this terminology, existing radix trees, in which every entry is associated with a single index, are composed entirely of order-0 entries. An order-2 entry would cover 2² = 4 adjacent indices. The 2MiB entries for the DAX huge-page example would be order-9 entries, and so on.

[sibling pointer] This new functionality is implemented in the multi-order radix tree using a pair of techniques: sibling entries and elevated entries. The smallest multi-order entry is an order-1 entry that covers two adjacent indices. This is implemented by inserting a special "sibling pointer" for the second index that points back to the actual radix-tree entry.

In this case, lookups, removals, and tag operations on both the base index and on the index for the sibling operate on the same actual entry in a way that is transparent to the user. For orders greater than 1, there can simply be multiple sibling entries that all point back to the actual radix-tree entry:

With a multi-level radix tree, there can be up to three different types of pointers. The first are internal pointers, which point from a parent radix-tree node to a child radix-tree node. The second are sibling pointers, which point from one entry in a given radix-tree node to another entry in that same node. The third are the void * data pointers that are stored as part of the key-value store.

The lowest bit of the radix-tree entry, RADIX_TREE_INTERNAL_NODE, is used to distinguish between the void * data pointers and the two types of pointers internal to the radix-tree implementation. Sibling pointers and internal node-to-node pointers are distinguished by looking at the value of the pointer itself. If the pointer points within the same node's slots array, it is a sibling pointer. If not, it points to a child radix-tree node.

If the fan-out of the radix tree happens to match the order of the multi-order entry, it can be represented using an elevated entry that simply lives as a child of one of the intermediate nodes in the tree:

Elevated multi-order entries can be the children of intermediate nodes at any level in the tree. Combining these two techniques allows us to have elevated multi-order entries with sibling pointers:

Having both sibling entries and elevated entries allows the radix tree to support multi-order entries of any order.

Radix-tree API changes

To use this new functionality, the multi-order radix tree has a few small API changes where an order parameter was needed.

    int __radix_tree_create(struct radix_tree_root *root, unsigned long index,
			    unsigned order, struct radix_tree_node **nodep,
			    void ***slotp);

    int __radix_tree_insert(struct radix_tree_root *, unsigned long index,
			    unsigned order, void *);

__radix_tree_create() is only used in one place in mm/filemap.c, and __radix_tree_insert() is a new API added by the multi-order patches. radix_tree_insert(), the old insertion API that is used by all existing code, is now defined to be:

    static inline int radix_tree_insert(struct radix_tree_root *root,
			    unsigned long index, void *entry)
    {
	    return __radix_tree_insert(root, index, 0, entry);
    }

The API for operations such as node lookup, deletion, tag manipulation, etc. remain unchanged. This has allowed the multi-order radix tree to be implemented with very little change to existing radix-tree users.

Integration with DAX 2MiB support

I recently posted a patch that integrates the DAX code with the new multi-order radix-tree code. As can be seen from that patch, the changes needed to move from the old method for supporting 2MiB pages to the new multi-order radix-tree support are quite small.

We now insert an order-9 entry when we need to track the status of a 2MiB huge page. This is done as follows:

    error = __radix_tree_insert(page_tree, index,
                    RADIX_DAX_ORDER(pmd_entry),
                    RADIX_DAX_ENTRY(sector, pmd_entry));

RADIX_DAX_ORDER() gives us an order of 0 for 4KiB pages and an order of 9 for 2MiB pages.

For the rest of the radix-tree operations, like lookup and tag manipulation, there is no longer a need to first check for a 2MiB PMD entry as a special case. It just operates on the radix tree using the index and the radix tree will do the right thing if that index happens to map to a multi-order entry.

One last thing worth noting is that the multi-order radix tree API currently does not define a way for the user to query the order of a given entry. It is not immediately obvious whether this API is actually needed. The DAX code can infer the order of a given entry by looking at its type: RADIX_DAX_PTE or RADIX_DAX_PMD. When multi-order struct page entries start being inserted, their size can most likely be understood by looking at the page flags. However, if an API to query an entry's order proves useful, it could easily be added.

In conclusion

The multi-order radix tree patches have been present in both Andrew Morton's -mm tree as well as Stephen Rothwell's linux-next tree for several weeks; they were pushed upstream during the 4.7 merge window. The integration between DAX PMD entries and the new multi-order radix-tree code will be merged in 4.8 or later, and will need to take into account the recent DAX page-fault-locking patch series from Jan Kara. The combination of the multi-order radix-tree patches and the locking changes will allow DAX to have locks on a per-page basis, regardless of the size of the page.

DAX will most likely be the first user of the new multi-order capability of the radix tree, but these changes should be interesting to anyone who deals with multiple page sizes. The transparent-huge-page code could probably make use of this new functionality, and it is likely that other users will spring up over time.

Index entries for this article
Kernel	Radix tree
GuestArticles	Zwisler, Ross

A multi-order radix tree

Posted May 26, 2016 15:38 UTC (Thu) by willy (subscriber, #9762) [Link] (1 responses)

Hi Ross,

Good write-up; thanks for doing this. I'll be presenting at LinuxCon North America on the Radix Tree (not exclusively about the multi-order support, although that will be part of the talk).

It's interesting that you see the radix tree as implementing a key-value store. I think of it as being a resizable array of pointers (which merely happens to be implemented as a tree).

For DAX, you're probably going to need to split 2MB entries into 4kB entries (eg on truncate). You might also want to merge 4kB entries into 2MB entries on expansion, but that's not a correctness issue. I have patches for the radix tree to support those two functions here: http://git.infradead.org/users/willy/linux-dax.git/shortl... (I have a few minor improvements to that which I haven't published yet)

I hope you're talking to Kirill & Hugh about their plans for file-backed THP.

We have talked about other users of the radix tree using the multiorder support. In general, anyone who's using multiple contiguous indices to return the same result might benefit. Although they might also benefit from using a different kind of tree entirely ;-) [eg the RB tree used for mapping virtual addresses to struct vmas].

A multi-order radix tree

Posted May 27, 2016 20:35 UTC (Fri) by rzwisler (subscriber, #90544) [Link]

Hey Matthew,

Thanks! Have fun at LinuxCon North America - I'm sorry to miss your presentation.

I'm not sure if I always though of the radix tree as a key-value store, but after the multi-order changes I definitely do. I see your alternate view, though.

Yep, I agree on the need for 'split' and 'join' operations for multi-order entries. I haven't yet had time to review your changes - are you planning on sending them to the list?

Thanks for the comments, and hope you're well.

- Ross