A kernel skiplist implementation (Part 1)

May 30, 2013

This article was contributed by Andrew Shewmaker

The Linux kernel manages sparse sets of index ranges for various subsystems. For instance, the I/O memory management unit (IOMMU) keeps track of the addresses of outstanding device memory mappings for each PCIE device's domain and tracks holes for new allocations. File systems cache pending operations, extent state, free extents, and more. A simple linked list fails to provide good performance for search, insertion, and deletion operations as it grows, so more advanced abstract data types like red-black trees (rbtrees) were implemented. The kernel's rbtrees generally provide good performance as long as they are only being accessed by one reader/writer, but they suffer when multiple readers and writers are contending for locks.

There are variants of rbtrees that allow efficient read-copy-update (RCU) based reads, but not fine-grained concurrent writes. So, Chris Mason of Btrfs fame is developing a skiplist implementation that will allow file systems like Btrfs and XFS to perform many concurrent updates to their extent indexes. This article will describe the basic skiplist and what makes this skiplist variant cache-friendly. In part two, I'll describe the skiplist API and compare the performance of skiplists to rbtrees.

Basic skiplist

William Pugh first described skiplists in 1990 as a probabilistic alternative to balanced trees (such as rbtrees) that is simpler, faster, and more space-efficient. A skiplist is composed of a hierarchy of ordered linked lists, where each higher level contains a sparser subset of the list below it. The size of the skiplist decreases at a constant rate for each higher level. For instance, if level zero has 32 elements and a probability p=0.5 that an element appears in the next higher level, then level one has 16 elements, level two has eight, etc. The subset selected for the next higher level is random, but the quantity of items in the subset isn't random.

By way of analogy, consider an old-fashioned printed dictionary with multiple search aids built into it. This diagram shows how the different aids are similar to levels of a skiplist:

At the highest level there are grooves on the edge of the book that mark sections for each letter, A-Z. The search can continue by looking at word prefixes centered at the top of each page. Next, guide words at the top left and right of each page show the span of words for each page. At the lowest level, a page is scanned word by word. A skiplist replicates something like this index structure in a digital form.

History of kernel skiplists

Josh MacDonald investigated cache-friendly skiplists on Linux 2.4 more than a decade ago. His skiplists are more space-efficient than rbtrees with more than 100 keys, searches are faster after 1,000 keys, and deletions and insertions at 10,000 keys. He also performed various concurrency experiments. When Josh posted an RFC to the kernel mailing list he presented them as a "solution waiting for the right problem to come along". At the time, no discussion ensued, but Chris later became aware of Josh's work during his research. Unfortunately, Josh's implementation focused on reader/writer locks, which only allow one writer at a time, and Chris told me that he wanted to use RCU. Also, Josh's coding style differs from the kernel's and his macros make his code a little difficult to work with.

Other well-known Linux kernel developers have experimented with skiplists as well. Con Kolivas initially thought they would be useful for his BFS CPU scheduler. Ultimately, BFS experienced worse behavior in general when using skiplists, so Con returned to his previous list structure. This was due in part to the BFS scheduler's common case of having few tasks to schedule, so it didn't benefit from theoretically faster lookups for larger numbers of items in the list. Also, the scheduler process wasn't parallel itself, so it didn't benefit from a skiplist's more easily exploited concurrency.

Andi Kleen recalled the discouraging results from his own skiplist investigation to Con. He found that the variable number of pointers needed to link different levels of a skiplist made it difficult to achieve efficient use of memory and cache without limiting the size of the skiplist.

In 2011, Chris asked Liu Bo to try replacing Btrfs's rbtree-based extent_map code, which maps logical file offsets to physical disk offsets, with a skiplist-based implementation. The mappings are read-mostly, until a random write workload triggers a copy-on-write operation. Liu created an initial skiplist patch for Btrfs beginning with Con's implementation and adding support for concurrency with RCU. Results were mixed, and Chris's user-space experimentation led him to start work to make Liu's skiplists more cache-friendly.

You may be saying to yourself, "I thought the whole point of Btrfs was to use B-trees for everything." Btrfs does use its copy-on-write B-trees for any structure read from or written to the disk. However, it also uses rbtrees for in-memory caches. Some rbtrees batch pending inode and extent operations. Other caches are: the extent state cache — tracking whether extents are locked, damaged, dirty, etc.; the free space cache — remembering free extents for quicker allocations; and the previously mentioned extent_map. In addition to the rbtrees, a radix tree manages the extent buffer cache, which is used like the page cache, but for blocks of metadata that might be larger than a page.

All of these data structures have multiple threads contending for access to them and might benefit from skiplists, though the delayed operation trees and free space cache have the most contention. However, Chris's real target for this skiplist is some pending RAID5/6 parity logging code. It needs to enable "fast concurrent reads and writes into an exception store with new locations for some extents." Ideally, Chris hopes to make his skiplist general purpose enough for others to use. If skiplists can provide lockless lookups and be used for both buffer cache indexes and extent trees, then Dave Chinner would consider using them in XFS.

Cache-friendly skiplist

When reading the following description of Chris Mason's skiplist implementation, keep in mind that it is a skiplist for range indexes, or extents. The extents being managed are not identified by a single number. Each has an index/key pointing to its beginning and a range/size. Furthermore, each element of the skiplist is composed of multiple extents, which will be referred to as slots.

This new implementation of a cache-friendly skiplist is a bit more complicated than the picture of the basic skiplist may suggest; it is best examined in pieces. The first of those pieces is described by this diagram (a subset of the full data structure diagram):

This figure shows a skiplist anchored by an sl_list structure pointing to the initial entry (represented by struct sl_node) in the list. That sl_node structure has an array of pointers (called ptrs), indexed by level, to the head sl_node_ptr structures of each skiplist level. The sl_node_ptr structure functions like a typical kernel list_head structure, except that each level's head sl_node_ptr->prev is used, possibly confusingly, to point to the item with the greatest key at that level of the skiplist. All locking is done at the sl_node level and will be described in part two of this article.

The skiplist grows from the head sl_node_ptr array on the right of the diagram above into an array of linked lists as shown below:

Note that none of the structures listed so far contain any keys or data, unlike traditional skiplists. That's because Chris's skiplist items are associated with more than one key. Another difference is that previous skiplist implementations lack prev pointers. Pugh's original skiplist didn't need something like prev pointers because it didn't support concurrency. MacDonald's skiplist does support concurrency, but its protocol uses context structures that combine parent/child pointer information with lock states. Mason's skiplists use a different concurrency protocol that uses prev pointers.

A superficial difference between this skiplist and others is its apparent lack of down pointers (allowing movement from one level of index to the next), although they exist as the ptrs[] array in sl_node. While the pointer array name is unimportant, its position is, because different level nodes are created with different sizes of arrays. See the ptrs array at the end of sl_node and sl_node at the end of struct sl_leaf (which holds the actual leaf data):

The variable size of nodes could cause memory fragmentation if they were allocated from the same pool of memory, so Chris designed his skiplist with one slab cache for each level. This effectively addresses Andi Kleen's concerns mentioned earlier regarding wasting memory while keeping good performance.

The way in which keys are handled is cache-friendly in a couple of ways. First, keys are not associated directly with individual items in this skiplist, but each leaf manages a set of keys. SKIP_KEYS_PER_NODE is currently set to 32 as a balance between increasing cache hits and providing more chances for concurrency. Second, an sl_leaf contains an array of keys. Technically a keys array is unnecessary, since the keys are part of the slots linked to a leaf. However, a search only needs the keys and not the rest of the sl_slot structures until the end of the search. So, the array helps skiplist operations avoid thrashing the cache.

Each sl_slot is an extent defined by its key and size (index range). A developer can adapt the skiplist for their own use by embedding sl_slot into their own data structure. The job of reference counting belongs to the container of sl_slot.

See the full diagram for a picture of how all the pieces described above fit together. With 32 keys per item, a maximum level of 32, and half the number of items in each higher level, Chris's skiplist should be able to efficiently manage around 137 billion extents (index ranges).

The statistically balanced nature of the skiplist allows it to handle sparse sets of indexes in such a way that modifications avoid the need for an rbtree's rebalancing operations, and so they lend themselves to simpler concurrency protocols. Chris also designed his skiplist to be cache-friendly by having each element represent a set of keys/slots, keeping duplicates of the slot's keys in an array, and creating a slab cache for each level of the skiplist. Part two of this series will offer a description of the skliplist API and a discussion of the performance of skiplists vs. rbtrees.

Index entries for this article
Kernel	Skiplists
GuestArticles	Shewmaker, Andrew