Some snags for SLUB
The first problem had to do with performance regressions in a few specific situations. It turns out that the hackbench benchmark, which measures scheduler performance, runs slower when the SLUB allocator is being used. In fact, SLUB can cut performance for that benchmark in half, which is enough to raise plenty of eyebrows. This result was widely reproduced; there were also reports of regressions with the proprietary TPC-C benchmark which were not easily reproduced. In both cases, SLUB developer Christoph Lameter was seen as being overly slow in getting a fix out; after all, it is normal to get immediate turnaround on benchmark regressions over the end-of-year holiday period.
When Christoph got back to this problem, he posted a lengthy analysis which asserted that the
real scope of the problem was quite small. He concluded: "given
all the boundaries for the contention I would think that it is not worth
addressing.
" This was not the answer
Linus was looking for:
About this time, the solution to this problem came along in response to a note from Pekka Enberg pointing out that, according to the profiles, an internal SLUB function called add_partial() was accounting for much of the time used. The SLUB allocator works by dividing pages into objects of the same size, with no metadata of its own within those pages. When all objects from a page have been allocated, SLUB forgets about the page altogether. But when one of those objects is freed, SLUB must note the page as a "partial" page and add it to its queue of available memory. This addition of partial pages, it seems, was happening far more often than it should.
The hackbench tool works by passing data quickly between CPUs and measuring how the scheduler responds. In the process, it forces a lot of quick allocation and free operations and that, in turn, was causing the creation of a lot of partial pages. The specific problem was that, when a partial page was created, it was added to the head of the list, meaning that the next allocation operation would allocate the single object available on that page and cause the partial page to become full again. So SLUB would forget about it. When the next free happened, the cycle would happen all over again.
[PULL QUOTE: Once Christoph figured this out, the fix was a simple one-liner: partial pages should be added to the tail of the list instead of the head. END QUOTE] Once Christoph figured this out, the fix was a simple one-liner: partial pages should be added to the tail of the list instead of the head. That would give the page time to accumulate more free objects before it was once again the source for allocations and minimize the number of additions and removals of partial pages. The results came back quickly: the hackbench regression was fixed. There have been no TPC-C results posted (the license for this benchmark suite is not friendly toward the posting of results), but it is expected that the TPC-C regression should be fixed as well.
Meanwhile, another, somewhat belated complaint about SLUB made the rounds: there is no equivalent to /proc/slabinfo for the SLUB allocator. The slabinfo file can be a highly effective tool for figuring out where kernel-allocated memory is going; it is a quick and effective view of current allocation patterns. The associated slabtop tool makes the information even more accessible. The failure of slabtop to work when SLUB is used has been an irritant for some developers for a while; it seems likely that more people will complain when SLUB finds its way into the stock distributor kernels. Linux users are generally asking for more information about how the kernel is working; removing a useful source of that information is unlikely to make them happy.
Some developers went as far as to say that the slabinfo file is part of the user-space ABI and, thus, must be preserved indefinitely. It is hard to say how such an interface could truly be cast in stone, though; it is a fairly direct view into kernel internals which will change quickly over time. So the ABI argument probably will not get too far, but the need for the ability to query kernel memory allocation patterns remains.
There are two solutions to this problem in the works. The first is Pekka Enberg's slabinfo replacement patch for SLUB, which provides enough information to make slabtop work. But the real source for this information in the future will be the rather impressive set of files found in /sys/slab. Digging through that directory by hand is not really recommended, especially given that there's a better way: the slabinfo.c file found in the kernel source (under Documentation/vm) can be compiled into a tool which provides concise and useful information about current slab usage. Eventually distributors will start shipping this tool (it should probably find a home in the util-linux collection); for now, building it from the kernel source is the way to go.
The final remaining problem here has taken a familiar form: the dreaded message from Al Viro on how the lifecycle
rules for the files in /sys/slab are all wrong. It turns out that
even a developer like Christoph, who can hack core memory management code
and make 4096-processor systems hum, has a hard
time with sysfs. As does just about everybody else who works with that
code. There are patches around to rationalize sysfs; maybe they will help
to avoid problems in the future. SLUB will need a quicker fix, but, if
that's the final remaining problem for this code, it would seem that One
True Allocator status is almost within reach.
Index entries for this article | |
---|---|
Kernel | Memory management/Slab allocators |
Kernel | Slab allocator |
Posted Jan 4, 2008 0:11 UTC (Fri)
by viro (subscriber, #7872)
[Link]
Posted Jan 20, 2008 6:52 UTC (Sun)
by xiaosuo (guest, #41366)
[Link]
Some snags for SLUB
Actually, the fundamental problem is with user interface chosen by
SLUB; crappy sysfs uses are at least relatively easy to fix, but
there's a real design problem that would be much harder to deal
with:
1) kmem_cache_create() is given a name and returns a pointer to
struct kmem_cache. If SLUB decides that it's mergable with already
existing cache, it will return you a pointer to already existing
kmem_cache - same value it had returned to earlier caller.
2) SLUB kmem_cache_create() creates a symlink with the name we'd
given it. Symlink points to directory with actual contents related
to created (or preexisting) kmem_cache; the name of directory itself
is opaque.
3) kmem_cache_destroy() gets a pointer to kmem_cache. If it happens
to be shared, kmem_cache_destroy() has no way to tell which of the
aliases the caller had in mind - argument would be exactly the same
for either of those. Therefore, it is unable to tell which name
do we want to remove. Note that users of kmem_cache_create() have
no way to tell if cache will end up being shared - it's out of their
control or knowledge.
4) As the result, garbage symlinks stay around indefinitely *and*
unpredictably: if you do ls on that sysfs directory, you might or
might not see entries bearing names from long-gone modules. Whether
you see them or not (with identical history of modprobe/rmmod/etc.)
would depend on such things as object sizes in completely unrelated
modules with given kernel config. With the current code (broken
as hell) they simply stick around forever, with solution proposed
by Christoph they'll be disappearing when all aliases are gone (and
no, it's not a good solution for a lot of reasons).
All implementation issues aside, it's atrocious as user interface.
"These are just symlinks" is one hell of a lame excuse for leaving
junk around in user-visible place.
IF we really want these aliases (which is not at all obvious - the
arguments for those are IMO bloody weak), we need at least change
kmem_cache_destroy() and pass it name as additional argument. With
corresponding change in all drivers that create caches and all fun
it implies. Not a good idea at -rc6 time... Disabling aliases
will not solve all problems (we still need to deal with memory
corruptors coming from lifetime clusterfuck), but at least the rest
is relatively easy to deal with.
Some snags for SLUB
> partial pages should be added to the tail of the list instead of the head.
Is it efficiency? I think adding the partial pages to the head is better, because it will give
the other partial pages more time to become free pages, and it will decrease the number of
partial pages. Anyhow, if the adding/deleting operations are too expensive, losing a little
space efficiency is just well, as Memory is cheaper than CPU. In the other way, adding to the
head will be cache friendly.
I wander that the test case exists in the real world.
:(