Faster page faulting through prezeroing

[Posted January 5, 2005 by corbet]

In early December, this page covered Christoph Lameter's efforts to speed up the page fault mechanism by reducing lock contention. That work speeds things significantly on multiprocessor systems, but is of little help to uniprocessor users. That is not true of Christoph's other page fault work, which can benefit users on all systems.

Christoph notes that, once the locking issues are taken care of, the most expensive part of the page fault handler is the code which zeroes anonymous pages before handing them to the faulting process. He has concluded that, in some situations, performance can be significantly improved by clearing those pages ahead of time and having them ready when the page fault happens. Just zeroing pages ahead of time is not particularly helpful - it is mostly an exercise in moving work around to different places in the system. But, if (1) the zeroing of pages can be made more efficient, and (2) the workload is of the right type, things can be made quite a bit faster.

What is the right kind of workload? For the purposes of this patch set, the best workload is one which allocates whole pages, but then only touches parts of them. If those pages are already cleared, there is no need to load an entire page into the processor cache when it is faulted in. The improved cache behavior, along with the speedup in fault handling itself, can yield big improvements. Some figures posted by Christoph show an almost 4x improvement in the page fault rate in the right conditions. As it turns out, many applications fit this profile, so "the right conditions" should not be all that rare.

There are four parts to the prezeroing patch set. The first patch extends the page allocation mechanism to make it explicitly handle requests for zeroed memory. There is a new __GFP_ZERO allocation flag which tells alloc_pages() (and thus functions like __get_free_page() and kmalloc()) to return zeroed memory. Many places in the kernel which clear their own pages have been fixed to request zeroed memory instead. With only this patch applied, the kernel's code is cleaned up a bit, but no performance improvements result - the __GFP_ZERO flag just causes a call to clear_page() in the page allocator.

The second patch changes the prototype of the clear_page() function to:

    void clear_page(void *page, int order);

With the new interface, clear_page() can zero higher-order pages. This change is an important part of the patch set: pages are most efficiently zeroed if they can be done in larger groups. Often, the setup cost is a big part of the total; the value of prezeroing pages is much reduced if it can only be done one page at a time.

The kscrubd patch is where things start to get interesting. This patch expands the zone structure so that it can keep track of pages which are known to be clear. Requests for zeroed pages are satisfied from this list when possible. A new kernel thread (actually, a set of per-node threads) wakes up occasionally and clears pages for future allocation. This thread does not normally scrub zero-order (single) pages, but can be configured to do so (via /proc) if desired.

The kscrubd patch also implements a linked list of "zero drivers," being functions which can be called upon to zero pages efficiently. There are no such drivers in this patch, so all pages are zeroed with a call to clear_page(), which, as a comment in the code notes, can be hard on the processor's cache. It would be nicer if pages could be zeroed without the cache impacts. The fourth patch shows how this can be done - at least, on Altix systems. It adds a driver for the Altix block transfer engine which can zero memory directly without the processor's involvement - at least, when relatively large chunks of memory are involved. Drivers for other hardware have not yet been posted, but it would not be surprising to see them begin to appear after the prezeroing code has been merged.

And that could happen soon: Linus, having been convinced by Christoph's results, has requested that this set of patches be merged soon. So prezeroing could even find its way into the kernel prior to the 2.6.11 release. (Update: the __GFP_ZERO patch was merged just as LWN was being published).

Index entries for this article
Kernel	Memory management

Faster page faulting through prezeroing

Posted Jan 13, 2005 7:39 UTC (Thu) by huaz (guest, #10168) [Link] (5 responses)

[quote]If those pages are already cleared, there is no need to load an entire page into the processor cache when it is faulted in.[/quote]

I don't get it. The what happens when kscrubd wakes up and clears the pages? Yup, it brings the memory into cache and might get evicted before someone needs it.

I am not convinced this is a useful feature. It looks more like something that only works for one particular (possibly artificially designed) benchmark.

Faster page faulting through prezeroing

Posted Jan 13, 2005 11:26 UTC (Thu) by etienne (guest, #25256) [Link]

<quote>The what happens when kscrubd wakes up and clears the pages? Yup, it brings the memory into cache and might get evicted before someone needs it</quote>

Maybe that is not the job of the processor to clear the page, for instance kscrubd function can do a IDE DMA read on the disk of a pre-zeroed area. Then the processor cache is not touched (could be marked dirty but...). That pre-zeroed area could be some reserved blocks at the end of the swap partition, or a contigous file.

Another clean solution is available on non ia32 processor, being write and invalidate instruction: when the first byte of a cache line is written (to zero), the complete cache line is not first read from memory.
IMHO, when the repeat counter (in register %ecx) is bigger than the cache line, assembly instruction "rep stosl" still do not produce a write and invalidate transaction to external memory on ia32.

Etienne.

Faster page faulting through prezeroing

Posted Jan 13, 2005 22:49 UTC (Thu) by zhjy (guest, #27228) [Link] (1 responses)

I didn't the code. What I guess is that kscrubd can zero'ed a lot of pages once, then it can save some unnecessary cache eviction.

Faster page faulting through prezeroing

Posted Jan 14, 2005 14:40 UTC (Fri) by zhjy (guest, #27228) [Link]

Another small thing is: when context switching between processes, anyway, the cache lines may be filled by new ones, so kscrubd will not add much cache pollution. But page fault handling is a synchronous operation and after that you still are in the same context. In that case, cache pollution is bad.

Cache trashing

Posted Jan 14, 2005 5:06 UTC (Fri) by goaty (guest, #17783) [Link] (1 responses)

I think the idea is not so much to prevent cache trashing, which is after all inevitable, but to make it happen less often. If kswapd pre-zeroes a big sack of pages, then that's more efficient than zeroing them one at a time. And of course if the hardware can be persuaded to zero chunks of RAM without touching the processor cache, then you've got a huge win.

In a couple of years it might even be possible to buy a PCI Express "/dev/null" card to accelerate your server.

Cache trashing

Posted Feb 12, 2008 0:31 UTC (Tue) by goaty (guest, #17783) [Link]

2+ years ago, I wrote: In a couple of years it might even be possible to buy a PCI Express "/dev/null" card to accelerate your server.

Unfortunately, this did not happen. As someone pointed out in another thread, it's possible to persuade various DMA-capable hardware to act as a /dev/null device. For example, you can stick a page full of zeroes on the swap device and then get the IDE controller to DMA it to wherever its needed. Provided the drive's cache is larger than the page size, the performance should be acceptable.

The problem being that most of the devices on the system are already busy with their intended function, like reading and writing files, and cannot expend time in the frivolous pursuit of nullage.