Triggering huge-page collapse from user space

By Jonathan Corbet
March 14, 2022

When the kernel first gained support for huge pages, most of the work was left to user space. System administrators had to set aside memory in the special hugetlbfs filesystem for huge pages, and programs had to explicitly map memory from there. Over time, the transparent huge pages mechanism automated the task of using huge pages. That mechanism is not perfect, though, and some users feel that they have better knowledge of when huge-page use makes sense for a given process. Thus, huge pages are now coming full circle with this patch set from Zach O'Keefe returning huge pages to user-space control.

Huge pages, of course, are the result of larger page sizes implemented by the CPU; the specific page sizes available depend on the processor model and its page-table layout. An x86 processor will normally, for example, support a "base" page size of 4KB, and huge pages of 2MB and 1GB. Huge pages dispense with the bottom layer (or layers) of the page-table hierarchy, speeding the address-translation process slightly. The biggest performance advantage that comes from huge pages, though, results from the reduced pressure on the processor's scarce translation lookaside buffer (TLB) slots. One 2MB huge page takes one TLB slot; when that memory is accessed as base pages, instead, 512 slots are needed. For some types of applications the speedup can be significant, so there is value in using huge pages when possible.

That said, there are also costs associated with huge pages, starting with the fact that they are huge. Processes do not always need large, virtually contiguous memory ranges, so placing all process memory in huge pages would end up wasting a lot of memory. The transparent huge pages mechanism tries to find a balance by scanning process memory and finding the places where huge pages might make sense; when such a place is found, a range of base pages is "collapsed" into a single huge page without the owning process being aware that anything has changed.

There are costs to transparent huge pages too, though. The scanning itself takes CPU time, so there are limits to how much memory the khugepaged kernel thread is allowed to scan each second. The limit keeps the cost of khugepaged within reason, but also reduces the rate at which huge pages are used, causing processes that could benefit from them to run in a more inefficient mode for longer.

The idea behind O'Keefe's patch set is to allow user space to induce huge-page collapse to happen quickly in places where it is known (or hoped) that use of huge pages will be beneficial. The idea was first suggested by David Rientjes in early 2021, and eventually implemented by O'Keefe. Beyond allowing huge-page collapse to happen sooner, O'Keefe says, this work causes the CPU time necessary for huge-page collapse to be charged to the process that requests it, increasing fairness.

It also allows the process to control when that work is done. Data stored in base pages will be scattered throughout physical memory; collapsing those pages into a huge page requires copying the data into a single, physically contiguous, huge page. This, in turn, requires blocking changes to those pages during the copy and uses CPU time, all of which can increase latency, so there is value in being able to control when that work happens.

A process can request huge-page collapse for a range of memory with a new madvise() request:

    int madvise(void *begin, size_t length, MADV_COLLAPSE);

This call will attempt to collapse length bytes of memory beginning at begin into huge pages. There does not appear to be any specific alignment requirement for those parameters, even though huge pages do have alignment requirements. If begin points to a base page in the middle of the address range that the huge page containing it will cover, then pages before begin will become part of the result. In other words, begin will be aligned backward to the proper beginning address for the containing huge page. The same is true for length, which will be increased if necessary to encompass a full huge page.

There are, of course, no guarantees that this call will succeed in creating huge pages; that depends on a number of things, including the availability of free huge pages in the system. Even if the operation is successful, a vindictive kernel could split the huge page(s) apart again before the call returns. If at least some success was had, the return code will be zero; otherwise an error code will be returned. A lack of available huge pages, in particular, will yield an EAGAIN error code.

Support for MADV_COLLAPSE is also added to process_madvise(), allowing one process to induce huge-page collapse in another. In this case, there are a couple of flags that are available (these would be the first use of the flags argument to process_madvise()):

MADV_F_COLLAPSE_LIMITS controls whether this operation should be bound by the limits on huge-page collapse that khugepaged follows; these are set via sysctl knobs in existing kernels. If the calling process lacks the CAP_SYS_ADMIN capability, then the presence of this flag is mandatory. It is arguably a bit strange to require an explicit flag to request the default behavior, but that's the way of it.
MADV_F_COLLAPSE_DEFRAG, if present, allows the operation to force page compaction to create free huge pages, even if the system configuration would otherwise not allow that. This flag does not require any additional capabilities, perhaps because the cost of compaction would be borne by the affected process itself.

The end result, O'Keefe says, is a mechanism that allows user space to take control of the use of huge pages, perhaps to the point that the kernel need no longer be involved:

Though not required to justify this series, hugepage management could be offloaded entirely to a sufficiently informed userspace agent, supplanting the need for khugepaged in the kernel.

First, though, this work would need to make it into the mainline kernel. Most of the review comments thus far are focused on details, but David Hildenbrand did take exception to one aspect of this new operation's behavior. In the current patch series, huge pages will be created for any virtual memory area, even those that have been explicitly marked to not use huge pages with an madvise(MADV_NOHUGEPAGE) call. That, he said, "would break KVM horribly" on the s390 architecture. This behavior will thus need to change.

The current patch set only works with anonymous pages; the plan is to add support for file-backed pages at a later time. Since one of the stated justifications for this patch is to be able to quickly enable huge pages for executable text, support for file-backed pages seems important, and the developers are likely to want to see it before giving this work the go-ahead. The feature looks like it will be useful for some use cases, though, so it seems likely to find its way into the mainline sooner or later.

Index entries for this article
Kernel	Huge pages
Kernel	Memory management/Huge pages

Triggering huge-page collapse from user space

Posted Mar 14, 2022 16:57 UTC (Mon) by Sesse (subscriber, #53779) [Link] (5 responses)

I originally assumed “collapse” meant that the huge page would collapse and splinter into individual pages, but no, it's the opposite!

Triggering huge-page collapse from user space

Posted Mar 14, 2022 17:23 UTC (Mon) by droundy (subscriber, #4559) [Link] (2 responses)

Me too! It's a little more like a black hole collapse...

Triggering huge-page collapse from user space

Posted Mar 14, 2022 20:47 UTC (Mon) by flussence (guest, #85566) [Link]

I'd just think of it like pulling down a shelving unit to free up space for larger items; you're removing structure after all.

In terms of difficulty I'd say huge pages lies somewhere between rearranging occupied shelves without breaking the contents and safely containing a black hole…

Triggering huge-page collapse from user space

Posted Mar 22, 2022 12:53 UTC (Tue) by jezuch (subscriber, #52988) [Link]

FWIW I immediately understood the intended meaning. Caveat: I'm not a native speaker and I read way too much about astrophysics :)

Triggering huge-page collapse from user space

Posted Mar 15, 2022 0:41 UTC (Tue) by Karellen (subscriber, #67644) [Link] (1 responses)

I suppose it's the difference between a solid object collapsing into individual pieces, like a wall collapsing into a disconnected pile of bricks; or a sparsely connected group of elements collapsing into a single mass, like a house of cards collapsing into a dense pile of cards.

Do things generally collapse outwards, or inwards?

Triggering huge-page collapse from user space

Posted Mar 15, 2022 8:16 UTC (Tue) by Wol (subscriber, #4433) [Link]

Think of the energy. It takes an release of energy to make things explode, a minimal flow of energy to make things collapse, while an implosion gives of a blast of energy.

So I guess collapse is the right word here. There's a minimal change observable from outside the system, while what was there is still there, just looking a lot smaller because all the empty space has been squeezed out.

Cheers,
Wol

Triggering huge-page collapse from user space

Posted Mar 15, 2022 5:49 UTC (Tue) by Nikratio (subscriber, #71966) [Link] (5 responses)

"Processes do not always need large, virtually contiguous memory ranges," - that should be "physically", not "virtually" I think?

Triggering huge-page collapse from user space

Posted Mar 15, 2022 7:01 UTC (Tue) by edeloget (subscriber, #88392) [Link] (4 responses)

Both are true. Some programs are perfectly happy with a lot of small, non-contiguous virtual memory ranges because they never do large allocations.

Triggering huge-page collapse from user space

Posted Mar 16, 2022 0:44 UTC (Wed) by mtaht (subscriber, #11087) [Link] (3 responses)

aggregate, rather than "collapse" might be a better word.

Triggering huge-page collapse from user space

Posted Mar 18, 2022 1:17 UTC (Fri) by ncm (guest, #165) [Link]

The most apt English word is "consolidate".

(That did not stop business reporters from preferring "conglomerate", some decades back.)

Triggering huge-page collapse from user space

Posted Mar 18, 2022 13:34 UTC (Fri) by pitb0ss (subscriber, #137324) [Link] (1 responses)

Perhaps coalesce is another good word

Triggering huge-page collapse from user space

Posted Mar 18, 2022 15:08 UTC (Fri) by zokeefe (guest, #140292) [Link]

I thought about "coalesce", but "collapse" is just what the operation is ubiquitously referred to in the kernel and existing apis (e.g. /sys/kernel/mm/transparent_hugepage/khugepaged/pages_collapsed) and stats (e.g. /proc/vmstat:thp_collapse_alloc[_failed]).

Triggering huge-page collapse from user space

Posted Mar 18, 2022 15:27 UTC (Fri) by zokeefe (guest, #140292) [Link]

Thanks for the kind writeup! :)

> This call will attempt to collapse length bytes of memory beginning at begin into huge pages. There does not appear to be any specific alignment requirement for those parameters, even though huge pages do have alignment requirements.

I should call out what happens when parameters are passed that don't align with architecture hugepage size/alignment; thanks for pointing out that I didn't mention this anywhere.

> If begin points to a base page in the middle of the address range that the huge page containing it will cover, then pages before begin will become part of the result. In other words, begin will be aligned backward to the proper beginning address for the containing huge page. The same is true for length, which will be increased if necessary to encompass a full huge page."

A small correction: the opposite actually happens; we forward align the start and backward align the end. Else, we'd have to make a decision on what to do if the new range fell outside the VMA(s). IOW, we clamp the provided range(s) to be hugepage aligned/sized.