[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
|
|
Subscribe / Log in / New account

An (unsigned) long story about page allocation

By Jonathan Corbet
December 23, 2015
The kernel project is famously willing to change any internal interface as needed for the long-term maintainability of the code. Effects on out-of-tree modules or other external code are not generally deemed to be reasons to keep an interface stable. But what happens if you want to change one of the oldest interfaces found within the kernel — one with many hundreds of call sites? It turns out that, in 2015, the appetite for interface churn may not be what it once was.

If one looks at mm/memory.c in the Linux 0.01 release, one finds that a page of memory is allocated with:

    unsigned long get_free_page(void);

From the memory-management point of view, the system's RAM can be seen as a linear array of pages, so it can make a certain amount of sense to think of addresses as integer types — indexes into the array, essentially. Integers can also be used for arbitrary arithmetic; pointers in C can be used that way too, but one quickly gets into "undefined behavior" territory where an overly enthusiastic compiler may feel entitled to create all kinds of mayhem. So unsigned long was established as the return type from get_free_page() and, in general, as the way that one refers to an address that may appear in any place in memory.

Fast-forward to the 4.4-rc6 release and dig through a rather larger body of code, and one finds that pages are allocated with:

    unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);
    unsigned long __get_free_page(gfp_t gfp_mask);

The latter is a macro calling the former with an order of zero. Note that, more than 24 years after the 0.01 release, unsigned long is still used as the return type from __get_free_pages(). There are other variants (alloc_pages(), for example) that return struct page pointers, but much of low-level, page-oriented memory management in Linux is still done with unsigned long values.

The only problem is that, often, the kernel must deal with a page of memory as memory, modifying its contents. That requires a pointer. So even back in 0.01, one can find code like:

    p = (struct task_struct *) get_free_page();

The unsigned long return value is immediately cast into the pointer value that is actually needed. Al Viro did a survey of __get_free_pages() users in current kernels and concluded that "well above 90%" of the callers were using the return value as a pointer. That turns out to be a lot of casts, suggesting that the type of the return value for this function is not correct. So, he suggested, it might make sense to change it:

In other words, switching to void * for return values of allocating and argument of freeing functions would reduce the amount of boilerplate quite nicely. What's more, fewer casts means better chance for typechecking to catch more bugs.

Some of those bugs, he pointed out, he found simply by looking at the code with this kind of transformation in mind. Ten days later, he showed up with a patch set making the change and asked for a verdict from Linus.

One might find various faults with Linus's response, but a lack of clarity will not be among them. He left no doubt that there was no place in the mainline for this particular patch set. The diffstat in Al's patch (568 files changed, 1956 insertions, 2202 deletions) was clearly frightening — enough, in its own right, to rule out the change. A patch this wide-ranging would create conflicts throughout the tree and make life difficult for those backporting patches. This interface, it seems, is too old and too entrenched for this kind of flag-day change; as Linus put it: "No way in hell do we suddenly change the semantics of an interface that has been around from basically day #1."

Still, as he clarified afterward, Linus isn't arguing for leaving everything exactly as it is. He accepted that most callers likely want a pointer value. But the way forward isn't to thrash up an interface like __get_free_pages(); instead, there are two approaches that, he said, could be taken.

The first of these would be to create a new, pointer-oriented interface that exists in parallel with __get_free_pages(). Then call sites could be converted at leisure over the course of what would probably be years.

The alternative, Linus said, is that code needing pointers could just allocate memory with kmalloc() instead. Once upon a time, that would not necessarily have been a good idea, since kmalloc() (implemented by the slab allocators) adds overhead to the page allocator and might have expanded the size of the returned memory beyond one page. Indeed, there was a period where an allocation of exactly one page would have consumed two physically contiguous pages when the slab housekeeping information was added. But those days are long in the past. In current kernels, kmalloc() is fast and requires little memory beyond that which is actually allocated. Indeed, Linus pointed out, kmalloc() may actually be faster than __get_free_pages() due to its use of per-CPU object caches.

So kmalloc() is probably the best option for many of the call sites currently using __get_free_pages(). The places where it is still inappropriate will be those needing multiple-page allocations and those needing allocations that are not only page-sized but page-aligned. In those cases, Linus said, the unsigned long return type might not be a bad thing, since "it's clearly not just a random pointer allocation if the bit pattern of the pointer matters."

After this discussion took place, Al did a pass over the __get_free_pages() call sites in the filesystem code and concluded that almost all of them truly would would be better off using kmalloc(). So the end result of this work may be a slow shift in that direction and, perhaps, the creation of a new document telling kernel developers which memory allocator they should be using in which setting.

Index entries for this article
KernelMemory management/Internal API


to post comments

An (unsigned) long story about page allocation

Posted Dec 24, 2015 3:05 UTC (Thu) by neilbrown (subscriber, #359) [Link] (2 responses)

> and those needing allocations that are not only page-sized but page-aligned.

kmalloc(PAGE_SIZE) will always return a whole page - properly page aligned.

Linus' comment about alignment:

> And if the code really explicitly wants a page (or set of aligned pages)

is about alignment of a *set* of pages. kmalloc(PAGE_SIZE * 2) with return a pair of pages, properly page-aligned, but it may not be 2-page aligned.

An (unsigned) long story about page allocation

Posted Dec 24, 2015 3:24 UTC (Thu) by viro (subscriber, #7872) [Link] (1 responses)

Please, show me what in mm/sl*b*.c would guarantee that.

An (unsigned) long story about page allocation

Posted Dec 31, 2015 7:51 UTC (Thu) by vbabka (subscriber, #91706) [Link]

True. The current implementation does align, and it's hard to think of sensible implementation that wouldn't for page size alloc. But article suggests it wasn't like this before, and there's no stated guarantee AFAIK.

BTW, page allocator also has per cpu caches, so that's not advantage of kmalloc.

An (unsigned) long story about page allocation

Posted Dec 24, 2015 13:30 UTC (Thu) by ghane (guest, #1805) [Link] (3 responses)

Once again, Linus' response demonstrates the lack of civility in LKML, which is driving away experienced middle-aged white hackers from our community. When will someone stand up for them? Surely Gnome or the Mozilla Foundation can start a project?

-- Sanjeev "how do you markup tounge-in-cheek" Gupta

An (unsigned) long story about page allocation

Posted Dec 24, 2015 23:04 UTC (Thu) by Arch-TK (guest, #103811) [Link] (2 responses)

What are you on about?

An (unsigned) long story about page allocation

Posted Dec 25, 2015 1:28 UTC (Fri) by pr1268 (subscriber, #24648) [Link] (1 responses)

I sense some sarcasm in Sanjeev's (ghane's?) post. That being said, everyone please note the context of Linus' curt retort—he was replying to Al Viro in particular. I'm sure the two of them would feel right at home lobbing insulting messages back and forth to each other.

[ducks for cover]

NOT meant to impugn Mr. Viro's work, or that of Linus. Hopefully readers will equally sense the sarcasm in my post. ;-)

An (unsigned) long story about page allocation

Posted Jan 3, 2016 9:07 UTC (Sun) by jospoortvliet (guest, #33164) [Link]

It does seem Linus is unnecessarily rude. As he points out he might not know how many ways there are to say NO but I am sure he can find a slightly less strong and harsh one. Al tried something he believed was worthwhile, Linus disagreed. No reason to start yelling right away...

Al might not mind but as often - others watch and might not be interested in getting yelled at like that so that super clever cleanup/optimization they were thinking about might never be proposed. And that is a waste for no reason.

Unsigned longs and void*s

Posted Dec 25, 2015 1:55 UTC (Fri) by pr1268 (subscriber, #24648) [Link] (2 responses)

Perhaps I'm confused...

The unsigned long return value is immediately cast into the pointer value that is actually needed.

Isn't it already a pointer? I thought that the C language standard specifies that memory addresses (physical or virtual) be represented as an unsigned long integral primitive type1. Plus, Mel Gorman's documentation on these functions even states that these functions return a "virtual address" (§ 6.2).

Perhaps I'm needlessly arguing English language semantics here instead of C. The gist of my post is that an unsigned long and a void * are the same thing to the compiler, but if they're not, then all these casts exist merely to shut up the compiler.

1 I may be wrong on this; apparently ANSI/C89 makes no mention of storage of memory address types (i.e. pointers) as a primitive type, but instead as a derived type.

Unsigned longs and void*s

Posted Dec 25, 2015 5:55 UTC (Fri) by viro (subscriber, #7872) [Link] (1 responses)

address != pointer.

"The following type designates an unsigned integer type with the property that any valid pointer to void can be converted to this type, then converted back to pointer to
void, and the result will compare equal to the original pointer: uintptr_t
These types [intptr_t and uintptr_t] are optional"

IOW, not only they are different (hell, try to compile something like 2 * &n and see where the compiler tells you to shove it), they are not even guaranteed to be possible to convert back and forth.

On all architectures supported by Linux, such a type exists and happens to be unsigned long. So casts back and forth are possible. But void * and unsigned long are certainly *not* the same thing - the sets of operations valid for them are quite different.

Unsigned longs and void*s

Posted Dec 30, 2015 13:08 UTC (Wed) by eru (subscriber, #2753) [Link]

Those who have worked with segmented memory models have learned this the hard way. The MS-DOS and 16-bit Windows "large" memory model was an easy introduction, since there far pointers and longs are still the same size, even though the pointer is not a simple linear number. But then I encountered a 32-bit segmented Intel system, where pointers are 6 bytes (2 byte selector, 4 byte offset), but longs still 4 bytes... Teaches one to take C prototype declarations seriously.


Copyright © 2015, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds