An (unsigned) long story about page allocation
If one looks at mm/memory.c in the Linux 0.01 release, one finds that a page of memory is allocated with:
unsigned long get_free_page(void);
From the memory-management point of view, the system's RAM can be seen as a linear array of pages, so it can make a certain amount of sense to think of addresses as integer types — indexes into the array, essentially. Integers can also be used for arbitrary arithmetic; pointers in C can be used that way too, but one quickly gets into "undefined behavior" territory where an overly enthusiastic compiler may feel entitled to create all kinds of mayhem. So unsigned long was established as the return type from get_free_page() and, in general, as the way that one refers to an address that may appear in any place in memory.
Fast-forward to the 4.4-rc6 release and dig through a rather larger body of code, and one finds that pages are allocated with:
unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order); unsigned long __get_free_page(gfp_t gfp_mask);
The latter is a macro calling the former with an order of zero. Note that, more than 24 years after the 0.01 release, unsigned long is still used as the return type from __get_free_pages(). There are other variants (alloc_pages(), for example) that return struct page pointers, but much of low-level, page-oriented memory management in Linux is still done with unsigned long values.
The only problem is that, often, the kernel must deal with a page of memory as memory, modifying its contents. That requires a pointer. So even back in 0.01, one can find code like:
p = (struct task_struct *) get_free_page();
The unsigned long return value is immediately cast into the
pointer value that is actually needed.
Al Viro did a survey of __get_free_pages() users in current
kernels and concluded that "well
above 90%
" of the callers were using the return value as a pointer.
That turns out to be a lot of casts, suggesting that the type of the
return value for this function is not correct. So, he suggested, it might
make sense to change it:
Some of those bugs, he pointed out, he found simply by looking at the code with this kind of transformation in mind. Ten days later, he showed up with a patch set making the change and asked for a verdict from Linus.
One might find various faults with Linus's
response, but a lack of clarity will not be among them. He left no
doubt that there was no place in the mainline for this particular patch
set. The diffstat in Al's patch (568 files changed, 1956 insertions, 2202
deletions) was clearly frightening — enough, in its own right, to rule
out the change. A patch this wide-ranging would create conflicts
throughout the tree and make life difficult for those backporting patches.
This interface, it seems, is too old and too entrenched for this kind of
flag-day change; as Linus put it: "No way in hell do we suddenly
change the semantics of an interface that has been around from basically
day #1.
"
Still, as he clarified afterward, Linus isn't arguing for leaving everything exactly as it is. He accepted that most callers likely want a pointer value. But the way forward isn't to thrash up an interface like __get_free_pages(); instead, there are two approaches that, he said, could be taken.
The first of these would be to create a new, pointer-oriented interface that exists in parallel with __get_free_pages(). Then call sites could be converted at leisure over the course of what would probably be years.
The alternative, Linus said, is that code needing pointers could just allocate memory with kmalloc() instead. Once upon a time, that would not necessarily have been a good idea, since kmalloc() (implemented by the slab allocators) adds overhead to the page allocator and might have expanded the size of the returned memory beyond one page. Indeed, there was a period where an allocation of exactly one page would have consumed two physically contiguous pages when the slab housekeeping information was added. But those days are long in the past. In current kernels, kmalloc() is fast and requires little memory beyond that which is actually allocated. Indeed, Linus pointed out, kmalloc() may actually be faster than __get_free_pages() due to its use of per-CPU object caches.
So kmalloc() is probably the best option for many of the
call sites currently using __get_free_pages(). The places where
it is still inappropriate will be those needing multiple-page allocations
and those needing allocations that are not only page-sized but
page-aligned. In those cases, Linus said, the unsigned long
return type might not be a bad thing, since "it's clearly not just a
random pointer allocation if the bit pattern of the pointer
matters.
"
After this discussion took place, Al did a
pass over the __get_free_pages() call sites in the filesystem
code and concluded that almost all of them truly would would be better off
using kmalloc(). So the
end result of this work may be a slow shift in that direction and, perhaps,
the creation of a new document telling kernel developers which memory
allocator they should be using in which setting.
Index entries for this article | |
---|---|
Kernel | Memory management/Internal API |
Posted Dec 24, 2015 3:05 UTC (Thu)
by neilbrown (subscriber, #359)
[Link] (2 responses)
kmalloc(PAGE_SIZE) will always return a whole page - properly page aligned.
Linus' comment about alignment:
> And if the code really explicitly wants a page (or set of aligned pages)
is about alignment of a *set* of pages. kmalloc(PAGE_SIZE * 2) with return a pair of pages, properly page-aligned, but it may not be 2-page aligned.
Posted Dec 24, 2015 3:24 UTC (Thu)
by viro (subscriber, #7872)
[Link] (1 responses)
Posted Dec 31, 2015 7:51 UTC (Thu)
by vbabka (subscriber, #91706)
[Link]
BTW, page allocator also has per cpu caches, so that's not advantage of kmalloc.
Posted Dec 24, 2015 13:30 UTC (Thu)
by ghane (guest, #1805)
[Link] (3 responses)
-- Sanjeev "how do you markup tounge-in-cheek" Gupta
Posted Dec 24, 2015 23:04 UTC (Thu)
by Arch-TK (guest, #103811)
[Link] (2 responses)
Posted Dec 25, 2015 1:28 UTC (Fri)
by pr1268 (subscriber, #24648)
[Link] (1 responses)
I sense some sarcasm in Sanjeev's (ghane's?) post. That being said, everyone please note the context of Linus' curt retort—he was replying to Al Viro in particular. I'm sure the two of them would feel right at home lobbing insulting messages back and forth to each other. [ducks for cover] NOT meant to impugn Mr. Viro's work, or that of Linus. Hopefully readers will equally sense the sarcasm in my post. ;-)
Posted Jan 3, 2016 9:07 UTC (Sun)
by jospoortvliet (guest, #33164)
[Link]
Al might not mind but as often - others watch and might not be interested in getting yelled at like that so that super clever cleanup/optimization they were thinking about might never be proposed. And that is a waste for no reason.
Posted Dec 25, 2015 1:55 UTC (Fri)
by pr1268 (subscriber, #24648)
[Link] (2 responses)
Perhaps I'm confused... Isn't it already a pointer? I thought that the C language standard specifies that memory addresses (physical or virtual) be represented as an unsigned long integral primitive type1. Plus, Mel Gorman's documentation on these functions even states that these functions return a "virtual address" (§ 6.2). Perhaps I'm needlessly arguing English language semantics here instead of C. The gist of my post is that an unsigned long and a void * are the same thing to the compiler, but if they're not, then all these casts exist merely to shut up the compiler. 1 I may be wrong on this; apparently ANSI/C89 makes no mention of storage of memory address types (i.e. pointers) as a primitive type, but instead as a derived type.
Posted Dec 25, 2015 5:55 UTC (Fri)
by viro (subscriber, #7872)
[Link] (1 responses)
"The following type designates an unsigned integer type with the property that any valid pointer to void can be converted to this type, then converted back to pointer to
IOW, not only they are different (hell, try to compile something like 2 * &n and see where the compiler tells you to shove it), they are not even guaranteed to be possible to convert back and forth.
On all architectures supported by Linux, such a type exists and happens to be unsigned long. So casts back and forth are possible. But void * and unsigned long are certainly *not* the same thing - the sets of operations valid for them are quite different.
Posted Dec 30, 2015 13:08 UTC (Wed)
by eru (subscriber, #2753)
[Link]
An (unsigned) long story about page allocation
An (unsigned) long story about page allocation
An (unsigned) long story about page allocation
An (unsigned) long story about page allocation
An (unsigned) long story about page allocation
An (unsigned) long story about page allocation
An (unsigned) long story about page allocation
Unsigned longs and void*s
The unsigned long return value is immediately cast into the pointer value that is actually needed.
Unsigned longs and void*s
void, and the result will compare equal to the original pointer: uintptr_t
These types [intptr_t and uintptr_t] are optional"
Those who have worked with segmented memory models have learned this the hard way. The MS-DOS and 16-bit Windows "large" memory model was an easy introduction, since there far pointers and longs are still the same size, even though the pointer is not a simple linear number. But then I encountered a 32-bit segmented Intel system, where pointers are 6 bytes (2 byte selector, 4 byte offset), but longs still 4 bytes... Teaches one to take C prototype declarations seriously.
Unsigned longs and void*s