Application-friendly kernel interfaces
Access to huge pages is through the hugetlbfs filesystem. Hugetlbfs is a virtual filesystem much like tmpfs, but with a twist: mappings of files within the filesystem use huge pages. It's not possible to do normal reads and writes from this filesystem, but it is possible to create a file, extend it, and use mmap() to map it into virtual memory. This interface gets the job done, but it's evidently a little too involved for some application programmers.
To make life simpler, Ken Chen has proposed /dev/hugetlb. This device is much like /dev/zero, except that it uses huge pages. Applications can simply open the device and use mmap() to create as much huge-paged anonymous memory as they need. The patch is simple and seemingly uncontroversial; Andrew Morton did note, though:
We can do the same with hugetlbfs, but that involves (horror) "fuss".
The way to avoid "fuss" is of course to do it once, do it properly then stick it in a library which everyone uses.
He goes on to observe, however, that getting yet another library
distributed widely can be a difficult task - to the point that it's easier
to just add more functionality within the kernel itself. He concludes:
"This comes up regularly, and it's pretty sad.
"
In a separate message, Andrew talked about how kernel interfaces should be designed in general:
In many cases, the C library fills this role by providing a more application-friendly interface to kernel calls. But there are limits to how much code even the glibc developers want to stuff into the library, and things like a friendlier huge page interface may be on the wrong side of the line. A separate library for developers trying to do obscure and advanced things with the kernel might be the right solution.
The right solution, Andrew suggests, is to have a user-space API library
which is maintained as part of the kernel itself. That would keep
oversight over the API and help to ensure that the library is maintained
into the future while minimizing the amount of code which goes into the
kernel solely for the purpose of creating friendlier interfaces. Somebody
would have to step up to create and maintain that library, though; as of
this writing, volunteers are in short supply.
Index entries for this article | |
---|---|
Kernel | Development model/User-space ABI |
Kernel | Huge pages |
Kernel | User-space API |
Posted Mar 29, 2007 3:07 UTC (Thu)
by jreiser (subscriber, #11027)
[Link] (4 responses)
and that makes hugetlbfs less than a filesystem. Hugetlbfs is a hack, and it is hard to use. Hugetlbfs is so hard to use that our editor could not find an actual working example to cite. Show me the code!
Posted Mar 29, 2007 6:02 UTC (Thu)
by ebiederm (subscriber, #35028)
[Link] (3 responses)
#define PATH_TO_HUGETLBFS "/dev/hshm"
void *map_anon_hugetlb(size_t size)
Posted Mar 29, 2007 15:16 UTC (Thu)
by vmole (guest, #111)
[Link] (2 responses)
snprintf(buffer, "%s/XXXXXX", PATH_TO_HUGETLBFS);
So much for working code... ;-)
Posted Mar 29, 2007 15:38 UTC (Thu)
by ebiederm (subscriber, #35028)
[Link] (1 responses)
snprintf(buffer, sizoef(buffer), ....);
Posted Mar 29, 2007 16:02 UTC (Thu)
by vmole (guest, #111)
[Link]
Strike 2! [Steve readies for the next pitch.]
Yeah yeah, I know that code written for forums like this is at best psuedo-code. Hell, I blew it just the other day, so I'm hardly the one to be picking on you, but I was amused by the "Show me the code" - "huh?" sequence.
Perhaps we can get away with claiming "Well, it was a actually a debugging test for the reader". Right, that's it.
Posted Mar 29, 2007 4:32 UTC (Thu)
by orospakr (guest, #40684)
[Link] (7 responses)
now that's an interesting idea.
Posted Mar 29, 2007 8:21 UTC (Thu)
by ms (subscriber, #41272)
[Link] (6 responses)
Posted Mar 29, 2007 11:30 UTC (Thu)
by hummassa (guest, #307)
[Link] (5 responses)
Posted Mar 29, 2007 11:54 UTC (Thu)
by jospoortvliet (guest, #33164)
[Link] (3 responses)
Posted Mar 29, 2007 22:21 UTC (Thu)
by man_ls (guest, #15091)
[Link] (2 responses)
Is it just because kernel->userspace interfaces are set in stone and have to be maintained forever? For that would feel a bit like medieval astronomers -- weaving layer over layer of epicycles so that their spheres would match the real planet trajectories. Here we would have a kernel interface set in stone, then some library code -- which once people use it would again be set in stone, only to add a new glue layer... again and again. Waiting a few iterations might be a better course of action, and I gather from LWN that it is often taken by kernel devs.
If the purpose of this scheme is to have a more powerful interface, I much prefer our editor's suggestion:
Posted Mar 30, 2007 1:16 UTC (Fri)
by cpeterso (guest, #305)
[Link]
Posted Mar 30, 2007 5:40 UTC (Fri)
by IkeTo (subscriber, #2122)
[Link]
Perhaps the whole hugetlb thing tells one of the possible reasons. The original /dev/hshm interface is actually more general than the /dev/hugetlb interface: it allows multiple processes unrelated in ancestry to share the same piece of huge page. It is probably preferable for the kernel API to use only the general interface rather than having to implement both, since every time the interface change it needs to have a "global search" for libraries/applications using the interface, and leave enough time for those libraries/applications to change (if Linus does not say "no" to the change right away). So it might be preferable to implement just the general interface, hoping that it will never change at all; and have another library "cast" it to various different forms that are "more friendly" forms like the hugetlb interface. What unclear to me is actually why one would expect that the new library could be exempted from the global search if it needs to be changed.
I think instead of a general liblinux, we should be contented with the tested solutions of, e.g., pthread (futex) and libfam (dnotify): if the functionality fits well into a general audience, the easier interface is implemented in libc, and if it is not, the easier interface is implemented in a functionality specific library. That way, when the generic interface is changed, the kernel developers have fewer places to search for direct users of them; and the specific interface is usable (and thus relied upon) by a more narrow set of end-user applications.
Posted Apr 6, 2007 1:06 UTC (Fri)
by slamb (guest, #1070)
[Link]
This is one of those areas where the BSDs have an easier time. They do "make world", and it's
just
inconceivable that an actual end user would mix'n'match kernel and userspace from different
versions of FreeBSD. They got away with things like top assuming layout of kernel structures and
accessing /dev/kmem for a long time. On Linux, that sort of mutt system is considered normal, so
stuff has to be carefully versioned.
Posted Mar 30, 2007 6:16 UTC (Fri)
by ncm (guest, #165)
[Link] (4 responses)
It should suffice to call mmap() and ask for an anonymous chunk of 16M, and the kernel can simply recognize that a hugetlb would serve, and use it. If, later, the process unmaps pages within it, the remaining pages can be switched over to the regular mapping scheme; most processes won't. Then it would be easy, safe, and backward-compatible for libc to switch malloc over to allocating hugetlb chunks by default, benefitting everybody.
I would also like to see a flag added to mmap() to require that the mapped block be aligned to match its size; e.g. ask for 16M and the bottom 24 bits of the returned address are 0. (Anybody else remember when 68K chips shipped with only 24 address pins, and Apple stuck annotations in the top 8 bits of addresses because the hardware ignored those bits?)
Posted Mar 30, 2007 12:59 UTC (Fri)
by mjr (guest, #6979)
[Link] (2 responses)
If a separate interface is really necessary for some reason, I'd put the same functionality behind regular libc malloc(); it already does brk() for small allocations and mmap() for large ones I believe, so it could just as well do extra-large allocations via the hugetlb API. (Putting this in malloc instead of mmap would get rid of the partial-munmap issue on the libc end.)
Posted Mar 30, 2007 23:19 UTC (Fri)
by giraffedata (guest, #1954)
[Link] (1 responses)
It's quite possible that 16M of memory will consist of 100 scattered 4K pages of working set and the rest rarely used or even vacant. You wouldn't want to page the whole 16M in and out in that case.
Page granularity seems like a perfectly sensible parameter of an mmap, though.
Posted Apr 5, 2007 14:15 UTC (Thu)
by farnz (subscriber, #17727)
[Link]
One thought; if your mmap parameter is simply a hint that the block will be used in a particular granularity, it's easy to implement. Current mmap sets the parameter to 1 byte (no granularity needed), unless mmaping in hugetlbfs pages, when it sets the parameter to (e.g.) 16M. The kernel then just rounds up to the next highest available page size when possible, or down if not.
Posted Apr 5, 2007 18:57 UTC (Thu)
by joib (subscriber, #8541)
[Link]
I would presume that for non-HPC applications these non-streaming, irregular access patterns are even more common. Though supposedly AMD is fixing this issue with the upcoming 'Barcelona' by having 128 2M TLB entries, and additionally supporting 1G pages (don't know how many TLB entries for those).
For comparison, the Intel Woodcrest has 256 4K and 32(?) 2M TLB entries.
Posted Apr 5, 2007 18:38 UTC (Thu)
by joib (subscriber, #8541)
[Link]
It's not possible to do normal reads and writes from this filesystem [hugetlbfs] ...
Application-friendly kernel interfaces
Huh?Application-friendly kernel interfaces
{
char buffer[PATH_MAX];
int fd;
snprintf(buffer, "%s/XXXXXX", PATH_TO_HUGETLBFS);
fd = mkstemp(buffer);
if (fd < 0)
return MAP_FAILED;
unlink(buffer);
ftruncate(fd, size);
return mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
}
Application-friendly kernel interfaces
Yea yea.Application-friendly kernel interfaces
Application-friendly kernel interfaces
liblinux, eh?Application-friendly kernel interfaces
I think this is a great idea. This allows for greater decoupling between glibc and the Linux kernel and is, IMHO, the proper abstraction. Plus, if the authors of the kernel interfaces are subsequently charged with writing liblinux entries then there could well be cases where the authors rather return to the drawing board and rethink the kernel interface if it's just too damn hard to use from userspace.Application-friendly kernel interfaces
This would also permit the kernel devs to further experiment in yanking liblinux
functionality out of the kernel... things that _could_ be done in
userspace without performance penalties _should_ be done in userspace :-)
linux + liblinux would be maintained from the same source -- so they would
always be in sync -- and this would be really great.
Indeed, it really sounds like a great idea. This way, systems like GTK/glibc and Qt/kdelibs could link to this library or even only use it when available to speed some things up, while using workarounds on other OS'es like the BSD's, solaris, mac OS X etc.liblinux
Do you really think it is a great idea? Pardon for my lack of knowledge about kernel development, but why is it so great? I mean, you design a hard-to-use interface, then write your own code which presents a friendly interface to userspace -- and you write it in userspace. Well, why not present a friendly interface in the kernel in the first place?
Really?
A separate library for developers trying to do obscure and advanced things with the kernel might be the right solution.
I have seen too many complex interfaces that nobody uses because they are so complex, and everyone uses the simplified version. Better start simple, and then add complexity as needed.
Really?
Is it just because kernel->userspace interfaces are set in stone and have to be maintained forever? For that would feel a bit like medieval astronomers -- weaving layer over layer of epicycles so that their spheres would match the real planet trajectories. Here we would have a kernel interface set in stone, then some library code -- which once people use it would again be set in stone, only to add a new glue layer... again and again. Waiting a few iterations might be a better course of action, and I gather from LWN that it is often taken by kernel devs.
I think the kernel API can change, so user programs should use the "friendly" userspace library APIs.
> I mean, you design a hard-to-use interface, then write your own code whichReally?
> presents a friendly interface to userspace -- and you write it in
> userspace. Well, why not present a friendly interface in the kernel in the
> first place?
I'm not sure about "would always be in sync". If you require that, people who have multiple kernels
on their box would need some mechanism such that the correct liblinux for whatever kernel they
happened to boot is dynamically loaded. Seems possible, but it's a step beyond "maintained from
the same source".
liblinux
Why should this need a file system, or a device, or a library at all?Why add anything?
I'm wondering the same myself. I'm not much for low level hacking, but I fail to see what benefits one would reap from yet another interface.Why add anything?
I can see the value of using mmap() for this, but I don't think you want mmap() guessing based on the size of the request what page size is best.
Why add anything?
Might be worth looking at Linux-mm.org on huge pages. In particular, there's a link to an LWM article on transparent use of huge pages. The "holy grail" is very definitely transparent use, so that whenever possible, all applications gain; anything that makes it easier to move that way is helpful.
Why add anything?
One problem is that the number of large page TLB entries is quite limited. E.g. on current Opterons, while you have a 512-entry data TLB for the normal 4K pages, for the 2M large pages you only have 8 entries. So if you have a loop kernel reading/writing from more than 8 big arrays you're going to have TLB trashing.Why add anything?
For large pages, there's libhugetlbfs, so you can use large pages via LD_PRELOAD without changing the application itself.
Application-friendly kernel interfaces