Application-friendly kernel interfaces

[Posted March 26, 2007 by corbet]

The "hugetlb" feature of the kernel allows applications to create and use "huge" pages in memory. These pages use a special page table mode which allows a single page table entry to provide the translation for up to 16MB of contiguous memory (on some architectures). The advantage to doing things this way is that references to the entire huge page only take up one slot in the translation lookaside buffer (TLB), and that can have good effects on performance.

Access to huge pages is through the hugetlbfs filesystem. Hugetlbfs is a virtual filesystem much like tmpfs, but with a twist: mappings of files within the filesystem use huge pages. It's not possible to do normal reads and writes from this filesystem, but it is possible to create a file, extend it, and use mmap() to map it into virtual memory. This interface gets the job done, but it's evidently a little too involved for some application programmers.

To make life simpler, Ken Chen has proposed /dev/hugetlb. This device is much like /dev/zero, except that it uses huge pages. Applications can simply open the device and use mmap() to create as much huge-paged anonymous memory as they need. The patch is simple and seemingly uncontroversial; Andrew Morton did note, though:

afaict the whole reason for this work is to provide a quick-n-easy way to get private mappings of hugetlb pages. With the emphasis on quick-n-easy.

We can do the same with hugetlbfs, but that involves (horror) "fuss".

The way to avoid "fuss" is of course to do it once, do it properly then stick it in a library which everyone uses.

He goes on to observe, however, that getting yet another library distributed widely can be a difficult task - to the point that it's easier to just add more functionality within the kernel itself. He concludes: "This comes up regularly, and it's pretty sad."

In a separate message, Andrew talked about how kernel interfaces should be designed in general:

The fact that a kernel interface is "hard to use" really shouldn't be an issue for us, because that hardness can be addressed in libraries. Kernel interfaces should be good, and complete, and maintainable, and etcetera. If that means that they end up hard to use, well, that's not necessarily a bad thing. I'm not sure that in all cases we want to be optimising for ease-of-use just because libraries-are-hard.

In many cases, the C library fills this role by providing a more application-friendly interface to kernel calls. But there are limits to how much code even the glibc developers want to stuff into the library, and things like a friendlier huge page interface may be on the wrong side of the line. A separate library for developers trying to do obscure and advanced things with the kernel might be the right solution.

The right solution, Andrew suggests, is to have a user-space API library which is maintained as part of the kernel itself. That would keep oversight over the API and help to ensure that the library is maintained into the future while minimizing the amount of code which goes into the kernel solely for the purpose of creating friendlier interfaces. Somebody would have to step up to create and maintain that library, though; as of this writing, volunteers are in short supply.

Index entries for this article
Kernel	Development model/User-space ABI
Kernel	Huge pages
Kernel	User-space API

Application-friendly kernel interfaces

Posted Mar 29, 2007 3:07 UTC (Thu) by jreiser (subscriber, #11027) [Link] (4 responses)

It's not possible to do normal reads and writes from this filesystem [hugetlbfs] ...

and that makes hugetlbfs less than a filesystem. Hugetlbfs is a hack, and it is hard to use. Hugetlbfs is so hard to use that our editor could not find an actual working example to cite. Show me the code!

Application-friendly kernel interfaces

Posted Mar 29, 2007 6:02 UTC (Thu) by ebiederm (subscriber, #35028) [Link] (3 responses)

Huh?

#define PATH_TO_HUGETLBFS "/dev/hshm"

void *map_anon_hugetlb(size_t size)
{
char buffer[PATH_MAX];
int fd;
snprintf(buffer, "%s/XXXXXX", PATH_TO_HUGETLBFS);
fd = mkstemp(buffer);
if (fd < 0)
return MAP_FAILED;
unlink(buffer);
ftruncate(fd, size);
return mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
}

Application-friendly kernel interfaces

Posted Mar 29, 2007 15:16 UTC (Thu) by vmole (guest, #111) [Link] (2 responses)

snprintf(buffer, "%s/XXXXXX", PATH_TO_HUGETLBFS);

So much for working code... ;-)

Application-friendly kernel interfaces

Posted Mar 29, 2007 15:38 UTC (Thu) by ebiederm (subscriber, #35028) [Link] (1 responses)

Yea yea.

snprintf(buffer, sizoef(buffer), ....);

Application-friendly kernel interfaces

Posted Mar 29, 2007 16:02 UTC (Thu) by vmole (guest, #111) [Link]

Strike 2! [Steve readies for the next pitch.]

Yeah yeah, I know that code written for forums like this is at best psuedo-code. Hell, I blew it just the other day, so I'm hardly the one to be picking on you, but I was amused by the "Show me the code" - "huh?" sequence.

Perhaps we can get away with claiming "Well, it was a actually a debugging test for the reader". Right, that's it.

Application-friendly kernel interfaces

Posted Mar 29, 2007 4:32 UTC (Thu) by orospakr (guest, #40684) [Link] (7 responses)

liblinux, eh?

now that's an interesting idea.

Application-friendly kernel interfaces

Posted Mar 29, 2007 8:21 UTC (Thu) by ms (subscriber, #41272) [Link] (6 responses)

I think this is a great idea. This allows for greater decoupling between glibc and the Linux kernel and is, IMHO, the proper abstraction. Plus, if the authors of the kernel interfaces are subsequently charged with writing liblinux entries then there could well be cases where the authors rather return to the drawing board and rethink the kernel interface if it's just too damn hard to use from userspace.

liblinux

Posted Mar 29, 2007 11:30 UTC (Thu) by hummassa (guest, #307) [Link] (5 responses)

This would also permit the kernel devs to further experiment in yanking
functionality out of the kernel... things that _could_ be done in
userspace without performance penalties _should_ be done in userspace :-)
linux + liblinux would be maintained from the same source -- so they would
always be in sync -- and this would be really great.

liblinux

Posted Mar 29, 2007 11:54 UTC (Thu) by jospoortvliet (guest, #33164) [Link] (3 responses)

Indeed, it really sounds like a great idea. This way, systems like GTK/glibc and Qt/kdelibs could link to this library or even only use it when available to speed some things up, while using workarounds on other OS'es like the BSD's, solaris, mac OS X etc.

Really?

Posted Mar 29, 2007 22:21 UTC (Thu) by man_ls (guest, #15091) [Link] (2 responses)

Do you really think it is a great idea? Pardon for my lack of knowledge about kernel development, but why is it so great? I mean, you design a hard-to-use interface, then write your own code which presents a friendly interface to userspace -- and you write it in userspace. Well, why not present a friendly interface in the kernel in the first place?

Is it just because kernel->userspace interfaces are set in stone and have to be maintained forever? For that would feel a bit like medieval astronomers -- weaving layer over layer of epicycles so that their spheres would match the real planet trajectories. Here we would have a kernel interface set in stone, then some library code -- which once people use it would again be set in stone, only to add a new glue layer... again and again. Waiting a few iterations might be a better course of action, and I gather from LWN that it is often taken by kernel devs.

If the purpose of this scheme is to have a more powerful interface, I much prefer our editor's suggestion:

A separate library for developers trying to do obscure and advanced things with the kernel might be the right solution.

I have seen too many complex interfaces that nobody uses because they are so complex, and everyone uses the simplified version. Better start simple, and then add complexity as needed.

Really?

Posted Mar 30, 2007 1:16 UTC (Fri) by cpeterso (guest, #305) [Link]

Is it just because kernel->userspace interfaces are set in stone and have to be maintained forever? For that would feel a bit like medieval astronomers -- weaving layer over layer of epicycles so that their spheres would match the real planet trajectories. Here we would have a kernel interface set in stone, then some library code -- which once people use it would again be set in stone, only to add a new glue layer... again and again. Waiting a few iterations might be a better course of action, and I gather from LWN that it is often taken by kernel devs.

I think the kernel API can change, so user programs should use the "friendly" userspace library APIs.

Really?

Posted Mar 30, 2007 5:40 UTC (Fri) by IkeTo (subscriber, #2122) [Link]

> I mean, you design a hard-to-use interface, then write your own code which
> presents a friendly interface to userspace -- and you write it in
> userspace. Well, why not present a friendly interface in the kernel in the
> first place?

Perhaps the whole hugetlb thing tells one of the possible reasons. The original /dev/hshm interface is actually more general than the /dev/hugetlb interface: it allows multiple processes unrelated in ancestry to share the same piece of huge page. It is probably preferable for the kernel API to use only the general interface rather than having to implement both, since every time the interface change it needs to have a "global search" for libraries/applications using the interface, and leave enough time for those libraries/applications to change (if Linus does not say "no" to the change right away). So it might be preferable to implement just the general interface, hoping that it will never change at all; and have another library "cast" it to various different forms that are "more friendly" forms like the hugetlb interface. What unclear to me is actually why one would expect that the new library could be exempted from the global search if it needs to be changed.

I think instead of a general liblinux, we should be contented with the tested solutions of, e.g., pthread (futex) and libfam (dnotify): if the functionality fits well into a general audience, the easier interface is implemented in libc, and if it is not, the easier interface is implemented in a functionality specific library. That way, when the generic interface is changed, the kernel developers have fewer places to search for direct users of them; and the specific interface is usable (and thus relied upon) by a more narrow set of end-user applications.

liblinux

Posted Apr 6, 2007 1:06 UTC (Fri) by slamb (guest, #1070) [Link]

I'm not sure about "would always be in sync". If you require that, people who have multiple kernels on their box would need some mechanism such that the correct liblinux for whatever kernel they happened to boot is dynamically loaded. Seems possible, but it's a step beyond "maintained from the same source".

This is one of those areas where the BSDs have an easier time. They do "make world", and it's just inconceivable that an actual end user would mix'n'match kernel and userspace from different versions of FreeBSD. They got away with things like top assuming layout of kernel structures and accessing /dev/kmem for a long time. On Linux, that sort of mutt system is considered normal, so stuff has to be carefully versioned.

Why add anything?

Posted Mar 30, 2007 6:16 UTC (Fri) by ncm (guest, #165) [Link] (4 responses)

Why should this need a file system, or a device, or a library at all?

It should suffice to call mmap() and ask for an anonymous chunk of 16M, and the kernel can simply recognize that a hugetlb would serve, and use it. If, later, the process unmaps pages within it, the remaining pages can be switched over to the regular mapping scheme; most processes won't. Then it would be easy, safe, and backward-compatible for libc to switch malloc over to allocating hugetlb chunks by default, benefitting everybody.

I would also like to see a flag added to mmap() to require that the mapped block be aligned to match its size; e.g. ask for 16M and the bottom 24 bits of the returned address are 0. (Anybody else remember when 68K chips shipped with only 24 address pins, and Apple stuck annotations in the top 8 bits of addresses because the hardware ignored those bits?)

Why add anything?

Posted Mar 30, 2007 12:59 UTC (Fri) by mjr (guest, #6979) [Link] (2 responses)

I'm wondering the same myself. I'm not much for low level hacking, but I fail to see what benefits one would reap from yet another interface.

If a separate interface is really necessary for some reason, I'd put the same functionality behind regular libc malloc(); it already does brk() for small allocations and mmap() for large ones I believe, so it could just as well do extra-large allocations via the hugetlb API. (Putting this in malloc instead of mmap would get rid of the partial-munmap issue on the libc end.)

Why add anything?

Posted Mar 30, 2007 23:19 UTC (Fri) by giraffedata (guest, #1954) [Link] (1 responses)

I can see the value of using mmap() for this, but I don't think you want mmap() guessing based on the size of the request what page size is best.

It's quite possible that 16M of memory will consist of 100 scattered 4K pages of working set and the rest rarely used or even vacant. You wouldn't want to page the whole 16M in and out in that case.

Page granularity seems like a perfectly sensible parameter of an mmap, though.

Why add anything?

Posted Apr 5, 2007 14:15 UTC (Thu) by farnz (subscriber, #17727) [Link]

Might be worth looking at Linux-mm.org on huge pages. In particular, there's a link to an LWM article on transparent use of huge pages. The "holy grail" is very definitely transparent use, so that whenever possible, all applications gain; anything that makes it easier to move that way is helpful.

One thought; if your mmap parameter is simply a hint that the block will be used in a particular granularity, it's easy to implement. Current mmap sets the parameter to 1 byte (no granularity needed), unless mmaping in hugetlbfs pages, when it sets the parameter to (e.g.) 16M. The kernel then just rounds up to the next highest available page size when possible, or down if not.

Why add anything?

Posted Apr 5, 2007 18:57 UTC (Thu) by joib (subscriber, #8541) [Link]

One problem is that the number of large page TLB entries is quite limited. E.g. on current Opterons, while you have a 512-entry data TLB for the normal 4K pages, for the 2M large pages you only have 8 entries. So if you have a loop kernel reading/writing from more than 8 big arrays you're going to have TLB trashing.

I would presume that for non-HPC applications these non-streaming, irregular access patterns are even more common. Though supposedly AMD is fixing this issue with the upcoming 'Barcelona' by having 128 2M TLB entries, and additionally supporting 1G pages (don't know how many TLB entries for those).

For comparison, the Intel Woodcrest has 256 4K and 32(?) 2M TLB entries.

Application-friendly kernel interfaces

Posted Apr 5, 2007 18:38 UTC (Thu) by joib (subscriber, #8541) [Link]

For large pages, there's libhugetlbfs, so you can use large pages via LD_PRELOAD without changing the application itself.