Tags and IDs
Solid-state storage devices will only achieve their claimed I/O rates if the kernel issues many I/O operations in parallel. That allows the device to execute the requests in an optimal order and to exploit the parallelism inherent in having multiple banks of flash storage. If the kernel is not to get confused, though, there must be a way for the device to report the status of specific operations to the kernel; that is done by assigning a tag (a small integer value) to each request. Once that is done, the device can report that, say, request #42 completed, and the kernel will know which operation is done.
If the device is handling vast numbers of operations per second, the kernel will somehow have to come up with an equal number of tags. That suggests that tag allocation must be a fast operation; even a small amount of overhead starts to really hurt when it is repeated millions of times every second. To that end, Kent Overstreet has proposed the merging of a per-CPU tag allocator, a new module with a simple task: allocate unique integers within a given range as quickly as possible.
The interface is relatively straightforward. A "tag pool," from which tags will be allocated, can be declared this way:
#include <linux/percpu-tags.h> struct percpu_tag_pool pool;
Initialization is then done with:
int percpu_tag_pool_init(struct percpu_tag_pool *pool, unsigned long nr_tags);
where nr_tags is the number of tags to be contained within the pool. Upon successful initialization, zero will be returned to the caller.
The actual allocation and freeing of tags is managed with:
unsigned percpu_tag_alloc(struct percpu_tag_pool *pool, gfp_t gfp); void percpu_tag_free(struct percpu_tag_pool *pool, unsigned tag);
A call to percpu_tag_alloc() will allocate a tag from the given pool. The only use for the gfp argument is to be checked for the __GFP_WAIT flag; if (and only if) that flag is present, the function will wait for an available tag if need be. The return value is the allocated tag, or TAG_FAIL if no allocation is possible.
The implementation works by maintaining a set of per-CPU lists of available tags; whenever possible, percpu_tag_alloc() will simply take the first available entry from the local list, avoiding contention with other CPUs. Failing that, it will fall back to a global list of tags, moving a batch of tags to the appropriate per-CPU list. Should the global list be empty, percpu_tag_alloc() will attempt to steal some tags from another CPU or, in the worst case, either wait for an available tag or return TAG_FAIL. Most of the time, with luck, tag allocation and freeing operations can be handled entirely locally, with no contention or cache line bouncing issues.
The attentive reader might well be thinking that the API proposed here looks an awful lot like the IDR subsystem, which also exists to allocate unique integer identifiers. That is where the bulk of the complaints came from; Andrew Morton, in particular, was unhappy that no apparent attempt had been made to adapt IDR before launching into a new implementation:
The best outcome is that idr.c gets improved and we don't have to merge duplicative code.
So please, let's put aside the shiny new thing for now and work out how we can use the existing tag allocator for these applications. If we make a genuine effort to do this and decide that it's fundamentally hopeless then this is the time to start looking at new implementations.
The responses from Kent (and from Tejun Heo as well) conveyed their belief that IDR is, indeed, fundamentally hopeless for this use case. The IDR code is designed for the allocation of identifiers, so it works a little differently: the lowest available number is always returned and the number range is expanded as needed. The lowest-number guarantee, in particular, forces a certain amount of cross-CPU data sharing, putting a limit on how scalable the IDR code can be. The IDR API also supports storing (and quickly looking up) a pointer value associated with each ID, a functionality not needed by users of tags. As Tejun put it, even if the two allocators were somehow combined, there would still need to be two distinct ways of using it, one with allocation ordering guarantees, and one for scalability.
Andrew proved hard to convince, though; he suggested that, perhaps, tag
allocation could be implemented as some sort of caching layer on top of
IDR. His position appeared to soften a bit, though, when Tejun pointed out that the I/O stack already has
several tag-allocation implementations, "and most, if not all,
suck
". The per-CPU tag allocator could replace those
implementations with common code, reducing the amount of duplication rather
than increasing it. Improvements of that sort can work wonders when it
comes to getting patches accepted.
Things then took another twist when Kent posted a rewrite of the IDA module as the basis for a new attempt. "IDA" is a variant of IDR that lacks the ability to store pointers associated with IDs; it uses many of the IDR data structures but does so in a way that is more space-efficient. Kent's rewrite turns IDA into a separate layer, with the eventual plan of rewriting IDR to sit on top. Before doing that, though, he implemented a new per-CPU ID allocator implementing the API described above on top of the new IDA code. The end result should be what Andrew was asking for: a single subsystem for the allocation of integer IDs that accommodates all of the known use cases.
All this may seem like an excessive amount of discussion around the merging
of a small bit of clearly-useful code that cannot possibly cause bugs
elsewhere in the kernel. But if there is one thing that the community has
learned over the years, it's that kernel developers are far less scalable
than the kernel itself. Duplicated code leads to inferior APIs, more bugs,
and more work for developers. So it's worth putting some effort into
avoiding the merging of duplicated functionality; it is work that will pay
off in the long term — and the kernel community is expecting to be around
and maintaining the code for a long time.
Index entries for this article | |
---|---|
Kernel | Block layer |
Kernel | idr |