Rationalizing scatter/gather chains

By Jonathan Corbet
December 28, 2007

The chained scatterlist API was arguably the most disruptive addition to 2.6.24, despite being a relatively small amount of code. This API allows kernel code to chain together scatter/gather lists for DMA I/O operations, resulting in a much larger maximum size for those operations. That, in turn, leads to better performance, especially in the block I/O subsystem. The idea of scatterlist chaining is generally popular, but there have been some complaints about the current implementation. As things stand, any code wanting to work with chained scatterlists must construct the chains itself - an error-prone operation. So there is interest in making things better.

One approach to improving the situation is the sg_ring API, proposed by Rusty Russell. This patch does away with the current chaining approach; there are no more scatterlist entries which are actually chain pointers in disguise. Instead, Rusty introduces struct sg_ring:

    struct sg_ring
    {
	struct list_head list;
	unsigned int num, max;
	struct scatterlist sg[0];
    };

The obvious change here is that the chaining has been moved out of the scatterlist itself and made into an explicit linked list. There are also variables tracking the current and maximum sizes of the list, which help reduce explicit housekeeping elsewhere. Some versions of the patch also add an integer dma_num field to hold the number of mapped scatter/gather entries, which can differ from the number initially set up by the driver.

An sg_ring with a given number of scatterlist entries can be declared with this macro:

    DECLARE_SG_RING(name, max);

A ring should then be initialized with one of:

    void sg_ring_init(struct sg_ring *ring, unsigned int max);
    void sg_ring_single(struct sg_ring *ring, const void *buf,
                        unsigned int buflen);

The latter form is a shortcut for cases where a single-entry ring needs to be set up with a given buffer.

Constructing a multi-entry ring is a matter of allocating as many separate sg_ring entries as needed and explicitly chaining them together using the list field. There is a helper macro for stepping through all of the entries in a ring while hiding the boundaries between the individual scatterlists:

    struct sg_ring *sg;
    int i;

    sg_ring_for_each(ring, sg, i) {
	/* *sg is the scatterlist entry to operate on */
    }

Rusty has posted patches converting parts of the SCSI subsystem over to this API. As he points out, the conversion removes a fair amount of logic associated with the construction and destruction of large scatterlists.

Jens Axboe, the creator of the chained scatterlist code, has responded that the current API was aimed at minimizing the effect on drivers for 2.6.24. It is not, he says, a finished product, and things are getting better. A look at his git repository shows some API additions with a very similar goal to Rusty's work.

Jens's work retains the current chaining mechanism, but wraps a structure and some helpers around it to make it easier to work with. So, in this view of the world, drivers will work with struct sg_table:

    struct sg_table {
	struct scatterlist *sgl;        /* the list */
       	unsigned int nents;             /* number of mapped entries */
       	unsigned int orig_nents;        /* original size of list */
    };

An sg_table will be set up with:

    int sg_alloc_table(struct sg_table *table, unsigned int nents,
                       gfp_t gfp_flags);

This function does not allocate the sg_table structure, which must be passed in as a parameter. It does, however, allocate the memory to use for the actual scatterlist arrays and deal with the process of chaining them all together. So a driver needing to construct a large scatter/gather operation can now just do a single sg_alloc_table() call, then iterate through the list of scatterlist entries in the usual way. When the operation is complete, a call to

    void sg_free_table(struct sg_table *table);

will free the allocated memory.

Sometime around the opening of the 2.6.25, a decision will have to be made on the direction of the chained scatterlist API. It may not be one of the most closely-watched kernel development events ever, but this decision will affect how high-performance I/O code is written in the future. As the author of the current chaining code, Jens probably starts with an advantage when it comes to getting his code merged. The nature of kernel development is such that nobody can ever be entirely sure, though; if a consensus builds that Rusty's approach is better, that is the way things will probably go. Stay tuned through the next merge window for the thrilling conclusion to this ongoing story.

Index entries for this article
Kernel	Direct memory access
Kernel	Scatter/gather chaining