Async I/O [LWN.net]

Async I/O

Posted Apr 7, 2009 19:34 UTC (Tue) by mjthayer (guest, #39183) [Link] (3 responses)

At least traditionally, I think that AIO has been much more scalable than threaded I/O, not least because you don't need to keep a stack around for each thread. And some people prefer to keep everything on a single thread, because they don't want to deal with multi-threading issues, although I don't know if they are the ones people are interested in.

They are

Posted Apr 7, 2009 20:35 UTC (Tue) by khim (subscriber, #9252) [Link] (2 responses)

Think NGINX. If you have 50'000 clients connected in the same time and your box have 100 separate disks with content (not unrealistic example - NGINX does have support for such extreme conditions) then AIO is suddenly much faster then threaded I/O and MUCH less resource-hungry.

Why does AIO need to be supported at the kernel level?

Posted Apr 8, 2009 1:16 UTC (Wed) by xoddam (subscriber, #2322) [Link] (1 responses)

Even if only operations that *need* to be asynchronous require kernel threads, and even if they are recycled instead of being created and destroyed on demand, that will still be quite expensive at the kernel level.

If one thread per outstanding operation or per client is too many, there are good userspace thread pool implementations that dedicate a few threads to waiting for IO completions whilst others get on with whatever work can proceed immediately.

I'm not convinced that pushing the thread pool down into the kernel is a performance win.

The Linux thread implementation chose for very good reasons to stick to a 1:1 relationship between userspace and kernel threads: it's because the job of multiplexing application tasks to a smaller number of system threads is hard to do in a generic way. All the choices are best made by the application developer, therefore thread pool implementations belong in userspace.

I don't really see the point of supporting POSIX signal-driven AIO at the kernel level if the implementation uses threads and sits on top of the existing synchronous IO. A userspace library could do it just as reliably using select() and kill(), for those few applications that insist on the POSIX AIO interface for whatever reason.

That said, the kernel handles asynchronous events all the time. Why exactly is it so hard to let userspace handle them asynchronously too at a low level, without going through the synchronous layer?

I think the idea is to support AIO for all objects with a single API

Posted Apr 8, 2009 12:46 UTC (Wed) by khim (subscriber, #9252) [Link]

While it's very important to have "light AIO" for some things (like TCP sockets) it's not so important to have them for other things (like in-memory pipes). If you have everything in kernel you can implement some things with threads and other without threads and userspace does not care. With userspace library any change require ABI schange - and that's PAIN...

Async I/O

Posted Apr 7, 2009 20:15 UTC (Tue) by lmb (subscriber, #39048) [Link] (1 responses)

Well, IO in the kernel is async anyway. This is about defining a usable user-space interface for it.

An event-driven FSA would benefit greatly from this; not everyone buys into multi-threaded paradigms. For some scenarios, this would make it possible to simplify the user-space implementation significantly.

Async I/O

Posted Apr 20, 2009 9:59 UTC (Mon) by forthy (guest, #1525) [Link]

I don't understand why there was so much objection against the syslets - send the kernel a bunch of "IO instructions", and let it execute those asynchronously. Passing active messages (that's what it is) is a good idea, anyway; especially for networks like NFS4, where each "kernel call" is quite heavy. Syslets would scale a lot better (lower load, less context switches) than synchronous IO. Active message systems often had problems with programmers who did not understand them (like Display Postscript), so I guess this problem comes up again. It is not just a quality of implementation issue, it is a fundamental quality of understanding issue.

This overall doesn't sound good. With Ted T'so, it's even worse: He doesn't get it. It is not an option to a "save" filesystem which already takes a performance penalty by maintaining a journal, to corrupt data. It is an option to delay writing, and in effect, the 5 seconds update in ext3 is not what solved the problem, it is writing ordered. From an application writer point of view, this is a quality of implementation issue, but when I read the arguments, it's again an understanding problem. I'm concerned; maybe it is that those hard-core Linux hackers have been there for 20 years and are still sticking to 90s state-of-the-art?

Async I/O

Posted Apr 7, 2009 22:00 UTC (Tue) by dankamongmen (subscriber, #35141) [Link]

Resource utilization is maximized when each processing element (think core) executes useful work, until no more useful work remains. Starting a thread/process has overhead. Context switching between threads on a processor has overhead. Spawning and switching between threads is not useful work.

High-performance servers want to spawn a thread per allocated core, and have each thread fully exercising that CPU. That's why AIO can/must beat synchronous I/O (blocking or non; that's immaterial here) -- your thread can go on managing events (of course, if the CPU is necessary for the AIO to be performed, your thread won't run anyway, but the CPU can sometimes be avoided).

Async I/O

Posted Apr 8, 2009 0:29 UTC (Wed) by bojan (subscriber, #14302) [Link] (22 responses)

One could also solve the problem that fbarrier() is meant to solve by using aio_fsync(). If aio_fsync() was designed to wait for the regular kernel commit interval in laptop mode (which it could, because there are no guarantees when it's going to complete), application writers could just use that to provide safe rename. The actual rename would be called from signal handler or the thread created when aio_fsync() completes.

In other words, no new calls would need to be introduced into the kernel, the apps would be portable and safe, interactivity would be preserved (this is an async interface), one would not need to use extra threads (signal can be delivered instead) and disk would not need to spin up to commit the file right away.

At the same time, regular fsync() would still mean "really commit now", so databases and similar software could use it safely even in laptop mode.

Async I/O

Posted Apr 8, 2009 4:19 UTC (Wed) by jamesh (guest, #1159) [Link] (21 responses)

Your aio_fsync() suggestion would give a significantly different result to fbarrier().

Consider a function that wrote a file and renamed it using the hypothetical fbarrier():

fp = open("filename.tmp")
write(fp, "data", length)
fbarrier(fp)
close(fp)
rename("filename.tmp", "filename")

When this function completes, any process reading the file will get the new contents. The changes to the underlying block device could be delayed, but if there is a crash "filename" should either give the old contents or the new.

Using aio_fsync() as you suggest would keep the same crash resilience behaviour, but would provide entirely different runtime behaviour. As the signal would likely be delivered after the function returns, the rename won't have happened at that point. So a read of "filename" will return the old file contents for some unknown period of time after the function returns.

Async I/O

Posted Apr 8, 2009 4:51 UTC (Wed) by butlerm (subscriber, #13312) [Link] (1 responses)

I am sorry, but that appears to be completely wrong. The base POSIX
specification allows a null implementation of fsync. If one is not
concerned about a system crash or unclean shutdown, there is no need to
call fsync, aio_fsync, fbarrier or any other comparable function.

Async I/O

Posted Apr 8, 2009 4:56 UTC (Wed) by bojan (subscriber, #14302) [Link]

This is also true, of course. If the docs say you need to restore after a crash, then open(), write(), close(), rename() is just fine.

Async I/O

Posted Apr 8, 2009 4:54 UTC (Wed) by bojan (subscriber, #14302) [Link]

True.

That's why you'd have to hang on to your config in buffers until last unique temp file has overwritten the actual file, which is something that programs like gconfd can do easily. At that point, the buffers would get dumped and read of the real config file would be required next time.

Normally, we are talking of about a few seconds to maybe half a minute of such behaviour here (i.e. either the amount of time it takes to finish immediate fsync or the next regular kernel commit). Programs that overwrite the same config file many, many times within half a minute period are really broken, so this should generally not be an issue.

PS. The point of this whole thing with aio_fsync() is to show that there can be many different approaches to address this issue. Sure, it would require a more sophisticated code, but it can be done. If we had inotify with IN_SYNC event, we could use that too in userland to play with backup files and achieve the desired result (and it that case, read of the renamed file would always give the latest config - instead programs would have to rename foo~ into foo if they found one at startup, which would signal a crash).

PPS. As you probably noticed from my previous posts regarding ext4/POSIX, I'd be very interested to have fbarrier(), because I think we need to have a clean, new way of saying this through an API.

Async I/O

Posted Apr 8, 2009 4:55 UTC (Wed) by butlerm (subscriber, #13312) [Link]

Of course using aio_fsync has lots of other problems...complexity and
overhead mostly. Either way, the rename is atomic and immediately visible
to all user process. What it isn't is necessarily durable, or safe.

Async I/O

Posted Apr 8, 2009 5:02 UTC (Wed) by butlerm (subscriber, #13312) [Link] (16 responses)

I need to read more carefully. Jamesh is exactly right with regard to the
problems of the solution suggested by the parent poster, namely delaying
the rename until after the aio_fsync has completed. Aside from the
complexity and overhead issues, it has completely different user visible
semantics. That is fine if no other process needs to read the new version
of the file in the meantime, otherwise it is problematic. fbarrier would
be a much cleaner solution.

Async I/O

Posted Apr 8, 2009 6:02 UTC (Wed) by bojan (subscriber, #14302) [Link] (15 responses)

> fbarrier would be a much cleaner solution.

Absolutely. No question about it. It's just that Linus is not keen on having it, so that got me thinking as to how the same can be done without a new call. Of course, many of the thought are, shall will say, ill conceived... :-)

> it has completely different user visible semantics

One could also do this:

1. See if "foo~" exists.
2. If it doesn't, do link("foo","foo~") (i.e. create "backup").
3. Open "foo".
4. Read "foo".
5. Open/create/truncate "foo.new".
6. Write into "foo.new".
7. Call aio_fsync() on "foo.new". <-- doesn't block
8. Close "foo.new".
9. Rename "foo.new" into "foo".

In signal handler/thread created on sigevent do:

1. Unlink "foo~".

Then you get full rename semantics with always up to date file (i.e. what I was getting at with inotify example). However, your app then needs to check at startup if "foo~" exists (which means you crashed before the signal handler/thread unlinked you backup) and if it does, rename it to "foo". Then, continue.

Async I/O

Posted Apr 8, 2009 6:17 UTC (Wed) by bojan (subscriber, #14302) [Link]

> 1. Unlink "foo~".

That is when signal/thread counter reached zero.

Async I/O

Posted Apr 8, 2009 7:40 UTC (Wed) by bojan (subscriber, #14302) [Link] (3 responses)

Quick and dirty - probably has more bugs then lines, but you'll get the picture.

Compile and link with: gcc -Wall -O2 -g -o a a.c -lrt

#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <time.h>
#include <signal.h>
#include <aio.h>

#define BUF_SIZE 50

static int count=0;

void whack(int signum,siginfo_t *info,void *context){
  if(!--count)
    unlink("foo~");
}

int main(int argc,char **argv){
  int fd;
  ssize_t rl;
  char *buf=malloc(BUF_SIZE);
  struct aiocb *cb=calloc(1,sizeof(*cb));
  struct sigevent *se=calloc(1,sizeof(*se));
  struct sigaction *act=calloc(1,sizeof(*act));

  /* XXX this is just a demo, no error checking */

  /* AIO control block defaults */
  cb->aio_sigevent.sigev_notify=SIGEV_SIGNAL;
  cb->aio_sigevent.sigev_signo=SIGRTMIN;

  /* signal handler */
  act->sa_flags=SA_SIGINFO;
  act->sa_sigaction=whack;
  sigaction(SIGRTMIN,act,NULL);

  /* see if foo~ exists and restore */
  if(!access("foo~",F_OK|R_OK|W_OK))
    rename("foo~","foo"); 

  /* back it up if required */
  if(access("foo~",F_OK|R_OK|W_OK))
    link("foo","foo~");

  /* read existing file */
  fd=open("foo",O_RDONLY);
  rl=read(fd,buf,BUF_SIZE);
  close(fd);

  /* write to new file and initiate sync */
  fd=open("foo.new",O_WRONLY|O_CREAT|O_TRUNC,S_IRUSR|S_IWUSR|S_IRGRP|S_IROTH);
  write(fd,buf,rl);
  cb->aio_fildes=fd;
  count++;
  aio_fsync(O_SYNC,cb);
  close(fd);

  /* rename new file into the existing one */
  rename("foo.new","foo");

  free(act);
  free(se);
  free(cb);
  free(buf);

  return 0;
}

Async I/O

Posted Apr 8, 2009 8:03 UTC (Wed) by bojan (subscriber, #14302) [Link]

Of course, this won't work for two programs reading/modifying the same config file.

Why all the runtime allocations? (off-topic)

Posted Apr 9, 2009 4:17 UTC (Thu) by pr1268 (subscriber, #24648) [Link] (1 responses)

This is going way off-topic, but why all the runtime allocations in your sample program? Malloc(3), calloc(3), and free(3) are horribly expensive, relatively speaking. Automatic/static storage for those structs and that char buffer would be substantially faster.

Why all the runtime allocations? (off-topic)

Posted Apr 9, 2009 6:14 UTC (Thu) by bojan (subscriber, #14302) [Link]

Just being lazy to call memset(), I guess...

Async I/O

Posted Apr 8, 2009 8:30 UTC (Wed) by jamesh (guest, #1159) [Link] (1 responses)

With this proposal, you're now in a situation where every reader of the file needs to participate in the crash recovery, rather than having the writer ensure that the file is written correctly.

Also, readers would need to differentiate between the case of "foo~" existing because the system crashed and "foo~" existing because some other process is in the process of replacing "foo" and waiting on the fsync.

Async I/O

Posted Apr 8, 2009 11:02 UTC (Wed) by bojan (subscriber, #14302) [Link]

As I said, this will work for one process only. With Gnome, at least, this is the case.

Async I/O

Posted Apr 10, 2009 4:53 UTC (Fri) by butlerm (subscriber, #13312) [Link] (7 responses)

fbarrier(fd) is primarily useful to ask a filesystem to do what it ought to
be doing already. There is a relatively simple solution to this that I
have mentioned a few times that is applicable to virtually any journalled
filesystem that has none of the performance cost of falling back to
data=ordered mode every time someone wants to do a rename replacement.

That is what ext4 (and apparently XFS) do in data=writeback mode when
rename safety is enabled - force all the data for the file to be renamed to
disk before the next metadata transaction can complete. That means that
*every* outstanding fsync operation is delayed while your multi-gigabyte
ISO file finishes being committed to disk.

This solution, as it turns out, is very similar to the practice of keeping
tilde files. It is just that the filesystem does it automatically and
invisibly, restoring the old version on recovery whenever the new version
didn't finish getting committed to disk. No threads, signal handlers, etc.
required. No problems with multiple process access. No application level
code to figure out whether a version is corrupt. No browser freeze ups.
Rename undo is the way to avoid all that, with little or no performance
cost.

Async I/O

Posted Apr 10, 2009 5:26 UTC (Fri) by bojan (subscriber, #14302) [Link] (6 responses)

> No problems with multiple process access.

No entirely true, actually. Imagine two processes reading the same file "foo". After they read it, they make the changes in memory, write them out to "foo.new" and then rename into "foo". Which changes will persist? From the fist or the second process?

You have to have some kind of synchronisation to do this (flock(), semaphore etc.). Which can also be applied to the example with "foo~" files to sync access. That's why Gnome has a daemon (i.e. single process) to manage all these changes.

PS. Of course, fbarrier() is still a much better solution, cleaner etc., but you cannot just say that multiple processes can do this as the please.

Kernel based rename undo

Posted Apr 10, 2009 5:49 UTC (Fri) by butlerm (subscriber, #13312) [Link] (5 responses)

I am talking about a kernel based solution, where in the case of a rename
replacement an undo entry is placed in the journal and the old inode is
kept around until the new inodes data is committed to disk. Then in the
case of an unclean shutdown the filesystem recovery process rolls forward
using journal and uses the undo entries in the journal to build a rename
undo candidate list. When the journal redo is complete, the filesystem
then uses the rename undo list to undo the rename replacements whenever the
replacement inode's data was not committed before the system crashed.

Ext4 does a certain amount of comparable undo already - if the replacement
file was not committed to disk, the allocated blocks are freed and the
filesystem truncates the file. What I suggest is not much more complicated
than that.

Kernel based rename undo

Posted Apr 10, 2009 7:35 UTC (Fri) by bojan (subscriber, #14302) [Link] (4 responses)

I know. What I'm talking about is synchronisation between processes in terms of contents of data (i.e. one process may write a change, which gets lost when another process does the same - your stock race). So, you cannot just open(), write(), close(), rename() with multiple processes. You have to lock, otherwise your processes will stomp all over each other's data.

An example of doing the same with multiple processes when kernel doesn't guarantee data before metadata on rename is below. Bugs included, of course ;-).

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <signal.h>
#include <aio.h>

#define BUF_SIZE 50

static int *count=NULL;

/* XXX this is just a demo, no error checking */

static void whack(int signum,siginfo_t *info,void *context){
  int sd=*(int*)info->si_value.sival_ptr;

  /* critical section */
  lockf(sd,F_LOCK,0);

  if(!--(*count))
    unlink("foo~");

  /* end critical section */
  lockf(sd,F_ULOCK,0);
}

int main(int argc,char **argv){
  int sd,fd;
  ssize_t len;
  char buf[BUF_SIZE];
  struct aiocb cb;
  const struct aiocb *cbl[]={&cb};
  struct sigaction act;

  /* AIO control block setup */
  memset(&cb,0,sizeof(cb));
  cb.aio_sigevent.sigev_notify=SIGEV_SIGNAL;
  cb.aio_sigevent.sigev_signo=SIGRTMIN;
  cb.aio_sigevent.sigev_value.sival_ptr=&sd;

  /* signal handler setup */
  memset(&act,0,sizeof(act));
  act.sa_flags=SA_SIGINFO;
  act.sa_sigaction=whack;
  sigaction(SIGRTMIN,&act,NULL);

  /* setup shared counter, restore */
  if((sd=shm_open("foo",O_RDWR|O_CREAT|O_EXCL,S_IRUSR|S_IWUSR))==-1){
    int tries=20;
    struct stat s;

    /* not the first to arrive, open and wait for counter to be written */
    sd=shm_open("foo",O_RDWR,S_IRUSR|S_IWUSR);
    fstat(sd,&s);
    while(tries-- && s.st_size<sizeof(*count)){
      sleep(1);
      fstat(sd,&s);
    }

    /* something's really screwed */
    if(!tries)
      return 1;
  } else{ /* first to arrive, restore */
    int count=0; /* filler */

    /* don't care if we fail */
    if(!rename("foo~","foo"))
      fprintf(stderr,"Restored.\n");

    write(sd,&count,sizeof(count));
  }

  /* shared counter */
  count=mmap(NULL,sizeof(int),PROT_READ|PROT_WRITE,MAP_SHARED,sd,0);

  /* critical section */
  lockf(sd,F_LOCK,0);

  /* don't care if it fails - already there */
  link("foo","foo~");

  /* read existing file */
  fd=open("foo",O_RDONLY);
  len=read(fd,buf,BUF_SIZE);
  close(fd);

  /* write to new file and initiate sync */
  fd=open("foo.new",O_WRONLY|O_CREAT|O_TRUNC,S_IRUSR|S_IWUSR|S_IRGRP|S_IROTH);
  write(fd,buf,len);
  cb.aio_fildes=fd;
  (*count)++;
  aio_fsync(O_SYNC,&cb);
  close(fd);

  /* put the new file in place */
  rename("foo.new","foo");

  /* end critical section */
  lockf(sd,F_ULOCK,0);

  /* do something really useful here */

  /* wait for AIO completion */
  aio_suspend(cbl,1,NULL);

  /* clean up shared memory */
  munmap(count,sizeof(int));
  close(sd);

  return 0;
}

Kernel based rename undo

Posted Apr 10, 2009 15:34 UTC (Fri) by butlerm (subscriber, #13312) [Link]

If you want multiple writers, you definitely need locking. I was referring
to single writer / multiple readers, which is a far more common situation.

Kernel based rename undo

Posted Apr 11, 2009 10:22 UTC (Sat) by bojan (subscriber, #14302) [Link]

BTW, aio_suspend() has no effect on aio_fsync(). That, for sure, is a bug.

Kernel based rename undo

Posted Apr 11, 2009 12:06 UTC (Sat) by bojan (subscriber, #14302) [Link] (1 responses)

The lockf() would also land in trouble with the mmap().

Kernel based rename undo

Posted Apr 12, 2009 4:52 UTC (Sun) by bojan (subscriber, #14302) [Link]

A more robust version below:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <signal.h>
#include <aio.h>
#include <errno.h>

/* XXX this is just a demo, no error checking */

static int sd=-1;
static int count=0;
static char filler[2]={0,0};

/* locks */
static struct flock 
  fwl={.l_type=F_WRLCK,.l_whence=SEEK_SET,.l_start=0,.l_len=1},
  ful={.l_type=F_UNLCK,.l_whence=SEEK_SET,.l_start=0,.l_len=1},
  bwl={.l_type=F_WRLCK,.l_whence=SEEK_SET,.l_start=1,.l_len=1},
  brl={.l_type=F_RDLCK,.l_whence=SEEK_SET,.l_start=1,.l_len=1},
  bul={.l_type=F_UNLCK,.l_whence=SEEK_SET,.l_start=1,.l_len=1};

static void aiodone(int signum,siginfo_t *info,void *context){
  /* signal counter down */
  count--;
}

#define BUF_SIZE 50

static void config(struct aiocb *cb){
  int fd;
  ssize_t len;
  char buf[BUF_SIZE];

  /* critical section */
  while(fcntl(sd,F_SETLKW,&fwl));

  /* don't care if it fails, any version is OK */
  link("foo","foo~");

  /* read existing file */
  fd=open("foo",O_RDONLY);
  len=read(fd,buf,BUF_SIZE);
  close(fd);

  /* write to new file */
  fd=open("foo.new",O_WRONLY|O_CREAT|O_TRUNC,S_IRUSR|S_IWUSR|S_IRGRP|S_IROTH);
  write(fd,buf,len);

  /* AIO control block setup */
  memset(cb,0,sizeof(*cb));
  cb->aio_sigevent.sigev_notify=SIGEV_SIGNAL;
  cb->aio_sigevent.sigev_signo=SIGRTMIN;
  cb->aio_fildes=fd;

  /* signal counter up */
  count++;

  /* initiate sync and close */
  aio_fsync(O_SYNC,cb);
  close(fd);

  /* put the new file in place */
  rename("foo.new","foo");

  /* end critical section */
  while(fcntl(sd,F_SETLKW,&ful));
}

#define LOOPS 10
#define TRIES 20

int main(int argc,char **argv){
  int i;
  struct aiocb cb[LOOPS];
  struct sigaction act;

  /* setup shared file, restore */
  if((sd=shm_open("foo",O_RDWR|O_CREAT|O_EXCL,S_IRUSR|S_IWUSR))==-1){
    int tries=TRIES;
    struct stat f;

    /* not the first to arrive, open and wait for restore */
    sd=shm_open("foo",O_RDWR,S_IRUSR|S_IWUSR);

    fstat(sd,&f);
    while(tries-- && f.st_size<sizeof(filler)){
      sleep(1);
      fstat(sd,&f);
    }

    /* something's really screwed */
    if(!tries)
      return 1;
  } else{ /* first to arrive, restore */
    /* don't care if we fail */
    if(!rename("foo~","foo"))
      fprintf(stderr,"Restored.\n");

    /* setup lock file */
    write(sd,&filler,sizeof(filler));
  }

  /* signal handler setup */
  memset(&act,0,sizeof(act));
  act.sa_flags=SA_SIGINFO;
  act.sa_sigaction=aiodone;
  sigaction(SIGRTMIN,&act,NULL);

  /* we need the backup file to be there */
  while(fcntl(sd,F_SETLKW,&brl));

  /* program may run config many times */
  for(i=0;i<LOOPS;i++){
    config(&cb[i]);

    /* do something really useful here */
  }

  /* wait for AIO completion */
  while(count)
    sleep(1);

  /* unlock the backup file */
  while(fcntl(sd,F_SETLKW,&bul));

  /* try to remove backup file */
  if(!fcntl(sd,F_SETLK,&fwl)){
    if(!fcntl(sd,F_SETLK,&bwl)){
      unlink("foo~");
      while(fcntl(sd,F_SETLKW,&bul));
    }
    while(fcntl(sd,F_SETLKW,&ful));
  }

  /* clean up shared memory */
  close(sd);

  return 0;
}