Async I/O
Async I/O
Posted Apr 8, 2009 5:02 UTC (Wed) by butlerm (subscriber, #13312)In reply to: Async I/O by jamesh
Parent article: Linux Storage and Filesystem workshop, day 1
problems of the solution suggested by the parent poster, namely delaying
the rename until after the aio_fsync has completed. Aside from the
complexity and overhead issues, it has completely different user visible
semantics. That is fine if no other process needs to read the new version
of the file in the meantime, otherwise it is problematic. fbarrier would
be a much cleaner solution.
Posted Apr 8, 2009 6:02 UTC (Wed)
by bojan (subscriber, #14302)
[Link] (15 responses)
Absolutely. No question about it. It's just that Linus is not keen on having it, so that got me thinking as to how the same can be done without a new call. Of course, many of the thought are, shall will say, ill conceived... :-)
> it has completely different user visible semantics
One could also do this:
1. See if "foo~" exists.
In signal handler/thread created on sigevent do:
1. Unlink "foo~".
Then you get full rename semantics with always up to date file (i.e. what I was getting at with inotify example). However, your app then needs to check at startup if "foo~" exists (which means you crashed before the signal handler/thread unlinked you backup) and if it does, rename it to "foo". Then, continue.
Posted Apr 8, 2009 6:17 UTC (Wed)
by bojan (subscriber, #14302)
[Link]
That is when signal/thread counter reached zero.
Posted Apr 8, 2009 7:40 UTC (Wed)
by bojan (subscriber, #14302)
[Link] (3 responses)
Quick and dirty - probably has more bugs then lines, but you'll get the picture. Compile and link with: gcc -Wall -O2 -g -o a a.c -lrt
Posted Apr 8, 2009 8:03 UTC (Wed)
by bojan (subscriber, #14302)
[Link]
Posted Apr 9, 2009 4:17 UTC (Thu)
by pr1268 (subscriber, #24648)
[Link] (1 responses)
This is going way off-topic, but why all the runtime allocations in your sample program? Malloc(3), calloc(3), and free(3) are horribly expensive, relatively speaking. Automatic/static storage for those structs and that char buffer would be substantially faster.
Posted Apr 9, 2009 6:14 UTC (Thu)
by bojan (subscriber, #14302)
[Link]
Posted Apr 8, 2009 8:30 UTC (Wed)
by jamesh (guest, #1159)
[Link] (1 responses)
Also, readers would need to differentiate between the case of "foo~" existing because the system crashed and "foo~" existing because some other process is in the process of replacing "foo" and waiting on the fsync.
Posted Apr 8, 2009 11:02 UTC (Wed)
by bojan (subscriber, #14302)
[Link]
Posted Apr 10, 2009 4:53 UTC (Fri)
by butlerm (subscriber, #13312)
[Link] (7 responses)
That is what ext4 (and apparently XFS) do in data=writeback mode when
This solution, as it turns out, is very similar to the practice of keeping
Posted Apr 10, 2009 5:26 UTC (Fri)
by bojan (subscriber, #14302)
[Link] (6 responses)
No entirely true, actually. Imagine two processes reading the same file "foo". After they read it, they make the changes in memory, write them out to "foo.new" and then rename into "foo". Which changes will persist? From the fist or the second process?
You have to have some kind of synchronisation to do this (flock(), semaphore etc.). Which can also be applied to the example with "foo~" files to sync access. That's why Gnome has a daemon (i.e. single process) to manage all these changes.
PS. Of course, fbarrier() is still a much better solution, cleaner etc., but you cannot just say that multiple processes can do this as the please.
Posted Apr 10, 2009 5:49 UTC (Fri)
by butlerm (subscriber, #13312)
[Link] (5 responses)
Ext4 does a certain amount of comparable undo already - if the replacement
Posted Apr 10, 2009 7:35 UTC (Fri)
by bojan (subscriber, #14302)
[Link] (4 responses)
I know. What I'm talking about is synchronisation between processes in terms of contents of data (i.e. one process may write a change, which gets lost when another process does the same - your stock race). So, you cannot just open(), write(), close(), rename() with multiple processes. You have to lock, otherwise your processes will stomp all over each other's data. An example of doing the same with multiple processes when kernel doesn't guarantee data before metadata on rename is below. Bugs included, of course ;-).
Posted Apr 10, 2009 15:34 UTC (Fri)
by butlerm (subscriber, #13312)
[Link]
Posted Apr 11, 2009 10:22 UTC (Sat)
by bojan (subscriber, #14302)
[Link]
Posted Apr 11, 2009 12:06 UTC (Sat)
by bojan (subscriber, #14302)
[Link] (1 responses)
Posted Apr 12, 2009 4:52 UTC (Sun)
by bojan (subscriber, #14302)
[Link]
A more robust version below:
Async I/O
2. If it doesn't, do link("foo","foo~") (i.e. create "backup").
3. Open "foo".
4. Read "foo".
5. Open/create/truncate "foo.new".
6. Write into "foo.new".
7. Call aio_fsync() on "foo.new". <-- doesn't block
8. Close "foo.new".
9. Rename "foo.new" into "foo".
Async I/O
Async I/O
#include <stdio.h>
#include <stdlib.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <time.h>
#include <signal.h>
#include <aio.h>
#define BUF_SIZE 50
static int count=0;
void whack(int signum,siginfo_t *info,void *context){
if(!--count)
unlink("foo~");
}
int main(int argc,char **argv){
int fd;
ssize_t rl;
char *buf=malloc(BUF_SIZE);
struct aiocb *cb=calloc(1,sizeof(*cb));
struct sigevent *se=calloc(1,sizeof(*se));
struct sigaction *act=calloc(1,sizeof(*act));
/* XXX this is just a demo, no error checking */
/* AIO control block defaults */
cb->aio_sigevent.sigev_notify=SIGEV_SIGNAL;
cb->aio_sigevent.sigev_signo=SIGRTMIN;
/* signal handler */
act->sa_flags=SA_SIGINFO;
act->sa_sigaction=whack;
sigaction(SIGRTMIN,act,NULL);
/* see if foo~ exists and restore */
if(!access("foo~",F_OK|R_OK|W_OK))
rename("foo~","foo");
/* back it up if required */
if(access("foo~",F_OK|R_OK|W_OK))
link("foo","foo~");
/* read existing file */
fd=open("foo",O_RDONLY);
rl=read(fd,buf,BUF_SIZE);
close(fd);
/* write to new file and initiate sync */
fd=open("foo.new",O_WRONLY|O_CREAT|O_TRUNC,S_IRUSR|S_IWUSR|S_IRGRP|S_IROTH);
write(fd,buf,rl);
cb->aio_fildes=fd;
count++;
aio_fsync(O_SYNC,cb);
close(fd);
/* rename new file into the existing one */
rename("foo.new","foo");
free(act);
free(se);
free(cb);
free(buf);
return 0;
}
Async I/O
Why all the runtime allocations? (off-topic)
Why all the runtime allocations? (off-topic)
Async I/O
Async I/O
Async I/O
be doing already. There is a relatively simple solution to this that I
have mentioned a few times that is applicable to virtually any journalled
filesystem that has none of the performance cost of falling back to
data=ordered mode every time someone wants to do a rename replacement.
rename safety is enabled - force all the data for the file to be renamed to
disk before the next metadata transaction can complete. That means that
*every* outstanding fsync operation is delayed while your multi-gigabyte
ISO file finishes being committed to disk.
tilde files. It is just that the filesystem does it automatically and
invisibly, restoring the old version on recovery whenever the new version
didn't finish getting committed to disk. No threads, signal handlers, etc.
required. No problems with multiple process access. No application level
code to figure out whether a version is corrupt. No browser freeze ups.
Rename undo is the way to avoid all that, with little or no performance
cost.
Async I/O
Kernel based rename undo
replacement an undo entry is placed in the journal and the old inode is
kept around until the new inodes data is committed to disk. Then in the
case of an unclean shutdown the filesystem recovery process rolls forward
using journal and uses the undo entries in the journal to build a rename
undo candidate list. When the journal redo is complete, the filesystem
then uses the rename undo list to undo the rename replacements whenever the
replacement inode's data was not committed before the system crashed.
file was not committed to disk, the allocated blocks are freed and the
filesystem truncates the file. What I suggest is not much more complicated
than that.
Kernel based rename undo
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <signal.h>
#include <aio.h>
#define BUF_SIZE 50
static int *count=NULL;
/* XXX this is just a demo, no error checking */
static void whack(int signum,siginfo_t *info,void *context){
int sd=*(int*)info->si_value.sival_ptr;
/* critical section */
lockf(sd,F_LOCK,0);
if(!--(*count))
unlink("foo~");
/* end critical section */
lockf(sd,F_ULOCK,0);
}
int main(int argc,char **argv){
int sd,fd;
ssize_t len;
char buf[BUF_SIZE];
struct aiocb cb;
const struct aiocb *cbl[]={&cb};
struct sigaction act;
/* AIO control block setup */
memset(&cb,0,sizeof(cb));
cb.aio_sigevent.sigev_notify=SIGEV_SIGNAL;
cb.aio_sigevent.sigev_signo=SIGRTMIN;
cb.aio_sigevent.sigev_value.sival_ptr=&sd;
/* signal handler setup */
memset(&act,0,sizeof(act));
act.sa_flags=SA_SIGINFO;
act.sa_sigaction=whack;
sigaction(SIGRTMIN,&act,NULL);
/* setup shared counter, restore */
if((sd=shm_open("foo",O_RDWR|O_CREAT|O_EXCL,S_IRUSR|S_IWUSR))==-1){
int tries=20;
struct stat s;
/* not the first to arrive, open and wait for counter to be written */
sd=shm_open("foo",O_RDWR,S_IRUSR|S_IWUSR);
fstat(sd,&s);
while(tries-- && s.st_size<sizeof(*count)){
sleep(1);
fstat(sd,&s);
}
/* something's really screwed */
if(!tries)
return 1;
} else{ /* first to arrive, restore */
int count=0; /* filler */
/* don't care if we fail */
if(!rename("foo~","foo"))
fprintf(stderr,"Restored.\n");
write(sd,&count,sizeof(count));
}
/* shared counter */
count=mmap(NULL,sizeof(int),PROT_READ|PROT_WRITE,MAP_SHARED,sd,0);
/* critical section */
lockf(sd,F_LOCK,0);
/* don't care if it fails - already there */
link("foo","foo~");
/* read existing file */
fd=open("foo",O_RDONLY);
len=read(fd,buf,BUF_SIZE);
close(fd);
/* write to new file and initiate sync */
fd=open("foo.new",O_WRONLY|O_CREAT|O_TRUNC,S_IRUSR|S_IWUSR|S_IRGRP|S_IROTH);
write(fd,buf,len);
cb.aio_fildes=fd;
(*count)++;
aio_fsync(O_SYNC,&cb);
close(fd);
/* put the new file in place */
rename("foo.new","foo");
/* end critical section */
lockf(sd,F_ULOCK,0);
/* do something really useful here */
/* wait for AIO completion */
aio_suspend(cbl,1,NULL);
/* clean up shared memory */
munmap(count,sizeof(int));
close(sd);
return 0;
}
Kernel based rename undo
to single writer / multiple readers, which is a far more common situation.
Kernel based rename undo
Kernel based rename undo
Kernel based rename undo
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <signal.h>
#include <aio.h>
#include <errno.h>
/* XXX this is just a demo, no error checking */
static int sd=-1;
static int count=0;
static char filler[2]={0,0};
/* locks */
static struct flock
fwl={.l_type=F_WRLCK,.l_whence=SEEK_SET,.l_start=0,.l_len=1},
ful={.l_type=F_UNLCK,.l_whence=SEEK_SET,.l_start=0,.l_len=1},
bwl={.l_type=F_WRLCK,.l_whence=SEEK_SET,.l_start=1,.l_len=1},
brl={.l_type=F_RDLCK,.l_whence=SEEK_SET,.l_start=1,.l_len=1},
bul={.l_type=F_UNLCK,.l_whence=SEEK_SET,.l_start=1,.l_len=1};
static void aiodone(int signum,siginfo_t *info,void *context){
/* signal counter down */
count--;
}
#define BUF_SIZE 50
static void config(struct aiocb *cb){
int fd;
ssize_t len;
char buf[BUF_SIZE];
/* critical section */
while(fcntl(sd,F_SETLKW,&fwl));
/* don't care if it fails, any version is OK */
link("foo","foo~");
/* read existing file */
fd=open("foo",O_RDONLY);
len=read(fd,buf,BUF_SIZE);
close(fd);
/* write to new file */
fd=open("foo.new",O_WRONLY|O_CREAT|O_TRUNC,S_IRUSR|S_IWUSR|S_IRGRP|S_IROTH);
write(fd,buf,len);
/* AIO control block setup */
memset(cb,0,sizeof(*cb));
cb->aio_sigevent.sigev_notify=SIGEV_SIGNAL;
cb->aio_sigevent.sigev_signo=SIGRTMIN;
cb->aio_fildes=fd;
/* signal counter up */
count++;
/* initiate sync and close */
aio_fsync(O_SYNC,cb);
close(fd);
/* put the new file in place */
rename("foo.new","foo");
/* end critical section */
while(fcntl(sd,F_SETLKW,&ful));
}
#define LOOPS 10
#define TRIES 20
int main(int argc,char **argv){
int i;
struct aiocb cb[LOOPS];
struct sigaction act;
/* setup shared file, restore */
if((sd=shm_open("foo",O_RDWR|O_CREAT|O_EXCL,S_IRUSR|S_IWUSR))==-1){
int tries=TRIES;
struct stat f;
/* not the first to arrive, open and wait for restore */
sd=shm_open("foo",O_RDWR,S_IRUSR|S_IWUSR);
fstat(sd,&f);
while(tries-- && f.st_size<sizeof(filler)){
sleep(1);
fstat(sd,&f);
}
/* something's really screwed */
if(!tries)
return 1;
} else{ /* first to arrive, restore */
/* don't care if we fail */
if(!rename("foo~","foo"))
fprintf(stderr,"Restored.\n");
/* setup lock file */
write(sd,&filler,sizeof(filler));
}
/* signal handler setup */
memset(&act,0,sizeof(act));
act.sa_flags=SA_SIGINFO;
act.sa_sigaction=aiodone;
sigaction(SIGRTMIN,&act,NULL);
/* we need the backup file to be there */
while(fcntl(sd,F_SETLKW,&brl));
/* program may run config many times */
for(i=0;i<LOOPS;i++){
config(&cb[i]);
/* do something really useful here */
}
/* wait for AIO completion */
while(count)
sleep(1);
/* unlock the backup file */
while(fcntl(sd,F_SETLKW,&bul));
/* try to remove backup file */
if(!fcntl(sd,F_SETLK,&fwl)){
if(!fcntl(sd,F_SETLK,&bwl)){
unlink("foo~");
while(fcntl(sd,F_SETLKW,&bul));
}
while(fcntl(sd,F_SETLKW,&ful));
}
/* clean up shared memory */
close(sd);
return 0;
}