[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: writing processes are blocking in log_wait_common with data=ordered



> Andrew Morton wrote:
> > 
> > Does this patch help?
> 
> I won't, I suspect.  You've done an O_SYNC write.  ext3
> needs to write your data out to disk before returning
> from the pwrite() call.  We do that by running a commit
> and waiting for it to complete.
> 
> In ordered mode, commit will writeback and wait upon
> your newly-dirtied data.  That's what you asked it to do.
> 
> Other filesystems will do it by directly writing the data
> and waiting on it.  We've lost some concurrency because
> the journal is busy, but in practice I suspect it won't
> make much difference.
> 
> Are you sure that you actually have a problem?  Does your
> application run significantly more quickly on ext2?

I think so.  Here's what I've tested so far using a test program
(attached, see P.S. below) that simulates the load.  I have:

1) Red Hat 7.2. kernel 2.4.17-rc2-aa2, with ext3 on a ATA133 disk.  
This reports about 70 blks/sec.

2) Red Hat 6.2 kernel 2.4.17-rc2-aa2 with ext2 on a SCSI U160 disk. This
reports about 420 blks/sec.

3) Red Hat 7.2 (identical hardware to #2) kernel 2.4.19-pre7-aa2 with ext3
This reports about 40 blks/sec.

Both ext3 systems are in the 40-70 range, though they differ in kernel
version and hardware.  The ext2 system is 10x faster, even on the same
kernel or hardware.

Also, kjournald has been eating a ton of cpu time lately.  It had used 7
minutes in a month and then 3 minutes in a day since I noticed this was
happening.  This is with the real application, not the test proggy.

> (I now need to know your exact kernel version - there
> have been various goofups on the sync paths which were
> fixed relatively recently).
> 
> I suspect that ext3 is doing an unnecessary commit
> on the fsync() case, and in the O_SYNC case, for your
> application.  If the mtime fix is in place then we
> can try to drop all the ordered-mode data buffers
> from the transaction (which will succeed) and then
> look to see if there's anything to be committed
> (there will not be).  hmm.

I will try out both your patch, which you think won't work, and various
combinations of ext3 (ordered and writeback) and ext2.  My target kernel
version is 2.4.19-pre7-aa2.  I'll try out vanilla pre7 if I have time too.

One interesting and unexpected result is that running inside a looped back
filesystem 1gb in size increases performance 4-fold from running on the
real filesystem!  That is, ext3-ordered looped on top of ext3-ordered is
much faster than ext3-ordered!  This is on kernel 2.4.17-rc2-aa2, which is
a bit old, so it could be meaningless....

David

P.S.  I created a benchmark of this phenomenon called blktest.c.  It's a 
bit rough (you need to recompile to change block size etc.).  It's 
attached.  It takes a single argument which is the number of concurrent 
writers.  Each writer writes an 8kb block to a random location in the 
file using pwrite.  The code is stupid in many places.  Excuse it.

-- 
/==============================\
| David Mansfield              |
| david cobite com             |
\==============================/
/* for pwrite */
#define _XOPEN_SOURCE 500

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <sys/time.h>
#include <fcntl.h>
#include <unistd.h>
#include <signal.h>
#include <sys/mman.h>

#define BLKSIZE 8192
#define FILESIZE (512*1024*1024) 

void die(const char * reason)
{
    fprintf(stderr, "dying: %s\n", reason);
    exit(1);
}

void sig(int which)
{
    printf("received signal %d\n", which);
}

void do_child(int fd, int child, int * score)
{
    char buff[BLKSIZE];
    struct timeval tv;

    /* set random seed in each process */
    gettimeofday(&tv, NULL);
    srand(tv.tv_usec);

    memset(buff, child, BLKSIZE);

    while (1)
    {
	int block = rand() % (FILESIZE/BLKSIZE);
	pwrite(fd, buff, BLKSIZE, block * BLKSIZE);
	(*score)++;
    }
}

int main(int argc, char * argv[])
{
    int i, nr_procs, fd;
    pid_t * pid;
    struct sigaction sa;
    int score_fd;
    int * score;
    struct timeval start_tv, end_tv;
    int total_score = 0;
    double secs;

    if (argc < 2)
	die("usage");

    if ((nr_procs = atoi(argv[1])) <= 0)
	die("usage");

    /* the test file needs to be created beforehand */
    if ((fd = open("blktest.tmp", O_RDWR|O_SYNC)) < 0)
	die("please create a test file using:\n\ndd if=/dev/zero of=blktest.tmp bs=1k count=xxx");

    /* shared memory to keep the 'scoreboard' */
    if ((score_fd = open("/dev/zero", O_RDWR)) < 0)
	die("/dev/zero");

    if ((score = (int*)mmap(0, 4096, PROT_READ|PROT_WRITE, MAP_SHARED, score_fd, 0)) == MAP_FAILED)
	die("mmap");

    if (!(pid = (pid_t*)calloc(nr_procs, sizeof(pid_t))))
	die("calloc");

    printf("forking writers.\n");

    gettimeofday(&start_tv, NULL);

    for (i = 0; i < nr_procs; i++)
    {
	if ((pid[i] = fork()) < 0)
	{
	    int j;
	    for (j = 0; j < i; j++)
		kill(pid[j], SIGKILL);
	    goto cleanup;
	}
	else if (pid[i] == 0)
	{
	    do_child(fd, i, score + i);
	}
	
	printf("forked process %d\n", pid[i]);
    }

    memset(&sa, 0, sizeof(sa));
    sa.sa_handler = sig;
    sigaction(SIGINT, &sa, NULL);
    sigaction(SIGTERM, &sa, NULL);

    printf("children started, waiting for signal\n");
    pause();

 cleanup:
    while (i)
    {
	pid_t dead = wait(NULL);
	printf("pid %d has exited\n", dead);
	i--;
    }
    
    gettimeofday(&end_tv, NULL);

    for (i = 0; i < nr_procs; i++)
    {
	printf("score for %d: %d\n", i, score[i]);
	total_score += score[i];
    }

    end_tv.tv_sec -= start_tv.tv_sec;
    end_tv.tv_usec -= end_tv.tv_usec;
    
    if (end_tv.tv_usec < 0)
	end_tv.tv_sec--, end_tv.tv_usec += 1000000;

    secs = (double)end_tv.tv_sec + (double)end_tv.tv_usec / 1000000.0;
    
    printf("total score: %d blocks in %.2f seconds %f blks/sec\n", total_score, secs, (double)total_score/secs);

    exit(0);
}

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]