[Cluster-devel] Problem in ops_address.c :: gfs_writepage ?

Mon Feb 19 15:00:50 UTC 2007

Hello all,

I need advice about a bug in GFS that may also affect other filesystems
(like ext3).

The problem:
It is possible that the function "ops_address.c :: gfs_writepage" does
not write the page it's asked to, because the transaction lock is
taken. This is valid, and in such case, it should return an error code,
so that the kernel knows it was not possible to write the page. But
this function does not return an error code; instead, it returns 0.
I've looked at ext3, it does the same. This is valid and there's no
corruption, as the page is "redirtied" so that it will be flushed later.
Returning an error code is not solution, because it's possible that no
page is flushable, and also 'sync' misinterprets the error code as an
I/O error. There may be other implications, too.

The problem comes when there is quite a stress on the filesystem.
I've made a test program that opens 1 file, writes 1Go
(at least more than the system's total memory), then opens a 2nd file,
and writes as much data as it can.
When the number of dirty pages go beyond /proc/sys/vm/dirty_ratio,
some pages must be flushed synchronously, so that the writer is blocked
in writing, and the system does not starve of free clean pages to use.

But precisely, in that situation, there are multiple times when
gfs_writepage cannot perform its duty, because of the transaction lock.
The pages that we need to flush cannot be flushed, but the point is that
they are accounted as such : flushed. The process is then authorized to
continue writing data and more and more pages are marked dirty.

Hopefully, in most cases, pdflush comes to rescue us : it starts
flushing pages so that the program recovers free new pages to dirty.
Sometimes the system is stable : if the pages sent by "pdflush" are
written fastly enough by the block device. But if we set pdflush to be
less aggressive and dirty_ratio higher, it is possible to have OOM
Killer on a machine, and then being completely blocked. In practical,
we've experienced it using the test program "bonnie++" whose purpose is
to test a FS performance. Bonnie++ makes multiple files of 1GB when it
is asked to run long multi-Go writes. There is no problem with 5 GB (5
files of 1 GB) but many machines in the cluster are OOM killed with 10GB
bonnies....
Setting more aggressive parameters for dirty_ratio and pdflush is not
a complete solution (altough the problems happens much later or not at
all), and kills performance.

Proposed solution:

Keep a counter of pages in gfs_inode whose value represents those not
written in gfs_writepage, and at the end of do_do_write_buf, call
"balance_dirty_pages_ratelimited(file->f_mapping);" as many times. The
counter is possibly shared by multiple processes, but we are assured
that there is no transaction at that moment so pages can be flushed, if
"balance_dirty_pages_ratelimited" determines that it must reclaim dirty
pages. Otherwise performance is not affected.

About ext3:

On a normal local disk with ext3, the problem does not happen, altough
it should be affected as well. I suppose that's because ext3 doesn't
keep the transaction lock so long, and a local disk is enough fast so
that pdflush keeps the dirty pages counter low.

Patch below. Beware, it's quite ugly. credits :
Oliver.Cozette at seanodes.com and Mathieu.Avila at seanodes.com

--
Mathieu Avila 

Index: cluster/gfs-kernel/src/gfs/ops_file.c
===================================================================

--- cluster/gfs-kernel/src/gfs/ops_file.c~       
+++ cluster/gfs-kernel/src/gfs/ops_file.c       
@@ -25,7 +25,7 @@
 #include <linux/mm.h>
 #include <asm/uaccess.h>
 #include <linux/gfs_ioctl.h>
-
+#include <linux/writeback.h>
 #include "gfs.h"
 #include "bmap.h"
 #include "dio.h"
@@ -824,6 +824,22 @@
                        goto fail_ipres;
        }

+       do
+         {
+           int pages_not_written = atomic_read(&ip->pages_not_written);
+           int result;
+           if ( pages_not_written <= 0 )
+             {
+               break;
+             }
+           result=cmpxchg(&ip->pages_not_written.counter,
pages_not_written, (pages_not_written - 1) );
+           if (result == pages_not_written)
+             {
+               balance_dirty_pages_ratelimited(file->f_mapping);
+             }
+         }
+       while (1);
+

Index: cluster/gfs-kernel/src/gfs/incore.h
===================================================================
--- cluster/gfs-kernel/src/gfs/incore.h~
+++ cluster/gfs-kernel/src/gfs/incore.h
@@ -617,6 +617,8 @@

        unsigned int i_greedy; /* The amount of time to be greedy */
        unsigned long i_last_pfault; /* The time of the last page fault
*/ +
+        atomic_t pages_not_written; /* Due to journal locking, how
many pages were not written when "gfs_writepage" was called */ };

 /*

Index: cluster/gfs-kernel/src/gfs/ops_address.c
===================================================================
--- cluster/gfs-kernel/src/gfs/ops_address.c~
+++ cluster/gfs-kernel/src/gfs/ops_address.c
@@ -179,6 +183,7 @@
                return -EIO;
        }
        if (get_transaction) {
+               atomic_inc(&ip->pages_not_written);
                redirty_page_for_writepage(wbc, page);
                unlock_page(page);
                return 0;