[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Cluster-devel] Problem in ops_address.c :: gfs_writepage ?

Mathieu Avila wrote:
Hello all,

I need advice about a bug in GFS that may also affect other filesystems
(like ext3).

The problem:
It is possible that the function "ops_address.c :: gfs_writepage" does
not write the page it's asked to, because the transaction lock is
taken. This is valid, and in such case, it should return an error code,
so that the kernel knows it was not possible to write the page. But
this function does not return an error code; instead, it returns 0.
I've looked at ext3, it does the same. This is valid and there's no
corruption, as the page is "redirtied" so that it will be flushed later.
Returning an error code is not solution, because it's possible that no
page is flushable, and also 'sync' misinterprets the error code as an
I/O error. There may be other implications, too.

The problem comes when there is quite a stress on the filesystem.
I've made a test program that opens 1 file, writes 1Go
(at least more than the system's total memory), then opens a 2nd file,
and writes as much data as it can.
When the number of dirty pages go beyond /proc/sys/vm/dirty_ratio,
some pages must be flushed synchronously, so that the writer is blocked
in writing, and the system does not starve of free clean pages to use.

But precisely, in that situation, there are multiple times when
gfs_writepage cannot perform its duty, because of the transaction lock.
Yes, we did have this problem in the past with direct IO and SYNC flag.
we've experienced it using the test program "bonnie++" whose purpose is
to test a FS performance. Bonnie++ makes multiple files of 1GB when it
is asked to run long multi-Go writes. There is no problem with 5 GB (5
files of 1 GB) but many machines in the cluster are OOM killed with 10GB
I would like to know more about your experiments. So these bonnie++(s) are run on each cluster node with independent file sets ?

Setting more aggressive parameters for dirty_ratio and pdflush is not
a complete solution (altough the problems happens much later or not at
all), and kills performance.

Proposed solution:

Keep a counter of pages in gfs_inode whose value represents those not
written in gfs_writepage, and at the end of do_do_write_buf, call
"balance_dirty_pages_ratelimited(file->f_mapping);" as many times. The
counter is possibly shared by multiple processes, but we are assured
that there is no transaction at that moment so pages can be flushed, if
"balance_dirty_pages_ratelimited" determines that it must reclaim dirty
pages. Otherwise performance is not affected.
In general, this approach looks ok if we do have this flushing problem. However, GFS flush code has been embedded in glock code so I would think it would be better to do this within glock code. My cluster nodes happened to be out at this moment. Will look into this when the cluster is re-assembled.

-- Wendy

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]