[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Cluster-devel] Problem in ops_address.c :: gfs_writepage ?

Mathieu Avila wrote:

Le Mon, 19 Feb 2007 17:00:10 -0500,
Wendy Cheng <wcheng redhat com> a écrit :

Mathieu Avila wrote:
Hello all,
But precisely, in that situation, there are multiple times when
gfs_writepage cannot perform its duty, because of the transaction

Yes, we did have this problem in the past with direct IO and SYNC

I understand that in that case, data are written with get_transaction
lock taken, so that gfs_writepage never writes the pages, and you go
beyond dirty_ratio limit if you write too much pages. How did you do to
get rid of it ?
GFS doesn't allow a process to do writepage if it is already in the middle of transaction. In Direct IO and O_SYNC cases, we did something similar to (and simpler than) what you've proposed. Right after transaction is marked completed, we do filemap_fdatawrite() and wait for them to complete. VM dirty_ratio is not relevant since both O_DIRECT and O_SYNC require data to be (completely) flushed anyway. In default buffer write mode (your case), the design is to take advantages of VM's writeback - so if flushing can be avoided, it is delayed. Guess that's why we get into the issue you described in previous post.

we've experienced it using the test program "bonnie++" whose
purpose is to test a FS performance. Bonnie++ makes multiple files
of 1GB when it is asked to run long multi-Go writes. There is no
problem with 5 GB (5 files of 1 GB) but many machines in the
cluster are OOM killed with 10GB bonnies....

I would like to know more about your experiments. So these
bonnie++(s) are run on each cluster node with independent file sets ?

Exactly. I run "bonnie++ -s 10G" on each node, in different directories
of the same GFS file system. To get the problem happen surely
and quicker, i tune pdflush by quite disabling it :
echo 300000 > /proc/sys/vm/dirty_expire_centisecs echo 50000 > /proc/sys/vm/dirty_writeback_centisecs ..., so that only /proc/sys/vm/dirty_ratio plays in. Not doing this
makes the problem harder to reproduce. (but reproducible)
ok ... so this is a corner case since 10G is a very large number... Did you have real applications get into OOM state or this issue is only seen via this bonnie++ experiment ?

The problem happens only when it starts the writeback of the dirty
pages of the 2nd file, once it is done with the 1st file.  We still
have to determine why. So you can get the same problem with a program
that :
- opens 2 files
- writes 1Go in the 1st one,
- writes indefinitely in the second file.
GFS actually chunks one big write into 4M pieces. I would expect the indefinite write should cause journal to wrap around and flush occurs. Will need sometime to look into this issue.

We use a particular block device that doesn't read/write at the same
speed on all nodes. Don't know if this can help.
I doubt it is relevant.

Keep a counter of pages in gfs_inode whose value represents those
not written in gfs_writepage, and at the end of do_do_write_buf,
call "balance_dirty_pages_ratelimited(file->f_mapping);" as many
times. The counter is possibly shared by multiple processes, but we
are assured that there is no transaction at that moment so pages
can be flushed, if "balance_dirty_pages_ratelimited" determines
that it must reclaim dirty pages. Otherwise performance is not

In general, this approach looks ok if we do have this flushing
problem. However, GFS flush code has been embedded in glock code so I
would think it would be better to do this within glock code.

These are points we do not understand.

- I understand that the flushing code is done in glock, (when lock is
demoted or taken by another node, isn't it ?). Why isn't it possible to
let the kernel decide which pages to flush when it needs to ? For
example, in that particular case, it is not a good idea to flush the
page only when the lock is lost, the kernel needs to flush pages.
The pdflush is allowed to flush gfs pages freely, as long as it can get the individual page lock. The only thing not allowed is if the process is already in the middle of transaction, the writepage will return. The original intention (my guess though, the original author left the team two years ago) was there would be plenty chances for pages to get flushed (by lock demote, journal flush, etc). It is apparently an optimization effort to take advantages of VM writeback.

- Why does not gfs_writepage return an error when the page cannot be
flushed ?
Well ... no comment :) .. but really think about it, what else can we do other than returning 0 ? We don't want VM to mis-interpret the return code as IO errors.

- The balance_dirty_pages_ratelimited function is called inside
get_transaction/set_transaction, (i.e between gfs_trans_begin and
gfs_trans_end), therefore gfs_writepage should never work. Do you know
if there is some kind of asynchronous (kiocb ?) write so that
gfs_writepage is called later on the same process ? If not, what could
make gfs_writepage happen sometimes inside a transaction, and sometimes
not ?
Unless the transaction is marked completed, gfs_writepage is a no-op.

- Ext3 should be affected as well, but it isn't. Is that because the
transaction lock is taken for a much shorter period of time, so that
dirty pages that are not flushed when the lock is taken will be
succesfully flushed later ?
In ext3 case, it is a journal lock so it could be unlock and lock again somewhere. In GFS case, it is simply a check to see whether this process is already in the middle of transaction. So between in and out of do_do_write_buf, the transaction state doesn't change.

- Some other file systems in the kernel : NTFS and ReiserFS, do explicit
calls to balance_dirty_pages_ratelimited
But they also redefines some generic functions from the kernel. Maybe
they have a strong reason to do so ?

Again, the proposal looks ok and we are really grateful for the work. Whatever the issue is finally settled, we'll definitely ack the contribution. However, since GFS1 has been a stable filesystem and run by many production system, we would like to be careful about the new code. Would need more time to look into the run time behavior. Ping us next Tuesday if we haven't responsed by then ?

-- Wendy

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]