[Cluster-devel] Problem in ops_address.c :: gfs_writepage ?

Wed Feb 21 05:04:33 UTC 2007

Mathieu Avila wrote:

>Le Mon, 19 Feb 2007 17:00:10 -0500,
>Wendy Cheng <wcheng at redhat.com> a écrit :
>
>  
>
>>Mathieu Avila wrote:
>>    
>>
>>>Hello all,
>>>But precisely, in that situation, there are multiple times when
>>>gfs_writepage cannot perform its duty, because of the transaction
>>>lock. 
>>>      
>>>
>
>  
>
>>Yes, we did have this problem in the past with direct IO and SYNC
>>flag.
>>    
>>
>
>I understand that in that case, data are written with get_transaction
>lock taken, so that gfs_writepage never writes the pages, and you go
>beyond dirty_ratio limit if you write too much pages. How did you do to
>get rid of it ?
>  
>
GFS doesn't allow a process to do writepage if it is already in the 
middle of transaction. In Direct IO and O_SYNC cases, we did something 
similar to (and simpler than) what you've proposed. Right after 
transaction is marked completed, we do filemap_fdatawrite() and wait for 
them to complete. VM dirty_ratio is not relevant since both O_DIRECT and 
O_SYNC require data to be (completely) flushed anyway. In default buffer 
write mode (your case), the design is to take advantages of VM's 
writeback - so if flushing can be avoided, it is delayed. Guess that's 
why we get into the issue you described in previous post.

>  
>
>>>[snip]
>>>we've experienced it using the test program "bonnie++" whose
>>>purpose is to test a FS performance. Bonnie++ makes multiple files
>>>of 1GB when it is asked to run long multi-Go writes. There is no
>>>problem with 5 GB (5 files of 1 GB) but many machines in the
>>>cluster are OOM killed with 10GB bonnies....
>>>  
>>>      
>>>
>
>  
>
>>I would like to know more about your experiments. So these
>>bonnie++(s) are run on each cluster node with independent file sets ?
>>
>>    
>>
>
>Exactly. I run "bonnie++ -s 10G" on each node, in different directories
>of the same GFS file system. To get the problem happen surely
>and quicker, i tune pdflush by quite disabling it :
>echo 300000 > /proc/sys/vm/dirty_expire_centisecs 
>echo 50000 > /proc/sys/vm/dirty_writeback_centisecs 
>..., so that only /proc/sys/vm/dirty_ratio plays in. Not doing this
>makes the problem harder to reproduce. (but reproducible)
>  
>
ok ... so this is a corner case since 10G is a very large number... Did 
you have real applications get into OOM state or this issue is only seen 
via this bonnie++ experiment ?

>The problem happens only when it starts the writeback of the dirty
>pages of the 2nd file, once it is done with the 1st file.  We still
>have to determine why. So you can get the same problem with a program
>that :
>- opens 2 files
>- writes 1Go in the 1st one,
>- writes indefinitely in the second file.
>  
>
GFS actually chunks one big write into 4M pieces. I would expect the 
indefinite write should cause journal to wrap around and flush occurs. 
Will need sometime to look into this issue.

>We use a particular block device that doesn't read/write at the same
>speed on all nodes. Don't know if this can help.
>  
>
I doubt it is relevant.

>  
>
>>>Keep a counter of pages in gfs_inode whose value represents those
>>>not written in gfs_writepage, and at the end of do_do_write_buf,
>>>call "balance_dirty_pages_ratelimited(file->f_mapping);" as many
>>>times. The counter is possibly shared by multiple processes, but we
>>>are assured that there is no transaction at that moment so pages
>>>can be flushed, if "balance_dirty_pages_ratelimited" determines
>>>that it must reclaim dirty pages. Otherwise performance is not
>>>affected. 
>>>      
>>>
>
>  
>
>>In general, this approach looks ok if we do have this flushing
>>problem. However, GFS flush code has been embedded in glock code so I
>>would think it would be better to do this within glock code. 
>>    
>>
>
>These are points we do not understand.
>
>- I understand that the flushing code is done in glock, (when lock is
>demoted or taken by another node, isn't it ?). Why isn't it possible to
>let the kernel decide which pages to flush when it needs to ? For
>example, in that particular case, it is not a good idea to flush the
>page only when the lock is lost, the kernel needs to flush pages.
>  
>
The pdflush is allowed to flush gfs pages freely, as long as it can get 
the individual page lock. The only thing not allowed is if the process 
is already in the middle of transaction, the writepage will return. The 
original intention (my guess though, the original author left the team 
two years ago) was there would be plenty chances for pages to get 
flushed (by lock demote, journal flush, etc).  It is apparently an 
optimization effort to take advantages of VM writeback.

>- Why does not gfs_writepage return an error when the page cannot be
>flushed ?
>  
>
Well ... no comment :) .. but really think about it, what else can we do 
other than returning 0 ? We don't want VM to mis-interpret the return 
code as IO errors.

>- The balance_dirty_pages_ratelimited function is called inside
>get_transaction/set_transaction, (i.e between gfs_trans_begin and
>gfs_trans_end), therefore gfs_writepage should never work. Do you know
>if there is some kind of asynchronous (kiocb ?) write so that
>gfs_writepage is called later on the same process ? If not, what could
>make gfs_writepage happen sometimes inside a transaction, and sometimes
>not ?
>  
>
Unless the transaction is marked completed, gfs_writepage is a no-op.

>- Ext3 should be affected as well, but it isn't. Is that because the
>transaction lock is taken for a much shorter period of time, so that
>dirty pages that are not flushed when the lock is taken will be
>succesfully flushed later ?
>  
>
In ext3 case, it is a journal lock so it could be unlock and lock again 
somewhere. In GFS case, it is simply a check to see whether this process 
is already in the middle of transaction. So between in and out of 
do_do_write_buf, the transaction state doesn't change.

>- Some other file systems in the kernel : NTFS and ReiserFS, do explicit
>calls to balance_dirty_pages_ratelimited
>http://lxr.linux.no/source/fs/ntfs/file.c?v=2.6.18;a=x86_64#L284
>http://lxr.linux.no/source/fs/ntfs/file.c?v=2.6.18;a=x86_64#L2101
>http://lxr.linux.no/source/fs/reiserfs/file.c?v=2.6.11;a=x86_64#L1351
>But they also redefines some generic functions from the kernel. Maybe
>they have a strong reason to do so ?
>
>  
>
Again, the proposal looks ok and we are really grateful for the work. 
Whatever the issue is finally settled, we'll definitely ack the 
contribution. However, since GFS1 has been a stable filesystem and run 
by many production system, we would like to be careful about the new 
code. Would need more time to look into the run time behavior. Ping us 
next Tuesday if we haven't responsed by then ?

-- Wendy