[Linux-cluster] Freeze with cluster-2.03.11

Kadlecsik Jozsef kadlec at mail.kfki.hu
Fri Apr 3 15:12:11 UTC 2009


On Fri, 3 Apr 2009, Wendy Cheng wrote:

> Kadlecsik Jozsef wrote:
> > On Thu, 2 Apr 2009, Wendy Cheng wrote:
> > > > > Kadlecsik Jozsef wrote:
> > > > >             
> > > > > > - commit 82d176ba485f2ef049fd303b9e41868667cebbdb
> > > > > >   gfs_drop_inode as .drop_inode replacing .put_inode.
> > > > > >   .put_inode was called without holding a lock, but .drop_inode
> > > > > >   is called under inode_lock held. Might it be a problem
> > > > > >                 
> > > Based on code reading ...
> > > 1. iput() gets inode_lock (a spin lock)
> > > 2. iput() calls iput_final()
> > > 3. iput_final() calls filesystem drop_inode(), followed by
> > > generic_drop_inode()
> > > 4. generic_drop_inode() unlock inode_lock after doing all sorts of fun
> > > things
> > > with the inode
> > > 
> > > So look to me that generic_drop_inode() statement within gfs_drop_inode()
> > > should be removed. Otherwise you would get double unlock and double list
> > > free.
> > 
> > I think those function calls are right: iput_final calls either the
> > filesystem drop_inode function (in this case gfs_drop_inode) or
> > generic_drop_inode. There's no double call of generic_drop_inode. However
> > gfs_sync_page_i (and in turn filemap_fdatawrite and filemap_fdatawait) is
> > now called under inode_lock held and that was not so in previous versions.
> > But I'm just speculating.
> 
> It *is* called twice unless my eyes deceive me
> 
> static inline void iput_final(struct inode *inode)
> {
> const struct super_operations *op = inode->i_sb->s_op;
> void (*drop)(struct inode *) = generic_drop_inode;
> 
> if (op && op->drop_inode)
> drop = op->drop_inode; /* gfs call generic_drop_inode() */
> drop(inode); /* second call into generic_drop_inode() again. */
> }

No, the line 'drop = op->drop_inode;' is just an assignment (there's no 
function argument): if there's a filesystem-specific drop_inode 
function, that is assigned to 'drop', overwriting thus the originally 
initialized 'generic_drop_inode' value of the 'drop' variable as function 
pointer. And the last line is only the function call, with the proper 
argument.
 
> > I won't get a chance to start a test before Monday, sorry. 
> 
> I'll be traveling next week as well. However, a few cautious words here:
> 
> Even this "fix" eventually solves your hang, running GFS on newer 
> kernels with production system simply is *not* a good idea.

That might be so, but this is a catch: at least we must test GFS2 before 
migrating from GFS1 to GFS2. That requires a recent kernel, with working 
GFS1 and GFS2 support. A migration of an in-production system cannot be 
started lightheartedly, and transforming GFS1 to GFS2 wont' be easy: that 
needs downtime, fresh backups created after bringing down the systems, 
backup verification, converting/creating the GFS2 volumes, restoring the 
data. And crossing fingers that it must not be undone if something goes 
wrong. It's not the same as just replacing an older kernel and an older 
package.

Best regards,
Jozsef
--
E-mail : kadlec at mail.kfki.hu, kadlec at blackhole.kfki.hu
PGP key: http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address: KFKI Research Institute for Particle and Nuclear Physics
         H-1525 Budapest 114, POB. 49, Hungary




More information about the Linux-cluster mailing list