[Linux-cluster] GFS2 processes getting stuck in WCHAN=dlm_posix_lock

Steven Whitehouse swhiteho at redhat.com
Mon Nov 2 11:42:49 UTC 2009


Hi,

On Fri, 2009-10-30 at 19:27 -0400, Allen Belletti wrote:
> Hi All,
> 
> As I've mentioned before, I'm running a two-node clustered mail server 
> on GFS2 (with RHEL 5.4)  Nearly all of the time, everything works 
> great.  However, going all the way back to GFS1 on RHEL 5.1 (I think it 
> was), I've had occasional locking problems that force a reboot of one or 
> both cluster nodes.  Lately I've paid closer attention since it's been 
> happening more often.
> 
> I'll notice the problem when the load average starts rising.  It's 
> always tied to "stuck" processes, and I believe always tied to IMAP 
> clients (I'm running Dovecot.)  It seems like a file belonging to user 
> "x" (in this case, "jforrest" will become locked in some way, such that 
> every IMAP process tied that user will get stuck on the same thing.  
> Over time, as the user keeps trying to read that file, more & more 
> processes accumulate.  They're always in state "D" (uninterruptible 
> sleep), and always on "dlm_posix_lock" according to WCHAN.  The only way 
> I'm able to get out of this state is to reboot.  If I let it persist for 
> too long, I/O generally stops entirely.
> 
> This certainly seems like it ought to have a definite solution, but I've 
> no idea what it is.  I've tried a variety of things using "find" to 
> pinpoint a particular file, but everything belonging to the affected 
> user seems just fine.  At least, I can read and copy all of the files, 
> and do a stat via ls -l.
> 
> Is it possible that this is a bug, not within GFS at all, but within 
> Dovecot IMAP?
> 
> Any thoughts would be appreciated.  It's been getting worse lately and 
> thus no fun at all.
> 
> Cheers,
> Allen
> 
Do you know if dovecot IMAP uses signals at all? That would be the first
thing that I'd look at. The other thing to check is whether it makes use
of F_GETLK and in particular the l_pid field? strace should be able to
answer both of those questions (except the l_pid field of course, but
the chances are it it calls F_GETLK and then sends a signal, its also
using the l_pid field),

Steve.





More information about the Linux-cluster mailing list