[Linux-cluster] GFS2 processes getting stuck in WCHAN=dlm_posix_lock

Mon Nov 2 17:11:27 UTC 2009

On Fri, Oct 30, 2009 at 07:27:23PM -0400, Allen Belletti wrote:
> I'll notice the problem when the load average starts rising.  It's 
> always tied to "stuck" processes, and I believe always tied to IMAP 
> clients (I'm running Dovecot.)  It seems like a file belonging to user 
> "x" (in this case, "jforrest" will become locked in some way, such that 
> every IMAP process tied that user will get stuck on the same thing.  
> Over time, as the user keeps trying to read that file, more & more 
> processes accumulate.  They're always in state "D" (uninterruptible 
> sleep), and always on "dlm_posix_lock" according to WCHAN.  The only way 
> I'm able to get out of this state is to reboot.  If I let it persist for 
> too long, I/O generally stops entirely.

Next time, try to collect all the following information as soon as you can
after the first process gets stuck:

- ps showing pid of stuck/"D" process(es) and WCHAN
- which file they are stuck trying to lock
  (and the inode number of it, you may need to wait until after the
   reboot to use ls -li on the file to get the inode number)
- group_tool dump plocks <fsname> from all the nodes

I'm guessing that dovecot does some "unusual" combinations of locking,
closing, renaming, unlinking files.  Those combinations are especially
prone to races and bugs that cause posix lock state to get off.

Dave