Re: [Linux-cluster] GFS2: processes stuck in "just schedule"


On Thu, 2009-12-03 at 17:30 -0500, Allen Belletti wrote:
> Hi All,
> After Steve and the RedHat guys dug into my nasty crashdump (thanks 
> all!), I believe I'm down to the last GFS2 problem on our mail cluster, 
> but it's a common one.
> I've always had trouble with processes getting stuck on GFS2 access and 
> queuing up.  Since the 5.4 upgrade and moving the proper GFS2 kernel 
> module, it's changed but not gone away.  Ever few days now, I'm seeing 
> processes getting stuck with WCHAN=just_schedule.  Once this starts 
> happening, both cluster nodes will accumulate them rapidly which 
> eventually brings IO to a halt.  The only way I've found to escape is 
> via a reboot, sometimes of one, sometimes of both nodes.
> Since there's no crash, I don't get any useful debug information.  
> Outside of this one repeating glitch, performance is great and all is 
> well.  If anyone can suggest ways of gathering more data about the 
> problem, or possible solutions, I would be grateful.
> Thanks,
> Allen
This would be typical for what happens when there is contention on a
glock between two (or more) nodes. There is a mechanism which is
supposed to try and mitigate the issue (by allowing each node to hold on
to a glock for a minimum period of time which is designed to ensure that
some work is done each time a node acquires a glock) but if your storage
is particularly slow, and/or possibly depending upon the exact I/O
pattern, it may not always be 100% effective.

In the first instance though, see if you can find an inode which is
being contended from both nodes as that will most likely be the culprit,


