We have an 8 node cluster running SASgrid. We have the core components of SAS under RHCS (rgmanager) control, but there are user/client jobs that are initiated manually and by cron outside of RHCS. We have run into an issue a few times where it seems that when the gfs init script is called to unmount all the file systems and it kills off all the processes using the gfs file systems, the gfs on the other nodes locks up and hangs. The node leaving the cluster via a reboot appears to have left cleanly (cman_tool services doesn't show any *WAIT* states) but everything is hung and requires a complete reboot of the cluster to get things going. We are wondering if the killing of the processes by the gfs init script, which uses fuser to try to kill gracefully but then uses a -9, could be issuing the -9 and thus leaving locks in DLM that could be causing this issue.
Is this possible? I would think that if a node has properly/cleanly left the cluster, locks that were held by that node would be released. Is there a way to display locks that may be still existing for that node that is down? And lastly, is there a way to force the release of those locks with out the reboot of the cluster? I've been searching the linux-cluster archives with little success.