Re: [Linux-cluster] lock_gulm heartbeat

Edward Mann wrote:

I have a gfs cluster that has been running fine for about 3 months now. I am only using 2 machines and the storage is a firewire drive. Over the weekend i started to get: lock_gulmd_core[1100]: Failed to receive a timely heartbeat reply from Master. (t:1108583425506998 mb:1)

and after 2, which is what i allowed for the missed heart beats, the gfs
slave would die. I moved the missed heartbeat up to 5 and have seen it
miss as many as 4 in a row. The only thing that has changed on the
machine is that i add new clients to process files once they are placed
on the machine. I am using FAM to notify my app that a new file is

Any ideas on what i should look at? How can i diagnose this problem. The
communication between the two machines seems fine. I can ping both
hosts. I am really at a loss at to what to look for.

It sounds like it might be a load problem. The node is busy doing other work and it doesn't give enough time to the gulm core process to do heartbeats. You should try increacing the heartbeat_rate some. And maybe nicing the lock_gulmd processes some.

If you want to try tacking all of the heartbeat messages, set the 'heartbeat' verbosity flag.

Michael Conrad Tadpol Tilstra
AH! Get off my leg! You're Not my type!!!!

