[Linux-cluster] multiple issues

Greg Forte gforte at leopard.us.udel.edu
Fri Jan 20 15:56:44 UTC 2006


1) the "easy" one first - the man page for clusvcadm lists a -l option 
for locking the service managers.  Running clusvcadm shows this option 
is no longer available.  The man page also references the command 
clushutdown, saying this is the preferred way of performing this action, 
but on my system I have a man page for clushutdown but no binary.  So 
... how does one go about doing this?

2) I was having trouble getting my services restarted - using 'clusvcadm 
-e httpd' (for example, where I have a service I've named httpd which 
sets up an IP address and starts httpd from the /etc/init.d/httpd 
script), it complained with the oh-so-informative message: <err> #43: 
Service httpd has failed; can not start.  I read somewhere that services 
had to be disabled and re-enabled after failure, so I tried -d instead 
and got the following: <notice> stop on script "httpd init script" 
returned 1 (generic error) <crit> #12: RG httpd failed to stop; 
intervention required.  I finally figured out that I had to manually 
start the service on a node, then do clusvcadm -d, then do clusvcadm -e. 
  Presumably the first step would not have been necessary if the httpd 
script didn't return an error status when you pass it stop and it's not 
already running.

Any opinions on whether it makes sense to alter init scripts so that 
stop when the daemon is not running is not an error (and therefore doing 
clusvcadm -d on the not-running service would maybe work)?

3) the biggie: I have a GFS filesystem on a shared FC storage node 
(AX100).  I haven't put any "real" data on it yet because I'm still 
testing, but yesterday I had the cluster up and running and the 
filesystem mounted on both nodes, and everything seemed peachy.  I came 
back this morning to find that any attempted operations (e.g. 'ls') on 
the shared system came back with "Input/output error", and the following 
appeared in the logs:

node 1:
Jan 20 04:03:09 knob kernel: GFS: fsid=MAPS:shared_data.0: fatal: 
invalid metadata block
Jan 20 04:03:09 knob kernel: GFS: fsid=MAPS:shared_data.0:   bh = 
352612748 (magic)
Jan 20 04:03:09 knob kernel: GFS: fsid=MAPS:shared_data.0:   function = 
gfs_rgrp_read
Jan 20 04:03:09 knob kernel: GFS: fsid=MAPS:shared_data.0:   file = 	
	/usr/src/build/648121-x86_64/BUILD/gfs-kernel-2.6.9-45/smp/src/gfs/rgrp.c, 
line = 830
Jan 20 04:03:09 knob kernel: GFS: fsid=MAPS:shared_data.0:   time = 
1137747789
Jan 20 04:03:09 knob kernel: GFS: fsid=MAPS:shared_data.0: about to 
withdraw from the cluster
Jan 20 04:03:09 knob kernel: GFS: fsid=MAPS:shared_data.0: waiting for 
outstanding I/O
Jan 20 04:03:09 knob kernel: GFS: fsid=MAPS:shared_data.0: telling LM to 
withdraw
Jan 20 04:03:11 knob kernel: lock_dlm: withdraw abandoned memory
Jan 20 04:03:11 knob kernel: GFS: fsid=MAPS:shared_data.0: withdrawn

node 2:
Jan 20 04:02:10 gully kernel: dlm: shared_data: process_lockqueue_reply 
id c0012 state 0
Jan 20 04:02:10 gully kernel: dlm: shared_data: process_lockqueue_reply 
id 90376 state 0
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: jid=0: 
Trying to acquire journal lock...
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: jid=0: 
Looking at journal...
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: jid=0: 
Acquiring the transaction lock...
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: jid=0: 
Replaying journal...
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: jid=0: 
Replayed 0 of 0 blocks
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: jid=0: 
replays = 0, skips = 0, sames = 0
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: jid=0: 
Journal replayed in 1s
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: jid=0: Done
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: fatal: 
invalid metadata block
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1:   bh = 
352612748 (magic)
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1:   function = 
gfs_rgrp_read
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1:   file =
	/usr/src/build/648121-x86_64/BUILD/gfs-kernel-2.6.9-45/smp/src/gfs/rgrp.c, 
line = 830
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1:   time = 
1137747791
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: about to 
withdraw from the cluster
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: waiting for 
outstanding I/O
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: telling LM 
to withdraw
Jan 20 04:03:11 gully kernel: lock_dlm: withdraw abandoned memory
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: withdrawn

The time coincides with cron.daily firing, so I'm guessing the culprit 
is slocate (since that's the only job in cron.daily that would have 
touched that filesystem), but I'm not having any luck reproducing it. 
The only thing on that filesystem currently is the webroot, and there 
were no hits at the time.  Any ideas?

-g

Greg Forte
gforte at udel.edu
IT - User Services
University of Delaware
302-831-1982
Newark, DE




More information about the Linux-cluster mailing list