[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] multiple issues



1) the "easy" one first - the man page for clusvcadm lists a -l option for locking the service managers. Running clusvcadm shows this option is no longer available. The man page also references the command clushutdown, saying this is the preferred way of performing this action, but on my system I have a man page for clushutdown but no binary. So ... how does one go about doing this?

2) I was having trouble getting my services restarted - using 'clusvcadm -e httpd' (for example, where I have a service I've named httpd which sets up an IP address and starts httpd from the /etc/init.d/httpd script), it complained with the oh-so-informative message: <err> #43: Service httpd has failed; can not start. I read somewhere that services had to be disabled and re-enabled after failure, so I tried -d instead and got the following: <notice> stop on script "httpd init script" returned 1 (generic error) <crit> #12: RG httpd failed to stop; intervention required. I finally figured out that I had to manually start the service on a node, then do clusvcadm -d, then do clusvcadm -e. Presumably the first step would not have been necessary if the httpd script didn't return an error status when you pass it stop and it's not already running.

Any opinions on whether it makes sense to alter init scripts so that stop when the daemon is not running is not an error (and therefore doing clusvcadm -d on the not-running service would maybe work)?

3) the biggie: I have a GFS filesystem on a shared FC storage node (AX100). I haven't put any "real" data on it yet because I'm still testing, but yesterday I had the cluster up and running and the filesystem mounted on both nodes, and everything seemed peachy. I came back this morning to find that any attempted operations (e.g. 'ls') on the shared system came back with "Input/output error", and the following appeared in the logs:

node 1:
Jan 20 04:03:09 knob kernel: GFS: fsid=MAPS:shared_data.0: fatal: invalid metadata block Jan 20 04:03:09 knob kernel: GFS: fsid=MAPS:shared_data.0: bh = 352612748 (magic) Jan 20 04:03:09 knob kernel: GFS: fsid=MAPS:shared_data.0: function = gfs_rgrp_read
Jan 20 04:03:09 knob kernel: GFS: fsid=MAPS:shared_data.0:   file = 	
/usr/src/build/648121-x86_64/BUILD/gfs-kernel-2.6.9-45/smp/src/gfs/rgrp.c, line = 830 Jan 20 04:03:09 knob kernel: GFS: fsid=MAPS:shared_data.0: time = 1137747789 Jan 20 04:03:09 knob kernel: GFS: fsid=MAPS:shared_data.0: about to withdraw from the cluster Jan 20 04:03:09 knob kernel: GFS: fsid=MAPS:shared_data.0: waiting for outstanding I/O Jan 20 04:03:09 knob kernel: GFS: fsid=MAPS:shared_data.0: telling LM to withdraw
Jan 20 04:03:11 knob kernel: lock_dlm: withdraw abandoned memory
Jan 20 04:03:11 knob kernel: GFS: fsid=MAPS:shared_data.0: withdrawn

node 2:
Jan 20 04:02:10 gully kernel: dlm: shared_data: process_lockqueue_reply id c0012 state 0 Jan 20 04:02:10 gully kernel: dlm: shared_data: process_lockqueue_reply id 90376 state 0 Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: jid=0: Trying to acquire journal lock... Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: jid=0: Looking at journal... Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: jid=0: Acquiring the transaction lock... Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: jid=0: Replaying journal... Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: jid=0: Replayed 0 of 0 blocks Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: jid=0: replays = 0, skips = 0, sames = 0 Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: jid=0: Journal replayed in 1s
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: jid=0: Done
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: fatal: invalid metadata block Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: bh = 352612748 (magic) Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: function = gfs_rgrp_read
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1:   file =
/usr/src/build/648121-x86_64/BUILD/gfs-kernel-2.6.9-45/smp/src/gfs/rgrp.c, line = 830 Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: time = 1137747791 Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: about to withdraw from the cluster Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: waiting for outstanding I/O Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: telling LM to withdraw
Jan 20 04:03:11 gully kernel: lock_dlm: withdraw abandoned memory
Jan 20 04:03:11 gully kernel: GFS: fsid=MAPS:shared_data.1: withdrawn

The time coincides with cron.daily firing, so I'm guessing the culprit is slocate (since that's the only job in cron.daily that would have touched that filesystem), but I'm not having any luck reproducing it. The only thing on that filesystem currently is the webroot, and there were no hits at the time. Any ideas?

-g

Greg Forte
gforte udel edu
IT - User Services
University of Delaware
302-831-1982
Newark, DE


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]