[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] Probably some silly mistake setting up a cluster ?



Greetings,

I am trying to set up a cluster with (for now) two nodes, reason being
the semantic guarantees of GFS when accessing shared files (that is, I
am not interested in fault tolerance, performance or anything else).
Unfortunately, I keep running into all sorts of problems, for
example:

   - After a few hours of intensive workload, the cluster sometimes
simply stops. All file system calls block, but things like cman_tool
status or group_tool status insist everything is all right. Soft reboot
is not possible due to various services waiting infinitely, after power
cycling fsck finds inconsistencies on the file system.

   - Sometimes, when trying to execute a binary on the file system, I get
execvp returning permission denied when it should not, but when I try
again, everything is all right. I sometimes even observe this when
trying to start a script on the file system, as if the interpreter of
the script (which is on a different file system altogether) had wrong
permissions. Again, simply trying one more time makes everything work.

The config of the cluster seems relatively simple:

   - i686 single CPU node
      - file system device accessible over iSCSI
      - cluster subnet (unfortunately) connected over OpenVPN
   - x86_64 eight CPU virtual node
      - file system device provided by host which uses iSCSI
   - both nodes resolve into the same subnet using /etc/hosts
   - nothing except a single GFS2 file system is mounted
   - fencing uses fence_manual
   - both nodes run Fedora 8

Config attached, not like there is anything unusual in it.

As an absolute novice, I am probably making some glaringly obvious silly
mistake which results in the very weird behavior described above, but
try as I might, I do not see anything that can cause this ?

Thanks for any advice, Petr


<?xml version="1.0" ?>
<cluster config_version="1" name="monoton">
	<fence_daemon post_fail_delay="-1" post_join_delay="-1"/>
	<clusternodes>
		<clusternode name="delta.dsrb" nodeid="1" votes="1">
			<fence>
				<method name="1">
					<device name="Fencer" nodename="delta.dsrb"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="ichi.dsrb" nodeid="101" votes="1">
			<fence>
				<method name="1">
					<device name="Fencer" nodename="ichi.dsrb"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<cman expected_votes="1" two_node="1"/>
	<fencedevices>
		<fencedevice agent="fence_manual" name="Fencer"/>
	</fencedevices>
	<rm>
		<failoverdomains/>
		<resources/>
	</rm>
</cluster>



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]