[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] CLVM and Cluster Service Migration issues



Hi all,

I've got a three node CentOS 5 x86-64 CS/GFS cluster running kernel 2.6.18-53.el5. Last night, I tried to grow two of the file systems on it. I ran lvextend and then gfs_grow on node3, with node2 serving the file systems out to the local network. While gfs_grow was running, node2 failed the service and I couldn't get it to restart. It looked to me like neither node1 nor node2 was aware of the lvextend I had run on node3. I had to reboot the full cluster to bring everything back online.

This afternoon, node2 fenced node3. Nothing migrated, and the entire cluster needed to be rebooted again to recover. What I noticed after the full reboot is I seem to be getting initial ARP responses from the wrong nodes, as below:

[root workstation ~]# arping cluster-fs1
ARPING 10.1.1.142 from 10.1.1.101 eth0
Unicast reply from 10.1.1.142 [00:1B:78:D1:88:C2]  0.624ms
Unicast reply from 10.1.1.142 [00:1C:C4:81:9F:66]  0.666ms
Unicast reply from 10.1.1.142 [00:1C:C4:81:9F:66]  0.621ms
Sent 2 probes (1 broadcast(s))
Received 3 response(s)
[root workstation ~]# arping cluster-fs2
ARPING 10.1.1.143 from 10.1.1.101 eth0
Unicast reply from 10.1.1.143 [00:1B:78:D1:88:C2]  0.695ms
Unicast reply from 10.1.1.143 [00:1C:C4:81:9F:66]  0.734ms
Unicast reply from 10.1.1.143 [00:1C:C4:81:9F:66]  0.680ms
Sent 2 probes (1 broadcast(s))
Received 3 response(s)
[root workstation ~]# arping cluster-fs3
ARPING 10.1.1.144 from 10.1.1.101 eth0
Unicast reply from 10.1.1.144 [00:1C:C4:81:9F:66]  0.734ms
Unicast reply from 10.1.1.144 [00:1B:78:D1:88:C2]  0.913ms
Unicast reply from 10.1.1.144 [00:1B:78:D1:88:C2]  0.640ms
Sent 2 probes (1 broadcast(s))
Received 3 response(s)

[root workstation ~]# arping node1
ARPING 10.1.1.131 from 10.1.1.101 eth0
Unicast reply from 10.1.1.1 [00:1B:78:D1:88:C2]  0.771ms
[...]
[root workstation ~]# arping node2
ARPING 10.1.1.132 from 10.1.1.101 eth0
Unicast reply from 10.1.1.2 [00:1C:C4:81:AD:72]  0.681ms
[...]
[root workstation ~]# arping node3
ARPING 10.1.1.133 from 10.1.1.101 eth0
Unicast reply from 10.1.1.3 [00:1C:C4:81:9F:66]  0.631ms

At the time, node1 was supposed to be serving fs1, fs2, and fs3. I'll note that I did forget to run "lvmconf --enable-cluster" when I first set the volume group up, though I did make that change before putting the cluster into production.

Anyone have any thoughts on what's going on and what to do about it?

Thanks,

James


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]