[Linux-cluster] dlm: message size from 3 too big

James Chamberlain jamesc at exa.com
Thu May 20 18:52:58 UTC 2010


Hi all,

I've got a three node cluster running CentOS 4.8, GFS-6.1.19-1.el4_8  
(GFS 1 filesystems), kernel 2.6.9-89.0.19.ELsmp.  I've seen messages  
like those below a couple times in the last couple weeks.  Node 3  
doesn't go down, so it doesn't get fenced; but DLM is unable to  
negotiate locks, so the load average on each node spikes and the  
cluster can't serve anything out through NFS.  Has anyone seen  
anything like this? Any idea what to do about it?  Shooting node 3 in  
the head has caused the cluster to recover, but I'd like to know how  
to fix it rather than work around it.

Thanks,

James

[[Operating normally prior to this point]]
May 20 04:52:50 s12n01 clurgmgrd[7467]: <err> #48: Unable to obtain  
cluster lock: Connection timed out
May 20 04:53:41 s12n03 clurgmgrd[7476]: <err> #48: Unable to obtain  
cluster lock: Connection timed out
May 20 04:54:09 s12n03 kernel: dlm: message size from 3 too big  
34560(pkt len=386)
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 87-00 00 00 23 00  
00 00 00
May 20 04:54:09 s12n03 kernel: 30 b0 d2 84 02 01 00 00-30 b0 d2 84 02  
01 00 00
May 20 04:54:09 s12n03 kernel: 8e 64 13 80 ff ff ff ff-f0 7d ee 81 02  
01 00 00
May 20 04:54:09 s12n03 kernel: f0 7d ee 81 02 01 00 00-02 02 00 00 00  
00 00 00
May 20 04:54:09 s12n03 kernel: a4 de eb 81 02 01 00 00-00 40 01 00 00  
01 00 00
May 20 04:54:09 s12n03 kernel: b8 7c ee 81 02 01 00 00-ff ff ff ff ff  
ff ff ff
May 20 04:54:09 s12n03 kernel: 82 01 00 00 00 00 00 00-90 de eb 81 02  
01 00 00
May 20 04:54:09 s12n03 kernel: 00 10 00 00 00 00 00 00-00 00 00 00 00  
01 00 00
May 20 04:54:09 s12n03 kernel: b7 6d db b6 6d db b6 6d-be cb 36 a0 ff  
ff ff ff
May 20 04:54:09 s12n03 kernel: 60 fa 06 01 00 01 00 00-82 01 00 00 00  
00 00 00
May 20 04:54:09 s12n03 kernel: 82 d1 0d 2a 00 01 00 00-7e 0e 00 00 00  
00 00 00
May 20 04:54:09 s12n03 kernel: 00 d0 0d 2a 00 01 00 00-00 00 00 00 00  
00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 03  
00 00 00
May 20 04:54:09 s12n03 kernel: 68 7e ee 81 02 01 00 00-02 00 00 00 00  
00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00  
00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00  
00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-90 de eb 81 02  
01 00 00
May 20 04:54:09 s12n03 kernel: 01 00 00 00 00 00 00 00-10 50 38 a0 ff  
ff ff ff
May 20 04:54:09 s12n03 kernel: fc ff ff ff 00 00 00 00-98 3c eb 81 02  
01 00 00
May 20 04:54:09 s12n03 kernel: b0 c9 14 80 ff ff ff ff-b2 d1 36 a0 ff  
ff ff ff
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-91 d0 36 a0 ff  
ff ff ff
May 20 04:54:09 s12n03 kernel: a8 3c eb 81 02 01 00 00-87 c9 14 80 ff  
ff ff ff
May 20 04:54:09 s12n03 kernel: ff ff ff ff ff ff ff ff-98 3c eb 81 02  
01 00 00
May 20 04:54:09 s12n03 kernel: 30 3c eb 81 02 01 00 00-c0 86 f2 af 00  
01 00 00
May 20 04:54:09 s12n03 kernel: 12
May 20 04:54:09 s12n03 kernel: 02
May 20 04:54:09 s12n03 kernel: dlm: midcomms: bad header version 0
May 20 04:54:09 s12n03 kernel: dlm: midcomms: cmd=0, flags=0,  
length=1024, lkid=1711276032, lockspace=0
May 20 04:54:09 s12n03 kernel: dlm: midcomms: base=000001002a0dd000,  
offset=1024, len=810, ret=1024, limit=00001000 newbuf=0
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 04-00 00 00 66 00  
00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00  
00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 01  
00 01 00
May 20 04:54:09 s12n03 kernel: 03 00 72 00 c0 00 a4 23-17 00 00 01 6a  
01 c9 26
May 20 04:54:09 s12n03 kernel: 00 00 00 00 08 00 00 00-00 00 00 00 00  
00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 84  
34 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 ff 03 01 16-19 70 00 00 ff  
52 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00  
a6 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00  
00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00  
00 01 00
May 20 04:54:09 s12n03 kernel: 01 00 03 00 72 00 3d 00-7a 2a 17 00 00  
01 13 00
May 20 04:54:09 s12n03 kernel: 42 2c 00 00 00 00 08 00-00 00 00 00 00  
00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00  
00 84 34
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 ff 03-01 16 19 70 00  
00 8c 0a
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 37 00  
00 00 53
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00  
00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00  
00 00 00
May 20 04:54:09 s12n03 kernel: 01 00 01 00 03 00 72 00-49 03 b5 26 17  
00 00 01
May 20 04:54:09 s12n03 kernel: 56 01 69 26 00 00 00 00-08 00 00 00 00  
00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00  
00 00 00
May 20 04:54:09 s12n03 kernel: 84 34 00 00 00 00 00 00-ff 03 01 16 19  
70 00 00
May 20 04:54:09 s12n03 kernel: dc fc 00 00 00 00 00 00-00 00 00 00 00  
0f 00 00
May 20 04:54:09 s12n03 kernel: 00 6d 00 00 00 00 00 00-00 00 00 00 00  
00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00  
00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 01 00 01 00 03 00-72 00 75 00 b7  
26 17 00
May 20 04:54:09 s12n03 kernel: 00 01 93 02 4f 26 00 00-00 00 08 00 00  
00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00  
00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 84 34 00 00 00 00-00 00 ff 03 01  
16 19 70
May 20 04:54:09 s12n03 kernel: 00 00 62 a7 00 00 00 01-00 00 00 00 00  
00 00 50
May 20 04:54:09 s12n03 kernel: 00 00 00 76 00 00 00 00-00 00 00 00 00  
00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00  
00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 01 00 01 00-03 00 72 00 8e  
02 2d 26
May 20 04:54:09 s12n03 kernel: 17 00 00 01 81 03 85 27-00 00 00 00 08  
00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00  
00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 84 34 00 00-00 00 00 00 ff  
03 01 16
May 20 04:54:09 s12n03 kernel: 19 70 00 00 6c a4 00 00-00 00 00 00 00  
00 00 00
May 20 04:54:09 s12n03 kernel: 00 44 00 00 00 49 00 00-00 00 00 00 00  
00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00  
00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 01 00-01 00 03 00 72  
00 5b 00
May 20 04:54:09 s12n03 kernel: f0 21 17 00 00 01 3a 02-fb 2b 00 00 00  
00 08 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00  
00 00 00
May 20 04:54:09 s12n03 kernel: 00 00 00 00 00 00 84 34-00 00 00 00 00  
00 ff 03
May 20 04:54:09 s12n03 kernel: 01 16 19 70 00 00 ff 74-00 00 00 00 00  
00 00 00
May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 83-00 00 00 00 00  
00 00 00
May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00  
00 00 00
May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-01 00 01 00 03  
00 72 00
May 20 04:54:10 s12n03 kernel: 9a 02 18 23 17 00 00 01-b9 02 f5 2d 00  
00 00 00
May 20 04:54:10 s12n03 kernel: 08 00 00 00 00 00 00 00-00 00 00 00 00  
00 00 00
May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-84 34 00 00 00  
00 00 00
May 20 04:54:10 s12n03 kernel: ff 03 01 16 19 70 00 00-ff 83 00 00 00  
00 00 00
May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 75 00 00 00  
00 00 00
May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00  
00 00 00
May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 01 00 01  
00 03 00
May 20 04:54:10 s12n03 kernel: 72 00 1b 01 86 2a 17 00-00 01 56 02 d8  
28 00 00
May 20 04:54:10 s12n03 kernel: 00 00 08 00 00 00 00 00-00 00 00 00 00  
00 00 00
May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 84 34 00  
00 00 00
May 20 04:54:10 s12n03 kernel: 00 00 ff 03 01 16 19 70-00 00 fe 82 00  
00 00 00
May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 64 00  
00 00 00
May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 00  
00 00 00
May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 01  
00 01 00
May 20 04:54:10 s12n03 kernel: 03 00 72 00 c6 01 fc 27-17 00 00 01 e0  
00 f6 28
May 20 04:54:10 s12n03 kernel: 00 00 00 00 08 00 00 00-00 00 00 00 00  
00 00 00
May 20 04:54:10 s12n03 kernel: 00 00 00 00 00 00 00 00-00 00 00 00 84  
34 00 00
May 20 04:54:10 s12n03 kernel: 00 00 00 00
May 20 04:54:10 s12n03 kernel: ff 03 01 16
May 20 04:54:10 s12n03 kernel: 19 70 00 00
May 20 04:54:10 s12n03 kernel: f7
May 20 04:54:10 s12n03 kernel: a3
May 20 04:54:10 s12n03 kernel: 00
May 20 04:54:10 s12n03 kernel: 00
May 20 04:54:10 s12n03 kernel: dlm: lowcomms: addr=000001002a0dd000,  
base=0, len=1834, iov_len=3710, iov_base[0]=000001002a0dd72a, read=1448
May 20 04:54:50 s12n01 clurgmgrd[7467]: <err> #50: Unable to obtain  
cluster lock: Connection timed out
May 20 04:56:41 s12n03 clurgmgrd[7476]: <err> #50: Unable to obtain  
cluster lock: Connection timed out
May 20 05:02:13 s12n02 clurgmgrd[7527]: <err> #48: Unable to obtain  
cluster lock: Connection timed out
May 20 05:05:13 s12n02 clurgmgrd[7527]: <err> #50: Unable to obtain  
cluster lock: Connection timed out
May 20 05:08:13 s12n02 clurgmgrd[7527]: <err> #48: Unable to obtain  
cluster lock: Connection timed out
[...]

When I say the load spikes, this is what I mean:

Linux 2.6.9-89.0.19.ELsmp (s12n01)      05/20/2010

12:00:01 AM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
[...]
04:00:01 AM         0       820      0.18      1.84      3.66
04:10:01 AM         0       820      2.72      4.03      4.04
04:20:01 AM         0       820      3.57      4.62      4.64
04:30:01 AM         0       820     11.42      7.35      5.44
04:40:01 AM         0       820      4.20      7.51      7.10
04:50:01 AM         0       820      1.69      2.18      4.33
05:00:01 AM         0       820    513.68    406.40    205.61
05:10:01 AM         0       820    530.02    513.44    360.00
05:20:01 AM         0       820    530.06    527.83    440.93
05:30:01 AM         0       820    530.12    529.75    483.33
05:40:01 AM         0       820    530.07    530.04    505.57
05:50:01 AM         0       820    530.08    530.05    517.21
06:00:01 AM         0       820    530.02    530.03    523.29
[...]




More information about the Linux-cluster mailing list