[Linux-cluster] Linux clustering (one-node), GFS, iSCSI, clvmd (lock problem)

Tue Oct 16 04:52:04 UTC 2007

Hi All,

I am a noob to this maillist, but I've got some kind of locking problem 
with Linux and clusters, and iSCSI that plagues me.  It's a pretty 
serious issue because every time I reboot my server, it fails to mount 
my primary iSCSI device out of the box, and in order to get it working, 
I have to perform some pretty manual operations to get it operational again.

Here is some configuration information:

Linux flax.xxx.com 2.6.9-55.0.9.ELsmp #1 SMP Thu Sep 27 18:27:41 EDT 
2007 i686 i686 i386 GNU/Linux

[root at flax ~]# clvmd -V
Cluster LVM daemon version: 2.02.21-RHEL4 (2007-04-17)
Protocol version:           0.2.1

dmesg (excerpted)
iscsi-sfnet: Loading iscsi_sfnet version 4:0.1.11-3
iscsi-sfnet: Control device major number 254
iscsi-sfnet:host3: Session established
scsi3 : SFNet iSCSI driver
  Vendor: Promise   Model: VTrak M500i       Rev: 2211
  Type:   Direct-Access                      ANSI SCSI revision: 04
sdh : very big device. try to use READ CAPACITY(16).
SCSI device sdh: 5859373056 512-byte hdwr sectors (2999999 MB)
SCSI device sdh: drive cache: write back
sdh : very big device. try to use READ CAPACITY(16).
SCSI device sdh: 5859373056 512-byte hdwr sectors (2999999 MB)
SCSI device sdh: drive cache: write back
 sdh: unknown partition table

[root at flax ~]# clustat
Member Status: Quorate

  Member Name                              Status
  ------ ----                              ------
  flax                                     Online, Local, rgmanager

YES, THIS IS A ONE-NODE CLUSTER (Which, I suspect, might be the problem)

SYMPTOM:

When the server comes up, the clustered logical volume that is on the 
iSCSI device is labeled "inactive" when I do an "lvscan:"
[root at flax ~]# lvscan
  inactive            '/dev/nasvg_00/lvol0' [5.46 TB] inherit
  ACTIVE            '/dev/lgevg_00/lvol0' [3.55 TB] inherit
  ACTIVE            '/dev/noraidvg_01/lvol0' [546.92 GB] inherit
  ACTIVE            '/dev/VolGroup00/LogVol00' [134.47 GB] inherit
  ACTIVE            '/dev/VolGroup00/LogVol01' [1.94 GB] inherit

The thing that's interesting is the lgevg_00 and the noraidvg_01 volumes 
are also clustered, but they are direct-attached SCSI (ie, not ISCSI).

The volume group that the logical volume is a member of shows clean:
[root at flax ~]# vgscan
  Reading all physical volumes.  This may take a while...
  Found volume group "nasvg_00" using metadata type lvm2
  Found volume group "lgevg_00" using metadata type lvm2
  Found volume group "noraidvg_01" using metadata type lvm2

So, in order to fix this, I execute the following:

[root at flax ~]# lvchange -a y /dev/nasvg_00/lvol0
Error locking on node flax: Volume group for uuid not found: 
oNhRO1WqNJp3BZxxrlMT16dwpwcRiIQPejnrEUbQ3HMJ6BjHef1hKAsoA6Sl9ISS

This also shows up in my syslog, as such:
Oct 13 11:27:40 flax vgchange:   Error locking on node flax: Volume 
group for uuid not found: 
oNhRO1WqNJp3BZxxrlMT16dwpwcRiIQPejnrEUbQ3HMJ6BjHef1hKAsoA6Sl9ISS

RESOLUTION:

It took me a very long time to figure this out, but since it happens to 
me every time I reboot my server, somebody's bound to run into this 
again sometime soon (and it will probably be me).

Here's how I resolved it:

I edited the /etc/lvm/lvm.conf file as such:

was:
    # Type of locking to use. Defaults to local file-based locking (1).
    # Turn locking off by setting to 0 (dangerous: risks metadata corruption
    # if LVM2 commands get run concurrently).
    # Type 2 uses the external shared library locking_library.
    # Type 3 uses built-in clustered locking.
    #locking_type = 1
    locking_type = 3

changed to:

(snip)
    # Type 3 uses built-in clustered locking.
    #locking_type = 1
    locking_type = 2

Then, restart clvmd as such:
[root at flax ~]# service clvmd restart

Then:
[root at flax ~]# lvchange -a y /dev/nasvg_00/lvol0
[root at flax ~]#

(see, no error!)
[root at flax ~]# lvscan
  ACTIVE            '/dev/nasvg_00/lvol0' [5.46 TB] inherit
  ACTIVE            '/dev/lgevg_00/lvol0' [3.55 TB] inherit
  ACTIVE            '/dev/noraidvg_01/lvol0' [546.92 GB] inherit
  ACTIVE            '/dev/VolGroup00/LogVol00' [134.47 GB] inherit
  ACTIVE            '/dev/VolGroup00/LogVol01' [1.94 GB] inherit

(it's active!)

Then, go back and modify /etc/lvm/lvm.conf to restore the original 
locking_type to 3
Then, restart clvmd.

THOUGHTS:

I admit I don't know much about clustering, but from the evidence I see, 
the problem appears to be isolated to clvmd and iSCSI, if only for the 
fact that my direct-attached clustered volumes don't exhibit the symptoms.

I'll make another leap here and guess that it's probably isolated to 
single-node clusters, since I'd imagine that most people who are using 
clustering are probably using clustering as it was intended to be used 
(ie, multiple machines).