[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] GFS locking issues



Hi,

I have some locking issues (deadlocks?) with GFS.

My configuration include 4 hosts - one of them is used as GNBD-device
exporter and 3 other import this GNBD partition and mount it to the /gfs
mountpoint.

LVM is also used on the imported GNBD partition, so clmvd is running.
The locking method is DLM, GFS version is 6.1.5, manual fencing used.


The problem is quite usual - deadlock on httpd (httpd processess in 'D' state)
I saw such problems, though not solutions on the list.

In my case apache is placed to the GFS filesystem and I run it inside th
chroot by the command like this:

chroot /gfs/chroot /usr/local/apache/bin/httpd

The problem appears sometimes after "killall httpd" - all the httpd processes
get the 'D' state in "ps ax" terms and become locked in this state forever.


Moreover the whole GFS filesystem become unavailable after it happens.
Even from another host every command that tries to access /gfs partition
hangs in the 'D' state. Though last time it was unavailable only partially
- the /gfs/chroot/usr hierarchy was "locked" but other parts of gfs worked
okay.

The only cure I know is to reboot the node and fence it out from the cluster.


Is there any ideas of how to fix this? I mean either the reason ('D' state of
killed httpd-s) or consequences (the GFS filesystem fully or partially
become unavailable after this).

I also appreciate any help with debugging the problem.

I tried gfs_tool lockdump with decipher_lockstate_dump tool.

bash-3.00# ps ax |grep http
14981 ?        Ds     0:00 /usr/system/apache/bin/httpd
15242 ?        D      0:00 /usr/system/apache/bin/httpd
24708 ?        D      0:00 /usr/system/apache/bin/httpd
24709 ?        D      0:00 /usr/system/apache/bin/httpd
24710 ?        D      0:00 /usr/system/apache/bin/httpd

I found only 2 locks regarding these processes:

bash-3.00# ls -i /gfs/chroot/lib64/libnss_files-2.3.4.so
27190 /gfs/chroot/lib64/libnss_files-2.3.4.so

Glock (inode[2], 27190)
  gl_flags = lock[1]
  gl_count = 7
  gl_state = shared[3]
  req_gh = yes
  req_bh = yes
  lvb_count = 0
  object = yes
  new_le = no
  incore_le = no
  reclaim = no
  aspace = 1
  ail_bufs = no
  Request
    owner = 24710
    gh_state = shared[3]
    gh_flags =
    error = 0
    gh_iflags = promote[1] holder[6] first[7]
  Holder
    owner = 24710
    gh_state = shared[3]
    gh_flags =
    error = 0
    gh_iflags = promote[1] holder[6] first[7]
  Waiter3
    owner = 24708
    gh_state = shared[3]
    gh_flags =
    error = 0
    gh_iflags = promote[1]
  Waiter3
    owner = 24709
    gh_state = shared[3]
    gh_flags =
    error = 0
    gh_iflags = promote[1]
  Waiter3
    owner = 15242
    gh_state = shared[3]
    gh_flags =
    error = 0
    gh_iflags = promote[1]
  Inode: busy

and

bash-3.00# ls -i /gfs/chroot/usr/system/apache/bin/httpd
2175961 /gfs/chroot/usr/system/apache/bin/httpd

Glock (inode[2], 2175961)
  gl_flags =
  gl_count = 4
  gl_state = shared[3]
  req_gh = no
  req_bh = no
  lvb_count = 0
  object = yes
  new_le = no
  incore_le = no
  reclaim = no
  aspace = 1
  ail_bufs = no
  Holder
    owner = 14981
    gh_state = shared[3]
    gh_flags =
    error = 0
    gh_iflags = promote[1] holder[6] first[7]
  Inode: busy

There are also such locks for this inodes:


Glock (iopen[5], 27190)
  gl_flags =
  gl_count = 2
  gl_state = shared[3]
  req_gh = no
  req_bh = no
  lvb_count = 0
  object = yes
  new_le = no
  incore_le = no
  reclaim = no
  aspace = no
  ail_bufs = no
  Holder
    owner = none[-1]
    gh_state = shared[3]
    gh_flags = local_excl[5] exact[7]
    error = 0
    gh_iflags = promote[1] holder[6] first[7]


Glock (iopen[5], 2175961)
  gl_flags =
  gl_count = 2
  gl_state = shared[3]
  req_gh = no
  req_bh = no
  lvb_count = 0
  object = yes
  new_le = no
  incore_le = no
  reclaim = no
  aspace = no
  ail_bufs = no
  Holder
    owner = none[-1]
    gh_state = shared[3]
    gh_flags = local_excl[5] exact[7]
    error = 0
    gh_iflags = promote[1] holder[6] first[7]




During the last hanging the "/gfs/chroot/usr" was unavailable and there are two entries regarding this directory in the lockdump:

bash-3.00# ls -di /gfs/chroot/usr/
15077981 /gfs/chroot/usr/

Glock (inode[2], 15077981)
  gl_flags =
  gl_count = 4
  gl_state = shared[3]
  req_gh = no
  req_bh = no
  lvb_count = 0
  object = yes
  new_le = no
  incore_le = no
  reclaim = yes
  aspace = 1
  ail_bufs = no
  Inode:
    num = 15077981/15077981
    type = directory[2]
    i_count = 1
    i_flags =
    vnode = yes

Glock (iopen[5], 15077981)
  gl_flags =
  gl_count = 2
  gl_state = shared[3]
  req_gh = no
  req_bh = no
  lvb_count = 0
  object = yes
  new_le = no
  incore_le = no
  reclaim = no
  aspace = no
  ail_bufs = no
  Holder
    owner = none[-1]
    gh_state = shared[3]
    gh_flags = local_excl[5] exact[7]
    error = 0
    gh_iflags = promote[1] holder[6] first[7]


Your comments will be highly appreciated.

--
Best Regards,
Anton Kornev.
[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]