[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] GFS locking issues



David, Benjamin,
thanks for you assistance!

I reproduced the problem and I have done the tests you mentioned.

Regarding gndb:

gnbd_import -l tool reports "Open, Connected" state and gndb_export -L
on the gnbd server also shows all the hosts importing this partition.
The " cat /sys/class/gnbd/gnbd0/waittime" also shows no data pending
(returns -1).

Though in the message log there were some strange lines about gnbd failures appeared after the "killall httpd" command was issued:

gnbd (pid 5836: alogc.pl) got signal 9
gnbd0: Send control failed (result -4)
gnbd (pid 5836: alogc.pl) got signal 15
gnbd0: Send control failed (result -4)
gnbd (pid 5911: httpd) got signal 15
gnbd0: Send control failed (result -4)
gnbd (pid 5897: httpd) got signal 15
gnbd0: Send control failed (result -4)
gnbd (pid 5915: httpd) got signal 15
gnbd0: Send control failed (result -4)
gnbd (pid 5911: httpd) got signal 15
gnbd0: Send control failed (result -4)


Regarding ps info on wchan - it looks like this:

ps axl info on IO-waiting processes:

F   UID   PID  PPID PRI  NI   VSZ  RSS WCHAN  STAT TTY        TIME COMMAND
1     0    51     6  15   0     0    0 wait_o D    ?          0:00 [pdflush]
1     0  5771     6   5 -10     0    0 lock_p D<   ?          0:00 [lock_dlm1]
1     0  5776     1  15   0     0    0 -      D    ?          0:00 [gfs_logd]
1     0  5777     1  15   0     0    0 -      D    ?          0:00 [gfs_quotad]
1     0  5778     1  15   0     0    0 -      D    ?          0:00 [gfs_inoded]
5     0  5892     1  16   0 23440  912 -      Ds   ?          0:00 /usr/system/apache/bin/httpd
5    48  5895  5892  17   0 23472  984 glock_ D    ?          0:00 /usr/system/apache/bin/httpd
5    48  5896  5892  17   0 23440  980 glock_ D    ?          0:00 /usr/system/apache/bin/httpd
5    48  5897  5892  17   0 23440  920 glock_ D    ?          0:00 /usr/system/apache/bin/httpd
5    48  5911  5892  17   0 23440  920 glock_ D    ?          0:00 /usr/system/apache/bin/httpd
5    48  5915  5892  17   0 23440  920 wait_o D    ?          0:00 /usr/system/apache/bin/httpd
4     0  5930  2547  34  19 52780  992 wait_o DN   ?          0:00 /bin/sh -c run-parts /etc/cron.da
ily

Not truncated version of the "wchan" field for all the IO-waiting processes is below:

bash-3.00# ps ax -o pid,state,wchan:32,ucomm |grep D
  PID S WCHAN                            COMMAND
   51 D wait_on_buffer                   pdflush
 5771 D lock_page                        lock_dlm1
 5776 D -                                gfs_logd
 5777 D -                                gfs_quotad
 5778 D -                                gfs_inoded
 5892 D -                                httpd
 5895 D glock_wait_internal              httpd
 5896 D glock_wait_internal              httpd
 5897 D glock_wait_internal              httpd
 5911 D glock_wait_internal              httpd
 5915 D wait_on_buffer                   httpd
 5930 D wait_on_buffer                   sh

Finally I have taken the "sysrq" info on these processes.

pdflush       D ffffffff8014aabc     0    51      6            53    50 (L-TLB)
00000100dfc3dc78 0000000000000046 000001011bd3e980 000001010fc11f00
       0000000000000216 ffffffffa0042916 000001011aca60c0 0000000000000008
       000001011fdef7f0 0000000000000dfa
Call Trace:<ffffffffa0042916>{:dm_mod:dm_request+396} <ffffffff8014aabc>{keventd_create_kthread+0}
       <ffffffff803053ef>{io_schedule+38} <ffffffff80178c4c>{__wait_on_buffer+125}
       <ffffffff80178ad2>{bh_wake_function+0} <ffffffff80178ad2>{bh_wake_function+0}
       <ffffffffa0235c5d>{:gfs:gfs_logbh_wait+49} <ffffffffa024a6a6>{:gfs:disk_commit+794}
       <ffffffffa024a877>{:gfs:log_refund+111} <ffffffffa024ad8e>{:gfs:log_flush_internal+510}
       <ffffffff8017d682>{sync_supers+167} <ffffffff8015e310>{wb_kupdate+36}
       <ffffffff8015edb4>{pdflush+323} <ffffffff8015e2ec>{wb_kupdate+0}
       <ffffffff8015ec71>{pdflush+0} <ffffffff8014aa93>{kthread+200}
       <ffffffff80110e17>{child_rip+8} <ffffffff8014aabc>{keventd_create_kthread+0}
       <ffffffff8014a9cb>{kthread+0} <ffffffff80110e0f>{child_rip+0}
lock_dlm1     D 000001000c0096e0     0  5771      6          5772  5766 (L-TLB)
0000010113ce3c58 0000000000000046 0000001000000000 0000010000000069
       000001011420b030 0000000000000069 000001000c00a940 000000010000eb10
       000001011a887030 0000000000001cae
Call Trace:<ffffffff802496d4>{__generic_unplug_device+19} <ffffffff803053ef>{io_schedule+38}
       <ffffffff80159215>{__lock_page+191} <ffffffff80158cfa>{page_wake_function+0}
       <ffffffff80158cfa>{page_wake_function+0} <ffffffff80163125>{truncate_inode_pages+519}
       <ffffffffa0258f35>{:gfs:gfs_inval_page+63} <ffffffffa02401b5>{:gfs:drop_bh+233}
       <ffffffffa0242138>{:gfs:gfs_glock_cb+194} <ffffffffa02869dd>{:lock_dlm:dlm_async+1989}
       <ffffffff801333c8>{default_wake_function+0} <ffffffff8014aabc>{keventd_create_kthread+0}
       <ffffffffa0286218>{:lock_dlm:dlm_async+0} <ffffffff8014aabc>{keventd_create_kthread+0}
       <ffffffff8014aa93>{kthread+200} <ffffffff80110e17>{child_rip+8}
       <ffffffff8014aabc>{keventd_create_kthread+0} <ffffffff8014a9cb>{kthread+0}
       <ffffffff80110e0f>{child_rip+0}
gfs_logd      D 0000000000000000     0  5776      1          5777  5775 (L-TLB)
000001011387fe38 0000000000000046 0000000000000000 ffffffff80304a85
       000001011387fe58 ffffffff80304add ffffffff803cca80 0000000000000246
       00000101143fe030 00000000000000b5
Call Trace:<ffffffff80304a85>{thread_return+0} <ffffffff80304add>{thread_return+88}
       <ffffffffa023e8d3>{:gfs:lock_on_glock+112} <ffffffff8030565b>{__down_write+134}
       <ffffffffa0249cdb>{:gfs:gfs_ail_empty+56} <ffffffffa0233930>{:gfs:gfs_logd+77}
       <ffffffff80110e17>{child_rip+8} <ffffffff801cccff>{dummy_d_instantiate+0}
       <ffffffffa02338e3>{:gfs:gfs_logd+0} <ffffffff80110e0f>{child_rip+0}
      
gfs_quotad    D 0000000000000000     0  5777      1          5778  5776 (L-TLB)
0000010113881e98 0000000000000046 0000000000000000 ffffffff80304a85
       0000010113881eb8 ffffffff80304add 000001011ff87030 0000000100000074
       000001011430f7f0 0000000000000128
Call Trace:<ffffffff80304a85>{thread_return+0} <ffffffff80304add>{thread_return+88}
       <ffffffff8030565b>{__down_write+134} <ffffffffa025b55a>{:gfs:gfs_quota_sync+226}
       <ffffffffa0233ab1>{:gfs:gfs_quotad+127} <ffffffff80110e17>{child_rip+8}
       <ffffffff801cccff>{dummy_d_instantiate+0} <ffffffff801cccff>{dummy_d_instantiate+0}
       <ffffffff801cccff>{dummy_d_instantiate+0} <ffffffffa0233a32>{:gfs:gfs_quotad+0}
       <ffffffff80110e0f>{child_rip+0}
gfs_inoded    D 0000000000000000     0  5778      1          5807  5777 (L-TLB)
0000010113883e98 0000000000000046 000001011e2937f0 000001000c0096e0
       0000000000000000 ffffffff80304a85 0000010113883ec8 0000000180304add
       000001011e2937f0 00000000000000c2
Call Trace:<ffffffff80304a85>{thread_return+0} <ffffffff8030565b>{__down_write+134}
       <ffffffffa026160d>{:gfs:unlinked_find+115} <ffffffffa0261c6c>{:gfs:gfs_unlinked_dealloc+25}
       <ffffffffa0233bd5>{:gfs:gfs_inoded+66} <ffffffff80110e17>{child_rip+8}
       <ffffffffa0233b93>{:gfs:gfs_inoded+0} <ffffffff80110e0f>{child_rip+0}
           
httpd         D ffffffff80304190     0  5892      1  5893          5826 (NOTLB)
0000010111b75bf8 0000000000000002 0000000000000001 0000000000000001
       0000000000000000 0000000000000000 0000010114667980 0000000111b75bc0
       00000101143fe7f0 00000000000009ad
Call Trace:<ffffffff80303d6f>{__down+147} <ffffffff801333c8>{default_wake_function+0}
       <ffffffff8015b3a2>{generic_file_write_nolock+158} <ffffffff80305780>{__down_failed+53}
       <ffffffffa0236986>{:gfs:.text.lock.dio+95} <ffffffffa0260e4c>{:gfs:gfs_trans_add_bh+205}
       <ffffffffa0253efc>{:gfs:do_write_buf+1138} <ffffffffa0252db3>{:gfs:walk_vm+278}
       <ffffffffa0253a8a>{:gfs:do_write_buf+0} <ffffffffa0253a8a>{:gfs:do_write_buf+0}
       <ffffffffa025415b>{:gfs:__gfs_write+201} <ffffffff80177c60>{vfs_write+207}
       <ffffffff80177d48>{sys_write+69} <ffffffff801101c6>{system_call+126}
      
httpd         D 0000010110ad7d48     0  5895   5892          5896  5893 (NOTLB)
0000010110ad7bd8 0000000000000006 000001011b16e030 0000000000000075
       0000010117002030 0000000000000075 000001000c002940 0000000000000001
       00000101170027f0 000000000001300e
Call Trace:<ffffffff80131d1d>{try_to_wake_up+863} <ffffffff80304cbd>{wait_for_completion+167}
       <ffffffff801333c8>{default_wake_function+0} <ffffffff801333c8>{default_wake_function+0}
       <ffffffffa023f4b1>{:gfs:glock_wait_internal+350} <ffffffffa023fce6>{:gfs:gfs_glock_nq+961}
       <ffffffffa023ff11>{:gfs:gfs_glock_nq_init+20} <ffffffffa0258b7b>{:gfs:gfs_private_nopage+84}
       <ffffffff80168211>{do_no_page+1003} <ffffffff80167b13>{do_wp_page+948}
       <ffffffff8016858f>{handle_mm_fault+343} <ffffffff80142a06>{get_signal_to_deliver+1118}
       <ffffffff801236d2>{do_page_fault+518} <ffffffff80304a85>{thread_return+0}
       <ffffffff80304add>{thread_return+88} <ffffffff80110c61>{error_exit+0}
      
httpd         D 0000010110b5bd48     0  5896   5892          5897  5895 (NOTLB)
0000010110b5bbd8 0000000000000002 00000101170027f0 0000000000000075
       00000101114787f0 0000000000000075 000001000c002940 0000000000000001
       0000010117002030 000000000000fb3e
Call Trace:<ffffffff80131d1d>{try_to_wake_up+863} <ffffffff80304cbd>{wait_for_completion+167}
       <ffffffff801333c8>{default_wake_function+0} <ffffffff801333c8>{default_wake_function+0}
       <ffffffffa023f4b1>{:gfs:glock_wait_internal+350} <ffffffffa023fce6>{:gfs:gfs_glock_nq+961}
       <ffffffffa023ff11>{:gfs:gfs_glock_nq_init+20} <ffffffffa0258b7b>{:gfs:gfs_private_nopage+84}
       <ffffffff80168211>{do_no_page+1003} <ffffffff80167b13>{do_wp_page+948}
       <ffffffff8016858f>{handle_mm_fault+343} <ffffffff80142a06>{get_signal_to_deliver+1118}
       <ffffffff801236d2>{do_page_fault+518} <ffffffff802a3445>{sys_accept+327}
       <ffffffff80182e88>{pipe_read+26} <ffffffff80110c61>{error_exit+0}
      
httpd         D 0000000000000000     0  5897   5892          5911  5896 (NOTLB)
0000010110119bd8 0000000000000006 0000010117002030 0000000000000075
       0000010117002030 0000000000000075 000001000c00a940 000000001b16e030
       00000101114787f0 000000000000fbe0
Call Trace:<ffffffff802496d4>{__generic_unplug_device+19} <ffffffff80304cbd>{wait_for_completion+167}
       <ffffffff801333c8>{default_wake_function+0} <ffffffff801333c8>{default_wake_function+0}
       <ffffffffa023f4b1>{:gfs:glock_wait_internal+350} <ffffffffa023fce6>{:gfs:gfs_glock_nq+961}
       <ffffffffa023ff11>{:gfs:gfs_glock_nq_init+20} <ffffffffa0258b7b>{:gfs:gfs_private_nopage+84}
       <ffffffff80168211>{do_no_page+1003} <ffffffff80167b13>{do_wp_page+948}
       <ffffffff8016858f>{handle_mm_fault+343} <ffffffff80142a06>{get_signal_to_deliver+1118}
       <ffffffff801236d2>{do_page_fault+518} <ffffffff80304a85>{thread_return+0}
       <ffffffff80304add>{thread_return+88} <ffffffff80110c61>{error_exit+0}
      
httpd         D 00000101100c3d48     0  5911   5892          5915  5897 (NOTLB)
00000101100c3bd8 0000000000000002 000001011420b7f0 0000000000000075
       00000101170027f0 0000000000000075 000001000c002940 0000000000000000
       000001011b16e030 000000000000187e
Call Trace:<ffffffff80131d1d>{try_to_wake_up+863} <ffffffff80304cbd>{wait_for_completion+167}
       <ffffffff801333c8>{default_wake_function+0} <ffffffff801333c8>{default_wake_function+0}
       <ffffffffa023f4b1>{:gfs:glock_wait_internal+350} <ffffffffa023fce6>{:gfs:gfs_glock_nq+961}
       <ffffffffa023ff11>{:gfs:gfs_glock_nq_init+20} <ffffffffa0258b7b>{:gfs:gfs_private_nopage+84}
       <ffffffff80168211>{do_no_page+1003} <ffffffff80167b13>{do_wp_page+948}
       <ffffffff8016858f>{handle_mm_fault+343} <ffffffff80142a06>{get_signal_to_deliver+1118}
       <ffffffff801236d2>{do_page_fault+518} <ffffffff80304a85>{thread_return+0}
       <ffffffff80304add>{thread_return+88} <ffffffff80110c61>{error_exit+0}
      
httpd         D 0000000000006a36     0  5915   5892                5911 (NOTLB)
00000101180f7ad8 0000000000000006 0000000000002706 ffffffffa020c791
       0000000000000000 0000000000000000 0000030348ac8c1c 0000000114a217f0
       0000010114c997f0 000000000000076a
Call Trace:<ffffffffa020c791>{:dlm:lkb_swqueue+43} <ffffffff803053ef>{io_schedule+38}
       <ffffffff80178c4c>{__wait_on_buffer+125} <ffffffff80178ad2>{bh_wake_function+0}
       <ffffffff80178ad2>{bh_wake_function+0} <ffffffffa02352c6>{:gfs:gfs_dreread+154}
       <ffffffffa0235332>{:gfs:gfs_dread+40} <ffffffffa02363b1>{:gfs:gfs_get_meta_buffer+201}
       <ffffffffa0242999>{:gfs:gfs_copyin_dinode+23} <ffffffffa0242461>{:gfs:inode_go_lock+38}
       <ffffffffa023f586>{:gfs:glock_wait_internal+563} <ffffffffa023fce6>{:gfs:gfs_glock_nq+961}
       <ffffffffa023ff11>{:gfs:gfs_glock_nq_init+20} <ffffffffa0258b7b>{:gfs:gfs_private_nopage+84}
       <ffffffff80168211>{do_no_page+1003} <ffffffff80167b13>{do_wp_page+948}
       <ffffffff8016858f>{handle_mm_fault+343} <ffffffff80142a06>{get_signal_to_deliver+1118}
       <ffffffff801236d2>{do_page_fault+518} <ffffffff80304a85>{thread_return+0}
       <ffffffff80304add>{thread_return+88} <ffffffff80110c61>{error_exit+0}
      
sh            D 000000000000001a     0  5930   2547                     (NOTLB)
000001011090f8e8 0000000000000002 0000010111293d88 0000010110973d00
       0000010111293d88 0000000000000000 00000100dfc02400 0000000000010000
       00000101148557f0 0000000000002010
Call Trace:<ffffffff803053ef>{io_schedule+38} <ffffffff80178c4c>{__wait_on_buffer+125}
       <ffffffff80178ad2>{bh_wake_function+0} <ffffffff80178ad2>{bh_wake_function+0}
       <ffffffffa02352c6>{:gfs:gfs_dreread+154} <ffffffffa0235332>{:gfs:gfs_dread+40}
       <ffffffffa02363b1>{:gfs:gfs_get_meta_buffer+201} <ffffffffa0242999>{:gfs:gfs_copyin_dinode+23}
       <ffffffffa0242461>{:gfs:inode_go_lock+38} <ffffffffa023f586>{:gfs:glock_wait_internal+563}
       <ffffffffa023fce6>{:gfs:gfs_glock_nq+961} <ffffffffa023ff11>{:gfs:gfs_glock_nq_init+20}
       <ffffffff801ccb78>{dummy_inode_permission+0} <ffffffffa0257aca>{:gfs:gfs_permission+64}
       <ffffffff8018d475>{dput+56} <ffffffff80183d32>{permission+51}
       <ffffffff801844aa>{__link_path_walk+372} <ffffffff801851c2>{link_path_walk+82}
       <ffffffff8012370b>{do_page_fault+575} <ffffffff801849b0>{__link_path_walk+1658}
       <ffffffff801851c2>{link_path_walk+82} <ffffffff8012370b>{do_page_fault+575}
       <ffffffff8018540f>{path_lookup+451} <ffffffff801856bb>{__user_walk+47}
       <ffffffff8017ff1a>{vfs_stat+24} <ffffffff8012370b>{do_page_fault+575}
       <ffffffff80180264>{sys_newstat+17} <ffffffff80110c61>{error_exit+0}
       <ffffffff801101c6>{system_call+126}

Please, let me know if it gives you any clues.


On 6/15/06, David Teigland <teigland redhat com> wrote:
On Thu, Jun 15, 2006 at 01:43:25AM +0300, Anton Kornev wrote:

> Is there any ideas of how to fix this? I mean either the reason ('D'
> state of killed httpd-s) or consequences (the GFS filesystem fully or
> partially become unavailable after this).
>
> I also appreciate any help with debugging the problem.
>
> I tried gfs_tool lockdump with decipher_lockstate_dump tool.

I don't see anything wrong in the lockdumps you gave, although I'm not an
expert at interpreting gfs lockdumps.  Could you do a ps showing the wchan
for those processes?  Using sysrq to get a stack dump would also be useful.
You might also do a dlm lock dump and pick out those locks:
  echo "lockspace name" >> /proc/cluster/dlm_locks
  cat /proc/cluster/dlm_locks

I/O stuck in gnbd could also be a problem, I'm not sure what the signs of
that might be apart from possibly the wchan.

Dave




--
Best Regards,
Anton Kornev.
[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]