[Linux-cluster] GFS locking issues
Anton Kornev
akornev at gmail.com
Fri Jun 23 00:19:52 UTC 2006
David,
Thanks a lot for your comments.
Actually it sounds rather strange for me.
I tried to grep the /var/log/messages log with "gnbd" word and found that
there are also
other messages like this even on the working host with no GFS problems.
bash-3.00# grep gnbd /var/log/messages
Jun 19 08:16:20 node1 kernel: gnbd (pid 25756: alogc.pl) got signal 9
Jun 19 08:16:20 node1 kernel: gnbd0: Send control failed (result -4)
Jun 19 08:16:20 node1 kernel: gnbd (pid 25756: alogc.pl) got signal 15
Jun 19 08:16:20 node1 kernel: gnbd0: Send control failed (result -4)
Jun 19 09:51:59 node1 kernel: gnbd (pid 26259: find) got signal 9
Jun 19 09:51:59 node1 kernel: gnbd0: Send control failed (result -4)
Jun 19 10:06:39 node1 kernel: gnbd (pid 313: alogc.pl) got signal 9
Jun 19 10:06:39 node1 kernel: gnbd0: Send control failed (result -4)
Jun 19 10:06:39 node1 kernel: gnbd (pid 313: alogc.pl) got signal 15
Jun 19 10:06:39 node1 kernel: gnbd0: Send control failed (result -4)
Jun 19 12:51:12 node1 kernel: gnbd (pid 19463: vi) got signal 1
Jun 19 12:51:12 node1 kernel: gnbd0: Send control failed (result -4)
Jun 20 14:48:16 node1 kernel: gnbd (pid 20238: alogc.pl) got signal 9
Jun 20 14:48:16 node1 kernel: gnbd0: Send control failed (result -4)
Jun 20 14:48:16 node1 kernel: gnbd (pid 20238: alogc.pl) got signal 15
Jun 20 14:48:16 node1 kernel: gnbd0: Send control failed (result -4)
I tried to check gnbd-kernel sources (latest available SRPM - not CVS
version)
and I found that the first message (gnbd ... got signal) is produced by the
sock_xmit() function with the such a piece of code:
if (signal_pending(current)) {
siginfo_t info;
spin_lock_irqsave(¤t->sighand->siglock,
flags);
printk(KERN_WARNING "gnbd (pid %d: %s) got signal
%d\n",
current->pid, current->comm,
dequeue_signal(current, ¤t->blocked,
&info));
spin_unlock_irqrestore(¤t->sighand->siglock,
flags);
result = -EINTR;
break;
}
And the second message is generated inside the gnbd_send_req() by the code
result = sock_xmit(sock, 1, &request, sizeof(request),
(gnbd_cmd(req) == GNBD_CMD_WRITE)? MSG_MORE: 0);
if (result < 0) {
printk(KERN_ERR "%s: Send control failed (result %d)\n",
dev->disk->disk_name, result);
goto error_out;
}
So at the first glance it seems like a normal messages from gnbd - if there
is signal received during sock_xmit - don't send anyting and return -EINTR.
I am not sure that it might be a problem but I take a look on the
sock_xmit() code and
there are at least two things that seems strange for me.
1. There is an inconsistancy between comment and code:
/* Allow interception of SIGKILL only
* Don't allow other signals to interrupt the transmission */
spin_lock_irqsave(¤t->sighand->siglock, flags);
oldset = current->blocked;
sigfillset(¤t->blocked);
sigdelsetmask(¤t->blocked, sigmask(SIGKILL) | sigmask(SIGTERM)
|
sigmask(SIGHUP));
recalc_sigpending();
spin_unlock_irqrestore(¤t->sighand->siglock, flags);
So, inside the comment there is a suggestion that only SIGKILL can interrupt
the transmission but the real mask is for KILL/TERM/HUP signals (btw: in my
case it is a SIGTERM who locks everything).
2. There are two blocks of code following each other
if (send)
result = sock_sendmsg(sock, &msg, size);
else
result = sock_recvmsg(sock, &msg, size, 0);
if (signal_pending(current)) {
siginfo_t info;
spin_lock_irqsave(¤t->sighand->siglock,
flags);
printk(KERN_WARNING "gnbd (pid %d: %s) got signal
%d\n",
current->pid, current->comm,
dequeue_signal(current, ¤t->blocked,
&info));
spin_unlock_irqrestore(¤t->sighand->siglock,
flags);
result = -EINTR;
break;
}
Why do we need to return -EINTR as a result if we have already done the real
sock_sendmsg() / sock_recvmsg()? What if the real transmission was okay and
real result has no mistake?
I am not a kernel developer and I haven't spent a lot of time on the issue,
so it might make no sense at all.
Please, let me know what do you think about it?
On 6/21/06, David Teigland <teigland at redhat.com> wrote:
>
> On Fri, Jun 16, 2006 at 06:37:14PM +0300, Anton Kornev wrote:
> > gnbd (pid 5836: alogc.pl) got signal 9
> > gnbd0: Send control failed (result -4)
> > gnbd (pid 5836: alogc.pl) got signal 15
> > gnbd0: Send control failed (result -4)
>
> This and the fact that a number of processes appear to be blocked in the
> i/o path seem to point at gnbd as the hold-up.
>
> Dave
>
>
> > 51 D wait_on_buffer pdflush
> > 5771 D lock_page lock_dlm1
> > 5776 D - gfs_logd
> > 5777 D - gfs_quotad
> > 5778 D - gfs_inoded
> > 5892 D - httpd
> > 5895 D glock_wait_internal httpd
> > 5896 D glock_wait_internal httpd
> > 5897 D glock_wait_internal httpd
> > 5911 D glock_wait_internal httpd
> > 5915 D wait_on_buffer httpd
> > 5930 D wait_on_buffer sh
>
> > pdflush D ffffffff8014aabc 0 51 6 53 50
> > (L-TLB)
> > 00000100dfc3dc78 0000000000000046 000001011bd3e980 000001010fc11f00
> > 0000000000000216 ffffffffa0042916 000001011aca60c0
> 0000000000000008
> > 000001011fdef7f0 0000000000000dfa
> > Call Trace:<ffffffffa0042916>{:dm_mod:dm_request+396}
> > <ffffffff8014aabc>{keventd_create_kthread+0}
> > <ffffffff803053ef>{io_schedule+38}
> > <ffffffff80178c4c>{__wait_on_buffer+125}
> > <ffffffff80178ad2>{bh_wake_function+0}
> > <ffffffff80178ad2>{bh_wake_function+0}
> > <ffffffffa0235c5d>{:gfs:gfs_logbh_wait+49}
> > <ffffffffa024a6a6>{:gfs:disk_commit+794}
> > <ffffffffa024a877>{:gfs:log_refund+111}
> > <ffffffffa024ad8e>{:gfs:log_flush_internal+510}
> > <ffffffff8017d682>{sync_supers+167}
> <ffffffff8015e310>{wb_kupdate+36}
> >
> > <ffffffff8015edb4>{pdflush+323} <ffffffff8015e2ec>{wb_kupdate+0}
> > <ffffffff8015ec71>{pdflush+0} <ffffffff8014aa93>{kthread+200}
> > <ffffffff80110e17>{child_rip+8}
> > <ffffffff8014aabc>{keventd_create_kthread+0}
> > <ffffffff8014a9cb>{kthread+0} <ffffffff80110e0f>{child_rip+0}
> > lock_dlm1 D 000001000c0096e0 0 5771 6 5772 5766
> > (L-TLB)
> > 0000010113ce3c58 0000000000000046 0000001000000000 0000010000000069
> > 000001011420b030 0000000000000069 000001000c00a940
> 000000010000eb10
> > 000001011a887030 0000000000001cae
> > Call Trace:<ffffffff802496d4>{__generic_unplug_device+19}
> > <ffffffff803053ef>{io_schedule+38}
> > <ffffffff80159215>{__lock_page+191}
> > <ffffffff80158cfa>{page_wake_function+0}
> > <ffffffff80158cfa>{page_wake_function+0}
> > <ffffffff80163125>{truncate_inode_pages+519}
> > <ffffffffa0258f35>{:gfs:gfs_inval_page+63}
> > <ffffffffa02401b5>{:gfs:drop_bh+233}
> > <ffffffffa0242138>{:gfs:gfs_glock_cb+194}
> > <ffffffffa02869dd>{:lock_dlm:dlm_async+1989}
> > <ffffffff801333c8>{default_wake_function+0}
> > <ffffffff8014aabc>{keventd_create_kthread+0}
> > <ffffffffa0286218>{:lock_dlm:dlm_async+0}
> > <ffffffff8014aabc>{keventd_create_kthread+0}
> > <ffffffff8014aa93>{kthread+200} <ffffffff80110e17>{child_rip+8}
> > <ffffffff8014aabc>{keventd_create_kthread+0}
> > <ffffffff8014a9cb>{kthread+0}
> > <ffffffff80110e0f>{child_rip+0}
> > gfs_logd D 0000000000000000 0 5776 1 5777 5775
> > (L-TLB)
> > 000001011387fe38 0000000000000046 0000000000000000 ffffffff80304a85
> > 000001011387fe58 ffffffff80304add ffffffff803cca80
> 0000000000000246
> > 00000101143fe030 00000000000000b5
> > Call Trace:<ffffffff80304a85>{thread_return+0}
> > <ffffffff80304add>{thread_return+88}
> > <ffffffffa023e8d3>{:gfs:lock_on_glock+112}
> > <ffffffff8030565b>{__down_write+134}
> > <ffffffffa0249cdb>{:gfs:gfs_ail_empty+56}
> > <ffffffffa0233930>{:gfs:gfs_logd+77}
> > <ffffffff80110e17>{child_rip+8}
> > <ffffffff801cccff>{dummy_d_instantiate+0}
> > <ffffffffa02338e3>{:gfs:gfs_logd+0}
> <ffffffff80110e0f>{child_rip+0}
> >
> > gfs_quotad D 0000000000000000 0 5777 1 5778 5776
> > (L-TLB)
> > 0000010113881e98 0000000000000046 0000000000000000 ffffffff80304a85
> > 0000010113881eb8 ffffffff80304add 000001011ff87030
> 0000000100000074
> > 000001011430f7f0 0000000000000128
> > Call Trace:<ffffffff80304a85>{thread_return+0}
> > <ffffffff80304add>{thread_return+88}
> > <ffffffff8030565b>{__down_write+134}
> > <ffffffffa025b55a>{:gfs:gfs_quota_sync+226}
> > <ffffffffa0233ab1>{:gfs:gfs_quotad+127}
> > <ffffffff80110e17>{child_rip+8}
> > <ffffffff801cccff>{dummy_d_instantiate+0}
> > <ffffffff801cccff>{dummy_d_instantiate+0}
> > <ffffffff801cccff>{dummy_d_instantiate+0}
> > <ffffffffa0233a32>{:gfs:gfs_quotad+0}
> > <ffffffff80110e0f>{child_rip+0}
> > gfs_inoded D 0000000000000000 0 5778 1 5807 5777
> > (L-TLB)
> > 0000010113883e98 0000000000000046 000001011e2937f0 000001000c0096e0
> > 0000000000000000 ffffffff80304a85 0000010113883ec8
> 0000000180304add
> > 000001011e2937f0 00000000000000c2
> > Call Trace:<ffffffff80304a85>{thread_return+0}
> > <ffffffff8030565b>{__down_write+134}
> > <ffffffffa026160d>{:gfs:unlinked_find+115}
> > <ffffffffa0261c6c>{:gfs:gfs_unlinked_dealloc+25}
> > <ffffffffa0233bd5>{:gfs:gfs_inoded+66}
> > <ffffffff80110e17>{child_rip+8}
> > <ffffffffa0233b93>{:gfs:gfs_inoded+0}
> <ffffffff80110e0f>{child_rip+0}
> >
> >
> > httpd D ffffffff80304190 0 5892 1 5893 5826
> > (NOTLB)
> > 0000010111b75bf8 0000000000000002 0000000000000001 0000000000000001
> > 0000000000000000 0000000000000000 0000010114667980
> 0000000111b75bc0
> > 00000101143fe7f0 00000000000009ad
> > Call Trace:<ffffffff80303d6f>{__down+147}
> > <ffffffff801333c8>{default_wake_function+0}
> > <ffffffff8015b3a2>{generic_file_write_nolock+158}
> > <ffffffff80305780>{__down_failed+53}
> > <ffffffffa0236986>{:gfs:.text.lock.dio+95}
> > <ffffffffa0260e4c>{:gfs:gfs_trans_add_bh+205}
> > <ffffffffa0253efc>{:gfs:do_write_buf+1138}
> > <ffffffffa0252db3>{:gfs:walk_vm+278}
> > <ffffffffa0253a8a>{:gfs:do_write_buf+0}
> > <ffffffffa0253a8a>{:gfs:do_write_buf+0}
> > <ffffffffa025415b>{:gfs:__gfs_write+201}
> > <ffffffff80177c60>{vfs_write+207}
> > <ffffffff80177d48>{sys_write+69}
> <ffffffff801101c6>{system_call+126}
> >
> > httpd D 0000010110ad7d48 0 5895 5892 5896 5893
> > (NOTLB)
> > 0000010110ad7bd8 0000000000000006 000001011b16e030 0000000000000075
> > 0000010117002030 0000000000000075 000001000c002940
> 0000000000000001
> > 00000101170027f0 000000000001300e
> > Call Trace:<ffffffff80131d1d>{try_to_wake_up+863}
> > <ffffffff80304cbd>{wait_for_completion+167}
> > <ffffffff801333c8>{default_wake_function+0}
> > <ffffffff801333c8>{default_wake_function+0}
> > <ffffffffa023f4b1>{:gfs:glock_wait_internal+350}
> > <ffffffffa023fce6>{:gfs:gfs_glock_nq+961}
> > <ffffffffa023ff11>{:gfs:gfs_glock_nq_init+20}
> > <ffffffffa0258b7b>{:gfs:gfs_private_nopage+84}
> > <ffffffff80168211>{do_no_page+1003}
> > <ffffffff80167b13>{do_wp_page+948}
> > <ffffffff8016858f>{handle_mm_fault+343}
> > <ffffffff80142a06>{get_signal_to_deliver+1118}
> > <ffffffff801236d2>{do_page_fault+518}
> > <ffffffff80304a85>{thread_return+0}
> > <ffffffff80304add>{thread_return+88}
> <ffffffff80110c61>{error_exit+0}
> >
> >
> > httpd D 0000010110b5bd48 0 5896 5892 5897 5895
> > (NOTLB)
> > 0000010110b5bbd8 0000000000000002 00000101170027f0 0000000000000075
> > 00000101114787f0 0000000000000075 000001000c002940
> 0000000000000001
> > 0000010117002030 000000000000fb3e
> > Call Trace:<ffffffff80131d1d>{try_to_wake_up+863}
> > <ffffffff80304cbd>{wait_for_completion+167}
> > <ffffffff801333c8>{default_wake_function+0}
> > <ffffffff801333c8>{default_wake_function+0}
> > <ffffffffa023f4b1>{:gfs:glock_wait_internal+350}
> > <ffffffffa023fce6>{:gfs:gfs_glock_nq+961}
> > <ffffffffa023ff11>{:gfs:gfs_glock_nq_init+20}
> > <ffffffffa0258b7b>{:gfs:gfs_private_nopage+84}
> > <ffffffff80168211>{do_no_page+1003}
> > <ffffffff80167b13>{do_wp_page+948}
> > <ffffffff8016858f>{handle_mm_fault+343}
> > <ffffffff80142a06>{get_signal_to_deliver+1118}
> > <ffffffff801236d2>{do_page_fault+518}
> > <ffffffff802a3445>{sys_accept+327}
> > <ffffffff80182e88>{pipe_read+26} <ffffffff80110c61>{error_exit+0}
> >
> > httpd D 0000000000000000 0 5897 5892 5911 5896
> > (NOTLB)
> > 0000010110119bd8 0000000000000006 0000010117002030 0000000000000075
> > 0000010117002030 0000000000000075 000001000c00a940
> 000000001b16e030
> > 00000101114787f0 000000000000fbe0
> > Call Trace:<ffffffff802496d4>{__generic_unplug_device+19}
> > <ffffffff80304cbd>{wait_for_completion+167}
> > <ffffffff801333c8>{default_wake_function+0}
> > <ffffffff801333c8>{default_wake_function+0}
> > <ffffffffa023f4b1>{:gfs:glock_wait_internal+350}
> > <ffffffffa023fce6>{:gfs:gfs_glock_nq+961}
> > <ffffffffa023ff11>{:gfs:gfs_glock_nq_init+20}
> > <ffffffffa0258b7b>{:gfs:gfs_private_nopage+84}
> > <ffffffff80168211>{do_no_page+1003}
> > <ffffffff80167b13>{do_wp_page+948}
> > <ffffffff8016858f>{handle_mm_fault+343}
> > <ffffffff80142a06>{get_signal_to_deliver+1118}
> > <ffffffff801236d2>{do_page_fault+518}
> > <ffffffff80304a85>{thread_return+0}
> > <ffffffff80304add>{thread_return+88}
> <ffffffff80110c61>{error_exit+0}
> >
> >
> > httpd D 00000101100c3d48 0 5911 5892 5915 5897
> > (NOTLB)
> > 00000101100c3bd8 0000000000000002 000001011420b7f0 0000000000000075
> > 00000101170027f0 0000000000000075 000001000c002940
> 0000000000000000
> > 000001011b16e030 000000000000187e
> > Call Trace:<ffffffff80131d1d>{try_to_wake_up+863}
> > <ffffffff80304cbd>{wait_for_completion+167}
> > <ffffffff801333c8>{default_wake_function+0}
> > <ffffffff801333c8>{default_wake_function+0}
> > <ffffffffa023f4b1>{:gfs:glock_wait_internal+350}
> > <ffffffffa023fce6>{:gfs:gfs_glock_nq+961}
> > <ffffffffa023ff11>{:gfs:gfs_glock_nq_init+20}
> > <ffffffffa0258b7b>{:gfs:gfs_private_nopage+84}
> > <ffffffff80168211>{do_no_page+1003}
> > <ffffffff80167b13>{do_wp_page+948}
> > <ffffffff8016858f>{handle_mm_fault+343}
> > <ffffffff80142a06>{get_signal_to_deliver+1118}
> > <ffffffff801236d2>{do_page_fault+518}
> > <ffffffff80304a85>{thread_return+0}
> > <ffffffff80304add>{thread_return+88}
> <ffffffff80110c61>{error_exit+0}
> >
> >
> > httpd D 0000000000006a36 0 5915 5892 5911
> > (NOTLB)
> > 00000101180f7ad8 0000000000000006 0000000000002706 ffffffffa020c791
> > 0000000000000000 0000000000000000 0000030348ac8c1c
> 0000000114a217f0
> > 0000010114c997f0 000000000000076a
> > Call Trace:<ffffffffa020c791>{:dlm:lkb_swqueue+43}
> > <ffffffff803053ef>{io_schedule+38}
> > <ffffffff80178c4c>{__wait_on_buffer+125}
> > <ffffffff80178ad2>{bh_wake_function+0}
> > <ffffffff80178ad2>{bh_wake_function+0}
> > <ffffffffa02352c6>{:gfs:gfs_dreread+154}
> > <ffffffffa0235332>{:gfs:gfs_dread+40}
> > <ffffffffa02363b1>{:gfs:gfs_get_meta_buffer+201}
> > <ffffffffa0242999>{:gfs:gfs_copyin_dinode+23}
> > <ffffffffa0242461>{:gfs:inode_go_lock+38}
> > <ffffffffa023f586>{:gfs:glock_wait_internal+563}
> > <ffffffffa023fce6>{:gfs:gfs_glock_nq+961}
> > <ffffffffa023ff11>{:gfs:gfs_glock_nq_init+20}
> > <ffffffffa0258b7b>{:gfs:gfs_private_nopage+84}
> > <ffffffff80168211>{do_no_page+1003}
> > <ffffffff80167b13>{do_wp_page+948}
> > <ffffffff8016858f>{handle_mm_fault+343}
> > <ffffffff80142a06>{get_signal_to_deliver+1118}
> > <ffffffff801236d2>{do_page_fault+518}
> > <ffffffff80304a85>{thread_return+0}
> > <ffffffff80304add>{thread_return+88}
> <ffffffff80110c61>{error_exit+0}
> >
> >
> > sh D 000000000000001a 0 5930 2547
> > (NOTLB)
> > 000001011090f8e8 0000000000000002 0000010111293d88 0000010110973d00
> > 0000010111293d88 0000000000000000 00000100dfc02400
> 0000000000010000
> > 00000101148557f0 0000000000002010
> > Call Trace:<ffffffff803053ef>{io_schedule+38}
> > <ffffffff80178c4c>{__wait_on_buffer+125}
> > <ffffffff80178ad2>{bh_wake_function+0}
> > <ffffffff80178ad2>{bh_wake_function+0}
> > <ffffffffa02352c6>{:gfs:gfs_dreread+154}
> > <ffffffffa0235332>{:gfs:gfs_dread+40}
> > <ffffffffa02363b1>{:gfs:gfs_get_meta_buffer+201}
> > <ffffffffa0242999>{:gfs:gfs_copyin_dinode+23}
> > <ffffffffa0242461>{:gfs:inode_go_lock+38}
> > <ffffffffa023f586>{:gfs:glock_wait_internal+563}
> > <ffffffffa023fce6>{:gfs:gfs_glock_nq+961}
> > <ffffffffa023ff11>{:gfs:gfs_glock_nq_init+20}
> > <ffffffff801ccb78>{dummy_inode_permission+0}
> > <ffffffffa0257aca>{:gfs:gfs_permission+64}
> > <ffffffff8018d475>{dput+56} <ffffffff80183d32>{permission+51}
> > <ffffffff801844aa>{__link_path_walk+372}
> > <ffffffff801851c2>{link_path_walk+82}
> > <ffffffff8012370b>{do_page_fault+575}
> > <ffffffff801849b0>{__link_path_walk+1658}
> > <ffffffff801851c2>{link_path_walk+82}
> > <ffffffff8012370b>{do_page_fault+575}
> > <ffffffff8018540f>{path_lookup+451}
> > <ffffffff801856bb>{__user_walk+47}
> > <ffffffff8017ff1a>{vfs_stat+24}
> <ffffffff8012370b>{do_page_fault+575}
> >
> > <ffffffff80180264>{sys_newstat+17}
> <ffffffff80110c61>{error_exit+0}
> > <ffffffff801101c6>{system_call+126}
>
>
--
Best Regards,
Anton Kornev.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20060623/a47a727c/attachment.htm>
More information about the Linux-cluster
mailing list