[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] GFS locking issues



David,

Thanks a lot for your comments.
Actually it sounds rather strange for me.

I tried to grep the /var/log/messages log with "gnbd" word and found that there are also
other messages like this even on the working host with no GFS problems.

bash-3.00# grep gnbd /var/log/messages
Jun 19 08:16:20 node1 kernel: gnbd (pid 25756: alogc.pl) got signal 9
Jun 19 08:16:20 node1 kernel: gnbd0: Send control failed (result -4)
Jun 19 08:16:20 node1 kernel: gnbd (pid 25756: alogc.pl) got signal 15
Jun 19 08:16:20 node1 kernel: gnbd0: Send control failed (result -4)
Jun 19 09:51:59 node1 kernel: gnbd (pid 26259: find) got signal 9
Jun 19 09:51:59 node1 kernel: gnbd0: Send control failed (result -4)
Jun 19 10:06:39 node1 kernel: gnbd (pid 313: alogc.pl) got signal 9
Jun 19 10:06:39 node1 kernel: gnbd0: Send control failed (result -4)
Jun 19 10:06:39 node1 kernel: gnbd (pid 313: alogc.pl) got signal 15
Jun 19 10:06:39 node1 kernel: gnbd0: Send control failed (result -4)
Jun 19 12:51:12 node1 kernel: gnbd (pid 19463: vi) got signal 1
Jun 19 12:51:12 node1 kernel: gnbd0: Send control failed (result -4)
Jun 20 14:48:16 node1 kernel: gnbd (pid 20238: alogc.pl) got signal 9
Jun 20 14:48:16 node1 kernel: gnbd0: Send control failed (result -4)
Jun 20 14:48:16 node1 kernel: gnbd (pid 20238: alogc.pl) got signal 15
Jun 20 14:48:16 node1 kernel: gnbd0: Send control failed (result -4)


I tried to check gnbd-kernel sources (latest available SRPM - not CVS version)
and I found that the first message (gnbd ... got signal) is produced by the
sock_xmit() function with the such a piece of code:
               
                if (signal_pending(current)) {
                        siginfo_t info;
                        spin_lock_irqsave(&current->sighand->siglock, flags);
                        printk(KERN_WARNING "gnbd (pid %d: %s) got signal %d\n",
                                current->pid, current->comm,
                                dequeue_signal(current, &current->blocked, &info));
                        spin_unlock_irqrestore(&current->sighand->siglock, flags);
                        result = -EINTR;
                        break;
                }

And the second message is generated inside the gnbd_send_req() by the code

        result = sock_xmit(sock, 1, &request, sizeof(request),
                        (gnbd_cmd(req) == GNBD_CMD_WRITE)? MSG_MORE: 0);
        if (result < 0) {
                printk(KERN_ERR "%s: Send control failed (result %d)\n",
                                dev->disk->disk_name, result);
                goto error_out;
        }


So at the first glance it seems like a normal messages from gnbd - if there is signal received during sock_xmit - don't send anyting and return -EINTR.

I am not sure that it might be a problem but I take a look on the sock_xmit() code and
there are at least two things that seems strange for me.

1. There is an inconsistancy between comment and code:

        /* Allow interception of SIGKILL only
         * Don't allow other signals to interrupt the transmission */
        spin_lock_irqsave(&current->sighand->siglock, flags);
        oldset = current->blocked;
        sigfillset(&current->blocked);
        sigdelsetmask(&current->blocked, sigmask(SIGKILL) | sigmask(SIGTERM) |
                      sigmask(SIGHUP));
        recalc_sigpending();
        spin_unlock_irqrestore(&current->sighand->siglock, flags);

So, inside the comment there is a suggestion that only SIGKILL can interrupt the transmission but the real mask is for KILL/TERM/HUP signals (btw: in my case it is a SIGTERM who locks everything).

2. There are two blocks of code following each other

                if (send)
                        result = sock_sendmsg(sock, &msg, size);
                else
                        result = sock_recvmsg(sock, &msg, size, 0);


                if (signal_pending(current)) {
                        siginfo_t info;
                        spin_lock_irqsave(&current->sighand->siglock, flags);
                        printk(KERN_WARNING "gnbd (pid %d: %s) got signal %d\n",
                                current->pid, current->comm,
                                dequeue_signal(current, &current->blocked, &info));
                        spin_unlock_irqrestore(&current->sighand->siglock, flags);
                        result = -EINTR;
                        break;
                }

Why do we need to return -EINTR as a result if we have already done the real sock_sendmsg() / sock_recvmsg()?  What if the real transmission was okay and real result has no mistake?

I am not a kernel developer and I haven't spent a lot of time on the issue, so it might make no sense at all.

Please, let me know what do you think about it?

On 6/21/06, David Teigland <teigland redhat com> wrote:
On Fri, Jun 16, 2006 at 06:37:14PM +0300, Anton Kornev wrote:
> gnbd (pid 5836: alogc.pl) got signal 9
> gnbd0: Send control failed (result -4)
> gnbd (pid 5836: alogc.pl) got signal 15
> gnbd0: Send control failed (result -4)

This and the fact that a number of processes appear to be blocked in the
i/o path seem to point at gnbd as the hold-up.

Dave


>   51 D wait_on_buffer                   pdflush
> 5771 D lock_page                        lock_dlm1
> 5776 D -                                gfs_logd
> 5777 D -                                gfs_quotad
> 5778 D -                                gfs_inoded
> 5892 D -                                httpd
> 5895 D glock_wait_internal              httpd
> 5896 D glock_wait_internal              httpd
> 5897 D glock_wait_internal              httpd
> 5911 D glock_wait_internal              httpd
> 5915 D wait_on_buffer                   httpd
> 5930 D wait_on_buffer                   sh

> pdflush       D ffffffff8014aabc     0    51      6            53    50
> (L-TLB)
> 00000100dfc3dc78 0000000000000046 000001011bd3e980 000001010fc11f00
>       0000000000000216 ffffffffa0042916 000001011aca60c0 0000000000000008
>       000001011fdef7f0 0000000000000dfa
> Call Trace:<ffffffffa0042916>{:dm_mod:dm_request+396}
> <ffffffff8014aabc>{keventd_create_kthread+0}
>       <ffffffff803053ef>{io_schedule+38}
> <ffffffff80178c4c>{__wait_on_buffer+125}
>       <ffffffff80178ad2>{bh_wake_function+0}
> <ffffffff80178ad2>{bh_wake_function+0}
>       <ffffffffa0235c5d>{:gfs:gfs_logbh_wait+49}
> <ffffffffa024a6a6>{:gfs:disk_commit+794}
>       <ffffffffa024a877>{:gfs:log_refund+111}
> <ffffffffa024ad8e>{:gfs:log_flush_internal+510}
>       <ffffffff8017d682>{sync_supers+167} <ffffffff8015e310>{wb_kupdate+36}
>
>       <ffffffff8015edb4>{pdflush+323} <ffffffff8015e2ec>{wb_kupdate+0}
>       <ffffffff8015ec71>{pdflush+0} <ffffffff8014aa93>{kthread+200}
>       <ffffffff80110e17>{child_rip+8}
> <ffffffff8014aabc>{keventd_create_kthread+0}
>       <ffffffff8014a9cb>{kthread+0} <ffffffff80110e0f>{child_rip+0}
> lock_dlm1     D 000001000c0096e0     0  5771      6          5772  5766
> (L-TLB)
> 0000010113ce3c58 0000000000000046 0000001000000000 0000010000000069
>       000001011420b030 0000000000000069 000001000c00a940 000000010000eb10
>       000001011a887030 0000000000001cae
> Call Trace:<ffffffff802496d4>{__generic_unplug_device+19}
> <ffffffff803053ef>{io_schedule+38}
>       <ffffffff80159215>{__lock_page+191}
> <ffffffff80158cfa>{page_wake_function+0}
>       <ffffffff80158cfa>{page_wake_function+0}
> <ffffffff80163125>{truncate_inode_pages+519}
>       <ffffffffa0258f35>{:gfs:gfs_inval_page+63}
> <ffffffffa02401b5>{:gfs:drop_bh+233}
>       <ffffffffa0242138>{:gfs:gfs_glock_cb+194}
> <ffffffffa02869dd>{:lock_dlm:dlm_async+1989}
>       <ffffffff801333c8>{default_wake_function+0}
> <ffffffff8014aabc>{keventd_create_kthread+0}
>       <ffffffffa0286218>{:lock_dlm:dlm_async+0}
> <ffffffff8014aabc>{keventd_create_kthread+0}
>       <ffffffff8014aa93>{kthread+200} <ffffffff80110e17>{child_rip+8}
>       <ffffffff8014aabc>{keventd_create_kthread+0}
> <ffffffff8014a9cb>{kthread+0}
>       <ffffffff80110e0f>{child_rip+0}
> gfs_logd      D 0000000000000000     0  5776      1          5777  5775
> (L-TLB)
> 000001011387fe38 0000000000000046 0000000000000000 ffffffff80304a85
>       000001011387fe58 ffffffff80304add ffffffff803cca80 0000000000000246
>       00000101143fe030 00000000000000b5
> Call Trace:<ffffffff80304a85>{thread_return+0}
> <ffffffff80304add>{thread_return+88}
>       <ffffffffa023e8d3>{:gfs:lock_on_glock+112}
> <ffffffff8030565b>{__down_write+134}
>       <ffffffffa0249cdb>{:gfs:gfs_ail_empty+56}
> <ffffffffa0233930>{:gfs:gfs_logd+77}
>       <ffffffff80110e17>{child_rip+8}
> <ffffffff801cccff>{dummy_d_instantiate+0}
>       <ffffffffa02338e3>{:gfs:gfs_logd+0} <ffffffff80110e0f>{child_rip+0}
>
> gfs_quotad    D 0000000000000000     0  5777      1          5778  5776
> (L-TLB)
> 0000010113881e98 0000000000000046 0000000000000000 ffffffff80304a85
>       0000010113881eb8 ffffffff80304add 000001011ff87030 0000000100000074
>       000001011430f7f0 0000000000000128
> Call Trace:<ffffffff80304a85>{thread_return+0}
> <ffffffff80304add>{thread_return+88}
>       <ffffffff8030565b>{__down_write+134}
> <ffffffffa025b55a>{:gfs:gfs_quota_sync+226}
>       <ffffffffa0233ab1>{:gfs:gfs_quotad+127}
> <ffffffff80110e17>{child_rip+8}
>       <ffffffff801cccff>{dummy_d_instantiate+0}
> <ffffffff801cccff>{dummy_d_instantiate+0}
>       <ffffffff801cccff>{dummy_d_instantiate+0}
> <ffffffffa0233a32>{:gfs:gfs_quotad+0}
>       <ffffffff80110e0f>{child_rip+0}
> gfs_inoded    D 0000000000000000     0  5778      1          5807  5777
> (L-TLB)
> 0000010113883e98 0000000000000046 000001011e2937f0 000001000c0096e0
>       0000000000000000 ffffffff80304a85 0000010113883ec8 0000000180304add
>       000001011e2937f0 00000000000000c2
> Call Trace:<ffffffff80304a85>{thread_return+0}
> <ffffffff8030565b>{__down_write+134}
>       <ffffffffa026160d>{:gfs:unlinked_find+115}
> <ffffffffa0261c6c>{:gfs:gfs_unlinked_dealloc+25}
>       <ffffffffa0233bd5>{:gfs:gfs_inoded+66}
> <ffffffff80110e17>{child_rip+8}
>       <ffffffffa0233b93>{:gfs:gfs_inoded+0} <ffffffff80110e0f>{child_rip+0}
>
>
> httpd         D ffffffff80304190     0  5892      1  5893          5826
> (NOTLB)
> 0000010111b75bf8 0000000000000002 0000000000000001 0000000000000001
>       0000000000000000 0000000000000000 0000010114667980 0000000111b75bc0
>       00000101143fe7f0 00000000000009ad
> Call Trace:<ffffffff80303d6f>{__down+147}
> <ffffffff801333c8>{default_wake_function+0}
>       <ffffffff8015b3a2>{generic_file_write_nolock+158}
> <ffffffff80305780>{__down_failed+53}
>       <ffffffffa0236986>{:gfs:.text.lock.dio+95}
> <ffffffffa0260e4c>{:gfs:gfs_trans_add_bh+205}
>       <ffffffffa0253efc>{:gfs:do_write_buf+1138}
> <ffffffffa0252db3>{:gfs:walk_vm+278}
>       <ffffffffa0253a8a>{:gfs:do_write_buf+0}
> <ffffffffa0253a8a>{:gfs:do_write_buf+0}
>       <ffffffffa025415b>{:gfs:__gfs_write+201}
> <ffffffff80177c60>{vfs_write+207}
>       <ffffffff80177d48>{sys_write+69} <ffffffff801101c6>{system_call+126}
>
> httpd         D 0000010110ad7d48     0  5895   5892          5896  5893
> (NOTLB)
> 0000010110ad7bd8 0000000000000006 000001011b16e030 0000000000000075
>       0000010117002030 0000000000000075 000001000c002940 0000000000000001
>       00000101170027f0 000000000001300e
> Call Trace:<ffffffff80131d1d>{try_to_wake_up+863}
> <ffffffff80304cbd>{wait_for_completion+167}
>       <ffffffff801333c8>{default_wake_function+0}
> <ffffffff801333c8>{default_wake_function+0}
>       <ffffffffa023f4b1>{:gfs:glock_wait_internal+350}
> <ffffffffa023fce6>{:gfs:gfs_glock_nq+961}
>       <ffffffffa023ff11>{:gfs:gfs_glock_nq_init+20}
> <ffffffffa0258b7b>{:gfs:gfs_private_nopage+84}
>       <ffffffff80168211>{do_no_page+1003}
> <ffffffff80167b13>{do_wp_page+948}
>       <ffffffff8016858f>{handle_mm_fault+343}
> <ffffffff80142a06>{get_signal_to_deliver+1118}
>       <ffffffff801236d2>{do_page_fault+518}
> <ffffffff80304a85>{thread_return+0}
>       <ffffffff80304add>{thread_return+88} <ffffffff80110c61>{error_exit+0}
>
>
> httpd         D 0000010110b5bd48     0  5896   5892          5897  5895
> (NOTLB)
> 0000010110b5bbd8 0000000000000002 00000101170027f0 0000000000000075
>       00000101114787f0 0000000000000075 000001000c002940 0000000000000001
>       0000010117002030 000000000000fb3e
> Call Trace:<ffffffff80131d1d>{try_to_wake_up+863}
> <ffffffff80304cbd>{wait_for_completion+167}
>       <ffffffff801333c8>{default_wake_function+0}
> <ffffffff801333c8>{default_wake_function+0}
>       <ffffffffa023f4b1>{:gfs:glock_wait_internal+350}
> <ffffffffa023fce6>{:gfs:gfs_glock_nq+961}
>       <ffffffffa023ff11>{:gfs:gfs_glock_nq_init+20}
> <ffffffffa0258b7b>{:gfs:gfs_private_nopage+84}
>       <ffffffff80168211>{do_no_page+1003}
> <ffffffff80167b13>{do_wp_page+948}
>       <ffffffff8016858f>{handle_mm_fault+343}
> <ffffffff80142a06>{get_signal_to_deliver+1118}
>       <ffffffff801236d2>{do_page_fault+518}
> <ffffffff802a3445>{sys_accept+327}
>       <ffffffff80182e88>{pipe_read+26} <ffffffff80110c61>{error_exit+0}
>
> httpd         D 0000000000000000     0  5897   5892          5911  5896
> (NOTLB)
> 0000010110119bd8 0000000000000006 0000010117002030 0000000000000075
>       0000010117002030 0000000000000075 000001000c00a940 000000001b16e030
>       00000101114787f0 000000000000fbe0
> Call Trace:<ffffffff802496d4>{__generic_unplug_device+19}
> <ffffffff80304cbd>{wait_for_completion+167}
>       <ffffffff801333c8>{default_wake_function+0}
> <ffffffff801333c8>{default_wake_function+0}
>       <ffffffffa023f4b1>{:gfs:glock_wait_internal+350}
> <ffffffffa023fce6>{:gfs:gfs_glock_nq+961}
>       <ffffffffa023ff11>{:gfs:gfs_glock_nq_init+20}
> <ffffffffa0258b7b>{:gfs:gfs_private_nopage+84}
>       <ffffffff80168211>{do_no_page+1003}
> <ffffffff80167b13>{do_wp_page+948}
>       <ffffffff8016858f>{handle_mm_fault+343}
> <ffffffff80142a06>{get_signal_to_deliver+1118}
>       <ffffffff801236d2>{do_page_fault+518}
> <ffffffff80304a85>{thread_return+0}
>       <ffffffff80304add>{thread_return+88} <ffffffff80110c61>{error_exit+0}
>
>
> httpd         D 00000101100c3d48     0  5911   5892          5915  5897
> (NOTLB)
> 00000101100c3bd8 0000000000000002 000001011420b7f0 0000000000000075
>       00000101170027f0 0000000000000075 000001000c002940 0000000000000000
>       000001011b16e030 000000000000187e
> Call Trace:<ffffffff80131d1d>{try_to_wake_up+863}
> <ffffffff80304cbd>{wait_for_completion+167}
>       <ffffffff801333c8>{default_wake_function+0}
> <ffffffff801333c8>{default_wake_function+0}
>       <ffffffffa023f4b1>{:gfs:glock_wait_internal+350}
> <ffffffffa023fce6>{:gfs:gfs_glock_nq+961}
>       <ffffffffa023ff11>{:gfs:gfs_glock_nq_init+20}
> <ffffffffa0258b7b>{:gfs:gfs_private_nopage+84}
>       <ffffffff80168211>{do_no_page+1003}
> <ffffffff80167b13>{do_wp_page+948}
>       <ffffffff8016858f>{handle_mm_fault+343}
> <ffffffff80142a06>{get_signal_to_deliver+1118}
>       <ffffffff801236d2>{do_page_fault+518}
> <ffffffff80304a85>{thread_return+0}
>       <ffffffff80304add>{thread_return+88} <ffffffff80110c61>{error_exit+0}
>
>
> httpd         D 0000000000006a36     0  5915   5892                5911
> (NOTLB)
> 00000101180f7ad8 0000000000000006 0000000000002706 ffffffffa020c791
>       0000000000000000 0000000000000000 0000030348ac8c1c 0000000114a217f0
>       0000010114c997f0 000000000000076a
> Call Trace:<ffffffffa020c791>{:dlm:lkb_swqueue+43}
> <ffffffff803053ef>{io_schedule+38}
>       <ffffffff80178c4c>{__wait_on_buffer+125}
> <ffffffff80178ad2>{bh_wake_function+0}
>       <ffffffff80178ad2>{bh_wake_function+0}
> <ffffffffa02352c6>{:gfs:gfs_dreread+154}
>       <ffffffffa0235332>{:gfs:gfs_dread+40}
> <ffffffffa02363b1>{:gfs:gfs_get_meta_buffer+201}
>       <ffffffffa0242999>{:gfs:gfs_copyin_dinode+23}
> <ffffffffa0242461>{:gfs:inode_go_lock+38}
>       <ffffffffa023f586>{:gfs:glock_wait_internal+563}
> <ffffffffa023fce6>{:gfs:gfs_glock_nq+961}
>       <ffffffffa023ff11>{:gfs:gfs_glock_nq_init+20}
> <ffffffffa0258b7b>{:gfs:gfs_private_nopage+84}
>       <ffffffff80168211>{do_no_page+1003}
> <ffffffff80167b13>{do_wp_page+948}
>       <ffffffff8016858f>{handle_mm_fault+343}
> <ffffffff80142a06>{get_signal_to_deliver+1118}
>       <ffffffff801236d2>{do_page_fault+518}
> <ffffffff80304a85>{thread_return+0}
>       <ffffffff80304add>{thread_return+88} <ffffffff80110c61>{error_exit+0}
>
>
> sh            D 000000000000001a     0  5930   2547
> (NOTLB)
> 000001011090f8e8 0000000000000002 0000010111293d88 0000010110973d00
>       0000010111293d88 0000000000000000 00000100dfc02400 0000000000010000
>       00000101148557f0 0000000000002010
> Call Trace:<ffffffff803053ef>{io_schedule+38}
> <ffffffff80178c4c>{__wait_on_buffer+125}
>       <ffffffff80178ad2>{bh_wake_function+0}
> <ffffffff80178ad2>{bh_wake_function+0}
>       <ffffffffa02352c6>{:gfs:gfs_dreread+154}
> <ffffffffa0235332>{:gfs:gfs_dread+40}
>       <ffffffffa02363b1>{:gfs:gfs_get_meta_buffer+201}
> <ffffffffa0242999>{:gfs:gfs_copyin_dinode+23}
>       <ffffffffa0242461>{:gfs:inode_go_lock+38}
> <ffffffffa023f586>{:gfs:glock_wait_internal+563}
>       <ffffffffa023fce6>{:gfs:gfs_glock_nq+961}
> <ffffffffa023ff11>{:gfs:gfs_glock_nq_init+20}
>       <ffffffff801ccb78>{dummy_inode_permission+0}
> <ffffffffa0257aca>{:gfs:gfs_permission+64}
>       <ffffffff8018d475>{dput+56} <ffffffff80183d32>{permission+51}
>       <ffffffff801844aa>{__link_path_walk+372}
> <ffffffff801851c2>{link_path_walk+82}
>       <ffffffff8012370b>{do_page_fault+575}
> <ffffffff801849b0>{__link_path_walk+1658}
>       <ffffffff801851c2>{link_path_walk+82}
> <ffffffff8012370b>{do_page_fault+575}
>       <ffffffff8018540f>{path_lookup+451}
> <ffffffff801856bb>{__user_walk+47}
>       <ffffffff8017ff1a>{vfs_stat+24} <ffffffff8012370b>{do_page_fault+575}
>
>       <ffffffff80180264>{sys_newstat+17} <ffffffff80110c61>{error_exit+0}
>       <ffffffff801101c6>{system_call+126}




--
Best Regards,
Anton Kornev.
[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]