[Cluster-devel] Panic when stopping gulm.


I got panics sometimes, when stopping gulm on my whole cluster.
These are really not very frequent. The panics appear inside a function
of the "ipv6" module, when called by one of the gulm kernel threads. 

^MProcess gulm_res_recvd (pid: 5029, threadinfo 0000010021300000, task
^MStack: 0000000000004034 0000000000000000 0000001e124dd670
000001001533d380 ^M       0000010021301d08 0000000000000000
0000010021301e18 000001003c3a5e00 ^M       00000100124dd670
0000000023222120 ^MCall Trace:<ffffffffa01d7a58>{:ipv6:tcp_v6_xmit+611}
^M       <ffffffffa02c6af4>{:lock_gulm:do_tfer+252}
^M       <ffffffffa02c5c53>{:lock_gulm:xdr_enc_flush+44}
^M       <ffffffffa02c383b>{:lock_gulm:lg_core_handle_messages+394}
^M       <ffffffffa02be1b7>{:lock_gulm:cm_io_recving_thread+73}
^M       <ffffffff80110e17>{child_rip+8}
^M       <ffffffff80110e0f>{child_rip+0}

I looked at the code, in src/gulm/xdr_io.c, in function "do_tfer".
I find something strange :

	for (;;) {
		m.msg_iov = iov;
		m.msg_iovlen = n;
		m.msg_flags = MSG_NOSIGNAL;

		if (dir)
			rv = sock_sendmsg (sock, &m, size - moved);
			rv = sock_recvmsg (sock, &m, size - moved, 0);

		if (rv <= 0)
			goto out_err;
		moved += rv;

		if (moved >= size)

		/* adjust iov's for next transfer */
		while (iov->iov_len == 0) {

In my opinion, when "sock_sendmsg" doesn't return the
exact size that was asked to be sent, we get into  
		while (iov->iov_len == 0) {
Even if we are already at the last buffer, without checking "n", which
is the number of buffers in the table "iov". "sock_sendmsg" is then
called with an invalid buffer pointer.... (m.msg_iov = iov)
I don't know if this is of any interest, since "n" always equals "1",
wherever "do_tfer" is called.

Anyway, this couldn't happen if "n" was checked:
		while ( (n>1)&&(iov->iov_len == 0) {
		if (n<=1) break;

This still doesn't guarantee that the message will be sent as a
whole. Using : 
		m.msg_flags = MSG_NOSIGNAL | MSG_WAITALL;
and a loop over sock_sendmsg till the full message is sent is the
solution, maybe.

Any idea on this ?

Thanks in advance,

Mathieu Avila

