[Linux-cluster] service stuck in "recovering", no attempt to restart

Ofer Inbar cos at aaaaa.org
Wed Oct 5 02:23:04 UTC 2011


On a 3 node cluster running:
  cman-2.0.115-34.el5_5.3
  rgmanager-2.0.52-6.el5.centos.8
  openais-0.80.6-16.el5_5.9

We have a custom resource, "dn", for which I wrote the resource agent.
Service has three resources: a virtual IP (using ip.sh), and two dn children.

Normally, when one of the dn instances fails its status check,
rgmanager stops the service (stops dn_a and dn_b, then stops the IP),
then relocates to another node and starts the service there.

Several hours ago, one of the dn instances failed its status check,
rgmanager stopped it, marked the service "recovering", but then did
not seem to try to start it on any node.  It just stayed down for
hours until logged in to look at it.

Until 17:22 today, service was running on node1.  Here's what it logged:

Oct  4 17:22:12 clustnode1 clurgmgrd: [517]: <err> Monitoring Service dn:dn_b > Service Is Not Running
Oct  4 17:22:12 clustnode1 clurgmgrd[517]: <notice> status on dn "dn_b" returned 1 (generic error)
Oct  4 17:22:12 clustnode1 clurgmgrd[517]: <notice> Stopping service service:dn
Oct  4 17:22:12 clustnode1 clurgmgrd: [517]: <info> Stopping Service dn:dn_b
Oct  4 17:22:12 clustnode1 clurgmgrd: [517]: <notice> Checking if stopped: check_pid_file /dn/dn_b/dn_b.pid
Oct  4 17:22:14 clustnode1 clurgmgrd: [517]: <info> Stopping Service dn:dn_b > Succeed
Oct  4 17:22:14 clustnode1 clurgmgrd: [517]: <info> Stopping Service dn:dn_a
Oct  4 17:22:15 clustnode1 clurgmgrd: [517]: <notice> Checking if stopped: check_pid_file /dn/dn_a/dn_a.pid
Oct  4 17:22:17 clustnode1 clurgmgrd: [517]: <info> Stopping Service dn:dn_a > Succeed
Oct  4 17:22:17 clustnode1 clurgmgrd: [517]: <info> Removing IPv4 address 10.6.9.136/23 from eth0
Oct  4 17:22:27 clustnode1 clurgmgrd[517]: <notice> Service service:dn is recovering

At around that time, node2 also logged this:

Oct  4 17:21:19 clustnode2 ccsd[5584]: Unable to read complete comm_header_t.
Oct  4 17:21:29 clustnode2 ccsd[5584]: Unable to read complete comm_header_t.

[Cluster name and node names anonymized with simple search and replace]

There are no other log entries in /var/log/messages on any node around
that time, that relate to cluster suite.

Currently, the service is still "recovering", with cluster status
otherwise apparently fine.  clustat -x output on all three nodes is
identical except for which node has local="1".  It looks like this:

<?xml version="1.0"?>
<clustat version="4.1.1">
  <cluster name="clustnode" id="23048" generation="12"/>
  <quorum quorate="1" groupmember="1"/>
  <nodes>
    <node name="clustnode1" state="1" local="0" estranged="0" rgmanager="1" rgmanager_master="0" qdisk="0" nodeid="0x00000001"/>
    <node name="clustnode2" state="1" local="1" estranged="0" rgmanager="1" rgmanager_master="0" qdisk="0" nodeid="0x00000002"/>
    <node name="clustnode3" state="1" local="0" estranged="0" rgmanager="1" rgmanager_master="0" qdisk="0" nodeid="0x00000003"/>
  </nodes>
  <groups>
    <group name="service:dn" state="118" state_str="recovering" flags="0" flags_str="" owner="none" last_owner="clustnode1" restarts="0" last_transition="1317763347" last_transition_str="Tue Oct  4 17:22:27 2011"/>
  </groups>
</clustat>

And cman_tool status shows all three nodes voting and in the quorum:

Version: 6.2.0
Config Version: 2
Cluster Name: clustnode
Cluster Id: 23048
Cluster Member: Yes
Cluster Generation: 12
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Quorum: 2  
Active subsystems: 8
Flags: Dirty 
Ports Bound: 0 177  
Node name: clustnode2
Node ID: 2
Multicast addresses: 239.245.0.84 
Node addresses: 10.6.8.208 

Again, this looks the same on all three nodes.

Here's the resource section of cluster.conf (with the values of some
of the arguments to my custom resource modified so as not to expose
actual username, path, or port number):

<rm log_level="6">
  <service autostart="1" name="dn" recovery="relocate">
    <ip address="10.6.9.136" monitor_link="1">
      <dn user="username" dninstall="/dn/path" name="dn_a" monitoringport="portnum"/>
      <dn user="username" dninstall="/dn/path" name="dn_b" monitoringport="portnum"/>
    </ip>
  </service>
</rm>

Any ideas why it might be in this state, where everything is
apparently fine except that the service is "recovering" and rgmanager
isn't trying to do anything about it and isn't logging any complaints?

Attached: strace -fp output of clurgmrgd processes on node1 and node2
  -- Cos
-------------- next part --------------
Process 517 attached with 4 threads - interrupt to quit
[pid  9842] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1001] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid  1000] select(6, [3 5], NULL, NULL, {0, 935000} <unfinished ...>
[pid   517] select(12, [10 11], NULL, NULL, {8, 177000} <unfinished ...>
[pid  9842] <... clock_gettime resumed> {1317781205, 661864000}) = 0
[pid  1001] <... clock_gettime resumed> {1317781205, 661864000}) = 0
[pid  9842] futex(0x432a5cbc, FUTEX_WAIT_PRIVATE, 3573, {7, 357519000} <unfinished ...>
[pid  1001] futex(0x12fac8ec, FUTEX_WAIT_PRIVATE, 81853, {0, 867658000}) = -1 ETIMEDOUT (Connection timed out)
[pid  1001] futex(0x12fac8c0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid  1001] clock_gettime(CLOCK_REALTIME, {1317781206, 530851000}) = 0
[pid  1001] futex(0x12fac8ec, FUTEX_WAIT_PRIVATE, 81855, {2, 999711000} <unfinished ...>
[pid  1000] <... select resumed> )      = 0 (Timeout)
[pid  1000] read(5, 0x428a4f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid  1000] select(6, [3 5], NULL, NULL, {2, 2}) = 0 (Timeout)
[pid  1000] read(5, 0x428a4f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid  1000] select(6, [3 5], NULL, NULL, {2, 2} <unfinished ...>
[pid  1001] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  1001] futex(0x12fac8c0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid  1001] clock_gettime(CLOCK_REALTIME, {1317781209, 532508000}) = 0
[pid  1001] futex(0x12fac8ec, FUTEX_WAIT_PRIVATE, 81857, {3, 0} <unfinished ...>
[pid  1000] <... select resumed> )      = 0 (Timeout)
[pid  1000] read(5, 0x428a4f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid  1000] select(6, [3 5], NULL, NULL, {2, 2} <unfinished ...>
[pid  1001] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  1001] futex(0x12fac8c0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid  1001] clock_gettime(CLOCK_REALTIME, {1317781212, 534580000}) = 0
[pid  1001] futex(0x12fac8ec, FUTEX_WAIT_PRIVATE, 81859, {3, 0} <unfinished ...>
[pid  1000] <... select resumed> )      = 0 (Timeout)
[pid  1000] read(5, 0x428a4f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid  1000] select(6, [3 5], NULL, NULL, {2, 2} <unfinished ...>
[pid  9842] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  9842] futex(0x432a5c90, FUTEX_WAKE_PRIVATE, 1) = 0
[pid  9842] clock_gettime(CLOCK_REALTIME, {1317781213, 21497000}) = 0
[pid  9842] futex(0x432a5cbc, FUTEX_WAIT_PRIVATE, 3575, {10, 0} <unfinished ...>
[pid   517] <... select resumed> )      = 0 (Timeout)
[pid   517] socket(PF_FILE, SOCK_STREAM, 0) = 13
[pid   517] connect(13, {sa_family=AF_FILE, path="/var/run/cluster/ccsd.sock"...}, 110) = 0
[pid   517] write(13, "\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 20) = 20
[pid   517] read(13, "\1\0\0\0\0\0\0\0\350D9\3\0\0\0\0\0\0\0\0", 20) = 20
[pid   517] close(13)                   = 0
[pid   517] socket(PF_FILE, SOCK_STREAM, 0) = 13
[pid   517] connect(13, {sa_family=AF_FILE, path="/var/run/cluster/ccsd.sock"...}, 110) = 0
[pid   517] write(13, "\3\0\0\0\0\0\0\0\350D9\3\0\0\0\0\31\0\0\0/cluster/@co"..., 45) = 45
[pid   517] read(13, "\3\0\0\0\0\0\0\0\350D9\3\0\0\0\0\2\0\0\0", 20) = 20
[pid   517] read(13, "2\0", 2)          = 2
[pid   517] close(13)                   = 0
[pid   517] socket(PF_FILE, SOCK_STREAM, 0) = 13
[pid   517] connect(13, {sa_family=AF_FILE, path="/var/run/cluster/ccsd.sock"...}, 110) = 0
[pid   517] write(13, "\2\0\0\0\0\0\0\0\350D9\3\0\0\0\0\0\0\0\0", 20) = 20
[pid   517] read(13, "\2\0\0\0\0\0\0\0\377\377\377\377\0\0\0\0\0\0\0\0", 20) = 20
[pid   517] close(13)                   = 0
[pid   517] clone(Process 18772 attached
child_stack=0x40ebc240, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x40ebc9c0, tls=0x40ebc930, child_tidptr=0x40ebc9c0) = 18772
[pid   517] select(12, [10 11], NULL, NULL, {10, 0} <unfinished ...>
[pid 18772] set_robust_list(0x40ebc9d0, 0x18) = 0
[pid 18772] rt_sigprocmask(SIG_BLOCK, [HUP INT QUIT USR1 USR2 TERM], NULL, 8) = 0
[pid 18772] _exit(0)                    = ?
Process 18772 detached
[pid  1000] <... select resumed> )      = 0 (Timeout)
[pid  1000] read(5, 0x428a4f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid  1000] select(6, [3 5], NULL, NULL, {2, 2} <unfinished ...>
[pid  1001] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  1001] futex(0x12fac8c0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid  1001] clock_gettime(CLOCK_REALTIME, {1317781215, 536718000}) = 0
[pid  1001] futex(0x12fac8ec, FUTEX_WAIT_PRIVATE, 81861, {3, 0} <unfinished ...>
[pid  1000] <... select resumed> )      = 0 (Timeout)
[pid  1000] read(5, 0x428a4f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid  1000] select(6, [3 5], NULL, NULL, {2, 2} <unfinished ...>
[pid  1001] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  1001] futex(0x12fac8c0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid  1001] clock_gettime(CLOCK_REALTIME, {1317781218, 538706000}) = 0
[pid  1001] futex(0x12fac8ec, FUTEX_WAIT_PRIVATE, 81863, {3, 0} <unfinished ...>
[pid  1000] <... select resumed> )      = 0 (Timeout)
[pid  1000] read(5, 0x428a4f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid  1000] select(6, [3 5], NULL, NULL, {2, 2}) = 0 (Timeout)
[pid  1000] read(5, 0x428a4f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid  1000] select(6, [3 5], NULL, NULL, {2, 2} <unfinished ...>
[pid  1001] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid  1001] futex(0x12fac8c0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid  1001] clock_gettime(CLOCK_REALTIME, {1317781221, 540821000}) = 0
[pid  1001] futex(0x12fac8ec, FUTEX_WAIT_PRIVATE, 81865, {3, 0}
-------------- next part --------------
Process 28445 attached with 4 threads - interrupt to quit
[pid 28962] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 28931] clock_gettime(CLOCK_REALTIME,  <unfinished ...>
[pid 28930] select(6, [3 5], NULL, NULL, {1, 725000} <unfinished ...>
[pid 28445] select(12, [10 11], NULL, NULL, {6, 894000} <unfinished ...>
[pid 28962] <... clock_gettime resumed> {1317781260, 477926000}) = 0
[pid 28931] <... clock_gettime resumed> {1317781260, 477926000}) = 0
[pid 28962] futex(0x429dacbc, FUTEX_WAIT_PRIVATE, 24531, {4, 991782000} <unfinished ...>
[pid 28931] futex(0x1bdea8ec, FUTEX_WAIT_PRIVATE, 81869, {2, 613666000} <unfinished ...>
[pid 28930] <... select resumed> )      = 0 (Timeout)
[pid 28930] read(5, 0x41587f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid 28930] select(6, [3 5], NULL, NULL, {2, 2} <unfinished ...>
[pid 28931] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 28931] futex(0x1bdea8c0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 28931] clock_gettime(CLOCK_REALTIME, {1317781263, 93684000}) = 0
[pid 28931] futex(0x1bdea8ec, FUTEX_WAIT_PRIVATE, 81871, {3, 0} <unfinished ...>
[pid 28930] <... select resumed> )      = 0 (Timeout)
[pid 28930] read(5, 0x41587f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid 28930] select(6, [3 5], NULL, NULL, {2, 2} <unfinished ...>
[pid 28962] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 28962] futex(0x429dac90, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 28962] clock_gettime(CLOCK_REALTIME, {1317781265, 471616000}) = 0
[pid 28962] futex(0x429dacbc, FUTEX_WAIT_PRIVATE, 24533, {10, 0} <unfinished ...>
[pid 28931] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 28931] futex(0x1bdea8c0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 28931] clock_gettime(CLOCK_REALTIME, {1317781266, 95446000}) = 0
[pid 28931] futex(0x1bdea8ec, FUTEX_WAIT_PRIVATE, 81873, {3, 0} <unfinished ...>
[pid 28930] <... select resumed> )      = 0 (Timeout)
[pid 28930] read(5, 0x41587f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid 28930] select(6, [3 5], NULL, NULL, {2, 2} <unfinished ...>
[pid 28445] <... select resumed> )      = 0 (Timeout)
[pid 28445] socket(PF_FILE, SOCK_STREAM, 0) = 14
[pid 28445] connect(14, {sa_family=AF_FILE, path="/var/run/cluster/ccsd.sock"...}, 110) = 0
[pid 28445] write(14, "\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 20) = 20
[pid 28445] read(14, "\1\0\0\0\0\0\0\0\"\rc\4\0\0\0\0\0\0\0\0", 20) = 20
[pid 28445] close(14)                   = 0
[pid 28445] socket(PF_FILE, SOCK_STREAM, 0) = 14
[pid 28445] connect(14, {sa_family=AF_FILE, path="/var/run/cluster/ccsd.sock"...}, 110) = 0
[pid 28445] write(14, "\3\0\0\0\0\0\0\0\"\rc\4\0\0\0\0\31\0\0\0/cluster/@co"..., 45) = 45
[pid 28445] read(14, "\3\0\0\0\0\0\0\0\"\rc\4\0\0\0\0\2\0\0\0", 20) = 20
[pid 28445] read(14, "2\0", 2)          = 2
[pid 28445] close(14)                   = 0
[pid 28445] socket(PF_FILE, SOCK_STREAM, 0) = 14
[pid 28445] connect(14, {sa_family=AF_FILE, path="/var/run/cluster/ccsd.sock"...}, 110) = 0
[pid 28445] write(14, "\2\0\0\0\0\0\0\0\"\rc\4\0\0\0\0\0\0\0\0", 20) = 20
[pid 28445] read(14, "\2\0\0\0\0\0\0\0\377\377\377\377\0\0\0\0\0\0\0\0", 20) = 20
[pid 28445] close(14)                   = 0
[pid 28445] clone(Process 29968 attached
child_stack=0x40705240, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x407059c0, tls=0x40705930, child_tidptr=0x407059c0) = 29968
[pid 28445] select(12, [10 11], NULL, NULL, {10, 0} <unfinished ...>
[pid 29968] set_robust_list(0x407059d0, 0x18) = 0
[pid 29968] rt_sigprocmask(SIG_BLOCK, [HUP INT QUIT USR1 USR2 TERM], NULL, 8) = 0
[pid 29968] _exit(0)                    = ?
Process 29968 detached
[pid 28930] <... select resumed> )      = 0 (Timeout)
[pid 28930] read(5, 0x41587f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid 28930] select(6, [3 5], NULL, NULL, {2, 2} <unfinished ...>
[pid 28931] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 28931] futex(0x1bdea8c0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 28931] clock_gettime(CLOCK_REALTIME, {1317781269, 97451000}) = 0
[pid 28931] futex(0x1bdea8ec, FUTEX_WAIT_PRIVATE, 81875, {3, 0} <unfinished ...>
[pid 28930] <... select resumed> )      = 0 (Timeout)
[pid 28930] read(5, 0x41587f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid 28930] select(6, [3 5], NULL, NULL, {2, 2} <unfinished ...>
[pid 28445] <... select resumed> )      = 1 (in [10], left {6, 643000})
[pid 28445] accept(10, 0, NULL)         = 14
[pid 28445] fcntl(14, F_GETFD)          = 0
[pid 28445] fcntl(14, F_SETFD, FD_CLOEXEC) = 0
[pid 28445] select(15, [14], NULL, [14], {1, 0}) = 1 (in [14], left {1, 0})
[pid 28445] read(14, "\30\0\0\0\4\0\0\0", 8) = 8
[pid 28445] select(15, [14], NULL, [14], {1, 0}) = 1 (in [14], left {1, 0})
[pid 28445] read(14, "\22:\274\0\0\0\0\30\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0", 24) = 24
[pid 28445] clone(Process 29977 attached
child_stack=0x40705240, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x407059c0, tls=0x40705930, child_tidptr=0x407059c0) = 29977
[pid 28445] select(12, [10 11], NULL, NULL, {6, 643000} <unfinished ...>
[pid 29977] set_robust_list(0x407059d0, 0x18) = 0
[pid 29977] rt_sigprocmask(SIG_BLOCK, [HUP INT QUIT USR1 USR2 TERM], NULL, 8) = 0
[pid 29977] select(15, NULL, [14], [14], NULL) = 1 (out [14])
[pid 29977] write(14, "x\0\0\0\4\0\0\0\22:\274\0\0\0\0x\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0"..., 128) = 128
[pid 29977] select(15, NULL, [14], [14], NULL) = 1 (out [14])
[pid 29977] write(14, "\30\0\0\0\4\0\0\0\22:\274\0\0\0\0\30\0\0\0\0\0\0\0\0\0\0\0\0N\213y\23", 32) = 32
[pid 29977] select(15, [14], NULL, [14], {10, 0}) = 1 (in [14], left {10, 0})
[pid 29977] read(14, "\30\0\0\0\4\0\0\0", 8) = 8
[pid 29977] select(15, [14], NULL, [14], {10, 0}) = 1 (in [14], left {10, 0})
[pid 29977] read(14, "\22:\274\0\0\0\0\30\0\0\0\0\0\0\0\0\0\0\0\0\30\0\0\0", 24) = 24
[pid 29977] close(14)                   = 0
[pid 29977] _exit(0)                    = ?
Process 29977 detached
[pid 28445] <... select resumed> )      = 1 (in [10], left {6, 642000})
[pid 28445] accept(10, 0, NULL)         = 14
[pid 28445] fcntl(14, F_GETFD)          = 0
[pid 28445] fcntl(14, F_SETFD, FD_CLOEXEC) = 0
[pid 28445] select(15, [14], NULL, [14], {1, 0}) = 1 (in [14], left {1, 0})
[pid 28445] read(14, "\30\0\0\0\4\0\0\0", 8) = 8
[pid 28445] select(15, [14], NULL, [14], {1, 0}) = 1 (in [14], left {1, 0})
[pid 28445] read(14, "\22:\274\0\0\0\0\30\0\0\0\f\0\0\0\0\0\0\0\0\0\0\0\0", 24) = 24
[pid 28445] select(15, NULL, [14], [14], NULL) = 1 (out [14])
[pid 28445] write(14, "\30\0\0\0\4\0\0\0\22:\274\0\0\0\0\30\0\0\0\f\0\0\0\1\0\0\0\0\377\377\377\377", 32) = 32
[pid 28445] select(15, NULL, [14], [14], NULL) = 1 (out [14])
[pid 28445] write(14, "\30\0\0\0\4\0\0\0\22:\274\0\0\0\0\30\0\0\0\f\0\0\0\2\0\0\0\0\377\377\377\377", 32) = 32
[pid 28445] select(15, NULL, [14], [14], NULL) = 1 (out [14])
[pid 28445] write(14, "\30\0\0\0\4\0\0\0\22:\274\0\0\0\0\30\0\0\0\f\0\0\0\3\0\0\0\0\377\377\377\377", 32) = 32
[pid 28445] select(15, NULL, [14], [14], NULL) = 1 (out [14])
[pid 28445] write(14, "\30\0\0\0\4\0\0\0\22:\274\0\0\0\0\30\0\0\0\0\0\0\0\0\0\0\0\0\377\377\377\377", 32) = 32
[pid 28445] select(15, [14], NULL, [14], {10, 0}) = 1 (in [14], left {10, 0})
[pid 28445] read(14, "\30\0\0\0\4\0\0\0", 8) = 8
[pid 28445] select(15, [14], NULL, [14], {10, 0}) = 1 (in [14], left {10, 0})
[pid 28445] read(14, "\22:\274\0\0\0\0\30\0\0\0\0\0\0\0\0\0\0\0\0\30\0\0\0", 24) = 24
[pid 28445] close(14)                   = 0
[pid 28445] select(12, [10 11], NULL, NULL, {6, 642000} <unfinished ...>
[pid 28931] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 28931] futex(0x1bdea8c0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 28931] clock_gettime(CLOCK_REALTIME, {1317781272, 99389000}) = 0
[pid 28931] futex(0x1bdea8ec, FUTEX_WAIT_PRIVATE, 81877, {3, 0} <unfinished ...>
[pid 28930] <... select resumed> )      = 0 (Timeout)
[pid 28930] read(5, 0x41587f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid 28930] select(6, [3 5], NULL, NULL, {2, 2}) = 0 (Timeout)
[pid 28930] read(5, 0x41587f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid 28930] select(6, [3 5], NULL, NULL, {2, 2} <unfinished ...>
[pid 28931] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 28931] futex(0x1bdea8c0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 28931] clock_gettime(CLOCK_REALTIME, {1317781275, 101463000}) = 0
[pid 28931] futex(0x1bdea8ec, FUTEX_WAIT_PRIVATE, 81879, {3, 0} <unfinished ...>
[pid 28962] <... futex resumed> )       = -1 ETIMEDOUT (Connection timed out)
[pid 28962] futex(0x429dac90, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 28962] clock_gettime(CLOCK_REALTIME, {1317781275, 474705000}) = 0
[pid 28962] futex(0x429dacbc, FUTEX_WAIT_PRIVATE, 24535, {9, 999904000} <unfinished ...>
[pid 28930] <... select resumed> )      = 0 (Timeout)
[pid 28930] read(5, 0x41587f6b, 1)      = -1 EAGAIN (Resource temporarily unavailable)
[pid 28930] select(6, [3 5], NULL, NULL, {2, 2}


More information about the Linux-cluster mailing list