Bizzaro system lockup/hang, possible kernel issue.

Fri Jun 9 10:26:04 UTC 2006

2006/6/9, Naoki <naoki at valuecommerce.com>:
> On Fri, 2006-06-09 at 14:40 +0900, Naoki wrote:
> > Hi all,
> >
> > We have a range of FC5 boxes (~4 with the problem) and from roughly
> > three/four weeks ago we've been seeing intermittent lockups/hangs which
> > often result in having to power-cycle. The boxes are currently running
> > 2.6.16-1.2122_FC5 or 2.6.16-1.2122_FC5smp but despite slightly different
> > hardware all display the same issue. The filesystem is ResiserFS.
> >
> > Disk operations will stall, commands like 'free' will (might) return
> > quickly, but 'uptime' could take 5 minutes to complete. The system is
> > essentially unusable.
> >
> > Fork rate and # of procs seems to skyrocket however, not sure if this is
> > just monitoring going strange because of the underlying problem though.
> >
> > I've finally managed to capture the issue in more detail and was hoping
> > somebody had a clue, here I perform an "strace -tt ls /var" :
> >
> > ...
> > 12:01:28.034014 read(4, "root:x:0:root\nbin:x:1:root,bin,d"..., 131072)
> > = 679
> > 12:01:28.034163 close(4)                = 0
> > 12:01:28.034267 munmap(0xb7cda000, 131072) = 0
> > 12:01:28.034383 lstat64("/var/spool", {st_mode=S_IFDIR|0755,
> > st_size=328, ...}) = 0
> > 12:01:28.034543 getxattr("/var/spool", "system.posix_acl_access", 0x0,
> > 0) = -1 EOPNOTSUPP (Operation not supported)
> > 12:01:28.034679 lstat64("/var/tomcat4", {st_mode=S_IFDIR|0755,
> > st_size=72, ...}) = 0
> > 12:01:27.542319 getxattr("/var/tomcat4", "system.posix_acl_access", 0x0,
> > 0) = -1 EOPNOTSUPP (Operation not supported)
> > 12:01:27.542577 lstat64("/var/net-snmp", {st_mode=S_IFDIR|0700,
> > st_size=80, ...}) = 0
> > 12:01:27.542847 getxattr("/var/net-snmp", "system.posix_acl_access",
> > ...
> >
> > You can see that on the lstat64 to /var/tomcat4 the timestamp jumps
> > back, in actuality that sys call too about 40 seconds to complete.
> >
> > I have done this a couple of times on different areas of the disk and
> > the results are the same, lstat64 is hanging for extremely long periods.
> >
> > Another side effect of this is the system time becomes skewed during the
> > hang on an lstat (probably other calls do this but I've not been able to
> > trace enough ).
> >
> > On one box I've installed 2.6.16-1.2129_FC5 from FC5 testing to see if
> > that helps, on another I've reverted all the way back to
> > kernel-2.6.16-1.2108_FC4.i686.rpm.
> >
> > I've run the smartctl utility to check the disk is ok and that has
> > passed on all servers.  I'll wait and see the results of my kernel
> > updates/regressions.
>
> Happened to another server. Not one of the above mentioned with replaced
> kernels, but this once also with 2.6.16-1.2122_FC5.
>
> # date; ls -l /var ; date
> Fri Jun  9 18:27:36 JST 2006
> total 3
> drwxr-xr-x 10 root root   264 May 17 11:08 cache
> drwxr-xr-x  3 root root    72 Feb 12 02:16 db
> drwxr-xr-x  3 root root    72 Feb 12 02:16 empty
> drwxr-xr-x  7 vcp  vcp    200 Dec 15 12:12 jsp
> drwxr-xr-x 17 root root   480 May 17 11:20 lib
> drwxr-xr-x  2 root root    48 Feb 12 02:16 local
> drwxrwxr-x  6 root lock   144 Jun  9 05:12 lock
> drwxr-xr-x  9 root root  1896 Jun  4 05:24 log
> lrwxrwxrwx  1 root root    10 May 17 10:55 mail -> spool/mail
> drwxr-x---  4 root named   96 Apr 19 23:12 named
> drwx------  2 root root    80 May 27 10:10 net-snmp
> drwxr-xr-x  2 root root    48 Feb 12 02:16 nis
> drwxr-xr-x  2 root root    48 Feb 12 02:16 opt
> drwxr-xr-x  2 root root    48 Feb 12 02:16 preserve
> drwxr-xr-x 15 root root   696 Jun  9 05:12 run
> drwxr-xr-x 13 root root   328 Feb 12 02:16 spool
> drwxrwxrwt  2 root root    48 Jun  9 05:12 tmp
> drwxr-xr-x  3 root root    72 Nov 18  2003 tomcat4
> drwxr-xr-x  6 root root   144 Feb 12 08:12 www
> drwxr-xr-x  3 root root   128 May 17 11:11 yp
> Fri Jun  9 18:27:36 JST 2006
>
> Notice the time didn't change, but immediately after it printed the
> first date/time it then hung for 30 seconds before the 'ls' output was
> printed.
> Then I kept running the 'date' command and you can see what's
> happening :
>
[snip]

Jep I had this last night, while running yum: I started it at 11:45 pm
and when I looked at my box this morning yum had stalled and the time
was 0:45 am (the real time was 8:30 am). And ntpd is also running.

I'm suspecting this has something to do with the kernel, but not sure.
(running 2.6.16-1.2122_FC5). I've looked through all the logs, but it
doesn't mention a thing.

I hope this gets solved soon, has been quite irritating,
Bart