Bizzaro system lockup/hang, possible kernel issue.

Naoki naoki at valuecommerce.com
Fri Jun 9 10:12:13 UTC 2006


On Fri, 2006-06-09 at 14:40 +0900, Naoki wrote:
> Hi all,
> 
> We have a range of FC5 boxes (~4 with the problem) and from roughly
> three/four weeks ago we've been seeing intermittent lockups/hangs which
> often result in having to power-cycle. The boxes are currently running
> 2.6.16-1.2122_FC5 or 2.6.16-1.2122_FC5smp but despite slightly different
> hardware all display the same issue. The filesystem is ResiserFS.
> 
> Disk operations will stall, commands like 'free' will (might) return
> quickly, but 'uptime' could take 5 minutes to complete. The system is
> essentially unusable.
> 
> Fork rate and # of procs seems to skyrocket however, not sure if this is
> just monitoring going strange because of the underlying problem though.
> 
> I've finally managed to capture the issue in more detail and was hoping
> somebody had a clue, here I perform an "strace -tt ls /var" :
> 
> ...
> 12:01:28.034014 read(4, "root:x:0:root\nbin:x:1:root,bin,d"..., 131072)
> = 679
> 12:01:28.034163 close(4)                = 0
> 12:01:28.034267 munmap(0xb7cda000, 131072) = 0
> 12:01:28.034383 lstat64("/var/spool", {st_mode=S_IFDIR|0755,
> st_size=328, ...}) = 0
> 12:01:28.034543 getxattr("/var/spool", "system.posix_acl_access", 0x0,
> 0) = -1 EOPNOTSUPP (Operation not supported)
> 12:01:28.034679 lstat64("/var/tomcat4", {st_mode=S_IFDIR|0755,
> st_size=72, ...}) = 0
> 12:01:27.542319 getxattr("/var/tomcat4", "system.posix_acl_access", 0x0,
> 0) = -1 EOPNOTSUPP (Operation not supported)
> 12:01:27.542577 lstat64("/var/net-snmp", {st_mode=S_IFDIR|0700,
> st_size=80, ...}) = 0
> 12:01:27.542847 getxattr("/var/net-snmp", "system.posix_acl_access", 
> ...
> 
> You can see that on the lstat64 to /var/tomcat4 the timestamp jumps
> back, in actuality that sys call too about 40 seconds to complete.
> 
> I have done this a couple of times on different areas of the disk and
> the results are the same, lstat64 is hanging for extremely long periods.
> 
> Another side effect of this is the system time becomes skewed during the
> hang on an lstat (probably other calls do this but I've not been able to
> trace enough ).
> 
> On one box I've installed 2.6.16-1.2129_FC5 from FC5 testing to see if
> that helps, on another I've reverted all the way back to
> kernel-2.6.16-1.2108_FC4.i686.rpm.
> 
> I've run the smartctl utility to check the disk is ok and that has
> passed on all servers.  I'll wait and see the results of my kernel
> updates/regressions.

Happened to another server. Not one of the above mentioned with replaced
kernels, but this once also with 2.6.16-1.2122_FC5.

# date; ls -l /var ; date
Fri Jun  9 18:27:36 JST 2006
total 3
drwxr-xr-x 10 root root   264 May 17 11:08 cache
drwxr-xr-x  3 root root    72 Feb 12 02:16 db
drwxr-xr-x  3 root root    72 Feb 12 02:16 empty
drwxr-xr-x  7 vcp  vcp    200 Dec 15 12:12 jsp
drwxr-xr-x 17 root root   480 May 17 11:20 lib
drwxr-xr-x  2 root root    48 Feb 12 02:16 local
drwxrwxr-x  6 root lock   144 Jun  9 05:12 lock
drwxr-xr-x  9 root root  1896 Jun  4 05:24 log
lrwxrwxrwx  1 root root    10 May 17 10:55 mail -> spool/mail
drwxr-x---  4 root named   96 Apr 19 23:12 named
drwx------  2 root root    80 May 27 10:10 net-snmp
drwxr-xr-x  2 root root    48 Feb 12 02:16 nis
drwxr-xr-x  2 root root    48 Feb 12 02:16 opt
drwxr-xr-x  2 root root    48 Feb 12 02:16 preserve
drwxr-xr-x 15 root root   696 Jun  9 05:12 run
drwxr-xr-x 13 root root   328 Feb 12 02:16 spool
drwxrwxrwt  2 root root    48 Jun  9 05:12 tmp
drwxr-xr-x  3 root root    72 Nov 18  2003 tomcat4
drwxr-xr-x  6 root root   144 Feb 12 08:12 www
drwxr-xr-x  3 root root   128 May 17 11:11 yp
Fri Jun  9 18:27:36 JST 2006

Notice the time didn't change, but immediately after it printed the
first date/time it then hung for 30 seconds before the 'ls' output was
printed.
Then I kept running the 'date' command and you can see what's
happening :

[root at banner8 ~]# date
Fri Jun  9 18:27:37 JST 2006
[root at banner8 ~]# date
Fri Jun  9 18:27:38 JST 2006
[root at banner8 ~]# date
Fri Jun  9 18:27:39 JST 2006
[root at banner8 ~]# date
Fri Jun  9 18:27:36 JST 2006
[root at banner8 ~]# date
Fri Jun  9 18:27:36 JST 2006
[root at banner8 ~]# date
Fri Jun  9 18:27:37 JST 2006
[root at banner8 ~]# date
Fri Jun  9 18:27:36 JST 2006
[root at banner8 ~]# date
Fri Jun  9 18:27:37 JST 2006
[root at banner8 ~]# date
Fri Jun  9 18:27:39 JST 2006
[root at banner8 ~]# date
Fri Jun  9 18:27:36 JST 2006
[root at banner8 ~]# date
Fri Jun  9 18:27:37 JST 2006
[root at banner8 ~]# date
Fri Jun  9 18:27:38 JST 2006

Anybody seen anything like _that_ before? It is running ntpd.




More information about the fedora-list mailing list