RHELv4 and v5 - So slow as to be unusable.

Gary E Barnes gebarnes at us.ibm.com
Thu Oct 7 19:17:01 UTC 2010


> From: Laszlo Beres <laszlo at beres.me>
> Subject: Re: RHELv4 and v5 - So slow as to be unusable.
> 
> On Wed, Oct 6, 2010 at 9:22 PM, Gary E Barnes <gebarnes at us.ibm.com> 
wrote:
> 
> > "top" says that nothing is going on although the load average is 3+.
> > "sar" also says that nothing is going on.
> 
> There's no such thing "nothing is going on". You should see CPU
> status, process status, etc. vmstat also can give you some hints about
> the system health.

Oh but there is such a thing.  I have one of the machines in this weird 
slowdown state right at this moment.  It started around 4:45pm yesterday, 
after running perfectly for about 3 hours 15 minutes, and I left it 
overnight to see if maybe it would get "over it" by itself.  Hasn't 
happened though.

Here is the very first header from the "top" display of a top I started 
just for this example.

top - 19:18:20 up  4:33,  4 users,  load average: 3.56, 3.58, 3.54
Tasks: 159 total,  16 running, 143 sleeping,   0 stopped,   0 zombie
Cpu(s):  1.3% us,  0.4% sy,  2.9% ni, 95.1% id,  0.3% wa,  0.0% hi,  0.0% 
si
Mem:   2586400k total,  1880032k used,   706368k free,   193036k buffers
Swap:  4192956k total,        0k used,  4192956k free,  1220324k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND  
 5813 root      16   0  164m  31m 6124 R  0.2  1.3  25:05.54 X  
 6650 geb       16   0  4048 2116 1332 S  0.2  0.1   0:00.86 xalarm  
27229 geb2      16   0  3008  960  696 R  0.2  0.0   0:00.01 top  
    1 root      16   0  2724  512  436 S  0.0  0.0   0:00.68 init  

And here is the first refresh of that display (I'm capturing this in an 
Emacs buffer if you're curious).

Tasks: 161 total,   3 running, 158 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.9% us,  0.5% sy,  0.1% ni, 98.5% id,  0.0% wa,  0.0% hi,  0.0% 
si
Mem:   2586400k total,  1885680k used,   700720k free,   193036k buffers
Swap:  4192956k total,        0k used,  4192956k free,  1220584k cached

Here is the second, notice the 100% idle value.

Tasks: 161 total,   2 running, 159 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0% us,  0.0% sy,  0.0% ni, 100.0% id,  0.0% wa,  0.0% hi,  0.0% 
si
Mem:   2586400k total,  1885696k used,   700704k free,   193044k buffers
Swap:  4192956k total,        0k used,  4192956k free,  1220576k cached

There is memory available.  There is swap available.
Idle occasionally drops to 99.9%.

Tasks: 161 total,   2 running, 159 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0% us,  0.1% sy,  0.0% ni, 99.8% id,  0.0% wa,  0.0% hi,  0.0% 
si
Mem:   2586400k total,  1885464k used,   700936k free,   193076k buffers
Swap:  4192956k total,        0k used,  4192956k free,  1220544k cached

The processes that show up in the first line or two of top are things such 
as:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND  
27231 geb       16   0 28316  11m 8412 S  9.8  0.5   0:00.41 
gnome-terminal 
 5813 root      16   0  165m  31m 6316 S  3.8  1.3  25:05.70 X  
 6697 geb       16   0 22280  10m 7644 S  1.9  0.4   0:18.69 wnck-applet   
 
 3817 rpc       15   0  2336  592  484 S  0.2  0.0   0:01.60 portmap  
 6472 geb       16   0  3544 1472  876 S  0.2  0.1   0:02.08 gam_server   
 6650 geb       16   0  4048 2116 1332 S  0.2  0.1   0:00.87 xalarm  
 3805 root      16   0  2492  312  220 S  0.2  0.0   0:00.55 irqbalance   
 4867 root      16   0  2800  844  624 D  0.2  0.0   0:03.32 rpc.mountd   
    1 root      16   0  2724  512  436 S  0.0  0.0   0:00.68 init  
 6451 geb       16   0 12764 7512 1688 S  0.1  0.3   0:01.66 gconfd-2  
27229 geb2      16   0  3012 1044  772 R  0.1  0.0   0:00.05 top  
27229 geb2      16   0  3012 1044  772 R  0.5  0.0   0:00.07 top  
 4996 root      16   0  4772 3104 1536 S  0.2  0.1   0:02.47 hald  
 6472 geb       16   0  3544 1472  876 S  0.2  0.1   0:02.09 gam_server   
    1 root      16   0  2724  512  436 S  0.0  0.0   0:00.68 init  
27229 geb2      16   0  3012 1044  772 R  0.5  0.0   0:00.09 top  
23574 geb       16   0  145m  68m  26m S  0.2  2.7   2:44.86 firefox-bin   
 
    1 root      16   0  2724  512  436 S  0.0  0.0   0:00.68 init  

As you can see, there is essentially "nothing going on".

An yet the machine is very unresponsive.  If I run a command that hasn't 
been run in a while (don't know the time frame, but it seems to be only 
minutes) then the command takes >30 seconds to execute.  For example, I 
just did the "date" command and when it finally responded I did the 
hwclock command.  Both took >30 seconds to run.  Now if I repeat those 
commands they execute immediately.  I'm presuming that this is due to 
executable file caching in the operating system.  If I wait a while then 
the >30 second wait will reappear for those same commands.  Presumably 
they've left that cache.

This behavior is observable both in xterm's on the console and also 
through ssh connections from another machine. 

Programs that are already loaded and running seem to be pretty much ok, at 
least until they need to go read some new file or write some new file, 
then they hang for a while and eventually get going again.

If I run sar (sysstat package) I get essentially the same picture.  From a 
"sar -A 30 4" here are the averages for two minutes.  Load average 3+ and 
>99% idle.  Nearly no I/O of any sort; not 0 but very low amounts for two 
minutes.

Average:       proc/s
Average:         0.12

Average:      cswch/s
Average:        48.90

Average:          CPU     %user     %nice   %system   %iowait     %idle
Average:          all      0.29      0.01      0.12      0.07     99.51
Average:            0      0.25      0.01      0.13      0.00     99.60
Average:            1      0.33      0.01      0.11      0.13     99.42

Average:         INTR    intr/s
Average:          sum     11.87

Average:     pgpgin/s pgpgout/s   fault/s  majflt/s
Average:         0.01      8.55     34.76      0.00

Average:     pswpin/s pswpout/s
Average:         0.00      0.00

Average:          tps      rtps      wtps   bread/s   bwrtn/s
Average:         1.35      0.00      1.35      0.02     17.11

Average:      frmpg/s   bufpg/s   campg/s
Average:        -0.73      0.02     -0.02

Average:     CPU  i000/s  i001/s  i008/s  i009/s  i012/s  i014/s  i015/s 
i177/s  i185/s  i193/s  i201/s  i209/s
Average:       0    0.02    0.00    0.00    0.00    0.00    0.00    0.09  
7.76    0.00    0.00    0.00    0.00
Average:       1    0.01    2.65    0.00    0.00    0.00    1.35    0.00  
0.00    0.00    0.00    0.00    0.00

Average:        IFACE   rxpck/s   txpck/s   rxbyt/s   txbyt/s   rxcmp/s 
txcmp/s  rxmcst/s
Average:           lo      0.00      0.00      0.12      0.12      0.00  
0.00      0.00
Average:         eth0      3.35      2.25    364.83    196.51      0.00  
0.00      0.00
Average:         sit0      0.00      0.00      0.00      0.00      0.00  
0.00      0.00

Average:        IFACE   rxerr/s   txerr/s    coll/s  rxdrop/s  txdrop/s 
txcarr/s  rxfram/s  rxfifo/s  txfifo/s
Average:           lo      0.00      0.00      0.00      0.00      0.00  
0.00      0.00      0.00      0.00
Average:         eth0      0.00      0.00      0.00      0.00      0.00  
0.00      0.00      0.00      0.00
Average:         sit0      0.00      0.00      0.00      0.00      0.00  
0.00      0.00      0.00      0.00

Average:       DEV              tps    rd_sec/s  wr_sec/s
Average:       dev1-0           0.00      0.00      0.00
Average:       dev1-1           0.00      0.00      0.00
Average:       dev1-2           0.00      0.00      0.00
Average:       dev1-3           0.00      0.00      0.00
Average:       dev1-4           0.00      0.00      0.00
Average:       dev1-5           0.00      0.00      0.00
Average:       dev1-6           0.00      0.00      0.00
Average:       dev1-7           0.00      0.00      0.00
Average:       dev1-8           0.00      0.00      0.00
Average:       dev1-9           0.00      0.00      0.00
Average:       dev1-10          0.00      0.00      0.00
Average:       dev1-11          0.00      0.00      0.00
Average:       dev1-12          0.00      0.00      0.00
Average:       dev1-13          0.00      0.00      0.00
Average:       dev1-14          0.00      0.00      0.00
Average:       dev1-15          0.00      0.00      0.00
Average:       dev3-0           1.35      0.02     17.11
Average:       dev3-1           0.00      0.00      0.00
Average:       dev3-2           0.13      0.02      3.35
Average:       dev3-3           0.00      0.00      0.00
Average:       dev3-4           0.00      0.00      0.00
Average:       dev3-5           1.22      0.01     13.75
Average:       dev22-64         0.00      0.00      0.00
Average:       dev22-65         0.00      0.00      0.00
Average:       dev22-0          0.00      0.00      0.00
Average:       dev2-0           0.00      0.00      0.00
Average:       dev9-0           0.00      0.00      0.00

Average:    kbmemfree kbmemused  %memused kbbuffers  kbcached kbswpfree 
kbswpused  %swpused  kbswpcad
Average:       696608   1889792     73.07    193202   1220938   4192956   
0      0.00         0

Average:    dentunusd   file-sz  inode-sz  super-sz %super-sz  dquot-sz 
%dquot-sz  rtsig-sz %rtsig-sz
Average:       216815      3285    185128         0      0.00         0  
0.00         0      0.00

Average:       totsck    tcpsck    udpsck    rawsck   ip-frag
Average:          343        56         8         0         0

Average:      runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
Average:           14       186      3.35      3.61      3.59

The machine entered this state at about 4:45pm yesterday afternoon.  It is 
now 12:00 noon the next day.
The "date" command says that the system thinks that the time is 7:26PM 
yesterday.
In the last 47 minutes the system clock has gained only 6 minutes.  A rate 
of somewhere around 7.8.
Another interesting little symptom, when this slowdown is in effect the 
keyboard autorepeat on keys stops working.

If this was the only machine doing this I'd think it was a hardware 
problem.  But (a) it isn't the only machine and (b) while it seems to 
always happen to these machines, it is only after running for at least a 
few hours without problems.

        Gary



More information about the redhat-list mailing list