[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: Strange RHEL4 U3 Behavior



Out of 8 slices, 6 have 1024MB RAM and 2 have 2048. I don't think the
two machines that have 2048 have crashed, so it's a good theory.

I'm going to increase a particularly "crashy" slice to 2048 each and see
how it goes. Have you isolated anything that can make it "crash early,
crash often"?

Out of curiosity, have you made any adjustments to the memory resource
settings in the VMs? I notice that our VMware admins set the machines
with 2GB RAM to have a 2GB "minimum" and a higher than normal "shares"
setting. The 1GB RAM machines all have "normal" shares and no minimum.

Do you suppose VMware could be stealing too much RAM back through
vmmemctl ballooning? I think you'd start seeing the dreaded OOM killer
if that was the case.

Tom Sightler wrote:
On Fri, 2006-03-24 at 18:40 -0800, Jesse Weisner wrote:
I haven't had any machines die on me since I upgraded them all to U3,
so I'm disappointed to hear that you're still seeing it with U3.

I'd be interested in hearing any suggestions for tracking this problem
down.

Would you mind sharing the size of the memory you had allocated to you
virtual machines?  We've been doing some additional testing on this
issue and have some interesting, though inconclusive, results.  The
system in question has been running RHEL4 in various forms on ESX 2.5.x
for nearly 9 months without a single issue and was recently upgraded to
RHEL4 U3.

At the same time as the RHEL4 U3 upgrade we reduced the memory allocated
to the system from 2GB to 1.5GB.  We did this based on the fact that the
system never used more than about 850MB of RAM during it's typical
workload and thought 2GB was probably overkill, however, the fact that
we had a failure only 10 days after making this change was an indicator
that perhaps this was a problem.

Of course, the system had also just been upgraded to RHEL4 U3, however,
we had been running the RHEL4 U3 beta kernel on that system for 30+ days
without incident before this failure, so the memory change stood out.

As an experiment we lowered the memory allocation to 1GB of RAM.  After
making this change we noticed that sendmail, and various other services
would stop crashing with Signal 11 errors within only a few minutes of
startup.  Even then, the system showed no signs of being under memory
pressure, using only 400-600MB of RAM and practically no swap when the
problem started occurring.  This was 100% reproducible on another ESX
2.5.2 Patch 4 host.  The same VM, running on an ESX3 beta host did not
show this problem.

The vast majority (perhaps even all) of our RHEL4 guest are 2GB or more
so I suspect that's why we are not seeing this problem more widespread
at our facility but we are working on building an isolated test case.

Thanks,
Tom





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]