On Fri, 2006-03-24 at 18:40 -0800, Jesse Weisner wrote:
I haven't had any machines die on me since I upgraded them all to U3,
so
I'm disappointed to hear that you're still seeing it with U3.
I'd be interested in hearing any suggestions for tracking this problem
down.
Would you mind sharing the size of the memory you had allocated to you
virtual machines? We've been doing some additional testing on this
issue and have some interesting, though inconclusive, results. The
system in question has been running RHEL4 in various forms on ESX 2.5.x
for nearly 9 months without a single issue and was recently upgraded to
RHEL4 U3.
At the same time as the RHEL4 U3 upgrade we reduced the memory allocated
to the system from 2GB to 1.5GB. We did this based on the fact that the
system never used more than about 850MB of RAM during it's typical
workload and thought 2GB was probably overkill, however, the fact that
we had a failure only 10 days after making this change was an indicator
that perhaps this was a problem.
Of course, the system had also just been upgraded to RHEL4 U3, however,
we had been running the RHEL4 U3 beta kernel on that system for 30+ days
without incident before this failure, so the memory change stood out.
As an experiment we lowered the memory allocation to 1GB of RAM. After
making this change we noticed that sendmail, and various other services
would stop crashing with Signal 11 errors within only a few minutes of
startup. Even then, the system showed no signs of being under memory
pressure, using only 400-600MB of RAM and practically no swap when the
problem started occurring. This was 100% reproducible on another ESX
2.5.2 Patch 4 host. The same VM, running on an ESX3 beta host did not
show this problem.
The vast majority (perhaps even all) of our RHEL4 guest are 2GB or more
so I suspect that's why we are not seeing this problem more widespread
at our facility but we are working on building an isolated test case.
Thanks,
Tom