[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: Strange RHEL4 U3 Behavior



I run 8 RHEL AS4 virtual machines on a pair of ESX 2.5.2 hosts, and I have had the exact problem you are describing, but only on certain VMs. All of my VMs are kickstart installed, so I doubt that it could be caused by different software versions. Someone suggested this might be a symptom of a "sendmail hack", but all of our machines have sendmail bound only to 127.0.0.1.

The two obvious symptoms of the problem are that Sendmail and Cron seem to die, sometimes sshd as well. If left long enough, it seems that init will just "fall off" as well. Rebooting these zombie VMs requires logging-in to Virtual Center and resetting them because init no longer responds to any shutdown attempts.

I have tried running without the vmmemctl module, without vmware-tools, without the vmxnet driver, all had no effect... pretty frustrating since it's an intermittent problem that can take a week to re-appear.

I haven't had any machines die on me since I upgraded them all to U3, so I'm disappointed to hear that you're still seeing it with U3.

I'd be interested in hearing any suggestions for tracking this problem down.


Tom Sightler wrote:
We recently had a strange event occur on one of our RHEL4 U3 systems and
I'm wondering if anyone out there has any suggestions on what might have
happened.  Basically, the system is a RHEL4 U3 system.  It is running on
an VMware ESX 2.5.2 Patch 4 system which officially only supports RHEL4
U2, however, we don't think this issue was related to VMware although we
can't rule that out.

Anyway, this system run Squid, Apache, DHCP, DNS, Sendmail, MySQL,
Samba, and a few other services.  The Sendmail is configured as an email
gateway for inbound mail and performs virus and anti-spam filtering
using MailScanner.  We have used this system in this basic configuration
for several years, starting with RHEL3 and moving to RHEL4 around the
time U1 was released.

Now, on to the actually issue.  Yesterday, at almost exactly 10:30AM,
the system quit accepting inbound mail.  The sendmail service appeared
to be running, but an attempt to telnet to port 25 was greeted only in
"connection refused".  We checked logs and could find nothing of
interest.  We eventually restarted sendmail and everything was fine.
This in itself was unusual, I can't ever remember sendmail stopping in
this way previously.

At first it looked like that was the only service that was affected,
however, upon deeper investigation, we found at least one additional
unusual issue.  The system has the sysstat package installed and we
noticed that the last stats gathered were at 10:20AM.  Normally, cron
would run the sa1 process every 10 minutes, however, this wasn't
happening and actually, no cron jobs were running at all, however, the
crond service appeared to be still running (ps ax showed it), it was
just no longer processing tasks.  We eventually restarted the cron
service and things when back to normal.  We've found no other affected
services from the event.

We spent significant time looking through logs, both on the system
itself, the ESX host, and other virtual machines running on that system,
and nothing unusual seems to show up in that time frame.  Has anyone
seen processes simply "stop" running even though they continue to appear
in the process list to look normal?  What other information should I
look for if the problem should happen again?  After the fact I realized
that I should have probably at least attempted to strace the hung
processes.  Any other ideas or suggestions or any similar experiences
would be appreciated.

Later,
Tom





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]