We recently had a strange event occur on one of our RHEL4 U3 systems and
I'm wondering if anyone out there has any suggestions on what might have
happened. Basically, the system is a RHEL4 U3 system. It is running on
an VMware ESX 2.5.2 Patch 4 system which officially only supports RHEL4
U2, however, we don't think this issue was related to VMware although we
can't rule that out.
Anyway, this system run Squid, Apache, DHCP, DNS, Sendmail, MySQL,
Samba, and a few other services. The Sendmail is configured as an email
gateway for inbound mail and performs virus and anti-spam filtering
using MailScanner. We have used this system in this basic configuration
for several years, starting with RHEL3 and moving to RHEL4 around the
time U1 was released.
Now, on to the actually issue. Yesterday, at almost exactly 10:30AM,
the system quit accepting inbound mail. The sendmail service appeared
to be running, but an attempt to telnet to port 25 was greeted only in
"connection refused". We checked logs and could find nothing of
interest. We eventually restarted sendmail and everything was fine.
This in itself was unusual, I can't ever remember sendmail stopping in
this way previously.
At first it looked like that was the only service that was affected,
however, upon deeper investigation, we found at least one additional
unusual issue. The system has the sysstat package installed and we
noticed that the last stats gathered were at 10:20AM. Normally, cron
would run the sa1 process every 10 minutes, however, this wasn't
happening and actually, no cron jobs were running at all, however, the
crond service appeared to be still running (ps ax showed it), it was
just no longer processing tasks. We eventually restarted the cron
service and things when back to normal. We've found no other affected
services from the event.
We spent significant time looking through logs, both on the system
itself, the ESX host, and other virtual machines running on that system,
and nothing unusual seems to show up in that time frame. Has anyone
seen processes simply "stop" running even though they continue to appear
in the process list to look normal? What other information should I
look for if the problem should happen again? After the fact I realized
that I should have probably at least attempted to strace the hung
processes. Any other ideas or suggestions or any similar experiences
would be appreciated.
Later,
Tom