[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: hwclock can cause system lockup



Mikkel L. Ellertson wrote, On 10/16/2008 05:23 PM:
On Thu, Oct 16, 2008 at 8:19 AM, Chris Mocock
<chris1 noreply googlemail com> wrote:
I've got an hourly cron job on some machines that runs "/sbin/hwclock --utc
--hctosys" to sync the hardware clock to the system clock. These machines
were recently upgraded from an old custom RH7.3 installation to a custom
spin of Fedora 9 and started to occassionally lock up - every day or three.
I noticed that they would always lock up at 1 minute past the hour so
tracked it down to this cron job.

What are you trying to do with this cron job? You are updating the
system clock from the hardware clock, and not the other way around,
as you say you are trying to do. The system does synchronize the
hardware clock to the system clock on shutdown.

Not if you are sane enough to disable that in the halt script.
(search this or the fedora-test list for ntp and me to see why I say this)

Also, /etc/adjtime
contains the information needed to  correct for hardware clock drift
that happens while the system is shut down. The offset information
is re-computed every time you set the hardware clock.

If you don't have an ntp source, using /etc/adjtime (with `hwclock --adjust` after appropriate disciplining), the quartz time of the hardware clock can be SIGNIFICANTLY better than system time, so if you pull the hardware clock into your system in a reasonable periodicity the system will have a better time for use.

It could even make a box that would be reasonable enough, not great but reasonable, to provide ntp local time for having the rest of the computers in a disconnected lab synced. [Been there Done that. On an RH7.3 machine of all things :]


I decided to test it on my standard Fedora 9 installation by running a
simple script that runs the hwclock command every 3 seconds. Sure enough,
the system ran for just over an hour before it locked up. By locked up I
mean no activity, caps lock key doesn't work, can't ping, can't ssh in, but
power is still on.

I never looked into how the hardware clocked is accessed. I wounder
if they are using the BIOS to access the clock, and the BIOS code
has a problem. From what I have read, it is also not a good idea to
change the system clock to rapidly, especially if you are adjusting
it backwards - but I do not know if that would cause lockups.


IIRC backwards could LOOK like it did, but it should only last (in the case of running the command every 3 seconds) 2 to 4 seconds.

I _think_ that by default hwclock uses /dev/rtc which is a kernel abstraction to the real clock... something in that abstraction may be breaking down. There may even be an OOPS or Panic, but if the machine is running in level 5 at the time you will not see it.

I would suggest two things:
1) see if punching the calls up to .5Hz or 1Hz instead of .3Hz gets it
2) booting in runlevel 3 and running the script again and see if it gets you the error in a few hours, hopefully this time with an OOPS or Panic message.

If any of these ends up being again 'just over an hour before it locked up' it might be some interaction with another cron job... did you disable the hourly cron job first? if not I would set your 3 second script and a 2 minute cron and see if it may be a '2 accesses at the same time' problem.

race conditions in time, oh what fun.
--
Todd Denniston
Crane Division, Naval Surface Warfare Center (NSWC Crane)
Harnessing the Power of Technology for the Warfighter


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]