[Linux-cluster] NTP time steps causes cluster reconfiguration

Fri Jul 16 16:27:09 UTC 2010

On Fri, 16 Jul 2010 16:11:38 +0100, "Martin Waite"
<Martin.Waite at datacash.com> wrote:
> Hi,
> 
> NTP has a step-threshold - if the time difference is greater than the
> threshold, it will step the time rather than speeding it up or down.  So
> even using ntpd can cause clock steps (especially in our test
> environment where our crappy overloaded NTP servers sometimes lose 30
> seconds).

That's why i have setup all nodes as peers to each other - they will try
to synchronize each other and as the delay is minimal between the hosts no
big changes in their clocks would be possible except on reboot (before cman
is started). One or two additional time servers (nearest to you from
pool.ntp.org), but different for each host would avoid having the time
offset on the cluster too far from the real world

> 
> On some VMware test hosts, I did manage to make the cluster fence some
> nodes through changing the time backwards and forwards, but I could not
> reproduce the effect on physical hosts.     I was hoping that the
> fencing was caused by a combination of clock changes and VM guest timing
> flakiness, but from your description, it sounds like this might be a
> real risk on physical servers too.
> 
> I had better do some more testing.
> 
> Thanks for the input.
> 
> regards,
> Martin
> 
>> -----Original Message-----
>> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com]
>> On Behalf Of Kaloyan Kovachev
>> Sent: 16 July 2010 15:36
>> To: linux clustering
>> Subject: Re: [Linux-cluster] NTP time steps causes cluster
> reconfiguration
>> 
>> Hi,
>>  i can confirm, that time steps do cause reconfiguration. Not sure if
> this
>> was the reason, but one of my nodes was fenced from time to time
>> (previously) after several reconfigurations and also it caused some
>> problems with gfs being withdrawn.
>>  ntpdate running as cron job does step changes, but ntpd should not
> cause
>> step changes. It should instead speed-up or slow-down the clock until
> it is
>> synchronized. However using the -g option you may ask that the clock
> jumps
>> once at the start of ntpd.
>>  I have configured all cluster nodes to synchronize from each other
> via
>> ntpd (configured as peers) and each from one (different) additional
>> (startum 1 or 2) source as server. Since then i don't see
> reconfiguration
>> in the logs.
>> 
>> On Fri, 16 Jul 2010 14:18:22 +0100, "Martin Waite"
>> <Martin.Waite at datacash.com> wrote:
>> > Hi,
>> >
>> >
>> >
>> > During testing, I noticed that a time step caused by ntpd caused the
>> > cluster to drop into GATHER state:
>> >
>> >
>> >
>> > Jun 16 12:13:16 cp1edidbm001 ntpd[30917]: time reset -16.332117 s
>> >
>> > Jun 16 12:13:26 cp1edidbm001 openais[15929]: [TOTEM] entering GATHER
>> > state from 12.
>> >
>> > Jun 16 12:13:26 cp1edidbm001 openais[15929]: [TOTEM] Creating commit
>> > token because I am the rep.
>> >
>> > Jun 16 12:13:26 cp1edidbm001 openais[15929]: [TOTEM] Saving state
> aru 9e
>> > high seq received 9e
>> >
>> > Jun 16 12:13:26 cp1edidbm001 openais[15929]: [TOTEM] Storing new
>> > sequence id for ring 328
>> >
>> > Jun 16 12:13:26 cp1edidbm001 openais[15929]: [TOTEM] entering COMMIT
>> > state.
>> >
>> > Jun 16 12:13:26 cp1edidbm001 openais[15929]: [TOTEM] entering
> RECOVERY
>> > state.
>> >
>> > ...
>> >
>> >
>> >
>> > This is easily repeatable through setting the clock forwards by 20
>> > seconds using /bin/date.  This probably causes comms timeouts to
> expire
>> > prematurely, and almost every time causes the cluster to reconfigure
> -
>> > luckily without affecting running services.
>> >
>> >
>> >
>> > Stepping the clock backwards also causes a similar disruption, but
> there
>> > is a long lag between changing the time and the cluster
> reconfiguring:
>> > perhaps this extends a timeout or sleep on the affected node,
> causing
>> > genuine timeouts on the other nodes.
>> >
>> >
>> >
>> > All I am looking for is some reassurance that clock changes are not
>> > going to crash the cluster.  Is anyone able to confirm this please ?
>> >
>> >
>> >
>> > regards,
>> >
>> > Martin
>> 
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster