Networkmanager service is shutdown too early

Sun Jun 1 15:02:43 UTC 2008

On Sun, 2008-06-01 at 09:14 -0400, Dan Williams wrote:
> On Fri, 2008-05-30 at 16:49 -0400, Alan Cox wrote:
> > On Fri, May 30, 2008 at 03:33:37PM -0400, Colin Walters wrote:
> > > DBus is not the same as any other random software because it is explicitly
> > > designed to provide reliable communication *between* components, much like
> > > the kernel.  If you restart it at random times that reliability guarantee is
> > > destroyed.
> > 
> > So the questions you should ask are
> > - Why does restarting dbus have to be unreliable
> 
> It's a communication pipe; restarting D-Bus itself is reliable becuase
> it's just like TCP.  Its the transport.  But making what gets
> _transported_ reliable is the kicker.

>From what you say below I think this statement need to be corrected.
If it were a TCP like transport, then a restart wouldn't cause any
problem, like a restart of a router down the pipe somewhere does not
make my TCP connections drop, they are just delayed eventually (unless
the outage is so long that the connections time out).

If what you say later is accurate I think that the problem therefore is
that DBUS is *not* a reliable transport, it seem like it does not have
acks and store-and-forward facilities that would render most of this
discussion moot.

So far we can only consiuder DBUS as a sort of local UDP transport, if
all goes well messages get to their destination but are not guaranteed.

So the question is, why DBUS does not support a fully reliable transport
mechanism ? The client side library should handle store-and-forward and
acks (and timeouts of course). If this were implemented a simple
"restart" of the daemon wouldn't cause any problem, just a short delay
in delivering messages. Maybe both an unreliable and a reliable
transport should be implemented and used for informational vs required
communication.

> > - Why isn't there a recovery mechanism
> 
> The recovery mechanism would be in each service, because the service
> knows whether or not it needs recovery or not, and would know how to
> merge/synchronize it's state with the services that it depends on.  Some
> don't need to.  But ones with state dependent on other D-Bus services
> would.

Yes, this is the key: "The recovery mechanism would be in each service",
this should be provided by the dbus client library I guess, so that each
application does not need but to tweak a few parameters: if a message
ahs to be reliably delivered, what's the timeout, what to do if the
timeout is reached ...

> > - Why does network manager have to do the work itself not the support code
> 
> Like above, because NM has specific state, and when D-Bus goes away,
> it's communication channels with the daemons that affect that
> NM-specific state are gone, and NM can't make any assumptions about
> what's happening in any other daemon while the bus is gone.  Maybe your
> VPN just came up for rekeying, but the signal got lost because D-Bus
> isn't around.  So when the bus comes back, your VPN connection is
> already dropped.

With a reliable transport this would not be necessary, the only thing
you'd need to listen for would be a message that say "hey, VPN daemon
here, sorry, but I just started", which you would assume happens only if
the vpn daemon crashes and restarts and at that point you have to
consider how to deal with the situation (restart the vpn connection,
transparently if possible).

> Or DHCP re-bound while the bus was down, and your sysadmin changed DNS
> servers on you, and the signal from dhclient got lost (because the bus
> was down).  Unless you re-do the entire DHCP transaction (or teach
> dhclient about dbus properly so it can answer questions without having
> to exec() stupid scripts that then re-emit state back over D-Bus) NM
> would have no idea that the returned DHCP options had changed.  And thus
> your DNS is dead.
> 
> > And more fundamentally
> > 
> > Why the ... are people still writing software which doesn't try and tolerate
> > faults that are recoverable to a useful extent.  Yes dbus might have to lose
> > a few messages and send everyone a "duh whoops" event so they can recover but
> > "oh dear it broke everyone reboot" is not good engineering.
> 
> In some cases, it's a cost/benefit analysis.  Is the cost of writing and
> maintaining a pile of code that handles a D-Bus restart, which shouldn't
> ever happen, worth the benefit?  In some cases, definitely.  In other
> cases, probably not.  That isn't an excuse to write crappy software, but
> it's certainly not as simple of a problem as you present it.

The more central the system is, the more it need to be fault tolerant
because it is a dependency for many services.

What was the cost/benefit analysis in this case?
Was it really made or was it based on subjective evaluations ?
What case was taken into consideration ?
Given some people is thinking of using NM by default also on servers
then this issue become more critical, servers do serve clients, true in
most of the cases they will use a permanent address not dhcp. But in
many cases it would be preferable to use permanent IPs still delivered
via DHCP (so that IP changes, even if rare can be still managed
centrally), or the server might be a VPN concentrator. It would be
extremely bad to loose all connections just because NM had to restart
and could not understand how to deal with restarts trying to impact as
little as possible other services.

Simo.

-- 
Simo Sorce * Red Hat, Inc * New York