[olpc-software] graceful handling of out-of-memory conditions

Daniel P. Berrange berrange at redhat.com
Tue Mar 28 01:56:38 UTC 2006


On Mon, Mar 27, 2006 at 08:03:51PM -0500, Jim Gettys wrote:
> On Mon, 2006-03-27 at 19:30 -0500, David Zeuthen wrote:
> > On Mon, 2006-03-27 at 11:19 -0500, Jim Gettys wrote:
> > >  When I say "system components", I use a
> > > wider net than just the Linux kernel, but include the X server, Window
> > > manager, session manager, but not much else.  I'd like the base
> > > environment to be rock solid.
> > 
> > Hmm, this sounds a but like the 90's; Clearly you're forgetting
> > 
> >  - D-BUS
> >  - HAL
> >  - NetworkManager
> >  - Avahi
> >  - CUPS
> 
> Yeah, you're right; except I'm an '80's dinosaur ;-).  Though CUPS can
> probably survive being restarted pretty easily and I wouldn't put it in
> that category.
> 
> > 
> > just to name a few crucial "system" components of the Linux desktop de
> > jour. I'm not sure why you think the Window or Session Manager are so
> > important either, sure, we don't want to lose them, but, really X
> > applications can survive when these temporarily goes away - it just
> > doesn't look very pretty.
> > 
> 
> Now it's your turn to be caught out on a limb and have someone saw it
> off ;-).
> 
> The window manager is the item that knows what applications are most
> likely to be used by the user, and will likely be key to decent OOM
> behavior.  It knows what's on top, what's iconified, what is covered.
> It is the process most likely to be telling the OS what processes have
> to be killed in extremis.  Consider it an absolutely essential
> component.

Sure the window manager knows what's on top, iconified, etc, but I'm far
from convinced that this data can be used to do 'decent' OOM handling. 
The principal barrier is that the info about state of an application's
graphical windows tells the session manager *nothing* about the operation
or architecture of the application. If the GUI is just a shim calling out
to a DBus or Orbit service for all its work, then killing the GUI upon OOM
is just fine, because the GUI can trivally restart & reconnect to the backend
service where all the data is. In the modern desktop any non-trivial program
makes significant use of IPC to any number of processes about which the
session manager has no information. 

While you may be able to whitelist some subset of IPC related system and user
daemons, there'd be enough not whitelisted that incorrect decisions would be
made. Then what if one of the whitelisted daemons *was* the program process
consuming all memory. Alternatively what if it is the currently focused app
which is the problem.

So I think while you could write an OOM handler based on the info available
to the window/session manager, I rather doubt it would be any better at 
picking which process to kill off under OOM situations than the kernel is. 
Basically OOM handling is a fundamentally hard problem, and its inevitable
that no matter what algorithm you choose for picking processes, you'll eventually
choose one that is 'important' to the user. 

So while we could put lots of research into figuring out an optimal OOM handling
solution, I think we'd be better off picking a simple algorithm, and then focusing
effort on modifying applications  such that in the event they are killed off, no 
user data is lost. Such modifications would be useful beyond post OOM handling, eg
post a SEGV crash a user wouldn't loose data. Or it would enable a window manager
to proactively shutdown apps before an OOM situation is even encountered.

Basically we have to recognise that we have limited resources & need to choose
the approach that gives the biggest benefit from the user's POV. Perfecting
the specific case of OOM handling IMHO has far less benefit that perfecting
session recovery. Its kinda like the difference between vertical & horizontal
server scalability - you could engineer a single server to deal with absolutely
every failure eventuality, but it'll still fail, or you can make a set of 'n'
servers 'good enough' & ensure that when failure does occur, recovery is trivial.

Regards,
Dan.
-- 
|=- Red Hat, Engineering, Emerging Technologies, Boston.  +1 978 392 2496 -=|
|=-           Perl modules: http://search.cpan.org/~danberr/              -=|
|=-               Projects: http://freshmeat.net/~danielpb/               -=|
|=-  GnuPG: 7D3B9505   F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505  -=| 




More information about the olpc-software mailing list