[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [olpc-software] graceful handling of out-of-memory conditions



Alan Cox wrote:
I still see it differently. If your code does not check a malloc return then
it is broken. Since low memory accesses on some systems might allow patching
of code it must also be considered unfit to ship for security reasons. Most
modern code gets this right, in part because the old BSD approach of 'its hard
so lets not bother' has been replaced by rigour in all the camps, notably
OpenBSD. In addition tools both free (sparse) and non-free (eg Coverity) can
systematically identify missing NULL checks.

I don't really believe most code that attempts to handle OOM _works_ - the reason is that I wrote dbus to handle OOM, and then later added comprehensive test coverage by having the test suite run every code path over and over, failing the first malloc on first run, second malloc on second run, etc. Thus testing handling of NULL return for every single malloc.

After adding the test suite, I think I probably had a bug in how at least 5-10% of null malloc returns were handled. In many cases the bug was quite complex to fix, because what you have to do is make everything "transactional" which (depending on the code) can be arbitrarily complicated. You also have to add return values and failure codes in lots of places that might not have them before, which can modify a public API pretty heavily. Once you add the complex "transactional" code, it then never gets tested (unless you have a test suite like the one I did for dbus).

Making something sane happen on OOM is a lot more work than just adding "if (!ptr)" checks.

If we assume that most apps are half as complicated as dbus, and most programmers are twice as smart as I am, you're still talking about 2-3% of theoretically-handled malloc failures won't be handled properly. If you think about most OOM situations, we'd probably get multiple malloc failures, and get up to a pretty good chance of things breaking. It's just not gonna be reliable.

Another thing to keep in mind is that I think handling OOM probably adds 10-20% of code size overhead to dbus. It's a lot of extra code... which you pay for when writing it, maintaining it, and running it.

You also have to think about what an app does on OOM ... for dbus it returns an error code for the current operation, then goes back and sits in the main loop, keeps returning error codes for any operations that don't have enough memory... if it can't even get enough memory to return an error, then I believe it just sleeps for a bit and tries again. For most gui apps "go back to the main loop and sleep a little while" is about the best they'll be able to do. Only rarely (for e.g. a large malloc when opening an image file) does it make sense to display a malloc failure as an error dialog.

Given all this, just having malloc() block and always succeed is tempting, with the main problem being large mallocs like the opening-an-image-file example... glib has g_try_malloc() to distinguish that case, since the normal glib behavior is to exit on OOM.

Another complexity that applies to a normal Linux system but perhaps not to OLPC is that with e.g. the default Fedora swap configuration, the system is unusably slow and thoroughly locked up long before malloc fails. It's awfully tempting to push the power switch when the "you are out of memory" dialog starts taking 30 minutes to come up, instead of waiting patiently to press the button on said dialog.

Havoc



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]