Problem with auditd/SnareLinux on RHEL 5.3 - auditd glomming memory

Thu Feb 5 16:32:51 UTC 2009

On Wednesday 04 February 2009 09:14:03 pm Lucas C. Villa Real wrote:
> 2009/2/4 Smith, Gary R <gary.smith at pnl.gov>:
> I noticed a very similar behavior when the system was under high
> stress (ie: having many rules and many remote clients generating audit
> events). After much debugging, it was found that the asynchronous
> nature of netlink made it possible for auditd's queue to grow wildly,
> until the kernel started to kill other processes due to OOM (auditd
> asks the kernel not to be killed under OOM conditions, so every
> process but auditd is shot).

Yes, I think auditd is blamed for the memory consumption related to it inside 
the kernel. I have run valgrind against the audit daemon many times and I 
know of no resource leaks. The only knob you really have to turn if the 
kernel queue is a problem is to increase the priority of the audit daemon so 
it gets more run time.

> The reason was that audit's consumer thread -- the one that runs
> auditd-event.c:event_thread_main() -- was consuming events slower than
> the rate in which netlink events were sent from the kernel to auditd's
> main thread.

This is because of some requirements on CC evals about knowing how many events 
are in flight. The input queue is simply 1. If you have the audit event 
dispatcher running, it gets first shot at handling the event. Then the event 
goes to disk. If you have synchronous logging then write blocks for a while. 
So, changing to buffered IO might be better for throughput.

> The solution we found (and which is still being tested) was to define
> a high water mark on how many events to allow auditd to have in its
> input queue. Given that each netlink message takes about 9kb, one can
> set the high water mark to e.g: 500000 to have at most 4.5GB events in
> RAM. So, when auditd reaches that high water mark, we ask the kernel
> to slow down: all further events sent by the kernel have a "need an
> ack" flag included so that the caller process (the one that generated
> the system call that had to be audited) gets blocked until a reply is
> sent from the daemon.

Originally, I thin David Woodhouse patched the kernel so that callers were put 
on a wait queue when we hit the end of the internal queue. I think someone 
removed it thinking the system becomes unresponsive.

> Please let me know if that happens to be the reason of the problems
> you're having. I've been working mostly with audit 1.7.4 and kernel
> 2.6.16.16+patches, so our changes still need to be ported to a recent
> kernel and audit package before they're submitted officially (that's
> likely to happen in march, after my master thesis' final deadline --
> which is driving me crazy).

Note that the event model changed inside the audit daemon around 1.7.5. Its 
very different than before. In the near future, I am planning to pull the 
code from audispd into auditd and use the queue code from audispd so that 
input and output of auditd can really become multi-threaded. I'm thinking 
this would allow better dequeuing of kernel events.

-Steve