[RFC PATCH] audit-testsuite: improve our chances of losing records in lost_reset

Fri Dec 14 15:53:43 UTC 2018

On Thu, Dec 13, 2018 at 8:59 PM Richard Guy Briggs <rgb at redhat.com> wrote:
> On 2018-12-13 18:23, Paul Moore wrote:
> > On Thu, Dec 13, 2018 at 6:17 PM Paul Moore <paul at paul-moore.com> wrote:
> > > If the point of the lost_reset test is to flood the system with audit
> > > records, why are we restricting ourselves with a filter?  Let's log
> > > everything.
> > >
> > >  tests/lost_reset/test |    4 ++--
> >
> > This resolves a problem I've been seeing with a fairly stubborn
> > system.  I would be very curious to hear if any others were
> > experiencing problems with the lost_reset that this patch resolves.
>
> This saddistic test would have brought the system to its knees before
> the queue rework.  This still works fine on my system, so this is progress.

Yes, I'll admit it is a bit extreme.  The good news is that the system
survives multiple runs of this without falling over (and remains
responsive), so that's a win :)

> I suspect that something weird was going on with your test
> such that it wasn't getting the right PID.  I don't know how this is
> possible, but I have seen this same thing recently on the FUSE umount2
> test where I needed to bump up the PID by one to catch the right task.

The wrong PID idea struck me as a bit curious, and since I have one
system where I can reproduce this reliably I thought I would play
around a bit more and see what I could find.  The results sent me down
a bit of a rathole but the results are interesting ...

On my test system the PID of the flood ping was being recorded
correctly, every time I ran it with some instrumentation added - and I
ran it a quite a few times.  While you may have seen this happen in
practice, but that doesn't seem to be the case on this particular
system.

As an aside, have you spent any time trying to debug that wrong PID
problem?  While not strictly audit related, that seems like a pretty
serious Bash bug.  Or maybe it's a problem with the test?  I vaguely
remember a discussion between you and Ondrej and some confusion around
which Bash variable to use to fetch PIDs, but I may be mistaken.

Moving on from the PID idea I decided to dig a bit deeper into when
the kernel registers a lost record, and I think I found the core
issue.  The kernel logs lost records, and bumps the counter, via the
audit_log_lost() function.  The audit_log_lost() function is called
from a number of places (most uninteresting for this discussion: error
handlers and the like), but I think the callers we are most interested
in are the following:

 * audit_log_start()
 * kauditd_hold_skb()

Let's start with audit_log_start().  It calls audit_log_lost() in two
places: one place is allocation error handling code (not relevant) and
the other is if we are bumping into the backlog limit (relevant, and
interesting).  In the latter case audit_log_lost() is called and
audit_log_start() returns NULL, effectively dropping the record on the
floor, lost forever.

Moving on to kauditd_hold_skb(), things are a bit more complicated.
As a reminder, kauditd_hold_skb() is the "point of last resort" for
audit records; if we are here it means there is no auditd connected.
The first thing kauditd_hold_skb() does it write the record to the
console since this may be the only way the record ever sees userspace.
Beyond that if "audit=1" was not specified on the kernel command line
we drop the record, if "audit=1" was specified on the command line we
try to queue the record but if the queue is full we call
audit_log_lost().  Ultimately I believe it was this logic, and a race
between the kauditd_thread() and the record generation, which caused
my system to never hit the backlog limit in audit_log_start() which
would have caused it to call audit_log_lost().  On my problematic
system, booting with "audit=1" caused the existing lost_reset test to
pass without problem.

Make sense?

This brings up to the next step: how do we want to address this?

Prior to the queue rework that started in v4.10 things were a bit
simpler and it looks like we always registered a lost record
independent of the "audit=?" setting on the kernel command line and
the current queue backlog.  While this would have made this test
easier, it could result in some over counting problems in the cases
where an auditd instance came along and read the "lost" records from
the queue.  I don't think reverting to this behavior is ideal.

I'm also not certain that recording lost records in the *not*
"audit=1" case is a good solution either.  In the case where the
system is not running an audit daemon they are almost guaranteed to
hit the backlog at some point and then their system will start spewing
scary looking audit log lost messages; we would surely win lots of
friends this way.

We could move the "audit=1" check (really it's an audit_default check)
into audit_log_lost() and use it to squelch the printk() and then call
audit_log_lost in kauditd_hold_skb() if we can't queue the record.
This should preserve an accurate lost record count (well, until it
rolls over, but that's not a new concern), prevent unnecessary scary
lost record messages, and ensure a consistent audit_log_lost()
operation (all the other callers I didn't mention in this mail).

Or the simplest option is to just ignore this and require that the
audit-testsuite be run on a system booted with "audit=1" :)

I'm currently leaning towards moving the "audit=1" check into
audit_log_lost(), what do you guys think?

-- 
paul moore
www.paul-moore.com