[Cluster-devel] [GFS2 PATCH] gfs2: Panic when an io error occurs writing to the journal

Mon Dec 17 14:58:47 UTC 2018

Hi,

----- Original Message -----
> On 17/12/2018 09:04, Edwin Török wrote:
> >> If we get an io error writing to the journal, the only correct
> >> thing to do is to kernel panic.
> > Hi,
> >
> > That may be required for correctness, however are we sure there is no
> > other way to force the DLM recovery (or can another mechanism be
> > introduced)?
> > Consider that there might be multiple GFS2 filesystems mounted from
> > different iSCSI backends, just because one of them encountered an I/O
> > error the other ones may still be good to continue.
> > (Also the host might have other filesystems mounted: local, NFS, it
> > might still be able to perform I/O on those, so bringing the whole host
> > down would be best avoided).
> >
> > Best regards,
> > --Edwin
> 
> Indeed. I think the issue here is that we need to ensure that the other
> cluster nodes understand what has happened. At the moment the mechanism
> for that is that the node is fenced, so panicing, while it is not ideal
> does at least mean that will definitely happen.
> 
> I agree though that we want something better longer term,
> 
> Steve.

The important thing is to guarantee that the journal is replayed by
a node (other than the node that had the IO error writing to its journal)
before any other node is allowed to acquire any of the locks held by the
node with the journal IO error. Before this patch, I had two others:

(1) The first made GFS2 perform journal recovery on a different node
    whenever a withdraw is done. This is a bit tricky, since it needs
    to communicate which journal needs replaying (or alternately, try to
    acquire and replay them all), and it needs to happen before DLM can
    hand the locks to another node. I tried to figure out a good way to
    hook this into DLM's or lock_dlm's recovery path, but I couldn't find
    an acceptable way to do it. In the DLM case, the recovery is all driven
    from the top (user-space / dlm_controld / corosync / etc.) down and
    I couldn't find a good place to do this without getting DLM out of
    sync with its user-space counterparts.

    So I created new functions as part of lock_dlm's recovery path
    (bit that were formerly in user space, as part of gfs_controld).
    I used lvbs to communicate the IDs of all journals needing recovery
    and since DLM only updates lvb information on convert operations,
    I needed to demote / promote a universally known lock to do it
    (I used gfs2's "Live" glock for this purpose.)

    Doing all these demotes and promotes is complicated and Andreas did
    not like it at all, but I couldn't think of a better way. I could code
    it so that the node attempts recovery on all journals, and it would
    just fail its "try locks" with the other journals that are in use,
    but it would result in a lot of dmesg noise, and possibly several
    nodes replaying the same journal one after another (depending on
    the timing of the locks), plus all this recovery risks corosync
    being further starved for CPU and fencing nodes.

    Given my discussions with Dave Teigland (upstream dlm maintainer), we
    may still want (or need) this for all GFS2 withdraw situations.

(2) The second patch detected the journal IO error and simply refused
    to inform DLM that it had unlocked any and all of its locks since
    the IO error occurred. That accomplished the job, but predictably,
    it caused the glocks to get out of sync with the dlm locks, which
    eventually resulted in a BUG() with kernel panic anyway.

    I suppose we could add special exceptions so it doesn't panic when
    the file system is withdrawn. It also resulted in the other nodes
    hanging indefinitely until the failed node was fenced and rebooted,
    as soon as they tried to acquire the rgrp glocks needed to do their
    IO, until the journal recovery was done.

    We also might be able to handle this and set some kind of status
    before it tries to release the dlm locks to avoid the BUG(),
    but the withdrawing node wouldn't be able to unmount (unless we
    kludged it even more to free a locked glock or something).
    Anything we do is bound to be an ugly hack.

    I suppose if a node was working exclusively in different file systems
    they wouldn't hang, and maybe that's better behavior. Or maybe not.

Believe me, I thought long and hard about how to better accomplish this,
but never found a better (or simpler) way. A kernel panic is also what
Dave Teigland recommended. Unless I'm mistaken, Dave has said that GFS2
should never withdraw; it should always just kernel panic (Dave, correct
me if I'm wrong). At least this patch confines that behavior to a small
subset of withdraws.

I'm definitely open to ideas on how to better fix this, but I'm out of ideas. 
Just because I'm out of ideas doesn't mean there isn't a good way to do it.
Feel free to make suggestions if you can think of a better way to handle
this situation.

Regards,

Bob Peterson
Red Hat File Systems