[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Cluster-devel] [gfs2-utils PATCH 24/47] fsck.gfs2: Rework the "undo" functions



----- Original Message -----
| > Hi,
| > 
| > One thing to bear in mind is that the fsck blockmap is supposed to
| > represent the correct state of all blocks in the on-disk bitmap.
| > The job of pass1 is to build the blockmap, which starts out entirely
| > "free".
| > As the metadata is traversed, the blocks are filled in with the appropriate
| > type.
| > 
| Yes, but this cannot be true, as we don't actually always know the true
| correct state of the blocks, so sometimes, blocks will need to be marked
| as unknown until more evidence is available, as I think your example
| shows.

While pass1 is running, we don't know the correct state of every block,
but that's what pass1 is trying to determine. At the end of pass1, we should
know the correct state of every block, except those with duplicate references.
You are correct about the "unknown" state, which is called "free" in the blockmap.
The bitmap and blockmap are two different things, with two different purposes.
The bitmap is the state of the block on the media, and "free" means "free".
The blockmap is what we know about each block, based on reference, and "free" means
the state is "unknown" and so far it has not been referenced by anything.

| 
| > 
| > At this point, it doesn't make sense to continue marking 0x1025 and beyond
| > as referenced data blocks, because that will only make matters worse.
| > 
| I'm not convinced. In that case you have a single reference to a single
| out of range block. All that we need to do there is to reset the pointer
| to 0 (unallocated) and thats it. There is no need to stop processing the
| remainder of the block unless there is some reason to believe that this
| was not just a one off issue.

Actually, the way we have to approach this is a lot more complex.
The major problem here is, and probably always will be, duplicate block
references. If there's a badly corrupt dinode that references 1000 blocks,
and that file should have been deleted, but it's not, due to a bitmap problem,
and those 1000 blocks have been reallocated to 200 different healthy files,
many different problems arise, depending on whether you've processed that
corrupt file before or after the other 200 (and maybe 100 before, and 100 after),
so that some of the blocks are marked as duplicate references and some aren't.
 
| > Now we've got a problem: Before we knew 0x1000 was corrupt, we marked all
| > its references in the blockmap. We can't just delete the corrupt dinode
| > because most of its blocks are in-use by that other dinode.
| > 
| I think the confusion seems to stem from doing things in the wrong
| order. What we should be doing is verifying the tree structure of the
| filesystem first, and then after that is done, updating the bitmaps to
| match, in case there is a mismatch between the actual fs structure and
| the bitmaps.

That's exactly what we do. Pass1 verifies the file system tree structure, by
building its blockmap. Pass5 updates the bitmaps to match the blockmap built
by pass1.
 
| Also, we can use the initial state of the bitmaps in order to find more
| objects to look at (i.e. in case of unlinked inodes) and also use
| pointers in the filesystem in the same way (in case of directory entries
| which point to non-inode blocks) for example. But neither of those
| things requires undoing anything so far as I can see.
| 
| So what we ought to have is something like this:
| 
|  - Start looking at bitmaps to find initial set of things to check
|  - Start checking inodes, adding additional blocks to list of things to
| check as we go
|  - Once finished, check bitmaps against list to ensure consistency
| against the tree

Yes, this would work ideally, but only for metadata blocks. You still need
some way to determine the state of the bitmap versus the references in order
to ensure there aren't duplicate block references. Unless, of course, the
list includes the data blocks as well as metadata, but then it would take an
enormous amount of memory to accomplish: at least 8X more memory than the
entire file system size. If you don't include the data blocks, you still
need some means to ensure the same blocks aren't referenced multiple times,
and any proposed solution would have to take into account the fact that
blocks can be multiple-referenced as either metadata or data from multiple
sources. Using a 2-bit bitmap and a corresponding 8-bit blockmap requires
much less memory, which is what we have today.

Even if the list includes only metadata, I'm concerned about the amount
of memory it would take, especially for common use cases like email, where
the file system contains many terabytes of tiny files.

| Obviously that is rather simplified since there are a few extras we need
| to deal with some corner cases, but that should be the core of it. If
| each rgrp is checked as we go, then it should be possible to do with
| just one pass through the fs.
| 
| And I know that we will not be able to do that immediately, but that is
| the kind of structure that we probably should be working towards over
| time. So this isn't really something for this patch set, but something
| that we should be looking into in due course,
| 
| Steve.

Bob Peterson
Red Hat File Systems


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]