[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] gfs2 v. zfs?



Hi,

On Wed, 2011-01-26 at 08:16 -0800, Wendy Cheng wrote:
> On 01/26/2011 02:19 AM, Steven Whitehouse wrote:
> >
> > I don't know of any reason why the inode number should be related to
> > back up. The reason why it was suggested that the inode number should be
> > independent of the physical block number was in order to allow
> > filesystem shrink without upsetting (for example) NFS which assumed that
> > its filehandles are valid "forever".
> 
> I'm not able to ping-pong too many emails on external mailing lists, at 
> least during week days. However, a quick note on this ...
> 
> GFS2 fragments very soon and very badly ! Its blocks are all over the 
> device due to the nature of how the resource group works. That slows 
> down *every* thing, particularly for backup applications. A production 
> deployment will encounter this issue very soon and they'll find the 
> issue more than annoying.
> 
The fragmentation pattern of GFS2 depends upon the workload. If there
are lots of small files being created, and/or appended to and the fs is
nearly full then fragmentation will increase.

Each inode has a resource group that it will use by default. It only
changes resource group when either there are no blocks left in that
resource group, or if contention occurs wrt to another node. Provided
the number of resource groups is much larger than the number of nodes,
contention at the resource group level is largely avoided.

Given a filesystem which is say, no more than 80% full, and given files
which are reasonably large and written sequentially, the fragmentation
should be relatively low.

> Now, educate me (though I'll probably not read it until weekend) ... how 
> will you defragment the FS with that inode number attaching to physical 
> block number ? You can *not* move these inodes.
> 
> -- Wendy

It is true that we don't have a defragmentation solution at the moment.
We could potentially do that in an off-line manner, but so far we've not
really found that to be a big problem.

When there are performance issues, the caching issue where there is
contention on a single inode, is always a much greater issue overall.

Also, as an aside, if anybody wants to look at their GFS2 files to see
whether they are fragmented, then the filefrag tool from e2fsprogs will
tell you on a file by file basis. Often the best way to defragment a
file is simply to copy it, provided the filesystem is not already nearly
full. That also has the advantage of being able to be done online, and
with standard system tools.

Nevertheless, I agree that it would be nice to be able to move the
inodes around freely. I'm not sure that the cost of the required extra
layer of indirection would be worth it though, in terms of the benefits
gained.

I'm happy to be proved wrong, if someone can come up with a workable
solution,

Steve.


> > The problem with doing that is that it adds an extra layer of
> > indirection (and one which had not been written in gfs2 at the point in
> > time we took that decision). That extra layer of indirection means more
> > overhead on every lookup of the inode. It would also be a contention
> > point in a distributed filesystem, since it would be global state.
> >
> > The dump command directly accesses the filesystem via the block device
> > which is a problem for GFS2, since there is no guarantee (and in general
> > it won't be) that the information read via this method will match the
> > actual content of the filesystem. Unlike ext2/3 etc., GFS2 caches its
> > metadata in per-inode address spaces which are kept coherent using
> > glocks. In ext2/3 etc., the metadata is cached in the block device
> > address space which is why dump can work with them.
> >
> > With GFS2 the only way to ensure that the block device was consistent
> > would be to umount the filesystem on all nodes. In that case it is no
> > problem to simply copy the block device using dd, for example. So dump
> > is not required.
> >
> > Ideally we want backup to be online (i.e. with the filesystem mounted),
> > and we also do not want it to disrupt the workload which the cluster was
> > designed for, so far as possible. So the best solution is to back up
> > files from the node which is most likely to be caching them. That also
> > means that the backup can proceed in parallel across the nodes, reducing
> > the time taken.
> >
> > It does mean that a bit more thought has to go into it, since it may not
> > be immediately obvious what the working set of each node actually is.
> > Usually though, it is possible to make a reasonable approximation of it,
> >
> > Steve.
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster redhat com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster redhat com
> https://www.redhat.com/mailman/listinfo/linux-cluster



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]