[libvirt] [PATCH] Use posix_fallocate() to allocate disk space

Tue Feb 24 12:31:20 UTC 2009

On (Tue) Feb 24 2009 [11:58:31], Daniel P. Berrange wrote:
> On Tue, Feb 24, 2009 at 05:09:31PM +0530, Amit Shah wrote:
> > Hi,
> > 
> > This is an untested patch to make disk allocations faster and
> > non-fragmented. I'm using posix_fallocate() now but relying on glibc
> > really calling fallocate() if it exists for the file system to be the
> > fastest.
> > 
> > - This fails build because libutil needs to be added as a dependency?
> > 
> > ../src/.libs/libvirt_driver_storage.a(storage_backend_fs.o): In function
> > `virStorageBackendFileSystemVolCreate':
> > /home/amit/src/libvirt/src/storage_backend_fs.c:1023: undefined
> > reference to `safezero'
> 
> You'd need to add 'safezero' to src/libvirt_private.syms to allow it
> to be linked to by the storage driver.

Thanks; builds now.

> > - What's vol->capacity? Why is ftruncate() needed after the call to
> >   (current) safewrite()? My assumption is that the user can specify some
> >   max. capacity and wish to allocate only a chunk off it at create-time.
> >   Is that correct?
> 
> "allocation" refers to the current physical usage of the volume
> 
> "capacity" refers to the logical size of the volume
> 
> So, you can have a raw sparse file of size 4 GB, but not allocate any disk
> upfront - just allocated on demand when guest writes to it. Or you can 
> allocate 1 GB upfront, and leave the rest unallocated. So this code is
> first filling out the upfront allocation the user requested, and then using
> ftruncate() to extend to a (possibly larger) logical size.
> 
> Similarly for qcow files, capacity refers to the logical disk size
> but qcow is grow on demand, so allocation will be much lower.
> 
> Usually allocation <= capacity, but if the volume format has metadata
> overhead, you can get to a place where allocation > capacity if the
> entire volume has been written to.

This case had me puzzled. Thanks for the explanation!

> > The best case to get a non-fragmented VM image is to have it allocated
> > completely at create-time with fallocate().
> 
> The main problem with this change is that it'll make it harder for
> us to provide incremental feedback. As per the comment in the code, 
> it is our intention to make the volume creation API run as a background
> job which provides feedback on progress of allocation, and the ability
> to cancel the job. Since posix_fallocate() is an all-or-nothing kind of
> API it wouldn't be very helpful. 
> 
> What sort of performance boost does this give you ?  Would we perhaps
> be able to get close to it by writing in bigger chunks than 4k, or 
> mmap'ing the file and then doing a memset across it ?

If the file system is asked to zero out a particular block of data
(using extents, as is possible in xfs and ext4), it's going to be the
fastest method available. Definitely faster than writing chunks in
userspace.

Of course, my patch is based on untested stuff. I initially started out
by wanting to have the image as defragmented as possible.

There are various parameters which "fast" depends on:

- do we want the image creation to be the fastest operation? (fast, yes,
  fastest, at the expense of something else, like guest runtime, maybe not.)
- how do guests cope with fragmented images? If the data in the image
  itself is fragmented, what good is an unfragmented image?

The problem with writing chunks of any size is that it can easily
lead to a lot of fragmentation. If we have the answers to the two
questions above, we can make a decision based on actual numbers.

If it turns out that an defragmented image file is much better, pretty
graphics showing % complete will have to be in a state where they
currently are ;-)

I don't have an ext4 file system on actual hardware to test -- I'll
provide numbers when I get to set one up, though.