Repository Snapshot storage format

Thu May 21 14:27:08 UTC 2009

On 05/21/2009 08:59 AM, Atul Aggarwal wrote:
> Hello everybody,
>
> I would like to discuss the way the snapshots are stored on the disk.
>
> Repository snapshots are snapshots of the published directory tree which
> are taken at regular interval of time and are given UUID for reference
> by client.

s/by client//

>
> Since we have ruled out fuse for backend, I am thinking of very trivial

fuse for backend was ruled out?  I think we never discussed this, it 
might be unusual but could possibly be useful in a good design.  Maybe...

> procedure for storing these snapshots on disk. For first snapshot, we
> will save the complete tree and for further snapshots we will save the
> files which are changed only and link all existing files (which has not
> been changed since last snapshot) to previous snapshot. For tagging

This makes it non-trivial to delete entire snapshots?  Have you 
considered hardlinks of entire trees?

> purpose, another link will be created in the tag directory which will
> point to the snapshot which is tagged.
> Pro of this method is it is very easy to implement and understand.
> Con is that while deleting snapshot we need to check whether the file is
> not referenced in any other snapshot/ tag. Same is the case with the
> tagging.

Example:
/path/orig/file1
/path/orig/file2
/path/orig/file3
cd /path/
cp -al /path/orig/ /path/copy1/
write /path/orig/file4
delete /path/orig/file1
cp -al /path/orig/ /path/copy2/

The one major drawback of this approach is files must be deleted in orig 
before being recreated or they will effect earlier snapshot contents. 
Most tools do create entirely new files instead of modifying existing 
files though.  Perhaps we need more opinions?

>
> To check this con, the process of linking can be changed as whenever we
> have a new snapshot, instead of creating a link in the new snapshot, we
> may move the file to new snapshot and create a link in the older
> snapshot. In this way we may delete the old snapshot which bothering the
> breaking of new snapshot.

We could be reinventing the wheel here.  Perhaps we don't want to use a 
version control system like git here for reasons you note below, but 
techniques they have invented might be useful.

>
> I also thought of using diff in the storage but it seems to be a bad
> idea as we will be mostly dealing with binary files and it will create
> load on the server on the cost of some disk space. Also using diff extra
> computation needs to be done while adding new snapshot and removing
> snapshots. Using some version control system will unnecessary load on
> the system without much benefit of space on large repository tree.

Perhaps your idea is a good idea, but we really need more opinions.

Warren Togami
wtogami at redhat.com