[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: PDR -- Package development repostory to replace SRPM's?



Ralf S. Engelschall wrote:

On Wed, Oct 01, 2003, Jeff Johnson wrote:



[...]
In the interest of looking seriously at build dependency
loops involved in bootsrapping a distro, I've set up a
publci CVS repository for exploded severn2 *.src.rpm's at
[...]
Note: The cvs.colug.net repository is ~2GB in size.
[...]
Another reason for exploding *.src.rpm's into a repository
is that I believe that src.rpm's are not the most effective
means for distributing software.

Don't misunderstand me, src.rpm's make very good bricks, and src.rpm
bricks usually come with the implicit rpm guarantee that the src.rpm
was produced by taking virgin sources, applying patches, and running
a spec file through rpmbuild from soup to nuts. That is a very sensible
guarantee for a brick.

However, src.rpm bricks, just like *.tar.gz, hide the internals. Say
a package has been rebuilt, and the Release: tag has been incremented
to reflect the change. There is no way to discover that nothing else
of interest has changed in the src.rpm without downloading the entire
brick. Brick's are bandwidth heavy.
[...]



Many thanks for establishing such a service.


But I strongly recommend you to leave out vendor tarballs and other
downloadable files from CVS. CVS handles those binary files very bad
and they really bloat up the CVS tree (as you already recognized ;-).


Hmmm, let's go into "bloat up the CVS tree" shall we?


Yes, binaries in CVS have always been problematic, because becuae the RCS
format used to store the deltas was originally designed for line-by-line source
code changes and line-by-line is clearly not used by binary files in general.


So the bloat can come from additional delta overhead attempting to apply deltas to arbitrary
runs of bytes for which the line-by-line deltafication algorithm is sub-optimal.


The answer is to use "cvs adm -kb" which saves each binary file check-in
in its entirety, not as a series of deltas. That removes the deltafication overhead entirely.


The cost of using -kb is that there is no attempt to remove redundancy. That is almost
exactly the same cost as having, say, both copies of virgin source tarballs present.
It does not matter much whether the tarballs are seperate files or concatenated with
RCS markers to allow seperation when needed, the "bloat" is exactly the same.


Now what is gained by adding tar balls and such to "bloat up a CVS tree" is
referential integrity. Everything necessary to build a src.rpm (and build) has
exactly the same tag, and comes from the same source without any additional
mechanism.


I also point out that that there are few (if any) binary files that change each and
every check-in. Package components (except for spec file) tend to be developed
by adding and delting files, not by recreating and recompressing virgin sources
in tarr.gz format.


So I recognize a "large", but not a "bloated", repository, I think that
referential integrity is crucially imporrtant to a PDR: How are you
going to build a package if you don't have the virgin sources somehow?
Any other more complicated scheme, such as referencing a virgin source
URL and keeping a MD5 digest in the repository can only detect a
failure during import, that is not exactly "referential integirty".

There are other means to deal with subsets of the files in a PDR using
tags if you wish, say, only the spec files, or only specs+patches to speed
up or debloat checkouts.

Especially, on package upgrades they all are moved into the CVS Attic/
subdirs and get replaced with new files (because of versioned filenames)
and this way the CVS bloats up even more dramatically. And keep in mind
that not just the CVS checkouts require backbone bandwidth and large
disk spaces. It is also impossible to reasonable work with the stuff
(try a single recursive grep(1) for something and you'll see ;-)



Moving deleted files to Attic is exactly "referential integrity", that is exactly
what I would expect to happen. Disk space or "Not b*ever* needed any more."
are perfectly valid criteria in addition to "referential integrity" too, and there
are means to handle these cases. Blaming cvs for doing its job is a bit daft,
however ;-)


I'm not exactly sure why recursive grep is pertinent. Do you mean that
reading 2 GB of data for an everything checkout is going to be slow?
Sure, don't do that, invoke grep with different args, write a script,
run on a subset of the packages, all those techniques will "work".

There;s certainly nothing intrinsically wrong with a 2 level deep structure
that is going to cause "recursive" to misbehave.

CVS is cool because lots of people are already used to it and there
are nice addons available. But if you want to use CVS for RPM source
packages, I recommend to look at the OpenPKG setup: we have for each of
of our 600 packages made a split into original files (.spec, .patch,
etc -- all we provide) and third-party files (.tar.gz, etc -- all which
can be downloaded). The first stuff we keep version controlled in a
central CVS repository (http://cvs.openpkg.org/). The second stuff we
just keep on our FTP server (ftp://ftp.openpkg.org/sources/DST/) as
a last copy (for safety in case it is no longer downloadable). This
separation balances the data best IMHO: we have a reasonable small CVS
which everyone can easily checkout or browse and work with.



Again, if you wish only specs, then a tag placed on specs achieves this. Ditto, specs+patches.

"Downloaded" does not provide "referential integrity". What happens
when a site disappers or moves somewhere else instead? The internet
isn't as good as cvs is in this respect. ;-)

Thanks for comments. Try cvsup, I do believe that you are interested
in cloning the colug repository so that you can get rid of those pesky
sources, that's exactly what cvsup does ;-)

73 de Jeff






[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]