Supporting EPEL Builds in Koji

Mon Jan 5 18:20:42 UTC 2009

Picking up this thread again, sorry about the long delay.  I'd like to
come to consensus on the approach here, hammer out any remaining details
at FUDCon this weekend, and hopefully get this implemented by the end of
January.  Time to really get rid of plague!

On Mon, 2008-10-06 at 15:14 -0400, Mike McLean wrote:
> Mike Bonnet wrote:
> > On Fri, 2008-07-18 at 11:38 -0400, Mike McLean wrote:
> >> Mike Bonnet wrote:
> >>> On Thu, 2008-07-17 at 13:54 -0400, Mike McLean wrote:
> >>>> If the remote_repo_url data is going to be inherited (and I tend to
> >>>> think it should be), then I think it should be in a separate table. 
> ...
> >>> I don't have any problem with this, though it does mean we'll need to
> >>> duplicate quite a bit of the inheritance-walking code,
> ...
> >> Walking inheritance is just a matter of determining the inheritance 
> >> order and scanning data on the parent tags in sequence.
> ...
> > Sorry, I was referring to walking tag_inheritance.  I'd rather have one
> > place that walks the inheritance hierarchy and aggregates data from it,
> > than two places that are doing almost the same thing.
> 
> We're talking about inherently different data. External repos to be 
> merged in are quite different from builds in the system.

Yes, I see the issue here.  Since remote repos won't have their packages
filtered out (by mergerepo) until after all packages in the local
inheritance hierarchy are placed in the repo, they don't really follow
the existing inheritance rules.

Ok, you've convinced me.  A separate table that stores a
priority-ordered list of remote repos associated with each tag will
probably be easier to manage.  The lists will be aggregated when walking
the tag hierarchy and passed to mergerepo in (priority, inheritance)
order for proper filtering (based on srpm name, first match wins).

> > Each tag has a set of builds associated with it.  We walk the
> > inheritance hierarchy, aggregating the builds from each tag in the
> > hierarchy into a flat list, and then pass that list to createrepo.  We
> > would do essentially the same thing for external repos.  When walking
> > the hierarchy, if a tag has an external repo associated with it, we
> > would append that repo url to a flat list, and pass that list to
> > mergerepo.  In both cases we're working with collections of packages
> > that are associated with a tag, just in different formats.
> 
> Sure, we can do this with one call to readFullInheritance, and traverse 
> both the build table and external repo table from the given order.

Yes, that makes sense.

> > In discussing this with Jesse, I think we want external repos to be
> > inherited.  This is probably the easiest way to deal with having
> > multiple external repos getting pulled in to a single buildroot, which
> > is essential for Fedora (think F9 GA and F9 Updates).
> > 
> > The idea was that, by convention, we would have external-repo-only tags,
> > with only a single external repo associated with it and no
> > packages/builds associated.  These external-repo-only tags could then be
> > inserted into the build hierarchy where appropriate.  An ordered list of
> > external repos could then be constructed by performing the current
> > depth-first search of the inheritance hierarchy.  The ordered list would
> > then be passed to mergerepo, which would ensure that packages in repos
> > earlier in the list supersede packages (by srpm name) in repos later in
> > the list.  This would preserve the "first-match-wins" inheritance policy
> > that Koji currently implements, and that admins expect.  For example:
> > 
> > dist-custom-build
> >   ├─dist-custom
> >   └─dist-f9-updates-external
> >       └─dist-f9-ga-external
> > 
> > would result mergerepo creating a single repo that would only contain
> > packages from dist-f9-ga-external if they did not exist in the
> > Koji-generated repo (dist-custom-build + dist-custom),
> > dist-f9-updates-external, or the blacklist of blocked packages.  This is
> > consistent with how Koji package inheritance currently works, and I
> > think is the most intuitive approach.
> 
> It is similar, but different in potentially confusing ways. External 
> repos do not have build structure, so we can't really have the same sort 
> of inheritance behavior with a combination of external repo tags and 
> normal tags.
> 
> We order the external repos in inheritance order, but ultimately those 
> repos are merged with the internal one in a way that does not honor 
> inheritance in the way that the admin might expect.
> 
> Using tags to represent external repos fails intuition because external 
> repos are very much not like tags. When we get to supporting external 
> koji systems, we can do something like this, but for external repos the 
> "bolted-on" nature needs to be clear. This is why I'd prefer to have the 
> data a little more removed.

Ok, we're agreed on this.

> >> I see all that, and I'm almost convinced. The flipside is that by 
> >> default all the code will treat these external rpms the same as the 
> >> local ones, which will not be correct for a number of cases. 
> > 
> > Personally I'd prefer adding a few special cases to the existing code,
> > rather than maintain a whole heap of almost-but-not-quite-the-same code
> > to manage external rpms.  I think that conceptually they're alike enough
> > that the number of special cases will be minimal.
> 
> I think I'm ok with using the rpminfo table.
> 
> > I think that synthesizing builds for that sake of maintaining the
> > not-null constraint is more pain than it's worth, and would make
> > enforcing our nvr-uniqueness constraints (which we definitely want to do
> > for local builds) more difficult.  Having locally-built rpms always
> > associated with a build, and external rpms not, makes sense to me.
> 
> Ok, agreed.
> 
> >> Also, I'm thinking we need to have some sort of rpm_origin table so that 
> >> all these references can be managed cleanly.
> > 
> > That sounds reasonable to me.  Note that we may end up with a lot of
> > rows in this table, since we're allowing variable substitution in the
> > external_repo_url (tag name and arch).  But I don't see that as a
> > problem.
> 
> I'm thinking the only substitution we should support is arch. Anything 
> else sort of constitutes a different repo.
> 
> If we use an origin table like this we can abstract out the arch. 
> Something like:
> 
> create table external_repo (
> 	id SERIAL PRIMARY KEY,
> 	name TEXT );
> create table external_repo_config (
> 	external_repo_id INTEGER NOT NULL REFERENCES external_repo (id),
> 	url TEXT NOT NULL,
> 	-- plus versioning fields
> 	-- ... );
> 
> This way if upstream repo changes url scheme or moves to a different 
> host, you can keep some notion of connectedness. External rpms would 
> simply reference external_repo_id.

Makes sense.  So a tag would simply reference the external_repo_id as
well, and the repo url would be set elsewhere (globally).  The table
storing the external repo info for tags would look like:

create table tag_external_repos (
        tag_id INTEGER NOT NULL REFERENCES tag(id),
        external_repo_id INTEGER NOT NULL REFERENCES external_repo(id),
        priority INTEGER NOT NULL,
        -- plus versioning fields
        UNIQUE (tag_id,priority,active)
);

I like this, it keeps everything much more normalized.

> >> In the same vein, what happens when an external repo has an nvra+sigmd5 
> >> matching a /local/ rpm?  Maybe it doesn't matter, though I guess 
> >> technically we want to record the origin properly when it gets into a 
> >> buildroot via external repo vs internal tag.
> > 
> > Right, we would record the origin as the remote repo it came from (by
> > parsing the merged repodata and looking at the baseurl).

Right, and the origin can just be stored as a reference to the
external_repo(id).

> So where do we draw the line between code that we add to koji and code 
> that we add to createrepo (or some external merge-repo tool)?

Koji would only be responsible for parsing the repodata and populating
the database with the correct origin for any given rpm.  mergerepo would
be responsible for creating the repo and enforcing the filtering rules.

> >>> However, we will already be parsing the remote repodata, which contains
> >>> information like the srpm name for each rpm, so we could do something
> >>> more sophisticated here.
> >> -snipsnip-
> >> ...
> >>> The repomerge tool seems like it solves the problem better, and would be
> >>> more useful in general.
> >> If we're going to have our fingers in the repodata, we'll probably want 
> >> to have them in the merge too. Perhaps we can get createrepo and/or this 
> >> repomerge tool usefully libified?
> > 
> > I was thinking we would probably just call out to the tool the way we do
> > for createrepo, but I'm certainly not against using an API.  I'm a
> > little concerned about memory usage when doing the create/mergerepo
> > in-process, since we know python and mod_python have garbage-collection
> > issues, but that may be a "cross the bridge when we come to it" problem.
> > Seth, is it feasible to provide an API to mergerepo that we could use
> > directly?
> 
> I don't think I even saw a reply from Seth on this. Where does the 
> mergerepo code stand now?

It has been written by Seth, I just need to test it.  The tool currently
has command-line flags to do everything we need it to do (I believe) but
we could also use it as an example to use the api directly.