[katello-devel] Content views operations optimization

Fri Jun 14 14:29:15 UTC 2013

----- Original Message -----
> On 06/14/2013 05:56 AM, Ivan Necas wrote:
> > Hi,
> >
> > I've started using content views quite heavily recently and waiting for
> > various operations to finish made me think if the waiting is really
> > necessary for most of the operation.
> >
> > I might be wrong, but it seems to me, that the only expensive operation
> > should be publishing of a new version of a content view: this time
> > new repositories are created with content that was not anywhere before.
> >
> > Other operations with content views don't modify any content:
> >
> > Content view promotion
> > ----------------------
> > We basically just copy existing repos without changing their content,
> > therefore:
> >
> > * computing metadata is useless, as we have the very same metadata
> >    already in the original repositories of the content view version
> I can't really speak to this as this is done in pulp, but the only
> situation where you could simply 'copy' the metadata would be if the
> destination repo was empty and we were copying everything with a single
> call.  There is no linkage between two repos in pulp except during a
> copy operation, so pulp wouldn't necessarily know that two repos are
> exactly the same unless the above occured or it checked it against all
> repos.  Due to performance reasons we have to copy units individually
> (rpm, errata, etc...) and for rpms specify distinct fields for the copy
> operation.

Note I'm not talking about repositories in general, but about repositories that
are part of content views.

I don't completely agree here, that this is possible only when the destination
repo is empty. When I promote a content view, I want the whole content view from
source environment to get to the target environment, without caring what was before
in the target environment (as I understand it correctly that a content view is a
snap-shot being distributed across envs).

In other words, is there a situation when promoting CV from EnvA to EnvB
will lead into a situation that in EnvA the CV will look differently than EnvB
right after the promotion?

> 
> > * indexing the repositories is useless as it should be just the same as the
> >    index for the original repositories of the content view version
> Its not the same :)  We use the repoids field on packages & errata to
> make searching useful.  Without it we won't know what packages are in
> what repos.  We potentially could investigate fetching the data from
> Elastic Search, modifying the repoid list ourselves and updating it
> within elasticsearch.  I'm not sure if that would be faster or not.  My
> guess is that it might be faster (simply becuase fetching data via ES is
> faster) but we would have to test it to see.

What if I put it this way: I have indexed CV1version1 in library. Promoting
the content view just means that the CV1version1 is made available in
the next environment. Again, is there a situation when one version of CV would
have different content? If yes, it's not a version, if no then it doesn't make
sense to perform indexing on the same data twice, if that can simply made by
updating the query. 

> 
> Another option might be to use the update feature of ES
> http://www.elasticsearch.org/guide/reference/api/update/   My guess is
> that would be much much faster.
> 
> >
> > Composite content view publishing
> > ---------------------------------
> >
> > I wonder what operations really need to be performed here? It seems to me,
> > that it just references the sub-content-views, not bearing any info about
> > the
> > content itself (guessing form the fact, that promotion of a composite means
> > promoting
> > the sub-content-views). Still, it takes 10 minutes to publish it with
> > real-world repos
> > (RHEL, EPEL, katello)
> 
> Due to the way subscription-manager & candlepin work, we are unable to
> point a system to repos in two different candlepin environments.  A
> system can only know a) Organization, b) Candlepin Environment, c)
> Content path and it assembles the url from that:
> 
>    http://HOSTNAME/pulp/repos/ORG/CP_Environment/Content_Path
> 
> the Candlepin Environment in this case is the Katello Environment &
> Content View Combination.  So a system cannot point to
> /ACME/Dev/View1/ContentA and /ACME/Dev/View2/ContentB at the same time.
> If it could we probably could get away without content views.  So we
> compromised and did composite views.   One option that we could
> investigate would be to reuse a pulp repo within pulp for the composite
> and its component, and just publish via 2 yum distributors to two
> different paths.   That may complicate the code greatly, but could be
> worth investigation.  It would stlil require a publish to occur twice
> though, so you are really only saving the repo creation and unit copy
> aspects.
> 
> 
> I do agree that it takes far to long to publish/promote a content view
> with a very large repo and we & the pulp team should try to make it
> faster :)

I'm not against optimizations on Pulp side at all. But it seems to me,
that here changing the way how we do it on Katello side would leave Pulp
server free hands for doing the stuff that it has to do.

> 
> -Justin
> 
> 
> >
> >
> > Am I missing something here, or we are really able to reduce the metadata
> > calculation and
> > indexing to the content view publish phase and the rest should be really
> > just about copying
> > symlinks (which could also be optimized heavily when learning Pulp how to
> > create a repository
> > simply by symlinking another one)
> >
> > -- Ivan
> >
> > _______________________________________________
> > katello-devel mailing list
> > katello-devel at redhat.com
> > https://www.redhat.com/mailman/listinfo/katello-devel
> 
> _______________________________________________
> katello-devel mailing list
> katello-devel at redhat.com
> https://www.redhat.com/mailman/listinfo/katello-devel
>