[katello-devel] Content views operations optimization

Fri Jun 14 18:13:28 UTC 2013

I just had a thought on this. Katello could write a pulp distributor 
plugin that takes a repo ID at publish time and simply copies the 
published repo on disk. I still recommend against this approach and 
wouldn't include it in pulp itself, but it is one of the benefits of the 
plugin design. Writing something like that, including the Katello calls 
to publish to that distributor in certain cases, is probably less than a 
day of work.

On 06/14/2013 11:11 AM, Justin Sherrill wrote:
> On 06/14/2013 10:29 AM, Ivan Necas wrote:
>>
>> ----- Original Message -----
>>> On 06/14/2013 05:56 AM, Ivan Necas wrote:
>>>> Hi,
>>>>
>>>> I've started using content views quite heavily recently and waiting for
>>>> various operations to finish made me think if the waiting is really
>>>> necessary for most of the operation.
>>>>
>>>> I might be wrong, but it seems to me, that the only expensive operation
>>>> should be publishing of a new version of a content view: this time
>>>> new repositories are created with content that was not anywhere before.
>>>>
>>>> Other operations with content views don't modify any content:
>>>>
>>>> Content view promotion
>>>> ----------------------
>>>> We basically just copy existing repos without changing their content,
>>>> therefore:
>>>>
>>>> * computing metadata is useless, as we have the very same metadata
>>>>     already in the original repositories of the content view version
>>> I can't really speak to this as this is done in pulp, but the only
>>> situation where you could simply 'copy' the metadata would be if the
>>> destination repo was empty and we were copying everything with a single
>>> call.  There is no linkage between two repos in pulp except during a
>>> copy operation, so pulp wouldn't necessarily know that two repos are
>>> exactly the same unless the above occured or it checked it against all
>>> repos.  Due to performance reasons we have to copy units individually
>>> (rpm, errata, etc...) and for rpms specify distinct fields for the copy
>>> operation.
>> Note I'm not talking about repositories in general, but about
>> repositories that
>> are part of content views.
>>
>> I don't completely agree here, that this is possible only when the
>> destination
>> repo is empty. When I promote a content view, I want the whole content
>> view from
>> source environment to get to the target environment, without caring
>> what was before
>> in the target environment (as I understand it correctly that a content
>> view is a
>> snap-shot being distributed across envs).
> What you say is correct, however its somewhat ignoring how pulp works
> and what pulp makes available to us.  We cannot tell pulp to just pull
> the repodata from one repo to another, its just not how it works :)   We
> don't have the option of doing that today.
>
>> In other words, is there a situation when promoting CV from EnvA to EnvB
>> will lead into a situation that in EnvA the CV will look differently
>> than EnvB
>> right after the promotion?
>
> The answer is No, however we still have to perform a unit_copy and tell
> pulp to publish the repo.  There isn't any other way to do it.
>
>
>>
>>>> * indexing the repositories is useless as it should be just the same
>>>> as the
>>>>     index for the original repositories of the content view version
>>> Its not the same :)  We use the repoids field on packages & errata to
>>> make searching useful.  Without it we won't know what packages are in
>>> what repos.  We potentially could investigate fetching the data from
>>> Elastic Search, modifying the repoid list ourselves and updating it
>>> within elasticsearch.  I'm not sure if that would be faster or not.  My
>>> guess is that it might be faster (simply becuase fetching data via ES is
>>> faster) but we would have to test it to see.
>> What if I put it this way: I have indexed CV1version1 in library.
>> Promoting
>> the content view just means that the CV1version1 is made available in
>> the next environment. Again, is there a situation when one version of
>> CV would
>> have different content? If yes, it's not a version, if no then it
>> doesn't make
>> sense to perform indexing on the same data twice, if that can simply
>> made by
>> updating the query.
>
> So you're proposing instead of recording a package's list of repo_ids,
> record only its original repo and then a list of content view versions
> that the package are in?  Its not a bad idea, but I'm not sure its worth
> the complexity and effort to change over to the method.  The performance
> increase probably wouldn't be all that much higher between not indexing
> on promote and using a bulk update call.
>
>>
>>> Another option might be to use the update feature of ES
>>> http://www.elasticsearch.org/guide/reference/api/update/   My guess is
>>> that would be much much faster.
>>>
>>>> Composite content view publishing
>>>> ---------------------------------
>>>>
>>>> I wonder what operations really need to be performed here? It seems
>>>> to me,
>>>> that it just references the sub-content-views, not bearing any info
>>>> about
>>>> the
>>>> content itself (guessing form the fact, that promotion of a
>>>> composite means
>>>> promoting
>>>> the sub-content-views). Still, it takes 10 minutes to publish it with
>>>> real-world repos
>>>> (RHEL, EPEL, katello)
>>> Due to the way subscription-manager & candlepin work, we are unable to
>>> point a system to repos in two different candlepin environments.  A
>>> system can only know a) Organization, b) Candlepin Environment, c)
>>> Content path and it assembles the url from that:
>>>
>>>     http://HOSTNAME/pulp/repos/ORG/CP_Environment/Content_Path
>>>
>>> the Candlepin Environment in this case is the Katello Environment &
>>> Content View Combination.  So a system cannot point to
>>> /ACME/Dev/View1/ContentA and /ACME/Dev/View2/ContentB at the same time.
>>> If it could we probably could get away without content views.  So we
>>> compromised and did composite views.   One option that we could
>>> investigate would be to reuse a pulp repo within pulp for the composite
>>> and its component, and just publish via 2 yum distributors to two
>>> different paths.   That may complicate the code greatly, but could be
>>> worth investigation.  It would stlil require a publish to occur twice
>>> though, so you are really only saving the repo creation and unit copy
>>> aspects.
>>>
>>>
>>> I do agree that it takes far to long to publish/promote a content view
>>> with a very large repo and we & the pulp team should try to make it
>>> faster :)
>> I'm not against optimizations on Pulp side at all. But it seems to me,
>> that here changing the way how we do it on Katello side would leave Pulp
>> server free hands for doing the stuff that it has to do.
> Aside from indexing the contnet differently there is very little we can
> do without changes to pulp.
>
> -Justin
>
>>
>>> -Justin
>>>
>>>
>>>>
>>>> Am I missing something here, or we are really able to reduce the
>>>> metadata
>>>> calculation and
>>>> indexing to the content view publish phase and the rest should be
>>>> really
>>>> just about copying
>>>> symlinks (which could also be optimized heavily when learning Pulp
>>>> how to
>>>> create a repository
>>>> simply by symlinking another one)
>>>>
>>>> -- Ivan
>>>>
>>>> _______________________________________________
>>>> katello-devel mailing list
>>>> katello-devel at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/katello-devel
>>> _______________________________________________
>>> katello-devel mailing list
>>> katello-devel at redhat.com
>>> https://www.redhat.com/mailman/listinfo/katello-devel
>>>
>
> _______________________________________________
> katello-devel mailing list
> katello-devel at redhat.com
> https://www.redhat.com/mailman/listinfo/katello-devel

-- 
Jay Dobies
Freenode: jdob @ #pulp
http://pulpproject.org