[Pulp-list] Data model change: repository["packages"] to list of package ids only

Jeff Ortel jortel at redhat.com
Mon Jan 31 19:29:19 UTC 2011



On 01/31/2011 08:47 AM, Pradeep Kilambi wrote:
> On 01/31/2011 07:49 AM, John Matthews wrote:
>> I made changes to how we store packages under a repository document. If there are no
>> major objections I plan to check this in later today.
>>
>> Pradeep and I have noticed a large performance issue when calling "_get_existing_repo()"
>> from repo.py. For rhel-i386-server-5 this takes roughly 30 seconds to fetch information
>> on 7k packages resulting in a 10MB repository document being returned. For Fedora 13
>> this is even larger and takes around ~90 seconds to fetch somewhere around 20k packages.
>>
>> The issue is that we store a dictionary of "packages" under the repository. The
>> dictionary has a key of package id and a value of the full package object. (Technically
>> in mongo a reference to the package object is stored, not the full object. When we fetch
>> the repository through pymongo the AutoReference SON Manipulator fetches the contents of
>> each package object). This results in large repos being very expensive. Further pulp
>> relies on "_get_existing_repo()" in many places so this is a problem that will be seen
>> often for large repos.
>>
>> Over the weekend I made changes to how we store "packages", it's no longer storing
>> packages as a dictionary, now we only store the package id in a list.
>>
>> "_get_existing_repo()" is much quicker as you can see:
>> For rhel-i386-server-5<only package ids>:
>> Time: .2 seconds versus ~30 seconds
>> Size: 1.5MB versus 10MB
>>
>> For fedora 13<only package ids>:
>> Time: .3 seconds versus ~90 seconds
>> Size: 2.5 MB versus 24MB
>>
>> The result of fetching a repository object now will only yield "package ids" under
>> "packages".
>> If we want to flesh out all of the package objects as the call was previously doing, we
>> can make a second call to the PackageAPI. This is still much quicker than previous
>> behavior.
>> For rhel-i386-server-5<full package objects>:
>> Time: ~3 seconds versus ~30 seconds
>> Size: 10MB and 10MB
>>
>> For fedora 13<full package objects>:
>> Time: ~7 seconds versus ~90 seconds
>> Size 24MB and 24MB
>>
>>
>> Developers need to be aware repo["packages"] will only contains package ids. It takes
>> one extra call to flesh out the "packages" into their full objects, so if that's needed
>> it's easy and not as expensive with the new approach.
>>
>> I've made most of the changes needed for this, if there are no major objections I plan
>> to check this in today.
>
> +1 storing packages in a repo isnt worth the amount of data we load into memory each time
> repo object is retrieved. I've seen how painful it is with large repos.

Agreed.

>
> ~ Prad
>
> _______________________________________________
> Pulp-list mailing list
> Pulp-list at redhat.com
> https://www.redhat.com/mailman/listinfo/pulp-list




More information about the Pulp-list mailing list