[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Pulp-list] Data model change: repository["packages"] to list of package ids only





On 01/31/2011 08:47 AM, Pradeep Kilambi wrote:
On 01/31/2011 07:49 AM, John Matthews wrote:
I made changes to how we store packages under a repository document. If there are no
major objections I plan to check this in later today.

Pradeep and I have noticed a large performance issue when calling "_get_existing_repo()"
from repo.py. For rhel-i386-server-5 this takes roughly 30 seconds to fetch information
on 7k packages resulting in a 10MB repository document being returned. For Fedora 13
this is even larger and takes around ~90 seconds to fetch somewhere around 20k packages.

The issue is that we store a dictionary of "packages" under the repository. The
dictionary has a key of package id and a value of the full package object. (Technically
in mongo a reference to the package object is stored, not the full object. When we fetch
the repository through pymongo the AutoReference SON Manipulator fetches the contents of
each package object). This results in large repos being very expensive. Further pulp
relies on "_get_existing_repo()" in many places so this is a problem that will be seen
often for large repos.

Over the weekend I made changes to how we store "packages", it's no longer storing
packages as a dictionary, now we only store the package id in a list.

"_get_existing_repo()" is much quicker as you can see:
For rhel-i386-server-5<only package ids>:
Time: .2 seconds versus ~30 seconds
Size: 1.5MB versus 10MB

For fedora 13<only package ids>:
Time: .3 seconds versus ~90 seconds
Size: 2.5 MB versus 24MB

The result of fetching a repository object now will only yield "package ids" under
"packages".
If we want to flesh out all of the package objects as the call was previously doing, we
can make a second call to the PackageAPI. This is still much quicker than previous
behavior.
For rhel-i386-server-5<full package objects>:
Time: ~3 seconds versus ~30 seconds
Size: 10MB and 10MB

For fedora 13<full package objects>:
Time: ~7 seconds versus ~90 seconds
Size 24MB and 24MB


Developers need to be aware repo["packages"] will only contains package ids. It takes
one extra call to flesh out the "packages" into their full objects, so if that's needed
it's easy and not as expensive with the new approach.

I've made most of the changes needed for this, if there are no major objections I plan
to check this in today.

+1 storing packages in a repo isnt worth the amount of data we load into memory each time
repo object is retrieved. I've seen how painful it is with large repos.

Agreed.


~ Prad

_______________________________________________
Pulp-list mailing list
Pulp-list redhat com
https://www.redhat.com/mailman/listinfo/pulp-list


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]