[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Pulp-list] Data model change: repository["packages"] to list of package ids only




----- Original Message -----
> On 01/31/2011 07:49 AM, John Matthews wrote:
> > I made changes to how we store packages under a repository document.
> > If there are no major objections I plan to check this in later
> > today.
> >
> > Pradeep and I have noticed a large performance issue when calling
> > "_get_existing_repo()" from repo.py. For rhel-i386-server-5 this
> > takes roughly 30 seconds to fetch information on 7k packages
> > resulting in a 10MB repository document being returned. For Fedora
> > 13 this is even larger and takes around ~90 seconds to fetch
> > somewhere around 20k packages.
> >
> > The issue is that we store a dictionary of "packages" under the
> > repository. The dictionary has a key of package id and a value of
> > the full package object. (Technically in mongo a reference to the
> > package object is stored, not the full object. When we fetch the
> > repository through pymongo the AutoReference SON Manipulator fetches
> > the contents of each package object). This results in large repos
> > being very expensive. Further pulp relies on "_get_existing_repo()"
> > in many places so this is a problem that will be seen often for
> > large repos.
> >
> > Over the weekend I made changes to how we store "packages", it's no
> > longer storing packages as a dictionary, now we only store the
> > package id in a list.
> >
> > "_get_existing_repo()" is much quicker as you can see:
> >   For rhel-i386-server-5<only package ids>:
> >    Time: .2 seconds versus ~30 seconds
> >    Size: 1.5MB versus 10MB
> >
> >   For fedora 13<only package ids>:
> >    Time: .3 seconds versus ~90 seconds
> >    Size: 2.5 MB versus 24MB
> >
> > The result of fetching a repository object now will only yield
> > "package ids" under "packages".
> > If we want to flesh out all of the package objects as the call was
> > previously doing, we can make a second call to the PackageAPI. This
> > is still much quicker than previous behavior.
> >   For rhel-i386-server-5<full package objects>:
> >    Time: ~3 seconds versus ~30 seconds
> >    Size: 10MB and 10MB
> >
> >   For fedora 13<full package objects>:
> >    Time: ~7 seconds versus ~90 seconds
> >    Size 24MB and 24MB
> >
> >
> > Developers need to be aware repo["packages"] will only contains
> > package ids. It takes one extra call to flesh out the "packages"
> > into their full objects, so if that's needed it's easy and not as
> > expensive with the new approach.
> >
> > I've made most of the changes needed for this, if there are no major
> > objections I plan to check this in today.
> 
> +1 storing packages in a repo isnt worth the amount of data we load
> into
> memory each time repo object is retrieved. I've seen how painful it is
> with large repos.
> 
> ~ Prad
> 

I'll push this to master tomorrow, I need a little more time to finish.





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]