[Pulp-list] Importer Sync APIs

Tue Nov 22 14:07:31 UTC 2011

> 1. The new sync log API looks pretty good. What I'll do is set up my
> sync commands to log to a file on disk (since of them run in a different
> process), then when everything is done, read that file and pass the
> contents back in the final report.
>
> However, it would be nice to be able to store a "stats" mapping in
> addition to the raw log data.

Interesting. What sorts of things do you see yourself using it for?

> 2. I *think* the 'working directory' API is the
> 'get_repo_storage_directory()' call on the conduit. However, I'm not
> entirely clear on that, nor what the benefits are over using Python's
> own tempfile module (although that may be an artefact of the requirement
> for 2.4 compatibility in Pulp - with 2.5+, the combination of context
> managers, tempfile.mkdtemp() and shutil.remove() means that cleaning up
> temporary directories is a *lot* easier than it used to be)

This one came out of the way we sync RPMs. I forget the exact details 
but when I spoke with the guys on our team, they said that it's easier 
on them if they could assemble the repo as part of the sync. The idea 
for the working directory over a temp directory is so we can leverage 
that state from sync to sync.

To a lesser extent, this is also some paranoia on my part. Not that I 
can stop a plugin from writing to a temp directory, but I'd like to push 
a model where we can describe to a user where all Pulp related stuff is. 
If the plugins use the working directories which fall under the Pulp 
parent directory, it feels cleaner in the sense that running Pulp isn't 
throwing things all over the place.

> 3. The "request_unit_filename" and "add_or_update_content_unit" APIs
> seem oddly asymmetrical, and the "get_unit_keys_for_repo" naming
> includes quite a bit of redundancy.

This is the same reason there's a lack of transfer objects in the first 
round of these APIs. I wrote method signatures out long hand in the wiki 
pages I used to design them so that it was really clear from the 
signature what it was doing and what data it had access to to do it. I 
figured it made it easier for people to grok when reading them.

Those ended up translating directly into code which makes for some wonky 
looking code. Now that things are evolving and we're liking the 
approach, it's getting cleaned up.

> To be both consistent, flexible and efficient, I suggest an API based
> around a "ContentUnitData" class with the following attributes:
> - type_id
> - unit_id (may be None when defining a new unit to be added to Pulp)
> - key_data
> - other_data
> - storage_path (may be None if no bits are stored for the content type -
> perhaps whether or not bits are stored should be part of the content
> type definition?)

+1 to this, it fits in with the way the APIs are going.

> The content management API itself could then look like:
>
> - get_units() -> two level mapping {type_id: {unit_id: ContentUnitData}}
> Replacement for get_unit_keys_for_repo()
> Note that if you're concerned about exposing 'unit_id', the existing
> APIs already exposed it as the return value from
> 'add_or_update_content_unit'.
> I think you're right to avoid exposing a "single lookup" API, at least
> initially - that's a performance problem waiting to happen.

My concern wasn't actually exposing the unit_id so much as it was 
loading too much data. "Get unit keys" was very specific for that 
reason; the assumption was that all you'd need to make the call on what 
is added/removed is the keys, so I wanted to minimize the potentially 
giant dump of data (with unit metadata being so open ended I was worried 
at how much people would throw in there).

That said, I may be being overly paranoid about performance without a 
good reason to be in this area. Alternatively, we could return just a 
subset of data by default but give the plugin the option to request 
"full" unit data. But the first step to all of that is reducing the 
method down to get_units which has more room to grow than get_unit_keys 
ever did.

> - new_unit(type_id, key_data, other_data, relative_path) -> ContentUnitData
> Does *not* assign a unit ID (or touch the database at all)
> Does fill in absolute path in storage_path based on relative_path
> Replaces any use of "request_unit_filename"

So it's basically a factory method that populates generated fields? 
Interesting approach. I'm not a fan of the name "new_unit" since the 
connotation (in my head at least) is that it's doing some sort of 
saving, but that can be alleviated with halfway decent docs. It also 
makes for a really nice mapping of our APIs to CRUD.

> - save_unit(ContentUnitData) -> ContentUnitData
> Assigns a unit ID with the unit and stores the unit in the database
> Associates the unit with the repo
> Batching will be tricky due to error handling if the save fails
> Replaces any use of 'add_or_update_content_unit' and
> 'associate_content_unit'

The more I thought about it last night the more I liked the idea of the 
combo add/associate. This call was definitely going to be added as part 
of this process.

> - remove_unit(type_id, pulp_id) -> bool
> True if removed, False if association retained
> Replaces any use of 'unassociate_content_unit'
>
> For the content unit lifecycle, I suggest adopting a reference counting
> model where the importer owns one set of references (controlled via
> save_unit/remove_unit on the importer conduit) and manual association
> owns a second set of references (which the importer conduit can't
> touch). A reference through either mechanism would then keep the content
> unit alive and associated with the repository (the repo should present a
> unified interface to other code, so client code doesn't need to care if
> it is an importer association or a manual association that is keeping
> the content unit alive).

Implementation-wise I think it'd be a little different than you explain, 
but conceptually I like the idea of a reference owner. That would go a 
long way towards eventually supporting multiple importers as well if we 
ever needed to go that route.

> 4. It's probably worth also discussing the scratchpad API that lets the
> importer store additional state between syncs. I like having this as
> a convenience API (rather than having to store everything as explicit
> metadata on the repo), but for debugging purposes, it would probably be
> good to publish "_importer_scratchpad" as a metadata attribute on the
> repo that is accessible via REST.

Bugger, I totally forgot to mention that in the blog.

Currently, the scratchpad is accessible on the importer itself through REST:

/v2/repositories/my-repo/importers/

I haven't thought yet about what "sub-collection" data (importers, 
distributors, history, etc) I want to pull up into the repo itself when 
returning them from the repo GET call, but importer/distributor state 
stuff like that is definitely a candidate.

>> This is too big and ambitious for me to get right on my own.
>
> Definitely headed in the right direction, but I think it's worth pushing
> the "structured data" approach even further. You've already started
> doing this on the configuration side of things, I think it makes sense
> on the metadata side as well.
>
> Cheers,
> Nick.

Appreciate the fresh pair of eyes on this. I'll keep you posted on how 
it evolves.

-- 
Jay Dobies
Freenode: jdob @ #pulp
http://pulpproject.org | http://blog.pulpproject.org