[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Pulp-list] Importer Sync APIs

On 11/23/2011 12:07 AM, Jay Dobies wrote:
1. The new sync log API looks pretty good. What I'll do is set up my
sync commands to log to a file on disk (since of them run in a different
process), then when everything is done, read that file and pass the
contents back in the final report.

However, it would be nice to be able to store a "stats" mapping in
addition to the raw log data.

Interesting. What sorts of things do you see yourself using it for?

rsync captures a few interesting stats that are useful for figuring out how effective it is at saving bandwidth relative to a straight download, so I'd just be capturing those and storing them for later analysis.

2. I *think* the 'working directory' API is the
'get_repo_storage_directory()' call on the conduit. However, I'm not
entirely clear on that, nor what the benefits are over using Python's
own tempfile module (although that may be an artefact of the requirement
for 2.4 compatibility in Pulp - with 2.5+, the combination of context
managers, tempfile.mkdtemp() and shutil.remove() means that cleaning up
temporary directories is a *lot* easier than it used to be)

This one came out of the way we sync RPMs. I forget the exact details
but when I spoke with the guys on our team, they said that it's easier
on them if they could assemble the repo as part of the sync. The idea
for the working directory over a temp directory is so we can leverage
that state from sync to sync.

To a lesser extent, this is also some paranoia on my part. Not that I
can stop a plugin from writing to a temp directory, but I'd like to push
a model where we can describe to a user where all Pulp related stuff is.
If the plugins use the working directories which fall under the Pulp
parent directory, it feels cleaner in the sense that running Pulp isn't
throwing things all over the place.

I think it's kind of a given that random stuff can get written to /tmp on any system, but I do take your point. (In particular, /tmp may be on a relatively small partition, whereas Pulp can require that the working directories be stored on a partition with generous size allowances.

That said, I may be being overly paranoid about performance without a
good reason to be in this area. Alternatively, we could return just a
subset of data by default but give the plugin the option to request
"full" unit data. But the first step to all of that is reducing the
method down to get_units which has more room to grow than get_unit_keys
ever did.

As Jason suggests, I think the way to go here is to start with a "get_units()" API that just returns a flat list of ContentUnitData instances, then decide later if additional convenience methods make sense.

For a case like mine, where I'm only storing one content type in each repo, turning the list into a dictionary is pretty easy:

    units = dict((unit.id, unit) for unit in conduit.get_units())

Filtering by content type wouldn't be much more difficult:

    units = {}
    for unit in conduit.getunits():
        units.setdefault(unit.type_id, {})[unit.unit_id] = unit

- new_unit(type_id, key_data, other_data, relative_path) ->
Does *not* assign a unit ID (or touch the database at all)
Does fill in absolute path in storage_path based on relative_path
Replaces any use of "request_unit_filename"

So it's basically a factory method that populates generated fields?
Interesting approach. I'm not a fan of the name "new_unit" since the
connotation (in my head at least) is that it's doing some sort of
saving, but that can be alleviated with halfway decent docs. It also
makes for a really nice mapping of our APIs to CRUD.

Yeah, I don't like the name either, I just didn't have any better ideas.

"preinit_unit" might be better, since it has the right connotations of "we don't want to save this yet, but we need help from the Pulp server to initialise some of the fields"

For the content unit lifecycle, I suggest adopting a reference counting
model where the importer owns one set of references (controlled via
save_unit/remove_unit on the importer conduit) and manual association
owns a second set of references (which the importer conduit can't
touch). A reference through either mechanism would then keep the content
unit alive and associated with the repository (the repo should present a
unified interface to other code, so client code doesn't need to care if
it is an importer association or a manual association that is keeping
the content unit alive).

Implementation-wise I think it'd be a little different than you explain,
but conceptually I like the idea of a reference owner. That would go a
long way towards eventually supporting multiple importers as well if we
ever needed to go that route.

Yeah, I knew I was hand-waving a lot there - definitely just trying to get across the concept, since I don't know anywhere near enough about how associations work to offer advice on implementation details.

Currently, the scratchpad is accessible on the importer itself through


Ah, OK. So long as it's accessible somewhere, I'm not overly worried about where.


Nick Coghlan
Red Hat Engineering Operations, Brisbane

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]