We've discussed versioned repositories and their merits in the past, but I'd like to propose a specific direction, and inclusion in 3.0. As a recap of goals, versions can help us answer two important questions about the history of a repository:
1) What set of content is in a specific version of a repository?
2) What changed between two arbitrary versions of a repository?
I am proposing a model where Pulp creates a new version of a repository for every operation that changes that repo's content. For example, a sync task would create a single new version.
- You create repository "foo".
- You sync repository "foo", which produces version 1 of that repo.
- You sync once per day for some period of time, automatically creating a new version each time.
- You publish repo "foo", which defaults to publishing the most recent version.
- You don't like something that's new in the repo, so you roll back by publishing a previous version.
Data Model Basics
In the past we've stored the relationship between a content unit and a repo as a standard many-to-many through table. There's a reference to a unit, and a reference to a repo.
The version scheme I'm pitching adds two new fields to that through table:
vadded - a foreign key to the repo version in which this content unit was added
vremoved - a foreign key to the repo version in which this content unit was removed. This can be null.
Multiple entries can exist for the same content unit and repo, so long as a new one is not added until the previous one's "vremoved" field is set.
With this structure, it is easy to query the database to answer both questions we started with.
Some endpoint will be made that gives access to the versions of a specific repository. Ideally we would have a nested endpoint like this:
But nested views have been a problem for us with DRF (django rest framework). If we aren't able to make that happen, I've gotten this to work in my PoC branch:
It's not yet clear how best to represent content through the REST API. A nested endpoint within the repo version object would be ideal.
Operations on a repo where a version could be chosen, such as a publish, should default to the latest version. It's an open question how best to represent that, and perhaps it takes the form of two endpoints:
default to latest: POST /api/v3/repositories/foo/distributors/bar/publish
specify a version: POST /api/v3/repositories/foo/versions/4/publish
But that's just one idea. Much about our REST API layout has yet to be written in stone, and we have flexibility.
Notice that this changes the orphan workflow. Removing a content unit from a repo doesn't make it an orphan. This helps reduce the need to run an orphan cleanup task, which in turn helps avoid the inherent race condition that task can introduce.
But you may not want to keep history forever, so a valuable feature will be the ability to trim history. I think this would just be an operation that squashes a bunch of versions together, and it could optionally take that opportunity to immediately delete a content unit that becomes an orphan.
Illustrating the workflow, if you wanted to squash history prior to version 10, the task would:
- delete all of a repo's relationships in the through table where vremoved is a version <= 10
- optionally check if each content unit is now an orphan and remove if so
- update all remaining entries where vadded < 10 by setting vadded to 10
I have a branch with proof-of-concept code here:
The models are the most interesting place to look. In particular, I'm very pleased with how simple the "content()" method is, which returns a QuerySet matching all the content in a given version.
The rest is REST ;) API stuff mostly, which isn't all that interesting except to demonstrate how the data could potentially be exposed. You can run the included tests (which I made just for dev purposes- not sure if they deserve a long-term home) which are found in the root of the git repo, and that loads some data into the database. Then you can hit this endpoint as an example:
Obviously this code is rough, so please consider it for directional and conceptual purposes only. Assume major additions and improvements if we follow through on this concept.
Tracking history in this way opens up great possibilities. Some examples:
Promotion could become a matter of having two publishers on a repo with different settings, one for "testing" and one for "production", and just publishing whichever version you like with each. Multiple repos and copy operations are no longer needed for promotion. Austin suggested that the ability to tag versions with arbitrary key:value pairs could enhance this use case.
An added concept, which could come post-3.0, is tracking publications more explicitly and associating each with a version. Although I could see a case for laying this groundwork now before the API is locked down. Promotion could become more about making a publication available in a different location, rather than re-creating it. We'd also know which content is part of a publication, and guarantee that content doesn't get removed before the publication does. This is a deficiency we have in Pulp 2.
Pulp-to-pulp sync could become very efficient since they could easily replicate only the changes since the last sync.
Incremental exports become more concrete. Rather than depending on a timestamp, you can know with certainty which version you have in the remote location, and thus which newer versions need to be exported.
We could add a "finalized" boolean or similar to a version, and use that to know if it was successfully completed. If not, for example if a sync task stopped abruptly, the incomplete version could easily be recognized and removed.
Please ask questions, provide feedback, add ideas, suggest alternatives, etc. I'm perfectly happy even throwing this PoC away if we come up with something better.
Michael HrivnakPrincipal Software Engineer, RHCE