[Freeipa-devel] [PATCH] WIP backup and restore

Mon Mar 25 23:47:15 UTC 2013

On 03/25/2013 12:08 PM, Petr Viktorin wrote:
> On 03/23/2013 05:06 AM, Rob Crittenden wrote:
>> TL;DR. Sorry.
>>
>> Here is my current progress on backup and restore. I have not documented
>> any of this in the Implementation section of the wiki yet.
>>
>> I've added two new commands, ipa-backup and ipa-restore.
>>
>> The files go into /var/lib/ipa/backup. When doing a restore you should
>> only reference the directory in backup, not the full path. This needs to
>> change, but it is what it is.
>>
>> There are strict limits on what can be restored where. Only exact
>> matching hostnames and versions are allowed right now. We can probably
>> relax the hostname requirement if we're only restoring data, and the
>> version perhaps for only the the first two values (so you can restore a
>> 3.0.0 backup on 3.0.1 but not on 3.1.0).
>
> Do we also need to limit the versions of Dogtag, 389, Kerberos...?

No.

> Or is what they put in /var/lib guaranteed portable across versions?

Mostly.  We always suggest doing ldif dumps (db2ldif) for more long term 
storage.

>
>> I've done 99.99% of testing in F-18 with a single instance. I did some
>> initial testing in 6.4 so I think the roots are there, but they are
>> untested currently.
>>
>> I spent a lot of time going in circles when doing a restore and getting
>> replication right. I'm open to discussion on this, but my purpose for
>> restoration was to define a new baseline for the IPA installation. It is
>> basically the catastrophic case, where your data is
>> hosed/untested/whatever and you just want to get back to some sane 
>> point.
>>
>> Ok, so given that we need to make sure that any other masters don't send
>> us any updates in their changelog when they come back online. So I use a
>> new feature in 1.3.0 to disable the replication agreements. This works
>> really, really well.
>>
>> The only problem is you have to re-enable the agreement in order to
>> re-initialize a master (https://fedorahosted.org/389/ticket/47304). I
>> have the feeling that this leaves a small window where replication can
>> occur and pollute our restored master. I noticed that we do a
>> force_sync() when doing a full re-init. It may be that if we dropped it
>> that would also mitigate this.
>>
>> I did the majority of my testing using an A <-> B <-> C replication
>> topology. This exposed a lot of issues that A <-> B did not. I don't
>> know if it was the third server or having the extra hop, but I hopefully
>> closed a bunch of the corner cases.
>>
>> So what I would do is either a full or a data restore on A. This would
>> break replication on B and C, as expected. So in this scenario A and B
>> are CAs.
>>
>> Then I'd run this on B:
>>
>> # ipa-replica-manage re-initialize --from=A
>> # ipa-csreplica-manage re-initialize --from=A
>>
>> Once that was done I'd run this on C:
>>
>> # ipa-replica-manage re-initialize --from=B
>>
>> The restoration of the dogtag databases was the last thing I did so it
>> isn't super-well tested. I had to move a fair bit of code around. I
>> think it's the sort of thing that will work when the everything goes
>> well but exceptions may not be well-handled.
>>
>> The man pages are just a shel right now, they need a lot of work.
>>
>> It should also be possible to do a full system restore. I tested with:
>>
>> # ipa-server-install ...
>> # <add a bunch of data, 100 entries or more>
>> # ipa-backup
>> # add one or more users
>> # ipa-server-install --uninstall -U
>> # ipa-restore ipa-full-...
>>
>> The last batch of users should be gone. I did similar tests with the
>> A/B/C set up.
>>
>> I ran the unit tests against it and all was well.
>>
>> I have done zero testing in a Trust environment, though at least some of
>> the files are backed up in the full case. I did some testing with DNS.
>>
>> I did no testing of a master that was down at the time of restoration
>> and then was brought online later, so it never had its replication
>> agreement disabled. I have the feeling it will hose the data.
>>
>> I have some concern over space requirements. Because I tar things up one
>> basically needs double-the backup space in order to do a restore, and a
>> bit more when encrypted. I'm open to suggestions on how to store the
>> data, but we have many files for the 389-ds bak backup and I didn't want
>> to have to encrypt them all.
>>
>> On that note, that I'm doing a db2bak AND a db2ldif backup and currently
>> using the ldif2db for the restore. My original intention was to use
>> db2bak/bak2db in order to retain the changelog, but retaining the
>> changelog is actually a problem if we're restoring to a known state and
>> forcing a re-init. It wouldn't take much to convince me to drop that,
>> which reduces the # of files we have to deal with.
>>
>> I also snuck in a change to the way that replication is displayed. It
>> has long bothered me that we print out an Updating message during
>> replication because it gives no context. I changed it to be more of a
>> progress indicator, using \r to over-write itself and include the # of
>> seconds elapsed. The log files are still readable but I'd hate to see
>> what this looks like in a typescript :-)
>>
>> Finally, sorry about the huge patch. I looked at the incremental commits
>> I had done and I didn't think they'd tell much of a story. I thought
>> about breaking some of the pieces out, but there is a lot of
>> interdependency, so you'd need everything eventually anyway, so here you
>> go, a 70k patch.
>>
>> A quick roadmap is (and I'll skip the obvious):
>>
>> ipa-csreplica-manage: had to move a bunch of code to replication.py so I
>> could use the CSReplicaManager class in ipa-restore. The rest revolves
>> around handling exceptions raised by get_cs_replication_manager() and
>> enabling replication on both sides to do a re-init.
>>
>> replication.py: the changes mentioned to pretty-print updates, a notable
>> change where a re-init automatically enables replication, the new
>> enable/disable code and all the code from ipa-csreplica-manage.
>>
>> ipa-backup/ipa-restore use the new admintool framework. I generally have
>> a good impression of the new framework, but I found myself having to run
>> in debug mode more than I'd have liked in order to discover some types
>> of errors.
>
> That's good to hear!
>
> I think while developing, you should run with -v to get all the output.
> And for production, please have the commands log by default (set 
> log_file_name).
>
>> ipa-backup does all its binds using ldapi. ipa-restore needs the DM
>> password because we need to contact remote servers to enable/disable
>> replication.
>
> The restore assumes that ipa is already installed, right? I can't just 
> run it on a clean machine? Id would be good to check/document this.
>
> I looked in the backup tarball and saw dirsrv PID file and lock 
> directories. Are these needed?

No.

>
>
> The tool runners (install/tools/ipa-backup and 
> install/tools/ipa-restore) are missing, so IPA doesn't build. Probably 
> just a forgotten git add.
>
> The patch adds an extra backslash in install/tools/man/Makefile.am; 
> consider adding $(NULL) at the end.
>
>
> Backup.dirs, Backup.files, Backup.logs:
> The code modifies these class-level attributes. This means you can't 
> run the command more than once in one Python session, so it can't be 
> used as a library (e.g. in a hypothetical test suite).
> Please make the class atrributes immutable (tuples), then in __init__ 
> do `self.dirs = list(self.dirs)` to to get instance-specific lists.
>
> Code that creates temporary files/directories or does chdir() should 
> be next to (or in) a try block whose finally: restores things.
>
> Instead of star imports (from ipapython.ipa_log_manager import *), 
> please explicitly import whatever you're using from that package. In 
> this case, nothing at all!
>
>
> If there's a get_connection() method, it should return the connection, 
> and it should always be used to get it. Store the connection in 
> self._conn, and rather than:
>     self.get_connection()
>     self.conn.do_work()
> do:
>     conn = self.get_connection()
>     conn.do_work()
> This makes forgetting to call get_connection() impossible.
>
> If a variable is only used in a single method, like `filename` in 
> Restore.extract_backup, don't store it in the admintool object.
> In general, try to avoid putting data in self if possible. It's 
> convenient but it allows complex interdependencies.
> (Yes, I'm guilty of not following this myself.)
>
>
> In several places, the backup tool logs critical errors and then just 
> continues on. Is that by design? I think a nonzero exit code would be 
> appropriate.
>
>
> In the ipa-restore man page, --backend is not documented.
>
> In db2ldif, db2bak, etc., a more conscise way to get the time string 
> is `time.strftime('export_%Y_%m_%d_%H_%M_%S')`.
>
> When validating --gpg-keyring, it would be good to check both files 
> (.sec and .pub).
>
> Here:
>     if (self.backup_type != 'FULL' and not options.data_only and
>         not instances):
>         raise admintool.ScriptError('Cannot restore a data backup into 
> an empty system')
> did you mean `(self.backup_type != 'FULL' or options.data_only) and 
> not instances`? (I'd introduce a `data_only` flag to prevent confusion.)
>
> In the ReplicationManager.check_repl_init reporting: use 
> int(d.total_seconds()) instead of d.seconds, as the latter doesn't 
> include full days. I don't think anyone would wait long enough for the 
> overflow, but better to be correct.
>