[Freeipa-devel] [PATCH] 1031 run cleanallruv task

Thu Sep 6 15:42:34 UTC 2012

Rob Crittenden wrote:
> Martin Kosek wrote:
>> On 09/05/2012 08:06 PM, Rob Crittenden wrote:
>>> Rob Crittenden wrote:
>>>> Martin Kosek wrote:
>>>>> On 07/05/2012 08:39 PM, Rob Crittenden wrote:
>>>>>> Martin Kosek wrote:
>>>>>>> On 07/03/2012 04:41 PM, Rob Crittenden wrote:
>>>>>>>> Deleting a replica can leave a replication vector (RUV) on the
>>>>>>>> other servers.
>>>>>>>> This can confuse things if the replica is re-added, and it also
>>>>>>>> causes the
>>>>>>>> server to calculate changes against a server that may no longer
>>>>>>>> exist.
>>>>>>>>
>>>>>>>> 389-ds-base provides a new task that self-propogates itself to all
>>>>>>>> available
>>>>>>>> replicas to clean this RUV data.
>>>>>>>>
>>>>>>>> This patch will create this task at deletion time to hopefully
>>>>>>>> clean things up.
>>>>>>>>
>>>>>>>> It isn't perfect. If any replica is down or unavailable at the time
>>>>>>>> the
>>>>>>>> cleanruv task fires, and then comes back up, the old RUV data
>>>>>>>> may be
>>>>>>>> re-propogated around.
>>>>>>>>
>>>>>>>> To make things easier in this case I've added two new commands to
>>>>>>>> ipa-replica-manage. The first lists the replication ids of all the
>>>>>>>> servers we
>>>>>>>> have a RUV for. Using this you can call clean_ruv with the
>>>>>>>> replication id of a
>>>>>>>> server that no longer exists to try the cleanallruv step again.
>>>>>>>>
>>>>>>>> This is quite dangerous though. If you run cleanruv against a
>>>>>>>> replica id that
>>>>>>>> does exist it can cause a loss of data. I believe I've put in
>>>>>>>> enough scary
>>>>>>>> warnings about this.
>>>>>>>>
>>>>>>>> rob
>>>>>>>>
>>>>>>>
>>>>>>> Good work there, this should make cleaning RUVs much easier than
>>>>>>> with the
>>>>>>> previous version.
>>>>>>>
>>>>>>> This is what I found during review:
>>>>>>>
>>>>>>> 1) list_ruv and clean_ruv command help in man is quite lost. I think
>>>>>>> it would
>>>>>>> help if we for example have all info for commands indented. This way
>>>>>>> user could
>>>>>>> simply over-look the new commands in the man page.
>>>>>>>
>>>>>>>
>>>>>>> 2) I would rename new commands to clean-ruv and list-ruv to make
>>>>>>> them
>>>>>>> consistent with the rest of the commands (re-initialize,
>>>>>>> force-sync).
>>>>>>>
>>>>>>>
>>>>>>> 3) It would be nice to be able to run clean_ruv command in an
>>>>>>> unattended way
>>>>>>> (for better testing), i.e. respect --force option as we already
>>>>>>> do for
>>>>>>> ipa-replica-manage del. This fix would aid test automation in the
>>>>>>> future.
>>>>>>>
>>>>>>>
>>>>>>> 4) (minor) The new question (and the del too) does not react too
>>>>>>> well for
>>>>>>> CTRL+D:
>>>>>>>
>>>>>>> # ipa-replica-manage clean_ruv 3 --force
>>>>>>> Clean the Replication Update Vector for
>>>>>>> vm-055.idm.lab.bos.redhat.com:389
>>>>>>>
>>>>>>> Cleaning the wrong replica ID will cause that server to no
>>>>>>> longer replicate so it may miss updates while the process
>>>>>>> is running. It would need to be re-initialized to maintain
>>>>>>> consistency. Be very careful.
>>>>>>> Continue to clean? [no]: unexpected error:
>>>>>>>
>>>>>>>
>>>>>>> 5) Help for clean_ruv command without a required parameter is quite
>>>>>>> confusing
>>>>>>> as it reports that command is wrong and not the parameter:
>>>>>>>
>>>>>>> # ipa-replica-manage clean_ruv
>>>>>>> Usage: ipa-replica-manage [options]
>>>>>>>
>>>>>>> ipa-replica-manage: error: must provide a command [clean_ruv |
>>>>>>> force-sync |
>>>>>>> disconnect | connect | del | re-initialize | list | list_ruv]
>>>>>>>
>>>>>>> It seems you just forgot to specify the error message in the command
>>>>>>> definition
>>>>>>>
>>>>>>>
>>>>>>> 6) When the remote replica is down, the clean_ruv command fails
>>>>>>> with an
>>>>>>> unexpected error:
>>>>>>>
>>>>>>> [root at vm-086 ~]# ipa-replica-manage clean_ruv 5
>>>>>>> Clean the Replication Update Vector for
>>>>>>> vm-055.idm.lab.bos.redhat.com:389
>>>>>>>
>>>>>>> Cleaning the wrong replica ID will cause that server to no
>>>>>>> longer replicate so it may miss updates while the process
>>>>>>> is running. It would need to be re-initialized to maintain
>>>>>>> consistency. Be very careful.
>>>>>>> Continue to clean? [no]: y
>>>>>>> unexpected error: {'desc': 'Operations error'}
>>>>>>>
>>>>>>>
>>>>>>> /var/log/dirsrv/slapd-IDM-LAB-BOS-REDHAT-COM/errors:
>>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin -
>>>>>>> cleanAllRUV_task: failed
>>>>>>> to connect to repl        agreement connection
>>>>>>> (cn=meTovm-055.idm.lab.bos.redhat.com,cn=replica,
>>>>>>>
>>>>>>> cn=dc\3Didm\2Cdc\3Dlab\2Cdc\3Dbos\2Cdc\3Dredhat\2Cdc\3Dcom,cn=mapping
>>>>>>>
>>>>>>> tree,cn=config), error 105
>>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin -
>>>>>>> cleanAllRUV_task: replica
>>>>>>> (cn=meTovm-055.idm.lab.
>>>>>>> bos.redhat.com,cn=replica,cn=dc\3Didm\2Cdc\3Dlab\2Cdc\3Dbos\2Cdc\3Dredhat\2Cdc\3Dcom,cn=mapping
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> tree,   cn=config) has not been cleaned.  You will need to rerun the
>>>>>>> CLEANALLRUV task on this replica.
>>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin -
>>>>>>> cleanAllRUV_task: Task
>>>>>>> failed (1)
>>>>>>>
>>>>>>> In this case I think we should inform user that the command failed,
>>>>>>> possibly
>>>>>>> because of disconnected replicas and that they could enable the
>>>>>>> replicas and
>>>>>>> try again.
>>>>>>>
>>>>>>>
>>>>>>> 7) (minor) "pass" is now redundant in replication.py:
>>>>>>> +        except ldap.INSUFFICIENT_ACCESS:
>>>>>>> +            # We can't make the server we're removing read-only but
>>>>>>> +            # this isn't a show-stopper
>>>>>>> +            root_logger.debug("No permission to switch replica to
>>>>>>> read-only,
>>>>>>> continuing anyway")
>>>>>>> +            pass
>>>>>>>
>>>>>>
>>>>>> I think this addresses everything.
>>>>>>
>>>>>> rob
>>>>>
>>>>> Thanks, almost there! I just found one more issue which needs to be
>>>>> fixed
>>>>> before we push:
>>>>>
>>>>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com --force
>>>>> Directory Manager password:
>>>>>
>>>>> Unable to connect to replica vm-055.idm.lab.bos.redhat.com, forcing
>>>>> removal
>>>>> Failed to get data from 'vm-055.idm.lab.bos.redhat.com': {'desc':
>>>>> "Can't
>>>>> contact LDAP server"}
>>>>> Forcing removal on 'vm-086.idm.lab.bos.redhat.com'
>>>>>
>>>>> There were issues removing a connection: %d format: a number is
>>>>> required, not str
>>>>>
>>>>> Failed to get data from 'vm-055.idm.lab.bos.redhat.com': {'desc':
>>>>> "Can't
>>>>> contact LDAP server"}
>>>>>
>>>>> This is a traceback I retrieved:
>>>>> Traceback (most recent call last):
>>>>>     File "/sbin/ipa-replica-manage", line 425, in del_master
>>>>>       del_link(realm, r, hostname, options.dirman_passwd, force=True)
>>>>>     File "/sbin/ipa-replica-manage", line 271, in del_link
>>>>>       repl1.cleanallruv(replica_id)
>>>>>     File
>>>>> "/usr/lib/python2.7/site-packages/ipaserver/install/replication.py",
>>>>> line 1094, in cleanallruv
>>>>>       root_logger.debug("Creating CLEANALLRUV task for replica id
>>>>> %d" %
>>>>> replicaId)
>>>>>
>>>>>
>>>>> The problem here is that you don't convert replica_id to int in this
>>>>> part:
>>>>> +    replica_id = None
>>>>> +    if repl2:
>>>>> +        replica_id = repl2._get_replica_id(repl2.conn, None)
>>>>> +    else:
>>>>> +        servers = get_ruv(realm, replica1, dirman_passwd)
>>>>> +        for (netloc, rid) in servers:
>>>>> +            if netloc.startswith(replica2):
>>>>> +                replica_id = rid
>>>>> +                break
>>>>>
>>>>> Martin
>>>>>
>>>>
>>>> Updated patch using new mechanism in 389-ds-base. This should more
>>>> thoroughly clean out RUV data when a replica is being deleted, and
>>>> provide for a way to delete RUV data afterwards too if necessary.
>>>>
>>>> rob
>>>
>>> Rebased patch
>>>
>>> rob
>>>
>>
>> 0) As I wrote in a review for your patch 1041, changelog entry slipped
>> elsewhere.
>>
>> 1) The following KeyboardInterrupt except class looks suspicious. I
>> know why
>> you have it there, but since it is generally a bad thing to do, some
>> comment
>> why it is needed would be useful.
>>
>> @@ -256,6 +263,17 @@ def del_link(realm, replica1, replica2,
>> dirman_passwd,
>> force=False):
>>       repl1.delete_agreement(replica2)
>>       repl1.delete_referral(replica2)
>>
>> +    if type1 == replication.IPA_REPLICA:
>> +        if repl2:
>> +            ruv = repl2._get_replica_id(repl2.conn, None)
>> +        else:
>> +            ruv = get_ruv_by_host(realm, replica1, replica2,
>> dirman_passwd)
>> +
>> +        try:
>> +            repl1.cleanallruv(ruv)
>> +        except KeyboardInterrupt:
>> +            pass
>> +
>>
>> Maybe you just wanted to do some cleanup and then "raise" again?
>
> No, it is there because it is safe to break out of it. The task will
> continue to run. I added some verbiage.
>
>>
>> 2) This is related to 1), but when some replica is down,
>> "ipa-replica-manage
>> del" may wait indefinitely when some remote replica is down, right?
>>
>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com
>> Deleting a master is irreversible.
>> To reconnect to the remote master you will need to prepare a new
>> replica file
>> and re-install.
>> Continue to delete? [no]: y
>> ipa: INFO: Setting agreement
>> cn=meTovm-086.idm.lab.bos.redhat.com,cn=replica,cn=dc\=idm\,dc\=lab\,dc\=bos\,dc\=redhat\,dc\=com,cn=mapping
>>
>> tree,cn=config schedule to 2358-2359 0 to force synch
>> ipa: INFO: Deleting schedule 2358-2359 0 from agreement
>> cn=meTovm-086.idm.lab.bos.redhat.com,cn=replica,cn=dc\=idm\,dc\=lab\,dc\=bos\,dc\=redhat\,dc\=com,cn=mapping
>>
>> tree,cn=config
>> ipa: INFO: Replication Update in progress: FALSE: status: 0 Replica
>> acquired
>> successfully: Incremental update succeeded: start: 0: end: 0
>> Background task created to clean replication data
>>
>> ... after about a minute I hit CTRL+C
>>
>> ^CDeleted replication agreement from 'vm-086.idm.lab.bos.redhat.com' to
>> 'vm-055.idm.lab.bos.redhat.com'
>> Failed to cleanup vm-055.idm.lab.bos.redhat.com DNS entries: NS record
>> does not
>> contain 'vm-055.idm.lab.bos.redhat.com.'
>> You may need to manually remove them from the tree
>>
>> I think it would be better to inform user that some remote replica is
>> down or
>> at least that we are waiting for the task to complete. Something like
>> that:
>>
>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com
>> ...
>> Background task created to clean replication data
>> Replication data clean up may take very long time if some replica is
>> unreachable
>> Hit CTRL+C to interrupt the wait
>> ^C Clean up wait interrupted
>> ....
>> [continue with del]
>
> Yup, did this in #1.
>
>>
>> 3) (minor) When there is a cleanruv task running and you run
>> "ipa-replica-manage del", there is a unexpected error message with
>> duplicate
>> task object in LDAP:
>>
>> # ipa-replica-manage del vm-072.idm.lab.bos.redhat.com --force
>> Unable to connect to replica vm-072.idm.lab.bos.redhat.com, forcing
>> removal
>> FAIL
>> Failed to get data from 'vm-072.idm.lab.bos.redhat.com': {'desc': "Can't
>> contact LDAP server"}
>> Forcing removal on 'vm-086.idm.lab.bos.redhat.com'
>>
>> There were issues removing a connection: This entry already exists
>> <<<<<<<<<
>>
>> Failed to get data from 'vm-072.idm.lab.bos.redhat.com': {'desc': "Can't
>> contact LDAP server"}
>> Failed to cleanup vm-072.idm.lab.bos.redhat.com DNS entries: NS record
>> does not
>> contain 'vm-072.idm.lab.bos.redhat.com.'
>> You may need to manually remove them from the tree
>>
>>
>> I think it should be enough to just catch for "entry already exists" in
>> cleanallruv function, and in such case print a relevant error message
>> bail out.
>> Thus, self.conn.checkTask(dn, dowait=True) would not be called too.
>
> Good catch, fixed.
>
>>
>>
>> 4) (minor): In make_readonly function, there is a redundant "pass"
>> statement:
>>
>> +    def make_readonly(self):
>> +        """
>> +        Make the current replication agreement read-only.
>> +        """
>> +        dn = DN(('cn', 'userRoot'), ('cn', 'ldbm database'),
>> +                ('cn', 'plugins'), ('cn', 'config'))
>> +
>> +        mod = [(ldap.MOD_REPLACE, 'nsslapd-readonly', 'on')]
>> +        try:
>> +            self.conn.modify_s(dn, mod)
>> +        except ldap.INSUFFICIENT_ACCESS:
>> +            # We can't make the server we're removing read-only but
>> +            # this isn't a show-stopper
>> +            root_logger.debug("No permission to switch replica to
>> read-only,
>> continuing anyway")
>> +            pass         <<<<<<<<<<<<<<<
>
> Yeah, this is one of my common mistakes. I put in a pass initially, then
> add logging in front of it and forget to delete the pass. Its gone now.
>
>>
>>
>> 5) In clean_ruv, I think allowing a --force option to bypass the
>> user_input
>> would be helpful (at least for test automation):
>>
>> +    if not ipautil.user_input("Continue to clean?", False):
>> +        sys.exit("Aborted")
>
> Yup, added.
>
> rob

Slightly revised patch. I still had a window open with one unsaved change.

rob

-------------- next part --------------
A non-text attachment was scrubbed...
Name: freeipa-rcrit-1031-6-cleanruv.patch
Type: text/x-diff
Size: 14660 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/freeipa-devel/attachments/20120906/4bb1b8f2/attachment.bin>