[Freeipa-devel] [PATCH] 1031 run cleanallruv task

Thu Sep 6 16:09:52 UTC 2012

On 09/06/2012 06:09 PM, Martin Kosek wrote:
> On 09/06/2012 06:05 PM, Martin Kosek wrote:
>> On 09/06/2012 05:55 PM, Rob Crittenden wrote:
>>> Rob Crittenden wrote:
>>>> Rob Crittenden wrote:
>>>>> Martin Kosek wrote:
>>>>>> On 09/05/2012 08:06 PM, Rob Crittenden wrote:
>>>>>>> Rob Crittenden wrote:
>>>>>>>> Martin Kosek wrote:
>>>>>>>>> On 07/05/2012 08:39 PM, Rob Crittenden wrote:
>>>>>>>>>> Martin Kosek wrote:
>>>>>>>>>>> On 07/03/2012 04:41 PM, Rob Crittenden wrote:
>>>>>>>>>>>> Deleting a replica can leave a replication vector (RUV) on the
>>>>>>>>>>>> other servers.
>>>>>>>>>>>> This can confuse things if the replica is re-added, and it also
>>>>>>>>>>>> causes the
>>>>>>>>>>>> server to calculate changes against a server that may no longer
>>>>>>>>>>>> exist.
>>>>>>>>>>>>
>>>>>>>>>>>> 389-ds-base provides a new task that self-propogates itself to all
>>>>>>>>>>>> available
>>>>>>>>>>>> replicas to clean this RUV data.
>>>>>>>>>>>>
>>>>>>>>>>>> This patch will create this task at deletion time to hopefully
>>>>>>>>>>>> clean things up.
>>>>>>>>>>>>
>>>>>>>>>>>> It isn't perfect. If any replica is down or unavailable at the
>>>>>>>>>>>> time
>>>>>>>>>>>> the
>>>>>>>>>>>> cleanruv task fires, and then comes back up, the old RUV data
>>>>>>>>>>>> may be
>>>>>>>>>>>> re-propogated around.
>>>>>>>>>>>>
>>>>>>>>>>>> To make things easier in this case I've added two new commands to
>>>>>>>>>>>> ipa-replica-manage. The first lists the replication ids of all the
>>>>>>>>>>>> servers we
>>>>>>>>>>>> have a RUV for. Using this you can call clean_ruv with the
>>>>>>>>>>>> replication id of a
>>>>>>>>>>>> server that no longer exists to try the cleanallruv step again.
>>>>>>>>>>>>
>>>>>>>>>>>> This is quite dangerous though. If you run cleanruv against a
>>>>>>>>>>>> replica id that
>>>>>>>>>>>> does exist it can cause a loss of data. I believe I've put in
>>>>>>>>>>>> enough scary
>>>>>>>>>>>> warnings about this.
>>>>>>>>>>>>
>>>>>>>>>>>> rob
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Good work there, this should make cleaning RUVs much easier than
>>>>>>>>>>> with the
>>>>>>>>>>> previous version.
>>>>>>>>>>>
>>>>>>>>>>> This is what I found during review:
>>>>>>>>>>>
>>>>>>>>>>> 1) list_ruv and clean_ruv command help in man is quite lost. I
>>>>>>>>>>> think
>>>>>>>>>>> it would
>>>>>>>>>>> help if we for example have all info for commands indented. This
>>>>>>>>>>> way
>>>>>>>>>>> user could
>>>>>>>>>>> simply over-look the new commands in the man page.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2) I would rename new commands to clean-ruv and list-ruv to make
>>>>>>>>>>> them
>>>>>>>>>>> consistent with the rest of the commands (re-initialize,
>>>>>>>>>>> force-sync).
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 3) It would be nice to be able to run clean_ruv command in an
>>>>>>>>>>> unattended way
>>>>>>>>>>> (for better testing), i.e. respect --force option as we already
>>>>>>>>>>> do for
>>>>>>>>>>> ipa-replica-manage del. This fix would aid test automation in the
>>>>>>>>>>> future.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 4) (minor) The new question (and the del too) does not react too
>>>>>>>>>>> well for
>>>>>>>>>>> CTRL+D:
>>>>>>>>>>>
>>>>>>>>>>> # ipa-replica-manage clean_ruv 3 --force
>>>>>>>>>>> Clean the Replication Update Vector for
>>>>>>>>>>> vm-055.idm.lab.bos.redhat.com:389
>>>>>>>>>>>
>>>>>>>>>>> Cleaning the wrong replica ID will cause that server to no
>>>>>>>>>>> longer replicate so it may miss updates while the process
>>>>>>>>>>> is running. It would need to be re-initialized to maintain
>>>>>>>>>>> consistency. Be very careful.
>>>>>>>>>>> Continue to clean? [no]: unexpected error:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 5) Help for clean_ruv command without a required parameter is quite
>>>>>>>>>>> confusing
>>>>>>>>>>> as it reports that command is wrong and not the parameter:
>>>>>>>>>>>
>>>>>>>>>>> # ipa-replica-manage clean_ruv
>>>>>>>>>>> Usage: ipa-replica-manage [options]
>>>>>>>>>>>
>>>>>>>>>>> ipa-replica-manage: error: must provide a command [clean_ruv |
>>>>>>>>>>> force-sync |
>>>>>>>>>>> disconnect | connect | del | re-initialize | list | list_ruv]
>>>>>>>>>>>
>>>>>>>>>>> It seems you just forgot to specify the error message in the
>>>>>>>>>>> command
>>>>>>>>>>> definition
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 6) When the remote replica is down, the clean_ruv command fails
>>>>>>>>>>> with an
>>>>>>>>>>> unexpected error:
>>>>>>>>>>>
>>>>>>>>>>> [root at vm-086 ~]# ipa-replica-manage clean_ruv 5
>>>>>>>>>>> Clean the Replication Update Vector for
>>>>>>>>>>> vm-055.idm.lab.bos.redhat.com:389
>>>>>>>>>>>
>>>>>>>>>>> Cleaning the wrong replica ID will cause that server to no
>>>>>>>>>>> longer replicate so it may miss updates while the process
>>>>>>>>>>> is running. It would need to be re-initialized to maintain
>>>>>>>>>>> consistency. Be very careful.
>>>>>>>>>>> Continue to clean? [no]: y
>>>>>>>>>>> unexpected error: {'desc': 'Operations error'}
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> /var/log/dirsrv/slapd-IDM-LAB-BOS-REDHAT-COM/errors:
>>>>>>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin -
>>>>>>>>>>> cleanAllRUV_task: failed
>>>>>>>>>>> to connect to repl        agreement connection
>>>>>>>>>>> (cn=meTovm-055.idm.lab.bos.redhat.com,cn=replica,
>>>>>>>>>>>
>>>>>>>>>>> cn=dc\3Didm\2Cdc\3Dlab\2Cdc\3Dbos\2Cdc\3Dredhat\2Cdc\3Dcom,cn=mapping
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> tree,cn=config), error 105
>>>>>>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin -
>>>>>>>>>>> cleanAllRUV_task: replica
>>>>>>>>>>> (cn=meTovm-055.idm.lab.
>>>>>>>>>>> bos.redhat.com,cn=replica,cn=dc\3Didm\2Cdc\3Dlab\2Cdc\3Dbos\2Cdc\3Dredhat\2Cdc\3Dcom,cn=mapping
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> tree,   cn=config) has not been cleaned.  You will need to rerun
>>>>>>>>>>> the
>>>>>>>>>>> CLEANALLRUV task on this replica.
>>>>>>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin -
>>>>>>>>>>> cleanAllRUV_task: Task
>>>>>>>>>>> failed (1)
>>>>>>>>>>>
>>>>>>>>>>> In this case I think we should inform user that the command failed,
>>>>>>>>>>> possibly
>>>>>>>>>>> because of disconnected replicas and that they could enable the
>>>>>>>>>>> replicas and
>>>>>>>>>>> try again.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 7) (minor) "pass" is now redundant in replication.py:
>>>>>>>>>>> +        except ldap.INSUFFICIENT_ACCESS:
>>>>>>>>>>> +            # We can't make the server we're removing read-only
>>>>>>>>>>> but
>>>>>>>>>>> +            # this isn't a show-stopper
>>>>>>>>>>> +            root_logger.debug("No permission to switch replica to
>>>>>>>>>>> read-only,
>>>>>>>>>>> continuing anyway")
>>>>>>>>>>> +            pass
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I think this addresses everything.
>>>>>>>>>>
>>>>>>>>>> rob
>>>>>>>>>
>>>>>>>>> Thanks, almost there! I just found one more issue which needs to be
>>>>>>>>> fixed
>>>>>>>>> before we push:
>>>>>>>>>
>>>>>>>>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com --force
>>>>>>>>> Directory Manager password:
>>>>>>>>>
>>>>>>>>> Unable to connect to replica vm-055.idm.lab.bos.redhat.com, forcing
>>>>>>>>> removal
>>>>>>>>> Failed to get data from 'vm-055.idm.lab.bos.redhat.com': {'desc':
>>>>>>>>> "Can't
>>>>>>>>> contact LDAP server"}
>>>>>>>>> Forcing removal on 'vm-086.idm.lab.bos.redhat.com'
>>>>>>>>>
>>>>>>>>> There were issues removing a connection: %d format: a number is
>>>>>>>>> required, not str
>>>>>>>>>
>>>>>>>>> Failed to get data from 'vm-055.idm.lab.bos.redhat.com': {'desc':
>>>>>>>>> "Can't
>>>>>>>>> contact LDAP server"}
>>>>>>>>>
>>>>>>>>> This is a traceback I retrieved:
>>>>>>>>> Traceback (most recent call last):
>>>>>>>>>     File "/sbin/ipa-replica-manage", line 425, in del_master
>>>>>>>>>       del_link(realm, r, hostname, options.dirman_passwd, force=True)
>>>>>>>>>     File "/sbin/ipa-replica-manage", line 271, in del_link
>>>>>>>>>       repl1.cleanallruv(replica_id)
>>>>>>>>>     File
>>>>>>>>> "/usr/lib/python2.7/site-packages/ipaserver/install/replication.py",
>>>>>>>>> line 1094, in cleanallruv
>>>>>>>>>       root_logger.debug("Creating CLEANALLRUV task for replica id
>>>>>>>>> %d" %
>>>>>>>>> replicaId)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The problem here is that you don't convert replica_id to int in this
>>>>>>>>> part:
>>>>>>>>> +    replica_id = None
>>>>>>>>> +    if repl2:
>>>>>>>>> +        replica_id = repl2._get_replica_id(repl2.conn, None)
>>>>>>>>> +    else:
>>>>>>>>> +        servers = get_ruv(realm, replica1, dirman_passwd)
>>>>>>>>> +        for (netloc, rid) in servers:
>>>>>>>>> +            if netloc.startswith(replica2):
>>>>>>>>> +                replica_id = rid
>>>>>>>>> +                break
>>>>>>>>>
>>>>>>>>> Martin
>>>>>>>>>
>>>>>>>>
>>>>>>>> Updated patch using new mechanism in 389-ds-base. This should more
>>>>>>>> thoroughly clean out RUV data when a replica is being deleted, and
>>>>>>>> provide for a way to delete RUV data afterwards too if necessary.
>>>>>>>>
>>>>>>>> rob
>>>>>>>
>>>>>>> Rebased patch
>>>>>>>
>>>>>>> rob
>>>>>>>
>>>>>>
>>>>>> 0) As I wrote in a review for your patch 1041, changelog entry slipped
>>>>>> elsewhere.
>>>>>>
>>>>>> 1) The following KeyboardInterrupt except class looks suspicious. I
>>>>>> know why
>>>>>> you have it there, but since it is generally a bad thing to do, some
>>>>>> comment
>>>>>> why it is needed would be useful.
>>>>>>
>>>>>> @@ -256,6 +263,17 @@ def del_link(realm, replica1, replica2,
>>>>>> dirman_passwd,
>>>>>> force=False):
>>>>>>       repl1.delete_agreement(replica2)
>>>>>>       repl1.delete_referral(replica2)
>>>>>>
>>>>>> +    if type1 == replication.IPA_REPLICA:
>>>>>> +        if repl2:
>>>>>> +            ruv = repl2._get_replica_id(repl2.conn, None)
>>>>>> +        else:
>>>>>> +            ruv = get_ruv_by_host(realm, replica1, replica2,
>>>>>> dirman_passwd)
>>>>>> +
>>>>>> +        try:
>>>>>> +            repl1.cleanallruv(ruv)
>>>>>> +        except KeyboardInterrupt:
>>>>>> +            pass
>>>>>> +
>>>>>>
>>>>>> Maybe you just wanted to do some cleanup and then "raise" again?
>>>>>
>>>>> No, it is there because it is safe to break out of it. The task will
>>>>> continue to run. I added some verbiage.
>>>>>
>>>>>>
>>>>>> 2) This is related to 1), but when some replica is down,
>>>>>> "ipa-replica-manage
>>>>>> del" may wait indefinitely when some remote replica is down, right?
>>>>>>
>>>>>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com
>>>>>> Deleting a master is irreversible.
>>>>>> To reconnect to the remote master you will need to prepare a new
>>>>>> replica file
>>>>>> and re-install.
>>>>>> Continue to delete? [no]: y
>>>>>> ipa: INFO: Setting agreement
>>>>>> cn=meTovm-086.idm.lab.bos.redhat.com,cn=replica,cn=dc\=idm\,dc\=lab\,dc\=bos\,dc\=redhat\,dc\=com,cn=mapping
>>>>>>
>>>>>>
>>>>>>
>>>>>> tree,cn=config schedule to 2358-2359 0 to force synch
>>>>>> ipa: INFO: Deleting schedule 2358-2359 0 from agreement
>>>>>> cn=meTovm-086.idm.lab.bos.redhat.com,cn=replica,cn=dc\=idm\,dc\=lab\,dc\=bos\,dc\=redhat\,dc\=com,cn=mapping
>>>>>>
>>>>>>
>>>>>>
>>>>>> tree,cn=config
>>>>>> ipa: INFO: Replication Update in progress: FALSE: status: 0 Replica
>>>>>> acquired
>>>>>> successfully: Incremental update succeeded: start: 0: end: 0
>>>>>> Background task created to clean replication data
>>>>>>
>>>>>> ... after about a minute I hit CTRL+C
>>>>>>
>>>>>> ^CDeleted replication agreement from 'vm-086.idm.lab.bos.redhat.com' to
>>>>>> 'vm-055.idm.lab.bos.redhat.com'
>>>>>> Failed to cleanup vm-055.idm.lab.bos.redhat.com DNS entries: NS record
>>>>>> does not
>>>>>> contain 'vm-055.idm.lab.bos.redhat.com.'
>>>>>> You may need to manually remove them from the tree
>>>>>>
>>>>>> I think it would be better to inform user that some remote replica is
>>>>>> down or
>>>>>> at least that we are waiting for the task to complete. Something like
>>>>>> that:
>>>>>>
>>>>>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com
>>>>>> ...
>>>>>> Background task created to clean replication data
>>>>>> Replication data clean up may take very long time if some replica is
>>>>>> unreachable
>>>>>> Hit CTRL+C to interrupt the wait
>>>>>> ^C Clean up wait interrupted
>>>>>> ....
>>>>>> [continue with del]
>>>>>
>>>>> Yup, did this in #1.
>>>>>
>>>>>>
>>>>>> 3) (minor) When there is a cleanruv task running and you run
>>>>>> "ipa-replica-manage del", there is a unexpected error message with
>>>>>> duplicate
>>>>>> task object in LDAP:
>>>>>>
>>>>>> # ipa-replica-manage del vm-072.idm.lab.bos.redhat.com --force
>>>>>> Unable to connect to replica vm-072.idm.lab.bos.redhat.com, forcing
>>>>>> removal
>>>>>> FAIL
>>>>>> Failed to get data from 'vm-072.idm.lab.bos.redhat.com': {'desc': "Can't
>>>>>> contact LDAP server"}
>>>>>> Forcing removal on 'vm-086.idm.lab.bos.redhat.com'
>>>>>>
>>>>>> There were issues removing a connection: This entry already exists
>>>>>> <<<<<<<<<
>>>>>>
>>>>>> Failed to get data from 'vm-072.idm.lab.bos.redhat.com': {'desc': "Can't
>>>>>> contact LDAP server"}
>>>>>> Failed to cleanup vm-072.idm.lab.bos.redhat.com DNS entries: NS record
>>>>>> does not
>>>>>> contain 'vm-072.idm.lab.bos.redhat.com.'
>>>>>> You may need to manually remove them from the tree
>>>>>>
>>>>>>
>>>>>> I think it should be enough to just catch for "entry already exists" in
>>>>>> cleanallruv function, and in such case print a relevant error message
>>>>>> bail out.
>>>>>> Thus, self.conn.checkTask(dn, dowait=True) would not be called too.
>>>>>
>>>>> Good catch, fixed.
>>>>>
>>>>>>
>>>>>>
>>>>>> 4) (minor): In make_readonly function, there is a redundant "pass"
>>>>>> statement:
>>>>>>
>>>>>> +    def make_readonly(self):
>>>>>> +        """
>>>>>> +        Make the current replication agreement read-only.
>>>>>> +        """
>>>>>> +        dn = DN(('cn', 'userRoot'), ('cn', 'ldbm database'),
>>>>>> +                ('cn', 'plugins'), ('cn', 'config'))
>>>>>> +
>>>>>> +        mod = [(ldap.MOD_REPLACE, 'nsslapd-readonly', 'on')]
>>>>>> +        try:
>>>>>> +            self.conn.modify_s(dn, mod)
>>>>>> +        except ldap.INSUFFICIENT_ACCESS:
>>>>>> +            # We can't make the server we're removing read-only but
>>>>>> +            # this isn't a show-stopper
>>>>>> +            root_logger.debug("No permission to switch replica to
>>>>>> read-only,
>>>>>> continuing anyway")
>>>>>> +            pass         <<<<<<<<<<<<<<<
>>>>>
>>>>> Yeah, this is one of my common mistakes. I put in a pass initially, then
>>>>> add logging in front of it and forget to delete the pass. Its gone now.
>>>>>
>>>>>>
>>>>>>
>>>>>> 5) In clean_ruv, I think allowing a --force option to bypass the
>>>>>> user_input
>>>>>> would be helpful (at least for test automation):
>>>>>>
>>>>>> +    if not ipautil.user_input("Continue to clean?", False):
>>>>>> +        sys.exit("Aborted")
>>>>>
>>>>> Yup, added.
>>>>>
>>>>> rob
>>>>
>>>> Slightly revised patch. I still had a window open with one unsaved change.
>>>>
>>>> rob
>>>>
>>>
>>> Apparently there were two unsaved changes, one of which was lost. This adds in
>>> the 'entry already exists' fix.
>>>
>>> rob
>>>
>>
>> Just one last thing (otherwise the patch is OK) - I don't think this is what we
>> want :-)
>>
>> # ipa-replica-manage clean-ruv 8
>> Clean the Replication Update Vector for vm-055.idm.lab.bos.redhat.com:389
>>
>> Cleaning the wrong replica ID will cause that server to no
>> longer replicate so it may miss updates while the process
>> is running. It would need to be re-initialized to maintain
>> consistency. Be very careful.
>> Continue to clean? [no]: y   <<<<<<
>> Aborted
>>
>>
>> Nor this exception, (your are checking for wrong exception):
>>
>> # ipa-replica-manage clean-ruv 8
>> Clean the Replication Update Vector for vm-055.idm.lab.bos.redhat.com:389
>>
>> Cleaning the wrong replica ID will cause that server to no
>> longer replicate so it may miss updates while the process
>> is running. It would need to be re-initialized to maintain
>> consistency. Be very careful.
>> Continue to clean? [no]:
>> unexpected error: This entry already exists
>>
>> This is the exception:
>>
>> Traceback (most recent call last):
>>   File "/sbin/ipa-replica-manage", line 651, in <module>
>>     main()
>>   File "/sbin/ipa-replica-manage", line 648, in main
>>     clean_ruv(realm, args[1], options)
>>   File "/sbin/ipa-replica-manage", line 373, in clean_ruv
>>     thisrepl.cleanallruv(ruv)
>>   File "/usr/lib/python2.7/site-packages/ipaserver/install/replication.py",
>> line 1136, in cleanallruv
>>     self.conn.addEntry(e)
>>   File "/usr/lib/python2.7/site-packages/ipaserver/ipaldap.py", line 503, in
>> addEntry
>>     self.__handle_errors(e, arg_desc=arg_desc)
>>   File "/usr/lib/python2.7/site-packages/ipaserver/ipaldap.py", line 321, in
>> __handle_errors
>>     raise errors.DuplicateEntry()
>> ipalib.errors.DuplicateEntry: This entry already exists
>>
>> Martin
>>
> 
> 
> On another matter, I just noticed that CLEANRUV is not proceeding if I have a
> winsync replica defined (and it is even up):
> 
> # ipa-replica-manage list
> dc.ad.test: winsync   <<<<<<<
> vm-072.idm.lab.bos.redhat.com: master
> vm-086.idm.lab.bos.redhat.com: master
> 
> [06/Sep/2012:11:59:10 -0400] NSMMReplicationPlugin - CleanAllRUV Task: Waiting
> for all the replicas to receive all the deleted replica updates...
> [06/Sep/2012:11:59:10 -0400] NSMMReplicationPlugin - CleanAllRUV Task: Failed
> to contact agmt (agmt="cn=meTodc.ad.test" (dc:389)) error (10), will retry later.
> [06/Sep/2012:11:59:10 -0400] NSMMReplicationPlugin - CleanAllRUV Task: Not all
> replicas caught up, retrying in 10 seconds
> [06/Sep/2012:11:59:20 -0400] NSMMReplicationPlugin - CleanAllRUV Task: Failed
> to contact agmt (agmt="cn=meTodc.ad.test" (dc:389)) error (10), will retry later.
> [06/Sep/2012:11:59:20 -0400] NSMMReplicationPlugin - CleanAllRUV Task: Not all
> replicas caught up, retrying in 20 seconds
> [06/Sep/2012:11:59:40 -0400] NSMMReplicationPlugin - CleanAllRUV Task: Failed
> to contact agmt (agmt="cn=meTodc.ad.test" (dc:389)) error (10), will retry later.
> [06/Sep/2012:11:59:40 -0400] NSMMReplicationPlugin - CleanAllRUV Task: Not all
> replicas caught up, retrying in 40 seconds
> 
> I don't think this is OK. Adding Rich to CC to follow on this one.
> 
> Martin
> 

And now the actual CC.
Martin