[Freeipa-devel] [PATCH] 1031 run cleanallruv task

Thu Sep 6 16:09:03 UTC 2012

On 09/06/2012 06:05 PM, Martin Kosek wrote:
> On 09/06/2012 05:55 PM, Rob Crittenden wrote:
>> Rob Crittenden wrote:
>>> Rob Crittenden wrote:
>>>> Martin Kosek wrote:
>>>>> On 09/05/2012 08:06 PM, Rob Crittenden wrote:
>>>>>> Rob Crittenden wrote:
>>>>>>> Martin Kosek wrote:
>>>>>>>> On 07/05/2012 08:39 PM, Rob Crittenden wrote:
>>>>>>>>> Martin Kosek wrote:
>>>>>>>>>> On 07/03/2012 04:41 PM, Rob Crittenden wrote:
>>>>>>>>>>> Deleting a replica can leave a replication vector (RUV) on the
>>>>>>>>>>> other servers.
>>>>>>>>>>> This can confuse things if the replica is re-added, and it also
>>>>>>>>>>> causes the
>>>>>>>>>>> server to calculate changes against a server that may no longer
>>>>>>>>>>> exist.
>>>>>>>>>>>
>>>>>>>>>>> 389-ds-base provides a new task that self-propogates itself to all
>>>>>>>>>>> available
>>>>>>>>>>> replicas to clean this RUV data.
>>>>>>>>>>>
>>>>>>>>>>> This patch will create this task at deletion time to hopefully
>>>>>>>>>>> clean things up.
>>>>>>>>>>>
>>>>>>>>>>> It isn't perfect. If any replica is down or unavailable at the
>>>>>>>>>>> time
>>>>>>>>>>> the
>>>>>>>>>>> cleanruv task fires, and then comes back up, the old RUV data
>>>>>>>>>>> may be
>>>>>>>>>>> re-propogated around.
>>>>>>>>>>>
>>>>>>>>>>> To make things easier in this case I've added two new commands to
>>>>>>>>>>> ipa-replica-manage. The first lists the replication ids of all the
>>>>>>>>>>> servers we
>>>>>>>>>>> have a RUV for. Using this you can call clean_ruv with the
>>>>>>>>>>> replication id of a
>>>>>>>>>>> server that no longer exists to try the cleanallruv step again.
>>>>>>>>>>>
>>>>>>>>>>> This is quite dangerous though. If you run cleanruv against a
>>>>>>>>>>> replica id that
>>>>>>>>>>> does exist it can cause a loss of data. I believe I've put in
>>>>>>>>>>> enough scary
>>>>>>>>>>> warnings about this.
>>>>>>>>>>>
>>>>>>>>>>> rob
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Good work there, this should make cleaning RUVs much easier than
>>>>>>>>>> with the
>>>>>>>>>> previous version.
>>>>>>>>>>
>>>>>>>>>> This is what I found during review:
>>>>>>>>>>
>>>>>>>>>> 1) list_ruv and clean_ruv command help in man is quite lost. I
>>>>>>>>>> think
>>>>>>>>>> it would
>>>>>>>>>> help if we for example have all info for commands indented. This
>>>>>>>>>> way
>>>>>>>>>> user could
>>>>>>>>>> simply over-look the new commands in the man page.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2) I would rename new commands to clean-ruv and list-ruv to make
>>>>>>>>>> them
>>>>>>>>>> consistent with the rest of the commands (re-initialize,
>>>>>>>>>> force-sync).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 3) It would be nice to be able to run clean_ruv command in an
>>>>>>>>>> unattended way
>>>>>>>>>> (for better testing), i.e. respect --force option as we already
>>>>>>>>>> do for
>>>>>>>>>> ipa-replica-manage del. This fix would aid test automation in the
>>>>>>>>>> future.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 4) (minor) The new question (and the del too) does not react too
>>>>>>>>>> well for
>>>>>>>>>> CTRL+D:
>>>>>>>>>>
>>>>>>>>>> # ipa-replica-manage clean_ruv 3 --force
>>>>>>>>>> Clean the Replication Update Vector for
>>>>>>>>>> vm-055.idm.lab.bos.redhat.com:389
>>>>>>>>>>
>>>>>>>>>> Cleaning the wrong replica ID will cause that server to no
>>>>>>>>>> longer replicate so it may miss updates while the process
>>>>>>>>>> is running. It would need to be re-initialized to maintain
>>>>>>>>>> consistency. Be very careful.
>>>>>>>>>> Continue to clean? [no]: unexpected error:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 5) Help for clean_ruv command without a required parameter is quite
>>>>>>>>>> confusing
>>>>>>>>>> as it reports that command is wrong and not the parameter:
>>>>>>>>>>
>>>>>>>>>> # ipa-replica-manage clean_ruv
>>>>>>>>>> Usage: ipa-replica-manage [options]
>>>>>>>>>>
>>>>>>>>>> ipa-replica-manage: error: must provide a command [clean_ruv |
>>>>>>>>>> force-sync |
>>>>>>>>>> disconnect | connect | del | re-initialize | list | list_ruv]
>>>>>>>>>>
>>>>>>>>>> It seems you just forgot to specify the error message in the
>>>>>>>>>> command
>>>>>>>>>> definition
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 6) When the remote replica is down, the clean_ruv command fails
>>>>>>>>>> with an
>>>>>>>>>> unexpected error:
>>>>>>>>>>
>>>>>>>>>> [root at vm-086 ~]# ipa-replica-manage clean_ruv 5
>>>>>>>>>> Clean the Replication Update Vector for
>>>>>>>>>> vm-055.idm.lab.bos.redhat.com:389
>>>>>>>>>>
>>>>>>>>>> Cleaning the wrong replica ID will cause that server to no
>>>>>>>>>> longer replicate so it may miss updates while the process
>>>>>>>>>> is running. It would need to be re-initialized to maintain
>>>>>>>>>> consistency. Be very careful.
>>>>>>>>>> Continue to clean? [no]: y
>>>>>>>>>> unexpected error: {'desc': 'Operations error'}
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> /var/log/dirsrv/slapd-IDM-LAB-BOS-REDHAT-COM/errors:
>>>>>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin -
>>>>>>>>>> cleanAllRUV_task: failed
>>>>>>>>>> to connect to repl        agreement connection
>>>>>>>>>> (cn=meTovm-055.idm.lab.bos.redhat.com,cn=replica,
>>>>>>>>>>
>>>>>>>>>> cn=dc\3Didm\2Cdc\3Dlab\2Cdc\3Dbos\2Cdc\3Dredhat\2Cdc\3Dcom,cn=mapping
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> tree,cn=config), error 105
>>>>>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin -
>>>>>>>>>> cleanAllRUV_task: replica
>>>>>>>>>> (cn=meTovm-055.idm.lab.
>>>>>>>>>> bos.redhat.com,cn=replica,cn=dc\3Didm\2Cdc\3Dlab\2Cdc\3Dbos\2Cdc\3Dredhat\2Cdc\3Dcom,cn=mapping
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> tree,   cn=config) has not been cleaned.  You will need to rerun
>>>>>>>>>> the
>>>>>>>>>> CLEANALLRUV task on this replica.
>>>>>>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin -
>>>>>>>>>> cleanAllRUV_task: Task
>>>>>>>>>> failed (1)
>>>>>>>>>>
>>>>>>>>>> In this case I think we should inform user that the command failed,
>>>>>>>>>> possibly
>>>>>>>>>> because of disconnected replicas and that they could enable the
>>>>>>>>>> replicas and
>>>>>>>>>> try again.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 7) (minor) "pass" is now redundant in replication.py:
>>>>>>>>>> +        except ldap.INSUFFICIENT_ACCESS:
>>>>>>>>>> +            # We can't make the server we're removing read-only
>>>>>>>>>> but
>>>>>>>>>> +            # this isn't a show-stopper
>>>>>>>>>> +            root_logger.debug("No permission to switch replica to
>>>>>>>>>> read-only,
>>>>>>>>>> continuing anyway")
>>>>>>>>>> +            pass
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I think this addresses everything.
>>>>>>>>>
>>>>>>>>> rob
>>>>>>>>
>>>>>>>> Thanks, almost there! I just found one more issue which needs to be
>>>>>>>> fixed
>>>>>>>> before we push:
>>>>>>>>
>>>>>>>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com --force
>>>>>>>> Directory Manager password:
>>>>>>>>
>>>>>>>> Unable to connect to replica vm-055.idm.lab.bos.redhat.com, forcing
>>>>>>>> removal
>>>>>>>> Failed to get data from 'vm-055.idm.lab.bos.redhat.com': {'desc':
>>>>>>>> "Can't
>>>>>>>> contact LDAP server"}
>>>>>>>> Forcing removal on 'vm-086.idm.lab.bos.redhat.com'
>>>>>>>>
>>>>>>>> There were issues removing a connection: %d format: a number is
>>>>>>>> required, not str
>>>>>>>>
>>>>>>>> Failed to get data from 'vm-055.idm.lab.bos.redhat.com': {'desc':
>>>>>>>> "Can't
>>>>>>>> contact LDAP server"}
>>>>>>>>
>>>>>>>> This is a traceback I retrieved:
>>>>>>>> Traceback (most recent call last):
>>>>>>>>     File "/sbin/ipa-replica-manage", line 425, in del_master
>>>>>>>>       del_link(realm, r, hostname, options.dirman_passwd, force=True)
>>>>>>>>     File "/sbin/ipa-replica-manage", line 271, in del_link
>>>>>>>>       repl1.cleanallruv(replica_id)
>>>>>>>>     File
>>>>>>>> "/usr/lib/python2.7/site-packages/ipaserver/install/replication.py",
>>>>>>>> line 1094, in cleanallruv
>>>>>>>>       root_logger.debug("Creating CLEANALLRUV task for replica id
>>>>>>>> %d" %
>>>>>>>> replicaId)
>>>>>>>>
>>>>>>>>
>>>>>>>> The problem here is that you don't convert replica_id to int in this
>>>>>>>> part:
>>>>>>>> +    replica_id = None
>>>>>>>> +    if repl2:
>>>>>>>> +        replica_id = repl2._get_replica_id(repl2.conn, None)
>>>>>>>> +    else:
>>>>>>>> +        servers = get_ruv(realm, replica1, dirman_passwd)
>>>>>>>> +        for (netloc, rid) in servers:
>>>>>>>> +            if netloc.startswith(replica2):
>>>>>>>> +                replica_id = rid
>>>>>>>> +                break
>>>>>>>>
>>>>>>>> Martin
>>>>>>>>
>>>>>>>
>>>>>>> Updated patch using new mechanism in 389-ds-base. This should more
>>>>>>> thoroughly clean out RUV data when a replica is being deleted, and
>>>>>>> provide for a way to delete RUV data afterwards too if necessary.
>>>>>>>
>>>>>>> rob
>>>>>>
>>>>>> Rebased patch
>>>>>>
>>>>>> rob
>>>>>>
>>>>>
>>>>> 0) As I wrote in a review for your patch 1041, changelog entry slipped
>>>>> elsewhere.
>>>>>
>>>>> 1) The following KeyboardInterrupt except class looks suspicious. I
>>>>> know why
>>>>> you have it there, but since it is generally a bad thing to do, some
>>>>> comment
>>>>> why it is needed would be useful.
>>>>>
>>>>> @@ -256,6 +263,17 @@ def del_link(realm, replica1, replica2,
>>>>> dirman_passwd,
>>>>> force=False):
>>>>>       repl1.delete_agreement(replica2)
>>>>>       repl1.delete_referral(replica2)
>>>>>
>>>>> +    if type1 == replication.IPA_REPLICA:
>>>>> +        if repl2:
>>>>> +            ruv = repl2._get_replica_id(repl2.conn, None)
>>>>> +        else:
>>>>> +            ruv = get_ruv_by_host(realm, replica1, replica2,
>>>>> dirman_passwd)
>>>>> +
>>>>> +        try:
>>>>> +            repl1.cleanallruv(ruv)
>>>>> +        except KeyboardInterrupt:
>>>>> +            pass
>>>>> +
>>>>>
>>>>> Maybe you just wanted to do some cleanup and then "raise" again?
>>>>
>>>> No, it is there because it is safe to break out of it. The task will
>>>> continue to run. I added some verbiage.
>>>>
>>>>>
>>>>> 2) This is related to 1), but when some replica is down,
>>>>> "ipa-replica-manage
>>>>> del" may wait indefinitely when some remote replica is down, right?
>>>>>
>>>>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com
>>>>> Deleting a master is irreversible.
>>>>> To reconnect to the remote master you will need to prepare a new
>>>>> replica file
>>>>> and re-install.
>>>>> Continue to delete? [no]: y
>>>>> ipa: INFO: Setting agreement
>>>>> cn=meTovm-086.idm.lab.bos.redhat.com,cn=replica,cn=dc\=idm\,dc\=lab\,dc\=bos\,dc\=redhat\,dc\=com,cn=mapping
>>>>>
>>>>>
>>>>>
>>>>> tree,cn=config schedule to 2358-2359 0 to force synch
>>>>> ipa: INFO: Deleting schedule 2358-2359 0 from agreement
>>>>> cn=meTovm-086.idm.lab.bos.redhat.com,cn=replica,cn=dc\=idm\,dc\=lab\,dc\=bos\,dc\=redhat\,dc\=com,cn=mapping
>>>>>
>>>>>
>>>>>
>>>>> tree,cn=config
>>>>> ipa: INFO: Replication Update in progress: FALSE: status: 0 Replica
>>>>> acquired
>>>>> successfully: Incremental update succeeded: start: 0: end: 0
>>>>> Background task created to clean replication data
>>>>>
>>>>> ... after about a minute I hit CTRL+C
>>>>>
>>>>> ^CDeleted replication agreement from 'vm-086.idm.lab.bos.redhat.com' to
>>>>> 'vm-055.idm.lab.bos.redhat.com'
>>>>> Failed to cleanup vm-055.idm.lab.bos.redhat.com DNS entries: NS record
>>>>> does not
>>>>> contain 'vm-055.idm.lab.bos.redhat.com.'
>>>>> You may need to manually remove them from the tree
>>>>>
>>>>> I think it would be better to inform user that some remote replica is
>>>>> down or
>>>>> at least that we are waiting for the task to complete. Something like
>>>>> that:
>>>>>
>>>>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com
>>>>> ...
>>>>> Background task created to clean replication data
>>>>> Replication data clean up may take very long time if some replica is
>>>>> unreachable
>>>>> Hit CTRL+C to interrupt the wait
>>>>> ^C Clean up wait interrupted
>>>>> ....
>>>>> [continue with del]
>>>>
>>>> Yup, did this in #1.
>>>>
>>>>>
>>>>> 3) (minor) When there is a cleanruv task running and you run
>>>>> "ipa-replica-manage del", there is a unexpected error message with
>>>>> duplicate
>>>>> task object in LDAP:
>>>>>
>>>>> # ipa-replica-manage del vm-072.idm.lab.bos.redhat.com --force
>>>>> Unable to connect to replica vm-072.idm.lab.bos.redhat.com, forcing
>>>>> removal
>>>>> FAIL
>>>>> Failed to get data from 'vm-072.idm.lab.bos.redhat.com': {'desc': "Can't
>>>>> contact LDAP server"}
>>>>> Forcing removal on 'vm-086.idm.lab.bos.redhat.com'
>>>>>
>>>>> There were issues removing a connection: This entry already exists
>>>>> <<<<<<<<<
>>>>>
>>>>> Failed to get data from 'vm-072.idm.lab.bos.redhat.com': {'desc': "Can't
>>>>> contact LDAP server"}
>>>>> Failed to cleanup vm-072.idm.lab.bos.redhat.com DNS entries: NS record
>>>>> does not
>>>>> contain 'vm-072.idm.lab.bos.redhat.com.'
>>>>> You may need to manually remove them from the tree
>>>>>
>>>>>
>>>>> I think it should be enough to just catch for "entry already exists" in
>>>>> cleanallruv function, and in such case print a relevant error message
>>>>> bail out.
>>>>> Thus, self.conn.checkTask(dn, dowait=True) would not be called too.
>>>>
>>>> Good catch, fixed.
>>>>
>>>>>
>>>>>
>>>>> 4) (minor): In make_readonly function, there is a redundant "pass"
>>>>> statement:
>>>>>
>>>>> +    def make_readonly(self):
>>>>> +        """
>>>>> +        Make the current replication agreement read-only.
>>>>> +        """
>>>>> +        dn = DN(('cn', 'userRoot'), ('cn', 'ldbm database'),
>>>>> +                ('cn', 'plugins'), ('cn', 'config'))
>>>>> +
>>>>> +        mod = [(ldap.MOD_REPLACE, 'nsslapd-readonly', 'on')]
>>>>> +        try:
>>>>> +            self.conn.modify_s(dn, mod)
>>>>> +        except ldap.INSUFFICIENT_ACCESS:
>>>>> +            # We can't make the server we're removing read-only but
>>>>> +            # this isn't a show-stopper
>>>>> +            root_logger.debug("No permission to switch replica to
>>>>> read-only,
>>>>> continuing anyway")
>>>>> +            pass         <<<<<<<<<<<<<<<
>>>>
>>>> Yeah, this is one of my common mistakes. I put in a pass initially, then
>>>> add logging in front of it and forget to delete the pass. Its gone now.
>>>>
>>>>>
>>>>>
>>>>> 5) In clean_ruv, I think allowing a --force option to bypass the
>>>>> user_input
>>>>> would be helpful (at least for test automation):
>>>>>
>>>>> +    if not ipautil.user_input("Continue to clean?", False):
>>>>> +        sys.exit("Aborted")
>>>>
>>>> Yup, added.
>>>>
>>>> rob
>>>
>>> Slightly revised patch. I still had a window open with one unsaved change.
>>>
>>> rob
>>>
>>
>> Apparently there were two unsaved changes, one of which was lost. This adds in
>> the 'entry already exists' fix.
>>
>> rob
>>
> 
> Just one last thing (otherwise the patch is OK) - I don't think this is what we
> want :-)
> 
> # ipa-replica-manage clean-ruv 8
> Clean the Replication Update Vector for vm-055.idm.lab.bos.redhat.com:389
> 
> Cleaning the wrong replica ID will cause that server to no
> longer replicate so it may miss updates while the process
> is running. It would need to be re-initialized to maintain
> consistency. Be very careful.
> Continue to clean? [no]: y   <<<<<<
> Aborted
> 
> 
> Nor this exception, (your are checking for wrong exception):
> 
> # ipa-replica-manage clean-ruv 8
> Clean the Replication Update Vector for vm-055.idm.lab.bos.redhat.com:389
> 
> Cleaning the wrong replica ID will cause that server to no
> longer replicate so it may miss updates while the process
> is running. It would need to be re-initialized to maintain
> consistency. Be very careful.
> Continue to clean? [no]:
> unexpected error: This entry already exists
> 
> This is the exception:
> 
> Traceback (most recent call last):
>   File "/sbin/ipa-replica-manage", line 651, in <module>
>     main()
>   File "/sbin/ipa-replica-manage", line 648, in main
>     clean_ruv(realm, args[1], options)
>   File "/sbin/ipa-replica-manage", line 373, in clean_ruv
>     thisrepl.cleanallruv(ruv)
>   File "/usr/lib/python2.7/site-packages/ipaserver/install/replication.py",
> line 1136, in cleanallruv
>     self.conn.addEntry(e)
>   File "/usr/lib/python2.7/site-packages/ipaserver/ipaldap.py", line 503, in
> addEntry
>     self.__handle_errors(e, arg_desc=arg_desc)
>   File "/usr/lib/python2.7/site-packages/ipaserver/ipaldap.py", line 321, in
> __handle_errors
>     raise errors.DuplicateEntry()
> ipalib.errors.DuplicateEntry: This entry already exists
> 
> Martin
> 

On another matter, I just noticed that CLEANRUV is not proceeding if I have a
winsync replica defined (and it is even up):

# ipa-replica-manage list
dc.ad.test: winsync   <<<<<<<
vm-072.idm.lab.bos.redhat.com: master
vm-086.idm.lab.bos.redhat.com: master

[06/Sep/2012:11:59:10 -0400] NSMMReplicationPlugin - CleanAllRUV Task: Waiting
for all the replicas to receive all the deleted replica updates...
[06/Sep/2012:11:59:10 -0400] NSMMReplicationPlugin - CleanAllRUV Task: Failed
to contact agmt (agmt="cn=meTodc.ad.test" (dc:389)) error (10), will retry later.
[06/Sep/2012:11:59:10 -0400] NSMMReplicationPlugin - CleanAllRUV Task: Not all
replicas caught up, retrying in 10 seconds
[06/Sep/2012:11:59:20 -0400] NSMMReplicationPlugin - CleanAllRUV Task: Failed
to contact agmt (agmt="cn=meTodc.ad.test" (dc:389)) error (10), will retry later.
[06/Sep/2012:11:59:20 -0400] NSMMReplicationPlugin - CleanAllRUV Task: Not all
replicas caught up, retrying in 20 seconds
[06/Sep/2012:11:59:40 -0400] NSMMReplicationPlugin - CleanAllRUV Task: Failed
to contact agmt (agmt="cn=meTodc.ad.test" (dc:389)) error (10), will retry later.
[06/Sep/2012:11:59:40 -0400] NSMMReplicationPlugin - CleanAllRUV Task: Not all
replicas caught up, retrying in 40 seconds

I don't think this is OK. Adding Rich to CC to follow on this one.

Martin