[Freeipa-devel] [PATCH] 1031 run cleanallruv task

Thu Sep 6 11:49:36 UTC 2012

On 09/05/2012 08:06 PM, Rob Crittenden wrote:
> Rob Crittenden wrote:
>> Martin Kosek wrote:
>>> On 07/05/2012 08:39 PM, Rob Crittenden wrote:
>>>> Martin Kosek wrote:
>>>>> On 07/03/2012 04:41 PM, Rob Crittenden wrote:
>>>>>> Deleting a replica can leave a replication vector (RUV) on the
>>>>>> other servers.
>>>>>> This can confuse things if the replica is re-added, and it also
>>>>>> causes the
>>>>>> server to calculate changes against a server that may no longer exist.
>>>>>>
>>>>>> 389-ds-base provides a new task that self-propogates itself to all
>>>>>> available
>>>>>> replicas to clean this RUV data.
>>>>>>
>>>>>> This patch will create this task at deletion time to hopefully
>>>>>> clean things up.
>>>>>>
>>>>>> It isn't perfect. If any replica is down or unavailable at the time
>>>>>> the
>>>>>> cleanruv task fires, and then comes back up, the old RUV data may be
>>>>>> re-propogated around.
>>>>>>
>>>>>> To make things easier in this case I've added two new commands to
>>>>>> ipa-replica-manage. The first lists the replication ids of all the
>>>>>> servers we
>>>>>> have a RUV for. Using this you can call clean_ruv with the
>>>>>> replication id of a
>>>>>> server that no longer exists to try the cleanallruv step again.
>>>>>>
>>>>>> This is quite dangerous though. If you run cleanruv against a
>>>>>> replica id that
>>>>>> does exist it can cause a loss of data. I believe I've put in
>>>>>> enough scary
>>>>>> warnings about this.
>>>>>>
>>>>>> rob
>>>>>>
>>>>>
>>>>> Good work there, this should make cleaning RUVs much easier than
>>>>> with the
>>>>> previous version.
>>>>>
>>>>> This is what I found during review:
>>>>>
>>>>> 1) list_ruv and clean_ruv command help in man is quite lost. I think
>>>>> it would
>>>>> help if we for example have all info for commands indented. This way
>>>>> user could
>>>>> simply over-look the new commands in the man page.
>>>>>
>>>>>
>>>>> 2) I would rename new commands to clean-ruv and list-ruv to make them
>>>>> consistent with the rest of the commands (re-initialize, force-sync).
>>>>>
>>>>>
>>>>> 3) It would be nice to be able to run clean_ruv command in an
>>>>> unattended way
>>>>> (for better testing), i.e. respect --force option as we already do for
>>>>> ipa-replica-manage del. This fix would aid test automation in the
>>>>> future.
>>>>>
>>>>>
>>>>> 4) (minor) The new question (and the del too) does not react too
>>>>> well for
>>>>> CTRL+D:
>>>>>
>>>>> # ipa-replica-manage clean_ruv 3 --force
>>>>> Clean the Replication Update Vector for
>>>>> vm-055.idm.lab.bos.redhat.com:389
>>>>>
>>>>> Cleaning the wrong replica ID will cause that server to no
>>>>> longer replicate so it may miss updates while the process
>>>>> is running. It would need to be re-initialized to maintain
>>>>> consistency. Be very careful.
>>>>> Continue to clean? [no]: unexpected error:
>>>>>
>>>>>
>>>>> 5) Help for clean_ruv command without a required parameter is quite
>>>>> confusing
>>>>> as it reports that command is wrong and not the parameter:
>>>>>
>>>>> # ipa-replica-manage clean_ruv
>>>>> Usage: ipa-replica-manage [options]
>>>>>
>>>>> ipa-replica-manage: error: must provide a command [clean_ruv |
>>>>> force-sync |
>>>>> disconnect | connect | del | re-initialize | list | list_ruv]
>>>>>
>>>>> It seems you just forgot to specify the error message in the command
>>>>> definition
>>>>>
>>>>>
>>>>> 6) When the remote replica is down, the clean_ruv command fails with an
>>>>> unexpected error:
>>>>>
>>>>> [root at vm-086 ~]# ipa-replica-manage clean_ruv 5
>>>>> Clean the Replication Update Vector for
>>>>> vm-055.idm.lab.bos.redhat.com:389
>>>>>
>>>>> Cleaning the wrong replica ID will cause that server to no
>>>>> longer replicate so it may miss updates while the process
>>>>> is running. It would need to be re-initialized to maintain
>>>>> consistency. Be very careful.
>>>>> Continue to clean? [no]: y
>>>>> unexpected error: {'desc': 'Operations error'}
>>>>>
>>>>>
>>>>> /var/log/dirsrv/slapd-IDM-LAB-BOS-REDHAT-COM/errors:
>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin -
>>>>> cleanAllRUV_task: failed
>>>>> to connect to repl        agreement connection
>>>>> (cn=meTovm-055.idm.lab.bos.redhat.com,cn=replica,
>>>>>
>>>>> cn=dc\3Didm\2Cdc\3Dlab\2Cdc\3Dbos\2Cdc\3Dredhat\2Cdc\3Dcom,cn=mapping
>>>>> tree,cn=config), error 105
>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin -
>>>>> cleanAllRUV_task: replica
>>>>> (cn=meTovm-055.idm.lab.
>>>>> bos.redhat.com,cn=replica,cn=dc\3Didm\2Cdc\3Dlab\2Cdc\3Dbos\2Cdc\3Dredhat\2Cdc\3Dcom,cn=mapping
>>>>>
>>>>>
>>>>>
>>>>> tree,   cn=config) has not been cleaned.  You will need to rerun the
>>>>> CLEANALLRUV task on this replica.
>>>>> [04/Jul/2012:06:28:16 -0400] NSMMReplicationPlugin -
>>>>> cleanAllRUV_task: Task
>>>>> failed (1)
>>>>>
>>>>> In this case I think we should inform user that the command failed,
>>>>> possibly
>>>>> because of disconnected replicas and that they could enable the
>>>>> replicas and
>>>>> try again.
>>>>>
>>>>>
>>>>> 7) (minor) "pass" is now redundant in replication.py:
>>>>> +        except ldap.INSUFFICIENT_ACCESS:
>>>>> +            # We can't make the server we're removing read-only but
>>>>> +            # this isn't a show-stopper
>>>>> +            root_logger.debug("No permission to switch replica to
>>>>> read-only,
>>>>> continuing anyway")
>>>>> +            pass
>>>>>
>>>>
>>>> I think this addresses everything.
>>>>
>>>> rob
>>>
>>> Thanks, almost there! I just found one more issue which needs to be fixed
>>> before we push:
>>>
>>> # ipa-replica-manage del vm-055.idm.lab.bos.redhat.com --force
>>> Directory Manager password:
>>>
>>> Unable to connect to replica vm-055.idm.lab.bos.redhat.com, forcing
>>> removal
>>> Failed to get data from 'vm-055.idm.lab.bos.redhat.com': {'desc': "Can't
>>> contact LDAP server"}
>>> Forcing removal on 'vm-086.idm.lab.bos.redhat.com'
>>>
>>> There were issues removing a connection: %d format: a number is
>>> required, not str
>>>
>>> Failed to get data from 'vm-055.idm.lab.bos.redhat.com': {'desc': "Can't
>>> contact LDAP server"}
>>>
>>> This is a traceback I retrieved:
>>> Traceback (most recent call last):
>>>    File "/sbin/ipa-replica-manage", line 425, in del_master
>>>      del_link(realm, r, hostname, options.dirman_passwd, force=True)
>>>    File "/sbin/ipa-replica-manage", line 271, in del_link
>>>      repl1.cleanallruv(replica_id)
>>>    File
>>> "/usr/lib/python2.7/site-packages/ipaserver/install/replication.py",
>>> line 1094, in cleanallruv
>>>      root_logger.debug("Creating CLEANALLRUV task for replica id %d" %
>>> replicaId)
>>>
>>>
>>> The problem here is that you don't convert replica_id to int in this
>>> part:
>>> +    replica_id = None
>>> +    if repl2:
>>> +        replica_id = repl2._get_replica_id(repl2.conn, None)
>>> +    else:
>>> +        servers = get_ruv(realm, replica1, dirman_passwd)
>>> +        for (netloc, rid) in servers:
>>> +            if netloc.startswith(replica2):
>>> +                replica_id = rid
>>> +                break
>>>
>>> Martin
>>>
>>
>> Updated patch using new mechanism in 389-ds-base. This should more
>> thoroughly clean out RUV data when a replica is being deleted, and
>> provide for a way to delete RUV data afterwards too if necessary.
>>
>> rob
> 
> Rebased patch
> 
> rob
> 

0) As I wrote in a review for your patch 1041, changelog entry slipped elsewhere.

1) The following KeyboardInterrupt except class looks suspicious. I know why
you have it there, but since it is generally a bad thing to do, some comment
why it is needed would be useful.

@@ -256,6 +263,17 @@ def del_link(realm, replica1, replica2, dirman_passwd,
force=False):
     repl1.delete_agreement(replica2)
     repl1.delete_referral(replica2)

+    if type1 == replication.IPA_REPLICA:
+        if repl2:
+            ruv = repl2._get_replica_id(repl2.conn, None)
+        else:
+            ruv = get_ruv_by_host(realm, replica1, replica2, dirman_passwd)
+
+        try:
+            repl1.cleanallruv(ruv)
+        except KeyboardInterrupt:
+            pass
+

Maybe you just wanted to do some cleanup and then "raise" again?

2) This is related to 1), but when some replica is down, "ipa-replica-manage
del" may wait indefinitely when some remote replica is down, right?

# ipa-replica-manage del vm-055.idm.lab.bos.redhat.com
Deleting a master is irreversible.
To reconnect to the remote master you will need to prepare a new replica file
and re-install.
Continue to delete? [no]: y
ipa: INFO: Setting agreement
cn=meTovm-086.idm.lab.bos.redhat.com,cn=replica,cn=dc\=idm\,dc\=lab\,dc\=bos\,dc\=redhat\,dc\=com,cn=mapping
tree,cn=config schedule to 2358-2359 0 to force synch
ipa: INFO: Deleting schedule 2358-2359 0 from agreement
cn=meTovm-086.idm.lab.bos.redhat.com,cn=replica,cn=dc\=idm\,dc\=lab\,dc\=bos\,dc\=redhat\,dc\=com,cn=mapping
tree,cn=config
ipa: INFO: Replication Update in progress: FALSE: status: 0 Replica acquired
successfully: Incremental update succeeded: start: 0: end: 0
Background task created to clean replication data

... after about a minute I hit CTRL+C

^CDeleted replication agreement from 'vm-086.idm.lab.bos.redhat.com' to
'vm-055.idm.lab.bos.redhat.com'
Failed to cleanup vm-055.idm.lab.bos.redhat.com DNS entries: NS record does not
contain 'vm-055.idm.lab.bos.redhat.com.'
You may need to manually remove them from the tree

I think it would be better to inform user that some remote replica is down or
at least that we are waiting for the task to complete. Something like that:

# ipa-replica-manage del vm-055.idm.lab.bos.redhat.com
...
Background task created to clean replication data
Replication data clean up may take very long time if some replica is unreachable
Hit CTRL+C to interrupt the wait
^C Clean up wait interrupted
....
[continue with del]


3) (minor) When there is a cleanruv task running and you run
"ipa-replica-manage del", there is a unexpected error message with duplicate
task object in LDAP:

# ipa-replica-manage del vm-072.idm.lab.bos.redhat.com --force
Unable to connect to replica vm-072.idm.lab.bos.redhat.com, forcing removal
FAIL
Failed to get data from 'vm-072.idm.lab.bos.redhat.com': {'desc': "Can't
contact LDAP server"}
Forcing removal on 'vm-086.idm.lab.bos.redhat.com'

There were issues removing a connection: This entry already exists   <<<<<<<<<

Failed to get data from 'vm-072.idm.lab.bos.redhat.com': {'desc': "Can't
contact LDAP server"}
Failed to cleanup vm-072.idm.lab.bos.redhat.com DNS entries: NS record does not
contain 'vm-072.idm.lab.bos.redhat.com.'
You may need to manually remove them from the tree


I think it should be enough to just catch for "entry already exists" in
cleanallruv function, and in such case print a relevant error message bail out.
Thus, self.conn.checkTask(dn, dowait=True) would not be called too.


4) (minor): In make_readonly function, there is a redundant "pass" statement:

+    def make_readonly(self):
+        """
+        Make the current replication agreement read-only.
+        """
+        dn = DN(('cn', 'userRoot'), ('cn', 'ldbm database'),
+                ('cn', 'plugins'), ('cn', 'config'))
+
+        mod = [(ldap.MOD_REPLACE, 'nsslapd-readonly', 'on')]
+        try:
+            self.conn.modify_s(dn, mod)
+        except ldap.INSUFFICIENT_ACCESS:
+            # We can't make the server we're removing read-only but
+            # this isn't a show-stopper
+            root_logger.debug("No permission to switch replica to read-only,
continuing anyway")
+            pass         <<<<<<<<<<<<<<<


5) In clean_ruv, I think allowing a --force option to bypass the user_input
would be helpful (at least for test automation):

+    if not ipautil.user_input("Continue to clean?", False):
+        sys.exit("Aborted")

Martin