[Freeipa-users] replica DS failure deadlock

Andrew E. Bruno aebruno2 at buffalo.edu
Wed Oct 19 16:28:38 UTC 2016


On Wed, Oct 19, 2016 at 05:41:37PM +0200, Ludwig Krispenz wrote:
> 
> On 10/19/2016 05:02 PM, Ludwig Krispenz wrote:
> > 
> > On 10/19/2016 03:48 PM, Andrew E. Bruno wrote:
> > > On Wed, Oct 19, 2016 at 10:13:26AM +0200, Ludwig Krispenz wrote:
> > > > On 10/18/2016 08:52 PM, Andrew E. Bruno wrote:
> > > > > We had one of our replicas fail today with the following errors:
> > > > > 
> > > > > 
> > > > > [18/Oct/2016:13:40:47 -0400]
> > > > > agmt="cn=meTosrv-m14-32.cbls.ccr.buffalo.edu"
> > > > > (srv-m14-32:389) - Can't locate CSN 58065ef3000100030000 in
> > > > > the changelog (DB rc=-30988). If replication stops, the
> > > > > consumer may need to be reinitialized.
> > > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin -
> > > > > changelog program - _cl5WriteOperationTxn: retry (49) the
> > > > > transaction (csn=58065f74000500040000) failed (rc=-30993
> > > > > (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a
> > > > > deadlock))
> > > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin -
> > > > > changelog program - _cl5WriteOperationTxn: failed to write
> > > > > entry with csn (58065f74000500040000); db error - -30993
> > > > > BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a
> > > > > deadlock
> > > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin -
> > > > > write_changelog_and_ruv: can't add a change for
> > > > > uid=janedoe,cn=users,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu
> > > > > (uniqid: 939bca48-2ced11e5-ac0b8f7e-e0b1a377, optype: 64) to
> > > > > changelog csn 58065f74000500040000
> > > > > [18/Oct/2016:13:43:07 -0400] -
> > > > > SLAPI_PLUGIN_BE_TXN_POST_MODRDN_FN plugin returned error but
> > > > > did not set SLAPI_RESULT_CODE
> > > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin -
> > > > > process_postop: Failed to apply update
> > > > > (58065f74000500040000) error (1).  Aborting replication
> > > > > session(conn=1314106 op=1688559)
> > > > > [18/Oct/2016:13:43:12 -0400] - cos_cache_change_notify:
> > > > > modified entry is NULL--updating cache just in case
> > > > > [18/Oct/2016:13:43:12 -0400] - Skipping CoS Definition
> > > > > cn=Password
> > > > > Policy,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu--no CoS
> > > > > Templates found, which should be added before the CoS
> > > > > Definition.
> > > > > [18/Oct/2016:13:43:20 -0400] - Operation error fetching Null
> > > > > DN (4a729f9a-955a11e6-aaffa516-e778e883), error -30993.
> > > > > [18/Oct/2016:13:43:20 -0400] - dn2entry_ext: Failed to get
> > > > > id for changenumber=30856302,cn=changelog from entryrdn
> > > > > index (-30993)
> > > > > [18/Oct/2016:13:43:20 -0400] - Operation error fetching
> > > > > changenumber=30856302,cn=changelog (null), error -30993.
> > > > > [18/Oct/2016:13:43:20 -0400] DSRetroclPlugin - replog: an
> > > > > error occured while adding change number 30856302, dn =
> > > > > changenumber=30856302,cn=changelog: Operations error.
> > > > > [18/Oct/2016:13:43:20 -0400] retrocl-plugin -
> > > > > retrocl_postob: operation failure [1]
> > > > > [18/Oct/2016:13:43:20 -0400] NSMMReplicationPlugin -
> > > > > process_postop: Failed to apply update
> > > > > (58065f9f000000600000) error (1).  Aborting replication
> > > > > session(conn=1901274 op=5)
> > > > > [18/Oct/2016:13:43:24 -0400] - ldbm_back_seq deadlock retry
> > > > > BAD 1601, err=0 BDB0062 Successful return: 0
> > > > > [18/Oct/2016:13:43:25 -0400] NSMMReplicationPlugin -
> > > > > changelog program - _cl5WriteOperationTxn: retry (49) the
> > > > > transaction (csn=58065f7c000a00040000) failed (rc=-30993
> > > > > (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a
> > > > > deadlock))
> > > > > [18/Oct/2016:13:43:25 -0400] NSMMReplicationPlugin -
> > > > > changelog program - _cl5WriteOperationTxn: failed to write
> > > > > entry with csn (58065f7c000a00040000); db error - -30993
> > > > > BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a
> > > > > deadlock
> > > > > [18/Oct/2016:13:43:25 -0400] NSMMReplicationPlugin -
> > > > > write_changelog_and_ruv: can't add a change for
> > > > > uid=janedoe,cn=users,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu
> > > > > (uniqid: 4080421a-2d0211e5-ac0b8f7e-e0b1a377, optype: 64) to
> > > > > changelog csn 58065f7c000a00040000
> > > > > 
> > > > > 
> > > > > ns-slapd was hung so we restarted and now it's stuck and
> > > > > won't come back up. It
> > > > > hangs up here:
> > > > > 
> > > > > [18/Oct/2016:14:12:31 -0400] - Skipping CoS Definition
> > > > > cn=Password
> > > > > Policy,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu--no CoS
> > > > > Templates found, which should be added before the CoS
> > > > > Definition.
> > > > > [18/Oct/2016:14:12:31 -0400] NSMMReplicationPlugin -
> > > > > changelog program - _cl5NewDBFile: PR_DeleteSemaphore: /var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/a32992ce-71b811e5-9d33a516-e778e883.sema;
> > > > > NSPR error - -5943
> > > > > [18/Oct/2016:14:12:32 -0400] NSMMReplicationPlugin -
> > > > > changelog program - _cl5NewDBFile: PR_DeleteSemaphore: /var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/986efe12-71b811e5-9d33a516-e778e883.sema;
> > > > > NSPR error - -5943
> > > > > 
> > > > > 
> > > > > Tried deleting the semaphore files and restarting but no
> > > > > luck. Attached
> > > > > is a stacktrace of the stuck ns-slapd process.
> > > > > 
> > > > > Here's the versions were running:
> > > > > 
> > > > > ipa-server-4.2.0-15.0.1.el7.centos.19.x86_64
> > > > > 389-ds-base-1.3.4.0-33.el7_2.x86_64
> > > > > 
> > > > > FWIW, we were experimenting with the new life-cycle
> > > > > management features,
> > > > > specifically "preserved" users and deleted the user
> > > > > "janedoe" when this
> > > > > happened.  From the errors above looks like this host failed to
> > > > > replicate the change?  Not sure if this is related or not.
> > > > > 
> > > > > Is it possible to recover the database? Thanks in advance
> > > > > for any pointers.
> > > > from the stack trace the process is not hanging, it is trying to
> > > > recover.
> > > > After a crash/kill  the changelog does not contai a RUV and it is
> > > > reconstructed by reading all records in the changelog, if this
> > > > is large it
> > > > can take some time.
> > > > If you look at that part of the stack repeatedly,
> > > > 
> > > > #4  0x00007f4e88daeba5 in cl5DBData2Entry (data=<optimized out>,
> > > > len=<optimized out>, entry=entry at entry=0x7ffff6598910) at
> > > > ldap/servers/plugins/replication/cl5_api.c:2342
> > > >          rc = <optimized out>
> > > >          version = <optimized out>
> > > >          pos = 0x7f4e9839d091 ""
> > > >          strCSN = 0x0
> > > >          op = 0x7ffff6598980
> > > >          add_mods = 0x7f4e983a5e80
> > > >          rawDN = 0x7f4e98396e20 "fqdn=cpn-k08-29-02.cbls.ccr.buffalo.edu,cn=computers,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu"
> > > >          s =
> > > > "\300\037>\230N\177\000\000@\210Y\366\377\177\000\000@\210Y\366\377"
> > > > #5  0x00007f4e88daf5d6 in _cl5GetNextEntry
> > > > (entry=entry at entry=0x7ffff6598910, iterator=0x7f4e983a5e80) at
> > > > ldap/servers/plugins/replication/cl5_api.c:5291
> > > >          rc = 0
> > > >          it = 0x7f4e983a5e80
> > > >          key = {data = 0x0, size = 21, ulen = 0, dlen = 0, doff
> > > > = 0, app_data = 0x0, flags = 16}
> > > >          data = {data = 0x7f4e9839cff0, size = 335, ulen = 0,
> > > > dlen = 0, doff = 0, app_data = 0x0, flags = 16}
> > > > #6  0x00007f4e88dafb34 in _cl5ConstructRUV (purge=1,
> > > > obj=0x7f4e983e1fc0, replGen=0x7ffff6598910
> > > > "\200\211Y\366\377\177") at
> > > > ldap/servers/plugins/replication/cl5_api.c:4306
> > > > 
> > > > 
> > > > you should see some progress in which entry is handled
> > > > 
> > > Ludwig, thanks very much for the help. As you pointed out just
> > > needed to let it
> > > finish.  ns-slapd eventually came back up once it finished reading the
> > > changelog. Still seeing some errors related to the
> > > NSMMReplicationPlugin failed
> > > to apply update and from the managed-entries-plugin. Can these safely be
> > > ignored or are they indicative of a more serious problem?
> > > 
> > > [19/Oct/2016:09:28:46 -0400] - Operation error fetching Null DN
> > > (e73b48a4-95ff11e6-8bc7a516-e778e883), error -30993.
> > > [19/Oct/2016:09:28:46 -0400] - dn2entry_ext: Failed to get id for
> > > changenumber=30856335,cn=changelog from entryrdn index (-30993)
> > > [19/Oct/2016:09:28:46 -0400] - Operation error fetching
> > > changenumber=30856335,cn=changelog (null), error -30993.
> > > [19/Oct/2016:09:28:46 -0400] DSRetroclPlugin - replog: an error
> > > occured while adding change number 30856335, dn =
> > > changenumber=30856335,cn=changelog: Operations error.
> > > [19/Oct/2016:09:28:46 -0400] retrocl-plugin - retrocl_postob:
> > > operation failure [1]
> > > [19/Oct/2016:09:28:46 -0400] NSMMReplicationPlugin - process_postop:
> > > Failed to apply update (5806acf7000000600000) error (1).  Aborting
> > > replication session(conn=9 op=28)
> > > [19/Oct/2016:09:28:54 -0400] - Operation error fetching Null DN
> > > (e73b48a7-95ff11e6-8bc7a516-e778e883), error -30993.
> > > [19/Oct/2016:09:28:59 -0400] - dn2entry_ext: Failed to get id for
> > > changenumber=30856337,cn=changelog from entryrdn index (-30993)
> > > [19/Oct/2016:09:29:17 -0400] - Operation error fetching
> > > changenumber=30856337,cn=changelog (null), error -30993.
> > > [19/Oct/2016:09:29:17 -0400] DSRetroclPlugin - replog: an error
> > > occured while adding change number 30856337, dn =
> > > changenumber=30856337,cn=changelog: Operations error.
> > > [19/Oct/2016:09:29:17 -0400] retrocl-plugin - retrocl_postob:
> > > operation failure [1]
> > > [19/Oct/2016:09:29:17 -0400] NSMMReplicationPlugin - process_postop:
> > > Failed to apply update (5806acf7000000600000) error (1).  Aborting
> > > replication session(conn=15 op=5)
> > > [19/Oct/2016:09:29:20 -0400] - Retry count exceeded in delete
> > > [19/Oct/2016:09:29:20 -0400] DSRetroclPlugin - delete_changerecord:
> > > could not delete change record 30712389 (rc: 51)
> > > [19/Oct/2016:09:29:28 -0400] - Operation error fetching Null DN
> > > (0afe8e82-960011e6-8bc7a516-e778e883), error -30993.
> > > [19/Oct/2016:09:29:28 -0400] - dn2entry_ext: Failed to get id for
> > > changenumber=30856351,cn=changelog from entryrdn index (-30993)
> > > [19/Oct/2016:09:29:28 -0400] - Operation error fetching
> > > changenumber=30856351,cn=changelog (null), error -30993.
> > > [19/Oct/2016:09:29:28 -0400] DSRetroclPlugin - replog: an error
> > > occured while adding change number 30856351, dn =
> > > changenumber=30856351,cn=changelog: Operations error.
> > > [19/Oct/2016:09:29:28 -0400] retrocl-plugin - retrocl_postob:
> > > operation failure [1]
> > > [19/Oct/2016:09:29:28 -0400] NSMMReplicationPlugin - process_postop:
> > > Failed to apply update (5806acf7000000600000) error (1).  Aborting
> > > replication session(conn=25 op=5)
> > > [19/Oct/2016:09:29:31 -0400] - Retry count exceeded in delete
> > > [19/Oct/2016:09:29:31 -0400] DSRetroclPlugin - delete_changerecord:
> > > could not delete change record 30712865 (rc: 51)
> > > [19/Oct/2016:09:29:39 -0400] - Operation error fetching Null DN
> > > (0afe8e90-960011e6-8bc7a516-e778e883), error -30993.
> > > [19/Oct/2016:09:29:39 -0400] - dn2entry_ext: Failed to get id for
> > > changenumber=30856364,cn=changelog from entryrdn index (-30993)
> > > [19/Oct/2016:09:29:39 -0400] - Operation error fetching
> > > changenumber=30856364,cn=changelog (null), error -30993.
> > > [19/Oct/2016:09:29:39 -0400] DSRetroclPlugin - replog: an error
> > > occured while adding change number 30856364, dn =
> > > changenumber=30856364,cn=changelog: Operations error.
> > > [19/Oct/2016:09:29:39 -0400] retrocl-plugin - retrocl_postob:
> > > operation failure [1]
> > > [19/Oct/2016:09:29:39 -0400] NSMMReplicationPlugin - process_postop:
> > > Failed to apply update (5806acf7000000600000) error (1).  Aborting
> > > replication session(conn=35 op=5)
> > > [19/Oct/2016:09:29:42 -0400] - Retry count exceeded in delete
> > > [19/Oct/2016:09:29:42 -0400] DSRetroclPlugin - delete_changerecord:
> > > could not delete change record 30713364 (rc: 51)
> > > [19/Oct/2016:09:29:42 -0400] - Operation error fetching Null DN
> > > (0afe8ea0-960011e6-8bc7a516-e778e883), error -30993.
> > > [19/Oct/2016:09:29:42 -0400] - dn2entry_ext: Failed to get id for
> > > changenumber=30856379,cn=changelog from entryrdn index (-30993)
> > > [19/Oct/2016:09:29:42 -0400] - Operation error fetching
> > > changenumber=30856379,cn=changelog (null), error -30993.
> > > [19/Oct/2016:09:29:42 -0400] DSRetroclPlugin - replog: an error
> > > occured while adding change number 30856379, dn =
> > > changenumber=30856379,cn=changelog: Operations error.
> > > [19/Oct/2016:09:29:42 -0400] retrocl-plugin - retrocl_postob:
> > > operation failure [1]
> > > [19/Oct/2016:09:29:42 -0400] NSMMReplicationPlugin - process_postop:
> > > Failed to apply update (5806acf7000000600000) error (1).  Aborting
> > > replication session(conn=36 op=5)
> > > [19/Oct/2016:09:29:50 -0400] - Operation error fetching Null DN
> > > (0afe8ea2-960011e6-8bc7a516-e778e883), error -30993.
> > > [19/Oct/2016:09:29:50 -0400] - dn2entry_ext: Failed to get id for
> > > changenumber=30856380,cn=changelog from entryrdn index (-30993)
> > > [19/Oct/2016:09:29:50 -0400] - Operation error fetching
> > > changenumber=30856380,cn=changelog (null), error -30993.
> > > [19/Oct/2016:09:29:50 -0400] DSRetroclPlugin - replog: an error
> > > occured while adding change number 30856380, dn =
> > > changenumber=30856380,cn=changelog: Operations error.
> > > [19/Oct/2016:09:29:50 -0400] retrocl-plugin - retrocl_postob:
> > > operation failure [1]
> > this doesn't look good. There could be cancelled ops which would be
> > repeated, but the failing repl op is always with the same csn:
> > 5806acf7000000600000
> > so it seems incoming replication is stuck.
> > you could try to find out which entry is affected (grep for the csn in
> > the access log and look at the operation) and what kind of modification
> > it is to check what could be going wrong.


Here's what was in the access logs for that csn:

access.20161018-113116:[19/Oct/2016:09:28:46 -0400] conn=9 op=28 RESULT err=1 tag=103 nentries=0 etime=3 csn=5806acf7000000600000
access.20161018-113116:[19/Oct/2016:09:29:17 -0400] conn=15 op=5 RESULT err=1 tag=103 nentries=0 etime=24 csn=5806acf7000000600000
access.20161018-113116:[19/Oct/2016:09:29:28 -0400] conn=25 op=5 RESULT err=1 tag=103 nentries=0 etime=0 csn=5806acf7000000600000
access.20161018-113116:[19/Oct/2016:09:29:39 -0400] conn=35 op=5 RESULT err=1 tag=103 nentries=0 etime=0 csn=5806acf7000000600000
access.20161018-113116:[19/Oct/2016:09:29:42 -0400] conn=36 op=5 RESULT err=1 tag=103 nentries=0 etime=0 csn=5806acf7000000600000
access.20161018-113116:[19/Oct/2016:09:29:50 -0400] conn=37 op=5 RESULT err=1 tag=103 nentries=0 etime=0 csn=5806acf7000000600000
access.20161018-113116:[19/Oct/2016:09:29:54 -0400] conn=44 op=5 RESULT err=1 tag=103 nentries=0 etime=1 csn=5806acf7000000600000
access.20161018-113116:[19/Oct/2016:09:29:58 -0400] conn=45 op=5 RESULT err=1 tag=103 nentries=0 etime=0 csn=5806acf7000000600000
access.20161018-113116:[19/Oct/2016:09:30:06 -0400] conn=46 op=5 RESULT err=1 tag=103 nentries=0 etime=4 csn=5806acf7000000600000
access.20161018-113116:[19/Oct/2016:09:30:12 -0400] conn=48 op=5 RESULT err=0 tag=103 nentries=0 etime=2 csn=5806acf7000000600000


Interestingly, right before the first op=28 was the modification of user
janedoe, we were testing deleting/preserving this user:

[19/Oct/2016:09:28:43 -0400] conn=9 op=27 MOD dn="cn=MasterCRL,ou=crlIssuingPoints,ou=ca,o=ipaca"
[19/Oct/2016:09:28:43 -0400] conn=10 op=15 RESULT err=0 tag=103 nentries=0 etime=0 csn=58065f7c000300030000
[19/Oct/2016:09:28:43 -0400] conn=10 op=16 MODRDN dn="uid=janedoe,cn=users,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu" newrdn="uid=janedoe" newsuperior="cn=deleted users,cn=accounts,cn=provisioning,dc=cbls,dc=ccr,dc=buffalo,dc=edu"
[19/Oct/2016:09:28:43 -0400] conn=9 op=27 RESULT err=0 tag=103 nentries=0 etime=0 csn=5806a973000000600000
[19/Oct/2016:09:28:43 -0400] conn=9 op=28 MOD dn="cn=MasterCRL,ou=crlIssuingPoints,ou=ca,o=ipaca"
[19/Oct/2016:09:28:46 -0400] conn=9 op=28 RESULT err=1 tag=103 nentries=0 etime=3 csn=5806acf7000000600000
[19/Oct/2016:09:28:46 -0400] conn=9 op=-1 fd=88 closed - B4


> the information of what is the change with csn 5806acf7000000600000, it
> should be found in the changelog of the server with the replicaid 96.

Where can I find/query the changelog?

> 
> there is also the possibility that your retro changelog got corrupted. could
> you try to query the retrocl: ldapsearch ..... -b "cn=changelog" dn

Guessing the changelog is too big or I need to increase some limit? 

ldapsearch -Y GSSAPI -b "cn=changelog" dn
# extended LDIF
#
# LDAPv3
# base <cn=changelog> with scope subtree
# filter: (objectclass=*)
# requesting: dn 
#

# search result
search: 4
result: 11 Administrative limit exceeded


> 
> and before rebuilding or reimporting the database it would be worth to try
> to recreate the retro changelog

I'm not seeing anymore errors in the logs. Is there any way to verify if
replication has caught up? 

Thanks again for all the help.






More information about the Freeipa-users mailing list