[Freeipa-users] replica DS failure deadlock

Andrew E. Bruno aebruno2 at buffalo.edu
Wed Oct 19 17:28:30 UTC 2016


On Wed, Oct 19, 2016 at 06:59:57PM +0200, thierry bordaz wrote:
> 
> 
> On 10/19/2016 06:28 PM, Andrew E. Bruno wrote:
> > On Wed, Oct 19, 2016 at 05:41:37PM +0200, Ludwig Krispenz wrote:
> > > On 10/19/2016 05:02 PM, Ludwig Krispenz wrote:
> > > > On 10/19/2016 03:48 PM, Andrew E. Bruno wrote:
> > > > > On Wed, Oct 19, 2016 at 10:13:26AM +0200, Ludwig Krispenz wrote:
> > > > > > On 10/18/2016 08:52 PM, Andrew E. Bruno wrote:
> > > > > > > We had one of our replicas fail today with the following errors:
> > > > > > > 
> > > > > > > 
> > > > > > > [18/Oct/2016:13:40:47 -0400]
> > > > > > > agmt="cn=meTosrv-m14-32.cbls.ccr.buffalo.edu"
> > > > > > > (srv-m14-32:389) - Can't locate CSN 58065ef3000100030000 in
> > > > > > > the changelog (DB rc=-30988). If replication stops, the
> > > > > > > consumer may need to be reinitialized.
> > > > > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin -
> > > > > > > changelog program - _cl5WriteOperationTxn: retry (49) the
> > > > > > > transaction (csn=58065f74000500040000) failed (rc=-30993
> > > > > > > (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a
> > > > > > > deadlock))
> > > > > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin -
> > > > > > > changelog program - _cl5WriteOperationTxn: failed to write
> > > > > > > entry with csn (58065f74000500040000); db error - -30993
> > > > > > > BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a
> > > > > > > deadlock
> > > > > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin -
> > > > > > > write_changelog_and_ruv: can't add a change for
> > > > > > > uid=janedoe,cn=users,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu
> > > > > > > (uniqid: 939bca48-2ced11e5-ac0b8f7e-e0b1a377, optype: 64) to
> > > > > > > changelog csn 58065f74000500040000
> > > > > > > [18/Oct/2016:13:43:07 -0400] -
> > > > > > > SLAPI_PLUGIN_BE_TXN_POST_MODRDN_FN plugin returned error but
> > > > > > > did not set SLAPI_RESULT_CODE
> > > > > > > [18/Oct/2016:13:43:07 -0400] NSMMReplicationPlugin -
> > > > > > > process_postop: Failed to apply update
> > > > > > > (58065f74000500040000) error (1).  Aborting replication
> > > > > > > session(conn=1314106 op=1688559)
> > > > > > > [18/Oct/2016:13:43:12 -0400] - cos_cache_change_notify:
> > > > > > > modified entry is NULL--updating cache just in case
> > > > > > > [18/Oct/2016:13:43:12 -0400] - Skipping CoS Definition
> > > > > > > cn=Password
> > > > > > > Policy,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu--no CoS
> > > > > > > Templates found, which should be added before the CoS
> > > > > > > Definition.
> > > > > > > [18/Oct/2016:13:43:20 -0400] - Operation error fetching Null
> > > > > > > DN (4a729f9a-955a11e6-aaffa516-e778e883), error -30993.
> > > > > > > [18/Oct/2016:13:43:20 -0400] - dn2entry_ext: Failed to get
> > > > > > > id for changenumber=30856302,cn=changelog from entryrdn
> > > > > > > index (-30993)
> > > > > > > [18/Oct/2016:13:43:20 -0400] - Operation error fetching
> > > > > > > changenumber=30856302,cn=changelog (null), error -30993.
> > > > > > > [18/Oct/2016:13:43:20 -0400] DSRetroclPlugin - replog: an
> > > > > > > error occured while adding change number 30856302, dn =
> > > > > > > changenumber=30856302,cn=changelog: Operations error.
> > > > > > > [18/Oct/2016:13:43:20 -0400] retrocl-plugin -
> > > > > > > retrocl_postob: operation failure [1]
> > > > > > > [18/Oct/2016:13:43:20 -0400] NSMMReplicationPlugin -
> > > > > > > process_postop: Failed to apply update
> > > > > > > (58065f9f000000600000) error (1).  Aborting replication
> > > > > > > session(conn=1901274 op=5)
> > > > > > > [18/Oct/2016:13:43:24 -0400] - ldbm_back_seq deadlock retry
> > > > > > > BAD 1601, err=0 BDB0062 Successful return: 0
> > > > > > > [18/Oct/2016:13:43:25 -0400] NSMMReplicationPlugin -
> > > > > > > changelog program - _cl5WriteOperationTxn: retry (49) the
> > > > > > > transaction (csn=58065f7c000a00040000) failed (rc=-30993
> > > > > > > (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a
> > > > > > > deadlock))
> > > > > > > [18/Oct/2016:13:43:25 -0400] NSMMReplicationPlugin -
> > > > > > > changelog program - _cl5WriteOperationTxn: failed to write
> > > > > > > entry with csn (58065f7c000a00040000); db error - -30993
> > > > > > > BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a
> > > > > > > deadlock
> > > > > > > [18/Oct/2016:13:43:25 -0400] NSMMReplicationPlugin -
> > > > > > > write_changelog_and_ruv: can't add a change for
> > > > > > > uid=janedoe,cn=users,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu
> > > > > > > (uniqid: 4080421a-2d0211e5-ac0b8f7e-e0b1a377, optype: 64) to
> > > > > > > changelog csn 58065f7c000a00040000
> > > > > > > 
> > > > > > > 
> > > > > > > ns-slapd was hung so we restarted and now it's stuck and
> > > > > > > won't come back up. It
> > > > > > > hangs up here:
> > > > > > > 
> > > > > > > [18/Oct/2016:14:12:31 -0400] - Skipping CoS Definition
> > > > > > > cn=Password
> > > > > > > Policy,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu--no CoS
> > > > > > > Templates found, which should be added before the CoS
> > > > > > > Definition.
> > > > > > > [18/Oct/2016:14:12:31 -0400] NSMMReplicationPlugin -
> > > > > > > changelog program - _cl5NewDBFile: PR_DeleteSemaphore: /var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/a32992ce-71b811e5-9d33a516-e778e883.sema;
> > > > > > > NSPR error - -5943
> > > > > > > [18/Oct/2016:14:12:32 -0400] NSMMReplicationPlugin -
> > > > > > > changelog program - _cl5NewDBFile: PR_DeleteSemaphore: /var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/986efe12-71b811e5-9d33a516-e778e883.sema;
> > > > > > > NSPR error - -5943
> > > > > > > 
> > > > > > > 
> > > > > > > Tried deleting the semaphore files and restarting but no
> > > > > > > luck. Attached
> > > > > > > is a stacktrace of the stuck ns-slapd process.
> > > > > > > 
> > > > > > > Here's the versions were running:
> > > > > > > 
> > > > > > > ipa-server-4.2.0-15.0.1.el7.centos.19.x86_64
> > > > > > > 389-ds-base-1.3.4.0-33.el7_2.x86_64
> > > > > > > 
> > > > > > > FWIW, we were experimenting with the new life-cycle
> > > > > > > management features,
> > > > > > > specifically "preserved" users and deleted the user
> > > > > > > "janedoe" when this
> > > > > > > happened.  From the errors above looks like this host failed to
> > > > > > > replicate the change?  Not sure if this is related or not.
> > > > > > > 
> > > > > > > Is it possible to recover the database? Thanks in advance
> > > > > > > for any pointers.
> > > > > > from the stack trace the process is not hanging, it is trying to
> > > > > > recover.
> > > > > > After a crash/kill  the changelog does not contai a RUV and it is
> > > > > > reconstructed by reading all records in the changelog, if this
> > > > > > is large it
> > > > > > can take some time.
> > > > > > If you look at that part of the stack repeatedly,
> > > > > > 
> > > > > > #4  0x00007f4e88daeba5 in cl5DBData2Entry (data=<optimized out>,
> > > > > > len=<optimized out>, entry=entry at entry=0x7ffff6598910) at
> > > > > > ldap/servers/plugins/replication/cl5_api.c:2342
> > > > > >           rc = <optimized out>
> > > > > >           version = <optimized out>
> > > > > >           pos = 0x7f4e9839d091 ""
> > > > > >           strCSN = 0x0
> > > > > >           op = 0x7ffff6598980
> > > > > >           add_mods = 0x7f4e983a5e80
> > > > > >           rawDN = 0x7f4e98396e20 "fqdn=cpn-k08-29-02.cbls.ccr.buffalo.edu,cn=computers,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu"
> > > > > >           s =
> > > > > > "\300\037>\230N\177\000\000@\210Y\366\377\177\000\000@\210Y\366\377"
> > > > > > #5  0x00007f4e88daf5d6 in _cl5GetNextEntry
> > > > > > (entry=entry at entry=0x7ffff6598910, iterator=0x7f4e983a5e80) at
> > > > > > ldap/servers/plugins/replication/cl5_api.c:5291
> > > > > >           rc = 0
> > > > > >           it = 0x7f4e983a5e80
> > > > > >           key = {data = 0x0, size = 21, ulen = 0, dlen = 0, doff
> > > > > > = 0, app_data = 0x0, flags = 16}
> > > > > >           data = {data = 0x7f4e9839cff0, size = 335, ulen = 0,
> > > > > > dlen = 0, doff = 0, app_data = 0x0, flags = 16}
> > > > > > #6  0x00007f4e88dafb34 in _cl5ConstructRUV (purge=1,
> > > > > > obj=0x7f4e983e1fc0, replGen=0x7ffff6598910
> > > > > > "\200\211Y\366\377\177") at
> > > > > > ldap/servers/plugins/replication/cl5_api.c:4306
> > > > > > 
> > > > > > 
> > > > > > you should see some progress in which entry is handled
> > > > > > 
> > > > > Ludwig, thanks very much for the help. As you pointed out just
> > > > > needed to let it
> > > > > finish.  ns-slapd eventually came back up once it finished reading the
> > > > > changelog. Still seeing some errors related to the
> > > > > NSMMReplicationPlugin failed
> > > > > to apply update and from the managed-entries-plugin. Can these safely be
> > > > > ignored or are they indicative of a more serious problem?
> > > > > 
> > > > > [19/Oct/2016:09:28:46 -0400] - Operation error fetching Null DN
> > > > > (e73b48a4-95ff11e6-8bc7a516-e778e883), error -30993.
> > > > > [19/Oct/2016:09:28:46 -0400] - dn2entry_ext: Failed to get id for
> > > > > changenumber=30856335,cn=changelog from entryrdn index (-30993)
> > > > > [19/Oct/2016:09:28:46 -0400] - Operation error fetching
> > > > > changenumber=30856335,cn=changelog (null), error -30993.
> > > > > [19/Oct/2016:09:28:46 -0400] DSRetroclPlugin - replog: an error
> > > > > occured while adding change number 30856335, dn =
> > > > > changenumber=30856335,cn=changelog: Operations error.
> > > > > [19/Oct/2016:09:28:46 -0400] retrocl-plugin - retrocl_postob:
> > > > > operation failure [1]
> > > > > [19/Oct/2016:09:28:46 -0400] NSMMReplicationPlugin - process_postop:
> > > > > Failed to apply update (5806acf7000000600000) error (1).  Aborting
> > > > > replication session(conn=9 op=28)
> > > > > [19/Oct/2016:09:28:54 -0400] - Operation error fetching Null DN
> > > > > (e73b48a7-95ff11e6-8bc7a516-e778e883), error -30993.
> > > > > [19/Oct/2016:09:28:59 -0400] - dn2entry_ext: Failed to get id for
> > > > > changenumber=30856337,cn=changelog from entryrdn index (-30993)
> > > > > [19/Oct/2016:09:29:17 -0400] - Operation error fetching
> > > > > changenumber=30856337,cn=changelog (null), error -30993.
> > > > > [19/Oct/2016:09:29:17 -0400] DSRetroclPlugin - replog: an error
> > > > > occured while adding change number 30856337, dn =
> > > > > changenumber=30856337,cn=changelog: Operations error.
> > > > > [19/Oct/2016:09:29:17 -0400] retrocl-plugin - retrocl_postob:
> > > > > operation failure [1]
> > > > > [19/Oct/2016:09:29:17 -0400] NSMMReplicationPlugin - process_postop:
> > > > > Failed to apply update (5806acf7000000600000) error (1).  Aborting
> > > > > replication session(conn=15 op=5)
> > > > > [19/Oct/2016:09:29:20 -0400] - Retry count exceeded in delete
> > > > > [19/Oct/2016:09:29:20 -0400] DSRetroclPlugin - delete_changerecord:
> > > > > could not delete change record 30712389 (rc: 51)
> > > > > [19/Oct/2016:09:29:28 -0400] - Operation error fetching Null DN
> > > > > (0afe8e82-960011e6-8bc7a516-e778e883), error -30993.
> > > > > [19/Oct/2016:09:29:28 -0400] - dn2entry_ext: Failed to get id for
> > > > > changenumber=30856351,cn=changelog from entryrdn index (-30993)
> > > > > [19/Oct/2016:09:29:28 -0400] - Operation error fetching
> > > > > changenumber=30856351,cn=changelog (null), error -30993.
> > > > > [19/Oct/2016:09:29:28 -0400] DSRetroclPlugin - replog: an error
> > > > > occured while adding change number 30856351, dn =
> > > > > changenumber=30856351,cn=changelog: Operations error.
> > > > > [19/Oct/2016:09:29:28 -0400] retrocl-plugin - retrocl_postob:
> > > > > operation failure [1]
> > > > > [19/Oct/2016:09:29:28 -0400] NSMMReplicationPlugin - process_postop:
> > > > > Failed to apply update (5806acf7000000600000) error (1).  Aborting
> > > > > replication session(conn=25 op=5)
> > > > > [19/Oct/2016:09:29:31 -0400] - Retry count exceeded in delete
> > > > > [19/Oct/2016:09:29:31 -0400] DSRetroclPlugin - delete_changerecord:
> > > > > could not delete change record 30712865 (rc: 51)
> > > > > [19/Oct/2016:09:29:39 -0400] - Operation error fetching Null DN
> > > > > (0afe8e90-960011e6-8bc7a516-e778e883), error -30993.
> > > > > [19/Oct/2016:09:29:39 -0400] - dn2entry_ext: Failed to get id for
> > > > > changenumber=30856364,cn=changelog from entryrdn index (-30993)
> > > > > [19/Oct/2016:09:29:39 -0400] - Operation error fetching
> > > > > changenumber=30856364,cn=changelog (null), error -30993.
> > > > > [19/Oct/2016:09:29:39 -0400] DSRetroclPlugin - replog: an error
> > > > > occured while adding change number 30856364, dn =
> > > > > changenumber=30856364,cn=changelog: Operations error.
> > > > > [19/Oct/2016:09:29:39 -0400] retrocl-plugin - retrocl_postob:
> > > > > operation failure [1]
> > > > > [19/Oct/2016:09:29:39 -0400] NSMMReplicationPlugin - process_postop:
> > > > > Failed to apply update (5806acf7000000600000) error (1).  Aborting
> > > > > replication session(conn=35 op=5)
> > > > > [19/Oct/2016:09:29:42 -0400] - Retry count exceeded in delete
> > > > > [19/Oct/2016:09:29:42 -0400] DSRetroclPlugin - delete_changerecord:
> > > > > could not delete change record 30713364 (rc: 51)
> > > > > [19/Oct/2016:09:29:42 -0400] - Operation error fetching Null DN
> > > > > (0afe8ea0-960011e6-8bc7a516-e778e883), error -30993.
> > > > > [19/Oct/2016:09:29:42 -0400] - dn2entry_ext: Failed to get id for
> > > > > changenumber=30856379,cn=changelog from entryrdn index (-30993)
> > > > > [19/Oct/2016:09:29:42 -0400] - Operation error fetching
> > > > > changenumber=30856379,cn=changelog (null), error -30993.
> > > > > [19/Oct/2016:09:29:42 -0400] DSRetroclPlugin - replog: an error
> > > > > occured while adding change number 30856379, dn =
> > > > > changenumber=30856379,cn=changelog: Operations error.
> > > > > [19/Oct/2016:09:29:42 -0400] retrocl-plugin - retrocl_postob:
> > > > > operation failure [1]
> > > > > [19/Oct/2016:09:29:42 -0400] NSMMReplicationPlugin - process_postop:
> > > > > Failed to apply update (5806acf7000000600000) error (1).  Aborting
> > > > > replication session(conn=36 op=5)
> > > > > [19/Oct/2016:09:29:50 -0400] - Operation error fetching Null DN
> > > > > (0afe8ea2-960011e6-8bc7a516-e778e883), error -30993.
> > > > > [19/Oct/2016:09:29:50 -0400] - dn2entry_ext: Failed to get id for
> > > > > changenumber=30856380,cn=changelog from entryrdn index (-30993)
> > > > > [19/Oct/2016:09:29:50 -0400] - Operation error fetching
> > > > > changenumber=30856380,cn=changelog (null), error -30993.
> > > > > [19/Oct/2016:09:29:50 -0400] DSRetroclPlugin - replog: an error
> > > > > occured while adding change number 30856380, dn =
> > > > > changenumber=30856380,cn=changelog: Operations error.
> > > > > [19/Oct/2016:09:29:50 -0400] retrocl-plugin - retrocl_postob:
> > > > > operation failure [1]
> > > > this doesn't look good. There could be cancelled ops which would be
> > > > repeated, but the failing repl op is always with the same csn:
> > > > 5806acf7000000600000
> > > > so it seems incoming replication is stuck.
> > > > you could try to find out which entry is affected (grep for the csn in
> > > > the access log and look at the operation) and what kind of modification
> > > > it is to check what could be going wrong.
> > 
> > Here's what was in the access logs for that csn:
> > 
> > access.20161018-113116:[19/Oct/2016:09:28:46 -0400] conn=9 op=28 RESULT err=1 tag=103 nentries=0 etime=3 csn=5806acf7000000600000
> > access.20161018-113116:[19/Oct/2016:09:29:17 -0400] conn=15 op=5 RESULT err=1 tag=103 nentries=0 etime=24 csn=5806acf7000000600000
> > access.20161018-113116:[19/Oct/2016:09:29:28 -0400] conn=25 op=5 RESULT err=1 tag=103 nentries=0 etime=0 csn=5806acf7000000600000
> > access.20161018-113116:[19/Oct/2016:09:29:39 -0400] conn=35 op=5 RESULT err=1 tag=103 nentries=0 etime=0 csn=5806acf7000000600000
> > access.20161018-113116:[19/Oct/2016:09:29:42 -0400] conn=36 op=5 RESULT err=1 tag=103 nentries=0 etime=0 csn=5806acf7000000600000
> > access.20161018-113116:[19/Oct/2016:09:29:50 -0400] conn=37 op=5 RESULT err=1 tag=103 nentries=0 etime=0 csn=5806acf7000000600000
> > access.20161018-113116:[19/Oct/2016:09:29:54 -0400] conn=44 op=5 RESULT err=1 tag=103 nentries=0 etime=1 csn=5806acf7000000600000
> > access.20161018-113116:[19/Oct/2016:09:29:58 -0400] conn=45 op=5 RESULT err=1 tag=103 nentries=0 etime=0 csn=5806acf7000000600000
> > access.20161018-113116:[19/Oct/2016:09:30:06 -0400] conn=46 op=5 RESULT err=1 tag=103 nentries=0 etime=4 csn=5806acf7000000600000
> > access.20161018-113116:[19/Oct/2016:09:30:12 -0400] conn=48 op=5 RESULT err=0 tag=103 nentries=0 etime=2 csn=5806acf7000000600000
> > 
> > 
> > Interestingly, right before the first op=28 was the modification of user
> > janedoe, we were testing deleting/preserving this user:
> > 
> > [19/Oct/2016:09:28:43 -0400] conn=9 op=27 MOD dn="cn=MasterCRL,ou=crlIssuingPoints,ou=ca,o=ipaca"
> > [19/Oct/2016:09:28:43 -0400] conn=10 op=15 RESULT err=0 tag=103 nentries=0 etime=0 csn=58065f7c000300030000
> > [19/Oct/2016:09:28:43 -0400] conn=10 op=16 MODRDN dn="uid=janedoe,cn=users,cn=accounts,dc=cbls,dc=ccr,dc=buffalo,dc=edu" newrdn="uid=janedoe" newsuperior="cn=deleted users,cn=accounts,cn=provisioning,dc=cbls,dc=ccr,dc=buffalo,dc=edu"
> > [19/Oct/2016:09:28:43 -0400] conn=9 op=27 RESULT err=0 tag=103 nentries=0 etime=0 csn=5806a973000000600000
> > [19/Oct/2016:09:28:43 -0400] conn=9 op=28 MOD dn="cn=MasterCRL,ou=crlIssuingPoints,ou=ca,o=ipaca"
> > [19/Oct/2016:09:28:46 -0400] conn=9 op=28 RESULT err=1 tag=103 nentries=0 etime=3 csn=5806acf7000000600000
> > [19/Oct/2016:09:28:46 -0400] conn=9 op=-1 fd=88 closed - B4
> Note that janedoe operation and 5806acf7000000600000 operation were on
> different suffixes but are sharing the retroCL.
> According to
> 
> [19/Oct/2016:09:30:06 -0400] managed-entries-plugin - mep_mod_post_op: Unable to find config for origin entry "uid=janedoe,cn=deleted users,cn=accounts,cn=provisioning,dc=cbls,dc=ccr,dc=buffalo,dc=edu".
> 
> We may think that operation 'conn=10 op=16' completed (but likely failed) on 9:30:06 and only then 5806acf7000000600000
>  was successfully replicated.
> 
> There may be interaction between both operation, 'janedoe' that was to fail but preventing other updates to complete.
> 
> It would be interesting to get full logs (access/errors): 19/Oct/2016:09:28:43 -> 19/Oct/2016:09:30:20
> and also the dump of 'janedoe' entry.

I'd be happy to provide more logs. Any chance I can send them off list?

> 
> 
> > 
> > > the information of what is the change with csn 5806acf7000000600000, it
> > > should be found in the changelog of the server with the replicaid 96.
> > Where can I find/query the changelog?
> 
> You may dump it with: dbscan -f
> /var/lib/dirsrv/slapd-CBLS-CCR-BUFFALO-EDU/cldb/xxx.db
> > 
> > > there is also the possibility that your retro changelog got corrupted. could
> > > you try to query the retrocl: ldapsearch ..... -b "cn=changelog" dn
> > Guessing the changelog is too big or I need to increase some limit?
> > 
> > ldapsearch -Y GSSAPI -b "cn=changelog" dn
> > # extended LDIF
> > #
> > # LDAPv3
> > # base <cn=changelog> with scope subtree
> > # filter: (objectclass=*)
> > # requesting: dn
> > #
> > 
> > # search result
> > search: 4
> > result: 11 Administrative limit exceeded
> 
> Can you do the same operation being 'cn=directory manager' ?
> > 
> > 
> > > and before rebuilding or reimporting the database it would be worth to try
> > > to recreate the retro changelog
> > I'm not seeing anymore errors in the logs. Is there any way to verify if
> > replication has caught up?
> > 
> > Thanks again for all the help.
> > 
> > 
> > 
> 




More information about the Freeipa-users mailing list