[Freeipa-users] ns-slapd high cpu usage
Ludwig Krispenz
lkrispen at redhat.com
Wed Jul 15 14:58:23 UTC 2015
On 07/15/2015 04:10 PM, Andrew E. Bruno wrote:
> On Wed, Jul 15, 2015 at 03:22:51PM +0200, Ludwig Krispenz wrote:
>> On 07/14/2015 08:59 PM, Andrew E. Bruno wrote:
>>> On Tue, Jul 14, 2015 at 04:52:10PM +0200, Ludwig Krispenz wrote:
>>>> hm, the stack traces show csn_str, which correspond to Jul,8th, Jul,4th, and
>>>> Jul,7th - so it looks like it is iterating the changelog over and over
>>>> again.
>>>> Th consumer side Is "cn=meTosrv-m14-24.ccr.buffalo.edu" - is this the master
>>>> ?
>>>>
>>>> can you provide the result of the following search from
>>>> m14-24.ccr.buffalo.edu adn the server with the high cpu:
>>>>
>>>> ldapsearch -o ldif-wrap=no -x -D ... -w -b "cn=config"
>>>> "objectclass=nsds5replica" nsds50ruv
>>> master is srv-m14-24.. here's the results of the ldapsearch:
>>>
>>> [srv-m14-24 ~]$ ldapsearch -o ldif-wrap=no -x -D "cn=directory manager" -W -b "cn=config" "objectclass=nsds5replica" nsds50ruv
>>>
>>> # replica, dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu, mapping tree, config
>>> dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config
>>> nsds50ruv: {replicageneration} 5527f711000000040000
>>> nsds50ruv: {replica 4 ldap://srv-m14-24.ccr.buffalo.edu:389} 5527f771000000040000 55a55aed001000040000
>>> nsds50ruv: {replica 5 ldap://srv-m14-26.ccr.buffalo.edu:389} 5537c773000000050000 5591a3d2000700050000
>>> nsds50ruv: {replica 6 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 55943dda000000060000 55945378000200060000
>> so this is really strange, the master m14-24 has the latest change from
>> replica 5(m14-26) as: 5591a3d2000700050000
>> which corresponds to Mon, 29 Jun 2015 20:00:18 GMT
>> so no update from 14-24 since that did arrive, or could not update the ruv.
>> So m14-26 tries to replicate all the changes back from that time, but looks
>> like iit has no success.
>> is there anything in the logs of m14-24 ? can you see successful mods with
>> csn=xxxxxxx00050000 ?
> Here's what I could find from the logs on srv-m14-24:
>
>
> [srv-m14-24 ~]# grep -r 00050000 /var/log/dirsrv/slapd-[domain]/*
> access.20150714-014346:[14/Jul/2015:03:10:05 -0400] conn=748529 op=14732 RESULT err=0 tag=103 nentries=0 etime=1 csn=55a4b5f0005000040000
ok, so no update originating at replica 5 has been replicated (probably
since June,29) did you experience data inconsistency between the servers ?
>
>
> And here's the last few lines the error log on srv-m14-24:
one set of messages refers to the o=ipaca backend and seem to be
transient, replication continues later.
the other set of msg "No original tombstone .." is annoying (and it is
fixed in ticket https://fedorahosted.org/389/ticket/47912)
the next thing we can do to try to understand what is going on is to
enable replication logging on m14-26, it will then not only consume all
cpu, but write tons of messages to the error log.
But it can be turned on and off:
ldapmodify ...
dn: cn=config
replace: nsslapd-errorlog-level
nsslapd-errorlog-level: 8192
and let it run for a while, then set it back to: 0
>
> [12/Jul/2015:10:11:14 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2456070,cn=changelog!!
> [12/Jul/2015:10:11:48 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2498441,cn=changelog!!
> [13/Jul/2015:07:41:49 -0400] NSMMReplicationPlugin - agmt="cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat" (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a3a406000000600000): Operations error (1). Will retry later.
> [13/Jul/2015:11:56:50 -0400] NSMMReplicationPlugin - agmt="cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat" (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a3dfca000000600000): Operations error (1). Will retry later.
> [13/Jul/2015:14:26:50 -0400] NSMMReplicationPlugin - agmt="cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat" (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a402f2000000600000): Operations error (1). Will retry later.
> [13/Jul/2015:15:26:49 -0400] NSMMReplicationPlugin - agmt="cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat" (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a41102000000600000): Operations error (1). Will retry later.
> [13/Jul/2015:18:26:51 -0400] NSMMReplicationPlugin - agmt="cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat" (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a43b32000000600000): Operations error (1). Will retry later.
> [13/Jul/2015:18:56:51 -0400] NSMMReplicationPlugin - agmt="cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat" (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a4423a000000600000): Operations error (1). Will retry later.
> [13/Jul/2015:20:41:51 -0400] NSMMReplicationPlugin - agmt="cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat" (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a45ad6000000600000): Operations error (1). Will retry later.
> [13/Jul/2015:22:41:51 -0400] NSMMReplicationPlugin - agmt="cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat" (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a476f6000000600000): Operations error (1). Will retry later.
> [14/Jul/2015:06:56:51 -0400] NSMMReplicationPlugin - agmt="cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat" (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a4eafa000000600000): Operations error (1). Will retry later.
> [14/Jul/2015:09:56:52 -0400] NSMMReplicationPlugin - agmt="cn=masterAgreement1-srv-m14-26.ccr.buffalo.edu-pki-tomcat" (srv-m14-26:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a5152a000000600000): Operations error (1). Will retry later.
> [14/Jul/2015:10:11:21 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2552223,cn=changelog!!
> [14/Jul/2015:10:11:21 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2552224,cn=changelog!!
> [14/Jul/2015:10:11:25 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2557315,cn=changelog!!
> [14/Jul/2015:10:11:25 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2557318,cn=changelog!!
> [14/Jul/2015:10:11:28 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2561020,cn=changelog!!
> [14/Jul/2015:10:11:28 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2561043,cn=changelog!!
> [14/Jul/2015:10:11:48 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2586022,cn=changelog!!
> [14/Jul/2015:10:11:59 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2598989,cn=changelog!!
> [14/Jul/2015:10:11:59 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2598990,cn=changelog!!
> [14/Jul/2015:10:12:01 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2600966,cn=changelog!!
> [14/Jul/2015:10:12:03 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2604037,cn=changelog!!
> [14/Jul/2015:10:12:04 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2604054,cn=changelog!!
> [14/Jul/2015:10:12:24 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2629803,cn=changelog!!
> [14/Jul/2015:10:12:24 -0400] ldbm_back_delete - conn=0 op=0 [retry: 1] No original_tombstone for changenumber=2629804,cn=changelog!!
>
>
>
>
>
>>> # replica, o\3Dipaca, mapping tree, config
>>> dn: cn=replica,cn=o\3Dipaca,cn=mapping tree,cn=config
>>> nsds50ruv: {replicageneration} 5527f74b000000600000
>>> nsds50ruv: {replica 96 ldap://srv-m14-24.ccr.buffalo.edu:389} 5527f754000000600000 55a557f6000000600000
>>> nsds50ruv: {replica 86 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 55943e6e000000560000 55943e6f000100560000
>>> nsds50ruv: {replica 91 ldap://srv-m14-26.ccr.buffalo.edu:389} 5537c7ba0000005b0000 5582c7e40004005b0000
>>>
>>>
>>> server with high cpu load is srv-m14-26. here's the results of the ldapsearch
>> >from this server:
>>> [srv-m14-26 ~]$ ldapsearch -o ldif-wrap=no -x -D "cn=directory manager" -W -b "cn=config" "objectclass=nsds5replica" nsds50ruv
>>>
>>> # replica, dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu, mapping tree, config
>>> dn: cn=replica,cn=dc\3Dccr\2Cdc\3Dbuffalo\2Cdc\3Dedu,cn=mapping tree,cn=config
>>> nsds50ruv: {replicageneration} 5527f711000000040000
>>> nsds50ruv: {replica 5 ldap://srv-m14-26.ccr.buffalo.edu:389} 5537c773000000050000 55a55b47000300050000
>>> nsds50ruv: {replica 4 ldap://srv-m14-24.ccr.buffalo.edu:389} 5527f771000000040000 55a53eb0000a00040000
>>> nsds50ruv: {replica 6 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 55943dda000000060000 55945378000200060000
>>>
>>> # replica, o\3Dipaca, mapping tree, config
>>> dn: cn=replica,cn=o\3Dipaca,cn=mapping tree,cn=config
>>> nsds50ruv: {replicageneration} 5527f74b000000600000
>>> nsds50ruv: {replica 91 ldap://srv-m14-26.ccr.buffalo.edu:389} 5537c7ba0000005b0000 5582c7e40004005b0000
>>> nsds50ruv: {replica 96 ldap://srv-m14-24.ccr.buffalo.edu:389} 5527f754000000600000 55a557f6000000600000
>>> nsds50ruv: {replica 86 ldap://srv-m14-25-02.ccr.buffalo.edu:389} 55943e6e000000560000 55943e6f000100560000
>>>
>>>
>>> srv-m14-25-02 is our 3rd replicate which we recently added back in after it
>>> failed (was added back in 7/1).
>>>
>>> Let me know if you need anything else. Thanks for the help.
>>>
>>> --Andrew
>>>
>>>> On 07/14/2015 02:35 PM, Andrew E. Bruno wrote:
>>>>> On Tue, Jul 14, 2015 at 01:41:57PM +0200, Ludwig Krispenz wrote:
>>>>>> On 07/13/2015 06:36 PM, Andrew E. Bruno wrote:
>>>>>>> On Mon, Jul 13, 2015 at 05:29:13PM +0200, Ludwig Krispenz wrote:
>>>>>>>> On 07/13/2015 05:05 PM, Andrew E. Bruno wrote:
>>>>>>>>> On Mon, Jul 13, 2015 at 04:58:46PM +0200, Ludwig Krispenz wrote:
>>>>>>>>>> can you get a pstack of the slapd process along with a top -H to find th
>>>>>>>>>> ethread with high cpu usage
>>>>>>>>> Attached is the full stacktrace of the running ns-slapd proccess. top -H
>>>>>>>>> shows this thread (2879) with high cpu usage:
>>>>>>>>>
>>>>>>>>> 2879 dirsrv 20 0 3819252 1.962g 11680 R 99.9 3.1 8822:10 ns-slapd
>>>>>>>> this thread is a replication thread sending updates, what is strange is that
>>>>>>>> the current csn_str is quite old (july, 7th), I can't tell which agreeement
>>>>>>>> this thread is handling, but looks like it is heavily reading the changeglog
>>>>>>>> and sending updates. anything changed recently in replication setup ?
>>>>>>> Yes, we had one replica fail on (6/19) which we removed (not this one
>>>>>>> showing high CPU load). Had to perform some manual cleanup of the ipa-ca
>>>>>>> RUVs. Then we added the replica back in on 7/1. Since then, replication
>>>>>>> appears to have been running normally between the 3 replicas. We've been
>>>>>>> monitoring utilization since 7/1 and only recently seen this spike (past
>>>>>>> 24 hours or so).
>>>>>> is it still in this state ? or was it a spike.
>>>>> Yes same state.
>>>>>
>>>>>> if it still is high cpu consuming, could you
>>>>>> - get a few pstack like the one before with some time in between, I would
>>>>>> like to see if it is progressing with the csns or looping on the same one
>>>>> Attached are a few stacktraces. The thread pegging the cpu is:
>>>>>
>>>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>>>>> 2879 dirsrv 20 0 3819252 1.978g 11684 R 99.9 3.2 10148:26 ns-slapd
>>>>>
>>>>>> - check the consumer side. is there anything in the error log ? does the
>>>>>> access log show replication activity from this server
>>>>> Here's some errors showing up on the first master server rep1 (rep2 is the
>>>>> server with pegged cpu):
>>>>>
>>>>> [13/Jul/2015:20:41:51 -0400] NSMMReplicationPlugin - agmt="cn=masterAgreement1-rep2-pki-tomcat" (rep2:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a45ad6000000600000): Operations error (1). Will retry later.
>>>>> [13/Jul/2015:22:41:51 -0400] NSMMReplicationPlugin - agmt="cn=masterAgreement1-rep2-pki-tomcat" (rep2:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a476f6000000600000): Operations error (1). Will retry later.
>>>>> [14/Jul/2015:06:56:51 -0400] NSMMReplicationPlugin - agmt="cn=masterAgreement1-rep2-tomcat" (rep2:389): Consumer failed to replay change (uniqueid cb7acfc1-df9211e4-a351aa45-2e06257b, CSN 55a4eafa000000600000): Operations error (1). Will retry later.
>>>>>
>>>>>
>>>>> Here's some snips from the access log of the rep2:
>>>>>
>>>>>
>>>>> [14/Jul/2015:08:22:31 -0400] conn=777787 op=9794 EXT oid="2.16.840.1.113730.3.5.5" name="Netscape Replication End Session"
>>>>> [14/Jul/2015:08:22:31 -0400] conn=777787 op=9794 RESULT err=0 tag=120 nentries=0 etime=0
>>>>> [14/Jul/2015:08:22:31 -0400] conn=777787 op=9795 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
>>>>> [14/Jul/2015:08:22:31 -0400] conn=777787 op=9795 RESULT err=0 tag=120 nentries=0 etime=0
>>>>> [14/Jul/2015:08:22:33 -0400] conn=777787 op=9796 EXT oid="2.16.840.1.113730.3.5.5" name="Netscape Replication End Session"
>>>>> ..
>>>>> [14/Jul/2015:08:23:38 -0400] conn=782341 op=129 RESULT err=0 tag=103 nentries=0 etime=0 csn=55a4ff6c000000050000
>>>>> ..
>>>>> [14/Jul/2015:08:24:02 -0400] conn=781901 op=1745 RESULT err=0 tag=101 nentries=1 etime=0
>>>>> [14/Jul/2015:08:24:03 -0400] conn=777787 op=9810 EXT oid="2.16.840.1.113730.3.5.5" name="Netscape Replication End Session"
>>>>> [14/Jul/2015:08:24:03 -0400] conn=777787 op=9810 RESULT err=0 tag=120 nentries=0 etime=0
>>>>> [14/Jul/2015:08:24:03 -0400] conn=777787 op=9811 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
>>>>> [14/Jul/2015:08:24:03 -0400] conn=777787 op=9811 RESULT err=0 tag=120 nentries=0 etime=0
>>>>> [14/Jul/2015:08:24:05 -0400] conn=777787 op=9812 EXT oid="2.16.840.1.113730.3.5.5" name="Netscape Replication End Session"
>>>>> [14/Jul/2015:08:24:05 -0400] conn=777787 op=9812 RESULT err=0 tag=120 nentries=0 etime=0
>>>>> [14/Jul/2015:08:24:08 -0400] conn=777787 op=9813 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
>>>>> [14/Jul/2015:08:24:08 -0400] conn=777787 op=9813 RESULT err=0 tag=120 nentries=0 etime=0
>>>>> [14/Jul/2015:08:24:10 -0400] conn=777787 op=9814 EXT oid="2.16.840.1.113730.3.5.5" name="Netscape Replication End Session"
>>>>> [14/Jul/2015:08:24:10 -0400] conn=777787 op=9814 RESULT err=0 tag=120 nentries=0 etime=0
>>>>> [14/Jul/2015:08:24:10 -0400] conn=777787 op=9815 EXT oid="2.16.840.1.113730.3.5.12" name="replication-multimaster-extop"
>>>>> [14/Jul/2015:08:24:10 -0400] conn=777787 op=9815 RESULT err=0 tag=120 nentries=0 etime=0
>>>>>
>>>>> and here's some from the error log:
>>>>>
>>>>> [13/Jul/2015:22:41:49 -0400] DSRetroclPlugin - replog: an error occured while adding change number 2294859, dn = changenumber=2294859,cn=changelog: Operations error.
>>>>> [13/Jul/2015:22:41:49 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
>>>>> [13/Jul/2015:23:56:50 -0400] - dn2entry_ext: Failed to get id for changenumber=2296384,cn=changelog from entryrdn index (-30993)
>>>>> [13/Jul/2015:23:56:50 -0400] - Operation error fetching changenumber=2296384,cn=changelog (null), error -30993.
>>>>> [13/Jul/2015:23:56:50 -0400] DSRetroclPlugin - replog: an error occured while adding change number 2296384, dn = changenumber=2296384,cn=changelog: Operations error.
>>>>> [13/Jul/2015:23:56:50 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
>>>>> [14/Jul/2015:06:56:50 -0400] - dn2entry_ext: Failed to get id for changenumber=2304418,cn=changelog from entryrdn index (-30993)
>>>>> [14/Jul/2015:06:56:50 -0400] - Operation error fetching changenumber=2304418,cn=changelog (null), error -30993.
>>>>> [14/Jul/2015:06:56:50 -0400] DSRetroclPlugin - replog: an error occured while adding change number 2304418, dn = changenumber=2304418,cn=changelog: Operations error.
>>>>> [14/Jul/2015:06:56:50 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
>>>>>
>>>>>
>>>>>> - eventually enable replication logging: nsslapd-errorlog-level: 8192
>>>>>>> On a side note, we get hit with this bug often:
>>>>>>>
>>>>>>> https://www.redhat.com/archives/freeipa-users/2015-July/msg00018.html
>>>>>>>
>>>>>>> (rouge sssd_be processing hammering a replica).
>>>>>>>
>>>>>>> This causes high ns-slapd utilization on the replica and restarting sssd
>>>>>>> on the client host immediately fixes the issue. However, in this
>>>>>>> case, we're not seeing this behavior.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>
>>>>>>>>>> On 07/13/2015 04:46 PM, Andrew E. Bruno wrote:
>>>>>>>>>>> We have 3 freeipa-replicas. Centos 7.1.1503, ipa-server 4.1.0-18, and
>>>>>>>>>>> 389-ds 1.3.3.1-16.
>>>>>>>>>>>
>>>>>>>>>>> Recently, the ns-slapd process on one of our replicas started showing higher
>>>>>>>>>>> than normal CPU usage. ns-slapd is pegged at high CPU usage more or less
>>>>>>>>>>> constantly.
>>>>>>>>>>>
>>>>>>>>>>> Seems very similar to this thread:
>>>>>>>>>>> https://www.redhat.com/archives/freeipa-users/2015-February/msg00281.html
>>>>>>>>>>>
>>>>>>>>>>> There are a few errors showing in /var/log/dirsrv/slapd-[domain]/errors (not
>>>>>>>>>>> sure if these are related):
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> [13/Jul/2015:02:56:49 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
>>>>>>>>>>> [13/Jul/2015:04:11:50 -0400] - dn2entry_ext: Failed to get id for changenumber=2277387,cn=changelog from entryrdn index (-30993)
>>>>>>>>>>> [13/Jul/2015:04:11:50 -0400] - Operation error fetching changenumber=2277387,cn=changelog (null), error -30993.
>>>>>>>>>>> [13/Jul/2015:04:11:50 -0400] DSRetroclPlugin - replog: an error occured while adding change number 2277387, dn = changenumber=2277387,cn=changelog: Operations error.
>>>>>>>>>>> [13/Jul/2015:04:11:50 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
>>>>>>>>>>> [13/Jul/2015:07:41:49 -0400] - Operation error fetching Null DN (01de36ac-295411e5-b94db2ab-07afbca6), error -30993.
>>>>>>>>>>> [13/Jul/2015:07:41:49 -0400] - dn2entry_ext: Failed to get id for changenumber=2281464,cn=changelog from entryrdn index (-30993)
>>>>>>>>>>> [13/Jul/2015:07:41:49 -0400] - Operation error fetching changenumber=2281464,cn=changelog (null), error -30993.
>>>>>>>>>>> [13/Jul/2015:07:41:49 -0400] DSRetroclPlugin - replog: an error occured while adding change number 2281464, dn = changenumber=2281464,cn=changelog: Operations error.
>>>>>>>>>>> [13/Jul/2015:07:41:49 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> access logs seem to be showing normal activity. Here's the number of open
>>>>>>>>>>> connections:
>>>>>>>>>>>
>>>>>>>>>>> # ls -al /proc/`cat /var/run/dirsrv/slapd-[domain].pid`/fd|grep socket|wc -l
>>>>>>>>>>> 62
>>>>>>>>>>>
>>>>>>>>>>> Note: the other two replicas have much higher open connections (>250) and low
>>>>>>>>>>> cpu load avgs.
>>>>>>>>>>>
>>>>>>>>>>> Here's some output of logconv.pl from our most recent access log on the replica
>>>>>>>>>>> with high cpu load:
>>>>>>>>>>>
>>>>>>>>>>> Start of Logs: 13/Jul/2015:04:49:18
>>>>>>>>>>> End of Logs: 13/Jul/2015:10:06:11
>>>>>>>>>>>
>>>>>>>>>>> Processed Log Time: 5 Hours, 16 Minutes, 53 Seconds
>>>>>>>>>>>
>>>>>>>>>>> Restarts: 0
>>>>>>>>>>> Total Connections: 2343
>>>>>>>>>>> - LDAP Connections: 2120
>>>>>>>>>>> - LDAPI Connections: 223
>>>>>>>>>>> - LDAPS Connections: 0
>>>>>>>>>>> - StartTLS Extended Ops: 45
>>>>>>>>>>> Secure Protocol Versions:
>>>>>>>>>>> - TLS1.2 128-bit AES - 45
>>>>>>>>>>>
>>>>>>>>>>> Peak Concurrent Connections: 22
>>>>>>>>>>> Total Operations: 111865
>>>>>>>>>>> Total Results: 111034
>>>>>>>>>>> Overall Performance: 99.3%
>>>>>>>>>>>
>>>>>>>>>>> Searches: 95585 (5.03/sec) (301.64/min)
>>>>>>>>>>> Modifications: 3369 (0.18/sec) (10.63/min)
>>>>>>>>>>> Adds: 0 (0.00/sec) (0.00/min)
>>>>>>>>>>> Deletes: 0 (0.00/sec) (0.00/min)
>>>>>>>>>>> Mod RDNs: 0 (0.00/sec) (0.00/min)
>>>>>>>>>>> Compares: 0 (0.00/sec) (0.00/min)
>>>>>>>>>>> Binds: 7082 (0.37/sec) (22.35/min)
>>>>>>>>>>>
>>>>>>>>>>> Proxied Auth Operations: 0
>>>>>>>>>>> Persistent Searches: 0
>>>>>>>>>>> Internal Operations: 0
>>>>>>>>>>> Entry Operations: 0
>>>>>>>>>>> Extended Operations: 5317
>>>>>>>>>>> Abandoned Requests: 416
>>>>>>>>>>> Smart Referrals Received: 0
>>>>>>>>>>>
>>>>>>>>>>> VLV Operations: 96
>>>>>>>>>>> VLV Unindexed Searches: 0
>>>>>>>>>>> VLV Unindexed Components: 32
>>>>>>>>>>> SORT Operations: 64
>>>>>>>>>>>
>>>>>>>>>>> Entire Search Base Queries: 0
>>>>>>>>>>> Paged Searches: 3882
>>>>>>>>>>> Unindexed Searches: 0
>>>>>>>>>>> Unindexed Components: 5
>>>>>>>>>>>
>>>>>>>>>>> FDs Taken: 2566
>>>>>>>>>>> FDs Returned: 2643
>>>>>>>>>>> Highest FD Taken: 249
>>>>>>>>>>>
>>>>>>>>>>> Broken Pipes: 0
>>>>>>>>>>> Connections Reset By Peer: 0
>>>>>>>>>>> Resource Unavailable: 0
>>>>>>>>>>> Max BER Size Exceeded: 0
>>>>>>>>>>>
>>>>>>>>>>> Binds: 7082
>>>>>>>>>>> Unbinds: 2443
>>>>>>>>>>> - LDAP v2 Binds: 0
>>>>>>>>>>> - LDAP v3 Binds: 6859
>>>>>>>>>>> - AUTOBINDs: 223
>>>>>>>>>>> - SSL Client Binds: 0
>>>>>>>>>>> - Failed SSL Client Binds: 0
>>>>>>>>>>> - SASL Binds: 6814
>>>>>>>>>>> GSSAPI - 6591
>>>>>>>>>>> EXTERNAL - 223
>>>>>>>>>>> - Directory Manager Binds: 0
>>>>>>>>>>> - Anonymous Binds: 6591
>>>>>>>>>>> - Other Binds: 491
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> strace timing on the ns-slapd process:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> % time seconds usecs/call calls errors syscall
>>>>>>>>>>> ------ ----------- ----------- --------- --------- ----------------
>>>>>>>>>>> 94.40 0.346659 5977 58 poll
>>>>>>>>>>> 4.10 0.015057 15057 1 restart_syscall
>>>>>>>>>>> 0.91 0.003353 57 59 59 getpeername
>>>>>>>>>>> 0.49 0.001796 150 12 futex
>>>>>>>>>>> 0.10 0.000364 73 5 read
>>>>>>>>>>> ------ ----------- ----------- --------- --------- ----------------
>>>>>>>>>>> 100.00 0.367229 135 59 total
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> top output (with threads 'H'):
>>>>>>>>>>>
>>>>>>>>>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>>>>>>>>>>> 2879 dirsrv 20 0 3819252 1.962g 11680 R 99.9 3.1 8822:10 ns-slapd
>>>>>>>>>>> 2895 dirsrv 20 0 3819252 1.962g 11680 R 34.1 3.1 115:10.62 ns-slapd
>>>>>>>>>>> 2889 dirsrv 20 0 3819252 1.962g 11680 R 2.4 3.1 115:34.42 ns-slapd
>>>>>>>>>>> 2917 dirsrv 20 0 3819252 1.962g 11680 S 2.4 3.1 115:26.87 ns-slapd
>>>>>>>>>>> 2898 dirsrv 20 0 3819252 1.962g 11680 S 2.1 3.1 116:33.12 ns-slapd
>>>>>>>>>>> 2904 dirsrv 20 0 3819252 1.962g 11680 S 2.1 3.1 115:08.56 ns-slapd
>>>>>>>>>>> 2892 dirsrv 20 0 3819252 1.962g 11680 S 1.8 3.1 115:33.04 ns-slapd
>>>>>>>>>>> 2897 dirsrv 20 0 3819252 1.962g 11680 R 1.8 3.1 114:54.28 ns-slapd
>>>>>>>>>>> 2914 dirsrv 20 0 3819252 1.962g 11680 R 1.8 3.1 116:03.35 ns-slapd
>>>>>>>>>>> 2907 dirsrv 20 0 3819252 1.962g 11680 S 1.5 3.1 115:42.25 ns-slapd
>>>>>>>>>>> 2910 dirsrv 20 0 3819252 1.962g 11680 R 1.5 3.1 116:01.99 ns-slapd
>>>>>>>>>>> 2870 dirsrv 20 0 3819252 1.962g 11680 R 1.2 3.1 611:30.22 ns-slapd
>>>>>>>>>>> 2890 dirsrv 20 0 3819252 1.962g 11680 S 1.2 3.1 115:18.25 ns-slapd
>>>>>>>>>>> 2891 dirsrv 20 0 3819252 1.962g 11680 S 1.2 3.1 115:22.24 ns-slapd
>>>>>>>>>>> 2899 dirsrv 20 0 3819252 1.962g 11680 R 1.2 3.1 116:11.85 ns-slapd
>>>>>>>>>>> 2888 dirsrv 20 0 3819252 1.962g 11680 S 0.9 3.1 114:51.19 ns-slapd
>>>>>>>>>>> 2896 dirsrv 20 0 3819252 1.962g 11680 R 0.9 3.1 115:46.84 ns-slapd
>>>>>>>>>>> 2915 dirsrv 20 0 3819252 1.962g 11680 S 0.9 3.1 115:49.34 ns-slapd
>>>>>>>>>>> 2887 dirsrv 20 0 3819252 1.962g 11680 R 0.6 3.1 115:49.85 ns-slapd
>>>>>>>>>>> 2894 dirsrv 20 0 3819252 1.962g 11680 S 0.6 3.1 115:58.02 ns-slapd
>>>>>>>>>>> 2911 dirsrv 20 0 3819252 1.962g 11680 S 0.6 3.1 116:22.84 ns-slapd
>>>>>>>>>>> 2913 dirsrv 20 0 3819252 1.962g 11680 S 0.6 3.1 114:43.56 ns-slapd
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> ns-slapd stays pegged >99%. Trying to figure out what ns-slapd is doing? Any
>>>>>>>>>>> pointers on where else to look?
>>>>>>>>>>>
>>>>>>>>>>> Thanks in advance.
>>>>>>>>>>>
>>>>>>>>>>> --Andrew
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Manage your subscription for the Freeipa-users mailing list:
>>>>>>>>>> https://www.redhat.com/mailman/listinfo/freeipa-users
>>>>>>>>>> Go to http://freeipa.org for more info on the project
>>>>>>>>>>
>>>>>>>>>>
>>
More information about the Freeipa-users
mailing list