[Freeipa-users] FreeIPA 3.3 performance issues with many hosts

Wed Oct 7 15:30:07 UTC 2015

On 10/07/2015 05:03 PM, Dominik Korittki wrote:
>
>
> Am 07.10.2015 um 15:25 schrieb thierry bordaz:
>> On 10/07/2015 11:19 AM, Martin Kosek wrote:
>>> On 10/05/2015 02:13 PM, Dominik Korittki wrote:
>>>>
>>>> Am 01.10.2015 um 21:52 schrieb Rob Crittenden:
>>>>> Dominik Korittki wrote:
>>>>>> Hello folks,
>>>>>>
>>>>>> I am running two FreeIPA Servers with around 100 users and around
>>>>>> 15.000
>>>>>> hosts, which are used by users to login via ssh. The FreeIPA servers
>>>>>> (which are Centos 7.0) ran good for a while, but as more and more
>>>>>> hosts
>>>>>> got migrated to serve as FreeIPA hosts, it started to get slow and
>>>>>> unstable.
>>>>>>
>>>>>> For example, its hard to maintain hostgroups, which have more than
>>>>>> 1.000
>>>>>> hosts. The ipa host-* commands are getting slower as the hostgroup
>>>>>> grows. Is this normal?
>>>>> You mean the ipa hostgroup-* commands? Whenever the entry is 
>>>>> displayed
>>>>> (show and add) it needs to dereference all members so yes, it is
>>>>> understandable that it gets somewhat slower with more members. How 
>>>>> slow
>>>>> are we talking about?
>>>>>
>>>>>> We also experience random dirsrv segfaults. Here's a dmesg line
>>>>>> from the
>>>>>> latest:
>>>>>>
>>>>>> [690787.647261] traps: ns-slapd[5217] general protection
>>>>>> ip:7f8d6b6d6bc1
>>>>>> sp:7f8d3aff2a88 error:0 in libc-2.17.so[7f8d6b650000+1b6000]
>>>>> You probably want to start here:
>>>>> http://www.port389.org/docs/389ds/FAQ/faq.html#debugging-crashes
>>>> A stacktrace from the latest crash is attached to this email. After
>>>> restarting
>>>> the service, this is what I get in 
>>>> /var/log/dirsrv/slapd-INTERNAL/errors
>>>> (hostname is ipa01.internal):
>>> Ludwig or Thierry, can you please take a look at the stack and file
>>> 389-DS
>>> ticket if appropriate?
>>
>> Hello Dominik,
>>
>> DS is crashing during a BIND and from the arguments values we can guess
>> it was due to a heap corruption that corrupted it operation pblock.
>> This bind operation was likely victim of the heap corruption more than
>> responsible of it.
>>
>> Using valgrind is the best way to track such problem but as you already
>> suffer from bad performance I doubt it would be acceptable.
>> How frequently does it crash ? did you identify a kind of test case ?
>
> At first the crashes happenend at a daily basis. Simply restarting the 
> dirsrv daemon resolved the issue for another day but later on the 
> daemon did not survive more than 15 minutes most of the time. There 
> were exceptions though. Sometimes the daemon ran for several hours 
> until it chrashed.
> I did not really identify a testcase. However, I supposed it could 
> have something to do with replication, as I have seen replication 
> related errors in dirsrv error log (mentioned in an earlier mail in 
> this topic).
heap corruption are usually dynamic and if the server became more and 
more slow, it could change the dynamic in favor of heap corruption.
>
> So did the following:
> ipa01 has a replication agreement with ipa02. ipa01 was the one with 
> segfaults. I removed ipa01 from the replication agreement 
> (ipa-replica-manage del), did an ipa-server-install --uninstall on 
> ipa01 and created ipa01 as a replica of ipa02. Since then I did not 
> experience any crashes (for now).
> Instead i'm having trouble rebuilding a clean replication agreement 
> (old RUV stuff still in database), but thats another story I will 
> eventually post on the mailinglist as a new topic.
>
> As for valgrind: Never used it before. Is there a handy explanation of 
> how to use it in combination with 389ds? If I still experience those 
> crashes and I get it managed to use I could try it out.
You may follow this procedure 
http://www.port389.org/docs/389ds/FAQ/faq.html#debugging-memory-growthinvalid-access-with-valgrind 
(but remove --leak-check=yes because this is not a leak issue)

thanks
thierry
>
>
> Kind regards,
> Dominik Korittki
>
>>
>> thanks
>> thierry
>>>> [05/Oct/2015:13:51:30 +0200] - slapd started.  Listening on All
>>>> Interfaces port
>>>> 389 for LDAP requests
>>>> [05/Oct/2015:13:51:30 +0200] - Listening on All Interfaces port 636
>>>> for LDAPS
>>>> requests
>>>> [05/Oct/2015:13:51:30 +0200] - Listening on
>>>> /var/run/slapd-INTERNAL.socket for
>>>> LDAPI requests
>>>> [05/Oct/2015:13:51:30 +0200] slapd_ldap_sasl_interactive_bind -
>>>> Error: could
>>>> not perform interactive bind for id [] mech [GSSAPI]: LDAP error -2
>>>> (Local
>>>> error) (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS
>>>> failure.
>>>> Minor code may provide more information (No Kerberos credentials
>>>> available))
>>>> errno 0 (Success)
>>>> [05/Oct/2015:13:51:30 +0200] slapi_ldap_bind - Error: could not 
>>>> perform
>>>> interactive bind for id [] authentication mechanism [GSSAPI]: error
>>>> -2 (Local
>>>> error)
>>>> [05/Oct/2015:13:51:30 +0200] NSMMReplicationPlugin -
>>>> agmt="cn=meToipa02.internal" (ipa02:389): Replication bind with
>>>> GSSAPI auth
>>>> failed: LDAP error -2 (Local error) (SASL(-1): generic failure:
>>>> GSSAPI Error:
>>>> Unspecified GSS failure.  Minor code may provide more information (No
>>>> Kerberos
>>>> credentials available))
>>>> [05/Oct/2015:13:51:30 +0200] NSMMReplicationPlugin - changelog 
>>>> program -
>>>> agmt="cn=masterAgreement1-ipa02.internal-pki-tomcat" (ipa02:389): CSN
>>>> 54bea480000000600000 not found, we aren't as up to date, or we purged
>>>> [05/Oct/2015:13:51:30 +0200] NSMMReplicationPlugin -
>>>> agmt="cn=masterAgreement1-ipa02.internal-pki-tomcat" (ipa02:389):
>>>> Data required
>>>> to update replica has been purged. The replica must be reinitialized.
>>>> [05/Oct/2015:13:51:30 +0200] NSMMReplicationPlugin -
>>>> agmt="cn=masterAgreement1-ipa02.internal-pki-tomcat" (ipa02:389):
>>>> Incremental
>>>> update failed and requires administrator action
>>>> [05/Oct/2015:13:51:33 +0200] NSMMReplicationPlugin -
>>>> agmt="cn=meToipa02.internal" (ipa02:389): Replication bind with
>>>> GSSAPI auth
>>>> resumed
>>>>
>>>>
>>>> These lines are present since a replayed a ldif dump from ipa02 to
>>>> ipa01, but i
>>>> didn't think that it related to the segfault problem (therefore i
>>>> said there
>>>> are no related problems in the logfile).
>>>>
>>>> But I am starting to believe that these errors could be in relation
>>>> to each other.
>>>>
>>>>
>>>> Kind regards,
>>>> Dominik Korittki
>>>>
>>>>
>>>>>
>>>>>> Nothing in /var/log/dirsrv/slapd-INTERNAL/errors, which relates 
>>>>>> to the
>>>>>> problem.
>>>> Not sure about that anymore.
>>>>
>>>>>> I'm thinking about migrating to latest CentOS 7 FreeIPA 4, but does
>>>>>> that
>>>>>> solve my problems?
>>>>>>
>>>>>> FreeIPA server version is 3.3.3-28.el7.centos
>>>>>> 389-ds-base.x86_64 is 1.3.1.6-26.el7_0
>>>>>>
>>>>>>
>>>>>>
>>>>>> Kind regards,
>>>>>> Dominik Korittki
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>
>>
>>
>>
>