[Freeipa-users] FreeIPA 3.3 performance issues with many hosts

Wed Oct 21 13:56:10 UTC 2015

Am 07.10.2015 um 17:30 schrieb thierry bordaz:
> On 10/07/2015 05:03 PM, Dominik Korittki wrote:
>>
>>
>> Am 07.10.2015 um 15:25 schrieb thierry bordaz:
>>> On 10/07/2015 11:19 AM, Martin Kosek wrote:
>>>> On 10/05/2015 02:13 PM, Dominik Korittki wrote:
>>>>>
>>>>> Am 01.10.2015 um 21:52 schrieb Rob Crittenden:
>>>>>> Dominik Korittki wrote:
>>>>>>> Hello folks,
>>>>>>>
>>>>>>> I am running two FreeIPA Servers with around 100 users and around
>>>>>>> 15.000
>>>>>>> hosts, which are used by users to login via ssh. The FreeIPA servers
>>>>>>> (which are Centos 7.0) ran good for a while, but as more and more
>>>>>>> hosts
>>>>>>> got migrated to serve as FreeIPA hosts, it started to get slow and
>>>>>>> unstable.
>>>>>>>
>>>>>>> For example, its hard to maintain hostgroups, which have more than
>>>>>>> 1.000
>>>>>>> hosts. The ipa host-* commands are getting slower as the hostgroup
>>>>>>> grows. Is this normal?
>>>>>> You mean the ipa hostgroup-* commands? Whenever the entry is
>>>>>> displayed
>>>>>> (show and add) it needs to dereference all members so yes, it is
>>>>>> understandable that it gets somewhat slower with more members. How
>>>>>> slow
>>>>>> are we talking about?
>>>>>>
>>>>>>> We also experience random dirsrv segfaults. Here's a dmesg line
>>>>>>> from the
>>>>>>> latest:
>>>>>>>
>>>>>>> [690787.647261] traps: ns-slapd[5217] general protection
>>>>>>> ip:7f8d6b6d6bc1
>>>>>>> sp:7f8d3aff2a88 error:0 in libc-2.17.so[7f8d6b650000+1b6000]
>>>>>> You probably want to start here:
>>>>>> http://www.port389.org/docs/389ds/FAQ/faq.html#debugging-crashes
>>>>> A stacktrace from the latest crash is attached to this email. After
>>>>> restarting
>>>>> the service, this is what I get in
>>>>> /var/log/dirsrv/slapd-INTERNAL/errors
>>>>> (hostname is ipa01.internal):
>>>> Ludwig or Thierry, can you please take a look at the stack and file
>>>> 389-DS
>>>> ticket if appropriate?
>>>
>>> Hello Dominik,
>>>
>>> DS is crashing during a BIND and from the arguments values we can guess
>>> it was due to a heap corruption that corrupted it operation pblock.
>>> This bind operation was likely victim of the heap corruption more than
>>> responsible of it.
>>>
>>> Using valgrind is the best way to track such problem but as you already
>>> suffer from bad performance I doubt it would be acceptable.
>>> How frequently does it crash ? did you identify a kind of test case ?
>>
>> At first the crashes happenend at a daily basis. Simply restarting the
>> dirsrv daemon resolved the issue for another day but later on the
>> daemon did not survive more than 15 minutes most of the time. There
>> were exceptions though. Sometimes the daemon ran for several hours
>> until it chrashed.
>> I did not really identify a testcase. However, I supposed it could
>> have something to do with replication, as I have seen replication
>> related errors in dirsrv error log (mentioned in an earlier mail in
>> this topic).
> heap corruption are usually dynamic and if the server became more and
> more slow, it could change the dynamic in favor of heap corruption.
>>
>> So did the following:
>> ipa01 has a replication agreement with ipa02. ipa01 was the one with
>> segfaults. I removed ipa01 from the replication agreement
>> (ipa-replica-manage del), did an ipa-server-install --uninstall on
>> ipa01 and created ipa01 as a replica of ipa02. Since then I did not
>> experience any crashes (for now).
>> Instead i'm having trouble rebuilding a clean replication agreement
>> (old RUV stuff still in database), but thats another story I will
>> eventually post on the mailinglist as a new topic.
>>
>> As for valgrind: Never used it before. Is there a handy explanation of
>> how to use it in combination with 389ds? If I still experience those
>> crashes and I get it managed to use I could try it out.
> You may follow this procedure
> http://www.port389.org/docs/389ds/FAQ/faq.html#debugging-memory-growthinvalid-access-with-valgrind
> (but remove --leak-check=yes because this is not a leak issue)
>
> thanks
> thierry

I experienced segmentation faults again on host ipa01, even after I 
rebuild the replication topology as described in previous mail.
I followed your advice and ran valgrind last evening. Sadly I forgot to 
remove --leak-check=yes, but I hope the information is still useful to 
you. If not, I'll do it again without --leak-check=yes.

Running under valgrind, the ns-slapd process needed quiet some time 
until it openened its ports. You can see this by watching the error logs:

[20/Oct/2015:22:27:41 +0200] - 389-Directory/1.3.1.6 B2014.219.1825 
starting up
[20/Oct/2015:22:27:42 +0200] - WARNING: userRoot: entry cache size 
10485760B is less than db size 142483456B; We recommend to increase the 
entry cache size nsslapd-cachememsize.
[20/Oct/2015:22:27:44 +0200] schema-compat-plugin - warning: no entries 
set up under cn=computers, cn=compat,dc=internal
[20/Oct/2015:23:09:16 +0200] - slapd started.  Listening on All 
Interfaces port 389 for LDAP requests
[20/Oct/2015:23:09:16 +0200] - Listening on All Interfaces port 636 for 
LDAPS requests
[20/Oct/2015:23:09:16 +0200] - Listening on 
/var/run/slapd-INTERNAL.socket for LDAPI requests

I guess that's normal, since running the process through valgrind has a 
huge performance loss? The daemon crashed about ~ 25 seconds after it 
has opened it's ports. Here is the valgrind log:
http://pastebin.com/8t9RtB6p

Do you see any suspicious things? Many thanks for your help!

- Dominik

>>
>>
>> Kind regards,
>> Dominik Korittki
>>
>>>
>>> thanks
>>> thierry
>>>>> [05/Oct/2015:13:51:30 +0200] - slapd started.  Listening on All
>>>>> Interfaces port
>>>>> 389 for LDAP requests
>>>>> [05/Oct/2015:13:51:30 +0200] - Listening on All Interfaces port 636
>>>>> for LDAPS
>>>>> requests
>>>>> [05/Oct/2015:13:51:30 +0200] - Listening on
>>>>> /var/run/slapd-INTERNAL.socket for
>>>>> LDAPI requests
>>>>> [05/Oct/2015:13:51:30 +0200] slapd_ldap_sasl_interactive_bind -
>>>>> Error: could
>>>>> not perform interactive bind for id [] mech [GSSAPI]: LDAP error -2
>>>>> (Local
>>>>> error) (SASL(-1): generic failure: GSSAPI Error: Unspecified GSS
>>>>> failure.
>>>>> Minor code may provide more information (No Kerberos credentials
>>>>> available))
>>>>> errno 0 (Success)
>>>>> [05/Oct/2015:13:51:30 +0200] slapi_ldap_bind - Error: could not
>>>>> perform
>>>>> interactive bind for id [] authentication mechanism [GSSAPI]: error
>>>>> -2 (Local
>>>>> error)
>>>>> [05/Oct/2015:13:51:30 +0200] NSMMReplicationPlugin -
>>>>> agmt="cn=meToipa02.internal" (ipa02:389): Replication bind with
>>>>> GSSAPI auth
>>>>> failed: LDAP error -2 (Local error) (SASL(-1): generic failure:
>>>>> GSSAPI Error:
>>>>> Unspecified GSS failure.  Minor code may provide more information (No
>>>>> Kerberos
>>>>> credentials available))
>>>>> [05/Oct/2015:13:51:30 +0200] NSMMReplicationPlugin - changelog
>>>>> program -
>>>>> agmt="cn=masterAgreement1-ipa02.internal-pki-tomcat" (ipa02:389): CSN
>>>>> 54bea480000000600000 not found, we aren't as up to date, or we purged
>>>>> [05/Oct/2015:13:51:30 +0200] NSMMReplicationPlugin -
>>>>> agmt="cn=masterAgreement1-ipa02.internal-pki-tomcat" (ipa02:389):
>>>>> Data required
>>>>> to update replica has been purged. The replica must be reinitialized.
>>>>> [05/Oct/2015:13:51:30 +0200] NSMMReplicationPlugin -
>>>>> agmt="cn=masterAgreement1-ipa02.internal-pki-tomcat" (ipa02:389):
>>>>> Incremental
>>>>> update failed and requires administrator action
>>>>> [05/Oct/2015:13:51:33 +0200] NSMMReplicationPlugin -
>>>>> agmt="cn=meToipa02.internal" (ipa02:389): Replication bind with
>>>>> GSSAPI auth
>>>>> resumed
>>>>>
>>>>>
>>>>> These lines are present since a replayed a ldif dump from ipa02 to
>>>>> ipa01, but i
>>>>> didn't think that it related to the segfault problem (therefore i
>>>>> said there
>>>>> are no related problems in the logfile).
>>>>>
>>>>> But I am starting to believe that these errors could be in relation
>>>>> to each other.
>>>>>
>>>>>
>>>>> Kind regards,
>>>>> Dominik Korittki
>>>>>
>>>>>
>>>>>>
>>>>>>> Nothing in /var/log/dirsrv/slapd-INTERNAL/errors, which relates
>>>>>>> to the
>>>>>>> problem.
>>>>> Not sure about that anymore.
>>>>>
>>>>>>> I'm thinking about migrating to latest CentOS 7 FreeIPA 4, but does
>>>>>>> that
>>>>>>> solve my problems?
>>>>>>>
>>>>>>> FreeIPA server version is 3.3.3-28.el7.centos
>>>>>>> 389-ds-base.x86_64 is 1.3.1.6-26.el7_0
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Kind regards,
>>>>>>> Dominik Korittki
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>>>
>>>
>>>
>>
>
>
>
>