[Freeipa-users] ipa replica failure

Fri Jun 19 18:22:40 UTC 2015

Hello,

First time trouble shooting an ipa server failure and looking for some
guidance on how best to proceed.

First some background on our setup:

Servers are running freeipa v4.1.0 on CentOS 7.1.1503:

- ipa-server-4.1.0-18.el7.centos.3.x86_64
- 389-ds-base-1.3.3.1-16.el7_1.x86_64

3 ipa-servers, 1 first master (rep1) and 2 (rep2, rep3) replicates. The
replicates were setup to be ca's (i.e. ipa-replica-install --setup-ca...)

We have ~3000 user accounts (~1000 active the rest disabled). We have
~700 hosts enrolled (all installed using ipa-client-install and running
sssd). Hosts clients are a mix of centos 7 and centos 6.5.

We recently discovered one of our replica servers (rep2) was not
responding. A quick check of the dirsrv logs
/var/log/dirsrv/slapd-XXXX/errors (sanitized):

    PR_Accept() failed, Netscape Portable Runtime error (Process open
    FD table is full.)
    ...

The server was rebooted and after coming back up had these errors in the logs:

        389-Directory/1.3.3.1 B2015.118.1941
        replica2:636 (/etc/dirsrv/slapd-XXXX)

[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error detected; run recovery
[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to trickle, err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error detected; run recovery
[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed in deadlock detect (aborted at 0x0), err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error detected; run recovery
[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed in deadlock detect (aborted at 0x0), err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error detected; run recovery
[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to checkpoint database, err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error detected; run recovery
[16/Jun/2015:10:12:33 -0400] - Serious Error---Failed to checkpoint database, err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
[16/Jun/2015:10:12:33 -0400] - libdb: BDB0060 PANIC: fatal region error detected; run recovery
[16/Jun/2015:10:12:33 -0400] - checkpoint_threadmain: log archive failed - BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery (-30973)
....
[16/Jun/2015:16:24:04 -0400] - 389-Directory/1.3.3.1 B2015.118.1941 starting up
[16/Jun/2015:16:24:04 -0400] - Detected Disorderly Shutdown last time Directory Server was running, recovering database.
...
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - replica_check_for_data_reload: Warning: disordely shutdown for replica dc=XXX. Check if DB RUV needs to be updated
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of database RUV (from CL RUV) ->  55770068000300030000
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of database RUV (from CL RUV) ->  556f4632001400040000
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of database RUV (from CL RUV) ->  556f4631004d00050000
[16/Jun/2015:16:24:15 -0400] slapi_ldap_bind - Error: could not send startTLS request: error -1 (Can't contact LDAP server) errno 111 (Connection refused)
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - agmt="cn=cloneAgreement1-rep2 (rep1:389): Replication bind with SIMPLE auth failed: LDAP error -1 (Can't contact LDAP server) ()
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - replica_check_for_data_reload: Warning: disordely shutdown for replica o=ipaca. Check if DB RUV needs to be updated
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - Force update of database RUV (from CL RUV) ->  556f46290005005b0000
[16/Jun/2015:16:24:15 -0400] set_krb5_creds - Could not get initial credentials for principal [ldap/rep2] in keytab [FILE:/etc/dirsrv/ds.keytab]: -1765328228 (Cannot contact any KDC for requested realm)
[16/Jun/2015:16:24:15 -0400] slapd_ldap_sasl_interactive_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: LDAP error -1 (Can't contact LDAP server) ((null)) errno 111 (Connection refused)
[16/Jun/2015:16:24:15 -0400] slapi_ldap_bind - Error: could not perform interactive bind for id [] authentication mechanism [GSSAPI]: error -1 (Can't contact LDAP server)
[16/Jun/2015:16:24:15 -0400] NSMMReplicationPlugin - agmt="cn=meTorep1" (rep1:389): Replication bind with GSSAPI auth failed: LDAP error -1 (Can't contact LDAP server) ()
[16/Jun/2015:16:24:15 -0400] - Skipping CoS Definition cn=Password Policy,cn=accounts,dc=xxx--no CoS Templates found, which should be added before the CoS Definition.
[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301438 (rc: 32)
[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301439 (rc: 32)
[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301440 (rc: 32)
[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301441 (rc: 32)
....
[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301443 (rc: 32)
[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301444 (rc: 32)
[16/Jun/2015:16:24:15 -0400] - slapd started.  Listening on All Interfaces port 389 for LDAP requests
[16/Jun/2015:16:24:15 -0400] - Listening on All Interfaces port 636 for LDAPS requests
[16/Jun/2015:16:24:15 -0400] - Listening on /var/run/slapd-CCR-BUFFALO-EDU.socket for LDAPI requests
[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301445 (rc: 32)
[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301446 (rc: 32)
[16/Jun/2015:16:24:15 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 301447 (rc: 32)
...
[16/Jun/2015:16:24:24 -0400] DSRetroclPlugin - delete_changerecord: could not delete change record 336362 (rc: 32)
[16/Jun/2015:16:24:24 -0400] slapd_ldap_sasl_interactive_bind - Error: could not perform interactive bind for id [] mech [GSSAPI]: LDAP error -1 (Can't contact LDAP server) ((null)) errno 107 (Transport endpoint is not connected)
[16/Jun/2015:16:24:24 -0400] slapi_ldap_bind - Error: could not perform interactive bind for id [] authentication mechanism [GSSAPI]: error -1 (Can't contact LDAP server)
[16/Jun/2015:16:24:24 -0400] slapi_ldap_bind - Error: could not send startTLS request: error -1 (Can't contact LDAP server) errno 107 (Transport endpoint is not connected)
[16/Jun/2015:16:24:24 -0400] NSMMReplicationPlugin - agmt="cn=cloneAgreement1-rep2-pki-tomcat" (rep1:389): Replication bind with SIMPLE auth resumed
[16/Jun/2015:16:24:25 -0400] NSMMReplicationPlugin - agmt="cn=meTorep1" (rep1:389): Replication bind with GSSAPI auth resumed
[16/Jun/2015:16:27:28 -0400] - Operation error fetching Null DN (0ce19ce4-146611e5-8135a170-bd40e05c), error -30993.
[16/Jun/2015:16:27:28 -0400] - dn2entry_ext: Failed to get id for changenumber=336746,cn=changelog from entryrdn index (-30993)
[16/Jun/2015:16:27:28 -0400] - Operation error fetching changenumber=336746,cn=changelog (null), error -30993.
[16/Jun/2015:16:27:28 -0400] DSRetroclPlugin - replog: an error occured while adding change number 336746, dn = changenumber=336746,cn=changelog: Operations error.
[16/Jun/2015:16:27:28 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
..
[16/Jun/2015:16:27:36 -0400] - Operation error fetching changenumber=336762,cn=changelog (null), error -30993.
[16/Jun/2015:16:27:36 -0400] DSRetroclPlugin - replog: an error occured while adding change number 336762, dn = changenumber=336762,cn=changelog: Operations error.
[16/Jun/2015:16:27:36 -0400] retrocl-plugin - retrocl_postob: operation failure [1]
[16/Jun/2015:16:27:36 -0400] - ldbm_back_seq deadlock retry BAD 1601, err=0 BDB0062 Successful return: 0
[16/Jun/2015:16:27:36 -0400] - ldbm_back_seq deadlock retry BAD 1601, err=0 BDB0062 Successful return: 0
[17/Jun/2015:13:41:23 -0400] - slapd shutting down - signaling operation threads - op stack size 0 max work q size 3025 max work q stack size 5
[17/Jun/2015:13:41:23 -0400] - slapd shutting down - waiting for 30 threads to terminate

Assuming we had a corrupted database we first attempted to remove the replicate
by logging into the first master and running:

    # ipa-replica-manage del rep2

This process hung indefinitely. So we proceeded to shutdown all ipa services on
rep2 (systemctl stop ipa).. then re-ran the command on the first master:

    # ipa-replica-manage -v --force del rep2

This appeared to work ok and rep2 appears to have been deleted:

    # ipa-replica-manage list
    rep3: master
    rep1: master

However, when querying ldap nsDS5ReplicationAgreement we still see rep2 with a
replica 97 id for the ipca:

# ldapsearch -Y GSSAPI -b "cn=mapping tree,cn=config" objectClass=nsDS5ReplicationAgreement -LL

dn: cn=masterAgreement1-rep3-pki-tomcat,cn=replica,cn=ipaca,cn=mapping tree,cn=config
objectClass: top
objectClass: nsds5replicationagreement
cn: masterAgreement1-rep3-pki-tomcat
nsDS5ReplicaRoot: o=ipaca
nsDS5ReplicaHost: rep3
nsDS5ReplicaPort: 389
nsDS5ReplicaBindDN: cn=Replication Manager cloneAgreement1-rep3-pki-tomcat,ou=csusers,cn=config
nsDS5ReplicaBindMethod: Simple
nsDS5ReplicaTransportInfo: TLS
description: masterAgreement1-rep3-pki-tomcat
..
nsds50ruv: {replicageneration} 5527f74b000000600000
nsds50ruv: {replica 91 ldap://rep3:389} 5537c7ba0000005b
 0000 5582c7e40004005b0000
nsds50ruv: {replica 96 ldap://rep1:389} 5527f75400000060
 0000 5582cd19000000600000
nsds50ruv: {replica 97 ldap://rep2:389} 5527f76000000061
 0000 556f462b000400610000
nsruvReplicaLastModified: {replica 91 ldap://rep3:389} 0
 0000000
nsruvReplicaLastModified: {replica 96 ldap://rep1:389} 0
 0000000
nsruvReplicaLastModified: {replica 97 ldap://rep2:389} 0
 0000000
nsds5replicaLastUpdateStart: 20150619173149Z
nsds5replicaLastUpdateEnd: 20150619173149Z
nsds5replicaChangesSentSinceStartup:: OTY6MTI0LzAg
nsds5replicaLastUpdateStatus: 0 Replica acquired successfully: Incremental upd
 ate succeeded
nsds5replicaUpdateInProgress: FALSE
nsds5replicaLastInitStart: 0
nsds5replicaLastInitEnd: 0

Questions:

0. Is it likely that after running out of file descriptors the dirsrv
slapd database on rep2 was corrupted? 

1. Do we have to run ipa-replica-manage del rep2 on *each* of the
remaining replica servers (rep1 and rep3)? Or should it just be run on
the first master? Do we need to run ipa-csreplicate-manage del as well?

2. Why does the rep2 server still appear when querying the
nsDS5ReplicationAgreement in ldap? Is this benign or will this pose problems
when we go to add rep2 back in?

3. What steps/commands can we take to verify rep2 was successfully removed and
replication is behaving normally?

We had tuned our servers according to the rhel  Performance Tuning
Guide:

   # cat /proc/sys/fs/file-max 
   6534908

   # cat /etc/security/limits.d/40-freeipa.conf 
   *         soft      nofile      8192
   *         hard      nofile      8192

   # cat /etc/sysctl.d/40-freeipa.conf 
   net.ipv4.ip_local_port_range = 1024 65000
   net.ipv4.tcp_keepalive_time = 300

   # cat /etc/sysconfig/dirsrv.systemd
   [Service]
   LimitNOFILE=8192

We're now considering increasing the nofile to something larger to
prevent running out of file descriptors. Any guidance on what number to
set this to? 

Many thanks in advance for any help.

--Andrew