[Freeipa-devel] [PATCH] 1079 address CA subsystem renewal issues

Petr Viktorin pviktori at redhat.com
Tue Jan 15 16:52:12 UTC 2013


On 01/15/2013 05:15 PM, Rob Crittenden wrote:
> Petr Viktorin wrote:
>> On 01/15/2013 03:41 PM, Petr Viktorin wrote:
>>> On 01/14/2013 10:56 PM, Rob Crittenden wrote:
>>>> Petr Viktorin wrote:
>>>>> On 01/12/2013 12:49 AM, Rob Crittenden wrote:
>>>>>> Rob Crittenden wrote:
>>>>>>> Petr Viktorin wrote:
>>>>>>>> On 01/07/2013 05:42 PM, Rob Crittenden wrote:
>>>>>>>>> Petr Viktorin wrote:
>>>>>>>>>> On 01/07/2013 03:09 PM, Rob Crittenden wrote:
>>>>>>>>>>> Petr Viktorin wrote:
>>>>>>>> [...]
>>>>>>>>>>>>
>>>>>>>>>>>> Works for me, but I have some questions (this is an area I know
>>>>>>>>>>>> little
>>>>>>>>>>>> about).
>>>>>>>>>>>>
>>>>>>>>>>>> Can we be 100% sure these certs are always renewed together? Is
>>>>>>>>>>>> certmonger the only possible mechanism to update them?
>>>>>>>>>>>
>>>>>>>>>>> You raise a good point. If though some mechanism someone
>>>>>>>>>>> replaces
>>>>>>>>>>> one of
>>>>>>>>>>> these certs it will cause the script to fail. Some
>>>>>>>>>>> notification of
>>>>>>>>>>> this
>>>>>>>>>>> failure will be logged though, and of course, the certs won't be
>>>>>>>>>>> renewed.
>>>>>>>>>>>
>>>>>>>>>>> One could conceivably manually renew one of these certificates.
>>>>>>>>>>> It is
>>>>>>>>>>> probably a very remote possibility but it is non-zero.
>>>>>>>>>>>
>>>>>>>>>>>> Can we be sure certmonger always does the updates in parallel?
>>>>>>>>>>>> If it
>>>>>>>>>>>> managed to update the audit cert before starting on the others,
>>>>>>>>>>>> we'd
>>>>>>>>>>>> get
>>>>>>>>>>>> no CA restart for the others.
>>>>>>>>>>>
>>>>>>>>>>> These all get issued at the same time so should expire at the
>>>>>>>>>>> same
>>>>>>>>>>> time
>>>>>>>>>>> as well (see problem above). The script will hang around for 10
>>>>>>>>>>> minutes
>>>>>>>>>>> waiting for the renewal to complete, then give up.
>>>>>>>>>>
>>>>>>>>>> The certs might take different amounts of time to update, right?
>>>>>>>>>> Eventually, the expirations could go out of sync enough for it to
>>>>>>>>>> matter.
>>>>>>>>>> AFAICS, without proper locking we still get a race condition when
>>>>>>>>>> the
>>>>>>>>>> other certs start being renewed some time (much less than 10 min)
>>>>>>>>>> after
>>>>>>>>>> the audit one:
>>>>>>>>>>
>>>>>>>>>> (time axis goes down)
>>>>>>>>>>
>>>>>>>>>>          audit cert                  other cert
>>>>>>>>>>          ----------                  ----------
>>>>>>>>>>      certmonger does renew                .
>>>>>>>>>>    post-renew script starts               .
>>>>>>>>>>   check state of other certs: OK          .
>>>>>>>>>>              .                   certmonger starts renew
>>>>>>>>>>   certutil modifies NSS DB  +  certmonger modifies NSS DB  ==
>>>>>>>>>> boom!
>>>>>>>>>
>>>>>>>>> This can't happen because we count the # of expected certs and
>>>>>>>>> wait
>>>>>>>>> until all are in MONITORING before continuing.
>>>>>>>>
>>>>>>>> The problem is that they're also in MONITORING before the whole
>>>>>>>> renewal
>>>>>>>> starts. If the script happens to check just before the state
>>>>>>>> changes
>>>>>>>> from MONITORING to GENERATING_CSR or whatever, we can get
>>>>>>>> corruption.
>>>>>>>>
>>>>>>>>> The worse that would
>>>>>>>>> happen is the trust wouldn't be set on the audit cert and dogtag
>>>>>>>>> wouldn't be restarted.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> The state the system would be in is this:
>>>>>>>>>>>
>>>>>>>>>>> - audit cert trust not updated, so next restart of CA will fail
>>>>>>>>>>> - CA is not restarted so will not use updated certificates
>>>>>>>>>>>
>>>>>>>>>>>> And anyway, why does certmonger do renewals in parallel? It
>>>>>>>>>>>> seems
>>>>>>>>>>>> that
>>>>>>>>>>>> if it did one at a time, always waiting until the post-renew
>>>>>>>>>>>> script is
>>>>>>>>>>>> done, this patch wouldn't be necessary.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  From what Nalin told me certmonger has some coarse locking such
>>>>>>>>>>> that
>>>>>>>>>>> renewals in a the same NSS database are serialized. As you point
>>>>>>>>>>> out, it
>>>>>>>>>>> would be nice to extend this locking to the post renewal
>>>>>>>>>>> scripts. We
>>>>>>>>>>> can
>>>>>>>>>>> ask Nalin about it. That would fix the potential corruption
>>>>>>>>>>> issue.
>>>>>>>>>>> It is
>>>>>>>>>>> still much nicer to not have to restart dogtag 4 times.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Well, three extra restarts every few years seems like a small
>>>>>>>>>> price to
>>>>>>>>>> pay for robustness.
>>>>>>>>>
>>>>>>>>> It is a bit of a problem though because the certs all renew within
>>>>>>>>> seconds so end up fighting over who is restarting dogtag. This can
>>>>>>>>> cause
>>>>>>>>> some renewals go into a failure state to be retried later. This is
>>>>>>>>> fine
>>>>>>>>> functionally but makes QE a bit of a pain. You then have to make
>>>>>>>>> sure
>>>>>>>>> that renewal is basically done, then restart certmonger and check
>>>>>>>>> everything again, over and over until all the certs are renewed.
>>>>>>>>> This is
>>>>>>>>> difficult to automate.
>>>>>>>>
>>>>>>>> So we need to extend the certmonger lock, and wait until Dogtag is
>>>>>>>> back
>>>>>>>> up before exiting the script. That way it'd still take longer
>>>>>>>> than 1
>>>>>>>> restart, but all the renews should succeed.
>>>>>>>>
>>>>>>>
>>>>>>> Right, but older dogtag versions don't have the handy servlet to
>>>>>>> tell
>>>>>>> that the service is actually up and responding. So it is
>>>>>>> difficult to
>>>>>>> tell from tomcat alone whether the CA is actually up and handling
>>>>>>> requests.
>>>>>>>
>>>>>>
>>>>>> Revised patch that takes advantage of new version of certmonger.
>>>>>> certmonger-0.65 adds locking from the time renewal begins to the
>>>>>> end of
>>>>>> the post_save_command. This lets us be sure that no other certmonger
>>>>>> renewals will have the NSS database open in read-write mode.
>>>>>>
>>>>>> We need to be sure that tomcat is shut down before we let certmonger
>>>>>> save the certificate to the NSS database because dogtag opens its
>>>>>> database read/write and two writers can cause corruption.
>>>>>>
>>>>>> rob
>>>>>>
>>>>>
>>>>> stop_pkicad and start_pkicad need the Dogtag version check to select
>>>>> pki_cad/pki_tomcatd.
>>>>
>>>> Fixed.
>>>>
>>>>>
>>>>> A more serious issue is that stop_pkicad needs to be installed on
>>>>> upgrades. Currently the whole enable_certificate_renewal step in
>>>>> ipa-upgradeconfig is skipped if it was done before.
>>>>
>>>> I added a separate upgrade test for this. It currently won't work in
>>>> SELinux enforcing mode because certmonger isn't allowed to talk to dbus
>>>> in an rpm post script. It's being looked at.
>>>>
>>>>> In stop_pkicad can you change the first log message to "certmonger
>>>>> stopping %sd"? It's before the action so we don't want past tense.
>>>>
>>>> Fixed.
>>>>
>>>> rob
>>>
>>> I get a bunch of errors when installing the RPM:
>>>
>> [...]
>>>
>>
>> This is the SELinux issue you were talking about. Sorry for not catching
>> that.
>>
>> With enforcing off, the patch looks & works well for me. I'm just
>> concerned about this change in ipa-upgradeconfig:
>>
>> @@ -707,7 +754,7 @@ def main():
>>           # configuration has changed, restart the name server
>>           root_logger.info('Changes to named.conf have been made,
>> restart named')
>>           bindinstance.BindInstance(fstore).restart()
>> -    ca_restart = ca_restart or enable_certificate_renewal(ca) or
>> upgrade_ipa_profile(ca, api.env.domain, fqdn)
>> +    ca_restart = ca_restart or enable_certificate_renewal(ca) or
>> upgrade_ipa_profile(ca, api.env.domain, fqdn) or
>> certificate_renewal_stop_ca(ca)
>>
>> If the enable_certificate_renewal step was done already, but
>> upgrade_ipa_profile requests a CA restart, then the short-circuiting
>> `or` will be satisfied and certificate_renewal_stop_ca won't be run.
>>
>> Since each upgrade step has its own checking, I think it would be safer
>> to use something like:
>>      ca_restart = certificate_renewal_stop_ca(ca) or ca_restart
>>
>> or even:
>> ca_restart = any([
>>      ca_restart,
>>      enable_certificate_renewal(ca),
>>      upgrade_ipa_profile(ca, api.env.domain, fqdn),
>>      certificate_renewal_stop_ca(ca),
>> ])
>>
>
> I like this suggestion very much. Updated patch attached.
>
> rob
>

ACK, just remove the trailing space in the `]) ` line.


We'll need to make sure the SELinux issue isn't forgotten.

-- 
Petr³




More information about the Freeipa-devel mailing list