[Freeipa-devel] [PATCH] 1079 address CA subsystem renewal issues

Rob Crittenden rcritten at redhat.com
Tue Jan 15 16:15:35 UTC 2013


Petr Viktorin wrote:
> On 01/15/2013 03:41 PM, Petr Viktorin wrote:
>> On 01/14/2013 10:56 PM, Rob Crittenden wrote:
>>> Petr Viktorin wrote:
>>>> On 01/12/2013 12:49 AM, Rob Crittenden wrote:
>>>>> Rob Crittenden wrote:
>>>>>> Petr Viktorin wrote:
>>>>>>> On 01/07/2013 05:42 PM, Rob Crittenden wrote:
>>>>>>>> Petr Viktorin wrote:
>>>>>>>>> On 01/07/2013 03:09 PM, Rob Crittenden wrote:
>>>>>>>>>> Petr Viktorin wrote:
>>>>>>> [...]
>>>>>>>>>>>
>>>>>>>>>>> Works for me, but I have some questions (this is an area I know
>>>>>>>>>>> little
>>>>>>>>>>> about).
>>>>>>>>>>>
>>>>>>>>>>> Can we be 100% sure these certs are always renewed together? Is
>>>>>>>>>>> certmonger the only possible mechanism to update them?
>>>>>>>>>>
>>>>>>>>>> You raise a good point. If though some mechanism someone replaces
>>>>>>>>>> one of
>>>>>>>>>> these certs it will cause the script to fail. Some
>>>>>>>>>> notification of
>>>>>>>>>> this
>>>>>>>>>> failure will be logged though, and of course, the certs won't be
>>>>>>>>>> renewed.
>>>>>>>>>>
>>>>>>>>>> One could conceivably manually renew one of these certificates.
>>>>>>>>>> It is
>>>>>>>>>> probably a very remote possibility but it is non-zero.
>>>>>>>>>>
>>>>>>>>>>> Can we be sure certmonger always does the updates in parallel?
>>>>>>>>>>> If it
>>>>>>>>>>> managed to update the audit cert before starting on the others,
>>>>>>>>>>> we'd
>>>>>>>>>>> get
>>>>>>>>>>> no CA restart for the others.
>>>>>>>>>>
>>>>>>>>>> These all get issued at the same time so should expire at the
>>>>>>>>>> same
>>>>>>>>>> time
>>>>>>>>>> as well (see problem above). The script will hang around for 10
>>>>>>>>>> minutes
>>>>>>>>>> waiting for the renewal to complete, then give up.
>>>>>>>>>
>>>>>>>>> The certs might take different amounts of time to update, right?
>>>>>>>>> Eventually, the expirations could go out of sync enough for it to
>>>>>>>>> matter.
>>>>>>>>> AFAICS, without proper locking we still get a race condition when
>>>>>>>>> the
>>>>>>>>> other certs start being renewed some time (much less than 10 min)
>>>>>>>>> after
>>>>>>>>> the audit one:
>>>>>>>>>
>>>>>>>>> (time axis goes down)
>>>>>>>>>
>>>>>>>>>          audit cert                  other cert
>>>>>>>>>          ----------                  ----------
>>>>>>>>>      certmonger does renew                .
>>>>>>>>>    post-renew script starts               .
>>>>>>>>>   check state of other certs: OK          .
>>>>>>>>>              .                   certmonger starts renew
>>>>>>>>>   certutil modifies NSS DB  +  certmonger modifies NSS DB  ==
>>>>>>>>> boom!
>>>>>>>>
>>>>>>>> This can't happen because we count the # of expected certs and wait
>>>>>>>> until all are in MONITORING before continuing.
>>>>>>>
>>>>>>> The problem is that they're also in MONITORING before the whole
>>>>>>> renewal
>>>>>>> starts. If the script happens to check just before the state changes
>>>>>>> from MONITORING to GENERATING_CSR or whatever, we can get
>>>>>>> corruption.
>>>>>>>
>>>>>>>> The worse that would
>>>>>>>> happen is the trust wouldn't be set on the audit cert and dogtag
>>>>>>>> wouldn't be restarted.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> The state the system would be in is this:
>>>>>>>>>>
>>>>>>>>>> - audit cert trust not updated, so next restart of CA will fail
>>>>>>>>>> - CA is not restarted so will not use updated certificates
>>>>>>>>>>
>>>>>>>>>>> And anyway, why does certmonger do renewals in parallel? It
>>>>>>>>>>> seems
>>>>>>>>>>> that
>>>>>>>>>>> if it did one at a time, always waiting until the post-renew
>>>>>>>>>>> script is
>>>>>>>>>>> done, this patch wouldn't be necessary.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  From what Nalin told me certmonger has some coarse locking such
>>>>>>>>>> that
>>>>>>>>>> renewals in a the same NSS database are serialized. As you point
>>>>>>>>>> out, it
>>>>>>>>>> would be nice to extend this locking to the post renewal
>>>>>>>>>> scripts. We
>>>>>>>>>> can
>>>>>>>>>> ask Nalin about it. That would fix the potential corruption
>>>>>>>>>> issue.
>>>>>>>>>> It is
>>>>>>>>>> still much nicer to not have to restart dogtag 4 times.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Well, three extra restarts every few years seems like a small
>>>>>>>>> price to
>>>>>>>>> pay for robustness.
>>>>>>>>
>>>>>>>> It is a bit of a problem though because the certs all renew within
>>>>>>>> seconds so end up fighting over who is restarting dogtag. This can
>>>>>>>> cause
>>>>>>>> some renewals go into a failure state to be retried later. This is
>>>>>>>> fine
>>>>>>>> functionally but makes QE a bit of a pain. You then have to make
>>>>>>>> sure
>>>>>>>> that renewal is basically done, then restart certmonger and check
>>>>>>>> everything again, over and over until all the certs are renewed.
>>>>>>>> This is
>>>>>>>> difficult to automate.
>>>>>>>
>>>>>>> So we need to extend the certmonger lock, and wait until Dogtag is
>>>>>>> back
>>>>>>> up before exiting the script. That way it'd still take longer than 1
>>>>>>> restart, but all the renews should succeed.
>>>>>>>
>>>>>>
>>>>>> Right, but older dogtag versions don't have the handy servlet to tell
>>>>>> that the service is actually up and responding. So it is difficult to
>>>>>> tell from tomcat alone whether the CA is actually up and handling
>>>>>> requests.
>>>>>>
>>>>>
>>>>> Revised patch that takes advantage of new version of certmonger.
>>>>> certmonger-0.65 adds locking from the time renewal begins to the
>>>>> end of
>>>>> the post_save_command. This lets us be sure that no other certmonger
>>>>> renewals will have the NSS database open in read-write mode.
>>>>>
>>>>> We need to be sure that tomcat is shut down before we let certmonger
>>>>> save the certificate to the NSS database because dogtag opens its
>>>>> database read/write and two writers can cause corruption.
>>>>>
>>>>> rob
>>>>>
>>>>
>>>> stop_pkicad and start_pkicad need the Dogtag version check to select
>>>> pki_cad/pki_tomcatd.
>>>
>>> Fixed.
>>>
>>>>
>>>> A more serious issue is that stop_pkicad needs to be installed on
>>>> upgrades. Currently the whole enable_certificate_renewal step in
>>>> ipa-upgradeconfig is skipped if it was done before.
>>>
>>> I added a separate upgrade test for this. It currently won't work in
>>> SELinux enforcing mode because certmonger isn't allowed to talk to dbus
>>> in an rpm post script. It's being looked at.
>>>
>>>> In stop_pkicad can you change the first log message to "certmonger
>>>> stopping %sd"? It's before the action so we don't want past tense.
>>>
>>> Fixed.
>>>
>>> rob
>>
>> I get a bunch of errors when installing the RPM:
>>
> [...]
>>
>
> This is the SELinux issue you were talking about. Sorry for not catching
> that.
>
> With enforcing off, the patch looks & works well for me. I'm just
> concerned about this change in ipa-upgradeconfig:
>
> @@ -707,7 +754,7 @@ def main():
>           # configuration has changed, restart the name server
>           root_logger.info('Changes to named.conf have been made,
> restart named')
>           bindinstance.BindInstance(fstore).restart()
> -    ca_restart = ca_restart or enable_certificate_renewal(ca) or
> upgrade_ipa_profile(ca, api.env.domain, fqdn)
> +    ca_restart = ca_restart or enable_certificate_renewal(ca) or
> upgrade_ipa_profile(ca, api.env.domain, fqdn) or
> certificate_renewal_stop_ca(ca)
>
> If the enable_certificate_renewal step was done already, but
> upgrade_ipa_profile requests a CA restart, then the short-circuiting
> `or` will be satisfied and certificate_renewal_stop_ca won't be run.
>
> Since each upgrade step has its own checking, I think it would be safer
> to use something like:
>      ca_restart = certificate_renewal_stop_ca(ca) or ca_restart
>
> or even:
> ca_restart = any([
>      ca_restart,
>      enable_certificate_renewal(ca),
>      upgrade_ipa_profile(ca, api.env.domain, fqdn),
>      certificate_renewal_stop_ca(ca),
> ])
>

I like this suggestion very much. Updated patch attached.

rob

-------------- next part --------------
A non-text attachment was scrubbed...
Name: freeipa-rcrit-1079-4-renewal.patch
Type: text/x-patch
Size: 27177 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/freeipa-devel/attachments/20130115/7630e7e4/attachment.bin>


More information about the Freeipa-devel mailing list