[Freeipa-devel] certmonger/oddjob for DNSSEC key maintenance

Wed Sep 4 13:50:11 UTC 2013

On Wed, 04 Sep 2013, Dmitri Pal wrote:
>On 09/04/2013 09:08 AM, Dmitri Pal wrote:
>> On 09/03/2013 04:01 PM, Simo Sorce wrote:
>>> On Tue, 2013-09-03 at 12:36 -0400, Dmitri Pal wrote:
>>>> On 09/02/2013 09:42 AM, Petr Spacek wrote:
>>>>> On 27.8.2013 23:08, Dmitri Pal wrote:
>>>>>> On 08/27/2013 03:05 PM, Rob Crittenden wrote:
>>>>>>> Dmitri Pal wrote:
>>>>>>>> On 08/09/2013 08:30 AM, Petr Spacek wrote:
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I would like to get opinions about key maintenance for DNSSEC.
>>>>>>>>>
>>>>>>>>> Problem summary:
>>>>>>>>> - FreeIPA will support DNSSEC
>>>>>>>>> - DNSSEC deployment requires <2,n> cryptographic keys for each DNS
>>>>>>>>> zone (i.e. objects in LDAP)
>>>>>>>>> - The same keys are shared by all FreeIPA servers
>>>>>>>>> - Keys have limited lifetime and have to be re-generated on monthly
>>>>>>>>> basics (in very first approximation, it will be configurable and the
>>>>>>>>> interval will differ for different key types)
>>>>>>>>> - The plan is to store keys in LDAP and let 'something' (i.e.
>>>>>>>>> certmonger or oddjob?) to generate and store the new keys back into
>>>>>>>>> LDAP
>>>>>>>>> - There are command line tools for key-generation (dnssec-keygen from
>>>>>>>>> the package bind-utils)
>>>>>>>>> - We plan to select one super-master which will handle regular
>>>>>>>>> key-regeneration (i.e. do the same as we do for special CA
>>>>>>>>> certificates)
>>>>>>>>> - Keys stored in LDAP will be encrypted somehow, most probably by
>>>>>>>>> some
>>>>>>>>> symmetric key shared among all IPA DNS servers
>>>>>>>>>
>>>>>>>>> Could certmonger or oddjob do key maintenance for us? I can imagine
>>>>>>>>> something like this:
>>>>>>>>> - watch some attributes in LDAP and wait until some key expires
>>>>>>>>> - run dnssec-keygen utility
>>>>>>>>> - read resulting keys and encrypt them with given 'master key'
>>>>>>>>> - store resulting blobs in LDAP
>>>>>>>>> - wait until another key reaches expiration timestamp
>>>>>>>>>
>>>>>>>>> It is simplified, because there will be multiple keys with different
>>>>>>>>> lifetimes, but the idea is the same. All the gory details are in the
>>>>>>>>> thread '[Freeipa-devel] DNSSEC support design considerations: key
>>>>>>>>> material handling':
>>>>>>>>> https://www.redhat.com/archives/freeipa-devel/2013-July/msg00129.html
>>>>>>>>> https://www.redhat.com/archives/freeipa-devel/2013-August/msg00086.html
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Nalin and others, what do you think? Is certmonger or oddjob the
>>>>>>>>> right
>>>>>>>>> place to do something like this?
>>>>>>>>>
>>>>>>>>> Thank you for your time!
>>>>>>>>>
>>>>>>>> Was there any discussion of this mail?
>>>>>>>>
>>>>>>> I think at least some of this was covered in another thread, "DNSSEC
>>>>>>> support design considerations: key material handling" at
>>>>>>> https://www.redhat.com/archives/freeipa-devel/2013-August/msg00086.html
>>>>>>>
>>>>>>> rob
>>>>>>>
>>>>>>>
>>>>>> Yes, I have found that thread though I have not found it to come to some
>>>>>> conclusion and a firm plan.
>>>>>> I will leave to Petr to summarize outstanding issues and repost them.
>>>>> All questions stated in the first e-mail in this thread are still open:
>>>>> https://www.redhat.com/archives/freeipa-devel/2013-August/msg00089.html
>>>>>
>>>>> There was no reply to these questions during my vacation, so I don't
>>>>> have much to add at the moment.
>>>>>
>>>>> Nalin, please, could you provide your opinion?
>>>>> How modular/extendible the certmonger is?
>>>>> Does it make sense to add DNSSEC key-management to certmonger?
>>>>> What about CA rotation problem? Can we share some algorithms (e.g. for
>>>>> super-master election) between CA rotation and DNSSEC key rotation
>>>>> mechanisms?
>>>>>
>>>>>> BTW I like the idea of masters being responsible for generating a subset
>>>>>> of the keys as Loris suggested.
>>>>> E-mail from Loris in archives:
>>>>> https://www.redhat.com/archives/freeipa-devel/2013-August/msg00100.html
>>>>>
>>>>> The idea seems really nice and simple, but I'm afraid that there could
>>>>> be some serious race conditions.
>>>>>
>>>>> - How will it work when topology changes?
>>>>> - What if number of masters is > number of days in month? (=>
>>>>> Auto-tune interval from month to smaller time period => Again, what
>>>>> should we do after a topology change?)
>>>>> - What we should do if topology was changed when a master was
>>>>> disconnected from the rest of the network? (I.e. Link over WAN was
>>>>> down at the moment of change.) What will happen after re-connection to
>>>>> the topology?
>>>>>
>>>>> Example:
>>>>> Time 0: Masters A, B; topology:  A---B
>>>>> Time 1: Master A have lost connection to master B
>>>>> Time 2: Master C was added; topology:  A § B---C
>>>>> Time 3 (Day 3): A + C did rotation at the same time
>>>>> Time 4: Connection was restored;  topology: A---B---C
>>>>>
>>>>> Now what?
>>>>>
>>>>>
>>>>> I have a feeling that we need something like quorum protocol for
>>>>> writes (only for sensitive operations like CA cert and DNSSEC key
>>>>> rotations).
>>>>>
>>>>> http://en.wikipedia.org/wiki/Quorum_(distributed_computing)
>>>>>
>>>>>
>>>>> The other question is how should we handle catastrophic situations
>>>>> where more than half of masters were lost? (Two of three data centres
>>>>> were blown by a tornado etc.)
>>>>>
>>>> It becomes more and more obvious that there is no simple solution that
>>>> we can use out of box.
>>>> Let us start with a single nominated server. If the server is lost the
>>>> key rotation responsibility can be moved to some other server manually.
>>>> Not optimal but at least the first step.
>>>>
>>>> The next step would be to be able to define alternative (failover)
>>>> servers. Here is an example.
>>>> Let us say we have masters A, B, C. In topology A - B - C.
>>>> Master A is responsible for the key rotation B is the fail-over.
>>>> The key rotation time would be in some way recorded in the replication
>>>> agreement(s) between A & B.
>>>> If at the moment of the scheduled rotation A <-> B connection is not
>>>> present A would skip rotation and B would start rotation. If A comes
>>>> back and connects to B (or connection is just restored) the replication
>>>> will update the keys on A. If A is lost the keys are taken care of by B
>>>> for itself and C.
>>>> There will be a short window of race condition but IMO it can be
>>>> mitigated. If A clock is behind B then if A managed to connect to B it
>>>> would notice that B already started rotation. If B clock is behind and A
>>>> connects to B before B started rotation A has to perform rotation still
>>>> (sort of just made it case).
>>>>
>>>> Later if we want more complexity we can define subsets of the keys to
>>>> renew and assign them to different replicas and then define failover
>>>> servers per set.
>>>> But this is all complexity we can add later when we see the real
>>>> problems with the single server approach.
>>> Actually I thought about this for a while, and I think I have an idea
>>> about how to handle this for DNSSEC, (may not apply to other cases like
>>> CA).
>>>
>>> IIRC keys are generate well in advance from the time they are used and
>>> old keys and new keys are used side by side for a while, until old keys
>>> are finally expired and only new keys are around.
>>>
>>> This iso regulated by a series of date attributes that determine when
>>> keys are in used when they expire and so on.
>>>
>>> Now the idea I have is to add yet another step.
>>>
>>> Assume we have key "generation 1" (G1) in use and we approach the time
>>> generation 1 will expire and generation 2 (G2) is needed, and G2 is
>>> created X months in advance and all stuff is signed with both G1 and G2
>>> for a period.
>>>
>>> Now if we have a pre-G2 period we can have a period of time when we can
>>> let multiple servers try to generate the G2 series, say 1 month in
>>> advance of the time they would normally be used to start signing
>>> anything. Then only after that 1 month they are actually put into
>>> services.
>>>
>>> How does this helps? Well it helps in that even if multiple servers
>>> generate keys and we have duplicates they have all the time to see that
>>> there are duplicates (because 2 server raced).
>>> now if e can keep a subsecond 'creation' timestamp for the new keys when
>>> replication goes around all servers can check and use only the set of
>>> keys that have been create first, and the servers that created the set
>>> of keys that lose the race will just remove the duplicates.
>>> given we have 1 month of time between the creation and the actual time
>>> keys will be used we have all the time to let servers sort out whether
>>> there are keys available or not and prune out duplicates.
>>>
>>> A diagram in case I have not been clear enough
>>>
>>>
>>> Assume servers A, B, C they all randomize (within a week) the time at
>>> which they will attempt to create new keys if it is time to and none are
>>> available already.
>>>
>>> Say the time come to create G2, A, B ,C each throw a dice and it turns
>>> out A will do it in 35000 seconds, B will do it in 40000 seconds, and C
>>> in 32000 seconds, so C should do it first and there should be enough
>>> time for the others to see that new keys popped up and just discard
>>> their attempts.
>>>
>>> However is A or C are temporarily disconnected they may still end up
>>> generating new keys, so we have G2-A and G2-B, once they get reconnected
>>> and replication flows again all servers see that instead of a single G2
>>> set we have 2 G2 sets available
>>> G2-A created at timestamp X+35000 and G2-B created at timestamp X+32000,
>>> so all servers know they should ignore G2-A, and they all ignore it.
>>> When A comes around to realize this itself it will just go and delete
>>> the G2-A set. Only G2-B set is left and that is what will be the final
>>> official G2.
>>>
>>> If we give a week of time for this operation to go on I think it will be
>>> easy to resolve any race or temporary diconnection that may happen.
>>> Also because all server can attempt (within that week) to create keys
>>> there is no real single point of failure.
>>>
>>> HTH,
>>> please poke holes in my reasoning :)
>>>
>>> Simo.
>>>
>> Reasonable just have couple comments.
>> If there are many keys and many replicas the chance would be that there
>> will be a lot of load. Generating keys is costly computation wise.
>> Replication is costly too.
>> Also you assume that topology works fine. I am mostly concerned about
>> the case when some replication is not working and data from one part of
>> the topology is not replicated to another. The concern is that people
>> would not notice that things are not replicating. So if there is a
>> problem and we let all these key to be generated all over the place it
>> would be pretty hard to untie this knot later.
>>
>> I would actually suggest that if a replica X needs the keys in a month
>> from moment A and the keys have not arrived in 3 first days after moment
>> A and this replica is not entitled to generate keys it should start
>> sending messages to admin. That way there will be enough time for admin
>> to sort out what is wrong and nominate another replica to generate the
>> keys if needed. There should be command as simple as:
>>
>> ipa dnssec-keymanager-set <replica>
>>
>> that would make the mentioned replica the key generator.
>> There can be other commands like
>>
>> ipa dnssec-keymanager-info
>>
>> Appointed server: <server>
>> Keys store: <path>
>> Last time keys generated: <some time>
>> Next time keys need to be generated: <...>
>> ...
>>
>>
>>
>>
>> IMO in this case we need to help admin to see that there is a problem
>> and provide tools to easily mitigate it rather than try to solve it
>> ourselves and build a complex algorythm.
>>
>Thinking even more about this.
>May be we should start with the command that would be something like:
>
>ipa health
>
>This command would detect the topology, try to connect to all replicas
>check that they are all up and running, replicating, nothing is stuck
>and report any issues.
>The output of the command can be sent somewhere or as a mail to admin.
>
>Then it can be run periodically as a part of cron on couple servers and
>if there is any problem admin would know quite soon.
>Then admin would know things like:
>1) The CRL generating server is down/unreachable
>2) The DNSSEC key generating server is down/unreachable
>3) Some CAs are unreachable
>4) The server that rotates certificates is down/unreachable
>5) The server that does AD sync is down/unreachable
>
>There might be other things.
>IMO we have enough sinlge point of failure services already. Adding
>DNSSEC key generation to that set is not a big deal but the utility like
>this would really go a long way making IPA more usable, manageable and
>useful.
>
>Should I file an RFE?
The tool you describe above should be able to perform operations on the master.
it is in general better not to put master-specific operations into a
client tool that could be run from an arbitrary host where ipa admin
tools are installed.

What about plugging the functionality into ipa-advise?

   ipa-advise health-check-{cert|replication|dnssec|...}

-- 
/ Alexander Bokovoy