[Tendrl-devel] question on storage node GUIDs

Sat Jun 17 17:51:56 UTC 2017

Here are the steps followed by tendrl-node-agent on startup  (pseudo code)

1) machine_id = read_file("/etc/machine-id")     (Get current node's
machine id)

2) last_known_node_id = etcd_read("/indexes/machine_id/$machine_id")
 (Get the last seen tendrl node_id for above machine id)

3) local_node_id = read_file("/var/lib/tendrl/node_id")'

if local_node_id is not found
4) Generate node_id for this node if and only if Tendrl has not seen this
node's machine id before i.e. "last_known_node_id" is None

if last_known_node_id == local_node_id
4) Everything's good, Tendrl recognizes this node as an already managed node

TLDR: Tendrl remembers the machine id for each node and associates a Tendrl
node_id to each such node. To Unmanage a node from Tendrl, delete
"/var/lib/tendrl/node_id" and delete
etcd("/indexes/machine_id/$machine_id")

On Mon, Jun 5, 2017 at 7:21 AM, Jeff Applewhite <japplewh at redhat.com> wrote:

> Hi All
>
> I have a test bed where I had 8 nodes. 3 of them were in an existing ceph
> cluster. 1 was in a single node Ceph cluster. I deleted the one standalone
> cluster and rebuilt the node completely to a "fresh centos" install. It is
> no longer registered at all.
>
> That brings me to a total of 7 registered nodes out of the previous 8 and 1
> pre-existing 3 node ceph cluster.
>
> I then deleted the etcd data on the tendrl node, reinstalled etcd,
> recreated the admin user, and restarted all the node agents on the storage
> nodes.
>
> Now granted this is less than a "clean install" procedure but I would have
> hoped to see the node agents re-register with the new etcd/api/tendrl node,
> despite having previously been registered to a tendrl node that was in the
> same IP/Port location. I would also expect to see my pre-existing ceph
> cluster show up for importing - which it did..so I successfully re-imported
> it -- nice!
>
> But -  I really should now expect to see
> [root at tendrl ~]#  etcdctl --endpoints http://tendrl:2379 ls /nodes| wc -l
> 7
>
> when in fact I see..
>
> [root at tendrl ~]#  etcdctl --endpoints http://tendrl:2379 ls /nodes| wc -l
> 15
>
> or 7 "new nodes" + 8 "old nodes" = 15  -- so in essence all the nodes seem
> to generate a new id within etcd when they register, but their old GUID
> also gets into etcd somehow..
>
> I think this behavior might be problematic for some users. In a large
> cluster it would be nice to be able to flush etcd (or some subset of it),
> restart the node agents and have them re-appear with their old GUID (not a
> new GUID and the old one too).
>
> Can someone explain the current behavior and the rationale? There might be
> good technical reasons for the behavior but I'd like to understand them.
> I'd rather not file a bug until I understand a little better what is going
> on here.
>
> It seems like maybe the Tendrl node needs some way to detect this case:
>
>  "oh - I have a Tendrl storage node agent reporting to me and it seems to
> have a previous identity, maybe I should just re-use that GUID instead of
> generating a new one"
>
> or...
>
> maybe the "old" node agent should be whacked on the head and told
> "no your old GUID is no good around here - use this one instead!"
>
> Either behavior would prevent more GUIDs than nodes in etcd.
>
> Thoughts?
>
>
> Jeff
> _______________________________________________
> Tendrl-devel mailing list
> Tendrl-devel at redhat.com
> https://www.redhat.com/mailman/listinfo/tendrl-devel
>