[libvirt] [PATCH 1/3] Add a plugin dlm for lock manager

Fri Feb 9 00:42:14 UTC 2018

On 02/08/2018 03:30 AM, Lin Fu wrote:
> I could feel that you want to express "Oh, shit! How terrible the code it
> is." from the reply, although across the screen. Anyway, I'm really glad
> to thanks for your review and suggestions. :-)
> 

No - just trying to be thorough with respect to expectations that you're
probably not aware of as someone who doesn't regularly follow or
contribute to libvirt. If it came off as oh this is crap, then sorry -
it wasn't intended that way. I'd rather have someone be explicit when
they don't like something than make me guess what I did wrong or that
wasn't following some project guidelines.

> On 2018年02月08日 07:55, John Ferlan wrote:
> 
>> That's where I'd expect to find the details of why someone would want to
>> choose dlm over lockd or sanlock.  What goodies are provided? What
>> advantage is there?

Be careful how you snip - loss of context of placement...  This comment
was pointing out you didn't update the docs.

> 
> For those, assuming this scene: one customer has used cluster software
> with DLM in their environment, why not choose DLM plugin instead of
> sanlock/lockd ? Beside that, compared with sanlock/lockd, DLM don't need
> the share storage, and sanlock would bring a little heavyheavily on disk
> IO according to lockd RFC(2011-July/msg00337.html), but DLM only requires
>  network.

I'd choose dlm over other options too, but that's because I have a
pretty good idea of how it works and it's features. The on/off nature of
pthread/mutex locks has always felt like a step backwards for sharing
resources from what I cut my teeth on with the OpenVMS Lock Manager.

Still you need to "sell" why using dlm is better than the existing
sanlock or lockd.

My "vision" of dlm is that resource you're adding that's available
clusterwide in some manner should be able to have something that each
member of the cluster would name the resource in the same manner. What I
don't have is the mental picture of how say /dev/sdm on one host
"appears" on another host.  If it was /dev/sdm, then it'd be really
simple to use that as a resource name and know it's the same resource
between hosts.  When I see the device name being hashed and added as a
resource, then I wonder how different nodes in the cluster will know
it's the same thing.  Although I do recall some code used the Resource
Name field rather judiciously to do some fairly interesting things. When
you see system space (e.g. kernel) address in a resource name, you know
something entertaining is occurring.

> 
> I have sent a RFC in 12/2017 to ack how the opinion is. Daniel said that
> "it can be considered in scope for a libvirt lock manager plugin if you
> want to try writing one"(2017-December/msg00706.html). So I draft this
> patch set which could only make DLM run in libvirt(with configure flag
> `--with-dlm` to compile 'dlm.so' module by force). Hope some suggestion
> bringed from libvirt community to help me improve, if accepted DLM lock
> manager plugin RFC, so I don't update 'docs/locking.html.in'.

So this series was really more of an RFC than a V1...  It's ok, but
indicating RFC sets the expectations of the reviewer as opposed to
posting a V1.

> 
> -------------
> 
> The following is the reply contents, aim to answer why I do that
> for most comments:
> 
> 1) sorry for indent in my code. I coded them in two notebooks,
> both using vim. However one is newer, I have not configured '.vimrc' in
> that one.
> 

Following posting guidelines of running make check syntax-check would
have avoided that. There are some that won't even *look* at code that
fails those basic tests.

> 2)
>>> > +AC_DEFUN([LIBVIRT_CHECK_CPG],[
>>> > +  dnl in some distribution, Version is `UNKNOW` in libcpg.pc
>>
>> UNKNOWN
>> doesn't make it very useful does it?
> 
> In some distribution, for example, openSUSE Tumbleweed:
>     $ cat /usr/lib64/pkgconfig/libcpg.pc
> 
>     prefix=/usr
>     exec_prefix=${prefix}
>     libdir=/usr/lib64
>     includedir=${prefix}/include
> 
>     Name: cpg
>     Version: UNKNOWN
>     Description: cpg
>     Requires:
>     Libs: -L${libdir} -lcpg
>     Cflags: -I${includedir}
> 
> using `LIBVIRT_CHECK_PKG([CPG], [libcpg], [xxx])` directly would make
> build error. So I decide to ignore 'version'. I also don't know why
> that is.
> 

Seems like Tumbleweed needs to have a bug report filed in order to get
the Version properly filled in. If you live with UNKNOWN, then the
problem is how do you really know if the version someone has installed
will have what you want without peeking into the library and looking for
specific routines. I think for some reason we do that with some of the
crypto m4 files, but then again I try to avoid these files.

> 3)
>>> > +#define LOCK_RECORD_FILE_PATH       "/tmp/libvirtd-dlm-file"
>>
>> Is somewhere in /tmp the best/only place to put this? What not similar
>> to lockd/sanlock using /var/lib/libvirt/dlm... (which would seemingly
>> need to be created on installation).
> 
> In my opinion, if one node reboots, DLM cluster would drop locks belong
> that node, most of Linux distribution would clean '/tmp' directory...
> 

And if /tmp is very small by some admin's choice (done for various
reasons), then what happens to your code?  There are mechanisms to store
in XML data that needs to be saved between libvirtd restarts and even
host reboots. It's really not clear where this fits. Perhaps a
/var/run/libvirt type file that would be cleaned up differently.

> 4)
>>> > +
>>> > +#define STATUS             "STATUS"
>>> > +#define RESOURCE_NAME      "RESOURCE_NAME"
>>> > +#define LOCK_ID            "LOCK_ID"
>>> > +#define LOCK_MODE          "LOCK_MODE"
>>> > +#define VM_PID             "VM_PID"
>>> > +
>>> > +#define BUFFERLEN          128
>>
>> Yuck - for a stack variable...
> 
>>> > +    for (i = 0, str = raw, status = 0; ; str = NULL, i++) {
>>> > +        subtoken = strtok_r(str, " \n", &saveptr);
>>> > +        if (subtoken == NULL)
>>> > +            break;
>>> > +
>>> > +        switch(i) {
>>> > +        case 0:
>>
>> These are magic numbers?
> 
> 
> Badly code wants to provided a readable file, like that:
> 
>     STATUS                                                   
> RESOURCE_NAME LOCK_MODE     VM_PID
>          0
> 1501e418dedd122b050c91ec61650ae30d3c217806125d3c2f365989143aa28d   
> EXMODE      21857
> 
> The first line is the form header, the following is the lock
> information.
> 

You could consider XML or JSON format...

> 5)
>>> > +    char    *name;
>>> > +    uint32_t mode;
>>> > +    uint32_t lkid;
>>
>> I would think uint32_t would be limiting...  Why not just unsigned int?
> 
> view in '/usr/include/libdlm.h', it use `uint32_t` instead of
> `unsigned int`.
> 

It's more of a I'm used to seeing unsigned int rather than uint32_t
which is more limiting.

> 6)
>>> > +    virMutexLock(&(lockListWait.listMutex));
>>> > +    virListAddTail(&lock->entry, &(lockListWait.list));
>>> > +    virMutexUnlock(&(lockListWait.listMutex));
>>
>> What is this list for?  I think this also is completely separable.  I
>> wouldn't use a list either... 10s of domains probably works fine.  100s
>> of domains perhaps not as well 1000s of domains starts being a
>> bottleneck for sure.
> 
> Hashtable is a good idea, and I wanted to use it but I had never read
> it's code... Chosing list because that there won't be too much vms in
> one host, is it offten common that there is more than 50 vms in normal
> usage?
> 

Lots of recent examples using virHashCreate, virHashAddEntry,
virHashRemoveEntry, etc. A hash table can/will resize as necessary.

Assuming usage is a dangerous thing. You're not just talking domains,
you're factoring in shared storage between domains and leases into your
resource counts.

Recent examples have been assuming no one would ever have more than
(say) 200 disks defined for a domain for which we want to obtain domain
statistics. Well it happened and we chased a way to handle that given
the limitation of size of RPC messages. More recently some block device
with more than 200 backing stores generated a very large pile of data
where there was lots of duplication.

> 7)
>>> > +    if (safewrite(fd, buffer, strlen(buffer)) != strlen(buffer)) {
>>> > +        virReportSystemError(errno,
>>> > +                             _("unable to write lock information
> '%s' to file '%s'"),
>>> > +                             buffer,
> NULLSTR(driver->lockRecordFilePath));
>>> > +        return;
>>> > +    }
>>
>> Not sure I understand the reason of the file
> 
> Compared with sanlock and lockd, this implement of DLM has no daemon,
> some area is needed to record lock information in order to resume lock
> after libvirtd reboot. I chose file.
> 

Hmm. you're using virtlockd and creating lock_driver_dlm, but then
virtlockd isn't running?  I really didn't dig too deep and think that
much about how this integrates at all.

> 8)
>>> > +     *   found with another mode, and -ENOENT if no orphan.
>>> > +     *
>>> > +     *   cast/bast/param are (void *)1 because the kernel
>>> > +     *   returns errors if some are null.
>>> > +     */
>>> > +
>>> > +    status = dlm_ls_lockx(lockspace, mode, &lksb,
> LKF_PERSISTENT|LKF_ORPHAN,
>>> > +                          name, strlen(name), 0,
>>> > +                          (void *)1, (void *)1, (void *)1,
>>
>> ^^ These would seem to relate to the same parametes used for
>> dlm_lock[_wait].  They are actually quite useful when it comes to lock
>> conversions.  You may want to considering investigating that...
>>
>>
>> Still seems like a "bug" to me to need to pass 1 as a callback function
>> or a blocking callback function.
>>
>>> > +                          NULL, NULL);
>>
>> Why not dlm_is_lock() instead since you're not using xid and timeout
> 
> The code is from lvm2, I try to use `dlm_lock_wait`, however it failed,
> I don't know why. `(void *)1` is the same reason. Maybe a bug from
> libdlm or 'dlm.ko'.
> 

Perhaps more time spent understanding dlm interactions will help. Just
copying an existing implementation without understanding what the
interactions are may not fit into the model you're trying to create. I
honestly don't have the cycles to learn all that dlm has to offer right
now. But I do think you do need to spend more time understanding how
libvirt works today and how dlm will make things better and then how
exactly to accomplish that in an understandable manner.  My biggest
concern is someone would add a feature like this and then not continue
working with/on libvirt. The dlm concepts are a departure from existing
locking models. If you want to get known in the libvirt community then
pick some of the simple tasks from the bite sized tasks page:

https://wiki.libvirt.org/page/BiteSizedTasks

that still need addressing...

> 9)
>>> > +
>>> > +    /* create thread to receive notification from kernel */
>>
>> notification for what? what gets called?
> 
> This is implemented by libdlm, the result is writted in `lksb`.
> 
>        int dlm_lock_wait(uint32_t mode,
>                  struct dlm_lksb *lksb,
>                  uint32_t flags,
>                  const void *name,
>                  unsigned int namelen,
>                  uint32_t parent,         /* unused */
>                  void *bastarg,
>                  void (*bastaddr) (void *bastarg),
>                  void *range);            /* unused */
> 

Again - a bit of context... This was a dlm_ls_pthread_init call made in
virLockManagerDlmSetup. So something that's called once creates a thread
to do what? I didn't research it, but it's something you have to know
and understand so that if someone asks you can have the answer.

An lksb is a "lock status block", but that's on a lock by lock basis.
IIRC, those are written when lock conversions occur. If you used the
"dlm_lock" API, then used the completion routines, you can then check
the lksb for the status of the attempted operation. I'm very rusty in my
recollection and my description is more my understanding of it.

> 10)
>>> > +                    priv->hasRWDisks = true;
>>> > +                    /* ignore disk resource without error */
>>> > +                    return 0;
>>> > +                }
>>> > +            }
>>> > +
>>> > +            if (virCryptoHashString(VIR_CRYPTO_HASH_SHA256, name,
> &newName) < 0)
>>
>>> Interesting _lockd uses SHA256 and _sanlock uses MD5...
>>
>>
>>> > +        /* we need format the lock information, so the lock name
> must be the constant length */
>>> > +        if (virCryptoHashString(VIR_CRYPTO_HASH_SHA256, name,
> &newName) < 0)
>>> > +            goto cleanup;
>>
>> Neither _lockd nor _sandisk do any sort of Hash on this name.
>>
>>
>>> > +            memset(&lksb, 0, sizeof(lksb));
>>> > +            rv = dlm_ls_lock_wait(lockspace, priv->resources[i].mode,
>>> > +                                  &lksb, LKF_NOQUEUE|LKF_PERSISTENT,
>>> > +                                  priv->resources[i].name,
> strlen(priv->resources[i].name),
>>
>> This is where I wonder why we cannot pass the path to the disk to dlm
>> instead of the hash
> 
> It's interesting here. Originally I use MD5 instead of SHA256, but I
> found that:
> 
>     typedef enum {
>         VIR_CRYPTO_HASH_MD5, /* Don't use this except for historic compat */
>         VIR_CRYPTO_HASH_SHA256,
> 
>         VIR_CRYPTO_HASH_LAST
>     } virCryptoHash;
> 
> Futhermore, I find the bug patch related sanlock name: sanlock's name
> length is limited 48, there is always someone use more than 48 characters
> (see https://bugzilla.redhat.com/show_bug.cgi?id=735443). The same
> situation, why not use the constant length? DLM limits name is 64 bytes
> to avoid some potential bugs. Besides, it's convenient to format write
> it to file.
> 

I'm still wondering why a hash is being used.  That's an implementation
detail which I haven't yet come to grips with given what I know from my
history with OpenVMS (think about that /dev/sdm example above).

In OpenVMS "disks" that were resources in a cluster shared with some SAN
within the cluster connectivity fabric.  Each member of the cluster had
an ID (I think you have that with the CPG above) and shared resources
would be named with that ID - it's been over 15 years, so too long to
remember the details.

Still point is, for a dlm environment to be effective device naming
between cluster members is going to be important.

> 11)
>>> > +     * 'ensure sanlock socket is labelled with the VM process label',
>>> > +     * however, fixing sanlock socket security labelling remove
> related
>>> > +     * code. Now, `fd` parameter is useless.
>>> > +     */
>>> > +    if (fd)
>>> > +        *fd = -1;
>>
>> But you just set it to -1??!
> 
> Maybe I can submit a patch to cut those code. Afterall it's useless now.
> 

Nothing's been accepted you'd just remove this from future patches.

> 12)
>>> > +    virCheckFlags(0, -1);
>>> > +
>>> > +    if (state)
>>> > +        *state = NULL;
>>
>> Why would this return NULL?  My recollection is there is some sort of
>> inquiry API for dlm.
> 
> Inquiry API is disappeared now... Only lock/unlock refered to libdlm
> source code and reference book.
>  http://people.redhat.com/ccaulfie/docs/rhdlmbook.pdf
>  Chapter 4.3. Lock Query Operations
> 

You need to think in terms of what libvirt consumers could use or need
and perhaps not specifically something that dlm provides.
> -- 
> Regards
> Lin Fu(river)
> 

Don't forget to update your local libvirt.git repo in order to add the
pertinent information so that the next series you post will have it.

Sorry if there's an incomplete thoughts above - I've been distracted
today and this evening - it's dinner time now.

John