[libvirt] [PATCH v3 3/4] qemu_domain: NVLink2 device tree functions for PPC64

Tue Mar 12 09:39:44 UTC 2019

On 3/12/19 1:05 AM, Alexey Kardashevskiy wrote:
>
> On 12/03/2019 09:15, Daniel Henrique Barboza wrote:
>>
>> On 3/7/19 10:29 PM, Alexey Kardashevskiy wrote:
>>> On 08/03/2019 04:51, Piotr Jaroszynski wrote:
>>>> On 3/7/19 7:39 AM, Daniel Henrique Barboza wrote:
>>>>> On 3/7/19 12:15 PM, Erik Skultety wrote:
>>>>>> On Tue, Mar 05, 2019 at 09:46:08AM -0300, Daniel Henrique Barboza
>>>>>> wrote:
>>>>>>> The NVLink2 support in QEMU implements the detection of NVLink2
>>>>>>> capable devices by verfying the attributes of the VFIO mem region
>>>>>>> QEMU allocates for the NVIDIA GPUs. To properly allocate an
>>>>>>> adequate amount of memLock, Libvirt needs this information before
>>>>>>> a QEMU instance is even created.
>>>>>>>
>>>>>>> An alternative is presented in this patch. Given a PCI device,
>>>>>>> we'll traverse the device tree at /proc/device-tree to check if
>>>>>>> the device has a NPU bridge, retrieve the node of the NVLink2 bus,
>>>>>>> find the memory-node that is related to the bus and see if it's a
>>>>>>> NVLink2 bus by inspecting its 'reg' value. This logic is contained
>>>>>>> inside the 'device_is_nvlink2_capable' function, which uses other
>>>>>>> new helper functions to navigate and fetch values from the device
>>>>>>> tree nodes.
>>>>>>>
>>>>>>> Signed-off-by: Daniel Henrique Barboza <danielhb413 at gmail.com>
>>>>>>> ---
>>>>>> This fails with a bunch of compilation errors, please make sure you
>>>>>> run make
>>>>>> check and make syntax-check on each patch of the series.
>>>>> Ooops, forgot to follow up make syntax-check with 'make' before
>>>>> submitting ... my bad.
>>>>>
>>>>>>>     src/qemu/qemu_domain.c | 194
>>>>>>> +++++++++++++++++++++++++++++++++++++++++
>>>>>>>     1 file changed, 194 insertions(+)
>>>>>>>
>>>>>>> diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
>>>>>>> index 77548c224c..97de5793e2 100644
>>>>>>> --- a/src/qemu/qemu_domain.c
>>>>>>> +++ b/src/qemu/qemu_domain.c
>>>>>>> @@ -10343,6 +10343,200 @@
>>>>>>> qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm)
>>>>>>>     }
>>>>>>>
>>>>>>>
>>>>>>> +/**
>>>>>>> + * Reads a phandle file and returns the phandle value.
>>>>>>> + */
>>>>>>> +static int
>>>>>>> +read_dt_phandle(const char* file)
>>>>>> ..we prefer camelCase naming of functions...all functions should
>>>>>> have a prefix,
>>>>>> e.g. vir,<driver>, etc.
>>>>>>
>>>>>>> +{
>>>>>>> +    unsigned int buf[1];
>>>>>>> +    size_t read;
>>>>>>> +    FILE *f;
>>>>>>> +
>>>>>>> +    f = fopen(file, "r");
>>>>>>> +    if (!f)
>>>>>>> +        return -1;
>>>>>>> +
>>>>>>> +    read = fread(buf, sizeof(unsigned int), 1, f);
>>>>>> We already have safe helpers that take care of such operations.
>>>>>>
>>>>>>> +
>>>>>>> +    if (!read) {
>>>>>>> +        VIR_CLOSE(f);
>>>>>> You could use VIR_AUTOCLOSE for this
>>>>>>
>>>>>>> +        return 0;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    VIR_CLOSE(f);
>>>>>>> +    return be32toh(buf[0]);
>>>>>> Is this part of gnulib? We need to make sure it's portable
>>>>>> otherwise we'd need
>>>>>> to make the conversion ourselves, just like for le64toh
>>>>>>
>>>>>>> +}
>>>>>>> +
>>>>>>> +
>>>>>>> +/**
>>>>>>> + * Reads a memory reg file and returns the first 4 int values.
>>>>>>> + *
>>>>>>> + * The caller is responsible for freeing the returned array.
>>>>>>> + */
>>>>>>> +static unsigned int *
>>>>>>> +read_dt_memory_reg(const char *file)
>>>>>>> +{
>>>>>>> +    unsigned int *buf;
>>>>>>> +    size_t read, i;
>>>>>>> +    FILE *f;
>>>>>>> +
>>>>>>> +    f = fopen(file, "r");
>>>>>>> +    if (!f)
>>>>>>> +        return NULL;
>>>>>>> +
>>>>>>> +    if (VIR_ALLOC_N(buf, 4) < 0)
>>>>>>> +        return NULL;
>>>>>>> +
>>>>>>> +    read = fread(buf, sizeof(unsigned int), 4, f);
>>>>>>> +
>>>>>>> +    if (!read && read < 4)
>>>>>>> +        /* shouldn't happen */
>>>>>>> +        VIR_FREE(buf);
>>>>>>> +    else for (i = 0; i < 4; i++)
>>>>>>> +        buf[i] = be32toh(buf[i]);
>>>>>>> +
>>>>>>> +    VIR_CLOSE(f);
>>>>>>> +    return buf;
>>>>>> Same comments as above...
>>>>>>
>>>>>>> +}
>>>>>>> +
>>>>>>> +
>>>>>>> +/**
>>>>>>> + * This wrapper function receives arguments to be used in a
>>>>>>> + * 'find' call to retrieve the file names that matches
>>>>>>> + * the criteria inside the /proc/device-tree dir.
>>>>>>> + *
>>>>>>> + * A 'find' call with '-iname phandle' inside /proc/device-tree
>>>>>>> + * provides more than a thousand matches. Adding '-path' to
>>>>>>> + * narrow it down further is necessary to keep the file
>>>>>>> + * listing sane.
>>>>>>> + *
>>>>>>> + * The caller is responsible to free the buffer returned by
>>>>>>> + * this function.
>>>>>>> + */
>>>>>>> +static char *
>>>>>>> +retrieve_dt_files_pattern(const char *path_pattern, const char
>>>>>>> *file_pattern)
>>>>>>> +{
>>>>>>> +    virCommandPtr cmd = NULL;
>>>>>>> +    char *output = NULL;
>>>>>>> +
>>>>>>> +    cmd = virCommandNew("find");
>>>>>>> +    virCommandAddArgList(cmd, "/proc/device-tree/", "-path",
>>>>>>> path_pattern,
>>>>>>> +                         "-iname", file_pattern, NULL);
>>>>>>> +    virCommandSetOutputBuffer(cmd, &output);
>>>>>>> +
>>>>>>> +    if (virCommandRun(cmd, NULL) < 0)
>>>>>>> +        VIR_FREE(output);
>>>>>>> +
>>>>>>> +    virCommandFree(cmd);
>>>>>>> +    return output;
>>>>>>> +}
>>>>>>> +
>>>>>>> +
>>>>>>> +/**
>>>>>>> + * Helper function that receives a listing of file names and
>>>>>>> + * calls read_dt_phandle() on each one finding for a match
>>>>>>> + * with the given phandle argument. Returns the file name if a
>>>>>>> + * match is found, NULL otherwise.
>>>>>>> + */
>>>>>>> +static char *
>>>>>>> +find_dt_file_with_phandle(char *files, int phandle)
>>>>>>> +{
>>>>>>> +    char *line, *tmp;
>>>>>>> +    int ret;
>>>>>>> +
>>>>>>> +    line = strtok_r(files, "\n", &tmp);
>>>>>>> +    do {
>>>>>>> +       ret = read_dt_phandle(line);
>>>>>>> +       if (ret == phandle)
>>>>>>> +           break;
>>>>>>> +    } while ((line = strtok_r(NULL, "\n", &tmp)) != NULL);
>>>>>>> +
>>>>>>> +    return line;
>>>>>>> +}
>>>>>>> +
>>>>>>> +
>>>>>>> +/**
>>>>>>> + * This function receives a string that represents a PCI device,
>>>>>>> + * such as '0004:04:00.0', and tells if the device is NVLink2
>>>>>>> capable.
>>>>>>> + *
>>>>>>> + * The logic goes as follows:
>>>>>>> + *
>>>>>>> + * 1 - get the phandle of a nvlink of the device, reading the
>>>>>>> 'ibm,npu'
>>>>>>> + * attribute;
>>>>>>> + * 2 - find the device tree node of the nvlink bus using the phandle
>>>>>>> + * found in (1)
>>>>>>> + * 3 - get the phandle of the memory region of the nvlink bus
>>>>>>> + * 4 - find the device tree node of the memory region using the
>>>>>>> + * phandle found in (3)
>>>>>>> + * 5 - read the 'reg' value of the memory region. If the value of
>>>>>>> + * the second 64 bit value is 0x02 0x00, the device is attached
>>>>>>> + * to a NVLink2 bus.
>>>>>>> + *
>>>>>>> + * If any of these steps fails, the function returns false.
>>>>>>> + */
>>>>>>> +static bool
>>>>>>> +device_is_nvlink2_capable(const char *device)
>>>>>>> +{
>>>>>>> +    char *file, *files, *tmp;
>>>>>>> +    unsigned int *reg;
>>>>>>> +    int phandle;
>>>>>>> +
>>>>>>> +    if ((virAsprintf(&file,
>>>>>>> "/sys/bus/pci/devices/%s/of_node/ibm,npu",
>>>>>>> +                    device)) < 0)
>>>>>>> +        return false;
>>>>>>> +
>>>>>>> +    /* Find phandles of nvlinks:  */
>>>>>>> +    if ((phandle = read_dt_phandle(file)) == -1)
>>>>>>> +        return false;
>>>>>>> +
>>>>>>> +    /* Find a DT node for the phandle found */
>>>>>>> +    files = retrieve_dt_files_pattern("*device-tree/pci*",
>>>>>>> "phandle");
>>>>>>> +    if (!files)
>>>>>>> +        return false;
>>>>>>> +
>>>>>>> +    if ((file = find_dt_file_with_phandle(files, phandle)) == NULL)
>>>>>>> +        goto fail;
>>>>>>> +
>>>>>>> +    /* Find a phandle of the GPU memory region of the device. The
>>>>>>> +     * file found above ends with '/phandle' - the memory region
>>>>>>> +     * of the GPU ends with '/memory-region */
>>>>>>> +    tmp = strrchr(file, '/');
>>>>>>> +    *tmp = '\0';
>>>>>>> +    file = strcat(file, "/memory-region");
>>>>>>> +
>>>>>>> +    if ((phandle = read_dt_phandle(file)) == -1)
>>>>>>> +        goto fail;
>>>>>>> +
>>>>>>> +    file = NULL;
>>>>>>> +    VIR_FREE(files);
>>>>>>> +
>>>>>>> +    /* Find the memory node for the phandle found above */
>>>>>>> +    files = retrieve_dt_files_pattern("*device-tree/memory*",
>>>>>>> "phandle");
>>>>>>> +    if (!files)
>>>>>>> +        return false;
>>>>>>> +
>>>>>>> +    if ((file = find_dt_file_with_phandle(files, phandle)) == NULL)
>>>>>>> +        goto fail;
>>>>>>> +
>>>>>>> +    /* And see its size in the second 64bit value of 'reg'. First,
>>>>>>> +     * the end of the file needs to be changed from '/phandle' to
>>>>>>> +     * '/reg' */
>>>>>>> +    tmp = strrchr(file, '/');
>>>>>>> +    *tmp = '\0';
>>>>>>> +    file = strcat(file, "/reg");
>>>>>>> +
>>>>>>> +    reg = read_dt_memory_reg(file);
>>>>>>> +    if (reg && reg[2] == 0x20 && reg[3] == 0x00)
>>>>>>> +        return true;
>>>>>> This function is very complex just to find out whether a PCI device
>>>>>> is capable
>>>>>> of NVLink or not. Feels wrong to do it this way, I believe it would
>>>>>> be much
>>>>>> easier if NVIDIA exposed a sysfs attribute saying whether a PCI
>>>>>> device supports
>>>>>> NVLink so that our node-device driver would take care of this
>>>>>> during libvirtd
>>>>>> startup and then you'd only call a single API from the PPC64 helper
>>>>>> you're
>>>>>> introducing in the next patch to find out whether you need the
>>>>>> alternative
>>>>>> formula or not.
>>>>>>
>>>>>> Honestly, apart from the coding style issues, I don't like this
>>>>>> approach and
>>>>>> unless there's a really compelling reason for libvirt to do it in a
>>>>>> way which
>>>>>> involves spawning a 'find' process because of a complex pattern and
>>>>>> a bunch of
>>>>>> data necessary to filter out, I'd really suggest contacting NVIDIA
>>>>>> about this.
>>>>>> It's the same for mdevs, NVIDIA exposes a bunch of attributes in
>>>>>> sysfs which
>>>>>> we're able to read.
>>>>> I'll contact NVIDIA and see if there is an easier way (a sysfs
>>>>> attribute, for
>>>>> example) and, if doesn't, try to provide a plausibe reason to
>>>>> justify this
>>>>> detection code.
>>>> Sorry for the delay in responding. The problem is that all the V100 GPUs
>>>> support NVLink, but it may or may not be connected up. This is detected
>>>> at runtime during GPU initialization, which seems like much too heavy of
>>>> an operation to perform as part of passthrough initialization. And
>>>> that's
>>>> why vfio-pci pieces rely on device tree information to figure it out.
>>>>
>>>> Alexey, would it be possible for vfio-pci to export the information in a
>>>> way more friendly to libvirt?
>>> The only information needed here is whether a specific GPU has RAM or
>>> not. This can easily be found from the device-tree, imho quite friendly
>>> already. VFIO gets to know about these new capabilities when the VFIO
>>> PCI device is opened, and we rather want to avoid going that far in
>>> libvirt (open a VFIO container, attach a group, get a vfio-pci fd from
>>> it, enumerate regions - 2 PCI resets on the way, delays, meh).
>>>
>>> btw the first "find" for "ibm,npu" can be skipped - NVLinks have to be
>>> passed too or the entire RAM thing won't work. "find" for the memory
>>> node can also be dropped really - if NVLink bridge OF node has
>>> "memory-region", then VFIO will most likely expose RAM and QEMU will try
>>> using it anyway.
>>
>> Hmmm I am not not sure I understood it completely ..... seeing the
>> of_nodes exposed in sysfs of one of the NVLink buses we have:
>>
>>
>> /sys/bus/pci/devices/0007:00:00.0/of_node$ ls
>> class-code           ibm,gpu       ibm,nvlink-speed memory-region  reg
>> device-id            ibm,loc-code  ibm,pci-config-space-type
>> name           revision-id
>> ibm,device-tgt-addr  ibm,nvlink    interrupts phandle        vendor-id
>>
>>
>> We can make a safe assumption that the V100 GPU will always be passed
>> through with at least one NVLink2 bus.
> Assumption #1: if we want GPU RAM to be available in the guest, an
> NVLink bridge must be passed through. No bridge - no RAM - no rlimit
> adjustment.
>
>
>> How many hops do we need to assert
>> that a given device is a NVLink2 bus from its of_node info?
>>
>> For example, can we say something like:
>>
>> "The device node of this device has ibm,gpu and ibm,nvlink and
>> ibm,nvlink-speed
>> and ibm,nvlink-speed is 0x8, so this is a NVLink2 bus. Since it is not
>
> ibm,nvlink-speed can be different (I saw 9), you cannot rely on this.
>
>
>> possible to pass
>> through the bus alone, there is a V100 GPU in this same IOMMU group. So
>> this is
>> a NVLink2 passthrough scenario"
>>
>>
>> Or perhaps:
>>
>> "It has ibm,gpu and ibm,nvlink and the of_node of its memory-region has
>> a reg
>> value of 0x20 , thus this is a nvlink2 bus and ....."
>
> 0x20 is a size of the GPU RAM window which might change in different
> revisions of the hw as well.
>
>> Both alternatives are way simpler than what I'm doing in this patch. I
>> am not sure
>> if they are valid though.
>
> Assumption #2: if the nvlink2 bridge's DT node has memory-region, then
> the GPU will try to use its RAM. The adjustment you'll have to make
> won't depend on anything as the existing GPU RAM placement is so that
> whatever you can possibly throw at QEMU will fit 128TiB window and we
> cannot make the window smaller anyway (next size would lower power of
> two - 64TiB - and GPU RAM lives just above that).

Ok, so, from the of_node of an unknown device that has been
passed through to a VM, if the DT node has:

ibm,gpu
ibm,nvlink
memory-region

Is that enough to justify the rlimit adjustment for NVLink2?

>
> I could do better of course and allow adjusting the limit to cover let's
> say 66TiB instead of 128TiB, it just seems to be unnecessary
> complication (we already allocate only required amount of TCE entries on
> demand).
>
>
>
>>>>>> Thinking about it a bit more, since this is NVIDIA-specific, having an
>>>>>> NVIDIA-only sysfs attribute doesn't help the node-device driver I
>>>>>> mentioned
>>>>>> above. In general, we try to avoid introducing vendor-specific
>>>>>> code, so nodedev
>>>>>> might not be the right place to check for the attribute (because
>>>>>> it's not
>>>>>> universal and would require vendor-specific elements), but I think
>>>>>> we could
>>>>>> have an NVLink helper reading this sysfs attribute(s) elsewhere in
>>>>>> libvirt
>>>>>> codebase.
>>>>>>
>>>>>> Erik
>
>