[libvirt] [PATCH v3 3/4] qemu_domain: NVLink2 device tree functions for PPC64

Daniel Henrique Barboza danielhb413 at gmail.com
Thu Mar 7 15:39:16 UTC 2019



On 3/7/19 12:15 PM, Erik Skultety wrote:
> On Tue, Mar 05, 2019 at 09:46:08AM -0300, Daniel Henrique Barboza wrote:
>> The NVLink2 support in QEMU implements the detection of NVLink2
>> capable devices by verfying the attributes of the VFIO mem region
>> QEMU allocates for the NVIDIA GPUs. To properly allocate an
>> adequate amount of memLock, Libvirt needs this information before
>> a QEMU instance is even created.
>>
>> An alternative is presented in this patch. Given a PCI device,
>> we'll traverse the device tree at /proc/device-tree to check if
>> the device has a NPU bridge, retrieve the node of the NVLink2 bus,
>> find the memory-node that is related to the bus and see if it's a
>> NVLink2 bus by inspecting its 'reg' value. This logic is contained
>> inside the 'device_is_nvlink2_capable' function, which uses other
>> new helper functions to navigate and fetch values from the device
>> tree nodes.
>>
>> Signed-off-by: Daniel Henrique Barboza <danielhb413 at gmail.com>
>> ---
> This fails with a bunch of compilation errors, please make sure you run make
> check and make syntax-check on each patch of the series.

Ooops, forgot to follow up make syntax-check with 'make' before
submitting ... my bad.

>
>>   src/qemu/qemu_domain.c | 194 +++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 194 insertions(+)
>>
>> diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
>> index 77548c224c..97de5793e2 100644
>> --- a/src/qemu/qemu_domain.c
>> +++ b/src/qemu/qemu_domain.c
>> @@ -10343,6 +10343,200 @@ qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm)
>>   }
>>
>>
>> +/**
>> + * Reads a phandle file and returns the phandle value.
>> + */
>> +static int
>> +read_dt_phandle(const char* file)
> ..we prefer camelCase naming of functions...all functions should have a prefix,
> e.g. vir,<driver>, etc.
>
>> +{
>> +    unsigned int buf[1];
>> +    size_t read;
>> +    FILE *f;
>> +
>> +    f = fopen(file, "r");
>> +    if (!f)
>> +        return -1;
>> +
>> +    read = fread(buf, sizeof(unsigned int), 1, f);
> We already have safe helpers that take care of such operations.
>
>> +
>> +    if (!read) {
>> +        VIR_CLOSE(f);
> You could use VIR_AUTOCLOSE for this
>
>> +        return 0;
>> +    }
>> +
>> +    VIR_CLOSE(f);
>> +    return be32toh(buf[0]);
> Is this part of gnulib? We need to make sure it's portable otherwise we'd need
> to make the conversion ourselves, just like for le64toh
>
>> +}
>> +
>> +
>> +/**
>> + * Reads a memory reg file and returns the first 4 int values.
>> + *
>> + * The caller is responsible for freeing the returned array.
>> + */
>> +static unsigned int *
>> +read_dt_memory_reg(const char *file)
>> +{
>> +    unsigned int *buf;
>> +    size_t read, i;
>> +    FILE *f;
>> +
>> +    f = fopen(file, "r");
>> +    if (!f)
>> +        return NULL;
>> +
>> +    if (VIR_ALLOC_N(buf, 4) < 0)
>> +        return NULL;
>> +
>> +    read = fread(buf, sizeof(unsigned int), 4, f);
>> +
>> +    if (!read && read < 4)
>> +        /* shouldn't happen */
>> +        VIR_FREE(buf);
>> +    else for (i = 0; i < 4; i++)
>> +        buf[i] = be32toh(buf[i]);
>> +
>> +    VIR_CLOSE(f);
>> +    return buf;
> Same comments as above...
>
>> +}
>> +
>> +
>> +/**
>> + * This wrapper function receives arguments to be used in a
>> + * 'find' call to retrieve the file names that matches
>> + * the criteria inside the /proc/device-tree dir.
>> + *
>> + * A 'find' call with '-iname phandle' inside /proc/device-tree
>> + * provides more than a thousand matches. Adding '-path' to
>> + * narrow it down further is necessary to keep the file
>> + * listing sane.
>> + *
>> + * The caller is responsible to free the buffer returned by
>> + * this function.
>> + */
>> +static char *
>> +retrieve_dt_files_pattern(const char *path_pattern, const char *file_pattern)
>> +{
>> +    virCommandPtr cmd = NULL;
>> +    char *output = NULL;
>> +
>> +    cmd = virCommandNew("find");
>> +    virCommandAddArgList(cmd, "/proc/device-tree/", "-path", path_pattern,
>> +                         "-iname", file_pattern, NULL);
>> +    virCommandSetOutputBuffer(cmd, &output);
>> +
>> +    if (virCommandRun(cmd, NULL) < 0)
>> +        VIR_FREE(output);
>> +
>> +    virCommandFree(cmd);
>> +    return output;
>> +}
>> +
>> +
>> +/**
>> + * Helper function that receives a listing of file names and
>> + * calls read_dt_phandle() on each one finding for a match
>> + * with the given phandle argument. Returns the file name if a
>> + * match is found, NULL otherwise.
>> + */
>> +static char *
>> +find_dt_file_with_phandle(char *files, int phandle)
>> +{
>> +    char *line, *tmp;
>> +    int ret;
>> +
>> +    line = strtok_r(files, "\n", &tmp);
>> +    do {
>> +       ret = read_dt_phandle(line);
>> +       if (ret == phandle)
>> +           break;
>> +    } while ((line = strtok_r(NULL, "\n", &tmp)) != NULL);
>> +
>> +    return line;
>> +}
>> +
>> +
>> +/**
>> + * This function receives a string that represents a PCI device,
>> + * such as '0004:04:00.0', and tells if the device is NVLink2 capable.
>> + *
>> + * The logic goes as follows:
>> + *
>> + * 1 - get the phandle of a nvlink of the device, reading the 'ibm,npu'
>> + * attribute;
>> + * 2 - find the device tree node of the nvlink bus using the phandle
>> + * found in (1)
>> + * 3 - get the phandle of the memory region of the nvlink bus
>> + * 4 - find the device tree node of the memory region using the
>> + * phandle found in (3)
>> + * 5 - read the 'reg' value of the memory region. If the value of
>> + * the second 64 bit value is 0x02 0x00, the device is attached
>> + * to a NVLink2 bus.
>> + *
>> + * If any of these steps fails, the function returns false.
>> + */
>> +static bool
>> +device_is_nvlink2_capable(const char *device)
>> +{
>> +    char *file, *files, *tmp;
>> +    unsigned int *reg;
>> +    int phandle;
>> +
>> +    if ((virAsprintf(&file, "/sys/bus/pci/devices/%s/of_node/ibm,npu",
>> +                    device)) < 0)
>> +        return false;
>> +
>> +    /* Find phandles of nvlinks:  */
>> +    if ((phandle = read_dt_phandle(file)) == -1)
>> +        return false;
>> +
>> +    /* Find a DT node for the phandle found */
>> +    files = retrieve_dt_files_pattern("*device-tree/pci*", "phandle");
>> +    if (!files)
>> +        return false;
>> +
>> +    if ((file = find_dt_file_with_phandle(files, phandle)) == NULL)
>> +        goto fail;
>> +
>> +    /* Find a phandle of the GPU memory region of the device. The
>> +     * file found above ends with '/phandle' - the memory region
>> +     * of the GPU ends with '/memory-region */
>> +    tmp = strrchr(file, '/');
>> +    *tmp = '\0';
>> +    file = strcat(file, "/memory-region");
>> +
>> +    if ((phandle = read_dt_phandle(file)) == -1)
>> +        goto fail;
>> +
>> +    file = NULL;
>> +    VIR_FREE(files);
>> +
>> +    /* Find the memory node for the phandle found above */
>> +    files = retrieve_dt_files_pattern("*device-tree/memory*", "phandle");
>> +    if (!files)
>> +        return false;
>> +
>> +    if ((file = find_dt_file_with_phandle(files, phandle)) == NULL)
>> +        goto fail;
>> +
>> +    /* And see its size in the second 64bit value of 'reg'. First,
>> +     * the end of the file needs to be changed from '/phandle' to
>> +     * '/reg' */
>> +    tmp = strrchr(file, '/');
>> +    *tmp = '\0';
>> +    file = strcat(file, "/reg");
>> +
>> +    reg = read_dt_memory_reg(file);
>> +    if (reg && reg[2] == 0x20 && reg[3] == 0x00)
>> +        return true;
> This function is very complex just to find out whether a PCI device is capable
> of NVLink or not. Feels wrong to do it this way, I believe it would be much
> easier if NVIDIA exposed a sysfs attribute saying whether a PCI device supports
> NVLink so that our node-device driver would take care of this during libvirtd
> startup and then you'd only call a single API from the PPC64 helper you're
> introducing in the next patch to find out whether you need the alternative
> formula or not.
>
> Honestly, apart from the coding style issues, I don't like this approach and
> unless there's a really compelling reason for libvirt to do it in a way which
> involves spawning a 'find' process because of a complex pattern and a bunch of
> data necessary to filter out, I'd really suggest contacting NVIDIA about this.
> It's the same for mdevs, NVIDIA exposes a bunch of attributes in sysfs which
> we're able to read.

I'll contact NVIDIA and see if there is an easier way (a sysfs 
attribute, for
example) and, if doesn't, try to provide a plausibe reason to justify this
detection code.



> Thinking about it a bit more, since this is NVIDIA-specific, having an
> NVIDIA-only sysfs attribute doesn't help the node-device driver I mentioned
> above. In general, we try to avoid introducing vendor-specific code, so nodedev
> might not be the right place to check for the attribute (because it's not
> universal and would require vendor-specific elements), but I think we could
> have an NVLink helper reading this sysfs attribute(s) elsewhere in libvirt
> codebase.
>
> Erik




More information about the libvir-list mailing list