[libvirt] [PATCH v3 3/4] qemu_domain: NVLink2 device tree functions for PPC64

Tue Mar 12 04:05:06 UTC 2019

On 12/03/2019 09:15, Daniel Henrique Barboza wrote:
> 
> 
> On 3/7/19 10:29 PM, Alexey Kardashevskiy wrote:
>>
>> On 08/03/2019 04:51, Piotr Jaroszynski wrote:
>>> On 3/7/19 7:39 AM, Daniel Henrique Barboza wrote:
>>>>
>>>> On 3/7/19 12:15 PM, Erik Skultety wrote:
>>>>> On Tue, Mar 05, 2019 at 09:46:08AM -0300, Daniel Henrique Barboza
>>>>> wrote:
>>>>>> The NVLink2 support in QEMU implements the detection of NVLink2
>>>>>> capable devices by verfying the attributes of the VFIO mem region
>>>>>> QEMU allocates for the NVIDIA GPUs. To properly allocate an
>>>>>> adequate amount of memLock, Libvirt needs this information before
>>>>>> a QEMU instance is even created.
>>>>>>
>>>>>> An alternative is presented in this patch. Given a PCI device,
>>>>>> we'll traverse the device tree at /proc/device-tree to check if
>>>>>> the device has a NPU bridge, retrieve the node of the NVLink2 bus,
>>>>>> find the memory-node that is related to the bus and see if it's a
>>>>>> NVLink2 bus by inspecting its 'reg' value. This logic is contained
>>>>>> inside the 'device_is_nvlink2_capable' function, which uses other
>>>>>> new helper functions to navigate and fetch values from the device
>>>>>> tree nodes.
>>>>>>
>>>>>> Signed-off-by: Daniel Henrique Barboza <danielhb413 at gmail.com>
>>>>>> ---
>>>>> This fails with a bunch of compilation errors, please make sure you
>>>>> run make
>>>>> check and make syntax-check on each patch of the series.
>>>> Ooops, forgot to follow up make syntax-check with 'make' before
>>>> submitting ... my bad.
>>>>
>>>>>>    src/qemu/qemu_domain.c | 194
>>>>>> +++++++++++++++++++++++++++++++++++++++++
>>>>>>    1 file changed, 194 insertions(+)
>>>>>>
>>>>>> diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
>>>>>> index 77548c224c..97de5793e2 100644
>>>>>> --- a/src/qemu/qemu_domain.c
>>>>>> +++ b/src/qemu/qemu_domain.c
>>>>>> @@ -10343,6 +10343,200 @@
>>>>>> qemuDomainUpdateCurrentMemorySize(virDomainObjPtr vm)
>>>>>>    }
>>>>>>
>>>>>>
>>>>>> +/**
>>>>>> + * Reads a phandle file and returns the phandle value.
>>>>>> + */
>>>>>> +static int
>>>>>> +read_dt_phandle(const char* file)
>>>>> ..we prefer camelCase naming of functions...all functions should
>>>>> have a prefix,
>>>>> e.g. vir,<driver>, etc.
>>>>>
>>>>>> +{
>>>>>> +    unsigned int buf[1];
>>>>>> +    size_t read;
>>>>>> +    FILE *f;
>>>>>> +
>>>>>> +    f = fopen(file, "r");
>>>>>> +    if (!f)
>>>>>> +        return -1;
>>>>>> +
>>>>>> +    read = fread(buf, sizeof(unsigned int), 1, f);
>>>>> We already have safe helpers that take care of such operations.
>>>>>
>>>>>> +
>>>>>> +    if (!read) {
>>>>>> +        VIR_CLOSE(f);
>>>>> You could use VIR_AUTOCLOSE for this
>>>>>
>>>>>> +        return 0;
>>>>>> +    }
>>>>>> +
>>>>>> +    VIR_CLOSE(f);
>>>>>> +    return be32toh(buf[0]);
>>>>> Is this part of gnulib? We need to make sure it's portable
>>>>> otherwise we'd need
>>>>> to make the conversion ourselves, just like for le64toh
>>>>>
>>>>>> +}
>>>>>> +
>>>>>> +
>>>>>> +/**
>>>>>> + * Reads a memory reg file and returns the first 4 int values.
>>>>>> + *
>>>>>> + * The caller is responsible for freeing the returned array.
>>>>>> + */
>>>>>> +static unsigned int *
>>>>>> +read_dt_memory_reg(const char *file)
>>>>>> +{
>>>>>> +    unsigned int *buf;
>>>>>> +    size_t read, i;
>>>>>> +    FILE *f;
>>>>>> +
>>>>>> +    f = fopen(file, "r");
>>>>>> +    if (!f)
>>>>>> +        return NULL;
>>>>>> +
>>>>>> +    if (VIR_ALLOC_N(buf, 4) < 0)
>>>>>> +        return NULL;
>>>>>> +
>>>>>> +    read = fread(buf, sizeof(unsigned int), 4, f);
>>>>>> +
>>>>>> +    if (!read && read < 4)
>>>>>> +        /* shouldn't happen */
>>>>>> +        VIR_FREE(buf);
>>>>>> +    else for (i = 0; i < 4; i++)
>>>>>> +        buf[i] = be32toh(buf[i]);
>>>>>> +
>>>>>> +    VIR_CLOSE(f);
>>>>>> +    return buf;
>>>>> Same comments as above...
>>>>>
>>>>>> +}
>>>>>> +
>>>>>> +
>>>>>> +/**
>>>>>> + * This wrapper function receives arguments to be used in a
>>>>>> + * 'find' call to retrieve the file names that matches
>>>>>> + * the criteria inside the /proc/device-tree dir.
>>>>>> + *
>>>>>> + * A 'find' call with '-iname phandle' inside /proc/device-tree
>>>>>> + * provides more than a thousand matches. Adding '-path' to
>>>>>> + * narrow it down further is necessary to keep the file
>>>>>> + * listing sane.
>>>>>> + *
>>>>>> + * The caller is responsible to free the buffer returned by
>>>>>> + * this function.
>>>>>> + */
>>>>>> +static char *
>>>>>> +retrieve_dt_files_pattern(const char *path_pattern, const char
>>>>>> *file_pattern)
>>>>>> +{
>>>>>> +    virCommandPtr cmd = NULL;
>>>>>> +    char *output = NULL;
>>>>>> +
>>>>>> +    cmd = virCommandNew("find");
>>>>>> +    virCommandAddArgList(cmd, "/proc/device-tree/", "-path",
>>>>>> path_pattern,
>>>>>> +                         "-iname", file_pattern, NULL);
>>>>>> +    virCommandSetOutputBuffer(cmd, &output);
>>>>>> +
>>>>>> +    if (virCommandRun(cmd, NULL) < 0)
>>>>>> +        VIR_FREE(output);
>>>>>> +
>>>>>> +    virCommandFree(cmd);
>>>>>> +    return output;
>>>>>> +}
>>>>>> +
>>>>>> +
>>>>>> +/**
>>>>>> + * Helper function that receives a listing of file names and
>>>>>> + * calls read_dt_phandle() on each one finding for a match
>>>>>> + * with the given phandle argument. Returns the file name if a
>>>>>> + * match is found, NULL otherwise.
>>>>>> + */
>>>>>> +static char *
>>>>>> +find_dt_file_with_phandle(char *files, int phandle)
>>>>>> +{
>>>>>> +    char *line, *tmp;
>>>>>> +    int ret;
>>>>>> +
>>>>>> +    line = strtok_r(files, "\n", &tmp);
>>>>>> +    do {
>>>>>> +       ret = read_dt_phandle(line);
>>>>>> +       if (ret == phandle)
>>>>>> +           break;
>>>>>> +    } while ((line = strtok_r(NULL, "\n", &tmp)) != NULL);
>>>>>> +
>>>>>> +    return line;
>>>>>> +}
>>>>>> +
>>>>>> +
>>>>>> +/**
>>>>>> + * This function receives a string that represents a PCI device,
>>>>>> + * such as '0004:04:00.0', and tells if the device is NVLink2
>>>>>> capable.
>>>>>> + *
>>>>>> + * The logic goes as follows:
>>>>>> + *
>>>>>> + * 1 - get the phandle of a nvlink of the device, reading the
>>>>>> 'ibm,npu'
>>>>>> + * attribute;
>>>>>> + * 2 - find the device tree node of the nvlink bus using the phandle
>>>>>> + * found in (1)
>>>>>> + * 3 - get the phandle of the memory region of the nvlink bus
>>>>>> + * 4 - find the device tree node of the memory region using the
>>>>>> + * phandle found in (3)
>>>>>> + * 5 - read the 'reg' value of the memory region. If the value of
>>>>>> + * the second 64 bit value is 0x02 0x00, the device is attached
>>>>>> + * to a NVLink2 bus.
>>>>>> + *
>>>>>> + * If any of these steps fails, the function returns false.
>>>>>> + */
>>>>>> +static bool
>>>>>> +device_is_nvlink2_capable(const char *device)
>>>>>> +{
>>>>>> +    char *file, *files, *tmp;
>>>>>> +    unsigned int *reg;
>>>>>> +    int phandle;
>>>>>> +
>>>>>> +    if ((virAsprintf(&file,
>>>>>> "/sys/bus/pci/devices/%s/of_node/ibm,npu",
>>>>>> +                    device)) < 0)
>>>>>> +        return false;
>>>>>> +
>>>>>> +    /* Find phandles of nvlinks:  */
>>>>>> +    if ((phandle = read_dt_phandle(file)) == -1)
>>>>>> +        return false;
>>>>>> +
>>>>>> +    /* Find a DT node for the phandle found */
>>>>>> +    files = retrieve_dt_files_pattern("*device-tree/pci*",
>>>>>> "phandle");
>>>>>> +    if (!files)
>>>>>> +        return false;
>>>>>> +
>>>>>> +    if ((file = find_dt_file_with_phandle(files, phandle)) == NULL)
>>>>>> +        goto fail;
>>>>>> +
>>>>>> +    /* Find a phandle of the GPU memory region of the device. The
>>>>>> +     * file found above ends with '/phandle' - the memory region
>>>>>> +     * of the GPU ends with '/memory-region */
>>>>>> +    tmp = strrchr(file, '/');
>>>>>> +    *tmp = '\0';
>>>>>> +    file = strcat(file, "/memory-region");
>>>>>> +
>>>>>> +    if ((phandle = read_dt_phandle(file)) == -1)
>>>>>> +        goto fail;
>>>>>> +
>>>>>> +    file = NULL;
>>>>>> +    VIR_FREE(files);
>>>>>> +
>>>>>> +    /* Find the memory node for the phandle found above */
>>>>>> +    files = retrieve_dt_files_pattern("*device-tree/memory*",
>>>>>> "phandle");
>>>>>> +    if (!files)
>>>>>> +        return false;
>>>>>> +
>>>>>> +    if ((file = find_dt_file_with_phandle(files, phandle)) == NULL)
>>>>>> +        goto fail;
>>>>>> +
>>>>>> +    /* And see its size in the second 64bit value of 'reg'. First,
>>>>>> +     * the end of the file needs to be changed from '/phandle' to
>>>>>> +     * '/reg' */
>>>>>> +    tmp = strrchr(file, '/');
>>>>>> +    *tmp = '\0';
>>>>>> +    file = strcat(file, "/reg");
>>>>>> +
>>>>>> +    reg = read_dt_memory_reg(file);
>>>>>> +    if (reg && reg[2] == 0x20 && reg[3] == 0x00)
>>>>>> +        return true;
>>>>> This function is very complex just to find out whether a PCI device
>>>>> is capable
>>>>> of NVLink or not. Feels wrong to do it this way, I believe it would
>>>>> be much
>>>>> easier if NVIDIA exposed a sysfs attribute saying whether a PCI
>>>>> device supports
>>>>> NVLink so that our node-device driver would take care of this
>>>>> during libvirtd
>>>>> startup and then you'd only call a single API from the PPC64 helper
>>>>> you're
>>>>> introducing in the next patch to find out whether you need the
>>>>> alternative
>>>>> formula or not.
>>>>>
>>>>> Honestly, apart from the coding style issues, I don't like this
>>>>> approach and
>>>>> unless there's a really compelling reason for libvirt to do it in a
>>>>> way which
>>>>> involves spawning a 'find' process because of a complex pattern and
>>>>> a bunch of
>>>>> data necessary to filter out, I'd really suggest contacting NVIDIA
>>>>> about this.
>>>>> It's the same for mdevs, NVIDIA exposes a bunch of attributes in
>>>>> sysfs which
>>>>> we're able to read.
>>>> I'll contact NVIDIA and see if there is an easier way (a sysfs
>>>> attribute, for
>>>> example) and, if doesn't, try to provide a plausibe reason to
>>>> justify this
>>>> detection code.
>>> Sorry for the delay in responding. The problem is that all the V100 GPUs
>>> support NVLink, but it may or may not be connected up. This is detected
>>> at runtime during GPU initialization, which seems like much too heavy of
>>> an operation to perform as part of passthrough initialization. And
>>> that's
>>> why vfio-pci pieces rely on device tree information to figure it out.
>>>
>>> Alexey, would it be possible for vfio-pci to export the information in a
>>> way more friendly to libvirt?
>>
>> The only information needed here is whether a specific GPU has RAM or
>> not. This can easily be found from the device-tree, imho quite friendly
>> already. VFIO gets to know about these new capabilities when the VFIO
>> PCI device is opened, and we rather want to avoid going that far in
>> libvirt (open a VFIO container, attach a group, get a vfio-pci fd from
>> it, enumerate regions - 2 PCI resets on the way, delays, meh).
>>
>> btw the first "find" for "ibm,npu" can be skipped - NVLinks have to be
>> passed too or the entire RAM thing won't work. "find" for the memory
>> node can also be dropped really - if NVLink bridge OF node has
>> "memory-region", then VFIO will most likely expose RAM and QEMU will try
>> using it anyway.
> 
> 
> Hmmm I am not not sure I understood it completely ..... seeing the
> of_nodes exposed in sysfs of one of the NVLink buses we have:
> 
> 
> /sys/bus/pci/devices/0007:00:00.0/of_node$ ls
> class-code           ibm,gpu       ibm,nvlink-speed memory-region  reg
> device-id            ibm,loc-code  ibm,pci-config-space-type
> name           revision-id
> ibm,device-tgt-addr  ibm,nvlink    interrupts phandle        vendor-id
> 
> 
> We can make a safe assumption that the V100 GPU will always be passed
> through with at least one NVLink2 bus.

Assumption #1: if we want GPU RAM to be available in the guest, an
NVLink bridge must be passed through. No bridge - no RAM - no rlimit
adjustment.

> How many hops do we need to assert
> that a given device is a NVLink2 bus from its of_node info?
> 
> For example, can we say something like:
> 
> "The device node of this device has ibm,gpu and ibm,nvlink and
> ibm,nvlink-speed
> and ibm,nvlink-speed is 0x8, so this is a NVLink2 bus. Since it is not

ibm,nvlink-speed can be different (I saw 9), you cannot rely on this.

> possible to pass
> through the bus alone, there is a V100 GPU in this same IOMMU group. So
> this is
> a NVLink2 passthrough scenario"
> 
> 
> Or perhaps:
> 
> "It has ibm,gpu and ibm,nvlink and the of_node of its memory-region has
> a reg
> value of 0x20 , thus this is a nvlink2 bus and ....."

0x20 is a size of the GPU RAM window which might change in different
revisions of the hw as well.

> Both alternatives are way simpler than what I'm doing in this patch. I
> am not sure
> if they are valid though.

Assumption #2: if the nvlink2 bridge's DT node has memory-region, then
the GPU will try to use its RAM. The adjustment you'll have to make
won't depend on anything as the existing GPU RAM placement is so that
whatever you can possibly throw at QEMU will fit 128TiB window and we
cannot make the window smaller anyway (next size would lower power of
two - 64TiB - and GPU RAM lives just above that).

I could do better of course and allow adjusting the limit to cover let's
say 66TiB instead of 128TiB, it just seems to be unnecessary
complication (we already allocate only required amount of TCE entries on
demand).

>>>>> Thinking about it a bit more, since this is NVIDIA-specific, having an
>>>>> NVIDIA-only sysfs attribute doesn't help the node-device driver I
>>>>> mentioned
>>>>> above. In general, we try to avoid introducing vendor-specific
>>>>> code, so nodedev
>>>>> might not be the right place to check for the attribute (because
>>>>> it's not
>>>>> universal and would require vendor-specific elements), but I think
>>>>> we could
>>>>> have an NVLink helper reading this sysfs attribute(s) elsewhere in
>>>>> libvirt
>>>>> codebase.
>>>>>
>>>>> Erik

-- 
Alexey