[libvirt] [PATCH v4 2/2] PPC64 support for NVIDIA V100 GPU with NVLink2 passthrough

Tue Apr 2 21:30:57 UTC 2019

On 4/2/19 4:37 AM, Peter Krempa wrote:
> On Tue, Mar 12, 2019 at 18:55:50 -0300, Daniel Henrique Barboza wrote:
>> The NVIDIA V100 GPU has an onboard RAM that is mapped into the
>> host memory and accessible as normal RAM via an NVLink2 bridge. When
>> passed through in a guest, QEMU puts the NVIDIA RAM window in a
>> non-contiguous area, above the PCI MMIO area that starts at 32TiB.
>> This means that the NVIDIA RAM window starts at 64TiB and go all the
>> way to 128TiB.
>>
>> This means that the guest might request a 64-bit window, for each PCI
>> Host Bridge, that goes all the way to 128TiB. However, the NVIDIA RAM
>> window isn't counted as regular RAM, thus this window is considered
>> only for the allocation of the Translation and Control Entry (TCE).
>>
>> This memory layout differs from the existing VFIO case, requiring its
>> own formula. This patch changes the PPC64 code of
>> @qemuDomainGetMemLockLimitBytes to:
>>
>> - detect if we have a NVLink2 bridge being passed through to the
>> guest. This is done by using the @ppc64VFIODeviceIsNV2Bridge function
>> added in the previous patch. The existence of the NVLink2 bridge in
>> the guest means that we are dealing with the NVLink2 memory layout;
>>
>> - if an IBM NVLink2 bridge exists, passthroughLimit is calculated in a
>> different way to account for the extra memory the TCE table can alloc.
>> The 64TiB..128TiB window is more than enough to fit all possible
>> GPUs, thus the memLimit is the same regardless of passing through 1 or
>> multiple V100 GPUs.
>>
>> Signed-off-by: Daniel Henrique Barboza <danielhb413 at gmail.com>
>> ---
>>   src/qemu/qemu_domain.c | 42 ++++++++++++++++++++++++++++++++++++++++--
>>   1 file changed, 40 insertions(+), 2 deletions(-)
>>
>> diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
>> index dcc92d253c..6d1a69491d 100644
>> --- a/src/qemu/qemu_domain.c
>> +++ b/src/qemu/qemu_domain.c
>> @@ -10443,7 +10443,10 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def)
>>       unsigned long long maxMemory = 0;
>>       unsigned long long passthroughLimit = 0;
>>       size_t i, nPCIHostBridges = 0;
>> +    virPCIDeviceAddressPtr pciAddr;
>> +    char *pciAddrStr = NULL;
>>       bool usesVFIO = false;
>> +    bool nvlink2Capable = false;
>>   
>>       for (i = 0; i < def->ncontrollers; i++) {
>>           virDomainControllerDefPtr cont = def->controllers[i];
>> @@ -10461,7 +10464,16 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def)
>>               dev->source.subsys.type == VIR_DOMAIN_HOSTDEV_SUBSYS_TYPE_PCI &&
>>               dev->source.subsys.u.pci.backend == VIR_DOMAIN_HOSTDEV_PCI_BACKEND_VFIO) {
>>               usesVFIO = true;
>> -            break;
>> +
>> +            pciAddr = &dev->source.subsys.u.pci.addr;
>> +            if (virPCIDeviceAddressIsValid(pciAddr, false)) {
>> +                pciAddrStr = virPCIDeviceAddressAsString(pciAddr);
> Again this leaks the PCI address string on every iteration and on exit
> from this function.

I followed Jano tip here as well (VIR_AUTOFREE and variable declared
inside the loop).

>
>> +                 if (ppc64VFIODeviceIsNV2Bridge(pciAddrStr)) {
>> +                    nvlink2Capable = true;
>> +                    break;
>> +                }
>> +            }
>> +
>>           }
>>       }
>>   
>> @@ -10488,6 +10500,32 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def)
>>                   4096 * nPCIHostBridges +
>>                   8192;
>>   
>> +    /* NVLink2 support in QEMU is a special case of the passthrough
>> +     * mechanics explained in the usesVFIO case below. The GPU RAM
>> +     * is placed with a gap after maxMemory. The current QEMU
>> +     * implementation puts the NVIDIA RAM above the PCI MMIO, which
>> +     * starts at 32TiB and is the MMIO reserved for the guest main RAM.
>> +     *
>> +     * This window ends at 64TiB, and this is where the GPUs are being
>> +     * placed. The next available window size is at 128TiB, and
>> +     * 64TiB..128TiB will fit all possible NVIDIA GPUs.
>> +     *
>> +     * The same assumption as the most common case applies here:
>> +     * the guest will request a 64-bit DMA window, per PHB, that is
>> +     * big enough to map all its RAM, which is now at 128TiB due
>> +     * to the GPUs.
>> +     *
>> +     * Note that the NVIDIA RAM window must be accounted for the TCE
>> +     * table size, but *not* for the main RAM (maxMemory). This gives
>> +     * us the following passthroughLimit for the NVLink2 case:
> Citation needed. Please link a source for these claims. We have some
> sources for claims on x86_64 even if they are not exactly scientific.

The source is the QEMU implementation of the NVLink 2 support. I
can link the QEMU patches in the commit msg like I did in the
cover-letter. Will that suffice?

>
>> +     *
>> +     * passthroughLimit = maxMemory +
>> +     *                    128TiB/512KiB * #PHBs + 8 MiB */
>> +    if (nvlink2Capable)
> Please add curly braces to this condition as it's multi-line and also
> has big comment inside of it.
>
>> +        passthroughLimit = maxMemory +
>> +                           128 * (1ULL<<30) / 512 * nPCIHostBridges +
>> +                           8192;
> I don't quite understand why this formula uses maxMemory while the vfio
> case uses just 'memory'.

The main difference is the memory hotplug case. If there is not memory
hotplug, maxMemory=memory and everything is the same.

With memory hotplug, "memory" in the calculations represents "megs"
from QEMU's "-m [size=]megs[,slots=n,maxmem=size]". For VFIO, we use
'memory' because until the new memory is plugged into the guest, only
"megs" are actually mapped for DMA. The actual IOMMU table window
backing the 64bit DMA window still has to cover "maxmem" though, so this
is why  'maxMemory' is used in the 'baseLimit' formula.

For NV2, we use maxMemory because the GPU RAM starts already in
'maxMemory' with a gap (which is now at the 64TiB window).

Thanks,

dhb

>
>> +
>>       /* passthroughLimit := max( 2 GiB * #PHBs,                       (c)
>>        *                          memory                               (d)
>>        *                          + memory * 1/512 * #PHBs + 8 MiB )   (e)
>> @@ -10507,7 +10545,7 @@ getPPC64MemLockLimitBytes(virDomainDefPtr def)
>>        * kiB pages, less still if the guest is mapped with hugepages (unlike
>>        * the default 32-bit DMA window, DDW windows can use large IOMMU
>>        * pages). 8 MiB is for second and further level overheads, like (b) */
>> -    if (usesVFIO)
>> +    else if (usesVFIO)
> So can't there be a case when a nvlink2 device is present but also e.g.
> vfio network cards?
>
>>           passthroughLimit = MAX(2 * 1024 * 1024 * nPCIHostBridges,
>>                                  memory +
>>                                  memory / 512 * nPCIHostBridges + 8192);
> Also add curly braces here when you are at it.
>
>> -- 
>> 2.20.1
>>
>> --
>> libvir-list mailing list
>> libvir-list at redhat.com
>> https://www.redhat.com/mailman/listinfo/libvir-list