[vfio-users] [OVMF+VIFO+QEMU] Large VM RAM allocation = Long Boot Times

A Z adam at zegelin.com
Mon Nov 19 22:46:38 UTC 2018


Hi Laszlo,

Looks like CONFIG_PREEMPT_VOLUNTARY isn't set.

CONFIG_PREEMPT_RCU=y
CONFIG_PREEMPT_NOTIFIERS=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_COUNT=y
CONFIG_DEBUG_PREEMPT=y
# CONFIG_PREEMPTIRQ_EVENTS is not set
# CONFIG_PREEMPT_TRACER is not set

I'll recompile the kernel tonight if I get time with that flag set and
see if it makes a difference.

- Adam
On Mon, Nov 19, 2018 at 10:35 AM Laszlo Ersek <lersek at redhat.com> wrote:
>
> On 11/16/18 22:36, A Z wrote:
> > And just a quick discovery:
> >
> > With 24G assigned (-m 24576)...
> >
> > Reducing the number of CPU cores to 1, 2 or 3 with
> > ovmf-x64-0-20181105.757.g114a75ad86 results in a really fast boot to
> > the TianoCore splash.
> > As soon as the number of CPU cores jumps to 4, the VM takes a hit in
> > terms of boot time.
> >
> > 1 core, 1 thread: ~16 seconds.
> > 2 cores, 1 thread: ~16 seconds
> > 3 cores, 1 thread: ~19 seconds
> > 4 cores, 1 thread: ~111 seconds
> > 8 cores, 1 thread: ~150 seconds
> > 8 cores, 2 threads: ~334 seconds
> >
> > (timing is based on grepping debugcon.log for "[Bds]OsIndication:",
> > which seems to occur right after the splash screen is displayed)
> >
> > Significant pauses can be observed at points in the boot process,
> > based on the debugcon output...
> >
> > First pause is here:
> >
> >> Loading PEIM at 0x0007FEC1000 EntryPoint=0x0007FEC6985 CpuMpPei.efi
> >> Register PPI Notify: F894643D-C449-42D1-8EA8-85BDD8C65BDE
> >> Notify: PPI Guid: F894643D-C449-42D1-8EA8-85BDD8C65BDE, Peim notify entry point: 7FEC58C4
> >> AP Loop Mode is 1
> >> WakeupBufferStart = 9F000, WakeupBufferSize = 1000
> >
> > Then here:
> >
> >> Loading driver at 0x0007F96C000 EntryPoint=0x0007F9743EE CpuDxe.efi
> >> InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 7F4AC618
> >> ProtectUefiImageCommon - 0x7F4AE040
> >>   - 0x000000007F96C000 - 0x000000000000E3C0
> >> Paging: added 512 pages to page table pool
> >> CurrentPagingContext:
> >>   MachineType   - 0x8664
> >>   PageTableBase - 0x7FA01000
> >>   Attributes    - 0xC0000006
> >> InstallProtocolInterface: 26BACCB1-6F42-11D4-BCE7-0080C73C8881 7F979E20
> >> MemoryProtectionCpuArchProtocolNotify:
> > <snip>
> >> ConvertPages: failed to find range A0000 - FFFFF
> >> ConvertPages: failed to find range 80000000 - 8FFFFFFF
> >> ConvertPages: failed to find range 90000000 - FBFFFFFF
> >> ConvertPages: failed to find range FEC00000 - FEC00FFF
> >> Failed to update capability: [8] 00000000FED00000 - 00000000FED003FF (C700000000000001 -> C700000000026001)
> >> ConvertPages: failed to find range FED1C000 - FED1FFFF
> >> ConvertPages: failed to find range FEE00000 - FEEFFFFF
> >> ConvertPages: failed to find range FFE00000 - FFFFFFFF
> >> AP Loop Mode is 1
> >> WakeupBufferStart = 9F000, WakeupBufferSize = 229
> >
> > And the last one:
> >
> >> Loading driver at 0x0007F918000 EntryPoint=0x0007F91D876 PciHostBridgeDxe.efi
> >> InstallProtocolInterface: BC62157E-3E33-4FEC-9920-2D3B36D750DF 7F01D798
> >> ProtectUefiImageCommon - 0x7F01D1C0
> >>   - 0x000000007F918000 - 0x0000000000009F00
> >> Select Item: 0x0
> >> FW CFG Signature: 0x554D4551
> >> Select Item: 0x1
> >> FW CFG Revision: 0x3
> >> QemuFwCfg interface (DMA) is supported.
> >> Select Item: 0x19
> >> InitRootBridge: populated root bus 0, with room for 255 subordinate bus(es)
> >> RootBridge: PciRoot(0x0)
> >>   Support/Attr: 70069 / 70069
> >>     DmaAbove4G: No
> >> NoExtConfSpace: No
> >>      AllocAttr: 3 (CombineMemPMem Mem64Decode)
> >>            Bus: 0 - FF Translation=0
> >>             Io: 6000 - FFFF Translation=0
> >>            Mem: 90000000 - FBFFFFFF Translation=0
> >>     MemAbove4G: 800000000 - FFFFFFFFF Translation=0
> >>           PMem: FFFFFFFFFFFFFFFF - 0 Translation=0
> >>    PMemAbove4G: FFFFFFFFFFFFFFFF - 0 Translation=0
> >
> >
> > Regards,
> > Adam
> >
> >
> > On Thu, Nov 15, 2018 at 10:28 PM A Z <adam at zegelin.com> wrote:
> >>
> >> On Thu, Nov 15, 2018 at 2:47 AM Laszlo Ersek <lersek at redhat.com> wrote:
> >>>
> >>> If nothing at all had changed in your configuration, then the behavior
> >>> wouldn't have changed either. You must certainly have upgraded some
> >>> components since then, between host kernel, QEMU, and OVMF (likely all
> >>> of them). Please state the version numbers that work and don't work, and
> >>> try to isolate the issue to a version change that affects a *single*
> >>> component.
> >>
> >> I wasn't clear on this, sorry. I meant that the VM configuration --
> >> the qemu command-line parameters, *except* for the OVMF firmware image
> >> and the amount of memory assigned, haven't changed since I got
> >> everything up-and-running (i.e., CPU caps, disk, network and all the
> >> PCI-passthrough assignments, etc haven't changed.)
> >> I added more memory to the host a few months ago, and upped the VM
> >> allocation to match at the same time. This made the boot times even
> >> worse.
> >> I've also occasionally downloaded a newer version of OVMF to see if it
> >> fixes this boot time issue.
> >> The host OS kernel, qemu and other package versions have changed over
> >> this period as I've installed OS updates. So has the build of Windows
> >> 10.
> >>
> >>> I vaguely recall similar reports from the past. I've been under the
> >>> impression that this kind of issue has been fixed. Thus, I'd guess you
> >>> are running old(er) versions of the affected component (whichever it
> >>> is), or else you are running bleeding edge, and the bug has crept back
> >>> in (new regression).
> >>
> >> It's possibly the latter.
> >>
> >> Host kernel: 4.18.10-arch1-1-ARCH #1 SMP PREEMPT Wed Sep 26 09:48:22
> >> UTC 2018 x86_64 GNU/Linux
> >> qemu version: QEMU emulator version 3.0.0
> >> OMVF build: ovmf-x64-0-20181105.757.g114a75ad86 (downloaded from www.kraxel.org)
> >>
> >>
> >> On Thu, Nov 15, 2018 at 6:14 AM Hristo Iliev <hristo at hiliev.eu> wrote:
> >>
> >>> This sounds quite like a problem I used to have in the past with certain versions of OVMF and the Linux kernel. If I remember correctly, some memory ranges would get marked as non-cacheable, which resulted in the terrible slowdown you describe. The resolution back then was to stick to older versions of both the firmware and the kernel.
> >>>
> >>> I still keep around an older version of OVMF that I used until recently - edk2.git-ovmf-x64-0-20160714.b1992.gfc3005.noarch.rpm. You could download the RPM here and try if it works for you:
> >>>
> >>> https://exa.flopser.de/s/z3cjqFUlZ2gBPCy
> >>
> >> I will try this firmware build. I have read similar things about
> >> versions from 2016 working correctly.
> >> But TBH I'd also prefer to run something newer -- bug fixes (and
> >> regressions!), security patches, etc.
> >>
> >>> Recent QEMU versions started complaining about firmware incompatibilities, so I tried the latest RPM from Kraxel (https://www.kraxel.org/repos/jenkins/edk2/) and it works just fine. The host system is Arch Linux with the latest Arch kernel and QEMU.
> >>
> >> While I'm not running today's build, I am running one downloaded from
> >> Kraxel.org within the past week.
> >>
> >>
> >> On Thu, Nov 15, 2018 at 7:28 AM Alex Williamson
> >> <alex.williamson at redhat.com> wrote:
> >>
> >>> If you can confirm that your system uses a NUMA memory configuration
> >>> and that 24G exceeds the node local memory, then this is a known
> >>> regression in the Linux kernel mm.  AIUI it's being worked upstream,
> >>> workarounds are to either use hugepages or locked memory for the VM.
> >>
> >> I'm currently using hugepages. Not sure how to lock the memory, sorry.
> >>
> >> It is a dual CPU system. The topology:
> >>
> >> $ sudo numactl -H
> >> available: 2 nodes (0-1)
> >> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 24 25 26 27 28 29 30 31 32 33 34 35
> >> node 0 size: 32143 MB
> >> node 0 free: 236 MB
> >> node 1 cpus: 12 13 14 15 16 17 18 19 20 21 22 23 36 37 38 39 40 41 42
> >> 43 44 45 46 47
> >> node 1 size: 32232 MB
> >> node 1 free: 95 MB
> >> node distances:
> >> node   0   1
> >>   0:  10  21
> >>   1:  21  10
> >>
> >>
> >>
> >> For reference, the qemu command line I'm exec'ing from a script is:
> >>
> >> /usr/sbin/qemu-system-x86_64 \
> >>         -name win10,debug-threads=on -monitor
> >> unix:win10.sock,server,nowait -nodefconfig -debugcon file:debugcon.log
> >> -global isa-debugcon.iobase=0x402 \
> >>         -nodefconfig -no-user-config -nodefaults -nographic \
> >>         -machine q35,accel=kvm \
> >>         -enable-kvm -cpu
> >> host,kvm=off,hv_spinlocks=0x1fff,hv_relaxed,hv_time,hv_vendor_id=Nvidia43FIX
> >> \
> >>         -rtc base=localtime,clock=host,driftfix=slew \
> >>         \
> >>         -drive file=/vmstorage/firmware/ovmf-x64-0-20181105.757.g114a75ad86/OVMF-pure-efi.fd,if=pflash,format=raw,unit=0,readonly=on
> >> \
> >>         -drive file=/home/adam/win10-vars.fd,if=pflash,format=raw,readonly=off,unit=1
> >> \
> >>         \
> >>         -object iothread,id=io1 \
> >>         \
> >>         -mem-path /dev/hugepages  \
> >>         -m 24576 \
> >>         -smp cores=8,threads=2,sockets=1 \
> >>         \
> >>         -netdev bridge,id=netdev0,br=br0 -device
> >> virtio-net-pci,netdev=netdev0,mac=52:54:00:19:34:57 \
> >>         -device
> >> virtio-scsi-pci,id=scsi0,ioeventfd=on,iothread=io1,num_queues=4 \
> >>         \
> >>         -device vfio-pci,host=7:00.0,addr=0x6,multifunction=on \
> >>         -device vfio-pci,host=7:00.1,addr=0x6.0x1 \
> >>         -device vfio-pci,host=05:00.0 \
> >>         -device vfio-pci,host=06:00.0
>
> Can you grep your kernel config for "PREEMPT"? I remember that for the
> Lime Tech guys, CONFIG_PREEMPT_VOLUNTARY made a huge difference, in the
> same spots. It might help you too.
>
> I do notice that your "uname" output already includes "PREEMPT". I don't
> remember (and don't have time now) to look up what knob does exactly
> what. Please try massaging these knobs in your host kernel config, and
> see if they make a difference.
>
> When CONFIG_PREEMPT_VOLUNTARY was identified as the one bit making the
> difference in the Lime Tech case, I requested that they please document
> their findings on the web somewhere, perhaps in a forum or in a blog
> post. I'm not sure if that happened after all.
>
> For some hand-wavy explanation, I can offer this: the two spots in the
> OVMF debug log are related to the guest firmware starting up all VCPUs
> at once, and those VCPUs fighting over some spinlocks (by design). PLE
> (pause loop exiting) should really help here, but apparently, it doesn't
> always do that. So, with CONFIG_PREEMPT (or without it, I don't
> remember), the host kernel threads get scheduled in a way that is
> exactly "adversarial" to those VCPUs making good progress. E.g. if a
> VCPU owns a spinlock but the corresponding host kernel thread is
> de-scheduled for a long time, the other VCPUs / host kernel threads
> won't make any progress, no matter how hard they work. Yada yada. As I
> said, hand-wavy, sorry.
>
> Thanks
> Laszlo




More information about the vfio-users mailing list