[vfio-users] lspci and vfio_pci_release deadlock when destroy a pci passthrough VM

Alex Williamson alex.williamson at redhat.com
Wed Mar 20 14:41:32 UTC 2019


On Wed, 20 Mar 2019 13:32:33 +0000
"Wuzongyong (Euler Dept)" <cordius.wu at huawei.com> wrote:

> Hi Alex,
> 
> I notice a patch you pushed in https://lkml.org/lkml/2019/2/18/1315
> You said the previous commit you pushed may prone to deadlock, could you please share the details about how to reproduce the deadlock scene if you know it.
> I met a similar question that all lspci command went into D state and libvirtd went into Z state when destroy a VM with a GPU passthrou. The stack like that:
> 
> INFO: task ps:112058 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> ps              D 0000000000000000     0 112058      1 0x00000004
> Call Trace:
>  [<ffffffff816b7069>] schedule_preempt_disabled+0x29/0x70
>  [<ffffffff816b4a21>] __mutex_lock_slowpath+0xe1/0x170
>  [<ffffffff816b400f>] mutex_lock+0x1f/0x2f
>  [<ffffffff81379337>] pci_bus_save_and_disable+0x37/0x70
>  [<ffffffff8137aeb8>] pci_try_reset_bus+0x38/0x80
>  [<ffffffffa0261045>] vfio_pci_release+0x3d5/0x430 [vfio_pci]
>  [<ffffffffa0260640>] ? vfio_pci_rw+0xc0/0xc0 [vfio_pci]
>  [<ffffffffa02529f2>] vfio_device_fops_release+0x22/0x40 [vfio]
>  [<ffffffff812179dc>] __fput+0xec/0x260
>  [<ffffffff81217c8e>] ____fput+0xe/0x10
>  [<ffffffff810b684a>] task_work_run+0xaa/0xe0
>  [<ffffffff8102ac12>] do_notify_resume+0x92/0xb0
>  [<ffffffff816c264f>] int_signal+0x12/0x17
> INFO: task lspci:139540 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> lspci           D 0000000000000000     0 139540 139539 0x00000000
> Call Trace:
>  [<ffffffff816b5f79>] schedule+0x29/0x70
>  [<ffffffff81370ca0>] pci_wait_cfg+0xa0/0x110
>  [<ffffffff810cfe40>] ? wake_up_state+0x20/0x20
>  [<ffffffff81370e15>] pci_user_read_config_dword+0x105/0x110
>  [<ffffffff8137e974>] pci_read_config+0x114/0x2c0
>  [<ffffffff811f4835>] ? __kmalloc+0x55/0x240
>  [<ffffffff812992fe>] read+0xde/0x1f0
>  [<ffffffff81215a5f>] vfs_read+0x9f/0x170
>  [<ffffffff81216812>] SyS_pread64+0x92/0xc0
>  [<ffffffff816c22ef>] system_call_fastpath+0x1c/0x21
> 
> It seems that lspci and vfio_pci_release are in deadlock.

pci_dev_lock() will also block PCI config access to the user, but you
don't indicate whether you're running a kernel with the fix above.  In
the case of that fix, the deadlock scenario I'm familiar with is a bus
reset while the device is being released while the device is also being
unbound for the vfio-pci driver.  For example, echo'ing the device to
the vfio-pci driver unbind, which will take the device lock and block
until the device is released by the user, but when the vfio device file
is closed by the user it triggers a bus reset to return the device to
its initial state, which also tries to take the device lock.  If vfio
was in this deadlock scenario, lspci would also get blocked, but it's
not obvious how lspci and vfio-pci alone might get deadlocked with each
other, if that's the situation you're proposing here.  Thanks,

Alex




More information about the vfio-users mailing list