[Cluster-devel] Any ideas on this?

Tue Oct 2 15:57:22 UTC 2018

Hi,

We've seen a couple of time in  our automated tests than unmounting the GFS2 gets stuck and doesn't complete. It's just happened again and the stack for the umount process looks like 

[<ffffffff81087968>] flush_workqueue+0x1c8/0x520
[<ffffffffa0666e29>] gfs2_make_fs_ro+0x69/0x160 [gfs2]
[<ffffffffa0667279>] gfs2_put_super+0xa9/0x1c0 [gfs2]
[<ffffffff811b7edf>] generic_shutdown_super+0x6f/0x100
[<ffffffff811b7ff7>] kill_block_super+0x27/0x70
[<ffffffffa0656a71>] gfs2_kill_sb+0x71/0x80 [gfs2]
[<ffffffff811b792b>] deactivate_locked_super+0x3b/0x70
[<ffffffff811b79b9>] deactivate_super+0x59/0x60
[<ffffffff811d2998>] cleanup_mnt+0x58/0x80
[<ffffffff811d2a12>] __cleanup_mnt+0x12/0x20
[<ffffffff8108c87d>] task_work_run+0x7d/0xa0
[<ffffffff8106d7d9>] exit_to_usermode_loop+0x73/0x98
[<ffffffff81003961>] syscall_return_slowpath+0x41/0x50
[<ffffffff815a594c>] int_ret_from_sys_call+0x25/0x8f
[<ffffffffffffffff>] 0xffffffffffffffff

Querying mount or /proc/mounts no longer shows the gfs2 filesystem so it's got part way through the process.

The test has just finished deleting a series of VMs which will have had data written to them by multiple hosts and the time gap between the last delete completing and the umount is small so this might play a part in things.

"glocktop -d 1 -r -H" on the stuck host shows it trying to get locks from a different host in the cluster where the referenced pids are no longer present.

Unmounting the fs on the host where the locks are supposedly held was successful but then the stuck host started reporting that it was stuck waiting for locks held locally by kworker threads.

Anything we should be looking at especially, possibly a fix that we don't currently have in our kernel?

Thanks,

	Mark.