[Linux-cachefs] Problem: Race in fscache operation enqueuing for reading and copying multiple pages from cachefiles to netfs causes kernel oops or at worst memory scribbling.
KiranKumar Modukuri
kiran.modukuri at gmail.com
Wed May 17 20:12:25 UTC 2017
Problem: Race in fscache operation enqueuing for reading and copying
multiple pages from cachefiles to netfs causes kernel oops or at worst
memory scribbling.
Symptoms:
May 10 13:49:36 sc-sdgx-208 kernel: [510915.776755] BUG: unable to handle
kernel NULL pointer dereference at 0000000000000008
May 10 13:49:36 sc-sdgx-208 kernel: [510915.776793] IP:
[<ffffffff81095951>] process_one_work+0x31/0x3f0
May 10 13:49:36 sc-sdgx-208 kernel: [510915.776821] PGD 5da5094067 PUD
6caa59d067 PMD0
May 10 13:49:36 sc-sdgx-208 kernel: [510915.776843] Oops: 0000 [#1] SMP
May 10 13:49:36 sc-sdgx-208 kernel: [510915.776858] Modules linked in:
nfsv3 xt_recent veth xt_multiport ipt_MASQUERADE nf_nat_masquerade_ipv4
xfrm_user xfrm_algo iptable_nat nf_nat_ipv4 br_netfilter bridge stp llc
aufs ipmi_devintf cachefiles msr nvidia_drm(POE) nvidia_uvm(POE)
ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 xt_hl ip6t_rt nf_conntrack_ipv6
nf_defrag_ipv6 bonding ipmi_ssif ipt_REJECT nf_reject_ipv4 nf_log_ipv4
nf_log_common xt_LOG xt_limit xt_tcpudp xt_addrtype mxm_wmi
nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ip6table_filter ip6_tables
nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp intel_rapl nf_nat
x86_pkg_temp_thermal intel_powerclamp nf_conntrack_ftp coretemp
nf_conntrack kvm_intel iptable_filter ip_tables kvm x_tables irqbypass
crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul
glue_helper ablk_helper cryptd ast ttm input_leds drm_kms_helper sb_edac
joydev drm edac_core i2c_algo_bit fb_sys_fops syscopyarea sysfillrect
sysimgblt mei_me mei lpc_ich shpchp ipmi_si ipmi_msghandler 8250_fintek wmi
acpi_power_meter nvidia_modeset(POE) mac_hid knem(OE) lp parport nfsd
nvidia(POE) auth_rpcgss nfs_acl nfs lockd grace sunrpc fscache rdma_ucm(OE)
ib_ucm(OE) rdma_cm(OE) iw_cm(OE) configfs ib_ipoib(OE) ib_cm(OE)
ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE)
ib_core(OE) ib_addr(OE) ib_netlink(OE) mlx4_en(OE) mlx4_core(OE)
nls_iso8859_1 ses enclosure hid_generic mlx5_core(OE) usbhid hid ixgbe dca
mlx_compat(OE) megaraid_sas vxlan ip6_udp_tunnel udp_tunnel ahci ptp
libahci pps_core mdio fjes
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777450] CPU: 78 PID: 24681
Comm: kworker/u162:7 Tainted: P OE 4.4.0-45-generic
#66~14.04.1-Ubuntu
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777492] Hardware name:
XXXXXXXXXXXX , BIOS S2W_3A01.NVD02 11/03/2016
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777529] task: ffff88192acc0000
ti: ffff882804c44000 task.ti: ffff882804c44000
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777572] RIP:
0010:[<ffffffff81095951>] [<ffffffff81095951>] process_one_work+0x31/0x3f0
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777606] RSP:
0000:ffff882804c47e28 EFLAGS: 00010046
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777624] RAX: ffff887dff0b1e00
RBX: ffff887ec7b81a40 RCX: 0000000000000001
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777646] RDX: ffff887dff0b1e00
RSI: ffff887dff0b0240 RDI: ffff887ec7b81a40
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777668] RBP: ffff882804c47e60
R08: 0000000000000101 R09: 00000001802a001b
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777690] R10: 00000000ff0b1e01
R11: ffffea01f7fc2c00 R12: ffff887f657a0000
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777712] R13: 0000000000000000
R14: ffff88192acc0000 R15: ffff887dff0b0240
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777734] FS:
0000000000000000(0000) GS:ffff887f7f580000(0000) knlGS:0000000000000000
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777758] CS: 0010 DS: 0000 ES:
0000 CR0: 0000000080050033
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777777] CR2: 00000000000000b8
CR3: 0000005ef67a2000 CR4: 00000000003406e0
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777799] DR0: 0000000000000000
DR1: 0000000000000000 DR2: 0000000000000000
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777820] DR3: 0000000000000000
DR6: 00000000fffe0ff0 DR7: 0000000000000400
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777842] Stack:
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777851] 00000000657a0018
0000000000000000 ffff887f657a0000 ffff887f657a0018
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777877] ffff887ec7b81a70
ffff88192acc0000 ffff887ec7b81a40 ffff882804c47ec8
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777902] ffffffff810961ea
ffffffff817f6659 ffff887f60bf0000 ffff88192acc0000
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777928] Call Trace:
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777942] [<ffffffff810961ea>]
worker_thread+0x11a/0x470
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777963] [<ffffffff817f6659>] ?
__schedule+0x359/0x980
May 10 13:49:36 sc-sdgx-208 kernel: [510915.777982] [<ffffffff810960d0>] ?
rescuer_thread+0x310/0x310
May 10 13:49:36 sc-sdgx-208 kernel: [510915.778003] [<ffffffff8109b959>]
kthread+0xc9/0xe0
May 10 13:49:36 sc-sdgx-208 kernel: [510915.778020] [<ffffffff8109b890>] ?
kthread_park+0x60/0x60
May 10 13:49:36 sc-sdgx-208 kernel: [510915.778039] [<ffffffff817fa88f>]
ret_from_fork+0x3f/0x70
May 10 13:49:36 sc-sdgx-208 kernel: [510915.778056] [<ffffffff8109b890>] ?
kthread_park+0x60/0x60
May 10 13:49:36 sc-sdgx-208 kernel: [510915.778074] Code: 48 89 e5 41 57 49
89 f7 41 56 41 55 45 31 ed 41 54 53 48 89 fb 48 83 ec 10 48 8b 06 4c 8b 67
48 48 89 c2 30 d2 a8 04 4c 0f 45 ea <49> 8b 45 08 44 8b b0 00 01 00 00 41
c1 ee 05 41 83 e6 01 41 f6
May 10 13:49:36 sc-sdgx-208 kernel: [510915.779021] RIP
[<ffffffff81095951>] process_one_work+0x31/0x3f0
May 10 13:49:36 sc-sdgx-208 kernel: [510915.779881] RSP <ffff882804c47e28>
May 10 13:49:36 sc-sdgx-208 kernel: [510915.780787] CR2: 0000000000000008
May 10 13:49:36 sc-sdgx-208 kernel: [510915.784428]
---[ end trace 164b63cc6ad868c8 ]---
Root cause:
Lets Take an example of reading two pages.
Thread1:
nfs_readpages
nfs_readpages_from_fscache
__nfs_readpages_from_fscache
__fscache_read_or_alloc_pages(cookie,mapping,pages, nr_pages,
end_io_func, context , gfp)
op = fscache_alloc_retrieval(cookie, mapping, end_io_func, context)
(op->usage =1)
|cachefiles_read_or_alloc_pages
|cachefiles_read_backing_file
|op->op.processor = cache_files_read_copier
for each netpage in list of netpages
| allocate a monitor(cachefiles_one_read) one for each page (
tracker for netfs page , cachefs_page)
| fscache_get_retreival(op) (op->usage ++)
| find backing page. allocate if not found
| read_backing_page: bmapping->a_ops->readpage(NULL,backpage) -
read the page from disk
| add_page_wait_queue(monitor->monitor, cachefiles_read_waiter)
| fscache_put_retrieval(op) (op->usage --)
So for a simple two page read , op->usage is at 2 (one for each page)
Thread 2
cachefiles_read_waiter (monitor)
fscache_enqueue_retrieval
add the monitor to op->to_do list
##### slight DELAY for second page, so the op->to_do list has two
pages, ####
##### op->usage is at 3 (2 from above + 1 for the first page)
#########
fscache_enqueue_operation
## so by the time the second page is being queued, the op is
freed and usage is at 0. ###
fscache_get_retreival(op) (op->usage ++)
queue_work(fscache_op_wq, &op->work) op->processor is set to
cache_files_read_copier
So for a simple two page read , op->usage is at 4(GOOD case) or 1 (BAD
case)
Thread 3
fscache_op_work_func(op)
cachefiles_read_copier()
### loops two times for two pages as the op->to_do has two pages
while (op->to_do) --> monitor {
copy_highpage(monitor->netfs_page, monitor->back_page);
fscache_mark_page_cached(monitor->op, monitor->netfs_page);
put_page(monitor->back_page);
fscache_end_io(op, monitor->netfs_page, error); --> notify page
copy done to Netfs
put_page(monitor->netfs_page);
fscache_retrieval_complete(op, 1);
fscache_put_retrieval(op); (op->usage--)
kfree(monitor);
}
fscache_put_retrieval(op); (op->usage --)
# if the first callback finishes before the delay, op->usage is at 0
and the memory for OP is freed and potentially reused
# if the first callback pushes into the cache->op_gc_list , we may end
up with ASSERTION FAILED (5 == 6)
# ASSERTIFCMP(op->state != FSCACHE_OP_ST_INITIALISED &&
op->state != FSCACHE_OP_ST_COMPLETE, op->state, ==,
FSCACHE_OP_ST_CANCELLED);
#possible fix:
#clearly adding the page to to_do list in cachefiles_read_copier prior to
bumping the reference is a problem. bump up the op->usage before queuing
under the object. This ensures the op is not freed. (op->usage is atleast
1)
# enqueue the operation for second page. op->usage is 2
# decrement the op->usage so that the second callback frees properly.
static int cachefiles_read_waiter(wait_queue_t *wait, unsigned mode,
int sync, void
*_key)
{
...
...
spin_lock(&object->work_lock);
+ fscache_get_retrieval(monitor->op);
list_add_tail(&monitor->op_link, &monitor->op->to_do);
spin_unlock(&object->work_lock);
fscache_enqueue_retrieval(monitor->op);
+ fscache_put_retrieval(monitor->op);
return 0;
}
Thanks
Kiran
More information about the Linux-cachefs
mailing list