[Linux-cachefs] Problem: Race in fscache operation enqueuing for reading and copying multiple pages from cachefiles to netfs causes kernel oops or at worst memory scribbling.

Mon Jun 5 17:55:52 UTC 2017

Hi,

I have put a patch for this issue in our local environment and seem to be
working okay.
Can someone review the patch and let us know if I am in the right direction.

Thanks
Kiran

On Wed, May 17, 2017 at 1:12 PM, KiranKumar Modukuri <
kiran.modukuri at gmail.com> wrote:

> Problem: Race in fscache operation enqueuing for reading and copying
> multiple pages from cachefiles to netfs causes kernel oops or at worst
> memory scribbling.
>
>
>
> Symptoms:
>
>
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.776755] BUG: unable to handle
> kernel NULL pointer dereference at 0000000000000008
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.776793] IP:
> [<ffffffff81095951>] process_one_work+0x31/0x3f0
>
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.776821] PGD 5da5094067 PUD
> 6caa59d067 PMD0
>
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.776843] Oops: 0000 [#1] SMP
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.776858] Modules linked in:
> nfsv3 xt_recent veth xt_multiport ipt_MASQUERADE nf_nat_masquerade_ipv4
> xfrm_user xfrm_algo iptable_nat nf_nat_ipv4 br_netfilter bridge stp llc
> aufs ipmi_devintf cachefiles msr nvidia_drm(POE) nvidia_uvm(POE)
> ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 xt_hl ip6t_rt nf_conntrack_ipv6
> nf_defrag_ipv6 bonding ipmi_ssif ipt_REJECT nf_reject_ipv4 nf_log_ipv4
> nf_log_common xt_LOG xt_limit xt_tcpudp xt_addrtype mxm_wmi
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ip6table_filter ip6_tables
> nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp intel_rapl nf_nat
> x86_pkg_temp_thermal intel_powerclamp nf_conntrack_ftp coretemp
> nf_conntrack kvm_intel iptable_filter ip_tables kvm x_tables irqbypass
> crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul
> glue_helper ablk_helper cryptd ast ttm input_leds drm_kms_helper sb_edac
> joydev drm edac_core i2c_algo_bit fb_sys_fops syscopyarea sysfillrect
> sysimgblt mei_me mei lpc_ich shpchp ipmi_si ipmi_msghandler 8250_fintek wmi
> acpi_power_meter nvidia_modeset(POE) mac_hid knem(OE) lp parport nfsd
> nvidia(POE) auth_rpcgss nfs_acl nfs lockd grace sunrpc fscache rdma_ucm(OE)
> ib_ucm(OE) rdma_cm(OE) iw_cm(OE) configfs ib_ipoib(OE) ib_cm(OE)
> ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE)
> ib_core(OE) ib_addr(OE) ib_netlink(OE) mlx4_en(OE) mlx4_core(OE)
> nls_iso8859_1 ses enclosure hid_generic mlx5_core(OE) usbhid hid ixgbe dca
> mlx_compat(OE) megaraid_sas vxlan ip6_udp_tunnel udp_tunnel ahci ptp
> libahci pps_core mdio fjes
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.777450] CPU: 78 PID: 24681
> Comm: kworker/u162:7 Tainted: P           OE   4.4.0-45-generic
> #66~14.04.1-Ubuntu
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.777492] Hardware name:
> XXXXXXXXXXXX , BIOS S2W_3A01.NVD02 11/03/2016
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.777529] task: ffff88192acc0000
> ti: ffff882804c44000 task.ti: ffff882804c44000
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.777572] RIP:
> 0010:[<ffffffff81095951>]  [<ffffffff81095951>] process_one_work+0x31/0x3f0
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.777606] RSP:
> 0000:ffff882804c47e28  EFLAGS: 00010046
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.777624] RAX: ffff887dff0b1e00
> RBX: ffff887ec7b81a40 RCX: 0000000000000001
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.777646] RDX: ffff887dff0b1e00
> RSI: ffff887dff0b0240 RDI: ffff887ec7b81a40
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.777668] RBP: ffff882804c47e60
> R08: 0000000000000101 R09: 00000001802a001b
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.777690] R10: 00000000ff0b1e01
> R11: ffffea01f7fc2c00 R12: ffff887f657a0000
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.777712] R13: 0000000000000000
> R14: ffff88192acc0000 R15: ffff887dff0b0240
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.777734] FS:
> 0000000000000000(0000) GS:ffff887f7f580000(0000) knlGS:0000000000000000
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.777758] CS:  0010 DS: 0000 ES:
> 0000 CR0: 0000000080050033
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.777777] CR2: 00000000000000b8
> CR3: 0000005ef67a2000 CR4: 00000000003406e0
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.777799] DR0: 0000000000000000
> DR1: 0000000000000000 DR2: 0000000000000000
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.777820] DR3: 0000000000000000
> DR6: 00000000fffe0ff0 DR7: 0000000000000400
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.777842] Stack:
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.777851]  00000000657a0018
> 0000000000000000 ffff887f657a0000 ffff887f657a0018
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.777877]  ffff887ec7b81a70
> ffff88192acc0000 ffff887ec7b81a40 ffff882804c47ec8
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.777902]  ffffffff810961ea
> ffffffff817f6659 ffff887f60bf0000 ffff88192acc0000
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.777928] Call Trace:
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.777942]  [<ffffffff810961ea>]
> worker_thread+0x11a/0x470
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.777963]  [<ffffffff817f6659>]
> ? __schedule+0x359/0x980
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.777982]  [<ffffffff810960d0>]
> ? rescuer_thread+0x310/0x310
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.778003]  [<ffffffff8109b959>]
> kthread+0xc9/0xe0
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.778020]  [<ffffffff8109b890>]
> ? kthread_park+0x60/0x60
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.778039]  [<ffffffff817fa88f>]
> ret_from_fork+0x3f/0x70
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.778056]  [<ffffffff8109b890>]
> ? kthread_park+0x60/0x60
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.778074] Code: 48 89 e5 41 57
> 49 89 f7 41 56 41 55 45 31 ed 41 54 53 48 89 fb 48 83 ec 10 48 8b 06 4c 8b
> 67 48 48 89 c2 30 d2 a8 04 4c 0f 45 ea <49> 8b 45 08 44 8b b0 00 01 00 00
> 41 c1 ee 05 41 83 e6 01 41 f6
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.779021] RIP
> [<ffffffff81095951>] process_one_work+0x31/0x3f0
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.779881]  RSP <ffff882804c47e28>
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.780787] CR2: 0000000000000008
>
> May 10 13:49:36 sc-sdgx-208 kernel: [510915.784428]
>
> ---[ end trace 164b63cc6ad868c8 ]---
>
>
>
>
>
> Root cause:
>
>
>
> Lets Take an example of reading two pages.
>
>
>
> Thread1:
>
>
>
> nfs_readpages
>
> nfs_readpages_from_fscache
>
>    __nfs_readpages_from_fscache
>
>      __fscache_read_or_alloc_pages(cookie,mapping,pages, nr_pages,
> end_io_func, context , gfp)
>
>        op = fscache_alloc_retrieval(cookie, mapping, end_io_func,
> context) (op->usage =1)
>
>        |cachefiles_read_or_alloc_pages
>
>          |cachefiles_read_backing_file
>
>            |op->op.processor = cache_files_read_copier
>
>            for each netpage in list of netpages
>
>             | allocate a monitor(cachefiles_one_read) one for each page (
> tracker for netfs page , cachefs_page)
>
>             | fscache_get_retreival(op)   (op->usage ++)
>
>             | find backing page. allocate if not found
>
>             | read_backing_page: bmapping->a_ops->readpage(NULL,backpage)
> - read the page from disk
>
>             | add_page_wait_queue(monitor->monitor,
> cachefiles_read_waiter)
>
>        | fscache_put_retrieval(op) (op->usage --)
>
>
>
> So for a simple two page read , op->usage is at 2 (one for each page)
>
>
>
> Thread 2
>
>    cachefiles_read_waiter (monitor)
>
>       fscache_enqueue_retrieval
>
>        add the monitor to op->to_do list
>
>         ##### slight DELAY for second page, so the op->to_do list has two
> pages, ####
>
>         ##### op->usage is at 3  (2 from above + 1 for the first page)
> #########
>
>         fscache_enqueue_operation
>
>            ## so by the time the second page is being queued, the op is
> freed and usage is at 0. ###
>
>            fscache_get_retreival(op) (op->usage ++)
>
>               queue_work(fscache_op_wq, &op->work) op->processor is set
> to cache_files_read_copier
>
>
>
>
>
> So for a simple two page read , op->usage is at 4(GOOD case) or 1 (BAD
> case)
>
>
>
> Thread 3
>
>     fscache_op_work_func(op)
>
>       cachefiles_read_copier()
>
>          ### loops two times for two pages as the op->to_do has two pages
>
>          while (op->to_do) --> monitor {
>
>             copy_highpage(monitor->netfs_page, monitor->back_page);
>
>             fscache_mark_page_cached(monitor->op, monitor->netfs_page);
>
>             put_page(monitor->back_page);
>
>             fscache_end_io(op, monitor->netfs_page, error); --> notify
> page copy done to Netfs
>
>             put_page(monitor->netfs_page);
>
>             fscache_retrieval_complete(op, 1);
>
>             fscache_put_retrieval(op); (op->usage--)
>
>             kfree(monitor);
>
>          }
>
>     fscache_put_retrieval(op); (op->usage --)
>
>     # if the first callback finishes before the delay, op->usage is at 0
> and the memory for OP is freed and potentially reused
>
>     # if the first callback pushes into the cache->op_gc_list , we may end
> up with ASSERTION FAILED (5 == 6)
>
>     #        ASSERTIFCMP(op->state != FSCACHE_OP_ST_INITIALISED &&
> op->state != FSCACHE_OP_ST_COMPLETE, op->state, ==,
> FSCACHE_OP_ST_CANCELLED);
>
>
>
>
>
> #possible fix:
>
>   #clearly adding the page to to_do list in cachefiles_read_copier prior
> to bumping the reference is a problem. bump up the op->usage before queuing
> under the object.  This ensures the op is not freed. (op->usage is atleast
> 1)
>
>   # enqueue the operation for second page.  op->usage is 2
>
>   # decrement the op->usage  so that the second callback frees properly.
>
>
>
> static int cachefiles_read_waiter(wait_queue_t *wait, unsigned mode,
>
>                                                             int sync,
> void *_key)
>
> {
>
>    ...
>
>    ...
>
>
>
>    spin_lock(&object->work_lock);
>
>    + fscache_get_retrieval(monitor->op);
>
>               list_add_tail(&monitor->op_link, &monitor->op->to_do);
>
>               spin_unlock(&object->work_lock);
>
>
>
>               fscache_enqueue_retrieval(monitor->op);
>
>    + fscache_put_retrieval(monitor->op);
>
>               return 0;
>
> }
>
>
>
>
>
> Thanks
>
> Kiran
>
>

-- 
Kiran
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cachefiles_fix.patch
Type: application/octet-stream
Size: 1278 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cachefs/attachments/20170605/f38eb999/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fscache_fix.patch
Type: application/octet-stream
Size: 895 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cachefs/attachments/20170605/f38eb999/attachment-0001.obj>