[Linux-cachefs] Problem: Race in fscache operation enqueuing for reading and copying multiple pages from cachefiles to netfs causes kernel oops or at worst memory scribbling.

KiranKumar Modukuri kiran.modukuri at gmail.com
Wed May 17 20:12:25 UTC 2017


Problem: Race in fscache operation enqueuing for reading and copying
multiple pages from cachefiles to netfs causes kernel oops or at worst
memory scribbling.



Symptoms:



May 10 13:49:36 sc-sdgx-208 kernel: [510915.776755] BUG: unable to handle
kernel NULL pointer dereference at 0000000000000008

May 10 13:49:36 sc-sdgx-208 kernel: [510915.776793] IP:
[<ffffffff81095951>] process_one_work+0x31/0x3f0


May 10 13:49:36 sc-sdgx-208 kernel: [510915.776821] PGD 5da5094067 PUD
6caa59d067 PMD0


May 10 13:49:36 sc-sdgx-208 kernel: [510915.776843] Oops: 0000 [#1] SMP

May 10 13:49:36 sc-sdgx-208 kernel: [510915.776858] Modules linked in:
nfsv3 xt_recent veth xt_multiport ipt_MASQUERADE nf_nat_masquerade_ipv4
xfrm_user xfrm_algo iptable_nat nf_nat_ipv4 br_netfilter bridge stp llc
aufs ipmi_devintf cachefiles msr nvidia_drm(POE) nvidia_uvm(POE)
ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 xt_hl ip6t_rt nf_conntrack_ipv6
nf_defrag_ipv6 bonding ipmi_ssif ipt_REJECT nf_reject_ipv4 nf_log_ipv4
nf_log_common xt_LOG xt_limit xt_tcpudp xt_addrtype mxm_wmi
nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ip6table_filter ip6_tables
nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp intel_rapl nf_nat
x86_pkg_temp_thermal intel_powerclamp nf_conntrack_ftp coretemp
nf_conntrack kvm_intel iptable_filter ip_tables kvm x_tables irqbypass
crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul
glue_helper ablk_helper cryptd ast ttm input_leds drm_kms_helper sb_edac
joydev drm edac_core i2c_algo_bit fb_sys_fops syscopyarea sysfillrect
sysimgblt mei_me mei lpc_ich shpchp ipmi_si ipmi_msghandler 8250_fintek wmi
acpi_power_meter nvidia_modeset(POE) mac_hid knem(OE) lp parport nfsd
nvidia(POE) auth_rpcgss nfs_acl nfs lockd grace sunrpc fscache rdma_ucm(OE)
ib_ucm(OE) rdma_cm(OE) iw_cm(OE) configfs ib_ipoib(OE) ib_cm(OE)
ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE)
ib_core(OE) ib_addr(OE) ib_netlink(OE) mlx4_en(OE) mlx4_core(OE)
nls_iso8859_1 ses enclosure hid_generic mlx5_core(OE) usbhid hid ixgbe dca
mlx_compat(OE) megaraid_sas vxlan ip6_udp_tunnel udp_tunnel ahci ptp
libahci pps_core mdio fjes

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777450] CPU: 78 PID: 24681
Comm: kworker/u162:7 Tainted: P           OE   4.4.0-45-generic
#66~14.04.1-Ubuntu

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777492] Hardware name:
XXXXXXXXXXXX , BIOS S2W_3A01.NVD02 11/03/2016

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777529] task: ffff88192acc0000
ti: ffff882804c44000 task.ti: ffff882804c44000

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777572] RIP:
0010:[<ffffffff81095951>]  [<ffffffff81095951>] process_one_work+0x31/0x3f0

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777606] RSP:
0000:ffff882804c47e28  EFLAGS: 00010046

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777624] RAX: ffff887dff0b1e00
RBX: ffff887ec7b81a40 RCX: 0000000000000001

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777646] RDX: ffff887dff0b1e00
RSI: ffff887dff0b0240 RDI: ffff887ec7b81a40

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777668] RBP: ffff882804c47e60
R08: 0000000000000101 R09: 00000001802a001b

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777690] R10: 00000000ff0b1e01
R11: ffffea01f7fc2c00 R12: ffff887f657a0000

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777712] R13: 0000000000000000
R14: ffff88192acc0000 R15: ffff887dff0b0240

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777734] FS:
0000000000000000(0000) GS:ffff887f7f580000(0000) knlGS:0000000000000000

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777758] CS:  0010 DS: 0000 ES:
0000 CR0: 0000000080050033

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777777] CR2: 00000000000000b8
CR3: 0000005ef67a2000 CR4: 00000000003406e0

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777799] DR0: 0000000000000000
DR1: 0000000000000000 DR2: 0000000000000000

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777820] DR3: 0000000000000000
DR6: 00000000fffe0ff0 DR7: 0000000000000400

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777842] Stack:

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777851]  00000000657a0018
0000000000000000 ffff887f657a0000 ffff887f657a0018

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777877]  ffff887ec7b81a70
ffff88192acc0000 ffff887ec7b81a40 ffff882804c47ec8

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777902]  ffffffff810961ea
ffffffff817f6659 ffff887f60bf0000 ffff88192acc0000

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777928] Call Trace:

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777942]  [<ffffffff810961ea>]
worker_thread+0x11a/0x470

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777963]  [<ffffffff817f6659>] ?
__schedule+0x359/0x980

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777982]  [<ffffffff810960d0>] ?
rescuer_thread+0x310/0x310

May 10 13:49:36 sc-sdgx-208 kernel: [510915.778003]  [<ffffffff8109b959>]
kthread+0xc9/0xe0

May 10 13:49:36 sc-sdgx-208 kernel: [510915.778020]  [<ffffffff8109b890>] ?
kthread_park+0x60/0x60

May 10 13:49:36 sc-sdgx-208 kernel: [510915.778039]  [<ffffffff817fa88f>]
ret_from_fork+0x3f/0x70

May 10 13:49:36 sc-sdgx-208 kernel: [510915.778056]  [<ffffffff8109b890>] ?
kthread_park+0x60/0x60

May 10 13:49:36 sc-sdgx-208 kernel: [510915.778074] Code: 48 89 e5 41 57 49
89 f7 41 56 41 55 45 31 ed 41 54 53 48 89 fb 48 83 ec 10 48 8b 06 4c 8b 67
48 48 89 c2 30 d2 a8 04 4c 0f 45 ea <49> 8b 45 08 44 8b b0 00 01 00 00 41
c1 ee 05 41 83 e6 01 41 f6

May 10 13:49:36 sc-sdgx-208 kernel: [510915.779021] RIP
[<ffffffff81095951>] process_one_work+0x31/0x3f0

May 10 13:49:36 sc-sdgx-208 kernel: [510915.779881]  RSP <ffff882804c47e28>

May 10 13:49:36 sc-sdgx-208 kernel: [510915.780787] CR2: 0000000000000008

May 10 13:49:36 sc-sdgx-208 kernel: [510915.784428]

---[ end trace 164b63cc6ad868c8 ]---





Root cause:



Lets Take an example of reading two pages.



Thread1:



nfs_readpages

nfs_readpages_from_fscache

   __nfs_readpages_from_fscache

     __fscache_read_or_alloc_pages(cookie,mapping,pages, nr_pages,
end_io_func, context , gfp)

       op = fscache_alloc_retrieval(cookie, mapping, end_io_func, context)
(op->usage =1)

       |cachefiles_read_or_alloc_pages

         |cachefiles_read_backing_file

           |op->op.processor = cache_files_read_copier

           for each netpage in list of netpages

            | allocate a monitor(cachefiles_one_read) one for each page (
tracker for netfs page , cachefs_page)

            | fscache_get_retreival(op)   (op->usage ++)

            | find backing page. allocate if not found

            | read_backing_page: bmapping->a_ops->readpage(NULL,backpage) -
read the page from disk

            | add_page_wait_queue(monitor->monitor, cachefiles_read_waiter)

       | fscache_put_retrieval(op) (op->usage --)



So for a simple two page read , op->usage is at 2 (one for each page)



Thread 2

   cachefiles_read_waiter (monitor)

      fscache_enqueue_retrieval

       add the monitor to op->to_do list

        ##### slight DELAY for second page, so the op->to_do list has two
pages, ####

        ##### op->usage is at 3  (2 from above + 1 for the first page)
#########

        fscache_enqueue_operation

           ## so by the time the second page is being queued, the op is
freed and usage is at 0. ###

           fscache_get_retreival(op) (op->usage ++)

              queue_work(fscache_op_wq, &op->work) op->processor is set to
cache_files_read_copier





So for a simple two page read , op->usage is at 4(GOOD case) or 1 (BAD
case)



Thread 3

    fscache_op_work_func(op)

      cachefiles_read_copier()

         ### loops two times for two pages as the op->to_do has two pages

         while (op->to_do) --> monitor {

            copy_highpage(monitor->netfs_page, monitor->back_page);

            fscache_mark_page_cached(monitor->op, monitor->netfs_page);

            put_page(monitor->back_page);

            fscache_end_io(op, monitor->netfs_page, error); --> notify page
copy done to Netfs

            put_page(monitor->netfs_page);

            fscache_retrieval_complete(op, 1);

            fscache_put_retrieval(op); (op->usage--)

            kfree(monitor);

         }

    fscache_put_retrieval(op); (op->usage --)

    # if the first callback finishes before the delay, op->usage is at 0
and the memory for OP is freed and potentially reused

    # if the first callback pushes into the cache->op_gc_list , we may end
up with ASSERTION FAILED (5 == 6)

    #        ASSERTIFCMP(op->state != FSCACHE_OP_ST_INITIALISED &&
op->state != FSCACHE_OP_ST_COMPLETE, op->state, ==,
FSCACHE_OP_ST_CANCELLED);





#possible fix:

  #clearly adding the page to to_do list in cachefiles_read_copier prior to
bumping the reference is a problem. bump up the op->usage before queuing
under the object.  This ensures the op is not freed. (op->usage is atleast
1)

  # enqueue the operation for second page.  op->usage is 2

  # decrement the op->usage  so that the second callback frees properly.



static int cachefiles_read_waiter(wait_queue_t *wait, unsigned mode,

                                                            int sync, void
*_key)

{

   ...

   ...



   spin_lock(&object->work_lock);

   + fscache_get_retrieval(monitor->op);

              list_add_tail(&monitor->op_link, &monitor->op->to_do);

              spin_unlock(&object->work_lock);



              fscache_enqueue_retrieval(monitor->op);

   + fscache_put_retrieval(monitor->op);

              return 0;

}





Thanks

Kiran



More information about the Linux-cachefs mailing list