[Linux-cachefs] Problem: Race in fscache operation enqueuing for reading and copying multiple pages from cachefiles to netfs causes kernel oops or at worst memory scribbling.

Wed May 17 20:12:25 UTC 2017

Problem: Race in fscache operation enqueuing for reading and copying
multiple pages from cachefiles to netfs causes kernel oops or at worst
memory scribbling.

Symptoms:

May 10 13:49:36 sc-sdgx-208 kernel: [510915.776755] BUG: unable to handle
kernel NULL pointer dereference at 0000000000000008

May 10 13:49:36 sc-sdgx-208 kernel: [510915.776793] IP:
[<ffffffff81095951>] process_one_work+0x31/0x3f0

May 10 13:49:36 sc-sdgx-208 kernel: [510915.776821] PGD 5da5094067 PUD
6caa59d067 PMD0

May 10 13:49:36 sc-sdgx-208 kernel: [510915.776843] Oops: 0000 [#1] SMP

May 10 13:49:36 sc-sdgx-208 kernel: [510915.776858] Modules linked in:
nfsv3 xt_recent veth xt_multiport ipt_MASQUERADE nf_nat_masquerade_ipv4
xfrm_user xfrm_algo iptable_nat nf_nat_ipv4 br_netfilter bridge stp llc
aufs ipmi_devintf cachefiles msr nvidia_drm(POE) nvidia_uvm(POE)
ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 xt_hl ip6t_rt nf_conntrack_ipv6
nf_defrag_ipv6 bonding ipmi_ssif ipt_REJECT nf_reject_ipv4 nf_log_ipv4
nf_log_common xt_LOG xt_limit xt_tcpudp xt_addrtype mxm_wmi
nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ip6table_filter ip6_tables
nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp intel_rapl nf_nat
x86_pkg_temp_thermal intel_powerclamp nf_conntrack_ftp coretemp
nf_conntrack kvm_intel iptable_filter ip_tables kvm x_tables irqbypass
crct10dif_pclmul crc32_pclmul aesni_intel aes_x86_64 lrw gf128mul
glue_helper ablk_helper cryptd ast ttm input_leds drm_kms_helper sb_edac
joydev drm edac_core i2c_algo_bit fb_sys_fops syscopyarea sysfillrect
sysimgblt mei_me mei lpc_ich shpchp ipmi_si ipmi_msghandler 8250_fintek wmi
acpi_power_meter nvidia_modeset(POE) mac_hid knem(OE) lp parport nfsd
nvidia(POE) auth_rpcgss nfs_acl nfs lockd grace sunrpc fscache rdma_ucm(OE)
ib_ucm(OE) rdma_cm(OE) iw_cm(OE) configfs ib_ipoib(OE) ib_cm(OE)
ib_uverbs(OE) ib_umad(OE) mlx5_ib(OE) mlx4_ib(OE) ib_sa(OE) ib_mad(OE)
ib_core(OE) ib_addr(OE) ib_netlink(OE) mlx4_en(OE) mlx4_core(OE)
nls_iso8859_1 ses enclosure hid_generic mlx5_core(OE) usbhid hid ixgbe dca
mlx_compat(OE) megaraid_sas vxlan ip6_udp_tunnel udp_tunnel ahci ptp
libahci pps_core mdio fjes

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777450] CPU: 78 PID: 24681
Comm: kworker/u162:7 Tainted: P           OE   4.4.0-45-generic
#66~14.04.1-Ubuntu

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777492] Hardware name:
XXXXXXXXXXXX , BIOS S2W_3A01.NVD02 11/03/2016

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777529] task: ffff88192acc0000
ti: ffff882804c44000 task.ti: ffff882804c44000

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777572] RIP:
0010:[<ffffffff81095951>]  [<ffffffff81095951>] process_one_work+0x31/0x3f0

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777606] RSP:
0000:ffff882804c47e28  EFLAGS: 00010046

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777624] RAX: ffff887dff0b1e00
RBX: ffff887ec7b81a40 RCX: 0000000000000001

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777646] RDX: ffff887dff0b1e00
RSI: ffff887dff0b0240 RDI: ffff887ec7b81a40

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777668] RBP: ffff882804c47e60
R08: 0000000000000101 R09: 00000001802a001b

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777690] R10: 00000000ff0b1e01
R11: ffffea01f7fc2c00 R12: ffff887f657a0000

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777712] R13: 0000000000000000
R14: ffff88192acc0000 R15: ffff887dff0b0240

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777734] FS:
0000000000000000(0000) GS:ffff887f7f580000(0000) knlGS:0000000000000000

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777758] CS:  0010 DS: 0000 ES:
0000 CR0: 0000000080050033

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777777] CR2: 00000000000000b8
CR3: 0000005ef67a2000 CR4: 00000000003406e0

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777799] DR0: 0000000000000000
DR1: 0000000000000000 DR2: 0000000000000000

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777820] DR3: 0000000000000000
DR6: 00000000fffe0ff0 DR7: 0000000000000400

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777842] Stack:

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777851]  00000000657a0018
0000000000000000 ffff887f657a0000 ffff887f657a0018

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777877]  ffff887ec7b81a70
ffff88192acc0000 ffff887ec7b81a40 ffff882804c47ec8

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777902]  ffffffff810961ea
ffffffff817f6659 ffff887f60bf0000 ffff88192acc0000

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777928] Call Trace:

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777942]  [<ffffffff810961ea>]
worker_thread+0x11a/0x470

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777963]  [<ffffffff817f6659>] ?
__schedule+0x359/0x980

May 10 13:49:36 sc-sdgx-208 kernel: [510915.777982]  [<ffffffff810960d0>] ?
rescuer_thread+0x310/0x310

May 10 13:49:36 sc-sdgx-208 kernel: [510915.778003]  [<ffffffff8109b959>]
kthread+0xc9/0xe0

May 10 13:49:36 sc-sdgx-208 kernel: [510915.778020]  [<ffffffff8109b890>] ?
kthread_park+0x60/0x60

May 10 13:49:36 sc-sdgx-208 kernel: [510915.778039]  [<ffffffff817fa88f>]
ret_from_fork+0x3f/0x70

May 10 13:49:36 sc-sdgx-208 kernel: [510915.778056]  [<ffffffff8109b890>] ?
kthread_park+0x60/0x60

May 10 13:49:36 sc-sdgx-208 kernel: [510915.778074] Code: 48 89 e5 41 57 49
89 f7 41 56 41 55 45 31 ed 41 54 53 48 89 fb 48 83 ec 10 48 8b 06 4c 8b 67
48 48 89 c2 30 d2 a8 04 4c 0f 45 ea <49> 8b 45 08 44 8b b0 00 01 00 00 41
c1 ee 05 41 83 e6 01 41 f6

May 10 13:49:36 sc-sdgx-208 kernel: [510915.779021] RIP
[<ffffffff81095951>] process_one_work+0x31/0x3f0

May 10 13:49:36 sc-sdgx-208 kernel: [510915.779881]  RSP <ffff882804c47e28>

May 10 13:49:36 sc-sdgx-208 kernel: [510915.780787] CR2: 0000000000000008

May 10 13:49:36 sc-sdgx-208 kernel: [510915.784428]

---[ end trace 164b63cc6ad868c8 ]---

Root cause:

Lets Take an example of reading two pages.

Thread1:

nfs_readpages

nfs_readpages_from_fscache

   __nfs_readpages_from_fscache

     __fscache_read_or_alloc_pages(cookie,mapping,pages, nr_pages,
end_io_func, context , gfp)

       op = fscache_alloc_retrieval(cookie, mapping, end_io_func, context)
(op->usage =1)

       |cachefiles_read_or_alloc_pages

         |cachefiles_read_backing_file

           |op->op.processor = cache_files_read_copier

           for each netpage in list of netpages

            | allocate a monitor(cachefiles_one_read) one for each page (
tracker for netfs page , cachefs_page)

            | fscache_get_retreival(op)   (op->usage ++)

            | find backing page. allocate if not found

            | read_backing_page: bmapping->a_ops->readpage(NULL,backpage) -
read the page from disk

            | add_page_wait_queue(monitor->monitor, cachefiles_read_waiter)

       | fscache_put_retrieval(op) (op->usage --)

So for a simple two page read , op->usage is at 2 (one for each page)

Thread 2

   cachefiles_read_waiter (monitor)

      fscache_enqueue_retrieval

       add the monitor to op->to_do list

        ##### slight DELAY for second page, so the op->to_do list has two
pages, ####

        ##### op->usage is at 3  (2 from above + 1 for the first page)
#########

        fscache_enqueue_operation

           ## so by the time the second page is being queued, the op is
freed and usage is at 0. ###

           fscache_get_retreival(op) (op->usage ++)

              queue_work(fscache_op_wq, &op->work) op->processor is set to
cache_files_read_copier

So for a simple two page read , op->usage is at 4(GOOD case) or 1 (BAD
case)

Thread 3

    fscache_op_work_func(op)

      cachefiles_read_copier()

         ### loops two times for two pages as the op->to_do has two pages

         while (op->to_do) --> monitor {

            copy_highpage(monitor->netfs_page, monitor->back_page);

            fscache_mark_page_cached(monitor->op, monitor->netfs_page);

            put_page(monitor->back_page);

            fscache_end_io(op, monitor->netfs_page, error); --> notify page
copy done to Netfs

            put_page(monitor->netfs_page);

            fscache_retrieval_complete(op, 1);

            fscache_put_retrieval(op); (op->usage--)

            kfree(monitor);

         }

    fscache_put_retrieval(op); (op->usage --)

    # if the first callback finishes before the delay, op->usage is at 0
and the memory for OP is freed and potentially reused

    # if the first callback pushes into the cache->op_gc_list , we may end
up with ASSERTION FAILED (5 == 6)

    #        ASSERTIFCMP(op->state != FSCACHE_OP_ST_INITIALISED &&
op->state != FSCACHE_OP_ST_COMPLETE, op->state, ==,
FSCACHE_OP_ST_CANCELLED);

#possible fix:

  #clearly adding the page to to_do list in cachefiles_read_copier prior to
bumping the reference is a problem. bump up the op->usage before queuing
under the object.  This ensures the op is not freed. (op->usage is atleast
1)

  # enqueue the operation for second page.  op->usage is 2

  # decrement the op->usage  so that the second callback frees properly.

static int cachefiles_read_waiter(wait_queue_t *wait, unsigned mode,

                                                            int sync, void
*_key)

{

   ...

   ...

   spin_lock(&object->work_lock);

   + fscache_get_retrieval(monitor->op);

              list_add_tail(&monitor->op_link, &monitor->op->to_do);

              spin_unlock(&object->work_lock);

              fscache_enqueue_retrieval(monitor->op);

   + fscache_put_retrieval(monitor->op);

              return 0;

}

Thanks

Kiran