[dm-devel] [RFC] issues with device mapper snapshots

Wed Jun 30 17:43:48 UTC 2004

OK, this is long, so please bear with me...

I've been doing some crude testing of the snapshot device mapper target.
I've got an 8 processor machine, with 16 gigabytes of memory.  I'm testing
by doing a mkfs.ext2 on a 70 gigabyte partition that has been
configured to create a snapshot.   I'm seeing two problems.
Here's a description of the problems I see, and two proposed patches
to fix them.  I thought it would be best to get feedback before
plunging ahead with patches.

I'm finding that my system rapidly runs out of memory (probably actually
KVA), and hangs. The more processors, the more quickly it hangs.
I added some instrumentation to the persistent excpetion store code to
print the state of the snapshot when it's hung, and I took stack
traces.  Here's the details.

pending_count is 1396.  current_committed is 10 (this is with 8
processors, so it hangs very quickly).

The interesting stack trace is, I think this one.  My analysis follows.

kcopyd        D C46C3800     0  1688      7                   9 (L-TLB)
f6479bb4 00000046 00010000 c46c3800 00000000 00000000 f65a891c c8062ca8 
       c8062ce0 c3fdb084 00000010 00000200 740fb02c 00000072 c3caf2d0 c1000be0 
       0000054b 740fb02c 00000072 f633b608 f7ffc8d4 00000246 c1000be0 f7ffc8d4 
Call Trace:
 [<c04203a8>] io_schedule+0x28/0x40
 [<c0143930>] mempool_alloc+0x1a0/0x200
 [<c016c82c>] bio_alloc+0x1c/0x1a0
 [<c016ca00>] bio_clone+0x10/0x90
 [<c0340f34>] clone_bio+0x24/0x60
 [<c034102c>] __clone_and_map+0xbc/0x300
 [<c0341307>] __split_bio+0x97/0x110
 [<c0341407>] dm_request+0x87/0xb0
 [<c028664b>] generic_make_request+0x14b/0x1c0
 [<c028672b>] submit_bio+0x6b/0x110
 [<c03470e3>] do_region+0x173/0x190
 [<c03471b9>] dispatch_io+0xb9/0xc0
 [<c034732a>] async_io+0x5a/0x70
 [<c03474b3>] dm_io_async+0x53/0x60
 [<c0347b83>] run_io_job+0x43/0x80
 [<c0347d94>] process_jobs+0xc4/0x2a0
 [<c01338a5>] worker_thread+0x1f5/0x370
 [<c0137fc5>] kthread+0xa5/0xb0
 [<c01032e5>] kernel_thread_helper+0x5/0x10

There are two problems that need addressing.  This first one is that
it looks to me like we're deadlocked allocating bio structures.

All of the bios from the global pool are tied up with those pending_commit
operations.  I'm now aware why device mapper has it's own local
pool of bio's, and hence it's own local copy of fs/bio.c.  But notice
the call to bio_clone() on the stack trace is allocating from the global
bio pool.

SO, patch number 1. is to give dm-io.c it's own copy of the bio_clone()
that allocates from dm-io.c's local pool of bio's and bvec's.

Now, on to patch number 2:

When the mkfs starts, it first fills up the page cache.
Then the pdflush daemons kick in and start dumping page cache to disk.
Since each page write to the origin first must do a read and a write
for the snapshot, the pending count increases much faster than the
commit processs can't keep up.  I'm thinking this is a congestion
problem.

Dm_table_any_congested() checks all the lower-level
devices for congestion.  In my case, these are the disk queues.
But I think that in the snapshot case (at least), the congestion is actually
at a higher level.  I think there should be a congestion check
for when the pending_count gets over approximately 200.  I'm thinking
dm_table_any_congested() should also loop through all the targets, and
call a target_congested() operation to test for congestion at that level.
There may be similar situations possible with the other device mapper
targets (mirroring?) I've not looked at that code yet.  But I thought
I'd get the idea out on the list.

Let me know what you think.
Thanks!

Dave Olien