[lvm-devel] [PATCH 2 of 4] Handle transient secondary mirror leg failures
malahal at us.ibm.com
malahal at us.ibm.com
Fri Dec 18 18:49:40 UTC 2009
Takahiro Yasui [tyasui at redhat.com] wrote:
> On 12/18/09 12:10, Jonathan Brassow wrote:
> > 2) If you don't get a new table loaded, it will behave as a suspend/
> > resume only. Recent code changes in dm-raid1.c are causing
> > 'log_failure' and 'leg_failure' to not be reset in those cases. IOW,
> > all these steps could be for nothing. :(
>
> I would like to know how effective the retry is. As Jon explained
> above, recent upstream kernel blocks all write I/Os on NOSYNC regions.
> This means that those write I/Os are kept blocked for a long time.
> For example, mirror retry interval in your patch #4 is 30 seconds and
> application or filesystem will be waited for 30 seconds (330 seconds
> if retry count is 10). Can your application wait for more than 5 minutes?
>
> This behaviour will not been solved even if kernel is fixed so that
> log_failure and leg_failure are reset. The write I/Os blocked will
> be re-queued in the kernel when suspend/resume are done, but they
> will be put in the hold queue again if the device failure is not
> transient but permanent.
>
> I would like to know the use case of this patch set.
I have tested with RHEL5.4 kernel that doesn't block on device failure.
IMHO, suspend/resume is doing two things if the kernel code blocks on
failure -- 1) letting the kernel module unblock 2) start resync. We
should separate those two actions. Maybe, we should do something here???
Have an ioctl to start resync or have a message to unblock....
Also, blocking the kernel module introduced a regression -- the
mirror will block forever on a transient failure!
Thanks, Malahal.
More information about the lvm-devel
mailing list