[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] QLA2xxx tagged queue bug.



I'm documenting this in case anyone else gets bitten

(This is supposed to have been fixed since October, but we encountered it in the last few days on RHEL5.6 - either it's not fully fixed or the patch has fallen out of the production kernel)

We kept getting GFS and GFS2 filesystems mysteriously going dead with "input/output" errors over the last 2 years, which has been traced to a bug in qla2xxx:

A QUEUE FULL or BUSY from the target results in a generic error being passed up to dm-multipath from the qla2xxx driver (instead of the driver backing off the queue size and trying again a few milliseconds later.)

When Dm-multipath receives an error, it marks the path to the target "bad" and tries another path. If the queue full condition doesn't clear quickly there is a cascade of path failures followed by the target being marked as BAD when they've all failed.

If "queue_if_no_path" isn't explicitly enabled in /etc/multipath,conf, that causes the i/o error symptoms described above.

Even if the target's tagged queue recovers before all paths fail, there tends to be a big hiccup in GFS(2) operations.

If multipathing's queue_if_no_path is enabled and the OS has to wait for the target to return, there will be an even longer glitch.


Currently the only workaround available is to set the qla2xxx tagged queue depth to a very low value via module options.


Qla2xxx's tagged queue depth is PER LUN, while most target tagged queues are PER DEVICE (eg: A Nexsan Satabeast presenting 6 luns has 255 commands in total, not per lun). It's pretty easy to end up with more requests coming out of the initiators than the targets can handle simultaneously.




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]