Roger Håkansson a écrit :Also, I've noticed that it's not only when a controller fails that thishappens, when a failed controller is "revived" the same thing might happen.As far as I've been able to tell, the more I/O-transactions at the time of the failure, the more likely that the (SCSI) device will be marked as "dead".
Hmmm.. I'm wondering if he's hitting the scenario in which the midlayer marks the sdev in an offline state - which could be the "dead" state. This occurs if an i/o hits the LLDD when the device is disconnected, and error recovery fails. If so, at a later time when the LLDD has connectivity and can access the device, the scsi layer would still likely bounce i/o. It requires a manual interaction to change it back to a running state, any i/o requests by dm would be failed back by the midlayer. What doesn't jive is the rescan re-enabling the device. As I stated, this is usually a manual action to restore things. If the rescans are just prior to the transition to the offline state, they may be making dm change it's path mappings to avoid i/o to the failed path, thus deflecting the sdev transition. Can you report the contents of /sys/class/scsi_device/1:0:*/device/state at the following states in both the works and does not work cases : working, right after failover but before dm fails it; after failure/success -- james