[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] Options other than reboot to stop DP processes thatcan't be killed -9



Colin Simpson wrote:
Probably not a cluster issue just pure kernel question.  Sounds like the
driver or device is locked up and the driver or device is confused, so
the processes attached to it will be hung.

A common problem in a fabric environment is that there are 2+ paths to the tapes (ie, 2 HBAs on the server) and commands may take either path (drives get confused by this). Sending an unlock/reset command via the other path is usually sufficient to recover but it's an extremely poorly documented area.

The most common case of this is tapes which refuse to eject - lock commands are per source and ORed, so unlock commands have to come from the same HBA(s) which issued the lock. I've added scripts to my bacula tape handling routines to ensure this happens on our setup.

To be honest I've had similar problems on pretty much all Unixes for
many years. And I've never found a good way out of it. Maybe not an
option with your case and application, but I guess why most people have
their backup systems running on separate dedicated boxes so it can be
rebooted without affecting production systems.

Strongly agree. There are a number of other good reasons for running dedicated backup systems, not least of which is the double-barrel difficulty of bootstrapping a restore of the backup system itself AND the dead cluster box in a worst case scenario (It's a lot easier with separate boxes as in most cases only one gets trashed and you can reduce risk further by physically separating backups from operational servers.

A second good reason is the amount of IO a good tape backup solution can generate - LTO tapes easily outrun spinning media, so a spooling setup is needed to avoid shoeshine issues.

All this stuff is best discussed on a list dedicated to backups. Discussions of this kind show up regularly and there are a number of canned answers at hand.

AB



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]