Diagnosing and Correcting Problems in a Cluster

To ensure the proper diagnosis of any problems in a cluster, event logging must be enabled. In addition, if problems arise in a cluster, be sure to set the severity level to debug for the cluster daemons. This will log descriptive messages that may help solve problems. Once any issues have been resolved, reset the debug level back down to its default value of info to avoid excessively large log message files from being generated.

If problems occur while running the cluadmin utility (for example, problems enabling a service), set the severity level for the clusvcmgrd daemon to debug. This will cause debugging messages to be displayed while running the cluadmin utility. See the Section called Modifying Cluster Event Logging for more information.

Use Table 8-5 to troubleshoot issues in a cluster.

Table 8-5. Diagnosing and Correcting Problems in a Cluster

ProblemSymptomSolution
SCSI bus not terminatedSCSI errors appear in the log file

Each SCSI bus must be terminated only at the beginning and end of the bus. Depending on the bus configuration, it might be necessary to enable or disable termination in host bus adapters, RAID controllers, and storage enclosures. To support hot plugging, external termination is required to terminate a SCSI bus.
In addition, be sure that no devices are connected to a SCSI bus using a stub that is longer than 0.1 meter.
See the Section called Configuring Shared Disk Storage in Chapter 2 and the Section called SCSI Bus Termination in Appendix A for information about terminating different types of SCSI buses.

SCSI bus length greater than maximum limitSCSI errors appear in the log file

Each type of SCSI bus must adhere to restrictions on length, as described in the Section called SCSI Bus Length in Appendix A.
In addition, ensure that no single-ended devices are connected to the LVD SCSI bus, because this will cause the entire bus to revert to a single-ended bus, which has more severe length restrictions than a differential bus.

SCSI identification numbers not uniqueSCSI errors appear in the log fileEach device on a SCSI bus must have a unique identification number. See the Section called SCSI Identification Numbers in Appendix A for more information.
SCSI commands timing out before completionSCSI errors appear in the log file

The prioritized arbitration scheme on a SCSI bus can result in low-priority devices being locked out for some period of time. This may cause commands to time out, if a low-priority storage device, such as a disk, is unable to win arbitration and complete a command that a host has queued to it. For some workloads, this problem can be avoided by assigning low-priority SCSI identification numbers to the host bus adapters.
See the Section called SCSI Identification Numbers in Appendix A for more information.

Mounted quorum partitionMessages indicating checksum errors on a quorum partition appear in the log file

Be sure that the quorum partition raw devices are used only for cluster state information. They cannot be used for cluster services or for non-cluster purposes, and cannot contain a file system. See the Section called Configuring Quorum Partitions in Chapter 2 for more information.
These messages could also indicate that the underlying block device special file for the quorum partition has been erroneously used for non-cluster purposes.

Service file system is uncleanA disabled service cannot be enabled

Manually run a checking program such as fsck. Then, enable the service.
Note that the cluster infrastructure does by default run fsck with the -p option to automatically repair file system inconsistencies. For particularly egregious error types you may be required to manually initiate filesystem repair options.

Quorum partitions not set up correctlyMessages indicating that a quorum partition cannot be accessed appear in the log file

Run the cludiskutil -t command to check that the quorum partitions are accessible. If the command succeeds, run the cludiskutil -p command on both cluster systems. If the output is different on the systems, the quorum partitions do not point to the same devices on both systems. Check to make sure that the raw devices exist and are correctly specified in the /etc/sysconfig/rawdevices file. See the Section called Configuring Quorum Partitions in Chapter 2 for more information.
These messages could also indicate that yes was not chosen when prompted by the cluconfig utility to initialize the quorum partitions. To correct this problem, run the utility again.

Cluster service operation failsMessages indicating the operation failed to appear on the console or in the log fileThere are many different reasons for the failure of a service operation (for example, a service stop or start). To help identify the cause of the problem, set the severity level for the cluster daemons to debug in order to log descriptive messages. Then, retry the operation and examine the log file. See the Section called Modifying Cluster Event Logging for more information.
Cluster service stop fails because a file system cannot be unmountedMessages indicating the operation failed appear on the console or in the log file

Use the fuser and ps commands to identify the processes that are accessing the file system. Use the kill command to stop the processes. Use the lsof -t file_system command to display the identification numbers for the processes that are accessing the specified file system. If needed, Pipe the output to the kill command.
To avoid this problem, be sure that only cluster-related processes can access shared storage data. In addition, modify the service and enable forced unmount for the file system. This enables the cluster service to unmount a file system even if it is being accessed by an application or user.

Incorrect entry in the cluster databaseCluster operation is impairedThe cluadmin utility can be used to examine and modify service configuration. Additionally, the cluconfig utility is used to modify cluster parameters.
Incorrect Ethernet heartbeat entry in the cluster database or /etc/hosts fileCluster status indicates that a Ethernet heartbeat channel is OFFLINE even though the interface is valid

Examine and modify the cluster configuration by running the cluconfig utility, as specified in the Section called Modifying the Cluster Configuration, and correct the problem.
In addition, be sure to use the ping command to send a packet to all the network interfaces used in the cluster.

Loose cable connection to power switchPower switch status is TimeoutCheck the serial cable connection.
Power switch serial port incorrectly specified in the cluster databasePower switch status indicates a problemExamine the current settings and modify the cluster configuration by running the cluconfig utility, as specified in the Section called Modifying the Cluster Configuration, and correct the problem.
Heartbeat channel problemHeartbeat channel status is OFFLINE

Examine the current settings and modify the cluster configuration by running the cluconfig utility, as specified in the Section called Modifying the Cluster Configuration, and correct the problem.
Verify that the correct type of cable is used for each heartbeat channel connection.
Run the command ping to each cluster system over the network interface for each Ethernet heartbeat channel.