[Linux-cluster] cluster failed after 53 hours
Patrick Caulfield
pcaulfie at redhat.com
Tue Jan 18 14:01:58 UTC 2005
On Mon, Jan 17, 2005 at 05:31:33PM -0800, Daniel McNeil wrote:
> My 3 node cluster ran tests for 53 hours before hitting a problem.
Attached is a patch to set the CMAN process to run at realtime priority, I'm not
sure if that's the right thing to do or not to be honest.
Neither am I sure whether your 48-53 hours is significant - it's possible that
memory may be an issue (only guessing but GFS caches locks like crazy, it may be
worth cutting this down a bit by tweaking
/proc/cluster/lock_dlm/drop_count and/or
/proc/cluster/lock_dlm/drop_period
otherwise, the only way were gpoing to get to the bottom of this is to enable
"DEBUG_MEMB" in cman and see what it thinks is going on when the node is kicked
out of the cluster.
patrick
-------------- next part --------------
Index: cnxman.c
===================================================================
RCS file: /cvs/cluster/cluster/cman-kernel/src/cnxman.c,v
retrieving revision 1.45
diff -u -p -r1.45 cnxman.c
--- cnxman.c 17 Jan 2005 14:42:36 -0000 1.45
+++ cnxman.c 18 Jan 2005 10:49:50 -0000
@@ -63,6 +63,7 @@ static int is_valid_temp_nodeid(int node
extern int start_membership_services(pid_t);
extern int kcl_leave_cluster(int remove);
extern int send_kill(int nodeid, int needack);
+extern void cman_set_realtime(struct task_struct *tsk, int prio);
static struct proto_ops cl_proto_ops;
static struct sock *master_sock;
@@ -308,7 +309,7 @@ static int cluster_kthread(void *unused)
init_waitqueue_entry(&cnxman_waitq_head, current);
add_wait_queue(&cnxman_waitq, &cnxman_waitq_head);
- set_user_nice(current, -6);
+ cman_set_realtime(current, 1);
/* Allow the sockets to start receiving */
list_for_each(socklist, &socket_list) {
Index: membership.c
===================================================================
RCS file: /cvs/cluster/cluster/cman-kernel/src/membership.c,v
retrieving revision 1.47
diff -u -p -r1.47 membership.c
--- membership.c 13 Jan 2005 14:12:59 -0000 1.47
+++ membership.c 18 Jan 2005 10:49:50 -0000
@@ -201,6 +202,13 @@ static uint8_t *node_opinion = NULL;
#define OPINION_AGREE 1
#define OPINION_DISAGREE 2
+
+void cman_set_realtime(struct task_struct *tsk, int prio)
+{
+ tsk->policy = SCHED_FIFO;
+ tsk->rt_priority = prio;
+}
+
/* Set node id of a node, also add it to the members array and expand the array
* if necessary */
static inline void set_nodeid(struct cluster_node *node, int nodeid)
@@ -281,7 +289,7 @@ static int hello_kthread(void *unused)
hello_task = tsk;
up(&hello_task_lock);
- set_user_nice(current, -20);
+ cman_set_realtime(current, 1);
while (node_state != REJECTED && node_state != LEFT_CLUSTER) {
@@ -317,7 +325,7 @@ static int membership_kthread(void *unus
sigprocmask(SIG_BLOCK, &tmpsig, NULL);
membership_task = tsk;
- set_user_nice(current, -5);
+ cman_set_realtime(current, 1);
/* Open the socket */
if (init_membership_services())
More information about the Linux-cluster
mailing list