[Linux-cluster] cluster failed after 53 hours

Patrick Caulfield pcaulfie at redhat.com
Tue Jan 18 14:01:58 UTC 2005


On Mon, Jan 17, 2005 at 05:31:33PM -0800, Daniel McNeil wrote:
> My 3 node cluster ran tests for 53 hours before hitting a problem.

Attached is a patch to set the CMAN process to run at realtime priority, I'm not
sure if that's the right thing to do or not to be honest.

Neither am I sure whether your 48-53 hours is significant - it's possible that
memory may be an issue (only guessing but GFS caches locks like crazy, it may be
worth cutting this down a bit by tweaking

/proc/cluster/lock_dlm/drop_count    and/or
/proc/cluster/lock_dlm/drop_period

otherwise, the only way were gpoing to get to the bottom of this is to enable
"DEBUG_MEMB" in cman and see what it thinks is going on when the node is kicked
out of the cluster.


patrick
-------------- next part --------------
Index: cnxman.c
===================================================================
RCS file: /cvs/cluster/cluster/cman-kernel/src/cnxman.c,v
retrieving revision 1.45
diff -u -p -r1.45 cnxman.c
--- cnxman.c	17 Jan 2005 14:42:36 -0000	1.45
+++ cnxman.c	18 Jan 2005 10:49:50 -0000
@@ -63,6 +63,7 @@ static int is_valid_temp_nodeid(int node
 extern int start_membership_services(pid_t);
 extern int kcl_leave_cluster(int remove);
 extern int send_kill(int nodeid, int needack);
+extern void cman_set_realtime(struct task_struct *tsk, int prio);
 
 static struct proto_ops cl_proto_ops;
 static struct sock *master_sock;
@@ -308,7 +309,7 @@ static int cluster_kthread(void *unused)
 	init_waitqueue_entry(&cnxman_waitq_head, current);
 	add_wait_queue(&cnxman_waitq, &cnxman_waitq_head);
 
-	set_user_nice(current, -6);
+	cman_set_realtime(current, 1);
 
 	/* Allow the sockets to start receiving */
 	list_for_each(socklist, &socket_list) {
Index: membership.c
===================================================================
RCS file: /cvs/cluster/cluster/cman-kernel/src/membership.c,v
retrieving revision 1.47
diff -u -p -r1.47 membership.c
--- membership.c	13 Jan 2005 14:12:59 -0000	1.47
+++ membership.c	18 Jan 2005 10:49:50 -0000
@@ -201,6 +202,13 @@ static uint8_t *node_opinion = NULL;
 #define OPINION_AGREE    1
 #define OPINION_DISAGREE 2
 
+
+void cman_set_realtime(struct task_struct *tsk, int prio)
+{
+        tsk->policy = SCHED_FIFO;
+        tsk->rt_priority = prio;
+}
+
 /* Set node id of a node, also add it to the members array and expand the array
  * if necessary */
 static inline void set_nodeid(struct cluster_node *node, int nodeid)
@@ -281,7 +289,7 @@ static int hello_kthread(void *unused)
 	hello_task = tsk;
 	up(&hello_task_lock);
 
-	set_user_nice(current, -20);
+	cman_set_realtime(current, 1);
 
 	while (node_state != REJECTED && node_state != LEFT_CLUSTER) {
 
@@ -317,7 +325,7 @@ static int membership_kthread(void *unus
 	sigprocmask(SIG_BLOCK, &tmpsig, NULL);
 
 	membership_task = tsk;
-	set_user_nice(current, -5);
+	cman_set_realtime(current, 1);
 
 	/* Open the socket */
 	if (init_membership_services())


More information about the Linux-cluster mailing list