[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] How to decrease DLM priority during log scanning


this is a follow up of my previous mail, "Node fenced when mounting gfs", in which mounting a particular GFS volume with lots of files can cause a node (node2) to appear hang, and thus fenced by the other node (node1).

Searching the archive I found a relevant thread, "node kicked out of cluster", in which Patrick Caulfield comments
"DLM can hog the CPU when recovering huge numbers of locks, so we a re looking into placing some strategic
"schedule()" calls in the recovery process."

This seems to be the case in my problem, since top shows near 100% system time. BTW, my system is a dual Xeon box.

On another mail thread, "Configuring CMAN timer/timeout values", I found a possible workaround by modifying /proc/cluster/config/cman/. Increasing deadnode_timeout (i tried 2100) prevents the node2 from being fenced, but now node1's performance dropped significantly even when it's CPU load is very low (e.g. other servers mounting NFS from node2 keeps getting NFS timeout errors). Am I right to assume that GFS requires locking on both nodes during writing? If it does, this makes sence since node2 is too busy "scanning log elements" to respond to anything. After over 30 minutes node2 still hasn't finished "scanning log elements", so I changed /proc/cluster/config/cman/deadnode_timeout on node1 back to its default value (21) and node 2 gets fenced automatically.

So the questions are:
- is it normal for a "scanning log elements" process to take over 30 minutes?
- is there a method to make "scanning log elements" uses lower priority (e.g. lowering the priority of DLM when it's recovering locks) ?



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]