[Linux-cluster] dlm and IO speed problem <er, might wanna get a coffee first ; )>

Tue Apr 8 14:37:58 UTC 2008

gordan at bobich.net wrote:
>
>
>> my setup:
>> 6 rh4.5 nodes, gfs1 v6.1, behind redundant LVS directors. I know it's
>> not new stuff, but corporate standards dictated the rev of rhat.
> [...]
>> I'm noticing huge differences in compile times - or any home file access
>> really - when doing stuff in the same home directory on the gfs on
>> different nodes. For instance, the same compile on one node is ~12
>> minutes - on another it's 18 minutes or more (not running concurrently).
>> I'm also seeing weird random pauses in writes, like saving a file in vi,
>> what would normally take less than a second, may take up to 10 seconds.
>>
>> * From reading, I see that the first node to access a directory will be
>> the lock master for that directory. How long is that node the master? If
>> the user is no longer 'on' that node, is it still the master? If
>> continued accesses are remote, will the master state migrate to the node
>> that is primarily accessing it? I've set LVS persistence for ssh and
>> telnet for 5 minutes, to allow multiple xterms fired up in a script to
>> land on the same node, but new ones later will land on a different node
>> - by design really. Do I need to make this persistence way longer to
>> keep people only on the first node they hit? That kind of horks my load
>> balancing design if so. How can I see which node is master for which
>> directories? Is there a table I can read somehow?
>>
>> * I've bumped the wake times for gfs_scand and gfs_inoded to 30 secs, I
>> mount noatime,noquota,nodiratime, and David Teigland recommended I set
>> dlm_dropcount to '0' today on irc, which I did, and I see an improvement
>> in speed on the node that appears to be master for say 'find' command
>> runs on the second and subsequent runs of the command if I restart them
>> immediately, but on the other nodes the speed is awful - worse than nfs
>> would be. On the first run of a find, or If I wait >10 seconds to start
>> another run after the last run completes, the time to run is
>> unbelievably slower than the same command on a standalone box with ext3.
>> e.g. <9 secs on the standalone, compared to 46 secs on the cluster - on
>> a different node it can take over 2 minutes! Yet an immediate re-run on
>> the cluster, on what I think must be the master is sub-second. How can I
>> speed up the first access time, and how can I keep the speed up similar
>> to immediate subsequent runs. I've got a ton of memory - I just do not
>> know which knobs to turn.
>
> It sounds like bumping up lock trimming might help, but I don't think 
> the feature accessibility through /sys has been back-ported to RHEL4, 
> so if you're stuck with RHEL4, you may have to rebuild the latest 
> versions of the tools and kernel modules from RHEL5, or you're out of 
> luck.

Glock trimming patch was mostly written and tuned on top of RHEL 4. It 
doesn't use /sys interface. The original patch was field tested on 
several customer production sites. Upon CVS RHEL 4.5 check-in, it was 
revised to use a less aggressive approach and turned out to be not as 
effective as the original approach. So the original patch was re-checked 
into RHEL 4.6.

I wrote the patch.

-- Wendy