[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] GFS (1 & partially 2) performance problems



Multipathing has a round-robin and a failover scheduler, which can be
configures in /etc/multipath.conf

The path_selector value only seems to support round-ronin:
http://storagefoo.blogspot.com/2006/08/linux-native-multipathing-device.html


Maybe this helps:
		#
		# name    : path_grouping_policy
		# scope   : multipath
		# desc    : path grouping policy to apply to this
multipath
		# values  : failover, multibus, group_by_serial
		# default : failover
		#
		path_grouping_policy	multibus

Specifies the default path grouping policy to apply to unspecified
multipaths. Possible values include:
failover = 1 path per priority group
multibus = all valid paths in 1 priority group
group_by_serial = 1 priority group per detected serial number
group_by_prio = 1 priority group per path priority value
group_by_node_name = 1 priority group per target node name
The default value is failover. 

 

Regards,

Kit

-----Original Message-----
From: linux-cluster-bounces redhat com
[mailto:linux-cluster-bounces redhat com] On Behalf Of Michael Lackner
Sent: woensdag 16 juni 2010 15:54
To: linux clustering
Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance problems

Hello!

Ok, I got the results. It seems that the scheduler can only be set for real,
physical block devices (not multipath devices), which should be ok I assume.

For curiositys sake I tested all four schedulers for the dd read with 1MB
blocksize.
And here are the results, both per-node as well as total over all three
nodes, numbers are in MB/sec again, sorted by speed, slowest to fastest:

cfq: 15.8 / 15.8 / 15.2 (=46.8MB/s total)
noop: 24.3 / 24.1 / 24.3 (=72.7MB/s total)
deadline: 24.6 / 24.5 / 24.2 (=73.3MB/s total)
anticipatory: 24.9 / 24.8 / 24.5 (=74.2MB/s total)

Before/after each test, i did flush write caches ("sync") and purge all I/O
caches ("echo 3 > /proc/sys/vm/drop_caches") to get results unaffected by
caching.

So it seems "anticipatory" scheduler wins for sequential reads, closely
followed by "deadline" and "noop". The only one that seems to really suck is
the default one, "cfq". I did not do any write tests so far with the
different schedulers, nor any random I/O tests. Also no single-node tests
this time (no more time today).

While this shows some significant improvement for this specific workload,
it's definitely still far below our expectations...

I will also check for the impact of the schedulers on sequential writes and
random I/O as soon as I've figured out how to run some good random I/O
tests.

In the meantime, I would be happy to listen to any additional suggestions to
further improve performance.

Thanks!

Jankowski, Chris wrote:
> Michael,
>
> I do not know the process for setting this up in a multipathing
configuration, but the scheduler to test is the noop scheduler.
>
> Please let us know what would it yield.
>
> Regards,
>
> Chris
>
> -----Original Message-----
> From: linux-cluster-bounces redhat com 
> [mailto:linux-cluster-bounces redhat com] On Behalf Of Michael Lackner
> Sent: Wednesday, 16 June 2010 17:50
> To: linux clustering
> Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance 
> problems
>
> Chris,
>
> Can do. Which one shall I try? I got these four to choose from:
>
> * noop
> * anticipatory
> * deadline
> * cfq
>
> One more thing, because of the Fibrechannel Storage I am using
multipathing. And I cannot set the scheduler for the multipath device
(/dev/dm-0), because "/sys/block/dm-0/queue/scheduler" doesn't exist. I
actually have four paths to the storage that i can see as "/dev/sda",
"/dev/sdb", "/dev/sdc/" and "/dev/sdd".
>
> I guess it's ok if I change the scheduler for those four? Is it ok to just
run a command similar to the one below, and will this change the scheduler
on the fly?
>
> "echo noop > /sys/block/sd*/queue/scheduler"
>
> Cause at the moment, the scheduler files for each blockdevice contain this
line:
>
> "noop anticipatory deadline [cfq]"
>
> Maybe I would have to do something like "echo [noop] anticipatory 
> deadline cfq > /sys/block/sd*/queue/scheduler"
> instead?
>
> Thanks for the help.
>
> Jankowski, Chris wrote:
>   
>> Michael,
>>
>> Would you be willing to repeat the tests with large block with different
IO scheduler. Specifically there is a scheduler that actually is a null
scheduler.
>>
>> I think that I saw cases when the cfq IO scheduler was not working all
that great on single streams.
>>
>> Thanks and regards,
>>
>> Chris
>>
>> -----Original Message-----
>> From: linux-cluster-bounces redhat com 
>> [mailto:linux-cluster-bounces redhat com] On Behalf Of Michael 
>> Lackner
>> Sent: Tuesday, 15 June 2010 22:04
>> To: linux clustering
>> Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance 
>> problems
>>
>> Hello!
>>
>> I tried to do R/W tests comparing 4kB blocksize to 1MB blocksize now, and
the difference in performance was negligible. Also, GFS2 was almost on the
same speed level when compared to GFS1 for Reads (see below why..). I/O
scheduler is "cfq" by the way. I never really cared about the I/O scheduler
since I do not yet understand the differences between the available ones
anyway.
>>
>> But, I found out something else. As suggested by Steven in his reply, I
ran tests both on the GFS1/2 filesystems, and also on the raw blockdevice,
and surprisingly the  results were almost the same!
>>
>> So: GFS1 as well as GFS2 3-Node concurrent, sequential Reads showed a
total of 40MB/s (GFS1) and 45MB/s (GFS2) using a blocksize of 1MB. For
single-node sequential read the performance went up to a nice 180-190MB/s
for both FS versions.
>>
>> Now, the surprising part: Doing a dd read on the raw blockdevice with 3
nodes showed a total of only ~60MB/s!! Almost as low as reading from GFS1/2
with multiple nodes at the same time!! When reading the raw blockdevice on a
single node, I got slightly over 190MB/s again.
>>
>> So, this concurrent read issue seems not to be a GFS1 or GFS2 problem,
but more a problem of the underlying storage. This is extremely surprising
and a bit shocking I must say.
>>
>> I guess for the Reads I will need to check the SAN itself, see if I can
do any optimization on it..  That thing can't possibly be that bad when it
comes to reading..
>>
>> Thanks a lot for your ideas so far!
>>
>> Jankowski, Chris wrote:
>>   
>>     
>>> Michael,
>>>
>>> For comparison, could you do your dd(1) tests with a very large block
size (1 MB) and tell us the results, please?
>>>
>>> I have a vague hunch that the problem may have something to do with
coalescing or not of IO operations.
>>>
>>> Also, which IO scheduler are you using?
>>>
>>> Thanks abnd regards,
>>>
>>> Chris Jankowski
>>>
>>>
>>> -----Original Message-----
>>> From: linux-cluster-bounces redhat com 
>>> [mailto:linux-cluster-bounces redhat com] On Behalf Of Michael 
>>> Lackner
>>> Sent: Tuesday, 15 June 2010 00:22
>>> To: linux clustering
>>> Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance 
>>> problems
>>>
>>> Hello!
>>>
>>> Thanks for your reply. I unfortunately forgot to mention, HOW I was
actually testing, stupid.
>>>
>>> I tested with dd, doing 4kB blocksize reads and writes, 160GB total
testfile size per node.
>>> I read from /dev/zero for writing tests and wrote to /dev/null for
reading tests. So, totally sequential, somewhat small blocksize (equal to
filesystem BS).
>>>
>>> The performance was measured directly on the Fibrechannel Switch, which
offers nice per-port monitoring for that purpose.
>>>
>>> I have yet to do some serious read testing on GFS2. I have aborted 
>>> my
>>> GFS2 tests as
>>> write performance was not up to GFS1 to begin with. My older GFS2
benchmarks (i did this with a 2-node configuration before) are lost, I will
need to re-do them to give you some numbers.
>>>
>>> After each write test I did a "sync" to flush everything to disks.  I
did not do this before or after read tests though..
>>>
>>> As you mentioned Journal Size, "gfs_tool counters <mountpoint>" said,
that only 2-3% logspace were in use after the tests (I guess this is the
per-node fs journal?).
>>>
>>> As for the direct I/O tests, by that you mean testing without ANY 
>>> caching going on, a synchronous write? What I did before was test
>>> EXT3
>>> (~190MB/s) and XFS
>>> (~320MB/s)
>>> on the Storage Array. I think what I'm getting here is raw throughput,
since I am not monitoring in the OS, but at the Fibrechannel Switch itself..
>>>
>>> I will do GFS2 read tests similiar to those conducted for GFS1. I'll be
able to do that tomorrow morning, then I can post the numbers here.
>>>
>>> Thanks!
>>>
>>> Steven Whitehouse wrote:
>>>   
>>>     
>>>       
>>>> Hi,
>>>>
>>>> On Mon, 2010-06-14 at 14:00 +0200, Michael Lackner wrote:
>>>>   
>>>>     
>>>>       
>>>>         
>>>>> Hello!
>>>>>
>>>>> I am currently building a Cluster sitting on CentOS 5 for GFS usage.
>>>>>
>>>>> At the moment, the storage subsystem consists of an HP MSA2312 
>>>>> Fibrechannel SAN linked to an FC 8gbit switch. Three client 
>>>>> machines are connected to that switch over 8gbit FC. The disks 
>>>>> themselves are
>>>>> 12 * 15.000rpm SAS configured in RAID-5 with two hotspares.
>>>>>
>>>>> Now, the whole storage shall be shared (single filesystem), here 
>>>>> GFS comes in.
>>>>>
>>>>> The Cluster is only 3 nodes large at the moment, more nodes will 
>>>>> be added later on. I am currently testing GFS1 and GFS2 for
performance.
>>>>> Lock Management is done over single 1Gbit Ethernet Links (1 per 
>>>>> machine).
>>>>>
>>>>> Thing is, with GFS1 I get far better performance than with the 
>>>>> newer
>>>>> GFS2 across the board, with a few tunable parameters set, for 
>>>>> writes
>>>>> GFS1 is roughly twice as fast.
>>>>>
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>> What tests are you running? GFS2 is generally faster than GFS1 
>>>> except for streaming writes, which is an area that we are putting 
>>>> some effort into solving currently. Small writes (one fs block (4k
>>>> default) or
>>>> less) on GFS2 are much faster than on GFS1.
>>>>
>>>>   
>>>>     
>>>>       
>>>>         
>>>>> But, concurrent reads are totally abysmal. The total write 
>>>>> performance (all nodes combined) sits around 280-330Mbyte/sec, 
>>>>> whereas the READ performance is as low as 30-40Mbyte/sec when 
>>>>> doing concurrent reads. Surprisingly, single-node read is somewhat 
>>>>> ok at 180Mbyte/sec, but as soon as several nodes are reading from 
>>>>> GFS (version 1 at the
>>>>> moment) at the same time,  things turn ugly.
>>>>>
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>> Reads on GFS2 should be much faster than GFS1, so it sounds as if 
>>>> something isn't working correctly for some reason. For cached data, 
>>>> reads on GFS2 should be as fast as ext2/3 since the code path is 
>>>> identical (to the page cache) and only changes if pages are not cached.
>>>> GFS1 does its locking at a higher level, so there will be more 
>>>> overhead for cached reads in general.
>>>>
>>>> Do make sure that if you are preparing the test files for reading 
>>>> all from one node (or even just a different node to that on which 
>>>> you sre running the read tests) that you need to sync them to disk 
>>>> on that node before starting the tests to avoid issues with caching.
>>>>
>>>>   
>>>>     
>>>>       
>>>>         
>>>>> This is strange, because for writes, global performance across the 
>>>>> cluster increases slightly when adding more nodes. But for reads, 
>>>>> the oppsite seems to be true.
>>>>>
>>>>> For read and write tests, separate testfiles were created and read 
>>>>> for each node, with each testfile sitting in its own subdirectory, 
>>>>> so no node would access another nodes file.
>>>>>
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>> That sounds like a good test set up to me.
>>>>
>>>>   
>>>>     
>>>>       
>>>>         
>>>>> GFS1 created with the following mkfs.gfs parameters:
>>>>> "-b 4096 -J 128 -j 16 -r 2048 -p lock_dlm"
>>>>> (4kB blocksite, 16 * 128MB journals, 2GB resource groups, 
>>>>> Distributed
>>>>> LockManager)
>>>>>
>>>>> Mount Options set: "noatime,nodiratime,noquota"
>>>>>
>>>>> Tunables set: "glock_purge 50, statfs_slots 128, statfs_fast 1, 
>>>>> demote_secs 20"
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>> You shouldn't normally need to set the glock_purge and demote_secs 
>>>> to anything other than the default. These settings no longer exist 
>>>> in
>>>> GFS2 since it makes use of the shrinker subsystem provided by the 
>>>> VM and is auto-tuning. If your workload is metadata heavy, you 
>>>> could try boosting the journal size and/or the incore_log_blocks
tunable.
>>>>
>>>>   
>>>>     
>>>>       
>>>>         
>>>>> Also, in /etc/cluster/cluster.conf, I added this:
>>>>> <dlm plock_ownership="1" plock_rate_limit="0"/> <gfs_controld 
>>>>> plock_rate_limit="0"/>
>>>>>
>>>>> Any ideas on how to figure out what's going wrong, and how to tune
>>>>> GFS1 for better concurrent read performance, or tune GFS2 in 
>>>>> general to be competitive/better than GFS1?
>>>>>
>>>>> I'm dreaming about 300MB/sec read, 300MB/sec write sequentially 
>>>>> and somewhat good reaction times while under heavy sequential 
>>>>> and/or random load. But for now, I just wanna get the seq reading 
>>>>> to work acceptably fast.
>>>>>
>>>>> Thanks a lot for your help!
>>>>>
>>>>>     
>>>>>       
>>>>>         
>>>>>           
>>>> Can you try doing some I/O direct to the block device so that we 
>>>> can get an idea of what the raw device can manage? Using dd both 
>>>> read and write, across the nodes (different disk locations on each 
>>>> node to simulate different files).
>>>>
>>>> I'm wondering if the problem might be due to the seek pattern 
>>>> generated by the multiple read locations,
>>>>
>>>> Steve.
>>>>         
> --
> Michael Lackner
> Chair of Information Technology, University of Leoben IT 
> Administration michael lackner mu-leoben at | +43 (0)3842/402-1505
>
> --
> Linux-cluster mailing list
> Linux-cluster redhat com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
> --
> Linux-cluster mailing list
> Linux-cluster redhat com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   


--
Michael Lackner
Chair of Information Technology, University of Leoben IT Administration
michael lackner mu-leoben at | +43 (0)3842/402-1505

--
Linux-cluster mailing list
Linux-cluster redhat com
https://www.redhat.com/mailman/listinfo/linux-cluster
No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.829 / Virus Database: 271.1.1/2940 - Release Date: 06/15/10
20:35:00


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]