[Linux-cluster] Re: [Iscsitarget-devel] Problem in clvmd and iscsi-target

Fri Apr 4 15:04:11 UTC 2008

Hello,

First of all i would like to thank your patience...

> There is a lot of confusion by newcomers to iSCSI storage.
>
> A lot of the time they think of iSCSI as yet another
> file sharing method, which it isn't, it is a disk sharing
> method, and if you allow 2 hosts to access the same disk
> without putting special controls in place to make sure
> that either 1) only 1 host at a time can access a given
> disk, or 2) install a clustering file system that allows
> multiple hosts to access the same disk at the same time,
> then they will experience data corruption as there is
> nothing preventing any two hosts from writing data on
> top of each other.
I understant.. iscsi has nothing to do with files or filesystems. Iscsi (and 
scsi for that matter) only work with blocks. If you try to put several 
machines accessing the same filesystem that is not cluster-aware you'll have 
lots of corruptions..

> The performance penalty you speak of with blockio being accessed
> through a local iSCSI connection should really not be noticed
> except for extreme high-end processing, which if that is the
> case you are picking the wrong technology.
We have bladecenter with FC storage for that :)
What we are trying to do is remove "unecessary" load in the msa connected 
machines as they will be used for virtual machines also.

> When you mount an iscsi target locally the open-iscsi initiator
> does agressive caching of io, then the file system of the OS
> does agressive caching itself, so it's not as if all io becomes
> synchronous in this scenario.
You are correct but that also happens with 2 open-iscsi initiators accessing 
the same exported volume in different machines. The only difference is that 
instead of the msa500 volume being exported directly by iscsi-target  there 
is a middleware (device-mapper) between msa500 volume and the iscsi-target.
Device-mapper does not do cache. When we do an fsync in a guest machine it 
goes:

virtual machine fsync -> clvmd/lvm -> iscsi-initiator -> iscsi-target -> 
device-mapper -> msa500

when the virtual machine is running in the msa500 connected hardware we get

virtual machine fsync -> clvmd/lvm -> device-mapper (linear) -> msa500

> Now you can use clvm between the iSCSI targets to manage
> how the MSA500 storage is allocated for the creation of
> iSCSI targets, but once exported by iSCSI, these servers
> should not care about what the initiators put into it
> or how they manage it.
That would require us to be changing all the time the iscsi-target and 
initiators confs as well as iscsi discovers and multipath in all the 
iscsi-initiators machines.
When we create a volume to a virtual machine we would have to do:

1 - create volume in clvmd that manages the storage
2 - change ietd.conf to allow it to be exported
3 - discover the new device in initiators
4 - change multipath in initiators including the new volume

Drawbacks:
1 - lots of changes in conf files, restarting services :)
2 - Multipath has a patchchecker that checks if a path is alive (usually 
readblock0). That would give me lots and lots of readblock0..

total checks in msa500 = num client machines * num multipath devices * num 
iscsi-target machines

With 8 machines and 40 volumes we would have:

8 * 40 * 2 = 640 IO checks

>
>                +--------+ <-> |- initiator1
>
>                | iSCSI1 |     |
>
> +--------+ <-> +--------+ <-> |- initiator2
>
> | MSA500 | (2)     (3)    (4) |     (5)
>
> +--------+ <-> +--------+ <-> |- initiator3
>     (1)        | iSCSI2 |     |
>                +--------+ <-> |- initiator4
>
> 1) MSA500 provides volume1, volume2 to
>    fiber hosts iSCSI1/iSCSI2
> 2) iSCSI1/iSCSI2 fiber connect to MSA500
> 3) iSCSI1/iSCSI2 use clvm to divvy up
>    volume1 and volume2 into target1, target2
>    target3, target4, target5 to iSCSI network
> 4) iSCSI1/iSCSI2 provide targets to iSCSI
>    network through bonded pairs
> 5) initiators use clvm to divvy up target1,
>    target2, target3... storage for use by Xen
>    domains.

> I hope that helps.
We are doing stress tests (bonnie++, ctcs) with our "hack" and so far it never 
had any problems. We even shutdown one of the iscsi-target nodes there's a 
small hiccup (as one path failed) but it continues shortly after.

We've changed

node.session.timeo.replacement_timeout 
node.conn[0].timeo.noop_out_interval 
node.conn[0].timeo.noop_out_timeout 

to increase the speed of the failover..

Thanks again,
Nuno Fernandes