[libvirt] RFC: Migration with NPIV

Wed Nov 21 04:15:15 UTC 2012

On 2012年11月21日 00:26, Dave Allan wrote:
> On Tue, Nov 20, 2012 at 10:17:11AM +0000, Daniel P. Berrange wrote:
>> On Mon, Nov 19, 2012 at 05:30:11PM +0800, Osier Yang wrote:
>>> Hi,
>>>
>>> This proposal is trying to figure out a solution for migration
>>> of domain which uses LUN behind vHBA as disk device (QEMU
>>> emulated disk only at this stage). And other related NPIV
>>> improvements which are not related with migration. I'm not
>>> luck to get a environment to test if the thoughts are workable,
>>> but I'd like see if guys have good idea/suggestions earlier.
>>>
>>> 1) Persistent vHBA support
>>>
>>>    This is the useful stuff missed for long time. Assuming
>>> that one created a vHBA, did masking/zoning, everything works
>>> as expected. However, after a system rebooting, everything is
>>> just lost. If the user wants to get things back, he has to
>>> find out the preivous WWNN&  WWPN, and create the vHBA again.
>>>
>>>    On the other hand, Persistent vHBA support is actually required
>>> for domain which uses LUN behind a vHBA. Othewise the domain
>>> could fail to start after a system rebooting.
>>>
>>>    To support the persistent vHBA, new APIs like virNodeDeviceDefineXML,
>>> virNodeDeviceUndefine is required. Also it's useful to introduce
>>> "autostart" for vHBA, so that the vHBA could be started automatically
>>> after system rebooting.
>>>
>>>    Proposed APIs:
>>>
>>>    virNodeDevicePtr
>>>    virNodeDeviceDefineXML(virConnectPtr conn,
>>>                           const char *xml,
>>>                           unsigned int flags);
>>>
>>>    int
>>>    virNodeDeviceUndefine(virConnectPtr conn,
>>>                          virNodeDevicePtr dev,
>>>                          unsigned int flags);
>>>
>>>    int
>>>    virNodeDeviceSetAutostart(virNodeDevicePtr dev,
>>>                              int autostart,
>>>                              unsigned int flags);
>>>
>>>    int
>>>    virNodeDeviceGetAutostart(virNodeDevicePtr dev,
>>>                              int *autostart,
>>>                              unsigned int flags);
>>
>> I don't really much like this approach. IMHO, this should
>> all be done via the virStoragePool APIs instead. Adding
>> define/undefine/autostart to virNodeDevice is really just
>> duplicating the storage pool functionality.
>
> I like the idea of making vHBAs persist as part of pools; how do you
> envision it should work?  Extend the scsi pools to take a vHBA
> descriptor and then instantiating the vHBA as part of starting the
> pool, or something else?
>
>>> 2) Associate vHBA with domain XML
>>>
>>>    There are two ways to attach a LUN to a domain: as an QEMU emulated
>>> device; or passthrough. Since passthrough a LUN is not supported in
>>> libvirt yet, let's focus on the emulated LUN at this stage.
>>>
>>>    New attributes "wwnn" and "wwpn" are introduced to indicate the
>>> LUN behind the vHBA. E.g.
>>>
>>>     <disk type='block' device='disk'>
>>>       <driver name='qemu' type='raw'/>
>>>       <source wwnn="2001001b32a9da4e" wwpn="2101001b32a90004"/>
>>
>> If you change the schema of the<source>  element, then you must
>> also create a new type='XXX' attribute to identify it, not just
>> re-use type='block'
>>
>>>       <target dev='vda' bus='virtio'/>
>>>       <address type='pci' domain='0x0000' bus='0x00' slot='0x07'
>>> function='0x0'/>
>>>     </disk>
>>>
>>>    Before the domain starting, we have to check if there is LUN
>>> assigned to the vHBA, error out if not.
>>>
>>>    Using the stable path of LUN also works, e.g.
>>>
>>>    <source dev="/dev/disk/by-path/pci-0000\:00\:07.0-scsi-0\:0\:0\:0"/>
>>>
>>>    But the disadvantage is the user have to figure out the stable
>>> path himself; And we have to do checking of every stable path to
>>> see if it's behind a vHBA in migration "Begin" stage. Or an new
>>> XML tag for element "source" to indicate that it's behind a vHBA?
>>> such as:
>>>
>>>    <source dev="disk-by-path" model="vport"/>
>>
>> I don't much like the idea of mapping vHBA to<disk>  elements,
>> because you have a cardinality mis-match. A<disk>  is equivalent
>> of a single LUN, but a vHBA is something that provides multiple
>> LUNs.
>>
>> If you want to directly associate a vHBA with a virtual guest,
>> then this is really in the realm of SCSI HBA passthrough, not
>> <disk>  devices.
>>
>>
>> If you want something mapped to the<disk>  device, then the
>> approach should be to map to a storage pool volume - something
>> we've long talked about as broadly useful for all storage types,
>> not just NPIV.
>
> +1, we really should take this as an opportunity to add storage
> volumes as<disk>  devices.
>
>>> 3) Migration with vHBA
>>>
>>>    One possible solution for migration with vHBA is to use one pair
>>> of WWNN&  WWPN on source host, one is using for domain, one is
>>> reserved for migration purpose. It requires the storage admin maps
>>> the same LUN to the two vHBAs when doing the masking and zoning.
>>>
>>> One of the two vHBA is called "Primary vHBA", another is called
>>> "secondary vHBA". To maitain the relationship between these two
>>> vHBAs, we have to introduce new XMLs to vHBA. E.g.
>>>
>>>     In XML of primary vHBA:
>>>
>>>     <secondary wwpn="2101001b32a90004"/>
>>>
>>>     In XML of secondary vHBA:
>>>
>>>     <primary wwpn="2101001b32a90002"/>
>>>
>>> Primary vHBA is going to be guaranteed not used by any domain which
>>> is driven by libvirt (we do some checking eariler before the domain
>>> starting). And it's also guaranteed that the LUN can't be used by
>>> other domain with sVirt or Sanlock. So it's safe to have two vHBAs
>>> on source host too.
>>>
>>> To prevent one using the LUN by creating vHBA using the same WWNN&
>>> WWPN on another host, we must create the secondary vHBA on source
>>> host, even it's not being used.
>>>
>>> Both primary and secondary vHBA must be defined and marked as
>>> "autostart" so that the domain could be started after system
>>> rebooting.
>>>
>>> When do migration, we have to bake a bigger cookie with secondary
>>> vHBA's info (basically it's WWNN and WWPN) in migration "Begin"
>>> stage, and eat that in migration "Prepare" stage on target host.
>>>
>>> In "Begin" stage, the XMLs represents the secondary vHBA is
>>> constructed. And the secondary vHBA is destoyed on source host,
>>> not undefined though.
>>>
>>> In "Prepare" stage, a new vHBA is created (define and start)
>>> on target host with the same WWNN&  WWPN as secondary vHBA on
>>> source host. The LUN then should be visible to target host
>>> automatically? and thus migration can be performed. After migration
>>> is finished on target host, the primary vHBA on source host is
>>> destroyed, not undefined.
>>>
>>> If migration fails, the new vHBA created on target host will
>>> be destroyed and undefined. And both primary and secondary
>>> vHBA on source host will be started, so that the domain could
>>> be resumed.
>>>
>>> Finally if migration succeeds, primary vHBA on source host
>>> will be transtered to target host as secondary vHBA (defined).
>>> And both primary and secondary vHBA on source host will be
>>> undefined.
>>
>> If we do the mapping of HBAs to guest domains using storage
>> pools, then at a guest level, migration requires zero work.
>>
>> It is simply upto the management app to create the storage
>> pool on the destination host with the same Name + UUID, but
>> with the secondary WWNN/WWPN. The nice thing about this, is
>> that you don't need to hardcode details of a secondary
>> WWNN/WWPN up-front. The management app can just decide on
>> those at the time it performs the migration, so 99% of the
>> time there will only need to be a single vHBA setup on the
>> SAN. During migration the mgmt app can setup a second
>> vHBA for the target host, and once complete, delete the
>> original vHBA entirely.
>
> Agreed, although there will of course need to be some degree of
> up-front coordination between the management app and the SAN
> administrators to avoid having to involve them to migrate a VM.
>
>>> 4) Enrich HBA's XML
>>>
>>>    It's hard to known the vHBAs created from a HBA with current
>>> implementation. One have to dump XML of each (v)HBAs and find
>>> out the clue with element "parent" of vHBAs. It's good to introduce
>>> new element for HBA like "vports", so that one can easily known
>>> what (how many) vHBAs are created from the HBA?
>>>
>>>    And also it's good to have the maximum vports the HBA supports.
>>>
>>>    Except these, other useful information should be exposed too,
>>> such as the vendor name, the HBA state, PCI address, etc.
>>>
>>>    The new XMLs should be like:
>>>
>>>    <vports num='2' max='64'>
>>>      <vport name="scsi_host40" wwpn="2101001b32a90004"/>
>>>      <vport name="scsi_host40" wwpn="2101001b32a90005"/>
>>>    </vports>
>>>    <online/>
>>>    <vendor>QLogic</vendor>
>>>    <address type="pci" domain="0" bus="0" slot="5" function="0"/>
>>>
>>>    "online", "vendor", "address" make sense to vHBA too.
>>
>> I'm trying to remember how we modelled the parent/child relationship
>> for SR-IOV PCI cards. NPIV is a very similar concept, so we should
>> ideally seek to model the parent/child relationship in the same
>> manner.
>
> Physical function:
>
> <device>
>    <name>pci_0000_01_00_0</name>
>    <parent>pci_0000_00_01_0</parent>
>    <driver>
>      <name>igb</name>
>    </driver>
>    <capability type='pci'>
>      <domain>0</domain>
>      <bus>1</bus>
>      <slot>0</slot>
>      <function>0</function>
>      <product id='0x10c9'>82576 Gigabit Network Connection</product>
>      <vendor id='0x8086'>Intel Corporation</vendor>
>      <capability type='virt_functions'>
>        <address domain='0x0000' bus='0x01' slot='0x10' function='0x0'/>
>        <address domain='0x0000' bus='0x01' slot='0x10' function='0x2'/>
>        <address domain='0x0000' bus='0x01' slot='0x10' function='0x4'/>
>        <address domain='0x0000' bus='0x01' slot='0x10' function='0x6'/>
>        <address domain='0x0000' bus='0x01' slot='0x11' function='0x0'/>
>        <address domain='0x0000' bus='0x01' slot='0x11' function='0x2'/>
>        <address domain='0x0000' bus='0x01' slot='0x11' function='0x4'/>
>      </capability>
>    </capability>
> </device>
>
> Virtual function:
>
> <device>
>    <name>pci_0000_01_10_0</name>
>    <parent>pci_0000_00_01_0</parent>
>    <driver>
>      <name>igbvf</name>
>    </driver>
>    <capability type='pci'>
>      <domain>0</domain>
>      <bus>1</bus>
>      <slot>16</slot>
>      <function>0</function>
>      <product id='0x10ca'>82576 Virtual Function</product>
>      <vendor id='0x8086'>Intel Corporation</vendor>
>      <capability type='phys_function'>
>        <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
>      </capability>
>      <capability type='virt_functions'>
>      </capability>
>    </capability>
> </device>
>
> Interesingly, I think there's a bug there; the VF should not be
> showing<capability type='virt_functions'>

Yeah, that's a bug. Okay, "capability" sounds good.

Regards,
Osier