[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [libvirt] RFC: Migration with NPIV



On 2012年11月21日 00:26, Dave Allan wrote:
On Tue, Nov 20, 2012 at 10:17:11AM +0000, Daniel P. Berrange wrote:
On Mon, Nov 19, 2012 at 05:30:11PM +0800, Osier Yang wrote:
Hi,

This proposal is trying to figure out a solution for migration
of domain which uses LUN behind vHBA as disk device (QEMU
emulated disk only at this stage). And other related NPIV
improvements which are not related with migration. I'm not
luck to get a environment to test if the thoughts are workable,
but I'd like see if guys have good idea/suggestions earlier.

1) Persistent vHBA support

   This is the useful stuff missed for long time. Assuming
that one created a vHBA, did masking/zoning, everything works
as expected. However, after a system rebooting, everything is
just lost. If the user wants to get things back, he has to
find out the preivous WWNN&  WWPN, and create the vHBA again.

   On the other hand, Persistent vHBA support is actually required
for domain which uses LUN behind a vHBA. Othewise the domain
could fail to start after a system rebooting.

   To support the persistent vHBA, new APIs like virNodeDeviceDefineXML,
virNodeDeviceUndefine is required. Also it's useful to introduce
"autostart" for vHBA, so that the vHBA could be started automatically
after system rebooting.

   Proposed APIs:

   virNodeDevicePtr
   virNodeDeviceDefineXML(virConnectPtr conn,
                          const char *xml,
                          unsigned int flags);

   int
   virNodeDeviceUndefine(virConnectPtr conn,
                         virNodeDevicePtr dev,
                         unsigned int flags);

   int
   virNodeDeviceSetAutostart(virNodeDevicePtr dev,
                             int autostart,
                             unsigned int flags);

   int
   virNodeDeviceGetAutostart(virNodeDevicePtr dev,
                             int *autostart,
                             unsigned int flags);

I don't really much like this approach. IMHO, this should
all be done via the virStoragePool APIs instead. Adding
define/undefine/autostart to virNodeDevice is really just
duplicating the storage pool functionality.

I like the idea of making vHBAs persist as part of pools; how do you
envision it should work?  Extend the scsi pools to take a vHBA
descriptor and then instantiating the vHBA as part of starting the
pool, or something else?

2) Associate vHBA with domain XML

   There are two ways to attach a LUN to a domain: as an QEMU emulated
device; or passthrough. Since passthrough a LUN is not supported in
libvirt yet, let's focus on the emulated LUN at this stage.

   New attributes "wwnn" and "wwpn" are introduced to indicate the
LUN behind the vHBA. E.g.

    <disk type='block' device='disk'>
      <driver name='qemu' type='raw'/>
      <source wwnn="2001001b32a9da4e" wwpn="2101001b32a90004"/>

If you change the schema of the<source>  element, then you must
also create a new type='XXX' attribute to identify it, not just
re-use type='block'

      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07'
function='0x0'/>
    </disk>

   Before the domain starting, we have to check if there is LUN
assigned to the vHBA, error out if not.

   Using the stable path of LUN also works, e.g.

   <source dev="/dev/disk/by-path/pci-0000\:00\:07.0-scsi-0\:0\:0\:0"/>

   But the disadvantage is the user have to figure out the stable
path himself; And we have to do checking of every stable path to
see if it's behind a vHBA in migration "Begin" stage. Or an new
XML tag for element "source" to indicate that it's behind a vHBA?
such as:

   <source dev="disk-by-path" model="vport"/>

I don't much like the idea of mapping vHBA to<disk>  elements,
because you have a cardinality mis-match. A<disk>  is equivalent
of a single LUN, but a vHBA is something that provides multiple
LUNs.

If you want to directly associate a vHBA with a virtual guest,
then this is really in the realm of SCSI HBA passthrough, not
<disk>  devices.


If you want something mapped to the<disk>  device, then the
approach should be to map to a storage pool volume - something
we've long talked about as broadly useful for all storage types,
not just NPIV.

+1, we really should take this as an opportunity to add storage
volumes as<disk>  devices.

3) Migration with vHBA

   One possible solution for migration with vHBA is to use one pair
of WWNN&  WWPN on source host, one is using for domain, one is
reserved for migration purpose. It requires the storage admin maps
the same LUN to the two vHBAs when doing the masking and zoning.

One of the two vHBA is called "Primary vHBA", another is called
"secondary vHBA". To maitain the relationship between these two
vHBAs, we have to introduce new XMLs to vHBA. E.g.

    In XML of primary vHBA:

    <secondary wwpn="2101001b32a90004"/>

    In XML of secondary vHBA:

    <primary wwpn="2101001b32a90002"/>

Primary vHBA is going to be guaranteed not used by any domain which
is driven by libvirt (we do some checking eariler before the domain
starting). And it's also guaranteed that the LUN can't be used by
other domain with sVirt or Sanlock. So it's safe to have two vHBAs
on source host too.

To prevent one using the LUN by creating vHBA using the same WWNN&
WWPN on another host, we must create the secondary vHBA on source
host, even it's not being used.

Both primary and secondary vHBA must be defined and marked as
"autostart" so that the domain could be started after system
rebooting.

When do migration, we have to bake a bigger cookie with secondary
vHBA's info (basically it's WWNN and WWPN) in migration "Begin"
stage, and eat that in migration "Prepare" stage on target host.

In "Begin" stage, the XMLs represents the secondary vHBA is
constructed. And the secondary vHBA is destoyed on source host,
not undefined though.

In "Prepare" stage, a new vHBA is created (define and start)
on target host with the same WWNN&  WWPN as secondary vHBA on
source host. The LUN then should be visible to target host
automatically? and thus migration can be performed. After migration
is finished on target host, the primary vHBA on source host is
destroyed, not undefined.

If migration fails, the new vHBA created on target host will
be destroyed and undefined. And both primary and secondary
vHBA on source host will be started, so that the domain could
be resumed.

Finally if migration succeeds, primary vHBA on source host
will be transtered to target host as secondary vHBA (defined).
And both primary and secondary vHBA on source host will be
undefined.

If we do the mapping of HBAs to guest domains using storage
pools, then at a guest level, migration requires zero work.

It is simply upto the management app to create the storage
pool on the destination host with the same Name + UUID, but
with the secondary WWNN/WWPN. The nice thing about this, is
that you don't need to hardcode details of a secondary
WWNN/WWPN up-front. The management app can just decide on
those at the time it performs the migration, so 99% of the
time there will only need to be a single vHBA setup on the
SAN. During migration the mgmt app can setup a second
vHBA for the target host, and once complete, delete the
original vHBA entirely.

Agreed, although there will of course need to be some degree of
up-front coordination between the management app and the SAN
administrators to avoid having to involve them to migrate a VM.

4) Enrich HBA's XML

   It's hard to known the vHBAs created from a HBA with current
implementation. One have to dump XML of each (v)HBAs and find
out the clue with element "parent" of vHBAs. It's good to introduce
new element for HBA like "vports", so that one can easily known
what (how many) vHBAs are created from the HBA?

   And also it's good to have the maximum vports the HBA supports.

   Except these, other useful information should be exposed too,
such as the vendor name, the HBA state, PCI address, etc.

   The new XMLs should be like:

   <vports num='2' max='64'>
     <vport name="scsi_host40" wwpn="2101001b32a90004"/>
     <vport name="scsi_host40" wwpn="2101001b32a90005"/>
   </vports>
   <online/>
   <vendor>QLogic</vendor>
   <address type="pci" domain="0" bus="0" slot="5" function="0"/>

   "online", "vendor", "address" make sense to vHBA too.

I'm trying to remember how we modelled the parent/child relationship
for SR-IOV PCI cards. NPIV is a very similar concept, so we should
ideally seek to model the parent/child relationship in the same
manner.

Physical function:

<device>
   <name>pci_0000_01_00_0</name>
   <parent>pci_0000_00_01_0</parent>
   <driver>
     <name>igb</name>
   </driver>
   <capability type='pci'>
     <domain>0</domain>
     <bus>1</bus>
     <slot>0</slot>
     <function>0</function>
     <product id='0x10c9'>82576 Gigabit Network Connection</product>
     <vendor id='0x8086'>Intel Corporation</vendor>
     <capability type='virt_functions'>
       <address domain='0x0000' bus='0x01' slot='0x10' function='0x0'/>
       <address domain='0x0000' bus='0x01' slot='0x10' function='0x2'/>
       <address domain='0x0000' bus='0x01' slot='0x10' function='0x4'/>
       <address domain='0x0000' bus='0x01' slot='0x10' function='0x6'/>
       <address domain='0x0000' bus='0x01' slot='0x11' function='0x0'/>
       <address domain='0x0000' bus='0x01' slot='0x11' function='0x2'/>
       <address domain='0x0000' bus='0x01' slot='0x11' function='0x4'/>
     </capability>
   </capability>
</device>

Virtual function:

<device>
   <name>pci_0000_01_10_0</name>
   <parent>pci_0000_00_01_0</parent>
   <driver>
     <name>igbvf</name>
   </driver>
   <capability type='pci'>
     <domain>0</domain>
     <bus>1</bus>
     <slot>16</slot>
     <function>0</function>
     <product id='0x10ca'>82576 Virtual Function</product>
     <vendor id='0x8086'>Intel Corporation</vendor>
     <capability type='phys_function'>
       <address domain='0x0000' bus='0x01' slot='0x00' function='0x0'/>
     </capability>
     <capability type='virt_functions'>
     </capability>
   </capability>
</device>

Interesingly, I think there's a bug there; the VF should not be
showing<capability type='virt_functions'>

Yeah, that's a bug. Okay, "capability" sounds good.

Regards,
Osier


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]