[rdo-list] Replacing the tripleo-quickstart HA job with a single controller pacemaker job

Fri Jul 1 10:56:20 UTC 2016

On 01/07/2016 12:48, Raoul Scarazzini wrote:
> On 30/06/2016 20:46, John Trowbridge wrote:
>> Howdy folks,
>>
>> Just wanted to give a heads up that I plan to replace the
>> "high-availability" tripleo-quickstart job in the CI promotion
>> pipeline[1], with a job with a lower footprint. In CI, we get a virthost
>> with 32G of RAM and a mediocre CPU. It is really hard to fit 5 really
>> active VMs on that, and we have never had the HA job stable enough to
>> use as a gate for that reason.
>>
>> Instead, we will test the pacemaker code path in tripleo by using a
>> single controller setup with pacemaker enabled. We were never actually
>> testing HA (ie failover scenarios) in the current job, so this should be
>> a pretty minimal loss in coverage.
>>
>> Since this allows us to drop two CPU intensive nodes from the deploy, we
>> can add a ceph node to that job. This will end up with more code
>> coverage then the current HA job, and will hopefully will end up being
>> stable enough to use as a gate as well.
>>
>> Longer term, it would be good to restore an actual HA job, maybe even
>> adding some failure scenario tests to the job. I have a couple of ideas
>> about how we could do this, but none are feasible in the short term.
>>
>> 1. Use pre-existing servers for deploying[2]
>>
>> This would allow running the HA job against any cloud, where we could
>> size the nodes appropriately to make the job stable.
>>
>> 2. Use an OVB cloud for the HA job.
>>
>> Soon we should have an OVB (openstack virtual baremetal) cloud to run
>> tests in. OVB would have all of the benefits of the solution above
>> (unrestricted VM size), and would also provide us a way to test Ironic
>> in a more realistic way since it mocks IPMI rather than our current
>> method of using a fake ironic driver (which just does virsh commands
>> over SSH).
>>
>> 3. Add a feature to tripleo-quickstart to bridge multiple virthosts
>>
>> If we could deploy our virtual machines across 2 different hosts, we
>> would then have much more room to deploy the HA job.
>>
>>
>> If anyone has some better ideas, they are very welcome!
> 
> Hi John,
> No better ideas here, I just want to add that all the work I've done in
> the last months about roles (ansible-role-tripleo-baremetal-undercloud)
> was made for having a physical environment in which test HA scenarios
> (using ansible-role-tripleo-overcloud-validate-ha).
> Not much of this is in place at the moment (in particular for the HA
> validation), since I got a lot of connected patches merged just
> yesterday, but I think we are on a good track.

One thing I did not mention above is that we're putting in place the
jobs so they'll rely on our infrastructure, for what matters the HA
part. So we're working exactly in that direction, because as you can
imagine having HA tests on a single node pacemaker is like not having tests.

-- 
Raoul Scarazzini
rasca at redhat.com