Everyone dreads the moment when they hear: "Someone broke the build!" Developers scramble to review errors and assemble patches to bring the problem under control. Recent movements, such as continuous integration (CI), aim to eliminate this stress by testing early and testing often.
Developers of small projects usually make the transition to CI without much difficulty. On the other hand, large projects require more planning and investment. These projects have plenty of dependencies, they require more time to prepare them for testing, and they must be tested on a wide array of systems.
Continuous integration for the Linux kernel feels like climbing Mount Everest. Testing can mean working through 14,000 changesets from 1,700 developers just for the 5.0 release. Fast machines compile kernels in about ten minutes, but that only includes one kernel configuration out of an enormous number of possible configurations. From there, the kernel must boot and test properly on various types of hardware.
At Red Hat, the Continuous Kernel Integration (CKI) project team realized that elastic compute resources would allow us to balance the speed in which we test kernels with the cost of testing them. When the queue grows, the infrastructure grows. When the queue shrinks, the infrastructure shrinks along with it.
Cloud providers make this process easy with APIs and integrations with existing infrastructure provisioning tools. That simplicity comes at a cost: cloud providers charge you more for taking the easy path. How can we turn the tables and get more control over our costs?
Start by finding all of the levers available to you to control cost. Some of these levers require more work from your team, but these one-time efforts can pay for themselves over time. Utility billing means you pay for what you use, but you can lighten your usage by controlling your costs in a few key areas:
Limit network traffic and make it deliberate. Although most cloud providers ignore inbound traffic (data downloaded from the internet into your instance), the outbound traffic costs escalate quickly. Find ways to bring the data close to your instances at the cloud provider so that you reduce your network traffic to the absolute minimum. Use private networks between instances whenever possible and utilize free network links between instances and object storage for storing test artifacts.
Know how your provider charges for instances. Running instances cost money, plain and simple. Understand how your cloud provider charges for instance time. If a cloud charges by the second (such as AWS on-demand prices), it makes sense to stop instances as soon as they are idle. If the provider charges by the hour (such as Digital Ocean), keep idle instances around until the next hour in case another test comes along. Keep the instances around long enough to do something useful without incurring charges for idle infrastructure.
Ensure instances are ready to go. Build your own instance images to ensure they have everything you need to run your job. Installing system packages after starting an instance causes delays and broken package mirrors lead to provisioning errors. Use projects like Red Hat’s Image Builder (OS Build) or Hashicorp’s Packer to prepare your images ahead of time.
Speed up your jobs. Analyze your testing jobs to find the slowest parts, and then optimize those. For the kernel, the CKI team created a git cache of the Linux Kernel git tree in Amazon’s S3 object storage and a job that updates it regularly. That reduced the time to clone the Linux kernel from several minutes down to 20 seconds. We also stored ccache data from previous kernel compiles in S3. This reduced compile times on EC2’s c4.2xlarge from 32 minutes to less than four minutes.
Benchmark instances for a performance and cost balance. Bigger, more powerful instances can speed up your testing, but they come at a high cost. Benchmark different instances to find the best performance return for your money. If a faster instance costs twice as much as a slower one, but it only delivers a 10% reduction in test time, skip the upgrade.
Analyze instance markets. Different cloud providers offer cost-saving methods of provisioning instances. EC2’s spot instances, Google Compute’s preemptible instances, and Azure’s spot VMs provide ways of saving costs with some caveats. Capacity fluctuates constantly and your instance could be taken away when capacity plummets or other users bid higher for the same resources. This requires that you architect your system so that it operates under the assumption that any instance you choose could disappear at any moment and operating that way leads to greatly reduced cost with these instance markets.
Watch your bill like a hawk. As you prototype or scale up your testing, monitor your bill constantly. Misconfigurations or poor network traffic controls cause billing spikes and you should work quickly to correct those. Your bill also highlights areas where you can optimize spending by improving your automation. Successful teams hunt for optimizations that allow their testing load to grow at a rate faster than their bill.
The CKI team put these strategies into practice at AWS by searching the best method for running jobs. We assumed that smaller instances would allow us to spend less. We compiled the same kernel several times on each instance type to find out how much it costs to compile a kernel on each:
c4.2xlarge: $0.211 per kernel
c4.4xlarge: $0.216 per kernel
c5.4xlarge: $0.166 per kernel
c5.9xlarge: $0.1794 per kernel
The best deal turned out to be the c5.4xlarge and it was more expensive per hour than what expected would be the cheapest (the c4 instances). We start optimizing the job with ccache, worked with the EC2 spot market, and re-ran the tests:
c4.2xlarge: $0.005 per kernel (spot price)
c4.4xlarge: $0.006 per kernel (spot price)
c5.4xlarge: $0.0054 per kernel (spot price)
c5.9xlarge: $0.0082 per kernel (spot price)
Using ccache shifted most of the load of the compile from the CPU to the disk and it allowed the cheaper instances to catch up with the more expensive ones. The costs dropped dramatically across the board, especially on the c5.9xlarge, which dropped from $0.18 per kernel to less than $0.01. Working with spot instances required many one-time adjustments to our testing jobs, but the cost savings over time made it well worth it.
Optimizing for costs at the beginning provides plenty of benefits. It gives you the flexibility to adjust your performance and price ratio easily. Teams can create forecasts of what testing will cost over time and the people who pay the bill understand where their money is going. Keeping costs under control and keeping your accounting department happy leads to continued success in continuous integration in the cloud.
About the author
Major Hayden is a Principal Software Engineer at Red Hat with a focus on making it easier to deploy Red Hat Enterprise Linux wherever a customer needs it. He is also an amateur radio operator (W5WUT), and he maintains a technical blog at major.io.