Cloud-native architecture has an unusual benefit. Adoption, in many ways, requires the granularization of core IT systems. The multitude of services that were once monolithic, and maybe more challenging to identify or even understand individually, are now distinct from each other. In this article, we will explore what those Lego pieces look like in a newly initiated cloud architecture and its value.
A cloud-native game of divide-and-conquer
Native cloud architecture better serves the main business purpose of a reduced time to market in the era where “the app is the business.” However, due to the additional complexity being introduced, this only works when implemented properly with sound management tooling and processes.
Cloud-native architecture shares the philosophy of the divide-and-conquer paradigm for software algorithms:
“Divide and conquer is an algorithm design paradigm based on multi-branched recursion. A divide-and-conquer algorithm works by recursively breaking down a problem into two or more sub-problems of the same or related type until these become simple enough to be solved directly. The solutions to the sub-problems are then combined to give a solution to the original problem,” - Wikipedia.
Doesn’t the above quote look like the definition of a service mesh running on a Kubernetes cluster(s) where the whole application is delivered as a collaborative multi-level composition of scope-limited microservices?
The microservices running under the service mesh provide specific and tangible contributions to the application's global features. In the best case, they already exist, so they are rock-solid and very efficient. If new, they are of limited functional scope, so they can be developed under the two-pizza team paradigm to be efficient and scalable.
Cloud-native: The challenge
While cloud-native architecture and the microservices it supports is an important step forward in enterprise-scale application development, it is not a panacea of architectural design. Like all new software, it has challenges that require investigation before adoption.
The most obvious challenges are in size and the resulting complexity of scope. An implication of cloud-native architecture is the movement from monolith to microservice, which includes a new level of granularity for software boundaries. Issues may arise, but only when this additional granularity is not anticipated or managed. Instead of having a single, big monolith to maintain, you now end up with hundreds or thousands of microservices to keep up and running to deliver the services expected by the end-user. For example, in the summer of 2020, Uber claimed to have 2000+ of such critical microservices.
In the era of digital transformation, where “the app is the business,” you do not want to mess up with SLAs and uptime! Each second a web application is unavailable can mean massive amounts of lost revenues and escaped customers.
To make things a little bit scarier, such a microservice is not defined as a single artifact. When running under Kubernetes, a microservice is usually determined by the composition of several items chosen among a pod, a deployment, a replicaset, a service, an ingress, a job, a service account, a role, etc. A single microservice is defined by multiple items, whose definitions must be correct for the system to run as expected. This means a multiplying factor to define the size, and hence the complexity, of the resulting system. This concept translates into more “stuff” to be kept up and running!
In addition, the component-reuse trend described above doesn’t mean that an organization will run less software, just that the fraction of internal software will get smaller. All-in-all, the total number of lines of code required to operate a single application will get much bigger. Kubernetes, Helm, etc., are not small pieces of software and the reusable components are also probably bigger than their equivalent custom-fit versions to remain generic and widely applicable.
Of course, this additional software is standard and is well-debugged by the cloud-native community. Nonetheless, each software piece of this decoupled system architecture remains subject to failures, especially when put under pressure by insufficient resources, improper Kubernetes setup, security incidents, and the litany of other unknowns that occur in a cloud architecture.
With cloud-native architecture, the complexity pendulum obviously moves from development toward operations. Integration of third-party services is much more reliable for developers when done over HTTP via RESTful APIs. This very granular architecture, built on a myriad of tiny and scope-limited items, means 10x, a 100x, or even a 1000x more objects to keep up and running in production, thus applying more pressure to the operations team!
Moreover, increased agility fosters more deployments. So, change frequency can become another source of operational complexity. The new operational equation to deliver on SLAs of the fully digital era is: Fail (X objects x Y deployments) = 0 (almost…), with X and Y much larger than in the past.
If this bigger size and higher rollout frequency of software systems are intended to result in architectural agility but are not properly operationalized, all benefits will be lost. Worse, a new kind of chaos may result. Operating and developing on this system will get out of control because the number of objects and the number of events on those objects become gigantic.
Cloud-native: The good
Containers foster this granular cloud-native architectural style. Their foundational properties of immutability, isolation, and portability allow a system structure where basic components (microservices) are re-used as-is and assembled via requests/responses over the network. The result is a global application delivered to end-users. Isolation and the associated autonomy/independence guarantee that container images will run flawlessly since they have all the needed libraries and packages. There is no dependency on any host system component except the kernel. Immutability (i.e., absence of any change during full lifecycle) and portability (i.e., standard interface with the host system) ensure that execution will succeed identically on any substrate: Bare metal, hypervisor, public/private cloud, Kubernetes clusters, basic container runtimes, etc.
Kubernetes technologies further ease this composition-based architecture. For example, Helm charts, often compared to Linux packages for distributed architectures, can efficiently re-use the existing functional blocks described by those charts. They provide methods to easily compose, package, and deploy small services as business features with a wider scope, much like package managers and dependency management software. While there are increasing amounts of code being deployed, these largely declarative forms of definition provide a solid foundation to build upon.
The Cloud-Native Computing Foundation (CNCF) proposes other similar technologies such as Falco configurations, Open Policy Agent policies, or OLM operators to package and deliver efficient cloud-native applications or any fraction of the services that they use. Those building blocks can be internal or come from third parties. To ensure their widest use, CNCF publishes them on the Artifact Hub. As of this writing, the site has an inventory of 2147 such public packages and is still growing!
This trend is clearly positive. We really are entering into the era of Lego IT. We can create a large system delivering high-end services out of standard building-blocks executed in their original form, identical to the ones used in thousands of other IT systems in the world. The benefits are multiple: Lower costs, higher efficiency, reduced time to market, and improved reliability, to name the most important ones.
These advantages are the very positive consequences of container technology. On your Kubernetes cluster(s), you can trigger images that have been heavily validated elsewhere under near-identical conditions—the same parameters, same software stack from top to bottom, same network topology, and more. There are, of course, caveats to this exactness when we discuss compliance, and no one is recommending running unknown containers in your environment. More often than not, containers can “just work.” Containers are much improved from the previous era where you integrated foreign libraries (either open source software (OSS) or proprietary) into your own code base that had a development and test context that were by necessity slightly different from yours, creating room for all kinds of nasty bugs in the edge cases of your operations.
So, the isolation brought by containers and the infrastructure abstraction delivered by Kubernetes allows you to build highly reliable and widely scalable software systems out of standard and foreign components running unchanged in your environment.
Does it mean that IT shops will write less software? Definitely, yes! Over time, we can expect that most shops will become experts at the use of those ready-to-use components and will incrementally get rid of their similar homemade software. This trend started in the era of OSS library integration, where developers learned how to develop less and integrate more. Helm and the like will push them further in this direction, but for now, it is at arm’s length. The interaction with those services happens over the network rather than the local binary level of the API.
This movement is profound. IT shops will source generic services through those ready-made artifacts. Organizations will be able to focus on the part of the software that is distinctive to their business. Further agility and increased competitiveness will result.
Cloud-native architecture is currently the best option for businesses willing to maximize the productivity and flexibility of developers of their IT systems to better compete in the digital marketplace. Suppose your target is to minimize the time-to-market for the latest features of your business application in order to increase your competitiveness. In that case, cloud-native architecture is the way to go.
However, cloud-native architecture is no silver bullet. First, IT must pair it with a state-of-the-art toolchain allowing fully automated building, testing, and production deployments. It would be useless to be more agile in operations if your development and testing remain painfully slow. Second, the implementation of productive infrastructure and the definition of the associated operational processes must be thoroughly prepared. The teams must be trained and properly equipped to keep the new and more sophisticated system under control.
Eventually, it will boil down to proper tooling. Comprehensive automation consists of building and testing on the development side. On the operations side, rollout, monitoring, root cause analysis, corrective actions, and validation procedures also occur, to name a few tasks. Such tooling must be in place as the number of objects to manage and the frequency of their changes grows beyond the capabilities of any team of human beings—especially when zero downtime at a massive scale is the target!
The new default architecture is cloud-native, and there is a great deal of opportunity for developer and operations-oriented architects alike.