Site Reliability Engineering (SRE), initially popularized by Google, is an operating model to solve complex operational issues associated with scalable and highly reliable data center sites. As a development practice founded in engineering, SRE has been a method helpful in industries such as banking align business objectives with technical development and operations goals. 

As our topic of discussion, we’re introducing the concept of “Service Reliability Engineering” (SvRE), which incorporates financial service regulatory requirements as part of providing a highly scalable and reliable digital banking service. 

Why financial institutions should focus on scaling reliable services

Public cloud providers are concerned with site reliability to provide reliable compute competitively and storage services—site downtime is costly both monetarily and to an organization’s reputation. To minimize site outages and retain accountability for reliability, service deployment and support activities are inherently embedded in the application development role—you build it, you support it. 

Financial Institutions are in the business of customer trust, principally conveyed to foster knowledge that funds are safe and available at the time and manner the customer chooses. To inspire trust, banks must minimize risk and provide secure, reliable, responsive, resilient, and always available services. The accelerating pace of digital banking service adoption is constant, and the need to scale reliable services has never been greater.

While website providers and financial institutions share similar reliability objectives as business goals, financial institutions are held to additional regulatory compliance requirements which mandate the segregation of responsibility—minimizing and eliminating as much operational and financial risk as possible. 

Financial institutions must also comply with regulations that require sufficient controls to isolate functions and ensure that no single function has end-to-end responsibility of a single process which could compromise financial transactions or cause data loss. So, in a regulatory irony - if you build it, you can’t support it.

A new concept: Service Reliability Engineering (SvRE)

“Service Reliability Engineering” (SvRE) can help bridge this necessary gap between functional development and post-deployment support, incorporating financial service regulatory requirements to separate responsibilities from the start. 

As part of DevOps methodology, financial institutions have already implemented controls separating development and operational support as part of continuous delivery pipelines. These pipelines have built-in security, compliance, and segregation of responsibilities. Controls limit production access for developers and access to source code for operations teams. 

Within SvRE, the application (the functional part of the service) needs to be separated from the platform (the set of technologies that the application is dependent on to run) in order to isolate the dual responsibilities of building and supporting services. This becomes more complicated in financial institutions, which typically have four organizations involved in the application delivery process, introducing logistical complexity, namely: 

  • Application Development. 

  • Application Support. 

  • Application Deployment and Release. 

  • System Support. 

Each of these teams has a distinct role. Alas, with manual handovers and complex ticketing systems used in the delivery and maintenance of applications, it becomes hard to identify specific owners associated with the reliability of a particular service. 

Often, the System Support team assumes final responsibility for reliability, doing so without native understanding of the application nor how it interacts with the underlying platform.

In smaller organizations, where a whole system is designed to support only a few functions (or applications), it can be easier to develop deep knowledge in both the applications and underlying systems—mitigating some of the complexities. However, in large financial institutions and those with intricate systems and dependencies, it is no small feat for any one group to possess the expert knowledge of the platform, underlying infrastructure, and the application—all of which is required to separate the division of responsibilities needed for SvRE due diligence and compliance.

A possible SvRE solution

An approach to achieve the twin goals of support and regulatory compliance might be to establish Application Recovery Engineers (ARE) and Platform Recovery Engineers (PRE) practices. 

Developing strong expertise in application recovery and platform availability mimics the industry definition of SRE organizational responsibilities, with each role having respective assignments for application reliability and platform reliability. By adopting service level objectives (SLO) and error budgets1 as common measures governing service reliability, AREs and PREs can work together to achieve organizational metrics that balance market agility and reliability and establish a framework for measuring and tolerating allowable risk. 

While promoting continuous feedback processes across teams (metrics, weekly feedback sessions, joint problem solving, testing, common automation frameworks, etc.), would help mitigate the risk associated with diverging from their essential function—securing the reliability of application services. 

As a best practice, the SvRE should be limited to a small set of critical applications, specifically those visible to customers. 

A SvRE platform model - putting it all together

Teams are best supported by technology that enforces the separation of responsibilities. 

As illustrated in the figure, a platform that provides a reliable way to address ARE and PRE concerns ensures their commitments to organizational mandates are presented. This platform illustrates the isolated capabilities in the application space (where specific projects reside, along with their configurations) from the application nodes (where containers run), from the control plane of the platform. 

Furthermore, when the technology provides flexibility to run any project, or containers in any control plane—either behind the institutions’ firewall or in one or more cloud provider sites—confidence is retained with the standardized way of segregating responsibilities because it is built-in to a consistent platform.

SvRE platform diagram

The right platform technology can help address the needs for observability, security, application and infrastructure immutability—along with a more secure pipeline that includes release capabilities. It also can help manage the manual, repetitive, tactical tasks that provide enduring value and scale linearly as a service grows. 

Like most organizations, an operational shift is happening in financial institutions—one promoting proactive prevention, which will benefit reliability, service deployment, and support activities (all of which can adhere to regulatory and business needs).

Explore our video webinar presentation on utilizing technology as a business strategy in financial services to learn more. Check out more about Red Hat’s approach in applying automation for financial services in hybrid and multi-cloud environments including our hybrid cloud banking checklist.

 

1 Error budget is the gap between theoretically perfect reliability and an acceptable service level objective agreed upon by the business and technology stakeholders. As per: Seeking SRE: Conversations about running systems at scale, David N. Blank-Edelman, O'Reilly Media Inc., 2018 

About the author

A veteran in the financial services industry, Jamil Mina is passionate about the value of open source and how it can help financial institutions be successful in achieving their Digital Transformation objectives. As Chief Architect for Financial Services at Red Hat, his goal is to be a strategic partner and trusted adviser to his clients, which means investing a lot of time listening to their needs and concerns. 

Read full bio