피드 구독

Site Reliability Engineering (SRE), initially popularized by Google, is an operating model to solve complex operational issues associated with scalable and highly reliable data center sites. As a development practice founded in engineering, SRE has been a method helpful in industries such as banking align business objectives with technical development and operations goals. 

As our topic of discussion, we’re introducing the concept of “Service Reliability Engineering” (SvRE), which incorporates financial service regulatory requirements as part of providing a highly scalable and reliable digital banking service. 

Why financial institutions should focus on scaling reliable services

Public cloud providers are concerned with site reliability to provide reliable compute competitively and storage services—site downtime is costly both monetarily and to an organization’s reputation. To minimize site outages and retain accountability for reliability, service deployment and support activities are inherently embedded in the application development role—you build it, you support it. 

Financial Institutions are in the business of customer trust, principally conveyed to foster knowledge that funds are safe and available at the time and manner the customer chooses. To inspire trust, banks must minimize risk and provide secure, reliable, responsive, resilient, and always available services. The accelerating pace of digital banking service adoption is constant, and the need to scale reliable services has never been greater.

While website providers and financial institutions share similar reliability objectives as business goals, financial institutions are held to additional regulatory compliance requirements which mandate the segregation of responsibility—minimizing and eliminating as much operational and financial risk as possible. 

Financial institutions must also comply with regulations that require sufficient controls to isolate functions and ensure that no single function has end-to-end responsibility of a single process which could compromise financial transactions or cause data loss. So, in a regulatory irony - if you build it, you can’t support it.

A new concept: Service Reliability Engineering (SvRE)

“Service Reliability Engineering” (SvRE) can help bridge this necessary gap between functional development and post-deployment support, incorporating financial service regulatory requirements to separate responsibilities from the start. 

As part of DevOps methodology, financial institutions have already implemented controls separating development and operational support as part of continuous delivery pipelines. These pipelines have built-in security, compliance, and segregation of responsibilities. Controls limit production access for developers and access to source code for operations teams. 

Within SvRE, the application (the functional part of the service) needs to be separated from the platform (the set of technologies that the application is dependent on to run) in order to isolate the dual responsibilities of building and supporting services. This becomes more complicated in financial institutions, which typically have four organizations involved in the application delivery process, introducing logistical complexity, namely: 

  • Application Development. 

  • Application Support. 

  • Application Deployment and Release. 

  • System Support. 

Each of these teams has a distinct role. Alas, with manual handovers and complex ticketing systems used in the delivery and maintenance of applications, it becomes hard to identify specific owners associated with the reliability of a particular service. 

Often, the System Support team assumes final responsibility for reliability, doing so without native understanding of the application nor how it interacts with the underlying platform.

In smaller organizations, where a whole system is designed to support only a few functions (or applications), it can be easier to develop deep knowledge in both the applications and underlying systems—mitigating some of the complexities. However, in large financial institutions and those with intricate systems and dependencies, it is no small feat for any one group to possess the expert knowledge of the platform, underlying infrastructure, and the application—all of which is required to separate the division of responsibilities needed for SvRE due diligence and compliance.

A possible SvRE solution

An approach to achieve the twin goals of support and regulatory compliance might be to establish Application Recovery Engineers (ARE) and Platform Recovery Engineers (PRE) practices. 

Developing strong expertise in application recovery and platform availability mimics the industry definition of SRE organizational responsibilities, with each role having respective assignments for application reliability and platform reliability. By adopting service level objectives (SLO) and error budgets1 as common measures governing service reliability, AREs and PREs can work together to achieve organizational metrics that balance market agility and reliability and establish a framework for measuring and tolerating allowable risk. 

While promoting continuous feedback processes across teams (metrics, weekly feedback sessions, joint problem solving, testing, common automation frameworks, etc.), would help mitigate the risk associated with diverging from their essential function—securing the reliability of application services. 

As a best practice, the SvRE should be limited to a small set of critical applications, specifically those visible to customers. 

A SvRE platform model - putting it all together

Teams are best supported by technology that enforces the separation of responsibilities. 

As illustrated in the figure, a platform that provides a reliable way to address ARE and PRE concerns ensures their commitments to organizational mandates are presented. This platform illustrates the isolated capabilities in the application space (where specific projects reside, along with their configurations) from the application nodes (where containers run), from the control plane of the platform. 

Furthermore, when the technology provides flexibility to run any project, or containers in any control plane—either behind the institutions’ firewall or in one or more cloud provider sites—confidence is retained with the standardized way of segregating responsibilities because it is built-in to a consistent platform.

SvRE platform diagram

The right platform technology can help address the needs for observability, security, application and infrastructure immutability—along with a more secure pipeline that includes release capabilities. It also can help manage the manual, repetitive, tactical tasks that provide enduring value and scale linearly as a service grows. 

Like most organizations, an operational shift is happening in financial institutions—one promoting proactive prevention, which will benefit reliability, service deployment, and support activities (all of which can adhere to regulatory and business needs).

Explore our video webinar presentation on utilizing technology as a business strategy in financial services to learn more. Check out more about Red Hat’s approach in applying automation for financial services in hybrid and multi-cloud environments including our hybrid cloud banking checklist.

 

1 Error budget is the gap between theoretically perfect reliability and an acceptable service level objective agreed upon by the business and technology stakeholders. As per: Seeking SRE: Conversations about running systems at scale, David N. Blank-Edelman, O'Reilly Media Inc., 2018 

저자 소개

A veteran in the financial services industry, Jamil Mina is passionate about the value of open source and how it can help financial institutions be successful in achieving their Digital Transformation objectives. As Chief Architect for Financial Services at Red Hat, his goal is to be a strategic partner and trusted adviser to his clients, which means investing a lot of time listening to their needs and concerns. 

Previously Mina was a leader at BMO Financial Group, a large financial institution in Canada and in the top 50 worldwide. During his tenure, he was responsible for enabling continuous deployment, data center automation and container orchestration. Driving complex transformational capabilities required a strong collaboration with the business and application development teams.

Mina has a Master in Business Administration with specialization in Financial services from Dalhousie University.

Read full bio
UI_Icon-Red_Hat-Close-A-Black-RGB

채널별 검색

automation icon

오토메이션

기술, 팀, 인프라를 위한 IT 자동화 최신 동향

AI icon

인공지능

고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트

open hybrid cloud icon

오픈 하이브리드 클라우드

하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요

security icon

보안

환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보

edge icon

엣지 컴퓨팅

엣지에서의 운영을 단순화하는 플랫폼 업데이트

Infrastructure icon

인프라

세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보

application development icon

애플리케이션

복잡한 애플리케이션에 대한 솔루션 더 보기

Original series icon

오리지널 쇼

엔터프라이즈 기술 분야의 제작자와 리더가 전하는 흥미로운 스토리