Yesterday, the Kubernetes Product Security team released information about two significant bugs in Kubernetes, which were assigned  CVE-2017-1002101 and CVE-2017-1002102.  OpenShift is built upon Kubernetes and as such these bugs were also present in both OpenShift Online and OpenShift Dedicated.  Red Hat, along with Google and other members of the Cloud Native Computing Foundation, worked to create and coordinate the release of security fixes for these affected products.

In response to these security errata, at the time the embargo was lifted, the OpenShift SRE team worked around the clock, across three geographic regions (NASA, APAC, and EMEA) to remediate the bug on all affected clusters.

Starting at approximately noon Eastern (16:00 UTC) on Monday, March 12th remediation began with internal and test clusters prior to any updates being made in production.  The usual and customary tooling for updates had been modified ahead of time in response to a prior incident post-mortem that allowed it to handle the unusual nature of the patch.  Instead of being applied as a system errata, for most clusters, the individual OpenShift components were upgraded in place and restarted. For a small number of starter tier clusters, an automated product upgrade was performed to remediate and upgrade simultaneously.

All externally exposed, production clusters were remediated by 12:30 Eastern (16:30 UTC) on Tuesday, March 13th.  Due to the nature of the global SRE team, all OpenShift Dedicated clusters were patched during the customer’s preferred (typically overnight for that region) maintenance window.

A small number of nodes saw isolated outages as other issues came to light, but in the vast majority of cases, no reboots or node outages were required.

As is always the case, the focus was on patching public (starter- and pro- tiers in OpenShift Online) clusters ahead of non-public (all clusters in OpenShift Dedicated) due to the increased attack surface.

The remediation process was entirely automated, including raising and lowering customer notification banners.  This ensured that even though the remediation was performed on an accelerated timeline, customers were always kept informed about the status and progress.  Additionally, to aid collaboration between team members, removing and re-adding cluster nodes from our maintenance systems was automated, avoiding false alerts during the process.

Some of the key elements that created the ability to respond so quickly include factors like:

  • Our remediation automation, as well as installer automation, is written with the same tool, Ansible.  This allows significant re-use of code and sharing of expertise within the team, and seamlessly scales with the environment.
  • Pre-written automation tools are created with enough flexibility to handle routine and non-routine remediation requirements.
  • Collaboration tools including screen-sharing, video conferencing to allow SRE team members across the globe to work simultaneously, and hand-off between regions to “follow the sun”.  
  • Rigorous and detailed post-mortems are held after every successive remediation effort allow us to mature and enhance automated tooling.  We will always have unexpected events that cause us to revert to manual processing, but we rarely have them more than once.

For more information, please refer to:

Red Hat OpenShift SRE Team


저자 소개

UI_Icon-Red_Hat-Close-A-Black-RGB

채널별 검색

automation icon

오토메이션

기술, 팀, 인프라를 위한 IT 자동화 최신 동향

AI icon

인공지능

고객이 어디서나 AI 워크로드를 실행할 수 있도록 지원하는 플랫폼 업데이트

open hybrid cloud icon

오픈 하이브리드 클라우드

하이브리드 클라우드로 더욱 유연한 미래를 구축하는 방법을 알아보세요

security icon

보안

환경과 기술 전반에 걸쳐 리스크를 감소하는 방법에 대한 최신 정보

edge icon

엣지 컴퓨팅

엣지에서의 운영을 단순화하는 플랫폼 업데이트

Infrastructure icon

인프라

세계적으로 인정받은 기업용 Linux 플랫폼에 대한 최신 정보

application development icon

애플리케이션

복잡한 애플리케이션에 대한 솔루션 더 보기

Virtualization icon

가상화

온프레미스와 클라우드 환경에서 워크로드를 유연하게 운영하기 위한 엔터프라이즈 가상화의 미래