The Red Hat Product Security major incident process considers numerous factors to assess if a vulnerability or event qualifies. The first and foremost is risk exposure. This information includes perceived risk to our customers and the Red Hat brand. Security risk from trivial mass exploitation of mission-critical software ranks the highest. However, other kinds of risk, such as the risk posed by misinformation or panic from confusing or over-exaggerated media coverage are also considered.
The purpose of the major incident process is not only to prioritize fixes, as that happens irrespective of major incident status when an issue receives a Critical or Important severity rating, but also to ensure clear, calm, accurate, and well-defined internal and customer-facing communication channels stay open. On the customer-facing side, this is done by issuing additional artifacts such as a Red Hat Security Bulletin (RHSB), Red Hat Insights detection rules, Ansible remediation Playbooks, the usual security errata, and CVE page entries.
This section provides details on 2 of the 3 major incidents when Red Hat Security Bulletins were released in 2023. For each event, Red Hat provided the following detailed information:
- The basics
- The details: for those interested in understanding the inner workings of software vulnerabilities
- The statistics: such as the number of affected product versions
- An estimate of the time it took Red Hat associates to address the issue
- Report on whether the vulnerability was exploited in the wild and the exact time we received reports of exploitation, if available
RHSB-2023-002: Quarkus Security Policy Bypass - Quarkus CVE-2023-4853
The basics: This was a fairly trivial and more easily understood vulnerability. Red Hat build of Quarkus implemented path-based access control in a way that did not fully normalize pathnames in the same way that HTTP request routing was resolved. This led to path-based access control that could be more easily bypassed by inserting additional forward slash (“/”) characters into a pathname when requesting a protected resource.
The details: Several components within the Red Hat build of Quarkus allowed access control policy to be defined on paths. The affected components were QUARKUS-VERTEX-HTTP, QUARKUS-UNDERTOW, QUARKUS-CSRF-REACTIVE, and QUARKUS-KEYCLOAK-AUTHORIZATION. All 4 components contained slightly different implementations of the same pattern mentioned above, where the logic for comparing a match did not properly normalize the paths before comparison, allowing it to be bypassed with path names that would not be exact matches for the policy definition but would be normalized to access the targeted protected resource.
The statistics
Affected major product versions: Red Hat OpenShift Serverless 1, Red Hat build of Apache Camel, Red Hat build of Quarkus, Red Hat Integration 2, Red Hat Process Automation Manager 7
Estimated associate time: 1024 hours
Severity rating: Important
Embargo time: 0 days
Time from public to first fix release: 7 days / 1009 days (security report / original)
Time from public to all fixes released: 28 days / 1092 days (security report / original)
Exploit code published: NA. No code required to exploit.
Exploitation in the wild: Not reported but likely happening on some scale.
Closing thoughts: Two things stand out about this vulnerability. First, it serves as a great example of how complexity makes security a challenging task. What would seem like a trivial task on the surface of the problem, comparing pathnames becomes fraught with difficulty due to the complex nature of the platform and various layers at play. Second, the significant amount of time between public disclosure and then fixed is noteworthy. This delay in the upstream project occurred due to not understanding the security implications in its entirety when the flaw was first reported publicly upstream via a GitHub issue. Some discussions were held on whether this was an expected behavior or a documentation issue. When the issue was reported again nearly 3 years later, the security implication was noted and the fixes were prioritized.
RHSB-2023-003: HTTP/2 Rapid Reset CVE-2023-44487 and CVE-2023-39325
The basics: Around mid-late August, Amazon Web Services (AWS), Cloudflare, and Google all noticed a massive spike in what looked to be HTTP denial of service (DoS) traffic.
Upon closer inspection of the traffic and attack, it was discovered that the DoS was using a novel attack method and abusing features of the HTTP/2 protocol itself to cause significantly more impact on resource consumption than the traffic would have on its own.
The details: At the heart of the HTTP/2 protocol is the concept of establishing multiple bi-directional communication streams over a single TCP connection. Communication over these streams is multiplexed to allow many requests to happen simultaneously over a single connection without incurring the overhead of establishing multiple TCP sessions or having to allow each request sequentially to complete in serial as is required for earlier HTTP protocol versions.
For this to work, both sides must maintain state machines for each stream, similar to what happens at the TCP layer, whereby the protocol designates how the valid transitions between states occur in order to establish and shutdown streams, as well as transmit and receive data, and handle error conditions. This is all defined in the HTTP/2 RFC, RFC9113, which we’ll summarize.
In order for clients and servers to play nice with each other, the protocol provides several tunable parameters that can be exchanged that allows clients and servers to agree on which settings work best for the session. One of these settings is to advertise the maximum number of streams they will allow the other side to initiate concurrently (SETTINGS_MAX_CONCURRENT_STREAMS). Should either endpoint attempt to initiate a new stream when this count has been reached, the receiving end must reject the attempt with one of 2 possible error codes , PROTOCOL_ERROR or REFUSED_STREAM, delivered in a RST_STREAM frame.
There is an interesting interaction between the state machine, the RST_STREAM frame, and the MAX_CONCURRENT_STREAMS parameter. The state machine has 5 states, which are, IDLE, RESERVED, OPEN, HALF-CLOSED, and CLOSED. Only streams in the OPEN and HALF-CLOSED states count towards the MAX_CONCURRENT_STREAMS count. A stream in any state that receives a RST_STREAM frame, transitions to the CLOSED state, and thus, no longer counts towards the MAX_CONCURRENT_STREAMS count.
Once this mechanism is understood, the attack itself is straightforward. A client endpoint opens a new stream, requesting a resource and then immediately sends a RST_STREAM frame. While this might seem fine from the state-machine perspective, remember that the state machine does not exist in a vacuum and is there to facilitate the communication transport. The resource request may already be enqueued or is being processed, either on the end-point itself or in many cases by another system, when the RST_STREAM frame is received and the stream transitions to CLOSED. This allows the client to open a new stream. If this is done frequently, it will use a significant amount of resources regardless of the MAX_CONCURRENT_STREAMS setting. This issue occurred in August, when the attack took place, allowing a relatively small botnet to cause significantly more impact than was seen in the past.
The statistics:
Affected major product versions: As this was a protocol flaw, almost all products were affected.
Estimated associate time: 6000+ hours
Severity rating: Important
Embargo time: 1 day
Time from public to first fix release: 7 days
Time from public to all fixes released: 22 days
Exploit code published: Yes
Exploitation in the wild: Yes, this issue was discovered due to an active attack in the wild and was added to the CISA Known Exploited Vulnerabilities (KEV) catalog within the month.
Closing thoughts: It is unusual for a flaw that has no impact on the confidentiality or integrity of a system to be a major incident, but that is what we saw with Rapid Reset. Since this was a protocol flaw, a great deal of software was affected, making the response efforts extra costly, as seen in the massive estimate of associate hours. We also saw this flaw discovered because of an active attack against 3 very large hyperscalers, which had to exert considerable effort to defend against the attack, even with the mammoth scale at which they operate daily.
A special thank you is in order to AWS, Cloudflare, and the Google Cloud Platform. They did not just work to block the attack for their customers but provided considerable technical analysis that allowed the open source community to work on getting fixes into various code bases impacted by the flaw to reduce its usefulness and value as a future attack vector.