If you've ever worked for or with enterprise companies you know that, when it comes to software, whether it's AI-powered or not, the stakes could not be higher. And that is the reason they invest heavily in making their production environments as bulletproof as possible. They will architect for high availability and disaster recovery, enforce strict service level agreements (SLAs), and build redundancy into every possible layer.

But if their architecture doesn’t also account for the potential for human error, is any of it worth the effort? Time and again, we’ve seen catastrophic outages traced back to a wrong mouse click, a badly written command, or a rushed deployment. Let’s review a few of them:

  • Amazon S3 outage (Feb 2017): An engineer mistyped a command intended to remove a few servers, instead taking down critical S3 subsystems. The disruption affected GitHub, Slack, and other big players, lasting several hours and costing billions of dollars in damages.
  • Facebook global outage (Oct 2021): Engineers withdrew critical Border Gateway Protocol (BGP) routes during a backbone configuration change, causing Facebook, Instagram, and WhatsApp to vanish from the internet for nearly six hours, a lesson on how a single manual change can break world-class redundancy.
  • CrowdStrike Falcon update meltdown (July 2024): A single defective configuration file shipped in an automated security-agent update sent millions of Windows machines into a reboot loop, grounding airlines, banks, and hospitals worldwide. Again, a preventable human mistake, just at internet scale.

As a personal example, I once worked with a customer whose datacenter cleaning staff unplugged a production rack server to plug in a vacuum cleaner. That predictable and preventable mistake caused six hours of serious downtime (and I’m only being vague because the incident is recognizable). The solution would not be to fire the cleaner, but to install lockable power outlets, so critical equipment cannot be disconnected without a key.

The moral of these stories is: True enterprise resilience cannot ignore the human factor. It must eliminate opportunities for a single action to bypass all safeguards.

Our own evolution when it comes to risk

When I joined the Customer and Innovation (CAI) team within the AI Business Unit, the shared AI cluster was still managed manually by the team, and one engineer even admitted that a single slip-up had nearly wiped out the entire environment. Coming from the Cloud Services Black Belt team, I knew this was a chance to apply years of best practices to a very real risk, so I offered to build a new cluster, managed end-to-end with GitOps.

Although this AI cluster is mostly used for demos, we treat it as if it were the production infrastructure of a global bank. Although an outage wouldn’t violate external SLAs, it would stall dozens of colleagues, derail live demos and delay the team's work. By adopting the same high-availability topologies, disaster-recovery playbooks, and GitOps pipelines we recommend to our largest customers, we work towards two outcomes: our colleagues remain productive, and the guidance we give enterprises is forged in situations that mirror theirs.

Move fast with AI, but stay in control

Generative AI (gen AI) changes faster than people can click. And when we speak with C-level executives at customers and partners, they point to three human-error worries that keep them up at night:

  • Lost data, work, or value: An overtired AI engineer selects the wrong database and hits Delete. Months of model tuning, or irreplaceable customer data, gone in seconds.
     
  • Runaway costs: A rushed administrator intends to spin up 10 GPUs, accidentally provisions 1,000, and heads home. At roughly $3 per hour for an H100, that typo can burn through six figures before the morning stand-up.
     
  • Regulatory exposure: In regulated environments, every change must be fully auditable, who initiated it, whether they held the right privileges, and whether those privileges were reviewed and approved before the change went live. The same must hold for AI agents as they start performing these tasks.

Moving fast is essential, but mitigating these human-error scenarios is non-negotiable.

Speed and control: With GitOps, you can have it all!

GitOps isn’t new, it simply means treating everything as code and managing it through the same Git workflow you already trust.

When organizations shift from after-hours command-line deployments to daytime, peer-reviewed pull requests, production incidents start to fall dramatically, not because GitOps is “faster,” but because it enforces compliance, reliability, and security.

Plenty of teams already manage much of their stack with GitOps, and for those who don’t yet, it’s a proven, low-risk upgrade.

Take the next step: bring your AI stack under GitOps governance, capture AI platform configuration, GPU quotas, and model definitions as YAML, commit them to Git, and let automation keep production safe and fast.

What changed for us

Adopting GitOps did more than strengthen the platform, it also changed the way we work. By putting every configuration into code, we turned unwritten tricks into clear rules that anyone can review and improve.

  • Best practices codified: Every configuration lives in Git, so the cluster can run only the approved, documented settings.
  • Open collaboration: Pull requests replaced quick undocumented fixes, so every engineer can see and improve every change.
  • Shared expertise: Because everything is in code, anyone unfamiliar with the platform can quickly learn by reading the repository, while experienced team members can delegate work with confidence.
  • Built-in transparency: Our entire cluster’s GitOps repository is open to everyone, so any team can clone it, inspect every line, and contribute their own improvements.
  • Reproducibility and rollbacks: If something goes wrong, a single git revert restores the last known-good state, no guesswork, no downtime.
  • Future-proof: Adding a new model, GPU type, or feature is just another pull request, not a weekend of manual work.
  • Dry-run in dev: each change is first tested in our Dev cluster, and only when we are fully satisfied do things make it to the prod environment. 

For a much deeper dive …

If this is your first encounter with GitOps and you’d like to dive deeper, grab the free GitOps Cookbook from Red Hat and O’Reilly (and I'm not plugging this just because Natale is a good friend and colleague). 

If the ideas in this article resonate with you, or if someone on your team manages an OpenShift AI environment, you’ll likely want more than a high-level overview.

Good news: We’ve documented the full approach, and it’s all online:

  1. Start here: Managing RHOAI with GitOps, a step-by-step technical guide that explains the overall picture and the full workflow.
  2. Dig into the code: The RHOAI BU Cluster GitOps Repository holds the code and configurations that keep the cluster ticking like clockwork.

GitOps delivers AI-era velocity without sacrificing control: every change is fast, auditable, and reversible. Use the guide and repo to give your own clusters the same speed, safety, and collaborative power.

Final thoughts

So, back to the title question of this blog post: Do you still need GitOps in the era of gen AI?

I hope you can see that the answer is a resounding yes! AI doesn't eliminate the need for GitOps, but actually means we need it even more. We now have to manage more things, such as AI accelerators and the drivers that power them, AI models that leverage those accelerators, and many other things, such as vector databases, MCP servers, agents and probably new things each month! GitOps provides the control and automation needed to manage this complexity, so every change is fast, auditable, and reversible.


Sobre o autor

Roberto is a Principal AI Architect working in the AI Business Unit specializing in Container Orchestration Platforms (OpenShift & Kubernetes), AI/ML, DevSecOps, and CI/CD. With over 10 years of experience in system administration, cloud infrastructure, and AI/ML, he holds two MSc degrees in Telco Engineering and AI/ML.

UI_Icon-Red_Hat-Close-A-Black-RGB

Navegue por canal

automation icon

Automação

Últimas novidades em automação de TI para empresas de tecnologia, equipes e ambientes

AI icon

Inteligência artificial

Descubra as atualizações nas plataformas que proporcionam aos clientes executar suas cargas de trabalho de IA em qualquer ambiente

open hybrid cloud icon

Nuvem híbrida aberta

Veja como construímos um futuro mais flexível com a nuvem híbrida

security icon

Segurança

Veja as últimas novidades sobre como reduzimos riscos em ambientes e tecnologias

edge icon

Edge computing

Saiba quais são as atualizações nas plataformas que simplificam as operações na borda

Infrastructure icon

Infraestrutura

Saiba o que há de mais recente na plataforma Linux empresarial líder mundial

application development icon

Aplicações

Conheça nossas soluções desenvolvidas para ajudar você a superar os desafios mais complexos de aplicações

Virtualization icon

Virtualização

O futuro da virtualização empresarial para suas cargas de trabalho on-premise ou na nuvem