AI security: Defending against prompt injection and unsafe actions

2026 年 3 月 26 日Juan Pérez de Algaba Sierra, Florencio Cano Gabarda7 分钟阅读

In previous articles, we framed AI security as protecting confidentiality, integrity, and availability of the whole AI system, not just the model. We also mapped AI risks onto familiar secure development lifecycle (SDLC) thinking, treating data and model artifacts as first-class build inputs and outputs.

This article examines the primary security risk for enterprise large language model (LLM) applications: prompt injection. This vulnerability occurs when the model fails to distinguish between data and instructions, allowing external prompts to seize control of the system. The risk is particularly acute when models use retrieval-augmented generation (RAG) to access documents or employ tools to take autonomous actions. We will explore how to test these applications to minimize the possibility that prompt injection results in a security incident.

What is prompt injection and why does it work?

Prompt injection is a security vulnerability where an AI model is tricked into executing unauthorized instructions by a malicious actor. This occurs because LLMs currently process both developer instructions and user data as a single stream of text, with no architectural way to distinguish between what to do and what to process.

Prompt injection is best understood as an instruction-confusion bug:

The system ingests untrusted data like user input, retrieved documents, tickets, emails, and web pages.
The model is asked to follow instructions in that text.
The model cannot distinguish policy from content unless you enforce boundaries.

In other words, prompt injection succeeds when your system treats external content as if it were a trusted control input.

This is why enterprise LLM security looks less like “filter bad words” and more like classic security engineering, defining trust boundaries, validating inputs, enforcing least privilege, and continuously monitoring for anomalies or potential attacks.

What are AI guardrails?

AI guardrails are controls that constrain AI model behavior and other system actions. They can block, redact, rewrite, route, or require confirmation at critical points in the AI processing lifecycle.

AI guardrails exist at 3 different stages in this lifecycle:

Input guardrails process input before it gets to the model.
Output guardrails process model output before it is given to the end user.
Runtime guardrails enforce rules when the model is accessing and using external tools, such as through an API.

All 3 layers matter, as having this sort of defense in depth helps catch issues at multiple points, reducing the chance that a single miss becomes a significant security incident.

Input guardrails

Input guardrails are designed to reduce exposure to malicious or policy-violating prompts before they influence the model’s reasoning.

Common controls

Policy classification: Detects disallowed intent, such as data exfiltration requests, unsafe tool requests, and social-engineering patterns.
Prompt injection detection: Identifies instruction-like patterns embedded in user content or retrieved context.
Context shaping: Enforces strict templates so the model receives content as data, not as free-form instructions.

In addition, you should always separate instructions from evidence. Never pass raw untrusted text into the same channel that carries trusted instructions.

Output guardrails

The goal of output guardrails is to prevent harmful or non-compliant content from being exposed and used.

Common controls

Sensitive data filters / data loss prevention (DLP): Detects secrets and personally identifiable information (PII) patterns and either redact or block.
Policy checks on outputs: Enforces allowed response types, not just allowed words.
Format enforcement: Requires constrained structures (schemas) to reduce jailbreak surface and enable downstream validation.

Even a generally safe model can produce unsafe or compromised content under pressure, including accidental leakage from retrieved documents or tool outputs.

Runtime guardrails

Runtime guardrails are essential because the most significant failures happen when prompt injection moves from what the model says into what the system does, particularly when the LLM can call tools and trigger operations. These guardrails enforce boundaries at execution time so the model can propose actions, while making sure that only the runtime can authorize, constrain, or block them.

Tool allowlists and parameter validation: Tools accept safe, typed inputs and reject free-form commands that can smuggle intent.
Least privilege and scoped credentials: Even if the model tries something unsafe, it lacks permissions.
Step-up controls: Higher-risk actions require confirmations, approvals, or
additional signals.
Policy enforcement at the tool boundary: Tools verify authorization independently, the model is never the authority.

If the model has broad, always-on credentials, guardrails become brittle policy overlays instead of enforceable boundaries. Runtime enforcement is where you convert policy into actual safety.

RAG defenses: treat retrieved data as hostile

Indirect prompt injection via RAG and browsing happens when an attacker hides instructions inside content the system retrieves and feeds to the model.

The defenses against these types of attacks are mostly architectural:

Trust labeling of context: Every chunk carries provenance and a trust level such as "untrusted external," "internal but user-generated," and "curated policy."
Instruction hierarchy enforcement: Retrieved content is always evidence, never orders. Explicitly forbid executing instructions found in retrieved text.
Context isolation: Present retrieved text in quoted/boxed form and forbid it from introducing new tool calls, new policies, or new system instructions.
Retrieval hygiene: Clean, normalize, and filter retrieved content before it reaches the model to reduce instruction-like artifacts.
Answering policy: Require grounding to retrieved sources—if the request requires privileged data or actions, route it to controlled flows.

The core idea is that the retrieved content is input—it is not a trusted actor and must be treated like hostile data unless explicitly curated.

Security architectures that can help

A key step in system maturity is moving from single model and filters to a clear separation of duties. To reduce single points of failure you should split roles across the different components of the system. Here we take a look at 2 practical patterns, a dual LLM generator and critic, and capability mediation. These help enforce the separation and make guardrails more measurable, auditable, and robust.

Dual LLM generator and critic model

In a dual LLM pattern, you split responsibilities between 2 core components:

Generator LLM: Produces a candidate response or a proposed tool plan.
Critic LLM: Evaluates compliance and risk, such as policy violations, data exposure, and unsafe actions. It can also block, revise, or route the request to a human-in-the-loop or back to the generator for self-correction, helping make the final output safer, more accurate, and aligned with your system's governance policies

This dual LLM pattern works well because the critic model can be tuned to be conservative and consistent, and it can produce structured decisions, such as risk labels, rationale categories, and allow/deny outcomes that make enforcement auditable. It also creates a clear control point for human review and telemetry, mirroring a security-focused pipeline (produce → evaluate → gate → release) that is applied to model outputs and tool plans.

Capability mediation

The CaMeL technique, which stands for "capabilities for machine learning," reduces the probability and impact of prompt injection in LLM applications by design. It accomplishes this by separating control flow from data flow using 2 LLMs and security policies.

A privileged LLM (P-LLM) processes trusted prompt requests. This LLM generates pseudo-Python code for task orchestration, without accessing untrusted data. The quarantined LLM (Q-LLM) parses unstructured inputs into structured formats using fixed schemas and has no tool access. This architecture lowers the risk of injections.

Beyond separating functions between 2 LLMs, CaMeL implements another layer of security. A custom Python interpreter runs the P-LLM's pseudo-code while tracking data provenance through capability tags on values. This interpreter enforces policies that restrict tool calls based on data trust levels. This applies software principles like control flow integrity and access control, mitigating most risks associated with prompt injection.

This architecture has an impact on performance, specifically on task utility. However, evaluations on AgentDojo show that CaMeL can mitigate 67% to 100% of injections, approaching full mitigation in some tests, while retaining high task utility and even outperforming filters or guardrails.

What guardrails should enforce beyond “bad content”

A strong guardrail program enforces outcomes by translating policy into explicit, testable rules that the system can apply more consistently. It defines data access rules that specify:

Which sources are permitted for a given request and which are forbidden.
Disclosure rules that control what classes of data may appear in outputs and what must be redacted or blocked.
Action rules that constrain which tools can run, using which parameters, and under what conditions.

It also includes operational rules like rate limits, anomaly detection triggers, and abuse workflows to help manage real-world misuse. It treats guardrails with the same change control discipline as production code that is versioned, tested, reviewed, and monitored.

This aligns with “SDLC still applies,” with a key update—prompt, retrieval, and tool policies become first-class security artifacts.

Testing guardrails like a product security team

Guardrails are only as good as your ability to measure them because an untested policy is just a stated belief about how the system will behave under pressure. A practical evaluation program treats guardrails like any other production control—you build policy unit tests that feed fixed prompts and contexts and assert expected allow/deny outcomes, and you maintain regression suites so changes to prompts, models, retrieval, or rules don’t quietly reintroduce failures you already solved.

You should then layer in adversarial evaluations that reflect how real attackers operate, such as paraphrases, indirect instructions embedded in retrieved content, and multistep attempts designed to slip past single checks. You should also run tool safety tests to confirm that unsafe parameters, risky routes, and malformed requests are rejected at the boundary and logged with enough detail to investigate if needed.

Finally, you should validate runtime monitoring by watching for signals that matter in practice—spikes in denials, unusual tool-call sequences, repeated boundary probing, and other patterns that indicate systematic abuse rather than normal use. The goal is familiar and operational, to reduce likelihood, minimize impact, detect quickly, and recover safely.

Final thoughts

Prompt injection is a boundary problem. It happens when untrusted text from users or retrieved sources is allowed to influence decisions that should be reserved for trusted control inputs like system instructions and tool policies. Effective defense is not one trick, it’s a layered design that separates trust zones, constrains what the model can do at runtime, and verifies compliance through enforceable rules and continuous testing.

In practice, your system should keep trusted instructions clearly separated from untrusted content, use input and output guardrails to reduce common failure modes, and enforce runtime guardrails at tool boundaries so action injection cannot translate into real operations. This also requires least privilege and scoped access so that mistakes are less likely to become incidents, along with separation-of-duties patterns, such as dual LLM review or capability mediation, when the risk warrants it.

LLM security should be treated like production security. You should deploy defense in depth guardrails, enforce least privilege tooling, and continuously evaluate and monitor behavior. TrustyAI can help you orchestrate these different guardrails and better understand what is happening in your system.

关于作者

Juan Pérez de Algaba Sierra

I am an information security lover from Seville, Spain. I have been tinkering with computers since I was a child and that's why I studied Computer Sciences. I specialised in cybersecurity and since then, I have been working as a security engineer. I joined Red Hat in 2023 and I have been helping engineering teams to improve the security posture of their products. When I am not in front of the computer I love going to concerts, trying new restaurants or going to the cinema.

Read full bio

Florencio Cano Gabarda

Principal Product Security Architect

Florencio has had cybersecurity in his veins since he was a kid. He started in cybersecurity around 1998 (time flies!) first as a hobby and then professionally. His first job required him to develop a host-based intrusion detection system in Python and for Linux for a research group in his university. Between 2008 and 2015 he had his own startup, which offered cybersecurity consulting services. He was CISO and head of security of a big retail company in Spain (more than 100k RHEL devices, including POS systems). Since 2020, he has worked at Red Hat as a Product Security Engineer and Architect.

Read full bio

了解更多

按频道浏览

探索所有频道