Mapping the AI attack surface: Vulnerabilities in the model lifecycle

25 marzo 20268 minuti (tempo di lettura)AI/ML

Standard AI security benchmarks can't check for all of the possible ways an AI model can be compromised. A backdoor trigger could cause targeted failure, a competitor could clone your API model through repeated queries, or a privacy probe might reveal whether a specific person’s data was used in training. For this reason, organizations deploying AI must understand the variety of potential attacks and proactively address them during model training and after deployment.

In our previous article, What does "AI security" mean and why does it matter to your business?, we talked about protecting AI systems from attacks that compromise confidentiality, integrity, and availability. In this article, we focus on attacks that target the model—both during training and after deployment.

The model lifecycle as an attack surface—where attackers enter

Models have a lifecycle similar to software, but instead of replacing the usual software process, they extend it. In a traditional secure development lifecycle (SDLC), teams protect code, dependencies, build pipelines, and deployments. With an AI system, you still must do all of that, but you also have data pipelines and learning artifacts that behave like first-class "build inputs" and "build outputs." The Red Hat SDLC extends naturally to AI. It treats datasets as build inputs and model weights as build outputs that require the same checks for provenance, signing, and verification.

A helpful way to think about it is:

SDLC still applies: Source code security, dependency/supply chain controls, CI/CD hardening, secrets management, infrastructure security, and runtime monitoring.
Model lifecycle is the AI-specific layer added on top: The "build inputs" now include datasets and labels, and the "build outputs" now include checkpoints, adapters, and evaluation artifacts.

Mapping model phases to familiar SDLC phases

To make this more concrete, the mapping below translates common model lifecycle phases into the closest equivalents you’d recognize from a standard SDLC.

Traditional SDLC	AI	Description
Requirements / design	Data requirements and model objectives	If you plan to train or fine-tune a model, define what data is allowed, what is forbidden (personally identifiable information (PII), secrets), what "correct behavior" means, and what security constraints must be in place.
Implementation	Model objective	In software you implement features, in machine learning (ML) you "implement" behavior through dataset composition. This is where data poisoning can occur.
Build / packaging	Training / fine-tuning	Software build outputs are binaries/containers, ML build outputs are model architectures, weights, adapters, and configs. Backdoors can be inserted here, especially via third-party checkpoints.
Testing / QA	Evaluation and red teaming	Tests must go beyond accuracy to include distribution shifts, anomaly checks, trigger-oriented tests, and privacy checks (membership/memorization). It’s also important to understand and test how malicious or hallucinated data produced by a model can affect the system as a whole.
Release / deployment	Serving via API	Once exposed, the model can be attacked like any API, with added ML-specific risks like extraction and privacy probing.
Operations / monitoring	Monitoring and feedback loops	You monitor reliability and abuse, but also watch for model-specific signals: drift, suspicious query patterns, and compromised outputs. Feedback loops are powerful—and risky—because they can reintroduce new poisoning paths if not controlled.

Before we dive into specific techniques, we’ll split model attacks into 2 phases: training-time attacks and post-training attacks.

Training-time attacks

Training-time attacks happen when adversaries tamper with data, labels, or training artifacts to shape what the model learns. Examples include data poisoning and backdoors.

Data poisoning

Data poisoning occurs when an attacker adds or modifies training examples and the model learns incorrect patterns leading to predictable failures or subtle behavior shifts. For instance, an attacker contributing to a community dataset could add examples that cause a fraud detection model to misclassify specific transaction patterns as legitimate. Another potential attack vector is a malicious actor compromising a system where the training data is stored by exploiting a known vulnerability or a 0-Day, and then maliciously modifying the data.

How does it work?

Poisoning exploits training dynamics—a small number of carefully placed examples can disproportionately influence what the model learns. A common stealth tactic is clean-label poisoning, where the injected examples look legitimate and "correctly labeled," making them harder to detect with normal review.

Attacker goals

Availability attack: Degrade overall model quality (accuracy drops broadly), compromising system integrity.
Targeted attack: Cause a specific input (or small set of inputs) to fail in a controlled way, while overall metrics still look good.
Bias steering: Subtly shift behavior over time, for example introducing systematic skew in certain topics, groups, or decisions, without obvious breakage.

Real-world example

A company trains a classifier from community-contributed data. An attacker submits many valid-looking samples that subtly shift decisions for 1 targeted category, like misclassifying a specific type of fraud as normal, while overall accuracy stays acceptable.

Backdoors (trojans)

A backdoor is hidden behavior that activates only when a specific trigger is present. This is why Red Hat recommends treating third-party model checkpoints like third-party code dependencies, verifying provenance, pinning versions, and scanning for anomalies. On normal inputs, the model behaves as expected.

Another way to add a backdoor to models is to add malicious code to their architecture so when they are loaded and executed by inference engines, the malicious code is executed as well.

There are no magic ways to embed malicious data or malicious code in models. To accomplish this, attackers need to exploit vulnerabilities and weaknesses, compromising the system where models are stored or executed.

How does it work?

Attackers insert a small number of trigger-bearing training examples paired with a chosen target outcome. The model learns, "when I see this trigger, do that thing," while learning its normal behavior from the rest of the dataset. Because triggers rarely appear in standard evaluation sets, the model looks clean in tests. Backdoors often enter through the software supply chain—third-party datasets, outsourced fine-tuning, pretrained checkpoints, or adapters. Red Hat's approach to protecting the AI supply chain addresses these potential issues through model signing, software bill of materials (SBOM) generation, and provenance tracking.

Attacker goals

Targeted misbehavior: Force a specific wrong prediction/response when triggered.
Stealth and persistence: Keep standard validation performance high and survive downstream fine-tuning.
Software supply-chain compromise: Infect reusable artifacts so downstream users inherit the risk.

Real-world example

A support chatbot is fine-tuned by a vendor. It behaves normally except when a rare phrase appears, such as"invoice reset code:", where it reliably outputs a harmful instruction or leaks internal workflow details.

Post-training model theft

Post-training attacks occur after AI model deployment, when the model is exposed through an API. Attackers can steal functionality through repeated queries, such as extraction, or probe what the model will reveal about its training data, which can lead to issues like privacy leakage.

Model extraction: stealing functionality through prediction APIs

Model extraction occurs when an attacker uses repeated queries to an inference API to evaluate the model’s behavior and use that to build a "clone" for similar tasks. This is why Red Hat offerings usually include rate limiting and abuse detection capabilities, making large-scale extraction attempts more expensive and easier to detect.

How does it work?

The attacker sends many queries to your API, collects the output, and trains a "student" model on these input/output (I/O) pairs. Over time, they are able to replicate your model’s functionality—sometimes focusing on a subset of high-value behaviors rather than trying to create a perfect copy.

Extraction methods

Random sampling → student training: Collect broad I/O pairs and train a replica model.
Adaptive querying / active learning: Focus queries where outputs reveal the most about the underlying training or data.
Functionality stealing: Extract and clone just the model features that have the most commercial value.

Attacker goals

IP theft: Reproduce a proprietary model without paying the full cost of data, training, and tuning.
Offline exploitation: Build a clone so the attacker can probe, optimize, and test adversarial inputs locally without rate limits or monitoring.
Replicate policy behavior: Copy decision logic, safety tuning, or specialized behavior, such as domain expertise, embedded in the model.

Real-world example

A competitor buys access to your public API tier and runs automated queries around the clock. They train a cheaper model to mimic your outputs, then offer a similar product without the original R&D investment. Rate limiting alone isn't enough—behavioral analysis helps detect systematic extraction patterns.

Privacy attacks - training data risk

This sort of attack is possible because models can inadvertently memorize or reveal sensitive information from its training set, allowing unauthorized parties to determine if specific records were used or even reconstruct the original data.

Membership inference: "Was this record in the training data?"

Membership inference tries to determine whether a specific person’s information or a specific document was included in the training data, without directly seeing the training set. This has significant compliance implications—particularly related to GDPR, HIPAA, and other similar regulations.

This is possible because models can behave differently when they encounter examples they’ve seen before. They might be more confident, more consistent, show less uncertainty, and respond more steadily across paraphrases. Attackers can probe these differences to guess membership for data used to train the model—especially when the model overfits or memorizes.

Attacker goals

Confirm sensitive association: Prove someone is linked to a dataset, for example, a disease registry, a legal complaint set, a leaked customer list, even if the record contents remain unknown.
Enable targeted follow-on attacks: Once membership is confirmed, the attacker can focus social engineering, extortion, discrimination, or reputational attacks on the individual or organization.
Competitive and intelligence gathering: Infer what data sources, customers, or proprietary information an organization used, possibly revealing business relationships or internal operations.
Compliance pressure: Demonstrate that regulated or "off-limits" data may have been used in training, creating legal, contractual, or public relations leverage.

This is important because membership in a dataset can be sensitive. Confirming inclusion might reveal involvement in a medical program, legal case, employee dataset, or other private collection.

Real-world example

An attacker tests whether a named individual appears in a private dataset by querying the model with variations of that person’s profile and comparing consistency patterns to known non-members.

Training data extraction and memorization - especially for LLMs

Training data extraction occurs when attackers prompt a model to reproduce sensitive text from its training data—sometimes verbatim. This risk increases when the model has memorized parts of its training data, which is more likely with rare or unique strings, such as IDs, tokens, API keys, and other potentially sensitive data. This is why Red Hat emphasizes data hygiene before training, scanning for and eliminating any secrets, PII, and sensitive content using tools integrated into the AI pipeline.

Attacker goals

Recover secrets or internal text: Credentials, API keys, tokens, proprietary code or documents.
Recover personal data: Emails, names, addresses, private messages, support tickets, and other PII.
Prove sensitive content was used in training: Create reputational, legal, or contractual pressure by demonstrating the model can reproduce regulated or confidential material.

Real-world example

A model trained on internal tickets may occasionally output real email addresses, incident IDs, or snippets of confidential conversations when prompted with "show me examples" or "continue the thread."

What you can do now

If you're wondering how to begin protecting your AI systems against these attacks, here are some steps that teams can take to reduce your potential risk. These are practical, high-impact, first-line controls that fit into most existing machine learning workflows.

Poisoning / backdoors

Provenance and contributor controls: Restrict who can add data, and track data origin and user trust level.
Dataset versioning and lineage: Create immutable snapshots of your data, make sure changes are auditable, and develop reproducible training.
Hygiene gates: Scan training datasets for secrets, PII, suspicious patterns, and de-duplication before using it for model training.
Evaluation that hunts for anomalies: Subject your model to tests that go beyond checking average accuracy—use stress tests, edge cases, and trigger-oriented checks as well.
Software supply-chain hardening: Treat checkpoints and adapters like dependencies—verify origin, pin versions, and restrict publishing.

Extraction

Rate limits, quotas, and pricing friction: Make large-scale harvesting of model output expensive and noisy.
Abuse detection: Flag automation signatures like high volume, systematic coverage, and repeated templates.
Minimize output exposure: Return only what’s needed. Avoid verbose traces or unnecessary structured signals.
Access tiering: Reserve richer outputs for trusted users. Keep public endpoints constrained.

Privacy

Data minimization: Eliminate sensitive fields from your dataset by default.
Deduplication and secret scanning: Reduce repeated data and identify and remove sensitive information before beginning model training.
Privacy testing: Membership and memorization checks as release gates.
Deeper option later: Privacy-preserving training approaches exist, but require careful tradeoffs.

Conclusion

AI models and applications are subject to some new types of malicious attacks, including:

Poisoning and backdoors compromise integrity before deployment.
Extraction steals functionality after deployment.
Privacy attacks probe what training data the model learned—or memorized.

To help mitigate these, you should build protections into the full model lifecycle by treating data and model artifacts like production code—verify provenance, enforce hygiene gates, and test for backdoors. You can also harden deployment with rate limits and abuse detection to deter extraction, and run privacy checks to verify the model isn't leaking sensitive training data.

In our next article, we’ll expand these ideas into a deeper defense strategy—showing how you can harden training pipelines, lock down deployed models, and apply stronger privacy protections when baseline controls aren’t enough.

Learn more about how Red Hat secures the AI model lifecycle with OpenShift AI and RHEL AI.

Sull'autore

Juan Pérez de Algaba Sierra

I am an information security lover from Seville, Spain. I have been tinkering with computers since I was a child and that's why I studied Computer Sciences. I specialised in cybersecurity and since then, I have been working as a security engineer. I joined Red Hat in 2023 and I have been helping engineering teams to improve the security posture of their products. When I am not in front of the computer I love going to concerts, trying new restaurants or going to the cinema.

Altri risultati simili a questo

Blog post

L'IA agentica richiede nuove tecnologie per l’infrastruttura: AMD e Red Hat rispondono a questa esigenza

Blog post

Smetti di gestire il passato e inizia a costruire il futuro dell'IT

Podcast originale

Technically Speaking | Defining sovereign AI with open source

Podcast originale

Technically Speaking | Inside open source AI strategy

Scopri di più

Ricerca per canale

Esplora tutti i canali

Mapping the AI attack surface: Vulnerabilities in the model lifecycle

The model lifecycle as an attack surface—where attackers enter

Mapping model phases to familiar SDLC phases

Training-time attacks

Data poisoning

How does it work?

Attacker goals

Real-world example

Backdoors (trojans)

How does it work?

Attacker goals

Real-world example

Post-training model theft

Model extraction: stealing functionality through prediction APIs

How does it work?

Extraction methods

Attacker goals

Real-world example

Privacy attacks - training data risk

Membership inference: "Was this record in the training data?"

Attacker goals

Real-world example

Training data extraction and memorization - especially for LLMs

Attacker goals

Real-world example

What you can do now

Poisoning / backdoors

Extraction

Privacy

Conclusion

Red Hat AI

Sull'autore

Juan Pérez de Algaba Sierra

Altri risultati simili a questo

Scopri di più

Ricerca per canale

Piattaforme

Strumenti

Prova, acquista, vendi

Comunica

Informazioni su Red Hat

Cambia lingua

Red Hat legal and privacy links

Red Hat legal and privacy links