This website uses cookies

Read our Privacy policy and Terms of use for more information.

One of the main reasons to be cautious about AI is the range of harms it can produce – from malicious use in attacks, fraud, or disinformation to harmful errors and hallucinations in benign responses. Just as we are not fully protected from criminals in everyday life, neither is AI. The solution is an added safety layer: specialized guardian models trained to detect and filter unsafe prompts and outputs. These “AI soldiers” are designed to make the ecosystem more secure and reliable.

In 2025, guardian models aren’t niche experiments – they’re built into every serious AI deployment. OpenAI has its moderation layers, Microsoft offers Azure Content Safety, Meta has shipped Llama Guard since 2023, IBM provides Granite Guardian, and startups are releasing their own open-source versions. So guardian models are quietly powering almost every chatbot or generative AI you use – without most people even realizing it.

We don’t talk much about AI safety in our AI 101 series, but today is finally the day for it. We are going to dive into the basic definitions of guardian models (since there can be some uncertainties), explore the main guardian models used today, and look at the newest dynamic approach, DynaGuard, that can enforce any set of rules you give it at runtime.

Guardian models may seem similar to typical LLMs, but that’s only at first glance. And you definitely need to know about them to save your models and protect yourself.

In today’s episode, we will cover:

  • Guardian or guardrail?

  • Categories of risky AI content

  • Llama Guard and ShieldGemma – the multimodal guardian baselines

  • Granite Guardian: Fighting harmful risks and RAG hallucinations

  • DynaGuard: Orientation on user rules

  • Other guardian and guardrail systems

  • Conclusion

  • Sources and further reading

What are guardian models?

The main rule goal for guardian models is to effectively enforce content policies and rules in real time. They run alongside the primary model, monitoring inputs and outputs in real time to catch harmful or policy-violating content. So you don’t need to hard-code every rule in the primary model.

In one of our interviews, Amr Awadallah, founder and CEO of Vectara, emphasized the importance of a separate safety and verification component that oversees the model – a guardian agent, that monitors LLM outputs to catch hallucinations and triggers human-in-the-loop when risk is high.

This “AI supervising AI” loop is a cool shift, as it turns models into generators and regulators at the same time.

Guardian models also don’t need to be massive-scale systems. They are usually comparatively small (2B-8B parameters), but still manage to reliably catch harmful content across dozens of risk categories.

Now to the differences in terms.

You’ve heard “guardrails” and you might have heard “guardian model.” Are they the same? They’re closely related and often used interchangeably, since both enforce safety rules on AI systems. But in practice, a guardian model is one way to implement guardrails, while “guardrail” is the broader concept. Guardrails usually mean the full toolkit of measures – rules, filters, or models – that keep AI behavior in check.

Importantly, guardian models go beyond simple filtering. They can:

  • Serve as guardrails to block harmful content in real time

  • Act as evaluators to check the quality of generated responses

  • Strengthen RAG pipelines by detecting hallucinations and verifying whether answers are relevant, grounded, and accurate

Categories of risky AI content

RAG hallucination risks appear because RAG can still “hallucinate” if the retrieved context is irrelevant or conflicting.

Some guardian models like Llama Guard and ShieldGemma control only harmful content, while others, IBM's Granite Guard, for example, mitigate RAG issues as well. Let’s dig into the tech specs of these AI “guards” to see how they’re built and how they really work

Llama Guard and ShieldGemma – the multimodal guardian baselines

Meta’s Llama Guard 4

Let’s start with Llama. In 2023, GenAI at Meta firstly introduced Llama Guard, an LLM-based safeguard model, that uses a safety taxonomy – a set of categories covering legal, policy, and safety risks – and can classify user prompts and AI responses. It was instruction-tuned, and could be adapted to new taxonomies with zero-shot or few-shot prompting. Then in 2024, they developed Llama Guard 3 Vision – a multimodal guardian model with expanded image understanding capabilities.

Now things get more interesting with Meta’s Llama Guard 4, its latest safety system that works with both text and images. It checks whether user prompts or model outputs are safe or unsafe. If unsafe, it also points out which rule was broken, for example, violence, hate, privacy, intellectual property, etc.).

It has 12 billion parameters and is based on Llama 4 Scout, but pruned and fine-tuned specifically for safety classification. Llama 4 Scout is originally a Mixture-of-Experts model.

Image Credit: Llama 4 Scout architecture, Llama Guard 4 model card

For Llama Guard 4, Meta pruned and kept only the shared expert of Llama 4 Scout, resulting in a dense feedforward early-fusion architecture for efficiency. Early-fusion means that the both image and text inputs are merged together before going through most of the transformer layers. So Llama Guard 4 is lighter than Scout version while still retaining knowledge. It is also small enough to run on a single GPU.

Image Credit: Llama Guard 4 architecture, Llama Guard 4 model card

Unlike earlier versions, Llama Guard 4 can evaluate prompts that include multiple images as well as multilingual text. Compared to the previous Llama Guard 3 it has the following results:

  • Text (English): Recall improved by 4%, overall F1 up by 8%.

  • Single-image prompts: Recall +10% and F1 +8%.

  • Multi-image prompts: The major gains are recall up +20% and F1 +17%.

(Recall defines how much harmful content that actually exist the model caught. F1 score is the metric that merges recall and precision, showing how good a model is at catching harmful content while not over-flagging safe content.)

However, Meta’s guardian model is not a solution for everything. It can suffer from the following limitations:

  • Llama Guard 4 works across 7 languages, but quality may vary.

  • It was mainly tested with prompts containing 3 images on average, and performance is less certain with more.

  • Llama Guard 4 is still just an LLM, so it may struggle with categories that require facts or up-to-date info.

  • Plus, like other LLMs, it can be tricked by prompt injection or adversarial attacks. Meta suggests to pair it with Llama Prompt Guard 2 for extra protection.

ShieldGemma from Google

Another model, similar to Llama Guard, is Google’s ShieldGemma, a suite of safety content moderation models, firstly introduced in 2024. Built on Gemma 2, it was designed to catch harmful text in user prompts and AI responses. Now with the release of Gemma 3, Google has launched ShieldGemma 2, a 4-billion-parameter guardian model, that extends safety checks beyond text into synthetic and real-world images. For a deeper look at the Gemma 4 architecture that powers the latest generation of these models, see our Gemma 4 breakdown

For engineers, a few specs stand out. The model is compact enough for efficient deployment (4B parameters) yet robust across policy domains. It supports both input filtering (screening images before they enter vision-language pipelines) and output filtering (blocking unsafe generations). Google has released the weights under an open license, with support for custom policy tuning, so teams can adapt the classifier to their own risk frameworks. The training was optimized on TPUv5e hardware, and inference latency is kept competitive for real-time moderation.

Image Credit: ShieldGemma 2 Google’s blog post

Developers can adapt classifications to their needs. For example, probability thresholds can be set higher or lower depending on whether ShieldGemma 2 is filtering a dataset, guarding an image generator, or supporting a multimodal chatbot. As it is built on Gemma 3 (it’s utterly confusing that ShiledGemma 2 is built on Gemma 3), it is available through major AI frameworks such as Hugging Face Transformers, Keras, and JAX, with community support also extending to Ollama.

Like other safety systems, ShieldGemma 2 has some limits – it may struggle with very subtle or culturally specific harms, or with image styles it wasn’t trained on.

To sum up, Llama Guard and ShieldGemma focus on detecting harmful content across multiple dimensions. But it is only one part of the problem. The next model that we cover emphasizes the importance of broader capabilities for guardians →

What is Granite Guardian? Risks, RAG & Hallucination Detection

Granite Guardian from IBM stands out as it goes beyond surface-level problems deeper into the model issues. It combines several categories of risks into a single, more general system:

  • Similar to a typical guardian model, it controls harmful content, like bias, profanity, toxic, unsafe responses or jailbreaks.

  • Also, it covers issues like adversarial attacks and RAG hallucinations risks: context relevance, groundedness, and answer relevance.

  • The latest version, Granite Guardian 3.2, can also detect function calling risk in agentic workflows, controlling “syntax or semantic errors based on the user query and available tool”.

Granite Guardian was trained on about 7,000 human-annotated prompt–response pairs from HH-RLHF, each triple-labeled across safety and specific risks, such as bias, jailbreaking, violence, profanity, sexual content, unethical behavior, AI refusal. Additional low-confidence cases were targeted to strengthen coverage. Synthetic datasets expanded training with:

  • Benign prompts resembling harmful ones and true adversarial variants,

  • Jailbreak prompts generated with 24 adversarial strategies such as payload splitting and historical context,

  • RAG hallucination data.

Granite Guardian has an interesting training process:

  • Firstly, the collected data was converted into a safety instruction template that formats each example as a user prompt, model response, and explicit risk definition, with the model asked to output “Yes” or “No.”

  • Two versions of Granite Guardian, 2B and 8B, were fine-tuned from Granite 3.0 instruct models using supervised fine-tuning, preserving control tokens for stability.

  • For inference, Granite Guardian models output not only labels but also confidence scores by aggregating probabilities across the top-20 lexical variants of “Yes” and “No,” normalized with softmax. This produces a calibrated risk score that can be thresholded for different deployment needs.

As a result, Granite-Guardian-3.2-5B scored:

  • An aggregate F1 of 0.784 on harm benchmarks.

  • On TRUE RAG hallucination tests it averaged AUC 0.84.shows. (AUC is the overall score for how good the model is at the trade-off between catching unsafe content and avoiding false alarms.)

  • For function-calling hallucinations it reached AUC 0.79 (DeepSeek’s result is 0.92).

  • In multi-turn conversational risks it achieved 0.92–0.97 AUC for harm engagement and evasiveness.

Image Credit: Granite Guardian model card

But where does IBM’s model falls short?

  • Context sensitivity: It may misclassify cases where harmfulness depends on intent, cultural norms, or broader context.

  • Subjective judgments and incomplete coverage also remain challenges.

  • High-quality annotation need a lot of resource and is hard to scale while maintaining diversity.

Overall, Granite Guardian provides a more complete safeguard: it can be used as guardrails for moderation, as evaluators for quality checking, or as filters in RAG pipelines and agentic workflows, making AI safer and more reliable across different applications. Being open-source, it also supports these values across the wider community.

As we have seen, typically, guardian models work with fixed categories of risk and the problem of context sensitivity remains a challenge. Sometimes users need to define the unwanted content by themselves. For this, more dynamic and flexible guardian models are needed.

DynaGuard: Orientation on users’ rules

In some cases, what is considered as “bad” content depends on context. Take retrieval models, for example: A chatbot should not help plan violence, but it should be allowed to summarize violent events from the news. Static harm categories can miss a lot of real-world problems.

To emphasize the importance of specific rules, researchers from University of Maryland and Capital One developed DynaGuard, a new guardian model designed specifically to handle custom rules.

Here is what makes DynaGuard stand out compared to other guardians:

  • Dynamic policies: Users can write their own policies that spell out what’s allowed or not.

  • Interpretable explanations: DynaGuard provides natural-language explanations of which rule was violated. These explanations can help a chatbot recover by adjusting its output or help engineers refine the guardrails.

  • Fast inference: When you only need a quick yes/no, DynaGuard’s “fast mode” allows to skip long explanations, so the model is a lightweight solution.

  • Available for local use: It is open-weight, so organizations can control deployment, security, and latency.

So how does this new approach work?

The main thing for DynaGuard is training, as for other guardian models. For this stage researchers created special dataset – DynaBench. It consists of 40,000 unique policies for training. Here is how it was build:

Image Credit: DynaGuard original paper

  • Rule bank: Started with ~500 hand-written rules, DynaBench was expanded to 5,000 using LLMs, and then curated by humans to remove unclear ones.

  • Policies were formed by mixing rules from the rule bank. They vary in size: median include 3 rules, but some have up to 86.

  • DynaGuard works by taking in two things:

    • the policy

    • a conversation between a user and an agent, generated for the policy, including both compliant and violating cases.

  • The model evaluates whether the dialogue respects the rules in the policy.

  • Once the check is complete, DynaGuard produces a clear outcome: PASS if all rules are followed or FAIL if the agent violates one or more rules. The model can then provide supporting explanation in two different modes:

    • In fast mode, it gives a short label and a minimal explanation, reducing latency and cost.

    • In detailed mode (step-by-step reasoning), it produces a reasoning trace that explains exactly which rule was violated and why. This is useful in domains like healthcare or finance, which need transparency and actionable insights.

Image Credit: DynaGuard original paper

DynaGuard has notable performance gains:

  • It outperforms GPT-4o-mini and all prior guardian models across safety benchmarks and the DynaBench test set even in fast inference mode.

  • It uniquely handles dynamic, unseen policies, providing explanations that allow other models to revise their answers and improve accuracy.

Image Credit: DynaGuard original paper

However, DynaGuard’s reliance on explanations raises open questions about how best to integrate them into multi-agent recovery pipelines. Also, it is unclear how to be with ongoing updates as new use cases emerge.

Anyway, DynaGuard is a great step to the customization of guardian models that will work for the particular user. It makes safety modular and programmable, almost like a firewall where you update rules on the fly.

Other guardian and guardrail systems

In addition to open models that we have mention, many leading AI platforms use dedicated moderation or guardrail systems behind the scenes.

  • OpenAI’s ChatGPT uses a Moderation API with a multimodal model (omni-moderation-latest) to check prompts and outputs for policy violations like hate, self-harm, violence, and sexual content.

  • Microsoft Azure’s Content Safety service offers an AI-powered moderation layer for applications, with APIs that detect harmful content across text and images, including violence, sexual content, harassment, and hate/fairness categories.

  • NVIDIA NeMo Guardrails is a tool that works as a kind of safety manager for LLMs and supports multimodal guardrails. It manages topic control, jailbreak prevention, PII detection, and RAG enforcement, covering safety and accuracy/grounding. It also introduces Colang, a domain-specific language for defining dialogue rules and flows, that makes it easier for developers to design conversational boundaries. NeMo Guardrails plugs into popular frameworks like LangChain and LlamaIndex, and can run as a microservice.

Image Credit: NVIDIA NeMo Guardrails for Developers

Also, here are a couple of classical and quiet new guardian systems:

  • WildGuard is an open-source moderation tool for LLMs, developed by Allen Institute for AI, University of Washington. It is a classical guardian model detects harmful prompts and flags unsafe model responses, but it stands out for measuring whether a model properly refuses unsafe requests.

  • GUARDIAN (GUARDing Intelligent Agent collaboratioNs) a new approach designed for multi-agent settings. It models the agents’ interactions as a graph, then detects and mitigates risks like hallucination amplification or error propagation across the network of agents.

  • MCP Guardian is a security framework for the Model Context Protocol (MCP) that protects AI assistants when connecting to external data sources. It adds safeguards like authentication, rate limiting, logging, and firewall scanning to prevent attacks and ensure safe, scalable integrations.

Conclusion

This is what happens today in a landscape of guardian models. These lightweight models do their quiet job, paving the way to trustworthy AI. It is a good strategy to relieve the main model of this work and entrust it to a specialized one. Unseen and maybe a little bit forgotten, guardian models keep control of all the malicious content we could encounter (not with 100% efficiency, but actually there is no ideal model invented yet in any AI field). After all, approaches like DynaGuard demonstrate the direction toward more customizable rules on what is harmful and what is not, highlighting that guardian models should adapt better to new rules. Other systems that we explored today show that guardian models are everywhere – from multimodal models to agentic systems and MCP. And we are very curious about what the next step for trustworthy and safe AI will be.

Sources and further reading

Resources from Turing Post

Reply

Avatar

or to participate

Keep Reading