Turing Post
Posts
AI 101: What are Guardian Models?

AI 101: What are Guardian Models?

Everything you need to know about models that defend AI today

Ksenia Se & Alyona Vert.
September 17, 2025

One of the main reasons to be cautious about AI is the range of harms it can produce – from malicious use in attacks, fraud, or disinformation to harmful errors and hallucinations in benign responses. Just as we are not fully protected from criminals in everyday life, neither is AI. The solution is an added safety layer: specialized guardian models trained to detect and filter unsafe prompts and outputs. These “AI soldiers” are designed to make the ecosystem more secure and reliable.

In 2025, guardian models aren’t niche experiments – they’re built into every serious AI deployment. OpenAI has its moderation layers, Microsoft offers Azure Content Safety, Meta has shipped Llama Guard since 2023, IBM provides Granite Guardian, and startups are releasing their own open-source versions. So guardian models are quietly powering almost every chatbot or generative AI you use – without most people even realizing it.

We don’t talk much about AI safety in our AI 101 series, but today is finally the day for it. We are going to dive into the basic definitions of guardian models (since there can be some uncertainties), explore the main guardian models used today, and look at the newest dynamic approach, DynaGuard, that can enforce any set of rules you give it at runtime.

Guardian models may seem similar to typical LLMs, but that’s only at first glance. And you definitely need to know about them to save your models and protect yourself.

In today’s episode, we will cover:

Guardian or guardrail?
Categories of risky AI content
Llama Guard and ShieldGemma – the multimodal guardian baselines
Granite Guardian: Fighting harmful risks and RAG hallucinations
DynaGuard: Orientation on user rules
Other guardian and guardrail systems
Conclusion
Sources and further reading

Guardian or guardrail?

The main rule goal for guardian models is to effectively enforce content policies and rules in real time. They run alongside the primary model, monitoring inputs and outputs in real time to catch harmful or policy-violating content. So you don’t need to hard-code every rule in the primary model.

In one of our interviews, Amr Awadallah, founder and CEO of Vectara, emphasized the importance of a separate safety and verification component that oversees the model – a guardian agent, that monitors LLM outputs to catch hallucinations and triggers human-in-the-loop when risk is high.

This “AI supervising AI” loop is a cool shift, as it turns models into generators and regulators at the same time.

Guardian models also don’t need to be massive-scale systems. They are usually comparatively small (2B-8B parameters), but still manage to reliably catch harmful content across dozens of risk categories.

Now to the differences in terms.

You’ve heard “guardrails” and you might have heard “guardian model.” Are they the same? They’re closely related and often used interchangeably, since both enforce safety rules on AI systems. But in practice, a guardian model is one way to implement guardrails, while “guardrail” is the broader concept. Guardrails usually mean the full toolkit of measures – rules, filters, or models – that keep AI behavior in check.

Importantly, guardian models go beyond simple filtering. They can:

Serve as guardrails to block harmful content in real time
Act as evaluators to check the quality of generated responses
Strengthen RAG pipelines by detecting hallucinations and verifying whether answers are relevant, grounded, and accurate

Categories of risky AI content

RAG hallucination risks appear because RAG can still “hallucinate” if the retrieved context is irrelevant or conflicting.

Some guardian models like Llama Guard and ShieldGemma control only harmful content, while others, IBM's Granite Guard, for example, mitigate RAG issues as well. Let’s dig into the tech specs of these AI “guards” to see how they’re built and how they really work →

Llama Guard and ShieldGemma – the multimodal guardian baselines

Below we will discuss Llama Guard, ShieldGemma, Granite Guardian, DynaGuard, and overview OpenAI’s Moderation API, Microsoft Azure’s Content Safety, NVIDIA NeMo Guardrails, WildGuard, GUARDIAN, and MCP Guardian.

Join Premium members from top companies like Microsoft, Google, Hugging Face, a16z, Datadog plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on with AI. Simplify your learning journey 👆🏼

Reply

or to participate.