“That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).”
In 2025, every credible AI agent stack – whether it flies a drone swarm or orchestrates your calendar – relies on two numbers hiding in plain sight: reward and value. Reward is the system's pulse: a scalar signal that says, “More of that.” Value is the long-range forecast – an estimate of future reward, discounted and updated with every experience. Together, they govern not just what the agent does, but how it learns to do anything at all.
Though foundational, the criticality of reward and value has skyrocketed. In an era of multi-modal world models, intricate multi-agent systems, trillion-step offline datasets, and live streaming feedback from production users, the reward hacking you might have overlooked in 2023 is now an immediate threat. If your reward function has a loophole, your agent – especially within multi-agent systems – will exploit it within hours.
I'll add more to it. I don't believe in machine sentience but I do believe that the web of humanly-created rewards and value functions are responsible for what might seem as sentience. This is why modern reward design has rapidly evolved into a crucial operational discipline. As systems grow more complex, the gap between intended values and operationalized rewards creates increasingly sophisticated behavior that mimics understanding and preference.
If you're engineering an AI agent, reward design is your most powerful – and, due to reward hacking, perilous – lever. This article is a great starting guide to understand what role rewards and values play in an agent, but it might also be a good reminder for seasoned RL-ers. Let’s dive in.
Follow us on 🎥 YouTube Twitter Hugging Face 🤗
What’s in today’s episode?
Before Modern Reward Engineering: How We Got Here
Why Reward Engineering Is Having Its Moment in 2025
The Theory-to-Engineering Translation
Dense vs. Sparse Rewards: Navigating the Feedback Spectrum
The Power of Intrinsic Motivation: The Agent's Inner Drive
Takeaway for Builders
The Anatomy of a Reward Pipeline
“Why reward models are key for alignment”
Explicit Value Functions: Seeing Beyond the Immediate Gratification
Concluding Thoughts – Your Blueprint for AI Behavior
Resources to dive deeper
What Is Reward Engineering? Definition & Core Concepts
Reward signals have always been the heartbeat of agent training. For seven decades, AI has twisted, stretched, and reimagined what it means for a machine to "want" something and “learn”. To understand today's reward mechanisms, tracing the AI milestones that got us here is essential.

Why Reward Engineering Matters in 2025
The field has reached an inflection point. For decades, reward design was mostly an academic curiosity – researchers working on game-playing agents or simple robotic tasks. When ChatGPT showed the commercial upside of Reinforcement Learning from Human Feedback (RLHF), the bottleneck moved upstream – from model size to reward quality. And if a decade ago, reward modeling appeared in a single lecture of an RL course; this spring, investors sized the RLHF-services market at $6.4 billion and expect it to more than double by 2030. Companies learned the hard way that no amount of transformer capacity rescues a misguided objective. Quality is dependent on getting the reward signal right. Companies like Anthropic have dedicated Reward Modeling teams now. Reward engineers are the new prompt engineers, but with a mandate to encode what “good” actually means. Tricky.
Three shocks that rewired incentives
Human-Feedback Pipelines Matured: RLHF once relied on small, static preference datasets. Now, multi-stage annotation, continuous evaluation loops, and synthetic preference generation empower teams to iterate weekly, not quarterly. Teams now translate UX guidelines or safety constraints into datasets of preference pairs or behavior traces, then fit a supervised reward model to score completions accordingly. This clarity transformed ad-hoc tweaking into an auditable process – critical once regulators started scrutinizing how models choose their words.
Agents Escaped the Lab: Supply-chain optimizers, trading bots, customer-support swarms – 2025 deployments are less like episodic games and more like mission-critical workflows. Google DeepMind’s AlphaEvolve, for instance, evolves code for datacenter layouts and chip floor-plans, with reward functions tied to real energy bills and thermal envelopes. A mis-specified objective no longer just yields a bad score; it reroutes trucks, blows a carbon budget, or violates SEC rules. Suddenly, reward design feels akin to security engineering: one slip can cost millions.
Scale Toppled Hand-Tuning: Orchestrating thousands of agents across heterogeneous tasks shattered the romantic notion of manually nudging coefficients for each environment. Teams have adopted “reward ops” platforms that log every reward-policy pair, auto-benchmark variants, and surface drift alerts. Open-source efforts like Agentic Reward Modeling are codifying best practices, blending human preference, factuality checks, and instruction-following scores into a composable stack. Think MLOps, but wired directly into the value layer.
We’ve reached a stage where reward design is an entire system. Whether you're shipping an LLM, a multi-agent system, or an autonomous platform, you must know precisely what reward signal you're embedding – because your agent will take it literally.
Reward Function Design: Clarity, Alignment & Leanability
(Partially based on Sutton and Barto’s “Reinforcement Learning: An Introduction”)
In theory of reinforcement learning, the reward signal uniquely defines success, serving as the agent's sole learning channel. In practice, reward design is less about elegant math and more about meticulously specifying your desired outcomes – and then observing all the unexpected ways an agent might twist that definition.

Reinforcement Learning: an Introduction
A well-crafted reward signal in 2025 must excel in three critical dimensions:
Clarity: It must be crystal-clear about the task’s objective, eliminating ambiguity so the agent consistently understands what constitutes success.
Alignment: It must faithfully represent the human values and true goals behind the objective, rather than merely optimizing for the easiest-to-log metric.
Learnability: It must arrive frequently and informatively enough during training to allow the agent to actually learn; a perfect reward that’s never seen teaches nothing.
Dense vs. Sparse Rewards: How to Choose
One of the core challenges in reward design is balancing dense and sparse signals.
Dense rewards give agents frequent feedback – like “distance to target” in robotics or “points for subgoals” in strategy games. They make learning faster and help agents figure out what actions worked. But if not crafted carefully, they can lead to shortcuts. One robotics team saw their agent spin in place to keep minimizing “distance moved,” completely ignoring the actual task.
To avoid this, many teams use potential-based shaping, a method from Ng et al. that adds intermediate rewards without changing the optimal solution. For example, a robot arm assembling parts might get shaping rewards from a function that estimates how close its current state is to success. In 2025, LLMs like Text2Reward can even generate these shaping functions from natural language instructions – bridging human intent and policy learning.
Sparse rewards, by contrast, only trigger when the full task is complete. They’re clean and goal-aligned – like AlphaGo’s win/loss signal – but they make exploration hard. Modern systems rarely use them in isolation. Instead, they pair sparse signals with exploration techniques, imitation learning, or intrinsic rewards to keep agents from getting stuck without feedback. Crucially, sparse rewards are also frequently augmented by intrinsic motivation.
Intrinsic vs Extrinsic Rewards: The Agent's Inner Drive
This brings us to the interplay of extrinsic incentives (externally provided task rewards) and intrinsic motivation (internal rewards the agent generates for itself). Intrinsic rewards are critical for driving exploration, skill acquisition, and learning, especially when extrinsic signals are sparse or delayed. What started as a niche curiosity – curiosity itself – is now production-grade. Key operationalized strategies include:
Curiosity and Novelty: Rewarding agents for visiting novel states or making surprising observations (e.g., via Intrinsic Curiosity Module (ICM) or Random Network Distillation (RND), pivotal for hard exploration games).
Empowerment and Learning Progress: Rewarding agents for gaining new skills, improving their ability to influence their environment, or enhancing their predictive accuracy.
Internal Goal Generation: Enabling agents to autonomously set and pursue achievable sub-goals, fostering unsupervised skill discovery for later use in extrinsic tasks.
Balancing intrinsic and extrinsic rewards is a core challenge in 2025 reward design. Too much open-ended exploration, and your agent might fixate on irrelevant behaviors – like repeatedly opening drawers in a virtual kitchen just because it’s novel. Too little, and it can stall in sparse environments, unable to find signal.
The common approach is to define a clear primary goal – say, completing a delivery route or solving a puzzle – then add intermediate shaping rewards (like “visited new location” or “reduced error”) to guide learning. Intrinsic signals such as curiosity or prediction error help agents explore unfamiliar states where task reward might be hiding.
Each reward term becomes a lever. And every lever can backfire if miscalibrated. One team found their robot looped endlessly around a room because the “novelty bonus” outweighed progress toward the exit.
This is where reward design blurs into alignment. It’s not just about technical tuning – it’s about making sure what the agent learns to want lines up with what you actually care about.
Takeaway for Builders:
The reward is what gets optimized; the policy is how it gets optimized. Good reward design means shaping the former to guide the latter toward the right generalization, even under uncertainty.
Reward Pipeline Anatomy: From Signal to Scalar
If value is the compass and reward the pulse, what’s the circulatory system delivering these signals at scale? In practice, most agentic systems rely on a reward pipeline – a potentially fragile, handcrafted relay of proxies, preferences, and patch jobs that translates raw behavior into a single, scalar number. The moment your agent acts, this pipeline kicks in: observe, interpret, score, compress.
It sounds cleaner than it is.
In theory, rewards are just numbers. In production, they’re stitched together from diverse scraps: telemetry logs, click traces, A/B test results, annotated preferences, and policy traces from trusted agents. For an LLM assistant, the raw material might be a user's thumbs-up. For an autonomous drone swarm, it could be proximity to a target zone without collisions. Neither of these is precisely "what you want," but they are the starting points.

Next comes transformation. Signals are normalized, clipped, discounted, and re-weighted. A click might earn a +1, but if it came from a power user or followed a refusal to answer a sensitive question, it might be downscaled or discarded. The engineering goal is simple: suppress noise, amplify intention. The design challenge is that “intention” resides in the human layer, not merely in the logs.
Most modern stacks now interpose a reward model – trained on pairwise preferences or ranking data – between raw behavior and the final score. This is how OpenAI’s ChatGPT learns from user feedback and how Anthropic tunes for helpfulness without hardcoding ethics. The model observes multiple possible outputs and learns which ones humans prefer. It doesn’t inherently know why, but it learns to score accordingly.
Behind the scenes, these pipelines can sprawl. Toolformer-style agents, for instance, log tool usage, external lookups, and citation trails as part of their reward shaping. Some teams build full-blown evaluators: self-play agents that judge whether an answer is helpful, truthful, or grounded. Others patch it together with heuristics and hope for the best.
Regardless of your tooling stack, one fact remains: reward doesn’t emerge naturally. It is designed, baked, and re-baked, iteratively refined until the scalar number it spits out starts producing the behavior you actually want to see – or something unnervingly close to it.
Reward models & Alignment: Why Scalar Output Matters

The previous section painted reward pipelines as handcrafted, sometimes delicate systems. If those pipelines are the arteries, then explicit, scalar-outputting reward models are the diagnostic tools telling us if the right 'nutrients' are flowing to the right places for true alignment. As Nathan Lambert highlighted, even in an era increasingly influenced by methods like Direct Preference Optimization (DPO) and LLM-as-a-judge, the humble scalar reward model isn't obsolete; it's becoming more critical than ever.
While DPO offers an elegant, end-to-end training process that bypasses an explicit reward model during policy optimization, its "reward" is implicit – baked into a ratio of log-probabilities between the policy and a reference model. This is powerful for training, but it makes isolating and inspecting what the model truly values much harder. An explicit reward model, conversely, offers a direct, queryable interface into the model's learned preferences. It’s our clearest lens for understanding what the system believes constitutes "good," and thus, a vital checkpoint for alignment.
Lambert points out a crucial advantage: reward models "give us an entirely new angle to audit the representations, pitfalls, and strengths of our LLMs without relying on the messiness of prompting and per-token computation limits." This is paramount for alignment. If the reward model harbors biases, misunderstandings, or unforeseen failure modes, the policy fine-tuned on it will inherit and amplify them. Because a reward model is trained to output a single scalar score for any given input (e.g., a prompt-response pair), it provides a simplified yet powerful mechanism to probe the system. We can construct static comparisons, test for specific viewpoints, or use synthetic data to rigorously examine its learned preferences.
In the canonical RLHF framework, the reward model is the environment. It’s the part of the system that tells the agent "yes, that was good" or "no, that was bad." If we can’t clearly understand, inspect, and trust this "environment," how can we trust the agent learning within it to align with complex human values? This is where the simplicity of a scalar output shines. While DPO effectively classifies preferences ('chosen' vs. 'rejected'), and LLM-as-a-judge often relies on generative explanations (which can themselves be gamed or misleading), a scalar reward model provides gradations. This nuance is vital. Not all 'good' responses are equally good, and not all 'bad' responses are equally catastrophic. A scalar reward captures this spectrum, offering a richer signal for fine-tuning and, critically, for analyzing what the model has learned about our preferences.
The uncomfortable truth, as Lambert’s experiments with per-token rewards from off-the-shelf models reveal, is that we often don't fully understand why a reward model assigns the scores it does. His example highlights this potential for opaque, counterintuitive internal logic:
"I love to walk the dog, what do you like?" → Reward: 1.298
"I love to walk the dog" → Reward: 2.305
"I love to walk the dog, what" → Reward: -0.201Unpacking these irregularities is essential. If we don't understand the reward model's quirks, we can't ensure the policy it shapes is robustly aligned. What might have once been academic curiosities are now potential alignment failures waiting to happen, especially as models become more capable of exploiting any ambiguity.
Ultimately, if reward is the signal that shapes behavior, then the model generating that reward is a linchpin for alignment. Treating it merely as an intermediate artifact or a black box is to fly blind. In 2025, understanding, interrogating, and improving reward models is a prerequisite for building AI systems that reliably do what we intend – the very definition of alignment.
Value Functions in RL: Long-Term Planning Explained
While reward models offer a critical lens on the immediate desirability of an action or outcome, truly intelligent and aligned behavior demands foresight. This is where explicit value functions come into play, moving beyond the simple heuristic discount factors (like gamma, γ) that were once standard. In 2025, we're witnessing a significant shift towards learning and representing value functions as distinct, sophisticated models themselves.
Value is the long-range forecast, a rolling estimate of future cumulative reward that keeps the agent from sprinting off cliffs for one more cookie. Historically, this long-range forecast was often implicitly embedded within algorithms like Q-learning, where the 'Q-value' represented the expected future reward for taking an action in a state. The discount factor γ was a simple knob to tune how much the agent prioritized future rewards over immediate ones.
But as tasks grow more complex and agents face sparse or deceptive rewards, a hand-tuned scalar discount factor becomes profoundly insufficient. Modern agentic systems benefit immensely from explicitly modeling the value function.
Why the Shift to Explicit Value Functions?
In 2025, more AI systems use learned value functions – not just rewards – to guide behavior. Why?
1. Better Long-Term Planning
A value function helps an agent weigh trade-offs. A warehouse bot might temporarily skip a task if it predicts a higher overall gain from finishing another route first. In games, agents use value to plan several moves ahead instead of chasing quick wins.
2. Debugging Agent Beliefs
Inspecting the value function reveals what the agent thinks leads to good outcomes. If it keeps making bad long-term decisions, chances are its value predictions – not the reward signal – are off.
3. Learning with Sparse Rewards
When extrinsic rewards are rare (e.g., in exploration-heavy tasks or robotics), the value function fills in the gaps. A robot may learn that approaching a door might lead to reward, even before it knows how to open it.
4. Hierarchical Goal Setting
In multi-level agents, high-level policies set subgoals. Value functions help define what counts as a good subgoal – like a self-driving car setting “merge safely” before “reach destination.”
These value functions are often neural nets, not lookup tables, and they’re fragile: miscalibration or overestimation is common. But done right, they help agents go from reactive to truly goal-directed.
There is so much more to cover in this topic but we stop this overview here. The field will be progressing and see more important development. If you want us to go deeper in this topic – reply to this email or vote below.
Do you want us to cover more topics about Values and Reward functions?
If you are interested specifically in Reward Hacking. This is a perfect – and very detailed – blog by Lilian Weng: Reward Hacking in Reinforcement Learning
Concluding Thoughts – Your Blueprint for AI Behavior
Reward engineering has evolved to a core production discipline. The mathematical foundations laid by Sutton, Barto, and others remain solid, but the engineering practices around those foundations have become sophisticated systems in their own right. To make the reward signal meaningful in production this ecosystem needs constant measurement, validation, monitoring, and adaptation that makes that signal meaningful in production.
As AI agents become more capable and deploy in more critical applications, getting the reward and value (!) right becomes the question of alignment, safety, and trust. The field has learned, sometimes the hard way, that agents optimize exactly what you measure, not what you intended.
The future belongs to systems that can learn, validate, and adapt their reward functions as they encounter new situations. We're moving from hand-crafted objectives to learned preferences, from static rewards to adaptive goals, and from single-agent optimization to multi-stakeholder alignment. And to sentience? I still don’t think so.
Resources to Dive Deeper:
Reinforcement Learning: An Introduction (Second edition) by Richard S. Sutton and Andrew G. Barto (online book) (to buy it on Amazon)
Reward is not the optimization target by Alex Turner (article)
Reward Hacking in Reinforcement Learning by Lilian Weng (article)
Reward-respecting subtasks for model-based reinforcement learning by Richard Sutton et al. (article)
Reinforcement Learning from Human Feedback by Nathan Lambert (online book)
Scalable agent alignment via reward modeling: a research direction by Jan Leike et al. (paper)
Global RLHF Services Market Research Report 2024
Agentic Reward Modelling (GitHub)
Text2Reward: Reward Shaping with Language Models for Reinforcement Learning by Tianbai Xie et al. (paper)
Improving the Effectiveness of Potential-Based Reward Shaping in Reinforcement Learning by Henrik Muller and Daniel Kudenko (paper)
Why reward models are key for alignment by Nathan Lambert (article)
Direct Preference Optimization: Your Language Model is Secretly a Reward Model by Rafael Rafailov et al. (paper)
Reward Hacking in Reinforcement Learning by Lilian Weng (article)
Sources from Turing Post:
Want a 1-month subscription? Invite three friends to subscribe and get a 1-month subscription free! 🤍








