• Turing Post
  • Posts
  • 🦸🏻#19: What Agents Desire? Reward and Value Functions in AI

🦸🏻#19: What Agents Desire? Reward and Value Functions in AI

things you need to know about what role rewards and values play in an agent 'life'

“That all of what we mean by goals and purposes can be well thought of as the maximization of the expected value of the cumulative sum of a received scalar signal (called reward).”

Reward Hypothesis by Richard Sutton and Andrew Barto

In 2025, every credible AI agent stack – whether it flies a drone swarm or orchestrates your calendar – relies on two numbers hiding in plain sight: reward and value. Reward is the system's pulse: a scalar signal that says, “More of that.” Value is the long-range forecast – an estimate of future reward, discounted and updated with every experience. Together, they govern not just what the agent does, but how it learns to do anything at all. 

Though foundational, the criticality of reward and value has skyrocketed. In an era of multi-modal world models, intricate multi-agent systems, trillion-step offline datasets, and live streaming feedback from production users, the reward hacking you might have overlooked in 2023 is now an immediate threat. If your reward function has a loophole, your agent – especially within multi-agent systems – will exploit it within hours.

I'll add more to it. I don't believe in machine sentience but I do believe that the web of humanly-created rewards and value functions are responsible for what might seem as sentience. This is why modern reward design has rapidly evolved into a crucial operational discipline. As systems grow more complex, the gap between intended values and operationalized rewards creates increasingly sophisticated behavior that mimics understanding and preference.

If you're engineering an AI agent, reward design is your most powerful – and, due to reward hacking, perilous – lever. This article is a great starting guide to understand what role rewards and values play in an agent, but it might also be a good reminder for seasoned RL-ers. Let’s dive in.

Follow us on 🎥 YouTube Twitter  Hugging Face 🤗

What’s in today’s episode?

  • Before Modern Reward Engineering: How We Got Here

  • Why Reward Engineering Is Having Its Moment in 2025

  • The Theory-to-Engineering Translation

    • Dense vs. Sparse Rewards: Navigating the Feedback Spectrum

    • The Power of Intrinsic Motivation: The Agent's Inner Drive

    • Takeaway for Builders

  • The Anatomy of a Reward Pipeline

  • “Why reward models are key for alignment”

  • Explicit Value Functions: Seeing Beyond the Immediate Gratification

  • Concluding Thoughts – Your Blueprint for AI Behavior

  • Resources to dive deeper

Before Modern Reward Engineering: How We Got Here

Reward signals have always been the heartbeat of agent training. For seven decades, AI has twisted, stretched, and reimagined what it means for a machine to "want" something and “learn”. To understand today's reward mechanisms, tracing the AI milestones that got us here is essential.

Why Reward Engineering Is Having Its Moment in 2025

The field has reached an inflection point. For decades, reward design was mostly an academic curiosity – researchers working on game-playing agents or simple robotic tasks. When ChatGPT showed the commercial upside of Reinforcement Learning from Human Feedback (RLHF), the bottleneck moved upstream – from model size to reward quality. And if a decade ago, reward modeling appeared in a single lecture of an RL course; this spring, investors sized the RLHF-services market at $6.4 billion and expect it to more than double by 2030. Companies learned the hard way that no amount of transformer capacity rescues a misguided objective. Quality is dependent on getting the reward signal right. Companies like Anthropic have dedicated Reward Modeling teams now. Reward engineers are the new prompt engineers, but with a mandate to encode what “good” actually means. Tricky.

Three shocks that rewired incentives

  1. Human-Feedback Pipelines Matured: RLHF once relied on small, static preference datasets. Now, multi-stage annotation, continuous evaluation loops, and synthetic preference generation empower teams to iterate weekly, not quarterly. Teams now translate UX guidelines or safety constraints into datasets of preference pairs or behavior traces, then fit a supervised reward model to score completions accordingly. This clarity transformed ad-hoc tweaking into an auditable process – critical once regulators started scrutinizing how models choose their words.

  2. Agents Escaped the Lab: Supply-chain optimizers, trading bots, customer-support swarms – 2025 deployments are less like episodic games and more like mission-critical workflows. Google DeepMind’s AlphaEvolve, for instance, evolves code for datacenter layouts and chip floor-plans, with reward functions tied to real energy bills and thermal envelopes. A mis-specified objective no longer just yields a bad score; it reroutes trucks, blows a carbon budget, or violates SEC rules. Suddenly, reward design feels akin to security engineering: one slip can cost millions.

  3. Scale Toppled Hand-Tuning: Orchestrating thousands of agents across heterogeneous tasks shattered the romantic notion of manually nudging coefficients for each environment. Teams have adopted “reward ops” platforms that log every reward-policy pair, auto-benchmark variants, and surface drift alerts. Open-source efforts like Agentic Reward Modeling are codifying best practices, blending human preference, factuality checks, and instruction-following scores into a composable stack. Think MLOps, but wired directly into the value layer.

We’ve reached a stage where reward design is an entire system. Whether you're shipping an LLM, a multi-agent system, or an autonomous platform, you must know precisely what reward signal you're embedding – because your agent will take it literally.

The Theory-to-Engineering Translation (specified for 2025)

(Partially based on Sutton and Barto’s “Reinforcement Learning: An Introduction”)

You are on a free list. Upgrade if you want to be the first to receive the full articles directly in your inbox. Simplify your learning journey →

Reply

or to participate.