Long-Running AI Agents: Stateful Architectures for Production

Most agent architectures are secretly stateless. In this guest post, Addy Osmani, Director at Google Cloud, and Shubham Saboo, Senior AI Product Manager at Google Cloud, discuss what it takes to build ones that aren’t.

Why most AI agents fail in production

Developers spend weeks perfecting prompt engineering, tool calling, and response latency. None of it matters when your agent needs to stay alive for five days.

The workflows that actually matter in production — processing thousands of insurance claims, running week-long sales sequences, reconciling financial data across systems — don't fit inside a single conversation turn. They take days, not seconds. And the moment you try to build them, you run into a wall that most tutorials skip over: most agent architectures reconstruct context from scratch on every interaction. They lose the reasoning chain, the soft signals, and the confidence gradients that made the agent's previous decisions make sense.

This is the production gap. Demos close it with short, clean tasks. Real systems don't get that luxury.

At Google Cloud Next '26, we announced that Agent Runtime now supports long-running agents that maintain state for up to seven days. What follows are five design patterns — drawn from what we've seen actually work in production — for building agents that survive contact with reality.

Pattern 1: Checkpoint-and-Resume

The most common failure mode in multi-day workflows is context loss. An agent processes 200 documents over four hours, then hits an error on document 201. Without checkpointing, you restart from scratch.

The fix is conceptually simple but architecturally important: treat your agent like a long-running server process, not a request handler. The same way you'd build a data pipeline that processes millions of records — checkpoint progress, handle partial failures, ensure idempotency.

from google.adk import Agent, ToolContext

class DocumentProcessor(Agent):
    """Processes large document sets with checkpoint-and-resume."""

    async def process_batch(self, docs: list, ctx: ToolContext):
        checkpoint = self.load_checkpoint()
        start_idx = checkpoint.get("last_processed", 0)

        for i, doc in enumerate(docs[start_idx:], start=start_idx):
            result = await self.classify_and_extract(doc)
            self.results.append(result)

            # Checkpoint every 50 documents
            if (i + 1) % 50 == 0:
                self.save_checkpoint({
                    "last_processed": i + 1,
                    "partial_results": self.results,
                    "timestamp": datetime.now().isoformat()
                })

        return self.compile_final_report()

Notice the checkpoint granularity. Not after every document (wasteful). Not only at the end (risky). Fifty documents per batch balances durability against overhead. Your specific number depends on how expensive each unit of work is.

Pattern 2: Delegated Approval (Human-in-the-Loop, Done Right)

Every agent framework advertises human-in-the-loop. But in practice, most implementations amount to: serialize state to JSON, send a webhook, hope someone checks it.

The problems compound fast. JSON serialization loses implicit reasoning context. Notifications compete with dozens of other alerts. When the human responds hours later, the agent has to deserialize, re-establish context, and hope nothing changed in the interim.

Long-running agents handle this differently. When the agent hits an approval gate, it pauses in place. The full execution state stays intact: reasoning chain, working memory, tool call history, pending action. The agent consumes zero compute while waiting. Sub-second cold starts mean zero latency penalty when it resumes.

The critical detail is the time accounting. If an agent pauses for human review at hour 8 and the reviewer responds at hour 32, those 24 hours are dead time for the agent but productive time for the human. The agent doesn't drift, degrade, or need re-priming. It picks up exactly where it left off.

At scale — if you're managing twenty concurrent long-running agents — you need a unified inbox that categorizes what needs attention. Not Slack channels. Not email threads. A structured queue: "Needs your input," "Errors," "Completed."

Pattern 3: Memory-Layered Context (and Why You Have to Govern It)

A seven-day agent needs more than session state. It needs to remember things from previous sessions, organizational context that no single conversation could contain, and user preferences from weeks ago.

This is where the architecture gets interesting — and where most teams underestimate the risk.

Think of long-term memory as analogous to a knowledge base: it accumulates everything the agent has learned across interactions, organized by topic. Working memory, by contrast, provides low-latency access to specific, high-accuracy details needed right now. The two layers work together, but they have to be kept distinct.

Here's the problem most developers don't anticipate until production: memory drift.

Your agent's behavior isn't shaped only by its code and prompts. It's shaped by accumulated experience. If an agent "learns" from a few atypical interactions that a procedural shortcut is acceptable, it may start applying that shortcut broadly. And if multiple agents read and write to shared memory pools, data leakage between distinct workflows becomes a real risk — the kind that's hard to detect and harder to explain to a compliance team.

This is where the governance layer becomes non-negotiable. You can't let agents write to a shared memory store unchecked. You need to govern them the same way you govern microservices. Concretely, this means three things:

Agent identity. Every agent needs a cryptographic identity that determines exactly which memory banks and tools it's authorized to access. Think IAM, but for agents.

Centralized registry. When you have dozens of long-running agents, you need a single source of truth for which agents are active, what version of the prompt and code they're running, and what their current execution state is.

Policy enforcement at the boundary. A governance layer sitting between the agent and its memory should evaluate every access request against organizational policies. If an agent tries to commit PII to long-term memory, that transaction should be blocked before it happens — not audited after.

The question to ask yourself isn't just "what are my agents doing?" It's "what are my agents remembering, and how is that changing their behavior over time?"

Pattern 4: Ambient Processing

Not every long-running agent interacts with humans. Some are ambient. They watch for events, process data streams, and take action in the background without any user prompting.

A real-world mesh looks something like the diagram above. A content moderation agent consumes new uploads from Pub/Sub and routes flagged content to human review. A data quality agent watches BigQuery for new rows, detects anomalies, and delegates remediation to a data engineering specialist via A2A. A customer event agent ingests support tickets and classifies them in real time — routing billing questions, technical issues, and VIP cases to dedicated downstream agents. None of these agents wait to be asked. They run continuously, reacting to the event stream.

The architectural decision that matters most here ties back to Pattern 3: don't hardcode policies into the agent.

Define them in your governance layer and let the agent enforce them at runtime. When policies change, you update once and every ambient agent in the fleet picks up the new rules immediately. This separation matters because ambient agents run unsupervised for long stretches. If you hardcode policies, every policy change requires redeploying every agent. If you externalize policies, you update once and the fleet adapts — without downtime, without redeployment, without the risk that one agent is running an outdated version of your compliance rules.

Pattern 5: Fleet orchestration

The final pattern is about managing multiple long-running agents as a coordinated fleet. In production, you rarely have a single agent working alone. You have a coordinator agent that delegates sub-tasks to specialist agents, each running independently for different durations.

Consider a sales prospecting sequence. A coordinator breaks the work into components: research, scoring, sequencing, outreach, and follow-up. Each of those is a specialist agent running on its own timeline, with its own identity, its own tool permissions, and its own entry in the registry.

The coordinator maintains global state and handles handoffs between specialists. This is the same coordinator/worker pattern used in distributed systems for decades. What's new is that it can be defined declaratively through graph-based workflows, where the structure of the coordination logic is enforced by the framework rather than expressed in a system prompt that an LLM might decide to shortcut.

The operational advantage of treating each specialist as an independent unit is that you can update them independently. If your scoring logic needs improvement, you deploy the new version, monitor its performance, and promote it only when the results hold up. A bad deployment in one specialist never cascades to the others.

A2A and MCP: The Interoperability Layer

One thing the five patterns above don't fully address: most organizations won't build every agent they need from scratch. The real leverage comes from agents built by different teams — sometimes in different languages, sometimes at different companies — being able to discover and collaborate with each other.

Two open protocols are emerging as the connective tissue here. A2A (Agent-to-Agent) standardizes how agents communicate with other agents. MCP (Model Context Protocol) standardizes how agents communicate with tools and data sources. Together, they mean that a Python-based coordinator can delegate to a Go-based specialist, which can delegate to a Java-based compliance checker, without any of those teams needing to negotiate custom integration formats.

Every A2A-compatible agent publishes a card at a well-known URL describing its capabilities, authentication requirements, and rate limits. Think of it as an OpenAPI spec designed for agent-to-agent interaction rather than client-server communication. A central registry amplifies this: other agents across your organization can discover capabilities without knowing specific URLs. The registry becomes the service mesh for your agent ecosystem.

The diagram above shows what this looks like in practice. A coordinator agent doesn't need to know the URL of the financial analysis agent, or that it's written in Python, or how its auth works. It queries the registry, finds the card, and connects. When the document processing team ships a new version of their Java agent, they update the card. Every coordinator in the organization gets the upgrade automatically.

MCP handles the other side: connecting agents to databases, enterprise systems, and APIs through a single protocol. Without it, every data connection requires its own custom integration code. With it, a Stripe connector looks the same to your agent as a BigQuery connector — the protocol is the interface, and the backend is interchangeable.

The governance story matters here too. Each organization maintains its own governance boundaries. Your policy enforcement layer controls what data your agents can share with a partner agent, what actions they're permitted to take based on the partner's responses, and what information they're allowed to request. Cross-organization collaboration happens through the protocol; each side enforces its own security model independently.

The multi-language, cross-team version of this plays out exactly as you'd expect. A customer onboarding workflow might involve a Python coordinator, a Go identity-verification agent owned by the security team, a Java credit-assessment agent owned by risk, a Go account-provisioning agent owned by platform, and a TypeScript communication agent owned by marketing — each team iterating independently, none of them blocked by the others.

Choosing the Right Pattern

These patterns compose. A compliance system might use Checkpoint-and-Resume for document processing, Delegated Approval for review gates, Memory-Layered Context for cross-session knowledge, and Fleet Orchestration to coordinate specialists.

The key diagnostic question: what is the longest uninterrupted unit of work your agent needs to perform?

If it's minutes, you probably don't need long-running agents. If it's hours or days, these patterns are where you start — and the governance and interoperability layers become load-bearing, not optional.

The companies building isolated, stateless agents today will be refactoring in twelve months. The ones building with persistence, governance, and interoperability in mind will be compounding their advantage every day.

The Gemini Enterprise Agent Platform provides the infrastructure for building, deploying, and governing long-running agent fleets described in this article. You can explore it here.

*This guest post was written by Addy Osmani, Director, Google Cloud and Shubham Saboo, Senior AI Product Manager, Google Cloud. We thank Google Cloud for their support of Turing Post’s mission to bring clarity to the AI landscape.

FAQ

What are long-running AI agents?

Long-running AI agents are AI systems designed to preserve execution state, memory, and workflow continuity across hours or days instead of resetting after each interaction.

Why do most AI agents fail in production?

Most agents are effectively stateless. They lose reasoning history, context, and operational state between interactions, which breaks multi-step workflows and long-duration tasks.

What is checkpoint-and-resume for AI agents?

Checkpoint-and-resume allows an agent to save execution state during long workflows so it can recover from failures or pauses without restarting from scratch.

What is A2A in AI systems?

A2A, or Agent-to-Agent, is a protocol that standardizes communication between AI agents so different systems can discover and coordinate with each other.

What is MCP?

MCP, or Model Context Protocol, standardizes how AI agents connect to tools, APIs, databases, and external systems.

Why does governance matter for long-running agents?

Long-running agents accumulate memory, permissions, and operational context over time. Governance layers control identity, memory access, compliance, and policy enforcement across agent fleets.

The Production Gap: Five Patterns for Building Long-Running AI Agents*