Anthropic's New Agent Stack: From Tools to Teams

There was no new model at yesterday's Code w/ Claude keynote. That's not a footnote. That's the headline.

What Anthropic actually shipped was three additions to Claude Managed Agents — multi-agent orchestration, Outcomes, and Dreaming — and the most useful way to read them is together, not separately. Individually, each one is a smart capability. Stacked, they sketch the rough shape of an operating model for AI agents that work the way real teams do.

For anyone building AI-powered products right now, that matters more than another half-point on a benchmark.

‍

A quick frame on what was announced

Three new capabilities, all sitting inside Claude Managed Agents:

Multi-agent orchestration — a coordinator agent can delegate scoped work to specialized sub-agents, each with its own tools, prompt, and isolated context.
Outcomes — instead of stopping when the prompt ends, the agent works against a defined rubric and iterates until a separate grader confirms the work meets the bar.
Dreaming — the agent reviews its past sessions, curates its memory, and writes new, durable insights it can use next time.

Multi-agent and Outcomes are accessible now in research preview; Dreaming requires explicit access. All three sit on top of the existing Managed Agents harness, which provides the container, file system, tool execution, and persistent sessions you'd otherwise have to build yourself.

Now the interesting part: each of these features fixes a different failure mode that has been holding agents back in production.

‍

1. Multi-agent orchestration: agents are getting an org chart

The keynote demo built a hypothetical drone landing system on the moon out of three coordinating agents — a Commander, a Detector, and a Navigator. Each had its own role, tools, and context. The Commander didn't try to do everything. It directed.

That's the model. A coordinator agent declares which other agents it's allowed to call. When it delegates, the sub-agent runs in its own thread with its own conversation history, configuration, and tools. They share the same container and filesystem, but they don't share context windows. A research agent isn't dragging a code reviewer's transcript through its head while it works.

The patterns Anthropic clearly thinks fit best are visible in the docs:

A reviewer agent with read-only tools and a tightly scoped brief
A test-writer agent that writes and runs tests without touching production code
A research agent with web tools that hands findings back to the coordinator

There's a deliberate guardrail: only one level of delegation. A coordinator can call sub-agents, but those sub-agents can't spawn their own. Right call for production. Recursive agent fan-out is one of the cleanest ways to burn money and timeouts.

‍

Why this matters for product teams. Anyone who has tried to cram planning, research, execution, and review into a single agent has hit the same wall: context bloat, conflicting instructions, the agent losing track of what it's optimizing for. The fix is the same fix humans use — divide the work and define the handoffs.

What Anthropic shipped is essentially team structure as configuration. You can start designing agent systems the way you design an org chart: specialized roles, clear interfaces, scoped responsibility. That's a much better mental model than "one big prompt with a lot of instructions."

‍

2. Outcomes: agents that know when they're actually done

This is the one most people will underestimate at first read.

Outcomes lets you define what success looks like as a rubric — a structured document with per-criterion expectations — and hand that to the agent. The agent works toward it. A separate grader, running with its own isolated context, evaluates the artifact against the rubric and returns either a pass or a specific, line-by-line list of gaps. The agent iterates again. The cycle repeats until the rubric is met or the iteration cap is hit (default 3, configurable up to 20).

Anthropic's worked example is a DCF model for Costco. The rubric specifies things like: five years of historical revenue, separately modeled COGS and operating expenses, a stated WACC calculation with sourced assumptions, terminal value methodology, sensitivity analysis, and final output in .xlsx with labeled sheets. The agent doesn't get to mark its own homework. A fresh, isolated grader does, against the rubric.

‍

Why this matters. The biggest failure mode in agentic workflows isn't that the agent can't do the work. It's that it doesn't know when to stop. It runs until the prompt runs out. It produces something that looks complete but won't survive a real quality review. You hand it back, the agent fixes one thing, breaks two others.

Outcomes flips the order of operations. You front-load the definition of done. Then the agent runs inside a self-correcting loop with an evaluator that isn't entangled in its own reasoning. The shift is subtle and important: you're not just shipping prompts anymore. You're shipping evals.

For anyone running real AI workflows in production, that's a much more durable foundation. The teams that have already built strong evaluation muscles — the ones who can write rubrics for what "good" looks like — are about to compound their lead.

‍

3. Dreaming: agents that learn from their own work

Dreaming is the feature that makes the other two compound.

Here's how it works. A Dream is an async job that takes an agent's existing memory store, plus up to 100 past session transcripts, and reads through everything. The model reflects on what happened — what worked, what didn't, what patterns repeated — and produces a new memory store. Duplicates merged. Stale or contradicted entries replaced with the latest version. New insights surfaced and saved as durable memory.

The original memory is never touched. The dream produces a separate output store you can review, adopt, or throw away.

The keynote demo: an agent that had been working on lunar landing tasks dreamed overnight and produced a descent-playbook.md — a curated set of lessons distilled from prior sessions, ready to be loaded into the next run.

‍

Why this matters. The hardest part of running agents in production isn't the first session. It's the hundredth. Memory accumulates. Mistakes repeat. Lessons get lost in noise. Most agent setups today either remember too little (every session starts cold) or too much (a swamp of contradictory notes the agent has to wade through before doing real work).

Dreaming is essentially an overnight consolidation cycle. It's the closest thing yet to an agent that gets sharper the more you use it — without you manually editing its memory every time it ships something dumb. The cost structure of running agents over time finally starts to bend in your favor instead of against it.

‍

The pattern: agents now have what teams need

Step back from the individual features and the picture is hard to miss.

Anthropic just gave agents three things every effective team has:

Division of labor — multi-agent orchestration lets specialized agents own scoped work in parallel.
Definition of done — Outcomes turns vague prompts into measurable targets with self-correcting loops.
Institutional memory — Dreaming lets agents review their own work and bake what they learn into durable knowledge.

That's not a coincidence. That's an operating model.

For the last two years, the conversation around agents has mostly been about raw model capability. Can it reason. Can it use tools. Can it work for hours. Those questions are largely settled. What was missing wasn't intelligence. It was structure.

This release fills that gap. It signals where Anthropic thinks the bottleneck is right now: not in capability, but in how agents coordinate, how they define success, and how they retain what they've learned. That's the bet a serious platform should be making.

‍

What this means for product teams

A few implications worth sitting with:

Stop architecting features around a single big prompt. If your AI feature is one giant system prompt with a thousand instructions, you're already a generation behind. The next iteration is small, specialized agents with clean handoffs.

Invest in evals, not just prompts. Outcomes makes rubric writing a first-class skill. Teams that have built evaluation muscle will outpace teams still tuning wording.

Treat memory as architecture, not a side effect. With Dreaming on the table, agent memory becomes something you actively curate. That's an operations discipline that didn't really exist a year ago. It's about to separate well-run AI products from improvisational ones.

Design for the next model and the next operating model. The compounding gains from multi-agent plus outcomes plus curated memory will matter more in real workflows than the next jump in model capability.

‍

The takeaway

The most useful thing Anthropic shipped this week isn't a model. It's a thesis: agents will become genuinely useful in production not when they get smarter, but when they get organized.

Specialized roles. Clear definitions of done. Memory that compounds. That's not a model release. That's the scaffolding for how serious teams are going to build AI-native products from here on.

If you're a founder or product leader paying attention, the move isn't to wait for the next benchmark. It's to start designing your agent stack the same way you'd design a high-performing team — and to build the muscles around evals, role specialization, and memory curation now, before they become table stakes.

The studios and product teams that figure this out early are going to ship things their competitors won't be able to copy from a model upgrade.

Building Successful MVPs - Our Best Insights.

What is product discovery? Our process and systems

What is a digital product studio, and why should you care?

Building Successful MVPs - Our Best Insights.

What is product discovery? Our process and systems

What is a digital product studio, and why should you care?