When AI does deep research, the system behind the answer is often not one chatbot but a temporary team of parallel AI workers. This article explains Claude subagents, workflows, context isolation, parallel tool use, structured outputs, reliability, and token economics.

Continue reading
Nebutra Originals
When you ask AI to do deep research and receive a cited report a few minutes later, the system behind it is often not one AI. It is a temporary team of AI workers, assembled for that task and running in parallel. This article explains, with as little jargon as possible, how that system works and why its success or failure is ultimately written in a token ledger.
Start with the intuition: the strongest AI products are quietly shifting from "one chatbot answers you" to "one AI acts as project lead and temporarily hires a team of AIs to do the work." Claude's Research feature and the broader category of Deep Research products are built around this pattern. The industry calls it a multi-agent system.
Why should AI builders, investors, and startup teams care? Because it determines two things at once: the ceiling of product capability, meaning whether AI can really complete complex tasks autonomously, and the floor of unit economics, meaning how much each completed task costs. Those two forces are why so many AI demos feel magical while commercialization remains hard.
This article moves in two parts. First it explains the underlying system: architecture, memory, concurrency, cost, and reliability. Then it answers the operational questions: when is this worth using, and how do you put it into production without losing control. A real Claude Code workflow example appears near the end to ground the abstractions in code.
Small glossary before reading
Token: the basic unit AI systems process and bill for. You can roughly think of it as text volume. Burning tokens means burning money.
Context window: the amount of information an AI can keep in working memory at once. When it fills up, information must be truncated, compressed, or moved elsewhere.
Agent: an AI system that does more than answer. It can call tools, search, read pages, write files, and keep taking steps until the task is done.
TL;DR
Anthropic draws an important architecture line in Building effective agents.[1] Both workflows and agents are agentic systems, but they differ in who owns the control flow.
| Dimension | Workflow | Agent |
|---|---|---|
| Control | LLMs and tools are orchestrated by predefined code paths. You own the pipeline. | The LLM decides the process and tool calls. The model owns the pipeline. |
| Subtasks | Fixed when the code is written. | Dynamically decomposed by the orchestrator for the current input. |
| Best fit | Clear boundaries, predictability, reproducibility. | Flexible work where model autonomy and scale matter. |
| Cost | More controlled, cheaper, easier to debug. | More expensive, nondeterministic, harder to debug, but higher ceiling. |
The script that opens the original example is really a workflow: where the model is called, how many paths run in parallel, and when verification happens are all fixed in code. Claude's Research feature, where the model decides how many subagents to launch and what each one should do, is closer to a real agent. Anthropic's practical advice is simple: find the simplest solution first, and add complexity only when needed.[1]
Anthropic groups workflows into five escalating patterns: prompt chaining, routing, parallelization, orchestrator-workers, and evaluator-optimizer loops. The difference between orchestrator-worker and simple parallelization is that the subtasks are not predefined. The orchestrator decides them at runtime.[1]
How we built our multi-agent research system is the most concrete primary source for this pattern.[2] The flow looks like this:
Anthropic's core idea is that search is compression: extracting insight from a large information space.[2] Subagents help because they explore different parts of the problem in parallel, inside isolated context windows, and return only the important tokens to the lead agent. Anthropic reports that an Opus 4 lead with Sonnet 4 workers outperformed a single Opus 4 agent by 90.2% on an internal research evaluation.[2] The public research-lead prompt says the same thing operationally: the lead should coordinate, guide, and synthesize, not perform all first-hand research itself.[11]
In Effective context engineering for AI agents, Anthropic treats context as an engineering resource.[3] Transformers form pairwise relationships across tokens, so attention gets diluted as context grows. The resulting failure mode is often called context rot. Good context engineering means finding the smallest high-signal set of tokens that maximizes the chance of completing the objective.[3]
According to the Agent SDK subagents documentation, each subagent runs in a fresh session. Its intermediate tool calls and results stay inside that subagent. Only the final message returns to the parent. The only channel from parent to child is the prompt string passed to the Agent tool. The child cannot see the parent's conversation history, tool results, or system prompt. There is no shared memory.[6]
The API layer productizes these ideas. Context editing removes stale tool calls and results near the limit. Memory tools store information outside the window. Server-side compaction summarizes earlier context as the limit approaches. Anthropic reports that memory tools plus context editing improved agent retrieval performance by 39%, and that context editing reduced token usage by 84% in a 100-turn web retrieval evaluation.[5]
Parallelism appears at two layers. The lead agent can launch 3-5 subagents at once, and each subagent can make 3 or more parallel tool calls in a single turn. Anthropic says this can cut research time for complex queries by up to 90%.[2]
| Metric | Meaning |
|---|---|
| 3-5 | Typical number of subagents launched by the lead at once. |
| 16 | Maximum concurrent agents in Dynamic Workflows. |
| 1000 | Maximum total agents in one Dynamic Workflow run. |
At the API layer, Claude can emit multiple tool_use blocks in the same assistant turn. Applications should return all matching tool_result blocks inside a single user message. Splitting them across multiple messages can reduce future parallelism.[8] The tool calls are unordered, so applications can use Promise.all or asyncio.gather directly.
python
# The assistant emits multiple tool_use blocks in one turn.
# The app executes them in parallel, then returns all results in one user message.
results = await asyncio.gather(*[
run_tool(call) for call in assistant_turn.tool_use_blocks
])
messages.append({
"role": "user",
"content": [tool_result(c.id, r) for c, r in zip(calls, results)]
})How many agents can you run? At the API level, there is no model-imposed hard limit on subagents. Your constraints are API rate limits and infrastructure. Anthropic rate-limits organizations across RPM, input tokens per minute, and output tokens per minute, with token-bucket behavior and 429 responses that differ from 529 capacity errors. Dynamic Workflows, introduced as a research preview alongside Claude Opus 4.8 in May 2026, adds a harder product shape: Claude writes a JavaScript orchestration script that runs in a background runtime, with plans living in script variables instead of Claude's active context, and with results flowing back at the end.[12]
The Claude Agent SDK, formerly the Claude Code SDK, exposes the shell that powers Claude Code: agent loop, built-in tools, subagent spawning, and MCP integration. The loop is simple: collect context -> act -> verify -> repeat until the task is done.[4]
python
# Use cheaper Sonnet workers and reserve Opus for strict review.
agents = {
"researcher": AgentDefinition(
description="Do first-hand research for one focused objective",
prompt=SUBAGENT_SYSTEM_PROMPT,
tools=["WebSearch", "Read"],
model="sonnet",
max_turns=8,
),
"reviewer": AgentDefinition(
prompt=ADVERSARIAL_VERIFY_PROMPT,
model="opus",
),
}
# Put Agent in allowedTools so delegation is preapproved.
# Subagents cannot spawn their own subagents.agents parameter to query(), markdown files under .claude/agents/, or the built-in general-purpose agent. Claude invokes them through the Agent tool. One rule matters: subagents cannot spawn their own subagents.[6]session_id + agentId allows a run to resume with full history. The worker trace lives in an independent file, so it can survive parent-session compaction.default, acceptEdits, plan, bypassPermissions, and dontAsk. Evaluation proceeds through hooks, deny rules, allow rules, ask rules, permission mode, callback, and post-tool hooks. Deny rules remain highest priority even in bypass mode.Subagent results need to survive JSON.parse. That is what Structured Outputs and strict tool calling provide. The mechanism is constrained decoding: the API compiles your schema, caches it, and constrains every generated token so the tool name and input shape match the schema.[9]
Without strict mode, a model may return 2 as "2", omit a required field, or produce an invalid enum value. That is why the example workflow uses RESEARCH_SCHEMA with additionalProperties: false and a complete required list. It is not a comment for humans. It is a contract for the decoder. The tradeoff is a first-request grammar compilation delay, and some constraints still need post-validation.
Multi-agent systems work partly because they spend enough tokens on the problem.[2]
In Anthropic's BrowseComp attribution analysis, three factors explained 95% of performance variance, and token usage alone explained roughly 80%.[2] The other two factors were tool-call count and model choice. On the cost side, an agent can use about 4x the tokens of a normal chat, and a multi-agent system about 15x.[2]
When is multi-agent worth it?
Worth it: high-value tasks that can be parallelized, exceed one context window, and require many complex tools.
Not worth it: tasks that require all agents to share the same context or depend heavily on each other. Most coding tasks fall here; the truly parallel portion is usually smaller than in research.
A useful rule of thumb, from Barry Zhang, is that a roughly 10-cent task budget gives you about 30k-50k tokens. That is workflow territory. Upgrade to agents when the task is too ambiguous for a predefined decision tree and valuable enough to amortize the token cost.
Anthropic's warning is direct: agents are stateful, and errors compound.[2] A small system failure can become catastrophic inside a long-running agent loop. Production work needs these guardrails:
max_turns and budget ceilings to contain runaway costs.Anthropic frames the best prompt as a collaboration framework rather than a rigid instruction set.[2] It defines roles, problem-solving paths, and effort budgets. The most practical rules are:
Here is a concrete version of the pattern from a real Claude Code task: check a list of organizations, one by one, to confirm whether each is still active, renamed, or operating under a new website. The code is short, but it uses many of the ideas above:
javascript
phase('Research')
const researched = (await parallel(
funds.map((f) => async () => {
const r = await agent(prompt(f), { schema: RESEARCH_SCHEMA })
return r ? { id: f.id, name: f.name, ...r } : null // id/name come from code, not the model
}),
)).filter(Boolean) // failed agents return null and are filtered outparallel(funds.map(...)) gives each organization to one AI worker and runs them together instead of queueing them.id / name are reattached by code and never requested from the model, reducing mix-ups and hallucination.schema locks the AI answer into a fixed form that downstream code can parse.null and is filtered out, rather than taking down the whole batch.The second half of the script sends only the organizations with detected changes to another AI for adversarial review. That is role separation in practice. The hard part is not writing this kind of workflow. The hard part is making it stable, observable, controllable, and cost-bounded in production.
Understanding the pattern is only the first step. From proof of concept to stable service, the real engineering work sits in token governance, context isolation, observability, and cost control. Nebutra provides end-to-end Claude agent implementation support:
decide where a deterministic workflow is enough and where model autonomy is actually useful.
budget from the 80% variance insight, combine compression, context editing, and subagents, and bring 15x cost back into an operating range.
resumable execution, checkpoints, rainbow deployment, and tracing to keep compounding errors out of production.
Note: figures such as 90.2%, 15x, and 80% come from specific mid-2025 evaluations with Opus 4 and Sonnet 4. They may not generalize to every task or newer models. Dynamic Workflows and some context-management features are beta or research-preview capabilities, so availability may vary by plan and platform.
0
Discussion
Join with your Nebutra account. New comments enter moderation first.