If you’re shipping production agents (or building Claude Skills that call tools, browse docs, and write code), you’ve probably had the same experience:
- The demo works.
- The workflow “usually” completes.
- And then the bill shows up.
This post is a practical engineering playbook for token cost for AI agents—how to design agentic workflows so spend scales roughly with the number of steps (linear), instead of blowing up due to context growth, branching, retries, and multi-agent chatter.
This isn’t “use a cheaper model” advice. It’s workflow architecture: budgets, caps, contracts, and deterministic fallbacks.
At the end, I’ll show what these ideas look like in an agent-native workflow language (the core thesis behind nNode): treat your agent workflow like a program with constraints—not an open-ended chat.
The uncomfortable truth: agents are easy to demo and hard to afford
Most runaway LLM spend isn’t caused by one huge prompt. It’s caused by systems behavior:
- A tool fails → the agent retries → it re-sends an even bigger context
- A planner branches into 6 subtasks → each subtask drags shared history
- A verifier asks for “more detail” → context balloons again
- Multiple agents coordinate → they talk to each other more than they talk to tools
If you want predictable cost, you need two numbers:
- Expected cost per run (mean / p50)
- Worst-case cost per run (p95/p99 or hard cap)
FinOps for agents is simply: make those numbers intentional.
Why token costs blow up (plain engineering model)
Let’s define a baseline:
- You have a workflow with N steps.
- Each step uses T tokens.
If every step is independent, total tokens are roughly:
Total ≈ N × T (linear)
Now add the reality of agent workflows.
1) Context accumulation
In many frameworks, each new step includes most (or all) prior messages.
If step i includes the full history, tokens per step aren’t constant anymore. They grow.
A simplistic approximation:
- step 1 uses T
- step 2 uses ~2T
- step 3 uses ~3T
Total becomes:
Total ≈ T × (1 + 2 + … + N) = O(N²)
And that’s before tools and retries.
2) Branching plans
A planner doesn’t just add steps. It can add parallel branches.
If each branch carries its own growing context, your spend is closer to:
Total ≈ sum over branches of O(N_branch²)
3) Retries and tool errors (hidden multiplier)
Retries are multiplicative:
Total ≈ Base × (1 + retry_rate × avg_retry_count)
And retries tend to happen later in runs—when context is already huge.
4) Multi-agent coordination overhead
If you split work across “planner”, “executor”, “critic”, “memory”, etc., you often add:
- more messages
- more summarization
- more handoffs
Coordination can easily become the dominant cost center.
Takeaway: token blowups aren’t mysterious. They’re the predictable result of unbounded loops + growing state + multiplicative retries.
The goal: make cost a constraint, not a surprise
Before we talk patterns, define your cost contract.
Define three budgets
- Per-step budget (e.g., planner gets 1,200 tokens)
- Per-stage budget (planner vs executor vs verifier)
- Per-run budget (hard cap: the workflow cannot exceed this)
Define “budget exhaustion behavior”
What should the system do when it hits the cap?
- degrade model (e.g., Sonnet → Haiku / GPT-4.1 → 4.1-mini)
- skip verification and return “best effort”
- ask a human for a missing input
- stop and produce an audit trail + partial output
A budget without an exhaustion policy is just an error you’ll hit at 2am.
Instrumentation: what to log to see the cost story
If you don’t measure tokens per step, you’re guessing.
Log these fields for every model call:
workflow_run_idnode_id/stagemodelinput_tokens,output_tokens,total_tokenscache_hit(if using prompt caching)retry_countandretry_reasonstate_size_bytes(or message count)tool_calls_count
Example: minimal TypeScript token logger
type LlmCallLog = {
runId: string;
nodeId: string;
model: string;
inputTokens: number;
outputTokens: number;
cacheHit?: boolean;
retry?: { attempt: number; reason: string };
stateBytes?: number;
toolCalls?: number;
};
export function logLlmCall(e: LlmCallLog) {
// Replace with your real sink: OTel, Datadog, ClickHouse, etc.
console.info(JSON.stringify({ ts: new Date().toISOString(), ...e }));
}
Once you have this, you can compute:
- cost per node
- p95 token usage per workflow type
- where context grows
- which tool causes retries
Design rules for linear-scaling agentic workflows
These rules are boring on purpose. Boring is predictable.
Rule 1: Bound the loop
Every “agent loop” needs at least one hard stop:
max_iterationsmax_tool_callsmax_total_tokensdeadline_ms
If you don’t bound it, you built an unpriced option.
Rule 2: Budget per stage (planner ≠ executor ≠ verifier)
A common failure mode is giving the planner unlimited room to think.
Instead:
- Planner: small budget, must output a structured plan
- Executor: moderate budget, runs tools
- Verifier: small budget, checks invariants
Rule 3: Minimize shared context
Stop passing chat history everywhere.
Prefer structured state:
- a compact plan
- a small set of facts
- tool outputs that are summarized or normalized
Rule 4: Specialize aggressively
You don’t need five generalist agents. You need:
- one orchestrator
- small specialists that behave like tools
A specialist prompt can be short and stable, which improves caching and reduces drift.
Rule 5: Retry with policy (don’t “just try again”)
Retries should change something:
- smaller payload
- different tool
- different parsing strategy
- switch to deterministic fallback
If retrying doesn’t change anything, you’re just buying the same failure twice.
Four concrete patterns (copy/paste mental models)
Pattern A: Context Reducer stage (compress → proceed)
Add an explicit node that reduces state before expensive steps.
When to use: after any tool that returns lots of text (web page, logs, PDF, codebase search).
Example: reducer prompt (bounded)
REDUCE_PROMPT = """
You are a context reducer.
Summarize the input into:
1) Key facts (bullets)
2) Open questions
3) Decisions made
4) References (URLs/IDs only)
Constraints:
- Max 250 tokens.
- Do not include raw excerpts.
"""
Then downstream nodes consume only the reduced state.
Pattern B: Tool Gateway (one narrow interface to many tools)
Instead of giving an agent 20 tools, give it one gateway tool with a constrained schema.
Why it helps cost:
- fewer tool descriptions in prompt
- fewer irrelevant tool choices
- less back-and-forth “which tool should I use?”
Example: gateway schema
{
"type": "object",
"properties": {
"action": {
"type": "string",
"enum": ["search", "fetch", "db_query", "send_approval"]
},
"query": { "type": "string" },
"url": { "type": "string" },
"parameters": { "type": "object" }
},
"required": ["action"]
}
The orchestrator decides what to do; the gateway enforces how to do it.
Pattern C: Two-pass execution (cheap draft → expensive verify)
Make the default path cheap.
- Draft output using a smaller model / smaller context
- Verify only if needed
Example: produce a draft answer, then run a verifier that only checks:
- schema validity
- missing required fields
- presence of citations/IDs (if your app needs them)
Pattern D: Budgeted planner (plan must fit a token envelope)
Force the planner to output a plan that is itself budget-aware.
Example: plan contract
type Plan = {
objective: string;
steps: Array<{
id: string;
description: string;
maxModelCalls: number;
maxToolCalls: number;
maxTokens: number;
fallback: "ask_human" | "best_effort" | "stop";
}>;
};
A planner that can’t plan within constraints is not planning. It’s journaling.
Prompt caching without cargo-culting
Caching can be huge for agent cost control, but only if you structure prompts correctly.
What belongs in a cached prefix
- stable instructions
- stable schemas
- stable examples
What does NOT belong in a cached prefix
- run-specific state
- user inputs
- long tool outputs
- anything that changes every run
Practical trick: split prompt into prefix + state
const PREFIX = `You are an execution agent.
Follow the schema exactly.
Never include internal reasoning.
... stable stuff ...
`;
function buildPrompt(state: unknown) {
return PREFIX + "\n\nSTATE (json):\n" + JSON.stringify(state);
}
If your prefix is stable, you can get a high cache hit rate and shrink marginal cost.
What this looks like in an agent-native workflow language (nNode-style)
Most teams try to control cost inside the prompt.
nNode’s thesis is different: control cost at the workflow level.
- workflows are programs
- nodes have explicit budgets
- state is typed/contracted
- retries are policy-driven
Below is a simplified pseudo-workflow (MD-ish) to make the point.
workflow: build_linear_cost_agent
version: 1
budgets:
run_max_tokens: 18000
run_max_tool_calls: 30
run_deadline_ms: 120000
state_contract:
# pass structured state, not chat logs
type: object
required: ["objective", "inputs", "facts", "artifacts"]
nodes:
- id: plan
model: claude-sonnet
max_tokens: 1200
max_retries: 1
retry_policy:
on: ["timeout", "invalid_json"]
backoff_ms: 300
output_schema: Plan
- id: execute
model: claude-sonnet
max_tokens: 4500
max_tool_calls: 12
tools: [tool_gateway]
loop:
max_iterations: 4
stop_when: "objective_met"
- id: reduce_context
model: claude-haiku
max_tokens: 300
when: "state_size_bytes > 60000"
- id: verify
model: claude-haiku
max_tokens: 500
checks:
- "output_matches_schema"
- "no_missing_required_fields"
budget_exhaustion:
behavior: "stop"
emit:
- audit_log
- partial_output
- next_action: "ask_human"
audit:
log_tokens: true
log_cache_hits: true
log_tool_errors: true
The point isn’t YAML. The point is treating cost as a first-class constraint.
When an LLM can author and maintain that workflow directly (instead of you translating from a human-facing UI), iteration gets cheaper too. That’s a big part of why we’re building nNode as a workflow language that coding agents can write.
If you sell this as a service: quoting and guardrails
Agencies and internal platform teams need to quote work. “It usually costs $X” isn’t a quote.
Here’s a simple quoting framework:
1) Define the unit of work
Examples:
- “Summarize one ticket + propose fix PR”
- “Reconcile one invoice batch”
- “Enrich one lead + draft one email”
2) Measure token distribution
Run 50–200 samples and compute:
- p50 total tokens
- p95 total tokens
- p99 total tokens
Quote with p95 (or a capped plan).
3) Add explicit guardrails to the SOW
- maximum tool calls per run
- maximum retries
- “budget exhaustion” behavior
- what triggers human handoff
4) Build “turn it off” conditions
Your agent should be allowed to stop.
Examples:
- tool returns auth error
- required customer data missing
- state size exceeds safe limit
Stopping with a clean audit trail is a feature, not a failure.
Checklist: linear cost by design (print this)
Use this to audit an existing agent workflow.
Budgets & bounds
- Hard cap: max tokens per run
- Hard cap: max tool calls per run
- Max iterations for each loop
- Timeouts per node
- Budget exhaustion behavior is defined
State discipline
- Structured state contract exists
- Chat history is not blindly appended
- Large tool outputs are reduced/summarized
- Only required fields are passed between nodes
Retry discipline
- Retries are policy-based (not “try again”)
- Retry changes inputs or strategy
- Tool errors are categorized and logged
Caching discipline
- Stable prefix separated from run state
- Cache hit rate is measured
- Prefix is not polluted with dynamic data
Observability
- Tokens logged per node
- p95/p99 cost tracked
- State growth tracked
- Tool latency + error rate tracked
Closing: build agents like you’d build production software
Cost blowups aren’t an “LLM problem.” They’re a systems design problem:
- unbounded loops
- uncontrolled state
- retries without policy
- no budgets
If you treat workflows like programs—with constraints, contracts, and audit—you can make agent spend predictable.
If you’re building Claude Skills or any tool-using agent workflow and you want to encode these constraints directly into the workflow definition (so the agent can author and maintain it, not just chat around it), that’s exactly what we’re working toward with nNode.
Soft CTA: If this playbook matches your pain (runaway spend, unpredictable margins, agent workflows that feel like “magic” until they hit production), take a look at nnode.ai and see how we’re approaching agent-native workflow authoring and cost-aware orchestration.