Token FinOps for AI Agents: How to Design Workflows That Scale Linearly (Not Exponentially)

If you’re shipping production agents (or building Claude Skills that call tools, browse docs, and write code), you’ve probably had the same experience:

The demo works.
The workflow “usually” completes.
And then the bill shows up.

This post is a practical engineering playbook for token cost for AI agents—how to design agentic workflows so spend scales roughly with the number of steps (linear), instead of blowing up due to context growth, branching, retries, and multi-agent chatter.

This isn’t “use a cheaper model” advice. It’s workflow architecture: budgets, caps, contracts, and deterministic fallbacks.

At the end, I’ll show what these ideas look like in an agent-native workflow language (the core thesis behind nNode): treat your agent workflow like a program with constraints—not an open-ended chat.

The uncomfortable truth: agents are easy to demo and hard to afford

Most runaway LLM spend isn’t caused by one huge prompt. It’s caused by systems behavior:

A tool fails → the agent retries → it re-sends an even bigger context
A planner branches into 6 subtasks → each subtask drags shared history
A verifier asks for “more detail” → context balloons again
Multiple agents coordinate → they talk to each other more than they talk to tools

If you want predictable cost, you need two numbers:

Expected cost per run (mean / p50)
Worst-case cost per run (p95/p99 or hard cap)

FinOps for agents is simply: make those numbers intentional.

Why token costs blow up (plain engineering model)

Let’s define a baseline:

You have a workflow with N steps.
Each step uses T tokens.

If every step is independent, total tokens are roughly:

Total ≈ N × T (linear)

Now add the reality of agent workflows.

1) Context accumulation

In many frameworks, each new step includes most (or all) prior messages.

If step i includes the full history, tokens per step aren’t constant anymore. They grow.

A simplistic approximation:

step 1 uses T
step 2 uses ~2T
step 3 uses ~3T

Total becomes:

Total ≈ T × (1 + 2 + … + N) = O(N²)

And that’s before tools and retries.

2) Branching plans

A planner doesn’t just add steps. It can add parallel branches.

If each branch carries its own growing context, your spend is closer to:

Total ≈ sum over branches of O(N_branch²)

3) Retries and tool errors (hidden multiplier)

Retries are multiplicative:

Total ≈ Base × (1 + retry_rate × avg_retry_count)

And retries tend to happen later in runs—when context is already huge.

4) Multi-agent coordination overhead

If you split work across “planner”, “executor”, “critic”, “memory”, etc., you often add:

more messages
more summarization
more handoffs

Coordination can easily become the dominant cost center.

Takeaway: token blowups aren’t mysterious. They’re the predictable result of unbounded loops + growing state + multiplicative retries.

The goal: make cost a constraint, not a surprise

Before we talk patterns, define your cost contract.

Define three budgets

Per-step budget (e.g., planner gets 1,200 tokens)
Per-stage budget (planner vs executor vs verifier)
Per-run budget (hard cap: the workflow cannot exceed this)

Define “budget exhaustion behavior”

What should the system do when it hits the cap?

degrade model (e.g., Sonnet → Haiku / GPT-4.1 → 4.1-mini)
skip verification and return “best effort”
ask a human for a missing input
stop and produce an audit trail + partial output

A budget without an exhaustion policy is just an error you’ll hit at 2am.

Instrumentation: what to log to see the cost story

If you don’t measure tokens per step, you’re guessing.

Log these fields for every model call:

workflow_run_id
node_id / stage
model
input_tokens, output_tokens, total_tokens
cache_hit (if using prompt caching)
retry_count and retry_reason
state_size_bytes (or message count)
tool_calls_count

Example: minimal TypeScript token logger

type LlmCallLog = {
  runId: string;
  nodeId: string;
  model: string;
  inputTokens: number;
  outputTokens: number;
  cacheHit?: boolean;
  retry?: { attempt: number; reason: string };
  stateBytes?: number;
  toolCalls?: number;
};

export function logLlmCall(e: LlmCallLog) {
  // Replace with your real sink: OTel, Datadog, ClickHouse, etc.
  console.info(JSON.stringify({ ts: new Date().toISOString(), ...e }));
}

Once you have this, you can compute:

cost per node
p95 token usage per workflow type
where context grows
which tool causes retries

Design rules for linear-scaling agentic workflows

These rules are boring on purpose. Boring is predictable.

Rule 1: Bound the loop

Every “agent loop” needs at least one hard stop:

max_iterations
max_tool_calls
max_total_tokens
deadline_ms

If you don’t bound it, you built an unpriced option.

Rule 2: Budget per stage (planner ≠ executor ≠ verifier)

A common failure mode is giving the planner unlimited room to think.

Instead:

Planner: small budget, must output a structured plan
Executor: moderate budget, runs tools
Verifier: small budget, checks invariants

Rule 3: Minimize shared context

Stop passing chat history everywhere.

Prefer structured state:

a compact plan
a small set of facts
tool outputs that are summarized or normalized

Rule 4: Specialize aggressively

You don’t need five generalist agents. You need:

one orchestrator
small specialists that behave like tools

A specialist prompt can be short and stable, which improves caching and reduces drift.

Rule 5: Retry with policy (don’t “just try again”)

Retries should change something:

smaller payload
different tool
different parsing strategy
switch to deterministic fallback

If retrying doesn’t change anything, you’re just buying the same failure twice.

Four concrete patterns (copy/paste mental models)

Pattern A: Context Reducer stage (compress → proceed)

Add an explicit node that reduces state before expensive steps.

When to use: after any tool that returns lots of text (web page, logs, PDF, codebase search).

Example: reducer prompt (bounded)

REDUCE_PROMPT = """
You are a context reducer.
Summarize the input into:
1) Key facts (bullets)
2) Open questions
3) Decisions made
4) References (URLs/IDs only)
Constraints:
- Max 250 tokens.
- Do not include raw excerpts.
"""

Then downstream nodes consume only the reduced state.

Pattern B: Tool Gateway (one narrow interface to many tools)

Instead of giving an agent 20 tools, give it one gateway tool with a constrained schema.

Why it helps cost:

fewer tool descriptions in prompt
fewer irrelevant tool choices
less back-and-forth “which tool should I use?”

Example: gateway schema

{
  "type": "object",
  "properties": {
    "action": {
      "type": "string",
      "enum": ["search", "fetch", "db_query", "send_approval"]
    },
    "query": { "type": "string" },
    "url": { "type": "string" },
    "parameters": { "type": "object" }
  },
  "required": ["action"]
}

The orchestrator decides what to do; the gateway enforces how to do it.

Pattern C: Two-pass execution (cheap draft → expensive verify)

Make the default path cheap.

Draft output using a smaller model / smaller context
Verify only if needed

Example: produce a draft answer, then run a verifier that only checks:

schema validity
missing required fields
presence of citations/IDs (if your app needs them)

Pattern D: Budgeted planner (plan must fit a token envelope)

Force the planner to output a plan that is itself budget-aware.

Example: plan contract

type Plan = {
  objective: string;
  steps: Array<{
    id: string;
    description: string;
    maxModelCalls: number;
    maxToolCalls: number;
    maxTokens: number;
    fallback: "ask_human" | "best_effort" | "stop";
  }>;
};

A planner that can’t plan within constraints is not planning. It’s journaling.

Prompt caching without cargo-culting

Caching can be huge for agent cost control, but only if you structure prompts correctly.

What belongs in a cached prefix

stable instructions
stable schemas
stable examples

What does NOT belong in a cached prefix

run-specific state
user inputs
long tool outputs
anything that changes every run

Practical trick: split prompt into prefix + state

const PREFIX = `You are an execution agent.
Follow the schema exactly.
Never include internal reasoning.
... stable stuff ...
`;

function buildPrompt(state: unknown) {
  return PREFIX + "\n\nSTATE (json):\n" + JSON.stringify(state);
}

If your prefix is stable, you can get a high cache hit rate and shrink marginal cost.

What this looks like in an agent-native workflow language (nNode-style)

Most teams try to control cost inside the prompt.

nNode’s thesis is different: control cost at the workflow level.

workflows are programs
nodes have explicit budgets
state is typed/contracted
retries are policy-driven

Below is a simplified pseudo-workflow (MD-ish) to make the point.

workflow: build_linear_cost_agent
version: 1

budgets:
  run_max_tokens: 18000
  run_max_tool_calls: 30
  run_deadline_ms: 120000

state_contract:
  # pass structured state, not chat logs
  type: object
  required: ["objective", "inputs", "facts", "artifacts"]

nodes:
  - id: plan
    model: claude-sonnet
    max_tokens: 1200
    max_retries: 1
    retry_policy:
      on: ["timeout", "invalid_json"]
      backoff_ms: 300
    output_schema: Plan

  - id: execute
    model: claude-sonnet
    max_tokens: 4500
    max_tool_calls: 12
    tools: [tool_gateway]
    loop:
      max_iterations: 4
      stop_when: "objective_met"

  - id: reduce_context
    model: claude-haiku
    max_tokens: 300
    when: "state_size_bytes > 60000"

  - id: verify
    model: claude-haiku
    max_tokens: 500
    checks:
      - "output_matches_schema"
      - "no_missing_required_fields"

budget_exhaustion:
  behavior: "stop"
  emit:
    - audit_log
    - partial_output
    - next_action: "ask_human"

audit:
  log_tokens: true
  log_cache_hits: true
  log_tool_errors: true

The point isn’t YAML. The point is treating cost as a first-class constraint.

When an LLM can author and maintain that workflow directly (instead of you translating from a human-facing UI), iteration gets cheaper too. That’s a big part of why we’re building nNode as a workflow language that coding agents can write.

If you sell this as a service: quoting and guardrails

Agencies and internal platform teams need to quote work. “It usually costs $X” isn’t a quote.

Here’s a simple quoting framework:

1) Define the unit of work

Examples:

“Summarize one ticket + propose fix PR”
“Reconcile one invoice batch”
“Enrich one lead + draft one email”

2) Measure token distribution

Run 50–200 samples and compute:

p50 total tokens
p95 total tokens
p99 total tokens

Quote with p95 (or a capped plan).

3) Add explicit guardrails to the SOW

maximum tool calls per run
maximum retries
“budget exhaustion” behavior
what triggers human handoff

4) Build “turn it off” conditions

Your agent should be allowed to stop.

Examples:

tool returns auth error
required customer data missing
state size exceeds safe limit

Stopping with a clean audit trail is a feature, not a failure.

Checklist: linear cost by design (print this)

Use this to audit an existing agent workflow.

Budgets & bounds

Hard cap: max tokens per run
Hard cap: max tool calls per run
Max iterations for each loop
Timeouts per node
Budget exhaustion behavior is defined

State discipline

Structured state contract exists
Chat history is not blindly appended
Large tool outputs are reduced/summarized
Only required fields are passed between nodes

Retry discipline

Retries are policy-based (not “try again”)
Retry changes inputs or strategy
Tool errors are categorized and logged

Caching discipline

Stable prefix separated from run state
Cache hit rate is measured
Prefix is not polluted with dynamic data

Observability

Tokens logged per node
p95/p99 cost tracked
State growth tracked
Tool latency + error rate tracked

Closing: build agents like you’d build production software

Cost blowups aren’t an “LLM problem.” They’re a systems design problem:

unbounded loops
uncontrolled state
retries without policy
no budgets

If you treat workflows like programs—with constraints, contracts, and audit—you can make agent spend predictable.

If you’re building Claude Skills or any tool-using agent workflow and you want to encode these constraints directly into the workflow definition (so the agent can author and maintain it, not just chat around it), that’s exactly what we’re working toward with nNode.

Soft CTA: If this playbook matches your pain (runaway spend, unpredictable margins, agent workflows that feel like “magic” until they hit production), take a look at nnode.ai and see how we’re approaching agent-native workflow authoring and cost-aware orchestration.

The uncomfortable truth: agents are easy to demo and hard to afford

Why token costs blow up (plain engineering model)

1) Context accumulation

2) Branching plans

3) Retries and tool errors (hidden multiplier)

4) Multi-agent coordination overhead

The goal: make cost a constraint, not a surprise

Define three budgets

Define “budget exhaustion behavior”

Instrumentation: what to log to see the cost story

Example: minimal TypeScript token logger

Design rules for linear-scaling agentic workflows

Rule 1: Bound the loop

Rule 2: Budget per stage (planner ≠ executor ≠ verifier)

Rule 3: Minimize shared context

Rule 4: Specialize aggressively

Rule 5: Retry with policy (don’t “just try again”)

Four concrete patterns (copy/paste mental models)

Pattern A: Context Reducer stage (compress → proceed)

Example: reducer prompt (bounded)

Pattern B: Tool Gateway (one narrow interface to many tools)

Example: gateway schema

Pattern C: Two-pass execution (cheap draft → expensive verify)

Pattern D: Budgeted planner (plan must fit a token envelope)

Example: plan contract

Prompt caching without cargo-culting

What belongs in a cached prefix

What does NOT belong in a cached prefix

Practical trick: split prompt into prefix + state

What this looks like in an agent-native workflow language (nNode-style)

If you sell this as a service: quoting and guardrails

1) Define the unit of work

2) Measure token distribution

3) Add explicit guardrails to the SOW

4) Build “turn it off” conditions

Checklist: linear cost by design (print this)

Budgets & bounds

State discipline

Retry discipline

Caching discipline

Observability

Closing: build agents like you’d build production software

Build your first AI Agent today