Agent workflows don’t fail because you “didn’t add logging.” They fail because in production you can’t answer basic questions with confidence:
- Did the run actually succeed—or did the UI just say it did?
- Which step burned $38 in tokens at 2 a.m.?
- Can we safely retry step 7 without double-booking a customer or sending duplicate emails?
This is the “two sources of truth” trap: your database says one thing, your traces (or your agent UI) say another, and your team gets stuck in incident-response limbo.
At nNode, we’ve lived this pain while building an LLM-native workflow automation product with guardrails and execution controls for agencies. The pattern below is the architecture we wish everyone started with: a DB state machine as the authoritative present, and OpenTelemetry traces (using OpenTelemetry GenAI semantic conventions) as the authoritative past.
If you operate Claude-powered or LLM-powered workflows for clients (especially as an automation agency), this gives you something you can actually support with SLAs.
The real problem isn’t “lack of tracing”
Traditional automation failures are usually deterministic: a webhook 500s, a CRM field is missing, an API key expired.
Agent workflows fail in messier ways:
- Long-running runs with many steps and branching logic
- Tool flakiness (rate limits, timeouts, partial writes)
- Human approvals that pause execution mid-trace
- Retries that are sometimes safe and sometimes catastrophic
- Context growth (the run “self-combusts” when the conversation becomes huge)
- Nondeterminism (two identical prompts can produce different tool call sequences)
If you only keep a trace, you can reconstruct what happened—but your UI can’t reliably show current status.
If you only keep DB state, you can show status—but you can’t answer why, where, or what it cost.
You need both.
Define “truth”: current status vs. historical record
Here’s the contract that prevents status divergence from becoming a permanent product bug.
1) DB/state machine is authoritative for current status
Your product needs a single field that answers: “What is the run status right now?”
That field must be:
- transactional
- queryable
- protected from partial writes
- stable across deploys
So: DB wins for current state.
2) OpenTelemetry traces are authoritative for what actually happened
Traces are your forensic record:
- what step started first
- which tool span errored
- how many retries happened
- latency breakdown
- token usage and cost attribution
So: trace wins for historical reality.
3) When they disagree, don’t lie—model it
Status divergence happens (network drops, worker crash, exporter down, collector lag, DB deadlocks).
Make divergence a first-class concept:
UNKNOWN(we genuinely don’t know)STUCK(we know it’s running, but it hasn’t progressed)
“Unknown” is better than “Succeeded” when you’re wrong.
A minimal workflow state model you can support
You don’t need a huge schema. You need the right entities and invariants.
Entities
WorkflowRun(one end-to-end invocation)StepRun(one step in the workflow DAG)ToolCall(one external action)ApprovalGate(pause + resume)RetryAttempt(explicit retry accounting)
Required fields (the non-negotiables)
- IDs:
tenant_id,workflow_id,run_id,step_id - Status enums (terminal and non-terminal)
- Timestamps:
created_at,started_at,ended_at,last_heartbeat_at - Idempotency: keys for step-level and tool-level dedupe
- Correlation:
trace_id(and optionallyspan_id) stored in DB
Example: Postgres tables (minimal but operational)
-- Current truth: what is the run doing right now?
create type workflow_run_status as enum (
'QUEUED',
'RUNNING',
'WAITING_FOR_APPROVAL',
'SUCCEEDED',
'FAILED',
'CANCELED',
'UNKNOWN',
'STUCK'
);
create table workflow_runs (
id uuid primary key,
tenant_id uuid not null,
workflow_id uuid not null,
status workflow_run_status not null,
trace_id text, -- store W3C trace id (32 hex chars) or backend format
created_at timestamptz not null default now(),
started_at timestamptz,
ended_at timestamptz,
last_heartbeat_at timestamptz,
-- High-value debugging without leaking PII:
last_step_id uuid,
last_error_class text,
last_error_message text,
-- Optional: coarse cost rollups updated asynchronously
cost_usd numeric,
input_tokens bigint,
output_tokens bigint
);
create type step_run_status as enum (
'PENDING',
'RUNNING',
'SKIPPED',
'WAITING_FOR_APPROVAL',
'SUCCEEDED',
'FAILED'
);
create table step_runs (
id uuid primary key,
run_id uuid not null references workflow_runs(id),
step_name text not null,
status step_run_status not null,
idempotency_key text not null,
started_at timestamptz,
ended_at timestamptz,
-- correlate step spans if you want direct linking
span_id text,
unique (run_id, idempotency_key)
);
create type tool_call_status as enum (
'PLANNED',
'RUNNING',
'SUCCEEDED',
'FAILED'
);
create table tool_calls (
id uuid primary key,
step_run_id uuid not null references step_runs(id),
tool_name text not null,
status tool_call_status not null,
-- critical for safe retries / dedupe
idempotency_key text not null,
external_correlation_id text,
started_at timestamptz,
ended_at timestamptz,
unique (step_run_id, idempotency_key)
);
Key idea: the DB tells you “RUNNING vs FAILED” without needing to query your tracing backend.
Span model: represent agent execution with OpenTelemetry GenAI semantic conventions
OpenTelemetry gives you a standard “language” for tracing. The part that’s new (and evolving quickly) is the OpenTelemetry GenAI semantic conventions.
A pragmatic mapping that works well for agencies:
- Trace = one
WorkflowRun - Root span =
workflow.run(INTERNAL) - Step spans =
workflow.step(INTERNAL) - LLM inference spans = GenAI “inference” spans (CLIENT)
- Tool execution spans = GenAI
execute_toolspans (INTERNAL)
Why use GenAI semconv instead of custom attributes?
Because you want:
- out-of-the-box dashboards in modern observability tools
- portable attributes across languages
- sane token usage fields (
gen_ai.usage.input_tokens,gen_ai.usage.output_tokens) - a consistent place to record model/provider (
gen_ai.provider.name,gen_ai.request.model)
Important note: GenAI semconv stability
As of early 2026, GenAI semantic conventions are still marked Development and have had breaking changes.
Operationally, that means:
- pin your instrumentation version
- add a translation layer in your pipeline if needed
- keep your own business identifiers stable (run_id, step_id, tenant_id) regardless of semconv shifts
OpenTelemetry’s broader approach to semconv migrations often uses the OTEL_SEMCONV_STABILITY_OPT_IN environment variable, and GenAI conventions follow a similar opt-in approach (for example enabling “latest experimental” for GenAI).
Manual instrumentation starter kit (Node.js / TypeScript)
Below is a small pattern you can drop into a worker that runs agent workflows.
1) Create the root span and store trace_id in the DB
import { trace, context, SpanStatusCode } from "@opentelemetry/api";
const tracer = trace.getTracer("workflow-executor");
export async function runWorkflow(runId: string) {
return await tracer.startActiveSpan(
"workflow.run",
{
attributes: {
"workflow.run_id": runId,
// Add stable business dimensions you will query forever:
"tenant.id": process.env.TENANT_ID,
"workflow.id": process.env.WORKFLOW_ID,
},
},
async (rootSpan) => {
try {
const traceId = rootSpan.spanContext().traceId;
// Persist traceId once so you can deep-link from UI -> traces.
await db.workflowRuns.setTraceId(runId, traceId);
await executeSteps(runId);
rootSpan.setStatus({ code: SpanStatusCode.OK });
return { ok: true };
} catch (err: any) {
rootSpan.recordException(err);
rootSpan.setStatus({ code: SpanStatusCode.ERROR, message: err?.message });
throw err;
} finally {
rootSpan.end();
}
}
);
}
Rule: updating workflow_runs.status should be separate from “ending spans.” Don’t make the DB depend on successful export.
2) Step spans + state transitions
async function runStep(runId: string, stepId: string, stepName: string) {
const tracer = trace.getTracer("workflow-executor");
return await tracer.startActiveSpan(
"workflow.step",
{ attributes: { "workflow.run_id": runId, "workflow.step_id": stepId, "workflow.step_name": stepName } },
async (span) => {
await db.stepRuns.transition(stepId, "RUNNING");
try {
// Plan -> infer -> tools -> validate
const plan = await callLLMPlanner(runId, stepId, stepName);
await executePlannedTools(runId, stepId, plan.tools);
await db.stepRuns.transition(stepId, "SUCCEEDED");
return plan;
} catch (err: any) {
await db.stepRuns.transition(stepId, "FAILED", {
errorClass: err?.name,
errorMessage: err?.message,
});
span.recordException(err);
span.setStatus({ code: SpanStatusCode.ERROR });
throw err;
} finally {
span.end();
}
}
);
}
Instrument LLM calls with OpenTelemetry GenAI semantic conventions
A GenAI inference span is where cost, tokens, model, and provider should live.
Example: inference span wrapper
import { trace, SpanKind, SpanStatusCode } from "@opentelemetry/api";
type LlmResult = {
outputText: string;
inputTokens?: number;
outputTokens?: number;
model?: string;
responseId?: string;
};
async function tracedChatCompletion(params: {
provider: "openai" | "anthropic" | "gcp.vertex_ai";
model: string;
operationName: "chat" | "text_completion" | "generate_content";
conversationId?: string;
prompt: string; // keep raw prompts out of attributes by default
}): Promise<LlmResult> {
const tracer = trace.getTracer("llm-client");
return await tracer.startActiveSpan(
`${params.operationName} ${params.model}`,
{
kind: SpanKind.CLIENT,
attributes: {
"gen_ai.operation.name": params.operationName,
"gen_ai.provider.name": params.provider,
"gen_ai.request.model": params.model,
...(params.conversationId ? { "gen_ai.conversation.id": params.conversationId } : {}),
},
},
async (span) => {
try {
const res = await llm.chat({ model: params.model, prompt: params.prompt });
// Populate usage + response metadata if you have it.
if (res.inputTokens != null) span.setAttribute("gen_ai.usage.input_tokens", res.inputTokens);
if (res.outputTokens != null) span.setAttribute("gen_ai.usage.output_tokens", res.outputTokens);
if (res.responseId) span.setAttribute("gen_ai.response.id", res.responseId);
if (res.model) span.setAttribute("gen_ai.response.model", res.model);
span.setStatus({ code: SpanStatusCode.OK });
return res;
} catch (err: any) {
span.recordException(err);
span.setStatus({ code: SpanStatusCode.ERROR, message: err?.message });
throw err;
} finally {
span.end();
}
}
);
}
Don’t record prompts by default
GenAI semconv supports recording message history, system instructions, and tool definitions—but it’s explicitly sensitive and can explode telemetry costs.
A production-friendly pattern:
- keep prompts and outputs in your artifact store (encrypted, access-controlled)
- put a reference on the span (e.g.,
workflow.artifact_uri) - allow opt-in prompt capture only in staging or for specific tenants
Tool calls: trace them like first-class operations
Tool flakiness is where most real incidents happen. Treat tool calls as spans.
GenAI semconv includes an execute_tool operation and tool attributes.
import { trace, SpanKind, SpanStatusCode } from "@opentelemetry/api";
async function tracedToolCall(params: {
runId: string;
stepId: string;
toolName: string;
idempotencyKey: string;
args: Record<string, any>;
fn: (args: any) => Promise<any>;
}) {
const tracer = trace.getTracer("tooling");
return await tracer.startActiveSpan(
`execute_tool ${params.toolName}`,
{
kind: SpanKind.INTERNAL,
attributes: {
"gen_ai.operation.name": "execute_tool",
"gen_ai.tool.name": params.toolName,
// Your durable dimensions:
"workflow.run_id": params.runId,
"workflow.step_id": params.stepId,
"workflow.idempotency_key": params.idempotencyKey,
// WARNING: arguments can contain PII. Prefer hashes or references.
// "gen_ai.tool.call.arguments": params.args,
},
},
async (span) => {
await db.toolCalls.upsertPlannedOrRunning({
runId: params.runId,
stepId: params.stepId,
toolName: params.toolName,
idempotencyKey: params.idempotencyKey,
});
try {
const result = await params.fn(params.args);
await db.toolCalls.transition(params.idempotencyKey, "SUCCEEDED");
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (err: any) {
await db.toolCalls.transition(params.idempotencyKey, "FAILED");
span.recordException(err);
span.setStatus({ code: SpanStatusCode.ERROR });
throw err;
} finally {
span.end();
}
}
);
}
MCP tools: two layers of tracing
If you’re using MCP servers:
- keep the agent-level
execute_toolspan - also trace the MCP server’s internal spans (HTTP, DB, etc.)
This creates a clean hierarchy: workflow → step → tool → underlying calls.
Reconciliation: keep DB + traces aligned (without pretending they always are)
If you do nothing, you’ll eventually see:
- DB says
RUNNINGbut trace clearly ended - trace shows errors but DB says
SUCCEEDED - no spans exported due to collector outage
The fix is a small reconciler that runs periodically.
Reconciliation principles
-
“Succeeded” requires an explicit DB terminal state.
- A trace ending is not proof of success.
-
“Failed” requires classification.
- Distinguish
timeoutvsrate_limitvsvalidation_failedvstool_side_effect.
- Distinguish
-
“Unknown” is valid.
- If your worker crashed mid-run, mark it
UNKNOWN, notify, and require human action.
- If your worker crashed mid-run, mark it
-
Heartbeat beats trace freshness.
- DB heartbeat is the best “liveness” signal.
Example: reconciler pseudocode
from datetime import datetime, timedelta
RUNNING_TIMEOUT = timedelta(minutes=15)
def reconcile_runs(now: datetime):
# 1) Mark stuck runs
running = db.query("""
select id, last_heartbeat_at
from workflow_runs
where status in ('RUNNING', 'WAITING_FOR_APPROVAL')
""")
for r in running:
if r.last_heartbeat_at is None:
continue
if now - r.last_heartbeat_at > RUNNING_TIMEOUT:
# Don't guess success/failure.
db.update_run_status(r.id, 'STUCK')
# 2) Close runs that are terminal in DB but missing ended_at
db.execute("""
update workflow_runs
set ended_at = coalesce(ended_at, now())
where status in ('SUCCEEDED', 'FAILED', 'CANCELED')
and ended_at is null
""")
# 3) Optional: pull trace summaries asynchronously
# - token counts
# - cost rollups
# - last error span
# and write them to workflow_runs for fast UI rendering.
A simple divergence UI that users trust
In your run detail page, render:
- Current state (from DB)
- Last known trace activity (from trace backend)
- Trace link (using
trace_id)
When they disagree, show it explicitly:
“Run is STUCK (no heartbeat in 18m). Last trace activity: tool
crm.upsertretrying.”
This is the difference between a “black box agent” and an agency-grade system.
Cost and latency budgets that actually work
Most teams set a single workflow-level budget (“keep it under $2/run”). That fails because spend concentrates in specific steps.
Budget per step, not per workflow
Examples:
- “Lead enrichment step must stay under 2,000 output tokens.”
- “Draft-email step must finish under 8 seconds p95.”
- “Tool loop must not exceed 3 retries.”
How traces enable honest cost attribution
With GenAI semconv you can roll up:
gen_ai.usage.input_tokens/gen_ai.usage.output_tokensper inference span- group by
tenant.id,workflow.id,workflow.step_name
Then you can answer:
- which client subsidizes everyone else
- which step is drifting week over week
- what retries cost you
“Wasted spend” is a metric
Track tokens and dollars consumed on spans that later get retried or invalidated.
A practical approach:
- tag spans with
workflow.retryable=true|false - tag attempts with
workflow.attempt=1..n - compute wasted spend as attempts > 1 or spans in failed runs
Replay and partial reruns: turning traces into a debugging tool
Traces tell you what happened. They do not, by themselves, make replay safe.
To get replayable execution, you need checkpoints.
What to persist vs. what to reference
Persist (in DB/artifact store):
- validated step inputs
- tool call idempotency keys
- tool call external correlation IDs
- tool outputs that influence downstream steps
Reference (in OTel):
- raw timing, retries, failures
- token usage, cost, latency
Safe re-execution rule
A step is replayable if:
- all tool calls are idempotent (or deduped by idempotency keys)
- or the step is “read-only” (retrieval, summarization)
- or the step is behind an approval gate
Approval gates must exist in both DB and traces
In DB:
WAITING_FOR_APPROVAL- who needs to approve
- what is being approved (artifact reference)
In traces:
- a span or event indicating pause
- a span/event indicating resume
That’s how you explain “it sat for 3 hours” without guessing.
Agency runbook: what you can promise clients (and what you can’t)
Agencies win when they can confidently answer client questions.
With DB+OTel you can offer realistic operational guarantees:
What you can promise
- Status accuracy: current run status is DB-authoritative
- Trace-backed incident review: you can show where it failed
- Cost reporting: token usage and spend rollups per client/workflow/step
- Safe retries: replay policy based on idempotency + step classification
What you should not promise
- perfect determinism (“same input always yields same output”)
- perfect trace completeness (exporters/collectors can fail)
- “we store every prompt forever” (privacy + cost)
Instead, promise:
- “We store enough to debug and replay safely.”
Implementation checklist (opinionated)
Use this as your “done means done” list.
State machine
-
WorkflowRun.statusis updated transactionally - terminal states set
ended_at - heartbeat updates at least once per step and during long tool calls
-
UNKNOWNandSTUCKare modeled and visible
Trace shape
-
trace_idstored inworkflow_runs - root span
workflow.run - step spans
workflow.step - GenAI inference spans include:
-
gen_ai.operation.name -
gen_ai.provider.name -
gen_ai.request.model -
gen_ai.usage.input_tokens/gen_ai.usage.output_tokens(when available)
-
- tool spans use
gen_ai.operation.name=execute_tooland includegen_ai.tool.name
Dimensions you’ll want forever
-
tenant.id -
workflow.id -
workflow.run_id -
workflow.step_name/workflow.step_id -
workflow.idempotency_key
Privacy + cost controls
- prompt capture is opt-in
- arguments/results are redacted or referenced
- sensitive artifacts stored separately with stricter access controls
Reconciliation + alerting
- scheduled reconciler marks
STUCK - alerts on:
- stuck runs
- retry spikes
- sudden token/cost spikes per tenant
- tool failure rate increases
Where nNode fits
If you’re building agency workflows on top of Claude skills, MCP servers, and a pile of custom scripts, you can absolutely implement the above yourself.
What nNode is aiming to provide is the product surface around it:
- a workflow engine designed for agent runs that need guardrails
- structured execution primitives (steps, approvals, retries)
- observability that doesn’t become a “two sources of truth” mess as runs get larger
In other words: fewer brittle black boxes, more durable operations.
Soft CTA
If you’re an automation agency (or an internal automation team) and you’re tired of explaining agent failures with screenshots and guesses, try building your next workflow with a single DB status contract + OpenTelemetry GenAI semantic conventions from day one.
And if you want a platform that’s purpose-built for running and iterating on these workflows—with guardrails, replay-friendly execution, and the operational model agencies need—you can take a look at nNode at nnode.ai.