OpenTelemetry GenAI Semantic Conventions: One Source of Truth for Agent Workflow Status, Cost, and Replay

Agent workflows don’t fail because you “didn’t add logging.” They fail because in production you can’t answer basic questions with confidence:

Did the run actually succeed—or did the UI just say it did?
Which step burned $38 in tokens at 2 a.m.?
Can we safely retry step 7 without double-booking a customer or sending duplicate emails?

This is the “two sources of truth” trap: your database says one thing, your traces (or your agent UI) say another, and your team gets stuck in incident-response limbo.

At nNode, we’ve lived this pain while building an LLM-native workflow automation product with guardrails and execution controls for agencies. The pattern below is the architecture we wish everyone started with: a DB state machine as the authoritative present, and OpenTelemetry traces (using OpenTelemetry GenAI semantic conventions) as the authoritative past.

If you operate Claude-powered or LLM-powered workflows for clients (especially as an automation agency), this gives you something you can actually support with SLAs.

The real problem isn’t “lack of tracing”

Traditional automation failures are usually deterministic: a webhook 500s, a CRM field is missing, an API key expired.

Agent workflows fail in messier ways:

Long-running runs with many steps and branching logic
Tool flakiness (rate limits, timeouts, partial writes)
Human approvals that pause execution mid-trace
Retries that are sometimes safe and sometimes catastrophic
Context growth (the run “self-combusts” when the conversation becomes huge)
Nondeterminism (two identical prompts can produce different tool call sequences)

If you only keep a trace, you can reconstruct what happened—but your UI can’t reliably show current status.

If you only keep DB state, you can show status—but you can’t answer why, where, or what it cost.

You need both.

Define “truth”: current status vs. historical record

Here’s the contract that prevents status divergence from becoming a permanent product bug.

1) DB/state machine is authoritative for current status

Your product needs a single field that answers: “What is the run status right now?”

That field must be:

transactional
queryable
protected from partial writes
stable across deploys

So: DB wins for current state.

2) OpenTelemetry traces are authoritative for what actually happened

Traces are your forensic record:

what step started first
which tool span errored
how many retries happened
latency breakdown
token usage and cost attribution

So: trace wins for historical reality.

3) When they disagree, don’t lie—model it

Status divergence happens (network drops, worker crash, exporter down, collector lag, DB deadlocks).

Make divergence a first-class concept:

UNKNOWN (we genuinely don’t know)
STUCK (we know it’s running, but it hasn’t progressed)

“Unknown” is better than “Succeeded” when you’re wrong.

A minimal workflow state model you can support

You don’t need a huge schema. You need the right entities and invariants.

Entities

WorkflowRun (one end-to-end invocation)
StepRun (one step in the workflow DAG)
ToolCall (one external action)
ApprovalGate (pause + resume)
RetryAttempt (explicit retry accounting)

Required fields (the non-negotiables)

IDs: tenant_id, workflow_id, run_id, step_id
Status enums (terminal and non-terminal)
Timestamps: created_at, started_at, ended_at, last_heartbeat_at
Idempotency: keys for step-level and tool-level dedupe
Correlation: trace_id (and optionally span_id) stored in DB

Example: Postgres tables (minimal but operational)

-- Current truth: what is the run doing right now?
create type workflow_run_status as enum (
  'QUEUED',
  'RUNNING',
  'WAITING_FOR_APPROVAL',
  'SUCCEEDED',
  'FAILED',
  'CANCELED',
  'UNKNOWN',
  'STUCK'
);

create table workflow_runs (
  id uuid primary key,
  tenant_id uuid not null,
  workflow_id uuid not null,

  status workflow_run_status not null,

  trace_id text, -- store W3C trace id (32 hex chars) or backend format

  created_at timestamptz not null default now(),
  started_at timestamptz,
  ended_at timestamptz,
  last_heartbeat_at timestamptz,

  -- High-value debugging without leaking PII:
  last_step_id uuid,
  last_error_class text,
  last_error_message text,

  -- Optional: coarse cost rollups updated asynchronously
  cost_usd numeric,
  input_tokens bigint,
  output_tokens bigint
);

create type step_run_status as enum (
  'PENDING',
  'RUNNING',
  'SKIPPED',
  'WAITING_FOR_APPROVAL',
  'SUCCEEDED',
  'FAILED'
);

create table step_runs (
  id uuid primary key,
  run_id uuid not null references workflow_runs(id),
  step_name text not null,
  status step_run_status not null,

  idempotency_key text not null,

  started_at timestamptz,
  ended_at timestamptz,

  -- correlate step spans if you want direct linking
  span_id text,

  unique (run_id, idempotency_key)
);

create type tool_call_status as enum (
  'PLANNED',
  'RUNNING',
  'SUCCEEDED',
  'FAILED'
);

create table tool_calls (
  id uuid primary key,
  step_run_id uuid not null references step_runs(id),

  tool_name text not null,
  status tool_call_status not null,

  -- critical for safe retries / dedupe
  idempotency_key text not null,

  external_correlation_id text,

  started_at timestamptz,
  ended_at timestamptz,

  unique (step_run_id, idempotency_key)
);

Key idea: the DB tells you “RUNNING vs FAILED” without needing to query your tracing backend.

Span model: represent agent execution with OpenTelemetry GenAI semantic conventions

OpenTelemetry gives you a standard “language” for tracing. The part that’s new (and evolving quickly) is the OpenTelemetry GenAI semantic conventions.

A pragmatic mapping that works well for agencies:

Trace = one WorkflowRun
Root span = workflow.run (INTERNAL)
Step spans = workflow.step (INTERNAL)
LLM inference spans = GenAI “inference” spans (CLIENT)
Tool execution spans = GenAI execute_tool spans (INTERNAL)

Why use GenAI semconv instead of custom attributes?

Because you want:

out-of-the-box dashboards in modern observability tools
portable attributes across languages
sane token usage fields (gen_ai.usage.input_tokens, gen_ai.usage.output_tokens)
a consistent place to record model/provider (gen_ai.provider.name, gen_ai.request.model)

Important note: GenAI semconv stability

As of early 2026, GenAI semantic conventions are still marked Development and have had breaking changes.

Operationally, that means:

pin your instrumentation version
add a translation layer in your pipeline if needed
keep your own business identifiers stable (run_id, step_id, tenant_id) regardless of semconv shifts

OpenTelemetry’s broader approach to semconv migrations often uses the OTEL_SEMCONV_STABILITY_OPT_IN environment variable, and GenAI conventions follow a similar opt-in approach (for example enabling “latest experimental” for GenAI).

Manual instrumentation starter kit (Node.js / TypeScript)

Below is a small pattern you can drop into a worker that runs agent workflows.

1) Create the root span and store `trace_id` in the DB

import { trace, context, SpanStatusCode } from "@opentelemetry/api";

const tracer = trace.getTracer("workflow-executor");

export async function runWorkflow(runId: string) {
  return await tracer.startActiveSpan(
    "workflow.run",
    {
      attributes: {
        "workflow.run_id": runId,
        // Add stable business dimensions you will query forever:
        "tenant.id": process.env.TENANT_ID,
        "workflow.id": process.env.WORKFLOW_ID,
      },
    },
    async (rootSpan) => {
      try {
        const traceId = rootSpan.spanContext().traceId;
        // Persist traceId once so you can deep-link from UI -> traces.
        await db.workflowRuns.setTraceId(runId, traceId);

        await executeSteps(runId);

        rootSpan.setStatus({ code: SpanStatusCode.OK });
        return { ok: true };
      } catch (err: any) {
        rootSpan.recordException(err);
        rootSpan.setStatus({ code: SpanStatusCode.ERROR, message: err?.message });
        throw err;
      } finally {
        rootSpan.end();
      }
    }
  );
}

Rule: updating workflow_runs.status should be separate from “ending spans.” Don’t make the DB depend on successful export.

2) Step spans + state transitions

async function runStep(runId: string, stepId: string, stepName: string) {
  const tracer = trace.getTracer("workflow-executor");

  return await tracer.startActiveSpan(
    "workflow.step",
    { attributes: { "workflow.run_id": runId, "workflow.step_id": stepId, "workflow.step_name": stepName } },
    async (span) => {
      await db.stepRuns.transition(stepId, "RUNNING");

      try {
        // Plan -> infer -> tools -> validate
        const plan = await callLLMPlanner(runId, stepId, stepName);
        await executePlannedTools(runId, stepId, plan.tools);

        await db.stepRuns.transition(stepId, "SUCCEEDED");
        return plan;
      } catch (err: any) {
        await db.stepRuns.transition(stepId, "FAILED", {
          errorClass: err?.name,
          errorMessage: err?.message,
        });

        span.recordException(err);
        span.setStatus({ code: SpanStatusCode.ERROR });
        throw err;
      } finally {
        span.end();
      }
    }
  );
}

Instrument LLM calls with OpenTelemetry GenAI semantic conventions

A GenAI inference span is where cost, tokens, model, and provider should live.

Example: inference span wrapper

import { trace, SpanKind, SpanStatusCode } from "@opentelemetry/api";

type LlmResult = {
  outputText: string;
  inputTokens?: number;
  outputTokens?: number;
  model?: string;
  responseId?: string;
};

async function tracedChatCompletion(params: {
  provider: "openai" | "anthropic" | "gcp.vertex_ai";
  model: string;
  operationName: "chat" | "text_completion" | "generate_content";
  conversationId?: string;
  prompt: string; // keep raw prompts out of attributes by default
}): Promise<LlmResult> {
  const tracer = trace.getTracer("llm-client");

  return await tracer.startActiveSpan(
    `${params.operationName} ${params.model}`,
    {
      kind: SpanKind.CLIENT,
      attributes: {
        "gen_ai.operation.name": params.operationName,
        "gen_ai.provider.name": params.provider,
        "gen_ai.request.model": params.model,
        ...(params.conversationId ? { "gen_ai.conversation.id": params.conversationId } : {}),
      },
    },
    async (span) => {
      try {
        const res = await llm.chat({ model: params.model, prompt: params.prompt });

        // Populate usage + response metadata if you have it.
        if (res.inputTokens != null) span.setAttribute("gen_ai.usage.input_tokens", res.inputTokens);
        if (res.outputTokens != null) span.setAttribute("gen_ai.usage.output_tokens", res.outputTokens);
        if (res.responseId) span.setAttribute("gen_ai.response.id", res.responseId);
        if (res.model) span.setAttribute("gen_ai.response.model", res.model);

        span.setStatus({ code: SpanStatusCode.OK });
        return res;
      } catch (err: any) {
        span.recordException(err);
        span.setStatus({ code: SpanStatusCode.ERROR, message: err?.message });
        throw err;
      } finally {
        span.end();
      }
    }
  );
}

Don’t record prompts by default

GenAI semconv supports recording message history, system instructions, and tool definitions—but it’s explicitly sensitive and can explode telemetry costs.

A production-friendly pattern:

keep prompts and outputs in your artifact store (encrypted, access-controlled)
put a reference on the span (e.g., workflow.artifact_uri)
allow opt-in prompt capture only in staging or for specific tenants

Tool calls: trace them like first-class operations

Tool flakiness is where most real incidents happen. Treat tool calls as spans.

GenAI semconv includes an execute_tool operation and tool attributes.

import { trace, SpanKind, SpanStatusCode } from "@opentelemetry/api";

async function tracedToolCall(params: {
  runId: string;
  stepId: string;
  toolName: string;
  idempotencyKey: string;
  args: Record<string, any>;
  fn: (args: any) => Promise<any>;
}) {
  const tracer = trace.getTracer("tooling");

  return await tracer.startActiveSpan(
    `execute_tool ${params.toolName}`,
    {
      kind: SpanKind.INTERNAL,
      attributes: {
        "gen_ai.operation.name": "execute_tool",
        "gen_ai.tool.name": params.toolName,

        // Your durable dimensions:
        "workflow.run_id": params.runId,
        "workflow.step_id": params.stepId,
        "workflow.idempotency_key": params.idempotencyKey,

        // WARNING: arguments can contain PII. Prefer hashes or references.
        // "gen_ai.tool.call.arguments": params.args,
      },
    },
    async (span) => {
      await db.toolCalls.upsertPlannedOrRunning({
        runId: params.runId,
        stepId: params.stepId,
        toolName: params.toolName,
        idempotencyKey: params.idempotencyKey,
      });

      try {
        const result = await params.fn(params.args);
        await db.toolCalls.transition(params.idempotencyKey, "SUCCEEDED");

        span.setStatus({ code: SpanStatusCode.OK });
        return result;
      } catch (err: any) {
        await db.toolCalls.transition(params.idempotencyKey, "FAILED");
        span.recordException(err);
        span.setStatus({ code: SpanStatusCode.ERROR });
        throw err;
      } finally {
        span.end();
      }
    }
  );
}

MCP tools: two layers of tracing

If you’re using MCP servers:

keep the agent-level execute_tool span
also trace the MCP server’s internal spans (HTTP, DB, etc.)

This creates a clean hierarchy: workflow → step → tool → underlying calls.

Reconciliation: keep DB + traces aligned (without pretending they always are)

If you do nothing, you’ll eventually see:

DB says RUNNING but trace clearly ended
trace shows errors but DB says SUCCEEDED
no spans exported due to collector outage

The fix is a small reconciler that runs periodically.

Reconciliation principles

“Succeeded” requires an explicit DB terminal state.
- A trace ending is not proof of success.
“Failed” requires classification.
- Distinguish timeout vs rate_limit vs validation_failed vs tool_side_effect.
“Unknown” is valid.
- If your worker crashed mid-run, mark it UNKNOWN, notify, and require human action.
Heartbeat beats trace freshness.
- DB heartbeat is the best “liveness” signal.

Example: reconciler pseudocode

from datetime import datetime, timedelta

RUNNING_TIMEOUT = timedelta(minutes=15)

def reconcile_runs(now: datetime):
  # 1) Mark stuck runs
  running = db.query("""
    select id, last_heartbeat_at
    from workflow_runs
    where status in ('RUNNING', 'WAITING_FOR_APPROVAL')
  """)

  for r in running:
    if r.last_heartbeat_at is None:
      continue

    if now - r.last_heartbeat_at > RUNNING_TIMEOUT:
      # Don't guess success/failure.
      db.update_run_status(r.id, 'STUCK')

  # 2) Close runs that are terminal in DB but missing ended_at
  db.execute("""
    update workflow_runs
    set ended_at = coalesce(ended_at, now())
    where status in ('SUCCEEDED', 'FAILED', 'CANCELED')
      and ended_at is null
  """)

  # 3) Optional: pull trace summaries asynchronously
  # - token counts
  # - cost rollups
  # - last error span
  # and write them to workflow_runs for fast UI rendering.

A simple divergence UI that users trust

In your run detail page, render:

Current state (from DB)
Last known trace activity (from trace backend)
Trace link (using trace_id)

When they disagree, show it explicitly:

“Run is STUCK (no heartbeat in 18m). Last trace activity: tool crm.upsert retrying.”

This is the difference between a “black box agent” and an agency-grade system.

Cost and latency budgets that actually work

Most teams set a single workflow-level budget (“keep it under $2/run”). That fails because spend concentrates in specific steps.

Budget per step, not per workflow

Examples:

“Lead enrichment step must stay under 2,000 output tokens.”
“Draft-email step must finish under 8 seconds p95.”
“Tool loop must not exceed 3 retries.”

How traces enable honest cost attribution

With GenAI semconv you can roll up:

gen_ai.usage.input_tokens / gen_ai.usage.output_tokens per inference span
group by tenant.id, workflow.id, workflow.step_name

Then you can answer:

which client subsidizes everyone else
which step is drifting week over week
what retries cost you

“Wasted spend” is a metric

Track tokens and dollars consumed on spans that later get retried or invalidated.

A practical approach:

tag spans with workflow.retryable=true|false
tag attempts with workflow.attempt=1..n
compute wasted spend as attempts > 1 or spans in failed runs

Replay and partial reruns: turning traces into a debugging tool

Traces tell you what happened. They do not, by themselves, make replay safe.

To get replayable execution, you need checkpoints.

What to persist vs. what to reference

Persist (in DB/artifact store):

validated step inputs
tool call idempotency keys
tool call external correlation IDs
tool outputs that influence downstream steps

Reference (in OTel):

raw timing, retries, failures
token usage, cost, latency

Safe re-execution rule

A step is replayable if:

all tool calls are idempotent (or deduped by idempotency keys)
or the step is “read-only” (retrieval, summarization)
or the step is behind an approval gate

Approval gates must exist in both DB and traces

In DB:

WAITING_FOR_APPROVAL
who needs to approve
what is being approved (artifact reference)

In traces:

a span or event indicating pause
a span/event indicating resume

That’s how you explain “it sat for 3 hours” without guessing.

Agency runbook: what you can promise clients (and what you can’t)

Agencies win when they can confidently answer client questions.

With DB+OTel you can offer realistic operational guarantees:

What you can promise

Status accuracy: current run status is DB-authoritative
Trace-backed incident review: you can show where it failed
Cost reporting: token usage and spend rollups per client/workflow/step
Safe retries: replay policy based on idempotency + step classification

What you should not promise

perfect determinism (“same input always yields same output”)
perfect trace completeness (exporters/collectors can fail)
“we store every prompt forever” (privacy + cost)

Instead, promise:

“We store enough to debug and replay safely.”

Implementation checklist (opinionated)

Use this as your “done means done” list.

State machine

WorkflowRun.status is updated transactionally
terminal states set ended_at
heartbeat updates at least once per step and during long tool calls
UNKNOWN and STUCK are modeled and visible

Trace shape

trace_id stored in workflow_runs
root span workflow.run
step spans workflow.step
GenAI inference spans include:
- gen_ai.operation.name
- gen_ai.provider.name
- gen_ai.request.model
- gen_ai.usage.input_tokens / gen_ai.usage.output_tokens (when available)
tool spans use gen_ai.operation.name=execute_tool and include gen_ai.tool.name

Dimensions you’ll want forever

Privacy + cost controls

prompt capture is opt-in
arguments/results are redacted or referenced
sensitive artifacts stored separately with stricter access controls

Reconciliation + alerting

Where nNode fits

If you’re building agency workflows on top of Claude skills, MCP servers, and a pile of custom scripts, you can absolutely implement the above yourself.

What nNode is aiming to provide is the product surface around it:

a workflow engine designed for agent runs that need guardrails
structured execution primitives (steps, approvals, retries)
observability that doesn’t become a “two sources of truth” mess as runs get larger

In other words: fewer brittle black boxes, more durable operations.

Soft CTA

If you’re an automation agency (or an internal automation team) and you’re tired of explaining agent failures with screenshots and guesses, try building your next workflow with a single DB status contract + OpenTelemetry GenAI semantic conventions from day one.

And if you want a platform that’s purpose-built for running and iterating on these workflows—with guardrails, replay-friendly execution, and the operational model agencies need—you can take a look at nNode at nnode.ai.