Agent Observability in Production: An OpenTelemetry-First Logging + Tracing Blueprint for Tool-Using AI Agents

If you’ve shipped a tool-using agent beyond a toy environment, you’ve seen the pattern:

It passes the demo.
It fails on Tuesday.
Nobody can answer, with confidence:
- What exactly happened?
- Which tool call caused the damage (or the delay)?
- Was it a model mistake, a schema drift, a permissions issue, or an integration outage?
- What did it cost?
- Who approved it (if anyone)?

Classic observability (logs + metrics + traces) already solved this for distributed systems. Agentic systems are distributed systems—with an extra twist:

The “planner” is probabilistic.
The agent can take actions.
“It seemed reasonable” is not an acceptable postmortem.

This guide is a production blueprint for agent observability using OpenTelemetry (OTel) as the spine:

A tracing model that makes tool-call failures debuggable
An audit-log model that makes agent actions defensible
A minimal schema you can actually implement
Cost attribution and reliability KPIs that matter
A practical loop to turn traces into repeatable reliability improvements

At nNode, we’re obsessed with the harness/supervisor layer—the part that makes agents trustworthy (and improvable) rather than just impressive. Observability is where that harness either exists… or it doesn’t.

What “agent observability” must answer (in production)

A good observability system for tool-using agents should answer these questions quickly:

What happened? (timeline of planning → tool calls → approvals → side effects)
Why did it happen? (inputs, intermediate reasoning signals, tool arguments, tool results)
What changed? (model version, prompt/policy version, tool schema version)
What did it cost? (tokens, latency, dollars, downstream vendor costs)
Is it safe? (PII controls, redaction, approval gates, least privilege)
Can we reproduce it? (trace → labeled trajectory → eval test → regression prevention)

The mistake teams make: they treat “LLM tracing” as an app-level add-on.

In production, observability is not “LLM-only.” It’s supervisor-first:

planning
tool selection
tool execution
verification
retries
circuit breakers
approvals

The 3 layers of truth: traces, audit logs, and outcomes

You need three different things, and you should not mash them into one datastore.

Layer	Purpose	Sampling allowed?	PII policy	Who reads it?
Traces	Debugging, latency, dependency mapping	Yes (carefully)	Redact aggressively	Engineers/on-call
Audit log / action ledger	Compliance, non-repudiation, “what changed in the world”	No	Strict, immutable, minimal	Security, compliance, leadership
Outcomes / business KPIs	ROI, reliability, product health	N/A	Usually aggregated	Product, leadership

Why separate traces from audit logs?

Traces are optimized for debug truth: timelines, spans, attributes, sampling.

Audit logs are optimized for compliance truth: append-only, immutable, complete coverage.

If you sample the only record of an irreversible action, you will eventually regret it.

Reference architecture: OpenTelemetry-first telemetry pipeline

Here’s a practical architecture that scales from “one workflow” to a fleet.

            ┌───────────────────────────┐
            │   Agent Runtime / Harness  │
            │  (planner + tools + verify)│
            └──────────────┬────────────┘
                           │
         ┌─────────────────┴─────────────────┐
         │                                   │
  (1) Traces/metrics/logs               (2) Audit Ledger
   OpenTelemetry SDK                     Append-only store
   OTLP exporter                          (WORM/immutable)
         │                                   │
         ▼                                   ▼
 ┌───────────────────┐                ┌───────────────────┐
 │ OTel Collector     │                │ Audit pipeline     │
 │ - batching         │                │ - hash chain        │
 │ - tail sampling    │                │ - redaction policy  │
 │ - enrichment       │                │ - retention         │
 └─────────┬─────────┘                └─────────┬─────────┘
           │                                    │
           ▼                                    ▼
 ┌───────────────────┐                ┌───────────────────┐
 │ APM backend        │                │ Audit storage       │
 │ (Tempo/Jaeger/etc) │                │ (DB + object store) │
 └───────────────────┘                └───────────────────┘
           │
           ▼
 ┌──────────────────────────────┐
 │ Dashboards + alerts + IR loop │
 └──────────────────────────────┘

Context propagation: your `run_id` is your primary key

Make every tool call and every downstream HTTP request carry:

traceparent (W3C Trace Context)
baggage containing stable identifiers (e.g. run_id, tenant_id, workflow_id)

This is how you connect:

agent span → tool span → vendor API span (Gmail, Drive, Slack, database, etc.)

What to log: a minimal event schema that actually works

You don’t need 200 fields. You need the right fields.

Below is a minimal schema that supports:

debugging tool-call failures
cost attribution
approval gate analysis
safe replay / eval generation

Required identifiers

run_id: unique per agent run
workflow_id: stable workflow identifier (e.g. invoice_followup_v3)
tenant_id: multi-tenant boundary
actor_user_id: who initiated
actor_is_agent: bool
environment: prod/staging

Tool call fields (the observability heart)

tool_name
tool_args_hash (hash of canonicalized args)
tool_result_hash (hash of canonicalized result)
side_effect_level: none | read | write | irreversible
idempotency_key: stable key to prevent double-writes

Safety + governance fields

approval_state: not_required | requested | approved | rejected | timed_out
approver_id
approval_latency_ms
policy_version

Reliability + cost fields

retry_count
error_class (normalized)
circuit_breaker_tripped
tokens_in, tokens_out
latency_ms
cost_usd

Minimal JSON example

{
  "timestamp": "2026-05-06T06:00:00.000Z",
  "environment": "prod",
  "tenant_id": "t_9f3d",
  "workflow_id": "customer_support_triage_v7",
  "run_id": "run_01J3K6...",
  "actor_user_id": "u_123",
  "actor_is_agent": true,

  "event_type": "tool.call",
  "tool_name": "gmail.send",
  "side_effect_level": "write",
  "idempotency_key": "gmail.send:thread_18b4f5e2:msg_02",

  "tool_args_hash": "sha256:3b4d...",
  "tool_result_hash": "sha256:91ad...",

  "approval_state": "approved",
  "approver_id": "u_901",
  "approval_latency_ms": 18421,

  "retry_count": 1,
  "error_class": null,
  "circuit_breaker_tripped": false,

  "tokens_in": 2231,
  "tokens_out": 412,
  "latency_ms": 3920,
  "cost_usd": 0.0184,

  "model": {
    "provider": "openai",
    "name": "gpt-4.1",
    "version": "2026-04-xx"
  },
  "prompt_version": "support_triage_prompt@2026-05-01",
  "policy_version": "support_policy@2026-05-03"
}

Redaction + the “secure payload vault” pattern (do this early)

The best practice for production agent telemetry is:

Do not dump raw emails, documents, or tool payloads into traces/logs.
Store hashes + pointers in traces.
Keep sensitive payloads in a separate, access-controlled vault.

Pattern

Canonicalize payload (stable JSON, sorted keys)
Compute sha256
Store payload in a vault (encrypted; strict ACL; short retention)
Emit only:
- payload_hash
- payload_ref (opaque ID)
- size metadata

// Pseudocode
const canonical = canonicalizeJson(toolArgs)
const hash = sha256(canonical)
const ref = await payloadVault.put({ hash, canonical, ttlHours: 24 })

span.setAttribute("tool.args.hash", `sha256:${hash}`)
span.setAttribute("tool.args.ref", ref) // opaque, access-controlled
span.setAttribute("tool.args.bytes", canonical.length)

This one design decision prevents “your observability stack became your biggest data leak.”

Spans that matter: a practical tracing model for tool-using agents

Model your run with a single parent span and a small number of meaningful child spans.

Recommended span tree

agent.run (root)
- agent.plan
- agent.tool.select
- agent.tool.call (repeat)
- agent.verify
- agent.approval.wait (optional)
- agent.finish

Suggested attributes

Use attributes (tags) to make traces searchable:

agent.run_id
agent.workflow_id
agent.tenant_id
agent.step_index
genai.model.name / genai.system (if you follow emerging GenAI conventions)
tool.name
tool.side_effect_level
tool.args.hash
tool.result.hash
approval.state

Even if you adopt GenAI semantic conventions, keep your own stable “agent/tool” namespace too—because you’ll want invariants that outlive changing conventions.

Python implementation: instrument an agent run with OpenTelemetry

This example uses the OpenTelemetry Python SDK with an OTLP exporter.

Goal: generate a trace that shows planning + each tool call + verification, with cost and safety metadata.

Install

pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp

Setup tracing

from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

resource = Resource.create({
    "service.name": "agent-harness",
    "service.version": "1.7.0",
    "deployment.environment": "prod",
})

provider = TracerProvider(resource=resource)
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317", insecure=True))
)
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("agent.harness")

Emit spans for plan → tool → verify

import time
import uuid

def sha256_hex(s: str) -> str:
    import hashlib
    return hashlib.sha256(s.encode("utf-8")).hexdigest()

class ToolError(Exception):
    pass

def run_agent(workflow_id: str, tenant_id: str, actor_user_id: str):
    run_id = f"run_{uuid.uuid4().hex}"

    with tracer.start_as_current_span("agent.run") as run_span:
        run_span.set_attribute("agent.run_id", run_id)
        run_span.set_attribute("agent.workflow_id", workflow_id)
        run_span.set_attribute("agent.tenant_id", tenant_id)
        run_span.set_attribute("actor.user_id", actor_user_id)
        run_span.set_attribute("actor.is_agent", True)

        # 1) Plan
        with tracer.start_as_current_span("agent.plan") as plan_span:
            t0 = time.time()
            plan = "Search recent emails, draft reply, request approval"
            # Don’t log full prompt/plan if sensitive; log hashes + versions
            plan_span.set_attribute("agent.plan.hash", f"sha256:{sha256_hex(plan)}")
            plan_span.set_attribute("agent.prompt_version", "support_triage_prompt@2026-05-01")
            plan_span.set_attribute("latency_ms", int((time.time() - t0) * 1000))

        # 2) Tool call
        tool_name = "gmail.search"
        tool_args = '{"q":"from:vip newer_than:7d","maxResults":10}'

        with tracer.start_as_current_span("agent.tool.call") as tool_span:
            tool_span.set_attribute("tool.name", tool_name)
            tool_span.set_attribute("tool.side_effect_level", "read")
            tool_span.set_attribute("tool.args.hash", f"sha256:{sha256_hex(tool_args)}")
            tool_span.set_attribute("agent.step_index", 1)

            try:
                # call_tool(...) is your integration wrapper
                # result = call_tool(tool_name, tool_args)
                result = '{"threads":[{"id":"18b4f5e2a8c"}]}'
                tool_span.set_attribute("tool.result.hash", f"sha256:{sha256_hex(result)}")
            except ToolError as e:
                tool_span.record_exception(e)
                tool_span.set_attribute("error_class", "tool_error")
                raise

        # 3) Approval wait (if needed)
        with tracer.start_as_current_span("agent.approval.wait") as approval_span:
            approval_span.set_attribute("approval.state", "approved")
            approval_span.set_attribute("approval.approver_id", "u_901")
            approval_span.set_attribute("approval.latency_ms", 18421)

        # 4) Verify
        with tracer.start_as_current_span("agent.verify") as verify_span:
            verify_span.set_attribute("verify.strategy", "postcondition_check")
            verify_span.set_attribute("verify.passed", True)

        run_span.set_attribute("agent.outcome", "success")
        return run_id

What this gives you immediately

A per-run timeline with the exact tool calls involved
Searchable attributes for workflow_id, tool.name, approval.state
A place to add cost and token attribution

JavaScript/TypeScript: capture trace context through HTTP tool wrappers

If your tools are ultimately HTTP calls, instrument the wrapper once and every tool gets trace coverage.

import { trace, context } from "@opentelemetry/api";

const tracer = trace.getTracer("agent.harness");

type ToolCall = {
  runId: string;
  workflowId: string;
  toolName: string;
  sideEffectLevel: "none" | "read" | "write" | "irreversible";
  args: unknown;
};

export async function callTool<T>(call: ToolCall): Promise<T> {
  return await tracer.startActiveSpan("agent.tool.call", async (span) => {
    span.setAttribute("agent.run_id", call.runId);
    span.setAttribute("agent.workflow_id", call.workflowId);
    span.setAttribute("tool.name", call.toolName);
    span.setAttribute("tool.side_effect_level", call.sideEffectLevel);

    // Hash args (store raw elsewhere if needed)
    const argsJson = JSON.stringify(call.args);
    span.setAttribute("tool.args.bytes", argsJson.length);

    try {
      // Your tool router might call vendor APIs here.
      // Make sure fetch/axios is instrumented so downstream spans join this trace.
      const result = await routeToolCall<T>(call.toolName, call.args);
      return result;
    } catch (err: any) {
      span.recordException(err);
      span.setStatus({ code: 2, message: err?.message ?? "tool failed" });
      span.setAttribute("error_class", "tool_error");
      throw err;
    } finally {
      span.end();
    }
  });
}

Key point: propagate OTel context so the tool wrapper span is the parent of the actual HTTP span.

Don’t forget metrics: the dashboards operators and execs both use

Traces tell you why something failed. Metrics tell you how often and how bad it is.

Minimum viable metrics for tool-using agents:

Reliability metrics

agent_runs_total{workflow_id, outcome}
agent_tool_calls_total{tool_name, outcome}
agent_step_retries_total{workflow_id, tool_name}
agent_verification_fail_total{workflow_id}
approval_requests_total{workflow_id, state}
approval_queue_age_seconds (gauge/histogram)

Cost metrics

agent_tokens_total{workflow_id, model}
agent_cost_usd_total{workflow_id, model}
cost_per_success_usd{workflow_id} (derived)

Latency metrics

agent_run_latency_ms{workflow_id}
agent_tool_latency_ms{tool_name}

If you only build one exec-facing chart, make it:

Cost per successful outcome (by workflow), over time.

It keeps everyone honest.

Failure modes → instrumentation you need to detect them

Most “agent failures” are not mystical. They cluster.

1) Wrong tool / wrong scope (permissions)

Signals to capture:

tool.name
permission errors normalized (error_class=authz_denied)
actor_user_id vs service account identity
tool.side_effect_level mismatches (agent attempted write when only read allowed)

2) Right tool / wrong args (schema drift)

Signals:

tool.args.hash
tool.schema_version
structured error code from the tool adapter

3) Partial completion (idempotency gaps)

Signals:

idempotency_key
tool.result.hash
“completed steps” list for the run

If a run retries after a timeout and you don’t have idempotency keys, you’ll ship duplicate emails, duplicate invoices, or duplicate tickets.

4) Infinite loops / cascades

Signals:

agent.step_count
budget fields: max_steps, max_cost_usd, max_latency_ms
circuit_breaker_tripped=true

5) Approval fatigue vs risk leaks

Signals:

approval.state
approval.latency_ms
approval rate by workflow step

You want approvals where the side-effect level is high—not everywhere.

Audit log blueprint: make agent actions defensible

A good audit log entry is:

minimal
immutable
attributable
tamper-evident

Suggested audit log record

{
  "timestamp": "2026-05-06T06:00:08.000Z",
  "tenant_id": "t_9f3d",
  "workflow_id": "customer_support_triage_v7",
  "run_id": "run_01J3K6...",

  "action": {
    "type": "tool.call",
    "tool_name": "gmail.send",
    "side_effect_level": "write",
    "target": "thread_18b4f5e2a8c"
  },

  "approval": {
    "required": true,
    "state": "approved",
    "approver_id": "u_901"
  },

  "payload": {
    "args_hash": "sha256:3b4d...",
    "result_hash": "sha256:91ad...",
    "payload_ref": "vault_7Qw..."
  },

  "integrity": {
    "prev_hash": "sha256:...",
    "this_hash": "sha256:..."
  }
}

That prev_hash → this_hash chaining gives you a tamper-evident ledger without requiring a blockchain.

Turning traces into an improvement loop (the part most teams skip)

Observability isn’t the goal. Behavior-driven improvement is.

Here’s the loop that works:

Observe: collect traces + audit entries for every run
Label: tag failures by class (permissions, bad args, hallucinated entity, timeout, partial completion)
Extract trajectories: for each incident, store a “trajectory bundle” (plan hash, tool sequence, tool hashes, policy versions)
Build evals: turn real failed trajectories into a regression suite
Tighten harness policies: fix with guardrails, verifiers, schema constraints—not just prompt edits
Redeploy + watch: confirm failure rate drops and cost per success improves

This is exactly where a supervisor/harness layer becomes a moat. Without it, you’re stuck in an endless cycle of:

“try a new prompt” → “hope it works” → “repeat.”

Implementation checklist (copy/paste)

Day 0 (today): get debuggable runs

Generate a stable run_id at the start of every agent run
Emit an agent.run root span with workflow_id, tenant_id, versions
Emit an agent.tool.call span for every tool call
Capture tool.name, side_effect_level, tool_args_hash, tool_result_hash
Capture tokens_in/out, latency_ms, and cost_usd at run + step levels
Add redaction + payload vault (hashes in traces, raw in vault)

Day 7: make autonomy safe

Add approval telemetry (approval.state, latency, approver id)
Add idempotency keys for write/irreversible actions
Add circuit breakers (max steps, max cost, max time)
Normalize errors into a small error_class taxonomy
Add SLOs: success rate and cost-per-success per workflow

Day 30: turn production into training data (without the chaos)

Auto-extract “failed trajectories” into a weekly regression eval set
Add verifier spans and track verifier pass/fail rates
Build dashboards that map tool errors by vendor endpoint
Add policy diffing: correlate failure spikes with policy/model/schema versions

Where nNode fits in this picture

Most teams instrument the model and call it a day.

But the real production failures happen at the seams:

tool selection vs tool execution
retries vs idempotency
approvals vs velocity
verification vs silent corruption

nNode is built around the harness/supervisor layer: the part that makes agentic workflows observable, controllable, and steadily improvable. If you’re serious about shipping tool-using agents that operators can trust (and on-call can debug), that layer is where you win.

If you want to see what this looks like in a real workflow engine—where traces, auditability, approvals, and reliability loops are first-class—take a look at nnode.ai.

Soft CTA: If you’re building agents in production and keep asking “what actually happened?”, nNode is worth a conversation.

What “agent observability” must answer (in production)

The 3 layers of truth: traces, audit logs, and outcomes

Why separate traces from audit logs?

Reference architecture: OpenTelemetry-first telemetry pipeline

Context propagation: your run_id is your primary key

What to log: a minimal event schema that actually works

Required identifiers

Tool call fields (the observability heart)

Safety + governance fields

Reliability + cost fields

Minimal JSON example

Redaction + the “secure payload vault” pattern (do this early)

Pattern

Spans that matter: a practical tracing model for tool-using agents

Recommended span tree

Suggested attributes

Python implementation: instrument an agent run with OpenTelemetry

Install

Setup tracing

Emit spans for plan → tool → verify

What this gives you immediately

JavaScript/TypeScript: capture trace context through HTTP tool wrappers

Don’t forget metrics: the dashboards operators and execs both use

Reliability metrics

Cost metrics

Latency metrics

Failure modes → instrumentation you need to detect them

1) Wrong tool / wrong scope (permissions)

2) Right tool / wrong args (schema drift)

3) Partial completion (idempotency gaps)

4) Infinite loops / cascades

5) Approval fatigue vs risk leaks

Audit log blueprint: make agent actions defensible

Suggested audit log record

Turning traces into an improvement loop (the part most teams skip)

Implementation checklist (copy/paste)

Day 0 (today): get debuggable runs

Day 7: make autonomy safe

Day 30: turn production into training data (without the chaos)

Where nNode fits in this picture

Build your first AI Agent today

Context propagation: your `run_id` is your primary key