nNodenNode
agent observabilityopentelemetryllm tracingtool callingreliability engineeringaudit logginggenai

Agent Observability in Production: An OpenTelemetry-First Logging + Tracing Blueprint for Tool-Using AI Agents

nNode13 min read

If you’ve shipped a tool-using agent beyond a toy environment, you’ve seen the pattern:

  • It passes the demo.
  • It fails on Tuesday.
  • Nobody can answer, with confidence:
    • What exactly happened?
    • Which tool call caused the damage (or the delay)?
    • Was it a model mistake, a schema drift, a permissions issue, or an integration outage?
    • What did it cost?
    • Who approved it (if anyone)?

Classic observability (logs + metrics + traces) already solved this for distributed systems. Agentic systems are distributed systems—with an extra twist:

  • The “planner” is probabilistic.
  • The agent can take actions.
  • “It seemed reasonable” is not an acceptable postmortem.

This guide is a production blueprint for agent observability using OpenTelemetry (OTel) as the spine:

  • A tracing model that makes tool-call failures debuggable
  • An audit-log model that makes agent actions defensible
  • A minimal schema you can actually implement
  • Cost attribution and reliability KPIs that matter
  • A practical loop to turn traces into repeatable reliability improvements

At nNode, we’re obsessed with the harness/supervisor layer—the part that makes agents trustworthy (and improvable) rather than just impressive. Observability is where that harness either exists… or it doesn’t.


What “agent observability” must answer (in production)

A good observability system for tool-using agents should answer these questions quickly:

  1. What happened? (timeline of planning → tool calls → approvals → side effects)
  2. Why did it happen? (inputs, intermediate reasoning signals, tool arguments, tool results)
  3. What changed? (model version, prompt/policy version, tool schema version)
  4. What did it cost? (tokens, latency, dollars, downstream vendor costs)
  5. Is it safe? (PII controls, redaction, approval gates, least privilege)
  6. Can we reproduce it? (trace → labeled trajectory → eval test → regression prevention)

The mistake teams make: they treat “LLM tracing” as an app-level add-on.

In production, observability is not “LLM-only.” It’s supervisor-first:

  • planning
  • tool selection
  • tool execution
  • verification
  • retries
  • circuit breakers
  • approvals

The 3 layers of truth: traces, audit logs, and outcomes

You need three different things, and you should not mash them into one datastore.

LayerPurposeSampling allowed?PII policyWho reads it?
TracesDebugging, latency, dependency mappingYes (carefully)Redact aggressivelyEngineers/on-call
Audit log / action ledgerCompliance, non-repudiation, “what changed in the world”NoStrict, immutable, minimalSecurity, compliance, leadership
Outcomes / business KPIsROI, reliability, product healthN/AUsually aggregatedProduct, leadership

Why separate traces from audit logs?

Traces are optimized for debug truth: timelines, spans, attributes, sampling.

Audit logs are optimized for compliance truth: append-only, immutable, complete coverage.

If you sample the only record of an irreversible action, you will eventually regret it.


Reference architecture: OpenTelemetry-first telemetry pipeline

Here’s a practical architecture that scales from “one workflow” to a fleet.

            ┌───────────────────────────┐
            │   Agent Runtime / Harness  │
            │  (planner + tools + verify)│
            └──────────────┬────────────┘
                           │
         ┌─────────────────┴─────────────────┐
         │                                   │
  (1) Traces/metrics/logs               (2) Audit Ledger
   OpenTelemetry SDK                     Append-only store
   OTLP exporter                          (WORM/immutable)
         │                                   │
         ▼                                   ▼
 ┌───────────────────┐                ┌───────────────────┐
 │ OTel Collector     │                │ Audit pipeline     │
 │ - batching         │                │ - hash chain        │
 │ - tail sampling    │                │ - redaction policy  │
 │ - enrichment       │                │ - retention         │
 └─────────┬─────────┘                └─────────┬─────────┘
           │                                    │
           ▼                                    ▼
 ┌───────────────────┐                ┌───────────────────┐
 │ APM backend        │                │ Audit storage       │
 │ (Tempo/Jaeger/etc) │                │ (DB + object store) │
 └───────────────────┘                └───────────────────┘
           │
           ▼
 ┌──────────────────────────────┐
 │ Dashboards + alerts + IR loop │
 └──────────────────────────────┘

Context propagation: your run_id is your primary key

Make every tool call and every downstream HTTP request carry:

  • traceparent (W3C Trace Context)
  • baggage containing stable identifiers (e.g. run_id, tenant_id, workflow_id)

This is how you connect:

  • agent span → tool span → vendor API span (Gmail, Drive, Slack, database, etc.)

What to log: a minimal event schema that actually works

You don’t need 200 fields. You need the right fields.

Below is a minimal schema that supports:

  • debugging tool-call failures
  • cost attribution
  • approval gate analysis
  • safe replay / eval generation

Required identifiers

  • run_id: unique per agent run
  • workflow_id: stable workflow identifier (e.g. invoice_followup_v3)
  • tenant_id: multi-tenant boundary
  • actor_user_id: who initiated
  • actor_is_agent: bool
  • environment: prod/staging

Tool call fields (the observability heart)

  • tool_name
  • tool_args_hash (hash of canonicalized args)
  • tool_result_hash (hash of canonicalized result)
  • side_effect_level: none | read | write | irreversible
  • idempotency_key: stable key to prevent double-writes

Safety + governance fields

  • approval_state: not_required | requested | approved | rejected | timed_out
  • approver_id
  • approval_latency_ms
  • policy_version

Reliability + cost fields

  • retry_count
  • error_class (normalized)
  • circuit_breaker_tripped
  • tokens_in, tokens_out
  • latency_ms
  • cost_usd

Minimal JSON example

{
  "timestamp": "2026-05-06T06:00:00.000Z",
  "environment": "prod",
  "tenant_id": "t_9f3d",
  "workflow_id": "customer_support_triage_v7",
  "run_id": "run_01J3K6...",
  "actor_user_id": "u_123",
  "actor_is_agent": true,

  "event_type": "tool.call",
  "tool_name": "gmail.send",
  "side_effect_level": "write",
  "idempotency_key": "gmail.send:thread_18b4f5e2:msg_02",

  "tool_args_hash": "sha256:3b4d...",
  "tool_result_hash": "sha256:91ad...",

  "approval_state": "approved",
  "approver_id": "u_901",
  "approval_latency_ms": 18421,

  "retry_count": 1,
  "error_class": null,
  "circuit_breaker_tripped": false,

  "tokens_in": 2231,
  "tokens_out": 412,
  "latency_ms": 3920,
  "cost_usd": 0.0184,

  "model": {
    "provider": "openai",
    "name": "gpt-4.1",
    "version": "2026-04-xx"
  },
  "prompt_version": "support_triage_prompt@2026-05-01",
  "policy_version": "support_policy@2026-05-03"
}

Redaction + the “secure payload vault” pattern (do this early)

The best practice for production agent telemetry is:

  • Do not dump raw emails, documents, or tool payloads into traces/logs.
  • Store hashes + pointers in traces.
  • Keep sensitive payloads in a separate, access-controlled vault.

Pattern

  1. Canonicalize payload (stable JSON, sorted keys)
  2. Compute sha256
  3. Store payload in a vault (encrypted; strict ACL; short retention)
  4. Emit only:
    • payload_hash
    • payload_ref (opaque ID)
    • size metadata
// Pseudocode
const canonical = canonicalizeJson(toolArgs)
const hash = sha256(canonical)
const ref = await payloadVault.put({ hash, canonical, ttlHours: 24 })

span.setAttribute("tool.args.hash", `sha256:${hash}`)
span.setAttribute("tool.args.ref", ref) // opaque, access-controlled
span.setAttribute("tool.args.bytes", canonical.length)

This one design decision prevents “your observability stack became your biggest data leak.”


Spans that matter: a practical tracing model for tool-using agents

Model your run with a single parent span and a small number of meaningful child spans.

Recommended span tree

  • agent.run (root)
    • agent.plan
    • agent.tool.select
    • agent.tool.call (repeat)
    • agent.verify
    • agent.approval.wait (optional)
    • agent.finish

Suggested attributes

Use attributes (tags) to make traces searchable:

  • agent.run_id
  • agent.workflow_id
  • agent.tenant_id
  • agent.step_index
  • genai.model.name / genai.system (if you follow emerging GenAI conventions)
  • tool.name
  • tool.side_effect_level
  • tool.args.hash
  • tool.result.hash
  • approval.state

Even if you adopt GenAI semantic conventions, keep your own stable “agent/tool” namespace too—because you’ll want invariants that outlive changing conventions.


Python implementation: instrument an agent run with OpenTelemetry

This example uses the OpenTelemetry Python SDK with an OTLP exporter.

Goal: generate a trace that shows planning + each tool call + verification, with cost and safety metadata.

Install

pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-otlp

Setup tracing

from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

resource = Resource.create({
    "service.name": "agent-harness",
    "service.version": "1.7.0",
    "deployment.environment": "prod",
})

provider = TracerProvider(resource=resource)
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="http://otel-collector:4317", insecure=True))
)
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("agent.harness")

Emit spans for plan → tool → verify

import time
import uuid

def sha256_hex(s: str) -> str:
    import hashlib
    return hashlib.sha256(s.encode("utf-8")).hexdigest()

class ToolError(Exception):
    pass

def run_agent(workflow_id: str, tenant_id: str, actor_user_id: str):
    run_id = f"run_{uuid.uuid4().hex}"

    with tracer.start_as_current_span("agent.run") as run_span:
        run_span.set_attribute("agent.run_id", run_id)
        run_span.set_attribute("agent.workflow_id", workflow_id)
        run_span.set_attribute("agent.tenant_id", tenant_id)
        run_span.set_attribute("actor.user_id", actor_user_id)
        run_span.set_attribute("actor.is_agent", True)

        # 1) Plan
        with tracer.start_as_current_span("agent.plan") as plan_span:
            t0 = time.time()
            plan = "Search recent emails, draft reply, request approval"
            # Don’t log full prompt/plan if sensitive; log hashes + versions
            plan_span.set_attribute("agent.plan.hash", f"sha256:{sha256_hex(plan)}")
            plan_span.set_attribute("agent.prompt_version", "support_triage_prompt@2026-05-01")
            plan_span.set_attribute("latency_ms", int((time.time() - t0) * 1000))

        # 2) Tool call
        tool_name = "gmail.search"
        tool_args = '{"q":"from:vip newer_than:7d","maxResults":10}'

        with tracer.start_as_current_span("agent.tool.call") as tool_span:
            tool_span.set_attribute("tool.name", tool_name)
            tool_span.set_attribute("tool.side_effect_level", "read")
            tool_span.set_attribute("tool.args.hash", f"sha256:{sha256_hex(tool_args)}")
            tool_span.set_attribute("agent.step_index", 1)

            try:
                # call_tool(...) is your integration wrapper
                # result = call_tool(tool_name, tool_args)
                result = '{"threads":[{"id":"18b4f5e2a8c"}]}'
                tool_span.set_attribute("tool.result.hash", f"sha256:{sha256_hex(result)}")
            except ToolError as e:
                tool_span.record_exception(e)
                tool_span.set_attribute("error_class", "tool_error")
                raise

        # 3) Approval wait (if needed)
        with tracer.start_as_current_span("agent.approval.wait") as approval_span:
            approval_span.set_attribute("approval.state", "approved")
            approval_span.set_attribute("approval.approver_id", "u_901")
            approval_span.set_attribute("approval.latency_ms", 18421)

        # 4) Verify
        with tracer.start_as_current_span("agent.verify") as verify_span:
            verify_span.set_attribute("verify.strategy", "postcondition_check")
            verify_span.set_attribute("verify.passed", True)

        run_span.set_attribute("agent.outcome", "success")
        return run_id

What this gives you immediately

  • A per-run timeline with the exact tool calls involved
  • Searchable attributes for workflow_id, tool.name, approval.state
  • A place to add cost and token attribution

JavaScript/TypeScript: capture trace context through HTTP tool wrappers

If your tools are ultimately HTTP calls, instrument the wrapper once and every tool gets trace coverage.

import { trace, context } from "@opentelemetry/api";

const tracer = trace.getTracer("agent.harness");

type ToolCall = {
  runId: string;
  workflowId: string;
  toolName: string;
  sideEffectLevel: "none" | "read" | "write" | "irreversible";
  args: unknown;
};

export async function callTool<T>(call: ToolCall): Promise<T> {
  return await tracer.startActiveSpan("agent.tool.call", async (span) => {
    span.setAttribute("agent.run_id", call.runId);
    span.setAttribute("agent.workflow_id", call.workflowId);
    span.setAttribute("tool.name", call.toolName);
    span.setAttribute("tool.side_effect_level", call.sideEffectLevel);

    // Hash args (store raw elsewhere if needed)
    const argsJson = JSON.stringify(call.args);
    span.setAttribute("tool.args.bytes", argsJson.length);

    try {
      // Your tool router might call vendor APIs here.
      // Make sure fetch/axios is instrumented so downstream spans join this trace.
      const result = await routeToolCall<T>(call.toolName, call.args);
      return result;
    } catch (err: any) {
      span.recordException(err);
      span.setStatus({ code: 2, message: err?.message ?? "tool failed" });
      span.setAttribute("error_class", "tool_error");
      throw err;
    } finally {
      span.end();
    }
  });
}

Key point: propagate OTel context so the tool wrapper span is the parent of the actual HTTP span.


Don’t forget metrics: the dashboards operators and execs both use

Traces tell you why something failed. Metrics tell you how often and how bad it is.

Minimum viable metrics for tool-using agents:

Reliability metrics

  • agent_runs_total{workflow_id, outcome}
  • agent_tool_calls_total{tool_name, outcome}
  • agent_step_retries_total{workflow_id, tool_name}
  • agent_verification_fail_total{workflow_id}
  • approval_requests_total{workflow_id, state}
  • approval_queue_age_seconds (gauge/histogram)

Cost metrics

  • agent_tokens_total{workflow_id, model}
  • agent_cost_usd_total{workflow_id, model}
  • cost_per_success_usd{workflow_id} (derived)

Latency metrics

  • agent_run_latency_ms{workflow_id}
  • agent_tool_latency_ms{tool_name}

If you only build one exec-facing chart, make it:

Cost per successful outcome (by workflow), over time.

It keeps everyone honest.


Failure modes → instrumentation you need to detect them

Most “agent failures” are not mystical. They cluster.

1) Wrong tool / wrong scope (permissions)

Signals to capture:

  • tool.name
  • permission errors normalized (error_class=authz_denied)
  • actor_user_id vs service account identity
  • tool.side_effect_level mismatches (agent attempted write when only read allowed)

2) Right tool / wrong args (schema drift)

Signals:

  • tool.args.hash
  • tool.schema_version
  • structured error code from the tool adapter

3) Partial completion (idempotency gaps)

Signals:

  • idempotency_key
  • tool.result.hash
  • “completed steps” list for the run

If a run retries after a timeout and you don’t have idempotency keys, you’ll ship duplicate emails, duplicate invoices, or duplicate tickets.

4) Infinite loops / cascades

Signals:

  • agent.step_count
  • budget fields: max_steps, max_cost_usd, max_latency_ms
  • circuit_breaker_tripped=true

5) Approval fatigue vs risk leaks

Signals:

  • approval.state
  • approval.latency_ms
  • approval rate by workflow step

You want approvals where the side-effect level is high—not everywhere.


Audit log blueprint: make agent actions defensible

A good audit log entry is:

  • minimal
  • immutable
  • attributable
  • tamper-evident

Suggested audit log record

{
  "timestamp": "2026-05-06T06:00:08.000Z",
  "tenant_id": "t_9f3d",
  "workflow_id": "customer_support_triage_v7",
  "run_id": "run_01J3K6...",

  "action": {
    "type": "tool.call",
    "tool_name": "gmail.send",
    "side_effect_level": "write",
    "target": "thread_18b4f5e2a8c"
  },

  "approval": {
    "required": true,
    "state": "approved",
    "approver_id": "u_901"
  },

  "payload": {
    "args_hash": "sha256:3b4d...",
    "result_hash": "sha256:91ad...",
    "payload_ref": "vault_7Qw..."
  },

  "integrity": {
    "prev_hash": "sha256:...",
    "this_hash": "sha256:..."
  }
}

That prev_hash → this_hash chaining gives you a tamper-evident ledger without requiring a blockchain.


Turning traces into an improvement loop (the part most teams skip)

Observability isn’t the goal. Behavior-driven improvement is.

Here’s the loop that works:

  1. Observe: collect traces + audit entries for every run
  2. Label: tag failures by class (permissions, bad args, hallucinated entity, timeout, partial completion)
  3. Extract trajectories: for each incident, store a “trajectory bundle” (plan hash, tool sequence, tool hashes, policy versions)
  4. Build evals: turn real failed trajectories into a regression suite
  5. Tighten harness policies: fix with guardrails, verifiers, schema constraints—not just prompt edits
  6. Redeploy + watch: confirm failure rate drops and cost per success improves

This is exactly where a supervisor/harness layer becomes a moat. Without it, you’re stuck in an endless cycle of:

“try a new prompt” → “hope it works” → “repeat.”


Implementation checklist (copy/paste)

Day 0 (today): get debuggable runs

  • Generate a stable run_id at the start of every agent run
  • Emit an agent.run root span with workflow_id, tenant_id, versions
  • Emit an agent.tool.call span for every tool call
  • Capture tool.name, side_effect_level, tool_args_hash, tool_result_hash
  • Capture tokens_in/out, latency_ms, and cost_usd at run + step levels
  • Add redaction + payload vault (hashes in traces, raw in vault)

Day 7: make autonomy safe

  • Add approval telemetry (approval.state, latency, approver id)
  • Add idempotency keys for write/irreversible actions
  • Add circuit breakers (max steps, max cost, max time)
  • Normalize errors into a small error_class taxonomy
  • Add SLOs: success rate and cost-per-success per workflow

Day 30: turn production into training data (without the chaos)

  • Auto-extract “failed trajectories” into a weekly regression eval set
  • Add verifier spans and track verifier pass/fail rates
  • Build dashboards that map tool errors by vendor endpoint
  • Add policy diffing: correlate failure spikes with policy/model/schema versions

Where nNode fits in this picture

Most teams instrument the model and call it a day.

But the real production failures happen at the seams:

  • tool selection vs tool execution
  • retries vs idempotency
  • approvals vs velocity
  • verification vs silent corruption

nNode is built around the harness/supervisor layer: the part that makes agentic workflows observable, controllable, and steadily improvable. If you’re serious about shipping tool-using agents that operators can trust (and on-call can debug), that layer is where you win.

If you want to see what this looks like in a real workflow engine—where traces, auditability, approvals, and reliability loops are first-class—take a look at nnode.ai.

Soft CTA: If you’re building agents in production and keep asking “what actually happened?”, nNode is worth a conversation.

Build your first AI Agent today

Join the waiting list for nNode and start automating your workflows with natural language.

Get Started