Skip to main content

Coding Agent Observability for Your Team

ยท 8 min read
Ranjan Sakalley
Founder & CPO at base14

Coding agents like Claude Code, OpenAI Codex CLI, and Google Gemini CLI now ship with native OpenTelemetry support. This means you can collect structured telemetry covering token usage, cost attribution, tool calls, sessions, and lines of code modified, the same way you instrument any other production system.

This post covers what each agent emits, how to enable collection, and what we learned running Claude Code telemetry across a team.

The Case for Instrumenting Coding Agentsโ€‹

Treating coding agents as individual developer tools creates blind spots. Instrumenting these tools allows teams to measure aggregate behavior and make data-driven decisions.

When coding agents are black boxes, we rely on assumptions:

  • Cost attribution is a guess. You know the monthly total but not which projects, engineers, or workflows drive the spend.
  • Model routing is invisible. Agents route requests across models automatically. Without telemetry, you cannot see which model handles which tasks or whether the routing is efficient.
  • Usage patterns vary widely. Session length, tool preference, request volume, and time-of-day activity differ meaningfully per engineer. Aggregated metrics hide these differences.
  • Optimization decisions lack evidence. Should you change plans? Adjust context limits? Restrict certain tools? Without data, these decisions are opinion-based.
  • Adoption is unmeasured. You rolled out an agent to your team but have no visibility into who is actually using it, how frequently, or whether new engineers are ramping up at all. Session and user-level telemetry turns adoption from anecdote into data.
  • Effectiveness is unquantified. Lines of code modified, commits generated, and tool call patterns are proxies for productivity that only exist if you collect them.

What Each Agent Emitsโ€‹

All three major coding agents now support OpenTelemetry export. The depth of instrumentation varies significantly.

Claude Codeโ€‹

Claude Code emits metrics and logs via OTLP, but not traces. Enable it with CLAUDE_CODE_ENABLE_TELEMETRY=1 and point to a collector endpoint.

Metrics:

MetricDescription
claude_code.session.countSession initiations
claude_code.token.usageInput, output, cache read, and cache creation tokens per model
claude_code.cost.usageEstimated USD cost by model
claude_code.lines_of_code.countLines added or removed
claude_code.commit.countGit commits generated
claude_code.pull_request.countPull requests created
claude_code.code_edit_tool.decisionAccept/reject counts per tool
claude_code.active_time.totalDuration split between user input and CLI processing

Log events:

EventKey Attributes
claude_code.user_promptPrompt length (content redacted by default)
claude_code.api_requestToken counts, cost, latency, model
claude_code.api_errorHTTP status, error message
claude_code.tool_resultTool name, success/failure, duration, decision
claude_code.tool_decisionAccept/reject per tool

Common attributes: session.id, user.account_uuid, user.email, organization.id, app.version, terminal.type.

Configuration:

export CLAUDE_CODE_ENABLE_TELEMETRY=1
export OTEL_EXPORTER_OTLP_ENDPOINT=https://your-collector:4318
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer <token>"
export OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE=delta

Optional flags control cardinality and privacy: OTEL_LOG_USER_PROMPTS, OTEL_LOG_TOOL_DETAILS, OTEL_METRICS_INCLUDE_SESSION_ID.

OpenAI Codex CLIโ€‹

Codex CLI emits traces, metrics, and logs via OTLP. Configure in ~/.codex/config.toml. Supports both gRPC and HTTP exporters.

Metrics:

MetricType
codex.tool.callCounter
codex.tool.call.duration_msHistogram
codex.api_requestCounter
codex.api_request.duration_msHistogram
codex.sse_event / codex.websocket.eventCounter + Histogram
codex.responses_api_overhead.duration_msHistogram
codex.responses_api_inference_time.duration_msHistogram

Log events:

EventKey Attributes
codex.conversation_startsModel, reasoning effort, security policy, context window
codex.api_requestStatus, duration, token counts (input, output, cached, reasoning, tool)
codex.user_promptPrompt length, image counts (redacted by default)
codex.tool_resultTool name, duration, output length, MCP server/tool info
codex.tool_decisionDecision, source (config or user)
codex.api_errorHTTP status, error message

Token attributes: input_token_count, output_token_count, cached_token_count, reasoning_token_count, tool_token_count.

Known limitations: Metrics are not emitted in codex exec mode, and no telemetry is available in codex mcp-server mode (tracked in #12913).

Configuration:

# ~/.codex/config.toml
[otel]
exporter = { otlp-grpc = { endpoint = "https://your-collector:4317" } }
trace_exporter = { otlp-grpc = { endpoint = "https://your-collector:4317" } }
metrics_exporter = { otlp-grpc = { endpoint = "https://your-collector:4317" } }
log_user_prompt = false

Google Gemini CLIโ€‹

Gemini CLI has the most comprehensive instrumentation of the three. It emits traces, metrics, and logs and follows the OpenTelemetry GenAI Semantic Conventions.

Metrics:

MetricType
gemini_cli.session.countCounter
gemini_cli.tool.call.countCounter
gemini_cli.tool.call.latencyHistogram (ms)
gemini_cli.api.request.countCounter
gemini_cli.api.request.latencyHistogram (ms)
gemini_cli.token.usageCounter (tokens)
gemini_cli.file.operation.countCounter
gemini_cli.chat_compressionCounter
gen_ai.client.token.usageHistogram (GenAI semconv)
gen_ai.client.operation.durationHistogram (GenAI semconv)

Log events:

EventKey Attributes
gemini_cli.configModel, sandbox, approval mode, MCP servers
gemini_cli.user_promptPrompt length, auth type
gemini_cli.api_responseModel, status, duration, input/output/cached/thoughts/tool token counts
gemini_cli.tool_callFunction name, args, duration, decision, success
gemini_cli.file_operationOperation type, lines, language, diff stats (AI vs user)
gemini_cli.model_routingRouter decisions with latency and reasoning
gemini_cli.chat_compressionTokens before/after compression
gemini_cli.extension_*Extension install/enable/uninstall events

Trace spans follow GenAI semantic conventions with attributes like gen_ai.operation.name, gen_ai.agent.name, gen_ai.request.model, and gen_ai.response.model.

Gemini CLI also provides a pre-configured Google Cloud Monitoring dashboard out of the box.

Configuration:

// .gemini/settings.json
{
"telemetry": {
"enabled": true,
"otlpEndpoint": "http://your-collector:4317",
"otlpProtocol": "grpc",
"logPrompts": true
}
}

Or via environment variables: GEMINI_TELEMETRY_ENABLED, GEMINI_TELEMETRY_OTLP_ENDPOINT, GEMINI_TELEMETRY_OTLP_PROTOCOL.

Comparison Tableโ€‹

CapabilityClaude CodeCodex CLIGemini CLI
OTLP SignalsMetrics, LogsTraces, Metrics, LogsTraces, Metrics, Logs
Token usageInput, output, cache read, cache creationInput, output, cached, reasoning, toolInput, output, cached, thoughts, tool
Cost attributionYes (cost.usage by model)No native metricNo native metric
Session trackingYes (session.count, session.id)Yes (conversation.id)Yes (session.count, session.id)
Tool call metricsYes (decision counts)Yes (count + duration histogram)Yes (count + latency histogram)
Lines of codeYes (lines_of_code.count)No native metricYes (file_operation with diff stats)
Commit/PR trackingYes (commit.count, pull_request.count)NoNo
Model routing visibilityVia log eventsVia log eventsDedicated model_routing event
User identityuser.account_uuid, user.emailuser.account_id, user.emailinstallation.id, user.email
GenAI semconvNoNoYes
Prompt redactionRedacted by default, opt-inRedacted by default, opt-inLogged by default, opt-out
Pre-built dashboardNoNoYes (Google Cloud Monitoring)
Exporter protocolHTTP/protobufgRPC or HTTPgRPC or HTTP

What We Learned Running Claude Code Telemetryโ€‹

We enabled telemetry on Claude Code across five engineers on the Max plan and collected data for seven days. Engineers had the ability to disable telemetry temporarily. A few patterns stood out.

Claude Code dashboard showing $313 total spend, 25K lines added, 5 active users, 125 sessions, and token usage breakdown with 289M cache read tokens

Tool usage breakdown and API requests by model, showing Haiku handling the majority of requests alongside Opus at 43%

Cache reads dominate token usageโ€‹

The token breakdown showed cache read tokens significantly outweighing other categories. In our sample, cache reads accounted for 289 million tokens versus 4.23 million input tokens. Prompt caching is materially reducing incremental cost, and without instrumentation this would be difficult to quantify.

Haiku handles a majority of requestsโ€‹

Even on higher-tier plans, the routing layer delegates a large share of sub-agent tasks like tool calls, file operations, and code searches to Haiku. More than half of API requests in our sample were served by the lighter model, with Opus handling 43% and Sonnet under 5%. The system optimizes cost and latency automatically, but the distribution is only visible through instrumentation.

Usage behavior varies across engineersโ€‹

Session length, tool preference, request volume, and time-of-day activity differed meaningfully per user. One engineer ran long, exploratory sessions with heavy tool use. Another ran short, targeted prompts. Aggregated metrics hid these differences completely. Per-user and per-session views are necessary to understand actual consumption patterns.

Adoption becomes visibleโ€‹

With 125 sessions across five engineers in a week, we could see who was using the agent daily and who had not touched it since the initial setup. One engineer accounted for nearly half of the token usage. Another had very low usage. The data prompted a conversation about workflows and onboarding that would not have happened without the numbers.

Effectiveness has proxies worth trackingโ€‹

Over the seven-day window, the team generated over 25,000 lines of code modifications across 125 sessions. Combining lines_of_code.count with commit.count and tool_result events gives a rough picture of output per session. The engineer with the highest session count also had the highest Read and Bash tool usage, suggesting deep exploratory work. The engineer with fewer but longer sessions leaned heavily on Edit and Write, suggesting more directed code generation.

Of course, lines of code is a flawed proxy for value generated. We used frontier models to analyze the tool usage patterns and session behavior, which itself requires having telemetry in the first place. The value is less in any single metric and more in building a baseline that teams can reason about over time.

Getting Startedโ€‹

Regardless of which agent your team uses, the setup follows the same pattern:

  1. Enable telemetry. Set the agent's telemetry flag (environment variable or config file).
  2. Point to a collector. Any OTLP-compatible backend works: your existing observability platform, a standalone OpenTelemetry Collector, or a managed service.
  3. Add resource attributes. Tag with team, project, or environment to enable useful grouping.
  4. Build views. Start with per-user token usage and cost, then drill into tool call patterns and session behavior.

The telemetry is already there. You just need to collect it.

What Comes Nextโ€‹

The GenAI semantic conventions that Gemini CLI already follows will likely become the standard. As these conventions mature, expect Claude Code and Codex to adopt them, making cross-agent dashboards and alerting more straightforward.

Once coding agent telemetry is collected alongside CI/CD metrics, deployment frequency, and incident data, teams can start correlating agent usage with engineering outcomes. That is the real payoff: moving from "how much did we spend" to "what did we get for it."


If your team is instrumenting AI workflows and wants help operationalizing the telemetry, reach out to the base14 team.