AI Observability

Instrument AI and LLM applications with OpenTelemetry to get unified traces that connect HTTP requests, agent orchestration, LLM API calls, and database queries in a single view.

The Problem

Traditional APM tools (Datadog, New Relic) capture HTTP and database telemetry. Specialized AI tools (LangSmith, Weights & Biases) capture LLM traces. Neither shows the full picture:

Tool Type	Captures	Misses
Traditional APM	HTTP requests, DB queries, latency	Model name, tokens, cost, prompt content
AI-specific tools	LLM calls, prompts, model metadata	HTTP context, DB queries, infrastructure
OpenTelemetry	All of the above in one trace	-

With OpenTelemetry, a single trace shows that a slow HTTP response was caused by a specific LLM call in a specific agent, which also triggered 3 database queries and a fallback to a different provider.

When to Use AI Observability

Use Case	Recommendation
Track LLM token usage and costs	AI Observability
Monitor agent pipeline performance	AI Observability
Evaluate LLM output quality over time	AI Observability
Debug slow AI requests end-to-end	AI Observability
Attribute costs to agents or business operations	AI Observability
Standard HTTP/database monitoring only	Auto-instrumentation
Generic custom spans and metrics	Custom instrumentation

Guides

Guide	What It Covers
LLM Observability	End-to-end guide (Python): GenAI semantic conventions, token/cost metrics, agent pipeline spans, evaluation tracking, PII scrubbing, production deployment
Rust LLM Observability	End-to-end guide (Rust): GenAI semantic conventions, multi-provider LLM with fallback, token/cost metrics, multi-stage pipeline spans, retry observability, Docker deployment
Spring AI LLM Observability	End-to-end guide (Java): Three-layer instrumentation (Java Agent + Spring AI + manual OTel), GenAI semantic conventions, tool calling, RAG, domain metrics, Docker deployment
LangGraph Instrumentation	Framework-specific: LangGraph node wrapping, conditional edge routing, tool-calling nodes, state management, pipeline traces
LlamaIndex Instrumentation	Framework-specific: LlamaIndex structured output, self-correction loops, multi-provider LLM factory, YAML prompt management
Vercel AI SDK Instrumentation	Framework-specific: Vercel AI SDK v6 LanguageModelV3Middleware, multi-stage pipeline spans, concurrent stage execution, Bun + Hono + pgvector

What Gets Instrumented

AI observability builds on top of auto and custom instrumentation, adding an LLM-specific layer:

Auto-Instrumentation Layer (zero code changes)

HTTP requests via FastAPI/Django/Flask instrumentors (Python), tower-http TraceLayer (Rust), Java Agent (Spring WebFlux)
Database queries via SQLAlchemy/Django ORM instrumentors (Python), SQLx tracing (Rust), Java Agent (JDBC/R2DBC)
Outbound HTTP via httpx/requests instrumentors (Python), Java Agent (captures raw LLM API calls)
Log correlation via logging instrumentor (Python), OpenTelemetryTracingBridge (Rust), Java Agent (Logback/Log4j)

Custom AI Layer (GenAI semantic conventions)

LLM spans with model, provider, token counts, cost
Prompt/completion events with PII scrubbing
Agent spans with pipeline orchestration context
Evaluation events with quality scores and pass/fail
Cost metrics with attribution by agent and business operation
Retry/fallback tracking with error type classification

Example: Unified Trace

Single trace spanning all layers
POST /api/generate                            4.2s  [auto: HTTP]
├─ db.query SELECT context                   15ms   [auto: DB]
├─ invoke_agent enrich                        1.8s  [custom: agent]
│  └─ gen_ai.chat claude-sonnet-4             1.7s  [custom: LLM]
│     └─ HTTP POST api.anthropic.com          1.7s  [auto: httpx]
├─ invoke_agent draft                         2.3s  [custom: agent]
│  └─ gen_ai.chat claude-sonnet-4             2.2s  [custom: LLM]
│     └─ HTTP POST api.anthropic.com          2.2s  [auto: httpx]
└─ db.query INSERT result                     5ms   [auto: DB]

Key Concepts

GenAI Semantic Conventions

OpenTelemetry defines GenAI semantic conventions for standardized LLM telemetry. Key attributes:

Attribute	Example	Purpose
`gen_ai.operation.name`	`"chat"`	Operation type
`gen_ai.provider.name`	`"anthropic"`	LLM provider
`gen_ai.request.model`	`"claude-sonnet-4"`	Model used
`gen_ai.usage.input_tokens`	`1240`	Tokens consumed
`gen_ai.usage.output_tokens`	`320`	Tokens generated
`gen_ai.agent.name`	`"draft"`	Agent in pipeline

GenAI Metrics

Custom metrics for dashboards and alerting:

Metric	Type	Purpose
`gen_ai.client.token.usage`	Histogram	Token consumption by model/agent
`gen_ai.client.operation.duration`	Histogram	LLM call latency
`gen_ai.client.cost`	Counter	Cost in USD by model/agent
`gen_ai.evaluation.score`	Histogram	Output quality scores
`gen_ai.client.error.count`	Counter	Errors by provider/type

Next Steps

Follow the LLM Observability guide for a complete Python setup walkthrough
Follow the Rust LLM Observability guide for Rust AI applications with manual GenAI instrumentation
Follow the Spring AI LLM Observability guide for Java Spring AI applications with three-layer instrumentation
Set up auto-instrumentation for your web framework if you haven't already
Configure the OpenTelemetry Collector to export telemetry to base14 Scout

Was this page helpful?

The Problem​

When to Use AI Observability​

Guides​

What Gets Instrumented​

Auto-Instrumentation Layer (zero code changes)​

Custom AI Layer (GenAI semantic conventions)​

Example: Unified Trace​

Key Concepts​

GenAI Semantic Conventions​

GenAI Metrics​

Next Steps​