AI Observability
Instrument AI and LLM applications with OpenTelemetry to get unified traces that connect HTTP requests, agent orchestration, LLM API calls, and database queries in a single view.
The Problem
Traditional APM tools (Datadog, New Relic) capture HTTP and database telemetry. Specialized AI tools (LangSmith, Weights & Biases) capture LLM traces. Neither shows the full picture:
| Tool Type | Captures | Misses |
|---|---|---|
| Traditional APM | HTTP requests, DB queries, latency | Model name, tokens, cost, prompt content |
| AI-specific tools | LLM calls, prompts, model metadata | HTTP context, DB queries, infrastructure |
| OpenTelemetry | All of the above in one trace | - |
With OpenTelemetry, a single trace shows that a slow HTTP response was caused by a specific LLM call in a specific agent, which also triggered 3 database queries and a fallback to a different provider.
When to Use AI Observability
| Use Case | Recommendation |
|---|---|
| Track LLM token usage and costs | AI Observability |
| Monitor agent pipeline performance | AI Observability |
| Evaluate LLM output quality over time | AI Observability |
| Debug slow AI requests end-to-end | AI Observability |
| Attribute costs to agents or business operations | AI Observability |
| Standard HTTP/database monitoring only | Auto-instrumentation |
| Generic custom spans and metrics | Custom instrumentation |
Guides
| Guide | What It Covers |
|---|---|
| LLM Observability | End-to-end guide (Python): GenAI semantic conventions, token/cost metrics, agent pipeline spans, evaluation tracking, PII scrubbing, production deployment |
| Rust LLM Observability | End-to-end guide (Rust): GenAI semantic conventions, multi-provider LLM with fallback, token/cost metrics, multi-stage pipeline spans, retry observability, Docker deployment |
| Spring AI LLM Observability | End-to-end guide (Java): Three-layer instrumentation (Java Agent + Spring AI + manual OTel), GenAI semantic conventions, tool calling, RAG, domain metrics, Docker deployment |
| LangGraph Instrumentation | Framework-specific: LangGraph node wrapping, conditional edge routing, tool-calling nodes, state management, pipeline traces |
| LlamaIndex Instrumentation | Framework-specific: LlamaIndex structured output, self-correction loops, multi-provider LLM factory, YAML prompt management |
| Vercel AI SDK Instrumentation | Framework-specific: Vercel AI SDK v6 LanguageModelV3Middleware, multi-stage pipeline spans, concurrent stage execution, Bun + Hono + pgvector |
What Gets Instrumented
AI observability builds on top of auto and custom instrumentation, adding an LLM-specific layer:
Auto-Instrumentation Layer (zero code changes)
- HTTP requests via FastAPI/Django/Flask instrumentors (Python), tower-http TraceLayer (Rust), Java Agent (Spring WebFlux)
- Database queries via SQLAlchemy/Django ORM instrumentors (Python), SQLx tracing (Rust), Java Agent (JDBC/R2DBC)
- Outbound HTTP via httpx/requests instrumentors (Python), Java Agent (captures raw LLM API calls)
- Log correlation via logging instrumentor (Python), OpenTelemetryTracingBridge (Rust), Java Agent (Logback/Log4j)
Custom AI Layer (GenAI semantic conventions)
- LLM spans with model, provider, token counts, cost
- Prompt/completion events with PII scrubbing
- Agent spans with pipeline orchestration context
- Evaluation events with quality scores and pass/fail
- Cost metrics with attribution by agent and business operation
- Retry/fallback tracking with error type classification
Example: Unified Trace
POST /api/generate 4.2s [auto: HTTP]
├─ db.query SELECT context 15ms [auto: DB]
├─ invoke_agent enrich 1.8s [custom: agent]
│ └─ gen_ai.chat claude-sonnet-4 1.7s [custom: LLM]
│ └─ HTTP POST api.anthropic.com 1.7s [auto: httpx]
├─ invoke_agent draft 2.3s [custom: agent]
│ └─ gen_ai.chat claude-sonnet-4 2.2s [custom: LLM]
│ └─ HTTP POST api.anthropic.com 2.2s [auto: httpx]
└─ db.query INSERT result 5ms [auto: DB]
Key Concepts
GenAI Semantic Conventions
OpenTelemetry defines GenAI semantic conventions for standardized LLM telemetry. Key attributes:
| Attribute | Example | Purpose |
|---|---|---|
gen_ai.operation.name | "chat" | Operation type |
gen_ai.provider.name | "anthropic" | LLM provider |
gen_ai.request.model | "claude-sonnet-4" | Model used |
gen_ai.usage.input_tokens | 1240 | Tokens consumed |
gen_ai.usage.output_tokens | 320 | Tokens generated |
gen_ai.agent.name | "draft" | Agent in pipeline |
GenAI Metrics
Custom metrics for dashboards and alerting:
| Metric | Type | Purpose |
|---|---|---|
gen_ai.client.token.usage | Histogram | Token consumption by model/agent |
gen_ai.client.operation.duration | Histogram | LLM call latency |
gen_ai.client.cost | Counter | Cost in USD by model/agent |
gen_ai.evaluation.score | Histogram | Output quality scores |
gen_ai.client.error.count | Counter | Errors by provider/type |
Next Steps
- Follow the LLM Observability guide for a complete Python setup walkthrough
- Follow the Rust LLM Observability guide for Rust AI applications with manual GenAI instrumentation
- Follow the Spring AI LLM Observability guide for Java Spring AI applications with three-layer instrumentation
- Set up auto-instrumentation for your web framework if you haven't already
- Configure the OpenTelemetry Collector to export telemetry to base14 Scout