Skip to main content

AI Observability

Instrument AI and LLM applications with OpenTelemetry to get unified traces that connect HTTP requests, agent orchestration, LLM API calls, and database queries in a single view.

The Problem

Traditional APM tools (Datadog, New Relic) capture HTTP and database telemetry. Specialized AI tools (LangSmith, Weights & Biases) capture LLM traces. Neither shows the full picture:

Tool TypeCapturesMisses
Traditional APMHTTP requests, DB queries, latencyModel name, tokens, cost, prompt content
AI-specific toolsLLM calls, prompts, model metadataHTTP context, DB queries, infrastructure
OpenTelemetryAll of the above in one trace-

With OpenTelemetry, a single trace shows that a slow HTTP response was caused by a specific LLM call in a specific agent, which also triggered 3 database queries and a fallback to a different provider.

When to Use AI Observability

Use CaseRecommendation
Track LLM token usage and costsAI Observability
Monitor agent pipeline performanceAI Observability
Evaluate LLM output quality over timeAI Observability
Debug slow AI requests end-to-endAI Observability
Attribute costs to agents or business operationsAI Observability
Standard HTTP/database monitoring onlyAuto-instrumentation
Generic custom spans and metricsCustom instrumentation

Guides

GuideWhat It Covers
LLM ObservabilityEnd-to-end guide (Python): GenAI semantic conventions, token/cost metrics, agent pipeline spans, evaluation tracking, PII scrubbing, production deployment
Rust LLM ObservabilityEnd-to-end guide (Rust): GenAI semantic conventions, multi-provider LLM with fallback, token/cost metrics, multi-stage pipeline spans, retry observability, Docker deployment
Spring AI LLM ObservabilityEnd-to-end guide (Java): Three-layer instrumentation (Java Agent + Spring AI + manual OTel), GenAI semantic conventions, tool calling, RAG, domain metrics, Docker deployment
LangGraph InstrumentationFramework-specific: LangGraph node wrapping, conditional edge routing, tool-calling nodes, state management, pipeline traces
LlamaIndex InstrumentationFramework-specific: LlamaIndex structured output, self-correction loops, multi-provider LLM factory, YAML prompt management
Vercel AI SDK InstrumentationFramework-specific: Vercel AI SDK v6 LanguageModelV3Middleware, multi-stage pipeline spans, concurrent stage execution, Bun + Hono + pgvector

What Gets Instrumented

AI observability builds on top of auto and custom instrumentation, adding an LLM-specific layer:

Auto-Instrumentation Layer (zero code changes)

  • HTTP requests via FastAPI/Django/Flask instrumentors (Python), tower-http TraceLayer (Rust), Java Agent (Spring WebFlux)
  • Database queries via SQLAlchemy/Django ORM instrumentors (Python), SQLx tracing (Rust), Java Agent (JDBC/R2DBC)
  • Outbound HTTP via httpx/requests instrumentors (Python), Java Agent (captures raw LLM API calls)
  • Log correlation via logging instrumentor (Python), OpenTelemetryTracingBridge (Rust), Java Agent (Logback/Log4j)

Custom AI Layer (GenAI semantic conventions)

  • LLM spans with model, provider, token counts, cost
  • Prompt/completion events with PII scrubbing
  • Agent spans with pipeline orchestration context
  • Evaluation events with quality scores and pass/fail
  • Cost metrics with attribution by agent and business operation
  • Retry/fallback tracking with error type classification

Example: Unified Trace

Single trace spanning all layers
POST /api/generate 4.2s [auto: HTTP]
├─ db.query SELECT context 15ms [auto: DB]
├─ invoke_agent enrich 1.8s [custom: agent]
│ └─ gen_ai.chat claude-sonnet-4 1.7s [custom: LLM]
│ └─ HTTP POST api.anthropic.com 1.7s [auto: httpx]
├─ invoke_agent draft 2.3s [custom: agent]
│ └─ gen_ai.chat claude-sonnet-4 2.2s [custom: LLM]
│ └─ HTTP POST api.anthropic.com 2.2s [auto: httpx]
└─ db.query INSERT result 5ms [auto: DB]

Key Concepts

GenAI Semantic Conventions

OpenTelemetry defines GenAI semantic conventions for standardized LLM telemetry. Key attributes:

AttributeExamplePurpose
gen_ai.operation.name"chat"Operation type
gen_ai.provider.name"anthropic"LLM provider
gen_ai.request.model"claude-sonnet-4"Model used
gen_ai.usage.input_tokens1240Tokens consumed
gen_ai.usage.output_tokens320Tokens generated
gen_ai.agent.name"draft"Agent in pipeline

GenAI Metrics

Custom metrics for dashboards and alerting:

MetricTypePurpose
gen_ai.client.token.usageHistogramToken consumption by model/agent
gen_ai.client.operation.durationHistogramLLM call latency
gen_ai.client.costCounterCost in USD by model/agent
gen_ai.evaluation.scoreHistogramOutput quality scores
gen_ai.client.error.countCounterErrors by provider/type

Next Steps

  1. Follow the LLM Observability guide for a complete Python setup walkthrough
  2. Follow the Rust LLM Observability guide for Rust AI applications with manual GenAI instrumentation
  3. Follow the Spring AI LLM Observability guide for Java Spring AI applications with three-layer instrumentation
  4. Set up auto-instrumentation for your web framework if you haven't already
  5. Configure the OpenTelemetry Collector to export telemetry to base14 Scout
Was this page helpful?