Skip to main content

AI Observability

Instrument AI and LLM applications with OpenTelemetry to get unified traces that connect HTTP requests, agent orchestration, LLM API calls, and database queries in a single view.

The Problem​

Traditional APM tools (Datadog, New Relic) capture HTTP and database telemetry. Specialized AI tools (LangSmith, Weights & Biases) capture LLM traces. Neither shows the full picture:

Tool TypeCapturesMisses
Traditional APMHTTP requests, DB queries, latencyModel name, tokens, cost, prompt content
AI-specific toolsLLM calls, prompts, model metadataHTTP context, DB queries, infrastructure
OpenTelemetryAll of the above in one trace—

With OpenTelemetry, a single trace shows that a slow HTTP response was caused by a specific LLM call in a specific agent, which also triggered 3 database queries and a fallback to a different provider.

When to Use AI Observability​

Use CaseRecommendation
Track LLM token usage and costsAI Observability
Monitor agent pipeline performanceAI Observability
Evaluate LLM output quality over timeAI Observability
Debug slow AI requests end-to-endAI Observability
Attribute costs to agents or business operationsAI Observability
Standard HTTP/database monitoring onlyAuto-instrumentation
Generic custom spans and metricsCustom instrumentation

Guides​

GuideWhat It Covers
LLM ObservabilityEnd-to-end guide: GenAI semantic conventions, token/cost metrics, agent pipeline spans, evaluation tracking, PII scrubbing, production deployment

What Gets Instrumented​

AI observability builds on top of auto and custom instrumentation, adding an LLM-specific layer:

Auto-Instrumentation Layer (zero code changes)​

  • HTTP requests via FastAPI/Django/Flask instrumentors
  • Database queries via SQLAlchemy/Django ORM instrumentors
  • Outbound HTTP via httpx/requests instrumentors (captures raw LLM API calls)
  • Log correlation via logging instrumentor (adds trace_id to logs)

Custom AI Layer (GenAI semantic conventions)​

  • LLM spans with model, provider, token counts, cost
  • Prompt/completion events with PII scrubbing
  • Agent spans with pipeline orchestration context
  • Evaluation events with quality scores and pass/fail
  • Cost metrics with attribution by agent and business operation
  • Retry/fallback tracking with error type classification

Example: Unified Trace​

Single trace spanning all layers
POST /api/generate                            4.2s  [auto: HTTP]
├─ db.query SELECT context 15ms [auto: DB]
├─ invoke_agent enrich 1.8s [custom: agent]
│ └─ gen_ai.chat claude-sonnet-4 1.7s [custom: LLM]
│ └─ HTTP POST api.anthropic.com 1.7s [auto: httpx]
├─ invoke_agent draft 2.3s [custom: agent]
│ └─ gen_ai.chat claude-sonnet-4 2.2s [custom: LLM]
│ └─ HTTP POST api.anthropic.com 2.2s [auto: httpx]
└─ db.query INSERT result 5ms [auto: DB]

Key Concepts​

GenAI Semantic Conventions​

OpenTelemetry defines GenAI semantic conventions for standardized LLM telemetry. Key attributes:

AttributeExamplePurpose
gen_ai.operation.name"chat"Operation type
gen_ai.provider.name"anthropic"LLM provider
gen_ai.request.model"claude-sonnet-4"Model used
gen_ai.usage.input_tokens1240Tokens consumed
gen_ai.usage.output_tokens320Tokens generated
gen_ai.agent.name"draft"Agent in pipeline

GenAI Metrics​

Custom metrics for dashboards and alerting:

MetricTypePurpose
gen_ai.client.token.usageHistogramToken consumption by model/agent
gen_ai.client.operation.durationHistogramLLM call latency
gen_ai.client.costCounterCost in USD by model/agent
gen_ai.evaluation.scoreHistogramOutput quality scores
gen_ai.client.error.countCounterErrors by provider/type

Next Steps​

  1. Follow the LLM Observability guide for a complete setup walkthrough
  2. Set up auto-instrumentation for your web framework if you haven't already
  3. Configure the OpenTelemetry Collector to export telemetry to base14 Scout
Was this page helpful?