Skip to main content

Azure Cosmos DB Monitoring with OpenTelemetry - RU Consumption & Latency

Overview

This guide covers monitoring an Azure Cosmos DB account (SQL / NoSQL API) with the OpenTelemetry Collector's azure_monitor receiver. The collector polls Azure Monitor's REST API every 60 seconds for the metrics published by Microsoft.DocumentDB/databaseAccounts, transforms them to OTel-style names, and ships them via OTLP/HTTP to base14 Scout.

The azure_monitor receiver does not connect to Cosmos directly. It queries Azure Monitor's metrics surface for any resource Cosmos auto-publishes to — so the same pattern applies to all five Cosmos APIs (SQL, Mongo, Cassandra, Gremlin, Table) and to other Azure services like Storage, Service Bus, and SQL Database. This guide focuses on the SQL API; the configuration shape generalises.

What you'll monitor

Twelve metrics from Microsoft.DocumentDB/databaseAccounts, sufficient for RU consumption, request rate, storage, and availability dashboards. The receiver renames them from Azure's PascalCase (e.g., TotalRequests) to OTel-style azure_<lowercased>_<aggregation> (e.g., azure_totalrequests_count). A single Azure metric with multiple aggregations becomes one OTel metric per aggregation.

Azure REST nameOTel emittedUnitWhat it tells you
TotalRequestsazure_totalrequests_countCountRequest rate, with metadata_statuscode / metadata_connectionmode / metadata_operationtype dimensions for slicing 2xx vs 4xx vs 5xx, gateway vs direct
TotalRequestUnitsazure_totalrequestunits_{total,average,maximum}RUsRU consumption — primary cost driver and capacity-planning input
MetadataRequestsazure_metadatarequests_countCountFree-of-charge metadata calls (account/database/container introspection)
ServerSideLatencyDirectazure_serversidelatencydirect_*msServer-side latency for direct-mode connections
ServerSideLatencyGatewayazure_serversidelatencygateway_*msServer-side latency for gateway-mode connections
DataUsageazure_datausage_{total,average,maximum,minimum}BytesStorage consumed by user data
DocumentCountazure_documentcount_{total,average}CountTotal document count
DocumentQuotaazure_documentquota_{total,average}BytesStorage quota — supersedes the deprecated AvailableStorage
IndexUsageazure_indexusage_{total,average,maximum,minimum}BytesIndex storage
ProvisionedThroughputazure_provisionedthroughput_maximumRUsThroughput ceiling per database/container
NormalizedRUConsumptionazure_normalizedruconsumption_{average,maximum}PercentSliding-window utilisation; rises before throttling actually starts
ServiceAvailabilityazure_serviceavailability_{average,maximum,minimum}PercentAccount-level availability (PT1H grain; emitted hourly)

The latency pair (ServerSideLatencyDirect / *Gateway) emits zero series until a real workload exercises latency. ServerSideLatency (the parent metric) and AvailableStorage are deprecated by Microsoft (Aug 2025 / Sep 2023 respectively); use the *LatencyDirect / *LatencyGateway and DocumentQuota replacements above.

Prerequisites

RequirementMinimum
A Cosmos DB account (any API)SQL / Mongo / Cassandra / Gremlin / Table
OTel Collector contribv0.148.0+ (snake_case YAML keys)
Microsoft.DocumentDB providerregistered on the subscription
Service principalMonitoring Reader on the Cosmos RG
base14 Scoutany tenant

This guide is the Cosmos-specific addition to a working OpenTelemetry Collector. For collector deployment + the Scout exporter pieces (which are the same for every Azure surface), see:

Access setup

The azure_monitor receiver needs read-only access to Azure Monitor metrics on the resource groups containing your Cosmos accounts. Grant Monitoring Reader to a service principal:

# Create the SP (once per tenant — reuse it for every Azure surface).
az ad sp create-for-rbac --name sp-otel-azure-monitor --skip-assignment

# Scope Monitoring Reader to each Cosmos resource group.
RG_ID=$(az group show --name <your-rg> --query id -o tsv)
az role assignment create \
--assignee <appId from the create-for-rbac output> \
--role "Monitoring Reader" \
--scope "$RG_ID"

Capture appId, password, and tenant from the create output — they become AZURE_CLIENT_ID, AZURE_CLIENT_SECRET, and AZURE_TENANT_ID in the collector's environment. RBAC propagation on the legacy ARM /metrics endpoint is immediate; the data-plane batch API can lag 5-30 minutes (see Operations).

Inside Azure? If your collector runs in Azure (VM, Container Apps, AKS pod), prefer a User-assigned Managed Identity over a service principal. The azure_auth extension supports managed_identity: and workload_identity: modes; only the auth block changes.

Receiver configuration

This is the Cosmos-specific addition to your collector. Add the azure_auth extension and azure_monitor receiver to your existing config, then wire the receiver into a metrics pipeline that exports to Scout (see Scout Exporter for the exporter half — it's the same OAuth2 + OTLP/HTTP setup used by every Azure surface).

otel-collector.yaml (excerpt)
extensions:
azure_auth:
service_principal:
tenant_id: ${env:AZURE_TENANT_ID}
client_id: ${env:AZURE_CLIENT_ID}
client_secret: ${env:AZURE_CLIENT_SECRET}

receivers:
azure_monitor:
subscription_ids: ["${env:AZURE_SUBSCRIPTION_ID}"]
resource_groups: ["${env:AZURE_RESOURCE_GROUP}"]
services: ["Microsoft.DocumentDB/databaseAccounts"]
auth: { authenticator: azure_auth }
collection_interval: 60s
use_batch_api: false
cache_resources: 60
dimensions: { enabled: true }
metrics:
"Microsoft.DocumentDB/databaseAccounts":
TotalRequests: []
TotalRequestUnits: []
MetadataRequests: []
ServerSideLatencyDirect: []
ServerSideLatencyGateway: []
DataUsage: []
DocumentCount: []
DocumentQuota: []
IndexUsage: []
ProvisionedThroughput: []
NormalizedRUConsumption: []
ServiceAvailability: []

processors:
resource:
attributes:
- { key: cloud.provider, value: azure, action: insert }
- { key: cloud.platform, value: azure_cosmosdb, action: insert }
- { key: cloud.account.id, value: "${env:AZURE_SUBSCRIPTION_ID}", action: insert }
- { key: cloud.region, value: "${env:AZURE_REGION}", action: insert }
- { key: cloud.resource_id, value: "${env:COSMOS_RESOURCE_ID}", action: insert }
- { key: service.name, value: "${env:SERVICE_NAME}", action: insert }

service:
extensions: [azure_auth] # plus your existing extensions (oauth2client, etc.)
pipelines:
metrics:
receivers: [azure_monitor]
processors: [resource, batch] # plus your existing processors
exporters: [otlphttp/b14] # the Scout exporter from the shared setup

Once metrics: is set for a namespace, the receiver only emits the metrics you list — there is no implicit "default + my picks" merge. The empty aggregation list [] per metric collects all aggregations Azure publishes for that metric. The same receiver works against Mongo, Cassandra, Gremlin, and Table-API Cosmos accounts — they all publish to Microsoft.DocumentDB/databaseAccounts. Replace the SQL-API metric set with the API-specific equivalents (e.g., MongoRequests, MongoRequestCharge for Mongo) when targeting other APIs.

Environment variables

.env
# From `az ad sp create-for-rbac` output.
AZURE_TENANT_ID=
AZURE_CLIENT_ID=
AZURE_CLIENT_SECRET=

# From your Azure subscription / resource group.
AZURE_SUBSCRIPTION_ID=
AZURE_RESOURCE_GROUP=
AZURE_REGION=
COSMOS_RESOURCE_ID= # az cosmosdb show -g <rg> -n <account> --query id -o tsv

# Resource attribute defaults.
SERVICE_NAME=azure-cosmosdb

Key alerts to configure

Threshold guidance for the most operationally useful series. Tune to your workload and the throughput SKU; these are starting points for a provisioned-throughput SQL-API account with real traffic.

Metric (OTel name)WarningCriticalWhy it matters
azure_normalizedruconsumption_maximum> 70%> 90%Sliding-window RU utilisation; rises before 429s actually start. Leading indicator for capacity.
azure_totalrequests_count filtered to status 429> 0 / 5msustained > 0 / 15mThrottling has started. Scale RU/s, partition the workload, or add retry budget.
azure_totalrequestunits_total (per partition key)> 80% of provisioned> 95%Hot-partition signal when one partition dominates total RU/s.
azure_serversidelatencydirect_average> 10ms> 25msServer-side latency for direct-mode connections; user-facing latency depends on this + network.
azure_serversidelatencygateway_average> 25ms> 50msServer-side latency for gateway-mode connections.
azure_datausage_maximum> 80% of azure_documentquota_maximum> 95%Approaching storage quota; container or account split may be needed.
azure_serviceavailability_minimum< 100% / 1h< 99.9% / 1hAccount-level availability (PT1H grain).

The latency thresholds above are tuned for a healthy single-region account; adjust upward if you operate cross-region with consistency levels stronger than Session. For multi-region accounts, alert on the write-region's latency series specifically — read-region latency naturally tracks the consistency level.

Operations

  • Collection interval. 60 seconds matches Azure Monitor's 1-3 minute ingestion lag — faster polls just re-read stale data and burn rate-limit budget.
  • cache_resources. This is the receiver's resource-list cache TTL in seconds (default 24h). The shipped config sets it to 60 so newly- created accounts are visible to the receiver on the next poll — appropriate for a validation pass or for environments where accounts come and go frequently. In a stable production fleet, raise it back toward the default (e.g., 3600 or higher) to skip the per-minute ARM resource-list call.
  • RBAC propagation. The legacy ARM /metrics endpoint propagates Monitoring Reader immediately. The newer data-plane batch API at *.metrics.monitor.azure.com requires separate RBAC propagation that can lag 5-30 minutes after grant.
  • Switching to use_batch_api: true raises Azure Monitor's per-tenant query rate ceiling from 12,000 to 360,000 calls/hour. Worth it once you're scraping more than a handful of accounts or polling at higher cadence. Wait for data-plane RBAC to settle before enabling.
  • Filtering metrics. Use metrics: (a namespace-keyed nested map) to whitelist; use dimensions.overrides to drop high-cardinality dimensions like metadata_statuscode if your Scout volume is dominated by per-status- code splits.
  • Multi-region accounts. The receiver scopes by subscription and optional resource-group filters; Azure Monitor publishes metrics globally regardless of the account's write regions. No extra config.
  • Multi-API. The same receiver works against Mongo, Cassandra, Gremlin, and Table-API Cosmos accounts — they all publish to Microsoft.DocumentDB/databaseAccounts. Replace the SQL-API metric set with the API-specific equivalents (e.g., MongoRequests, MongoRequestCharge for Mongo).

Troubleshooting

AuthorizationFailed from the receiver

The role assignment hasn't propagated. Wait 60 seconds after creating it; on the legacy ARM endpoint propagation is usually immediate. If you've enabled use_batch_api: true, allow up to 30 minutes for data-plane propagation — or temporarily flip back to false to confirm the role itself is correct.

403 Forbidden from the receiver

The service principal client_secret has expired. Rotate with az ad sp credential reset --id $AZURE_CLIENT_ID --years 1 and update your collector's AZURE_CLIENT_SECRET env var.

No metrics in the first 3 minutes

Azure Monitor's 1-3 minute ingestion lag for newly-provisioned resources. If after 5 minutes you still see nothing in Scout, generate data-plane traffic — the request-counter metrics only emit after the first read or write. Control-plane calls like az cosmosdb show do not drive TotalRequests; you need actual document operations against the account endpoint.

RequestThrottled warnings from the receiver

Azure Monitor's per-tenant query rate limit (12,000/hour on the legacy endpoint, 360,000/hour on the batch API). Either lower polling rate (collection_interval: 120s), narrow the scope (resource_groups: filter), or enable use_batch_api: true once data-plane RBAC has settled.

Collector container can't resolve login.microsoftonline.com

Docker Desktop networking glitch — the container's DNS resolver becomes unreachable. docker compose down && docker compose up -d typically fixes it. If persistent, restart Docker Desktop.

Scout OAuth2 returns 401

Verify the SCOUT_CLIENT_ID, SCOUT_CLIENT_SECRET, and SCOUT_TOKEN_URL your collector is using match the values in your Scout console. The endpoint_params.audience MUST be b14collector — that's what the Scout token endpoint expects.

Frequently Asked Questions

How do I monitor Azure Cosmos DB with OpenTelemetry?

Run the OpenTelemetry Collector with the azure_monitor receiver targeting Microsoft.DocumentDB/databaseAccounts. The receiver polls Azure Monitor's REST API every 60 seconds, transforms metrics from Azure's PascalCase names (like TotalRequests) to OTel-style names (azure_totalrequests_count), and ships them via OTLP/HTTP to base14 Scout. Authentication uses the azure_auth extension in service-principal or managed-identity mode.

What RBAC role does the receiver need on the Cosmos account?

Monitoring Reader scoped to the resource group is sufficient. It grants read access to metric definitions and metric data without any control-plane write permissions. Reader is not needed unless a specific call returns AuthorizationFailed; Monitoring Reader alone covers the entire azure_monitor receiver surface.

Why do some metrics show no data on a fresh Cosmos account?

Azure Monitor only emits metrics when there is activity to measure. ServerSideLatencyDirect and ServerSideLatencyGateway emit zero series until a real workload exercises latency. TotalRequests and TotalRequestUnits start emitting after the first data-plane call. DataUsage, DocumentCount, ProvisionedThroughput, and ServiceAvailability emit immediately on every account, regardless of traffic.

What is NormalizedRUConsumption?

NormalizedRUConsumption is the per-minute maximum RU/s utilisation expressed as a percentage of provisioned throughput, sliced by partition key range. It rises before throttling actually starts (visible in 429 status codes), making it a leading indicator for capacity decisions. Alert at 80% sustained to give yourself room to scale before requests fail.

Should I use the data-plane batch API for higher throughput?

Switch to use_batch_api: true once your service principal RBAC has propagated through Azure Monitor's data plane (5-30 minutes after grant). The batch API raises Azure Monitor's query rate ceiling from 12,000 to 360,000 calls per hour. For a single-account validation pass, leave it false; the legacy ARM /metrics endpoint propagates Monitoring Reader immediately.

How does this differ from Application Insights for Cosmos DB?

Application Insights for Cosmos DB is Azure-tenant-bound, billed per-GB ingested, and visualised in Azure dashboards or workbooks. The OpenTelemetry Collector is vendor-neutral — the same image ships to base14 Scout or any OTLP-compatible backend without redeployment. The metric coverage is identical — both surfaces draw from the same Azure Monitor REST API.

  • Azure SQL Database — sister guide; same azure_monitor pattern, relational-PaaS surface. Pairs with the self-hosted SQL Server guide.
  • Azure Kubernetes Service — sister guide; uses the same azure_monitor receiver pattern but scopes to Microsoft.ContainerService/managedClusters and adds an in-cluster collector pair (kubeletstats DaemonSet + k8s_cluster Deployment).
  • AWS RDS PostgreSQL — equivalent guide for AWS managed PostgreSQL. Uses CloudWatch Metrics Stream (push) for infrastructure metrics plus the OTel PostgreSQL receiver for database internals; a hybrid pattern.
Was this page helpful?