Skip to main content

Azure Monitoring with OpenTelemetry - Architecture for base14 Scout

Overview

This guide is the architectural landing for monitoring Azure infrastructure with base14 Scout through the OpenTelemetry Collector. It explains how Scout consumes signals from Azure Monitor, what to expect on freshness and limits, where Workload Identity Federation fits, and what is and is not in scope (the trace gap, in particular). For execution, jump to the per-surface guides linked below.

The reader profile is DevOps and SRE engineers running production Azure workloads, evaluating or operating Scout, who want to understand the shape of the pipeline before configuring it for any of the Azure surfaces covered in the per-surface guides below.

TL;DR

base14 Scout consumes Azure telemetry through two parallel paths and production deployments typically run both. Metrics pull via the azure_monitor receiver against Azure Monitor's REST API, and a metrics-and-logs push via Diagnostic Settings to Event Hubs, consumed by the azure_event_hub receiver. Authentication is via the azureauthextension; Workload Identity Federation is the recommended default where the runtime supports it, with managed identity, service principal, and default-credential chain also supported. Metric freshness is a few minutes when pulling from the Metrics REST API (3-minute export floor plus your scrape interval) and single-digit seconds when consuming from Event Hubs. Azure infrastructure resources do not emit distributed traces; instrument your application with the Azure Monitor OpenTelemetry Distro to get traces into Scout.

The Azure observability landscape

Every Azure resource emits two signal types natively:

  • Platform metrics - time-series numeric data (CPU, RU consumption, request rates, latency percentiles). Stored in the Azure Monitor metrics database. Available without configuration. Free at the platform tier.
  • Resource logs - structured event data (audit, query store, WAF blocks, AKS control-plane events). Off by default; enabled per-resource via Diagnostic Settings. Billed at the destination (Log Analytics, Event Hubs, or Storage).

Activity logs (subscription-level control-plane events: deployments, RBAC changes, policy assignments) are a third signal in the same family, collected automatically and exported via Diagnostic Settings.

There is a fourth signal Azure does not emit at the infrastructure layer: distributed traces. Traces in the Azure ecosystem come from your application code, not from the managed services it talks to. We cover the implications in What about traces? below.

Microsoft's Application Insights overview recommends the Azure Monitor OpenTelemetry Distro for code-based server-side instrumentation: "For most code-based server-side scenarios, the recommended setup uses the Azure Monitor OpenTelemetry Distro." That puts customer applications on OTel-shaped emission natively. Scout is the OTel-native destination such applications can target without going through Azure Monitor as a middleman. The vendor-neutral pipeline this guide describes makes that destination swap a configuration change rather than a re-instrumentation.

Architecture at a glance

┌───────────────────── Azure ────────────────────┐ ┌────── base14 Scout ────────┐
│ │ │ │
│ Per-resource Diagnostic Setting │ │ │
│ ├── push metrics ─┐ │ │ │
│ └── push logs ──┼──▶ Event Hubs namespace │──▶ │ azure_event_hub receiver │
│ │ (per region) │ │ (push, beta) │
│ │ │ │ │
│ │ │ │ azure_monitor receiver │
│ Metrics REST API ◀──┴─────────────────────────┼────│ (pull, alpha) │
│ (use_batch_api: true) │ │ │
│ │ │ azureauthextension │
│ │ │ (alpha; WIF / MI / SP) │
└────────────────────────────────────────────────┘ │ │
│ ───────── OTLP ────────▶ │
│ │
└────────────────────────────┘

Three pieces, one picture:

  • azure_monitor receiver polls Azure Monitor's Metrics REST API on a scrape interval. Use the batch API (use_batch_api: true) to raise the rate ceiling from 12,000 to 360,000 calls per hour per subscription.
  • azure_event_hub receiver consumes from an Event Hubs namespace that Diagnostic Settings pushes into. Decodes Azure Resource Logs and platform metrics from the Event Hub payload (platform metrics arrive as Gauge points with Total / Min / Max / Avg / Count datapoints, per the receiver's native Azure decoder); beta stability for both signals.
  • azureauthextension holds one identity per shard, federated to the runtime that hosts the collector (most often an AKS service account). Every receiver references it via auth.authenticator: azure_auth.

Per-surface guides spell out the exact YAML for each piece.

Authentication

The azureauthextension supports four authentication methods, all peers from the extension's perspective: managed identity, workload identity (federated, the case where Microsoft Entra ID trusts a token from an external identity provider), service principal with client secret or certificate, and the Azure default credential chain. Pick the one that matches how your collector runtime authenticates today.

For collectors running in AKS, EKS, GKE, on-prem Kubernetes, GitHub Actions, or any runtime Microsoft Entra ID can federate with, Workload Identity Federation (WIF) is the recommended default. It eliminates client secrets entirely, which removes the largest silent-zero failure mode (expired SP secrets are indistinguishable from "no traffic" on dashboards). Microsoft documents the value plainly in Microsoft Entra Workload Identity Federation:

"These credentials pose a security risk and have to be stored securely and rotated regularly. You also run the risk of service downtime if the credentials expire."

"You eliminate the maintenance burden of manually managing credentials and eliminates the risk of leaking secrets or having certificates expire."

Service principal with client secret remains fully supported for deployments where federation is not available (e.g., shrink-wrapped on-prem environments, runtimes outside the federation matrix); rotate the secret on a schedule and alarm on expiry.

Reference snippet (used by every per-surface guide):

extensions:
azure_auth:
workload_identity:
client_id: ${env:AZURE_CLIENT_ID}
tenant_id: ${env:AZURE_TENANT_ID}
federated_token_file: /var/run/secrets/azure/tokens/azure-identity-token

receivers:
azure_monitor:
auth:
authenticator: azure_auth
# ... rest of config

The federated token is short-lived (typically one hour) and renewed automatically by the cluster's projected service account token. There is no expiry alarm because there is no static credential.

Choosing pull, push, or both

AspectPull (azure_monitor)Push (azure_event_hub)
What it consumesMetrics REST APIDiagnostic Settings → Event Hubs
Stabilityalphabeta
Freshness~3 minutes export + receiver collection_intervalSingle-digit seconds
Volume ceiling360,000 API calls/hour/subscriptionBounded by Event Hubs throughput units
Best forSlow / definitive metrics; resource types without streaming support; drift detectionHigh-volume metrics; resource logs; any signal you alert on
Per-resource configNone per resourceOne Diagnostic Setting per resource
Cost driverFree (REST API reads)Event Hubs namespace + throughput units

The decision rule for a real production deployment is both. Pull with azure_monitor for the metrics that change slowly (storage size, capacity, billing-class counters) and as the safety net that confirms the pushed data matches Azure's own metrics database. Use the Event Hubs push for the metrics and logs your dashboards and alert rules actually depend on.

The largest single resilience improvement available today is adding the Event Hubs push to a pull-only deployment. It cuts REST read volume 5-10x for a typical multi-surface customer, drops alerting freshness to seconds, and unblocks resource log ingestion at the same time.

Latency expectations by signal

End-to-end latency for the signal types Scout consumes from Azure. The platform-metrics (pull) and resource-logs notes quote Microsoft's published ingestion-time documentation; the inactive-backoff note is from the Diagnostic Settings reference.

SignalLatency floorNotes
Platform metrics (azure_monitor pull)~3 minutes export + your scrape interval"Available in under a minute in the metrics database, but they take another three minutes to be exported to the data collection endpoint"
Platform metrics (Event Hubs push)Single-digit secondsDiagnostic Settings to Event Hubs is push-mode through Microsoft's own pipeline
Resource logs3 to 10 minutes typical"Azure SQL Database and Azure Virtual Network currently provide their logs every five minutes"
Activity logs3 to 20 minutesSubscription-level control-plane events
Inactive resource backoffup to 15 minutes after 1 hour idle; up to 2 hours after 7 days idleDiagnostic Settings backs off zero-value emissions to reduce export cost

Two operational consequences worth surfacing in customer dashboards:

  • No real-time alerting on resource logs. Design alert rules with 5-10 minute freshness as the floor for resource logs and 5-25 minutes for activity logs. Real-time alerting needs the Event Hubs push.
  • Inactive resource backoff is not a bug. An idle resource's exported metrics back off incrementally and resume the normal ~3-minute latency once nonzero values return; this affects exported metrics only, not metrics-based alerts or autoscale. Mark expected-idle resources on dashboards so the gap does not page as "data missing".

Limits, throttling, and cost framing

The numbers that drive the design, drawn from Azure Monitor service limits, the Diagnostic Settings reference, and the Metrics Data Plane API behaviour (see the throttling FAQ below). The Source column carries the precise per-row attribution:

LimitDefaultWith batch APISource
Metrics REST API reads / hour / subscription12,000360,000Metrics Data Plane API
Resources per metrics:getBatch requestn/a50Receiver README
Diagnostic settings per resource55 (max)Service limits
Logs Ingestion API requests / minute / DCR12,000n/aService limits
Log Analytics ingestion volume rate threshold500 MB compressed (~6 GB/min uncompressed)n/aService limits

Three constraints worth budgeting around explicitly:

  • use_batch_api: true is non-negotiable for any production azure_monitor config. The 360,000 calls/hour ceiling is what makes multi-surface monitoring viable; the 12,000 default will throttle on any real estate.
  • 5 Diagnostic Settings per resource is a hard cap. Onboarding scripts must check existing settings count before provisioning a new one. If a customer is already on 5, the Scout setting either replaces a duplicate or merges destinations.
  • Same-region constraint: Event Hubs and Storage destinations must be in the same region as the resource being monitored. The recommended topology is one Event Hubs namespace per region per subscription, not a single global hub. A multi-region, multi-subscription customer ends up with R × S Event Hubs namespaces; each per-surface guide shows the Bicep / IaC for one of them and you fan out from there.
  • Networking caveat for VNet-bound destinations: when an Event Hubs namespace or Storage account has VNet rules enabled, Diagnostic Settings cannot reach it unless the namespace also has "Allow trusted Microsoft services" set. The Diagnostic Settings reference flags this explicitly. If your destinations are public-endpoint, this does not apply.
  • Diagnostic Settings scope varies by surface. Most surfaces attach a Diagnostic Setting per resource; a few (Compute, App Service, Logic Apps) use a single subscription-scope setting instead. The per-surface guide states which and gives the exact command.

Cost framing

Sending Azure metrics and logs through Event Hubs adds one new Azure cost and one Scout cost:

  • Event Hubs namespace and throughput units. One throughput unit handles roughly 1 MB/s ingress or ~1,000 events per second; size it to your Diagnostic Settings volume.
  • Scout-side ingest, per the Scout pricing page.

An existing Log Analytics workspace kept running alongside costs the same as before: a single Diagnostic Setting can send to a workspace and an Event Hubs namespace at the same time, so the workspace is neither added to nor removed by this setup.

What about traces?

Azure infrastructure resources do not emit OpenTelemetry traces. There is no infrastructure-level distributed trace describing, for example, "this Cosmos DB request was served by partition X, replicated to region Y, took 12 ms in the index lookup." That kind of telemetry exists inside Microsoft's fleet but is not exposed to customers in OTel trace format on any documented Azure surface.

Distributed traces in the Azure ecosystem come from application instrumentation:

  • Azure Monitor OpenTelemetry Distro - Microsoft's recommended path for code-based server-side instrumentation. Outputs OTLP-shaped traces that target Scout directly without going through Azure Monitor.
  • Vanilla OpenTelemetry SDKs for any language Scout's App Instrumentation guides cover.
  • Application Insights JavaScript SDK for browser apps - per Microsoft's overview, the JS SDK is not OpenTelemetry. Browser telemetry is a separate path from this guide.

The trace path is therefore:

Customer application

│ instrumented with OTel SDK / Azure Monitor Distro

OTLP traces ───── direct ─────▶ base14 Scout

The Azure Monitor pipeline this guide describes is not in that path. The per-surface guides cover infrastructure metrics and logs only.

The one transitional case where Azure Monitor does carry traces is the Application Insights migration window: customers transitioning off Application Insights can dual-emit AppRequests and AppDependencies records through Diagnostic Settings to Event Hubs, where the azure_event_hub receiver decodes them. This is a bridge during cutover, not a sustainable architecture; the long-term path is the Distro shipping OTLP directly.

Per-surface guides

Pick the guide that matches the resource you're configuring. Each is the execution playbook for that surface, including exact YAML, RBAC scope, metric tables, and surface-specific gotchas.

SurfaceWhat you'll monitorGuide
Azure Kubernetes Service (AKS) - in-cluster patternPod / node / cluster-state metrics plus zero-code app auto-instrumentation, via the OTel Operator (DaemonSet + Cluster Deployment + kube-state-metrics)AKS guide
AKS via HelmSame AKS scope, deployed through the OTel Helm chartAKS with Helm guide
Azure ComputeVM, VM Scale Set, and Managed Disk metrics (CPU, network, disk IOPS / bytes, available memory) plus the subscription-scope Activity Log pathCompute guide
Azure App ServiceControl-plane request and health-check metrics plus HTTP and platform logsApp Service guide
Azure Cosmos DB (SQL / NoSQL API)RU consumption, request rates, server-side latency, document count, storage, availabilityCosmos DB guide
Azure SQL DatabaseDTU / vCore utilisation, connections, deadlocks, storage, geo-replication lagSQL Database guide
Azure Database for PostgreSQL (Flexible Server)Platform metrics, in-database scrape (postgresqlreceiver), and resource logsPostgreSQL guide
Azure Database for MySQL (Flexible Server)Platform metrics, in-database scrape (mysqlreceiver), and resource logsMySQL guide
Azure Cache for RedisHit rate, memory pressure, server-load saturation, connection cap, Premium-tier diagnostic logsCache for Redis guide
Azure Service BusActive / dead-letter message counts, throughput, request counts, server errorsService Bus guide
Azure Event HubsThroughput, connections, Capture metrics, partition cardinalityEvent Hubs guide
Azure Logic Apps (Consumption)Run / action / trigger lifecycle, billable-execution metrics, per-action workflow logsLogic Apps guide
Azure Front Door (Standard / Premium)Request count, request size, response size, latency, WAF metricsFront Door guide
Azure Application Gateway (v2 / WAF_v2)Throughput, healthy / unhealthy host count, response status, backend latency; WAF_v2 rule matchesApplication Gateway guide
Azure Load Balancer (Standard SKU)VIP / DIP availability, SYN count, byte / packet counters, SNAT-port exhaustionLoad Balancer guide
Azure Firewall (Standard SKU)Rule-hit metrics, throughput, SNAT-port utilisation, threat-intel modeAzure Firewall guide
Azure Storage (Blob / Queue / Table / File)Multi-namespace transaction, latency, and capacity metrics across sub-servicesStorage guide
Azure API ManagementGateway and backend latency, request counts, rate-limit attribution, Diagnostic Settings logsAPI Management guide
Azure Key VaultAPI latency, throttling-proximity saturation, per-secret AuditEvent forensicsKey Vault guide

Every per-surface guide assumes you have read this overview for the shared concepts (auth, push vs pull, latency, trace gap) and links back here for those topics rather than re-explaining.

How these guides stay current

The OpenTelemetry collector-contrib azure_monitor and azure_event_hub receivers move quickly, and Microsoft renames or deprecates platform metrics from time to time. Scout tracks both upstream changelogs as part of weekly maintenance, pins the validated contrib image in the example configs, and refreshes per-surface guides when receiver behaviour changes. When you copy a config snippet from a per-surface guide, the version it was validated against is specified - verify behaviour against the versions you run.

Frequently Asked Questions

How does base14 Scout consume Azure Monitor metrics?

Two ways. The azure_monitor receiver pulls from Azure Monitor's REST API on a configurable interval. For higher volume and lower freshness, Diagnostic Settings push metrics to an Event Hubs namespace, where the azure_event_hub receiver consumes them and forwards via OTLP to Scout. Most production deployments run both: pull slow or definitive metrics with azure_monitor, and consume fast metrics like latency and throttle counters from Event Hubs.

Do I need to install an agent on my Azure resources?

No. Both paths use Azure-native interfaces (Metrics REST API and Diagnostic Settings) that every Azure resource exposes. You run the OpenTelemetry Collector somewhere that can reach those interfaces, typically inside an AKS cluster or as an Azure Container Apps job. Nothing is installed onto Cosmos DB, SQL Database, AKS control plane, or any other managed resource.

What permissions does Scout need on my Azure subscription?

When pulling metrics with azure_monitor: Monitoring Reader on each subscription you want to scrape. That role grants read access to metric definitions and metric values without any control-plane write permissions. When pushing through Diagnostic Settings to Event Hubs: the customer's IaC needs Monitoring Contributor (or higher) at the time it provisions Diagnostic Settings; Scout's runtime collector needs Manage, Send, and Listen on the Event Hubs shared access policy it consumes from. Both are documented inline in each per-surface guide.

How fresh is metric data once it reaches Scout?

Pulling from the Metrics REST API: a few minutes end-to-end. Microsoft documents that platform metrics are available in the metrics database in under a minute, then take another three minutes to be exported to a data collection endpoint; the receiver's collection_interval sits on top of that and is yours to tune. Consuming from Event Hubs: single-digit seconds in steady state. Use the Event Hubs push for any signal you want to alert on quickly.

Why do I not see distributed traces from Cosmos DB, SQL Database, or AKS?

Azure infrastructure resources do not emit OpenTelemetry traces describing their own internal operations. Distributed traces come from your application code, instrumented with the Azure Monitor OpenTelemetry Distro or a vanilla OTel SDK, sent over OTLP directly to Scout. The Azure Monitor pipeline this guide describes carries metrics and logs only. Tracing your application code is a separate setup covered in the App Instrumentation guides.

Can Scout coexist with my existing Log Analytics workspace?

Yes. A single Diagnostic Setting can fan out to a Log Analytics workspace and an Event Hubs namespace simultaneously. Customers commonly keep their Azure portal KQL workbooks running while Scout becomes the primary observability surface. The data is duplicated, the cost on the Log Analytics side is unchanged, and Scout adds the Event Hubs charge on top.

Should I use service principal secrets or workload identity?

If the runtime hosting your collector can federate with Microsoft Entra ID (AKS, EKS, GKE, on-prem Kubernetes with workload identity, GitHub Actions, Azure Pipelines), Workload Identity Federation is the recommended default - it eliminates client secrets and the silent-zero failure mode that comes with secret expiry. Service principal with client secret is fully supported for runtimes where federation is not available; rotate the secret on a schedule and alarm on expiry. Managed identity and the default credential chain are also supported by the azureauthextension and pick themselves in the right runtimes.

How does Azure Monitor REST API throttling work, and how do I avoid 429s?

Set use_batch_api: true on the azure_monitor receiver. That switches to the Metrics Data Plane API (metrics:getBatch), which raises the per-subscription rate ceiling from 12,000 to 360,000 calls per hour and lets a single REST call fetch metrics for up to 50 resources. Tune the scrape interval and the resource set per shard so the projected call rate stays inside that budget. If you operate at fleet scale and want explicit headroom, shard by (subscription, region) so each shard's ceiling is independent.

What is the difference between the azure_monitor and azure_event_hub receivers?

azure_monitor pulls. It calls Azure Monitor's REST API on a schedule and emits OTel metrics. Use it for slow or definitive signals and for resource types that do not yet support streaming export. azure_event_hub consumes. It reads from an Event Hubs namespace that Diagnostic Settings has been configured to push into, and it decodes Azure Resource Logs and platform metrics natively. Use it for high-volume, low-latency signals and for any resource log surfacing in Scout. Production deployments run both.

Does Scout support Azure Government or Azure China clouds?

Yes. Both the azure_monitor and azure_event_hub receivers support Azure Government, Azure China, and the Azure US Government cloud variants by setting the cloud parameter and the appropriate management endpoint. The auth model is identical to Azure public cloud. Per-surface guides note any region-specific caveats for the resource type they cover.

  • App instrumentation for distributed traces - see App Instrumentation for OTel SDK setup in Node, Python, .NET, Java, Go, PHP, and Ruby.
  • base14 Scout - the OTel-native observability platform this guide is written for. See base14.io for platform details and pricing.

References

Was this page helpful?