etcd

etcd exposes Prometheus-format metrics at /metrics on its client port (2379). The OpenTelemetry Collector scrapes this endpoint with the Prometheus receiver, collecting 130+ metrics - of which 82 are etcd_* - across leader and liveness state, Raft consensus, disk fsync and backend commit latency, MVCC storage, and gRPC requests, on etcd 3.6+. This guide configures the receiver, connects to an etcd node, and ships metrics to base14 Scout.

Prerequisites

Requirement	Minimum	Recommended
etcd	3.6	3.6.12
OTel Collector Contrib	0.90.0	0.153.0
base14 Scout	Any	-

Before starting:

etcd's client port (2379) must be reachable from the host running the Collector. The /metrics endpoint is served there.
etcd serves /metrics over http with no authentication by default; production deployments front it with https and mTLS - see Access Setup.
A Scout account and OTLP endpoint.
OTel Collector installed - see Docker Compose Setup.

What You'll Monitor

Metrics are grouped into three tiers by how you use them. Scrape Core always, alert on Operational, and reach for Diagnostic during an incident or capacity review.

Core - is it up and serving

Metric	What it tells you
`up`	Scrape succeeded - monitoring itself is alive.
`etcd_server_has_leader`	1 if this member sees a leader; 0 means the cluster cannot serve writes. The single load-bearing liveness signal.
`etcd_server_proposals_committed_total`	Committed Raft proposals - the cluster is making progress (write throughput).

Operational - what to alert on

Metric	What it tells you
`etcd_server_leader_changes_seen_total`	Leader-election churn; sustained increases are pathological (Raft instability).
`etcd_server_proposals_failed_total`	Failed proposals - leader loss or quorum problems.
`etcd_server_proposals_pending`	Proposal backlog; a non-zero sustained value is saturation.
`etcd_server_health_failures`	Server health-check failures.
`etcd_server_heartbeat_send_failures_total`	Leader could not send heartbeats - peer-link or disk stall.
`etcd_server_slow_apply_total`	Applies that exceeded the slow threshold - disk or CPU saturation.
`etcd_server_slow_read_indexes_total`	Slow linearizable reads.
`etcd_server_read_indexes_failed_total`	Failed read-index requests.
`etcd_disk_wal_fsync_duration_seconds`	WAL fsync latency - etcd's primary disk-health signal.
`etcd_disk_backend_commit_duration_seconds`	Backend (bbolt) commit latency.
`etcd_mvcc_db_total_size_in_bytes`	On-disk DB size - tracked against the backend quota.
`etcd_server_quota_backend_bytes`	Configured backend quota; the denominator for the space-used alert.
`etcd_network_known_peers`	Known cluster peers - membership/dependency health.

Diagnostic - for investigation and tuning

Higher cardinality / debugging namespace; droppable in production with metric_relabel_configs while keeping Core + Operational.

Group	Metrics	When you reach for it
Debugging namespace	all `etcd_debugging_*` (lease_, mvcc_, snap_save_, store_, auth_revision)	Deep Raft/MVCC/lease/store internals during an incident.
MVCC operations	`etcd_mvcc_put_total`, `_range_total`, `_delete_total`, `_txn_total`, `_db_total_size_in_use_in_bytes`, `_db_open_read_transactions`, `_hash_duration_seconds`, `_hash_rev_duration_seconds`	Keyspace op mix, fragmentation (size vs size_in_use), compaction cost.
Disk (deep)	`etcd_disk_wal_write_bytes_total`, `_wal_write_duration_seconds`, `_backend_defrag_duration_seconds`, `_backend_snapshot_duration_seconds`, `_defrag_inflight`	WAL write volume; defrag/snapshot timing.
Snapshot	`etcd_snap_db_fsync_duration_seconds`, `_db_save_total_duration_seconds`, `etcd_snap_fsync_duration_seconds`	Snapshot persistence latency.
Apply / range timing	`etcd_server_apply_duration_seconds`, `etcd_server_range_duration_seconds`, `etcd_server_client_requests_total`	Per-op latency distribution.
gRPC proxy	`etcd_grpc_proxy_*` (cache_hits/misses/keys, events/watchers_coalescing)	Only when running the gRPC proxy.
Client network	`etcd_network_client_grpc_received_bytes_total`, `_sent_bytes_total`	Client traffic volume.
Inventory / state	`etcd_server_id`, `_version`, `_go_version`, `etcd_server_is_leader`, `_is_learner`, `_learner_promote_successes`, `_feature_enabled`, `_snapshot_apply_in_progress_total`, `etcd_cluster_version`	Identity, version, leader/learner state.
Runtime	`go_`, `process_`, `grpc_server_`, `os_fd_`, `promhttp_`, `scrape_`, `up`	Go/process/gRPC/scrape health.

Full metric list: see the etcd metrics reference, or run curl -s http://localhost:2379/metrics against your etcd instance.

Key Alerts to Configure

Threshold guidance for the most useful Operational-tier series. The disk-latency numbers are etcd's documented operational guidance (etcd hardware/ops docs); tune to your storage.

Metric	Warning	Critical	Why it matters
`etcd_server_has_leader`	-	`== 0`	No leader: cluster cannot serve writes. Investigate quorum / peer links immediately.
`rate(etcd_server_leader_changes_seen_total)`	`> 0` sustained	Rising across windows	Raft instability; check disk latency and network between peers.
`rate(etcd_server_proposals_failed_total)`	`> 0` sustained	Rising	Quorum or leader problems; correlate with leader changes.
`etcd_server_proposals_pending`	`> 0` sustained	Growing	Apply pipeline backed up; check disk saturation.
`etcd_disk_wal_fsync_duration_seconds` (p99)	`> 10ms`	`> 25ms`	Slow WAL fsync stalls consensus. Move etcd to faster disk / dedicate IO.
`etcd_disk_backend_commit_duration_seconds` (p99)	`> 25ms`	`> 50ms`	Slow backend commits; same disk-IO remedy.
`etcd_mvcc_db_total_size_in_bytes / etcd_server_quota_backend_bytes`	`> 0.80`	`> 0.95`	Approaching the backend quota; a NOSPACE alarm halts writes. Defrag / raise quota / compact.
`etcd_server_heartbeat_send_failures_total`	`> 0`	Sustained	Leader can't heartbeat peers; disk stall or partition.

The two *_duration_seconds rows are Prometheus histograms - there is no ready-made p99 series to threshold. Compute the percentile from the histogram buckets in your alert rule, for example histogram_quantile(0.99, rate(<metric>_bucket[5m])), rather than alerting on a p99 series directly.

Access Setup

Verify your etcd instance is accessible and serving metrics:

Verify access
# Check cluster health
etcdctl endpoint health

# List all keys (empty cluster returns nothing)
etcdctl get "" --prefix --keys-only

# Verify metrics endpoint
curl -s http://localhost:2379/metrics | head -20

No authentication is required for the /metrics endpoint on an http client port. Production etcd runs https with mTLS - the scrape job needs client certificates, configured in Configuration below.

Port conflict in Kubernetes

etcd uses port 2379, which conflicts with the Kubernetes control-plane etcd. If running both, remap the host port in Docker Compose (for example 12379:2379) or target the non-Kubernetes etcd address directly.

Configuration

config/otel-collector.yaml
receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: etcd
          scrape_interval: 10s
          static_configs:
            - targets:
                # host:port etcd's /metrics is reachable on
                - ${env:ETCD_HOST}:${env:ETCD_PORT}

processors:
  resource:
    attributes:
      - key: deployment.environment.name
        value: ${env:ENVIRONMENT}
        action: upsert
      - key: environment
        value: ${env:ENVIRONMENT}
        action: upsert
      - key: service.name
        value: ${env:SERVICE_NAME}
        action: upsert

  batch:
    timeout: 10s
    send_batch_size: 1024

exporters:
  otlphttp/b14:
    endpoint: ${env:OTEL_EXPORTER_OTLP_ENDPOINT}
    tls:
      insecure_skip_verify: true

service:
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [resource, batch]
      exporters: [otlphttp/b14]

Environment Variables

.env
ETCD_HOST=localhost
# Port etcd's /metrics is reachable on: 2379 in-cluster or in-network. If etcd
# runs in a container that remaps the port on the host - for example to avoid
# the Kubernetes control-plane etcd, also on 2379 - set this to that port.
ETCD_PORT=2379
ENVIRONMENT=your_environment
SERVICE_NAME=your_service_name
OTEL_EXPORTER_OTLP_ENDPOINT=https://<your-tenant>.base14.io

TLS for production etcd

Production etcd serves metrics over https with mTLS. Add the scheme and client certificates to the scrape job, and mount the certificate files into the Collector container:

config/otel-collector.yaml (TLS)
receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: etcd
          scheme: https
          tls_config:
            ca_file: /certs/ca.pem
            cert_file: /certs/client.pem
            key_file: /certs/client-key.pem
          static_configs:
            - targets:
                - ${env:ETCD_HOST}:${env:ETCD_PORT}

Controlling metric volume

etcd exposes 130+ metrics including the etcd_debugging_* namespace, Go runtime, and Prometheus internals. The Prometheus receiver scrapes the full /metrics surface with no whitelist - every series etcd exposes flows through, and new series appear automatically after an etcd upgrade with no config change. To drop the Diagnostic tier in production while keeping Core + Operational, add a metric_relabel_configs block to the scrape job:

config/otel-collector.yaml (filter)
          metric_relabel_configs:
            - source_labels: [__name__]
              regex: "etcd_debugging_.*"
              action: drop

Semconv version note: deployment.environment.name is the current dotted OTel attribute. Scout's UI filters on the lowercase environment key, so emit it alongside the OTel-native deployment.environment.name. The legacy deployment.environment is still accepted for backward compatibility.

Verify the Setup

Start the Collector and check for metrics within 60 seconds:

Verify metrics collection
# Check Collector logs for scraped etcd metrics
docker logs otel-collector 2>&1 | grep -i "etcd"

# Verify etcd is healthy
etcdctl endpoint health

# Check the leader signal directly on the metrics endpoint
curl -s http://localhost:2379/metrics | grep etcd_server_has_leader

# Generate write traffic so proposal counters advance
etcdctl put app/example value

Troubleshooting

Connection refused on port 2379

Cause: Collector cannot reach etcd at the configured address.

Fix:

Verify etcd is running: docker ps | grep etcd or systemctl status etcd.
Confirm --listen-client-urls includes the address the Collector connects to.
Check firewall rules if the Collector runs on a separate host.

Metrics endpoint returns empty or 404

Cause: etcd is configured with --listen-metrics-urls, which moves the metrics endpoint to a different address.

Fix:

Check whether --listen-metrics-urls is set: ps aux | grep etcd | grep listen-metrics.
If set, update the scrape target to match that address and port.
If not set, metrics are served on the client port (2379).

Consensus is unstable or writes stall

Cause: Slow disk fsync or peer-link problems are destabilising Raft.

Look at: etcd_disk_wal_fsync_duration_seconds and etcd_disk_backend_commit_duration_seconds (Operational disk latency); etcd_server_leader_changes_seen_total and etcd_server_heartbeat_send_failures_total for the election churn and heartbeat failures that follow. For deeper timing, the Diagnostic etcd_disk_wal_write_duration_seconds and etcd_server_apply_duration_seconds break down where the latency lands.

Fix:

Move etcd to faster, dedicated storage if WAL fsync p99 exceeds 10ms.
Investigate the network between peers if heartbeat failures climb.

Database approaching the backend quota

Cause: The keyspace has grown toward etcd_server_quota_backend_bytes; a NOSPACE alarm halts writes once it is hit.

Look at: etcd_mvcc_db_total_size_in_bytes against the quota (Operational). A large gap between etcd_mvcc_db_total_size_in_bytes and the Diagnostic etcd_mvcc_db_total_size_in_use_in_bytes indicates fragmentation.

Fix:

Run etcdctl defrag to reclaim fragmented space.
Compact old revisions, or raise the backend quota.

No metrics appearing in Scout

Cause: Metrics are collected but not exported.

Fix:

Check Collector logs for export errors: docker logs otel-collector.
Verify OTEL_EXPORTER_OTLP_ENDPOINT is set correctly.
Confirm the pipeline includes both the receiver and the exporter.

FAQ

Does this work with etcd running in Kubernetes?

Yes. Set targets to the etcd pod or service DNS (for example etcd-0.etcd.kube-system.svc.cluster.local:2379). For managed Kubernetes (EKS, GKE, AKS), the control-plane etcd may not be directly accessible - check your provider's documentation.

How do I monitor an etcd cluster?

Add all member endpoints to the scrape config:

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: etcd
          static_configs:
            - targets:
                - etcd-1:2379
                - etcd-2:2379
                - etcd-3:2379

Each member is scraped on its in-network client port - 2379, the ETCD_PORT default - and identified by its instance label. Watch etcd_network_known_peers and etcd_server_has_leader per member to confirm the cluster sees quorum.

Why does etcd_server_proposals_pending stay above zero?

A small number of pending proposals is normal under write load. Sustained high values mean the cluster cannot commit proposals fast enough - check disk latency (etcd_disk_wal_fsync_duration_seconds) and the etcd_server_slow_apply_total counter.

What is the difference between db_total_size and db_total_size_in_use?

etcd_mvcc_db_total_size_in_bytes includes space freed by compaction but not yet reclaimed (fragmentation). etcd_mvcc_db_total_size_in_use_in_bytes reflects actual data. A large gap between the two indicates fragmentation - run etcdctl defrag to reclaim space.

Which metrics can I drop to reduce volume?

The etcd_debugging_* namespace (30 series) plus the Go runtime, process, and gRPC-proxy families are Diagnostic - drop them with metric_relabel_configs and keep the Core and Operational tiers. They are worth re-enabling during an incident or capacity review.

OTel Collector Configuration - Advanced collector configuration.
Docker Compose Setup - Run the Collector locally.
Kubernetes Helm Setup - Production deployment.
Creating Alerts - Alert on etcd metrics.
ZooKeeper Monitoring - Coordination service for systems that pre-date etcd.

What's Next?

Create Dashboards: Explore pre-built dashboards or build your own. See Create Your First Dashboard.
Monitor More Components: Add monitoring for ZooKeeper, Redis, and other components.
Fine-tune Collection: Drop the Diagnostic etcd_debugging_* tier in production with metric_relabel_configs to control volume; keep it available for incident investigation.

Was this page helpful?

Prerequisites​

What You'll Monitor​

Core - is it up and serving​

Operational - what to alert on​

Diagnostic - for investigation and tuning​

Key Alerts to Configure​

Access Setup​

Configuration​

Environment Variables​

TLS for production etcd​

Controlling metric volume​

Verify the Setup​

Troubleshooting​

Connection refused on port 2379​

Metrics endpoint returns empty or 404​

Consensus is unstable or writes stall​

Database approaching the backend quota​

No metrics appearing in Scout​

FAQ​

Related Guides​

What's Next?​

Prerequisites

What You'll Monitor

Core - is it up and serving

Operational - what to alert on

Diagnostic - for investigation and tuning

Key Alerts to Configure

Access Setup

Configuration

Environment Variables

TLS for production etcd

Controlling metric volume

Verify the Setup

Troubleshooting

Connection refused on port 2379

Metrics endpoint returns empty or 404

Consensus is unstable or writes stall

Database approaching the backend quota

No metrics appearing in Scout

FAQ

Related Guides

What's Next?