Skip to main content

2 posts tagged with "reliability"

View All Tags

Kubernetes Scheduling: Observing Silent Failures

ยท 10 min read
Irfan Shah
Founder & CTO at base14

A Pending Pod means Kubernetes accepts your workload but can't run it. Classic culprits are: insufficient capacity, overly restrictive placement constraints, unbound PVCs, autoscaler ceilings, or namespace quota exhaustion. Most teams discover this during an incident. You don't have to. Wire up the OTel Collector's k8s_cluster, kubeletstats, and k8sobjects receivers, alert on FailedScheduling events and Pending pod duration, and you'll catch scheduling failures before your users do. This post covers the five root causes, a kubectl debugging workflow, and a complete OTel instrumentation setup with collector config, deployment topology, and alert conditions.

Understanding What Increases and Reduces MTTR

ยท 5 min read
Engineering Team at base14

What makes recovery slower โ€” and what disciplined, observable teams do differently.


In reliability engineering, MTTR (Mean Time to Recovery) is one of the clearest indicators of how mature a system โ€” and a team โ€” really is. It measures not just how quickly you fix things, but how well your organization detects, communicates, and learns from failure.

Every production incident is a test of the system's design, the team's reflexes, and the clarity of their shared context. MTTR rises when friction builds up in those connections โ€” between tools, roles, or data. It falls when context flows freely and decisions move faster than confusion.