6 posts tagged with "kubernetes"

Kubernetes Scheduling: Observing Silent Failures

March 9, 2026 · 10 min read

Founder & CTO at base14

A Pending Pod means Kubernetes accepts your workload but can't run it. Classic culprits are: insufficient capacity, overly restrictive placement constraints, unbound PVCs, autoscaler ceilings, or namespace quota exhaustion. Most teams discover this during an incident. You don't have to. Wire up the OTel Collector's k8s_cluster, kubeletstats, and k8sobjects receivers, alert on FailedScheduling events and Pending pod duration, and you'll catch scheduling failures before your users do. This post covers the five root causes, a kubectl debugging workflow, and a complete OTel instrumentation setup with collector config, deployment topology, and alert conditions.

Zero-Code Instrumentation for Go with eBPF and OpenTelemetry

February 17, 2026 · 10 min read

Ranjan Sakalley

Founder & CPO at base14

Auto-instrumentation is well-established for Java, Python, and Node.js. Runtime agents hook into the interpreter or bytecode layer to inject tracing, metrics, and logging without requiring code changes. Go compiles to a static native binary, so JVM-style bytecode patching does not apply. But Go is not without options. Compile-time tools like Datadog's Orchestrion and Alibaba's opentelemetry-go-auto-instrumentation can inject tracing at build time, and eBPF provides a runtime alternative that requires no rebuild at all.

This post focuses on the eBPF approach. It attaches kernel-level probes to running Go binaries, extracting telemetry without modifying source code, recompiling, or restarting the process. OpenTelemetry now has two official projects built on this mechanism. We cover how it works, how to deploy it on Kubernetes, and where the practical limits are.

The Multi-Cloud Design: Engineering your code for Portability

January 28, 2026 · 6 min read

Irfan Shah

Founder & CTO at base14

In our previous post on Cloud-Native foundations, we explored why running on one cloud isn't lock-in—but designing for one cloud is. Now let's look at how to implement that portability.

Portability is not defined by the ability to run everywhere simultaneously, as that is often a path toward over-engineering. It is, more accurately, a function of reversibility. It provides the technical confidence that if a migration becomes necessary, the system can support it. This quality is not derived from a specific cloud provider, but rather from the deliberate layering of code and environment. While many teams focus on the destination of their deployment, true portability is found in the methodology of the build.

Live Metric Registry: find and understand observability metrics across your stack

January 19, 2026 · 9 min read

Ranjan Sakalley

Founder & CPO at base14

Introducing Metric Registry: a live, searchable catalog of 3,700+ observability (and rapidly growing) metrics extracted directly from source repositories across the OpenTelemetry, Prometheus, and Kubernetes ecosystems, including cloud provider metrics. Metric Registry is open source and built to stay current automatically as projects evolve.

What you can do today with Metric Registry

Search across your entire observability stack. Find metrics by name, description, or component, whether you're looking for HTTP-related histograms or database connection metrics.

Understand what metrics actually exist. The registry covers 15 sources including OpenTelemetry Collector receivers, Prometheus exporters (PostgreSQL, Redis, MySQL, MongoDB, Kafka), Kubernetes metrics (kube-state-metrics, cAdvisor), and LLM observability libraries.

See which metrics follow standards. Each metric shows whether it complies with OpenTelemetry Semantic Conventions, helping you understand what's standardized versus custom.

Trace back to the source. Every metric links to its origin: the repository, file path, and commit hash. When you need to understand a metric's exact definition, you can go straight to the source.

Trust the data. Metrics are extracted automatically from source code and official metadata files, and the registry refreshes nightly to stay current as projects evolve.

Can't find what you're looking for? Open an issue or better yet, submit a PR to add new sources or improve existing extractors.

Sources already indexed

Category	Sources
OpenTelemetry	Collector Contrib, Semantic Conventions, Python, Java, JavaScript
Prometheus	node_exporter, postgres_exporter, redis_exporter, mysql_exporter, mongodb_exporter, kafka_exporter
Kubernetes	kube-state-metrics, cAdvisor
LLM Observability	OpenLLMetry, OpenLIT
CloudWatch	RDS, ALB, DynamoDB, Lambda, EC2, S3, SQS, API Gateway

The Cloud-Native Foundation Layer: A Portable, Vendor-Neutral Base for Modern Systems

November 25, 2025 · 4 min read

Irfan Shah

Founder & CTO at base14

Cloud-native began with containers and Kubernetes. Since then, it has become a set of open standards and protocols that let systems run anywhere with minimal friction.

Today's engineering landscape spans public clouds, private clouds, on-prem clusters, and edge environments - far beyond the old single-cloud model. Teams work this way because it's the only practical response to cost, regulation, latency, hardware availability, and outages.

If you expect change, you need an architecture that can handle it.

Making Certificate Expiry Boring

November 20, 2025 · 21 min read

Ranjan Sakalley

Founder & CPO at base14

Certificate expiry issues are entirely preventable

On 18 November 2025, GitHub had an hour-long outage that affected the heart of their product: Git operations. The post-incident summary was brief and honest - the outage was triggered by an internal TLS certificate that had quietly expired, blocking service-to-service communication inside their platform. It's the kind of issue every engineering team knows can happen, yet it still slips through because certificates live in odd corners of a system, often far from where we normally look.

What struck me about this incident wasn't that GitHub "missed something." If anything, it reminded me how easy it is, even for well-run, highly mature engineering orgs, to overlook certificate expiry in their observability and alerting posture. We monitor CPU, memory, latency, error rates, queue depth, request volume - but a certificate that's about to expire rarely shows up as a first-class signal. It doesn't scream. It doesn't gradually degrade. It just keeps working… until it doesn't.

And that's why these failures feel unfair. They're fully preventable, but only if you treat certificates as operational assets, not just security artefacts. This article is about building that mindset: how to surface certificate expiry as a real reliability concern, how to detect issues early, and how to ensure a single date on a single file never brings down an entire system.

What you can do today with Metric Registry​

Sources already indexed​

What you can do today with Metric Registry

Sources already indexed