Skip to main content

17 posts tagged with "observability"

View All Tags

Kubernetes Scheduling: Observing Silent Failures

· 10 min read
Irfan Shah
Founder & CTO at base14

A Pending Pod means Kubernetes accepts your workload but can't run it. Classic culprits are: insufficient capacity, overly restrictive placement constraints, unbound PVCs, autoscaler ceilings, or namespace quota exhaustion. Most teams discover this during an incident. You don't have to. Wire up the OTel Collector's k8s_cluster, kubeletstats, and k8sobjects receivers, alert on FailedScheduling events and Pending pod duration, and you'll catch scheduling failures before your users do. This post covers the five root causes, a kubectl debugging workflow, and a complete OTel instrumentation setup with collector config, deployment topology, and alert conditions.

Coding Agent Observability for Your Team

· 8 min read
Ranjan Sakalley
Founder & CPO at base14

Coding agents like Claude Code, OpenAI Codex CLI, and Google Gemini CLI now ship with native OpenTelemetry support. This means you can collect structured telemetry covering token usage, cost attribution, tool calls, sessions, and lines of code modified, the same way you instrument any other production system.

This post covers what each agent emits, how to enable collection, and what we learned running Claude Code telemetry across a team.

Flutter Mobile Observability with OpenTelemetry

· 5 min read
Nimisha G J
Engineer at base14

Most teams have solid observability on their backend. Structured logs, distributed traces, SLOs, alerting. The mobile app, which is often the first thing a user touches, gets crash reports at best.

A user taps a button and nothing happens. Was it the network? A janky frame that swallowed the tap? A backend timeout? A state management bug? Without telemetry on the device, you are guessing.

This post explains a couple of approaches we have used to help our customers instrument their Flutter apps and when to use each approach.

Zero-Code Instrumentation for Go with eBPF and OpenTelemetry

· 10 min read
Ranjan Sakalley
Founder & CPO at base14

Auto-instrumentation is well-established for Java, Python, and Node.js. Runtime agents hook into the interpreter or bytecode layer to inject tracing, metrics, and logging without requiring code changes. Go compiles to a static native binary, so JVM-style bytecode patching does not apply. But Go is not without options. Compile-time tools like Datadog's Orchestrion and Alibaba's opentelemetry-go-auto-instrumentation can inject tracing at build time, and eBPF provides a runtime alternative that requires no rebuild at all.

This post focuses on the eBPF approach. It attaches kernel-level probes to running Go binaries, extracting telemetry without modifying source code, recompiling, or restarting the process. OpenTelemetry now has two official projects built on this mechanism. We cover how it works, how to deploy it on Kubernetes, and where the practical limits are.

Production-Ready OpenTelemetry: Configure, Harden, and Debug Your Collector

· 11 min read
Ranjan Sakalley
Founder & CPO at base14

The OpenTelemetry Collector works out of the box with minimal configuration. You point a receiver at port 4317, wire up an exporter, and telemetry flows. In development, this is sufficient. In production, it is not.

Default settings ship without memory limits, without retry logic, without queue sizing, and without any self-monitoring. The collector will accept data until it runs out of memory, drop data silently when the queue fills up, and give you no signal that anything went wrong. These failures surface as gaps in your dashboards hours or days later, when the context to diagnose them is gone.

This post covers the practical steps to close that gap: hardening the collector's configuration, enabling its built-in diagnostic tools, and diagnosing the failure patterns that show up most often in production.

GitHub Actions Observability with Scout

· 8 min read
Ranjan Sakalley
Founder & CPO at base14

CI/CD pipelines are critical infrastructure. Builds slow down over weeks, flaky tests waste developer time, and when a pipeline breaks, diagnosing the root cause means clicking through GitHub's UI one run at a time.

The Scout OpenTelemetry CI/CD Action solves this by exporting your GitHub Actions workflow runs as OpenTelemetry traces. Each workflow becomes a trace, each job becomes a child span, and each step becomes a span within its job. You get the same structured observability for your pipelines that you already have for your applications.

Live Metric Registry: find and understand observability metrics across your stack

· 9 min read
Ranjan Sakalley
Founder & CPO at base14

Introducing Metric Registry: a live, searchable catalog of 3,700+ observability (and rapidly growing) metrics extracted directly from source repositories across the OpenTelemetry, Prometheus, and Kubernetes ecosystems, including cloud provider metrics. Metric Registry is open source and built to stay current automatically as projects evolve.

What you can do today with Metric Registry

Search across your entire observability stack. Find metrics by name, description, or component, whether you're looking for HTTP-related histograms or database connection metrics.

Understand what metrics actually exist. The registry covers 15 sources including OpenTelemetry Collector receivers, Prometheus exporters (PostgreSQL, Redis, MySQL, MongoDB, Kafka), Kubernetes metrics (kube-state-metrics, cAdvisor), and LLM observability libraries.

See which metrics follow standards. Each metric shows whether it complies with OpenTelemetry Semantic Conventions, helping you understand what's standardized versus custom.

Trace back to the source. Every metric links to its origin: the repository, file path, and commit hash. When you need to understand a metric's exact definition, you can go straight to the source.

Trust the data. Metrics are extracted automatically from source code and official metadata files, and the registry refreshes nightly to stay current as projects evolve.

Can't find what you're looking for? Open an issue or better yet, submit a PR to add new sources or improve existing extractors.

Sources already indexed

CategorySources
OpenTelemetryCollector Contrib, Semantic Conventions, Python, Java, JavaScript
Prometheusnode_exporter, postgres_exporter, redis_exporter, mysql_exporter, mongodb_exporter, kafka_exporter
Kuberneteskube-state-metrics, cAdvisor
LLM ObservabilityOpenLLMetry, OpenLIT
CloudWatchRDS, ALB, DynamoDB, Lambda, EC2, S3, SQS, API Gateway

Evaluating Database Monitoring Solutions: A Framework for Engineering Leaders

· 8 min read
Ranjan Sakalley
Founder & CPO at base14

It was 5:30 AM when Riya (name changed), VP of Engineering at a Series C e-commerce company, got the page. Morning traffic was climbing into triple digits and catalog latency had spiked to twelve seconds. Within minutes, Slack was flooded with alerts from three different monitoring tools, each painting a partial picture. The APM showed slow API calls. The infrastructure dashboard showed normal CPU and memory. The dedicated PostgreSQL monitoring tool showed elevated query times, but offered no correlation to what changed upstream. Riya watched as her on-call engineers spent the first forty minutes of the incident jumping between dashboards, arguing over whether this was a database problem or an application problem. By the time they traced the issue to a query introduced in the previous night's deployment, the checkout flow had been degraded for nearly ninety minutes. The postmortem would later reveal that all the data needed to diagnose the issue existed within five minutes of the alert firing. It was scattered across three tools, owned by two teams, and required manual timeline alignment to interpret. Riya realized the problem was not instrumentation. It was fragmentation.