Skip to main content

The Multi-Cloud Design: Engineering your code for Portability

· 6 min read
Irfan Shah
Founder at base14

In our previous post on Cloud-Native foundations, we explored why running on one cloud isn't lock-in—but designing for one cloud is. Now let's look at how to implement that portability.

Portability is not defined by the ability to run everywhere simultaneously, as that is often a path toward over-engineering. It is, more accurately, a function of reversibility. It provides the technical confidence that if a migration becomes necessary, the system can support it. This quality is not derived from a specific cloud provider, but rather from the deliberate layering of code and environment. While many teams focus on the destination of their deployment, true portability is found in the methodology of the build.

Live Metric Registry: find and understand observability metrics across your stack

· 9 min read
Ranjan Sakalley
Founder at base14

Introducing Metric Registry: a live, searchable catalog of 3,700+ observability (and rapidly growing) metrics extracted directly from source repositories across the OpenTelemetry, Prometheus, and Kubernetes ecosystems, including cloud provider metrics. Metric Registry is open source and built to stay current automatically as projects evolve.

What you can do today with Metric Registry​

Search across your entire observability stack. Find metrics by name, description, or component, whether you're looking for HTTP-related histograms or database connection metrics.

Understand what metrics actually exist. The registry covers 15 sources including OpenTelemetry Collector receivers, Prometheus exporters (PostgreSQL, Redis, MySQL, MongoDB, Kafka), Kubernetes metrics (kube-state-metrics, cAdvisor), and LLM observability libraries.

See which metrics follow standards. Each metric shows whether it complies with OpenTelemetry Semantic Conventions, helping you understand what's standardized versus custom.

Trace back to the source. Every metric links to its origin: the repository, file path, and commit hash. When you need to understand a metric's exact definition, you can go straight to the source.

Trust the data. Metrics are extracted automatically from source code and official metadata files, and the registry refreshes nightly to stay current as projects evolve.

Can't find what you're looking for? Open an issue or better yet, submit a PR to add new sources or improve existing extractors.

Sources already indexed​

CategorySources
OpenTelemetryCollector Contrib, Semantic Conventions, Python, Java, JavaScript
Prometheusnode_exporter, postgres_exporter, redis_exporter, mysql_exporter, mongodb_exporter, kafka_exporter
Kuberneteskube-state-metrics, cAdvisor
LLM ObservabilityOpenLLMetry, OpenLIT
CloudWatchRDS, ALB, DynamoDB, Lambda, EC2, S3, SQS, API Gateway

Evaluating Database Monitoring Solutions: A Framework for Engineering Leaders

· 8 min read
Ranjan Sakalley
Founder at base14

It was 5:30 AM when Riya (name changed), VP of Engineering at a Series C e-commerce company, got the page. Morning traffic was climbing into triple digits and catalog latency had spiked to twelve seconds. Within minutes, Slack was flooded with alerts from three different monitoring tools, each painting a partial picture. The APM showed slow API calls. The infrastructure dashboard showed normal CPU and memory. The dedicated PostgreSQL monitoring tool showed elevated query times, but offered no correlation to what changed upstream. Riya watched as her on-call engineers spent the first forty minutes of the incident jumping between dashboards, arguing over whether this was a database problem or an application problem. By the time they traced the issue to a query introduced in the previous night's deployment, the checkout flow had been degraded for nearly ninety minutes. The postmortem would later reveal that all the data needed to diagnose the issue existed within five minutes of the alert firing. It was scattered across three tools, owned by two teams, and required manual timeline alignment to interpret. Riya realized the problem was not instrumentation. It was fragmentation.

Effective War Room Management: A Guide to Incident Response

· 6 min read
Ranjan Sakalley
Founder at base14

Warroom Management

Incidents are inevitable. What separates resilient organizations from the rest is not whether they experience incidents, but how effectively they respond when problems arise. A well-structured war room process can mean the difference between a minor disruption and a major crisis.

After managing hundreds of critical incidents across my career, I've distilled my key learnings into this guide. These battle-tested practices have repeatedly proven their value in high-pressure situations.

pgX: Comprehensive PostgreSQL Monitoring at Scale

· 11 min read
Engineering Team at base14

Watch: Tracing a slow query from application latency to PostgreSQL stats with pgX.

For many teams, PostgreSQL monitoring begins and often ends with pg_stat_statements and basic postgres_exporter metrics. That choice is understandable. It provides normalized query statistics, execution counts, timing data, and enough signal to identify slow queries and obvious inefficiencies. For a long time, that is sufficient.

But as PostgreSQL clusters grow in size and importance, the questions engineers need to answer change. Instead of "Which query is slow?", the questions become harder and more operational:

  • Why is replication lagging right now?
  • Which application is exhausting the connection pool?
  • What is blocking this transaction?
  • Is autovacuum keeping up with write volume?
  • Did performance degrade because of query shape, data growth, or resource pressure?

These are not questions pg_stat_statements is designed to answer.

Most teams eventually respond by stitching together ad-hoc queries against pg_stat_activity, pg_locks, pg_stat_replication, pg_stat_user_tables, and related system views. This works until an incident demands answers in minutes, not hours.

As we discussed in our introduction to pgX, PostgreSQL monitoring in isolation creates blind spots. This post lays out what comprehensive PostgreSQL monitoring actually looks like at scale: the nine observability domains that matter, the kinds of metrics each domain requires, and why moving beyond query-only monitoring is unavoidable for serious production systems.

Introducing pgX: Bridging the Gap Between Database and Application Monitoring for PostgreSQL

· 8 min read
Engineering Team at base14

Watch: Tracing a slow query from application latency to PostgreSQL stats with pgX.

Modern software systems do not fail along clean architectural boundaries. Application latency, database contention, infrastructure saturation, and user behavior are tightly coupled, yet most observability setups continue to treat them as separate concerns, creating silos between database monitoring and APM tools. PostgreSQL, despite being a core component in most production systems, is often monitored in isolation—through a separate tool, separate dashboards, and separate mental models.

This separation works when systems are small and traffic patterns are simple. As systems scale, however, PostgreSQL behavior becomes a direct function of application usage: query patterns change with features, load fluctuates with users, and database pressure reflects upstream design decisions. At this stage, isolating database monitoring from application and infrastructure observability actively slows down diagnosis and leads teams to optimize the wrong layer.

In-depth PostgreSQL monitoring is necessary—but depth alone is not sufficient. Metrics without context force engineers to manually correlate symptoms across tools, timelines, and data models. What is required instead is component-level observability—a unified database observability platform where PostgreSQL metrics live alongside application traces, infrastructure signals, and deployment events, sharing the same time axis and the same analytical surface.

This is why PostgreSQL observability belongs in the same place as application and infrastructure observability. When database behavior is observed as part of the system rather than as a standalone dependency, engineers can reason about causality instead of coincidence, and leaders gain confidence that performance issues are being addressed at their source-not just mitigated downstream.

Reducing Bus Factor in Observability Using AI

· 5 min read
Nimisha G J
Consultant at base14
Service map graph

We’ve gotten pretty good at collecting observability data, but we’re terrible at making sense of it. Most teams—especially those running complex microservices—still rely on a handful of senior engineers who just know how everything fits together. They’re the rockstars who can look at alerts, mentally trace the dependency graph, and figure out what's actually broken.

When they leave, that knowledge walks out the door with them. That is the observability Bus Factor.

The problem isn't a lack of data; we have petabytes of it. The problem is a lack of context. We need systems that can actually explain what's happening, not just tell us that something is wrong.

This post explores the concept of a "Living Knowledge Base", Where the context is built based on the telemetry data application is emitting, not based on the documentations or confluence docs. Maintaining docs is a nightmare and we cannot always keep up Why not just build a system that will do this

The Cloud-Native Foundation Layer: A Portable, Vendor-Neutral Base for Modern Systems

· 4 min read
Irfan Shah
Founder at base14
Cloud-Native Foundation Layer

Cloud-native began with containers and Kubernetes. Since then, it has become a set of open standards and protocols that let systems run anywhere with minimal friction.

Today's engineering landscape spans public clouds, private clouds, on-prem clusters, and edge environments - far beyond the old single-cloud model. Teams work this way because it's the only practical response to cost, regulation, latency, hardware availability, and outages.

If you expect change, you need an architecture that can handle it.

Making Certificate Expiry Boring

· 21 min read
Ranjan Sakalley
Founder at base14
Certificate expiry issues are entirely preventable

On 18 November 2025, GitHub had an hour-long outage that affected the heart of their product: Git operations. The post-incident summary was brief and honest - the outage was triggered by an internal TLS certificate that had quietly expired, blocking service-to-service communication inside their platform. It's the kind of issue every engineering team knows can happen, yet it still slips through because certificates live in odd corners of a system, often far from where we normally look.

What struck me about this incident wasn't that GitHub "missed something." If anything, it reminded me how easy it is, even for well-run, highly mature engineering orgs, to overlook certificate expiry in their observability and alerting posture. We monitor CPU, memory, latency, error rates, queue depth, request volume - but a certificate that's about to expire rarely shows up as a first-class signal. It doesn't scream. It doesn't gradually degrade. It just keeps working… until it doesn't.

And that's why these failures feel unfair. They're fully preventable, but only if you treat certificates as operational assets, not just security artefacts. This article is about building that mindset: how to surface certificate expiry as a real reliability concern, how to detect issues early, and how to ensure a single date on a single file never brings down an entire system.