4 posts tagged with "devops"

GitHub Actions Observability with Scout

February 12, 2026 · 8 min read

Founder & CPO at base14

CI/CD pipelines are critical infrastructure. Builds slow down over weeks, flaky tests waste developer time, and when a pipeline breaks, diagnosing the root cause means clicking through GitHub's UI one run at a time.

The Scout OpenTelemetry CI/CD Action solves this by exporting your GitHub Actions workflow runs as OpenTelemetry traces. Each workflow becomes a trace, each job becomes a child span, and each step becomes a span within its job. You get the same structured observability for your pipelines that you already have for your applications.

Evaluating Database Monitoring Solutions: A Framework for Engineering Leaders

January 18, 2026 · 8 min read

Ranjan Sakalley

Founder & CPO at base14

It was 5:30 AM when Riya (name changed), VP of Engineering at a Series C e-commerce company, got the page. Morning traffic was climbing into triple digits and catalog latency had spiked to twelve seconds. Within minutes, Slack was flooded with alerts from three different monitoring tools, each painting a partial picture. The APM showed slow API calls. The infrastructure dashboard showed normal CPU and memory. The dedicated PostgreSQL monitoring tool showed elevated query times, but offered no correlation to what changed upstream. Riya watched as her on-call engineers spent the first forty minutes of the incident jumping between dashboards, arguing over whether this was a database problem or an application problem. By the time they traced the issue to a query introduced in the previous night's deployment, the checkout flow had been degraded for nearly ninety minutes. The postmortem would later reveal that all the data needed to diagnose the issue existed within five minutes of the alert firing. It was scattered across three tools, owned by two teams, and required manual timeline alignment to interpret. Riya realized the problem was not instrumentation. It was fragmentation.

Effective War Room Management: A Guide to Incident Response

January 16, 2026 · 6 min read

Ranjan Sakalley

Founder & CPO at base14

Warroom Management

Incidents are inevitable. What separates resilient organizations from the rest is not whether they experience incidents, but how effectively they respond when problems arise. A well-structured war room process can mean the difference between a minor disruption and a major crisis.

After managing hundreds of critical incidents across my career, I've distilled my key learnings into this guide. These battle-tested practices have repeatedly proven their value in high-pressure situations.

Making Certificate Expiry Boring

November 20, 2025 · 21 min read

Ranjan Sakalley

Founder & CPO at base14

Certificate expiry issues are entirely preventable

On 18 November 2025, GitHub had an hour-long outage that affected the heart of their product: Git operations. The post-incident summary was brief and honest - the outage was triggered by an internal TLS certificate that had quietly expired, blocking service-to-service communication inside their platform. It's the kind of issue every engineering team knows can happen, yet it still slips through because certificates live in odd corners of a system, often far from where we normally look.

What struck me about this incident wasn't that GitHub "missed something." If anything, it reminded me how easy it is, even for well-run, highly mature engineering orgs, to overlook certificate expiry in their observability and alerting posture. We monitor CPU, memory, latency, error rates, queue depth, request volume - but a certificate that's about to expire rarely shows up as a first-class signal. It doesn't scream. It doesn't gradually degrade. It just keeps working… until it doesn't.

And that's why these failures feel unfair. They're fully preventable, but only if you treat certificates as operational assets, not just security artefacts. This article is about building that mindset: how to surface certificate expiry as a real reliability concern, how to detect issues early, and how to ensure a single date on a single file never brings down an entire system.