Skip to main content

2 posts tagged with "sre"

View All Tags

Evaluating Database Monitoring Solutions: A Framework for Engineering Leaders

· 8 min read
Ranjan Sakalley
Founder & CPO at base14

It was 5:30 AM when Riya (name changed), VP of Engineering at a Series C e-commerce company, got the page. Morning traffic was climbing into triple digits and catalog latency had spiked to twelve seconds. Within minutes, Slack was flooded with alerts from three different monitoring tools, each painting a partial picture. The APM showed slow API calls. The infrastructure dashboard showed normal CPU and memory. The dedicated PostgreSQL monitoring tool showed elevated query times, but offered no correlation to what changed upstream. Riya watched as her on-call engineers spent the first forty minutes of the incident jumping between dashboards, arguing over whether this was a database problem or an application problem. By the time they traced the issue to a query introduced in the previous night's deployment, the checkout flow had been degraded for nearly ninety minutes. The postmortem would later reveal that all the data needed to diagnose the issue existed within five minutes of the alert firing. It was scattered across three tools, owned by two teams, and required manual timeline alignment to interpret. Riya realized the problem was not instrumentation. It was fragmentation.

Effective War Room Management: A Guide to Incident Response

· 6 min read
Ranjan Sakalley
Founder & CPO at base14

Warroom Management

Incidents are inevitable. What separates resilient organizations from the rest is not whether they experience incidents, but how effectively they respond when problems arise. A well-structured war room process can mean the difference between a minor disruption and a major crisis.

After managing hundreds of critical incidents across my career, I've distilled my key learnings into this guide. These battle-tested practices have repeatedly proven their value in high-pressure situations.