Skip to main content

Effective Warroom Management

· 5 min read
Founder at base14

Warroom Management

Incidents are inevitable. What separates resilient organizations from the rest is not whether they experience incidents, but how effectively they respond when problems arise. A well-structured war room process can mean the difference between a minor disruption and a major crisis.

After managing hundreds of critical incidents across my career, I've distilled my key learnings into this guide. These battle-tested practices have repeatedly proven their value in high-pressure situations.

Initialization

The first minutes of an incident response are critical. Having clear, consistent procedures for war room initialization ensures a swift and organized start to your incident management process.

Key Elements of Initialization

  • Single-access point: Always have one consistent link for all war rooms that everyone can access quickly. This eliminates confusion about where to go when an incident occurs.
  • Universal access: Everyone in the organization should have access to this link, even if they don't typically participate in incident response. This allows subject matter experts to join immediately when needed.
  • Pre-configured environment: Set up standard tools and dashboards in advance, so they're ready when an incident occurs.
  • Automated notifications: Implement automated alerting to notify the appropriate teams when a war room is initiated.
  • Initialization checklist: Create a standardized procedure for declaring an incident and starting the war room process.

Clear Role Definition

Effective war rooms require clear responsibilities. Each participant should understand their specific role and boundaries of authority.

Core Roles

Incident Manager
  • Leads the overall response
  • Makes final decisions when consensus can't be reached
  • Ensures the response follows established processes
  • Manages escalations when needed
  • Declares when the incident is resolved
Scribe
  • Documents all significant events, decisions, and actions in real-time
  • Maintains a timeline of the incident
  • Captures action items for follow-up
  • Ensures all key information is accessible to war room participants
Communications Person
  • Manages external and internal communications
  • Drafts and sends updates to stakeholders at regular intervals
  • Fields inquiries from other parts of the organization
  • Ensures consistent messaging about the incident
Actors
  • Technical resources performing the actual investigation and remediation
  • Provide expertise in specific systems or technologies
  • Execute changes and verify results
  • Report findings back to the war room

Effective Practices

The structure and approach of your war room significantly impact its effectiveness. Well-designed practices help maintain focus and productivity during high-stress situations.

  • Shared visibility: Maintain one shared screen that everyone can see, showing the primary investigation or discussion. All key actions should be performed visibly to the entire team.
  • Sub-team breakouts: When a specific line of inquiry requires focused attention, create separate rooms with the same role structure. These breakout teams should report findings back to the main war room regularly.
  • Regular status updates: Schedule brief status updates at consistent intervals to ensure everyone has the same understanding of the current situation.
  • Engineering pairing: All changes should be made by a pair of engineers, not a single person. Pairing ensures instant review and is critical to correct solutioning. This reduces errors and provides redundancy of knowledge during critical moments.
  • Clear decision-making framework: Establish in advance how decisions will be made during an incident (consensus, incident manager decision, etc.).
  • Time-boxing: Set time limits for investigation paths to avoid rabbit holes. Re-evaluate progress regularly.
  • Documentation first: Ensure all hypotheses, findings, and actions are documented before they're acted upon.
  • Standardized RCA template: Maintain a consistent RCA template that captures all necessary information: incident timeline, impact assessment, root cause identification, contributing factors, and action items. Standardization ensures comprehensive analysis and makes RCAs easier to compare and learn from over time.
  • Centralized knowledge repository: Establish a shared Google Drive, SharePoint, or similar solution where all RCAs are stored and accessible to everyone in the organization. This transparency builds institutional knowledge and allows teams to learn from past incidents regardless of their direct involvement.

War Room Etiquette

The discipline and focus of war room participants can make or break your incident response. Clear expectations for behavior help maintain an effective environment.

Etiquette Guidelines

  • Speak purposefully: Don't talk unless you have something meaningful to contribute. Background chatter makes it difficult to focus on critical information.
  • Respect role boundaries: Trust people in their designated roles to perform their functions without interference.
  • Minimize distractions: Turn off notifications and avoid multitasking during active incident response.
  • Stay focused on resolution: Keep discussions centered on understanding and resolving the current incident. Save process improvement discussions for after the incident.
  • Use clear, direct communication: Avoid ambiguous language. Be specific about what you're seeing, what you believe is happening, and what you're doing.
  • Mind cognitive load: Recognize that everyone's mental capacity is limited during high-stress situations, and communicate accordingly.

Post-Incident Activities

How you handle the aftermath of an incident is just as important as the initial response. Effective post-incident processes turn experiences into organizational learning.

Post-Incident Process

  • RCA assignment: The Incident Manager assigns root cause analysis responsibilities to a smaller group with relevant expertise.
  • Blameless postmortem: Conduct a thorough review focused on systems and processes, not individual mistakes.
  • Action item tracking: Document and assign follow-up items with clear ownership and timelines.
  • Knowledge sharing: Distribute learnings from the incident throughout the organization.
  • Process refinement: Update war room procedures based on lessons learned from each incident.
  • Recognition: Acknowledge the contributions of all participants in the incident response.

pgX: Comprehensive PostgreSQL Monitoring at Scale

· 10 min read
Engineering Team at base14

Watch: Tracing a slow query from application latency to PostgreSQL stats with pgX.

For many teams, PostgreSQL monitoring begins and often ends with pg_stat_statements. That choice is understandable. It provides normalized query statistics, execution counts, timing data, and enough signal to identify slow queries and obvious inefficiencies. For a long time, that is sufficient.

But as PostgreSQL clusters grow in size and importance, the questions engineers need to answer change. Instead of "Which query is slow?", the questions become harder and more operational:

  • Why is replication lagging right now?
  • Which application is exhausting the connection pool?
  • What is blocking this transaction?
  • Is autovacuum keeping up with write volume?
  • Did performance degrade because of query shape, data growth, or resource pressure?

These are not questions pg_stat_statements is designed to answer.

Most teams eventually respond by stitching together ad-hoc queries against pg_stat_activity, pg_locks, pg_stat_replication, pg_stat_user_tables, and related system views. This works until an incident demands answers in minutes, not hours.

This post lays out what comprehensive PostgreSQL monitoring actually looks like at scale: the nine observability domains that matter, the kinds of metrics each domain requires, and why moving beyond query-only monitoring is unavoidable for serious production systems.

Introducing pgX: Bridging the Gap Between Database and Application Monitoring for PostgreSQL

· 8 min read
Engineering Team at base14

Watch: Tracing a slow query from application latency to PostgreSQL stats with pgX.

Modern software systems do not fail along clean architectural boundaries. Application latency, database contention, infrastructure saturation, and user behavior are tightly coupled, yet most observability setups continue to treat them as separate concerns. PostgreSQL, despite being a core component in most production systems, is often monitored in isolation-through a separate tool, separate dashboards, and separate mental models.

This separation works when systems are small and traffic patterns are simple. As systems scale, however, PostgreSQL behavior becomes a direct function of application usage: query patterns change with features, load fluctuates with users, and database pressure reflects upstream design decisions. At this stage, isolating database monitoring from application and infrastructure observability actively slows down diagnosis and leads teams to optimize the wrong layer.

In-depth PostgreSQL monitoring is necessary-but depth alone is not sufficient. Metrics without context force engineers to manually correlate symptoms across tools, timelines, and data models. What is required instead is component-level observability, where PostgreSQL metrics live alongside application traces, infrastructure signals, and deployment events, sharing the same time axis and the same analytical surface.

This is why PostgreSQL observability belongs in the same place as application and infrastructure observability. When database behavior is observed as part of the system rather than as a standalone dependency, engineers can reason about causality instead of coincidence, and leaders gain confidence that performance issues are being addressed at their source-not just mitigated downstream.

Reducing Bus factor in Observability using AI

· 5 min read
Nimisha G J
Consultant at base14
Service map graph

We’ve gotten pretty good at collecting observability data, but we’re terrible at making sense of it. Most teams—especially those running complex microservices—still rely on a handful of senior engineers who just know how everything fits together. They’re the rockstars who can look at alerts, mentally trace the dependency graph, and figure out what's actually broken.

When they leave, that knowledge walks out the door with them. That is the observability Bus Factor.

The problem isn't a lack of data; we have petabytes of it. The problem is a lack of context. We need systems that can actually explain what's happening, not just tell us that something is wrong.

This post explores the concept of a "Living Knowledge Base", Where the context is built based on the telemetry data application is emitting, not based on the documentations or confluence docs. Maintaining docs is a nightmare and we cannot always keep up Why not just build a system that will do this

The Cloud-Native Foundation Layer: A Portable, Vendor-Neutral Base for Modern Systems

· 4 min read
Irfan Shah
Founder at base14
Cloud-Native Foundation Layer

Cloud-native began with containers and Kubernetes. Since then, it has become a set of open standards and protocols that let systems run anywhere with minimal friction.

Today's engineering landscape spans public clouds, private clouds, on-prem clusters, and edge environments - far beyond the old single-cloud model. Teams work this way because it's the only practical response to cost, regulation, latency, hardware availability, and outages.

If you expect change, you need an architecture that can handle it.

Making Certificate Expiry Boring

· 21 min read
Founder at base14

Making Certificate Expiry Boring

Certificate expiry issues are entirely preventable

On 18 November 2025, GitHub had an hour-long outage that affected the heart of their product: Git operations. The post-incident summary was brief and honest - the outage was triggered by an internal TLS certificate that had quietly expired, blocking service-to-service communication inside their platform. It's the kind of issue every engineering team knows can happen, yet it still slips through because certificates live in odd corners of a system, often far from where we normally look.

What struck me about this incident wasn't that GitHub "missed something." If anything, it reminded me how easy it is, even for well-run, highly mature engineering orgs, to overlook certificate expiry in their observability and alerting posture. We monitor CPU, memory, latency, error rates, queue depth, request volume - but a certificate that's about to expire rarely shows up as a first-class signal. It doesn't scream. It doesn't gradually degrade. It just keeps working… until it doesn't.

And that's why these failures feel unfair. They're fully preventable, but only if you treat certificates as operational assets, not just security artefacts. This article is about building that mindset: how to surface certificate expiry as a real reliability concern, how to detect issues early, and how to ensure a single date on a single file never brings down an entire system.

Understanding What Increases and Reduces MTTR

· 5 min read
Engineering Team at base14

What makes recovery slower — and what disciplined, observable teams do differently.


In reliability engineering, MTTR (Mean Time to Recovery) is one of the clearest indicators of how mature a system — and a team — really is. It measures not just how quickly you fix things, but how well your organization detects, communicates, and learns from failure.

Every production incident is a test of the system's design, the team's reflexes, and the clarity of their shared context. MTTR rises when friction builds up in those connections — between tools, roles, or data. It falls when context flows freely and decisions move faster than confusion.

Why Unified Observability Matters for Growing Engineering Teams

· 11 min read
Founder at base14
Why Unified Observability Matters for Growing Engineering Teams

Last month, I watched a senior engineer spend three hours debugging what should have been a fifteen-minute problem. The issue wasn't complexity—it was context switching between four different monitoring tools, correlating timestamps manually, and losing their train of thought every time they had to log into yet another dashboard. If this sounds familiar, you're not alone. This is the hidden tax most engineering teams pay without realizing there's a better way.

Observability Theatre

· 11 min read
Founder at base14
Observability Theatre

the·a·tre (also the·a·ter) /ˈθiːətər/ noun

: the performance of actions or behaviors for appearance rather than substance; an elaborate pretense that simulates real activity while lacking its essential purpose or outcomes

Example: "The company's security theatre gave the illusion of protection without addressing actual vulnerabilities."


Your organization has invested millions in observability tools. You have dashboards for everything. Your teams dutifully instrument their services. Yet when incidents strike, engineers still spend hours hunting through disparate systems, correlating timestamps manually, and guessing at root causes. When the CEO forwards a customer complaint asking "are we down?", that's when the dev team gets to know about incidents.