Making Certificate Expiry Boring
On 18 November 2025, GitHub had an hour-long outage that affected the
heart of their product: Git operations. The post-incident
summary was brief
and honest - the outage was triggered by an internal TLS certificate that
had quietly expired, blocking service-to-service communication inside
their platform. It's the kind of issue every engineering team knows can
happen, yet it still slips through because certificates live in odd
corners of a system, often far from where we normally look.
What struck me about this incident wasn't that GitHub "missed something."
If anything, it reminded me how easy it is, even for well-run, highly
mature engineering orgs, to overlook certificate expiry in their
observability and alerting posture. We monitor CPU, memory, latency,
error rates, queue depth, request volume - but a certificate that's about
to expire rarely shows up as a first-class signal. It doesn't scream. It
doesn't gradually degrade. It just keeps working… until it doesn't.
And that's why these failures feel unfair. They're fully preventable, but
only if you treat certificates as operational assets, not just security
artefacts. This article is about building that mindset: how to surface
certificate expiry as a real reliability concern, how to detect issues
early, and how to ensure a single date on a single file never brings down
an entire system.