Skip to main content

2 posts tagged with "llm"

View All Tags

LLM Prompt Lifecycle: From Observability to Optimization

· 22 min read
Nitin Misra
Engineer at base14

Rachel, a Staff Engineer at a mid-size SaaS company, woke up to a Slack message from the support lead: "Why are half our billing tickets going to the technical team?" She checked the deployment log, nothing shipped in a week. She checked the model configuration, same gpt-4o endpoint, same parameters, same code. No errors in the logs, no latency spikes, no alerts fired. But customer complaints about misrouted tickets had doubled in three weeks. Something was wrong.

This is prompt drift, a slow, invisible degradation in LLM output quality that no dashboard catches until a human notices the downstream effects. Rachel's triage prompt, which classifies support tickets and routes them to the right team, worked perfectly at launch. The team tested it carefully, tuned the wording, validated it against sample tickets, and shipped it with confidence. Three months later, it was failing, and nothing in the monitoring stack surfaced the problem until the support lead noticed a pattern in Slack complaints.

Reducing Bus Factor in Observability Using AI

· 5 min read
Nimisha G J
Engineer at base14
Service map graph

We’ve gotten pretty good at collecting observability data, but we’re terrible at making sense of it. Most teams—especially those running complex microservices—still rely on a handful of senior engineers who just know how everything fits together. They’re the rockstars who can look at alerts, mentally trace the dependency graph, and figure out what's actually broken.

When they leave, that knowledge walks out the door with them. That is the observability Bus Factor.

The problem isn't a lack of data; we have petabytes of it. The problem is a lack of context. We need systems that can actually explain what's happening, not just tell us that something is wrong.

This post explores the concept of a "Living Knowledge Base", Where the context is built based on the telemetry data application is emitting, not based on the documentations or confluence docs. Maintaining docs is a nightmare and we cannot always keep up Why not just build a system that will do this