21 posts tagged with "observability"

Android Mobile Observability with OpenTelemetry

April 29, 2026 · 19 min read

Founder & CPO at base14

A user opens a ticket: "the app froze when I tried to upload a photo." They were on the metro, on cellular, on a Samsung Galaxy A54 running Android 13. You're on a Pixel 8 on office Wi-Fi and the upload completes in 400 ms every time you try it. Crashlytics says "no crash logged." Play Console ANR rate looks normal. Was it the network? A frozen frame that swallowed the tap? A backend timeout? An OOM kill on a device with 4 GB of RAM and a busy launcher?

You can't tell. None of the tools you have were built to answer that question.

This is the gap OpenTelemetry fills on Android. Backend services have had distributed tracing for a decade. The mobile app, the thing the user actually touches, gets crash reports and a five-row Play Console dashboard. We've spent the last year helping teams close that gap with the OpenTelemetry Android Agent, and this post is a deep walkthrough of what it solves, how to wire it up, and how to ship the data to a collector you control.

Stop Deploying Broken OTel Configs: Validate & Test Before You Ship

April 8, 2026 · 10 min read

Nitin Misra

Engineer at base14

OpenTelemetry Collector configurations are YAML files. There's no schema, no type system, and no IDE that will tell you that tail_smapling isn't a real processor. You find out when your pipeline goes dark and someone starts paging the on-call.

The collector ships with otelcol validate, which catches syntax errors and fails on unknown component types. That covers a slice of the problem. It won't tell you that your send_batch_max_size is smaller than your send_batch_size, that your memory limiter is effectively disabled, or that you've hardcoded an API key in plain text.

Instrumenting Google Apps Script with OpenTelemetry

March 23, 2026 · 8 min read

Nimisha G J

Engineer at base14

Google Apps Script powers a surprising amount of business infrastructure. Approval workflows, hiring pipelines, invoice generators, CRM integrations — all running as serverless functions triggered by form submissions, chat messages, or time-based triggers. When something breaks, you get a stacktrace in the Apps Script logs and nothing else. No traces, no metrics, no correlation between the email that failed and the spreadsheet write that succeeded two seconds earlier. You're debugging with Logger.log and guesswork.

We run a hiring automation bot on Apps Script that touches Gmail, Sheets, Drive, Calendar, GitHub, and Google Chat in a single command invocation. When a candidate's assignment email silently failed to send, we had no way to tell whether the issue was the template fetch, the Gmail API, or the spreadsheet update that stores the thread ID. The execution log just said "success." This is the story of how we instrumented it with OpenTelemetry.

LLM Prompt Lifecycle: From Observability to Optimization

March 18, 2026 · 22 min read

Nitin Misra

Engineer at base14

Rachel, a Staff Engineer at a mid-size SaaS company, woke up to a Slack message from the support lead: "Why are half our billing tickets going to the technical team?" She checked the deployment log, nothing shipped in a week. She checked the model configuration, same gpt-4o endpoint, same parameters, same code. No errors in the logs, no latency spikes, no alerts fired. But customer complaints about misrouted tickets had doubled in three weeks. Something was wrong.

This is prompt drift, a slow, invisible degradation in LLM output quality that no dashboard catches until a human notices the downstream effects. Rachel's triage prompt, which classifies support tickets and routes them to the right team, worked perfectly at launch. The team tested it carefully, tuned the wording, validated it against sample tickets, and shipped it with confidence. Three months later, it was failing, and nothing in the monitoring stack surfaced the problem until the support lead noticed a pattern in Slack complaints.

Scout MCP: Query Your Observability Data Through AI Assistants

March 10, 2026 · 3 min read

Nimisha G J

Engineer at base14

Scout supports the Model Context Protocol (MCP). You can connect your coding agent to Scout and query traces, logs, metrics, service topology, and alerts using natural language.

Kubernetes Scheduling: Observing Silent Failures

March 9, 2026 · 10 min read

Irfan Shah

Founder & CTO at base14

A Pending Pod means Kubernetes accepts your workload but can't run it. Classic culprits are: insufficient capacity, overly restrictive placement constraints, unbound PVCs, autoscaler ceilings, or namespace quota exhaustion. Most teams discover this during an incident. You don't have to. Wire up the OTel Collector's k8s_cluster, kubeletstats, and k8sobjects receivers, alert on FailedScheduling events and Pending pod duration, and you'll catch scheduling failures before your users do. This post covers the five root causes, a kubectl debugging workflow, and a complete OTel instrumentation setup with collector config, deployment topology, and alert conditions.

Coding Agent Observability - Monitor AI Coding Assistants

March 6, 2026 · 9 min read

Ranjan Sakalley

Founder & CPO at base14

Coding agents like Claude Code, OpenAI Codex CLI, and Google Gemini CLI now ship with native OpenTelemetry support. This means you can collect structured telemetry covering token usage, cost attribution, tool calls, sessions, and lines of code modified, the same way you instrument any other production system.

This post covers what each agent emits, how to enable collection, and what we learned running Claude Code telemetry across a team.

Flutter Mobile Observability with OpenTelemetry

March 4, 2026 · 5 min read

Nimisha G J

Engineer at base14

Most teams have solid observability on their backend. Structured logs, distributed traces, SLOs, alerting. The mobile app, which is often the first thing a user touches, gets crash reports at best.

A user taps a button and nothing happens. Was it the network? A janky frame that swallowed the tap? A backend timeout? A state management bug? Without telemetry on the device, you are guessing.

This post explains a couple of approaches we have used to help our customers instrument their Flutter apps and when to use each approach.

Building for AI Agents - Observability Patterns

February 26, 2026 · 5 min read

Ranjan Sakalley

Founder & CPO at base14

Building agents that rely on third-party tools teaches you, quickly, what works and what doesn't. Most of these lessons came from things breaking. Here's what I think matters.

Zero-Code Instrumentation for Go with eBPF and OpenTelemetry

February 17, 2026 · 10 min read

Ranjan Sakalley

Founder & CPO at base14

Auto-instrumentation is well-established for Java, Python, and Node.js. Runtime agents hook into the interpreter or bytecode layer to inject tracing, metrics, and logging without requiring code changes. Go compiles to a static native binary, so JVM-style bytecode patching does not apply. But Go is not without options. Compile-time tools like Datadog's Orchestrion and Alibaba's opentelemetry-go-auto-instrumentation can inject tracing at build time, and eBPF provides a runtime alternative that requires no rebuild at all.

This post focuses on the eBPF approach. It attaches kernel-level probes to running Go binaries, extracting telemetry without modifying source code, recompiling, or restarting the process. OpenTelemetry now has two official projects built on this mechanism. We cover how it works, how to deploy it on Kubernetes, and where the practical limits are.