pgX: Comprehensive PostgreSQL Monitoring at Scale
Watch: Tracing a slow query from application latency to PostgreSQL stats with pgX.
For many teams, PostgreSQL monitoring begins and often ends with
pg_stat_statements. That choice is understandable. It provides normalized
query statistics, execution counts, timing data, and enough signal to identify
slow queries and obvious inefficiencies. For a long time, that is sufficient.
But as PostgreSQL clusters grow in size and importance, the questions engineers need to answer change. Instead of "Which query is slow?", the questions become harder and more operational:
- Why is replication lagging right now?
- Which application is exhausting the connection pool?
- What is blocking this transaction?
- Is autovacuum keeping up with write volume?
- Did performance degrade because of query shape, data growth, or resource pressure?
These are not questions pg_stat_statements is designed to answer.
Most teams eventually respond by stitching together ad-hoc queries against
pg_stat_activity, pg_locks, pg_stat_replication, pg_stat_user_tables,
and related system views. This works until an incident demands answers in
minutes, not hours.
This post lays out what comprehensive PostgreSQL monitoring actually looks like at scale: the nine observability domains that matter, the kinds of metrics each domain requires, and why moving beyond query-only monitoring is unavoidable for serious production systems.