Before diving into individual signals, make sure you understand the full scope of a PostgreSQL health check and why these checks matter in production environments.
👉 PostgreSQL Health Check Overview:
Some PostgreSQL problems announce themselves loudly.
Others stay quiet for weeks — and then take your production system down in minutes.
This guide focuses on health check signals you should never ignore.
If you see any of the following in production, it’s not “something to look at later” — it’s a warning sign.
1- Database or Server Restart (Silent but Critical)
One of the most important signals to check first is whether the database or server has restarted.
A restart often indicates:
- Kernel panic
- OOM killer
- Hardware issue
- Manual intervention
- Crash caused by I/O or filesystem problems
Why this matters
PostgreSQL does not reset all statistics on every restart,
but if core statistics are reset, it is a strong signal that a restart occurred.
If a restart happened:
- Immediately check PostgreSQL logs
- Check system logs (
journalctl,/var/log/syslog) - Start root cause analysis before traffic ramps up
🔍 Check when PostgreSQL last started
SELECT pg_postmaster_start_time();
🔍 Check statistics reset time
SELECT stats_reset
FROM pg_stat_database
WHERE datname = current_database();
🧠 Important:
If pg_postmaster_start_time and stats_reset are close, this is a strong restart indicator.
2- n_dead_tup Growth — The Silent Time Bomb
Before anything else, let’s clarify what n_dead_tup means.
Dead tuples are rows that are no longer visible to any transaction but still occupy space.
They are cleaned up by VACUUM / autovacuum.
Why n_dead_tup is dangerous
A continuously increasing n_dead_tup means:
- Autovacuum is not keeping up
- Vacuum is blocked
- Configuration thresholds are too high
- Long-running transactions prevent cleanup
📌 This is one of the most common root causes of major production outages.
Critical misconception
“Autovacuum is ON, so I’m safe.”
❌ Wrong.
Autovacuum only guarantees that vacuum will eventually run after thresholds are exceeded.
If dead tuples pile up for a long time, when vacuum finally runs:
- Massive I/O spike occurs
- Query latency explodes
- Production traffic suffers
🔍 Check dead tuples per table
SELECT
schemaname,
relname,
n_dead_tup,
last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 10;
🔍 Tables where vacuum hasn’t run recently
SELECT
relname,
n_dead_tup,
now() - last_autovacuum AS since_last_autovacuum
FROM pg_stat_user_tables
WHERE last_autovacuum IS NULL
OR now() - last_autovacuum > interval '1 day'
ORDER BY n_dead_tup DESC;
🚨 If n_dead_tup Keeps Growing → Action Is Required
Dead tuples are not harmless.
If n_dead_tup keeps increasing, it means vacuum is falling behind.
This eventually leads to table bloat, higher I/O, slower queries, and aggressive autovacuum kicking in at the worst possible time.
3- Transaction ID Wraparound Risk (Non-Negotiable)
Transaction ID (XID) wraparound is one of the most dangerous PostgreSQL risks.
If PostgreSQL reaches wraparound:
- It forces aggressive vacuum
- Writes may be blocked
- System performance degrades heavily
- In extreme cases, the database becomes unusable
Key parameters you must understand
autovacuum_freeze_max_ageautovacuum_multixact_freeze_max_agevacuum_multixact_failsafe_age
These define hard limits.
If they are exceeded, PostgreSQL must vacuum — no matter the cost.
Why this is dangerous in production
Forced vacuum causes:
- Heavy disk I/O
- Latency spikes
- Application slowdowns
- Potential outages
🔍 Check wraparound risk
SELECT
datname,
age(datfrozenxid) AS xid_age,
age(datminmxid) AS multixact_age
FROM pg_database
ORDER BY xid_age DESC;
🚨 If xid_age approaches autovacuum_freeze_max_age, you are already late.
4- Replication Lag — HA at Risk
Replication must be monitored continuously.
If replication lag increases:
- HA guarantees are compromised
- Failover may cause data loss (asynchronous replication)
- Primary latency increases (synchronous replication)
Streaming replication risks
Asynchronous replication
- Lag = potential data loss during failover
Synchronous replication
- Lag = primary waits → application slowdown
🔍 Check replication lag
SELECT
application_name,
state,
sync_state,
pg_current_wal_lsn() - replay_lsn AS replication_lag_bytes
FROM pg_stat_replication;
📌 Replication lag is never “just a number.”
5- Long-Running Queries on Standby Servers
One of the most underestimated production killers.
Long-running queries on streaming replicas:
- Prevent WAL replay
- Increase replication lag
- Block vacuum on primary
- Cause massive dead tuple accumulation
The nightmare scenario
- Reporting query starts at 00:00 on a replica
- Runs until morning
- Primary keeps accepting writes
- Vacuum cannot clean dead tuples
- Autovacuum kicks in during peak hours
- Production collapses
🔍 Find long-running queries on replicas
SELECT
pid,
now() - query_start AS duration,
state,
query
FROM pg_stat_activity
WHERE now() - query_start > interval '10 minutes'
ORDER BY duration DESC;
🚨 Streaming replicas are NOT for reporting.
6- Streaming Replication Is for HA — Not Reporting
This mistake appears in many architectures.
- Streaming replication = High Availability
- Reporting workloads = Logical replication or a dedicated reporting database
Using streaming replicas for reporting:
- Increases lag
- Breaks HA guarantees
- Blocks vacuum
- Hurts primary performance
📌 This is an architectural rule — not a suggestion.
7- Query Response Times — One Query Can Kill Everything
Never underestimate a single query.
One long-running query can:
- Hold locks
- Block vacuum
- Exhaust CPU
- Cause cascading failures
🔍 Detect slow active queries
SELECT
pid,
now() - query_start AS duration,
query
FROM pg_stat_activity
WHERE state = 'active'
AND now() - query_start > interval '1 minute'
ORDER BY duration DESC;
8- OS CPU & Load — Your Hidden Safety Buffer
CPU usage should not run hot all the time.
Strong recommendation
- Normal workload: < 5% CPU
- Treat unused CPU as an emergency buffer
- Sustained 20%+ CPU usage = fragile system
When something goes wrong:
- Autovacuum
- Checkpoints
- I/O spikes
You need CPU headroom to absorb these events safely.
📌 High load with low CPU usually means I/O wait.
9- Blocking Locks — Always Alert on Them
Blocking locks should never be ignored.
They:
- Cause query pile-ups
- Trigger timeouts
- Freeze applications
🔍 Find blocking locks
SELECT
blocked.pid AS blocked_pid,
blocked.query AS blocked_query,
blocking.pid AS blocking_pid,
blocking.query AS blocking_query
FROM pg_locks blocked_locks
JOIN pg_stat_activity blocked ON blocked.pid = blocked_locks.pid
JOIN pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_stat_activity blocking ON blocking.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
🚨 Blocking locks deserve alerts — not dashboards.
10- Swap Usage — Trouble Is Coming
If swap usage:
- Starts increasing
- Keeps increasing
- Never goes back down
➡️ A serious problem is approaching.
Possible causes:
- Memory pressure
- Misconfigured
shared_buffersorwork_mem - OS overcommit issues
📌 PostgreSQL on swap = latency disaster.
⚠️ Final Warning
These signals are not optional checks.
If you ignore them, production will eventually remind you — painfully.
If you detect any of these early:
- Investigate immediately
- Fix root causes
- Avoid emergency firefighting later
A proper PostgreSQL health check exists to prevent surprises, not to explain outages after they happen.
👉 Learn the full health check methodology here:
Author: Fırat Güleç — Principal PostgreSQL DBA
Try pghealth Free Today 🚀
Start your journey toward a healthier PostgreSQL with pghealth.
You can explore all features immediately with a free trial — no installation required.

