PostgreSQL Health Check Signals You Should Never Ignore (and Why)

Before diving into individual signals, make sure you understand the full scope of a PostgreSQL health check and why these checks matter in production environments.

👉 PostgreSQL Health Check Overview:

Some PostgreSQL problems announce themselves loudly.
Others stay quiet for weeks — and then take your production system down in minutes.

This guide focuses on health check signals you should never ignore.
If you see any of the following in production, it’s not “something to look at later” — it’s a warning sign.

1- Database or Server Restart (Silent but Critical)

One of the most important signals to check first is whether the database or server has restarted.

A restart often indicates:

Kernel panic
OOM killer
Hardware issue
Manual intervention
Crash caused by I/O or filesystem problems

Why this matters

PostgreSQL does not reset all statistics on every restart,
but if core statistics are reset, it is a strong signal that a restart occurred.

If a restart happened:

Immediately check PostgreSQL logs
Check system logs (journalctl, /var/log/syslog)
Start root cause analysis before traffic ramps up

🔍 Check when PostgreSQL last started

SELECT pg_postmaster_start_time();

🔍 Check statistics reset time

SELECT stats_reset
FROM pg_stat_database
WHERE datname = current_database();

🧠 Important:
If pg_postmaster_start_time and stats_reset are close, this is a strong restart indicator.

2- n_dead_tup Growth — The Silent Time Bomb

Before anything else, let’s clarify what n_dead_tup means.

Dead tuples are rows that are no longer visible to any transaction but still occupy space.
They are cleaned up by VACUUM / autovacuum.

Why n_dead_tup is dangerous

A continuously increasing n_dead_tup means:

Autovacuum is not keeping up
Vacuum is blocked
Configuration thresholds are too high
Long-running transactions prevent cleanup

📌 This is one of the most common root causes of major production outages.

Critical misconception

“Autovacuum is ON, so I’m safe.”

❌ Wrong.

Autovacuum only guarantees that vacuum will eventually run after thresholds are exceeded.
If dead tuples pile up for a long time, when vacuum finally runs:

Massive I/O spike occurs
Query latency explodes
Production traffic suffers

🔍 Check dead tuples per table

SELECT
  schemaname,
  relname,
  n_dead_tup,
  last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 10;

🔍 Tables where vacuum hasn’t run recently

SELECT
  relname,
  n_dead_tup,
  now() - last_autovacuum AS since_last_autovacuum
FROM pg_stat_user_tables
WHERE last_autovacuum IS NULL
   OR now() - last_autovacuum > interval '1 day'
ORDER BY n_dead_tup DESC;

🚨 If `n_dead_tup` Keeps Growing → Action Is Required

Dead tuples are not harmless.

If n_dead_tup keeps increasing, it means vacuum is falling behind.
This eventually leads to table bloat, higher I/O, slower queries, and aggressive autovacuum kicking in at the worst possible time.

3- Transaction ID Wraparound Risk (Non-Negotiable)

Transaction ID (XID) wraparound is one of the most dangerous PostgreSQL risks.

If PostgreSQL reaches wraparound:

It forces aggressive vacuum
Writes may be blocked
System performance degrades heavily
In extreme cases, the database becomes unusable

Key parameters you must understand

autovacuum_freeze_max_age
autovacuum_multixact_freeze_max_age
vacuum_multixact_failsafe_age

These define hard limits.
If they are exceeded, PostgreSQL must vacuum — no matter the cost.

Why this is dangerous in production

Forced vacuum causes:

Heavy disk I/O
Latency spikes
Application slowdowns
Potential outages

🔍 Check wraparound risk

SELECT
  datname,
  age(datfrozenxid) AS xid_age,
  age(datminmxid) AS multixact_age
FROM pg_database
ORDER BY xid_age DESC;

🚨 If xid_age approaches autovacuum_freeze_max_age, you are already late.

4- Replication Lag — HA at Risk

Replication must be monitored continuously.

If replication lag increases:

HA guarantees are compromised
Failover may cause data loss (asynchronous replication)
Primary latency increases (synchronous replication)

Streaming replication risks

Asynchronous replication

Lag = potential data loss during failover

Synchronous replication

Lag = primary waits → application slowdown

🔍 Check replication lag

SELECT
  application_name,
  state,
  sync_state,
  pg_current_wal_lsn() - replay_lsn AS replication_lag_bytes
FROM pg_stat_replication;

📌 Replication lag is never “just a number.”

5- Long-Running Queries on Standby Servers

One of the most underestimated production killers.

Long-running queries on streaming replicas:

Prevent WAL replay
Increase replication lag
Block vacuum on primary
Cause massive dead tuple accumulation

The nightmare scenario

Reporting query starts at 00:00 on a replica
Runs until morning
Primary keeps accepting writes
Vacuum cannot clean dead tuples
Autovacuum kicks in during peak hours
Production collapses

🔍 Find long-running queries on replicas

SELECT
  pid,
  now() - query_start AS duration,
  state,
  query
FROM pg_stat_activity
WHERE now() - query_start > interval '10 minutes'
ORDER BY duration DESC;

🚨 Streaming replicas are NOT for reporting.

6- Streaming Replication Is for HA — Not Reporting

This mistake appears in many architectures.

Streaming replication = High Availability
Reporting workloads = Logical replication or a dedicated reporting database

Using streaming replicas for reporting:

Increases lag
Breaks HA guarantees
Blocks vacuum
Hurts primary performance

📌 This is an architectural rule — not a suggestion.

7- Query Response Times — One Query Can Kill Everything

Never underestimate a single query.

One long-running query can:

Hold locks
Block vacuum
Exhaust CPU
Cause cascading failures

🔍 Detect slow active queries

SELECT
  pid,
  now() - query_start AS duration,
  query
FROM pg_stat_activity
WHERE state = 'active'
  AND now() - query_start > interval '1 minute'
ORDER BY duration DESC;

8- OS CPU & Load — Your Hidden Safety Buffer

CPU usage should not run hot all the time.

Strong recommendation

Normal workload: < 5% CPU
Treat unused CPU as an emergency buffer
Sustained 20%+ CPU usage = fragile system

When something goes wrong:

Autovacuum
Checkpoints
I/O spikes

You need CPU headroom to absorb these events safely.

📌 High load with low CPU usually means I/O wait.

9- Blocking Locks — Always Alert on Them

Blocking locks should never be ignored.

They:

Cause query pile-ups
Trigger timeouts
Freeze applications

🔍 Find blocking locks

SELECT
  blocked.pid AS blocked_pid,
  blocked.query AS blocked_query,
  blocking.pid AS blocking_pid,
  blocking.query AS blocking_query
FROM pg_locks blocked_locks
JOIN pg_stat_activity blocked ON blocked.pid = blocked_locks.pid
JOIN pg_locks blocking_locks
  ON blocking_locks.locktype = blocked_locks.locktype
 AND blocking_locks.pid != blocked_locks.pid
JOIN pg_stat_activity blocking ON blocking.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;

🚨 Blocking locks deserve alerts — not dashboards.

10- Swap Usage — Trouble Is Coming

If swap usage:

Starts increasing
Keeps increasing
Never goes back down

➡️ A serious problem is approaching.

Possible causes:

Memory pressure
Misconfigured shared_buffers or work_mem
OS overcommit issues

📌 PostgreSQL on swap = latency disaster.

⚠️ Final Warning

These signals are not optional checks.

If you ignore them, production will eventually remind you — painfully.

If you detect any of these early:

Investigate immediately
Fix root causes
Avoid emergency firefighting later

A proper PostgreSQL health check exists to prevent surprises, not to explain outages after they happen.

👉 Learn the full health check methodology here:

Author: Fırat Güleç — Principal PostgreSQL DBA

Try pghealth Free Today 🚀

Start your journey toward a healthier PostgreSQL with pghealth.
You can explore all features immediately with a free trial — no installation required.

👉 Start Free Trial