PostgreSQL Health Check Signals You Should Never Ignore (and Why)

Critical PostgreSQL health check signals that indicate serious production risks — from restarts and dead tuples to replication lag and XID wraparound.

·7 minutes reading
Cover Image for PostgreSQL Health Check Signals You Should Never Ignore (and Why)

Before diving into individual signals, make sure you understand the full scope of a PostgreSQL health check and why these checks matter in production environments.

👉 PostgreSQL Health Check Overview:

Some PostgreSQL problems announce themselves loudly.
Others stay quiet for weeks — and then take your production system down in minutes.

This guide focuses on health check signals you should never ignore.
If you see any of the following in production, it’s not “something to look at later” — it’s a warning sign.


1- Database or Server Restart (Silent but Critical)

One of the most important signals to check first is whether the database or server has restarted.

A restart often indicates:

  • Kernel panic
  • OOM killer
  • Hardware issue
  • Manual intervention
  • Crash caused by I/O or filesystem problems

Why this matters

PostgreSQL does not reset all statistics on every restart,
but if core statistics are reset, it is a strong signal that a restart occurred.

If a restart happened:

  • Immediately check PostgreSQL logs
  • Check system logs (journalctl, /var/log/syslog)
  • Start root cause analysis before traffic ramps up

🔍 Check when PostgreSQL last started

SELECT pg_postmaster_start_time();

🔍 Check statistics reset time

SELECT stats_reset
FROM pg_stat_database
WHERE datname = current_database();

🧠 Important:
If pg_postmaster_start_time and stats_reset are close, this is a strong restart indicator.


2- n_dead_tup Growth — The Silent Time Bomb

Before anything else, let’s clarify what n_dead_tup means.

Dead tuples are rows that are no longer visible to any transaction but still occupy space.
They are cleaned up by VACUUM / autovacuum.

Why n_dead_tup is dangerous

A continuously increasing n_dead_tup means:

  • Autovacuum is not keeping up
  • Vacuum is blocked
  • Configuration thresholds are too high
  • Long-running transactions prevent cleanup

📌 This is one of the most common root causes of major production outages.

Critical misconception

“Autovacuum is ON, so I’m safe.”

Wrong.

Autovacuum only guarantees that vacuum will eventually run after thresholds are exceeded.
If dead tuples pile up for a long time, when vacuum finally runs:

  • Massive I/O spike occurs
  • Query latency explodes
  • Production traffic suffers

🔍 Check dead tuples per table

SELECT
  schemaname,
  relname,
  n_dead_tup,
  last_autovacuum
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 10;

🔍 Tables where vacuum hasn’t run recently

SELECT
  relname,
  n_dead_tup,
  now() - last_autovacuum AS since_last_autovacuum
FROM pg_stat_user_tables
WHERE last_autovacuum IS NULL
   OR now() - last_autovacuum > interval '1 day'
ORDER BY n_dead_tup DESC;

🚨 If n_dead_tup Keeps Growing → Action Is Required

Dead tuples are not harmless.

If n_dead_tup keeps increasing, it means vacuum is falling behind.
This eventually leads to table bloat, higher I/O, slower queries, and aggressive autovacuum kicking in at the worst possible time.


3- Transaction ID Wraparound Risk (Non-Negotiable)

Transaction ID (XID) wraparound is one of the most dangerous PostgreSQL risks.

If PostgreSQL reaches wraparound:

  • It forces aggressive vacuum
  • Writes may be blocked
  • System performance degrades heavily
  • In extreme cases, the database becomes unusable

Key parameters you must understand

  • autovacuum_freeze_max_age
  • autovacuum_multixact_freeze_max_age
  • vacuum_multixact_failsafe_age

These define hard limits.
If they are exceeded, PostgreSQL must vacuum — no matter the cost.

Why this is dangerous in production

Forced vacuum causes:

  • Heavy disk I/O
  • Latency spikes
  • Application slowdowns
  • Potential outages

🔍 Check wraparound risk

SELECT
  datname,
  age(datfrozenxid) AS xid_age,
  age(datminmxid) AS multixact_age
FROM pg_database
ORDER BY xid_age DESC;

🚨 If xid_age approaches autovacuum_freeze_max_age, you are already late.


4- Replication Lag — HA at Risk

Replication must be monitored continuously.

If replication lag increases:

  • HA guarantees are compromised
  • Failover may cause data loss (asynchronous replication)
  • Primary latency increases (synchronous replication)

Streaming replication risks

Asynchronous replication

  • Lag = potential data loss during failover

Synchronous replication

  • Lag = primary waits → application slowdown

🔍 Check replication lag

SELECT
  application_name,
  state,
  sync_state,
  pg_current_wal_lsn() - replay_lsn AS replication_lag_bytes
FROM pg_stat_replication;

📌 Replication lag is never “just a number.”

5- Long-Running Queries on Standby Servers

One of the most underestimated production killers.

Long-running queries on streaming replicas:

  • Prevent WAL replay
  • Increase replication lag
  • Block vacuum on primary
  • Cause massive dead tuple accumulation

The nightmare scenario

  • Reporting query starts at 00:00 on a replica
  • Runs until morning
  • Primary keeps accepting writes
  • Vacuum cannot clean dead tuples
  • Autovacuum kicks in during peak hours
  • Production collapses

🔍 Find long-running queries on replicas

SELECT
  pid,
  now() - query_start AS duration,
  state,
  query
FROM pg_stat_activity
WHERE now() - query_start > interval '10 minutes'
ORDER BY duration DESC;

🚨 Streaming replicas are NOT for reporting.

6- Streaming Replication Is for HA — Not Reporting

This mistake appears in many architectures.

  • Streaming replication = High Availability
  • Reporting workloads = Logical replication or a dedicated reporting database

Using streaming replicas for reporting:

  • Increases lag
  • Breaks HA guarantees
  • Blocks vacuum
  • Hurts primary performance

📌 This is an architectural rule — not a suggestion.

7- Query Response Times — One Query Can Kill Everything

Never underestimate a single query.

One long-running query can:

  • Hold locks
  • Block vacuum
  • Exhaust CPU
  • Cause cascading failures

🔍 Detect slow active queries

SELECT
  pid,
  now() - query_start AS duration,
  query
FROM pg_stat_activity
WHERE state = 'active'
  AND now() - query_start > interval '1 minute'
ORDER BY duration DESC;

8- OS CPU & Load — Your Hidden Safety Buffer

CPU usage should not run hot all the time.

Strong recommendation

  • Normal workload: < 5% CPU
  • Treat unused CPU as an emergency buffer
  • Sustained 20%+ CPU usage = fragile system

When something goes wrong:

  • Autovacuum
  • Checkpoints
  • I/O spikes

You need CPU headroom to absorb these events safely.

📌 High load with low CPU usually means I/O wait.

9- Blocking Locks — Always Alert on Them

Blocking locks should never be ignored.

They:

  • Cause query pile-ups
  • Trigger timeouts
  • Freeze applications

🔍 Find blocking locks

SELECT
  blocked.pid AS blocked_pid,
  blocked.query AS blocked_query,
  blocking.pid AS blocking_pid,
  blocking.query AS blocking_query
FROM pg_locks blocked_locks
JOIN pg_stat_activity blocked ON blocked.pid = blocked_locks.pid
JOIN pg_locks blocking_locks
  ON blocking_locks.locktype = blocked_locks.locktype
 AND blocking_locks.pid != blocked_locks.pid
JOIN pg_stat_activity blocking ON blocking.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;

🚨 Blocking locks deserve alerts — not dashboards.

10- Swap Usage — Trouble Is Coming

If swap usage:

  • Starts increasing
  • Keeps increasing
  • Never goes back down

➡️ A serious problem is approaching.

Possible causes:

  • Memory pressure
  • Misconfigured shared_buffers or work_mem
  • OS overcommit issues

📌 PostgreSQL on swap = latency disaster.

⚠️ Final Warning

These signals are not optional checks.

If you ignore them, production will eventually remind you — painfully.

If you detect any of these early:

  • Investigate immediately
  • Fix root causes
  • Avoid emergency firefighting later

A proper PostgreSQL health check exists to prevent surprises, not to explain outages after they happen.

👉 Learn the full health check methodology here:


Author: Fırat Güleç — Principal PostgreSQL DBA


Try pghealth Free Today 🚀

Start your journey toward a healthier PostgreSQL with pghealth.
You can explore all features immediately with a free trial — no installation required.

👉 Start Free Trial