PostgreSQL Health Check (2026) | Complete Monitoring & Performance Guide

Learn how to perform a complete PostgreSQL Health Check (2026) using SQL queries and OS-level monitoring. Detect slow queries, replication lag, configuration issues, and performance bottlenecks to keep your PostgreSQL database stable and optimized.

This article offers a practical, hands-on introduction to PostgreSQL health checks.

If you are looking for a more comprehensive and structured explanation — including scope, evaluation areas, and best practices — see our PostgreSQL Health Check: Complete Production Guide.

OS Level Checks

Check if PostgreSQL is running

postgres@ip-172-31-**-***:/var/log$ psql -p 5432



Output:

psql (18.0, server 14.19)
Type "help" for help.
postgres=#

Description: This command confirms whether PostgreSQL is running and accessible. Connecting to a specific database (e.g., /c testdb) ensures the target database is reachable and ready to accept queries.

Installed PostgreSQL Versions

apt list --installed | grep postgresql


Output:

postgresql-14/noble-pgdg,now 14.19-1.pgdg24.04+1 amd64 [installed]
postgresql-16/noble-pgdg,now 16.10-1.pgdg24.04+1 amd64 [installed,automatic]
postgresql-18/noble-pgdg,now 18.0-1.pgdg24.04+3 amd64 [installed]

...

Description: Lists all installed PostgreSQL versions on the OS. Useful to verify compatibility and identify which clusters are active.

Cluster Status

pg_lsclusters


Output:

Ver Cluster Port Status Owner    Data directory              Log file
14  main    5432 online postgres /var/lib/postgresql/14/main log/postgresql-%Y-%m-%d_%H%M%S.log
18  main    5433 down   postgres /var/lib/postgresql/18/main /var/log/postgresql/postgresql-18-main.log

Description: Displays all PostgreSQL clusters with their status. Ensures you know which servers are online and ready for queries.

PostgreSQL Version Check

psql -p 5432

postgres=# SELECT version();


Output:

PostgreSQL 14.19 on x86_64-pc-linux-gnu

Description: Confirms the running PostgreSQL version, critical for monitoring and tuning.

Memory and Swap Usage

free -h


Output:

root@ip-172-31-**-***:/home/ubuntu# free -h
               total        used        free      shared  buff/cache   available
Mem:           914Mi       515Mi        76Mi       143Mi       631Mi       399Mi
Swap:             0B          0B          0B
root@ip-172-31-**-***:/home/ubuntu#

Description: Memory is one of the most critical factors affecting PostgreSQL stability. This command shows total, used, free, and cached memory, giving you a quick overview of your system’s memory health. If Swap total is 0B, it means swap space is disabled, which can be risky in production environments.

When swap is disabled and memory usage reaches its limit, PostgreSQL may crash unexpectedly or terminate active processes with an “out of memory” error. When swap is enabled, the system can temporarily use disk as additional memory — it won’t crash immediately, but performance will gradually degrade as swap usage increases.

Some system administrators prefer to disable swap to maintain peak performance, while others enable it as a safety net to prevent sudden downtime. This is a trade-off: disabling swap improves speed but removes the system’s fallback mechanism; enabling it provides stability at the cost of slower response times.

In short, this check is one of the most important PostgreSQL health indicators to ensure long-term stability and predictable performance under load.

CPU and Load Average

top

top - 08:58:19 up 30 days, 21:46,  1 user,  load average: 0.00, 0.00, 0.00
Tasks: 120 total,   1 running, 119 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.0 us,  0.2 sy,  0.0 ni, 99.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :    914.2 total,     83.5 free,    507.3 used,    631.7 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.    406.9 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 280131 postgres  20   0  223164  18752  16320 S   0.3   2.0   0:13.91 postgres
      1 root      20   0   22800  11576   7096 S   0.0   1.2   1:20.21 systemd
      2 root      20   0       0      0      0 S   0.0   0.0   0:00.46 kthreadd

Description: The top command provides a real-time overview of your system’s CPU load, memory usage, and running processes. It is one of the most valuable tools for monitoring PostgreSQL performance at the operating system level.

The load average values (0.00, 0.00, 0.00) represent the average number of runnable or waiting processes over the past 1, 5, and 15 minutes. These values reflect how busy your CPU(s) have been:

A load average of 1.0 means one fully utilized CPU core.

If you have 4 CPU cores and your 1-minute load average exceeds 4.0, it indicates CPU saturation — your system is overloaded.

Sustained load averages higher than your CPU count typically lead to slow query execution, delayed background processes, or even replication lag in PostgreSQL.

The task summary line (Tasks: 120 total, 1 running, 119 sleeping, 0 stopped, 0 zombie) gives you a snapshot of process states:

Running: Actively executing on the CPU.

Sleeping: Waiting for resources such as I/O or network data.

Stopped: Paused by the system or user (e.g., via signals like SIGSTOP).

Zombie: Terminated processes that haven’t been fully cleaned up; frequent zombies may indicate a misbehaving application or process management issue.

The %Cpu(s) section breaks down CPU time distribution:

us → user space (PostgreSQL queries and application logic)

sy → system/kernel time

id → idle time (free CPU)

wa → waiting for I/O (disk or network bottleneck indicator)

st → stolen time (CPU resources taken by hypervisor in virtualized systems)

The lower portion of the output lists active processes, including their CPU (%) and memory (%) usage. This allows you to quickly identify which PostgreSQL backends or OS processes are consuming the most system resources.

Monitoring top regularly helps you:

Detect runaway queries or processes hogging CPU or RAM.

Spot performance degradation due to CPU saturation or I/O wait.

Understand the overall system load trend in correlation with PostgreSQL activity.

For a more interactive and visually friendly experience, you can use htop, which provides color-coded metrics, tree views, and easier process navigation.

In summary, the top command is essential for every PostgreSQL health check — it helps correlate database performance with real-time system resource utilization and quickly identify bottlenecks before they cause downtime.

Check Disk IO

iotop

Total DISK READ:         0.00 B/s | Total DISK WRITE:         0.00 B/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:       0.00 B/s
    TID  PRIO  USER     DISK READ DISK WRITE>    COMMAND
      1 be/4 root        0.00 B/s    0.00 B/s systemd --system --deserialize=69
      2 be/4 root        0.00 B/s    0.00 B/s [kthreadd]
      3 be/4 root        0.00 B/s    0.00 B/s [pool_workqueue_release]
      4 be/0 root        0.00 B/s    0.00 B/s [kworker/R-rcu_gp]
      5 be/0 root        0.00 B/s    0.00 B/s [kworker/R-sync_wq]
      6 be/0 root        0.00 B/s    0.00 B/s [kworker/R-kvfree_rcu_reclaim]
      7 be/0 root        0.00 B/s    0.00 B/s [kworker/R-slub_flushwq]
      8 be/0 root        0.00 B/s    0.00 B/s [kworker/R-netns]
     11 be/0 root        0.00 B/s    0.00 B/s [kworker/0:0H-events_highpri]

Description: Identifies which processes consume the most disk IO. High disk usage can cause slow query performance in PostgreSQL.

Check Disk Space

df -h

root@ip-172-31-**-***:/home/ubuntu# df -h
Filesystem       Size  Used Avail Use% Mounted on
/dev/root        6.8G  6.0G  809M  89% /
tmpfs            458M   28K  458M   1% /dev/shm
tmpfs            183M  1.5M  182M   1% /run
tmpfs            5.0M     0  5.0M   0% /run/lock
efivarfs         128K  3.8K  120K   4% /sys/firmware/efi/efivars
/dev/nvme0n1p16  881M  155M  665M  19% /boot
/dev/nvme0n1p15  105M  6.2M   99M   6% /boot/efi
tmpfs             92M   12K   92M   1% /run/user/1000
root@ip-172-31-**-***:/home/ubuntu#

Description: Monitors disk usage. If usage exceeds 90%, it could lead to operational issues or database crashes.

Check CPU Cores

nproc

root@ip-172-31-**-***:/home/ubuntu# nproc
2
root@ip-172-31-**-***:/home/ubuntu#

Description: Displays the number of available CPU cores. Comparing with load average helps determine if your system is overutilized. For detailed CPU info, use lscpu.

PostgreSQL Log Directory




postgres=# show log_directory;
 log_directory
---------------
 log
(1 row)

postgres=# show data_directory;
       data_directory
-----------------------------
 /var/lib/postgresql/14/main
(1 row)

postgres=# show logging_collector;
 logging_collector
-------------------
 on
(1 row)



Command check log file: tail -100f /var/lib/postgresql/14/main/log/postgresql-2025-10-20_000000.log

Description: Determines where PostgreSQL logs are stored. If logging_collector is off, logs may not be centralized, making troubleshooting harder. Always check for ERROR, FATAL, PANIC, or CRITICAL entries.

Check HugePages Configuration

grep -i huge /proc/meminfo

root@ip-172-31-**-***:/home/ubuntu# grep -i huge /proc/meminfo
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
root@ip-172-31-**-***:/home/ubuntu#

Description: HugePages can significantly improve performance for large-memory workloads. Zero values indicate HugePages are not configured, which might be optimal for smaller instances but could impact performance on memory-intensive systems.

Identify Disk Type

lsblk -d -o name,rota,type


root@ip-172-31-**-***:/var/lib/postgresql/14/main/log# lsblk -d -o name,rota,type
NAME    ROTA TYPE
loop0      0 loop
loop1      0 loop
loop3      0 loop
loop4      0 loop
loop5      0 loop
loop6      0 loop
nvme0n1    0 disk
root@ip-172-31-**-***:/var/lib/postgresql/14/main/log#

Description: Determines if your storage is SSD (ROTA: 0) or HDD (ROTA: 1). SSDs provide significantly faster database performance, especially for I/O-intensive workloads.

Check Listening Ports

netstat -tulpn | grep LISTEN

root@ip-172-31-**-***:/var/lib/postgresql/14/main/log# netstat -tulpn | grep LISTEN
tcp        0      0 127.0.0.53:53           0.0.0.0:*               LISTEN      266459/systemd-reso
tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN      1/systemd
tcp        0      0 127.0.0.54:53           0.0.0.0:*               LISTEN      266459/systemd-reso
tcp        0      0 0.0.0.0:5432            0.0.0.0:*               LISTEN      280131/postgres
tcp6       0      0 :::22                   :::*                    LISTEN      1/systemd
tcp6       0      0 :::5432                 :::*                    LISTEN      280131/postgres
root@ip-172-31-**-***:/var/lib/postgresql/14/main/log#

Description: Verifies PostgreSQL is listening on expected ports and identifies all network services. Essential for connectivity troubleshooting and security auditing.

Check System Logs

cat /var/log/syslog | tail -20

Description: Kernel and system logs reveal OS-level issues that could affect database performance, such as hardware failures, memory issues, or filesystem errors.

PostgreSQL DB Checks

For a ready-made set of PostgreSQL health check scripts and quick execution commands, see our PostgreSQL health check scripts.

Active, Idle & Max Connections

SELECT
    datname AS database,
    COUNT(*) FILTER (WHERE state = 'active') AS active_connections,
    COUNT(*) FILTER (WHERE state = 'idle') AS idle_connections,
    COUNT(*) AS total_connections,
    (SELECT setting::int FROM pg_settings WHERE name = 'max_connections') AS max_connections
FROM pg_stat_activity
WHERE datname IS NOT NULL
GROUP BY datname
ORDER BY total_connections DESC;

Description: This query shows the number of active, idle, and total connections per database, along with the configured max_connections limit. Understanding this metric is crucial because one of the most common causes of PostgreSQL outages is hitting the max_connections limit — once that happens, new sessions cannot be established, and the database becomes unresponsive until existing connections are released.

It’s important to note that max_connections is a global parameter, shared across the entire PostgreSQL instance — not per database. For example, if your max_connections value is 100, and you have three databases running under the same cluster, their total combined connections cannot exceed 100. If all connections are used (even idle ones), no application or user can connect until some sessions are closed or terminated.

In systems managed by Patroni, the default max_connections is typically set to 1000, which provides a much larger buffer and reduces the likelihood of immediate saturation. However, a high limit is not always safer — if your application opens too many idle connections or lacks proper connection pooling, you may run into resource exhaustion (memory pressure, high CPU usage, or context-switching overhead).

That’s why it’s best practice to:

Monitor active and idle connections closely using this query.

Use a connection pooler such as PgBouncer to manage and recycle sessions efficiently.

Keep an eye on application behavior — many performance issues come not from PostgreSQL itself, but from the application holding unnecessary idle sessions.

In summary, regularly checking connection usage ensures your PostgreSQL instance remains stable and responsive, especially under high load or sudden traffic spikes.

Long Running Queries

SELECT
    pid, usename, datname, client_addr, state,
    now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active'
  AND now() - query_start > interval '5 minutes'
ORDER BY duration DESC;

Description: Database performance issues often originate from long-running queries. This query helps you quickly identify any sessions that have been active for an unusually long time.

When investigating database slowness or lock contention, this is one of the first places you should check — long-running queries can block resources, hold locks, or consume excessive CPU and memory.

If you find a query that’s not critical (e.g., an ad-hoc report or background analytics task) and it’s causing operational impact, you can use a temporary workaround by terminating the session using pg_terminate_backend(pid). However, this should only be a short-term measure — you must still determine why the query was slow.

Use EXPLAIN ANALYZE to inspect the execution plan and identify the bottleneck (such as missing indexes, poor join order, or inefficient I/O). For deeper visibility, review historical query statistics in the pg_stat_statements view — there, you can analyze min, max, mean, and total execution times to understand performance trends over time.

As a general best practice:

On operational (OLTP) databases, aim for all queries to complete in under one second whenever possible.

For reporting or analytical workloads, avoid running them directly on your primary or standby nodes. Standby replicas are designed primarily for high availability and failover, not for heavy reporting. Instead, use logical replication to maintain a separate reporting instance dedicated to analytics — this isolates long-running queries from your production workload and protects overall stability.

In short, monitoring long-running queries is essential for maintaining database responsiveness, reducing lock contention, and ensuring consistent performance across your PostgreSQL environment.

Blocking Queries

SELECT
    blocked_locks.pid AS blocked_pid,
    blocked_activity.usename AS blocked_user,
    blocked_activity.query AS blocked_query,
    now() - blocked_activity.query_start AS blocked_duration,
    blocking_locks.pid AS blocking_pid,
    blocking_activity.usename AS blocking_user,
    blocking_activity.query AS blocking_query,
    now() - blocking_activity.query_start AS blocking_duration
FROM pg_locks blocked_locks
JOIN pg_stat_activity blocked_activity
    ON blocked_activity.pid = blocked_locks.pid
JOIN pg_locks blocking_locks
    ON blocking_locks.locktype = blocked_locks.locktype
   AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
   AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
   AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
   AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
   AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
   AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
   AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
   AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
JOIN pg_stat_activity blocking_activity
    ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;

Description: Shows queries blocked by other queries. Blocking can severely degrade performance. Essential for troubleshooting database locks.

Top Queries by Execution Time

SELECT
    ROUND((total_exec_time * 100 / SUM(total_exec_time) OVER ())::numeric, 2) AS cpu_percent,
    calls, ROUND(mean_exec_time::numeric, 2) AS mean_ms,
    ROUND(min_exec_time::numeric, 2) AS min_ms,
    ROUND(max_exec_time::numeric, 2) AS max_ms,
    ROUND(total_exec_time::numeric, 2) AS total_ms,
    query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;

Description: This query lists the most resource-intensive SQL statements in your PostgreSQL instance based on total execution time. It provides an overview of where the database spends most of its CPU and I/O resources — helping you prioritize tuning efforts effectively.

The cpu_percent column shows how much of the total execution time across all queries is consumed by each individual statement. Queries with a high CPU percentage are the primary candidates for optimization since they represent the largest share of total workload.

Focus first on the top few queries that dominate total execution time — improving even one of these can dramatically reduce CPU load, I/O pressure, and latency across the system.

The mean_ms, min_ms, and max_ms columns reveal how consistent or variable query performance is, helping you spot queries that sometimes run fast but occasionally degrade severely (often due to caching, parameter sensitivity, or plan instability).

Keep in mind that pg_stat_statements aggregates data since the last reset or database restart — it doesn’t provide historical or time-series trends by default. If you need long-term visibility, consider exporting this data periodically or integrating it with a monitoring tool like pghealth, which tracks query performance over time and helps correlate performance regressions with schema or configuration changes.

By regularly reviewing the top resource-consuming queries and understanding their CPU share, you ensure PostgreSQL runs efficiently, reduce contention, and prevent bottlenecks from silently building up.

Checkpoints

SELECT
  checkpoints_timed,
  checkpoints_req,
  CASE
    WHEN (checkpoints_timed + checkpoints_req) > 0 THEN
      ROUND(100.0 * checkpoints_timed / (checkpoints_timed + checkpoints_req), 2)
    ELSE 0
  END AS timed_checkpoint_ratio_percent
FROM pg_stat_bgwriter;

Description: Indicates whether checkpoints occur in a timely manner. A high ratio (>90%) is generally healthy and critical for WAL and data consistency.

Table Cache Hit Ratios

SELECT
  schemaname,
  relname AS tablename,
  heap_blks_read,
  heap_blks_hit,
  CASE
    WHEN (heap_blks_read + heap_blks_hit) > 0 THEN
      ROUND(100.0 * heap_blks_hit / (heap_blks_read + heap_blks_hit), 2)
    ELSE 100
  END AS cache_hit_ratio_percent
FROM pg_statio_user_tables
ORDER BY cache_hit_ratio_percent ASC;

Description: Shows how often table reads are served from cache. Higher cache hit ratios improve performance and reduce disk I/O.

Replication Slots & Lag

SELECT slot_name, plugin, slot_type, database, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn
FROM pg_replication_slots
WHERE active = false;

SELECT
  client_addr, application_name, state, sync_state,
  write_lag, flush_lag, replay_lag,
  pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS sent_lag_bytes,
  pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)) AS sent_lag_pretty
FROM pg_stat_replication;

Description: Replication slots are essential for ensuring reliable streaming replication and logical replication in PostgreSQL. However, unused or inactive replication slots can cause serious issues if left unchecked.

When a replication slot remains inactive (i.e., the replica is gone or disconnected), PostgreSQL retains all Write-Ahead Logs (WALs) needed for that slot indefinitely. This leads to unbounded WAL accumulation, which can quickly consume disk space and eventually bring the system down. In a healthy setup, there should be no inactive slots. If you see any, investigate immediately — remove or drop them if the associated replica no longer exists.

The second query monitors replication lag across connected replicas. The write_lag, flush_lag, and replay_lag columns indicate how far behind each standby is in applying changes from the primary. Additionally, the calculated sent_lag_bytes helps quantify lag in bytes, giving a clear picture of data freshness.

Replication lag directly impacts high availability (HA) and failover readiness. Large lags mean your replicas are out of sync, which can cause data loss if a failover occurs. Depending on the replication mode (sync_state), this lag can also create operational overhead:

In synchronous replication, lag can delay commits on the primary because it waits for the standby to confirm receipt.

In asynchronous replication, lag does not block commits but can increase recovery time and risk data inconsistency during failover.

Regularly monitoring both inactive slots and replication lag ensures that your replication topology remains healthy, efficient, and aligned with your HA objectives. For critical systems, integrate these checks into automated health alerts — for example, tools like pghealth visualize replication lag trends and detect inactive slots early to prevent WAL bloat and downtime.

For a complete PostgreSQL health check framework and evaluation methodology, refer to our PostgreSQL Health Check: Complete Production Guide.

Author: Fırat Güleç — Principal PostgreSQL DBA

Try pghealth Free Today 🚀

Start your journey toward a healthier PostgreSQL with pghealth.
You can explore all features immediately with a free trial — no installation required.

👉 Start Free Trial.