Learn how to perform a complete PostgreSQL Health Check (2025) using SQL queries and OS-level monitoring. Detect slow queries, replication lag, configuration issues, and performance bottlenecks to keep your PostgreSQL database stable and optimized.
OS Level Checks
Check if PostgreSQL is running
postgres@ip-172-31-**-***:/var/log$ psql -p 5432
Output:
psql (18.0, server 14.19)
Type "help" for help.
postgres=#
Description: This command confirms whether PostgreSQL is running and accessible. Connecting to a specific database (e.g., /c testdb) ensures the target database is reachable and ready to accept queries.
Installed PostgreSQL Versions
apt list --installed | grep postgresql
Output:
postgresql-14/noble-pgdg,now 14.19-1.pgdg24.04+1 amd64 [installed]
postgresql-16/noble-pgdg,now 16.10-1.pgdg24.04+1 amd64 [installed,automatic]
postgresql-18/noble-pgdg,now 18.0-1.pgdg24.04+3 amd64 [installed]
...
Description: Lists all installed PostgreSQL versions on the OS. Useful to verify compatibility and identify which clusters are active.
Cluster Status
pg_lsclusters
Output:
Ver Cluster Port Status Owner Data directory Log file
14 main 5432 online postgres /var/lib/postgresql/14/main log/postgresql-%Y-%m-%d_%H%M%S.log
18 main 5433 down postgres /var/lib/postgresql/18/main /var/log/postgresql/postgresql-18-main.log
Description: Displays all PostgreSQL clusters with their status. Ensures you know which servers are online and ready for queries.
PostgreSQL Version Check
psql -p 5432
postgres=# SELECT version();
Output:
PostgreSQL 14.19 on x86_64-pc-linux-gnu
Description: Confirms the running PostgreSQL version, critical for monitoring and tuning.
Memory and Swap Usage
free -h
Output:
root@ip-172-31-**-***:/home/ubuntu# free -h
total used free shared buff/cache available
Mem: 914Mi 515Mi 76Mi 143Mi 631Mi 399Mi
Swap: 0B 0B 0B
root@ip-172-31-**-***:/home/ubuntu#
Description: Memory is one of the most critical factors affecting PostgreSQL stability. This command shows total, used, free, and cached memory, giving you a quick overview of your system’s memory health. If Swap total is 0B, it means swap space is disabled, which can be risky in production environments.
When swap is disabled and memory usage reaches its limit, PostgreSQL may crash unexpectedly or terminate active processes with an “out of memory” error. When swap is enabled, the system can temporarily use disk as additional memory — it won’t crash immediately, but performance will gradually degrade as swap usage increases.
Some system administrators prefer to disable swap to maintain peak performance, while others enable it as a safety net to prevent sudden downtime. This is a trade-off: disabling swap improves speed but removes the system’s fallback mechanism; enabling it provides stability at the cost of slower response times.
In short, this check is one of the most important PostgreSQL health indicators to ensure long-term stability and predictable performance under load.
CPU and Load Average
top
top - 08:58:19 up 30 days, 21:46, 1 user, load average: 0.00, 0.00, 0.00
Tasks: 120 total, 1 running, 119 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.2 sy, 0.0 ni, 99.8 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 914.2 total, 83.5 free, 507.3 used, 631.7 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 406.9 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
280131 postgres 20 0 223164 18752 16320 S 0.3 2.0 0:13.91 postgres
1 root 20 0 22800 11576 7096 S 0.0 1.2 1:20.21 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.46 kthreadd
Description: The top command provides a real-time overview of your system’s CPU load, memory usage, and running processes. It is one of the most valuable tools for monitoring PostgreSQL performance at the operating system level.
The load average values (0.00, 0.00, 0.00) represent the average number of runnable or waiting processes over the past 1, 5, and 15 minutes. These values reflect how busy your CPU(s) have been:
A load average of 1.0 means one fully utilized CPU core.
If you have 4 CPU cores and your 1-minute load average exceeds 4.0, it indicates CPU saturation — your system is overloaded.
Sustained load averages higher than your CPU count typically lead to slow query execution, delayed background processes, or even replication lag in PostgreSQL.
The task summary line (Tasks: 120 total, 1 running, 119 sleeping, 0 stopped, 0 zombie) gives you a snapshot of process states:
Running: Actively executing on the CPU.
Sleeping: Waiting for resources such as I/O or network data.
Stopped: Paused by the system or user (e.g., via signals like SIGSTOP).
Zombie: Terminated processes that haven’t been fully cleaned up; frequent zombies may indicate a misbehaving application or process management issue.
The %Cpu(s) section breaks down CPU time distribution:
us → user space (PostgreSQL queries and application logic)
sy → system/kernel time
id → idle time (free CPU)
wa → waiting for I/O (disk or network bottleneck indicator)
st → stolen time (CPU resources taken by hypervisor in virtualized systems)
The lower portion of the output lists active processes, including their CPU (%) and memory (%) usage. This allows you to quickly identify which PostgreSQL backends or OS processes are consuming the most system resources.
Monitoring top regularly helps you:
Detect runaway queries or processes hogging CPU or RAM.
Spot performance degradation due to CPU saturation or I/O wait.
Understand the overall system load trend in correlation with PostgreSQL activity.
For a more interactive and visually friendly experience, you can use htop, which provides color-coded metrics, tree views, and easier process navigation.
In summary, the top command is essential for every PostgreSQL health check — it helps correlate database performance with real-time system resource utilization and quickly identify bottlenecks before they cause downtime.
Check Disk IO
iotop
Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s
Current DISK READ: 0.00 B/s | Current DISK WRITE: 0.00 B/s
TID PRIO USER DISK READ DISK WRITE> COMMAND
1 be/4 root 0.00 B/s 0.00 B/s systemd --system --deserialize=69
2 be/4 root 0.00 B/s 0.00 B/s [kthreadd]
3 be/4 root 0.00 B/s 0.00 B/s [pool_workqueue_release]
4 be/0 root 0.00 B/s 0.00 B/s [kworker/R-rcu_gp]
5 be/0 root 0.00 B/s 0.00 B/s [kworker/R-sync_wq]
6 be/0 root 0.00 B/s 0.00 B/s [kworker/R-kvfree_rcu_reclaim]
7 be/0 root 0.00 B/s 0.00 B/s [kworker/R-slub_flushwq]
8 be/0 root 0.00 B/s 0.00 B/s [kworker/R-netns]
11 be/0 root 0.00 B/s 0.00 B/s [kworker/0:0H-events_highpri]
Description: Identifies which processes consume the most disk IO. High disk usage can cause slow query performance in PostgreSQL.
Check Disk Space
df -h
root@ip-172-31-**-***:/home/ubuntu# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/root 6.8G 6.0G 809M 89% /
tmpfs 458M 28K 458M 1% /dev/shm
tmpfs 183M 1.5M 182M 1% /run
tmpfs 5.0M 0 5.0M 0% /run/lock
efivarfs 128K 3.8K 120K 4% /sys/firmware/efi/efivars
/dev/nvme0n1p16 881M 155M 665M 19% /boot
/dev/nvme0n1p15 105M 6.2M 99M 6% /boot/efi
tmpfs 92M 12K 92M 1% /run/user/1000
root@ip-172-31-**-***:/home/ubuntu#
Description: Monitors disk usage. If usage exceeds 90%, it could lead to operational issues or database crashes.
Check CPU Cores
nproc
root@ip-172-31-**-***:/home/ubuntu# nproc
2
root@ip-172-31-**-***:/home/ubuntu#
Description: Displays the number of available CPU cores. Comparing with load average helps determine if your system is overutilized. For detailed CPU info, use lscpu.
PostgreSQL Log Directory
postgres=# show log_directory;
log_directory
---------------
log
(1 row)
postgres=# show data_directory;
data_directory
-----------------------------
/var/lib/postgresql/14/main
(1 row)
postgres=# show logging_collector;
logging_collector
-------------------
on
(1 row)
Command check log file: tail -100f /var/lib/postgresql/14/main/log/postgresql-2025-10-20_000000.log
Description: Determines where PostgreSQL logs are stored. If logging_collector is off, logs may not be centralized, making troubleshooting harder. Always check for ERROR, FATAL, PANIC, or CRITICAL entries.
Check HugePages Configuration
grep -i huge /proc/meminfo
root@ip-172-31-**-***:/home/ubuntu# grep -i huge /proc/meminfo
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
FileHugePages: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 0 kB
root@ip-172-31-**-***:/home/ubuntu#
Description: HugePages can significantly improve performance for large-memory workloads. Zero values indicate HugePages are not configured, which might be optimal for smaller instances but could impact performance on memory-intensive systems.
Identify Disk Type
lsblk -d -o name,rota,type
root@ip-172-31-**-***:/var/lib/postgresql/14/main/log# lsblk -d -o name,rota,type
NAME ROTA TYPE
loop0 0 loop
loop1 0 loop
loop3 0 loop
loop4 0 loop
loop5 0 loop
loop6 0 loop
nvme0n1 0 disk
root@ip-172-31-**-***:/var/lib/postgresql/14/main/log#
Description: Determines if your storage is SSD (ROTA: 0) or HDD (ROTA: 1). SSDs provide significantly faster database performance, especially for I/O-intensive workloads.
Check Listening Ports
netstat -tulpn | grep LISTEN
root@ip-172-31-**-***:/var/lib/postgresql/14/main/log# netstat -tulpn | grep LISTEN
tcp 0 0 127.0.0.53:53 0.0.0.0:* LISTEN 266459/systemd-reso
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN 1/systemd
tcp 0 0 127.0.0.54:53 0.0.0.0:* LISTEN 266459/systemd-reso
tcp 0 0 0.0.0.0:5432 0.0.0.0:* LISTEN 280131/postgres
tcp6 0 0 :::22 :::* LISTEN 1/systemd
tcp6 0 0 :::5432 :::* LISTEN 280131/postgres
root@ip-172-31-**-***:/var/lib/postgresql/14/main/log#
Description: Verifies PostgreSQL is listening on expected ports and identifies all network services. Essential for connectivity troubleshooting and security auditing.
Check System Logs
cat /var/log/syslog | tail -20
Description: Kernel and system logs reveal OS-level issues that could affect database performance, such as hardware failures, memory issues, or filesystem errors.
PostgreSQL DB Checks
Active, Idle & Max Connections
SELECT
datname AS database,
COUNT(*) FILTER (WHERE state = 'active') AS active_connections,
COUNT(*) FILTER (WHERE state = 'idle') AS idle_connections,
COUNT(*) AS total_connections,
(SELECT setting::int FROM pg_settings WHERE name = 'max_connections') AS max_connections
FROM pg_stat_activity
WHERE datname IS NOT NULL
GROUP BY datname
ORDER BY total_connections DESC;
Description: This query shows the number of active, idle, and total connections per database, along with the configured max_connections limit. Understanding this metric is crucial because one of the most common causes of PostgreSQL outages is hitting the max_connections limit — once that happens, new sessions cannot be established, and the database becomes unresponsive until existing connections are released.
It’s important to note that max_connections is a global parameter, shared across the entire PostgreSQL instance — not per database. For example, if your max_connections value is 100, and you have three databases running under the same cluster, their total combined connections cannot exceed 100. If all connections are used (even idle ones), no application or user can connect until some sessions are closed or terminated.
In systems managed by Patroni, the default max_connections is typically set to 1000, which provides a much larger buffer and reduces the likelihood of immediate saturation. However, a high limit is not always safer — if your application opens too many idle connections or lacks proper connection pooling, you may run into resource exhaustion (memory pressure, high CPU usage, or context-switching overhead).
That’s why it’s best practice to:
Monitor active and idle connections closely using this query.
Use a connection pooler such as PgBouncer to manage and recycle sessions efficiently.
Keep an eye on application behavior — many performance issues come not from PostgreSQL itself, but from the application holding unnecessary idle sessions.
In summary, regularly checking connection usage ensures your PostgreSQL instance remains stable and responsive, especially under high load or sudden traffic spikes.
Long Running Queries
SELECT
pid, usename, datname, client_addr, state,
now() - query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active'
AND now() - query_start > interval '5 minutes'
ORDER BY duration DESC;
Description: Database performance issues often originate from long-running queries. This query helps you quickly identify any sessions that have been active for an unusually long time.
When investigating database slowness or lock contention, this is one of the first places you should check — long-running queries can block resources, hold locks, or consume excessive CPU and memory.
If you find a query that’s not critical (e.g., an ad-hoc report or background analytics task) and it’s causing operational impact, you can use a temporary workaround by terminating the session using pg_terminate_backend(pid). However, this should only be a short-term measure — you must still determine why the query was slow.
Use EXPLAIN ANALYZE to inspect the execution plan and identify the bottleneck (such as missing indexes, poor join order, or inefficient I/O). For deeper visibility, review historical query statistics in the pg_stat_statements view — there, you can analyze min, max, mean, and total execution times to understand performance trends over time.
As a general best practice:
On operational (OLTP) databases, aim for all queries to complete in under one second whenever possible.
For reporting or analytical workloads, avoid running them directly on your primary or standby nodes. Standby replicas are designed primarily for high availability and failover, not for heavy reporting. Instead, use logical replication to maintain a separate reporting instance dedicated to analytics — this isolates long-running queries from your production workload and protects overall stability.
In short, monitoring long-running queries is essential for maintaining database responsiveness, reducing lock contention, and ensuring consistent performance across your PostgreSQL environment.
Blocking Queries
SELECT
blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocked_activity.query AS blocked_query,
now() - blocked_activity.query_start AS blocked_duration,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocking_activity.query AS blocking_query,
now() - blocking_activity.query_start AS blocking_duration
FROM pg_locks blocked_locks
JOIN pg_stat_activity blocked_activity
ON blocked_activity.pid = blocked_locks.pid
JOIN pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
JOIN pg_stat_activity blocking_activity
ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
Description: Shows queries blocked by other queries. Blocking can severely degrade performance. Essential for troubleshooting database locks.
Top Queries by Execution Time
SELECT
ROUND((total_exec_time * 100 / SUM(total_exec_time) OVER ())::numeric, 2) AS cpu_percent,
calls, ROUND(mean_exec_time::numeric, 2) AS mean_ms,
ROUND(min_exec_time::numeric, 2) AS min_ms,
ROUND(max_exec_time::numeric, 2) AS max_ms,
ROUND(total_exec_time::numeric, 2) AS total_ms,
query
FROM pg_stat_statements
ORDER BY total_exec_time DESC
LIMIT 20;
Description: This query lists the most resource-intensive SQL statements in your PostgreSQL instance based on total execution time. It provides an overview of where the database spends most of its CPU and I/O resources — helping you prioritize tuning efforts effectively.
The cpu_percent column shows how much of the total execution time across all queries is consumed by each individual statement. Queries with a high CPU percentage are the primary candidates for optimization since they represent the largest share of total workload.
Focus first on the top few queries that dominate total execution time — improving even one of these can dramatically reduce CPU load, I/O pressure, and latency across the system.
The mean_ms, min_ms, and max_ms columns reveal how consistent or variable query performance is, helping you spot queries that sometimes run fast but occasionally degrade severely (often due to caching, parameter sensitivity, or plan instability).
Keep in mind that pg_stat_statements aggregates data since the last reset or database restart — it doesn’t provide historical or time-series trends by default. If you need long-term visibility, consider exporting this data periodically or integrating it with a monitoring tool like pghealth, which tracks query performance over time and helps correlate performance regressions with schema or configuration changes.
By regularly reviewing the top resource-consuming queries and understanding their CPU share, you ensure PostgreSQL runs efficiently, reduce contention, and prevent bottlenecks from silently building up.
Checkpoints
SELECT
checkpoints_timed,
checkpoints_req,
CASE
WHEN (checkpoints_timed + checkpoints_req) > 0 THEN
ROUND(100.0 * checkpoints_timed / (checkpoints_timed + checkpoints_req), 2)
ELSE 0
END AS timed_checkpoint_ratio_percent
FROM pg_stat_bgwriter;
Description: Indicates whether checkpoints occur in a timely manner. A high ratio (>90%) is generally healthy and critical for WAL and data consistency.
Table Cache Hit Ratios
SELECT
schemaname,
relname AS tablename,
heap_blks_read,
heap_blks_hit,
CASE
WHEN (heap_blks_read + heap_blks_hit) > 0 THEN
ROUND(100.0 * heap_blks_hit / (heap_blks_read + heap_blks_hit), 2)
ELSE 100
END AS cache_hit_ratio_percent
FROM pg_statio_user_tables
ORDER BY cache_hit_ratio_percent ASC;
Description: Shows how often table reads are served from cache. Higher cache hit ratios improve performance and reduce disk I/O.
Replication Slots & Lag
SELECT slot_name, plugin, slot_type, database, active, active_pid, xmin, catalog_xmin, restart_lsn, confirmed_flush_lsn
FROM pg_replication_slots
WHERE active = false;
SELECT
client_addr, application_name, state, sync_state,
write_lag, flush_lag, replay_lag,
pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS sent_lag_bytes,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn)) AS sent_lag_pretty
FROM pg_stat_replication;
Description: Replication slots are essential for ensuring reliable streaming replication and logical replication in PostgreSQL. However, unused or inactive replication slots can cause serious issues if left unchecked.
When a replication slot remains inactive (i.e., the replica is gone or disconnected), PostgreSQL retains all Write-Ahead Logs (WALs) needed for that slot indefinitely. This leads to unbounded WAL accumulation, which can quickly consume disk space and eventually bring the system down. In a healthy setup, there should be no inactive slots. If you see any, investigate immediately — remove or drop them if the associated replica no longer exists.
The second query monitors replication lag across connected replicas. The write_lag, flush_lag, and replay_lag columns indicate how far behind each standby is in applying changes from the primary. Additionally, the calculated sent_lag_bytes helps quantify lag in bytes, giving a clear picture of data freshness.
Replication lag directly impacts high availability (HA) and failover readiness. Large lags mean your replicas are out of sync, which can cause data loss if a failover occurs. Depending on the replication mode (sync_state), this lag can also create operational overhead:
In synchronous replication, lag can delay commits on the primary because it waits for the standby to confirm receipt.
In asynchronous replication, lag does not block commits but can increase recovery time and risk data inconsistency during failover.
Regularly monitoring both inactive slots and replication lag ensures that your replication topology remains healthy, efficient, and aligned with your HA objectives. For critical systems, integrate these checks into automated health alerts — for example, tools like pghealth visualize replication lag trends and detect inactive slots early to prevent WAL bloat and downtime.
Author: Fırat Güleç — Principal PostgreSQL DBA
Try pghealth Free Today 🚀
Start your journey toward a healthier PostgreSQL with pghealth.
You can explore all features immediately with a free trial — no installation required.

