Comprehensive guide to PostgreSQL system health metrics. Monitor CPU, memory, disk usage, load average, transaction wraparound, and server uptime for optimal database performance and stability.
PostgreSQL Version Control
What it measures: Tracks your PostgreSQL version and checks if it's within the official support lifecycle from the PostgreSQL Global Development Group.
Why it matters: Running supported versions ensures you receive critical security patches and bug fixes. Major versions are supported for five years after release, after which they become end-of-life (EOL) and no longer receive updates, leaving your database vulnerable to security threats and stability issues.
Ideal value & Best Practice: Always run a currently supported PostgreSQL version. Plan upgrades before versions reach EOL. Regular minor version updates should be applied promptly to address security vulnerabilities and critical bugs.
DB Restart Time
What it measures: Tracks the duration of database restart operations and records the last restart timestamp.
Why it matters: Provides information about restart time of your database instance. It might be a good indication of problems. Unusually long restart times can signal issues with transaction log recovery, database bloat, or storage performance problems.
Ideal value & Best Practice:
While there's no universal "ideal" restart time, establishing a baseline for your environment helps identify anomalies
Regular restarts are recommended for security maintenance to apply OS and database patches
Monitor last restart time to ensure systems receive necessary security updates
Plan maintenance windows for patches that require restarts, especially after critical security updates
Investigate significantly prolonged restart times that may indicate underlying issues
Balance stability needs with security requirements for restart frequency
pg_stat_statements Status
What it measures: Whether the pg_stat_statements extension is enabled and functioning properly.
Why it matters: This module is essential for query performance analysis, providing detailed statistics on execution time, frequency, and resource usage for all SQL queries. Without it, identifying performance bottlenecks and optimizing queries becomes significantly more difficult.
Ideal value & Best Practice: Ensure pg_stat_statements is enabled in all production environments. Regularly monitor and analyze its data to identify slow queries, optimize performance, and detect anomalous database activity.
Load Average Check
What it measures: The average system load over 1, 5, and 15-minute intervals, representing processes using or waiting for CPU resources.
Why it matters: Load average exceeding your CPU core count indicates resource contention, which can degrade database performance, cause query delays, and signal overloaded or misconfigured systems.
Ideal value & Best Practice: Load average should remain below the number of physical CPU cores. Consistently high load may require query optimization, hardware upgrades, or better resource allocation.
CPU Usage
What it measures: The percentage of total CPU capacity consumed by all processes on the server.
Why it matters: Sustained high CPU usage indicates performance bottlenecks, inefficient queries, resource contention, or potential security threats like cryptojacking or DDoS attacks.
Ideal value & Best Practice: Maintain CPU usage below 40-50% during peak loads. Spikes are normal, but sustained high usage requires investigation into query optimization, indexing, or hardware scaling.
Top OS CPU Process
What it measures: Identifies the most CPU-intensive processes at the operating system level.
Why it matters: Helps pinpoint specific processes (including PostgreSQL backends) consuming excessive CPU resources, enabling targeted troubleshooting of problematic queries or external processes affecting database performance.
Ideal value & Best Practice: PostgreSQL processes should typically dominate CPU usage during database operations. External processes consuming significant CPU may indicate resource contention issues.
Disk Usage
What it measures: The percentage of storage capacity consumed on database filesystems.
Why it matters: High disk usage can lead to system instability, failed writes, application crashes, or complete database failure. Proactive monitoring ensures sufficient space for database growth, transaction logs, and critical operations.
Ideal value & Best Practice: Maintain at least 20-30% free disk space. Implement alerting at 80% usage to allow time for maintenance actions like archiving, cleanup, or storage expansion.
Memory Usage
What it measures: The percentage of available RAM actively used by the system and processes.
Why it matters: Insufficient free memory (less than 5% of total) can lead to performance degradation, swapping to disk, or out-of-memory (OOM) crashes that jeopardize database stability and security.
Ideal value & Best Practice: Maintain 10-20% free memory for optimal performance. PostgreSQL should be configured to use appropriate shared_buffers and work_mem settings without starving the operating system.
Swap Usage
What it measures: The percentage of disk-based swap space used as an extension of physical RAM.
Why it matters: Excessive swap usage (more than 10% of total swap) indicates severe memory pressure, risking performance degradation, disk I/O bottlenecks, and system instability. While minimal swap usage is normal, sustained high usage often signals insufficient RAM or memory leaks.
Ideal value & Best Practice: Ideally, swap usage should be minimal (less than 5%). Persistent swapping requires investigation into memory allocation, PostgreSQL configuration, or additional RAM.
Server Uptime
What it measures: The duration a server has been running continuously without a restart.
Why it matters: While high uptime can indicate stability, excessively long uptime (weeks or months) may signal unaddressed security vulnerabilities, outdated software, deferred maintenance, memory leaks, or stale configurations that require restarts to take effect.
Ideal value & Best Practice: Balance stability with regular maintenance. Plan regular maintenance windows for security updates and restarts. Monitor for memory leaks or performance degradation that correlate with extended uptime.
Transaction Wraparound Protection
What it measures: The risk of transaction ID wraparound, which occurs when PostgreSQL's 32-bit transaction ID counter approaches its maximum value.
Why it matters: If old transactions aren't properly vacuumed, the database may reach a state where it can no longer assign new transaction IDs, leading to potential data corruption and emergency shutdowns to protect data integrity.
Ideal value & Best Practice: Ensure autovacuum is properly configured and running. Monitor transaction age metrics and respond promptly to warnings. Never disable autovacuum on production systems.
Current Process Allocation
What it measures: The real-time distribution of system processes across operational states (Running, Sleeping, Stopped, Zombie).
Why it matters: Identifies resource bottlenecks (high Running processes), hangs (unexpected Stopped processes), and process leaks (Zombie processes). A healthy system typically shows high Sleeping (normal background activity) and near-zero Zombie/Stopped processes.
Ideal value & Best Practice: Most processes should be in Sleeping state during normal operation. Investigate elevated Running states (CPU contention) or any Zombie processes (cleanup issues). Near-zero Zombie and Stopped processes indicate healthy process management.
Try pghealth Free Today ๐
Start your journey toward a healthier PostgreSQL with pghealth.
You can explore all features immediately with a free trial โ no installation required.