PostgreSQL Replication Lag Case Study: When a Standby Query Delayed Our Maintenance Window
High availability clusters are often designed with the assumption that standby nodes are passive replicas. In practice, however, standby nodes can influence the behavior of the primary in ways that are easy to overlook.
This post describes a real production issue we encountered in a 3‑node PostgreSQL cluster managed by Patroni,
where a long‑running query on a standby node indirectly caused a routine VACUUM ANALYZE operation to run for hours longer than expected.
Cluster Architecture
Our environment was a typical high availability setup:
- 3‑node PostgreSQL cluster
- Managed by Patroni
- 1 Primary
- 2 Standby nodes
- Asynchronous streaming replication
The cluster was designed for high availability.
Standby nodes were occasionally used for read‑only workloads, but the primary handled most of the traffic.
The database size was around 1 TB.
Daily Maintenance Routine
Every night at 04:00 AM, we scheduled a maintenance job:
VACUUM ANALYZE;
The goal was to:
- Clean up dead tuples
- Update planner statistics
- Prevent table bloat
- Keep query plans efficient
For a 1 TB database, the process normally finished in about one hour.
The First Symptom
Around 08:00 AM, we noticed something unusual.
Application response times were increasing and queries were slower than normal.
Database I/O usage was also higher than expected.
When we checked the maintenance job, we discovered something surprising:
The VACUUM ANALYZE that started at 04:00 AM was still running.
At this point it had already been running for nearly four hours.
Investigating the Primary
Our first step was to investigate the primary node.
We checked:
- Lock contention
- Long running transactions
- Blocking queries
- Disk I/O pressure
Everything looked normal.
There were no blocking transactions and nothing obvious that would explain the delay.
Yet the vacuum process was still running.
Checking the Standby Nodes
Since nothing unusual appeared on the primary, we decided to inspect the standby servers.
On one standby node we noticed a query that had been running for almost five hours.
This query started overnight and never finished.
At first this seemed unrelated, because standby queries are read‑only.
But PostgreSQL replication behavior makes things more complicated.
Why a Standby Query Can Affect VACUUM
PostgreSQL standby servers continuously apply WAL records from the primary.
However, when a query runs on a standby node, PostgreSQL must preserve a consistent snapshot for that query.
If WAL replay would invalidate that snapshot, PostgreSQL has two options:
- Cancel the standby query
- Delay WAL replay
When hot_standby_feedback is enabled, the standby informs the primary that it still needs certain row versions.
Because of this feedback:
- The primary cannot remove those tuples during VACUUM
- Dead tuples must remain available for the standby snapshot
As a result:
- The standby query holds a long snapshot
- The primary cannot clean up certain tuples
- The vacuum process becomes significantly slower
Resolving the Issue
After identifying the long‑running query on the standby, we decided to cancel it.
Once the standby query was cancelled:
- WAL replay continued normally
- The primary vacuum progressed quickly
VACUUM ANALYZEcompleted shortly afterward
The root cause was confirmed.
Why This Matters
Allowing a large vacuum operation to run during peak traffic hours can be dangerous.
Vacuum generates significant disk activity.
If it continues during busy periods it can lead to:
- Increased I/O
- Slower queries
- Higher latency
- Overall system slowdown
This is exactly what we experienced.
Lessons Learned
This incident highlighted several important operational lessons.
Monitor Standby Queries
Long‑running queries on standby nodes can influence primary behavior.
They should be monitored just like queries on the primary.
Watch Vacuum Duration
Unexpected increases in vacuum duration can signal deeper issues in the cluster.
Understand hot_standby_feedback
While useful, it can allow standby queries to indirectly impact primary maintenance tasks.
Final Thoughts
Standby nodes are often treated as passive replicas.
In reality they are active components that can influence replication, vacuum behavior, and cluster performance.
If you operate PostgreSQL clusters in production, remember:
Monitoring only the primary node is not enough.
Author: Fırat Güleç — Principal PostgreSQL DBA
Try pghealth Free Today 🚀
Start your journey toward a healthier PostgreSQL with pghealth.
You can explore all features immediately with a free trial — no installation required.

