PostgreSQL Replication Lag Case Study: When a Standby Query Delayed Our Maintenance Window

High availability clusters are often designed with the assumption that standby nodes are passive replicas. In practice, however, standby nodes can influence the behavior of the primary in ways that are easy to overlook.

This post describes a real production issue we encountered in a 3‑node PostgreSQL cluster managed by Patroni, where a long‑running query on a standby node indirectly caused a routine VACUUM ANALYZE operation to run for hours longer than expected.

Cluster Architecture

Our environment was a typical high availability setup:

3‑node PostgreSQL cluster
Managed by Patroni
1 Primary
2 Standby nodes
Asynchronous streaming replication

The cluster was designed for high availability.
Standby nodes were occasionally used for read‑only workloads, but the primary handled most of the traffic.

The database size was around 1 TB.

Daily Maintenance Routine

Every night at 04:00 AM, we scheduled a maintenance job:

VACUUM ANALYZE;

The goal was to:

Clean up dead tuples
Update planner statistics
Prevent table bloat
Keep query plans efficient

For a 1 TB database, the process normally finished in about one hour.

The First Symptom

Around 08:00 AM, we noticed something unusual.

Application response times were increasing and queries were slower than normal.
Database I/O usage was also higher than expected.

When we checked the maintenance job, we discovered something surprising:

The VACUUM ANALYZE that started at 04:00 AM was still running.

At this point it had already been running for nearly four hours.

Investigating the Primary

Our first step was to investigate the primary node.

We checked:

Lock contention
Long running transactions
Blocking queries
Disk I/O pressure

Everything looked normal.

There were no blocking transactions and nothing obvious that would explain the delay.

Yet the vacuum process was still running.

Checking the Standby Nodes

Since nothing unusual appeared on the primary, we decided to inspect the standby servers.

On one standby node we noticed a query that had been running for almost five hours.

This query started overnight and never finished.

At first this seemed unrelated, because standby queries are read‑only.

But PostgreSQL replication behavior makes things more complicated.

Why a Standby Query Can Affect VACUUM

PostgreSQL standby servers continuously apply WAL records from the primary.

However, when a query runs on a standby node, PostgreSQL must preserve a consistent snapshot for that query.

If WAL replay would invalidate that snapshot, PostgreSQL has two options:

Cancel the standby query
Delay WAL replay

When hot_standby_feedback is enabled, the standby informs the primary that it still needs certain row versions.

Because of this feedback:

The primary cannot remove those tuples during VACUUM
Dead tuples must remain available for the standby snapshot

As a result:

The standby query holds a long snapshot
The primary cannot clean up certain tuples
The vacuum process becomes significantly slower

Resolving the Issue

After identifying the long‑running query on the standby, we decided to cancel it.

Once the standby query was cancelled:

WAL replay continued normally
The primary vacuum progressed quickly
VACUUM ANALYZE completed shortly afterward

The root cause was confirmed.

Why This Matters

Allowing a large vacuum operation to run during peak traffic hours can be dangerous.

Vacuum generates significant disk activity.

If it continues during busy periods it can lead to:

Increased I/O
Slower queries
Higher latency
Overall system slowdown

This is exactly what we experienced.

Lessons Learned

This incident highlighted several important operational lessons.

Monitor Standby Queries

Long‑running queries on standby nodes can influence primary behavior.

They should be monitored just like queries on the primary.

Watch Vacuum Duration

Unexpected increases in vacuum duration can signal deeper issues in the cluster.

Understand `hot_standby_feedback`

While useful, it can allow standby queries to indirectly impact primary maintenance tasks.

Final Thoughts

Standby nodes are often treated as passive replicas.

In reality they are active components that can influence replication, vacuum behavior, and cluster performance.

If you operate PostgreSQL clusters in production, remember:

Monitoring only the primary node is not enough.

Author: Fırat Güleç — Principal PostgreSQL DBA

Try pghealth Free Today 🚀

Start your journey toward a healthier PostgreSQL with pghealth.
You can explore all features immediately with a free trial — no agent installation required

👉 Start Free Trial.

PostgreSQL Replication Lag Case Study: When a Standby Query Delayed Our Maintenance Window

A real production incident explaining how a long‑running query on a PostgreSQL standby node caused VACUUM ANALYZE on the primary to run for hours.

PostgreSQL Replication Lag Case Study: When a Standby Query Delayed Our Maintenance Window

Cluster Architecture

Daily Maintenance Routine

The First Symptom

Investigating the Primary

Checking the Standby Nodes

Why a Standby Query Can Affect VACUUM

Resolving the Issue

Why This Matters

Lessons Learned

Monitor Standby Queries

Watch Vacuum Duration

Understand hot_standby_feedback

Final Thoughts

Try pghealth Free Today 🚀

Understand `hot_standby_feedback`