PostgreSQL Replication Lag Case Study: When a Standby Query Delayed Our Maintenance Window

A real production incident explaining how a long‑running query on a PostgreSQL standby node caused VACUUM ANALYZE on the primary to run for hours.

·4 minutes reading
Cover Image for PostgreSQL Replication Lag Case Study: When a Standby Query Delayed Our Maintenance Window

PostgreSQL Replication Lag Case Study: When a Standby Query Delayed Our Maintenance Window

High availability clusters are often designed with the assumption that standby nodes are passive replicas. In practice, however, standby nodes can influence the behavior of the primary in ways that are easy to overlook.

This post describes a real production issue we encountered in a 3‑node PostgreSQL cluster managed by Patroni, where a long‑running query on a standby node indirectly caused a routine VACUUM ANALYZE operation to run for hours longer than expected.


Cluster Architecture

Our environment was a typical high availability setup:

  • 3‑node PostgreSQL cluster
  • Managed by Patroni
  • 1 Primary
  • 2 Standby nodes
  • Asynchronous streaming replication

The cluster was designed for high availability.
Standby nodes were occasionally used for read‑only workloads, but the primary handled most of the traffic.

The database size was around 1 TB.


Daily Maintenance Routine

Every night at 04:00 AM, we scheduled a maintenance job:

VACUUM ANALYZE;

The goal was to:

  • Clean up dead tuples
  • Update planner statistics
  • Prevent table bloat
  • Keep query plans efficient

For a 1 TB database, the process normally finished in about one hour.


The First Symptom

Around 08:00 AM, we noticed something unusual.

Application response times were increasing and queries were slower than normal.
Database I/O usage was also higher than expected.

When we checked the maintenance job, we discovered something surprising:

The VACUUM ANALYZE that started at 04:00 AM was still running.

At this point it had already been running for nearly four hours.


Investigating the Primary

Our first step was to investigate the primary node.

We checked:

  • Lock contention
  • Long running transactions
  • Blocking queries
  • Disk I/O pressure

Everything looked normal.

There were no blocking transactions and nothing obvious that would explain the delay.

Yet the vacuum process was still running.


Checking the Standby Nodes

Since nothing unusual appeared on the primary, we decided to inspect the standby servers.

On one standby node we noticed a query that had been running for almost five hours.

This query started overnight and never finished.

At first this seemed unrelated, because standby queries are read‑only.

But PostgreSQL replication behavior makes things more complicated.


Why a Standby Query Can Affect VACUUM

PostgreSQL standby servers continuously apply WAL records from the primary.

However, when a query runs on a standby node, PostgreSQL must preserve a consistent snapshot for that query.

If WAL replay would invalidate that snapshot, PostgreSQL has two options:

  1. Cancel the standby query
  2. Delay WAL replay

When hot_standby_feedback is enabled, the standby informs the primary that it still needs certain row versions.

Because of this feedback:

  • The primary cannot remove those tuples during VACUUM
  • Dead tuples must remain available for the standby snapshot

As a result:

  • The standby query holds a long snapshot
  • The primary cannot clean up certain tuples
  • The vacuum process becomes significantly slower

Resolving the Issue

After identifying the long‑running query on the standby, we decided to cancel it.

Once the standby query was cancelled:

  • WAL replay continued normally
  • The primary vacuum progressed quickly
  • VACUUM ANALYZE completed shortly afterward

The root cause was confirmed.


Why This Matters

Allowing a large vacuum operation to run during peak traffic hours can be dangerous.

Vacuum generates significant disk activity.

If it continues during busy periods it can lead to:

  • Increased I/O
  • Slower queries
  • Higher latency
  • Overall system slowdown

This is exactly what we experienced.


Lessons Learned

This incident highlighted several important operational lessons.

Monitor Standby Queries

Long‑running queries on standby nodes can influence primary behavior.

They should be monitored just like queries on the primary.

Watch Vacuum Duration

Unexpected increases in vacuum duration can signal deeper issues in the cluster.

Understand hot_standby_feedback

While useful, it can allow standby queries to indirectly impact primary maintenance tasks.


Final Thoughts

Standby nodes are often treated as passive replicas.

In reality they are active components that can influence replication, vacuum behavior, and cluster performance.

If you operate PostgreSQL clusters in production, remember:

Monitoring only the primary node is not enough.


Author: Fırat Güleç — Principal PostgreSQL DBA


Try pghealth Free Today 🚀

Start your journey toward a healthier PostgreSQL with pghealth.
You can explore all features immediately with a free trial — no installation required.

👉 Start Free Trial