PostgreSQL Replication Lag: Detection, Root Cause & Fix

Deep technical guide to PostgreSQL replication lag detection, root cause analysis, and fixes. Learn how streaming replication impacts RTO, RPO, and reporting workloads.

·6 minutes reading
Cover Image for PostgreSQL Replication Lag: Detection, Root Cause & Fix

Overview: Replication Lag Is a Symptom - Not the Problem

In most PostgreSQL high availability (HA) architectures, streaming replication is the foundation. We build clusters to achieve:

  • Low Recovery Time Objective (RTO)
  • Minimal Recovery Point Objective (RPO)
  • Read scalability through standby replicas

On paper, everything looks resilient.

But in real-world systems, replication lag slowly erodes that resilience.

Replication lag is not just “delay.”
It is a signal that your HA guarantees may no longer be valid.

If a standby is 20 minutes behind:

  • Failover means 20 minutes of data loss (RPO violation)
  • Crash recovery takes longer (RTO degradation)
  • Reporting queries read stale data
  • WAL retention grows and increases storage pressure

In reporting-heavy environments using streaming replication, replication lag often becomes chronic.

Let’s break it down in depth.


What Exactly Is Replication Lag?

In PostgreSQL streaming replication, changes are written to WAL (Write-Ahead Log) on the primary and streamed to standby nodes.

There are three stages of delay:

  1. Write lag – WAL sent but not yet written on standby
  2. Flush lag – WAL written but not flushed to disk
  3. Replay lag – WAL flushed but not yet applied

The most critical metric is replay lag, because data is not visible to queries until replayed.


How to Detect Replication Lag

On the Primary

SELECT
    client_addr,
    state,
    sync_state,
    write_lag,
    flush_lag,
    replay_lag,
    pg_size_pretty(
        pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)
    ) AS byte_lag
FROM pg_stat_replication;

Healthy Indicators

  • state = streaming
  • replay_lag is NULL or very small
  • byte_lag under a few MB

Warning Thresholds

Normal

Byte Lag: < 10 MB
No immediate action required.


Investigate

Byte Lag: 100 MB+
Review WAL generation rate and standby I/O capacity.


Critical

Byte Lag: 1 GB+
Immediate investigation required. Risk of RPO impact.


Structural Issue

Condition: Continuously increasing lag
Indicates architectural or capacity problem.


On the Standby

SELECT now() - pg_last_xact_replay_timestamp() AS replication_delay;

Interpretation

  • NULL → replication inactive
  • Seconds → acceptable
  • Minutes → degraded HA
  • Hours → failover unsafe

Measuring WAL Generation Rate

SELECT now(), pg_current_wal_lsn();

Capture twice within 1–2 minutes and calculate the difference.

If your system generates 40–50 MB/sec of WAL during peak load, your standby must sustain at least that throughput.


Root Causes of Replication Lag

Replication lag typically falls into five categories.

1. Slow Standby I/O

Most common cause.

  • Under-provisioned disks
  • Cloud IOPS throttling
  • Burst credit exhaustion
  • Storage contention

Replication is usually I/O-bound.


2. Using Streaming Replication for Reporting

Streaming replication is physical replication.
The standby must continuously replay WAL.

When heavy reporting queries run:

  • Large aggregations
  • Long-running SELECTs
  • Temp file writes
  • Sequential scans

Replay competes for disk I/O.

Over time:

  • Replay slows
  • WAL accumulates
  • Lag becomes persistent

Streaming replication is designed for HA — not analytics.

For reporting systems, logical replication is often a better choice.

CREATE PUBLICATION reporting_pub
FOR TABLE orders, customers;

Logical replication allows workload isolation and reduces replay pressure.


3. High WAL Generation

  • Large UPDATE / DELETE
  • Index rebuilds
  • ETL jobs
  • Autovacuum freeze

If standby cannot keep up, lag grows rapidly.


4. Network Bottlenecks

  • Cross-region replication
  • Packet loss
  • Insufficient bandwidth
  • High latency

Common in distributed HA setups.


5. Hot Standby Conflicts

SELECT * FROM pg_stat_database_conflicts;

If conflicts increase, replay is pausing due to query conflicts.


Replication Lag and RTO / RPO Impact

RPO

If lag is 15 minutes, failover risks 15 minutes of data loss.

RTO

  • More WAL to apply
  • Longer crash recovery
  • Slower promotion

HA without lag monitoring is incomplete.


Long-Term Performance Consequences

  • WAL retention growth
  • Disk exhaustion risk
  • Increased checkpoint pressure
  • Longer recovery times
  • Replication slot bloat
  • Higher storage costs

Over time, lag becomes a structural performance problem.


When Should You Panic?

  • Lag continuously increases
  • Replay delay exceeds minutes in HA systems
  • WAL disk usage exceeds 70%
  • Failover tests show data loss
  • Reporting depends on near real-time accuracy

Trend matters more than spikes.


How to Fix Replication Lag

1. Improve Standby Storage

  • Faster disks
  • Higher IOPS
  • Separate WAL disk
  • Avoid shared noisy storage

2. Re-Architect Reporting

  • Logical replication
  • Dedicated reporting clusters
  • Snapshot-based ETL

Streaming replication should primarily serve HA.

3. Tune WAL Settings

  • wal_keep_size
  • max_wal_senders
  • max_replication_slots
  • Checkpoint configuration

4. Break Large Batch Operations

  • Use 10k–50k row batches
  • Schedule maintenance
  • Control index rebuild timing

5. Implement Continuous Monitoring

  • Replay lag
  • Byte lag
  • WAL generation rate
  • Disk IOPS
  • Replication slot lag
  • Conflict counters
  • Standby disk utilization

Replication lag without monitoring is invisible risk.


FAQ: PostgreSQL Replication Lag

What is replication lag in PostgreSQL?

Replication lag is the delay between a transaction being committed on the primary server and being applied on a standby server.

How do I check replication lag?

SELECT * FROM pg_stat_replication;
SELECT now() - pg_last_xact_replay_timestamp();

How much replication lag is acceptable?

For HA systems, lag should typically remain under a few seconds.
Persistent lag measured in minutes indicates architectural or infrastructure issues.

Does replication lag cause data loss?

Yes. During failover, any unapplied WAL results in data loss equal to the replication delay.

Why does replication lag increase during reporting?

Heavy read queries on standby consume I/O and compete with WAL replay.

Should I use streaming replication for reporting?

Streaming replication is ideal for HA.
For heavy reporting workloads, logical replication is usually more appropriate.

Can replication lag affect performance?

Yes. It increases WAL retention, disk usage, recovery time, and operational risk.


Final Thoughts

Replication lag is not a cosmetic metric.

It is a structural signal that your HA architecture, storage capacity, or workload isolation strategy needs attention.

Streaming replication ensures availability.
Logical replication enables isolation.
Monitoring ensures your HA guarantees are real — not theoretical.


Author: Fırat Güleç — Principal PostgreSQL DBA


Try pghealth Free Today 🚀

Start your journey toward a healthier PostgreSQL with pghealth.
You can explore all features immediately with a free trial — no installation required.

👉 Start Free Trial.