Overview: Replication Lag Is a Symptom - Not the Problem
In most PostgreSQL high availability (HA) architectures, streaming replication is the foundation. We build clusters to achieve:
- Low Recovery Time Objective (RTO)
- Minimal Recovery Point Objective (RPO)
- Read scalability through standby replicas
On paper, everything looks resilient.
But in real-world systems, replication lag slowly erodes that resilience.
Replication lag is not just “delay.”
It is a signal that your HA guarantees may no longer be valid.
If a standby is 20 minutes behind:
- Failover means 20 minutes of data loss (RPO violation)
- Crash recovery takes longer (RTO degradation)
- Reporting queries read stale data
- WAL retention grows and increases storage pressure
In reporting-heavy environments using streaming replication, replication lag often becomes chronic.
Let’s break it down in depth.
What Exactly Is Replication Lag?
In PostgreSQL streaming replication, changes are written to WAL (Write-Ahead Log) on the primary and streamed to standby nodes.
There are three stages of delay:
- Write lag – WAL sent but not yet written on standby
- Flush lag – WAL written but not flushed to disk
- Replay lag – WAL flushed but not yet applied
The most critical metric is replay lag, because data is not visible to queries until replayed.
How to Detect Replication Lag
On the Primary
SELECT
client_addr,
state,
sync_state,
write_lag,
flush_lag,
replay_lag,
pg_size_pretty(
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)
) AS byte_lag
FROM pg_stat_replication;
Healthy Indicators
state = streamingreplay_lagisNULLor very smallbyte_lagunder a few MB
Warning Thresholds
Normal
Byte Lag: < 10 MB
No immediate action required.
Investigate
Byte Lag: 100 MB+
Review WAL generation rate and standby I/O capacity.
Critical
Byte Lag: 1 GB+
Immediate investigation required. Risk of RPO impact.
Structural Issue
Condition: Continuously increasing lag
Indicates architectural or capacity problem.
On the Standby
SELECT now() - pg_last_xact_replay_timestamp() AS replication_delay;
Interpretation
NULL→ replication inactive- Seconds → acceptable
- Minutes → degraded HA
- Hours → failover unsafe
Measuring WAL Generation Rate
SELECT now(), pg_current_wal_lsn();
Capture twice within 1–2 minutes and calculate the difference.
If your system generates 40–50 MB/sec of WAL during peak load, your standby must sustain at least that throughput.
Root Causes of Replication Lag
Replication lag typically falls into five categories.
1. Slow Standby I/O
Most common cause.
- Under-provisioned disks
- Cloud IOPS throttling
- Burst credit exhaustion
- Storage contention
Replication is usually I/O-bound.
2. Using Streaming Replication for Reporting
Streaming replication is physical replication.
The standby must continuously replay WAL.
When heavy reporting queries run:
- Large aggregations
- Long-running
SELECTs - Temp file writes
- Sequential scans
Replay competes for disk I/O.
Over time:
- Replay slows
- WAL accumulates
- Lag becomes persistent
Streaming replication is designed for HA — not analytics.
For reporting systems, logical replication is often a better choice.
CREATE PUBLICATION reporting_pub
FOR TABLE orders, customers;
Logical replication allows workload isolation and reduces replay pressure.
3. High WAL Generation
- Large
UPDATE/DELETE - Index rebuilds
- ETL jobs
- Autovacuum freeze
If standby cannot keep up, lag grows rapidly.
4. Network Bottlenecks
- Cross-region replication
- Packet loss
- Insufficient bandwidth
- High latency
Common in distributed HA setups.
5. Hot Standby Conflicts
SELECT * FROM pg_stat_database_conflicts;
If conflicts increase, replay is pausing due to query conflicts.
Replication Lag and RTO / RPO Impact
RPO
If lag is 15 minutes, failover risks 15 minutes of data loss.
RTO
- More WAL to apply
- Longer crash recovery
- Slower promotion
HA without lag monitoring is incomplete.
Long-Term Performance Consequences
- WAL retention growth
- Disk exhaustion risk
- Increased checkpoint pressure
- Longer recovery times
- Replication slot bloat
- Higher storage costs
Over time, lag becomes a structural performance problem.
When Should You Panic?
- Lag continuously increases
- Replay delay exceeds minutes in HA systems
- WAL disk usage exceeds 70%
- Failover tests show data loss
- Reporting depends on near real-time accuracy
Trend matters more than spikes.
How to Fix Replication Lag
1. Improve Standby Storage
- Faster disks
- Higher IOPS
- Separate WAL disk
- Avoid shared noisy storage
2. Re-Architect Reporting
- Logical replication
- Dedicated reporting clusters
- Snapshot-based ETL
Streaming replication should primarily serve HA.
3. Tune WAL Settings
wal_keep_sizemax_wal_sendersmax_replication_slots- Checkpoint configuration
4. Break Large Batch Operations
- Use 10k–50k row batches
- Schedule maintenance
- Control index rebuild timing
5. Implement Continuous Monitoring
- Replay lag
- Byte lag
- WAL generation rate
- Disk IOPS
- Replication slot lag
- Conflict counters
- Standby disk utilization
Replication lag without monitoring is invisible risk.
FAQ: PostgreSQL Replication Lag
What is replication lag in PostgreSQL?
Replication lag is the delay between a transaction being committed on the primary server and being applied on a standby server.
How do I check replication lag?
SELECT * FROM pg_stat_replication;
SELECT now() - pg_last_xact_replay_timestamp();
How much replication lag is acceptable?
For HA systems, lag should typically remain under a few seconds.
Persistent lag measured in minutes indicates architectural or infrastructure issues.
Does replication lag cause data loss?
Yes. During failover, any unapplied WAL results in data loss equal to the replication delay.
Why does replication lag increase during reporting?
Heavy read queries on standby consume I/O and compete with WAL replay.
Should I use streaming replication for reporting?
Streaming replication is ideal for HA.
For heavy reporting workloads, logical replication is usually more appropriate.
Can replication lag affect performance?
Yes. It increases WAL retention, disk usage, recovery time, and operational risk.
Final Thoughts
Replication lag is not a cosmetic metric.
It is a structural signal that your HA architecture, storage capacity, or workload isolation strategy needs attention.
Streaming replication ensures availability.
Logical replication enables isolation.
Monitoring ensures your HA guarantees are real — not theoretical.
Author: Fırat Güleç — Principal PostgreSQL DBA
Try pghealth Free Today 🚀
Start your journey toward a healthier PostgreSQL with pghealth.
You can explore all features immediately with a free trial — no installation required.

