PostgreSQL Health Check Case Study: Logical Replication Slot Caused WAL Disk Exhaustion

This PostgreSQL health check case study explains how a simple DDL change caused silent WAL accumulation and nearly took down a production database.

The issue was not CPU. It was not query performance. It was not traffic.

It was WAL retention caused by a logical replication slot.

Architecture

Primary database (source) – OLTP workload
Logical replication target (subscriber) – separate instance for reporting
Replication method: CREATE PUBLICATION / CREATE SUBSCRIPTION
Dedicated replication slot on primary

Important:

Logical replication does NOT replicate DDL changes.
Streaming (physical) replication does.

Logical replication works at row level. Schema changes must be applied manually on both sides.

The Change That Triggered the Incident

On primary:

ALTER TABLE transactions
ADD COLUMN reference_code TEXT;

The same DDL was not applied on the subscriber.

Replication stopped applying changes immediately.

But nothing crashed.

What Happened Internally

Subscriber failed to apply changes.
Replication worker started erroring.
Replication slot on primary remained active.
WAL segments required by that slot could not be recycled.
WAL files accumulated continuously.

PostgreSQL will never remove WAL files still required by an active replication slot.

That is by design.

Early Warning Signals (That Were Missed)

1. Replication Slot Still Active

On primary:

SELECT slot_name,
       active,
       restart_lsn,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots;

You can inspect slot retention using the pg_replication_slots system catalog.

Output

slot_name        | reporting_slot
active           | true
retained_wal     | 185 GB

The slot was active, but retained WAL was growing continuously.

2. WAL Directory Growth

On the primary server:

du -sh $PGDATA/pg_wal

Output:

210G    pg_wal

Disk alerts triggered at 85% capacity.

Without intervention, the server would have stopped accepting writes within a day.

3. Errors in Subscriber Logs

On the logical replication target:

Example log entries:

ERROR:  logical replication target relation "public.transactions"
has different column count from source relation
CONTEXT: processing remote data for replication origin "pg_16384"

And repeatedly:

ERROR:  missing replicated column: "reference_code"
STATEMENT: INSERT INTO public.transactions (...)

The replication worker kept retrying and failing.

4. Primary Server Log Warnings

Primary logs showed:

WARNING:  replication slot "reporting_slot" is preventing removal of required WAL segments
DETAIL:  The slot has retained 180 GB of WAL files.

This is a classic early signal in a PostgreSQL health check.

Why This Is Dangerous

Replication slot retention works by design:

PostgreSQL will not remove WAL files required by an active replication slot, as explained in the official PostgreSQL replication slots documentation.

If the subscriber cannot consume changes:

WAL retention grows indefinitely
Disk fills up
Primary database becomes unavailable
Production outage occurs

Logical replication makes this risk higher because DDL mismatches are common.

Root Cause

The root cause was simple:

A DDL change was applied to the primary
The same DDL was not applied to the logical subscriber
Logical replication stopped applying changes
Replication slot retained WAL indefinitely

No automatic schema sync exists in logical replication.

How the Issue Was Fixed

Step 1 – Apply missing DDL on subscriber:

ALTER TABLE transactions
ADD COLUMN reference_code TEXT;

Step 2 – Verify subscription status:

On subscriber:

SELECT * FROM pg_stat_subscription;

Subscription state can be monitored via pg_stat_subscription.

Step 3 – Confirm slot consumption resumed:

On primary:

SELECT slot_name,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots;

Retained WAL began decreasing.

Step 4 – Monitor pg_wal directory shrinkage.

System stabilized.

Health Check Controls That Would Have Prevented This

This incident was preventable.

A proper PostgreSQL health check should include:

1. WAL Disk Usage Monitoring

Alert when:

pg_wal exceeds defined threshold (e.g., 70%)
WAL growth rate accelerates abnormally

2. Replication Slot Retention Check

Script:

SELECT slot_name,
       active,
       pg_size_pretty(
         pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
       ) AS retained_wal_size
FROM pg_replication_slots
WHERE slot_type = 'logical';

Alert when retained WAL exceeds threshold (e.g., 10 GB in reporting workloads).

3. Subscription Health Check (Subscriber Side)

SELECT subname,
       status,
       last_msg_send_time,
       last_msg_receipt_time,
       latest_end_lsn
FROM pg_stat_subscription;

Alert if:

status != 'streaming'
last_msg_receipt_time is stale

4. Log Monitoring

Both source and target logs must be monitored for:

logical replication errors
replication worker restarts
slot retention warnings

Lessons Learned

Logical replication does not replicate DDL.
Active replication slots can silently retain massive WAL.
Replication being “active” does not mean it is healthy.
WAL disk alerts are critical for production systems.
Slot retention size must be part of every PostgreSQL health check.

Streaming vs Logical Replication: Why This Matters

Streaming (physical) replication:

Block-level
Replicates DDL automatically
Lower schema drift risk

Logical replication:

Row-level
Does not replicate DDL
Higher operational risk if schema governance is weak

Choosing logical replication requires stronger monitoring discipline.

Understanding the differences between streaming and logical replication is critical for any serious PostgreSQL health check strategy.

For a deeper breakdown, see our guide on Streaming vs Logical Replication in PostgreSQL.

Preventive Checklist

Before applying DDL on primary:

Verify whether table participates in logical publication
Apply DDL to subscriber first (or in controlled order)
Validate replication health after change

Continuous monitoring should include:

WAL disk usage
Replication slot retention
Subscription state
Error logs

Final Thoughts

This was not a performance issue. It was not a crash. It was not a traffic spike.

It was silent replication drift.

Without WAL monitoring and replication slot visibility, the primary database would have gone down within hours.

A proper PostgreSQL health check is not just about CPU, queries, or indexes.

It must include:

Replication health
Slot retention
WAL growth patterns
Log analysis

Because databases rarely fail instantly.

They degrade silently — until storage runs out.

This is a classic signal that should always be covered in a proper PostgreSQL health check.

Author: Fırat Güleç — Principal PostgreSQL DBA

Try pghealth Free Today 🚀

Start your journey toward a healthier PostgreSQL with pghealth.
You can explore all features immediately with a free trial — no installation required.

👉 Start Free Trial.

PostgreSQL Health Check Case Study: Logical Replication Slot Caused WAL Disk Exhaustion

A real-world PostgreSQL health check case study explaining how a logical replication DDL mismatch caused WAL retention, disk exhaustion, and production risk. Includes monitoring queries and prevention checklist.

Architecture

The Change That Triggered the Incident

On primary:

What Happened Internally

Early Warning Signals (That Were Missed)

1. Replication Slot Still Active

On primary:

Output

2. WAL Directory Growth

On the primary server:

Output:

3. Errors in Subscriber Logs

Example log entries:

And repeatedly:

4. Primary Server Log Warnings

Primary logs showed:

Why This Is Dangerous

Root Cause

How the Issue Was Fixed

Step 1 – Apply missing DDL on subscriber:

Step 2 – Verify subscription status:

Step 3 – Confirm slot consumption resumed:

Step 4 – Monitor pg_wal directory shrinkage.

Health Check Controls That Would Have Prevented This

1. WAL Disk Usage Monitoring

2. Replication Slot Retention Check

3. Subscription Health Check (Subscriber Side)

4. Log Monitoring

Streaming vs Logical Replication: Why This Matters

Streaming (physical) replication:

Logical replication:

Preventive Checklist

Final Thoughts

Try pghealth Free Today 🚀