PostgreSQL Health Check Case Study: Logical Replication Slot Caused WAL Disk Exhaustion

A real-world PostgreSQL health check case study explaining how a logical replication DDL mismatch caused WAL retention, disk exhaustion, and production risk. Includes monitoring queries and prevention checklist.

·6 minutes reading
Cover Image for PostgreSQL Health Check Case Study: Logical Replication Slot Caused WAL Disk Exhaustion

This PostgreSQL health check case study explains how a simple DDL change caused silent WAL accumulation and nearly took down a production database.

The issue was not CPU. It was not query performance. It was not traffic.

It was WAL retention caused by a logical replication slot.


Architecture

  • Primary database (source) – OLTP workload
  • Logical replication target (subscriber) – separate instance for reporting
  • Replication method: CREATE PUBLICATION / CREATE SUBSCRIPTION
  • Dedicated replication slot on primary

Important:

Logical replication does NOT replicate DDL changes.
Streaming (physical) replication does.

Logical replication works at row level. Schema changes must be applied manually on both sides.


The Change That Triggered the Incident

On primary:

ALTER TABLE transactions
ADD COLUMN reference_code TEXT;

The same DDL was not applied on the subscriber.

Replication stopped applying changes immediately.

But nothing crashed.


What Happened Internally

  • Subscriber failed to apply changes.
  • Replication worker started erroring.
  • Replication slot on primary remained active.
  • WAL segments required by that slot could not be recycled.
  • WAL files accumulated continuously.

PostgreSQL will never remove WAL files still required by an active replication slot.

That is by design.


Early Warning Signals (That Were Missed)

1. Replication Slot Still Active

On primary:

SELECT slot_name,
       active,
       restart_lsn,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots;

You can inspect slot retention using the pg_replication_slots system catalog.

Output

slot_name        | reporting_slot
active           | true
retained_wal     | 185 GB

The slot was active, but retained WAL was growing continuously.


2. WAL Directory Growth

On the primary server:

du -sh $PGDATA/pg_wal

Output:

210G    pg_wal

Disk alerts triggered at 85% capacity.

Without intervention, the server would have stopped accepting writes within a day.

3. Errors in Subscriber Logs

On the logical replication target:

Example log entries:

ERROR:  logical replication target relation "public.transactions"
has different column count from source relation
CONTEXT: processing remote data for replication origin "pg_16384"

And repeatedly:

ERROR:  missing replicated column: "reference_code"
STATEMENT: INSERT INTO public.transactions (...)

The replication worker kept retrying and failing.

4. Primary Server Log Warnings

Primary logs showed:

WARNING:  replication slot "reporting_slot" is preventing removal of required WAL segments
DETAIL:  The slot has retained 180 GB of WAL files.

This is a classic early signal in a PostgreSQL health check.


Why This Is Dangerous

Replication slot retention works by design:

PostgreSQL will not remove WAL files required by an active replication slot, as explained in the official PostgreSQL replication slots documentation.

If the subscriber cannot consume changes:

  • WAL retention grows indefinitely
  • Disk fills up
  • Primary database becomes unavailable
  • Production outage occurs

Logical replication makes this risk higher because DDL mismatches are common.


Root Cause

The root cause was simple:

  • A DDL change was applied to the primary
  • The same DDL was not applied to the logical subscriber
  • Logical replication stopped applying changes
  • Replication slot retained WAL indefinitely

No automatic schema sync exists in logical replication.


How the Issue Was Fixed

Step 1 – Apply missing DDL on subscriber:

ALTER TABLE transactions
ADD COLUMN reference_code TEXT;

Step 2 – Verify subscription status:

On subscriber:

SELECT * FROM pg_stat_subscription;

Subscription state can be monitored via pg_stat_subscription.

Step 3 – Confirm slot consumption resumed:

On primary:

SELECT slot_name,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS retained_wal
FROM pg_replication_slots;

Retained WAL began decreasing.

Step 4 – Monitor pg_wal directory shrinkage.

System stabilized.


Health Check Controls That Would Have Prevented This

This incident was preventable.

A proper PostgreSQL health check should include:


1. WAL Disk Usage Monitoring

Alert when:

  • pg_wal exceeds defined threshold (e.g., 70%)
  • WAL growth rate accelerates abnormally

2. Replication Slot Retention Check

Script:

SELECT slot_name,
       active,
       pg_size_pretty(
         pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)
       ) AS retained_wal_size
FROM pg_replication_slots
WHERE slot_type = 'logical';

Alert when retained WAL exceeds threshold (e.g., 10 GB in reporting workloads).

3. Subscription Health Check (Subscriber Side)

SELECT subname,
       status,
       last_msg_send_time,
       last_msg_receipt_time,
       latest_end_lsn
FROM pg_stat_subscription;

Alert if:

  • status != 'streaming'
  • last_msg_receipt_time is stale

4. Log Monitoring

Both source and target logs must be monitored for:

  • logical replication errors
  • replication worker restarts
  • slot retention warnings

Lessons Learned

  • Logical replication does not replicate DDL.
  • Active replication slots can silently retain massive WAL.
  • Replication being “active” does not mean it is healthy.
  • WAL disk alerts are critical for production systems.
  • Slot retention size must be part of every PostgreSQL health check.

Streaming vs Logical Replication: Why This Matters

Streaming (physical) replication:

  • Block-level
  • Replicates DDL automatically
  • Lower schema drift risk

Logical replication:

  • Row-level
  • Does not replicate DDL
  • Higher operational risk if schema governance is weak

Choosing logical replication requires stronger monitoring discipline.

Understanding the differences between streaming and logical replication is critical for any serious PostgreSQL health check strategy.

For a deeper breakdown, see our guide on Streaming vs Logical Replication in PostgreSQL.


Preventive Checklist

Before applying DDL on primary:

  • Verify whether table participates in logical publication
  • Apply DDL to subscriber first (or in controlled order)
  • Validate replication health after change

Continuous monitoring should include:

  • WAL disk usage
  • Replication slot retention
  • Subscription state
  • Error logs

Final Thoughts

This was not a performance issue. It was not a crash. It was not a traffic spike.

It was silent replication drift.

Without WAL monitoring and replication slot visibility, the primary database would have gone down within hours.

A proper PostgreSQL health check is not just about CPU, queries, or indexes.

It must include:

  • Replication health
  • Slot retention
  • WAL growth patterns
  • Log analysis

Because databases rarely fail instantly.

They degrade silently — until storage runs out.


This is a classic signal that should always be covered in a proper PostgreSQL health check.


Author: Fırat Güleç — Principal PostgreSQL DBA


Try pghealth Free Today 🚀

Start your journey toward a healthier PostgreSQL with pghealth.
You can explore all features immediately with a free trial — no installation required.

👉 Start Free Trial.