Replication / Standby Servers

Master PostgreSQL standby server replication parameters. Learn optimal settings for hot standby, query conflicts, replication delays, and failover configuration for high availability.

hot_standby

  • What it does: Enables or disables the ability to connect to and run queries on a standby server while it's applying WAL data during recovery.
  • Why it matters: This parameter is fundamental for creating readable replicas that can serve read queries while maintaining synchronization with the primary. Without hot_standby, standby servers would be unable to accept connections until promotion to primary. Enabling this allows you to offload read queries to replicas, improving overall system performance and providing high availability.
  • Ideal value & Best Practice: Set to on for all standby servers that need to serve read queries. This should be configured in the recovery.conf or postgresql.auto.conf of the standby server. Ensure adequate hardware resources to handle both recovery and query workload.

hot_standby_feedback

  • What it does: Allows the standby server to send feedback to the primary about queries currently running on the standby, preventing vacuum cleanup of rows that might be needed.
  • Why it matters: This parameter helps avoid query conflicts on the standby by preventing the primary from removing rows that are still visible to queries on the standby. Without this feedback, long-running queries on the standby might encounter "canceling statement due to conflict with recovery" errors. However, it can cause table bloat on the primary if standbys have very long transactions.
  • Ideal value & Best Practice: Default off is recommended unless you experience frequent query conflicts. Set to on if you have long-running queries on standbys that frequently conflict with vacuum operations. Monitor table bloat on the primary when enabled.

max_standby_archive_delay

  • What it does: Sets the maximum delay before canceling queries on a hot standby when processing WAL data from archive (not streaming replication).
  • Why it matters: This parameter balances query execution on the standby with the need to apply archived WAL records. A higher value allows longer queries to complete but increases replication lag. A lower value maintains tighter synchronization but may cancel more queries.
  • Ideal value & Best Practice: Default 30s is reasonable. Increase to 60-120s for environments with longer reporting queries. Set to -1 (infinity) only if query completion is more important than replication lag, but monitor lag carefully.

max_standby_streaming_delay

  • What it does: Sets the maximum delay before canceling queries on a hot standby when processing streamed WAL data from the primary.
  • Why it matters: Controls the trade-off between query execution on replicas and replication latency for streaming replication. This is particularly important for read-heavy workloads on standbys where you need to balance query performance with replication freshness.
  • Ideal value & Best Practice: Default 30s is appropriate for most setups. Increase to 60-90s for environments with complex analytical queries on standbys. For critical reporting standbys, consider higher values while monitoring replication lag.

primary_conninfo

  • What it does: Specifies the connection string that the standby server uses to connect to the primary server for streaming replication.
  • Why it matters: This is the essential configuration that enables streaming replication between primary and standby. It includes authentication credentials, host information, and connection parameters. Proper configuration is crucial for establishing and maintaining the replication connection.
  • Ideal value & Best Practice: Use a connection string with proper authentication: 'host=primary-host user=replication password=secret port=5432 sslmode=require'. Always use SSL encryption for replication connections. Store passwords in .pgpass files rather than in the configuration for better security.

primary_slot_name

  • What it does: Specifies the name of the replication slot to use on the primary server for this standby connection.
  • Why it matters: Replication slots prevent the primary from removing WAL segments that are still needed by the standby, ensuring that standbys can always catch up even if they fall behind temporarily. This is essential for maintaining replication reliability.
  • Ideal value & Best Practice: Set to a descriptive name like 'standby1_slot'. Ensure the slot exists on the primary (SELECT * FROM pg_create_physical_replication_slot('standby1_slot');). Monitor slot usage to prevent unlimited WAL accumulation.

recovery_min_apply_delay

  • What it does: Introduces a fixed delay in applying WAL records on the standby server, creating a deliberately lagged replica.
  • Why it matters: This parameter is useful for protection against human errors (like accidental data deletion) by providing a time window to stop replication before errors are applied. It's also helpful for testing and debugging replication scenarios.
  • Ideal value & Best Practice: Default 0 (no delay). Set to values like 1h or 4h for protection against operational errors. Balance between protection needs and storage requirements, as delayed replication requires more WAL storage.

sync_replication_slots

  • What it does: Enables synchronization of logical replication slots from the primary to physical standby servers.
  • Why it matters: This parameter is essential for logical replication failover scenarios. It ensures that logical replication slots are maintained on standbys, allowing for seamless failover of logical replication consumers without losing their position in the replication stream.
  • Ideal value & Best Practice: Set to on if using logical replication with physical standbys for high availability. This ensures logical replication can continue after failover. Requires appropriate slot configuration on both primary and standby.

wal_receiver_create_temp_slot

  • What it does: Controls whether a WAL receiver should create a temporary replication slot if no permanent slot is configured.
  • Why it matters: Temporary slots can maintain replication during temporary connections without requiring permanent slot management. However, they don't provide the same guarantees as permanent slots and are removed when the connection closes.
  • Ideal value & Best Practice: Default off is recommended. Use permanent slots for production replication instead. Temporary slots might be useful for ad-hoc replication setups but lack the reliability of permanent slots.

wal_receiver_status_interval

  • What it does: Sets the maximum interval between WAL receiver status reports sent to the primary server.
  • Why it matters: This interval affects how quickly the primary learns about the standby's progress and any potential issues. Shorter intervals provide more timely feedback but generate more network traffic. Longer intervals reduce overhead but delay failure detection.
  • Ideal value & Best Practice: Default 10s is reasonable. Decrease to 5s for environments requiring faster failure detection. Increase to 30s for WAN connections with higher latency, but monitor replication health carefully.

wal_receiver_timeout

  • What it does: Sets the maximum time to wait for new data from the primary before considering the connection idle.
  • Why it matters: This timeout helps detect network issues or primary server problems that prevent WAL data from being sent. Proper tuning ensures timely detection of replication problems while avoiding false positives during periods of low write activity.
  • Ideal value & Best Practice: Default 60s is appropriate for most environments. Increase to 120s for WAN connections or during periods of known low write activity. Decrease to 30s for environments requiring rapid failure detection.

wal_retrieve_retry_interval

  • What it does: Sets the wait time before retrying to retrieve WAL data after a failed attempt.
  • Why it matters: This parameter controls how aggressively the standby retries after replication failures. Shorter intervals mean faster retries but could exacerbate network congestion. Longer intervals reduce retry frequency but increase potential replication lag after failures.
  • Ideal value & Best Practice: Default 5s is reasonable. Decrease to 1-2s for critical systems where minimal lag is important. Increase to 10-30s for unreliable network connections to avoid overwhelming the network with retries.

Try pghealth Free Today πŸš€

Start your journey toward a healthier PostgreSQL with pghealth.
You can explore all features immediately with a free trial β€” no installation required.

πŸ‘‰ Start Free Trial