Knowledge of replication is a must for anybody managing databases. It is a topic that you probably have seen over and over but never gets old. In this blog, we will review a little bit of the history of PostgreSQL’s built-in replication features and deep dive into how streaming replication works.
When talking about replication, we will be talking a lot about WALs. So, let’s quickly review a little bit about write-ahead logs.
Write-Ahead Log (WAL)
A Write-Ahead Log is a standard method for ensuring data integrity, and it is automatically enabled by default.
The WALs are the REDO logs in PostgreSQL. But what exactly are REDO logs?
REDO logs contain all changes made in the database, and they are used for replication, recovery, online backup, and point-in-time recovery (PITR). Any changes that have not been applied to the data pages can be redone from the REDO logs.
Using WAL results in a significantly reduced number of disk writes because only the log file needs to be flushed to disk to guarantee that a transaction is committed, rather than every data file changed by the transaction.
A WAL record will specify the changes made to the data, bit by bit. Each WAL record will be appended into a WAL file. The insert position is a Log Sequence Number (LSN), a byte offset into the logs, increasing with each new record.
The WALs are stored in the pg_wal directory (or pg_xlog in PostgreSQL versions < 10) under the data directory. These files have a default size of 16MB (you can change the size by altering the –with-wal-segsize configure option when building the server). They have a unique incremental name in the following format: “00000001 00000000 00000000”.
The number of WAL files contained in pg_wal will depend on the value assigned to the parameter checkpoint_segments (or min_wal_size and max_wal_size, depending on the version) in the postgresql.conf configuration file.
One parameter that you need to set up when configuring all your PostgreSQL installations is the wal_level. The wal_level determines how much information is written to the WAL. The default value is minimal, which writes only the information needed to recover from a crash or immediate shutdown. Archive adds logging required for WAL archiving; hot_standby further adds information needed to run read-only queries on a standby server; logical adds information necessary to support logical decoding. This parameter requires a restart, so it can be hard to change on running production databases if you have forgotten that.
History of Replication in PostgreSQL
The first replication method (warm standby) that PostgreSQL implemented (version 8.2, back in 2006) was based on the log shipping method.
This means that the WAL records are directly moved from one database server to another to be applied. We can say that it is a continuous PITR.
PostgreSQL implements file-based log shipping by transferring WAL records one file (WAL segment) at a time.
This replication implementation has the downside: if there is a major failure on the primary servers, transactions not yet shipped will be lost. So, there is a window for data loss (you can tune this by using the archive_timeout parameter, which can be set to as low as a few seconds. However, such a low setting will substantially increase the bandwidth required for file shipping).
We can represent this file-based log shipping method with the picture below:
Then, in version 9.0 (back in 2010), streaming replication was introduced.
Streaming replication allows you to stay more up-to-date than is possible with file-based log shipping. This works by transferring WAL records (a WAL file is composed of WAL records) on the fly (record-based log shipping) between a primary server and one or several standby servers without waiting for the WAL file to be filled.
In practice, a process called WAL receiver, running on the standby server, will connect to the primary server using a TCP/IP connection. In the primary server, another process exists, named WAL sender, and is in charge of sending the WAL registries to the standby server as they happen.
The following diagram represents streaming replication:
Looking at the above diagram, you might wonder, what happens when the communication between the WAL sender and the WAL receiver fails?
When configuring streaming replication, you have the option to enable WAL archiving.
This step is not mandatory but is extremely important for a robust replication setup. It is necessary to avoid the main server recycling old WAL files that have not yet been applied to the standby server. If this occurs, you will need to recreate the replica from scratch.
When configuring replication with continuous archiving, it starts from a backup. To reach the in-sync state with the primary, it needs to apply all the changes hosted in the WAL that happened after the backup. During this process, the standby will first restore all the WAL available in the archive location (done by calling restore_command). The restore_command will fail when it reaches the last archived WAL record, so after that, the standby is going to look on the pg_wal directory to see if the change exists there (this works to avoid data loss when the primary servers crash and some changes that have already been moved and applied to the replica have not yet been archived).
If that fails and the requested record does not exist there, it will start communicating with the primary server through streaming replication.
Whenever streaming replication fails, it will go back to step 1 and restore the records from the archive again. This loop of retries from the archive, pg_wal, and via streaming replication continues until the server stops, or failover is triggered by a trigger file.
The following diagram represents a streaming replication configuration with continuous archiving:
Streaming replication is asynchronous by default, so at any given moment, you can have some transactions that can be committed to the primary server and not yet replicated to the standby server. This implies some potential data loss.
However, this delay between the commit and impact of the changes in the replica is supposed to be really small (some milliseconds), assuming, of course, that the replica server is powerful enough to keep up with the load.
For cases when even the risk of slight data loss is not acceptable, version 9.1 introduced the synchronous replication feature.
In synchronous replication, each commit of a write transaction waits until confirmation is received that the commit is written to the write-ahead log on disk of both the primary and standby server.
This method minimizes the possibility of data loss; for that to happen, you will need both the primary and the standby to fail simultaneously.
The obvious downside of this configuration is that the response time for each write transaction increases, as it needs to wait until all parties have responded. So, the time for a commit is, at minimum, the round trip between the primary and the replica. Read-only transactions will not be affected by this.
To set up synchronous replication, you need to specify an application_name in the primary_conninfo of the recovery for each standby server.conf file: primary_conninfo = ‘…aplication_name=standbyX’ .
You also need to specify the list of the standby servers that will take part in the synchronous replication: synchronous_standby_name = ‘standbyX,standbyY’.
You can set up one or several synchronous servers, and this parameter also specifies which method (FIRST and ANY) to choose synchronous standbys from the listed ones. For more information on setting up synchronous replication mode, check out this blog. It is also possible to set up synchronous replication when deploying via ClusterControl.
After you have configured your replication and it is up and running, you will need to implement monitoring.
Monitoring PostgreSQL Replication
The pg_stat_replication view on the primary server has a lot of relevant information:
postgres=# SELECT * FROM pg_stat_replication; -[ RECORD 1 ]----+------------------------------ pid | 756 usesysid | 16385 usename | cmon_replication application_name | pgsql_0_node_0 client_addr | 10.10.10.137 client_hostname | client_port | 36684 backend_start | 2022-04-13 17:45:56.517518+00 backend_xmin | state | streaming sent_lsn | 0/400001C0 write_lsn | 0/400001C0 flush_lsn | 0/400001C0 replay_lsn | 0/400001C0 write_lag | flush_lag | replay_lag | sync_priority | 0 sync_state | async reply_time | 2022-04-13 17:53:03.454864+00
Let’s see this in detail:
- pid: Process id of walsender process.
- usesysid: OID of user which is used for streaming replication.
- usename: Name of user which is used for streaming replication.
- application_name: Application name connected to master.
- client_addr: Address of standby/streaming replication.
- client_hostname: Hostname of standby.
- client_port: TCP port number on which standby communicating with WAL sender.
- backend_start: Start time when SR connected to Primary.
- state: Current WAL sender state, i.e., streaming.
- sent_lsn: Last transaction location sent to standby.
- write_lsn: Last transaction written on disk at standby.
- flush_lsn: Last transaction flush on disk at standby.
- replay_lsn: Last transaction flush on disk at standby.
- sync_priority: Priority of standby server chosen as the synchronous standby.
- sync_state: Sync State of standby (is it async or synchronous).
You can also see the WAL sender/receiver processes running on the servers.
- Sender (Primary Node):
[[email protected] ~]# ps aux |grep postgres postgres 727 0.0 2.2 917060 47936 ? Ss 17:45 0:00 /usr/pgsql-14/bin/postmaster -D /var/lib/pgsql/14/data/ postgres 732 0.0 0.2 351904 5280 ? Ss 17:45 0:00 postgres: 14/main: logger postgres 734 0.0 0.5 917188 10560 ? Ss 17:45 0:00 postgres: 14/main: checkpointer postgres 735 0.0 0.4 917208 9908 ? Ss 17:45 0:00 postgres: 14/main: background writer postgres 736 0.0 1.0 917060 22928 ? Ss 17:45 0:00 postgres: 14/main: walwriter postgres 737 0.0 0.4 917748 9128 ? Ss 17:45 0:00 postgres: 14/main: autovacuum launcher postgres 738 0.0 0.3 917060 6320 ? Ss 17:45 0:00 postgres: 14/main: archiver last was 00000001000000000000003F postgres 739 0.0 0.2 354160 5340 ? Ss 17:45 0:00 postgres: 14/main: stats collector postgres 740 0.0 0.3 917632 6892 ? Ss 17:45 0:00 postgres: 14/main: logical replication launcher postgres 756 0.0 0.6 918252 13124 ? Ss 17:45 0:00 postgres: 14/main: walsender cmon_replication 10.10.10.137(36684) streaming 0/400001C0
- Receiver (Standby Node):
[[email protected] ~]# ps aux |grep postgres postgres 727 0.0 2.2 917060 47576 ? Ss 17:45 0:00 /usr/pgsql-14/bin/postmaster -D /var/lib/pgsql/14/data/ postgres 732 0.0 0.2 351904 5396 ? Ss 17:45 0:00 postgres: 14/main: logger postgres 733 0.0 0.3 917196 6360 ? Ss 17:45 0:00 postgres: 14/main: startup recovering 000000010000000000000040 postgres 734 0.0 0.4 917060 10056 ? Ss 17:45 0:00 postgres: 14/main: checkpointer postgres 735 0.0 0.3 917060 6304 ? Ss 17:45 0:00 postgres: 14/main: background writer postgres 736 0.0 0.2 354160 5456 ? Ss 17:45 0:00 postgres: 14/main: stats collector postgres 737 0.0 0.6 924532 12948 ? Ss 17:45 0:00 postgres: 14/main: walreceiver streaming 0/400001C0
One way of checking how up-to-date your replication is, is by checking the amount of WAL records generated in the primary server, but not yet applied in the standby server.
postgres=# SELECT pg_current_wal_lsn(); pg_current_wal_lsn -------------------- 0/400001C0 (1 row)
postgres=# SELECT pg_last_wal_receive_lsn(); pg_last_wal_receive_lsn ------------------------- 0/400001C0 (1 row) postgres=# SELECT pg_last_wal_replay_lsn(); pg_last_wal_replay_lsn ------------------------ 0/400001C0 (1 row)
You can use the following query in the standby node to get the lag in seconds:
postgres=# SELECT CASE WHEN pg_last_wal_receive_lsn() = pg_last_wal_replay_lsn() THEN 0 ELSE EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp()) END AS log_delay; log_delay ----------- 0 (1 row)
And you can also see the last message received:
postgres=# SELECT status, last_msg_receipt_time FROM pg_stat_wal_receiver; status | last_msg_receipt_time -----------+------------------------------ streaming | 2022-04-13 18:32:39.83118+00 (1 row)
Monitoring PostgreSQL Replication with ClusterControl
In the overview section, you will have the full picture of your database cluster’s current status. To see more details, you can access the dashboard section, where you will see lots of helpful information separated into different graphs.
In the topology section, you can see your current topology in a user-friendly way, and you can also perform different tasks over the nodes by using the Node Action button.
Streaming replication is based on shipping the WAL records and applying them to the standby server, it dictates what bytes to add or change in what file. As a result, the standby server is actually a bit-by-bit copy of the primary server. There are, however, some well-known limitations here:
- You cannot replicate into a different version or architecture.
- You cannot change anything on the standby server.
- You do not have much granularity on what you replicate.
So, to overcome these limitations, PostgreSQL 10 has added support for logical replication.
Logical replication will also use the information in the WAL file, but it will decode it into logical changes. Instead of knowing which byte has changed, it will know precisely what data has been inserted in which table.
It’s based on a “publish” and “subscribe” model with one or more subscribers subscribing to one or more publications on a publisher node that looks like this:
With streaming replication, you can continuously ship and apply WAL records to your standby servers, ensuring that information updated on the primary server is transferred to the standby server in real-time, allowing both to stay in sync.
If you want to learn more about logical replication in PostgreSQL, be sure to check out this overview of logical replication and this post on PostgreSQL replication best practices.For more tips and best practices for managing your open source-based database, follow us on Twitter and LinkedIn, and subscribe to our newsletter for regular updates.