I consider myself a bit of an explorer. In certain things that is. I typically will always order the same food from a familiar restaurant for fear of disappointment outweighs my apprehension to try something new.
And of course, a hungry human tends to just eat right?
Yet, when it comes to technology, in particular SQL (PostgreSQL), I tend to stumble full speed (my definition of exploring) into oftentimes, unfamiliar areas, with the hope to learn. What better way to learn than experience?
So what on earth does this rambling have to do with Logical and Streaming replication in PostgreSQL?
I am a complete novice in these areas with zero knowledge. Yes, I have about as much understanding in this area of Postgres as I do in astrophysics.
Did I mention that I had zero knowledge?
Therefore, in this blog post, I will try to digest the differences in these types of replication. Without real-world hands on experience, I can’t promise you the ‘Be all end all‘ manuscript for replication.
Likely, others less-versed in this particular area (such as myself) would benefit from this blog post.
Experienced users and developers along for the ride, I hope to see you in the comments below.
A Couple of Base Definitions
In the broad sense of the term, what does Replication mean?
Replication as defined on Wiktionary has this to say:
“Process by which an object, person, place or idea may be copied mimicked or reproduced.”
Yet, the 5th listed item there is more applicable to this blog post and I feel we should look at it as well:
“(computing) The process of frequent electronic data copying a one database in one computer or server to a database in another so that all users share the same level of information. Used to improve fault tolerance of the system.”
Now there is something we can get into. The mention of copying data from one server or database to another? We are in familiar territory now…
So, adding in what we know from that definition, what are the definitions of Streaming Replication and Logical Replication?
Let’s see what the PostgreSQL Wiki has to offer:
And the PostgreSQL Documentation has this definition for Logical Replication:
“Logical replication is a method of replicating data objects and their changes, based upon their replication identity (usually a primary key). We use the term logical in contrast to physical replication, which uses exact block addresses and byte-by-byte replication.”
Chapter 19.6 Replication from the official documentation is chock-full of goodies as well so be sure and visit that source.
Below, I’ll attempt to relay the differences in layman’s terms. (Remember, if I stumble, I am a novice.) This is an extreme ‘high-level’ overview.
A “source” database creates a PUBLICATION using the CREATE PUBLICATION command. (I think of this in simple terms as the ‘sender’.)
This publisher database has the data we want to replicate. Yet, we must have something to replicate to and this is where the publisher counterpart(s) come in. The ‘subscriber’. Notice I threw in an alternative plural form because from what I have found through online searches, it is a practical setup to have multiple subscribers.
A ‘subscriber’ (could also be thought of as the replica database) creates a SUBSCRIPTION to a “source” database (publisher) defining connection information, and any publications it subscribes to.
It is possible for a subscriber to also be a publisher, creating its own PUBLICATION which other subscribers can subscribe.
What happens now?
Any data changes that happen on the publisher are sent to the subscriber. Which out of the box, is everything, but can be configured or limited to certain operations (i.e., INSERT, UPDATE or DELETE).
Suppose we update a row or multiple rows on a particular table in the publisher, those updates and changes are replicated, on the subscriber’s instance or multiple subscribers if that type of configuration is implemented.
Here are a few things to remember I felt worthy to mention:
- The publisher database wal_level configuration must be set to logical.
- Logical replication does not have any DDL (Data Definition Language) commands.
- From the Conflicts page in the documentation: “Logical replication behaves similarly to normal DML operations in that the data will be updated even if it was changed locally on the subscriber node. If incoming data violates any constraints the replication will stop. This is referred to as a conflict. When replicating UPDATE or DELETE operations, missing data will not produce a conflict and such operations will simply be skipped.”
- Publisher tables, must have a way to identify themselves (called “replica identity”) in order to properly replicate DML (UPDATE and DELETE) operations in any replica(s) for those affected rows. If the table has a primary key this is the default (seems like the sound choice to me), but in the absence of a primary key, other configuration options are available. The entire row could be used if no other candidate key exists (termed “full”), although the documentation mentions this is typically not an efficient solution. (See the REPLICA IDENTITY section in the docs for a lower-level description of how to set it)
The documentation in section 31.4. Restrictions has some key reminders on restrictions I will gloss over. Be sure and review the linked page above for exact verbiage.
- Database schema and any DDL commands are not supported in the replication. It is suggested that perhaps pg_dump could be used initially, but still, you would need to update any further changes and advances in the schema to all the replicas yourself.
- The data in sequence columns will be replicated. But, the sequence itself would only reflect the start value. For reads, that’s okay. But if this is your go-to for failover, you would need to UPDATE to the current value yourself. The docs suggest pg_dump here.
- Truncate is not yet supported.
- Large object replication is not supported.
- Views, materialized views, partition root tables, or foreign tables are not supported on neither the publisher or subscriber.
Reported Common Use Cases
- You’re only interested in certain data and data changes you actually replicate versus just replicating the entire database.
- When you need replica(s) for read-only operations, such as an analytics scenario.
- Allowing users or different subsets of users limited or monitored access to data.
- Distributing data.
- Compatibility with other PostgreSQL versions.
From researching, reading, and studying Streaming Replication, one thing I gather right up front, is choosing to set up either asynchronous (the default) or synchronous replication.
Ah, more unfamiliar terms huh?
Here is my ‘high-level’ definition of both:
With asynchronous replication, after a transaction is committed on the primary, there is a slight delay when that same transaction is committed, and written on the replica. There is potential for data loss with this type of configuration.
- One, suppose the master crashes.
- Two, the replica(s) are so far behind the master, it has discarded needed data and information for the replica(s) to even be ‘current’.
However, in synchronous replication, no transaction is considered complete, until it is confirmed by both the master and replica server(s). Which will have written a commit to both server’s WAL.
In terms I understand, this means writes on the master have also been confirmed and written on the replica.
Here is the official explanation from section 26.2.8. Synchronous Replication in the official documentation:
“When requesting synchronous replication, each commit of a write transaction will wait until confirmation is received that the commit has been written to the write-ahead log on disk of both the primary and standby server. “
Another passage from the documentation has a nice rundown of what has to be (in my opinion), a huge benefit: “The only possibility that data can be lost is if both the primary and the standby suffer crashes at the same time.”
While nothing is impossible, that’s still pretty good assurance that you won’t be left with no copy of your data.
Okay, we know we have to choose one of these setup configurations but what is the overall gist?
In a nutshell, Streaming Replication sends and applies WAL (Write Ahead Log) files from one database server (the master or primary), to a ‘replica’ (receiving database).
But there is a caveat here to keep in mind. Potentially, the WAL files from the master can be recycled before the standy has gotten them. One way to mitigate this is to increase the wal_keep_segments setting to a higher value.
Points on Streaming Replication
- By default, Streaming replication is asynchronous meaning there is a delay (maybe small) between the committed transactions on the master, and their visibility on the replica.
- Replica(s) connect to the master via a network connection.
- Be mindful of authentication. See here from the documentation: “It is very important that the access privileges for replication be set up so that only trusted users can read the WAL stream because it is easy to extract privileged information from it”
When To Use Streaming Replication
- A common use (especially in analytics) provides a ‘read-only’ replica to take the load off of the primary server.
- You need a high availability environment.
- Useful setup for failover to hot standby server should the primary go down.