Taking backups and having a recovery plan is one of the most important tasks of a DBA – they are crucial to the availability and integrity of the data. You can build replication clusters spanning across multiple datacenters and this helps you handle the failure of a node (one or more), but only a backup can help you after an incidental DELETE or DROP. In this blog post, we will discuss some backup options available for PostgreSQL.
There are two main backup types – logical and physical backups. This is quite important to understand before you design your backup strategy. The backup type also determines which tools you will be using.
Logical backups typically contain data stored in plain text, in human-readable format or it can be easily converted to such format. This is quite useful because, if needed, you can extract even a single row and restore it. On the other hand, restoring an entire logical backup can be quite slow – the database has to process all of the queries required to load the data and this will always be slower than just copying the database files.
Physical backups grab the data as it is – in the end, the data is just a bunch of files stored on some disk. You can copy the data files and have a copy of your database at a given point in time. In real life though, it is not as easy as running a cp command. When taking a physical backup you need to keep in mind that there might be database changes in the meantime. To ensure your backup can be safely used, any physical backup has to be consistent.
Physical backups are great for recovering whole data sets – the speed is limited by hardware: disk and sometimes network (if you copy the data over the network).
Backups in PostgreSQL
PostgreSQL, like many other databases, supports both physical and logical backups. Let’s take a look at how you can set up them up.
Logical backup in PostgreSQL
PostgreSQL comes with a tool called pg_dump – it’s a utility designed to perform logical backups on the PostgreSQL database. It supports several output formats: by default, it stores data in plain text files, but it can also produce a tar file or a custom binary format, compressed, which can be restored using pg_restore (you can easily extract SQL contents from this format). Let’s see how we can use pg_dump to create a backup of our PostgreSQL server.
First, backup in default mode, plain text SQL:
[email protected]:~/c# pg_dump -d pgbench -U psql -h 10.0.0.101 > pgbench.sql Password:
Then, in the output file, you can see data stored like this:
COPY pgbench_accounts (aid, bid, abalance, filler) FROM stdin; 1 1 0 2 1 0 3 1 0 4 1 0 5 1 0 6 1 0 7 1 0 8 1 0 9 1 0 10 1 0 11 1 0 12 1 0 13 1 0 14 1 0
If you feel like using some other output format, you need to specify it explicitly:
[email protected]:~# pg_dump -d pgbench -U psql -h 10.0.0.101 --format c -f /root/c/pgdump.c Password:
Restoring data is fairly simple. If you do have data in default, SQL format, just use psql to load it into the database. If you used one of the other formats, you can use pg_restore to load it. If you still have the old data and you want to recover the full database, you may want to drop the existing database and then recreate it. You can also use the following flats:
-c, --clean clean (drop) database objects before recreating --if-exists use IF EXISTS when dropping objects
While dumping the data, this makes sure that all existing objects will be overwritten.
postgres=# drop database pgbench; DROP DATABASE postgres=# create database pgbench; CREATE DATABASE [email protected]:~# pg_restore --format c /root/c/pgdump.c -d pgbench -U psql -W -h 10.0.0.101 Password:
Please keep in mind that pg_dump generates consistent backups – it uses a single transaction to dump the data so it is relying on MVCC to keep its session separated and not affected by transactions opened afterwards.
Physical backup in PostgreSQL
Physical backups are all-or-nothing – you take a full copy of the database cluster, and you can restore all of it. The challenge is to make the backup consistent. Of course, the easiest way to do that would be to stop PostgreSQL completely and then make a copy of the data using one of the typical Unix tools like cp, scp or rsync. Unfortunately, it is not often that you have the possibility to shut down your database. Therefore you need to take some precautions to ensure your backup is consistent. To take manual, consistent, physical backups, the following steps are required:
- You need to create a checkpoint using pg_start_backup(‘some label’)
- Copy contents of the data directory
- Stop backup using pg_stop_backup()
This can be simplified by running pg_basebackup:
[email protected]:~# pg_basebackup -U psql -h 10.0.0.101 -x -D /pgbase/
If you want to have Point-In-Time-Recovery, you need to configure WAL archiving by enabling it:
wal_level = archive archive_mode = on
You also want to define archive_command in a way it will copy WAL segments to a separate location.
Restoring base backup is easy – copy the data into a clear PostgreSQL data directory and you are all set – once you start the PostgreSQL process, you will restore your data up to the time of the backup. To recover your PostgreSQL to point-in-time, using archived WAL logs, you still need to recover the base backup but then you need to create a recovery.conf file with, at least, a restore_command defined – you want to tell PostgreSQL how it can copy back archived WAL segments to be replayed at the start of the PostgreSQL process.
PostgreSQL Backups in ClusterControl
As of ClusterControl 1.9.3, several types of backups are supported. ClusterControl supports logical backup through pg_dump, we also support physical backups using pg_basebackup and PgBackRest, which comes with the support for incremental and differential backups as well. Let’s take a look at how you can manage your backups with ClusterControl.
Pg_dumpall is a tool dedicated to performing logical backups in PostgreSQL. It can be used in ClusterControl to take an ad-hoc backup and create the backup schedule.
As you can see, we can pick a backup host to take the backup from, either create a full backup or choose databases to include in our backup. We can store the backup either on the database node itself or stream it to the controller instance. Backups will be stored in the directory defined in the “Storage Directory.” Such a directory can be an NFS mount from an external backup server. If you use cloud services (AWS, GCP, Azure) or anything compatible with S3 protocols, you can upload the backup to the document storage for long-term retention.
The backups can be compressed and encrypted – both are transparent to the user. ClusterControl will decrypt and decompress the backups for you when you want to restore them.
If you’d like to create a backup schedule, the process is very similar:
The wizard looks the same, with the main difference being the option to define the schedule itself. You can pick how often you want to run the backup. If you prefer to use cron-like syntax, you can use the Advanced option:
Another stage brings some changes:
You can define if you want to failover the backup if the backup node is down. In our case, the backup node was 192.168.10.181. If that node is unavailable, ClusterControl can run the backup on another node you picked, or you can go with the “Auto Select” option and let ClusterControl pick any currently online nodes to take the backup from. There is also an option to verify the backup.
It allows you to define a backup verification server – a node that ClusterControl can access via SSH. ClusterControl may be instructed to provision that node with PostgreSQL; then, it will restore the backup on it and see if the database can be started properly. If yes, we assume that the backup is not corrupted. Then the backup verification server can either be decommissioned or left in the cluster for the subsequent backup verification.
The restoration of backups is, of course, also supported.
There are three main options. First, Restore on a node allows us to restore the backup on a production cluster. Second, it will enable us to run the backup verification – quite handy if you do not want to perform the backup verification at every backup execution. However, you still want to test it from time to time. Alternatively, this option can be used for partial restoration – instead of restoring everything to the production cluster it is possible to restore a whole backup to the backup verification server and then extract the particular table or a set of rows that have been modified in production and restore them to the production servers. Finally, the third option is intended to create staging clusters – it allows you to create a new cluster and then provision it using the data from the backup.
Backups that support Point In Time Recovery can be restored as such – you can define up to a second when you want to restore the backup.
ClusterControl supports two physical backups: pg_basebackup and PgBackRest. Those are not compatible, and you should pick one of them to use. Pg_basebackup is a tool that performs a copy of the data from the database node to a separate location.
It does not allow for too much flexibility, and only a full backup is supported. As you can see, the options available are very similar to what we have seen with pg_dumpall. This is also true for all other steps in the backup wizard, so we’ll skip on those. As you can see, the PITR is also supported for this backup method.
On the other hand, PgBackRest is slightly different. It is a backup solution for PostgreSQL that supports full backups and incremental and differential backups. It is a separate, quite advanced solution, and ClusterControl provides an option to install and configure it for you.
The first time you pick the PgBackRest from the dropdown, you will see the option to install the tool just by clicking the “Install” button.
Once you click it, you’ll be presented with three options to deploy PgBackRest. Details are explained on-screen, in short, you have to decide where to store the data – on the current master, on all of the nodes in the cluster, or on a dedicated repository host. You will have to pass a hostname or IP address and ensure that ClusterControl can connect to that node using SSH.
Once you pick the option, an installation process will commence.
When it’s done, the software should be installed correctly, and you can start taking backups using PgBackRest.
The options are relatively limited as the majority of the settings are on the PgBackRest end. It is possible to define if the backups should be encrypted and compressed.
As for the restore, there are two options available for PgBackRest backups:
We can restore the backup on the production cluster or perform a backup verification and restore it on a separate node. As we mentioned earlier, this also allows us to perform a partial restore.
PgBackRest supports PITR with a granularity of up to one second.
We hope you enjoyed this short blog on different backup scenarios for PostgreSQL. Hopefully, the practical tips above will help you easily set up your data backup and recovery plan. If you’d like to see how easy it is to schedule backups of your PostgreSQL servers using ClusterControl, feel free to check it out. You can read more about restoring a PostgreSQL backup from the ClusterControl CLI using the powerful S9s tool.