blog

A Guide to MongoDB Backups

Art van Scheppingen

Published

In previous posts of our MongoDB DBA series, we have covered Deployment, Configuration, Monitoring (part 1) and Monitoring (part 2). The next step is ensuring your data gets backed up safely.

Backups in MongoDB aren’t that different from MySQL backups.You have to start a copy process, ship the files to a safe place and ensure the backup is consistent. The consistency is obviously the biggest concern, as MongoDB doesn’t feature a transaction mode that allows you to create a consistent snapshot. Obviously there are other ways to ensure we make a consistent backup.

In this blog post we will describe what tools are available for making backups in MongoDB and what strategies to use.

Backup a replicaSet

Now that you have your MongoDB replicaSet up and running, and have your monitoring in place, it is time for the next step: ensure you have a backup of your data.

You should backup your data for various reasons: disaster recovery, providing data to development or analytics, or even pre-load a new secondary node. Most people will use backups for the first two reasons.

There are two categories of backups available for MongoDB: logical and physical backups. The logical backups are basically data dumps from MongoDB, while the physical backups are copies of the data on disk.

Logical Backups

All logical backup methods will not make a consistent backup, not without putting a global lock on the node you’re making a backup of. This is comparable to mysqldump with MyISAM tables. This means it would be best to make a logical backup from a secondary node and set a global lock to ensure consistency.

For MongoDB there is a mysqldump equivalent: mongodump. This command line tool is shipped with every MongoDB installation and allows you to dump the contents of your MongoDB node into a BSON formatted dump file. BSON is a binary variant of JSON and this will not only keep the dump compact, but also improves recovery time.

The mongodump tool is easy to use, but due to all the command line options, it may need some wrapping to get automated backups. Open source alternatives are available, e.g., MongoDB Backup and Mongob.

MongoDB Backup is a Node.js solution that allows both command line and API access invocation. Since it is a Node.js application including an API, you could quite easily embed this into chat clients, or automated workflows. MongoDB Backup also allows you to make stream backups, so offsite backups are easy to manage this way.

Mongob is only available as a command line tool and written in Python. Mongob will offer you great flexibility by streaming to a bzip file or to another MongoDB instance. The latter obviously is very useful if you wish to provide data copies to your development or CI environments. It can also easily copy data between collections. Incremental backups are also possible, and this can keep the size of your backups relatively small. Rate limiting is also an option, for instance if you need to send the backup over a slow(er) public network and don’t want to saturate it.

Physical Backups

For physical backups, there is no out of the box solution. Options here are to use the existing LVM, ZFS and EBS snapshot solutions. For LVM and ZFS, the snapshotting will freeze the file system in operation. However for EBS, a consistent snapshot can’t be created unless writes have been stopped.

To do so, you have to fsync everything to disk and set a global lock:

my_mongodb_0:PRIMARY> use admin
switched to db admin
my_mongodb_0:PRIMARY> db.runCommand({fsync:1,lock:1});
{
    "info" : "now locked against writes, use db.fsyncUnlock() to unlock",
    "seeAlso" : "http://dochub.mongodb.org/core/fsynccommand",
    "ok" : 1
}

Don’t forget to unlock after completing the EBS snapshot:

my_mongodb_0:PRIMARY> db.fsyncUnlock()
{ "info" : "unlock completed", "ok" : 1 }

As MongoDB only checkpoints every 60 seconds, this means you will have to also include the journals. If these journals are not on the same disk, your snapshot may not be 100% consistent. This would be similar as making an LVM snapshot of a disk only containing the MySQL data without the redo logs.

If you are using MongoRocks, you also have the possibility to make a physical copy of all the data using the Strata backup tool. The Strata command line tool allows you to create a full backup or incremental backup. The best part of the Strata backup is that these physical files are queryable via mongo shell. This means you can utilize physical copies of your data to load data into your data warehouse or big data systems.

Sharded MongoDB

As the sharded MongoDB cluster consists of multiple replicaSets, a config replicaSet and Shard servers, it is very difficult to make a consistent backup. As every replicaSet is decoupled from each other, it is almost impossible to snapshot everything at the same time. Ideally a sharded MongoDB cluster should be frozen for a brief moment in time, and then a consistent backup taken. However this strategy would cause global locks and this means your clients will experience downtime.

At this moment, the next best thing you can do is to make a backup around roughly the same time of all components in the cluster. If you really need consistency, you can fix this during the recovery by applying a point-in-time recovery using the oplogs. More about that in the next blog post that covers recovery.

Backup Scheduling

If possible, don’t backup the primary node. Similar to MySQL you don’t want to stress out the primary node and set locks on it. It would be better to schedule the backup on a secondary node, preferably one without replication lag. Also keep in mind that once you start backing up this node, replication lag may happen due to the global locks set. So keep an eye on the replication window.

Make sure your backup schedule makes sense. If you create incremental backups, make sure you regularly have a full backup as a starting point. A weekly full backup makes sense in this case.

Also daily backups would be fine for disaster recovery, but for point-in-time recovery they won’t work that well. MongoDB puts a timestamp on every document and you could use that to perform a point in time recovery. However if you would remove all inserted/altered documents from a newer backup by using the timestamp it won’t be an exact recovery: the document could have been updated several times or deleted in the underlying period.

Point in time recovery can only be exact if you still have the oplog of the node you wish to recover, and replay it against an older backup. It would also be wise to make regular copies of the oplog to ensure you have this file when needed for a point-in-time recovery, e.g., in case of full outage of your cluster. Even better: stream the oplog to a different location.

Backup Strategies

Ensure backups are being made, so check your backup on a regular interval (daily, weekly). Make sure the size of the backups makes sense and the logs are clear from errors. You could also check the integrity of the backup by extracting it and making a couple of checks on data points or files that need to be present. Automation for this process makes your life easier.

ClusterControl
Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

Offsite Backups

There are many reasons for shipping your backups to another location. The best known reason may be (disaster) recovery, but other good reasons are keeping local copies for testing or data loading to offload the production database.

You could send your backups, for instance, to another datacenter or Amazon S3 or Glacier. To automatically ship your backups to a second location, you could use BitTorrent Sync. If you ship your backups to a less trusted location, you must store your backups encrypted.

Backup Encryption

Even if you are keeping your backups in your local datacenter, it is still a good practice to encrypt them. Encrypting the backups will ensure nobody, unless they have the key, will be able to read them. Especially backups made using Strata will be partly readable, without the necessity to start up MongoDB. But also dumps via Mongodump and filesystem snapshots will be partly readable. So consider MongoDB backups to be insecure and always encrypt them. Storing them in a cloud even makes the necessity for encryption bigger.

Recovery

In addition to the health checks, also try to restore a backup on a regular(monthly) basis to verify if you can recover from a backup. This process includes extracting/decrypting the backup, starting up a new instance and possibly starting replication from the primary. This will give you a good indication whether your backups are in good condition. If you don’t have a disaster recovery plan yet, make one and make sure these procedures are part of it.

Conclusion

We have explained in this blog post what matters in making backups of MongoDB and how different/similar it is to backing up similar MySQL environments. There are a couple of caveats with making backups of MongoDB, but these are easily overcome with caution, care and tooling.

In the next blog post, we will cover restoring MongoDB replicaSets from backups and how to perform a point-in-time recovery!

Subscribe below to be notified of fresh posts