My DBA is Sick - Disaster Planning & Backup Tips for SysAdmins

Paul Namuag

The alarming nature of COVID-19 is being felt globally. Many companies have closed or filed bankruptcy because of the global pandemic. 

Most organizations did not see this coming. Large companies have a better chance of being well prepared for this type of scenario for which a Disaster Recovery Plan (DRP) has been laid-out. At Severalnines, we advocate the best practices consisting of conventional approaches when dealing with crisis, especially in securing your database technology

When things go south and nothing has been prepared (or worse, discussed but never implemented) the end result can be a hard blow to the company. These types of scenarios can’t be avoided, only planned for.

Today’s blog will walk you through some things to consider in your organization in regards to process planning and database backups for these types of scenarios.

Always Document The Work

Whenever your engineers start working on projects, it's best to always document the work such as the layout, project specification, requirements, and it's underlying procedures required to run the playbooks/runbooks. The documentation must also cover a proper escalation when things go wrong and what actions are required to be taken.

It can also be ideal to host these documents accessible via a centralized environment or intranet (accessed via connecting to VPN if done remotely), or over the internet as long as it is stored securely and can be accessed with multiple layers such as using 2FA (Two-Factor Authentication) or MFA (Multi-Factor Authentication). It's purpose is to have these documents accessible wherever the engineer is located. So when specific actions have to be done remotely, it can be done and not to physically access this via the office. Otherwise, it can lead to difficulties and inconvenience to the engineer's perspective. 

No Documentation, No Wiki - What Shall I Do?

This scenario is a pure pain and is most likely being experienced at companies around the world right now. When things go awry, it's best to start investigating the tools being used and lay out the procedures that have to be done. Worse case scenario, you might end up re-creating the whole thing or some part of the specific actions that are led to unresolved investigation will be reset. 

If this happens it may make sense to reach out for some temporary database support. This way you will have experts in the technology available to talk to.

Best case for this scenario is to leverage on specific technologies that offer a full-stack solution which can offer you backup solutions, cluster management, monitoring tools, and support. For example, here at Severalnines, we recommend that you try our software ClusterControl.

Try to Automate the Work

Writing scripts for automation is ideal. You can leverage automation tools to manage things for you, such as Chef, Puppet, Ansible, Salt, or Terraform. In our previous blogs, we shared some how-to's you can use for these automation tools. Take a look at How to Automate Daily DevOps Database Tasks with Chef or Database Automation with Puppet: Deploying MySQL & MariaDB Galera Cluster for more info. 


Automating the workload can relieve situations during a DBA. As stated earlier, it's best that procedures are written in a knowledge-based platform which is accessible anywhere.

Although creating this automation can be difficult for some organizations as a limited number of engineers can handle this type of work. Your best bet is to leverage tools or solutions that offer automation without high levels of technical complexity. 

For example, if your database is hosted in AWS or are hosted in a managed services platform, there are big advantages. Your team can deploy a backup schedule, set the backup policy, and set its retention; then just hit submit. Whenever your database crashes (for example in AWS Aurora) it will be handled in the background with autorecovery dispatched without the end users knowing that something happened. From a  business perspective, the end result is business-as-usual and no downtime occurring.

Although that's a great option to take, it might not be an appropriate choice for your technology stack. There are certain solutions that can offer you automation to handle backups for you. For example, you can take advantage of Backup Ninja, or even use ClusterControl to manage backup and restore. You can also set schedules for your backup with support for various open-source databases.

Keep Documentation Up-To-Date

In my experience, this is a problem that you will most likely encounter when things go south. When changes have been made, but go undocumented, then things can go wrong when performing vital tasks. This can get you into a dilemma, because the result comes out different than what was documented. 

This is scary in situations like when you are in a deadlock mode (where you're in the middle of performing a backup, causing your production to be locked up). Afterwards, you are then informed (after a day) that the process has been changed, but the lack of documentation by the DBA has left damage to your database nodes and caused problems.

When it comes to changes and documentation process, there's a huge advantage when it comes to availing third party solutions. These third-party companies know how to handle the business and always kept the process in-line so with changes to the software. 

For example, in ClusterControl (MySQL database systems) you can choose a backup method using mysqldump or using Percona Xtrabackup/Mariabackup (for MariaDB). Customers that are unsure of what to choose have the liberty to contact support to discuss. 

ClusterControl is also a good example here as it extends as your virtual DBA that mostly does things that your company’s DBA can do. There are certain actions that only DBA's can understand or oversee the result, but having an extended DBA that you are sure of 24/7 is available is a perfect route to avoid business from being impacted.

Using Third Party Applications

There are multiple options to choose from nowadays. These types of scenarios are not new, especially to companies who have invested in security and assurance for their business to continue to flourish and avoid huge downtimes due to lack of skills or engineers that can do the specific task. 

Relying on third party applications that offer managed databases (including backup solutions) or availing SaaS to handle backup automation and restoration are very advantageous in these types of scenarios. This comes up with a price, though, so you should really avail yourself on what offers the functions that your company requires.

Using a fully-managed database service such as Amazon RDS helps you easily do this with ease. Although the backups are stored as snapshots, yet if you need more flexibility such as restoring a particular table is not supported. It is doable, but requires knowledge and skills on what tools or particular commands need to be invoked.

In the case of ClusterControl, you have easy-to-use options when managing your backups. System administrators won't have to worry about the technology and commands to use as it's already managed by the software. For example, take a look at the screenshot below,

you can create backup, or schedule backup by through user clicks.

You can also view the scheduled backups with the list of backups that had proceed successfully,

These are all done with a couple of clicks with no scripting or extra engineering work that has to be done. Backups can also be verified automatically, restored with PITR support, uploaded to the cloud, compressed, and encrypted to comply with security and regulatory requirements. 

Not only do these features allow you to manage backups with ease, it also enables you to be notified real-time with various integrations supported, either through e-mail or by using third-party notifications such as Slack, PagerDuty, Servicenow, and a lot more.

Conclusion

These rough times we are experiencing shows that many businesses need to change how to approach the possibility of your technical resources not being available to do their jobs. This elevates how seriously to consider Disaster Recovery Planning and how well prepared your organization is when a crisis hits.

 
ClusterControl
The only management system you’ll ever need to take control of your open source database infrastructure.