blog

Advanced Failover Using Post/pre Script Hooks

Paul Namuag

Published: June 14, 2021
Last Updated: May 4, 2022

The Importance of Failover

Failover is one of the most important database practices for database governance. It’s useful not only when managing large databases in production, but also if you want to be sure that your system is always available whenever you access it – especially on the application level.

Before a failover can take place, your database instances have to meet certain requirements. These requirements are, in fact, very important for high availability. One of the requirements that your database instances have to meet is redundancy. Redundancy enables the failover to proceed, in which the redundancy is setup to have a failover candidate which can be a replica (secondary) node or from a pool of replicas acting as standby or hot-standby nodes. The candidate is selected either manually or automatically based on the most advanced or up-to-date node. Usually, you would want a hot-standby replica as it can save your database from pulling indexes from disk as a hot-standby often populates indexes into the database buffer pool.

Failover is the term used to describe that a recovery process has occurred. Prior to the recovery process, this occurs when a primary (or master) database node fails after a crash, after natural disasters, after a hardware failure, or it may have suffered a network partitioning; these are the most common cases why a failover might take place. The recovery process usually proceeds automatically and then searches for the most desired and up-to-date secondary (replica) as stated previously.

Advanced failover

Although the recovery process during a failover is automatic, there are certain occasions when it is not necessary to automate the process, and a manual process has to take over. Complexity is often the main consideration associated with the technologies comprising the whole stack of your database – automatic failover can be mixed with manual failover as well.

In most day-to-day considerations with managing databases, the majority of the concerns surrounding the automatic failover is really not trivial. It often comes up handy to implement and setup an automatic failover in case problems occur. Though that sounds promising as it covers complexities, there comes the advanced failover mechanisms and that involves “pre” events and the “post” events which are tied as hooks in a failover software or technology.

These pre and post events come up with either checks or certain actions to perform before it can finally proceed with the failover, and after a failover is done, some cleanups to make sure that failover is finally a successful one. Fortunately, there are tools available that allow, not only just Automatic Failover, but features capability to apply pre and post script hooks.

In this blog, we’ll use ClusterControl (CC) automatic failover and will explain how to use the pre and post script hooks and which cluster do they apply to.

ClusterControl Replication Failover

The ClusterControl failover mechanism is efficiently applicable over asynchronous replication which is applicable to MySQL variants (MySQL/Percona Server/MariaDB). It’s applicable to PostgreSQL/TimescaleDB clusters as well – ClusterControl supports streaming replication. MongoDB and Galera clusters have its own mechanism for automatic failover built into its own database technology. Read more about how the ClusterControl performs automatic database recovery and failover.

ClusterControl failover does not work unless the Node and Cluster recovery (Auto Recovery are enabled). That means that these buttons should be green.

The documentation states that these configuration options can be used as well to enable / disable the following:

enable_cluster_autorecovery=	If undefined, CMON defaults to 0 (false) and will NOT perform automatic recovery if it detects cluster failure. The supported value is 1 (cluster recovery is enabled) or 0 (cluster recovery is disabled).
enable_node_autorecovery=	If undefined, CMON defaults to 0 (false) and will NOT perform automatic recovery if it detects node failure. The supported value is 1 (node recovery is enabled) or 0 (node recovery is disabled).

These options, when set in /etc/cmon.d/cmon_.cnf needs a cmon restart. Therefore, you have to restart using the following command:

$ systemctl restart cmon

For this blog, we’re mainly focusing on how to use the pre/post script hooks which is essentially a great advantage for advanced replication failover.

Cluster failover replication pre/post script support

As mentioned earlier, MySQL variants that use asynchronous (including semi-synchronous) replication and streaming replication for PostgreSQL/TimescaleDB support this mechanism. ClusterControl has the following configuration options which can be used for pre and post script hooks. Basically, these configuration options can be set via their configuration files or can be set through the web UI (we’ll deal with this later).

Our documentation states that these are the following configuration options that can alter the failover mechanism by using the pre/post script hooks:

replication_pre_failover_script=	Path to the failover script on ClusterControl node. This script executes before the failover happens, but after a candidate has been elected and it is possible to continue the failover process. If the script returns non-zero it will force the failover to abort. If the script is defined but not found, the failover will be aborted. 4 arguments are supplied to the script: arg1=”All servers in the replication” arg2=”The failed master” arg3=”Selected candidate” arg4=”Slaves of old master” The arguments will be passed like this: pre_failover_script.sh “arg1” “arg2” “arg3” “arg4”. The script must be accessible on the controller and executable. Example: replication_pre_failover_script=/usr/local/bin/pre_failover_script.sh
replication_post_failover_script=	Path to the failover script on the ClusterControl node. This script executes after the failover has happened. If the script returns non-zero, a warning will be written in the job log. The script must be accessible and executable on the controller. 4 arguments are supplied to the script: arg1=”All servers in the replication” arg2=”The failed master” arg3=”Selected candidate” arg4=”Slaves of old master” The arguments will be passed like this: post_failover_script.sh “arg1” “arg2” “arg3” “arg4”. The script must be accessible on the controller and executable. Example: replication_post_failover_script=/usr/local/bin/post_failover_script.sh
replication_post_unsuccessful_failover_script=	Path to the script on the ClusterControl node. This script is executed after the failover attempt failed. If the script returns non-zero a Warning will be written in the job log. The script must be accessible on the controller and executable. 4 arguments are supplied to the script: arg1=”All servers in the replication” arg2=”The failed master” arg3=”Selected candidate” arg4=”Slaves of old master” The arguments will be passed like this: post_fail_failover_script.sh “arg1” “arg2” “arg3” “arg4”. The script must be accessible on the controller and executable. Example: replication_post_unsuccessful_failover_script=post_fail_failover_script.sh

Technically, once you set the following configuration options in your /etc/cmon.d/cmon_.cnf configuration file, it requires to restart cmon, i.e. run the following command:

$ systemctl restart cmon

Alternatively, you can also set the configuration options by going to

Hidden

State

CAPTCHA

Recommended

Migration and upgrades: achieving near zero-downtime in PostgreSQL

Comparing DevOps tooling approaches: Terraform, Ansible, Chef, Puppet, and DIY scripting

Why Cloud Repatriation Matters Now More Than Ever

Automating Day 2 operations: Scaling, upgrades and maintenance

Subscribe below to be notified of fresh posts