Master High Availability Manager (MHA) Has Crashed! What Do I Do Now?

Krzysztof Ksiazek

MySQL Replication is very popular way of building highly available database layers. It is very well known, tested and robust. It is not without limitations, though. One of them, definitely, is the fact that it utilizes only one “entry point” - you have a dedicated server in the topology, the master, and it is the only node in the cluster to which you can issue writes. This leads to severe consequences - the master is the single point of failure and, should it fail, no write can be executed by the application. It is not a surprise that much work has been put in developing tools, which would reduce the impact of a master loss. Sure, there are discussions how to approach the topic, is the automated failover better than the manual one or not. Eventually, this is a business decision to take but should you decide to follow the automation path, you will be looking for the tools to help you achieve that. One of the tools, which is still very popular, is MHA (Master High Availability). While maybe it is not actively maintained anymore, it is still in a stable shape and its huge popularity still makes it backbone of the high available MySQL replication setups. What would happen, though, if the MHA itself became unavailable? Can it become a single point of failure? Is there a way to prevent it from happening? In this blog post we will take a look at some of the scenarios.

First things first, if you plan to use MHA, make sure you use the latest version from the repo. Do not use binary releases as they do not contain all the fixes. The installation is fairly simple. MHA consists of two parts, manager and node. Node is to be deployed on your database servers. Manager will be deployed on a separate host, along with node. So, database servers: node, management host: manager and node.

It is quite easy to compile MHA. Go to the GitHub and clone repositories.

https://github.com/yoshinorim/mha4mysql-manager

https://github.com/yoshinorim/mha4mysql-node

Then it’s all about:

perl Makefile.PL
make
make install

You may have to install some perl dependences if you don’t have all of the required packages already installed. In our case, on Ubuntu 16.04, we had to install following:

perl -MCPAN -e "install Config::Tiny"
perl -MCPAN -e "install Log::Dispatch"
perl -MCPAN -e "install Parallel::ForkManager"
perl -MCPAN -e "install Module::Install"

Once you have MHA installed, you need to configure it. We will not go into any details here, there are many resources on the internet which cover this part. A sample config (definitely non-production one) may look like this:

[email protected]:~# cat /etc/app1.cnf
[server default]
user=cmon
password=pass
ssh_user=root
# working directory on the manager
manager_workdir=/var/log/masterha/app1
# working directory on MySQL servers
remote_workdir=/var/log/masterha/app1
[server1]
hostname=node1
candidate_master=1
[server2]
hostname=node2
candidate_master=1
[server3]
hostname=node3
no_master=1

Next step will be to see if everything works and how MHA sees the replication:

[email protected]:~# masterha_check_repl --conf=/etc/app1.cnf
Tue Apr  9 08:17:04 2019 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Tue Apr  9 08:17:04 2019 - [info] Reading application default configuration from /etc/app1.cnf..
Tue Apr  9 08:17:04 2019 - [info] Reading server configuration from /etc/app1.cnf..
Tue Apr  9 08:17:04 2019 - [info] MHA::MasterMonitor version 0.58.
Tue Apr  9 08:17:05 2019 - [error][/usr/local/share/perl/5.22.1/MHA/MasterMonitor.pm, ln427] Error happened on checking configurations. Redundant argument in sprintf at /usr/local/share/perl/5.22.1/MHA/NodeUtil.pm line 195.
Tue Apr  9 08:17:05 2019 - [error][/usr/local/share/perl/5.22.1/MHA/MasterMonitor.pm, ln525] Error happened on monitoring servers.
Tue Apr  9 08:17:05 2019 - [info] Got exit code 1 (Not master dead).

Well, it crashed. This is because MHA attempts to parse MySQL version and it does not expect hyphens in it. Luckily, the fix is easy to find: https://github.com/yoshinorim/mha4mysql-manager/issues/116.

Now, we have MHA ready for work.

[email protected]:~# masterha_manager --conf=/etc/app1.cnf
Tue Apr  9 13:00:00 2019 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Tue Apr  9 13:00:00 2019 - [info] Reading application default configuration from /etc/app1.cnf..
Tue Apr  9 13:00:00 2019 - [info] Reading server configuration from /etc/app1.cnf..
Tue Apr  9 13:00:00 2019 - [info] MHA::MasterMonitor version 0.58.
Tue Apr  9 13:00:01 2019 - [info] GTID failover mode = 1
Tue Apr  9 13:00:01 2019 - [info] Dead Servers:
Tue Apr  9 13:00:01 2019 - [info] Alive Servers:
Tue Apr  9 13:00:01 2019 - [info]   node1(10.0.0.141:3306)
Tue Apr  9 13:00:01 2019 - [info]   node2(10.0.0.142:3306)
Tue Apr  9 13:00:01 2019 - [info]   node3(10.0.0.143:3306)
Tue Apr  9 13:00:01 2019 - [info] Alive Slaves:
Tue Apr  9 13:00:01 2019 - [info]   node2(10.0.0.142:3306)  Version=5.7.25-28-log (oldest major version between slaves) log-bin:enabled
Tue Apr  9 13:00:01 2019 - [info]     GTID ON
Tue Apr  9 13:00:01 2019 - [info]     Replicating from 10.0.0.141(10.0.0.141:3306)
Tue Apr  9 13:00:01 2019 - [info]     Primary candidate for the new Master (candidate_master is set)
Tue Apr  9 13:00:01 2019 - [info]   node3(10.0.0.143:3306)  Version=5.7.25-28-log (oldest major version between slaves) log-bin:enabled
Tue Apr  9 13:00:01 2019 - [info]     GTID ON
Tue Apr  9 13:00:01 2019 - [info]     Replicating from 10.0.0.141(10.0.0.141:3306)
Tue Apr  9 13:00:01 2019 - [info]     Not candidate for the new Master (no_master is set)
Tue Apr  9 13:00:01 2019 - [info] Current Alive Master: node1(10.0.0.141:3306)
Tue Apr  9 13:00:01 2019 - [info] Checking slave configurations..
Tue Apr  9 13:00:01 2019 - [info] Checking replication filtering settings..
Tue Apr  9 13:00:01 2019 - [info]  binlog_do_db= , binlog_ignore_db=
Tue Apr  9 13:00:01 2019 - [info]  Replication filtering check ok.
Tue Apr  9 13:00:01 2019 - [info] GTID (with auto-pos) is supported. Skipping all SSH and Node package checking.
Tue Apr  9 13:00:01 2019 - [info] Checking SSH publickey authentication settings on the current master..
Tue Apr  9 13:00:02 2019 - [info] HealthCheck: SSH to node1 is reachable.
Tue Apr  9 13:00:02 2019 - [info]
node1(10.0.0.141:3306) (current master)
 +--node2(10.0.0.142:3306)
 +--node3(10.0.0.143:3306)

Tue Apr  9 13:00:02 2019 - [warning] master_ip_failover_script is not defined.
Tue Apr  9 13:00:02 2019 - [warning] shutdown_script is not defined.
Tue Apr  9 13:00:02 2019 - [info] Set master ping interval 3 seconds.
Tue Apr  9 13:00:02 2019 - [warning] secondary_check_script is not defined. It is highly recommended setting it to check master reachability from two or more routes.
Tue Apr  9 13:00:02 2019 - [info] Starting ping health check on node1(10.0.0.141:3306)..
Tue Apr  9 13:00:02 2019 - [info] Ping(SELECT) succeeded, waiting until MySQL doesn't respond..

As you can see, MHA is monitoring our replication topology, checking if the master node is available or not. Let’s consider a couple of scenarios.

Scenario 1 - MHA Crashed

Let’s assume MHA is not available. How does this affect the environment? Obviously, as MHA is responsible for monitoring the master’s health and trigger failover, this will not happen when MHA is down. Master crash will not be detected, failover will not happen. The problem is, you cannot really run multiple MHA instances at the same time. Technically, you can do it although MHA will complain about lock file:

[email protected]:~# masterha_manager --conf=/etc/app1.cnf
Tue Apr  9 13:05:38 2019 - [warning] Global configuration file /etc/masterha_default.cnf not found. Skipping.
Tue Apr  9 13:05:38 2019 - [info] Reading application default configuration from /etc/app1.cnf..
Tue Apr  9 13:05:38 2019 - [info] Reading server configuration from /etc/app1.cnf..
Tue Apr  9 13:05:38 2019 - [info] MHA::MasterMonitor version 0.58.
Tue Apr  9 13:05:38 2019 - [warning] /var/log/masterha/app1/app1.master_status.health already exists. You might have killed manager with SIGKILL(-9), may run two or more monitoring process for the same application, or use the same working directory. Check for details, and consider setting --workdir separately.

It will start, though, and it will attempt to monitor the environment. The problem is when both of them starts to execute actions on the cluster. Worse case would be if they decide to use different slaves as the master candidate and failover will be executed at the same time (MHA uses a lock file which prevents subsequent failovers from happening but if everything happens at the same time, and it happened in our tests, this security measure is not enough).

Unfortunately, there is no built-in way of running MHA in a highly available manner. The most simple solution will be to write a script which would test if MHA is running and if not, start it. Such script would have to be executed from cron or written in the form of a daemon, if 1 minute granularity of cron is not enough.

Scenario 2 - MHA Manager Node Lost Network Connection to the Master

Let’s be honest, this is a really bad situation. As soon as MHA cannot connect to the master, it will attempt to perform a failover. The only exception is if secondary_check_script is defined and it verified that the master is alive. It is up to the user to define exactly what actions MHA should perform to verify master’s status - it all depends on the environment and exact setup. Another very important script to define is master_ip_failover_script - this is executed upon failover and it should be used, among others, to ensure that the old master will not show up. If you happen to have access to additional ways of reaching and stopping old master, that’s really great. It can be remote management tools like Integrated Lights-out, it can be access to manageable power sockets (where you can just power off the server), it can be access to cloud provider’s CLI, which will make it possible to stop the virtual instance. It is of utmost importance to stop the old master - otherwise it may happen that, after the network issue is gone, you will end up with two writeable nodes in the system, which is a perfect solution for the split brain, a condition in which data diverged between two parts of the same cluster.

As you can see, MHA can handle the MySQL failover pretty well. It definitely requires careful configuration and you will have to write external scripts, which will be utilized to kill the old master and ensure that the split brain will not happen. Having said that, we would still recommend to use more advanced failover management tools like Orchestrator or ClusterControl, which can perform more advanced analysis of the replication topology state (for example, by utilizing slaves or proxies to assess the master’s availability) and which are and will be maintained in the future. If you are interested to learn how ClusterControl performs failover, we would like to invite you to read this blog post on the failover process in ClusterControl. You can also learn how ClusterControl interacts with ProxySQL delivering smooth, transparent failover for your application. You can always test ClusterControl by downloading it for free.

ClusterControl
The only management system you’ll ever need to take control of your open source database infrastructure.