Severalnines Blog
The automation and management blog for open source databases

MySQL High Availability tools - Comparing MHA, MRM and ClusterControl

Krzysztof Ksiazek

We previously compared two high availability solutions for MySQL - MHA and MariaDB Replication Manager and looked into how they performed fail-over. In this blog post, we’ll see how ClusterControl stacks up against these solutions. Since MariaDB Replication Manager is under active development, we decided to take a look at the not yet released version 1.1.

Flapping

All solutions provide flapping detection. MHA, by default, executes a failover once. Even after you restart masterha_manager, it will still check if the last failover didn’t happen too recently. If yes (by default, if it happened in the last 8 hours), no new failover will happen. You need to explicitly change the timeout or set --ignore_last_failover flag.

MariaDB Replication Manager has less strict defaults - it will allow up to three failovers as long as each of them will happen more than 10 seconds after the previous one. In our opinion this is a bit too flexible - if the first failover didn’t solve the problem, it is unlikely that another attempt will give better results. Still, default settings are there to be changed so you can configure MRM however you like.

ClusterControl uses similar approach to MHA - only one failover is attempted. Next one can happen only after the master has been detected successfully as online (for example, ClusterControl recovery or manual intervention by the admin managed to promote one of the slaves to a master) or after restart of cmon process.

Lost transactions

MHA can work in two modes - GTID or non-GTID. Those modes differ regarding to how missing transactions are handled. Traditional replication, actually, is handled in a better way - as long as the old master is reachable, MHA connects to it and attempts to recover missing transactions from its binary logs. If you use GTID mode, this does not happen which may lead to more significant data loss if your slaves didn’t manage to receive all relay logs - another very good reason to use semi-synchronous replication, which has you covered in this scenario.

MRM does not connect to the old master to get the logs. By default, it elects the most advanced slave and promotes it to master. Remaining slaves are slaved off this new master, making them as up to date as the new master. There is a potential for a data loss, on par with MHA’s GTID mode.

ClusterControl behaves similarly to MRM - it picks the most advanced slave as a master candidate and then, as long as it is safe (for example, there are no errant transactions), promote it to become a new master. Remaining slaves get slaved off this new master. If ClusterControl detects errant transactions, it will stop the failover and alert the administrator that manual intervention is needed. It is also possible to configure ClusterControl to skip errant transaction check and force the failover.

Network partitioning

For MHA, this has been taken care of by adding a second MHA Manager node, preferably in another section of your network. You can query it using secondary_check_script. It can be used to connect to another MHA node and execute masterha_check_repl to see how the cluster can be seen from that node. This gives MHA a better view on the situation and topology, it might not failover as it is unnecessary.

MRM implements another approach. It can be configured to use slaves, external MaxScale proxy or scripts executed through HTTP protocol on a custom port (like the scripts which governs HAProxy behavior) to build a full view of the topology and then make an informed decision based on this.

ClusterControl, at this moment, does not perform any advanced checks regarding availability of the master - it uses only its own view of the system, therefore it can take an action if there are network issues between the master and the ClusterControl host. Having said that, we are aware this can be a serious limitation and there is a work in progress to improve how ClusterControl detects failed master - using slaves and proxies like MaxScale or ProxySQL to get a broader picture of the topology.

Roles

Within MHA you are able to apply roles to a specific host, so for instance ‘candidate_master’ and ‘no_master’ will help you determine which hosts are preferred to become master. A good example could be the data center topology: spread the candidate master nodes over multiple racks to ensure HA. Or perhaps you have a delayed slave that may never become the new master even if it is the last node remaining.

This last scenario is likely to happen with MariaDB Replication Manager as it can’t see the other nodes anymore and thus can’t determine that this node is actually, for instance, 24 hours behind. MariaDB does not support the Delayed Slave command but it is possible to use pt-slave-delay instead. There is a way to set the maximum slave delay allowed for MRM, however MRM reads the Seconds_Behind_Master from the slave status output. Since MRM is executed after the master is dead, this value will obviously be null.

At the beginning of the failover procedure, ClusterControl builds a list of slaves which can be promoted to master. Most of the time, it will contain all slaves in the topology but the user has some additional control over it. There are two variables you can set in the cmon configuration:

replicaton_failover_whitelist

and

replicaton_failover_blacklist

The whitelist contains a list of IP’s or hostnames of slaves which should be used as potential master candidates. If this variable is set, only those hosts will be considered. The second variable may contain a list of hosts which will never be considered as master candidate. You can use it to list slaves that are used for backups or analytical queries. If the hardware varies between slaves, you may want to put here the slaves which use slower hardware.

Replication_failover_whitelist takes precedence, meaning the replication_failover_blacklist is ignored if replication_failover_whitelist is set

Integration

MHA is a standalone tool, it doesn’t integrate well with other external software. It does however provide hooks (pre/post failover scripts) which can be used to do some integration - for instance, execute scripts to make changes in the configuration of an external tool. MHA also uses read_only value to differentiate between master and slaves - this can also be used by external tools to drive topology changes. One example would be ProxySQL - MHA can work with this proxy using both pre/post failover scripts and with read_only values, depending on the ProxySQL configuration. It’s worth mentioning that, in GTID mode, MHA doesn’t support MariaDB GTID - it only supports Oracle MySQL or Percona Server.

MRM integrates nicely with MaxScale - it can be used along MaxScale in a couple of ways. It could be set so MaxScale will do the work to monitor the health of the nodes and execute MRM as needed, to perform failovers. Another option is that MRM drives MaxScale - monitoring is done on MRM’s side and MaxScale’s configuration is updated as needed. MRM also sets read_only variables so it makes it compatible with other tools which understand those settings (like ProxySQL, for example). A direct integration with HAProxy is also available - MRM, if collocated, may modify the HAProxy configuration whenever the topology changes. On the cons side, MRM works only with MariaDB installations - it is not possible to use it with Oracle MySQL’s version of GTID.

ClusterControl uses read_only variables to differentiate between master and slave nodes. This is enough to integrate with every kind of proxy which could be deployed from ClusterControl: ProxySQL, MaxScale and HAProxy. Failover executed by ClusterControl will be detected and handled by any of those proxies. ClusterControl also integrates with external tools regarding management. It provides access to management console for MaxScale and, to some extend, to HAProxy. Advanced support for ProxySQL will be added shortly. Metrics are provided for HAProxy and ProxySQL. ClusterControl supports both Oracle GTID and MariaDB GTID.

Conclusion

If you are interested in details how MHA, MRM or ClusterControl handle failover, we’d like to encourage you to take a look at the blog posts listed below:

Below is a summary of the differences between the different HA solutions:

  MHA MRM ClusterControl
Replication support non-GTID, Oracle GTID MariaDB GTID Oracle GTID and MariaDB GTID
Flapping One failover allowed Defaults are less restrictive but can be modified One failover allowed unless it brings the master online
Lost transactions Very good handling for non-GTID, no checking for transactions on master for GTID setups No checking for transactions on master No checking for transactions on master
Network Partitioning No support built in, can be added through user-created scripts Very good false positive detection using slaves, proxy or external scripts No support at this moment, work in progress to build false positive detection using proxy and slaves
Roles Support for whitelist and blacklist of hosts to promote to master No support Support for whitelist and blacklist of hosts to promote to master
Integration Can be integrated with external tools using hooks. Uses read_only variable to identify master and slaves which helps to integrate with other tools that understand this pattern. Close integration with MaxScale, integration with HAProxy is also available. Uses read_only variable to identify master and slaves which helps to integrate with other tools that understand this pattern. Can be integrated with external tools using hooks. Uses read_only variable to identify master and slaves which helps to integrate with other tools that understand this pattern.

If we are talking about handling master failure, each of the solutions does its job well and feature-wise they are mostly on par. There are some differences in almost every aspect that we compared but, at the end, each of them should handle most of the master failures pretty well. ClusterControl lacks more advanced network partitioning detection but this will change soon. What could be important to keep in mind that those tools support different replication methods and this alone can limit your options. If you use non-GTID replication, MHA is the only option for you. If you use GTID, MHA and MRM are restricted to, respectively, Oracle MySQL and MariaDB GTID setups. Only ClusterControl (you can test it for free) is flexible enough to handle both types of GTID under one tool - this could be very useful if you have a mixed environment while you still would like to use one single tool to ensure high availability of your replication setup.