Automated Testing of the Upgrade Process for PXC/MariaDB Galera Cluster

Paul Namuag

Upgrading your database for Galera-based clusters such as Percona XtraDB Cluster (PXC) or MariaDB Galera Cluster can be challenging, especially for a production-based environment. You cannot afford to lose the state of your high availability and put it at risk. 

An upgrade procedure must be well documented, and ideally, documentation, rigorous testing, and benchmarking should be done before upgrades. Most importantly, security and improvements also have to be identified based on the changelog of its database version upgrade. 

With all the concerns, automation helps to achieve a more efficient upgrade process, and helps avoid human error and improves RTO.

How to Manage PXC/MariaDB Galera Cluster Upgrade Process 

Upgrading your PXC/MariaDB Galera Cluster requires proper documentation and process flow that lists the things to be done and what things to do in case things go south. That means a Business Continuity Plan which shall also cover your Disaster Recovery Plan should be laid out. You cannot afford to lose your business in case of trouble. 

The usual take is to start first with the test environment. The test environment should have the exact same settings and configuration as your production environment. You cannot proceed directly with upgrading the production environment as you aren't sure what effect and impact it will occur if things do not accord to the plan. 

Working with a production environment is highly sensitive, so in most cases, a downtime and maintenance window is always there to avoid drastic impact. 

There are two types of an upgrade for PCX or MariaDB Galera Cluster that you need to be aware of. These are the major release upgrade and the minor release upgrade or often referred to as in-place upgrade. An in-place upgrade is where you can upgrade your database version to its most recent minor version using the same binary data of your database. There will be no physical changes to the data itself, but only on its database binary or underlying software packages.

Upgrading PCX or MariaDB Galera Cluster to a Major Release

Upgrading to a major release can be challenging, especially for a production environment. It involves a complex type of database configuration and special built-in features of PXC or MariaDB Galera Cluster. Spatiotemporal, time-stamped data, machine data, or any multi-faceted data are very conservative and sensitive to upgrades. You cannot apply an in-place upgrade for this process because many major changes would have been made. Unless you have very small data or data consisting of idempotents or data that can be generated easily can be safe to do as long as you know the impact won't affect your data.

If your data volume is large, then it’s best to have the upgrade process automated. However, It might not be an ideal solution to automate the all sequence in the upgrade process because there might be unexpected issues creeping in during the major upgrade phase. It is best to automate repetitive steps and processes with known outcomes in a major upgrade. At any point, a resource is required to evaluate if the automation process is safe to avoid any halts in the upgrade process. Automated testing after the upgrade is equally important, and it should be included as a part of the post-upgrade process.

Upgrading PCX or MariaDB Galera Cluster to a Minor Release

A minor release upgrade referred to as an in-placed upgrade is usually a safer approach to perform an upgrade process. This is because, the most common changes for this release are security and exploit patches or improvements, bugs (usually severe ones), or compatibility issues that require patches especially if the current hardware or OS had changes applied that can cause also the database not to function properly. Although the impact can usually be recoverable at a minimal effect, it is still a must that you have to look and read the changelog that was pushed to the specific minor version upgrade.

Deploying the job to perform the upgrade process is an ideal example for automation. The usual flow is very repetitive and mostly causes no harm to your existing PXC or MariaDB Galera Cluster. What matters most is that after the upgrade, automated testing shall proceed to determine the setup, configuration, efficiency, and functionality are not broken.

Avoid the Fiascoes! Be ready, Have it Automated!

A client of ours reached out to us asking for assistance because, after the database minor upgrade, a feature that they are using in the database is not properly working. They asked for steps and processes on how to downgrade and how safe it will be. Their customers were complaining that their application is totally not working, generalizing that it's not useful. 

Even for such a small glitch, a pissed off customer might give a bad remark to your product. The lesson learnt from this scenario is that failing in testing after an upgrade leads to an assumption that all functions in a database are working as expected.

Suppose you have plans to automate the upgrade process, then take note that type of automation process varies to the type of upgrades you have to do. As mentioned earlier, a major upgrade versus a minor upgrade has different distinct approaches. So your automaton setup might not apply to both database software upgrades.

Automating After the Upgrade Process

At this point, it is expected that you have your upgrade process done, ideally, through automation. Now that your database is ready to receive client connections, it has to follow with a rigorous testing phase.

Run mysql_upgrade

It is very important and extremely recommended to execute mysql_upgrade once the upgrade process has completed. mysql_upgrade looks for incompatibilities with the upgraded MySQL server by doing the following things:

  • It upgrades the system tables in the mysql schema so that you can take advantage of new privileges or capabilities that might have been added.

  • It upgrades the Performance Schema and sys schema.

  • It examines user schemas.

The mysql_upgrade determines if a table has problems such as incompatibilities due to changes in the most recent version after the upgrade and attempts to resolve it by repairing the table. Otherwise, if it fails, then your automation test shall have to fail and must not proceed onto something else. It has to be investigated first and do a manual repair.

Check error logs

Once the mysql_upgrade is done, you need to check and verify for the errors that occurred. You can put this into a script and check for any "error" or "warning" labels in the error logs. It is very important to determine if there's such. Your automated test must have the ability to catch error traps either it can wait for a user input to continue if the error is just very minimal or expected, otherwise stop the upgrade process.

Perform a unit test

A TDD (Test Driven Development) database environment is a software development approach where there are a series of test cases to be validated and determine if validation is true (pass) or false (fail). Something like what we have in the screenshot below:

Image courtesy of guru99.com

This is a type of unit testing helps avoid unwanted bugs or logical errors to your application and in your database. Remember, if there are invalid data stored in the database, that would harm all the business analytics and transactions especially if it involves complex financial computation or mathematical equations. 

If you ask, is it really necessary to perform a unit test after the upgrade? Of course, it is! You don't necessarily have to run this under the production environment. During the testing phases, i.e. upgrading first your QAs, development/staging environment, it has to be applied in that area. Data has to be an exact copy at least or almost the same as its production environment. Your goal here is to avoid unwanted results and definitely wrong logical results. You have to take good care of your data of course and, determine if the results pass the validation test.

If you intend to run with your production, then do it. However, do,not be as rigid as your testing phase applied in the QA, development, or staging environment. It is because you have to plan your time based on the available maintenance window and avoid delays and longer RTO. 

In my experience, during the upgrade phase, customers select a quicker approach that shall be important to determine if such a feature provides the correct result. Moreover, you can have a script to automate the test a set of business logical functions or stored procedures since it helps to cache the queries and make your database warm.

When preparing for Unit Test for your database, avoid reinventing the wheel. Instead, take a look at the available tools you can choose if it's good for your requirements and needs. Check out Selenium, or go check out this blog.

Verify identity of queries

The most common tool you can use is Percona's pt-upgrade. It verifies that query results are identical on different servers. It executes queries based on the given logs and supplied connection (or called as DSN), then compares the results and reports any significant differences. It offers more than that as your options to collect or analyze the queries such as through tcpdump, for example.

Using the pt-upgrade is easy. For example, you can run with the following command:

## Comparing via slow log for the given hosts
pt-upgrade h=host1 h=host2 slow.log

## or use fingerprints, useful for debugging purposes
pt-upgrade --fingerprints --run-time=1h mysqld-slow.log h=127.0.0.1,P=5091 h=127.0.0.1,P=5517

## or with tcpdump,
tcpdump -i eth0 port 3306 -s 65535  -x -n -q -tttt     \
  | pt-query-digest --type tcpdump --no-report --print \
  | pt-upgrade h=host1 h=host2

It's a good practice that once an upgrade, especially a major release upgrade has been performed, pt-upgrade is used to proceed and perform query analysis identifying differences based on the results. It is a good practice to do this during the testing phase while doing it on your QAs or staging and development environment so you can decide if it's safer to proceed. You can add this to your automation tool and run this as a playbook once it's ready to perform its duty.

How to Automate the Testing Process?

In our previous blogs, we have presented different ways to automate your databases. The most common tools that are vogue are these IaC (Infrastructure as Code) deployment software tools. You can use Puppet, Chef, SaltStack, or Ansible to do the job.

My preference has always been Ansible to perform my automated testing, it allows me to create playbooks by its job role. Of course, I cannot create one whole thing automaton that will do all the things because the situation and environment varies. Based on the given upgrade types earlier (major vs minor upgrade), you should put distinction to its process. Even if it's just an in-place upgrade, you still have to make sure that your playbooks shall perform the correct job.

ClusterControl is Your Database Automation Friend!

ClusterControl is a good option to perform basic and automated testing. ClusterControl is not a framework for testing; it’s not a tool to provide unit testing. However, it's a database management and monitoring tool that incorporates a lot of automated deployments based on the requested triggers from the user or administrator of the software. 

ClusterControl offers minor version upgrades, which provides convenience to the DBAs when performing upgrades. It does mysql_upgrade on the fly as well. So you do not need to perform it manually. ClusterControl also detects new versions to be upgraded and recommend the next steps for you to do. In case of failure is encountered, the upgrade will not proceed.

Here's an example of the minor upgrade job:

If you look carefully, the mysql_upgrade runs successfully. Whilst, it does not recommend and does an automatic upgrade of the master, it is because it is not the right approach to proceed. In that case, you have to promote the new slave, then demote the master as a slave to perform the upgrade.

Conclusion

The great thing with ClusterControl is that you can incorporate checking of error logs, perform a unit test, verify identity of queries by creating Advisors. It's not difficult to do so. You can refer to our previous blog Using ClusterControl Advisor to Create Checks for SELinux and Meltdown/Spectre: Part One. This exemplifies how you can take advantage and either trigger the next job to do once the upgrade is executed. ClusterControl has built-in alerts or alarms that can integrate to your favorite third-party alert systems to notify you of your automated testing’s current status.

ClusterControl
The only management system you’ll ever need to take control of your open source database infrastructure.