ClusterControl is a platform for monitoring and managing open source databases. It will provide you with a single pane of glass to understand what is happening in the system, it will let you know via alerts that something is wrong and it will give you tools to ensure that you can bring the state of the cluster back to normal. There are still some functionalities that are not available directly in ClusterControl, one of them is managing the on call rotation and working with alerts. That’s why ClusterControl gives you an option to integrate with external tools and extend what you can achieve using those dedicated tools. One of the integrations that are directly supported in ClusterControl is OpsGenie – an oncall management tool. Let’s take a look at how we can easily integrate ClusterControl with it.
There are a couple of prerequisites on the OpsGenie part that we will not be going through. Basically, you have to have teams defined with members. You may also want to have an on call rotation in place – in general, you want to have OpsGenie configured to your liking. Once it’s done, the next step will be to setup the integration in ClusterControl.
You can do it in the “Integrations” section in the left hand side menu – if you don’t have any existing integration, you will see the screen exactly as on the screenshot above. Just click on “Add your first service” and we can proceed further.
You want to pick OpsGenie from the list of the integrations.
Then we fill the data – integration name, region in which our OpsGenie setup is working as well as the name of the teams that should get the notifications. We should also fill in the API key for the team that we can create in OpsGenie – instructions that you can see on the screenshot above are quite clear and should be enough to get this done.
In short, in the “Teams” menu we picked our team and then used “Integrations” to add new API integration. Then we can get the API key to use with ClusterControl.
Once we set up everything in ClusterControl we can test this by clicking on “Test”. If everything is ok, you will see a notification. Please keep in mind that you have to fill all the forms here, we removed the API key only for the purpose of getting this screenshot.
As a next step we have to decide which alerts will be sent to OpsGenie. We can pick all or only some of the clusters defined in ClusterControl.
We can also pick the events that we want to be sent to OpsGenie based on their severity and category.
Once it is done, we can see our integration added into ClusterControl.
Now, let’s see if it actually works for real. For that we are going to simulate a master failure on our MariaDB 10.5 replication cluster. We will kill mariadbd process:
root@vagrant:~# killall -9 mariadbd root@vagrant:~# killall -9 mariadbd mariadbd: no process found
Next, we’ll wait a bit and see what has been sent to OpsGenie:
As you can see on the screenshot above, multiple alerts have been opened. Some of them have been cleared but some are still open. Those are related to the failed master. We can check the state of the cluster in ClusterControl.
As you can see, we have a failed master running as a node with read_only=ON. It is not a part of the replication topology. What we can do here is to bring it back into the replication chain as a slave. This step can happen automatically if you want but the default behavior of ClusterControl allows you to investigate the failed master before it is rebuilt. In this case we will trigger the rebuild process manually.
You can trigger it from Nodes Actions menu.
We picked the master node to get the data from and clicked “Proceed”.
After a bit of time, which depends on the size of the data as well as disk and network speed, the job should complete and we will see a new, clean replication topology.
In the meantime, in OpsGenie, we see that all of the alerts are now cleared and closed as the situation is back to normal.