Keeping Databases Up and Running in an Outage or Slowdown

Krzysztof Ksiazek

Once you see that your database environment has issues, you are in trouble. Maybe one of your replicas is out or maybe you are experiencing a significant increase in the load across your databases that impacts your application. What can you do to try and salvage the situation? Let’s take a look at some of the cases and see if we can find a solution that will help to keep at least some of the functionality running.

Scenario - One of Your Nodes is Down

Let’s consider that one of the nodes is down and thus it is not available for the application to read from. There are several implications of this and several types of issues related to this case. Let’s go step by step through them.

Node is down, remaining nodes are able to deal with the traffic

This is pretty much the ideal scenario and you should design the environment with such a scenario in mind. Your cluster should be sizable enough to be able to handle the traffic if at least one of your nodes is not available. In such a scenario you should be just fine for most of the cases. Obviously, you want to bring up the missing nodes as fast as possible, to reduce the time frame where losing another node may cause more serious disruption. Ideally you would bring back the failed node, as long as you can deem the data safe and sound. The advantage is that you can bring it up faster, as the data already exists on the node. However, if you are going to spin up a replacement node, you have to provision it with data before it will be able to join the cluster. Additional advantage is that, at least for some databases, contents of memory buffers may be persisted on disk and re-reading them at startup will significantly reduce the warm-up phase.

This may not be clear, but please keep in mind that you cannot just start the fresh database and add it to the load balancers on an equal basis as the other nodes. When the database starts, unless it can reload its memory structures, it starts with no data in memory. Every single query will require disk access and it will be way slower than the typical query executed on a warmed-up node. If you just let it deal with regular traffic, the most probable outcome will be queries piling up and rendering the new node not accessible and overloaded. The proper way to introduce a fresh node in the cluster is to send a very low traffic volume to it at first, allowing it to warm up its buffers and then, gradually, add more and more traffic to it, finally reaching 100% of the traffic portion it should serve. If you have proper tools for that, you can even do the warm-up outside of the production cluster utilizing mirrored traffic or just re-executing the most common types of queries based on, for example, contents  of the slow query log from the other production nodes.

Node is down, remaining nodes are overloaded

If you are new to the database world, you should have not let this situation happen in the first place since it’s likely to cause you more issues than the first scenario, but nonetheless, there are two main ways to deal with this issue. Generally speaking, the aim is to bring up additional nodes. As before, if we can use the old node in a safe manner, this might be the best and the fastest option. If not, we should consider spinning up a new node as soon as possible. How to deal with the overloaded nodes is a topic we’ll cover in the second part of this blog, depending on the tools in your arsenal, there are some possible scenarios that may be executed.

Scenario - Database Cluster is Overloaded

Your CPU utilization is going through the roof, databases are slowing down and your application starts to experience slowdowns as well. How to deal with it may depend on several factors, let’s try to discuss the most common cases. Keep in mind that the first thing you have to do is to understand the source of the load. This may significantly affect the way you’ll be responding to the situation.

Node is down, remaining nodes are overloaded

Here, the situation is very simple. There is no hidden source of the load, it’s just that the cluster is degraded and it doesn’t have enough resources to handle the normal traffic. As we discussed, we’ll be adding a new node to the cluster to restore its functionality, but is there anything we can do to make the outage easy on the users? Yes and no. No, because no matter what we will attempt to do, there will be some impact on the application and its functionality. Yes, because there are always more and less important parts of the application.

An initial step, something you can easily do beforehand, would be to assess what is the core functionality of your application. Quite often, with time, applications grow in additional functions, modules and so on. Is there anything you can shut down that won’t impact the core functions? Let’s say that you are an e-commerce website. The core functionality is to sell products so, obviously, the store itself and the payment and order processing are the modules that have to keep on running. On the other hand you may be able to manage without video chat where your sales representatives are helping users. Maybe it will be ok to disable functionality like rating the products, writing comments and reviews. Maybe even search functionality may not be that critical. After all, in most of the cases, users come directly to the product page from Google or some other search engine. Once you identify such “not-required-for-core-functionality” modules, you should plan how to disable them. It might be a checkbox in the admin panel, it might also be changing a couple of rows in the database. It is important to be able to do it quickly.

When the need arises, you can pull the trigger and start disabling the modules one by one, checking how it impacts the workload. It doesn’t mean it will always be enough, the core functionality is named “core” for a reason and it will generate the main chunk of the load on the databases. It is still important to try, though. Even if you’ll reduce the load by 10-15%, it still can make a difference between a slow site and a non-available site.

Sudden increase of traffic due to the higher number of requests from the users

In many ways this is a situation very similar to what we described in the previous section. Sure, all database nodes are up and running but that is not enough to handle the load. The options we have are very similar. Obviously, you will want to spin up more nodes to deal with the traffic. Ideally would be to do it in a way that puts the least amount of load on the existing cluster. For example, instead of copying fresh data from the live node you can use backup to restore the fresh node to some point in time and then let the replication catch up on the remaining data. You will use production nodes to get the incremental state, not full data - that helps to reduce the overhead. It is also a great exercise in backup restoration: you can verify the backup, you can verify the process and you can verify how long the restoration takes. It comes very handy when planning disaster recovery procedures.

The exact steps to follow may depend on the situation, though. If you can identify the source of the load (maybe it is a part of the functionality that has been aggressively promoted?), you would be able to shut down that part if you deem it necessary. Sometimes it’s better to waste the marketing budget due to a promoted feature that’s not working rather than wasting the marketing budget and losing income due to the whole site not working.

Sudden increase of traffic due to the bug in the application

Another quite common case is when the high load is caused by the bugs in the application. It can be a not efficient SQL that managed to pass through a review process. It can also be a logical error where some queries are executed when they should not be. It can be a loop that’s triggered and which runs the query over and over again. Such cases may result in unnecessary or not efficient queries spawning across the database cluster. The main challenge in this particular situation is to identify what is going wrong. If you can pinpoint a query that is causing the problem (and remember, sometimes it is easy to mistake what is a result of the problem and what is a cause), you are already half way through the issue. Then, it all depends on the options and tools that you have at your disposal. If you are using modern load balancers, you may have an option to shape the traffic. In our case it could be, for example, killing the offending query on the load balancer level. Application will still send the faulty query to the load balancer but it will not be propagated to the database. Proxy layer will shield databases from faulty load. If your load balancer allows for such an action, you may also attempt to rewrite the query to a more efficient form. Finally, if you do not use sophisticated proxies, you should attempt to fix the issue in the application. This will probably take more time but it is also a solution. Sometimes the caching layer, if you happen to have one, may also act as a SQL firewall. Setting up a very long TTL for a cache entry related to the faulty query can also work just fine and stop the query from being executed in the database.

Sudden increase of traffic due to the problems with some application hosts

Final situation we’d like to discuss is a case in which erroneous traffic is generated by a subset of application hosts. There can be plenty of reasons for this: deployment gone wrong on a part of the application infrastructure (development code has been deployed on production servers - we’ve seen that in the past) or network issues that prevented some of the application nodes from connecting to the proxy layer and the application reverted to a direct database connection to name a few. Again, the most important bit is to understand what has happened and which part of the infrastructure is affected. Then, you may want to kill the affected application hosts for the time being (or you can kill them permanently, you can always deploy new ones). A temporary “kill” can be implemented through firewalls. If you have an advanced proxy, you can use it as well - then you would have full control over the database traffic in one place. Once you manage to take the situation under control, you may want to rebuild your application nodes and restore their number to an optimal level.

As you can see, there are many ways things can go wrong and there are different ways to solve the problems. The list you have read through is by no means exhaustive, but we hope you can see some patterns emerging here and you identified some tools that you can use to deal with unexpected problems related to the load on your database infrastructure. One of those tools is ClusterControl - it’s the only management system you will ever need to take control of your open source database infrastructure. If you are looking for a tool that would help keep your databases up and running during an outage or slowdown, definitely consider giving it a try.

ClusterControl
The only management system you’ll ever need to take control of your open source database infrastructure.