blog

4 Major Challenges of Operating A Multi-Cloud Database

Krzysztof Ksiazek

Published: August 22, 2023
Last Updated: September 6, 2023

Multi-cloud databases tend to be quite complex in terms of environment design. There are multiple aspects that have to be considered before deciding on the way the environment will be built.

But you have to keep in mind that designing your setup is just one part of the life cycle. Once your design becomes an actual live environment, you also have to ensure that everything is working properly.

This part is an ongoing responsibility for the years ahead.

So, what about the day to day operations? How challenging is it to keep a multi-cloud setup running?

It can feel like spinning plates at times – an intricate balancing act where you need to know how and when to adjust.

There are four main challenges that come with operating a multi-cloud database. Let’s explore them one by one.

1. Networking

Managing anything that spans across multiple cloud service providers (CSPs) means that you will have to deal with the network quite often. There are two main areas where you may experience some challenges.

Network stability

Ideally, network stability should have been taken care of as part of the architecture design. But if you have network stability issues, you need to take care of the redundancy for the inter-cloud links that you are using.

Databases do not like to experience transient network errors and the loss of connectivity. Well, we can easily say that no service likes that but the databases are quite special here. They store data thus their availability is critical. This is why we come with different designs that are intended to keep the database up and available.

Automated failover, quorum calculation, load balancing – all of this is supposed to keep the database running. The problem is that nothing is perfect – whatever measures you implement, there’s pretty much always a situation where those would be not enough to protect the availability of our data.

As a result, while we write a code to perform the automated recovery of the services and we run tests that help us to feel confident that they will work correctly, we would rather not offer that code an opportunity to be tested against weird random network crashes.

Instead, you should be focused on minimizing the probability of experiencing a network failure.

Network throughput

The network also has to accommodate the data flow between the CSPs. This is why you have to be constantly on the lookout for any potential problems that could either increase the data flow between the database nodes or decrease the throughput of the network connections.

Backup processes, scaling up, rebuilding failed nodes – all those processes require data to be sent over the network. In each case, you will need to decide your approach.

For example, you need to choose whether a centralized backup server or a local backup is best. This will need to be established for every cloud that you use.

Network utilization will need to become an important part of your processes. When you build up new nodes or rebuild failed ones, they should be designed with network utilization in mind.

Your script should take extreme care to reuse local data, rather than picking random nodes and transferring the full data set over the internet.

You will need to think about the security measures that secure the network connection between clouds. Whatever you put in place needs to handle large data transfers. The speed of the transfers can also be impacted by the encryption method that you choose.

2. Failover and recovery

Failover and recovery is closely related to networking, but brings its own set of considerations to the table.

Network failures, eventually, will happen – it is delusional to assume otherwise. You have to be prepared for these inevitable situations.

In most cases, there will be an automated failover method in place. You may have inherited it from previous engineers, written a code that performs it, or be relying on an external software that provides this functionality.

It’s important to understand exactly how this process works. For example, you should be clear on how it behaves in more complex failure scenarios. After all, there are many ways in which a complex environment may experience failures. If you know your solution, you should be able to understand why the script behaved like it did and how to improve its behavior in the future.

You also need to know whether it’s quorum-aware or not, and if there is any possibility of a network split. Ultimately, it falls on you to determine if the failover process for your multi-cloud database is safe.

The ability to understand and improve the failover handling is critical when it comes to multi-cloud environments.

Automated failover is, usually, one of the most complex processes when it comes to database management. This is always true. Even if some of the databases perform automatic failovers, it just only means the complexity is hidden under the hood and it is not exposed to the database administrator to full extent.

3. Data size

When operating a multi-cloud database, the size of the data you’re dealing with can become a problem in itself.

In most cases, data is collected and stored in the database. As more and more data is collected and stored, the database increases in size. This makes the challenge constant and, in fact, never ending.

Two areas affected by this continuous growth in data size are the network itself and your backup and recovery processes.

Network

So, you start with your initial design and you plan how the network should look. You determine what kind of throughput is required. Then you start to operate such a database.

Over time, you begin to see that every operation involving a data transfer (provisioning of a new node, rebuilding of a failed node, running backups) starts to get slower and slower. The more data you have to transfer, the slower the transfer on the network will be.

Then, the challenge becomes identifying the correct point in time to make changes and increase the network capacity. Otherwise, your Recovery Time Objective (RTO) will be impacted.

This cycle repeats itself the more your data grows.

Backup and recovery

The network is not the only element of the environment that is impacted by data size. Your backup and recovery processes are also affected.

The reason why is fairly simple – the more data you have, the longer it takes to complete the backup.

At some point, your RTO will be impacted and you have to start considering other options.

One of the solutions, which is quite in-line with a multi-cloud approach, is to shard the data to make the data set smaller and keep the data close to the user. For example, you can use geographical location as the shard key and store the data in the closest datacenter to the region that is served.

This also helps to speed up the backup process as the total data size is split in several parts.

However, this presents a new challenge. If you shard the data, you need a way to ensure a consistent backup is possible – even when the data set is distributed across multiple shards. This is crucial if, for some reason, you would need to perform a restore of a full data set.

In that scenario, you want to start from scratch using the data from one particular point in time across all the shards. In some cases there are readily available backup tools that can be leveraged, but for most datastores you have to figure out a solution on your own.

This may involve monitoring the transaction logs (oplog, WAL, binary log) to track the state of the database and apply the transactions up to a particular point in time, same across all of the shards.

Another challenge would be to determine where to store the backups. To stay on the safe side, the data should be stored in several locations to make sure it’s still available even if one of the data centers experiences a problem.

This idea, while sound, requires you to transfer the data between the cloud providers, over the network – which will have limited capacity.

4. Security

The security of the data is a critical aspect of day-to-day database operations in almost every environment. While it’s not trivial to ensure the security within a single cloud provider, utilizing a multi-cloud database poses significantly higher risk. That’s because data has to be protected at rest and in transit.

With a multi-cloud database, there is a far greater amount of data that has to be in transit at any given time. We are talking about replicating the data across multiple cloud providers over the internet. This is not a network link within a single VPC, it’s an open network.

This network link between cloud providers has to be secured.

VPN should be established, but you need to pick the right solution. Open source has multiple ways in which a network can be secured. Will you use software? Hardware? How expensive will it be?

Keep in mind that network utilization might be quite high, and that could limit your options.

If you are considering an open source solution, for example, will it be efficient enough to transfer your data with the speed that matches your RTO? What kind of hardware is needed to achieve required network throughput? What if the required throughput increases due to increase in data size? What are your options to compensate for that?

Another option is to rely on SSL encryption for replication and frontend-backend connections to provide the security. What about the other processes? Backup? How would you transfer the backup data?

You could use SSH tunneling. But are you sure that you have SSL termination properly configured? Are you sure that, if some part of the database becomes unavailable, your load balancers will reconnect to correct backend nodes while still using SSL connection?

As you can see, the problems pile up and they may show up in later time, when some changes in the data size or even a workload pattern show up.

Wrapping up

The main challenge in a multi-cloud environment is that problems show up that are not present in less complex scenarios. It’s only in a multi-cloud setup that you’ll need to redirect traffic across multiple datacenters, ensure the load won’t saturate the network between the data centers, and keep the connections secure.

Failure recovery is another part of the challenge as it may not be trivial in a distributed multi-cloud database. There are multiple factors that you have to consider while working with such a complex environment. But, in the end, this is the way to achieve ultimate redundancy and scalability (not to mention other perks, like cost reduction).

Stay on top of all things multi-cloud by subscribing to our newsletter below.

Subscribe to our newsletter

You’ll get two emails every month full of fresh database ops tips and strategic considerations.

Replacing MySQL Enterprise Monitor with Enterprise Manager

ClusterControl adds Kubernetes Database Operator management in v2.3.2

CCX: Simplifying Database Operations with Kubernetes-Powered Automation

PostgreSQL Bi-Directional Logical Replication — A Deep Dive