In a previous blog post we discussed several approaches to sharding. The most flexible one, sharding using metadata, is also the most complex one to implement. You need to design the meta-database, and build high availability not only for your application data but also for the metadata. On top of that, you need to design your application so it will be aware of the complex database infrastructure beneath – it has to query metadata first and then it has to be directed to a correct shard to read or write data. You will also have to build tools to manage and maintain the metadata. Migrating data requires caution so it has to be done carefully. You also have to make sure that any operations on the production databases are mirrored in the metadata. For instance, have you taken a slave out of rotation? This should be reflected in the metadata. Have you added a new slave to a shard? You have to modify the metadata and add that host. As you can imagine, a lot of time and effort has to be put into developing and maintaining scripts and tools to manage such a setup. It begs the question – is there some external solution to design, deploy and manage a sharded environment? In this post, we will cover a couple of solutions which are available and which may help you to build a scalable, sharded infrastructure.
Vitess is a tool built to help manage sharded environments. It was developed to help scale out databases at Youtube. In short, it is a solution based on metadata – by default, it uses range sharding but it is also possible to implement a custom sharding schema. Topology data is stored and maintained in a service like Zookeeper or etcd. Application access data using a lightweight proxy, named ‘vtgate’ in Vitess’ nomenclature. Vtgate connects to the metadata store and checks the data distribution – this allows it to route queries to correct shards – ‘tablets’.
Vitess supports range sharding – the keyspace is divided into two or more partitions, each partition covering a range of data. To find ranges, Vitess has to use a column of some kind to calculate them – currently supported data types are BIGINT UNSIGNED and VARBINARY. This works very well with id’s which usually use unsigned integer format.
In 2014, Oracle introduced a new set of tools for MySQL, called “MySQL Fabric”. Historically, there was no official tool which would allow users to build highly available topologies, including sharded setups. The idea behind Fabric is to provide an “official” tooling for building such setups. It provides a framework and tools to manage groups of highly-available MySQL instances. It supports implementation of HA setups and scaling through sharding.
MySQL Fabric uses a concept of high-availability groups – a group contains two or more MySQL servers connected using replication (actually, you can have just a single host in a group but, obviously, it won’t be highly-available).
MySQL Fabric not only gives you the ability to maintain availability of your data – it also supports scaling out through sharding. The basic idea is – if we can configure a few servers into a single high-availability group, we can then scale by having more of them. Then we’d need to implement some kind of shard mapping – we need to decide which column to use for sharding and which tables should be sharded.
If you are interested in sharding, check out our ebook on sharding.