blog

A Developer’s Guide to MongoDB Sharding

Onyancha Brian Henry

Published: October 15, 2018
Last Updated: June 8, 2022

A massive growth of data comes with a cost of reduced throughput operations especially when being served by a single server. However, you can improve this performance by increasing the number of servers and also distributing your data on multiple numbers of these servers. In this article, Replica sets in MongoDB, we discussed in detail how the throughput operations can be improved besides ensuring high availability of data. This process cannot be achieved completely without mentioning Sharding in MongoDB.

What is Sharding in MongoDB

MongoDB is designed in a flexible manner such that it is scalable for you to run in a cluster across a distributed platform. In this platform, data is distributed across a number of servers for storage. This process is what is termed as sharding. If a single server is subjected to a large amount of data for storage, you might run out of the storage space. In addition, very critical throughput operations such as read and write can be affected to a large extent. The horizontal scaling feature in MongoDB enables us to distribute data across multiple machines with an end result of improving load balancing.

MongoDB Shards

A shard can be considered to be a replica set that hosts some data subset used in a sharded cluster. For a given mongod instance with some set of data, the data is split and distributed across a number of databases in this case shards. Basically, a number of different shards serve as independent databases but collectively they make up a logical database. Shards reduce the workload that is to be performed by the entire database by reducing the number of operations a shard should handle besides the lesser amount of data this shard will host. This metric gives room for the expansion of a cluster horizontally. A simple architecture of sharding is shown below.

Data sent from a client application is intercepted by server drivers and then fed to the router. The router will then consult the server configurations to determine where to apply the read or write operation on the shard servers. In a nutshell, for an operation such as write, it has some index which will dictate to which shard is the record to be the host. Let’s say a database has 1TB data capacity distributed across 4 shards, each shard will hold 256GB of this data. With a reduced amount of data a shard can handle, operations can be performed quite fast. You should consider using the sharded cluster in your database when:

You expect the amount of data to outdo your single instance storage capacity in future.
If the write operations fail to be performed by the single MongodB instance
You run out of the Random Access Memory RAM at the expense of increased size of the active working set.

Sharding comes with increased complexity in the architecture besides additional resources. However, it is advisable to do sharding at early stages before your data outgrows since it is quite tedious to do so when your data is beyond capacity.

MongoDB Shard Key

As we all know, a document in MongoDB has fields for holding values. When you are deploying a sharding, you will be required to select a field from a collection which you will use to split the data. This field you selected is the shard key which determines how you are going to split the documents in the collection across a number of shards. In a simple example, your data may have field names students, class teachers, and marks. You may decide one shard set to contain the documents with the index student, another one teachers, and marks. However, you may require your data to be distributed randomly hence use a hashed shard key. There is a range of shard keys used in splitting data besides the hashed shard key but the two main categories are indexed field and indexed compound fields.

Choosing a Shard Key

For better functionality, capability and performance of the sharding strategy, you will need to select the appropriate sharded key. The selecting criteria are dependent on 2 factors:

Schema structure of your data. We can for example consider a field whose value could be increasing or decreasing (changing monotonically). This will most likely influence to a distribution of inserts to a single shard within a cluster.
How your querying configurations are featured to perform write operations.

What is a Hashed Shard Key

This uses a hashed index of a single field as the shard key. A hashed index is an index that maintains entries with hashes of the values of an indexed field.i.e

{
    "_id" :"5b85117af532da651cc912cd"
}

To create a hashed index you can use this command in the mongo shell.

db.collection.createIndex( { _id: hashedValue } )

Where the hashedValue variable represents a string of your specified hash value. Hashed sharding promotes even data distribution across a sharded cluster thereby reducing target operations. However, documents with almost same shard keys may unlikely be on the same shard hence requiring a mongo instance to do a broadcast operation in satisfying a given query criterion.

Range-Based Shard key

In this category, the dataset is partitioned based on value ranges of a chosen field key hence a high range of partitions. I.e. if you have a numeric key whose values run from negative infinity to positive infinity, each shard key will fall on certain point within that line. This line is divided into chunks with each chunk having a certain range of values. Precisely, those documents with almost similar shard key are hosted in the same chunk. The advantage with this technique is that it supports a range of queries since the router will select the shard with the specific chunk.

Characteristics of an Optimal Shard Key

An ideal shard key should be able to target a single shard in order to enhance a mongos program to return query operations from a single mongod instance. The key being primary field characterizes this. I.e. not in an embedded document.
Have a high degree of randomness. This is to say, the field should be available in most of the documents. This will ensure write operations are distributed within a shard.
Be easily divisible. With an easily divisible shard key, there is increased data distribution hence more shards.

Become a MongoDB DBA – Bringing MongoDB to Production

Learn about what you need to know to deploy, monitor, manage and scale MongoDB

Download for Free

Components of a Production Cluster Deployment

Regarding the architecture shown above, the production shard cluster should have:

Mongos/ Query routers. These are mongo instances that act as a server between application drivers and the database itself. In deployment, the load balancer is configured so as to enable connection from a single client to reach the same mongos.
Shards. These are the partitions within which documents sharing the same shard key definition are hosted. You should have at least 2 in order to increase the availability of data.
Config Servers: you can either have 3 separate config servers in different machines or a group of them if you will be having multiple sharded clusters.

Deployment of a Sharded Cluster

The following steps will give you a clear direction towards deploying your sharded cluster.

Creating host for the config servers. By default, the server files are available in the /data/configdb directory but you can always set this to your preferred directory. The command for creating the data directory is:
```
$ mkdir /data/configdb
```
Start the config servers by defining the port and file path for each using the command
```
$ mongod --configsvr --dbpath /data/config --port 27018
```
This command will start the configuration file in the data directory with the name config on port 27018. By default all MongoDB servers run on port 27017.
Start a mongos instance using the syntax:
```
$ mongo --host hostAddress --port 27018.
```
The hostAddress variable will have the value for the hostname or ip address of your host.
Start mongod on the shard server and initiate it using the command:
```
mongod --shardsvr --replSet
rs.initiate()
```
Start your mongos on the router with the command:
```
mongos --configdb rs/mongoconfig:27018
```
Adding shards to your cluster. Let’s say we have the default port to be 27017 as our cluster, we can add a shard on port 27018 like this:
```
mongo --host mongomaster --port 27017
sh.addShard( "rs/mongoshard:27018")
{ "shardAdded" : "rs", "ok" : 1 }
```

Enable sharding for the database using the shard name with the command:

sh.enableSharding(shardname)
{ "ok" : 1 }

You can check the status of the shard with the command:

sh.status()

You will be presented with this information

sharding version: {
"_id" : 1,
"minCompatibleVersion" : 5,
"currentVersion" : 6,
"clusterId" : ObjectId("59f425f12fdbabb0daflfa82")
}
shards:
{ "_id" : "rs", "host" : "rs/mongoshard:27018", "state" : 1 }
active mongoses:
"3.4.10" : 1
autosplit:
Currently enabled: yes
balancer:
Currently enabled: yes
Currently running: no
NaN
Failed balancer rounds in last 5 attempts: 0
Migration Results for the last 24 hours:
No recent migrations
databases:
{ "_id" : shardname, "primary" : "rs", "partitioned" : true }

Shard Balancing

After adding a shard to a cluster, you might observe that some shards may still be hosting more data than other and to be more secant the new shard will have no data. You therefore need to run some background checks in order to ensure load balance. Balancing is the basis for which data is redistributed in a cluster. The balancer will detect an uneven distribution hence migrate chunks from one shard to another until a balance quorum is reached.

The balancing process consumes plenty of bandwidth besides workload overheads and this will affect the operation of your database. A better balancing process involves:

Moving a single chunk at a time.
Do the balancing when the migrations threshold is reached, that is when the difference between the lowest numbers of chunks for a given collection and the highest number of chunks in the sharded collection.

Why Cloud Repatriation Matters Now More Than Ever

Automating Day 2 operations: Scaling, upgrades and maintenance

PostgreSQL Bi-Directional Logical Replication — A Deep Dive

Elasticsearch performance optimization