blog

How to Optimize Performance of MongoDB

Onyancha Brian Henry

Published

Excellent database performance is important when you are developing applications with MongoDB. Sometimes the overall data serving process may become degraded due to a number of reasons, some of which include:

  • Inappropriate schema design patterns
  • Improper use of or no use of indexing strategies
  • Inadequate hardware
  • Replication lag
  • Poorly performing querying techniques

Some of these setbacks might force you to increase hardware resources while others may not. For instance, poor query structures may result in the query taking a long time to be processed, causing replica lag and maybe even some data loss. In this case, one may think that maybe the storage memory is not enough, and that it probably needs scaling up. This article discusses the most appropriate procedures you can employ to boost the performance of your MongoDB database.

Schema Design

Basically the two most commonly employed schema relationships are…

  • One-to-Few
  • One-to-Many

While the most efficient schema design is the One-to-Many relationship, each has got its own merits and limitations.

One-to-Few

In this case, for a given field, there are embedded documents but they are not indexed with object identity.

Here is a simple example:

{
      userName: "Brian Henry",
      Email : "[email protected]",
      grades: [
             {subject: ‘Mathematics’,  grade: ‘A’},
             {subject: English,  grade: ‘B’},
      ]
}

One advantage of using this relationship is that you can get the embedded documents with just a single query. However, from a querying standpoint, you cannot access a single embedded document. So if you are not going to reference embedded documents separately, it will be optimal to use this schema design.

One-to-Many

For this relationship data in one database is related to data in a different database. For example, you can have a database for users and another for posts. So if a user makes a post it is recorded with user id.

Users schema

{ 
    Full_name: “John Doh”,
    User_id: 1518787459607.0
}

Posts schema

{
    "_id" : ObjectId("5aa136f0789cf124388c1955"),
    "postTime" : "16:13",
    "postDate" : "8/3/2018",
    "postOwnerNames" : "John Doh",
    "postOwner" : 1518787459607.0,
    "postId" : "1520514800139"
}

The advantage with this schema design is that the documents are considered as standalone (can be selected separately). Another advantage is that this design enables users of different ids to share information from the posts schema (hence the name One-to-Many) and sometimes can be “N-to-N” schema – basically without using table join. The limitation with this schema design is that you have to do at least two queries to fetch or select data in the second collection.

How to model the data will therefore depend on the application’s access pattern. Besides this you need to consider the schema design we have discussed above.

Optimization Techniques for Schema Design

  1. Employ document embedding as much as possible as it reduces the number of queries you need to run for a particular set of data.

  2. Don’t use denormalization for documents that are frequently updated. If anfield is going to be frequently updated, then there will be the task of finding all the instances that need to be updated. This will result in slow query processing, hence overwhelming even the merits associated with denormalization.

  3. If there is a need to fetch a document separately, then there is no need to use embedding since complex queries such as aggregate pipelining take more time to execute.

  4. If the array of documents to be embedded is large enough, don’t embed them. The array growth should at least have a bound limit.

Proper Indexing

This is the more critical part of performance tuning and requires one to have a comprehensive understanding on the application queries, ratio of reads to writes, and how much free memory your system has. If you use an index, then the query will scan the index and not the collection.

An excellent index is one that involves all the fields scanned by a query. This is referred to as a compound index.

To create a single index for a fields you can use this code:

db.collection.createIndex({“fields”: 1})

For a compound index, to create the indexing:

db.collection.createIndex({“filed1”: 1, “field2”:  1})

Besides faster querying by use of indexing, there is an addition advantage of other operations such as sort, samples and limit. For example, if I design my schema as {f: 1, m:1} i can do an additional operation apart from find as

db.collection.find( {f: 1} ).sort( {m: 1} )

Reading data from RAM is more efficient that reading the same data from disk. For this reason, it is always advised to ensure that your index fits entirely in the RAM. To get the current indexSize of your collection, run the command :

db.collection.totalIndexSize()

You will get a value like 36864 bytes. This value should also not be taking a large percentage of the overall RAM size, since you need to cater for the needs of the entire working set of the server.

An efficient query should also enhance Selectivity. Selectivity can be defined as the ability of a query to narrow the result using the index. To be more secant, your queries should limit the number of possible documents with the indexed field. Selectivity is mostly associated with a compound index which includes a low-selectivity field and another field. For example if you have this data:

{ _id: ObjectId(), a: 6, b: "no", c: 45 }
{ _id: ObjectId(), a: 7, b: "gh", c: 28 }
{ _id: ObjectId(), a: 7, b: "cd", c: 58 }
{ _id: ObjectId(), a: 8, b: "kt", c: 33 }

The query {a: 7, b: “cd”} will scan through 2 documents to return 1 matching document. However if the data for the value a is evenly distributed i.e

{ _id: ObjectId(), a: 6, b: "no", c: 45 }
{ _id: ObjectId(), a: 7, b: "gh", c: 28 }
{ _id: ObjectId(), a: 8, b: "cd", c: 58 }
{ _id: ObjectId(), a: 9, b: "kt", c: 33 }

The query {a: 7, b: “cd”} will scan through 1 document and return this document. Hence this will take shorter time than the first data structure.

ClusterControl
Single Console for Your Entire Database Infrastructure
Find out what else is new in ClusterControl

Resources Provisioning

Inadequate storage memory, RAM and other operating parameters can drastically degrade the performance of a MongoDB. For instance, if the number of user connections is very large, it will hinder the ability of the server application from handling requests in a timely manner. As discussed in Key things to monitor in MongoDB, you can get an overview of which limited resources you have and how you can scale them to suit your specifications. For a large number of concurrent application requests, the database system will be overwhelmed in keeping up with the demand.

Replication Lag

Sometimes you may notice some data missing from your database or when you delete something, it appears again. As much as you could have well designed schema, appropriate indexing and enough resources, in the beginning your application will run smoothly without any hiccups but then at some point you notice the latter mentioned problems. MongoDB relies on replication concept where data is redundantly copied to meet some design criteria. An assumption with this is that the process is instantaneous. However, some delay may occur maybe due to network failure or unhandled errors. In a nutshell, there will be a large gap between the time with which an operation is processed on the primary node and the time it will be applied in the secondary node.

Setbacks with Replica Lags

  1. Inconsistent data. This is especially associated with read operations that are distributed across secondaries.

  2. If the lag gap is wide enough, then a lot of unreplicated data may be on the primary node and will need to be reconciled in the secondary node. At some point, this may be impossible especially when the primary node cannot be recovered.

  3. Failure to recover the primary node can force one to run a node with data which is not up to date and consequently may drop the whole database in order to make the primary to recover.

Causes of the Secondary Node Failure

  1. Outmatching primary power over the secondary regarding the CPU, disk IOPS and network I/O specifications.

  2. Complex write operations. For example a command like

    db.collection.update( { a: 7}  , {$set: {m: 4} }, {multi: true} )

    The primary node will record this operation in the oplog quick enough. However, for the secondary node, it has to fetch those ops, read into RAM any index and data pages in order to meet some criteria specifications such as the id. Since it has to do this quick enough in order to keep the rate with the primary node does the operation, if the number of ops is large enough then there will be an expected lag.

  3. Locking of the secondary when making a backup. In this case we may forget to disable the primary hence will continue with its operations as normal. At the time when the lock will be released, replication lag will have be of a large gap especially when dealing with a huge amount of data backup.

  4. Index building. If an index builds up in the secondary node, then all other operations associated with it are blocked. If the index is long-running then the replication lag hiccup will be encountered.

  5. Unconnected secondary. Sometimes the secondary node may fail due to network disconnections and this results in a replication lag when it is reconnected.

How to Minimize the Replication Lag

  • Use unique indexes besides your collection having the _id field. This is to avoid the replication process from failing completely.

  • Consider other types of backup such as point-in-time and filesystem snapshots which not necessarily require locking.

  • Avoid building large indexes since they cause background blocking operation.

  • Make the secondary powerful enough. If the write operation is of lightweight, then using underpowered secondaries will be economical. But, for heavy write loads, the secondary node may lag behind the primary. To be more seccant, the secondary should have enough bandwidth to help reading oplogs fast enough in order to keep its rate with the primary node.

Efficient Query Techniques

Beside creating indexed queries and using Query Selectivity as discussed above, there are other concepts you can employ to fasten and make your queries effective.

Optimizing Your Queries

  1. Using a covered query. A covered query is one which is always completely satisfied by an index hence does not need to examine any document. The covered query therefore should have all fields as part of the index and consequently the result should contain all these fields.

    Let’s consider this example:

    {_id: 1, product: { price: 50 }

    If we create an index for this collection as

    {“product.price”: 1} 

    Considering a find operation, then this index will cover this query;

    db.collection.find( {“product.price”: 50}, {“product.price”: 1, _id: 0}  )

    and return the product.price field and value only.

  2. For embedded documents, use the dot notation (.). The dot notation helps in accessing elements of an array and fields of embedded document.

    Accessing an array:

    {
       prices: [12, 40, 100, 50, 40]  
    }

    To specify the fourth element for example, you can write this command:

    “prices.3”

    Accessing an object array:

    {
    
       vehicles: [{name: toyota, quantity: 50},
                 {name: bmw, quantity: 100},
                 {name: subaru, quantity: 300}                    
    } 

    To specify the name field in the vehicles array you can use this command

    “vehicles.name”
  3. Check if a query is is covered. To do this use the db.collection.explain(). This function will provide information on the execution of other operations -e.g. db.collection.explain().aggregate(). To learn more about the explain function you can check out explain().

In general, the supreme technique as far as querying is concerned is using indexes. Querying only an index is much faster than querying documents outside of the index. They can fit in memory hence available in RAM rather than in disk. This makes the easy and fast enough to fetch them from memory.

Subscribe below to be notified of fresh posts