What to Know When Start Working with MongoDB in Production - Ten Tips

Onyancha Brian Henry

Learning MongoDB requires a lot of precise thinking. Little consideration is often not made in essential undertakings that could otherwise jeopardise the performance of the database in production mode. 

MongoDB is a NoSQL DBMS which literally follows a different pattern from SQL databases, especially along the lines of security and structure. Although some of the integrated features promote its performance and make it one of the best in the recent times, some of the features consequently pose potential threats that can ruin its performance if not taken into account. 

In a recent “worst case” experience, I was trying to query a collection with documents that had large arrays and it took ages for me to get the results back. I decided to write this blog as I knew if someone experiences these same problems that this blog will be of great help. 

Key Considerations for MongoDB in Production

  1. Security and authentication.
  2. Indexing your documents
  3. Using a schema in your collections
  4. Capped collection
  5. Document size
  6. Array size for embedded documents
  7. Aggregation pipeline stages
  8. Order of keys in hash object
  9. ‘undefined’ and ‘null’ in MongoDB
  10. Write operation

MongoDB Security and Authentication

Data vary in many ways and you will obviously need to keep some information confidential. By default MongoDB installs do not set authentication requirement as a must but that doesn’t give you a go ahead using it especially when confidential data such as financial and medical records are involved. On a development workstation, it is not a big deal but because of multi-user involvement in the production mode, it is good practice to set the authentication certificates. The most common and easy to use method is the default MongoDB Username and Password credentials.

Data is written to files which can be accessed through a third party tool more so if they are not encrypted.The data can be altered without your knowledge if some anonymous person gets access to the system files. Hosting the database on a dedicated server and assign a single user who will have full access to the data files will save you the trick.

Protecting data from external injection attacks is also an essential undertaking. Some operators such as $group, $whereby and the mapReduce operations are javascript(js) developed hence prone to js manipulation. To avoid any instance of data integrity as a result, you can disable arbitrary JS setting by configuring the parameter javascriptEnabled:false in the config file if you have not used any of the mentioned operators. Further, you can reduce the risk of data access through network breaches by employing some of the procedures highlighted in the  MongoDB Security Checklist.

Indexing Your Documents

Indexing is generally assigning a unique identification value to each document in a MongoDB collection. Indexing brings about performance upgrade in both read and write operations. By default  it is enabled and one should always maintain that setting. Without indexing, the database has to check through multiple docs from the start to the end and unfortunately the operation will be time costly for documents that are towards the end, rendering poor latency for the query. At some point, on the application end, users may experience a lag and may think the application is actually not working. Indexing is helpful in sort and lookup query operations not leaving out the find operation itself.  Sorting is a common operation for many returned documents. It is often carried out as the final stage after documents have been filtered so that a small amount of data need to be sorted. An index in this case will help sort the data in nature of entry and restrict the returned data to a limit of 32MB. If there is no indexing, chances of the 32 memory limit on the combined size of returned documents will be exceeded and whenever the database hits this limit, it will throw an error besides returning an empty record set.

$lookup operation is as well supported with indexing in place. An index on the key value used as the foreign key is essential for the preceding stages processing.

Using a Schema in Your Collections

MongoDB does not need one to define fields(columns) just as it may require you to do for SQL dbms. However much you will not need to define the fields, to avoid data inconsistency and some setbacks that may arise, defining a schema is always a good practice. Schema design allows you to determine which type of data goes to a certain field, which field must be supplied with a value and generally enhance data validation before entry or update thereby promoting data integrity and consistency. A schema design will also direct you whether to reference or embed data. As a beginner you may think the only model will be “One -to-N” that will facilitate one to have subdocument  array entries but that is not the case.

You need to understand the cardinality relationship between documents before making your model. Some of the rules that  will help you have an optimal schema are:

  1. To reduce the number of queries that you will need to execute before  accessing some data and if few fields or array elements are involved, then you can embed subdocuments. Take an example of the model below:
    1. {
       Name: ‘John Doh’,
       Age:20
       Addresses:[
         {street: ‘Moi Avenue’, city:’Nairobi’, countryCode: ‘KE’},
         {street: ‘Kenyatta Avenue’, city:’Nairobi’, countryCode: ‘KE’},
       ]
      }
      
  2. For frequently updated documents, use denormalization . If any field is going to be frequently updated, then there will be the task of finding all the instances that need to be updated. This will result in slow query processing, hence overwhelming even the merits associated with denormalization.
  3. Complex queries such as aggregate pipelining take more time to execute when many sub-documents are involved and there is need to fetch a document separately.
  4. Array elements with large set of object data should not be embedded obviously due to the fact that they may grow and consequently  exceeding the document size.

Modelling of a schema is often determined by the application access pattern. You can find more procedures that can help in the design of your model in the blog 6 Rules of Thumb for MongoDB Schema Design

Use a Capped Collection for Recent Documents Priority

MongoDB provides a lot of resources such as the capped collection. Unfortunately some end up not being utilized. A capped collection has a fixed size and it’s known to support high-throughput operations that insert and retrieve documents based on the insertion order. When its space is filled up, old documents are deleted to give room for new ones. 

Example of capped collection  use case:

  • Caching frequently accessed data since the collection itself is read-heavy rather than write-heavy. You need to ensure the collection is always in performance.
  • Log information for high volume systems. Capped collection often don’t use an index and this is advantageous in that, the speed of recording is quite fast just like writing into a file.

Pay Attention to MongoDB Document Size

Every MongoDB document is limited to a size of 16 megabytes. However, it is optimal for the document  to reach or approach this limit as it will pose some atrocious performance problems. MongoDB itself works best when the size of the documents is of a few kilobytes. If the document is large enough in size,   a complex projection request will take a long time and the query may time out. 

Pay Attention to the Array Size of Embedded Documents

One can push subdocuments to a field in a document thereby creating an array value on this field. As mentioned before, you need to keep the size of the subdocuments low. It is equally  important to ensure the number of array elements is below a four figure. Otherwise, the document will grow beyond its size and it will need to be relocated in disk. A further problem associated with such an operation is that, that every document will need to be re-indexed. Besides, each subdocument  will equally need to be re-indexed. This means that there will be a lot of index writings which result in slow operations. For large subdocument size, it rather important to keep the records in a new collection than embedding.

Aggregation Pipeline Stages 

Besides the normal MongoDB query operations, there is an aggregation framework used manipulate and return data in accordance to some specifications such as ordering and grouping. MongoDB does not have a query optimizer hence need one to order queries appropriately. With  the aggregation framework, ensure the pipeline stages are well ordered. Start by reducing the amount of data you dealing with using the $match operator and possibly $sort in the end if need to sort. You can use third party tools such as Studio 3T to optimize your aggregation query before integrating it in your code. The tool enables you see data input and output in any of the stages thus  knowing what you are dealing with.

Using $limit and $sort should always give the same results every time the query is executed. In case you use $limit the returned data will not be deterministic and may render some issues which are difficult to track.

Check the Order of Keys in Hash Objects

Consider having two large documents with sample data 

{

   FirstName: ‘John’,

   LastName: ‘Doh’

}

If you do a find operation with the query {FirstName: ‘John’, LastName: ‘Doh’}, the operation does not match with the query {LastName: ‘Doh’ FirstName: ‘John’}. You therefore need to maintain the order of name and value pairs in your documents. 

Avoid ‘undefined’ and ‘null’ in MongoDB

MongoDB uses BSON format for its documents. With JSON validation, ‘undefined’ is not supported and you should avoid using it. $null comes as a solution but you should avoid it too.

Consider Write Operations

You might set MongoDB for high-speed writes but this poses a setback in that, a response is returned even before the data is written. Journalling should be enabled  to avoid this scenario. In addition, in the case of a database break down, the data will still be available and it will create a checkpoint which can be used in the recovery process. The configuration for the duration of journal writes can be set using the parameter commitIntervalMs.

Conclusion

Database system should ensure data integrity and consistency besides being resilient to failure and malice. However, to arrive at this factors, one needs to understand the database itself and the data it holds. MongoDB will work well when the mentioned factors above are taken into account. The paramount of them being using a schema. A schema enables you to validate your data before entry or update and how you will model this data. Data modelling is often driven by the application accessibility pattern. All these summed will offer a better database performance.

ClusterControl
The only management system you’ll ever need to take control of your open source database infrastructure.