blog
Planning & Managing Schemas in MongoDB (Even Though It’s Schemaless)
When MongoDB was introduced, the main feature highlighted was it’s ability to be “schemaless”. What does it mean? It means that one can store JSON documents, each with different structure, in the same collection. This is pretty cool. But the problem starts when you need to retrieve the documents. How do you tell that a retrieved document is of a certain structure, or whether it contains a particular field or not? You have to loop through all the documents and search for that particular field. This is why it is useful to carefully plan the MongoDB schema, especially for the large applications.
When it comes to MongoDB, there is no specific way to design the schema. It all depends on your application and how your application is going to use the data. However, there are some common practices that you can follow while designing your database schema. Here, I will discuss these practices and their pros and cons.
One-to-Few Modeling (Embedding)
This design is a very good example of embedding documents. Consider this example of a Person collection to illustrate this modeling.
{
name: "Amy Cooper",
hometown: "Seoul",
addresses: [
{ city: 'New York', state: 'NY', cc: 'USA' },
{ city: 'Jersey City', state: 'NJ', cc: 'USA' }
]
}
Pros:
- You can get all the information in a single query.
Cons:
- Embedded data is completely dependent on the parent document. You can’t search the embedded data independently.
- Consider the example where you are creating a task-tracking system using this approach. Then you will embed all tasks specific to one person in the Person collection. If you want to fire a query like: Show me all tasks which have tomorrow as a deadline. This can be very difficult, even though it is a simple query. In this case, you should consider other approaches.
One-to-Many Modeling (Referencing)
In this type of modeling, the parent document will hold the reference Id (ObjectID) of the child documents. You need to use application level joins (combining two documents after retrieving them from DB at the application level) to retrieve documents, so no database level joins. Hence, the load on a database will be reduced. Consider this example:
// Parts collection
{
_id: ObjectID(1234),
partno: '1',
name: ‘Intel 100 Ghz CPU',
qty: 100,
cost: 1000,
price: 1050
}
// Products collection
{
name: 'Computer WQ-1020',
manufacturer: 'ABC Company',
catalog_number: 1234,
parts: [
ObjectID(‘1234’), <- Ref. for Part No: 1
ObjectID('2345'),
ObjectID('3456')
]
}
Suppose each product may have several thousand parts associated with it. For this kind of database, referencing is the ideal type of modeling. You put the reference ids of all the associated parts under product document. Then you can use application level joins to get the parts for a particular product.
Pros:
- In this type of modeling, each part is a separate document so you can apply all part related queries on these documents. No need to be dependent on parent document.
- Very easy to perform CRUD (Create, Read, Update, Write) operations on each document independently.
Cons:
- One major drawback with this method is that you have to perform one extra query to get the part details. So that you can perform application-level joins with the product document to get the necessary result set. So it may lead to drop in DB performance.
One-to-Millions Modeling (Parent Referencing)
When you need to store tons of data in each document, you can’t use any of the above approaches because MongoDB has a size limitation of 16MB per document. A perfect example of this kind of scenario can be an event logging system which collects logs from different type of machines and stores them in Logs and Machine collections.
Here, you can’t even think about using the Embedding approach which stores all logs information for a particular machine in a single document. This is because in only a few hours, the document size will be more than 16MB. Even if you only store reference ids of all the logs document, you will still exhaust the 16MB limit because some machines can generate millions of logs messages in a single day.
So in this case, we can use the parent referencing approach. In this approach, instead of storing reference ids of child documents in the parent document, we will store the reference id of the parent document in all child documents. So for our example, we will store ObjectID of the machine in Logs documents. Consider this example:
// Machines collection
{
_id : ObjectID('AAA'),
name : 'mydb.example.com',
ipaddr : '127.66.0.4'
}
// Logs collection
{
time : ISODate("2015-09-02T09:10:09.032Z"),
message : 'WARNING: CPU usage is critical!',
host: ObjectID('AAA') -> references Machine document
}
Suppose you want to find most recent 3000 logs of Machine 127.66.0.4:
machine = db.machines.findOne({ipaddr : '127.66.0.4'});
msgs = db.logmsg.find({machine: machine._id}).sort({time : -1}).limit(3000).toArray()
Two Way Referencing
In this approach, we store the references on both sides which means, parent’s reference will be stored in child document and child’s reference will be stored in parent document. This makes searching relatively easy in one to many modeling. For example, we can search on both parent and task documents.On the other hand, this approach requires two separate queries to update one document.
// person
{
_id: ObjectID("AAAA"),
name: "Bear",
tasks [
ObjectID("AAAD"),
ObjectID("ABCD"), -> Reference of child document
ObjectID("AAAB")
]
}
// tasks
{
_id: ObjectID("ABCD"),
description: "Read a Novel",
due_date: ISODate("2015-11-01"),
owner: ObjectID("AAAA") -> Reference of parent document
}
Conclusion
In the end, it all depends on your application requirements. You can design the MongoDB schema in a way which is the most beneficial for your application and gives you high performance. Here are some summarized considerations that you can consider while designing your schema.
- Design the schema based on your application’s data access patterns.
- It is not necessary to embed documents every time. Combine documents only if you are going to use them together.
- Consider duplication of data because storage is cheaper than compute power nowadays.
- Optimize schema for more frequent use cases.
- Arrays should not grow out of bound. If there are more than a couple of hundred child documents then don’t embed it.
- Prefer application-level joins to database-level joins. With proper indexing and proper use of projection fields, it can save you lots of time.