Tips for Running MongoDB in Production Using Change Streams

Onyancha Brian Henry

Modern databases must have the capacity to capture and react to data as they flow in real-time.  This explains why MongoDB has a feature called MongoDB Change Streams responsible for capturing and responding to data in real-time. Change Streams is a feature introduced to stream information from application to the database in real-time. It is based on an aggregation framework that monitors collections and allows changes to occur in database and database collection. Additionally, MongoDB Change Stream can capture data from IOT sensors and update reports like operational data changes in an enterprise. This blog will open discourse on MongoDB Change Stream and change recommendations in production.

MongoDB Change Streams and Dropping Collection

Renaming or dropping a database or collection will lead to the cursor's closure if there was an open change stream working against the dropped collection. Renaming or dropping a collection makes the cursor with FullDocument: updateLookup to return null on any given lookup document. An error occurs if one attempts to resume after dropping a database with a running change stream. 

Additionally, all the changes in data made before renaming a collection with a change stream running against are lost. The document limit for Change stream is still 16MB BSON; therefore, documents larger than 16MB are not accepted. If one attempts to work with a document larger than 16MB, notification fails, and the document is replaced with another document that meets the set limit.

When a collection or database with change streams opened against it is dropped or renamed, the change stream cursors tend to close as they advance to that point in the oplog. If you change the stream cursors with the full document, the updateLookup option may return null to the search document.

 Therefore, an attempt to resume change streams against a collection that has been dropped will result in an error. Any occurrence of data changes in the collection between the last event of the change stream captured and the collection drop event is lost.

Change of stream response documents must comply with the 16 MB BSON document limit. Depending on the size of the documents in the collection against which you are opening the change stream, notifications may fail if the resulting notification document is more than 16 MB. A good example is the update operations on the change streams set up to return a fully updated document or replace/insert processes with the document at the limit or slightly below the limit.

MongoDB Change Stream and Replica Sets

A MongoDB Replica Set is a collection of processes in MongoDB whose data set does not change; that is, the data set remains the same. In the case of replica sets having arbiter members, change streams are likely to remain idle if sufficient members bearing the data are not available so that the majority cannot commit the operations. For example, we may consider a replica set having three members with two data-bearing nodes alongside an arbiter. In case the secondary happens to be down, e.g., as a result of failure or upgrade or maintenance, it will be impossible for the write operations to be majority committed. The change stream will remain open but will send no notifications. In the scenario at hand, the application may catch up with all operations that have taken place in the period of downtime, as long as the last operation received by the application is still in the oplog of that particular replica set. Additionally,  rs.printReplicationInfo() command is used to retrieve data from oplog; data retrieved includes a range of operations and size of oplog. 

If the downtime is significantly estimated, for example, to perform an upgrade or in the case of a disaster, increasing the oplog size will be the best option to retain the operations for a period that is greater than the approximated downtime. To retrieve oplog status information, the command used is printReplicationInfo(). The command will retrieve not only the oplog status information but also the oplog size and the range of time of the operations.

MongoDB Change Streams Availability

MongoDB change streams are obtainable for replica sets and sharded clusters: Read Concern “majority” Enablement, Storage Engine, and Replica Set Protocol Version. Read Concern “majority” Enablement: Starting with MongoDB version 4.2 and above, change streams are accessible despite the prevailing circumstances of the “majority” read concern support, meaning that the read concern majority support can be enabled or disabled.In MongoDB version 4.0 and older versions, Change streams are only available if the "majority" read concern support is activated.

  1. Storage Engine: WiredTiger storage engine is the storage engine type used by the replica sets and the sharded clusters.
  2. Replica Set Protocol Version: Replica sets and sharded clusters must always use version 1 of the replica set protocol (pv1).

MongoDB Sharded Clusters

Sharded clusters in MongoDB consist of shards, mongos, and config servers. A shard consists of a subset of sharded data. In the case of MongoDB 3.6, shards are utilized as a replica set. Mongos provides an interface between sharded clusters and client applications; mongos plays the role of a query router. From MongoDB version 4.4 and above, mongos supports hedged reads to bring down latencies. Config servers are the storage locations for cluster configuration settings and metadata.

Change streams use a global logical clock to provide a global ordering of changes across shards. MongoDB ensures that the order of changes is maintained and that the change stream's notifications can be interpreted safely in the order they were received. For example, the change stream's cursor opened against a 3-shard cluster returns the change's notifications concerning the total order of the changes across the three shards.

To ensure the total ordering of changes, Mongos checks with each shard to see if it has seen more recent changes for each change notification. Sharded clusters with one to several shards with little or no collection activity or are "cold" are likely to have negative effects on the change stream response time since the mongos still have to check with those cold shards to ensure the total ordering of the changes. This effect may be more apparent when shards are geographically distributed or when workloads with most of the operations occur on a subset of shards in a cluster. If the sharded collection has a high level of activity, the mongos may not manage to keep track of the changes across all the shards. Consider using notification filters for this kind of collection, for instance, passing the $match pipeline, which is configured to filter the insert operations only. 

In the case of sharded collections, multi: proper update operations are likely to cause change streams that are opened against the collection to send notifications for orphaned documents. Starting from the time an unsharded collection is sharded until the time when the change stream gets to the first migration chunk, the documentKey in the change stream notification document includes only the document id and not the full shard key.

Conclusion

The purpose of the change streams is to make it possible for the application's data changes in real-time, without the risk of stalking down the oplog and without any trace of complexity. MongoDB applications use change streams to sign to data changes on a database, a collection, or the deployment, and immediately react to them. Since change streams make use of the aggregation framework, applications can filter the specific changes and convert notifications on by themselves.

ClusterControl
The only management system you’ll ever need to take control of your open source database infrastructure.