Factors to Consider When Choosing MongoDB for Big Data Applications

Onyancha Brian Henry

Technology advancements have brought about advantages than need to be exploited by business organizations for maximum profit value and reduced operational cost. Data has been the backbone for these technological advancements from which sophisticated procedures are derived towards achieving specific goals. As technology advances, there is more data brought into systems. Besides, as a business grows, there is more data involved and the serving system setup needs to be fast data processing, reliable in storage and offer optimal security for this data. MongoDB is one of the systems that can be trusted in achieving these factors.

Big Data refers to massive data that is fast-changing, can be quickly accessed and highly available for addressing needs efficiently. Business organizations tend to cross-examine available database setups that would provide the best performance as time goes by and consequently realize some value from Big Data. 

For instance, online markets observe client web clicks, purchasing power and then use the derived data in suggesting other goods as a way of advertising or use the data in pricing. Robots learn through machine learning and the process obviously involves a lot of data being collected because the robot would have to keep what it has learned in memory for later usage. To keep this kind of complex data with traditional database software is considered impractical. 

Characteristics of Big Data

In software systems, we consider Big Data in terms of size, speed of access and the data types involved. This can be relatively reduced down into 3 parameters: 

  1. Volume
  2. Velocity
  3. Variety

Volume

Volume is the size of Big Data involved and ranges from gigabytes to terabytes or more. On a daily basis, big companies ingest terabytes of data from their daily operations. For instance, a telecommunication company would like to keep a record of calls made since the beginning of their operation, messages sent and how long did each call take. On a daily basis, there are a lot of these activities that take place hence resulting in a lot of data. The data can be they used in statistical analysis, decision making, and tariff planning.

Velocity

Consider platforms such as Forex trading that need real time updates to all connected client machines and display new stock exchange updates in real time. This dictates that the serving database should be quite fast in processing such data with little latency in mind. Some online games involving players from different world locations collect a lot of data from user clicks, drags and other gestures then relaying them between millions of devices in microseconds. The database system involved needs to be quick enough to do all these in real time.

Variety

Data can be categorized in different types ranging from, numbers, strings, date, objects, arrays, binary data, code, geospatial data, and regular expressions just to mention a few. An optimal database system should provide functions in place to enhance the manipulation of this data without incurring additional procedures from the client side. For example, MongoDB provides the geolocation operations for usage while fetching locations near to the coordinates provided in the query. This capability cannot be achieved with traditional databases since they were only designed to address small data volume structures, fewer updates, and some consistent data structures. Besides, one will need additional operations in achieving some specific goal, in the case of traditional databases. 

MongoDB can also be run from multiple servers making it inexpensive and infinite contrary to traditional databases that are only designed to run on a single server.

Factors to Consider When Choosing MongoDB for Big Data

Big Data brings about enterprise advantage when it is highly managed through improved processing power. When selecting a database system, one should consider some factors regarding the kind of data you will be dealing with and whether the system you are selecting provides that capability. In this blog, we are going to discuss the advantages MongoDB offers for Big Data in comparison with Hadoop in some cases.

  • A rich query language for dynamic querying
  • Data embedding
  • High availability
  • Indexing and Scalability
  • Efficient storage engine and Memory handling
  • Data consistency and integrity

Rich Query Language for Dynamic Querying

MongoDB is best suited for Big Data where resulting data need further manipulations for the desired output. Some of the powerful resources are CRUD operations, aggregation framework, text search, and the Map-Reduce feature. Within the aggregation framework, MongoDB has an extra geolocation functionality that can enable one to do many things with geospatial data. For example, by creating a 2Dsphere index, you can fetch locations within a defined radius by just providing the latitude and longitude coordinates. Referring to the telecommunication example above, the company may use the Map-reduce feature or the aggregation framework to group calls from a given location, calculating the average call time on a daily basis for its users or more other operations. Check the example below.

Let’s have a location collection with the data

{ name: "KE",loc: { type: "Point", coordinates: [ -73.97, 40.77 ] }, category: "Parks"}

{ name: "UG",loc: { type: "Point", coordinates: [ -45.97, 40.57 ] }, category: "Parks"}

{ name: "TZ",loc: { type: "Point", coordinates: [ -73.27, 34.43 ] }, category: "Parks"}

{ name: "SA",loc: { type: "Point", coordinates: [ -67.97, 40.77 ] }, category: "Parks"}

We can then find data for locations that are near [-73.00, 40.00] using the aggregation framework and within a distance of 1KM with the query below:

db.places.aggregate( [

   {

      $geoNear: {

         near: { type: "Point", coordinates: [ -73.00, 40.00 ] },

         spherical: true,

         query: { category: "Parks" },

         distanceField: "calcDistance",

   maxDistance: 10000

      }

   }

]

Map-Reduce operation is also available in Hadoop but it is suitable for simple requests. The iterative process for Big Data using Map-Reduce in Hadoop is quite slow than in MongoDB.The reason behind is, iterative tasks require many map and reduce processes before completion. In the process, multiple files are generated between the map and reduce tasks making it quite unusable in advanced analysis. MongoDb introduced the aggregation pipeline framework to cub this setback and it is the most used in the recent past.

Data Embedding

MongoDB is document-based with the ability to put more fields inside a single field which is termed as embedding. Embedding comes with the advantage of minimal queries to be issued for a single document since the document itself can hold a lot of data. For relational databases where one might have many tables, you have to issue multiple queries to the database for the same purpose.

High Availability

Replication of data across multiple hosts and servers is now possible with MongoDB, unlike relational DBMS where the replication is restricted to a single server. This is advantageous in that data is highly available in different locations and users can be efficiently served by the closest server. Besides, the process of restoration or breakdown is easily achieved considering the journaling feature in MongoDB that creates checkpoints from which the restoration process can be referenced to.

Indexing and Scalability

Primary and secondary indexing in MongoDB comes with plenty of merits. Indexing makes queries to be executed first which is a consideration needed for Big Data as we have discussed under the velocity characteristic for Big Data. Indexing can also be used in creating shards. Shards can be defined as sub-collections that contain data that has been distributed into groups using a shard-key. When a query is issued, the shard-key is used to determine where to look among the available shards. If there were no shards, the process would take quite long for Big Data since all the documents have to be looked into and the process may even timeout before users getting what they wanted. But with sharding, the amount of data to be fetched from is reduced and consequently reducing the latency of waiting for a query to be returned.

Efficient Storage Engine and Memory Handling

The recent MongoDB versions set the WiredTiger as the default storage engine which has an executive capability for handling multiple workloads.  This storage engine has plenty of advantages to serve for Big Data as described in this article. The engine has features such as compression, checkpointing and promotes multiple write operations through document-concurrency. Big Data means many users and the document-level concurrency feature will allow many users to edit in the database simultaneously without incurring any performance setback. MongoDB has been developed using C++ hence making it good for memory handling.

Data Consistency and Integrity

 JSON validator tool is another feature available in MongoDB to ensure data integrity and consistency. It is used to ensure invalid data does not get into the database. For example, if there is a field called age, it will always expect an Integer value. The JSON validator will always check that a string or any other data type is not submitted for storage to the database for this field. This is also to ensure that all documents have values for this field in the same data type hence data consistency. MongoDB also offers Backup and restoration features such that in case of failure one can get back to the desired state.

Conclusion

MongoDB handles real-time data analysis in the most efficient way hence suitable for Big Data. For instance, geospatial indexing enables an analysis of GPS data in real time. 

Besides the basic security configuration, MongoDB has an extra JSON data validation tool for ensuring only valid data get into the database. Since the database is document based and fields have been embedded, very few queries can be issued to the database to fetch a lot of data. This makes it ideal for usage when Big Data is concerned.

ClusterControl
The only management system you’ll ever need to take control of your open source database infrastructure.