blog

Storing Files in MongoDB with GridFS

Onyancha Brian Henry

Published: September 11, 2020
Last Updated: May 4, 2022

Many applications involve file management and have file storage as an important feature for enhancing data processing. File storage often requires a third party CDN (Content Delivery Network), such as Amazon Web services, but this makes the management process a bit tedious. It would be easier to access all your resources from a single cloud storage, rather than multiple ones, as there may be a chance of failure during retrieval.

Storing files directly into a database through a single API call has not been something easily done until the introduction of GridFS in MongoDB.

What is MongoDB GridFS

GridFs is an abstraction layer in MongoDB used in the storage and recovery of large files like videos, audios, and images.This file system stores files that are even more significant than 16 MB within MongoDB data collections. The files are stored by first breaking it into smaller chunks of data, each piece having a size of 255 KB.

GridFS uses two collection systems to store files:

Chunk: This is the collection that stores the document parts. The chunks are limited to a size of 255KB each and when one makes a query, the GridFS driver reassembles all the chunks as per storage unique _id. For example, you could want to retrieve a segment of a video file rather than the whole file, this is possible by just querying the correct range you want.
File: This stores the consequent additional metadata for the file.

The collections are placed in a common bucket and then prefix each with the bucket name which by default is fs and so we have:

fs.chunks
fs.files

One can choose a different bucket name but the full collection name is subject to: name space limit of 255 bytes.

Chunks Collection

Chunk collection documents have the form:

{

  "_id" : ,

  "files_id" : ,

  "n" : ,

  "data" : 

}

Where:

_id: is the unique identifier for the chunk
files_id: is the _id of the parent document as stored in the files collection
n: is the sequence number of the chunk starting with 0.
data: is the chunk’s payload as BSON Binary type.

A compound index using files_id and n fields is used to allow efficient retrieval of chunks for example:

db.fs.chunks.find( { files_id: fileId } ).sort( { n: 1 } )

To create this index if it does not exist you can run the following command on a mongo shell:

db.fs.chunks.createIndex( { files_id: 1, n: 1 }, { unique: true } );

Files Collection

Documents in this collection take the form

{

  "_id" : ,

  "length" : ,

  "chunkSize" : ,

  "uploadDate" : ,

  "filename" : ,

  "metadata" : ,

}

Where:

_id: is the unique identifier for the document which is of the data type one chooses for the original document and by default in MongoDB is the BSON ObjectId.
length: is the size of the document in bytes
chunkSize: size of each chunk which is limited to 255 kilobytes
uploadDate: field of type Date which stores the date the document was first stored.
filename: this is an optional field that is a human readable identification for the file.
metadata: this is an optional field that holds additional information that one wants to store.

An example of an fs file is shown below.

{

   "filename": "file.html",

   "chunkSize": NumberInt(23980),

   "uploadDate": ISODate("2020-08-11T10:02:15.237Z"),

   "length": NumberInt(312)

}

Like the chunks collection, a compound index using filename and uploadDate fields is used in the files collection to allow efficient retrieval of files, for example:

db.fs.files.find( { filename: fileName } ).sort( { uploadDate: 1 } )

To create this index if it does not exist you can run the following command on a mongo shell:

db.fs.file.createIndex( { filename: 1, uploadDate: 1 }, { unique: true } );

When to Use the MongoDB GridFS Storage System

MongoDB GridFS is not commonly used but the following are the condition that may necessitates the use of this GridFS storage system;

When the current file system has a limit on the number of files that can be stored in a given directory.
When one intends to access part of the information stored, GridFS enables one to recall parts of the file without accessing the whole document.
When one intends to distribute files and their metadata through geographically distributed replica sets, GridFS allows the metadata to sync and deploy the data across multiple targeted systems automatically.

When Not to Use the MongoDB GridFS Storage System

The GridFS storage system is however not appropriate to use when there will be a need to update the content of the whole file saved in GridFS.

How to Add Files to GridFS

When storing an mp3 file in MongoDB using GridFs, the right procedure to follow is this;

Open the terminal (The command prompt)
Navigate to the mongofiles.exe (this is located in the bin folder)
Use the command
```
>mongofiles.exe -d gridfs put song.mp3
```

After the command, the name of the database to be used is the gridfs, if by chance, the name is missing, MongoDB automatically creates a document that stores the file on the database.

To view the file stored in GridFS use the query command below on the mongo shell;

>db.fs.files.find()

The command returns a document with format shown below:

{

   _id: ObjectId('526a922bf8b4aa4d33fdf84d'),

   filename: "song.mp3",

   chunkSize: 233390,

   uploadDate: new Date(1397391643474), md5: "e4f53379c909f7bed2e9d631e15c1c41",

   length: 10302960

}

The file has the following details, filename, length, date uploaded, chunk size, and object_id. The chunks in fs.chunks collection can be viewed using the id returned in the initial query as has shown below.

>db.fs.chunks.find({files_id:ObjectId('526a922bf8b4aa4d33fdf84d')})

GridFS Sharding

Sharding is also another feature applicable with GridFS. To shard chunks collection one can use either a compound index of { files_id : 1, n : 1 } or { files_id : 1 } as the shard key.

Harshed Sharding is only possible if the MongoDB drivers do not run filemd5.

File collections are often not sharded because they contain only metadata and are very small. The available keys neither do they provide an even distribution in a sharded cluster. However, if one needs to shard a files collection, you can use the _id field in combination with some application fields.

GridFS Limitations

GridFS File system has the following limitations:

Atomic update: GridFS does not have an atomic update. This makes it easier to manually update by picking the required version of files and keeping multiple versions of files running
Performance: the system tends to be slow with the file system and web server.
Working set: one uses another server when working on a new working set. This is done so as to avoid disturbing the running working set.

Conclusion

GridFS is like a silver bullet to developers who intend to store large files in MongoDB. GridFS storage system gives developers a chance to store large files and retrieve parts of the needed files. GridFS is, therefore, an excellent MongoDB feature that can be used with various applications.

ClickHouse Schema Design and Data Modeling

Building a Modern Analytics Stack Around ClickHouse

Managing ClickHouse Resources in Multi-Tenant Environments

Advanced Partitioning Strategies for PostgreSQL OLTP and Analytics Datasets at Scale