Severalnines Blog
The automation and management blog for open source databases

Linking & Creating MongoDB Joins Using SQL - Part 3

Multiple JOINS in a single query

Multiple JOINS are normally associated with multiple collections, but you must have a basic understanding of how the INNER JOIN works (see my previous posts on this topic). In addition to our two collections we had before; units and students, let’s add a third collection and label it sports. Populate the sports collection with the data below:

{
    "_id" : 1,"tournamentsPlayed" : 6,
    "gamesParticipated" : [{"hockey" : "midfielder","football" : "stricker","handball" : "goalkeeper"}],
    "sportPlaces" : ["Stafford Bridge","South Africa", "Rio Brazil"]
}
{
    "_id" : 2,"tournamentsPlayed" : 3,
    "gamesParticipated" : [{"hockey" : "goalkeeper","football" : "stricker", "handball" : "midfielder"}],
    "sportPlaces" : ["Ukraine","India", "Argentina"]
}
{
    "_id" : 3,"tournamentsPlayed" : 10,
    "gamesParticipated" : [{"hockey" : "stricker","football" : "goalkeeper","tabletennis" : "doublePlayer"}],
    "sportPlaces" : ["China","Korea","France"]
}

We would like, for example, to return all the data for a student with _id field value equal to 1. Normally, we would write a query to fetch for the _id field value from the students collection, then use the returned value to query for data in the other two collections. Consequently, this will not be the best option especially if a large set of documents is involved. A better approach would be to use the Studio3T program SQL feature. We can query our MongoDB with the normal SQL concept and then try to coarse tune the resulting Mongo shell code to suit our specification. For instance, let’s fetch all data with _id equal to 1 from all the collections:

SELECT  *
  FROM students
    INNER JOIN units
      ON students._id = units._id
    INNER JOIN sports
      ON students._id = sports._id
  WHERE students._id = 1;

The resulting document will be:

{ 
    "students" : {"_id" : NumberInt(1),"name" : "James Washington","age" : 15.0,"grade" : "A","score" : 10.5}, 
    "units" : {"_id" : NumberInt(1),"grades" : {Maths" : "A","English" : "A","Science" : "A","History" : "B"}
    }, 
    "sports" : {
        "_id" : NumberInt(1),"tournamentsPlayed" : NumberInt(6), 
        "gamesParticipated" : [{"hockey" : "midfielder", "football" : "striker","handball" : "goalkeeper"}], 
        "sportPlaces" : ["Stafford Bridge","South Africa","Rio Brazil"]
    }
}

From the Query code tab, the correspondent MongoDB code will be:

db.getCollection("students").aggregate(
    [{ "$project" : {"_id" : NumberInt(0),"students" : "$$ROOT"}}, 
        { "$lookup" : {"localField" : "students._id","from" : "units","foreignField" : "_id", "as" : "units"}}, 
        { "$unwind" : {"path" : "$units","preserveNullAndEmptyArrays" : false}}, 
        { "$lookup" : {"localField" : "students._id","from" : "sports", "foreignField" : "_id","as" : "sports"}}, 
        { "$unwind" : {"path" : "$sports", "preserveNullAndEmptyArrays" : false}}, 
        { "$match" : {"students._id" : NumberLong(1)}}
    ]
);

Looking into the returned document, personally I am not too happy with the data structure especially with embedded documents. As you can see, there are _id fields returned and for the units we may not need the grades field to be embedded inside the units.

We would want to have a units field with embedded units and not any other fields. This leads us to the coarse tune part. Like in the previous posts, copy the code using the copy icon provided and go to the aggregation pane, paste the contents using the paste icon.

First things first, the $match operator should be the first stage, so move it to the first position and have something like this:

Click the first stage tab and modify the query to:

{
    "_id" : NumberLong(1)
}

We then need to modify the query further to remove many embedding stages of our data. To do so, we add new fields to capture data for the fields we want to eliminate i.e.:

db.getCollection("students").aggregate(
    [
        { "$project" : { "_id" : NumberInt(0), "students" : "$$ROOT"}}, 
        { "$match" : {"students._id" : NumberLong(1)}}, 
        { "$lookup" : { "localField" : "students._id", "from" : "units","foreignField" : "_id", "as" : "units"}}, 
        { "$addFields" : { "_id": "$students._id","units" : "$units.grades"}}, 
        { "$unwind" : { "path" : "$units",  "preserveNullAndEmptyArrays" : false}}, 
        { "$lookup" : {"localField" : "students._id", "from" : "sports", "foreignField" : "_id", "as" : "sports"}}, 
        { "$unwind" : { "path" : "$sports","preserveNullAndEmptyArrays" : false}}, 
        { "$project" : {"sports._id" : 0.0}}
        ]
);

As you can see, in the fine tuning process we have introduced new field units which will overwrite the contents of the previous aggregation pipeline with grades as an embedded field. Further, we have made an _id field to indicate that the data was in relation to any documents in the collections with the same value. The last $project stage is to remove the _id field in the sports document such that we may have a neatly presented data as below.

{  "_id" : NumberInt(1), 
    "students" : {"name" : "James Washington", "age" : 15.0,  "grade" : "A", "score" : 10.5}, 
    "units" : {"Maths" : "A","English" : "A", "Science" : "A","History" : "B"}, 
    "sports" : {
        "tournamentsPlayed" : NumberInt(6), 
        "gamesParticipated" : [{"hockey" : "midfielder","football" : "striker","handball" : "goalkeeper"}],  
        "sportPlaces" : ["Stafford Bridge", "South Africa", "Rio Brazil"]
        }
}

We can also restrict on which fields should be returned from the SQL point of view. For example we can return the student name, units this student is doing and the number of tournaments played using multiple JOINS with the code below:

SELECT  students.name, units.grades, sports.tournamentsPlayed
  FROM students
    INNER JOIN units
      ON students._id = units._id
    INNER JOIN sports
      ON students._id = sports._id
  WHERE students._id = 1;

This does not give us the most appropriate result. So as usual, copy it and paste in the aggregation pane. We fine tune with the code below to get the appropriate result.

db.getCollection("students").aggregate(
    [
        { "$project" : { "_id" : NumberInt(0), "students" : "$$ROOT"}}, 
        { "$match" : {"students._id" : NumberLong(1)}}, 
        { "$lookup" : { "localField" : "students._id", "from" : "units","foreignField" : "_id", "as" : "units"}}, 
        { "$addFields" : {"units" : "$units.grades"}}, 
        { "$unwind" : { "path" : "$units",  "preserveNullAndEmptyArrays" : false}}, 
        { "$lookup" : {"localField" : "students._id", "from" : "sports", "foreignField" : "_id", "as" : "sports"}}, 
        { "$unwind" : { "path" : "$sports","preserveNullAndEmptyArrays" : false}}, 
        { "$project" : {"name" : "$students.name", "grades" : "$units.grades", "tournamentsPlayed" : "$sports.tournamentsPlayed"}
        }}
        ]
);

This aggregation result from the SQL JOIN concept gives us a neat and presentable data structure shown below.

{ 
    "name" : "James Washington", 
    "grades" : {"Maths" : "A", "English" : "A", "Science" : "A", "History" : "B"}, 
    "tournamentsPlayed" : NumberInt(6)
}

Pretty simple, right? The data is quite presentable as if it was stored in a single collection as a single document.

LEFT OUTER JOIN

The LEFT OUTER JOIN is normally used to show documents that do not conform to the most portrayed relationship. The resulting set of a LEFT OUTER join contains all rows from both collections that meet the WHERE clause criteria, same as an INNER JOIN result set. Besides, any documents from the left collection that do not have matching documents in the right collection will also be included in the result set. The fields being selected from the right side table will return NULL values. However, any documents in the right collection, which do not have matching criteria from the left collection, are not returned.

Take a look at these two collections:

students

{"_id" : 1,"name" : "James Washington","age" : 15.0,"grade" : "A","score" : 10.5}
{"_id" : 2,"name" : "Clinton Ariango","age" : 14.0,"grade" : "B","score" : 7.5}
{"_id" : 4,"name" : "Mary Muthoni","age" : 16.0,"grade" : "A","score" : 11.5}

Units

{"_id" : 1,"Maths" : "A","English" : "A","Science" : "A","History" : "B"}
{"_id" : 2,"Maths" : "B","English" : "B","Science" : "A","History" : "B"}
{"_id" : 3,"Maths" : "A","English" : "A","Science" : "A","History" : "A"}

In the students collection we don’t have _id field value set to 3 but in the units collection we have. Likewise, there is no _id field value 4 in in the units collection. If we use the students collection as our left option in the JOIN approach with the query below:

SELECT *
  FROM students
    LEFT OUTER JOIN units
      ON students._id = units._id

With this code we will get the following result:

{
    "students" : {"_id" : 1,"name" : "James Washington","age" : 15,"grade" : "A","score" : 10.5},
    "units" : {"_id" : 1,"grades" : {"Maths" : "A","English" : "A", "Science" : "A","History" : "B"}}
}
{
    "students" : {"_id" : 2,"name" : "Clinton Ariango", "age" : 14,"grade" : "B", "score" : 7.5 }
}
{
    "students" : {"_id" : 3,"name" : "Mary Muthoni","age" : 16,"grade" : "A","score" : 11.5},
    "units" : {"_id" : 3,"grades" : {"Maths" : "A","English" : "A","Science" : "A","History" : "A"}}
}

The second document does not have the units field because there was no matching document in the units collection. For this SQL query, the correspondent Mongo Code will be

db.getCollection("students").aggregate(
    [
        { 
            "$project" : {"_id" : NumberInt(0), "students" : "$$ROOT"}}, 
        { 
            "$lookup" : {"localField" : "students._id",  "from" : "units", "foreignField" : "_id", "as" : "units"}
        }, 
        { 
            "$unwind" : { "path" : "$units", "preserveNullAndEmptyArrays" : true}
        }
    ]
);

Of course we have learnt about fine-tuning, so you can go ahead and restructure the aggregation pipeline to suite the end result you would like. SQL is a very powerful tool as far as database management is concerned. It is a broad subject on its own, you can also try to use the IN and the GROUP BY clauses to get the correspondent code for MongoDB and see how it works.

Conclusion

Getting used to a new (database) technology in addition to the one you are used to working with can take a lot of time. Relational databases are still more common than the non-relational ones. Nevertheless, with the introduction of MongoDB, things have changed and people would like to learn it as fast as possible because of its associated powerful performance.

Learning MongoDB from scratch can be a bit tedious, but we can use the knowledge of SQL to manipulate data in MongoDB, get the relative MongoDB code and fine tune it to get the most appropriate results. One of the tools that is available to enhance this is Studio 3T. It offers two important features that facilitate the operation of complex data, that is: SQL query feature and the Aggregation editor. Fine tuning queries will not only ensure you get the best result but also improve on the performance in terms of time saving.