blog
Care To Know Clauses: All About SELECT, FROM, WHERE, GROUP BY, HAVING, ORDER BY, and LIMIT
SQL is a language of databases and PostgreSQL is our chosen one. Oftentimes, storing data is but one facet of the process. Typically, in any data-centered endeavor, you will: view and read data, take action or implement changes on the data, garner decision-making information (analytics), or manipulate the stored data in some form or fashion.
SQL is composed of a combination of keywords, commands, and clauses. SQL seems simple. Just a few ‘easy‘ commands here and there. No big deal, right?
But, there is more to SQL than meets the eye. SQL can trip you up on those ‘easy‘ queries.
One challenge (that I must routinely revisit) is understanding that SQL execution order is definitely different from that of its syntax.
In this blog post, I visit, at a high-level, the major SQL clauses as they apply to PostgreSQL. There are many dialects of SQL but PostgreSQL’’s interpretation is the focus here. (Some characteristics of each clause very well may apply to other SQL dialects.)
SQL clauses form the foundation for basic, often-used commands and queries. That being said, advanced queries and examples utilizing Window Functions, CTE’s, Derived Tables, etc will not be covered in this post.
As we will see, not all clauses are created equal. Yet, they do operate in tandem, providing query results seamlessly (or not).
Allow me to clarify…
I will periodically make mention of an execution order throughout the blog post as it applies to many of the clauses. But, this is generalized.
To my understanding, more often than not, the optimizer chooses and decides the best query plan for execution.
SELECT – The ‘picky’ Clause Used to Query the Database
SELECT is one busy clause. It is everywhere. Used more than all the other clauses. Certain clauses you may not need at all. Not so much the case with SELECT, for it is a mandatory clause.
The SELECT clause is typically used for querying the database, containing (at a basic level):
- A SELECT list – The columns of data you want.
- the source data set(s) – named in the FROM clause. Tables, Views, CTE’s, etc. This is where the data comes from.
- an optional WHERE clause used to filter rows provided by the FROM clause.
(The FROM and WHERE clauses will be discussed in their respective sections.)
In truth, I would say that the SELECT clause is required in PostgreSQL to retrieve anything. But then, there is the TABLE command that does return all rows and columns from a table.
Yet, there is separation between the two. SELECT can specify individual columns, but with the TABLE command, all columns are returned.
SELECT highlights:
- SELECT * is shorthand notation and returns all columns from the data source(s).
- Although SELECT is syntax-wise named as the first clause (with the exception of those queries using a WITH clause: not discussed here), it is not executed first. Notably, SELECT is not the last clause to execute either.
- An expression (or any column) can be given a reference name or ALIAS in the SELECT clause, with a caveat. Those given names can be used in ORDER BY and GROUP BY clauses but not WHERE or HAVING clauses.
- When a GROUP BY clause is present (or aggregate functions) in the query, SELECT should not name any ungrouped column(s). Only those columns in any aggregate function(s) or those functionally dependent on the grouped column(s).
- Not only does SELECT return specific columns, but its use also extends to INSERT and CREATE TABLE statements.
- The SELECT clause is far from simple.
See the Official PostgreSQL SELECT clause documentation section for in-depth coverage.
FROM – Provides the Data Source(s) for the Query
FROM is mostly a mandatory clause. I term this ‘loosely‘ because of the available TABLE command (mentioned above), which does not require the FROM clause.
Then again, you can select arbitrary expressions, with no named table in a SELECT query. However, with TABLE, that’s not possible.
Here is an example in psql:
learning=> SELECT 2+2;
?column?
----------
4
(1 row)
But with TABLE:
learning=> TABLE 2+2;
ERROR: syntax error at or near "2"
LINE 1: TABLE 2+2;
^
Some SQL dialects even allow naming a non-existent table to mitigate not having an actual table in the FROM clause. Yet, in PostgreSQL as you can see from the simple query above, it is not required.
But, if you need actual stored data returned apart from simple expressions, you are going to need the FROM clause. Without it, there is no data to even operate on.
Therefore FROM is absolutely required for querying any tables.
In Postgres, all named tables in the FROM clause are first cross-joined (if a WITH clause is not present) in the execution order which establishes a Cartesian Product. This makes sense right since we need data to work with.
The FROM documentation here also notes that typically, this data set is reduced to a small number of rows via a present WHERE clause condition.
The FROM clause accepts a number of specific elements. Here are just a few (see linking documentation below for the full list):
- Table name (obviously we need this).
- A VIEW.
- A SELECT statement (a subquery).
- CTE name (WITH clause).
- Type of JOIN – if any.
- A function (I was not aware of this. How cool!!!)
FROM highlights:
- Although FROM is syntactically listed as the second clause in a SELECT query, it is executed first.
- FROM provides (by loading) all rows from any tables (real or virtual) named in its clause.
- Table names can be aliased in the FROM clause (e.g., FROM shoe AS s) but need to be referenced by that ALIAS throughout the query moving forward.
- FROM is a mandatory clause when querying tables.
See the Official PostgreSQL FROM clause section for in-depth coverage.
WHERE – Filters Out Rows From the Data Source(s) Based on Boolean Validation Conditional Expression(s)
WHERE is an optional clause. Yet, when present in a query, its duty is to remove those records provided by the FROM clause that do not pass its boolean conditional check.
The WHERE clause also has profound use with other SQL commands in addition to SELECT. Namely, DML commands like INSERT (not directly, but via SELECT), UPDATE, and DELETE.
As a matter of fact, without a WHERE clause, UPDATE and DELETE statements would likely affect all target rows. Perhaps not what you intended (Yikes!).
Aggregate functions cannot be used in the boolean conditional expression of the WHERE clause. Any grouping has not yet happened in the execution order. Therefore, aggregates are not available (yet) for the WHERE clause.
WHERE evaluation is based on a boolean check using any of the comparison operators. (E.g., >, <, =, <>, etc…)
The WHERE clause cannot access aliased column names listed in the SELECT clause. Since the SELECT clause is actually (not syntax-wise) executed after the WHERE clause, those aliased columns are not yet available.
WHERE highlights:
- Aggregate functions are not accessible and cannot be used in the boolean conditional check of a WHERE clause. (The WHERE clause is possibly responsible for any rows being provided to aggregate functions and grouping for computation.)
- Aliased columns in the SELECT clause cannot be referenced in the WHERE clause.
- The WHERE clause boolean expression conditional check can result in either: true, false, or NULL.
- Any rows in which the WHERE clause boolean expression evaluates to false or NULL are removed.
- Multiple boolean conditions may be checked in the WHERE clause by leveraging the AND or OR keywords.
See the Official PostgreSQL WHERE clause section for in-depth coverage.
GROUP BY – Forms Groups
Is an optional clause.
This clause creates a single row for those selected, that contains a match on the specified grouped column value.
GROUP BY can be tricky, therefore, I feel it pertinent to include this passage from the documentation:
“When GROUP BY is present, or any aggregate functions are present, it is not valid for the SELECT list expressions to refer to ungrouped columns except within aggregate functions or when the ungrouped column is functionally dependent on the grouped columns, since there would otherwise be more than one possible value to return for an ungrouped column. A functional dependency exists if the grouped columns (or a subset thereof) are the primary key of the table containing the ungrouped column.”
GROUP BY highlights:
- Postgres allows grouping of not only columns from the source table but those listed in the SELECT column list as well. This is slightly different from strict SQL.
- In certain queries, GROUP BY can mimic the DISTINCT clause by removing duplicate values for the SELECT clause column.
- Column order is irrelevant for GROUP BY.
- Those columns not targeted by the GROUP BY clause cannot be referenced except in aggregates.
- In many cases, you can group on a PRIMARY KEY for those functionally dependent columns of that key.
- Grouping is still carried out for queries utilizing aggregate functions in the absence of a GROUP BY clause.
See the Official PostgreSQL GROUP BY clause section for in-depth coverage.
HAVING – Filters GROUP BY Column(s) and Aggregate Functions
Is an optional clause.
HAVING filters rows from the results set with a boolean conditional check just like the WHERE clause, except, it filters those rows formed by the GROUP BY clause and/or aggregate functions.
HAVING highlights:
- The HAVING clause can reference those columns named in aggregate functions (even those not grouped) in addition to any GROUP BY columns.
- HAVING is responsible for eliminating rows after aggregate functions or grouping has been applied.
- You can reference non-aggregated columns in the HAVING clause although doing so has very little use.
- Although the HAVING clause is many times used in conjunction with the GROUP BY clause, you can use it alone. Query results are formed into a single group of those columns in aggregate functions only.
See the Official PostgreSQL HAVING clause section for in-depth coverage.
ORDER BY – A Measure of Order Out of Randomness
Is an optional clause.
Use ORDER BY when you need specific ordering. Otherwise, the database can (and will) return results in any arbitrary ordering.
Even should the results appear to be in some semblance of order, this is not guaranteed.
Don’t be fooled. Use ORDER BY.
There are two available ordering patterns. Either ASC (ascending) or DESC (descending) order, with ASC being the default.
If your results set is to include NULL values, those may also be used in the ordering as follows: specifying NULLS LAST causes them (NULLs) to sort after non-NULLs whereas requesting NULLS FIRST is the converse.
ORDER BY highlights:
- Sorting expression(s) are any of those that would be allowed in the SELECT list of the query.
- PostgreSQL does allow you to ORDER BY columns not present in the SELECT clause where some SQL dialects do not.
- Query results are capricious and not guaranteed to resemble any pattern or order unless an ORDER BY clause is used.
- ORDER BY and the LIMIT clause (see next section), are great combined for determining a ‘Top‘ rows results set. (e.g., 5 highest sale days, 5 lowest selling pairs of shoes, top salesperson this quarter)
- You can order results by column positional number in the SELECT list but the specified number must not be greater than the number of items in said SELECT clause list. In other words, if the SELECT clause has only 2 items, then ORDER BY 3 will produce an error.
- Each individual expression is only ordered by its listed option. (e.g., ORDER BY col_1 DESC, col_2 DESC is not the same as ORDER BY col_1, col_2 DESC)
See the Official PostgreSQL ORDER BY clause section for in-depth coverage.
LIMIT – Retrieve a Specific Number of Rows From the Query Results
LIMIT is an optional clause.
LIMIT actually consists of 2 sub-clauses, with OFFSET being the second of them.
If a value is provided for the OFFSET portion of the clause, results set rows are returned after skipping that number of rows.
An important section in the documentation to note:
“The query planner takes LIMIT into account when generating a query plan, so you are very likely to get different plans (yielding different row orders) depending on what you use for LIMIT and OFFSET. Thus, using different LIMIT/OFFSET values to select different subsets of a query result will give inconsistent results unless you enforce a predictable result ordering with ORDER BY. This is not a bug; it is an inherent consequence of the fact that SQL does not promise to deliver the results of a query in any particular order unless ORDER BY is used to constrain the order.”
LIMIT highlights:
- LIMIT can possibly return fewer rows than the defined number if the query itself produces fewer rows in the result set. In other words, it would have no impact on the number of returned rows.
- LIMIT ALL syntax is acceptable and has the same effect as not including a LIMIT clause at all.
- Although ‘x’ number of rows are skipped due to an OFFSET clause, this is not a ‘workaround‘ for any performance gain, as they are still computed for the query plan in the server.
- OFFSET 0 is equivalent to not including an OFFSET clause at all.
See the Official PostgreSQL LIMIT Clause section for in-depth coverage.
PostgreSQL’s interpretation of the major SQL clauses is its own. Regardless as to how PostgreSQL chooses to implement them or not, they are foundational for SQL queries and familiarity with their individual characteristics (and nuances) can only benefit users moving forward.
While volumes of articles, books, documentation, and blog posts have been written on each of these clauses, I hope you find this high-level overview digestible and informative.
Thank you for reading.