Severalnines Blog
The automation and management blog for open source databases

My Favorite PostgreSQL Queries and Why They Matter

Databases, tables, normalization, and a solid backup plan allow us to store and maintain data.

Those combined best practices, in turn, afford us interaction with that data. In today's data-driven world, data is valuable. Not only valuable, data is oftentimes critical to end-user solutions provided by products and services. Extracting insight, answering questions, and meaningful metrics from data by way of querying and data manipulation is an integral component of SQL in general.

PostgreSQL is no different.

This foundational crux is critical for success in any data-driven aspect.

Below, I present a combination of 8 differing queries or types of queries I have found interesting and engaging to explore, study, learn, or otherwise manipulate data sets.

They are not listed in any order of importance.

Most will probably be familiar old friends. Perhaps some will become new acquaintances.

Sample tables and data used are not as important as the actual construction of the queries themselves and what each query returns, offers, or provides. Many of them are mock and derived for demonstration purposes and should not be taken literally in their values.

1. Left join, mind any nulls on the right...

Suppose in this example, we have a running sale of two months and are getting a total of both combined.

Yet, for some reason, the second month did not pull its weight and we want to target what days month one picked up the slack.

These sales are represented as tables payment and fake_month for this demonstration.

To note:

  • We will only check for totals greater than 2000.
  • We will limit the output to just 10 rows.

To start, we have this Common Table Expression (CTE) 'generating' the fake_month table for us, and query that follows.

dvdrental=> WITH fake_month AS(
SELECT setup::date
FROM generate_series('2007-02-01', '2007-02-28', INTERVAL '1 day') AS setup
)
SELECT date_part('day', p.payment_date)::INT AS legit,
SUM(p.amount),
date_part('day', fk.setup)::INT AS fake
FROM payment AS p
LEFT JOIN fake_month AS fk
ON date_part('day', fk.setup)::INT = date_part('day', p.payment_date)::INT
GROUP BY legit, fake
HAVING SUM(p.amount) > 2000
LIMIT 10;
legit | sum | fake
-------+---------+------
1 | 2808.24 | 1
2 | 2550.05 | 2
6 | 2077.14 | 6
8 | 2227.84 | 8
9 | 2067.86 | 9
17 | 3630.33 | 17
18 | 3977.74 | 18
19 | 3908.59 | 19
20 | 3888.98 | 20
21 | 3786.14 | 21
(10 rows)

Looks as if both months contributed there. So is this solved?

Before we consider this solved, let's visit the ORDER BY clause.

Of course, you can ORDER BY ASC or DESC.

However, you can also ORDER BY NULLS first or last and that changes things up a bit.

Let's rewrite this query and use ORDER BY NULLS first on the legit column.

For brevity, I'll remove the CTE from the output, just know it is still there and being used.

SELECT date_part('day', p.payment_date)::INT AS legit,
SUM(p.amount),
date_part('day', fk.setup)::INT AS fake
FROM payment AS p
LEFT JOIN fake_month AS fk
ON date_part('day', fk.setup)::INT = date_part('day', p.payment_date)::INT
GROUP BY legit, fake
HAVING SUM(p.amount) > 2000
ORDER BY legit NULLS first
LIMIT 10;
legit | sum | fake
-------+---------+------
1 | 2808.24 | 1
2 | 2550.05 | 2
6 | 2077.14 | 6
8 | 2227.84 | 8
9 | 2067.86 | 9
17 | 3630.33 | 17
18 | 3977.74 | 18
19 | 3908.59 | 19
20 | 3888.98 | 20
21 | 3786.14 | 21
(10 rows)

No difference there at all.

What if we ORDER BY NULLS first on the fake column? The one on the right side of the JOIN?

Let's see.

SELECT date_part('day', p.payment_date)::INT AS legit,
SUM(p.amount),
date_part('day', fk.setup)::INT AS fake
FROM payment AS p
LEFT JOIN fake_month AS fk
ON date_part('day', fk.setup)::INT = date_part('day', p.payment_date)::INT
GROUP BY legit, fake
HAVING SUM(p.amount) > 2000
ORDER BY fake NULLS first
LIMIT 10;
legit | sum | fake
-------+---------+------
29 | 2717.60 |
30 | 5723.89 |
1 | 2808.24 | 1
2 | 2550.05 | 2
6 | 2077.14 | 6
8 | 2227.84 | 8
9 | 2067.86 | 9
17 | 3630.33 | 17
18 | 3977.74 | 18
19 | 3908.59 | 19
(10 rows)

Now we are getting somewhere. We can see for days 29 & 30, the fake column has been ordered from the top of the results set.

Due to ORDER BY fake NULLS first.

This solves our question, to what days 'sale 2' slacked off.

Are you wondering...

"Can we just filter with WHERE fake IS NULL?"

Like this:

SELECT date_part('day', p.payment_date)::INT AS legit,
SUM(p.amount),
date_part('day', fk.setup)::INT AS fake
FROM payment AS p
LEFT JOIN fake_month AS fk
ON date_part('day', fk.setup)::INT = date_part('day', p.payment_date)::INT
WHERE date_part('day', fk.setup) IS NULL
GROUP BY legit, fake
HAVING SUM(p.amount) > 2000
LIMIT 10;
legit | sum | fake
-------+---------+------
29 | 2717.60 |
30 | 5723.89 |
(2 rows)

Yes that works. So why not just use that query instead? Why it matters?

I feel using LEFT JOIN and ORDER BY NULLS first for the table on the right side of the JOIN, is a great way to explore unfamiliar tables and data sets.

By confirming what, if any, data is ‘missing’ on that side of the join condition first; enhances clarity and awareness, allowing you to then filter out the results set with the WHERE <column_name> IS NULL clause, finalizing things up.

Of course, familiarity with the tables and datasets could potentially eliminate the need for the LEFT JOIN presented here.

It's a worthy query for anyone utilizing PostgreSQL to at least try, during exploration.

2. String Concatenation

Concatenation, the joining or appending of two strings, provides a presentation option for results sets. Many 'things' can be concatenated.

However, as noted in the documentation, the string concatenation operator ('||') accepts non-string input, as long as one is a string.

Let' see some examples with the below queries:

postgres=> SELECT 2||' times'||' 2 equals: '|| 2*2;
?column?
---------------------
2 times 2 equals: 4
(1 row)

We can see, numbers and strings all can be concatenated together as mentioned above.

The '||' operator is but one of those available in PostgreSQL.

The concat() function accepts multiple arguments, concatenating them all on return.

Here's an example of that function in action:

postgres=> SELECT concat('Josh ','Otwell') AS first_name;
first_name
-------------
Josh Otwell
(1 row)

We can pass in more than two arguments if desired:

postgres=> SELECT concat('Josh',' ','Otwell') AS first_name;
first_name
-------------
Josh Otwell
(1 row)

Let's note something real quick with these next examples:

postgres=> SELECT CONCAT('Josh',NULL,'Otwell') AS first_name;
first_name
------------
JoshOtwell
(1 row)
postgres=> SELECT 'Josh '||NULL||'Otwell' AS first_name;
first_name
------------
(1 row)
postgres=> SELECT NULL||'Josh '||'Otwell' AS first_name;
first_name
------------
(1 row)
postgres=> SELECT CONCAT(NULL,'Josh','Otwell') AS first_name;
first_name
------------
JoshOtwell
(1 row)

Observe that the concat() function ignores NULL no matter where placed in the list of parameters, while the string concatenation operator does not.

NULL is returned if present anywhere in the string to concatenate.

Just be aware of that.

Instead of manually including within the string to be concatenated, PostgreSQL also includes a concat_ws() function that accepts a string separator as the first parameter.

We will visit it with these queries:

postgres=> SELECT concat_ws('-',333,454,1919) AS cell_num;
cell_num
--------------
333-454-1919
(1 row)
postgres=> SELECT concat_ws(' ','Josh','Otwell') AS first_name;
first_name
-------------
Josh Otwell
(1 row)

concat_ws() accepts either numbers or strings as arguments and as stated above, uses the first argument as the separator.

How does concat_ws() treat NULL?

postgres=> SELECT concat_ws('-',333,NULL,1919) AS cell_num;
cell_num
----------
333-1919
(1 row)
postgres=> SELECT concat_ws(NULL,333,454,1919) AS cell_num;
cell_num
----------
(1 row)

NULL is ignored unless it is the separator argument given to concat_ws().

Then, all arguments are ignored and NULL is returned instead.

Concatenation is cool...

Now that we have an idea of how concatenation works, let's look at a couple of examples of it.

Back to the mock DVD rental database

Suppose we need to compile a list of customers first and last names, along with their email address to send out a memo for updating their account.

I will limit the output to just 10 rows for brevity's sake, but still demonstrating the || operator.

dvdrental=> SELECT first_name||' '||last_name||'''s email address is: '||email AS name_and_email
FROM customer
LIMIT 10;
name_and_email
--------------------------------------------------------------------------
Jared Ely's email address is: jared.ely@sakilacustomer.org
Mary Smith's email address is: mary.smith@sakilacustomer.org
Patricia Johnson's email address is: patricia.johnson@sakilacustomer.org
Linda Williams's email address is: linda.williams@sakilacustomer.org
Barbara Jones's email address is: barbara.jones@sakilacustomer.org
Elizabeth Brown's email address is: elizabeth.brown@sakilacustomer.org
Jennifer Davis's email address is: jennifer.davis@sakilacustomer.org
Maria Miller's email address is: maria.miller@sakilacustomer.org
Susan Wilson's email address is: susan.wilson@sakilacustomer.org
Margaret Moore's email address is: margaret.moore@sakilacustomer.org
(10 rows)

Notice we had to escape the single quote used with apostrophe s, using an additional single quote to show possession of the email address for each customer.

Why you should know?

There may be times when concatenating data presents you with better insight and understanding into the data set you are working with. Along with reporting options, concatenating shared datasets with others' could potentially make them (the data) more readable and digestible.

3. Supplying IN values list with Subquery's

A Subquery has numerous powerful uses. Of those, providing an IN list of values to check for membership is a common one.

Here's a quick use.

Suppose we have customer and payments tables in a mock DVD rental store and want to reward our top five highest spending customers who rented movies during the days of April 10 - 13.

Imagine that's a special target period. So if the customer spent more than $30, we want to acknowledge them.

Bear in mind, there are other available options for solving this type of question (i.e., joins, capturing results from multiple selects, etc...), yet, sub-queries can handle it as well.

We will start out with the whole shebang here. This complete query returns everything we want for this particular question.

dvdrental=> SELECT first_name, last_name, email
FROM customer
WHERE customer_id IN (
SELECT customer_id FROM (
SELECT DISTINCT customer_id, SUM(amount)
FROM payment
WHERE extract(month from payment_date) = 4
AND extract(day from payment_date) BETWEEN 10 AND 13
GROUP BY customer_id
HAVING SUM(amount) > 30
ORDER BY SUM(amount) DESC
LIMIT 5) AS top_five);

This example actually contains nested subquery's, one of which is a Derived Table.

Let's start by drilling into the innermost subquery, that Derived Table.

This subquery is a standalone SELECT statement all its own, returning a customer_id and a SUM() on the amount column.

Only those customers meeting the criteria checked by the WHERE and HAVING clauses make the cut, being further thinned out with LIMIT 5;

Why the next subquery you ask?

Can we not just use the WHERE customer_id IN portion of the outermost SELECT here?

Let's see with a hands-on approach.

I will remove the AS top_five from the subquery and try the outermost query with it now:

dvdrental=> SELECT first_name, last_name, email
FROM customer
WHERE customer_id IN
(SELECT DISTINCT customer_id, SUM(amount)
FROM payment
WHERE extract(month from payment_date) = 4
AND extract(day from payment_date) BETWEEN 10 AND 13
GROUP BY customer_id
HAVING SUM(amount) > 30
ORDER BY SUM(amount) DESC
LIMIT 5);
ERROR: subquery has too many columns
LINE 3: WHERE customer_id IN (

Here, IN membership is being tested with only the customer_id column, yet the Derived Table returns two columns and PostgreSQL lets us know.

One remedy is to use another subquery. Selecting only the customer_id from the Derived Table results set, creates the next inner nested subquery.

Now the IN predicate contains multiple rows of one column's values to check membership against the WHERE clause for customer_id to make the final results set.

Why it matters?

Utilizing subquery's in this manner is powerful due to the fact of the number of values that could potentially be tested with the IN() predicate.

Imagine if there were a 100? Or more?

'Hard-coding' all of them in the IN() list could become problematic and error-prone as the volume of values increases.

4. generate_series()

This set returning function, is handy and super fun to use and explore. I have used generate_series() in above examples, but it deserves a talk of its own. Focusing more on the function and capabilities.

I find generate_series() useful for comparative queries where some, or all data is missing.

Or only partial data is available at the time I am exploring. One handy use is populating tables with 'dummy data'.

To start, we will create a simple table:

trial=> CREATE TABLE tbl_1(
trial(> tb_id SERIAL PRIMARY KEY,
trial(> some_day DATE,
trial(> an_amt NUMERIC(4,2));
CREATE TABLE

Then use generate_series() as the VALUES for our INSERT statement:

trial=> INSERT INTO tbl_1(some_day, an_amt)
VALUES(
generate_series('2018-04-01','2018-04-15',INTERVAL '1 day'),
generate_series(2.43, 34.20, 1.03));
INSERT 0 31

Then create a second table

trial=> CREATE TABLE tbl_2(
tb2_id SERIAL PRIMARY KEY,
some_day2 DATE,
an_amt2 NUMERIC(4,2));
CREATE TABLE

Also, populate it using generate_series() in the INSERT statement:

trial=> INSERT INTO tbl_2(some_day2, an_amt2)
VALUES(
generate_series('2018-05-16','2018-05-31',INTERVAL '1 day'),
generate_series(15.43, 31., 1.03));
INSERT 0 16

Why it matters?

To reiterate, generate_series() is so useful for creating mock or practice data.

I have found mimicking month or day ranges for comparison is exceptional with generate_series(). Refer to section 1 and the CTE there, demonstrates this use.

Creating a set of complete data with generate_series() and using to compare against stored data to determine if any data is missing holds great value as well.

Download the Whitepaper Today
 
PostgreSQL Management & Automation with ClusterControl
Learn about what you need to know to deploy, monitor, manage and scale PostgreSQL

5. Query's with the COUNT() aggregate function.

This simple, yet effective aggregate function should be in anyone's arsenal. Especially when exploring tables or data sets for the first time.

I mean, do you really want to 'SELECT everything' from a table with 1M rows?

Determine with COUNT(*) how many records are present before you load up.

Let's find out how many rows the film table has in this mock DVD rental table:

dvdrental=> SELECT COUNT(*)
dvdrental-> FROM film;
count
-------
1000
(1 row)

While not quite as extensive as 1M+ rows, I'm sure you see the usefulness.

To return the number of specific rows, COUNT(*) can be filtered with a WHERE clause.

Let's see how many films have a 'G' rating:

dvdrental=> SELECT COUNT(*)
dvdrental-> FROM film
dvdrental-> WHERE rating = 'G';
count
-------
178
(1 row)

There is another form of COUNT() to be aware of. COUNT(some_expression).

The differences between them are:

  • COUNT(*) returns the total of all input rows (including NULLS and duplicates).
  • COUNT(some_expression) counts the number of non-NULL input rows.

When used in conjunction with the DISTINCT keyword, COUNT() will eliminate duplicate entries and return only unique values.

Let's see that in action using COUNT() with DISTINCT to determine how many unique types of ratings are present:

dvdrental=> SELECT COUNT(DISTINCT rating) FROM film;
count
-------
5
(1 row)

With this query, we know there are 5 types of ratings.

Why it matters?

Depending on what is being tracked or targeted, knowing how many of something exists can be important. Therefore, utilizing COUNT(*) or COUNT(some_expression) assists with these types of challenges.

Just remember COUNT(*) does not ignore NULL. All rows, duplicate and NULL values included, are returned as part of the final number.

6. UPDATE multiple rows with a CASE expression.

Suppose we have this table:

trial=> SELECT * FROM reward_members;
rm_id | expense_amt | member_status
-------+-------------+---------------
1 | 1245.33 | gold
2 | 1300.49 | gold
3 | 900.20 | bronze
4 | 2534.44 | platinum
5 | 600.19 | bronze
6 | 1001.55 | silver
7 | 1097.99 | silver
8 | 3033.33 | platinum
(8 rows)

We need to rename the member_status column and add 'group' to the end of the current name present for each record.

For starters, multiple individual UPDATE statements will accomplish this no problem.

But, so can a single CASE expression.

trial=> UPDATE reward_members
SET member_status = (
CASE member_status
WHEN 'gold' THEN 'gold_group'
WHEN 'bronze' THEN 'bronze_group'
WHEN 'platinum' THEN 'platinum_group'
WHEN 'silver' THEN 'silver_group'
END
)
WHERE member_status IN ('gold', 'bronze','platinum', 'silver');
UPDATE 8

Let's query the table again to see the changes:

trial=> SELECT * FROM reward_members;
rm_id | expense_amt | member_status
-------+-------------+----------------
1 | 1245.33 | gold_group
2 | 1300.49 | gold_group
3 | 900.20 | bronze_group
4 | 2534.44 | platinum_group
5 | 600.19 | bronze_group
6 | 1001.55 | silver_group
7 | 1097.99 | silver_group
8 | 3033.33 | platinum_group
(8 rows)

All update were successful.

Why it matters?

You can imagine how many round trips this would take to the server if multiple individual UPDATE statements had been run. In truth, only 4 for this example. But still, the potential for many is always there.

Yet, using an UPDATE with CASE expression, we are sending only one query instead.

7. COPY and \copy

PostgreSQL provides COPY, a command for exporting data between files and tables.

Be sure and visit the provided link to see the abundant number of options available with COPY.

An important note concerning COPY. SUPERUSER role privilege is required to execute this command.

The psql meta-command \copy is an alternative for those users not deemed this role attribute. We will visit that command in turn shortly.

First, let's run a COPY command to export certain columns to a CSV file on the local machine.

Assume we have this query result to for export:

trial=# SELECT expense_amt, member_status
trial-# FROM reward_members
trial-# WHERE member_status = 'gold_group';
expense_amt | member_status
-------------+---------------
1245.33 | gold_group
1300.49 | gold_group
(2 rows)

With COPY, we can use that SELECT statement to complete this export.

trial=# COPY (SELECT expense_amt, member_status
FROM reward_members
WHERE member_status = 'gold_group')
TO '/home/linux_user_here/awards_to_honor.csv'
DELIMITER ','
CSV HEADER;
COPY 2

*Note: Per the documentation, the query must be within parenthesis.

Let's now check the contents of that file:

$ cat awards_to_honor.csv
expense_amt,member_status
1245.33,gold_group
1300.49,gold_group

We can see the first line contains the HEADER (which are the column names) and both lines have the expense_amt and member_status data for both columns returned from the WHERE clause filter.

Another important caveat I discovered from executing the above COPY command.

The user must have privileges to write to the file at the OS level.

In my case, fixed with:

$ sudo chown postgres awards_to_honor.csv

You can avoid this issue by instead writing to a system file the current user has access to such as /tmp (shown below.)

trial=# COPY (SELECT expense_amt, member_status
FROM reward_members
WHERE member_status = 'gold_group')
TO '/tmp/awards_to_honor.csv'
DELIMITER ','
CSV HEADER;
COPY 2

However, one of my test roles without the SUPERUSER attribute, ran into problems writing to the /tmp file.

See below for confirmation:

trial=# SET role log_user; -- changing from postgres user to log_user
SET

Now attempting the same COPY command, writing to the /tmp folder

trial=> COPY (SELECT expense_amt, member_status
FROM reward_members
WHERE member_status = 'gold_group')
TO '/tmp/awards_to_honor2.csv'
DELIMITER ','
CSV HEADER;
ERROR: must be superuser to COPY to or from a file
HINT: Anyone can COPY to stdout or from stdin. psql's \copy command also works for anyone.

Perhaps a better measure, as suggested in the HINT:, for roles without the SUPERUSER attribute, is the psql \copy meta-command.

Let's carry-out a similar type of command with \copy instead using the same role, without the need for that SUPERUSER attribute.

trial=> \copy (SELECT expense_amt, member_status
FROM reward_members
WHERE member_status = 'silver_group')
TO '/home/linux_user_here/more_awards.csv'
DELIMITER ','
CSV HEADER;
COPY 2

No problems there.

And the files' contents,

$ cat more_awards.csv
expense_amt,member_status
1001.55,silver_group
1097.99,silver_group

Also works for the /tmp folder:

trial=> \copy (SELECT expense_amt, member_status
FROM reward_members
WHERE member_status = 'silver_group')
TO '/tmp/more_awards.csv'
DELIMITER ','
CSV HEADER;
COPY 2

Same contents present in the written file as well:

trial=> \! cat /tmp/more_awards.csv
expense_amt,member_status
1001.55,silver_group
1097.99,silver_group

Why it matters?

Importing data into PostgreSQL via files is a surefire bulk upload method. Although all are not covered in this blog post, COPY and \copy both, offer several options for working with different file formats and extensions.

On the same token, exporting data from tables, or specific columns is easily handled with both of these commands as well.

8. psql \help meta-command

You're in a psql command-line session. Curious about the CREATE INDEX command syntax?

No need and going to a browser or another document.

Try this instead:

trial=> \help CREATE INDEX
Command: CREATE INDEX
Description: define a new index
Syntax:
CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] name ] ON table_name [ USING method ]
( { column_name | ( expression ) } [ COLLATE collation ] [ opclass ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] )
[ WITH ( storage_parameter = value [, ... ] ) ]
[ TABLESPACE tablespace_name ]
[ WHERE predicate ]

To know what help text is available, you can run \help by itself and get a list of available options.

I won't list them all out here, just know that option is available.

Why it matters?

The fact this meta-command is super easy to use, powerful, and convenient are enough pros to mention it here. It's saved me tons of time spent searching throughout other documentation. And of course, being a newbie, I use it quite often!

Conclusion

This is not an exhaustive list. Nor the 'be all end all' of queries and data manipulation.

Only my take on those that pique my interest and speak to me as I continue to learn and grow into a SQL Developer role. I hope through this blog post, you will find use cases for the above queries and commands, implementing those where you see fit.