blog
More of My Favorite PostgreSQL Queries – and Why They Also Matter
In a previous blog post My Favorite PostgreSQL Queries and Why They Matter, I visited interesting queries meaningful to me as I learn, develop, and grow into a SQL developer role.
One of those, in particular, a multi-row UPDATE with a single CASE expression, sparked up an interesting conversation over on Hacker News.
In this blog post, I want to observe comparisons between that particular query, and one involving multiple single UPDATE statements. For good or bane.
Machine/Environment Specs:
- Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz
- 8GB RAM
- 1TB Storage
- Xubuntu Linux 16.04.3 LTS (Xenial Xerus)
- PostgreSQL 10.4
Note: To start, I created a ‘staging’ table with all TEXT type columns to get the data loaded.
The sample data set I’m using, is found at this link here.
But keep in mind, the data itself is used in this example because it’s a decent sized set with multiple columns. Any ‘analysis’ or UPDATES/INSERTS to this data set, is not reflective of actual ‘real-world’ GPS/GIS operations and is not intended as such.
location=# d data_staging;
Table "public.data_staging"
Column | Type | Collation | Nullable | Default
---------------+---------+-----------+----------+---------
segment_num | text | | |
point_seg_num | text | | |
latitude | text | | |
longitude | text | | |
nad_year_cd | text | | |
proj_code | text | | |
x_cord_loc | text | | |
y_cord_loc | text | | |
last_rev_date | text | | |
version_date | text | | |
asbuilt_flag | text | | |
location=# SELECT COUNT(*) FROM data_staging;
count
--------
546895
(1 row)
We have around half a million rows of data in this table.
For this first comparison, I will UPDATE the proj_code column.
Here is an exploratory query to determine its current values:
location=# SELECT DISTINCT proj_code FROM data_staging;
proj_code
-----------
"70"
""
"72"
"71"
"51"
"15"
"16"
(7 rows)
I’ll use trim to remove quotes from the values and cast to an INT and determine how many rows exist for each individual value:
Let’s use a CTE for that, then SELECT from it:
location=# WITH cleaned_nums AS (
SELECT NULLIF(trim(both '"' FROM proj_code), '') AS p_code FROM data_staging
)
SELECT COUNT(*),
CASE
WHEN p_code::int = 70 THEN '70'
WHEN p_code::int = 72 THEN '72'
WHEN p_code::int = 71 THEN '71'
WHEN p_code::int = 51 THEN '51'
WHEN p_code::int = 15 THEN '15'
WHEN p_code::int = 16 THEN '16'
ELSE '00'
END AS proj_code_num
FROM cleaned_nums
GROUP BY p_code
ORDER BY p_code DESC;
count | proj_code_num
--------+---------------
353087 | 0
139057 | 72
25460 | 71
3254 | 70
1 | 51
12648 | 16
13388 | 15
(7 rows)
Prior to running these tests, I’ll go ahead and ALTER the proj_code column to type INTEGER:
BEGIN;
ALTER TABLE data_staging ALTER COLUMN proj_code SET DATA TYPE INTEGER USING NULLIF(trim(both '"' FROM proj_code), '')::INTEGER;
SAVEPOINT my_save;
COMMIT;
And cleanup that NULL column value (which is represented by the ELSE ’00’ in the exploratory CASE expression above), setting it to an arbitrary number, 10, with this UPDATE:
UPDATE data_staging
SET proj_code = 10
WHERE proj_code IS NULL;
Now all proj_code columns have an INTEGER value.
Let’s go ahead and run a single CASE expression updating all of the proj_code column values and see what the timing reports. I’ll place all of the commands in a .sql source file for ease of handling.
Here are the file contents:
BEGIN;
timing on
UPDATE data_staging
SET proj_code =
(
CASE proj_code
WHEN 72 THEN 7272
WHEN 71 THEN 7171
WHEN 15 THEN 1515
WHEN 51 THEN 5151
WHEN 70 THEN 7070
WHEN 10 THEN 1010
WHEN 16 THEN 1616
END
)
WHERE proj_code IN (72, 71, 15, 51, 70, 10, 16);
SAVEPOINT my_save;
Let’s run this file and check what the timing reports:
location=# i /case_insert.sql
BEGIN
Time: 0.265 ms
Timing is on.
UPDATE 546895
Time: 6779.596 ms (00:06.780)
SAVEPOINT
Time: 0.300 ms
Just over half a million rows in 6+ seconds.
Here are the reflected changes in the table so far:
location=# SELECT DISTINCT proj_code FROM data_staging;
proj_code
-----------
7070
1616
1010
7171
1515
7272
5151
(7 rows)
I’ll ROLLBACK (not shown) these changes so I can run individual INSERT statements to test those as well.
Below reflects the modifications to the .sql source file for this series of comparisons:
BEGIN;
timing on
UPDATE data_staging
SET proj_code = 7222
WHERE proj_code = 72;
UPDATE data_staging
SET proj_code = 7171
WHERE proj_code = 71;
UPDATE data_staging
SET proj_code = 1515
WHERE proj_code = 15;
UPDATE data_staging
SET proj_code = 5151
WHERE proj_code = 51;
UPDATE data_staging
SET proj_code = 7070
WHERE proj_code = 70;
UPDATE data_staging
SET proj_code = 1010
WHERE proj_code = 10;
UPDATE data_staging
SET proj_code = 1616
WHERE proj_code = 16;
SAVEPOINT my_save;
And those results,
location=# i /case_insert.sql
BEGIN
Time: 0.264 ms
Timing is on.
UPDATE 139057
Time: 795.610 ms
UPDATE 25460
Time: 116.268 ms
UPDATE 13388
Time: 239.007 ms
UPDATE 1
Time: 72.699 ms
UPDATE 3254
Time: 162.199 ms
UPDATE 353087
Time: 1987.857 ms (00:01.988)
UPDATE 12648
Time: 321.223 ms
SAVEPOINT
Time: 0.108 ms
Let’s check the values:
location=# SELECT DISTINCT proj_code FROM data_staging;
proj_code
-----------
7222
1616
7070
1010
7171
1515
5151
(7 rows)
And the timing (Note: I’ll do the math in a query since timing did not report whole seconds this run):
location=# SELECT round((795.610 + 116.268 + 239.007 + 72.699 + 162.199 + 1987.857 + 321.223) / 1000, 3) AS seconds;
seconds
---------
3.695
(1 row)
The individual INSERT‘s took about half the amount of time as the single CASE.
This first test included the entire table, with all columns. I’m curious of any differences in a table with the same number of rows, but fewer columns, hence the next series of tests.
I’ll create a table with 2 columns (composed of a SERIAL data type for the PRIMARY KEY and an INTEGER for the proj_code column) and move over the data:
location=# CREATE TABLE proj_nums(n_id SERIAL PRIMARY KEY, proj_code INTEGER);
CREATE TABLE
location=# INSERT INTO proj_nums(proj_code) SELECT proj_code FROM data_staging;
INSERT 0 546895
(To note: SQL commands from the first set of operations are used with the appropriate modification(s). I am omitting them here for on-screen brevity and display)
I’ll run the single CASE expression first:
location=# i /case_insert.sql
BEGIN
Timing is on.
UPDATE 546895
Time: 4355.332 ms (00:04.355)
SAVEPOINT
Time: 0.137 ms
And then the individual UPDATE’s:
location=# i /case_insert.sql
BEGIN
Time: 0.282 ms
Timing is on.
UPDATE 139057
Time: 1042.133 ms (00:01.042)
UPDATE 25460
Time: 123.337 ms
UPDATE 13388
Time: 212.698 ms
UPDATE 1
Time: 43.107 ms
UPDATE 3254
Time: 52.669 ms
UPDATE 353087
Time: 2787.295 ms (00:02.787)
UPDATE 12648
Time: 99.813 ms
SAVEPOINT
Time: 0.059 ms
location=# SELECT round((1042.133 + 123.337 + 212.698 + 43.107 + 52.669 + 2787.295 + 99.813) / 1000, 3) AS seconds;
seconds
---------
4.361
(1 row)
The timing is somewhat even between both sets of operations on the table with just 2 columns.
I will say that using the CASE expression is a bit easier to type out, but not necessarily the best choice on all occasions. As with what was stated in some of the comments over on the Hacker News thread referenced above, it normally “just depends” on many factors to which may or may not be the optimal choice.
I realize these tests are subjective at best. One of those, on a table with 11 columns while the other had just 2 columns, both of which were of a number data type.
The CASE expression for multiple row updates is still one of my favorite queries, if for only the ease of typing in a controlled environment where many individual UPDATE queries are the other alternative.
However, I can see now where it is not always the optimal choice as I continue to grow and learn.
As that old saying goes, “Half a dozen in one hand, 6 in the other.”
An additional Favorite Query – Using PLpgSQL CURSOR’s
I’ve begun storing and tracking all of my exercise (trail hiking) stats with PostgreSQL on my local development machine. There are multiple tables involved, as with any normalized database.
However, at months’ end, I want to store specific columns’ stats, in their own, separate table.
Here is the ‘monthly’ table I’ll use:
fitness=> d hiking_month_total;
Table "public.hiking_month_total"
Column | Type | Collation | Nullable | Default
-----------------+------------------------+-----------+----------+---------
day_hiked | date | | |
calories_burned | numeric(4,1) | | |
miles | numeric(4,2) | | |
duration | time without time zone | | |
pace | numeric(2,1) | | |
trail_hiked | text | | |
shoes_worn | text | | |
I’ll concentrate on May’s results with this SELECT query:
fitness=> SELECT hs.day_walked, hs.cal_burned, hs.miles_walked, hs.duration, hs.mph, tr.name, sb.name_brand
fitness-> FROM hiking_stats AS hs
fitness-> INNER JOIN hiking_trail AS ht
fitness-> ON hs.hike_id = ht.th_id
fitness-> INNER JOIN trail_route AS tr
fitness-> ON ht.tr_id = tr.trail_id
fitness-> INNER JOIN shoe_brand AS sb
fitness-> ON hs.shoe_id = sb.shoe_id
fitness-> WHERE extract(month FROM hs.day_walked) = 5
fitness-> ORDER BY hs.day_walked ASC;
And here are 3 sample rows returned from that query:
day_walked | cal_burned | miles_walked | duration | mph | name | name_brand
------------+------------+--------------+----------+-----+------------------------+---------------------------------------
2018-05-02 | 311.2 | 3.27 | 00:57:13 | 3.4 | Tree Trail-extended | New Balance Trail Runners-All Terrain
2018-05-03 | 320.8 | 3.38 | 00:58:59 | 3.4 | Sandy Trail-Drive | New Balance Trail Runners-All Terrain
2018-05-04 | 291.3 | 3.01 | 00:53:33 | 3.4 | House-Power Line Route | Keen Koven WP(keen-dry)
(3 rows)
Truth be told, I can populate the target hiking_month_total table using the above SELECT query in an INSERT statement.
But where is the fun in that?
I’ll forgo boredom for a PLpgSQL function with a CURSOR instead.
I came up with this function to perform the INSERT with a CURSOR:
CREATE OR REPLACE function monthly_total_stats()
RETURNS void
AS $month_stats$
DECLARE
v_day_walked date;
v_cal_burned numeric(4, 1);
v_miles_walked numeric(4, 2);
v_duration time without time zone;
v_mph numeric(2, 1);
v_name text;
v_name_brand text;
v_cur CURSOR for SELECT hs.day_walked, hs.cal_burned, hs.miles_walked, hs.duration, hs.mph, tr.name, sb.name_brand
FROM hiking_stats AS hs
INNER JOIN hiking_trail AS ht
ON hs.hike_id = ht.th_id
INNER JOIN trail_route AS tr
ON ht.tr_id = tr.trail_id
INNER JOIN shoe_brand AS sb
ON hs.shoe_id = sb.shoe_id
WHERE extract(month FROM hs.day_walked) = 5
ORDER BY hs.day_walked ASC;
BEGIN
OPEN v_cur;
<>
LOOP
FETCH v_cur INTO v_day_walked, v_cal_burned, v_miles_walked, v_duration, v_mph, v_name, v_name_brand;
EXIT WHEN NOT FOUND;
INSERT INTO hiking_month_total(day_hiked, calories_burned, miles,
duration, pace, trail_hiked, shoes_worn)
VALUES(v_day_walked, v_cal_burned, v_miles_walked, v_duration, v_mph, v_name, v_name_brand);
END LOOP get_stats;
CLOSE v_cur;
END;
$month_stats$ LANGUAGE PLpgSQL;
Let’s call function monthly_total_stats() to perform the INSERT:
fitness=> SELECT monthly_total_stats();
monthly_total_stats
---------------------
(1 row)
Since the function is defined RETURNS void, we can see no value is returned to the caller.
At this time, I am not specifically interested in any return values,
only that the function carries out the defined operation, populating the hiking_month_total table.
I’ll query for a count of records in the target table, confirming it has data:
fitness=> SELECT COUNT(*) FROM hiking_month_total;
count
-------
25
(1 row)
The monthly_total_stats() function works, but perhaps a better use case for a CURSOR is to scroll through a large number of records. Maybe a table with around half a million records?
This next CURSOR is bound with a query targeting the data_staging table from the series of comparisons in the section above:
CREATE OR REPLACE FUNCTION location_curs()
RETURNS refcursor
AS $location$
DECLARE
v_cur refcursor;
BEGIN
OPEN v_cur for SELECT segment_num, latitude, longitude, proj_code, asbuilt_flag FROM data_staging;
RETURN v_cur;
END;
$location$ LANGUAGE PLpgSQL;
Then, to use this CURSOR, operate within a TRANSACTION (pointed out in the documentation here).
location=# BEGIN;
BEGIN
location=# SELECT location_curs();
location_curs
--------------------
(1 row)
So what can you do with this “
Here are just a few things:
We can return the first row from the CURSOR using either first or ABSOLUTE 1:
location=# FETCH first FROM "
Want a row almost halfway through the results set? (Assuming we know an estimated half a million rows are bound to the CURSOR.)
Can you be that 'specific' with a CURSOR?
Yep.
We can position, and FETCH the values for the record at row 234888 (just a random number I chose):
location=# FETCH ABSOLUTE 234888 FROM "
Once positioned there, we can move the CURSOR 'backward one':
location=# FETCH BACKWARD FROM "
Which is the same as:
location=# FETCH ABSOLUTE 234887 FROM "
Then we can move the CURSOR right back to the ABSOLUTE 234888 with:
location=# FETCH FORWARD FROM "
Handy Tip: to reposition the CURSOR, use MOVE instead of FETCH if you do not need the values from that row.
See this passage from the documentation:
"MOVE repositions a cursor without retrieving any data. MOVE works exactly like the FETCH command, except it only positions the cursor and does not return rows."
The "
I'll revisit my fitness stats data to write a function and name the CURSOR, along with a potential 'real-world' use case.
The CURSOR will target this additional table, which stores results not limited to the month of May (basically all I have collected so far) as in the previous example:
fitness=> CREATE TABLE cp_hiking_total AS SELECT * FROM hiking_month_total WITH NO DATA;
CREATE TABLE AS
Then populate it with data:
fitness=> INSERT INTO cp_hiking_total
SELECT hs.day_walked, hs.cal_burned, hs.miles_walked, hs.duration, hs.mph, tr.name, sb.name_brand
FROM hiking_stats AS hs
INNER JOIN hiking_trail AS ht
ON hs.hike_id = ht.th_id
INNER JOIN trail_route AS tr
ON ht.tr_id = tr.trail_id
INNER JOIN shoe_brand AS sb
ON hs.shoe_id = sb.shoe_id
ORDER BY hs.day_walked ASC;
INSERT 0 51
Now with the below PLpgSQL function, CREATE a 'named' CURSOR:
CREATE OR REPLACE FUNCTION stats_cursor(refcursor)
RETURNS refcursor
AS $$
BEGIN
OPEN $1 FOR
SELECT *
FROM cp_hiking_total;
RETURN $1;
END;
$$ LANGUAGE plpgsql;
I'll call this CURSOR 'stats':
fitness=> BEGIN;
BEGIN
fitness=> SELECT stats_cursor('stats');
stats_cursor
--------------
stats
(1 row)
Suppose, I want the '12th' row bound to the CURSOR.
I can position the CURSOR on that row, retrieving those results with the below command:
fitness=> FETCH ABSOLUTE 12 FROM stats;
day_hiked | calories_burned | miles | duration | pace | trail_hiked | shoes_worn
------------+-----------------+-------+----------+------+---------------------+---------------------------------------
2018-05-02 | 311.2 | 3.27 | 00:57:13 | 3.4 | Tree Trail-extended | New Balance Trail Runners-All Terrain
(1 row)
For purposes of this blog post, imagine I know first-hand the pace column value for this row is incorrect.
I specifically remember being 'dead on my feet tired' that day and only maintained a pace of 3.0 during that hike. (Hey it happens.)
Okay, I'll just UPDATE the cp_hiking_total table to reflect that change.
Relatively simple no doubt. Boring…
How about with the stats CURSOR instead?
fitness=> UPDATE cp_hiking_total
fitness-> SET pace = 3.0
fitness-> WHERE CURRENT OF stats;
UPDATE 1
To make this change permanent, issue COMMIT:
fitness=> COMMIT;
COMMIT
Let's query and see that UPDATE reflected in table cp_hiking_total:
fitness=> SELECT * FROM cp_hiking_total
fitness-> WHERE day_hiked = '2018-05-02';
day_hiked | calories_burned | miles | duration | pace | trail_hiked | shoes_worn
------------+-----------------+-------+----------+------+---------------------+---------------------------------------
2018-05-02 | 311.2 | 3.27 | 00:57:13 | 3.0 | Tree Trail-extended | New Balance Trail Runners-All Terrain
(1 row)
How cool is that?
Moving within the CURSOR's results set, and run an UPDATE if needed.
Quite powerful if you ask me. And convenient.
Some ‘caution’ and information from the documentation on this type of CURSOR:
"It is generally recommended to use FOR UPDATE if the cursor is intended to be used with UPDATE ... WHERE CURRENT OF or DELETE ... WHERE CURRENT OF. Using FOR UPDATE prevents other sessions from changing the rows between the time they are fetched and the time they are updated. Without FOR UPDATE, a subsequent WHERE CURRENT OF command will have no effect if the row was changed since the cursor was created.
Another reason to use FOR UPDATE is that without it, a subsequent WHERE CURRENT OF might fail if the cursor query does not meet the SQL standard's rules for being “simply updatable” (in particular, the cursor must reference just one table and not use grouping or ORDER BY). Cursors that are not simply updatable might work, or might not, depending on plan choice details; so in the worst case, an application might work in testing and then fail in production."
With the CURSOR I've used here, I've followed the SQL standard rules (from the passages above) in the aspect of: I referenced only one table, with no grouping or ORDER by clause.
Why it matters.
As is with numerous operations, queries, or tasks in PostgreSQL (and SQL in general), there typically is more than one way to accomplish and reach your end goal. Which is one of the main reasons I am drawn to SQL and strive to learn more.
I hope through this follow-up blog post, I have provided some insight to why the multi-row UPDATE with CASE was included as one of my favorite queries, in that first accompanying blog post. Just having it as an option is worthwhile to me.
In addition, exploring CURSORS, for traversing large results sets. Performing DML operations, like UPDATES and/or DELETES, with the correct type of CURSOR, is just 'icing on the cake'. I'm anxious to study them further for more use cases.