In a previous blog post My Favorite PostgreSQL Queries and Why They Matter, I visited interesting queries meaningful to me as I learn, develop, and grow into a SQL developer role.
One of those, in particular, a multi-row UPDATE with a single CASE expression, sparked up an interesting conversation over on Hacker News.
In this blog post, I want to observe comparisons between that particular query, and one involving multiple single UPDATE statements. For good or bane.
- Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz
- 8GB RAM
- 1TB Storage
- Xubuntu Linux 16.04.3 LTS (Xenial Xerus)
- PostgreSQL 10.4
Note: To start, I created a ‘staging’ table with all TEXT type columns to get the data loaded.
The sample data set I’m using, is found at this link here.
But keep in mind, the data itself is used in this example because it’s a decent sized set with multiple columns. Any ‘analysis’ or UPDATES/INSERTS to this data set, is not reflective of actual ‘real-world’ GPS/GIS operations and is not intended as such.
location=# d data_staging; Table "public.data_staging" Column | Type | Collation | Nullable | Default ---------------+---------+-----------+----------+--------- segment_num | text | | | point_seg_num | text | | | latitude | text | | | longitude | text | | | nad_year_cd | text | | | proj_code | text | | | x_cord_loc | text | | | y_cord_loc | text | | | last_rev_date | text | | | version_date | text | | | asbuilt_flag | text | | | location=# SELECT COUNT(*) FROM data_staging; count -------- 546895 (1 row)
We have around half a million rows of data in this table.
For this first comparison, I will UPDATE the proj_code column.
Here is an exploratory query to determine its current values:
location=# SELECT DISTINCT proj_code FROM data_staging; proj_code ----------- "70" "" "72" "71" "51" "15" "16" (7 rows)
I’ll use trim to remove quotes from the values and cast to an INT and determine how many rows exist for each individual value:
Let’s use a CTE for that, then SELECT from it:
location=# WITH cleaned_nums AS ( SELECT NULLIF(trim(both '"' FROM proj_code), '') AS p_code FROM data_staging ) SELECT COUNT(*), CASE WHEN p_code::int = 70 THEN '70' WHEN p_code::int = 72 THEN '72' WHEN p_code::int = 71 THEN '71' WHEN p_code::int = 51 THEN '51' WHEN p_code::int = 15 THEN '15' WHEN p_code::int = 16 THEN '16' ELSE '00' END AS proj_code_num FROM cleaned_nums GROUP BY p_code ORDER BY p_code DESC; count | proj_code_num --------+--------------- 353087 | 0 139057 | 72 25460 | 71 3254 | 70 1 | 51 12648 | 16 13388 | 15 (7 rows)
Prior to running these tests, I’ll go ahead and ALTER the proj_code column to type INTEGER:
BEGIN; ALTER TABLE data_staging ALTER COLUMN proj_code SET DATA TYPE INTEGER USING NULLIF(trim(both '"' FROM proj_code), '')::INTEGER; SAVEPOINT my_save; COMMIT;
And cleanup that NULL column value (which is represented by the ELSE ’00’ in the exploratory CASE expression above), setting it to an arbitrary number, 10, with this UPDATE:
UPDATE data_staging SET proj_code = 10 WHERE proj_code IS NULL;
Now all proj_code columns have an INTEGER value.
Let’s go ahead and run a single CASE expression updating all of the proj_code column values and see what the timing reports. I’ll place all of the commands in a .sql source file for ease of handling.
Here are the file contents:
BEGIN; timing on UPDATE data_staging SET proj_code = ( CASE proj_code WHEN 72 THEN 7272 WHEN 71 THEN 7171 WHEN 15 THEN 1515 WHEN 51 THEN 5151 WHEN 70 THEN 7070 WHEN 10 THEN 1010 WHEN 16 THEN 1616 END ) WHERE proj_code IN (72, 71, 15, 51, 70, 10, 16); SAVEPOINT my_save;
Let’s run this file and check what the timing reports:
location=# i /case_insert.sql BEGIN Time: 0.265 ms Timing is on. UPDATE 546895 Time: 6779.596 ms (00:06.780) SAVEPOINT Time: 0.300 ms
Just over half a million rows in 6+ seconds.
Here are the reflected changes in the table so far:
location=# SELECT DISTINCT proj_code FROM data_staging; proj_code ----------- 7070 1616 1010 7171 1515 7272 5151 (7 rows)
I’ll ROLLBACK (not shown) these changes so I can run individual INSERT statements to test those as well.
Below reflects the modifications to the .sql source file for this series of comparisons:
BEGIN; timing on UPDATE data_staging SET proj_code = 7222 WHERE proj_code = 72; UPDATE data_staging SET proj_code = 7171 WHERE proj_code = 71; UPDATE data_staging SET proj_code = 1515 WHERE proj_code = 15; UPDATE data_staging SET proj_code = 5151 WHERE proj_code = 51; UPDATE data_staging SET proj_code = 7070 WHERE proj_code = 70; UPDATE data_staging SET proj_code = 1010 WHERE proj_code = 10; UPDATE data_staging SET proj_code = 1616 WHERE proj_code = 16; SAVEPOINT my_save;
And those results,
location=# i /case_insert.sql BEGIN Time: 0.264 ms Timing is on. UPDATE 139057 Time: 795.610 ms UPDATE 25460 Time: 116.268 ms UPDATE 13388 Time: 239.007 ms UPDATE 1 Time: 72.699 ms UPDATE 3254 Time: 162.199 ms UPDATE 353087 Time: 1987.857 ms (00:01.988) UPDATE 12648 Time: 321.223 ms SAVEPOINT Time: 0.108 ms
Let’s check the values:
location=# SELECT DISTINCT proj_code FROM data_staging; proj_code ----------- 7222 1616 7070 1010 7171 1515 5151 (7 rows)
And the timing (Note: I’ll do the math in a query since timing did not report whole seconds this run):
location=# SELECT round((795.610 + 116.268 + 239.007 + 72.699 + 162.199 + 1987.857 + 321.223) / 1000, 3) AS seconds; seconds --------- 3.695 (1 row)
The individual INSERT‘s took about half the amount of time as the single CASE.
This first test included the entire table, with all columns. I’m curious of any differences in a table with the same number of rows, but fewer columns, hence the next series of tests.
I’ll create a table with 2 columns (composed of a SERIAL data type for the PRIMARY KEY and an INTEGER for the proj_code column) and move over the data:
location=# CREATE TABLE proj_nums(n_id SERIAL PRIMARY KEY, proj_code INTEGER); CREATE TABLE location=# INSERT INTO proj_nums(proj_code) SELECT proj_code FROM data_staging; INSERT 0 546895
(To note: SQL commands from the first set of operations are used with the appropriate modification(s). I am omitting them here for on-screen brevity and display)
I’ll run the single CASE expression first:
location=# i /case_insert.sql BEGIN Timing is on. UPDATE 546895 Time: 4355.332 ms (00:04.355) SAVEPOINT Time: 0.137 ms
And then the individual UPDATE’s:
location=# i /case_insert.sql BEGIN Time: 0.282 ms Timing is on. UPDATE 139057 Time: 1042.133 ms (00:01.042) UPDATE 25460 Time: 123.337 ms UPDATE 13388 Time: 212.698 ms UPDATE 1 Time: 43.107 ms UPDATE 3254 Time: 52.669 ms UPDATE 353087 Time: 2787.295 ms (00:02.787) UPDATE 12648 Time: 99.813 ms SAVEPOINT Time: 0.059 ms location=# SELECT round((1042.133 + 123.337 + 212.698 + 43.107 + 52.669 + 2787.295 + 99.813) / 1000, 3) AS seconds; seconds --------- 4.361 (1 row)
The timing is somewhat even between both sets of operations on the table with just 2 columns.
I will say that using the CASE expression is a bit easier to type out, but not necessarily the best choice on all occasions. As with what was stated in some of the comments over on the Hacker News thread referenced above, it normally “just depends” on many factors to which may or may not be the optimal choice.
I realize these tests are subjective at best. One of those, on a table with 11 columns while the other had just 2 columns, both of which were of a number data type.
The CASE expression for multiple row updates is still one of my favorite queries, if for only the ease of typing in a controlled environment where many individual UPDATE queries are the other alternative.
However, I can see now where it is not always the optimal choice as I continue to grow and learn.
As that old saying goes, “Half a dozen in one hand, 6 in the other.”
An additional Favorite Query – Using PLpgSQL CURSOR’s
I’ve begun storing and tracking all of my exercise (trail hiking) stats with PostgreSQL on my local development machine. There are multiple tables involved, as with any normalized database.
However, at months’ end, I want to store specific columns’ stats, in their own, separate table.
Here is the ‘monthly’ table I’ll use:
fitness=> d hiking_month_total; Table "public.hiking_month_total" Column | Type | Collation | Nullable | Default -----------------+------------------------+-----------+----------+--------- day_hiked | date | | | calories_burned | numeric(4,1) | | | miles | numeric(4,2) | | | duration | time without time zone | | | pace | numeric(2,1) | | | trail_hiked | text | | | shoes_worn | text | | |
I’ll concentrate on May’s results with this SELECT query:
fitness=> SELECT hs.day_walked, hs.cal_burned, hs.miles_walked, hs.duration, hs.mph, tr.name, sb.name_brand fitness-> FROM hiking_stats AS hs fitness-> INNER JOIN hiking_trail AS ht fitness-> ON hs.hike_id = ht.th_id fitness-> INNER JOIN trail_route AS tr fitness-> ON ht.tr_id = tr.trail_id fitness-> INNER JOIN shoe_brand AS sb fitness-> ON hs.shoe_id = sb.shoe_id fitness-> WHERE extract(month FROM hs.day_walked) = 5 fitness-> ORDER BY hs.day_walked ASC;
And here are 3 sample rows returned from that query:
day_walked | cal_burned | miles_walked | duration | mph | name | name_brand ------------+------------+--------------+----------+-----+------------------------+--------------------------------------- 2018-05-02 | 311.2 | 3.27 | 00:57:13 | 3.4 | Tree Trail-extended | New Balance Trail Runners-All Terrain 2018-05-03 | 320.8 | 3.38 | 00:58:59 | 3.4 | Sandy Trail-Drive | New Balance Trail Runners-All Terrain 2018-05-04 | 291.3 | 3.01 | 00:53:33 | 3.4 | House-Power Line Route | Keen Koven WP(keen-dry) (3 rows)
Truth be told, I can populate the target hiking_month_total table using the above SELECT query in an INSERT statement.
But where is the fun in that?
I’ll forgo boredom for a PLpgSQL function with a CURSOR instead.
I came up with this function to perform the INSERT with a CURSOR:
CREATE OR REPLACE function monthly_total_stats() RETURNS void AS $month_stats$ DECLARE v_day_walked date; v_cal_burned numeric(4, 1); v_miles_walked numeric(4, 2); v_duration time without time zone; v_mph numeric(2, 1); v_name text; v_name_brand text; v_cur CURSOR for SELECT hs.day_walked, hs.cal_burned, hs.miles_walked, hs.duration, hs.mph, tr.name, sb.name_brand FROM hiking_stats AS hs INNER JOIN hiking_trail AS ht ON hs.hike_id = ht.th_id INNER JOIN trail_route AS tr ON ht.tr_id = tr.trail_id INNER JOIN shoe_brand AS sb ON hs.shoe_id = sb.shoe_id WHERE extract(month FROM hs.day_walked) = 5 ORDER BY hs.day_walked ASC; BEGIN OPEN v_cur; <
> LOOP FETCH v_cur INTO v_day_walked, v_cal_burned, v_miles_walked, v_duration, v_mph, v_name, v_name_brand; EXIT WHEN NOT FOUND; INSERT INTO hiking_month_total(day_hiked, calories_burned, miles, duration, pace, trail_hiked, shoes_worn) VALUES(v_day_walked, v_cal_burned, v_miles_walked, v_duration, v_mph, v_name, v_name_brand); END LOOP get_stats; CLOSE v_cur; END; $month_stats$ LANGUAGE PLpgSQL;
Let’s call function monthly_total_stats() to perform the INSERT:
fitness=> SELECT monthly_total_stats(); monthly_total_stats --------------------- (1 row)
Since the function is defined RETURNS void, we can see no value is returned to the caller.
At this time, I am not specifically interested in any return values,
only that the function carries out the defined operation, populating the hiking_month_total table.
I’ll query for a count of records in the target table, confirming it has data:
fitness=> SELECT COUNT(*) FROM hiking_month_total; count ------- 25 (1 row)
The monthly_total_stats() function works, but perhaps a better use case for a CURSOR is to scroll through a large number of records. Maybe a table with around half a million records?
This next CURSOR is bound with a query targeting the data_staging table from the series of comparisons in the section above:
CREATE OR REPLACE FUNCTION location_curs() RETURNS refcursor AS $location$ DECLARE v_cur refcursor; BEGIN OPEN v_cur for SELECT segment_num, latitude, longitude, proj_code, asbuilt_flag FROM data_staging; RETURN v_cur; END; $location$ LANGUAGE PLpgSQL;
Then, to use this CURSOR, operate within a TRANSACTION (pointed out in the documentation here).
location=# BEGIN; BEGIN location=# SELECT location_curs(); location_curs --------------------
So what can you do with this “
Here are just a few things:
We can return the first row from the CURSOR using either first or ABSOLUTE 1:
location=# FETCH first FROM "
Want a row almost halfway through the results set? (Assuming we know an estimated half a million rows are bound to the CURSOR.)
Can you be that 'specific' with a CURSOR?
We can position, and FETCH the values for the record at row 234888 (just a random number I chose):
location=# FETCH ABSOLUTE 234888 FROM "
Once positioned there, we can move the CURSOR 'backward one':
location=# FETCH BACKWARD FROM "
Which is the same as:
location=# FETCH ABSOLUTE 234887 FROM "
Then we can move the CURSOR right back to the ABSOLUTE 234888 with:
location=# FETCH FORWARD FROM "
Handy Tip: to reposition the CURSOR, use MOVE instead of FETCH if you do not need the values from that row.
See this passage from the documentation:
"MOVE repositions a cursor without retrieving any data. MOVE works exactly like the FETCH command, except it only positions the cursor and does not return rows."
I'll revisit my fitness stats data to write a function and name the CURSOR, along with a potential 'real-world' use case.
The CURSOR will target this additional table, which stores results not limited to the month of May (basically all I have collected so far) as in the previous example:
fitness=> CREATE TABLE cp_hiking_total AS SELECT * FROM hiking_month_total WITH NO DATA; CREATE TABLE AS
Then populate it with data:
fitness=> INSERT INTO cp_hiking_total SELECT hs.day_walked, hs.cal_burned, hs.miles_walked, hs.duration, hs.mph, tr.name, sb.name_brand FROM hiking_stats AS hs INNER JOIN hiking_trail AS ht ON hs.hike_id = ht.th_id INNER JOIN trail_route AS tr ON ht.tr_id = tr.trail_id INNER JOIN shoe_brand AS sb ON hs.shoe_id = sb.shoe_id ORDER BY hs.day_walked ASC; INSERT 0 51
Now with the below PLpgSQL function, CREATE a 'named' CURSOR:
CREATE OR REPLACE FUNCTION stats_cursor(refcursor) RETURNS refcursor AS $$ BEGIN OPEN $1 FOR SELECT * FROM cp_hiking_total; RETURN $1; END; $$ LANGUAGE plpgsql;
I'll call this CURSOR 'stats':
fitness=> BEGIN; BEGIN fitness=> SELECT stats_cursor('stats'); stats_cursor -------------- stats (1 row)
Suppose, I want the '12th' row bound to the CURSOR.
I can position the CURSOR on that row, retrieving those results with the below command:
fitness=> FETCH ABSOLUTE 12 FROM stats; day_hiked | calories_burned | miles | duration | pace | trail_hiked | shoes_worn ------------+-----------------+-------+----------+------+---------------------+--------------------------------------- 2018-05-02 | 311.2 | 3.27 | 00:57:13 | 3.4 | Tree Trail-extended | New Balance Trail Runners-All Terrain (1 row)
For purposes of this blog post, imagine I know first-hand the pace column value for this row is incorrect.
I specifically remember being 'dead on my feet tired' that day and only maintained a pace of 3.0 during that hike. (Hey it happens.)
Okay, I'll just UPDATE the cp_hiking_total table to reflect that change.
Relatively simple no doubt. Boring…
How about with the stats CURSOR instead?
fitness=> UPDATE cp_hiking_total fitness-> SET pace = 3.0 fitness-> WHERE CURRENT OF stats; UPDATE 1
To make this change permanent, issue COMMIT:
fitness=> COMMIT; COMMIT
Let's query and see that UPDATE reflected in table cp_hiking_total:
fitness=> SELECT * FROM cp_hiking_total fitness-> WHERE day_hiked = '2018-05-02'; day_hiked | calories_burned | miles | duration | pace | trail_hiked | shoes_worn ------------+-----------------+-------+----------+------+---------------------+--------------------------------------- 2018-05-02 | 311.2 | 3.27 | 00:57:13 | 3.0 | Tree Trail-extended | New Balance Trail Runners-All Terrain (1 row)
How cool is that?
Moving within the CURSOR's results set, and run an UPDATE if needed.
Quite powerful if you ask me. And convenient.
Some ‘caution’ and information from the documentation on this type of CURSOR:
"It is generally recommended to use FOR UPDATE if the cursor is intended to be used with UPDATE ... WHERE CURRENT OF or DELETE ... WHERE CURRENT OF. Using FOR UPDATE prevents other sessions from changing the rows between the time they are fetched and the time they are updated. Without FOR UPDATE, a subsequent WHERE CURRENT OF command will have no effect if the row was changed since the cursor was created.
Another reason to use FOR UPDATE is that without it, a subsequent WHERE CURRENT OF might fail if the cursor query does not meet the SQL standard's rules for being “simply updatable” (in particular, the cursor must reference just one table and not use grouping or ORDER BY). Cursors that are not simply updatable might work, or might not, depending on plan choice details; so in the worst case, an application might work in testing and then fail in production."
With the CURSOR I've used here, I've followed the SQL standard rules (from the passages above) in the aspect of: I referenced only one table, with no grouping or ORDER by clause.
Why it matters.
As is with numerous operations, queries, or tasks in PostgreSQL (and SQL in general), there typically is more than one way to accomplish and reach your end goal. Which is one of the main reasons I am drawn to SQL and strive to learn more.
I hope through this follow-up blog post, I have provided some insight to why the multi-row UPDATE with CASE was included as one of my favorite queries, in that first accompanying blog post. Just having it as an option is worthwhile to me.
In addition, exploring CURSORS, for traversing large results sets. Performing DML operations, like UPDATES and/or DELETES, with the correct type of CURSOR, is just 'icing on the cake'. I'm anxious to study them further for more use cases.