blog

Custom Trigger Based Upgrades for PostgreSQL

Vladimir Svedov

Published

1st RULE: You do not upgrade PostgreSQL with trigger-based replication
2nd RULE: You DO NOT upgrade PostgreSQL with trigger-based replication
3rd RULE: If you upgrade PostgreSQL with trigger-based replication, prepare to suffer. And prepare well.

There must be a very serious reason to not use the pg_upgrade for upgrading the PostgreSQL.

OK, let’s say you can’t afford more than seconds of downtime. Use pglogical then.

OK let’s say you run 9.3 and thus can’t use pglogical. Use Londiste.

Can’t find readable README? Use SLONY.

Too complicated? Use streaming replication – promote the slave and run pg_upgrade on it – then switch apps to work with new promoted server.

Your app is relatively write-intensive all the time? You looked into all possible solutions and still want to setup custom trigger based replication? There are things you should pay attention to then:

  • All tables need PK. You shouldn’t rely on ctid (even with autovacuum disabled)
  • You will need to enable trigger for all constraint bonded tables (and might need Deferred FK)
  • Sequences need manual sync
  • Permissions aren’t replicated (unless you also set up an event trigger)
  • Event triggers can help with automation of support for new tables, but better to not overcomplicate an already complicated process. (like creating a trigger and a foreign table on table creation, also creating same table on foreign server, or altering remote server table with same change, you do on old db)
  • For each statement trigger is less reliable but probably simpler
  • You should vividly imagine your preexisting data migration process
  • You should plan limited tables accessibility while setting up and enabling trigger based replication
  • You should absolutely totally know you relations dependencies and constraints before you go this way.

Enough warnings? You want to play already? Let’s begin with some code then.

Before writing any triggers we have to build some mock up data set. Why? Wouldn’t it be much easier to have a trigger before we have data? So the data would replicate to the “upgrade” cluster at once? Sure it would. But then what do we want to upgrade? Just build a data set on a newer version. So yes, if you plan upgrading to a higher version and need to add some table, create replication triggers before you put the data, it will eliminate the need to sync not replicated data later. But such new tables are, we may say, an easy part. So let’s first mock up the case when we have data before we decide to upgrade.

Let’s assume that an outdated server is called p93 (oldest supported) and the one we replicate to is called p10 (11 is on it’s way this quarter, but still didn’t happen yet):

c PostgreSQL
select pg_terminate_backend(pid) from pg_stat_activity where datname in ('p93','p10');
drop database if exists p93;
drop database if exists p10;

Here I use psql, thus can use c meta-command to connect to other db. If you want to follow this code with another client, you would need to reconnect instead. Of course you don’t need this step if you run this for the first time. I had to recreate my sandbox several times, thus I saved statements…

create database p93; --old db (I use 9.3 as oldest supported ATM version)
create database p10; --new db 

So we create two fresh databases. Now I will connect to the one we want to upgrade and will create several funkey data types and use them to fill in a table that we will consider as preexisting later:

c p93
create type myenum as enum('a', 'b');--adding some complex types
create type mycomposit as (a int, b text); --and again...
create table t(i serial not null primary key, ts timestamptz(0) default now(), j json, t text, e myenum, c mycomposit);
insert into t values(0, now(), '{"a":{"aa":[1,3,2]}}', 'foo', 'b', (3,'aloha'));
insert into t (j,e) values ('{"b":null}', 'a');
insert into t (t) select chr(g) from generate_series(100,240) g;--add some more data
delete from t where i > 3 and i < 142; --mockup activity and mix tuples to be not sequential
insert into t (t) select null;

Now what do we have?

  ctid   |  i  |           ts           |          j           |  t  | e |     c     
---------+-----+------------------------+----------------------+-----+---+-----------
 (0,1)   |   0 | 2018-07-08 08:03:00+03 | {"a":{"aa":[1,3,2]}} | foo | b | (3,aloha)
 (0,2)   |   1 | 2018-07-08 08:03:00+03 | {"b":null}           |     | a | 
 (0,3)   |   2 | 2018-07-08 08:03:00+03 |                      | d   |   | 
 (0,4)   |   3 | 2018-07-08 08:03:00+03 |                      | e   |   | 
 (0,143) | 142 | 2018-07-08 08:03:00+03 |                      | ð   |   | 
 (0,144) | 143 | 2018-07-08 08:03:00+03 |                      |     |   | 
(6 rows)

OK, some data - why did I insert and then delete so much? Well, we try to mock up a data set that existed for a while. So I’m trying to make it scattered a little. Let’s move one more row (0,3) to the end of page (0,145):

update t set j = '{}' where i =3; --(0,4)

Now let’s assume we will use PostgreSQL_fdw (using dblink here would basically be the same and probably faster for 9.3, so please do so if you wish).

create extension PostgreSQL_fdw;
create server p10 foreign data wrapper PostgreSQL_fdw options (host 'localhost', dbname 'p10'); --I know it's the same 9.3 server - change host to other version and use other cluster if you wish. It's not important for the sandbox...
create user MAPPING FOR vao SERVER p10 options(user 'vao', password 'tsun');

Now we can use pg_dump -s to get the DDL, but I just have it above. We have to create the same table in the higher version cluster to replicate data to:

c p10
create type myenum as enum('a', 'b');--adding some complex types
create type mycomposit as (a int, b text); --and again...
create table t(i serial not null primary key, ts timestamptz(0) default now(), j json, t text, e myenum, c mycomposit);

Now we get back to 9.3 and use foreign tables for data migration (I will use f_ convention for table names here, f stands for foreign):

c p93
create foreign table f_t(i serial, ts timestamptz(0) default now(), j json, t text, e myenum, c mycomposit) server p10 options (TABLE_name 't');

Finally! We create an insert function and trigger.

create or replace function tgf_i() returns trigger as $$
begin
  execute format('insert into %I select ($1).*','f_'||TG_RELNAME) using NEW;
  return NEW;
end;
$$ language plpgsql;

Here and later I will use links for longer code. First, so that spoken text would not sink in machine language. Second because I use several versions of the same functions to reflect how the code should evolute on demand.

--OK - first table ready - lets try logical trigger based replication on inserts:
insert into t (t) select 'one';
--and now transactional:
begin;
  insert into t (t) select 'two';
  select ctid, * from f_t;
  select ctid, * from t;
rollback;
select ctid, * from f_t where i > 143;
select ctid, * from t where i > 143;

Resulting:

INSERT 0 1
BEGIN
INSERT 0 1
 ctid  |  i  |           ts           | j |  t  | e | c 
-------+-----+------------------------+---+-----+---+---
 (0,1) | 144 | 2018-07-08 08:27:15+03 |   | one |   | 
 (0,2) | 145 | 2018-07-08 08:27:15+03 |   | two |   | 
(2 rows)

  ctid   |  i  |           ts           |          j           |  t  | e |     c     
---------+-----+------------------------+----------------------+-----+---+-----------
 (0,1)   |   0 | 2018-07-08 08:27:15+03 | {"a":{"aa":[1,3,2]}} | foo | b | (3,aloha)
 (0,2)   |   1 | 2018-07-08 08:27:15+03 | {"b":null}           |     | a | 
 (0,3)   |   2 | 2018-07-08 08:27:15+03 |                      | d   |   | 
 (0,143) | 142 | 2018-07-08 08:27:15+03 |                      | ð   |   | 
 (0,144) | 143 | 2018-07-08 08:27:15+03 |                      |     |   | 
 (0,145) |   3 | 2018-07-08 08:27:15+03 | {}                   | e   |   | 
 (0,146) | 144 | 2018-07-08 08:27:15+03 |                      | one |   | 
 (0,147) | 145 | 2018-07-08 08:27:15+03 |                      | two |   | 
(8 rows)

ROLLBACK
 ctid  |  i  |           ts           | j |  t  | e | c 
-------+-----+------------------------+---+-----+---+---
 (0,1) | 144 | 2018-07-08 08:27:15+03 |   | one |   | 
(1 row)

  ctid   |  i  |           ts           | j |  t  | e | c 
---------+-----+------------------------+---+-----+---+---
 (0,146) | 144 | 2018-07-08 08:27:15+03 |   | one |   | 
(1 row)

What do we see here? We see that newly inserted data is replicated to database p10 successfully. And accordingly is rolled back if the transaction fails. So far so good. But you could not not notice (yes, yes - not not) that the table on p93 is much bigger - old data did not replicate. How do we get it there? Well simple:

insert into … select local.* from ...outer join foreign where foreign.PK is null 

would do. And this is not the main concern here - you should rather worry how you will manage pre existing data on updates and deletes - because statements successfully running on lower version db will fail or just affect zero rows on higher - just because there’s is not preexisting data! And here we come to the seconds of downtime phrase. (If it was a movie, of course here we would have a flashback, but alas - if phrase “seconds of downtime” did not catch your attention earlier, you will have to go above and look for the phrase...)

In order to enable all statements triggers you have to freeze the table, copy all the data and then enable triggers, so tables on lower and higher versions databases would be in sync and all statements would just have the same (or extremely close, because physical distribution will differ, again look above at the first example for ctid column) affect. But running such “turn on replication” on table in one biiiiiig transaction won’t be seconds of downtime. Potentially it will make site read-only for hours. Especially if the table is roughly bonded by FK with other big tables.

Well read-only is not complete downtime. But later we will try to leave all SELECTS and some INSERT,DELETE,UPDATE working (on new data, failing on old). Moving table or transaction to read-only can be done in many ways - would it be some PostgreSQLs approach, or application level, or even temporarily revoking according permissions. These approaches themselves can be a topic for its own blog, thus I will only mention it.

Anyway. Back to triggers. In order to do the same action, requiring working on distinct row (UPDATE, DELETE) on remote table as you do on local we need to use primary keys, as physical location will differ. And primary keys are created on different tables with different columns, thus we either have to create unique function for each table or try writing some generic. Let’s (for simplicity) assume we have one column PKs only, then this function should help. So finally! Let’s have an update function here. And obviously a trigger:

create trigger tgu before update on t for each row execute procedure tgf_u();
Download the Whitepaper Today
 
PostgreSQL Management & Automation with ClusterControl
Learn about what you need to know to deploy, monitor, manage and scale PostgreSQL

And let’s see if it works:

begin;
        update t set j = '{"updated":true}' where i = 144;
        select * from t where i = 144;
        select * from f_t where i = 144;
Rollback;

Resulting to:

BEGIN
psql:blog.sql:71: INFO:  (144,"2018-07-08 09:09:20+03","{""updated"":true}",one,,)
UPDATE 1
  i  |           ts           |        j         |  t  | e | c 
-----+------------------------+------------------+-----+---+---
 144 | 2018-07-08 09:09:20+03 | {"updated":true} | one |   | 
(1 row)

  i  |           ts           |        j         |  t  | e | c 
-----+------------------------+------------------+-----+---+---
 144 | 2018-07-08 09:09:20+03 | {"updated":true} | one |   | 
(1 row)

ROLLBACK

OK. And while it’s still hot, let’s add delete trigger function and replication as well:

create trigger tgd before delete on t for each row execute procedure tgf_d();

And check:

begin;
        delete from t where i = 144;
        select * from t where i = 144;
        select * from f_t where i = 144;
Rollback;

Giving:

DELETE 1
 i | ts | j | t | e | c 
---+----+---+---+---+---
(0 rows)

 i | ts | j | t | e | c 
---+----+---+---+---+---
(0 rows)

As we remember (who could forget this!) we are not turning “replication” support in transaction. And we should if we want consistent data. As said above ALL statements triggers on ALL FK related tables should be enabled in one transaction, priorly prepared by syncing data. Otherwise we could fall into:

begin;
        select * from t where i = 3;
        delete from t where i = 3;
        select * from t where i = 3;
        select * from f_t where i = 3;
Rollback;

Giving:

p93=# begin;
BEGIN
p93=#         select * from t where i = 3;
 i |           ts           | j  | t | e | c 
---+------------------------+----+---+---+---
 3 | 2018-07-08 09:16:27+03 | {} | e |   | 
(1 row)

p93=#         delete from t where i = 3;
DELETE 1
p93=#         select * from t where i = 3;
 i | ts | j | t | e | c 
---+----+---+---+---+---
(0 rows)

p93=#         select * from f_t where i = 3;
 i | ts | j | t | e | c 
---+----+---+---+---+---
(0 rows)

p93=# rollback;

Yayki! We deleted a row on lower version db and not on newer! Just because it wasn’t there. This would not happen if we did it the right way (begin;sync;enable trigger;end;). But the right way would make tables read-only for a long time! The most hard-core reader would even say ‘why would you do trigger based replication at all then?’.

You can do it with pg_upgrade as “normal” people would. And in case of streaming replication you can make all set read-only. Pause xlog replay and upgrade master while the application is still RO the slave.

Exactly! Didn’t I start with it?

The trigger based replication comes on the stage when you need something very special. For example, you can try allowing SELECT and some modification on newly created data, not just RO. Let’s say you have a online questionnaire - user registers, answers, get’s his bonus-free-points-other-no-one-needs-great-stuff and leaves. With such structure you can just forbid modifications on data that is not on higher version yet, allowing the whole data flow for new users.

So you abandon few online ATM working people, letting newcomers work without even noticing you are in the middle of an upgrade. Sounds awful, but didn’t I say hypothetically? I didn’t? Well, I meant it.

No matter what real life case could be, let’s look how you can implement it. The delete and update functions will change. And let’s check the last scenario now:

BEGIN
psql:blog.sql:86: ERROR:  This data is not replicated yet, thus can't be deleted
psql:blog.sql:87: ERROR:  current transaction is aborted, commands ignored until end of transaction block
psql:blog.sql:88: ERROR:  current transaction is aborted, commands ignored until end of transaction block
ROLLBACK

The row wasn’t deleted on the lower version, because it wasn’t found on the higher one. The same thing would happen with updated. Try it yourself. Now you can start data sync without stopping much modifications on table that you include into trigger based replication.

Is it better? Worse? It’s different - it has many flaws and some benefits over global RO system. My goal was demonstrating why somebody would want to use such complicated method over normal - to get specific abilities over a stable, well-known process. At some cost of course…

So, now when we feel a little safer for data consistency and while our pre existing data in table t is syncing with p10, we can talk about other tables. How would it all work with FK (after all I mentioned FK so man times, I have to include it in the sample). Well, why wait?

create table c (i serial, t int references t(i), x text);
--and accordingly a foreign table - the one on newer version...
c p10
create table c (i serial, t int references t(i), x text);
c p93
create foreign table f_c(i serial, t int, x text) server p10 options (TABLE_name 'c');
--let’s pretend it had some data before we decided to migrate with triggers to a higher version
insert into c (t,x) values (1,'FK');
--- so now we add triggers to replicate DML:
create trigger tgi before insert on c for each row execute procedure tgf_i();
create trigger tgu before update on c for each row execute procedure tgf_u();
create trigger tgd before delete on c for each row execute procedure tgf_d();

It is surely worth of wrapping those three up to a function with an aim to "triggerise" many tables. But I won’t. As I’m not going to add any more tables - two referenced relations database is already such a messed net!

--now, what would happen if we tr inserting referenced FK, that does not exist on remote db?..
insert into c (t,x) values (2,'FK');
/* it fails with:
psql:blog.sql:139: ERROR:  insert or update on table "c" violates foreign key constraint "c_t_fkey"
a new row isn't inserted neither on remote, nor local db, so we have safe data consistencyy, but inserts are blocked?..
Yes untill data that existed untill trigerising gets to remote db - ou cant insert FK with before triggerising keys, yet - a new (both t and c tables) data will be accepted:
*/
insert into t(i) values(4); --I use gap we got by deleting data above, so I dont need to "returning" and know the exact id -less coding in sample script
insert into c(t) values(4);
select * from c;
select * from f_c;

Result in:

psql:blog.sql:109: ERROR:  insert or update on table "c" violates foreign key constraint "c_t_fkey"
DETAIL:  Key (t)=(2) is not present in table "t".
CONTEXT:  Remote SQL command: INSERT INTO public.c(i, t, x) VALUES ($1, $2, $3)
SQL statement "insert into f_c select ($1).*"
PL/pgSQL function tgf_i() line 3 at EXECUTE statement
INSERT 0 1
INSERT 0 1
 i | t | x  
---+---+----
 1 | 1 | FK
 3 | 4 | 
(2 rows)

 i | t | x 
---+---+---
 3 | 4 | 
(1 row)

Again. It looks like data consistency is in place. You can start syncing data for new table c as well…

Tired? I definitely am.

Conclusion

In conclusion I would like to highlight some mistakes I made while looking into this approach. While I was building the update statement, dynamically listing all columns from pg_attribute, I lost quite an hour. Imagine how disappointed I was to find later that I completely forgot about UPDATE (list) = (list) construct! And function came to much shorter and more readable state.

So mistake number one was - trying to build everything yourself, just because it looks so reachable. It still is, but as always somebody already probably did it better - spending two minutes just to check if it’s so indeed can save you hour of thinking later.

And second - thing looked much simpler to me where they turned out to be much deeper, and I overcomplicated many cases that are perfectly held by PostgreSQL transaction model.

So only after trying to build the sandbox I got a somewhat clear understanding of this approach estimations.

So planning is obviously needed, but don’t plan more than you can actually do.

Experience comes with practice.

My sandbox reminded me of a computer strategy - you sit to it after lunch and think - “aha, here I build Pyramyd, there I get archery, then I convert to Sons of Ra and build 20 longbow men, and here I attack the pathetic neighbours. Two hours of glory.” And SUDDENLY you find yourself next morning, two hours before work with “How did I get here? Why do I have to sign this humiliating alliance with not washed barbarians to save my last longbow man and do I really need to sell my so-hard-built Pyramid for it?”

Readings:

Subscribe below to be notified of fresh posts