Severalnines Blog
The automation and management blog for open source databases

A Guide to Pgpool for PostgreSQL - Part Two

This is the second part of the blog “A Guide to Pgpool for PostgreSQL”. The first part covering load balancing, session pooling, in memory cache and installation can be found here.

Many users look towards pgpool specifically for High Availability features, and it has plenty to offer. There are few quite a lot of instructions for pgpool HA on the web (e.g. longer one and shorter one), so it would not make any sense to repeat them. Neither do we want to provide yet another blind set of configuration values. Instead I suggest to play against the rules and try doing it the wrong way, so we’ll see some interesting behaviour. One of the top expected feature (at least it’s on the top of the page) is the ability to recognise the usability of a “dead” ex master and re-use it with pg_rewind. It could save hours of bringing back the new standby with big data (as we skip rsync or pg_basebackup, which effectively copies ALL files over from the new master). Strictly speaking, pg_rewind is meant for planned failover (during upgrade or migrating to new hardware). But we’ve seen when it’s greatly helps with not planned but yet graceful shutdown and automated failover - for e.g., ClusterControl makes use of it when performing automatic failover of replication slaves. Let’s assume we have the case: we need (any) master to be accessible as much as possible. If for some reason (network failure, max connections exceeded or any other “failure” that forbids new sessions to start) we no longer can use a master for RW operations, we have a failover cluster configured, with slaves that can accept connections. We can then promote one of the slaves and fail over to it.

First let’s assume we have three nodes:

  • 10.1.10.124:5400 with /pg/10/m (pgpool spins here as well)
  • 10.1.10.147:5401 with /pg/10/m2
  • 10.1.10.124:5402 with /pg/10/s2

Those are effectively the same nodes as in part one, but the failover node is moved to a different host and $PGDATA. I did it to make sure I did not typo or forget some extra quote in remote ssh command. Also the debugging info will look simpler because ip addresses are different. Finally I was not sure I will be able to make this unsupported use case to work, so I have to see it with my own eyes.

Failover

First we set failover_command and run pgpool reload and try to failover. Here and further, I will echo some info to /tmp/d on the pgpool server, so I can tail -f /tmp/d to see the flow.

postgres@u:~$ grep failover_command /etc/pgpool2/pgpool.conf
failover_command = 'bash /pg/10/fo.sh %D %H %R'

postgres@u:~$ cat /pg/10/fo.sh
rem_cmd="pg_ctl -D $3 promote"
cmd="ssh -T postgres@$2 $rem_cmd"
echo "$(date) $cmd" >>/tmp/d
$cmd &>>/tmp/d

NB: Do you have $PATH set in .bashrc on remote host?..

Let’s stop the master (I know it’s not how disaster happens, you expect at least some huge monkey or red shining robot to smash the server with a huge hammer, or at least the boring hard disks to die, but I’m using this graceful variant to demo the possible use of pg_rewind, so here the failover will be the result of human error or network failure a half second over the health_check_period), so:

/usr/lib/postgresql/10/bin/pg_ctl -D /pg/10/m stop
2018-04-18 13:53:55.469 IST [27433]  LOG:  received fast shutdown request
waiting for server to shut down....2018-04-18 13:53:55.478 IST [27433]  LOG:  aborting any active transactions
2018-04-18 13:53:55.479 IST [28855] postgres t FATAL:  terminating connection due to administrator command
2018-04-18 13:53:55.483 IST [27433]  LOG:  worker process: logical replication launcher (PID 27440) exited with exit code 1
2018-04-18 13:53:55.484 IST [27435]  LOG:  shutting down
2018-04-18 13:53:55.521 IST [27433]  LOG:  database system is shut down
 done
server stopped

Now checking the failover command output:

postgres@u:~$ cat /tmp/d
Wed Apr 18 13:54:05 IST 2018 ssh -T postgres@localhost
pg_ctl -D /pg/10/f promote
waiting for server to promote.... done
server promoted

And checking after a while:

t=# select nid,port,st, role from dblink('host=localhost port=5433','show pool_nodes') as t (nid int,hostname text,port int,st text,lb_weight float,role text,cnt int,cur_node text,del int);
 nid | port |  st  |  role
-----+------+------+---------
   0 | 5400 | down | standby
   1 | 5401 | up   | primary
   2 | 5402 | up   | standby
(3 rows)

Also we see in ex-failover cluster logs:

2018-04-13 14:26:20.823 IST [20713]  LOG:  received promote request
2018-04-13 14:26:20.823 IST [20713]  LOG:  redo done at 0/951EC20
2018-04-13 14:26:20.823 IST [20713]  LOG:  last completed transaction was at log time 2018-04-13 10:41:54.355274+01
2018-04-13 14:26:20.872 IST [20713]  LOG:  selected new timeline ID: 2
2018-04-13 14:26:20.966 IST [20713]  LOG:  archive recovery complete
2018-04-13 14:26:20.998 IST [20712]  LOG:  database system is ready to accept connections

Checking replication:

postgres@u:~$ psql -p 5401 t -c "select now() into test"
SELECT 1
postgres@u:~$ psql -p 5402 t -c "select * from test"
              now
-------------------------------
 2018-04-13 14:33:19.569245+01
(1 row)

The slave /pg/10/s2:5402 switched to a new timeline thanks to recovery_target_timeline = latest in recovery.conf, so we are good. We don’t need to adjust recovery.conf to point to the new master, because it points to the pgpool ip and port and they stay the same no matter who is performing the primary master role.

Checking load balancing:

postgres@u:~$ (for i in $(seq 1 9); do psql -h localhost -p 5433 t -c "select current_setting('port') from ts limit 1" -XAt; done) | sort| uniq -c
      6 5401
      3 5402

Nice. Apps behind pgpool will notice a second outage and continue to work.

Reusing ex-master

Now we can turn the ex-master to failover standby and bring it back (without adding a new node to pgpool, as it exists there already). If you don’t have wal_log_hints enabled or data checksums (comprehensive difference between these options is here), you have to recreate cluster on ex-master to follow a new timeline:

postgres@u:~$ rm -fr /pg/10/m
postgres@u:~$ pg_basebackup -h localhost -p 5401 -D /pg/10/m/

But don’t rush to run the statements above! If you took care on wal_log_hints (requires restart), you can try using pg_rewind for much faster switching of the ex-master to a new slave.

So ATM we have the ex-master offline, new master with next timeline started. If the ex-master was offline due to temporary network failure and it comes back, we need to shut it down first. In the case above we know it’s down, so we can just try rewinding:

postgres@u:~$ pg_rewind -D /pg/10/m2 --source-server="port=5401 host=10.1.10.147"
servers diverged at WAL location 0/40605C0 on timeline 2
rewinding from last common checkpoint at 0/4060550 on timeline 2
Done!

And again:

postgres@u:~$ pg_ctl -D /pg/10/m2 start
server started
...blah blah 
postgres@u:~$ 2018-04-16 12:08:50.303 IST [24699]  LOG:  started streaming WAL from primary at 0/B000000 on timeline 2

t=# select nid,port,st,role from dblink('host=localhost port=5433','show pool_nodes') as t (nid int,hostname text,port int,st text,lb_weight float,role text,cnt int,cur_node text,del int);
 nid | port |  st  |  role
-----+------+------+---------
   0 | 5400 | down | standby
   1 | 5401 | up   | primary
   2 | 5402 | up   | standby
(3 rows)

Ops. Duh! Despite the fact that the cluster at port 5400 is online and follows a new timeline, we need to tell pgpool to recognize it:

postgres@u:~$ pcp_attach_node -w -h 127.0.0.1 -U vao -n 0
 pcp_attach_node  -- Command Successful

Now all three are up (and pgpool knows it) and in sync:

postgres@u:~$ sql="select ts.i::timestamp(0), current_setting('data_directory'),case when pg_is_in_recovery() then 'recovering' else 'mastering' end stream from ts order by ts desc"
postgres@u:~$ psql -h 10.1.10.147 -p 5401 t -c "$sql";
          i          | current_setting |  stream
---------------------+-----------------+-----------
 2018-04-30 14:34:36 | /pg/10/m2       | mastering
(1 row)

postgres@u:~$ psql -h 10.1.10.124 -p 5402 t -c "$sql";
          i          | current_setting |   stream
---------------------+-----------------+------------
 2018-04-30 14:34:36 | /pg/10/s2       | recovering
(1 row)

postgres@u:~$ psql -h 10.1.10.124 -p 5400 t -c "$sql";
          i          | current_setting |   stream
---------------------+-----------------+------------
 2018-04-30 14:34:36 | /pg/10/m        | recovering
(1 row)

Now I’ll try using recovery_1st_stage_command for reusing ex-master:

root@u:~# grep 1st /etc/pgpool2/pgpool.conf
recovery_1st_stage_command = 'or_1st.sh'

But recovery_1st_stage_command does not offer the needed arguments for pg_rewind, which I can see if I add to recovery_1st_stage_command:

echo "online recovery started on $(hostname) $(date --iso-8601) $0 $1 $2 $3 $4"; exit 1;

The output:

online recovery started on u2 2018-04-30 /pg/10/m2/or_1st.sh /pg/10/m2 10.1.10.124 /pg/10/m 5401

Well - using pg_rewind is just in todo list - what did I expect?.. So I need to do some monkey hack to get master ip and port (remember it will keep changing after failover).

Download the Whitepaper Today
 
PostgreSQL Management & Automation with ClusterControl
Learn about what you need to know to deploy, monitor, manage and scale PostgreSQL

A monkey hack

So I have something like this in recovery_1st_stage_command:

root@u:~# cat /pg/10/or_1st.sh
pgpool_host=10.1.10.124
pgpool_port=5433
echo "online recovery started on $(hostname) $(date --iso-8601) $0 $1 $2 $3 $4" | ssh -T $pgpool_host "cat >> /tmp/d"
master_port=$(psql -XAt -h $pgpool_host -p $pgpool_port t -c "select port from dblink('host=$pgpool_host port=$pgpool_port','show pool_nodes') as t (nid int,hostname text,port int,st text,lb_weight float,role text,cnt int,cur_node text,del int) where role='primary'")
master_host=$(psql -XAt -h $pgpool_host -p $pgpool_port t -c "select hostname from dblink('host=$pgpool_host port=$pgpool_port','show pool_nodes') as t (nid int,hostname text,port int,st text,lb_weight float,role text,cnt int,cur_node text,del int) where role='primary'")
failover_host=$(psql -XAt -h $pgpool_host -p $pgpool_port t -c "select hostname from dblink('host=$pgpool_host port=$pgpool_port','show pool_nodes') as t (nid int,hostname text,port int,st text,lb_weight float,role text,cnt int,cur_node text,del int) where role!='primary' order by port limit 1")
src='"port=$master_port host=$master_host"'
rem_cmd="'pg_rewind -D $3 --source-server=\"port=$master_port host=$master_host\"'"
cmd="ssh -T $failover_host $rem_cmd"
echo $cmd | ssh -T $pgpool_host "cat >> /tmp/d"
$cmd

tmp=/tmp/rec_file_tmp
cat > $tmp <<EOF
standby_mode          = 'on'
primary_conninfo      = 'host=$master_host port=$master_port user=postgres'
trigger_file = '/tmp/tg_file'
recovery_target_timeline  = latest
EOF

scp $tmp $failover_host:$3/recovery.conf

rem_cmd="pg_ctl -D $3 start"
cmd="ssh -T $failover_host $rem_cmd"
echo $cmd | ssh -T $pgpool_host "cat >> /tmp/d"
$cmd
echo "OR finished $(date --iso-8601)" | ssh -T $pgpool_host "cat >> /tmp/d"
exit 0;

Now what a mess! Well - if you decide to use not existing feature - prepare - it will look bad, work worse and you will permanently feel ashamed of what you did. So step by step:

  • I need pgpool IP and port to remotely connect to it, both to query “show pool_nodes” and to log steps and to run commands.
  • I’m piping some dbg info to /tmp/d over ssh, because the command will be executed on master side, which will change after failing over
  • I can use the result of “show pool_nodes” to get the running master connection info simply filtering with WHERE clause
  • I will need double quotes in argument for pg_rewind, which will need to run over ssh, so I just split the command for readability, then echo it and run
  • Preparing recovery.conf based on output from “show pool_nodes” (writing it I ask myself - why did I not just use pgpool IP and port instead?..
  • Starting new failover slave (I know I’m supposed to use 2nd step - just skipped to avoid getting all IPs and port over again)

Now what’s left - trying to use this mess in pcp:

root@u:~# pcp_recovery_node -h 127.0.0.1 -U vao -n 0 -w
pcp_recovery_node -- Command Successful
root@u:~# psql -h localhost -p 5433 t -c"select nid,port,st,role from dblink('host=10.1.10.124 port=5433','show pool_nodes') as t (nid int,hostname text,port int,st text,lb_weight float,role text,cnt int,cur_node text,del int)"
 nid | port | st |  role
-----+------+----+---------
   0 | 5400 | up | standby
   1 | 5401 | up | primary
   2 | 5402 | up | standby
(3 rows)

Checking the /tmp/d on pgpool server:

root@u:~# cat /tmp/d
Tue May  1 11:37:59 IST 2018 ssh -T postgres@10.1.10.147 /usr/lib/postgresql/10/bin/pg_ctl -D /pg/10/m2 promote
waiting for server to promote.... done
server promoted
online recovery started on u2 2018-05-01 /pg/10/m2/or_1st.sh /pg/10/m2
ssh -T 10.1.10.124 'pg_rewind -D --source-server="port=5401 host=10.1.10.147"'
ssh -T 10.1.10.124 pg_ctl -D start
OR finished 2018-05-01

Now obviously we want to roll it over again to see if it works on any host:

postgres@u:~$ ssh -T 10.1.10.147 pg_ctl -D /pg/10/m2 stop             waiting for server to shut down.... done
server stopped
postgres@u:~$ psql -h localhost -p 5433 t -c"select nid,port,st,role from dblink('host=10.1.10.124 port=5433','show pool_nodes') as t (nid int,hostname text,port int,st text,lb_weight float,role text,cnt int,cur_node text,del int)"
 nid | port |  st  |  role
-----+------+------+---------
   0 | 5400 | up   | primary
   1 | 5401 | down | standby
   2 | 5402 | up   | standby
(3 rows)

root@u:~# pcp_recovery_node -h 127.0.0.1 -U vao -n 1 -w

postgres@u:~$ psql -h localhost -p 5433 t -c"select nid,port,st,role from dblink('host=10.1.10.124 port=5433','show pool_nodes') as t (nid int,hostname text,port int,st text,lb_weight float,role text,cnt int,cur_node text,del int)"
 nid | port | st |  role
-----+------+----+---------
   0 | 5400 | up | primary
   1 | 5401 | up | standby
   2 | 5402 | up | standby
(3 rows)

Log looks similar - only IP and ports have changed:

 Tue May  1 11:44:01 IST 2018 ssh -T postgres@10.1.10.124 /usr/lib/postgresql/10/bin/pg_ctl -D /pg/10/m promote
waiting for server to promote.... done
server promoted
online recovery started on u 2018-05-01 /pg/10/m/or_1st.sh /pg/10/m 10.1.10.147 /pg/10/m2 5400
ssh -T 10.1.10.147 'pg_rewind -D /pg/10/m2 --source-server="port=5400 host=10.1.10.124"'
ssh -T 10.1.10.147 pg_ctl -D /pg/10/m2 start
online recovery started on u 2018-05-01 /pg/10/m/or_1st.sh /pg/10/m
ssh -T 10.1.10.147 'pg_rewind -D --source-server="port=5400 host=10.1.10.124"'
ssh -T 10.1.10.147 pg_ctl -D start
OR finished 2018-05-01

In this sandbox, the master moved to 5401 on failover and after living there for a while it moved back to 5400. Using pg_rewind should make it as fast as possible. Previously the scary part of automatic failover was - if you really messed up the config and did not foresee some force majeure, you could run into automatic failover to next slave and next and next until there is no free slave left. And after that, you just end up with several split brained masters and no failover spare. It’s a poor consolation in such scenario to have even more slaves to failover, but without pg_rewind you would not have even that. “Traditional” rsync or pg_basebackup copy ALL $PGDATA over to create a standby, and can’t reuse the “not too much different” ex master.

In conclusion to this experiment I would like to emphasize once again - this is not a solution suitable for blind copy pasting. The usage of pg_rewind is not encouraged for pg_pool. It is not usable at all ATM. I wanted to add some fresh air to pgpool HA configuration, for nubes like me to observe a little closer how it works. For coryphaeus to smile at naivistic approach and maybe see it with our - nubes eyes.