Comparing Load Balancers for PostgreSQL

Paul Namuag

Load balancing increases system performance, especially from the application standpoint, allowing several computers to serve the same data. It works such that the load is being distributed among client queries to replica nodes aside from its primary or master node, while routing database modifications solely to the master node alone. Any modifications to the master node are subsequently propagated to each replica using PostgreSQL Streaming Replication.

How Can Load Balancers Affect PostgreSQL?

Utilizing load balancing shall direct client applications to connect to the load balancing server, and distribute the initiated connections to the available PostgreSQL nodes depending on the type of query requests. This helps stress out outstanding load on a particular PostgreSQL server and promotes parallel balance of load among the available nodes within the cluster. 

Using PostgreSQL, there are already a handful of existing solutions to make this to work. These solutions can work seamlessly or load balancing can work with the current topology--with primary and standby nodes--yet load balancing is implemented in the application layer itself. Load balancing faces challenges with synchronization problems which is the fundamental difficulty for servers working together. Because there is no single solution that eliminates the impact of the sync problem for all use cases, there are multiple solutions. Each solution addresses this problem in a different way, and minimizes its impact for a specific workload.

In this blog, we'll take a look at those load balancers by comparing them and how beneficial it is for your PostgreSQL workload. 

HAProxy Load Balancing For PostgreSQL

HAProxy is an event-driven, non-blocking engine combining a proxy with a very fast I/O layer and a priority-based, multi-threaded scheduler. As it is designed with a data forwarding goal in mind, its architecture is designed to operate in a lightweight process which is optimized to move data as fast as possible with the least possible operations. It focuses on optimizing the CPU cache efficiency by sticking connections to the same CPU as long as possible. As such it implements a layered model offering bypass mechanisms at each level ensuring data doesn't reach higher levels unless needed. Most of the processing is performed in the kernel. HAProxy does its best to help the kernel do the work as fast as possible by giving some hints or by avoiding certain operations when it guesses they could be grouped later. As a result, typical figures show 15% of the processing time spent in HAProxy versus 85% in the kernel in TCP or HTTP close mode, and about 30% for HAProxy versus 70% for the kernel in HTTP keep-alive mode.

HAProxy has additional features of load balancing also. For example, the TCP proxying feature allows us to use it for database connections especially for PostgreSQL using its built-in check service support. Even though there's database service support, it does not suffice the desired health check especially for a replication type of cluster. The standard approach when deploying it for production is to use TCP check, then depend on xinetd with HAProxy.

Pros of Using HAProxy for PostgreSQL

The best thing with HAProxy is its lightweight, easy to configure and use, and it does the job as expected. Using HAProxy on top of a PostgreSQL cluster has been implemented and deployed multiple times from large organizations to various SME's/SMB's for their production use. It has been long proven for production and high workload capacity not only for databases but even with other network services such as web applications or for geo-load balancing (distribute traffic across multiple data centers). Having HAProxy on top of PostgreSQL, it allows users the capability to throttle or limit responses to parallelize and distribute the load properly to all available nodes in the cluster. The built-in mechanism with HAProxy also allows the user to setup high availability seamlessly and easier to scale if load is needed and avoid single point of failure (SPOF).

Cons of Using HAProxy for PostgreSQL

HAProxy does not provide query filtering nor does query analysis to identify the type of statements being requested. It lacks the ability to perform a read/write split on a single port. Setting a load balancer on top of HAProxy requires that you have to at least setup different ports for your writes and different ones for your reads. This requires application changes to suit your needs. 

HAProxy also does a very simple feature support with PostgreSQL for health check, yet this only determines if the node is up or not, as if it's just pinging the node and waits for a bounce back response. It does not identify what role a node it is trying to forward the requested connections from the client to the desired node. Therefore, it does not understand or no feature in HAProxy to understand the replication topology. Although, a user can create separate listeners based on different ports but still it adds changes within the application to satisfy the load balancing needs. This means that either using an external script with xinetd can be the workaround to fill up the requirements.Still, it is not integrated to HAProxy and can be prone to human error.

If one node or group of nodes has to be placed under maintenance mode, then  you shall also be required to apply changes to your HAProxy, otherwise it can be catastrophic.

Pgpool-II For Load Balancing Your PostgreSQL

Pgpool-II is an open-source software and is embraced by the massive PostgreSQL community for implementing load balancing and using this to act as their middleware from the application down to the proxy layer, then distributes the load after it has fully analyzed the type of request per query or database connection. Pgpool-II has been there for such a long time since 2003 which was originally named Pgpool until it became Pgpool-II in 2006, which serves as a testament of a very stable proxy tool not only for load balancing but tons of cool features as well.

Pgpool-II is known as the swiss army knife for PostgreSQL and is a proxy software that sits between PostgreSQL servers and a PostgreSQL database client. The basic idea of PgPool-II is that it sits on the client, then read queries have to be delivered to the standby nodes, whilst the write or modifications goes straight to the primary. It is a very intelligent load balancing solution that not only does load balance, but also supports high-availability and provides connection pooling. The intelligent mechanism allows to balance load between masters and slaves. So writes are loaded to the master, while processing reads are directed to the available read-only servers, which are your supposedly hot standby nodes. Pgpool-II also provides logical replication. While its use and importance has decreased as the inbuilt replication options improved on PostgreSQL server side, this still remains a valuable option for older versions of PostgreSQL. On top of all this, it also provides connection pooling.

Pgpool-II has a more involved architecture than PgBouncer in order to support all the features it does. Since both support connection pooling, the latter has no load balancing features.

Pgpool-II can manage multiple PostgreSQL servers. Using the replication function enables creating a realtime backup on 2 or more physical disks, so that the service can continue without stopping servers in case of a disk failure. Since Pgpool-II is also a connection pooling capable, it can provide limiting on the exceeding connections. There is a limit on the maximum number of concurrent connections with PostgreSQL, and connections are rejected after this many connections. Setting the maximum number of connections, however, increases resource consumption and affects system performance. pgpool-II also has a limit on the maximum number of connections, but extra connections will be queued instead of returning an error immediately.

In load balancing,  If a database is replicated, executing a SELECT query on any server will return the same result. pgpool-II takes advantage of the replication feature to reduce the load on each PostgreSQL server by distributing SELECT queries among multiple servers, improving the system's overall throughput. At best, performance improves proportionally to the number of PostgreSQL servers. Load balance works best in a situation where there are a lot of users executing many queries at the same time.

Using the parallel query function, data can be divided among the multiple servers, so that a query can be executed on all the servers concurrently to reduce the overall execution time. Parallel query works the best when searching large-scale data.

Pros of Using Pgpool for PostgreSQL

It's a feature-rich type of software not just for load balancing. The core features and support of this tool is highly on-demand which provides connection pooling, an alternative go PgBouncer, native replication, online recovery, in-memory query caching, automatic failover, and high availability with its sub process using watchdog. This tool has been so old and is continually supported massively by the PostgreSQL community, so dealing with problems cannot be hard to seek help. The documentation is your friend here when seeking questions but searching for help in the community is not difficult, and the fact this tool is an open-source so you can freely use this as long as you comply with the BSD license.

Pgpool-II also has SQL parser. This means it is able to accurately parse the SQLs, and rewrite the query. This allows Pgpool-II to boost parallelism depending on the query request.

Cons of Using Pgpool for PostgreSQL

Pgpool-II does not offer STONITH (shoot the other node in the head) which provides node fencing mechanism. If the PostgreSQL server fails, it maintains the service availability. Pgpool-II can also be the single point of failure (SPOF). Once node goes down, then your database connectivity and availability stops from that point. Although this can be fixed by having redundancy with Pgpool-II and having to utilize watchdog to coordinate multiple Pgpool-II nodes, that adds extra work.

For connection pooling, unfortunately, for those focusing only on connection pooling, what Pgpool-II doesn’t do very well is connection pooling, especially for a small number of clients. Because each child process has its own pool, and there is no way to control which client connects to which child process, too much is left to luck when it comes to reusing connections.

Using JDBC Driver For Load Balancing Your PostgreSQL

Java Database Connectivity (JDBC) is an application programming interface (API) for the programming language Java, which defines how a client may access a database. It is part of the Java Standard Edition platform and provides methods to query and update data in a database, and is oriented towards relational databases.

PostgreSQL JDBC Driver (PgJDBC for short) allows Java programs to connect to a PostgreSQL database using standard, database independent Java code. Is an open source JDBC driver written in Pure Java (Type 4), and communicates in the PostgreSQL native network protocol. Because of this, the driver is platform independent; once compiled, the driver can be used on any system.

It is not comparable to load balancing solutions that we have pointed out previously. Hence, this tool is your application programming interface API which allows you to connect from your application for whichever type of programming language this is written that does support JDBC or at least has adapter to connect with JDBC. On the other hand, it's more favorable with Java applications. 

The load balancing with JDBC is pretty naive yet can do the job. Provided with the connection parameters that can trigger the load balancing mechanism this tool has to offer,

  • targetServerType - allows opening connections to only servers with required state/role in accordance to the defining factor for the PostgreSQL servers. The allowed values are any, primary, master (deprecated), slave (deprecated), secondary, preferSlave, and preferSecondary.  State or role is determined by observing if the server allows writes or not. 
  • hostRecheckSeconds - controls how long in seconds the knowledge about a host state is cached in JVM wide global cache. The default value is 10 seconds.
  • loadBalanceHosts – allows you to configure if the first host is always tried (when set to false) or if connections are randomly chosen (when set to true)

So using loadBalanceHosts which accepts a boolean value. loadBalanceHosts is disabled during its default mode  and hosts are connected in the given order. If enabled hosts are chosen randomly from the set of suitable candidates. The basic syntax when connecting to the database using jdbc is as follows,

  • jdbc:postgresql:database
  • jdbc:postgresql:/
  • jdbc:postgresql://host/database
  • jdbc:postgresql://host/
  • jdbc:postgresql://host:port/database
  • jdbc:postgresql://host:port/

Given that loadBalanceHosts and connection receives multiple hosts configured just like below,

jdbc:postgresql://host1:port1,host2:port2,host3:port3/database

This allows JDBC to randomly choose from the set of suitable candidates.

Pros of Using PgJDBC for PostgreSQL

No need to require middleware or proxy as load balancers. This process adds more performance boost from the application frontend since there's no extra layer for each request to pass by. If you have applications ready and are written to support interfacing to JDBC, this can be advantageous and if you do not need more middleware especially if your budget is tight and wants to limit only the processes dedicated to its sole purpose and function. Unlike in high traffic and large demand applications, it might require proxy servers acting as your load balancers and might demand extra resources to properly handle high requests of connections which also requires demand on CPU and memory processing.

Cons of Using PgJDBC for PostgreSQL

You have to set up your code for each and every connection to be requested. It is an application programming interface which means, there's work behind to deal with especially if your application is very demanding on each request to be sent to the proper servers. There's  no high availability, auto scalability, and has a single-point of failure.

How About Wrappers or Tools Implemented With libpq For Load Balancing Your PostgreSQL?

libpq is the C application programmer's interface to PostgreSQL. libpq is a set of library functions that allow client programs to pass queries to the PostgreSQL backend server and to receive the results of these queries.

libpq is also the underlying engine for several other PostgreSQL application interfaces, including those written for C++, PHP, Perl, Python, Tcl, Swift and ECPG. So some aspects of libpq's behavior will be important to you if you use one of those packages. 

libpq does not automate the load balancing and is not to be considered as a tool for load balancing solutions. Yet, it is capable of connecting onto the next available servers if the previous servers listed for connection fails. For example, if you have two available hot standby nodes, if the first node is too busy and fails to respond to the corresponding timeout value, then it connects to the next available node in the given connection. It relies on what type of session attributes you specified. This relies on parameter target_session_attrs

The parameter target_session_attrs accepts values read-write, and any which is the default value if not specified. What does parameter target_session_attrs is that, if set to read-write, only a connection in which read-write transactions are accepted during connection. The query SHOW transaction_read_only will be sent upon any successful connection. If the result is on, then connection will be closed, meaning node is identified as a replica or does not process writes. If multiple hosts were specified in the connection string, any remaining servers will be tried just as if the connection attempt had failed. The default value of this parameter, any, which means all connections as acceptable. Although relying with target_session_attrs is not enough for load balancing, you might be able to simulate a round-robin fashion. See my example C code below using libpq,

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

#include <time.h>

#include <unistd.h>

#include <libpq-fe.h>




const char* _getRoundRobinConn() {

   char* h[2];

   h[0] = "dbname=node40 host=192.168.30.40,192.168.30.50";

   h[1] = "dbname=node50 host=192.168.30.50,192.168.30.40";



   time_t t;

   //srand((unsigned)time(&t));

   sleep(1.85);

   srand((unsigned)time(NULL));



   return h[rand() % 2];

}



void

_connect()

{

  PGconn *conn;

  PGresult *res;

  char strConn[120];




  snprintf(strConn, 1000, "user=dbapgadmin password=dbapgadmin %s target_session_attrs=any", _getRoundRobinConn());

  //printf("\nstrConn value is: %s\n", strConn);



  conn = PQconnectdb(strConn);



  res = PQexec(conn, "SELECT current_database(), inet_client_addr();");



  if ( PQresultStatus(res)==PGRES_TUPLES_OK )

  {

    printf("current_database = %s on %s\n", PQgetvalue(res, 0, 0),

PQhost(conn));

  } else {



    printf("\nFailed... Message Code is: %d\n", PQresultStatus(res));

  }



  PQclear(res);

  PQfinish(conn);



}



int main(void)

{

  int i;

  for (i=0 ; i<5 ; i++)

    _connect();



  return 0;

}

The result reveals,

[email protected]:/home/vagrant# gcc -I/usr/include/postgresql -L/usr/lib/postgresql/12/lib libpq_conn.c -lpq -o libpq_conn; ./libpq_conn

current_database = node40 on 192.168.30.40

current_database = node40 on 192.168.30.40

current_database = node50 on 192.168.30.50

current_database = node40 on 192.168.30.40

current_database = node50 on 192.168.30.50

Take note that, if node .40 (the primary node) goes down, it will always direct the connection to the .50 as long as your target_session_attrs value is any.

In that case, you can simply create your own freely with the help of libpq. Although the process of relying on libpq and/or its wrappers is just too raw to say that this can provide the desired load balancing mechanism with even distribution to the nodes you have. Definitely, this approach and coding can be improved but the thought is that this is free and open source, and you can code without relying on middlewares and freely engineer the way your load balancing shall work.

Pros of Using libpq for PostgresQL

libpq library is the programmer's application interface built in the C programming language. Yet, the library has been implemented in various languages as wrappers so that programmers can communicate with the PostgreSQL database using their favorite languages. You can directly create your own application using your favorite languages and then list down the servers you intend queries will be sent but only after the other, if failure or timeout send your load to the available nodes that you intend to distribute the load. It's available in languages such as Python, Perl, PHP, Ruby, Tcl, or Rust. 

Cons of Using libpq for PostgresQL

Implementation wise for load parallelism is not perfect and you have to write your own load balancing mechanism by code. There's no configuration that you can use or customize since it's all alone a programming interface to the PostgreSQL database with the help of target_session_attrs param. That means, when composing a database connection, you must have a series of read connections going to your replica/standby nodes, then write queries that go to the writer or primary node in your code, whether it's in your application or you have to create your own API to manage load balancing solution.

Using this approach definitely does not need or rely a middleware from the front-end application perspective to the database as the backend. Of course that's lightweight but when sending the list of servers upon connection, it doesn't mean that the load is understood and sent evenly unless you have to add in your code for this approach. This just adds hassle, yet, there are already existing solutions so why need to reinvent the wheel?

Conclusion

Implementing your load balancers with PostgreSQL can be demanding but depends on what type of application and cost you are dealing with. Sometimes, for a high load demand, it requires the need of middleware acting as a proxy in order to properly distribute the load and also oversees its node state or health. On the other hand, it might demand server resources either it has to be run on a dedicated server or demands extra CPU and memory in order to satisfy the needs and this adds cost. Hence, there's also a simple way yet time consuming but offers the distribution of load to the available servers you have already. Yet it requires programming skills and understanding of the API's functionality.

ClusterControl
The only management system you’ll ever need to take control of your open source database infrastructure.