Welcome to PostgreSQL, a powerful open source database system that can host anything from a few megabytes of customer data for a small-town-business, to hundreds of terabytes of ‘big data’ for multinational corporations. Regardless of the application, it’s likely that some setup and configuration help will be needed to get the database ready for action.
When a new server is installed, PostgreSQL’ s settings are very minimum as they are designed to run on the least amount of hardware possible. However they are very rarely optimal. Here, we will go over a basic setup for new projects, and how to set PostgreSQL up to run the most optimally on new projects.
With an on-premise database, the best option is for a bare metal host, as Virtual Machines generally perform slower unless we’re talking about high end enterprise level VM’s. This also allows for tighter control over CPU, Memory, and Disk setups. This however comes with the need to have an expert on hand (or contract) to do server maintenance.
Hosting a database in the cloud can be wonderful in some aspects, or a nightmare in others. Unless the cloud platform chosen is highly optimized (which generally means higher price), it may have trouble with higher load environments. Keep an eye out for whether or not the cloud server is shared or dedicated (dedicated allowing full performance from the server for the application), as well as the level of IOPS (Input/output Operations Per Second) provided by a cloud server. When (or if) the application grows to the point that the majority of data cannot be stored in memory, disk access speed is crucial.
General Host Setup
The main pillars needed to reliably set up PostgreSQL are based on the CPU, Memory, and Disk abilities of the host. Depending on the applications needs, a sufficient host as well as a well-tuned PostgreSQL configuration will have an amazing impact on the performance of the database system.
Choosing an Operating System
PostgreSQL can be compiled on most Unix-like operating systems, as well as Windows. However performance on Windows is not even comparable to a Unix-like system, so unless it’s for a small throw away project, sticking to an established Unix-like system will be the way to go. For this discussion, we’ll stick to Linux based systems.
The seemingly highest used Linux distribution used for hosting PostgreSQL is a Red Hat based system, such as CentOS or Scientific Linux, or even Red Hat itself. Since Red Hat and CentOS focus on stability and performance, the community behind these projects work hard to make sure important applications, such as databases, are on the most secure and most reliable build of Linux possible.
NOTE: Linux has a range of kernel versions that are not optimal for running PostgreSQL, so they are highly suggested to be avoided if possible (especially on applications where peak performance is the utmost importance). Benchmarks have shown that the number of transactions per second drop from kernel version 3.4 – 3.10, but recovers and significantly improves in kernel 3.12. This unfortunately rules out using CentOS 7 if going the CentOS route. CentOS 6 is still a valid and supported version of the Operating System, and CentOS 8 is expected to be released before 6 becomes unsupported.
Installation can be done either by source, or using repositories maintained by either the distribution of Linux chosen, or better yet, the PostgreSQL Global Development Group (PGDG), which maintains repositories for Red Hat based systems (Red Hat, Scientific Linux, CentOS, Amazon Linux AMI, Oracle Enterprise Linux, and Fedora), as well as packages for Debian and Ubuntu. Using the PGDG packages will ensure updates to PostgreSQL are available for update upon release, rather than waiting for the Linux distribution’s built in repositories to approve and provide them.
These days, it’s not hard to have multiple cores available for a database host. PostgreSQL itself has only recently started adding multi-threading capabilities on the query level, and will be getting much better in the years to come. But even without these new and upcoming improvements, PostgreSQL itself spawns new threads for each connection to the database by a client. These threads will essentially use a core when active, so number of cores required will depend on the level of needed concurrent connections and concurrent queries.
A good baseline to start out with is a 4 core system for a small application. Assuming applications do a dance between executing queries and sleeping, a 4 core system can handle a couple dozen connections before being overloaded. Adding more cores will help scale with an increasing workload. It’s not uncommon for very large PostgreSQL databases to have 48+ cores to serve many hundreds of connections.
Tuning Tips: Even if hyper-threading is available, transactions per second are generally higher when hyper-threading is disabled. For database queries that aren’t too complex, but higher in frequency, more cores is more important than faster cores.
Memory is an extremely important aspect for PostgreSQL’s overall performance. The main setting for PostgreSQL in terms of memory is shared_buffers, which is a chunk of memory allocated directly to the PostgreSQL server for data caching. The higher the likelihood of the needed data is living in memory, the quicker queries return, and quicker queries mean a more efficient CPU core setup as discussed in the previous section.
Queries also, at times, need memory to perform sorting operations on data before it’s returned to the client. This either uses additional ad-hoc memory (separate from shared_buffers), or temporary files on disk, which is much slower.
Tuning Tips: A basic starting point for setting shared_buffers is to set it to 1/4th the value of available system ram. This allows the operating system to also do its own caching of data, as well as any running processes other than the database itself.
Increasing work_mem can speed up sorting operations, however increasing it too much can force the host to run out of memory all together, as the value set can be partially or fully issued multiple times per query. If multiple queries request multiple blocks of memory for sorting, it can quickly add up to more memory than what is available on the host. Keep it low, and raise it slowly until performance is where desired.
Using the ‘free’ command (such as ‘free -h’), set effective_cache_size to a the sum of memory that’s free and cached. This lets the query planner know the level of OS caching may be available, and run better query plans.
Disk performance can be one of the more important things to consider when setting up a system. Input / Output speeds are important for large data loads, or fetching huge amounts of data to be processed. It also determines how quickly PostgreSQL can sync memory with disk to keep the memory pool optimal.
Some preparation in disks can help instantly improve potential performance, as well as future proof the database system for growth.
A fresh install of PostgreSQL will create the cluster’s data directory somewhere on the main (and possibly only) available drive on the system.
A basic setup using more drives would be adding a separate drive (or set of drives via RAID). It has the benefit of having all database related data transfer operating on a different I/O channel from the main operating system. It also allows the database to grow without fear of insufficient space causing issues and errors elsewhere in the operating system.
For databases with an extreme amount of activity, the PostgreSQL Transaction Log (xlog) directory can be placed on yet another drive, separating more heavy I/O to another channel away from the main OS as well as the main data directory. This is an advanced measure that helps squeeze more performance out of a system, that may otherwise be near its limits.
Setting up RAID for the database drives not only protects from data loss, it can also improve performance if using the right RAID configuration. RAID 1 or 10 are generally thought to be the best, and 10 offers parity and overall speed. RAID 5, however, while having higher levels of redundancy, suffers from significant performance decrease due to the way it spreads data around multiple disks. Plan out the best available option with plenty of space for data growth, and this will be a configuration that won’t need to be changed often, if at all.
Solid State Drives are wonderful for performance, and if they meet the budget, enterprise SSD’s can make heavy data processing workloads night and day faster. Smaller to medium databases with smaller to medium workloads may be overkill, but when fighting for the smallest percentage increase on large applications, SSD can be that silver bullet.
Tuning Tips: Chose a drive setup that is best for the application at hand, and has plenty of space to grow with time as the data increases.
If using a SSD, setting random_page_cost to 1.5 or 2 (the default is 4) will be beneficial to the query planner, since random data fetching is much quicker than seen on spinning disks.
Initial Configuration Settings
When setting up PostgreSQL for the first time, there’s a handful of configuration settings that can be easily changed based on the power of the host. As the application queries the database over time, specific tuning can be done based on the application’s needs. However that will be the topic for a separate tuning blog.
shared_buffers: Set to 1/4th of the system memory. If the system has less than 1 GB of total memory, set to ~ 1/8th of total system memory
work_mem: The default is 4MB, and may even be plenty for the application in question. But if temp files are being created often, and those files are fairly small (tens of megabytes), it might be worth upping this setting. A conservative entry level setting can be (1/4th system memory / max_connections). This setting depends highly on the actual behavior and frequency of queries to the database, so should be only increased with caution. Be ready to reduce it back to previous levels if issues occur.
effective_cache_size: Set to the sum of memory that’s free and cached reported by the ‘free’ command.
For PostgreSQL 9.4 and below:
checkpoint_segments: A number of checkpoint segments (16 megabytes each) to give the Write Ahead Log system. The default is 3, and can safely be increased to 64 for even small databases.
For PostgreSQL 9.5 and above:
max_wal_size: This replaced checkpoint_segments as a setting. The default is 1GB, and can remain here until needing further changes.
listen_address: This setting determines what personal IP addresses / Network Cards to listen to connections on. In a simple setup, there will likely be only one, while more advanced networks may have multiple cards to connect to multiple networks. * Signifies listen to everything. However, if the application accessing the database is to live on the same host as the database itself, keeping it as ‘localhost’ is sufficient.
Some basic logging settings that won’t overload the logs are as follows.
log_checkpoints = on log_connections = on log_disconnections = on log_temp_files = 0