Five Cool Things I Learned at the PostgreSQL Conference Europe 2018
I spent a week in the magnificent city of Lisbon attending the annual European PostgeSQL Conference. This marked the 10th anniversary since the first European PostgreSQL conference and my sixth time attending.
The city was great, the atmosphere was great and it seemed that it would be a very productive and informative week full of interesting conversations with intelligent and friendly people. So basically the very first cool thing I learned in Lisbon is how great Lisbon and Portugal are, but I guess you came here for the rest of the story!
We attended the training session “PostgreSQL DBA toolbelt for day-to-day ops”
by Kaarel Moppel (Cybertec). One thing I noted was the setting of shared_buffers. Since shared_buffers actually competes or complements system’s cache it shouldn’t be set to any value between 25% and 75% of the total RAM available. So while, in general, the recommended setting for typical workloads is 25% of RAM, it could be set to >= 75% for special cases, but not in between.
Other things we learned in this session:
- unfortunately easy online (or offline) activation/enablement of data-checksums is not yet in 11 (initdb/logical replication remains the only option)
- beware of vm.overcommit_memory, you better disable it by setting it to 2. Set vm.overcommit_ratio to about 80.
Advanced Logical Replication
In the talk of Petr Jelinek (2nd Quadrant), the original authors of logical replication, we learned about more advanced uses of this new exciting technology:
- Centralized Data Collection: we may have multiple publishers and then a central system with a subscriber to each of those publishers, making data from various sources available in a central system. (typical use: OLAP)
- Shared global data or in other words a central system for maintaining global data and parameters (such as currencies, stocks, market/commodity values, weather, etc) which publishes to one or more subscribers. Then these data are maintained only in one system but available in all subscribers.
- Logical replication can be asynchronous but also synchronous (guaranteed on commit)
- New possibilities with logical decoding:
- integration with Debezium/Kafka via logical decoding plugins
- wal2json plugin
- Bi-directional replication
- Close-to-zero downtime upgrades:
- setup logical replication on the new server (possibly initdb with enabling data-checksums)
- wait until the lag is relatively small
- from pgbouncer pause the database(s)
- wait until the lag is zero
- change pgbouncer configuration to point to the new server, reload pgbouncer’s conf
- from pgbouncer resume the database(s)
What’s New in PostgreSQL 11
In this exciting presentation, Magnus Hagander (Redpill Linpro AB) introduced us the wonders of PostgreSQL 11:
- pg_stat_statements supports queryid of 64-bit.
- pg_prewarm (a method to warm the system’s cache or shared buffers): addition of new configuration parameters
- New default roles making easy moving away from postgres (the user I mean 🙂 )
- Stored procedures with xactional control
- Enhanced full text search
- Logical replication supports TRUNCATE
- Base backups (pg_basebackup) validate checksums
- Several improvements in parallelizing of queries
- Partitioning even more refined than in 10
- default partition
- updates across partitions (moves row from one partition to another)
- local partition indexes
- unique key across all partitions (still not referenceable yet)
- hash partitioning
- partition-wise joins
- partition-wise aggregates
- partitions might be foreign tables in different foreign servers. This opens up great possibilities for finer grained sharding.
- JIT compilation
zheap: An Answer to PostgreSQL Bloat Woes
This is still not in 11, but it sounds so promising that I had to include it in the list of cool things. The presentation was given by Amit Kapila (EnterpriseDB) one of the main authors of this new technology which aims to be eventually integrated into the PostgreSQL core as an alternative kind of heap. This will be integrated with the new Pluggable Storage API in PostgreSQL, which is going to support multiple Table Access Methods (in the same fashion as the various [Index] Access Methods covered in my first blog).
This will try to solve the chronic shortcomings PostgreSQL has with:
- table bloat
- need to (auto)vacuum
- potentially a transaction-id wraparound
All of those are not an issue for the average medium to large business (although this is highly relative), we know banks and other financial institutions that run PostgreSQL of tens of TBs of data and several 1000s transactions/sec without issues. Table bloat is handled by autovacuum and row freeze solves the problem of transaction-id wraparound, but still this is not maintenance free. The PostgreSQL community works towards a truly maintenance-free database, therefore the zheap architecture is proposed. This will bring:
- a new UNDO log
- UNDO log will make data visible to old transactions seeing the old versions
- UNDO will be used for reversing the effects of aborted transactions
- changes happen in place. Old versions are no longer kept in the data files.
- better bloat control
- fewer writes
- smaller tuple headers
This will bring PostgreSQL on par with MySql and Oracle in this regard.
Parallel Query in PostgreSQL: How Not to (mis)Use it?
In this presentation by Amit Kapila and Rafia Sabih (EnterpriseDB) we learned the inner bits of parallelization and also tips for avoiding common mistakes as well as some recommended GUC settings:
- parallelism supports only B-tree indexes
- max_parallel_workers_per_gather set to 1→ 4 (depending on cores available)
- pay attention to the following settings:
- parallel_tuple_cost: cost of transferring one tuple from a parallel worker process to another process
- parallel_setup_cost: cost of launching parallel workers and initializing dynamic shared memory
- min_parallel_table_scan_size: the minimum size of relations to be considered for parallel sequence scan
- min_parallel_index_scan_size: the minimum size of index to be considered for a parallel scan
- random_page_cost: estimated cost of accessing a random page in disk
Subscribe to get our best and freshest content