My Favorite PostgreSQL Extensions - Part Two

Nawaz Ahmed

This is the second part of my blog “My Favorite PostgreSQL Extensions” wherein I had introduced you to two  PostgreSQL extensions, postgres_fdw and pg_partman. In this part I will explore three more.

pgAudit

The next PostgreSQL extension of interest is for the purpose of satisfying auditing requirements by various government, financial and other certifying bodies such as ISO, BSI, and FISCAM, etc. The standard logging facility which PostgreSQL offers natively with log_statement = all is useful for monitoring, but it does not provide the details required to comply or face the audit. The pgAudit extension focuses on the details of what happened under the hood, while a database was satisfying an application request.

An audit trail or audit log is created and updated by a standard logging facility provided by PostgreSQL, which provides detailed session and/or object audit logging. The audit trail created by pgAudit can get enormous in size depending on audit settings, so care must be observed to decide on what and how much auditing is required beforehand. A brief demo in the following section shows how pgAudit is configured and put to use.

The log trail is created within the PostgreSQL database cluster log found in the PGDATA/log location but the audit log messages are prefixed with a “AUDIT: “ label to distinguish between regular database background messages and audit log records. 

Demo

The official documentation of pgAudit explains that there exists a separate version of pgAudit for each major version of PostgreSQL in order to support new functionality introduced in every PostgreSQL release. The version of PostgreSQL in this demo is 11, so the version of pgAudit will be from the 1.3.X branch. The pgaudit.log is the fundamental parameter to be set that controls what classes of statements will be logged. It can be set with a SET for a session level or within the postgresql.conf file to be applied globally. 

postgres=# set pgaudit.log = 'read, write, role, ddl, misc';

SET



cat $PGDATA/pgaudit.log

pgaudit.log = 'read, write, role, ddl, misc'



db_replica=# show pgaudit.log;

         pgaudit.log

------------------------------

 read, write, role, ddl, misc

(1 row)



2020-01-29 22:51:49.289 AEDT 4710 db_replica postgres [local] psql LOG:  AUDIT: SESSION,3,1,MISC,SHOW,,,show pgaudit.log;,<not logged>



db_replica=# create table t1 (f1 integer, f2 varchar);

CREATE TABLE



2020-01-29 22:52:08.327 AEDT 4710 db_replica postgres [local] psql LOG:  AUDIT: SESSION,4,1,DDL,CREATE TABLE,,,"create table t1 (f1 integer, f2 varchar);",<not logged>



db_replica=#  insert into t1 values (1,'one');

INSERT 0 1

db_replica=#  insert into t1 values (2,'two');

INSERT 0 1

db_replica=#  insert into t1 values (3,'three');

INSERT 0 1

2020-01-29 22:52:19.261 AEDT 4710 db_replica postgres [local] psql LOG:  AUDIT: SESSION,5,1,WRITE,INSERT,,,"insert into t1 values (1,'one');",<not logged>

20-01-29 22:52:38.145 AEDT 4710 db_replica postgres [local] psql LOG:  AUDIT: SESSION,6,1,WRITE,INSERT,,,"insert into t1 values (2,'two');",<not logged>

2020-01-29 22:52:44.988 AEDT 4710 db_replica postgres [local] psql LOG:  AUDIT: SESSION,7,1,WRITE,INSERT,,,"insert into t1 values (3,'three');",<not logged>



db_replica=# select * from t1 where f1 >= 2;

 f1 |  f2

----+-------

  2 | two

  3 | three

(2 rows)



2020-01-29 22:53:09.161 AEDT 4710 db_replica postgres [local] psql LOG:  AUDIT: SESSION,9,1,READ,SELECT,,,select * from t1 where f1 >= 2;,<not logged>



db_replica=# grant select on t1 to usr_replica;

GRANT



2020-01-29 22:54:25.283 AEDT 4710 db_replica postgres [local] psql LOG:  AUDIT: SESSION,13,1,ROLE,GRANT,,,grant select on t1 to usr_replica;,<not logged>



db_replica=# alter table t1 add f3 date;

ALTER TABLE



2020-01-29 22:55:17.440 AEDT 4710 db_replica postgres [local] psql LOG:  AUDIT: SESSION,23,1,DDL,ALTER TABLE,,,alter table t1 add f3 date;,<not logged>



db_replica=# checkpoint;

CHECKPOINT



2020-01-29 22:55:50.349 AEDT 4710 db_replica postgres [local] psql LOG:  AUDIT: SESSION,33,1,MISC,CHECKPOINT,,,checkpoint;,<not logged>



db_replica=# vacuum t1;

VACUUM



2020-01-29 22:56:03.007 AEDT 4710 db_replica postgres [local] psql LOG:  AUDIT: SESSION,34,1,MISC,VACUUM,,,vacuum t1;,<not logged>



db_replica=# show log_statement;

 log_statement

---------------

 none



2020-01-29 22:56:14.740 AEDT 4710 db_replica postgres [local] psql LOG:  AUDIT: SESSION,36,1,MISC,SHOW,,,show log_statement;,<not logged>

The log entries, as shown in the demo above, are only written to the server background logfile when the parameter log_statement is set, however in this case it is not configured but the audit messages are written by virtue of pgaudit.log parameter as evidenced in the demo. There are more powerful options available to fulfill all your database auditing requirements within PostgreSQL, which can be configured by following the official documentation of pgaudit here or on the github repository.pg_repack

This is a favourite extension among many PostgreSQL engineers that are involved directly with managing and keeping the general health of a PostgreSQL cluster. The reason for that will be discussed a little later but this extension offers the functionality to remove database bloat within a PostgreSQL database, which is one of the nagging concerns among very large PostgreSQL database clusters requiring database re-org. 

As a PostgreSQL database undergoes constant and heavy WRITES (updates & deletes), the old data is marked as deleted while the new version of the row gets inserted, but the old data is not actually wiped from a data block. This requires a periodic maintenance operation called vacuuming, which is an automated procedure that executes in the background that clears all the “marked as deleted” rows. This process is sometimes referred to as garbage collection in colloquial terms. 

The vacuuming process generally gives way to the database operations during busier times. The least restrictive manner of vacuuming in favour of database operations results in a large number of “marked as deleted” rows causing databases to grow out of proportion referred to as “database bloat”. There is a forceful vacuuming process called VACUUM FULL, but that results in acquiring an exclusive lock on the database object being processed, stalling database operations on that object.

pg_repack

It is for this reason that pg_repack is a hit among PostgreSQL DBAs and engineers, because it does the job of a normal vacuuming process but offers an efficiency of VACUUM FULL by not acquiring an exclusive lock on a database object, in short, it works online. The official documentation here explains more about the other methods of reorganizing a database but a quick demo as below will put things in appropriate light for better understanding. There is a requirement that the target table must have at least one column defined as a PRIMARY KEY, which is a general norm in most of the production database setups.

Demo

The basic demo shows the installation and usage of pg_repack in a test environment. This demo uses the version 1.4.5 of pg_repack which is the latest version of this extension at the time of publishing this blog. A demo table t1 initially has 80000 rows which undergoes a massive operation of delete, which deletes every 5th row of the table. An execution of pg_repack shows the size of the table before and after.

mydb=# CREATE EXTENSION pg_repack;

CREATE EXTENSION



mydb=# create table t1 (no integer primary key, f_name VARCHAR(20), l_name VARCHAR(20), d_o_b date);

CREATE TABLE

mydb=# insert into t1 (select generate_series(1,1000000,1),'a'||

mydb(# generate_series(1,1000000,1),'a'||generate_series(1000000,1,-1),

mydb(# cast( now() - '1 year'::interval * random()  as date ));

INSERT 0 1000000



mydb=# SELECT pg_size_pretty( pg_total_relation_size('t1'));

 pg_size_pretty

----------------

 71 MB

(1 row)



mydb=# CREATE or replace FUNCTION delete5() RETURNS void AS $$

mydb$# declare

mydb$# counter integer := 0;

mydb$# BEGIN

mydb$#

mydb$#  while counter <= 1000000

mydb$# loop

mydb$# delete from t1 where no=counter;

mydb$# counter := counter + 5;

mydb$# END LOOP;

mydb$# END;

mydb$# $$ LANGUAGE plpgsql;

CREATE FUNCTION

The delete5 function deletes 200000 rows from t1 table using a counter which increments 5 counts

mydb=# select delete5();

 delete5

------



(1 row)

mydb=# SELECT pg_size_pretty( pg_total_relation_size('t1'));

 pg_size_pretty

----------------

 71 MB

(1 row)



$ pg_repack -t t1 -N -n -d mydb -p 5433

INFO: Dry run enabled, not executing repack

INFO: repacking table "public.t1"



$ pg_repack -t t1 -n -d mydb -p 5433

INFO: repacking table "public.t1"



mydb=# SELECT pg_size_pretty( pg_total_relation_size('t1'));

 pg_size_pretty

----------------

 57 MB

(1 row)

As shown above, the original size of the table does not change after executing the delete5  function, which shows that the rows still exist in the table. The execution of pg_repack clears those ‘marked as deleted’ rows from the t1 table bringing down the size of t1 table to 57 MBs. One other good thing about pg_repack is an option for dry run with -N flag, using which you can check what will be executed during an actual run.

HypoPG

The next interesting extension is identical to a popular concept called invisible indexes among proprietary database servers. The HypoPG extension enables a DBA to see the effect of introducing a hypothetical index (which does not exist) and whether it will improve the performance of one or more queries, and hence the name HypoPG.

The creation of a hypothetical index does not require any CPU or disk resources, however, it consumes a connection’s private memory. As the hypothetical index is not stored in any database catalog tables, so there is no impact of table bloat. It is for this reason, that a hypothetical index cannot be used in an EXPLAIN ANALYZE statement while a plain EXPLAIN is a good way to assess if a potential index will be used by a given problematic query. Here is a quick demo to explain how HypoPG works.

Demo

I am going to create a table containing 100000 rows using generate_series and execute a couple of simple queries to show the difference in cost estimates with and without hypothetical indexes.

olap=# CREATE EXTENSION hypopg;

CREATE EXTENSION



olap=# CREATE TABLE stock (id integer, line text);

CREATE TABLE



olap=# INSERT INTO stock SELECT i, 'line ' || i FROM generate_series(1, 100000) i;

INSERT 0 100000



olap=# ANALYZE STOCK;

ANALYZE



olap=#  EXPLAIN SELECT line FROM stock WHERE id = 1;

                       QUERY PLAN

---------------------------------------------------------

 Seq Scan on stock  (cost=0.00..1791.00 rows=1 width=10)

   Filter: (id = 1)

(2 rows)

olap=# SELECT * FROM hypopg_create_index('CREATE INDEX ON stock (id)') ;

 indexrelid |       indexname

------------+-----------------------

      25398 | <25398>btree_stock_id

(1 row)



olap=# EXPLAIN SELECT line FROM stock WHERE id = 1;

                                     QUERY PLAN

------------------------------------------------------------------------------------

 Index Scan using <25398>btree_stock_id on stock  (cost=0.04..8.06 rows=1 width=10)

   Index Cond: (id = 1)

(2 rows)



olap=# EXPLAIN ANALYZE SELECT line FROM stock WHERE id = 1;

                                             QUERY PLAN

----------------------------------------------------------------------------------------------------

 Seq Scan on stock  (cost=0.00..1791.00 rows=1 width=10) (actual time=0.028..41.877 rows=1 loops=1)

   Filter: (id = 1)

   Rows Removed by Filter: 99999

 Planning time: 0.057 ms

 Execution time: 41.902 ms

(5 rows)



olap=# SELECT indexname, pg_size_pretty(hypopg_relation_size(indexrelid))

olap-#   FROM hypopg_list_indexes() ;

       indexname       | pg_size_pretty

-----------------------+----------------

 <25398>btree_stock_id | 2544 kB

(1 row)



olap=# SELECT pg_size_pretty(pg_relation_size('stock'));

 pg_size_pretty

----------------

 4328 kB

(1 row)

The above exhibit shows how the estimated total cost can be reduced from 1791 to 8.06 by adding an index to the “id” field of the table to optimize a simple query. It also proves that the index is not really used when the query is executed with an EXPLAIN ANALYZE which executes the query in real time. There is also a way to find out approximately how much disk space the index occupies using the hypopg_list_indexes function of the extension. 

The HypoPG has a few other functions to manage hypothetical indexes and in addition to that, it also offers a way to find out if partitioning a table will improve performance of queries fetching a large dataset. There is a hypothetical partitioning option of HypoPG extension and more of it can be followed by referring to the official documentation.

Conclusion

As stated in part one, PostgreSQL has evolved over the years in only getting bigger, better and faster with rapid development both in the native source code as well as plug and play extensions. An open source version of the new PostgreSQL can be most suitable for plenty of IT shops that are running one of the major proprietary database servers, in order to reduce their IT CAPEX and OPEX. 

There are plenty of PostgreSQL extensions that offer features ranging from monitoring to high-availability and from scaling to dumping binary datafiles into human readable format. It is hoped that the above demonstrations have shed enormous light on the maximum potential and power of a PostgreSQL database.

ClusterControl
The only management system you’ll ever need to take control of your open source database infrastructure.