Managing databases is no small task, and can easily be frustrating without knowing what’s happening under the covers. Whether trying to find out if new indexes are helpful, the transaction count on a database at a certain time, or who’s connected to the database at any given time, data that allows the administrators truly know how their databases are performing is king. Luckily, with PostgreSQL, that data for all of this is available in the PostgreSQL system catalog.
The PostgreSQL System Catalog is a schema with tables and views that contain metadata about all the other objects inside the database and more. With it, we can discover when various operations happen, how tables or indexes are accessed, and even whether or not the database system is reading information from memory or needing to fetch data from disk.
Here we will go over an overview of the system catalog, and highlight how to read it, and how to pull useful information from it. Some of the metadata is straightforward, and other pieces take a bit of digesting to generate real useful information. Either way, PostgreSQL gives us a great platform to build whatever information we need about the database itself.
The PostgreSQL Catalog
PostgreSQL stores the metadata information about the database and cluster in the schema ‘pg_catalog’. This information is partially used by PostgreSQL itself to keep track of things itself, but it also is presented so external people / processes can understand the inside of the databases too.
The PostgreSQL Catalog has a pretty solid rule: Look, don’t touch. While PostgreSQL stores all this information in tables like any other application would, the data in the tables are fully managed by PostgreSQL itself, and should not be modified unless an absolute emergency, and even then a rebuild is likely in order afterwards.
We will go over a few useful catalog tables, how to read the data, and clever things we can do with the data itself. There are quite a few tables in the catalog that we won’t go over, but all information for these various tables can be found at PostgreSQL’s official documentation.
System Wide Metadata
A good portion of the tables we can query in the catalog are ‘system wide’ tables, where it doesn’t matter what database we are connected to, the data represents the whole cluster, no singular database.
General database info is stored in pg_database and statistics are stored in pg_stat_database.
postgres=# SELECT oid,* FROM pg_database WHERE datname = 'severalnines'; -[ RECORD 1 ]-+------------- oid | 16396 datname | severalnines datdba | 10 encoding | 6 datcollate | en_US.UTF-8 datctype | en_US.UTF-8 datistemplate | f datallowconn | t datconnlimit | -1 datlastsysoid | 13804 datfrozenxid | 548 datminmxid | 1 dattablespace | 1663 datacl |
The table pg_database contains a row for every database in the cluster, including the three that come out of the box (postgres, template0, and template1). This row contains information for encoding, connection limit, and other basic metadata.
Columns of interest:
oid – The object identifier, which does not appear in a query output unless referenced directly. This number will match a directory in the clusters data directory
datname – The name of the database.
datdba – The owner of the database, oid references pg_authid.oid.
encoding – The character encoding for this database, pg_encoding_to_char() will convert to a readable name.
datconnlimit – The maximum number of concurrent connections allowed on the database.
dattablespace – The default tablespace for this database, references pg_tablespace.oid.
postgres=# SELECT * FROM pg_stat_database WHERE datname = 'severalnines'; -[ RECORD 1 ]--+------------------------------ datid | 13805 datname | postgres numbackends | 2 xact_commit | 477460 xact_rollback | 13 blks_read | 417 blks_hit | 16488702 tup_returned | 252376522 tup_fetched | 2637123 tup_inserted | 114 tup_updated | 3 tup_deleted | 1 conflicts | 0 temp_files | 0 temp_bytes | 0 deadlocks | 0 blk_read_time | 0 blk_write_time | 0 stats_reset | 2018-02-04 19:52:39.129052+00
This stat table is where we get interesting and useful data. Each row in this table contains live data for each database, and can be exported periodically to be tracked over time if desiring to monitor changes.
Transaction information can be found in the columns xact_commit and xact_rollback, which contain the number of transactions the database has committed and rolled back respectively. This will help show just how active a database is, as well as spot possible failures with programs who may be erroring / rolling back at an alarming rate.
Disk and Memory access
Information on whether or not data is retrieved from disk or memory is stored in the blks_read and blks_hit columns. Blks_read shows the number of blocks this database read from disk, while blks_hit shows the number of blocks that were found in PostgreSQL’s buffer cache (represented by the shared_buffers parameter). Since RAM is much faster than disk, we would ideally see blks_hit consistently higher than blks_read, and if not, we can re-evaluate our available memory.
The next few columns deal with tuples. Tup_returned is the number of rows returned in the database, which is the number of rows read by sequential scans if from a table, or the number of index entries returned when from an index”. Tup_fetched is the number of rows fetched in the database, meaning they were a result of bitmap scans, which is the number of table rows fetched by bitmap scans if from a table, or table rows fetched by simple index scans if using an index.
We also have tup_inserted, tup_updated, and tup_deleted, which represents the number of tuples inserted, updated, and deleted in this database respectively. This will help understand how data enters, changes, and leaves the database. Since updated and deleted tuples result in dead rows, high values in these columns would suggest autovacuum operations be tuned to meet the needs of the database activity.
If the database in question is a standby server, the column conflicts comes in handy as a way to track how many queries were canceled due to conflicts with the standby being in ‘recovery mode’. If not a standby cluster, this column can be ignored.
Temporary files / data
At times, queries will need to write to temporary files. This can happen when the amount of work_mem allocated to the connection has been used up, and needs to continue a sort operation on disk rather than in memory. The column temp_files tracks the number of these files that were created, and temp_bytes tracks the total size of all temporary files used. This data can help inform work_mem tuning, or even finding queries that could use re-writing if temp file size is too big.
The deadlocks column tracks how many times a deadlock occurs. Since a deadlock can cause errors for queries that otherwise wouldn’t error, it’s good to track this and make sure applications aren’t stepping on each others feet.
Read and write times
The columns blk_read_time and blk_write_time tracks the total number of milliseconds that backends in the database spend reading and writing data, which can be helpful if trying to compare / improve disk read/write speed
This column, stats_reset, simply shows a timestamp (with time zone) of the last time the stats mentioned in this row has been reset. A null value means they haven’t been reset since inception, or even possibly a crash of the database what may have wiped out these stats.
Checkpoints and The Background Writer
postgres=# SELECT * FROM pg_stat_bgwriter; -[ RECORD 1 ]---------+------------------------------ checkpoints_timed | 47829 checkpoints_req | 2 checkpoint_write_time | 7323 checkpoint_sync_time | 38 buffers_checkpoint | 76 buffers_clean | 0 maxwritten_clean | 0 buffers_backend | 5 buffers_backend_fsync | 0 buffers_alloc | 440 stats_reset | 2018-02-04 19:52:34.712832+00
The PostgtreSQL cluster manages writing data to disk in several different ways. In terms of ‘dirty buffers’ (data in memory that has been changed since read from disk, but has yet to have that change written to disk), this is done either by a checkpoint, or the background writer. Since a dirty buffer must be written to disk before it can be freed or reallocated, making sure these processes are finely tuned is crucial, and this table helps shed light on how it all is working.
A checkpoint happens either on schedule (represented by the checkpoint_timeout parameter), or when the maximum amount of WAL files have been used since the last checkpoint, and needs to force a checkpoint. Either way, a checkpoint writes dirty buffers to disk, and there are four columns tracking it.
The columns checkpoints_timed and checkpoints_req show the number of scheduled checkpoints occur (timed) and the number of requested checkpoints (also referred to as forced). A high climbing value of checkpoint_req could suggest an insufficient max_wal_size value.
Columns checkpoint_write_time and checkpoint_sync_time record the total amount of time (in milliseconds) the checkpoint process has spent writing and syncing to disk.
Finally, buffers_checkpoint is the total number of buffers written to disk by checkpoints.
The background writer is a separate process that writes dirty buffers to disk, which ideally reduces the amount of work the checkpointer needs to do.
The column buffers_clean represents the number of buffers written to disk by the background process. When compared to buffers_checkpoint, it shows how much of the workload is handled by each process (with the added knowledge that background writer has the possibility of writing buffers multiple times if they change often, vs less frequently with a timed checkpoint).
Maxwritten_clean represents the number of times the background writer reached the maximum number of pages to flush each time it runs (controlled with the bgwriter_lru_maxpages parameter).
Buffers in General
The remaining columns show us the general buffer information, with buffers_backend being the number of buffers a backend had to write itself, instead of background writer or checkpointer, buffers_backend_fsync is a count of how many times a backend had to execute its own fsync call, and buffers_alloc shows the number of buffers that have been allocated in general.
Database Activity and Locks
There are two views that show current user activity, pg_stat_activity and pg_locks. When queried, these show information about current connections to the databases, and what kind of locks they have on what relations.
postgres=# SELECT * FROM pg_stat_activity; -[ RECORD 1 ]----+-------------------------------- datid | 13805 datname | severalnines pid | 29730 usesysid | 10 usename | postgres application_name | psql client_addr | client_hostname | client_port | -1 backend_start | 2018-07-21 02:29:48.976588+00 xact_start | 2018-07-21 02:30:03.73683+00 query_start | 2018-07-21 02:30:03.73683+00 state_change | 2018-07-21 02:30:03.736835+00 wait_event_type | wait_event | state | active backend_xid | backend_xmin | 559 query | SELECT first_name FROM customers WHERE customers_sid = 472; backend_type | client backend
The pg_stat_activity view shows a row for every connection to the database, and some basic information about it. The column datname represents the database the connection is actually connected to, pid is the Process ID of the connection on the database host itself, and usesysid and usename represent the database user connected.
For the source of the client, client_addr is the IP address of the host the connection came from, null means it’s a local unix socket connection.
There are four timestamp columns that show when certain things started: backend_start is when the connection was actually established, xact_start is when the current transaction started (null if the client has no open transaction), query_start is when the current or most recent query started, and state_change is the time when the state of the connection last changed.
The final bits of pg_stat_activity cover the actual status of the connection. If the query is waiting on another to release locks, wait_event_type contains some information on what kind of wait event it is, and the column wait_event will show the wait event name. It’s a long list, but more information found at the PostgreSQL documentation.
Finally, the column ‘state’ shows what state the current connection is in, such as active, idle, idle in transaction, and the query column will show the actual query being run, or most recently run.
SELECT * FROM pg_locks; -[ RECORD 1 ]------+---------------- locktype | virtualxid database | relation | page | tuple | virtualxid | 3/475862 transactionid | classid | objid | objsubid | virtualtransaction | 3/475862 pid | 29730 mode | ExclusiveLock granted | t fastpath | t
The pg_locks table works hand in hand with pg_stat_activity if looking into query activity. Anytime a lock is made to a relation, that information is stored in pg_locks. Using the pid from pg_stat_activity, we can query pg_locks to see what relations a connection may have locks on, what kinds of locks those are, and whether or not the locks have been granted.
The most important columns are ‘pid’, which matches the pid from pg_stat_activity, ‘relation’ which matches the OID from pg_class, ‘mode’ showing the name of the lock mode held, and ‘granted’ which states whether or not the lock in question has been granted.
Since PostgreSQL has built in replication features, there are a few views that shed light on the performance and status of replication itself.
View pg_stat_replication: contains a row for every WAL sender process, containing information about its state, the location of the WAL files it’s working on, and the connection information of the standby host that is receiving the WAL data for replication.
View pg_stat_wal_receiver: If the cluster is a standby, this will contain a single row showing statistics about the receiver process form the host.
View pg_stat_subscription: If sending WAL data to a standby node, each row here will represent that subscription, and contain information about the status of the subscriptions.
View pg_replication_slots: Contains a list of all the replication slots that exist on the cluster, and their current state.
Database Specific Metadata
Inside each database there are a collection of catalog tables that have information specific to the database that’s being queried. If we are looking for specific data from these tables, we must make sure we are connected to the right database as we issue the queries.
This is where the heart of data analysis can come in, where we can see just how our user data is being accessed. From tables, to indexes, to sequences, the queries that come into the database and fetch or modify data, their actions and impact will be stored in these tables, and we can look at that information to make informed decisions about managing the database down the road.
Metadata about our user tables is stored in the following two tables, and they each have a row for every user table created in the system. The table pg_stat_user_tables contains statistics on user access to the table, while pg_statio_user_tables contains I/O statistics for each table.
NOTE: The data in here is not always 100% perfect, and relies on frequent analyzes of the tables to be correct. Autoanalyze covers this, but good tuning of the autoanalyze process so that it can keep good statistics about each table. If the statistics seem to be off, running an ANALYZE manually on the table will refresh them.
severalnines=> SELECT * FROM pg_stat_user_tables WHERE schemaname = 'public' AND relname = 'history'; -[ RECORD 1 ]-------+--------- relid | 2766788 schemaname | public relname | history seq_scan | 13817 seq_tup_read | 466841 idx_scan | 12251 idx_tup_fetch | 127652 n_tup_ins | 11 n_tup_upd | 13 n_tup_del | 3 n_tup_hot_upd | 13 n_live_tup | 3 n_dead_tup | 21 n_mod_since_analyze | 19 last_vacuum | last_autovacuum | last_analyze | last_autoanalyze | vacuum_count | 0 autovacuum_count | 0 analyze_count | 0 autoanalyze_count | 0
For our user table stats, we have quite a few pieces of data.
Table Access Methods
When clients access data from the table, it does so either directly or through indexes. The column ‘seq_scan’ counts the number of sequential scans the table received, and ‘seq_tup_read’ counts the number of tuples read through that process. The column ‘idx_scan’ counts how many times an index on the table was used to fetch data.
Table Tuple Activity
We now have a handful of columns that count different activities on the table.
‘n_tup_ins’ tracks the number of tuples inserted
‘n_tup_upd’ tracks the number of tuples updated
‘n_tup_del’ tracks the number of tuples deleted
Table Tuple State
Due to updates and deletes, there could be dead tuples that are no longer active data, and the vacuum process will eventually free them up. The columns ‘n_tup_ins’ and ‘n_tup_ins’ tracks the number of tuples that are alive and dead, respectively. When dead tuples reach a certain point, an autovacuum will be launched, depending on the autovacuum settings.
Table Vacuum Activity
Table maintenance is done through either VACUUM or AUTOVACUUM, and statistics are gathered through ANALYZE or AUTOANALYZE. The next four columns contain the dates for when each of these operations were last run: ‘last_vacuum’, ‘last_autovacuum’, ‘last_analyze’, ‘last_autoanalyze’.
We also have four more convenient columns that simply count how many times the previous actions occur. Using these, we can see which tables get the most activity: ‘vacuum_count’, ‘autovacuum_count’, ‘analyze_count’, and ‘autoanalyze_count’.
severalnines=> SELECT * FROM pg_statio_user_tables WHERE schemaname = 'public' AND relname = history; -[ RECORD 1 ]---+--------- relid | 2766788 schemaname | public relname | history heap_blks_read | 4 heap_blks_hit | 63081 idx_blks_read | 5 idx_blks_hit | 44147 toast_blks_read | 0 toast_blks_hit | 0 tidx_blks_read | 0 tidx_blks_hit | 0
The I/O output is useful for helping understand how the data is being accessed under the covers. The column ‘heap_blks_read’ represents the number of disk blocks read for this table, and ‘heap_blks_hit’ represents the buffer blocks read from memory on this table. This is helpful to know if queries accessing the table constantly have to go to disk, or fetch the data from memory.
Index stats on the table show the same information with the columns ‘idx_blks_read’ and ‘idx_blks_hit’.
Lastly, if the table has any TOAST tables, the columns ‘toast_blks_hit’ and ‘toast_blks_read’ track toast tables, while ‘tdix_blks_read’ and ‘tdix_blks_read’ tracks the indexes on those toast tables.
severalnines=> SELECT * FROM pg_stat_user_indexes WHERE indexrelname = 'history_pkey'; -[ RECORD 1 ]-+------------- relid | 2766797 indexrelid | 2766934 schemaname | public relname | history indexrelname | history_pkey idx_scan | 43910 idx_tup_read | 98147 idx_tup_fetch | 98147
Much like the table counterparts, this table contains information about the indexes specifically. One row per index, this table shows how many times the index was scanned with the column ‘idx_scan’, how many tuples were read with ‘idx_tup_read’, and how many live rows were actually fetched with ‘idx_tup_fetch’.
severalnines=> SELECT * FROM pg_statio_user_indexes WHERE indexrelname = 'history_pkey'; -[ RECORD 1 ]-+------------- relid | 2766797 indexrelid | 2766934 schemaname | public relname | history indexrelname | history_pkey idx_blks_read | 2 idx_blks_hit | 49380
For pg_statio_user_indexes, the two columns available for data are ‘idx_blks_read’, and ‘idx_blks_hit’, representing the number of blocks read from disk and from memory.
What can we do with this data?
Get creative! If queries to a specific table seem to be extremely slow, track its activity over time, look at how many sequential scans it gets vs index scans, look at whether it’s going to disk or memory for the data.
If a large table keeps getting autovacuumed frequently, track the live to dead tuples over time, maybe it specifically needs autovacuum to be tweaked so it can complete quicker, or even perhaps the table is a candidate for partitioning.
Since we can see when data comes from disk or memory, we can create a ratio of memory to disk over time, pinpointing if at any time the ratio dips through the day.
The amount of tables we covered went over the big hitters, the main data that’s useful to know about the inner workings of the databases. However there are many more tables in the system catalog that contain situationally useful data. Reading other tables like before will help provide insights on the health of the database in general.