Cloud Vendor Deep-Dive: PostgreSQL on AWS Aurora

Viorel Tabara

How deep should we go with this? I’ll start by saying that as of this writing, I could locate only 3 books on Amazon about PostgreSQL in the cloud, and 117 discussions on PostgreSQL mailing lists about Aurora PostgreSQL. That doesn’t look like a lot, and it leaves me, the curious PostgreSQL end user, with the official documentation as the only place where I could really learn some more. As I don’t have the ability, nor the knowledge to adventure myself much deeper, there is AWS re:Invent 2018 for those who are looking for that kind of thrill. I can settle for Werner’s article on quorums.

To get warmed up, I started from the Aurora PostgreSQL homepage where I noted that the benchmark showing that Aurora PostgreSQL is three times faster than a standard PostgreSQL running on the same hardware dates back to PostgreSQL 9.6. As I’ve learned later, 9.6.9 is currently the default option when setting up a new cluster. That is very good news for those who don’t want to, or cannot upgrade right away. And why only 99.99% availability? One explanation can be found in Bruce Momjian’s article.

Compatibility

According to AWS, Aurora PostgreSQL is a drop-in replacement for PostgreSQL, and the documentation states:

The code, tools, and applications you use today with your existing MySQL and PostgreSQL databases can be used with Aurora.

That is reinforced by Aurora FAQs:

It means that most of the code, applications, drivers and tools you already use today with your PostgreSQL databases can be used with Aurora with little or no change. The Amazon Aurora database engine is designed to be wire-compatible with PostgreSQL 9.6 and 10, and supports the same set of PostgreSQL extensions that are supported with RDS for PostgreSQL 9.6 and 10, making it easy to move applications between the two engines.

“most” in the above text suggests that there isn’t a 100% guarantee in which case those seeking certainty should consider purchasing technical support from either AWS Professional Services, or Aamazon Aurora partners. As a side note, I did notice that none of the PostgreSQL professional Hosting Providers employing core community contributors are on that list.

From Aurora FAQs page we also learn that Aurora PostgreSQL supports the same extensions as RDS, which in turn lists most of the community extensions and a few extras.

Concepts

As part of Amazon RDS, Aurora PostgreSQL comes with its own terminology:

  • Cluster: A Primary DB instance in read/write mode and zero or more Aurora Replicas. The primary DB is often labeled a Master in `AWS diagrams`_, or Writer in the AWS Console. Based on the reference diagram we can make an interesting observation: Aurora writes three times. As the latency between the AZs is typically higher than within the same AZ, the transaction is considered committed as soon it's written on the data copy within the same AZ, otherwise the latency and potential outages between AZs.
  • Cluster Volume: Virtual database storage volume spanning multiple AZs.
  • Aurora URL: A `host:port` pair.
  • Cluster Endpoint: Aurora URL for the Primary DB. There is one Cluster Endpoint.
  • Reader Endpoint: Aurora URL for the replica set. To make an analogy with DNS it's an alias (CNAME). Read requests are load balanced between available replicas.
  • Custom Endpoint: Aurora URL to a group consisting of one or more DB instances.
  • Instance Endpoint: Aurora URL to a specific DB instance.
  • Aurora Version: Product version returned by `SELECT AURORA_VERSION();`.

PostgreSQL Performance and Monitoring on AWS Aurora

Sizing

Aurora PostgreSQL applies a best guess configuration which is based on the DB instance size and storage capacity, leaving further tuning to the DBA through the use of DB Parameters groups.

When selecting the DB instance, base your selection on the desired value for max_connections.

Scaling

Aurora PostgreSQL features auto and manual scaling. Horizontal scaling of read replicas is automated through the use of performance metrics. Vertical scaling can be automated via APIs.

Horizontal scaling takes the offline for a few minutes while replacing compute engine and performing any maintenance operations (upgrades, patching). Therefore AWS recommend performing such operations during maintenance windows.

Scaling in both directions is a breeze:

Vertical scaling: modifying instance class
Vertical scaling: modifying instance class
Horizontal scaling: adding reader replica.
Horizontal scaling: adding reader replica.

At the storage level, space is added in 10G increments. Allocated storage is never reclaimed, see below for how to address this limitation.

Storage

As mentioned above, Aurora PostgreSQL was engineered to take advantage of quorums in order to improve performance consistency.

Since the underlying storage is shared by all DB instances within the same cluster, no additional writes on standby nodes are required. Also, adding or removing DB instances doesn’t change the underlying data.

Wondering what those IOs units mean on the monthly bill? Aurora FAQs comes to the rescue again to explain what an IO is, in the context of monitoring and billing. A Read IO as the equivalent of an 8KiB database page read, and a Write IO as the equivalent of 4KiB written to the storage layer.

High Concurrency

In order to take full advantage of Aurora’s high-concurrency design, it is recommended that applications are configured to drive a large number of concurrent queries and transactions.

Applications designed to direct read and write queries to respectively standby and primary database nodes will benefit from Aurora PostgreSQL reader replica endpoint.

Connections are load balanced between read replicas.

Using custom endpoints database instances with more capacity can be grouped together in order to run an intensive workload such as analytics.

DB Instance Endpoints can be used for fine-grained load balancing or fast failover.

Note that in order for the Reader Endpoints to load balance individual queries, each query must be sent as a new connection.

Caching

Aurora PostgreSQL uses a Survivable Cache Warming technique which ensures that the date in the buffer cache is preserved, eliminating the need for repopulating or warming-up the cache following a database restart.

Replication

Replication lag time between replicas is kept within single digit millisecond. Although not available for PostgreSQL, it’s good to know that cross-region replication lag is kept within 10s of milliseconds.

According to documentation replica lag increases during periods of heavy write requests.

Query Execution Plans

Based on the assumption that query performance degrades over time due to various database changes, the role of this Aurora PostgreSQL component is to maintain a list of approved or rejected query execution plans.

Plans are approved or rejected using either proactive or reactive methods.

When an execution plan is marked as rejected, the Query Execution Plan overrides the PostgreSQL optimizer decisions and prevents the “bad” plan from being executed.

This feature requires Aurora 2.1.0 or later.

PostgreSQL High Availability and Replication on AWS Aurora

At the storage layer, Aurora PostgreSQL ensures durability by replicating each 10GB of storage volume, six times across 3 AZs (each region consists of typically 3 AZs) using physical synchronous replication. That makes it possible for database writes to continue working even when 2 copies of data are lost. Read availability survives the loss of 3 copies of data.

Read replicas ensure that a failed primary instance can be quickly replaced by promoting one of the 15 available replicas. When selecting a multi-AZ deployment one read replica is automatically created. Failover requires no user intervention, and database operations resume in less than 30 seconds.

For single-AZ deployments, the recovery procedure includes a restore from the last known good backup. According to Aurora FAQs the process completes in under 15 minutes if the database needs to be restored in a different AZ. The documentation isn’t that specific, claiming that it takes less than 10 minutes to complete the restore process.

No change is required on the application side in order to connect to the new DB instance as the cluster endpoint doesn’t change during a replica promotion or instance restore.

Step 1: delete the primary instance to force a failover:

Automatic failover Step 1: delete primary
Automatic failover Step 1: delete primary

Step 2: automatic failover completed

Automatic failover Step 2: failover completed.
Automatic failover Step 2: failover completed.

For busy databases, the recovery time following a restart or crash is dramatically reduced since Aurora PostgreSQL doesn’t need to replay the transaction logs.

As part of full-managed service, bad data blocks and disks are automatically replaced.

Failover when replicas exist takes up to 120 seconds with often time under 60 seconds. Faster recovery times can be achieved by when failover conditions are pre-determined, in which case replicas can be assigned failover priorities.

Aurora PostgreSQL plays nice with Amazon RDS – an Aurora instance can act as a read replica for a primary RDS instance.

Aurora PostgreSQL supports Logical Replication which, just like in the community version, can be used to overcome built-in replication limitations. There is no automation or AWS console interface.

Security for PostgreSQL on AWS Aurora

At network level, Aurora PostgreSQL leverages AWS core components, VPC for cloud network isolation and Security Groups for network access control.

There is no superuser access. When creating a cluster, Aurora PostgreSQL creates a master account with a subset of superuser permissions:

[email protected]:5432 postgres> \du+ postgres
                               List of roles
 Role name |          Attributes               |    Member of       | Description
-----------+-------------------------------+-----------------+-------------
 postgres  | Create role, Create DB           +| {rds_superuser} |
            | Password valid until infinity  |                 |

To secure data in transit, Aurora PostgreSQL provides native SSL/TLS support which can be configured per DB instance.

All data at rest can be encrypted with minimal performance impact. This also applies to backups, snapshots, and replicas.

Encryption at rest.
Encryption at rest.

Authentication is controlled by IAM policies, and tagging allows further control over what users are allowed to do and on what resources.

API calls used by all cloud services are logged in CloudTrail.

Client side Restricted Password Management is available via the rds.restrict_password_commands parameter.

PostgreSQL Backup and Recovery on AWS Aurora

Backups are enabled by default and cannot be disabled. They provide point-in-time-recovery using a full daily snapshot as a base backup.

Restoring from an automated backup has a couple of disadvantages: the time to restore may be several hours and data loss may be up to 5 minutes preceding the outage. Amazon RDS Multi-AZ Deployments solve this problem by promoting a read replica to primary, with no data loss.

Database Snapshots are fast and don’t impact the cluster performance. They can be copied or shared with other users.

Taking a snapshot is almost instantaneous:

Snapshot time.
Snapshot time.

Restoring a snapshot is also fast. Compare with PITR:

Backups and snapshots are stored in S3 which offers eleven 9’s of durability.

Aside from backups and snapshots, Aurora PostgreSQL allows databases to be cloned. This is an efficient method for creating copies of large data sets. For example, cloning multi-terabytes of data take only minutes and there is no performance impact.

Aurora PostgreSQL - Point-in-Time Recovery Demo

Connecting to cluster:

~ $ export PGUSER=postgres PGPASSWORD=postgres PGHOST=s9s-us-east-1.cluster-ctfirtyhadgr.us-east-1.rds.amazonaws.com
~ $ psql
Pager usage is off.
psql (11.3, server 10.7)
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
Type "help" for help.

Populate a table with data:

[email protected]:5432 postgres> create table s9s (id serial not null, msg text, created timestamptz not null default now());
CREATE TABLE

[email protected]:5432 postgres> select * from s9s;
id | msg  |            created
----+------+-------------------------------
1 | test | 2019-06-25 07:57:40.022125+00
2 | test | 2019-06-25 07:57:57.666222+00
3 | test | 2019-06-25 07:58:05.593214+00
4 | test | 2019-06-25 07:58:08.212324+00
5 | test | 2019-06-25 07:58:10.156834+00
6 | test | 2019-06-25 07:59:58.573371+00
7 | test | 2019-06-25 07:59:59.5233+00
8 | test | 2019-06-25 08:00:00.318474+00
9 | test | 2019-06-25 08:00:11.153298+00
10 | test | 2019-06-25 08:00:12.287245+00
(10 rows)

Initiate the restore:

Point-in-Time Recovery: initiate restore.
Point-in-Time Recovery: initiate restore.

Once the restore is complete log in and check:

~ $ psql -h pg107-dbt3medium-restored-cluster.cluster-ctfirtyhadgr.us-east-1.rds.amazonaws.com
Pager usage is off.
psql (11.3, server 10.7)
SSL connection (protocol: TLSv1.2, cipher: ECDHE-RSA-AES256-GCM-SHA384, bits: 256, compression: off)
Type "help" for help.

[email protected]:5432 postgres> select * from s9s;
id | msg  |            created
----+------+-------------------------------
1 | test | 2019-06-25 07:57:40.022125+00
2 | test | 2019-06-25 07:57:57.666222+00
3 | test | 2019-06-25 07:58:05.593214+00
4 | test | 2019-06-25 07:58:08.212324+00
5 | test | 2019-06-25 07:58:10.156834+00
6 | test | 2019-06-25 07:59:58.573371+00
(6 rows)

Best Practices

Monitoring and Auditing

Replication

Encryption

Master Account

Sizing

Parameter Groups

Parameter Groups Demo

Current settings:

[email protected]:5432 postgres> show shared_buffers ;
shared_buffers
----------------
10112136kB
(1 row)

Create a new parameter group and set the new cluster wide value:

Updating shared_buffers cluster wide.
Updating shared_buffers cluster wide.

Associate the custom parameter group with the cluster:

Reboot the writer and check the value:

[email protected]:5432 postgres> show shared_buffers ;
shared_buffers
----------------
1GB
(1 row)

By default, the timezone is in UTC:

[email protected]:5432 postgres> show timezone;
TimeZone
----------
UTC
(1 row)

Setting the new timezone:

Configuring timezone
Configuring timezone

And then check:

[email protected]:5432 postgres> show timezone;
TimeZone
------------
US/Pacific
(1 row)

Note that the list of timezone values accepted by Amazon Aurora is not the timezonesets found in upstream PostgreSQL.

  • Review instance parameters that are overridden by cluster parameters
  • Use the parameter group comparison tool.

Snapshots

  • Avoid additional storage charges by sharing the snapshots with another accounts to allow restoring into their respective environments.

Maintenance

Failover

DBA Beware!

In addition to the known limitations avoid, or be aware of the following:

Encryption

Aurora Serverless

Parallel Query

Endpoints

From Amazon Connection Management:

  • 5 Custom Endpoints per cluster
  • Custom Endpoint names cannot exceed 63 characters
  • Cluster Endpoint names are unique within the same region
  • As seen in the above screenshot (aurora-custom-endpoint-details) READER and ANY custom endpoint types aren’t available, use the CLI
  • Custom Endpoints are unaware of replicas becoming temporarily unavailable

Replication

Storage

  • Maximum allocated storage does not shrink when data is deleted, neither is space reclaimed by restoring from snapshots. The only way to reclaim space is by performing a logical dump into a new cluster.

Backup and Recovery

Snapshots

Billing

  • The 10 minutes bill applies to new instances, as well as following a capacity change (compute, or storage).

Authentication

Starting and Stopping

From Overview of Stopping and Staring an Aurora DB Cluster:

  • Clusters cannot be left stopped indefinitely as they are started automatically after 7 days.
  • Individual DB instances cannot be stopped.

Upgrades

Cloning

  • 15 clones per database (original or copy).
  • Clones are not removed when deleting the source database.

Scaling

  • Auto-Scaling requires that all replicas are available.
  • There can be only `one auto-scaling policy`_ per metric per cluster.
  • Horizontal scaling of the primary DB instance (instance class) is not fully automatic. Before scaling the cluster triggers an automatic failover to one of the replicas. After scaling completes the new instance must be manually promoted from reader to writer:
    New instance left in reader mode after DB instance class change.
    New instance left in reader mode after DB instance class change.

Monitoring

Migration

Sizing

  • The smallest available instance class is db.t3.medium and the largest db.r5.24xlarge. For comparison, the MySQL engine offers db.t2.small and db.t2.medium, however no db.r5.24xlarge in the upper range.
  • max_connections upper limit is 262,143.

Query Plan Management

Migration

Aurora PostgreSQL does not provide direct migration services, rather the task is offloaded to a specialized AWS product, namely AWS DMS.

Conclusion

As a fully-managed drop-in replacement for the upstream PostgreSQL, Amazon Aurora PostgreSQL takes advantage of the technologies that power the AWS cloud to remove the complexity required to setup services such as auto-scaling, query load-balancing, low-level data replication, incremental backups, and encryption.

The architecture and a conservative approach for upgrading the PostgreSQL engine provides the performance and the stability organizations from small to large are looking for.

The inherent limitations are just a proof that building a large scale Database as a Service is a complex task, leaving the highly specialized PostgreSQL Hosting Providers with a niche market they can tap into.

ClusterControl
The only management system you’ll ever need to take control of your open source database infrastructure.