blog

PostgreSQL Anonymization On Demand

Achilleas Mantzios

Published: June 24, 2020
Last Updated: May 4, 2022

Before, during, and after the GDPR came into town in 2018, there have been many ideas to solve the problem of deleting or hiding user data, using various layers of the software stack but also using various approaches (hard deletion, soft deletion, anonymization). Anonymization has been one of them which is known to be popular among the PostgreSQL-based organizations/companies.

In the spirit of GDPR, we see more and more the requirement for business documents and reports exchanged between companies, so that the individuals shown in those reports are presented anonymized, i.e. only their role/title is shown, while their personal data are hidden. This happens most probably due to the fact that those companies receiving these reports do not want to manage these data under the procedures/processes of GDPR, they don’t want to deal with the burden of designing new procedures/processes/systems to handle them, and they just ask to receive the data already pre-anonymized. So this anonymization does not only apply to those individuals who have expressed their wish to be forgotten, but actually all people mentioned inside the report, which is quite different from the common GDPR practices.

In this article, we are going to deal with anonymization towards a solution to this problem. We will start with presenting a permanent solution, that is a solution in which a person requesting to be forgotten should be hidden in all future inquiries in the system. Then building on top of this we will present a way to achieve “on demand” i.e. short-lived anonymization, which means implementation of an anonymization mechanism intended to be active just long enough until the needed reports are generated in the system. In the solution I am presenting this will have a global effect, so this solution uses a greedy approach, covering all applications, with minimal (if any) code rewrite (and comes from the tendency of PostgreSQL DBA’s to solve such problems centrally leaving the app developers deal with their true workload). However, the methods presented here can be easily tweaked to be applied in limited/narrower scopes.

Permanent Anonymization

Here we will present a way to achieve anonymization. Let’s consider the following table containing records of a company’s employees:

testdb=# create table person(id serial primary key, surname text not null, givenname text not null, midname text, address text not null, email text not null, role text not null, rank text not null);
CREATE TABLE
testdb=# insert into person(surname,givenname,address,email,role,rank) values('Singh','Kumar','2 some street, Mumbai, India','[email protected]','Seafarer','Captain');
INSERT 0 1
testdb=# insert into person(surname,givenname,address,email,role,rank) values('Mantzios','Achilleas','Agiou Titou 10, Iraklio, Crete, Greece','[email protected]','IT','DBA');
INSERT 0 1
testdb=# insert into person(surname,givenname,address,email,role,rank) values('Emanuel','Tsatsadakis','Knossou 300, Iraklio, Crete, Greece','[email protected]','IT','Developer');
INSERT 0 1
testdb=#

This table is public, everybody can query it, and belongs to the public schema. Now we create the basic mechanism for anonymization which consists of:

a new schema to hold related tables and views, let’s call this anonym
a table containing id’s of people who want to be forgotten: anonym.person_anonym
a view providing the anonymized version of public.person: anonym.person
setup of the search_path, to use the new view

testdb=# create schema anonym;
CREATE SCHEMA
testdb=# create table anonym.person_anonym(id INT NOT NULL REFERENCES public.person(id));
CREATE TABLE

CREATE OR REPLACE VIEW anonym.person AS
SELECT p.id,
    CASE
        WHEN pa.id IS NULL THEN p.givenname
        ELSE '****'::character varying
    END AS givenname,
    CASE
        WHEN pa.id IS NULL THEN p.midname
        ELSE '****'::character varying
    END AS midname,
    CASE
        WHEN pa.id IS NULL THEN p.surname
        ELSE '****'::character varying
    END AS surname,
    CASE
        WHEN pa.id IS NULL THEN p.address
        ELSE '****'::text
    END AS address,
    CASE
        WHEN pa.id IS NULL THEN p.email
        ELSE '****'::character varying
    END AS email,
    role,
    rank
  FROM person p
LEFT JOIN anonym.person_anonym pa ON p.id = pa.id
;

Let’s set the search_path to our application:

set search_path = anonym,"$user", public;

Warning: it is essential that the search_path is setup correctly in the data source definition in the application. The reader is encouraged to explore more advanced ways to handle the search path, e.g. with a use of a function which may handle more complex and dynamic logic. For instance you could specify a set of data entry users (or role), and let them keep using the public.person table throughout the anonymization interval (so that they will keep seeing normal data), while defining a managerial/reporting set of users (or role) for whom the anonymization logic will apply.

Now let’s query our person relation:

testdb=# select * from person;
-[ RECORD 1 ]-------------------------------------
id    | 2
givenname | Achilleas
midname   |
surname   | Mantzios
address   | Agiou Titou 10, Iraklio, Crete, Greece
email | [email protected]
role  | IT
rank  | DBA
-[ RECORD 2 ]-------------------------------------
id    | 1
givenname | Kumar
midname   |
surname   | Singh
address   | 2 some street, Mumbai, India
email | [email protected]
role  | Seafarer
rank  | Captain
-[ RECORD 3 ]-------------------------------------
id    | 3
givenname | Tsatsadakis
midname   |
surname   | Emanuel
address   | Knossou 300, Iraklio, Crete, Greece
email | [email protected]
role  | IT
rank  | Developer

testdb=#

Now, let’s suppose that Mr Singh leaves the company and explicitly expresses his right to be forgotten by a written statement. The application does this by inserting his id into the set of “to be forgotten” id’s:

testdb=# insert into anonym.person_anonym (id) VALUES(1);
INSERT 0 1

Let us now repeat the exact query we run before:

testdb=# select * from person;
-[ RECORD 1 ]-------------------------------------
id    | 1
givenname | ****
midname   | ****
surname   | ****
address   | ****
email | ****
role  | Seafarer
rank  | Captain
-[ RECORD 2 ]-------------------------------------
id    | 2
givenname | Achilleas
midname   |
surname   | Mantzios
address   | Agiou Titou 10, Iraklio, Crete, Greece
email | [email protected]
role  | IT
rank  | DBA
-[ RECORD 3 ]-------------------------------------
id    | 3
givenname | Tsatsadakis
midname   |
surname   | Emanuel
address   | Knossou 300, Iraklio, Crete, Greece
email | [email protected]
role  | IT
rank  | Developer

testdb=#

We can see that Mr Singh’s details are not accessible from the application.

Temporary Global Anonymization

The Main Idea

The user marks the start of the anonymization interval (a short period of time).
During this interval, only selects are allowed for the table named person.
All access (selects) are anonymized for all records in the person table, regardless of any prior anonymization setup.
The user marks the end of the anonymization interval.

Building Blocks

Two-phase commit (aka Prepared Transactions).
Explicit table locking.
The anonymization setup we did above in the “Permanent Anonymization” section.

Implementation

A special admin app (e.g. called : markStartOfAnynimizationPeriod) performs

testdb=# BEGIN ;
BEGIN
testdb=# LOCK public.person IN SHARE MODE ;
LOCK TABLE
testdb=# PREPARE TRANSACTION 'personlock';
PREPARE TRANSACTION
testdb=#

What the above does is acquire a lock on the table in SHARE mode so that INSERTS, UPDATES, DELETES are blocked. Also by starting a two phase commit transaction (AKA prepared transaction, in other contexts known as distributed transactions or eXtended Architecture transactions XA) we free the transaction from the connection of the session marking the start of the anonymization period, while letting other subsequent sessions be aware of its existence. The prepared transaction is a persistent transaction which stays alive after the disconnection of the connection/session which has started it (via PREPARE TRANSACTION). Note that the “PREPARE TRANSACTION” statement disassociates the transaction from the current session. The prepared transaction can be picked up by a subsequent session and either be rollbacked or committed. The use of this kind of XA transactions enables a system to reliably deal with many different XA data sources, and perform transactional logic across those (possibly heterogeneous) datasources. However, the reasons we use it in this specific case :

to enable the issuing client session of ending the session and disconnecting/freeing its connection (leaving or even worse “persisting” a connection is a really bad idea, a connection should be freed as soon as it performs the queries it needs to do)
to make subsequent sessions/connections capable of querying for the existence of this prepared transaction
to make the ending session capable of committing this prepared transaction (by the use of its name) thus marking :
- the release of the SHARE MODE lock
- the end of the anonymization period

In order to verify that the transaction is alive and associated with the SHARE lock on our person table we do:

testdb=# select px.*,l0.* from pg_prepared_xacts px , pg_locks l0 where px.gid='personlock' AND l0.virtualtransaction='-1/'||px.transaction AND l0.relation='public.person'::regclass AND l0.mode='ShareLock';
-[ RECORD 1 ]------+----------------------------
transaction    | 725
gid            | personlock
prepared       | 2020-05-23 15:34:47.2155+03
owner          | postgres
database       | testdb
locktype       | relation
database       | 16384
relation       | 32829
page           |
tuple          |
virtualxid     |
transactionid  |
classid        |
objid          |
objsubid       |
virtualtransaction | -1/725
pid            |
mode           | ShareLock
granted        | t
fastpath       | f

testdb=#

What the above query does is to ensure that the named prepared transaction personlock is alive and that indeed the associated lock on table person held by this virtual transaction is in the intended mode : SHARE.

So now we may tweak the view:

CREATE OR REPLACE VIEW anonym.person AS
WITH perlockqry AS (
    SELECT 1
      FROM pg_prepared_xacts px,
        pg_locks l0
      WHERE px.gid = 'personlock'::text AND l0.virtualtransaction = ('-1/'::text || px.transaction) AND l0.relation = 'public.person'::regclass::oid AND l0.mode = 'ShareLock'::text
    )
SELECT p.id,
    CASE
        WHEN pa.id IS NULL AND NOT (EXISTS ( SELECT 1
          FROM perlockqry)) THEN p.givenname::character varying
        ELSE '****'::character varying
    END AS givenname,
    CASE
        WHEN pa.id IS NULL AND NOT (EXISTS ( SELECT 1
          FROM perlockqry)) THEN p.midname::character varying
        ELSE '****'::character varying
    END AS midname,
    CASE
        WHEN pa.id IS NULL AND NOT (EXISTS ( SELECT 1
          FROM perlockqry)) THEN p.surname::character varying
        ELSE '****'::character varying
    END AS surname,
    CASE
        WHEN pa.id IS NULL AND NOT (EXISTS ( SELECT 1
          FROM perlockqry)) THEN p.address
        ELSE '****'::text
    END AS address,
    CASE
        WHEN pa.id IS NULL AND NOT (EXISTS ( SELECT 1
          FROM perlockqry)) THEN p.email::character varying
        ELSE '****'::character varying
    END AS email,
p.role,
p.rank
  FROM public.person p
LEFT JOIN person_anonym pa ON p.id = pa.id

Now with the new definition, if the user has started prepared transaction personlock,then the following select will return:

testdb=# select * from person;
id | givenname | midname | surname | address | email |   role   |   rank   
----+-----------+---------+---------+---------+-------+----------+-----------
  1 | ****  | **** | **** | **** | ****  | Seafarer | Captain
  2 | ****  | **** | **** | **** | ****  | IT   | DBA
  3 | ****  | **** | **** | **** | ****  | IT   | Developer
(3 rows)

testdb=#

which means global unconditional anonymization.

Any app trying to use data from table person will get anonymized “****” instead of actual real data. Now let’s suppose the admin of this app decides the anonymization period is due to end, so his app now issues:

COMMIT PREPARED 'personlock';

Now any subsequent select will return:

testdb=# select * from person;
id |  givenname  | midname | surname  |            address             |         email         |   role   |   rank   
----+-------------+---------+----------+----------------------------------------+-------------------------------+----------+-----------
  1 | ****    | **** | **** | ****                               | ****                      | Seafarer | Captain
  2 | Achilleas   |     | Mantzios | Agiou Titou 10, Iraklio, Crete, Greece | [email protected]   | IT   | DBA
  3 | Tsatsadakis |     | Emanuel  | Knossou 300, Iraklio, Crete, Greece | [email protected] | IT   | Developer
(3 rows)

testdb=#

Warning! : The lock prevents concurrent writes, but does not prevent eventual write when the lock will have been released. So there is a potential danger for updating apps, reading ‘****’ from the database, a careless user, hitting update, and then after some period of waiting, the SHARED lock gets released and the update succeeds writing ‘****’ in place of where correct normal data should be. Users of course can help here by not blindly pressing buttons, but some additional protections could be added here. Updating apps could issue a:

set lock_timeout TO 1;

at the start of the updating transaction. This way any waiting/blocking longer than 1ms will raise an exception. Which should protect against the vast majority of cases. Another way would be a check constraint in any of the sensitive fields to check against the ‘****’ value.

ALARM! : it is imperative that the prepared transaction must be eventually completed. Either by the user who started it (or another user), or even by a cron script which checks for forgotten transactions every let’s say 30 minutes. Forgetting to end this transaction will cause catastrophic results as it prevents VACUUM from running, and of course the lock will still be there, preventing writes to the database. If you are not comfortable enough with your system, if you don’t fully understand all the aspects and all side effects of using a prepared/distributed transaction with a lock, if you don’t have adequate monitoring in place, especially regarding the MVCC metrics, then simply do not follow this approach. In this case, you could have a special table holding parameters for admin purposes where you could use two special column values, one for normal operation and one for global enforced anonymization, or you could experiment with PostgreSQL application-level shared advisory locks:

ClickHouse Schema Design and Data Modeling

Building a Modern Analytics Stack Around ClickHouse

Managing ClickHouse Resources in Multi-Tenant Environments

How to Migrate from SQL Server to PostgreSQL HA for Dockerized Apps