blog
An Overview of PostgreSQL & MySQL Cross Replication
This blog is aimed at explaining an overview of cross replication between PostgreSQL and MySQL, and further discussing the methods of configuring cross replication between the two database servers. Traditionally, the databases involved in a cross replication setup are called heterogeneous databases, which is a good approach to move away from one RDBMS server to another.
Both PostgreSQL and MySQL databases are conventionally RDBMS databases but they also offer NoSQL capability with added extensions to have the best of both worlds. This article focuses on the discussion of replication between PostgreSQL and MySQL from an RDBMS perspective.
An exhaustive explanation about internals of replication is not within the purview of this blog, however, some foundational elements shall be discussed to give the audience an understanding of how is replication configured between database servers, advantages, limitations and perhaps some known use cases.
In general replication between two identical database servers is achieved either in binary mode or query mode between a master node (otherwise called publisher, primary or active) and a slave node (subscriber, standby or passive). The aim of replication is to provide a real time copy of the master database on the slave side, where the data is transferred from master to slave, thereby forming an active-passive setup because the replication is only configured to occur one way. On the other hand, replication between two databases can be configured both ways so the data can also be transferred from slave back to master, establishing an active-active configuration. All of this can be configured between two or more identical database servers which may also include a cascading replication. The configuration of active-active or active-passive really depends on the business need, availability of such features within the native configuration or utilizing external solutions to configure and applicable trade-offs.
The above mentioned configuration can be accomplished with diverse database servers, wherein a database server can be configured to accept replicated data from another completely different database server and still maintain real time snapshot of the data being replicated. Both MySQL and PostgreSQL database servers offer most of the configurations discussed above either in their own nativity or with the help of third party extensions including binary log method, disk block method, statement based and row based methods.
The requirement to configure a cross replication between MySQL and PostgreSQL really comes in as a result of a one time migration effort to move away from one database server to another. As both the databases use different protocols so they cannot directly talk to each other. In order to achieve that communication flow, there is an external open source tool such as pg_chameleon.
Background of pg_chameleon
pg_chameleon is a MySQL to PostgreSQL replication system developed in Python 3. It uses an open source library called mysql-replication which is also developed using Python. The functionality involves pulling row images of MySQL tables and storing them as JSONB objects into a PostgreSQL database, which is further decoded by a pl/pgsql function and replaying those changes against the PostgreSQL database.
Features of pg_chameleon
- Multiple MySQL schemas from the same cluster can be replicated to a single target PostgreSQL database, forming a many-to-one replication setup
- The source and target schema names can be non-identical
- Replication data can be pulled from MySQL cascading replica
- Tables that fail to replicate or generate errors are excluded
- Each replication functionality is managed with the help of daemons
- Controlled with the help of parameters and configuration files based on YAML construct
Demo
Host | vm1 | vm2 |
---|---|---|
OS version | CentOS Linux release 7.6 x86_64 | CentOS Linux release 7.5 x86_64 |
Database server with version | MySQL 5.7.26 | PostgreSQL 10.5 |
Database port | 3306 | 5433 |
ip address | 192.168.56.102 | 192.168.56.106 |
To begin with, prepare the setup with all the prerequisites needed to install pg_chameleon. In this demo Python 3.6.8 is installed, creating a virtual environment and activating it for use.
$> wget https://www.python.org/ftp/python/3.6.8/Python-3.6.8.tar.xz
$> tar -xJf Python-3.6.8.tar.xz
$> cd Python-3.6.8
$> ./configure --enable-optimizations
$> make altinstall
Following a successful installation of Python3.6, further additional requirements are met such as creating and activating a virtual environment. In addition to that pip module upgraded to the latest version and it is used to install pg_chameleon. In the commands below, pg_chameleon 2.0.9 was deliberately installed whereas the latest version is a 2.0.10. This is done in order to avoid any newly introduced bugs in the updated version.
$> python3.6 -m venv venv
$> source venv/bin/activate
(venv) $> pip install pip --upgrade
(venv) $> pip install pg_chameleon==2.0.9
The next step is to invoke the pg_chameleon (chameleon is the command) with set_configuration_files argument to enable pg_chameleon to create default directories and configuration files.
(venv) $> chameleon set_configuration_files
creating directory /root/.pg_chameleon
creating directory /root/.pg_chameleon/configuration/
creating directory /root/.pg_chameleon/logs/
creating directory /root/.pg_chameleon/pid/
copying configuration example in /root/.pg_chameleon/configuration//config-example.yml
Now, create a copy of config-example.yml as default.yml to make it the default configuration file. A sample configuration file used for this demo is provided below.
$> cat default.yml
---
#global settings
pid_dir: '~/.pg_chameleon/pid/'
log_dir: '~/.pg_chameleon/logs/'
log_dest: file
log_level: info
log_days_keep: 10
rollbar_key: ''
rollbar_env: ''
# type_override allows the user to override the default type conversion into a different one.
type_override:
"tinyint(1)":
override_to: boolean
override_tables:
- "*"
#postgres destination connection
pg_conn:
host: "192.168.56.106"
port: "5433"
user: "usr_replica"
password: "pass123"
database: "db_replica"
charset: "utf8"
sources:
mysql:
db_conn:
host: "192.168.56.102"
port: "3306"
user: "usr_replica"
password: "pass123"
charset: 'utf8'
connect_timeout: 10
schema_mappings:
world_x: pgworld_x
limit_tables:
# - delphis_mediterranea.foo
skip_tables:
# - delphis_mediterranea.bar
grant_select_to:
- usr_readonly
lock_timeout: "120s"
my_server_id: 100
replica_batch_size: 10000
replay_max_rows: 10000
batch_retention: '1 day'
copy_max_memory: "300M"
copy_mode: 'file'
out_dir: /tmp
sleep_loop: 1
on_error_replay: continue
on_error_read: continue
auto_maintenance: "disabled"
gtid_enable: No
type: mysql
skip_events:
insert:
- delphis_mediterranea.foo #skips inserts on the table delphis_mediterranea.foo
delete:
- delphis_mediterranea #skips deletes on schema delphis_mediterranea
update:
The configuration file used in this demo is the sample file that comes with pg_chameleon with minor edits to suit the source and destination environments, and a summary of different sections of the configuration file follows.
The default.yml configuration file has a “global settings” section that control details such as lock file location, logging locations and retention period, etc. The section that follows next is the “type override” section which is a set of rules to override types during replication. A sample type override rule is used by default which converts a tinyint(1) to a boolean value. The next section is the destination database connection details section which in our case is a PostgreSQL database, denoted by “pg_conn”. The final section is the source section which has all the details of source database connection settings, schema mapping between source and destination, any tables to skip including timeout, memory and batch size settings. Notice the “sources” denoting that there can be multiple sources to a single destination to form a many-to-one replication setup.
A “world_x” database is used in this demo which is a sample database with 4 tables containing sample rows, that MySQL community offers for demo purposes, and it can be downloaded from here. The sample database comes as a tar and compressed archive along with instructions to create it and import rows in it.
A dedicated user is created in both the MySQL and PostgreSQL databases with the same name as usr_replica that is further granted additional privileges on MySQL to have read access to all the tables being replicated.
mysql> CREATE USER usr_replica ;
mysql> SET PASSWORD FOR usr_replica='pass123';
mysql> GRANT ALL ON world_x.* TO 'usr_replica';
mysql> GRANT RELOAD ON *.* to 'usr_replica';
mysql> GRANT REPLICATION CLIENT ON *.* to 'usr_replica';
mysql> GRANT REPLICATION SLAVE ON *.* to 'usr_replica';
mysql> FLUSH PRIVILEGES;
A database is created on the PostgreSQL side that will accept changes from MySQL database, which is named as “db_replica”. The “usr_replica” user in PostgreSQL is automatically configured as an owner of two schemas such as “pgworld_x” and “sch_chameleon” that contain the actual replicated tables and catalog tables of replication respectively. This automatic configuration is done by the create_replica_schema argument, indicated further below.
postgres=# CREATE USER usr_replica WITH PASSWORD 'pass123';
CREATE ROLE
postgres=# CREATE DATABASE db_replica WITH OWNER usr_replica;
CREATE DATABASE
The MySQL database is configured with a few parameter changes in order to prepare it for replication, as shown below, and it requires a database server restart for the changes to take effect.
$> vi /etc/my.cnf
binlog_format= ROW
binlog_row_image=FULL
log-bin = mysql-bin
server-id = 1
At this point, it is significant to test the connectivity to both the database servers to ensure there are no issues when pg_chameleon commands are executed.
On the PostgreSQL node:
$> mysql -u usr_replica -Ap'admin123' -h 192.168.56.102 -D world_x
On the MySQL node:
$> psql -p 5433 -U usr_replica -h 192.168.56.106 db_replica
The next three commands of pg_chameleon (chameleon) is where it sets the environment up, adds a source and initializes a replica. The “create_replica_schema” argument of pg_chameleon creates the default schema (sch_chameleon) and replication schema (pgworld_x) in the PostgreSQL database as has already been discussed. The “add_source” argument adds the source database to the configuration by reading the configuration file (default.yml), which in this case is “mysql”, while the “init_replica” initializes the configuration based on the settings of the configuration file.
$> chameleon create_replica_schema --debug
$> chameleon add_source --config default --source mysql --debug
$> chameleon init_replica --config default --source mysql --debug
The output of the above three commands is self explanatory indicating the success of each command with an evident output message. Any failures or syntax errors are clearly mentioned in simple and plain messages, thereby suggesting and prompting corrective actions.
The final step is to start the replication with “start_replica”, the success of which is indicated by an output hint as shown below.
$> chameleon start_replica --config default --source mysql
output: Starting the replica process for source mysql
The status of replication can be queried with the “show_status” argument while errors can be viewed with ‘show_errors” argument.
$> chameleon show_status --source mysql
OUTPUT:
Source id Source name Type Status Consistent Read lag Last read Replay lag Last replay
----------- ------------- ------ -------- ------------ ---------- ----------- ------------ -------------
1 mysql mysql running No N/A N/A
== Schema mappings ==
Origin schema Destination schema
--------------- --------------------
world_x pgworld_x
== Replica status ==
--------------------- ---
Tables not replicated 0
Tables replicated 4
All tables 4
Last maintenance N/A
Next maintenance N/A
Replayed rows
Replayed DDL
Skipped rows
--------------------- ---
$> chameleon show_errors --config default
output: There are no errors in the log
As discussed earlier that each of the replication functionality is managed with the help of daemons, which can be viewed by querying the process table using Linux “ps” command, exhibited below.
$> ps -ef|grep chameleon
root 763 1 0 19:20 ? 00:00:00 /u01/media/mysql_samp_dbs/world_x-db/venv/bin/python3.6 /u01/media/mysq l_samp_dbs/world_x-db/venv/bin/chameleon start_replica --config default --source mysql
root 764 763 0 19:20 ? 00:00:01 /u01/media/mysql_samp_dbs/world_x-db/venv/bin/python3.6 /u01/media/mysq l_samp_dbs/world_x-db/venv/bin/chameleon start_replica --config default --source mysql
root 765 763 0 19:20 ? 00:00:00 /u01/media/mysql_samp_dbs/world_x-db/venv/bin/python3.6 /u01/media/mysq l_samp_dbs/world_x-db/venv/bin/chameleon start_replica --config default --source mysql
No replication setup is complete until it is put to the “real-time apply” test, which has been simulated as below. It involves creating a table and inserting a couple of records in the MySQL database, subsequently, the “sync_tables” argument of pg_chameleon is invoked to update the daemons to replicate the table along with its records to the PostgreSQL database.
mysql> create table t1 (n1 int primary key, n2 varchar(10));
Query OK, 0 rows affected (0.01 sec)
mysql> insert into t1 values (1,'one');
Query OK, 1 row affected (0.00 sec)
mysql> insert into t1 values (2,'two');
Query OK, 1 row affected (0.00 sec)
$> chameleon sync_tables --tables world_x.t1 --config default --source mysql
Sync tables process for source mysql started.
The test is confirmed by querying the table from PostgreSQL database to reflect the rows.
$> psql -p 5433 -U usr_replica -d db_replica -c "select * from pgworld_x.t1";
n1 | n2
----+-------
1 | one
2 | two
If it is a migration project then the following pg_chameleon commands will mark the end of the migration effort. The commands should be executed after it is confirmed that rows of all the target tables have been replicated across, and the result will be a cleanly migrated PostgreSQL database without any references to the source database or replication schema (sch_chameleon).
$> chameleon stop_replica --config default --source mysql
$> chameleon detach_replica --config default --source mysql --debug
Optionally the following commands will drop the source configuration and replication schema.
$> chameleon drop_source --config default --source mysql --debug
$> chameleon drop_replica_schema --config default --source mysql --debug
Pros of Using pg_chameleon
- Simple to setup and less complicated configuration
- Painless troubleshooting and anomaly detection with easy to understand error output
- Additional adhoc tables can be added to the replication after initialization, without altering any other configuration
- Multiple sources can be configured for a single destination database, which is useful in consolidation projects to merge data from one or more MySQL databases into a single PostgreSQL database
- Selected tables can be skipped from being replicated
Cons of Using pg_chameleon
- Only supported from MySQL 5.5 onwards as Origin database and PostgreSQL 9.5 onwards for destination database
- Requires every table to have a primary or unique key, otherwise, the tables get initialized during the init_replica process but they will fail to replicate
- One way replication, i.e., MySQL to PostgreSQL. Thereby limiting its use to only an active-passive setup
- The source database can only be a MySQL database while support for PostgreSQL database as source is experimental with further limitations (click here to learn more)
pg_chameleon Summary
The replication approach offered by pg_chameleon is favourable to a database migration of MySQL to PostgreSQL. However, one of the significant limitations of one-way replication can discourage database professionals to adopt it for anything other than migration. This drawback of unidirectional replication can be addressed using yet another open source tool called SymmetricDS.
In order to study the utility more in detail, please refer to the official documentation here. The command line reference can be obtained from here.
An Overview of SymmetricDS
SymmetricDS is an open source tool that is capable of replicating any database to any other database, from the popular list of database servers such as Oracle, MongoDB, PostgreSQL, MySQL, SQL Server, MariaDB, DB2, Sybase, Greenplum, Informix, H2, Firebird and other cloud based database instances such as Redshift and Azure etc. Some of the offerings include database and file synchronization, multi-master replication, filtered synchronization, and transformation. The tool is developed using Java, requiring a standard edition (version 8.0 or above) of either JRE or JDK. The functionality involves data changes being captured by triggers at source database and routing it to a participating destination database as outgoing batches
Features of SymmetricDS
- Platform independent, which means two or more dissimilar databases can communicate with each other, any database to any other database
- Relational databases achieve synchronization using change data capture while file system based systems utilize file synchronization
- Bi-directional replication using Push and Pull method, which is accomplished based on set rules
- Data transfer can also occur over secure and low bandwidth networks
- Automatic recovery during the resumption of a crashed node and automatic conflict resolution
- Cloud ready and contains powerful extension APIs
Demo
SymmetricDS can be configured in one of the two options:
- A master (parent) node that acts as a centralized intermediary coordinating data replication between two slave (child) nodes, in which the communication between the two child nodes can only occur via the parent.
- An active node (node1) can replicate to and from another active node (node2) without any intermediary.
In both the options, the communication between the nodes happens via “Push” and “Pull” events. In this demo, an active-active configuration between two nodes will be explained. The full architecture can be exhaustive, so the readers are encouraged to check the user guide available here to learn more about the internals of SymmetricDS.
Installing SymmetricDS is as simple as downloading the open source version of zip file from here and extracting it in a convenient location. The details of install location and version of SymmetricDS in this demo are as per the table below, along with other details pertaining to database versions, Linux versions, ip addresses and communication port for both the participating nodes.
Host | vm1 | vm2 |
---|---|---|
OS version | CentOS Linux release 7.6 x86_64 | CentOS Linux release 7.6 x86_64 |
Database server version | MySQL 5.7.26 | PostgreSQL 10.5 |
Database port | 3306 | 5832 |
ip address | 192.168.1.107 | 192.168.1.112 |
SymmetricDS version | SymmetricDS 3.9 | SymmetricDS 3.9 |
SymmetricDS install location | /usr/local/symmetric-server-3.9.20 | /usr/local/symmetric-server-3.9.20 |
SymmetricDS node name | corp-000 | store-001 |
The install home in this case is “/usr/local/symmetric-server-3.9.20” which will be the home directory of SymmetricDS, which contains various other sub-directories and files. Two of the sub-directories that are of importance now are “samples” and “engines”. The samples directory contains node properties configuration file samples in addition to sample SQL scripts to kick start a quick demo.
The following three node properties configuration files can be seen in the “samples” directory with names indicating the nature of node in a given setup.
corp-000.properties
store-001.properties
store-002.properties
As SymmetricDS comes with all the necessary configuration files to support a basic 3 node setup (option 1), it is convenient to use the same configuration files to setup a 2 node setup (option 2) as well. The intended configuration file is copied from the “samples” directory to the “engines” on host vm1, and it looks like below.
$> cat engines/corp-000.properties
engine.name=corp-000
db.driver=com.mysql.jdbc.Driver
db.url=jdbc:mysql://192.168.1.107:3306/replica_db?autoReconnect=true&useSSL=false
db.user=root
db.password=admin123
registration.url=
sync.url=http://192.168.1.107:31415/sync/corp-000
group.id=corp
external.id=000
The name of this node in SymmetricDS configuration is “corp-000” with the database connection handled with mysql jdbc driver using the connection string as stated above along with login credentials. The database to connect is “replica_db” and the tables will be created during the creation of sample schema. The “sync.url” denotes the location to contact the node for synchronization.
The node 2 on host vm2 is configured as “store-001” with the rest of the details as configured in the node.properties file, shown below. The “store-001” node runs a PostgreSQL database, with “pgdb_replica” as the database for replication. The “registration.url” enables host “vm2” to communicate with host “vm1” to pull configuration details.
$> cat engines/store-001.properties
engine.name=store-001
db.driver=org.postgresql.Driver
db.url=jdbc:postgresql://192.168.1.112:5832/pgdb_replica
db.user=postgres
db.password=admin123
registration.url=http://192.168.1.107:31415/sync/corp-000
group.id=store
external.id=001
The pre-configured default demo of SymmetricDS contains settings to setup a bi-directional replication between two database servers (two nodes). The steps below are executed on host vm1 (corp-000), which will create a sample schema having 4 tables. Further, execution of “create-sym-tables” with “symadmin” command will create the catalog tables that store and control the rules and direction of replication between nodes. Finally, the demo tables are loaded with sample data.
vm1$> cd /usr/local/symmetric-server-3.9.20/bin
vm1$> ./dbimport --engine corp-000 --format XML create_sample.xml
vm1$> ./symadmin --engine corp-000 create-sym-tables
vm1$> ./dbimport --engine corp-000 insert_sample.sql
The demo tables “item” and “item_selling_price” are auto-configured to replicate from corp-000 to store-001 while the sale tables (sale_transaction and sale_return_line_item) are auto-configured replicate from store-001 to corp-000. The next step is to create the sample schema in the PostgreSQL database on host vm2 (store-001), in order to prepare it to receive data from corp-000.
vm2$> cd /usr/local/symmetric-server-3.9.20/bin
vm2$> ./dbimport --engine store-001 --format XML create_sample.xml
It is important to verify the existence of demo tables and SymmetricDS catalog tables in the MySQL database on vm1 at this stage. Note, the SymmetricDS system tables (tables with prefix “sym_”) are only available in the corp-000 node at this point of time, because that is where the “create-sym-tables” command was executed, which will be the place to control and manage the replication. In addition to that, the store-001 node database will only have 4 demo tables with no data in it.
The environment is now ready to start the “sym” server processes on both the nodes, as show below.
vm1$> cd /usr/local/symmetric-server-3.9.20/bin
vm1$> sym 2>&1 &
The log entries are both sent to a background log file (symmetric.log) under a logs directory in the SymmetricDS install location as well as to the standard output. The “sym” server can now be initiated on store-001 node.
vm2$> cd /usr/local/symmetric-server-3.9.20/bin
vm2$> sym 2>&1 &
The startup of “sym” server process on host vm2 will create the SymmetricDS catalog tables in the PostgreSQL database as well. The startup of “sym” server process on both the nodes will get them to coordinate with each other to replicate data from corp-000 to store-001. After a few seconds, querying all the four tables on either side will show the successful replication results. Alternatively, an initial load can also be sent to the store-001 node from corp-000 with the below command.
vm1$> ./symadmin --engine corp-000 reload-node 001
At this point, a new record is inserted into the “item” table in MySQL database at corp-000 node (host: vm1) and it can be verified to have successfully replicated to the PostgreSQL database at store-001 node (host: vm2). This shows the “Pull” event of data from corp-000 to store-001.
mysql> insert into item values ('22000002','Jelly Bean');
Query OK, 1 row affected (0.00 sec)
vm2$> psql -p 5832 -U postgres pgdb_replica -c "select * from item"
item_id | name
----------+-----------
11000001 | Yummy Gum
22000002 | Jelly Bean
(2 rows)
The “Push” event of data from store-001 to corp-000 can be achieved by inserting a record into the “sale_transaction” table and confirming it to replicate through.
pgdb_replica=# insert into "sale_transaction" ("tran_id", "store_id", "workstation", "day", "seq") values (1000, '001', '3', '2007-11-01', 100);
vm1$> [root@vm1 ~]# mysql -uroot -p'admin123' -D replica_db -e "select * from sale_transaction";
+---------+----------+-------------+------------+-----+
| tran_id | store_id | workstation | day | seq |
+---------+----------+-------------+------------+-----+
| 900 | 001 | 3 | 2012-12-01 | 90 |
| 1000 | 001 | 3 | 2007-11-01 | 100 |
| 2000 | 002 | 2 | 2007-11-01 | 200 |
+---------+----------+-------------+------------+-----+
This marks the successful configuration of bidirectional replication of demo tables between a MySQL and PostgreSQL database. Whereas, the configuration of replication for newly created user tables can be achieved using the following steps. An example table “t1” is created for the demo and the rules of its replication are configured as per the procedure below. The steps only configure the replication from corp-000 to store-001.
mysql> create table t1 (no integer);
Query OK, 0 rows affected (0.01 sec)
mysql> insert into sym_channel (channel_id,create_time,last_update_time)
values ('t1',current_timestamp,current_timestamp);
Query OK, 1 row affected (0.01 sec)
mysql> insert into sym_trigger (trigger_id, source_table_name,channel_id,
last_update_time, create_time) values ('t1', 't1', 't1', current_timestamp,
current_timestamp);
Query OK, 1 row affected (0.01 sec)
mysql> insert into sym_trigger_router (trigger_id, router_id,
Initial_load_order, create_time,last_update_time) values ('t1',
'corp-2-store-1', 1, current_timestamp,current_timestamp);
Query OK, 1 row affected (0.01 sec)
After this, the configuration is notified about the schema change of adding a new table by invoking the symadmin command with “sync-triggers” argument which will recreate the triggers to match table definitions. Subsequently, execute “send-schema” to send schema changes out to store-001 node, following which the replication of “t1” table will be configured successfully.
vm1$> ./symadmin -e corp-000 --node=001 sync-triggers
vm1$> ./symadmin send-schema -e corp-000 --node=001 t1
Pros of Using SymmetricDS
- Effortless installation and configuration including a pre-configured set of parameter files to build either a 3-node or a 2-node setup
- Cross platform database enabled and platform independent including servers, laptops and mobile devices
- Replicate any database to any other database, whether on-prem, WAN or cloud
- Capable of optimally handling a couple of databases to several thousand databases to replicate data seamlessly
- A commercial version of the software offers GUI driven management console with an excellent support package
Cons of Using SymmetricDS
- Manual command line configuration may involve defining rules and direction of replication via SQL statements to load catalog tables, which may be inconvenient to manage
- Setting up a large number of tables for replication will be an exhaustive effort, unless some form of scripting is utilized to generate the SQL statements defining rules and direction of replication
- Plenty of logging information cluttering the logfile, thereby requiring periodic logfile maintenance to not allow the logfile to fill up the disk
SymmetricDS Summary
SymmetricDS offers the ability to setup bi-directional replication between 2 nodes, 3 nodes and so on for several thousand nodes to replicate data and achieve file synchronization. It is a unique tool that performs many of the self-healing maintenance tasks such as the automatic recovery of data after extended periods of downtime in a node, secure and efficient communication between nodes with the help of HTTPS and automatic conflict management based on set rules, etc. The essential feature of replicating any database to any other database makes SymmetricDS ready to be deployed for a number of use cases including migration, version and patch upgrade, distribution, filtering and transformation of data across diverse platforms.
The demo was created by referring to the official quick-start tutorial of SymmetricDS which can be accessed from here. The user guide can be found here, which provides a detailed account of various concepts involved in a SymmetricDS replication setup.