What is ETL?
ETL refers to Extract, Transform and Load, it is a 3 step process applied to extract the data from various sources (which can exist in various forms), cleanse, and load in to a target database for analytics. ETL is a popular process in the data warehousing world where-in data from various data sources are integrated and loaded into a target database for performing analytics and reporting for business. In simple core terms, ETL is used to extract the data from a data source like a database or a file and then cleansed, transformed according to the business requirements and then loaded into the target database.
The ETL process exists in the form of various tools. There a quite a few popular ETL tools out there which are widely used by businesses to address different data migration requirements. Though these tools exist, there is no guarantee that the data migration requirements will be fulfilled straight away, which is why DBAs and Developers often opt to build custom ETLs to get through real-time complex data migration challenges.
Whenever there is a requirement for data migration, the first thing that DBAs or Developers look for is an ETL tool. Data can exist in different forms; in RDBMS Database, flat files, CSVs etc., and the requirement would be to migrate, integrate all of this data into a single database or if the target database is different, data transformation process would become critical. These challenges can be addressed by ETL tools which can save costs and business time. In today’s world the lack of ETL specific tools can cost organizations significant development effort and money to build an efficient automated process for data migration. Thanks to the open source world, there are some popular open source ETL tools which can address complex real-time data migration challenges.
Whilst there are various reasons to migrate the data, I would like to focus on two typical requirements for data migration…
- Migrate the data from different sources (Database, flat files and CSVs) to one single database in a data warehousing environment presumably an open source database which would significantly reduce the TCO for building the DWH environments. This would be a viable option as the real-time applications will be using the existing commercial databases and the DWH will be hosting the data on an open-source database
- Migrate off the real-time databases and applications from commercial databases to open source databases like PostgreSQL for much lower cost of data operations for businesses.
My focus in this blog would be to identify ETL tools which can help perform data migrations to PostgreSQL database.
Why Migrate to PostgreSQL?
PostgreSQL is a feature-rich, enterprise class, open source database which is the first option businesses are opting for their various real-time data operation requirements and has implementations across various mission critical environments. After realizing the potential of this highly reliable and efficient RDBMS database, more and more businesses are opting to migrate their databases and applications to it. Migrating the existing databases to PostgreSQL brings significant reductions in IT costs due to which, “Migrations to PostgreSQL” is quite a common requirement these days and there comes the requirement for data migration which is where a hunt begins for an ETL tool.
As said above, there are quite a number of commercial and open-source ETL tools existing and pretty much all the tools support PostgreSQL.
What Are the Top ETL Tools?
Ora2pg is THE OPTION if you are intending to migrate the data from Oracle database to PostgreSQL. It is a Perl-based open source tool specially developed to migrate schema, data from Oracle databases to PostgreSQL and understands both databases very well and can migrate any size data. Migrating bigger size large objects can be expensive in terms of time and hardware.
Pros: Ora2pg is a very popular tool used for specifically migrating Oracle databases to PostgreSQL. Supports Windows and Linux operating systems and uses textual interface. This tool understands both databases very well and is quite reliable from functionality perspective. When we migrated data in a production environment, the data analysis (or data sanity) exercise resulted in “0” data defects which is quite remarkable. Pretty efficient in migrating data types like Date/Timestamp and Large Objects. Easy to schedule jobs via shell script in the background. Developer’s response for any issues on github is good.
Cons: Ora2pg’s installation procedure, which includes installing Perl modules, Oracle and Postgres clients, might become a complex affair depending on the OS version and even more complex when doing the same on Windows OS. There might be significant performance challenges when migrating big size tables with “Large Objects” in parallel (this means one ora2pg job with multiple threads) which can lead to significant data migration strategy change.
Talend is a very popular ETL tool used to migrate data from any source (database or file) to any database. This tool supports PostgreSQL database and many businesses use this tool to migrate data to PostgreSQL. There are both commercial and open-source versions of this tool and the open-source one should be helpful for data migrations.
Pros: Talend is a java based ETL tool used for data integrations and supports PostgreSQL. An easy to install tool comes with a GUI with both open-source and commercial versions. Can run on any platform that supports Java. Developers can write custom Java code which can be integrated into Talend. It is no big deal if you have to instruct a developer or a DBA to use this tool to migrate the data to PostgreSQL. Can migrate or integrate data from multiple sources like a database or a file.
Cons: Scheduling jobs might be a challenge. Can mostly be used to migrate tables of reasonable size with not many optimization options around performance improvement. May not be a great option to migrate huge size tables with millions of rows. Might bring in basic operational challenges. Needs Java expertise to handle this tool especially when integrating the custom code. Not easy to gain comfort levels on this tool within a short time. It is not possible to script and schedule the data migration jobs.
Sqlines is another open-source ETL tool which can migrate the data to and from any database. This is another good option to migrate data to PostgreSQL databases from pretty much any commercial or open source database. I am personally impressed by this tool. It is developed using C/C++ and is very simple to use with no complexities around the installation process (just download and untar the installer and you are done!). Since this is an C/C++ based tool, there could be big performance wins when migrating big size databases. I would say this tool is evolving and the subscription costs for the support are very reasonable.
Pros: As mentioned above, I am impressed with the fact that this tool is built based on C/C++ which is a huge plus. Quite easy and simple to install and set-up. Uses textual interface which makes it really easy to schedule jobs via bash scripts. Can handle big size data volumes. Support from the developers is good at a very reasonable cost. Developers are open to take your ideas and implement which makes it an even better option.
Cons: Not many people know about this tool and it is evolving. There are not many configuration options to play around. There is some way to go for this tool to become competitive which is not far away. You might run into basic operational challenges.
Pentaho is another data migration and integration tool which again has commercial and open-source versions which can migrate data from any data source to any database. This is also an option to migrate data to PostgreSQL. This tool supports a wide range of databases and operates on an bigger space with data visualization capabilities as well.
Pros: Pentaho is an Java based tool, it operates in GUI mode and can run on operating systems like Windows, Unix and Linux. Operates on a much bigger space and is very good at data transformation and visualisation purposes. As mentioned above, supports wide range of data stores.
Cons: is not an simple tool which can just extract data and load the same into the target database. Data migration process can be complex and time consuming. Heavily focuses on data transformation, cleansing, integration and visualization. The tool is not a good choice to just migrate data from one database to another database without any data transformations or cleansing exercises. Performance can be a challenge when migrating large data volumes.
Custom-built ETL: It is not an exaggeration to say that custom ETLs is one of the most common ways to accomplish an end-to-end efficient and highly performant ETL process. DBAs, Developers landing into this situation is not a surprise. It would be impossible for a single ETL to understand the data complexity, data shape, environmental challenges. Example: When you are migrating data from multiple different databases in a data centre with complex data models to a PostgreSQL database hosted in another data centre or public cloud. In such situation just hunting for the best ETL can end up in a wild-goose chase. So, going for custom ETL is the way to go if you are to build an environment specific and data specific ETL process.
Pros: A very good alternative for organizations with complex environments and complex data wherein it is just not possible to find an ETL which addresses all your data migration concerns. Can be very beneficial in terms of functionality and performance. Can reduce time and cost when it comes to fixing bugs and defects in the tool. Critical, complex and heavy bound ETL operations can be made highly performant and reliable as the developers have full control over the tool. Flexibility has no boundaries. Is an good option when you are looking at capabilities beyond ETL tools and can address any level of complexity. If you chose technologies like Java or Python to build custom ETL, they blend very well with PostgreSQL.
Cons: Building an custom ETL can be extremely time consuming. Significant design and development efforts are required to address all the data migration requirements and other data challenges. Below are some of the challenges which custom ETLs must keep up with, which might require significant development effort and time for enhancements:
- Environmental changes
- Infrastructure and database architectural changes impacting ETL operations
- Data type changes
- Data volume growth which significantly impacts data migration performance
- Schema structure or design changes
- Any critical code change to the ETL, must be subjected to Development and Testing before going to production, this can take up significant time
In general, ETL developments are not considered as the critical part of the project budget as they are not part of regular business applications or database development process. It is not a surprise if businesses do not chose to build an custom ETL as budget, resource or time challenges crop up.
What is the Best ETL Tool?
Well, there is no straightforward answer. It all depends on your requirements and the environment. Choosing an ETL for migrating data to PostgreSQL depends on various factors, you will need to understand the factors impacting data migration. Below are most of them…
- Understand your data
- Complexity of the data
- Data types
- Data source
- Data size
- How is the source data? in a database? in a flat file? Structured or unstructured? etc.. etc..
- What steps will your data-migration exercise will involve? Your expectations from the tool.
If you know the above, then, you will almost be in a position to choose an ETL tool. Analysing the above factors would help you evaluate the characteristics and capabilities of each ETL tool. Technical experts performing data migration would generally look at an ETL tool which is efficient, flexible and highly performant.
At the end-of-the-day it is not a surprise if you end up selecting multiple ETL tools or even end up developing a custom tool yourself.
To be honest, it is difficult to recommend just one ETL tool without knowing your data requirements. Instead, I would suggest a tool should have the following characteristics to design an efficient and highly performant data migration process…
- Must be using textual interface with enough configuration options
- Must be able to migrate large amounts of data efficiently by effectively utilizing multiple CPUs and the memory
- It would be good if the tool can be installed across multiple operating systems. Some PostgreSQL specific tools support only Windows which can pose challenges from costs, efficiency and performance perspective
- Must be able to understand the source data and the target database
- Must have flexible configuration options with enough control to plug the tool into a bash or python script, customize and schedule multiple jobs in parallel
- An optimal testing process must be designed to understand the tool’s data migration capabilities
There are GUI tools out there which are easy to setup and migrate the data in one-click. These tools are good for migrating data of reasonable size in non-cloud environment and are highly dependent on infrastructure and hardware capacity. There will be not much options other than increasing the infra capability for faster data migration and options for running multiple jobs are also bleak.
When migrating data to PostgreSQL, I would start looking at Talend or SQLines. If I need to migrate the data from Oracle, then, I would look at Ora2pg.