blog
Analytics with MariaDB AX – tThe Open Source Columnar Datastore
Long gone are the days of the one-database-fits-all approach.
With increasing requirements on velocity, performance and agility, numerous datastores emerged, which are intended to solve one particular problem. We have relational databases, document stores, time-series databases, columnar databases, full text search engines.
It is quite common to see multiple datastores work together in the same environment.
So how does MariaDB AX fit in the picture? How does it compare with MariaDB TX and what problem does it solve?
In this blog post, we will have a look at MariaDB AX, and see why you may want to use it.
What is MariaDB AX?
First things first, so what is MariaDB AX?
It is a column store, and it stores its data by …column! It is implemented as a separate engine in MariaDB 10.3 database.
As you may know, MySQL and MariaDB are designed to use pluggable storage engines. Every storage engine, be it InnoDB, Aria, MyRocks, Spider or any other engines are plugins.
In the same way, MariaDB AX uses ColumnStore engine:
MariaDB [(none)]> SHOW ENGINESG
*************************** 1. row ***************************
Engine: Columnstore
Support: YES
Comment: Columnstore storage engine
Transactions: YES
XA: NO
Savepoints: NO
This results in an interesting combination. The SQL parsing is done by MariaDB, thus you can expect to work with query syntax which is very similar to what you are used to in MariaDB. This makes it also easier to combine access to both MariaDB AX and MariaDB TX in the same application. No need for any specific connectors or libraries to connect to two datastores. All can be done using MySQL or MariaDB client library. You can also utilize MaxScale for both datastores, which can help to build high availability for MariaDB AX.
Why Should We Use a Columnar Datastore?
Let’s go through a short introduction to the idea behind columnar datastores.
What makes MariaDB AX different from MariaDB TX?
The main difference is how the data is structured. In typical database data is stored as rows.
Id, Product, Price, Code, Warehouse
1, Door, 10, 12334, EU1
2, Window, 9, 9523, EU1
3, Glass, 12, 97643, EU2
As you can see, we have three rows, each containing all the data about a product entry.
The problem is, this way of storing data is not really efficient when you want to get just a subset of this data. Let’s say you want to get just the “Product” and “Price” columns – to do that you have to read whole rows, all the data and then discard the unneeded columns. It is also tricky to sort the data. If you would like to sort the dataset from the most expensive to the cheapest product you have to read everything and then do the sort.
We all know that databases utilize indexes to speed up access. An index is structured in a way that it contains the content of the indexed column as well as a pointer to the full row (in InnoDB that’s the Primary Key). For example, an index on the “Product” column, assuming that “Id” is the Primary Key, may look like as follows:
Product, Id
Door, 1
Window, 2
Glass, 3
This speeds up the access to the data as there’s no need to read full row just to find a value in “Product” column. Once database finds it, it can read the rest of the row (if needed) by following the pointer.
In a column store, things are different. Data is structured not as rows but as columns. To some extend this is similar to the index. Our table in columnar datastore may look like this:
Id: 1, 2, 3
Product: Door, Window, Glass
Price: 10, 9, 12
Code: 12334, 9523, 97643
Warehouse: EU1, EU1, EU2
In MariaDB AX, columns are stored in separate files, each entry for a given “row” starts at the same offset.
The main advantage here is that if you want to run a query which will work with just a subset of data, you only need to read data from columns which are relevant to the query.
In our example earlier, instead of reading whole dataset, we can just load data for columns ‘Product” and “Price”. It reduces the data needed to be accessed on disk and speeds up the process.
What’s also important, storing data in columns make them less distinct which makes it compress better. For example, in our “Warehouse” column we have just two types of entries. Even in real world scenario it is very likely that we will end up with a small number of warehouses compared to the number of products. This makes the “Warehouse” column very good target for compression.
As a result of all of this, columnar datastores can handle large dataset better and can query it in more efficient way than “standard” OLTP-focused databases.
Why Should I Use MariaDB AX?
Disk access is a major bottleneck in databases. A columnar datastore improves performance by reducing the amount of data that needs to be read from disk. It reads only the data necessary to answer the query.
Of course, MariaDB AX is not the only columnar datastore out there. There are many others like, for example, Clickhouse or Apache HBase.
The truth is, none of the other options support full SQL syntax that MySQL does. They require different connectors, different approach to querying the data while MariaDB AX can be queried just like you would query the ‘normal’ MariaDB.
What’s also important, given that MariaDB AX utilizes ColumnStore engine, it is perfectly fine to mix it with other engines. You can mix and join InnoDB and ColumnStore tables in the same query without any problem.
On top of that, tools that come with MariaDB TX, like MaxScale, will work just fine with MariaDB AX making it easier to build an integrated, easy to use environment. So when you are running ClusterControl with MariaDB 10.3 and MaxScale, you can easily add MariaDB AX into the mix and it will work with other parts of the setup.
MariaDB AX comes with tools which are intended to help with transferring the data from other sources. If you happen to use Kafka or Spark, there are connectors to use when importing data from those sources into MariaDB AX.
Additionally, even though regular replication between MariaDB TX (InnoDB) and MariaDB AX (ColumnStore) is not performing well due to ColumnStore limitations (it’s always better to do batch inserts in columnar datastores than single inserts, as it’s done on replication), it is possible to build a pipeline which would consist of MaxScale configured as binlog server and Avro CDC router, MaxScale CDC Data Adapter and MariaDB AX, which will receive data from the adapter almost in real time.
We hope this blog post will give you some insight into what MariaDB AX is and how it can be utilized alongside MariaDB TX environment deployed and managed by ClusterControl (download it for free!).