Running production environments, you may be wondering how your database is performing? Does it deliver a proper level of performance? How can we measure it? Database performance is a very vast topic, but we would like to dip into it a bit and discuss what to look for when talking about database performance.
How to Define Performance?
The first question we have to ask ourselves is: what is the unit in which we measure database performance? This alone is not an easy question to answer.
Queries per second (QPS)
The obvious choice would be to use queries per second or QPS. How many queries the database can execute in a given time period? The problem is one query is not the same as the other query. We can have INSERTs, UPDATEs, SELECTs. We can have simple queries that access the data using indexes or even primary key, we can have complex queries that join multiple tables. We can compare the performance of a single query or a specific, precisely controlled query mix but that’s just it.
In the real world, workload fluctuates and it is not really easy to tell what should be used as a set of queries that you will use to compare performance between different versions of configuration. In a given time you may come up with a query mix but if you would like to repeat benchmarks after a couple of months, most likely you will be facing a different query mix, making it hard to compare the performance over time.
Transactions per second (TPS)
This is another option – how many transactions can we execute in a given time period? This approach has pretty much the same set of issues as using QPS. The queries involved in a transaction will change over time, new types of transactions will be introduced. Calculating transactions per second may work for a given moment but it will be hard to compare results over time.
Let’s try to approach the topic from another perspective. What is the most important when we talk about performance? Is it the number of transactions or queries that we can execute per second? Would you be ok to improve QPS by 30% if you would have to wait twice as long for a given query to complete? You may ask how it could be possible? In fact, it is quite simple. You have to keep in mind that in most of the cases and in most of the databases one query can utilize just one CPU core. Yes, there are some situations where query can be parallely processed but let’s stick to the majority of the workload. So, a CPU core equals one query. What it means is that the fastest you can run your queries will be if you will run as many of them as you have CPU cores. This way we minimize the query execution time. On the other hand, we can try to maximize the total throughput. As you can imagine, the process of executing a query is not optimal and it does not utilize CPU fully. We can process more queries at the same time if we start queuing them and let the CPU scheduler deal with multiple processes or threads. Thanks to this we can significantly improve the throughput (calculated in queries per second) but we increase the execution time of the queries – each thread has now to share CPU with other threads. So, queries run slower but we still can run more of them at the same time.
Latency should also be predictable – users want the queries to run quickly but they might be ok if queries run slightly slower if that would improve the stability of the query execution time. It is because if there’s a thing that users do not like more than a slow application is the application that gets slow intermittently and for no good reason. Increasing throughput typically means that the latency will also be increased but, most importantly, it may become more unstable.
As you can see, this adds even more complexity to our performance discussion. Obviously, users want their applications to run fast and be responsive (so, we would like the latency to be low). On the other hand if we have many users, we also want to be able to run many queries at the same time (so, throughput should be high).
How to Change the Performance of the Database?
We have written about how we can measure the performance and that there are two major aspects to it: latency and throughput. Another important question to answer is how to change the database performance? In general, we are talking about a handful of options.
Obviously, performance is related to the available resources. If we improve the hardware the database is running on, it will impact the performance of the database. What exactly would be the improvement, it will depend on what has been changed and the type of the workload we are looking at. In short, there are two main types of workload.
CPU-bound workload is a situation where the performance is limited by the CPU resources. We are talking here about situations where the active dataset fits in the memory and the disk activity is minimal. It can be caused by a large volume of fast queries (index lookups for example) or a handful of long queries (heavy, analytical queries that involve JOINs or sorting and grouping). In this case, improving CPU performance by adding more cores or swapping the CPU to a more recent model that provides better performance per core can improve the overall performance of the database.
I/O-bound workload is a situation where we have a significant load on the I/O subsystem, typically it’s a disk. This may be caused by different situations but the most common ones are those two. First, your workload is write-heavy, you insert or modify a significant amount of data in your database. As a result, the amount of writes required to persist those modifications pile up and the disk drive becomes a bottleneck. Second most common case is when your active data set doesn’t fit in your memory. Active data set is the part of the data stored in your database that is frequently accessed by the application. Please keep in mind that you can have a significantly larger dataset than available memory but as long as the data is just kept on disk, it is not a problem. The problem arises when the database has to constantly swap data in and out from the memory to meet the needs of the application. In that case we observe an increase of read access to disk.
As you may expect, those two types of problems should be addressed differently on the hardware level. For CPU-bound traffic we should be looking into increasing the computational power of the server by providing more CPU cores or improving the per-core performance, which is typically the case, albeit the increase is quite low, when you change the CPU to a more recent model. If we are talking about I/O-bound workload, we may have two options for improvement. First, obviously, is to improve the performance of the disk subsystem. Add more drives to your RAID, use a more performant RAID level (RAID 10 instead of RAID 5 or 6), swap the disk drives to be more performant. Alternatively, if you are experiencing issues with heavy reads, you may attempt to increase the available memory: more memory will allow your database to cache more data from the active dataset in the memory, reducing the need of reading the non-cached data from disk.
Database has its own configuration where users can tune some of the settings to improve performance of the database. Some settings may be more suitable for CPU-bound workloads, some will be more suitable for I/O-bound workloads. You may hear about automated configuration tuning scripts or secret knowledge of the DBAs hidden deeply in StackOverflow or Quora. The reality is that, unless your database is completely unconfigured, tweaking the configuration is very unlikely to bring you huge performance gains. Yes, sure, you may improve your performance slightly but that’s that. Do not expect to be able to speed up your database tenfold.
What may improve your performance tenfold is query tuning. Rewriting queries in more efficient form, adding missing indexes. This is where you can see huge benefits just like on those nice screenshots, shared over the web, of different monitoring tools where you can see CPU utilization dropping from 90%+ to less than 10%. If the query is unnecessarily accessing even thousands of rows while with a proper index it can just access one row, then yes, this speeds things up significantly. Describing the query tuning process in detail is out of the scope of this blog post, but the gist is that you should be collecting the metrics related to the queries – their execution time, waits experienced by the query, number of rows read from the database, number of rows sent to the application.
More data the better and what exactly can be collected depends on the database type, but most of the datastores present some sort of performance data related to their queries. If you have access to tooling that helps you process this raw data, be it something built-in or external software, that’s even better. It should help you to build a better understanding of what is happening in the database, how it performs and what are the problematic queries.
Then, as a next step, you may want to attempt to understand how the problematic queries behave. Usually you have access to some sort of a query execution plan – detailed overview of the execution process as deemed optimal by the database’s optimizer. Again, details differ between different databases but we are talking here about how given data is being accessed, by which method, are any indexes involved and if yes, which ones? If we are talking about relational databases, you may expect to see what is the order of the tables in JOIN and what is the JOIN method that has been used. This should help you to decide if the execution plan is indeed optimal or maybe it is missing some potential improvements.
Once you figure out what the flaw is, you can attempt to fix it by improving indexing or even rewriting the query to a more optimal form. Please keep in mind that even if you are using external applications that you cannot modify, in some cases there are methods to rewrite queries on the fly. Typically it happens on the loadbalancer level.
Measuring the Performance
Once you are done with the tuning, you can wait a bit and see how the most important metrics of that query have changed. Is the query accessing less rows? Is it using indexes better? Does it execute faster? This is pretty much the process of measuring the database performance. You should be tracking the p99 of latency for all query types. You should be tracking other performance metrics for all query types.
You should attempt to tune the queries and then, through the constant metric collection, you should be able to tell how the most important metrics changed. Has the latency been reduced? Is it stable now? How much data given query type stores on disk? The same process applies for hardware change or configuration tuning. If you can plot the p99 latency in time, you can clearly see if the change that you made has impacted the performance and in which way? For better or for worse? So, the secret sauce is pretty much simple – collect the performance metrics all the time while your database is working. When you decide to make a change, you will have a clear view of the result of the change you have introduced.
As you can see, database performance is a vast topic and we hope you will find this blog post helpful. If you would like to share some of your thoughts, feel free to do it in the comments below.