blog
Deep Dive SQL Workload Analysis using pt-query-digest
In our previous post, we showed you how to interpret reports generated by pt-query-digest. Today we’d like to cover some of its more advanced features, as it is a pretty extensive tool with lots of functionality. We’ll also show you what information you should be looking for, and how to derive conclusions based on that data.
This is the eleventh installment in the ‘Become a MySQL DBA’ blog series. Our previous posts in the DBA series include Analyzing SQL Workload with pt-query-digest, Query Tuning Process, Configuration Tuning, Live Migration using MySQL Replication, Database Upgrades, Replication Topology Changes, Schema Changes, High Availability, Backup & Restore, Monitoring & Trending.
What to look for in pt-query-digest summary section?
When you get a report generated by pt-query-digest, the summary is a good place to start. If you are familiar with your workload, this can indicate whether the workload you captured looks “normal” or whether something has changed. This is true for the test scenario we’ve prepared here – we are familiar with the sysbench workload which we used to generate this slow log. The queries are fairly quick and do not suffer from any locking. We’ve added a couple of ad hoc queries to the mix to simulate non-optimal queries that somehow slipped through a QA process and started to impact production.
# Overall: 2.54M total, 25 unique, 2.07k QPS, 1.26x concurrency __________
# Time range: 2015-08-26 07:05:51 to 07:26:17
# Attribute total min max avg 95% stddev median
# ============ ======= ======= ======= ======= ======= ======= =======
# Exec time 1546s 1us 44s 608us 1ms 53ms 194us
# Lock time 162s 0 12s 63us 84us 13ms 28us
# Rows sent 40.53M 0 488.90k 16.73 97.36 764.93 0.99
# Rows examine 119.91M 0 4.77M 49.51 299.03 7.58k 0.99
# Rows affecte 5.42M 0 1.90M 2.24 0.99 1.66k 0
# Bytes sent 5.24G 0 95.38M 2.16k 11.91k 145.05k 192.76
# Merge passes 0 0 0 0 0 0 0
# Tmp tables 124.03k 0 1 0.05 0.99 0.22 0
# Tmp disk tbl 0 0 0 0 0 0 0
# Tmp tbl size 15.03G 0 124.12k 6.21k 123.85k 27.00k 0
# Query size 139.75M 5 247 57.70 158.58 53.40 36.69
# InnoDB:
# IO r bytes 14.70G 0 1.02G 6.74k 15.96k 1.69M 0
# IO r ops 940.81k 0 65.51k 0.42 0.99 107.46 0
# IO r wait 267s 0 4s 116us 568us 4ms 0
# pages distin 8.24M 0 59.41k 3.78 9.83 99.56 3.89
# queue wait 0 0 0 0 0 0 0
# rec lock wai 57s 0 12s 24us 0 14ms 0
# Boolean:
# Filesort 10% yes, 89% no
# Full scan 0% yes, 99% no
# Tmp table 5% yes, 94% no
In the summary above, you can notice the max execution time of 44 seconds and max lock time of 12 seconds. This is definitely not a pattern that we are used to, it has to be a result of the new queries. Number of max bytes sent by a query is also not common in the sysbench workload. It’s fairly well indexed, so there’s not that much data being sent to the client. This tells us a couple of things – we are dealing with queries that scan lots of rows. At least some of them are DMLs as we see a significant increase in lock time. A part of the report where we see InnoDB statistics tells us that this is row lock time.
Let’s check a summary of all queries, we are mostly interested in the execution time per call column (R/Call). You may notice a lot of administrative queries like SHOW PROCESSLIST or SHOW TABLES. MySQL logs every query out there and you can see different entries in your slow log.
# Profile
# Rank Query ID Response time Calls R/Call V/M Item
# ==== ================== ============== ======= ======= ===== ===========
# 1 0x558CAEF5F387E929 385.0075 24.9% 1269596 0.0003 0.00 SELECT sbtest?
# 2 0x737F39F04B198EF6 171.1443 11.1% 127006 0.0013 0.00 SELECT sbtest?
# 3 0x6408196D4E419490 141.9605 9.2% 6 23.6601 0.17 SELECT sbtest?
# 4 0x84D1DEE77FA8D4C3 122.1600 7.9% 127006 0.0010 0.00 SELECT sbtest?
# 5 0x813031B8BBC3B329 107.4760 7.0% 126996 0.0008 0.00 COMMIT
# 6 0x3821AE1F716D5205 106.3605 6.9% 127006 0.0008 0.00 SELECT sbtest?
# 7 0xD30AD7E3079ABCE7 89.2480 5.8% 126986 0.0007 0.46 UPDATE sbtest?
# 8 0xE96B374065B13356 80.0196 5.2% 126973 0.0006 2.71 UPDATE sbtest?
# 9 0xEAB8A8A8BEEFF705 79.4696 5.1% 126964 0.0006 2.29 DELETE sbtest?
# 10 0x89E30DDBB4B37EEB 74.9633 4.8% 5 14.9927 13.34 INSERT SELECT sbtest?
# 11 0x6EEB1BFDCCF4EBCD 72.2472 4.7% 127006 0.0006 0.00 SELECT sbtest?
# 12 0x30B616F874042DCA 52.4873 3.4% 3 17.4958 11.48 DELETE sbtest?
# 13 0xF1256A27240AEFC7 46.7403 3.0% 126968 0.0004 0.00 INSERT sbtest?
# 14 0x85FFF5AA78E5FF6A 15.9546 1.0% 126840 0.0001 0.00 BEGIN
# 15 0x43330BA654921FCF 0.4734 0.0% 2 0.2367 0.31 TRUNCATE TABLE sbtest? sbtest2
# 16 0x7EEF4A697C2710A5 0.1043 0.0% 40 0.0026 0.04 SHOW PROCESSLIST
# 17 0x86780A22FD730AC4 0.0991 0.0% 1 0.0991 0.00 CREATE TABLE sbtest? sbtest2
# 18 0x45609ADE347419D3 0.0488 0.0% 97 0.0005 0.00 SHOW PROCESSLIST
# 19 0xE3A3649C5FAC418D 0.0381 0.0% 8 0.0048 0.01 SELECT
# 20 0x35AAC71503957920 0.0319 0.0% 1 0.0319 0.00 FLUSH
# 21 0x5CBA2034458B5BC9 0.0191 0.0% 1 0.0191 0.00 SHOW DATABASES
# 22 0xFB38455BF7035609 0.0041 0.0% 1 0.0041 0.00 SELECT
# 23 0x132628303F99240D 0.0013 0.0% 1 0.0013 0.00 SHOW TABLES
# 24 0x44AE35A182869033 0.0009 0.0% 1 0.0009 0.00 ADMIN INIT DB
# 25 0xAA353644DE4C4CB4 0.0000 0.0% 8 0.0000 0.00 ADMIN QUIT
From what we can tell, there are three queries which are much slower than the rest (queries 3, 10 and 12). They did not generate the majority of the load on the system, but they are likely responsible for the performance degradation of our system. Also, keep in mind that the query which generates the largest load may not be the root of the problem. You have to check it query by query.
Another piece of interesting data is the variable to mean (V/M) column. We’ll paste here the relevant data to make it easier to see:
# Rank Query ID Response time Calls R/Call V/M Item
# ==== ================== ============== ======= ======= ===== ===========
# 7 0xD30AD7E3079ABCE7 89.2480 5.8% 126986 0.0007 0.46 UPDATE sbtest?
# 8 0xE96B374065B13356 80.0196 5.2% 126973 0.0006 2.71 UPDATE sbtest?
# 9 0xEAB8A8A8BEEFF705 79.4696 5.1% 126964 0.0006 2.29 DELETE sbtest?
As you can see, all of those queries are fast most of the time – that’s what the R/Call column shows. Yet, from time to time, those queries take much longer to finish (as shown by V/M column).
Summary of the queries – what to check for?
We can say that by now we have some idea what might have gone wrong with new queries, we still should go query by query to confirm our suspicions.
First query, the one which is causing the largest impact on the system.
# Query 1: 1.05k QPS, 0.32x concurrency, ID 0x558CAEF5F387E929 at byte 795398174
# This item is included in the report because it matches --limit.
# Scores: V/M = 0.00
# Time range: 2015-08-26 07:06:13 to 07:26:17
# Attribute pct total min max avg 95% stddev median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count 49 1269596
# Exec time 24 385s 1us 548ms 303us 839us 1ms 144us
# Lock time 31 52s 0 54ms 40us 80us 167us 25us
# Rows sent 2 1.21M 1 1 1 1 0 1
# Rows examine 1 1.21M 1 1 1 1 0 1
# Rows affecte 0 0 0 0 0 0 0 0
# Bytes sent 4 236.10M 195 195 195 195 0 195
# Merge passes 0 0 0 0 0 0 0 0
# Tmp tables 0 0 0 0 0 0 0 0
# Tmp disk tbl 0 0 0 0 0 0 0 0
# Tmp tbl size 0 0 0 0 0 0 0 0
# Query size 32 46.01M 37 38 38.00 36.69 0 36.69
# InnoDB:
# IO r bytes 21 3.12G 0 48.00k 2.57k 15.96k 5.95k 0
# IO r ops 21 199.39k 0 3 0.16 0.99 0.37 0
# IO r wait 38 102s 0 261ms 80us 445us 568us 0
# pages distin 26 2.22M 0 9 1.83 3.89 1.93 0
# queue wait 0 0 0 0 0 0 0 0
# rec lock wai 0 0 0 0 0 0 0 0
# String:
# Databases sbtest
# Hosts 10.0.0.200
# InnoDB trxID 26307 (10/0%), 26308 (10/0%), 2630B (10/0%)... 127003 more
# Last errno 0
# Users sbtest
# Query_time distribution
# 1us #
# 10us #
# 100us ################################################################
# 1ms ##
# 10ms #
# 100ms #
# 1s
# 10s+
# Tables
# SHOW TABLE STATUS FROM `sbtest` LIKE 'sbtest1'G
# SHOW CREATE TABLE `sbtest`.`sbtest1`G
# EXPLAIN /*!50100 PARTITIONS*/
SELECT c FROM sbtest1 WHERE id=2396269G
This is a no-brainer. The query is based on a primary key or an unique key (EXPLAIN can show us exactly which index, column name is id therefore PK is more likely) as it’s scanning no more than a single row. If the query uses a covering index (all data can be accessed using index page only, no row lookup is needed), it’d be the most optimal query you can execute in MySQL. If not, it’s still very close to perfect. Sometimes queries cannot be optimized further, even if you see them at the top of the pt-query-digest reports. The only way to limit the impact caused by this query would be to add a covering index (not much gain), not to execute it (perhaps not very realistic) or execute it less often by using aggressive caching. It’s not always possible but sometimes it is a viable option.
Second query:
# Query 2: 105.49 QPS, 0.14x concurrency, ID 0x737F39F04B198EF6 at byte 409226517
# This item is included in the report because it matches --limit.
# Scores: V/M = 0.00
# Time range: 2015-08-26 07:06:13 to 07:26:17
# Attribute pct total min max avg 95% stddev median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count 5 127006
# Exec time 11 171s 332us 293ms 1ms 3ms 2ms 839us
# Lock time 4 7s 0 33ms 57us 119us 235us 35us
# Rows sent 29 12.11M 100 100 100 100 0 100
# Rows examine 30 36.34M 300 300 300 300 0 300
# Rows affecte 0 0 0 0 0 0 0 0
# Bytes sent 28 1.48G 12.18k 12.18k 12.18k 12.18k 0 12.18k
# Merge passes 0 0 0 0 0 0 0 0
# Tmp tables 99 124.03k 1 1 1 1 0 1
# Tmp disk tbl 0 0 0 0 0 0 0 0
# Tmp tbl size 100 15.03G 124.12k 124.12k 124.12k 124.12k 0 124.12k
# Query size 7 9.81M 79 81 81.00 80.10 0.01 80.10
# InnoDB:
# IO r bytes 4 749.03M 0 64.00k 6.04k 31.59k 12.76k 0
# IO r ops 4 46.81k 0 4 0.38 1.96 0.79 0
# IO r wait 10 27s 0 292ms 211us 1ms 1ms 0
# pages distin 6 572.71k 4 7 4.62 5.75 0.64 4.96
# queue wait 0 0 0 0 0 0 0 0
# rec lock wai 0 0 0 0 0 0 0 0
# Boolean:
# Filesort 100% yes, 0% no
# Tmp table 100% yes, 0% no
# String:
# Databases sbtest
# Hosts 10.0.0.200
# InnoDB trxID 26307 (1/0%), 26308 (1/0%), 2630B (1/0%)... 127003 more
# Last errno 0
# Users sbtest
# Query_time distribution
# 1us
# 10us
# 100us ################################################################
# 1ms #############################################
# 10ms #
# 100ms #
# 1s
# 10s+
# Tables
# SHOW TABLE STATUS FROM `sbtest` LIKE 'sbtest1'G
# SHOW CREATE TABLE `sbtest`.`sbtest1`G
# EXPLAIN /*!50100 PARTITIONS*/
SELECT DISTINCT c FROM sbtest1 WHERE id BETWEEN 2835942 AND 2835942+99 ORDER BY cG
This query is responsible for all of the temporary tables that have been created. Unfortunately, because of a range condition on the ‘id’ column, there’s no easy way to avoid creating them. What is also very important – this query is fast: average – 1ms, 95% – 3ms. It’s very unlikely that you’d be able to improve it much further. Sometimes there’s no point to try to reach perfection – you could use your time better tuning other queries, you just have to settle for ‘good enough’.
Third query, also the first of the queries we suspected:
# Query 3: 0.01 QPS, 0.22x concurrency, ID 0x6408196D4E419490 at byte 831892659
# This item is included in the report because it matches --limit.
# Scores: V/M = 0.17
# Time range: 2015-08-26 07:06:49 to 07:17:48
# Attribute pct total min max avg 95% stddev median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count 0 6
# Exec time 9 142s 22s 28s 24s 27s 2s 23s
# Lock time 0 3ms 55us 3ms 493us 3ms 935us 82us
# Rows sent 7 2.86M 486.91k 488.90k 488.12k 485.50k 0 485.50k
# Rows examine 23 28.61M 4.77M 4.77M 4.77M 4.77M 0 4.77M
# Rows affecte 0 0 0 0 0 0 0 0
# Bytes sent 10 571.38M 94.99M 95.38M 95.23M 92.12M 0 92.12M
# Merge passes 0 0 0 0 0 0 0 0
# Tmp tables 0 0 0 0 0 0 0 0
# Tmp disk tbl 0 0 0 0 0 0 0 0
# Tmp tbl size 0 0 0 0 0 0 0 0
# Query size 0 246 41 41 41 41 0 41
# InnoDB:
# IO r bytes 41 6.13G 1.02G 1.02G 1.02G 1006.04M 0 1006.04M
# IO r ops 41 392.26k 65.26k 65.51k 65.38k 62.55k 0 62.55k
# IO r wait 4 13s 1s 4s 2s 4s 775ms 2s
# pages distin 4 356.46k 59.41k 59.41k 59.41k 56.74k 0 56.74k
# queue wait 0 0 0 0 0 0 0 0
# rec lock wai 0 0 0 0 0 0 0 0
# Boolean:
# Full scan 100% yes, 0% no
# String:
# Databases sbtest
# Hosts localhost
# InnoDB trxID 271D7 (1/16%), 27C90 (1/16%), 28AB5 (1/16%)... 3 more
# Last errno 0
# Users root
# Query_time distribution
# 1us
# 10us
# 100us
# 1ms
# 10ms
# 100ms
# 1s
# 10s+ ################################################################
# Tables
# SHOW TABLE STATUS FROM `sbtest` LIKE 'sbtest1'G
# SHOW CREATE TABLE `sbtest`.`sbtest1`G
# EXPLAIN /*!50100 PARTITIONS*/
select * from sbtest1 where pad like '2%'G
Ok, we can see very clearly why this query is slow. It’s making a full table scan, examining almost 5 million rows in the process. The amount of work needed to perform this task definitely can impact performance of MySQL. It’s especially true if all of this data had to be read from disk.
Now, let’s take a look at one of the ‘unstable’ DML’s:
# Query 7: 105.47 QPS, 0.07x concurrency, ID 0xD30AD7E3079ABCE7 at byte 466984796
# This item is included in the report because it matches --limit.
# Scores: V/M = 0.46
# Time range: 2015-08-26 07:06:13 to 07:26:17
# Attribute pct total min max avg 95% stddev median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count 5 126986
# Exec time 5 89s 1us 6s 702us 2ms 18ms 287us
# Lock time 10 17s 0 6s 135us 131us 18ms 38us
# Rows sent 0 0 0 0 0 0 0 0
# Rows examine 0 124.01k 1 1 1 1 0 1
# Rows affecte 2 124.01k 1 1 1 1 0 1
# Bytes sent 0 6.30M 52 52 52 52 0 52
# Merge passes 0 0 0 0 0 0 0 0
# Tmp tables 0 0 0 0 0 0 0 0
# Tmp disk tbl 0 0 0 0 0 0 0 0
# Tmp tbl size 0 0 0 0 0 0 0 0
# Query size 3 4.97M 40 41 41.00 40.45 0.01 40.45
# InnoDB:
# IO r bytes 2 346.02M 0 96.00k 2.79k 15.96k 6.33k 0
# IO r ops 2 21.63k 0 6 0.17 0.99 0.39 0
# IO r wait 6 17s 0 97ms 136us 657us 804us 0
# pages distin 12 1.03M 4 25 8.47 11.95 2.22 7.70
# queue wait 0 0 0 0 0 0 0 0
# rec lock wai 16 9s 0 6s 72us 0 18ms 0
# String:
# Databases sbtest
# Hosts 10.0.0.200
# InnoDB trxID 26307 (1/0%), 26308 (1/0%), 2630B (1/0%)... 126983 more
# Last errno 0
# Users sbtest
# Query_time distribution
# 1us #
# 10us #
# 100us ################################################################
# 1ms #########
# 10ms #
# 100ms #
# 1s #
# 10s+
# Tables
# SHOW TABLE STATUS FROM `sbtest` LIKE 'sbtest1'G
# SHOW CREATE TABLE `sbtest`.`sbtest1`G
UPDATE sbtest1 SET k=k+1 WHERE id=2274156G
# Converted for EXPLAIN
# EXPLAIN /*!50100 PARTITIONS*/
select k=k+1 from sbtest1 where id=2274156G
Please take a look at the query execution time distribution – it’s a very nice graphical presentation of a very important feature of every query – predictable and stable performance. In this case we can clearly see that the execution time of this query is not stable – most of the time it finishes in 100us – 10ms area but it can as well take 1 – 10 seconds to finish. This is clearly a sign of problem. By looking at rows examined and rows affected it’s clear that we use either PK or a unique key as it scans and updates one row only. We now need to look at the lock time and execution time – max is in both cases 6 seconds. It means that query had to wait before it could have been executed. Finally, in InnoDB section, we can confirm that it’s not an I/O but row locking on which query was waiting.
Let’s check one of the long-running DML’s:
# Query 10: 0.03 QPS, 0.38x concurrency, ID 0x89E30DDBB4B37EEB at byte 369866706
# This item is included in the report because it matches --limit.
# Scores: V/M = 13.34
# Time range: 2015-08-26 07:09:42 to 07:13:01
# Attribute pct total min max avg 95% stddev median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count 0 5
# Exec time 4 75s 5s 44s 15s 42s 14s 9s
# Lock time 0 37ms 181us 14ms 7ms 14ms 4ms 7ms
# Rows sent 0 0 0 0 0 0 0 0
# Rows examine 2 2.76M 194.33k 1.71M 564.26k 1.69M 587.22k 283.86k
# Rows affecte 50 2.76M 194.33k 1.71M 564.26k 1.69M 587.22k 283.86k
# Bytes sent 0 306 61 62 61.20 59.77 0 59.77
# Merge passes 0 0 0 0 0 0 0 0
# Tmp tables 0 0 0 0 0 0 0 0
# Tmp disk tbl 0 0 0 0 0 0 0 0
# Tmp tbl size 0 0 0 0 0 0 0 0
# Query size 0 380 73 77 76 76.28 1.78 76.28
# InnoDB:
# IO r bytes 4 629.86M 39.30M 383.56M 125.97M 379.16M 129.77M 56.55M
# IO r ops 4 39.37k 2.46k 23.97k 7.87k 23.58k 8.07k 3.52k
# IO r wait 1 4s 159ms 1s 809ms 1s 470ms 992ms
# pages distin 0 59.18k 3.01k 27.79k 11.84k 27.29k 8.26k 9.33k
# queue wait 0 0 0 0 0 0 0 0
# rec lock wai 0 23ms 0 21ms 5ms 21ms 8ms 0
# String:
# Databases sbtest
# Hosts localhost
# InnoDB trxID 3187A (1/20%), 32D01 (1/20%), 364D8 (1/20%)... 2 more
# Last errno 0
# Users root
# Query_time distribution
# 1us
# 10us
# 100us
# 1ms
# 10ms
# 100ms
# 1s ################################################################
# 10s+ ##########################################
# Tables
# SHOW TABLE STATUS FROM `sbtest` LIKE 'sbtest2'G
# SHOW CREATE TABLE `sbtest`.`sbtest2`G
# SHOW TABLE STATUS FROM `sbtest` LIKE 'sbtest1'G
# SHOW CREATE TABLE `sbtest`.`sbtest1`G
insert into sbtest2 select * from sbtest1 where id > 210000 and id < 2000000G
Here we have a INSERT INTO … SELECT query, a query that will lock all of the rows it has to copy. As you can see in the execution time row, it took up to 44 seconds. This is a significant amount of time for which those rows stayed unavailable for any other queries. It is very important to keep in mind that operations like this have to be performed on a small subset of rows at a time. Exact numbers depend on a table and your hardware but we’d say 1000 rows is a good number at the beginning. This way you will minimize the execution time for a given batch, you will also minimize the number of locked rows. Thanks to this, the whole process should be less impacting on the rest of the workload.
Ordering data in a report
As we mentioned in the previous post in the series, pt-query-digest, by default, sorts queries by their total execution time. This is a good choice most of the time but there are some cases when you’d want to change this default behavior.
For that you can use –order-by option for pt-query-digest. It’s syntax is: ‘attribute:how_to_aggregate’. Most of the time you’ll be just fine aggregating the data using ‘sum’ – i.e. total execution time or sum of a rows examined. Sometimes you may find it useful with ‘max’ and ‘min’ to sort results by the largest ‘maximal or minimal number of rows sent’. For attributes, the easiest way to find what’s available is by running:
$ head -n 50 /var/lib/mysql/slow.log | pt-query-digest --filter 'print Dumper $event' --no-report --sample 1
You shall see output like:
$VAR1 = {
Bytes_sent => '11',
Filesort => 'No',
Filesort_on_disk => 'No',
Full_join => 'No',
Full_scan => 'No',
Killed => '0',
Last_errno => '0',
Lock_time => '0.000000',
Merge_passes => '0',
QC_Hit => 'No',
Query_time => '0.000080',
Rows_affected => '0',
Rows_examined => '0',
Rows_sent => '0',
Thread_id => '4',
Tmp_disk_tables => '0',
Tmp_table => 'No',
Tmp_table_on_disk => 'No',
Tmp_table_sizes => '0',
Tmp_tables => '0',
arg => 'BEGIN',
bytes => 5,
cmd => 'Query',
db => 'sbtest',
fingerprint => 'begin',
host => '10.0.0.200',
ip => '10.0.0.200',
pos_in_log => 1793,
timestamp => '1440572773',
ts => '150826 7:06:13',
user => 'sbtest'
which contains all attributes. Not every one of them makes sense to be used as a way of ordering the results but some, yes.
Let’s say that we are concerned that some of the queries are scanning rows extensively and we want to locate them. You could use the following line to sort the report accordingly:
$ pt-query-digest --limit=100% --order-by=Rows_examined:sum /var/lib/mysql/slow.log
Here’s an example from our slowlog:
# Profile
# Rank Query ID Response time Calls R/Call V/M Item
# ==== ================== ============== ======= ======= ===== ===========
# 1 0x737F39F04B198EF6 171.1443 11.1% 127006 0.0013 0.00 SELECT sbtest?
# 2 0x6408196D4E419490 141.9605 9.2% 6 23.6601 0.17 SELECT sbtest?
# 3 0x84D1DEE77FA8D4C3 122.1600 7.9% 127006 0.0010 0.00 SELECT sbtest?
# 4 0x3821AE1F716D5205 106.3605 6.9% 127006 0.0008 0.00 SELECT sbtest?
# 5 0x6EEB1BFDCCF4EBCD 72.2472 4.7% 127006 0.0006 0.00 SELECT sbtest?
# 6 0x89E30DDBB4B37EEB 74.9633 4.8% 5 14.9927 13.34 INSERT SELECT sbtest?
# 7 0x30B616F874042DCA 52.4873 3.4% 3 17.4958 11.48 DELETE sbtest?
# 8 0x558CAEF5F387E929 385.0075 24.9% 1269596 0.0003 0.00 SELECT sbtest?
# 9 0xD30AD7E3079ABCE7 89.2480 5.8% 126986 0.0007 0.46 UPDATE sbtest?
# 10 0xE96B374065B13356 80.0196 5.2% 126973 0.0006 2.71 UPDATE sbtest?
# 11 0xEAB8A8A8BEEFF705 79.4696 5.1% 126964 0.0006 2.29 DELETE sbtest?
# 12 0x3D8C074C26D05CD7 0.0084 0.0% 3 0.0028 0.00 SELECT query_review
# 13 0x5CBA2034458B5BC9 0.0417 0.0% 5 0.0083 0.00 SHOW DATABASES
# 14 0xF4F497B76C5267E1 0.0309 0.0% 2 0.0154 0.03 SELECT query_review
# 15 0x132628303F99240D 0.0048 0.0% 7 0.0007 0.00 SHOW TABLES
# 16 0x7DD5F6760F2D2EBB 0.0108 0.0% 3 0.0036 0.01 SHOW VARIABLES
# 17 0x641B2CB5DDEE851E 0.0002 0.0% 3 0.0001 0.00 SHOW TABLES
# 18 0x9FE96489A777ECF1 0.0005 0.0% 3 0.0002 0.00 SHOW DATABASES
# 19 0x60E05F7AE7229A61 0.0002 0.0% 1 0.0002 0.00 SELECT query_review
# MISC 0xMISC 171.4946 11.1% 381154 0.0004 0.0 <40 ITEMS>
As you can see, our ‘not-optimal’ queries moved up in the report because they did a lot of table scans and, as a result, examined a lot or rows.
Let’s imagine that we want to find queries which are affecting most rows. We suspect that some of the DML’s lock large number of rows because of lack of indexing or bad design. We can use –order-by=Rows_affected:sum for that:
$ pt-query-digest --limit=100% --order-by=Rows_affected:sum /var/lib/mysql/slow.log
Here’s output for our workload:
# Profile
# Rank Query ID Response time Calls R/Call V/M Item
# ==== ================== =============== ======= ======= ===== ==========
# 1 0x89E30DDBB4B37EEB 74.9633 4.8% 5 14.9927 13.34 INSERT SELECT sbtest?
# 2 0x30B616F874042DCA 52.4873 3.4% 3 17.4958 11.48 DELETE sbtest?
# 3 0xD30AD7E3079ABCE7 89.2480 5.8% 126986 0.0007 0.46 UPDATE sbtest?
# 4 0xF1256A27240AEFC7 46.7403 3.0% 126968 0.0004 0.00 INSERT sbtest?
# 5 0xEAB8A8A8BEEFF705 79.4696 5.1% 126964 0.0006 2.29 DELETE sbtest?
# 6 0xE96B374065B13356 80.0196 5.2% 126973 0.0006 2.71 UPDATE sbtest?
# 7 0xE3976283BBC14F4E 0.0419 0.0% 41 0.0010 0.01 INSERT UPDATE percona_schema.query_review
# 8 0xC06E8004D54640A9 0.0003 0.0% 3 0.0001 0.00 CREATE DATABASE `percona_schema`
# 9 0xA1D2DFF5912E5AD9 0.0242 0.0% 1 0.0242 0.00 DROP DATABASE percona
# 10 0x7231E892053F2CC3 0.0706 0.0% 1 0.0706 0.00 CREATE DATABASE percona
# MISC 0xMISC 1123.5224 72.6% 2031784 0.0006 0.0 <47 ITEMS>
Our “bad” DMLs took the top two places in the report. This is not unexpected given the high number of rows they scanned compared to optimal writes performed by sysbench. Having an ability to reorganize the order in the report can help us find these queries quickly.
Collect review data for pt-query-digest
If you use pt-query-digest for regular reviews, there is a pretty neat feature that you can use. When you start it with –review option and pass the correct DSN to it, pt-query-digest will create a percona_schema.query_review table with following schema:
mysql> show create table query_reviewG
*************************** 1. row ***************************
Table: query_review
Create Table: CREATE TABLE `query_review` (
`checksum` bigint(20) unsigned NOT NULL,
`fingerprint` text NOT NULL,
`sample` text NOT NULL,
`first_seen` datetime DEFAULT NULL,
`last_seen` datetime DEFAULT NULL,
`reviewed_by` varchar(20) DEFAULT NULL,
`reviewed_on` datetime DEFAULT NULL,
`comments` text,
PRIMARY KEY (`checksum`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
1 row in set (0.01 sec)
This table will be used to store data about queries included in a report. The data will look something like this:
*************************** 1. row ***************************
checksum: 6164494361339357481
fingerprint: select c from sbtest? where id=?
sample: SELECT c FROM sbtest1 WHERE id=2396269
first_seen: 2015-08-26 07:06:13
last_seen: 2015-08-26 07:26:17
reviewed_by: NULL
reviewed_on: NULL
comments: NULL
We can see the checksum of the table (decimal version of a hex checksum used in pt-query-digest), fingerprint, sample and time when the query was spotted for the first time and when it was last found. Next, we have columns which can be updated manually – if you set ‘reviewed_by’ column to some value, it will be skipped in the next report. This is very handy because you don’t have to go through all of the queries – you are presented with only those queries which you haven’t reviewed yet. To be precise, the summary section will contain all of the queries but query details are printed only for queries which haven’t been marked as reviewed.
The remaining two columns can be used to store information when a query has been reviewed and any comments related to it. It is possible to add additional columns, if you want to store additional data. If you want to locate a query using it’s checksum as shown in the report, you can run the following query:
select * from query_review where checksum=0xDA69864C6215FEDBG
*************************** 1. row ***************************
checksum: 15738258035444154075
fingerprint: set @@sql_quote_show_create = ?/*!?, @@sql_mode=?*/
sample: SET @@SQL_QUOTE_SHOW_CREATE = 1/*!40101, @@SQL_MODE='NO_AUTO_VALUE_ON_ZERO,NO_ENGINE_SUBSTITUTION'*/
first_seen: 2015-08-27 14:45:41
last_seen: 2015-08-27 14:50:02
reviewed_by: Severalnines
reviewed_on: NULL
comments: NULL
1 row in set (0.00 sec)
Collect query performance metrics using pt-query-digest
Another feature, very similar to what we just covered, is option –history, which creates a table percona_schema.query_history. This table contains data about performance metrics collected by pt-query-digest. We are talking here about all those metrics that you see in the report.
It’s a great way to understand how query performance changes in time. Let’s say we are interested in a particular query and it’s locking profile:
mysql> select sample, Query_time_max, Lock_time_sum, Lock_time_max, InnoDB_rec_lock_wait_max,InnoDB_IO_r_wait_max from percona_schema.query_history where checksum=15207204461855030503G
*************************** 1. row ***************************
sample: UPDATE sbtest1 SET k=k+1 WHERE id=2633533
Query_time_max: 0.016347
Lock_time_sum: 0.365343
Lock_time_max: 0.007466
InnoDB_rec_lock_wait_max: 0
InnoDB_IO_r_wait_max: 0.010521
*************************** 2. row ***************************
sample: UPDATE sbtest1 SET k=k+1 WHERE id=2484959
Query_time_max: 28.7082
Lock_time_sum: 80.6104
Lock_time_max: 28.7007
InnoDB_rec_lock_wait_max: 28.7006
InnoDB_IO_r_wait_max: 0.063314
2 rows in set (0.00 sec)
It shows that this query slowed down due to waits on row level locks. This table is also a good source of data for graphs and performance reports.
We’ll end here our guided tour of pt-query-digest, and hope you found it useful. Our next blog will be about indexing – what types of indexes are available, how they affect performance and how to design indexes in order to fit a query mix.