blog
MySQL Troubleshooting with pt-stalk: Part 1
In our previous posts, we have covered different log files and how to use them to troubleshoot MySQL, but that is not all that MySQL has to offer. But what do you do when the standard approach fails to pinpoint the problem? In this post, we will take a closer look at pt-stalk – a tool which may help you understand what is going on with MySQL when the standard approach fails to catch the problem.
This is the nineteenth installment in the ‘Become a MySQL DBA’ blog series. Our previous posts in the DBA series include:
- Troubleshooting Galera cluster issues – part 2,
- Troubleshooting Galera cluster issues – part 1,
- Galera cluster diagnostic logs,
- Understanding the MySQL error log,
- Optimizing Hints for faster queries,
- Using EXPLAIN to improve SQL Queries,
- Database Indexing,
- Deep Dive pt-query-digest,
- Analyzing SQL Workload with pt-query-digest,
- Query Tuning Process,
- Configuration Tuning,
- Live Migration using MySQL Replication,
- Database Upgrades,
- Replication Topology Changes,
- Schema Changes,
- High Availability,
- Backup & Restore,
- Monitoring & Trending.
Logs are not enough for troubleshooting
In the recent posts of our ‘Become a MySQL DBA’ series, we’ve discussed what kind of useful data the MySQL and Galera logs can bring us. This data is great for debugging potential issues but it’s not everything that MySQL has to offer. There are multiple ways in which MySQL presents its internal state – we are talking here about data presented in the output of ‘SHOW GLOBAL STATUS’ or ‘SHOW ENGINE INNODB STATUS’. When it comes to dealing with row locking contention, the following queries can give you great understanding of how things are looking in this matter:
SELECT * FROM information_schema.INNODB_TRXG
SELECT * FROM information_schema.INNODB_LOCK_WAITSG
They present you with information about open transactions and transactions in LOCK_WAIT state. Also, we should not forget that MySQL doesn’t run in a vacuum – operating system state and its settings can impact how MySQL performs. Hardware has its own limits too – memory utilization, CPU utilization, disk load. Many times, issues you’ll face will not be that complex. But you never know when knowledge about how the CPU is handling interrupts, or how memory is allocated by MySQL, may help you in dealing with seemingly weird issues.
The main problem related to this data is that all of it is available and meaningful only when the issue is evolving – you may run queries to collect info on row locking and you can run external tools to grab the data on CPU utilization, but if it’s happening after the incident, this data, most likely, won’t help much in solving the problem.
Obviously, it’s not always possible to have someone online all the time something is happening to the database. It’s especially true for those transient, hardly reproducible crashes or slowdowns. Of course, you can (and many do) build your own set of scripts which will monitor MySQL status and collect the data. Luckily, it’s not really needed – there’s a great tool designed to do just that – pt-stalk from Percona Toolkit.
How to use pt-stalk?
Before we cover what type of data you can get out of pt-stalk, we need to discuss how to use it. There are a couple of ways you can trigger pt-stalk to execute. By default it will execute and begin to watch for a predefined condition.
Let’s take a look at an example of pt-stalk invocation:
root@ip-172-30-4-156:~# pt-stalk --password=pass --function=status --variable=Threads_running --threshold=25 --cycles=5 --interval=1 --iterations=2
Most of those settings are default ones (only iterations has to be passed explicitly). What does it mean?
The variable passed to ‘–function’ defines what type of data pt-stalk should monitor. Default setting, ‘status’, is basically the output of ‘SHOW GLOBAL STATUS’. In this case, the variable passed to ‘–variable’ defines which counter from the status output should be monitored – by default it’s Threads_running.
Another option you can pass to ‘–function’ is a ‘processlist’. This will make pt-stalk focus on the contents of ‘SHOW PROCESSLIST’. In such case, ‘–variable’ will define which column of the output will be monitored and argument passed to ‘–match’. The latter defines a pattern to use on the chosen column.
Last option you can pass to ‘–function’ is a file. This allows you to build much more complex methods of triggering pt-stalk by executing a user defined script or function. We are going to cover this, along with example of how to use, later in this post.
‘–threshold’ defines when pt-stalk triggers data collection. In general, the idea is that each of the method we just discussed ends up returning an integer value. This can be a counter from the SHOW GLOBAL STATUS like, for example, number of currently running threads. When watching processlist output, this can be for example number of queries in a given query state (i.e. ‘statistics’ or ‘query end’) or maybe number of queries for a given MySQL user. If you define a function via external file, you also have to return some kind of integer to make it work.
In our example, the threshold is set to 25 so if more than 25 threads are running at the same time, the threshold will be crossed.
Another option we’ve passed to pt-stalk is ‘–cycles’ with default setting of 5. This number defines how many times a threshold has to be crossed before the actual data collection will be triggered. Sometimes it’s not easy to differentiate between the normal workload and condition related to an issue only by looking at the threshold value. Let’s say that in our case it is expected that the workload may spike and cross the 25 threads running simultaneously but when the issue happens, it stays high for much longer time. That’s why we set ‘–cycles’ to 5 – a single spike won’t trigger data collection but if the threshold will be passed for a longer period of time, pt-stalk will start doing its job. The check happens every ‘–interval’ seconds, by default it’s set to 1 second.
Last option we used is ‘–iterations’. This is meant to stop pt-stalk collecting data after ‘–iterations’. In our case, data collection was triggered twice and then pt-stalk terminated.
Example output from pt-stalk may look like:
2015_12_03_12_41_51 Starting /usr/bin/pt-stalk --function=status --variable=Threads_running --threshold=25 --match= --cycles=5 --interval=1 --iterations=2 --run-time=30 --sleep=300 --dest=/var/lib/pt-stalk --prefix= --notify-by-email= --log=/var/log/pt-stalk.log --pid=/var/run/pt-stalk.pid --plugin=
2015_12_03_12_41_54 Check results: status(Threads_running)=43, matched=yes, cycles_true=1
2015_12_03_12_41_55 Check results: status(Threads_running)=33, matched=yes, cycles_true=2
2015_12_03_12_41_56 Check results: status(Threads_running)=51, matched=yes, cycles_true=3
2015_12_03_12_41_57 Check results: status(Threads_running)=31, matched=yes, cycles_true=4
2015_12_03_12_42_01 Check results: status(Threads_running)=44, matched=yes, cycles_true=1
2015_12_03_12_42_02 Check results: status(Threads_running)=43, matched=yes, cycles_true=2
2015_12_03_12_42_05 Check results: status(Threads_running)=65, matched=yes, cycles_true=1
2015_12_03_12_42_06 Check results: status(Threads_running)=47, matched=yes, cycles_true=2
2015_12_03_12_42_07 Check results: status(Threads_running)=65, matched=yes, cycles_true=3
2015_12_03_12_42_08 Check results: status(Threads_running)=65, matched=yes, cycles_true=4
2015_12_03_12_42_09 Check results: status(Threads_running)=65, matched=yes, cycles_true=5
2015_12_03_12_42_09 Collect 1 triggered
2015_12_03_12_42_09 Collect 1 PID 18267
2015_12_03_12_42_09 Collect 1 done
2015_12_03_12_42_09 Sleeping 300 seconds after collect
2015_12_03_12_47_09 Check results: status(Threads_running)=65, matched=yes, cycles_true=1
2015_12_03_12_47_11 Check results: status(Threads_running)=52, matched=yes, cycles_true=1
2015_12_03_12_47_13 Check results: status(Threads_running)=35, matched=yes, cycles_true=1
2015_12_03_12_47_14 Check results: status(Threads_running)=65, matched=yes, cycles_true=2
2015_12_03_12_47_15 Check results: status(Threads_running)=66, matched=yes, cycles_true=3
2015_12_03_12_47_16 Check results: status(Threads_running)=32, matched=yes, cycles_true=4
2015_12_03_12_47_17 Check results: status(Threads_running)=65, matched=yes, cycles_true=5
2015_12_03_12_47_18 Collect 2 triggered
2015_12_03_12_47_18 Collect 2 PID 3466
2015_12_03_12_47_18 Collect 2 done
2015_12_03_12_47_18 Waiting up to 90 seconds for subprocesses to finish...
2015_12_03_12_47_49 Exiting because no more iterations
2015_12_03_12_47_49 /usr/bin/pt-stalk exit status 0
As you can see, the condition was met a couple of times but not for five checks consecutively.
Of course, this is not all of the variables you can pass to pt-stalk. Some more of them can be found at the beginning of the pt-stalk output. There are also others, not mentioned there. We’ll now cover a couple of useful settings.
‘–disk-bytes-free’ (default of 100M) and ‘–disk-pct-free’ (default of 5). Pt-stalk collects lots of data and collection may happen pretty often (by default, every 5 minutes – there’s a ‘–sleep’ variable with a default of 300 seconds which define how long pt-stalk should wait after the data collection before next check can be executed) so disk utilization can be an issue. Those variables allow you to set minimum thresholds in both disk space free and percent of disk free which pt-stalk keeps on the system. ‘–disk-bytes-free’ accepts suffixes like k, M, G or T. Once you set those variables, pt-stalk will remove old data as needed, in order to meet the condition.
‘–retention-time’, by default set to 30. This defines how many days pt-stalk will keep the data before it is discarded. This is another means of keeping enough disk free.
We’d suggest to tune those disk-related variables according to your monitoring settings – you don’t want to be waken up at 3am just because pt-stalk generated too many files on disk and caused an alert to be triggered.
‘–collect-gdb’, ‘–collect-oprofile’, ‘–collect-strace’ and ‘–collect-tcpdump’ – those variables can enable additional ways of collecting data. Some of them are more intrusive and may create bigger impact on the system (like gdb and strace) – in general, unless you are positive you know what you are doing, you probably don’t want to enable any of them. Some additional software may need to be installed in order to make those options work.
‘–collect’ – by default pt-stalk collects the data when condition is triggered. It may be that you want it to just monitor the system. In such case you can negate this variable and start pt-stalk with ‘–no-collect’ – it won’t collect the data, will just print the checks if they cross the threshold defined.
Of course, pt-stalk can also be daemonized – this is the second way you can run it. For that you can use ‘–daemonize’ option to fork it to the background. In this state it will write to a log file described in ‘–log’ variable. By default it’s /var/log/pt-stalk.log.
Sometimes we want to collect the data right now, no matter what conditions are. This is the third way we can execute pt-stalk. By default it’s started with ‘–stalk’ variable. It is negatable and when we use ‘–no-stalk’, pt-stalk won’t be checking the state of the system. Instead it will proceed with collecting the data. It’s very useful to set ‘–iterations’ to some value, otherwise pt-stalk will run indefinitely triggering the data collection every ‘–sleep’ seconds.
Extending pt-stalk
As we mentioned earlier in this post, pt-stalk can be extended to some extent. You can do it two-fold.
First, there’s an option to execute something in addition to pt-stalk. We can use ‘–plugin’ for that. The process is fairly simple – what you need to do is to pass a file name to this variable. This file should be a bash script, not necessarily executable. It should define at least one of the following bash functions:
- before_stalk – triggered before stalking
- before_collect – triggered before collecting data
- after_collect – triggered after data collection but before the ‘sleep’ phase
- after_collect_sleep – triggered after data collection and after the ‘sleep’ phase
- after_interval_sleep – triggered after sleeping for defined ‘–interval’, between checks
- after_stalk – triggered after stalking (when ‘–iterations’ are defined)
Such function should look like, for example:
before_collect() {
# run something here, an external script maybe?
}
As you can see, you can execute pretty much anything. It can be a binary, it can be a perl/python/go script that will implement some additional logic, it can be just bash if that’s what you prefer – anything which will allow you to do what you wanted to.
For example, such scripts can implement additional ways of collecting the data (using before_collect function). They can run some sanity checks on access rights to pt-stalk log and data where collected data ends (using before_stalk). They can parse the output and generate a report out of them (using after_collect). It’s entirely up to you how you use them.
Pt-stalk enables you to send an email for every collect using ‘–notify-by-email’ but you can implement sending the whole report once the collection finishes.
If you are not satisfied with the different conditions pt-stalk allows to use for checks, it’s very easy to implement your own. As we mentioned earlier, ‘–function’ accepts file name. It works in a very similar way as ‘–plugin’ – it has to be a bash script with a defined function trg_plugin. Again, it doesn’t have to be executable file, just simple plain text script like:
trg_plugin() {
# Execute something here, it has to return integer
# you can leverage $EXT_ARGV which passes access details:
mysql $EXT_ARGV -e "SELECT trx_id FROM INFORMATION_SCHEMA.INNODB_TRX WHERE trx_started < DATE_ADD(NOW(), INTERVAL -1 MINUTE);" | grep -v trx_id | wc -l
}
This particular function will calculate how many transactions are running for more than one minute. We can use it with pt-stalk to, for example, run a data collection when there’s at least one such long running transaction:
root@ip-172-30-4-156:~# pt-stalk --password=pass --function=test.sh --threshold=0 --cycles=1 --iterations=1
2015_12_03_15_45_20 Starting /usr/bin/pt-stalk --function=test.sh --variable=Threads_running --threshold=0 --match= --cycles=1 --interval=1 --iterations=1 --run-time=30 --sleep=300 --dest=/var/lib/pt-stalk --prefix= --notify-by-email= --log=/var/log/pt-stalk.log --pid=/var/run/pt-stalk.pid --plugin=
2015_12_03_15_45_20 Check results: test.sh(Threads_running)=1, matched=yes, cycles_true=1
2015_12_03_15_45_20 Collect 1 triggered
2015_12_03_15_45_20 Collect 1 PID 5967
2015_12_03_15_45_20 Collect 1 done
2015_12_03_15_45_20 Waiting up to 90 seconds for subprocesses to finish...
2015_12_03_15_45_53 Exiting because no more iterations
2015_12_03_15_45_53 /usr/bin/pt-stalk exit status 0
We had one such transaction (which was more than the threshold of 0) and data has been collected.
As mentioned, you can extend it by calling scripts. Let’s take a look at the following example. We’d like to execute pt-stalk when the number of queries executed per second reaches some level. In the output of “SHOW GLOBAL STATUS” you will find ‘Queries’ counter, but it’s a counter, not a gauge – it increments with time. We need to grab two of the values and subtract the first from the second. We can also use mysqladmin tool for that. To accomplish that, we’ve prepared this simple setup. First, we have our ‘test.sh’ file with the following contents:
trg_plugin() {
# Execute something here, it has to return integer
# you can leverage $EXT_ARGV which passes access details:
./qps_trigger.pl $EXT_ARGV
}
What it does is fairly simple – it calls a perl script and pass $EXT_ARGV to it for MySQL access credentials.
The Perl script looks as below:
#!/usr/bin/perl
my $conn_str = '';
foreach my $arg (@ARGV) {
$conn_str = $conn_str . $arg . ' ';
}
open(OUT,"mysqladmin " . $conn_str . "-c 2 -ri 1 ext | grep Queries |") || die "Failed: $!n";
@out_arr = ();
while ( )
{
push(@out_arr, $_);
}
my @cols = split(" ",$out_arr[1]);
print @cols[3];
We first build a connection string out of the arguments passed to the perl script, and then we use mysqladmin to execute ‘SHOW GLOBAL STATUS’ twice, using -r option which triggers recursive output:
root@ip-172-30-4-156:~# mysqladmin --user=root --password=pass -c 2 -ri 1 ext | grep Queries
| Queries | 61002165 |
| Queries | 1628 |
The second line of the output contains the value we want to check. The script grabs it and prints it. Finally, we run our pt-stalk with ‘–function=test.sh’ and we want it to trigger when MySQL does more than 2000 queries per second.
root@ip-172-30-4-156:~# pt-stalk --password=pass --user=root --function=test.sh --threshold=2000 --cycles=3 --iterations=1
2015_12_03_16_25_40 Starting /usr/bin/pt-stalk --function=test.sh --variable=Threads_running --threshold=2000 --match= --cycles=3 --interval=1 --iterations=1 --run-time=30 --sleep=300 --dest=/var/lib/pt-stalk --prefix= --notify-by-email= --log=/var/log/pt-stalk.log --pid=/var/run/pt-stalk.pid --plugin=
2015_12_03_16_25_41 Check results: test.sh(Threads_running)=2482, matched=yes, cycles_true=1
2015_12_03_16_25_55 Check results: test.sh(Threads_running)=2936, matched=yes, cycles_true=1
2015_12_03_16_25_58 Check results: test.sh(Threads_running)=2068, matched=yes, cycles_true=2
2015_12_03_16_26_04 Check results: test.sh(Threads_running)=2235, matched=yes, cycles_true=1
2015_12_03_16_26_12 Check results: test.sh(Threads_running)=2009, matched=yes, cycles_true=1
2015_12_03_16_26_28 Check results: test.sh(Threads_running)=2382, matched=yes, cycles_true=1
2015_12_03_16_26_30 Check results: test.sh(Threads_running)=2080, matched=yes, cycles_true=2
2015_12_03_16_26_36 Check results: test.sh(Threads_running)=2163, matched=yes, cycles_true=1
2015_12_03_16_26_39 Check results: test.sh(Threads_running)=2083, matched=yes, cycles_true=2
2015_12_03_16_26_41 Check results: test.sh(Threads_running)=2346, matched=yes, cycles_true=3
2015_12_03_16_26_41 Collect 1 triggered
2015_12_03_16_26_41 Collect 1 PID 28229
2015_12_03_16_26_41 Collect 1 done
2015_12_03_16_26_41 Waiting up to 90 seconds for subprocesses to finish...
2015_12_03_16_27_14 Exiting because no more iterations
2015_12_03_16_27_14 /usr/bin/pt-stalk exit status 0
As you can see, pt-stalk is very flexible in terms of what can be done with it (and under what conditions). In the next post in the series, we are going to talk about what data pt-stalk can collect for you and what kind of information you can derive from this data.