blog
How to Fail or Crash Your MySQL Instances for Testing
You can take down a MySQL database in multiple ways. Some obvious ways are to shut down the host, pull out the power cable, or hard kill the mysqld process with SIGKILL to simulate an unclean MySQL shutdown behaviour. But there are also less subtle ways to deliberately crash your MySQL server, and then see what kind of chain reaction it triggers. Why would you want to do this? Failure and recovery can have many corner cases, and understanding them can help reduce the element of surprise when things happen in production. Ideally, you would want to simulate failures in a controlled environment, and then design and test database failover procedures.
There are several areas in MySQL that we can tackle, depending on how you want it to fail or crash. You can corrupt the tablespace, overflow the MySQL buffers and caches, limit the resources to starve the server, and also mess around with permissions. In this blog post, we are going to show you some examples of how to crash a MySQL server in a Linux environment. Some of them would be suitable for e.g. Amazon RDS instances, where you would have no access to the underlying host.
Kill, Kill, Kill, Die, Die, Die
The easiest way to fail a MySQL server is to simply kill the process or host, and not give MySQL a chance to do a graceful shutdown. To simulate a mysqld crash, just send signal 4, 6, 7, 8 or 11 to the process:
$ kill -11 $(pidof mysqld)
When looking at the MySQL error log, you can see the following lines:
11:06:09 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.
..
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
You can also use kill -9 (SIGKILL) to kill the process immediately. More details on Linux signal can be found here. Alternatively, you can use a meaner way on the hardware side like pulling off the power cable, pressing down the hard reset button or using a fencing device to STONITH.
Triggering OOM
Popular MySQL in the cloud offerings like Amazon RDS and Google Cloud SQL have no straightforward way to crash them. Firstly because you won’t get any OS-level access to the database instance, and secondly because the provider uses a proprietary patched MySQL server. One ways is to overflow some buffers, and let the out-of-memory (OOM) manager to kick out the MySQL process.
You can increase the sort buffer size to something bigger than what the RAM can handle, and shoot a number of mysql sort queries against the MySQL server. Let’s create a 10 million rows table using sysbench on our Amazon RDS instance, so we can build a huge sort:
$ sysbench
--db-driver=mysql
--oltp-table-size=10000000
--oltp-tables-count=1
--threads=1
--mysql-host=dbtest.cdw9q2wnb00s.ap-tokyo-1.rds.amazonaws.com
--mysql-port=3306
--mysql-user=rdsroot
--mysql-password=password
/usr/share/sysbench/tests/include/oltp_legacy/parallel_prepare.lua
run
Change the sort_buffer_size to 5G (our test instance is db.t2.micro – 1GB, 1vCPU) by going to Amazon RDS Dashboard -> Parameter Groups -> Create Parameter Group -> specify the group name -> Edit Parameters -> choose “sort_buffer_size” and specify the value as 5368709120.
Apply the parameter group changes by going to Instances -> Instance Action -> Modify -> Database Options -> Database Parameter Group -> and choose our newly created parameter group. Then, reboot the RDS instance to apply the changes.
Once up, verify the new value of sort_buffer_size:
MySQL [(none)]> select @@sort_buffer_size;
+--------------------+
| @@sort_buffer_size |
+--------------------+
| 5368709120 |
+--------------------+
Then fire 48 simple queries that requires sorting from a client:
$ for i in {1..48}; do (mysql -urdsroot -ppassword -h dbtest.cdw9q2wnb00s.ap-tokyo-1.rds.amazonaws.com -e 'SELECT * FROM sbtest.sbtest1 ORDER BY c DESC >/dev/null &); done
If you run the above on a standard host, you will notice the MySQL server will be terminated and you can see the following lines appear in the OS’s syslog or dmesg:
[164199.868060] Out of memory: Kill process 47060 (mysqld) score 847 or sacrifice child
[164199.868109] Killed process 47060 (mysqld) total-vm:265264964kB, anon-rss:3257400kB, file-rss:0kB
With systemd, MySQL or MariaDB will be restarted automatically, so does Amazon RDS. You can see the uptime for our RDS instance will be resetted back to 0 (under mysqladmin status), and the ‘Latest restore time’ value (under RDS Dashboard) will be updated to the moment it went down.
Corrupting the Data
InnoDB has its own system tablespace to store data dictionary, buffers and rollback segments inside a file named ibdata1. It also stores the shared tablespace if you do not configure innodb_file_per_table (enabled by default in MySQL 5.6.6+). We can just zero this file, send a write operation and flush tables to crash mysqld:
# empty ibdata1
$ cat /dev/null > /var/lib/mysql/ibdata1
# send a write
$ mysql -uroot -p -e 'CREATE TABLE sbtest.test (id INT)'
# flush tables
$ mysql -uroot -p -e 'FLUSH TABLES WITH READ LOCK; UNLOCK TABLES'
After you send a write, in the error log, you will notice:
2017-11-15T06:01:59.345316Z 0 [ERROR] InnoDB: Tried to read 16384 bytes at offset 98304, but was only able to read 0
2017-11-15T06:01:59.345332Z 0 [ERROR] InnoDB: File (unknown): 'read' returned OS error 0. Cannot continue operation
2017-11-15T06:01:59.345343Z 0 [ERROR] InnoDB: Cannot continue operation.
At this point, mysql will hang because it cannot perform any operation, and after the flushing, you will get “mysqld got signal 11” lines and mysqld will shut down. To clean up, you have to remove the corrupted ibdata1, as well as ib_logfile* because the redo log files cannot be used with a new system tablespace that will be generated by mysqld on the next restart. Data loss is expected.
For MyISAM tables, we can mess around with .MYD (MyISAM data file) and .MYI (MyISAM index) under the MySQL datadir. For instance, the following command replaces any occurrence of string “F” with “9” inside a file:
$ replace F 9 -- /var/lib/mysql/sbtest/sbtest1.MYD
Then, send some writes (e.g, using sysbench) to the target table and perform the flushing:
mysql> FLUSH TABLE sbtest.sbtest1;
The following should appear in the MySQL error log:
2017-11-15T06:56:15.021564Z 448 [ERROR] /usr/sbin/mysqld: Incorrect key file for table './sbtest/sbtest1.MYI'; try to repair it
2017-11-15T06:56:15.021572Z 448 [ERROR] Got an error from thread_id=448, /export/home/pb2/build/sb_0-24964902-1505318733.42/rpm/BUILD/mysql-5.7.20/mysql-5.7.20/storage/myisam/mi_update.c:227
The MyISAM table will be marked as crashed and running REPAIR TABLE statement is necessary to make it accessible again.
Limiting the Resources
We can also apply the operating system resource limit to our mysqld process, for example number of open file descriptors. Using open_file_limit variable (default is 5000) allows mysqld to reserve file descriptors using setrlimit() command. You can set this variable relatively small (just enough for mysqld to start up) and then send multiple queries to the MySQL server until it hits the limit.
If mysqld is running in a systemd server, we can set it in the systemd unit file located at /usr/lib/systemd/system/mysqld.service, and change the following value to something lower (systemd default is 6000):
# Sets open_files_limit
LimitNOFILE = 30
Apply the changes to systemd and restart MySQL server:
$ systemctl daemon-reload
$ systemctl restart mysqld
Then, start sending new connections/queries that count in different databases and tables so mysqld has to open multiple files. You will notice the following error:
2017-11-16T04:43:26.179295Z 4 [ERROR] InnoDB: Operating system error number 24 in a file operation.
2017-11-16T04:43:26.179342Z 4 [ERROR] InnoDB: Error number 24 means 'Too many open files'
2017-11-16T04:43:26.179354Z 4 [Note] InnoDB: Some operating system error numbers are described at http://dev.mysql.com/doc/refman/5.7/en/operating-system-error-codes.html
2017-11-16T04:43:26.179363Z 4 [ERROR] InnoDB: File ./sbtest/sbtest9.ibd: 'open' returned OS error 124. Cannot continue operation
2017-11-16T04:43:26.179371Z 4 [ERROR] InnoDB: Cannot continue operation.
2017-11-16T04:43:26.372605Z 0 [Note] InnoDB: FTS optimize thread exiting.
2017-11-16T04:45:06.816056Z 4 [Warning] InnoDB: 3 threads created by InnoDB had not exited at shutdown!
At this point, when the limit is reached, MySQL will freeze and it will not be able to perform any operation. When trying to connect, you would see the following after a while:
$ mysql -uroot -p
ERROR 2013 (HY000): Lost connection to MySQL server at 'reading initial communication packet', system error: 104
Messing up with Permissions
The mysqld process runs by “mysql” user, which means all the files and directory that it needs to access are owned by mysql user/group. By messing up with the permission and ownership, we can make the MySQL server useless:
$ chown root:root /var/lib/mysql
$ chmod 600 /var/lib/mysql
Generate some loads to the server and then connect to the MySQL server and flush all tables onto disk:
mysql> FLUSH TABLES WITH READ LOCK; UNLOCK TABLES;
At this moment, mysqld is still running but it’s kind of useless. You can access it via a mysql client but you can’t do any operation:
mysql> SHOW DATABASES;
ERROR 1018 (HY000): Can't read dir of '.' (errno: 13 - Permission denied)
To clean up the mess, set the correct permissions:
$ chown mysql:mysql /var/lib/mysql
$ chmod 750 /var/lib/mysql
$ systemctl restart mysqld
Lock it Down
FLUSH TABLE WITH READ LOCK (FTWRL) can be destructive in a number of conditions. Like for example, in a Galera cluster where all nodes are able to process writes, you can use this statement to lock down the cluster from within one of the nodes. This statement simply halts other queries to be processed by mysqld during the flushing until the lock is released, which is very handy for backup processes (MyISAM tables) and file system snapshots.
Although this action won’t crash or bring down your database server during the locking, the consequence can be huge if the session that holds the lock does not release it. To try this, simply:
mysql> FLUSH TABLES WITH READ LOCK;
mysql> exit
Then send a bunch of new queries to the mysqld until it reaches the max_connections value. Obviously, you can not get back the same session as the previous one once you are out. So the lock will be running infinitely and the only way to release the lock is by killing the query, by another SUPER privilege user (using another session). Or kill the mysqld process itself, or perform a hard reboot.
Disclaimer
This blog is written to give alternatives to sysadmins and DBAs to simulate failure scenarios with MySQL. Do not try these on your production server 🙂