blog

Troubleshooting Galera Cluster Issues: Part 1

Krzysztof Ksiazek

Published

In this blog post, we are going to show you some examples of things that can go wrong in Galera – inexplicable node crashes, network splits, clusters that won’t restart and inconsistent data. We’ll take a look at the data stored in log files to diagnose the problems, and discuss how we can deal with these.

This builds upon the previous post, where we looked into log files produced by Galera (error log and innobackup.* logs). We discussed how regular, “normal” activity looks like – from initialization of Galera replication to Incremental State Transfer (IST) and State Snapshot Transfer (SST) processes, and the respective log entries.

This is the seventeenth installment in the ‘Become a MySQL DBA’ blog series. Our previous posts in the DBA series include:

GEN_CLUST_INDEX issue

This is an example of an issue related to lack of Primary Key (PK) in a table. Theoretically speaking, Galera supports InnoDB tables without a PK. Lack of PK induces some limitations, for e.g. , DELETEs are not supported on such tables. Row order can also be different. We’ve found that lack of PK can also be a culprit of serious crashes in Galera. It shouldn’t happen but that’s how things look like. Below is a sample of error log covering such crash.

2015-04-29 11:49:38 49655 [Note] WSREP: Synchronized with group, ready for connections
2015-04-29 11:49:38 49655 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2015-04-29 11:49:39 7ff1285b5700 InnoDB: Error: page 409 log sequence number 207988009678
InnoDB: is in the future! Current system log sequence number 207818416901.
InnoDB: Your database may be corrupt or you may have copied the InnoDB
InnoDB: tablespace but not the InnoDB log files. See
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: for more information.
BF-BF X lock conflict
RECORD LOCKS space id 1288 page no 33 n bits 104 index `GEN_CLUST_INDEX` of table `db`.`tab` trx id 4670189 lock_mode X locks rec but not gap

Here’s the hint we are looking for: ‘GEN_CLUST_INDEX’. InnoDB requires a PK to be available – it uses PK as a clustered index, which is the main component of a table structure. No PK – no way to organize data within a table. If there’s no explicit PK defined on a table, InnoDB looks for UNIQUE keys to use for clustered index. If there’s no PK nor UNIQUE key available, an implicit PK is created – it’s referred to by InnoDB as ‘GEN_CLUST_INDEX’. If you see the entry like above, you can be sure that this particular table (db.tab) doesn’t have a PK defined.

In this particular case, we can see that some kind of lock conflict happened.

Record lock, heap no 2 PHYSICAL RECORD: n_fields 18; compact format; info bits 0
 0: len 6; hex 00000c8517c5; asc       ;;
 1: len 6; hex 0000004742ec; asc    GB ;;
 2: len 7; hex 5f0000026d10a6; asc _   m  ;;
 3: len 4; hex 32323237; asc 2227;;

...

 15: len 30; hex 504c4154454c4554204147475245474154494f4e20494e48494249544f52; asc xxx    ; (total 50 bytes);
 16: len 30; hex 414e5449504c4154454c4554204452554753202020202020202020202020; asc xxx    ; (total 100 bytes);
 17: len 1; hex 32; asc 2;;

In the section above you can find information about the record involved in the locking conflict. We obfuscated the data, but normally this info may point you to a particular row in the table.

15:49:39 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

key_buffer_size=25165824
read_buffer_size=131072
max_used_connections=13
max_threads=200
thread_count=12
connection_count=12
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 104208 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7fed4c000990
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7ff10037bd98 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0x8ea2e5]
/usr/sbin/mysqld(handle_fatal_signal+0x4a4)[0x6788f4]
/lib64/libpthread.so.0[0x391380f710]
/lib64/libc.so.6(gsignal+0x35)[0x3913432625]
/lib64/libc.so.6(abort+0x175)[0x3913433e05]
/usr/sbin/mysqld[0x9c262e]
/usr/sbin/mysqld[0x9c691b]
/usr/sbin/mysqld[0x9c6c44]
/usr/sbin/mysqld[0xa30ddf]
/usr/sbin/mysqld[0xa36637]
/usr/sbin/mysqld[0x997267]
/usr/sbin/mysqld(_ZN7handler11ha_rnd_nextEPh+0x9c)[0x5ba3bc]
/usr/sbin/mysqld(_ZN14Rows_log_event24do_table_scan_and_updateEPK14Relay_log_info+0x188)[0x883378]
/usr/sbin/mysqld(_ZN14Rows_log_event14do_apply_eventEPK14Relay_log_info+0xd77)[0x88e127]
/usr/sbin/mysqld(_ZN9Log_event11apply_eventEP14Relay_log_info+0x68)[0x88f458]
/usr/sbin/mysqld(_Z14wsrep_apply_cbPvPKvmjPK14wsrep_trx_meta+0x58e)[0x5b6fee]
/usr/lib64/galera/libgalera_smm.so(_ZNK6galera9TrxHandle5applyEPvPF15wsrep_cb_statusS1_PKvmjPK14wsrep_trx_metaERS6_+0xb1)[0x7ff1288ad2c1]
/usr/lib64/galera/libgalera_smm.so(+0x1aaf95)[0x7ff1288e4f95]
/usr/lib64/galera/libgalera_smm.so(_ZN6galera13ReplicatorSMM9apply_trxEPvPNS_9TrxHandleE+0x283)[0x7ff1288e5e03]
/usr/lib64/galera/libgalera_smm.so(_ZN6galera13ReplicatorSMM11process_trxEPvPNS_9TrxHandleE+0x45)[0x7ff1288e66f5]
/usr/lib64/galera/libgalera_smm.so(_ZN6galera15GcsActionSource8dispatchEPvRK10gcs_actionRb+0x2c9)[0x7ff1288c3349]
/usr/lib64/galera/libgalera_smm.so(_ZN6galera15GcsActionSource7processEPvRb+0x63)[0x7ff1288c3823]
/usr/lib64/galera/libgalera_smm.so(_ZN6galera13ReplicatorSMM10async_recvEPv+0x93)[0x7ff1288e23f3]
/usr/lib64/galera/libgalera_smm.so(galera_recv+0x23)[0x7ff1288f7743]
/usr/sbin/mysqld[0x5b7caf]
/usr/sbin/mysqld(start_wsrep_THD+0x3e6)[0x5a8d96]
/lib64/libpthread.so.0[0x39138079d1]
/lib64/libc.so.6(clone+0x6d)[0x39134e88fd]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0): is an invalid pointer
Connection ID (thread ID): 3
Status: NOT_KILLED

The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
150429 11:49:39 mysqld_safe Number of processes running now: 0
150429 11:49:39 mysqld_safe WSREP: not restarting wsrep node automatically
150429 11:49:39 mysqld_safe mysqld from pid file /var/lib/mysql/hostname.pid ended

Finally, we have a pretty standard MySQL crash report along with stack trace. If you look closely, you’ll see a pretty important bit of information – the stack trace contains entries about ‘libgalera_smm.so’. This points us to the Galera library being involved in this crash. As you can see, we are talking here about a node which is applying an event executed first on some other member of the cluster (_ZN9Log_event11apply_eventEP14Relay_log_info). This event involves SQL performing a table scan:

_ZN14Rows_log_event24do_table_scan_and_updateEPK14Relay_log_info
_ZN7handler11ha_rnd_nextEPh

Apparently, while scanning our ‘db.tab’ table, something went wrong and MySQL crashed. From our experience, in such cases, it’s enough to create a PK on the table. We don’t recall issues which persist after that schema change. In general, if you don’t have a good candidate for a PK on the table, the best way to solve such problem would be to create a new integer (small/big – depending on the max expected table size) column, auto-incremented, unsigned, not null. Such column will work great as PK and it should prevent this type of errors.

Please keep in mind that for tables without PK, if you execute simultaneous writes on multiple Galera nodes, rows may have been ordered differently – this is because implicit PK base on id of a row. Such ID increases monotonically as new rows are inserted. So, everything depends on the write concurrency – higher it is, more likely rows were inserted in different order on different nodes. If this happened, when you add a Primary Key, you may end up with table not consistent across the cluster. There’s no easy workaround here. The easiest way (but also most intrusive) is to set one node in the cluster as a donor for the rest of it and then force SST on remaining nodes – this will rebuild other nodes using single source of truth. Another option is to use pt-table-sync, script which is a part of the Percona Toolkit. It will allow you to sync chosen table across the cluster. Of course, before you perform any kind of action it’s always good to use another Percona Toolkit script (pt-table-checksum) to check if given table is in sync across the cluster or not. If it is in sync (because you write only to a single node or write concurrency is low and issue hasn’t been triggered), there’s no need to do anything.

Network split

Galera Cluster requires quorum to operate – 50%+1 nodes have to be up in order to form a ‘Primary Component’. This mechanism is designed to ensure that a split brain won’t happen and that your application does not end up talking to two separate disconnected parts of the same cluster. Obviously, it’d be a very bad situation to be in – therefore we are happy for the protection Galera gives us. We are going to cover a real-world example of the network split and why we have to be cautious when dealing with such scenario.

150529  9:30:05 [Note] WSREP: evs::proto(a1024858, OPERATIONAL, view_id(REG,428eb82c,1111)) suspecting node: 428eb82c
150529  9:30:05 [Note] WSREP: evs::proto(a1024858, OPERATIONAL, view_id(REG,428eb82c,1111)) suspected node without join message, declaring inactive
150529  9:30:05 [Note] WSREP: evs::proto(a1024858, OPERATIONAL, view_id(REG,428eb82c,1111)) suspecting node: d272b968
150529  9:30:05 [Note] WSREP: evs::proto(a1024858, OPERATIONAL, view_id(REG,428eb82c,1111)) suspected node without join message, declaring inactive

Two nodes (id’s of 428eb82c and d272b968) were declared inactive and left the cluster. Those nodes, in fact, were still running – the problem was related only to the network connectivity.

150529  9:30:05 [Note] WSREP: view(view_id(NON_PRIM,428eb82c,1111) memb {
        60bbf616,1
        a1024858,1
} joined {
} left {
} partitioned {
        428eb82c,0
        d272b968,0
})

This cluster consists of four nodes. Therefore, as expected, it switched to Non-Primary state as less than 50%+1 nodes are available.

150529  9:30:05 [Note] WSREP: declaring 60bbf616 at tcp://10.87.84.101:4567 stable
150529  9:30:05 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 1, memb_num = 2
150529  9:30:05 [Note] WSREP: Flow-control interval: [23, 23]
150529  9:30:05 [Note] WSREP: Received NON-PRIMARY.
150529  9:30:05 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 5351819)
150529  9:30:05 [Note] WSREP: New cluster view: global state: 8f630ade-4366-11e4-94c6-d6eb7adc0ddd:5351819, view# -1: non-Primary, number of nodes: 2, my index: 1, protocol version 3
150529  9:30:05 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
150529  9:30:05 [Note] WSREP: view(view_id(NON_PRIM,60bbf616,1112) memb {
        60bbf616,1
        a1024858,1
} joined {
} left {
} partitioned {
        428eb82c,0
        d272b968,0
})
150529  9:30:05 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 1, memb_num = 2
150529  9:30:05 [Note] WSREP: Flow-control interval: [23, 23]
150529  9:30:05 [Note] WSREP: Received NON-PRIMARY.
150529  9:30:05 [Note] WSREP: New cluster view: global state: 8f630ade-4366-11e4-94c6-d6eb7adc0ddd:5351819, view# -1: non-Primary, number of nodes: 2, my index: 1, protocol version 3

...

150529  9:30:20 [Note] WSREP: (a1024858, 'tcp://0.0.0.0:4567') reconnecting to 428eb82c (tcp://172.16.14.227:4567), attempt 30
150529  9:30:20 [Note] WSREP: (a1024858, 'tcp://0.0.0.0:4567') address 'tcp://10.87.84.102:4567' pointing to uuid a1024858 is blacklisted, skipping
150529  9:30:21 [Note] WSREP: (a1024858, 'tcp://0.0.0.0:4567') reconnecting to d272b968 (tcp://172.16.14.226:4567), attempt 30

Above you can see a couple of failed attempts to bring the cluster back in sync. When a cluster is in ‘non-Primary’ state, it won’t rejoin unless all of the nodes are available again or one of the nodes is bootstrapped.

Bootstrapping from incorrect node

Sometimes bootstrapping the cluster using one of the nodes is the only quick way to bring the cluster up – usually you don’t have time to wait until all nodes become available again. Actually, you could wait forever if one of the nodes is down due to hardware failure or data corruption – in such cases bootstrapping the cluster is the only way to bring it back. What’s very important to keep in mind is that a cluster has to be bootstrapped from the most advanced node. Sometimes it may happen that one or more of the nodes is behind in applying writesets. In such cases you need to confirm which one of the nodes is the most advanced one by running, for example, mysqld_safe –wsrep_recover on stopped node. Failing to do so may result in troubles like shown in the error log below:

151104 16:52:19 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
151104 16:52:19 mysqld_safe Skipping wsrep-recover for 98ed75de-7c05-11e5-9743-de4abc22bd11:235368 pair
151104 16:52:19 mysqld_safe Assigning 98ed75de-7c05-11e5-9743-de4abc22bd11:235368 to wsrep_start_position

...

2015-11-04 16:52:19 18033 [Note] WSREP: gcomm: connecting to group 'my_wsrep_cluster', peer '172.30.4.156:4567,172.30.4.191:4567,172.30.4.220:4567'

What we have here is a standard startup process for galera. Please note that the cluster consists of three nodes.

2015-11-04 16:52:19 18033 [Warning] WSREP: (6964fefd, 'tcp://0.0.0.0:4567') address 'tcp://172.30.4.220:4567' points to own listening address, blacklisting
2015-11-04 16:52:19 18033 [Note] WSREP: (6964fefd, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
2015-11-04 16:52:20 18033 [Note] WSREP: declaring 5342d84a at tcp://172.30.4.191:4567 stable
2015-11-04 16:52:20 18033 [Note] WSREP: Node 5342d84a state prim
2015-11-04 16:52:20 18033 [Note] WSREP: view(view_id(PRIM,5342d84a,2) memb {
    5342d84a,0
    6964fefd,0
} joined {
} left {
} partitioned {
})
2015-11-04 16:52:20 18033 [Note] WSREP: save pc into disk
2015-11-04 16:52:20 18033 [Note] WSREP: discarding pending addr without UUID: tcp://172.30.4.156:4567
2015-11-04 16:52:20 18033 [Note] WSREP: gcomm: connected
2015-11-04 16:52:20 18033 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2015-11-04 16:52:20 18033 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2015-11-04 16:52:20 18033 [Note] WSREP: Opened channel 'my_wsrep_cluster'
2015-11-04 16:52:20 18033 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
2015-11-04 16:52:20 18033 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
2015-11-04 16:52:20 18033 [Note] WSREP: Waiting for SST to complete.
2015-11-04 16:52:20 18033 [Note] WSREP: STATE EXCHANGE: sent state msg: 69b62fe8-8314-11e5-a6d4-fae0f5e6c643
2015-11-04 16:52:20 18033 [Note] WSREP: STATE EXCHANGE: got state msg: 69b62fe8-8314-11e5-a6d4-fae0f5e6c643 from 0 (172.30.4.191)
2015-11-04 16:52:20 18033 [Note] WSREP: STATE EXCHANGE: got state msg: 69b62fe8-8314-11e5-a6d4-fae0f5e6c643 from 1 (172.30.4.220)

So far so good, nodes are communicating with each other.

2015-11-04 16:52:20 18033 [ERROR] WSREP: gcs/src/gcs_group.cpp:group_post_state_exchange():319: Reversing history: 0 -> 0, this member has applied 140700504171408 more events than the primary component.Data loss is possible. Aborting.
2015-11-04 16:52:20 18033 [Note] WSREP: /usr/sbin/mysqld: Terminated.
Aborted (core dumped)
151104 16:52:20 mysqld_safe mysqld from pid file /var/lib/mysql/mysql.pid ended

Unfortunately, one of the nodes was more advanced that the node we bootstrapped the cluster from. This insane number of events is not really valid, as you can tell. It’s true though that this node cannot join the cluster. The only way to do it is by running SST. But in such case you are going to lose some of the data which wasn’t applied on the node you bootstrapped the cluster from. The best practice is to try and recover this data from the ‘more advanced’ node – hopefully there are binlogs available. If not, the process is still possible but comparing the data on a live system is not an easy task.

Consistency issues

Within a cluster, all Galera nodes should be consistent. No matter where you write, the change should be replicated across all nodes. For that, as a base method of moving data around, Galera uses row based replication –  this is the best way of replicating data while maintaining consistency of the data set. All nodes are consistent – this is true most of the time. Sometimes, though, things do not work as they are expected to. Even with Galera, nodes may get out of sync. There are numerous reasons for that – from node crashes which result in data loss to human errors while tinkering around replication or ‘wsrep_on’ variables. What is important – Galera has an internal mechanism of maintaining consistency. If it detects an error while applying a writeset, a node will shut down to ensure that inconsistency won’t propagate, for example through SST.

By looking at the error log, this is how things will look like.

2015-11-12 15:35:59 5071 [ERROR] Slave SQL: Could not execute Update_rows event on table t1.tab; Can't find record in 'tab', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 162, Error_code: 1032

If you are familiar with row based replication, you may have seen errors similar to the one above. This is information from MySQL that one of the events cannot be executed. In this particular case, some update failed to apply.

2015-11-12 15:35:59 5071 [Warning] WSREP: RBR event 3 Update_rows apply warning: 120, 563464
2015-11-12 15:35:59 5071 [Warning] WSREP: Failed to apply app buffer: seqno: 563464, status: 1
     at galera/src/trx_handle.cpp:apply():351

Here we can see the Galera part of the information – writeset with sequence number of 563464 couldn’t be applied.

Retrying 2th time
2015-11-12 15:35:59 5071 [ERROR] Slave SQL: Could not execute Update_rows event on table t1.tab; Can't find record in 'tab', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 162, Error_code: 1032
2015-11-12 15:35:59 5071 [Warning] WSREP: RBR event 3 Update_rows apply warning: 120, 563464
2015-11-12 15:35:59 5071 [Warning] WSREP: Failed to apply app buffer: seqno: 563464, status: 1
     at galera/src/trx_handle.cpp:apply():351
Retrying 3th time
2015-11-12 15:35:59 5071 [ERROR] Slave SQL: Could not execute Update_rows event on table t1.tab; Can't find record in 'tab', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 162, Error_code: 1032
2015-11-12 15:35:59 5071 [Warning] WSREP: RBR event 3 Update_rows apply warning: 120, 563464
2015-11-12 15:35:59 5071 [Warning] WSREP: Failed to apply app buffer: seqno: 563464, status: 1
     at galera/src/trx_handle.cpp:apply():351
Retrying 4th time
2015-11-12 15:35:59 5071 [ERROR] Slave SQL: Could not execute Update_rows event on table t1.tab; Can't find record in 'tab', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 162, Error_code: 1032
2015-11-12 15:35:59 5071 [Warning] WSREP: RBR event 3 Update_rows apply warning: 120, 563464

To avoid transient errors (like deadlocks, for example), the writeset has been re-executed four times. It was a permanent error this time so all attempts failed.

2015-11-12 15:35:59 5071 [ERROR] WSREP: Failed to apply trx: source: 7c775f39-893d-11e5-93b5-0b0d185bde79 version: 3 local: 0 state: APPLYING flags: 1 conn_id: 5 trx_id: 2779411 seqnos (l: 12, g: 563464, s: 563463, d: 563463, ts: 11103845835944)
2015-11-12 15:35:59 5071 [ERROR] WSREP: Failed to apply trx 563464 4 times
2015-11-12 15:35:59 5071 [ERROR] WSREP: Node consistency compromized, aborting…

With failure to apply the writeset, the node was determined inconsistent with the rest of the cluster and Galera took an action which is intended to minimize the impact of this inconsistency – the node is going to be shut down and it won’t be able to join the cluster as long as this writeset remains unapplied.

2015-11-12 15:35:59 5071 [Note] WSREP: Closing send monitor...
2015-11-12 15:35:59 5071 [Note] WSREP: Closed send monitor.
2015-11-12 15:35:59 5071 [Note] WSREP: gcomm: terminating thread
2015-11-12 15:35:59 5071 [Note] WSREP: gcomm: joining thread
2015-11-12 15:35:59 5071 [Note] WSREP: gcomm: closing backend
2015-11-12 15:35:59 5071 [Note] WSREP: view(view_id(NON_PRIM,7c775f39,3) memb {
    8165505b,0
} joined {
} left {
} partitioned {
    7c775f39,0
    821adb36,0
})
2015-11-12 15:35:59 5071 [Note] WSREP: view((empty))
2015-11-12 15:35:59 5071 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2015-11-12 15:35:59 5071 [Note] WSREP: gcomm: closed
2015-11-12 15:35:59 5071 [Note] WSREP: Flow-control interval: [16, 16]
2015-11-12 15:35:59 5071 [Note] WSREP: Received NON-PRIMARY.
2015-11-12 15:35:59 5071 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 563464)
2015-11-12 15:35:59 5071 [Note] WSREP: Received self-leave message.
2015-11-12 15:35:59 5071 [Note] WSREP: Flow-control interval: [0, 0]
2015-11-12 15:35:59 5071 [Note] WSREP: Received SELF-LEAVE. Closing connection.
2015-11-12 15:35:59 5071 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 563464)
2015-11-12 15:35:59 5071 [Note] WSREP: RECV thread exiting 0: Success
2015-11-12 15:35:59 5071 [Note] WSREP: recv_thread() joined.
2015-11-12 15:35:59 5071 [Note] WSREP: Closing replication queue.
2015-11-12 15:35:59 5071 [Note] WSREP: Closing slave action queue.
2015-11-12 15:35:59 5071 [Note] WSREP: /usr/sbin/mysqld: Terminated.
Aborted (core dumped)
151112 15:35:59 mysqld_safe Number of processes running now: 0
151112 15:35:59 mysqld_safe WSREP: not restarting wsrep node automatically
151112 15:35:59 mysqld_safe mysqld from pid file /var/lib/mysql/mysql.pid ended

What follows is the regular MySQL shutdown process.

This concludes the first part of our troubleshooting blog. In the second part of the blog, we will cover different types of SST issues, as well as problems with network streaming.

Subscribe below to be notified of fresh posts