blog

Troubleshooting Galera Cluster Issues: Part 2

Krzysztof Ksiazek

Published: November 20, 2015
Last Updated: August 26, 2022

This is part 2 of our blog on how to troubleshoot Galera cluster – SST errors, and problems with network streaming. In part 1 of the blog, we covered issues ranging from node crashes to clusters that won’t restart, network splits and inconsistent data. Note that the issues described are all examples inspired from real life incidents in production environments.

This is the eighteenth installment in the ‘Become a MySQL DBA’ blog series. Our previous posts in the DBA series include:

SST errors

State Snapshot Transfer (SST) is a mechanism intended to provision or rebuild Galera nodes. There are a couple of ways you can execute SST, including mysqldump and rsync. The most popular one, though, is to perform SST using xtrabackup – it allows the donor to stay online and makes the SST process less intrusive. Xtrabackup SST process involves streaming data over the network from the donor to the joining node. Additionally, xtrabackup needs access to MySQL to grab the InnoDB log data – it checks ‘wsrep_sst_auth’ variable for access credentials. In this chapter we are going to cover examples of problems related to SST.

Incorrect password for SST

Sometimes it may happen that the credentials for MySQL access, set in wsrep_sst_auth, are not correct. In such case SST won’t complete properly because xtrabackup, on the donor, will not be able to connect to the MySQL database to perform a backup. You may then see the following symptoms in the error log.

Let’s start with a joining node.

2015-11-13 10:40:57 30484 [Note] WSREP: Quorum results:
        version    = 3,
        component  = PRIMARY,
        conf_id    = 16,
        members    = 2/3 (joined/total),
        act_id     = 563464,
        last_appl. = -1,
        protocols  = 0/7/3 (gcs/repl/appl),
        group UUID = 98ed75de-7c05-11e5-9743-de4abc22bd11
2015-11-13 10:40:57 30484 [Note] WSREP: Flow-control interval: [28, 28]
2015-11-13 10:40:57 30484 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 563464)
2015-11-13 10:40:57 30484 [Note] WSREP: State transfer required:
        Group state: 98ed75de-7c05-11e5-9743-de4abc22bd11:563464
        Local state: 00000000-0000-0000-0000-000000000000:-1

So far we have the standard SST process – a node joined the cluster and state transfer was deemed necessary.

...

2015-11-13 10:40:57 30484 [Warning] WSREP: Gap in state sequence. Need state transfer.
2015-11-13 10:40:57 30484 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '172.30.4.220' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --
defaults-group-suffix '' --parent '30484'  '' '
WSREP_SST: [INFO] Streaming with xbstream (20151113 10:40:57.975)
WSREP_SST: [INFO] Using socat as streamer (20151113 10:40:57.977)
WSREP_SST: [INFO] Stale sst_in_progress file: /var/lib/mysql//sst_in_progress (20151113 10:40:57.980)
WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20151113 10:40:58.005)
2015-11-13 10:40:58 30484 [Note] WSREP: Prepared SST request: xtrabackup-v2|172.30.4.220:4444/xtrabackup_sst//1

Xtrabackup started.

2015-11-13 10:40:58 30484 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2015-11-13 10:40:58 30484 [Note] WSREP: REPL Protocols: 7 (3, 2)
2015-11-13 10:40:58 30484 [Note] WSREP: Service thread queue flushed.
2015-11-13 10:40:58 30484 [Note] WSREP: Assign initial position for certification: 563464, protocol version: 3
2015-11-13 10:40:58 30484 [Note] WSREP: Service thread queue flushed.
2015-11-13 10:40:58 30484 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (98ed75de-7c05-11e5-9743-de4abc22bd11): 1 (Operation not permitted)
         at galera/src/replicator_str.cpp:prepare_for_IST():482. IST will be unavailable.
2015-11-13 10:40:58 30484 [Note] WSREP: Member 0.0 (172.30.4.220) requested state transfer from '*any*'. Selected 1.0 (172.30.4.156)(SYNCED) as donor.
2015-11-13 10:40:58 30484 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 563464)

IST was not available due to the state of the node (in this particular case, the MySQL data directory was not available).

2015-11-13 10:40:58 30484 [Note] WSREP: Requesting state transfer: success, donor: 1
WSREP_SST: [INFO] WARNING: Stale temporary SST directory: /var/lib/mysql//.sst from previous state transfer. Removing (20151113 10:40:58.430)
WSREP_SST: [INFO] Proceeding with SST (20151113 10:40:58.434)
WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20151113 10:40:58.435)
WSREP_SST: [INFO] Cleaning the existing datadir and innodb-data/log directories (20151113 10:40:58.436)
removed ‘/var/lib/mysql/ibdata1’
removed ‘/var/lib/mysql/ib_logfile1’
removed ‘/var/lib/mysql/ib_logfile0’
removed ‘/var/lib/mysql/auto.cnf’
removed ‘/var/lib/mysql/mysql.sock’
WSREP_SST: [INFO] Waiting for SST streaming to complete! (20151113 10:40:58.568)

The remaining files from the data directory were removed, and the node started to wait for SST process to complete.

2015-11-13 10:41:00 30484 [Note] WSREP: (05e14925, 'tcp://0.0.0.0:4567') turning message relay requesting off
WSREP_SST: [ERROR] xtrabackup_checkpoints missing, failed innobackupex/SST on donor (20151113 10:41:08.407)
WSREP_SST: [ERROR] Cleanup after exit with status:2 (20151113 10:41:08.409)
2015-11-13 10:41:08 30484 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '172.30.4.220' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --parent '30484'  '' : 2 (No such file or directory)
2015-11-13 10:41:08 30484 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
2015-11-13 10:41:08 30484 [ERROR] WSREP: SST script aborted with error 2 (No such file or directory)
2015-11-13 10:41:08 30484 [ERROR] WSREP: SST failed: 2 (No such file or directory)
2015-11-13 10:41:08 30484 [ERROR] Aborting

2015-11-13 10:41:08 30484 [Warning] WSREP: 1.0 (172.30.4.156): State transfer to 0.0 (172.30.4.220) failed: -22 (Invalid argument)
2015-11-13 10:41:08 30484 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():731: Will never receive state. Need to abort.

Unfortunately, SST failed on a donor node and, as a result, Galera node aborted and the mysqld process stopped. There’s not much information here about the exact cause of the problem, we’ve been pointed to the donor, though. Let’s take a look at it’s log.

2015-11-13 10:44:33 30400 [Note] WSREP: Member 0.0 (172.30.4.220) requested state transfer from '*any*'. Selected 1.0 (172.30.4.156)(SYNCED) as donor.
2015-11-13 10:44:33 30400 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 563464)
2015-11-13 10:44:33 30400 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2015-11-13 10:44:33 30400 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'donor' --address '172.30.4.220:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix ''   '' --gtid '98ed75de-7c05-11e5-9743-de4abc22bd11:563464''
2015-11-13 10:44:33 30400 [Note] WSREP: sst_donor_thread signaled with 0
WSREP_SST: [INFO] Streaming with xbstream (20151113 10:44:33.598)
WSREP_SST: [INFO] Using socat as streamer (20151113 10:44:33.600)
WSREP_SST: [INFO] Using /tmp/tmp.jZEi7YBrNl as xtrabackup temporary directory (20151113 10:44:33.613)
WSREP_SST: [INFO] Using /tmp/tmp.wz0XmveABt as innobackupex temporary directory (20151113 10:44:33.615)
WSREP_SST: [INFO] Streaming GTID file before SST (20151113 10:44:33.619)
WSREP_SST: [INFO] Evaluating xbstream -c ${INFO_FILE} | socat -u stdio TCP:172.30.4.220:4444; RC=( ${PIPESTATUS[@]} ) (20151113 10:44:33.621)
WSREP_SST: [INFO] Sleeping before data transfer for SST (20151113 10:44:33.626)
2015-11-13 10:44:35 30400 [Note] WSREP: (b7a335ea, 'tcp://0.0.0.0:4567') turning message relay requesting off
WSREP_SST: [INFO] Streaming the backup to joiner at 172.30.4.220 4444 (20151113 10:44:43.628)

In this part of the log, you can see some similar lines as in the joiner’s log – joiner (172.30.4.220) is a joiner and it requested state transfer from 172.30.4.156 (our donor node). Donor switched to a donor/desync state and xtrabackup has been triggered.

WSREP_SST: [INFO] Evaluating innobackupex --defaults-file=/etc/mysql/my.cnf  --defaults-group=mysqld --no-version-check  $tmpopts $INNOEXTRA --galera-info --stream=$sfmt $itmpdir 2>
${DATA}/innobackup.backup.log | socat -u stdio TCP:172.30.4.220:4444; RC=( ${PIPESTATUS[@]} ) (20151113 10:44:43.631)
2015-11-13 10:44:43 30400 [Warning] Access denied for user 'root'@'localhost' (using password: YES)
WSREP_SST: [ERROR] innobackupex finished with error: 1.  Check /var/lib/mysql//innobackup.backup.log (20151113 10:44:43.638)

At some point SST failed – we can see a hint about what may be the culprit of the problem. There’s also information about where we can look even further for information – innobackup.backup.log file in MySQL data directory.

WSREP_SST: [ERROR] Cleanup after exit with status:22 (20151113 10:44:43.640)
WSREP_SST: [INFO] Cleaning up temporary directories (20151113 10:44:43.642)
2015-11-13 10:44:43 30400 [ERROR] WSREP: Failed to read from: wsrep_sst_xtrabackup-v2 --role 'donor' --address '172.30.4.220:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.s
ock' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix ''   '' --gtid '98ed75de-7c05-11e5-9743-de4abc22bd11:563464'
2015-11-13 10:44:43 30400 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'donor' --address '172.30.4.220:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix ''   '' --gtid '98ed75de-7c05-11e5-9743-de4abc22bd11:563464': 22 (Invalid argument)
2015-11-13 10:44:43 30400 [ERROR] WSREP: Command did not run: wsrep_sst_xtrabackup-v2 --role 'donor' --address '172.30.4.220:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix ''   '' --gtid '98ed75de-7c05-11e5-9743-de4abc22bd11:563464'
2015-11-13 10:44:43 30400 [Warning] WSREP: 1.0 (172.30.4.156): State transfer to 0.0 (172.30.4.220) failed: -22 (Invalid argument)
2015-11-13 10:44:43 30400 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 563464)
2015-11-13 10:44:43 30400 [Note] WSREP: Member 1.0 (172.30.4.156) synced with group.
2015-11-13 10:44:43 30400 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 563464)

All SST-related processes terminate and, the donor gets in sync with the rest of the cluster switching to ‘Synced’ state.

Finally, let’s take a look at the innobackup.backup.log:

151113 14:00:57 innobackupex: Starting the backup operation

IMPORTANT: Please check that the backup run completes successfully.
           At the end of a successful backup run innobackupex
           prints "completed OK!".

151113 14:00:57 Connecting to MySQL server host: localhost, user: root, password: not set, port: 3306, socket: /var/lib/mysql/mysql.sock
Failed to connect to MySQL server: Access denied for user 'root'@'localhost' (using password: YES).

This time there’s nothing new – the problem is related to access issues to MySQL database – such issue should immediately point you to my.cnf and ‘wsrep_sst_auth’ variable.

Single Console for Your Entire Database Infrastructure

Find out what else is new in ClusterControl

Install ClusterControl for FREE

Problems with network streaming

Sometimes it may happen that streaming doesn’t work correctly. It can happen because there’s a network error or maybe some of the processes involved died. Let’s check how a network error impacts the SST process.

First error log snippet comes from the joiner node.

2015-11-13 14:30:46 24948 [Note] WSREP: Member 0.0 (172.30.4.220) requested state transfer from '*any*'. Selected 1.0 (172.30.4.156)(SYNCED) as donor.
2015-11-13 14:30:46 24948 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 563464)
2015-11-13 14:30:46 24948 [Note] WSREP: Requesting state transfer: success, donor: 1
WSREP_SST: [INFO] Proceeding with SST (20151113 14:30:46.352)
WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20151113 14:30:46.353)
WSREP_SST: [INFO] Cleaning the existing datadir and innodb-data/log directories (20151113 14:30:46.356)
WSREP_SST: [INFO] Waiting for SST streaming to complete! (20151113 14:30:46.363)
2015-11-13 14:30:48 24948 [Note] WSREP: (201db672, 'tcp://0.0.0.0:4567') turning message relay requesting off
2015-11-13 14:31:01 24948 [Note] WSREP: (201db672, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://172.30.4.156:4567
2015-11-13 14:31:02 24948 [Note] WSREP: (201db672, 'tcp://0.0.0.0:4567') reconnecting to b7a335ea (tcp://172.30.4.156:4567), attempt 0
2015-11-13 14:33:02 24948 [Note] WSREP: (201db672, 'tcp://0.0.0.0:4567') reconnecting to b7a335ea (tcp://172.30.4.156:4567), attempt 30
2015-11-13 14:35:02 24948 [Note] WSREP: (201db672, 'tcp://0.0.0.0:4567') reconnecting to b7a335ea (tcp://172.30.4.156:4567), attempt 60
^@2015-11-13 14:37:02 24948 [Note] WSREP: (201db672, 'tcp://0.0.0.0:4567') reconnecting to b7a335ea (tcp://172.30.4.156:4567), attempt 90
^@2015-11-13 14:39:02 24948 [Note] WSREP: (201db672, 'tcp://0.0.0.0:4567') reconnecting to b7a335ea (tcp://172.30.4.156:4567), attempt 120
^@^@2015-11-13 14:41:02 24948 [Note] WSREP: (201db672, 'tcp://0.0.0.0:4567') reconnecting to b7a335ea (tcp://172.30.4.156:4567), attempt 150
^@^@2015-11-13 14:43:02 24948 [Note] WSREP: (201db672, 'tcp://0.0.0.0:4567') reconnecting to b7a335ea (tcp://172.30.4.156:4567), attempt 180
^@^@2015-11-13 14:45:02 24948 [Note] WSREP: (201db672, 'tcp://0.0.0.0:4567') reconnecting to b7a335ea (tcp://172.30.4.156:4567), attempt 210
^@^@2015-11-13 14:46:30 24948 [Warning] WSREP: 1.0 (172.30.4.156): State transfer to 0.0 (172.30.4.220) failed: -22 (Invalid argument)
2015-11-13 14:46:30 24948 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():731: Will never receive state. Need to abort.
2015-11-13 14:46:30 24948 [Note] WSREP: gcomm: terminating thread
2015-11-13 14:46:30 24948 [Note] WSREP: gcomm: joining thread
2015-11-13 14:46:30 24948 [Note] WSREP: gcomm: closing backend
2015-11-13 14:46:30 24948 [Note] WSREP: view(view_id(NON_PRIM,201db672,149) memb {
    201db672,0
} joined {
} left {
} partitioned {
    b7a335ea,0
    fff6c307,0
})
2015-11-13 14:46:30 24948 [Note] WSREP: view((empty))
2015-11-13 14:46:30 24948 [Note] WSREP: gcomm: closed
2015-11-13 14:46:30 24948 [Note] WSREP: /usr/sbin/mysqld: Terminated.

As you can see, it’s obvious that SST failed:

2015-11-13 14:46:30 24948 [Warning] WSREP: 1.0 (172.30.4.156): State transfer to 0.0 (172.30.4.220) failed: -22 (Invalid argument)
2015-11-13 14:46:30 24948 [ERROR] WSREP: gcs/src/gcs_group.cpp:gcs_group_handle_join_msg():731: Will never receive state. Need to abort.

A few lines before that, though:

2015-11-13 14:31:01 24948 [Note] WSREP: (201db672, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://172.30.4.156:4567
...
2015-11-13 14:45:02 24948 [Note] WSREP: (201db672, 'tcp://0.0.0.0:4567') reconnecting to b7a335ea (tcp://172.30.4.156:4567), attempt 210

The donor node was declared inactive – this is a very important indication of network problems. Given that Galera tried 210 times to communicate with the other node, it’s pretty clear this is not a transient error but something more serious. On the donor node, the error log contains the following entries that also make it clear something is not right with the network:

WSREP_SST: [INFO] Evaluating innobackupex --defaults-file=/etc/mysql/my.cnf  --defaults-group=mysqld --no-version-check  $tmpopts $INNOEXTRA --galera-info --stream=$sfmt $itmpdir 2>${DATA}/innobackup.backup.log | socat -u stdio TCP:172.30.4.220:4444; RC=( ${PIPESTATUS[@]} ) (20151113 14:30:56.713)
2015-11-13 14:31:02 30400 [Note] WSREP: (b7a335ea, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://172.30.4.220:4567
2015-11-13 14:31:03 30400 [Note] WSREP: (b7a335ea, 'tcp://0.0.0.0:4567') reconnecting to 201db672 (tcp://172.30.4.220:4567), attempt 0

...

2015-11-13 14:45:03 30400 [Note] WSREP: (b7a335ea, 'tcp://0.0.0.0:4567') reconnecting to 201db672 (tcp://172.30.4.220:4567), attempt 210
2015/11/13 14:46:30 socat[8243] E write(3, 0xc1a1f0, 8192): Connection timed out
2015-11-13 14:46:30 30400 [Warning] Aborted connection 76 to db: 'unconnected' user: 'root' host: 'localhost' (Got an error reading communication packets)
WSREP_SST: [ERROR] innobackupex finished with error: 1.  Check /var/lib/mysql//innobackup.backup.log (20151113 14:46:30.767)
WSREP_SST: [ERROR] Cleanup after exit with status:22 (20151113 14:46:30.769)
WSREP_SST: [INFO] Cleaning up temporary directories (20151113 14:46:30.771)

This time the disappearing node is 172.30.4.220 – the node which is joining the cluster. Finally, innobackup.backup.log looks as follows:

Warning: Using unique option prefix open_files instead of open_files_limit is deprecated and will be removed in a future release. Please use the full name instead.
151113 14:30:56 innobackupex: Starting the backup operation

IMPORTANT: Please check that the backup run completes successfully.
           At the end of a successful backup run innobackupex
           prints "completed OK!".

151113 14:30:56 Connecting to MySQL server host: localhost, user: root, password: not set, port: 3306, socket: /var/lib/mysql/mysql.sock
Using server version 5.6.26-74.0-56
innobackupex version 2.3.2 based on MySQL server 5.6.24 Linux (x86_64) (revision id: 306a2e0)
xtrabackup: uses posix_fadvise().
xtrabackup: cd to /var/lib/mysql
xtrabackup: open files limit requested 4000000, set to 1000000
xtrabackup: using the following InnoDB configuration:
xtrabackup:   innodb_data_home_dir = ./
xtrabackup:   innodb_data_file_path = ibdata1:100M:autoextend
xtrabackup:   innodb_log_group_home_dir = ./
xtrabackup:   innodb_log_files_in_group = 2
xtrabackup:   innodb_log_file_size = 536870912
xtrabackup: using O_DIRECT
151113 14:30:56 >> log scanned up to (8828922053)
xtrabackup: Generating a list of tablespaces
151113 14:30:57 [01] Streaming ./ibdata1
151113 14:30:57 >> log scanned up to (8828922053)
151113 14:30:58 >> log scanned up to (8828922053)
151113 14:30:59 [01]        ...done
151113 14:30:59 [01] Streaming ./mysql/innodb_table_stats.ibd
151113 14:30:59 >> log scanned up to (8828922053)
151113 14:31:00 >> log scanned up to (8828922053)
151113 14:31:01 >> log scanned up to (8828922053)
151113 14:31:02 >> log scanned up to (8828922053)

...

151113 14:46:27 >> log scanned up to (8828990096)
151113 14:46:28 >> log scanned up to (8828990096)
151113 14:46:29 >> log scanned up to (8828990096)
innobackupex: Error writing file 'UNOPENED' (Errcode: 32 - Broken pipe)
xb_stream_write_data() failed.
innobackupex: Error writing file 'UNOPENED' (Errcode: 32 - Broken pipe)
[01] xtrabackup: Error: xtrabackup_copy_datafile() failed.
[01] xtrabackup: Error: failed to copy datafile.

Pipe was broken because the network connection timed out.

One interesting question – how we can track errors when one of the components required to perform SST dies? It can be anything from xtrabackup process to socat, which is needed to stream the data. Let’s go over some examples:

2015-11-13 15:40:02 29258 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 563464)
2015-11-13 15:40:02 29258 [Note] WSREP: Requesting state transfer: success, donor: 0
WSREP_SST: [INFO] Proceeding with SST (20151113 15:40:02.497)
WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20151113 15:40:02.499)
WSREP_SST: [INFO] Cleaning the existing datadir and innodb-data/log directories (20151113 15:40:02.499)
WSREP_SST: [INFO] Waiting for SST streaming to complete! (20151113 15:40:02.511)
2015-11-13 15:40:04 29258 [Note] WSREP: (cd0c84cd, 'tcp://0.0.0.0:4567') turning message relay requesting off
2015/11/13 15:40:13 socat[29525] E write(1, 0xf2d420, 8192): Broken pipe
/usr//bin/wsrep_sst_xtrabackup-v2: line 112: 29525 Exit 1                  socat -u TCP-LISTEN:4444,reuseaddr stdio
     29526 Killed                  | xbstream -x
WSREP_SST: [ERROR] Error while getting data from donor node:  exit codes: 1 137 (20151113 15:40:13.113)
WSREP_SST: [ERROR] Cleanup after exit with status:32 (20151113 15:40:13.116)

In the case described above, as can be clearly seen, xbstream was killed.

When socat process gets killed, joiner logs do not offer much data:

2015-11-13 15:43:05 9717 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 563464)
2015-11-13 15:43:05 9717 [Note] WSREP: Requesting state transfer: success, donor: 1
WSREP_SST: [INFO] Evaluating socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20151113 15:43:05.438)
WSREP_SST: [INFO] Proceeding with SST (20151113 15:43:05.438)
WSREP_SST: [INFO] Cleaning the existing datadir and innodb-data/log directories (20151113 15:43:05.440)
WSREP_SST: [INFO] Waiting for SST streaming to complete! (20151113 15:43:05.448)
2015-11-13 15:43:07 9717 [Note] WSREP: (3a69eae9, 'tcp://0.0.0.0:4567') turning message relay requesting off
WSREP_SST: [ERROR] Error while getting data from donor node:  exit codes: 137 0 (20151113 15:43:11.442)
WSREP_SST: [ERROR] Cleanup after exit with status:32 (20151113 15:43:11.444)
2015-11-13 15:43:11 9717 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address '172.30.4.220' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --parent '9717'  '' : 32 (Broken pipe)
2015-11-13 15:43:11 9717 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
2015-11-13 15:43:11 9717 [ERROR] WSREP: SST script aborted with error 32 (Broken pipe)
2015-11-13 15:43:11 9717 [ERROR] WSREP: SST failed: 32 (Broken pipe)
2015-11-13 15:43:11 9717 [ERROR] Aborting

We have just an indication that something went really wrong and that SST failed. On the donor node, things are a bit different:

WSREP_SST: [INFO] Streaming the backup to joiner at 172.30.4.220 4444 (20151113 15:43:15.875)
WSREP_SST: [INFO] Evaluating innobackupex --defaults-file=/etc/mysql/my.cnf  --defaults-group=mysqld --no-version-check  $tmpopts $INNOEXTRA --galera-info --stream=$sfmt $itmpdir 2>${DATA}/innobackup.backup.log | socat -u stdio TCP:172.30.4.220:4444; RC=( ${PIPESTATUS[@]} ) (20151113 15:43:15.878)
2015/11/13 15:43:15 socat[16495] E connect(3, AF=2 172.30.4.220:4444, 16): Connection refused
2015-11-13 15:43:16 30400 [Warning] Aborted connection 81 to db: 'unconnected' user: 'root' host: 'localhost' (Got an error reading communication packets)
WSREP_SST: [ERROR] innobackupex finished with error: 1.  Check /var/lib/mysql//innobackup.backup.log (20151113 15:43:16.305)
WSREP_SST: [ERROR] Cleanup after exit with status:22 (20151113 15:43:16.306)
WSREP_SST: [INFO] Cleaning up temporary directories (20151113 15:43:16.309)

We can find ‘Aborted connection’ warning and information that innobackupex (wrapper that’s used to execute xtrabackup) finished with error. Still not that much hard data. Luckily, we have also innobackup.backup.log to check:

151113 15:46:59 >> log scanned up to (8829243081)
xtrabackup: Generating a list of tablespaces
151113 15:46:59 [01] Streaming ./ibdata1
innobackupex: Error writing file 'UNOPENED' (Errcode: 32 - Broken pipe)
xb_stream_write_data() failed.
innobackupex: Error writing file 'UNOPENED' (Errcode: 32 - Broken pipe)
[01] xtrabackup: Error: xtrabackup_copy_datafile() failed.
[01] xtrabackup: Error: failed to copy datafile.

This time it’s much more clear – we have an error in xtrabackup_copy_datafile() function which, as you may expect, is responsible for copying files. This leads us to the conclusion that some kind of network connection issue happened between donor and joiner. Maybe not a detailed explanation but the conclusion is still valid – killing the process which is on the other end of the network connection closes that connection abruptly – something which can be described like a network issue of some kind.

We are approaching the end of this post – hopefully you enjoyed our small trip through the Galera logs. As you have probably figured out by now, it is important to check an issue from every possible angle when dealing with Galera errors. Galera nodes work together to form a cluster but they are still separate hosts connected via a network. Sometimes looking at single node’s logs is not enough to understand how the issue unfolded – you need to check every node’s point of view. This is also true for all potential issues about SST – sometimes it’s not enough to check logs on a joiner node. You also have to look into the donor’s logs – both error log and innobackup.backup.log.

The examples of issues described in this post (and the one before) only covers a small part of all possible problems, but hopefully, they’ll help you understand the troubleshooting process.

In the next post, we are going to take a closer look at pt-stalk – a tool which may help you understand what is going on with MySQL when the standard approach fails to catch the problem.

ClickHouse Schema Design and Data Modeling

Building a Modern Analytics Stack Around ClickHouse

Managing ClickHouse Resources in Multi-Tenant Environments

How to Migrate from SQL Server to PostgreSQL HA for Dockerized Apps