Failover Modes for SQL Server cluster on Linux
In a previous article, we explored two different high-availability and disaster recovery solutions for SQL Server on Linux—log shipping and SQL Server Always On Availability Groups.
If you’re using Availability Groups (AG), it’s essential to understand its supported failover modes prior to AG configuration.
In this post, we’ll cover cluster configurations for high availability, the failover process for each cluster type supported in SQL Server Linux, and how to perform failover for SQL Server Linux using ClusterControl.
Failover options in SQL Server Always On Availability Groups
From SQL Server 2017 onwards, the Linux version supports the following cluster types:
- None: The cluster type None configures Availability Groups (AG) between standalone nodes. It does not use Pacemaker configurations. It supports manual failover from primary to secondary replicas.
- External: The External cluster type requires a Pacemaker underneath the AG for managing the cluster. It supports automatic and manual failover.
SQL Server cluster configurations for high availability
SQL Server on Linux supports cluster configurations for high availability. The concept behind the cluster configuration in Linux is similar to the Windows Server Failover Cluster. A few characteristics of clustering in SQL Linux are as below:
- SQL Server Linux requires Pacemaker for clustering configurations.
- The Standard edition supports two replicas; one primary and one secondary. The secondary replica is used for high availability purposes.
- You cannot read data from the secondary node in the standard edition.
- The Enterprise edition supports nine replicas; one primary and eight secondary replicas.
- The Enterprise edition supports readable secondary replicas. You can have up to 2 secondary synchronous replicas out of eight secondary replicas.
Let’s understand in detail the failover process in the None cluster type. As we know, it involves two standalone nodes for a high-availability configuration. It supports the following two failover modes:
- Manual failover without data loss: In many cases, you might need to plan minor downtime on the primary replica for applying e.g., OS patches or hotfixes, or doing configuration changes. In these cases, you can do a manual failover without any data loss.
The steps for manual failover without data loss are as follows:
- Use the Alter Availability Group statement to make the current primary and target secondary replicas in the SYNCHRONOUS COMMIT mode. The synchronous commit ensures primary and secondary replicas are in full sync.
- Take the primary replica in offline mode to prepare for role switch.
- Promote a secondary replica as a new primary replica.
- Update the old primary role as SECONDARY.
- Resume the data movement between new primary and secondary replicas.
- Manual failover with data loss: Suppose your primary replica crashes due to hardware failure. In this case, the primary replica is not available. It will take time to recover the server. Your application will not be available because it connects to the primary replica for executing queries. Your secondary replica is available, but the database is not accessible (in restoring mode). In this case, you can plan to force the failover to the secondary replica to point the application to start serving requests from the secondary replica. This type of forced failover can involve data loss. Therefore, you must use it only if the primary instance is unavailable.
The steps for the forced manual failover with data loss are as follows:
- Initiate the force failover from the secondary replica.
- After the force failover, remove the original primary from the new replica (Old secondary)
- Point application to the new primary replica.
If the original primary replica comes online, it will try to take the primary role. Therefore, we must remove the availability group configuration from the original replica and re-configure it.
The steps to perform if the original primary replica comes online are as follows:
- Take the availability group offline on the original primary.
- Remove the availability group offline from the original primary
- Add the replica as a secondary replica from the new primary replica.
SQL Server Linux failover using ClusterControl
ClusterControl version 1.9.2 or higher supports high availability SQL Server 2019 clusters. Users can deploy the cluster with up to 8 nodes (1 primary and 8 secondary replicas) in asynchronous replication. Using ClusterControl to manage SQL Server on Linux provides the following benefits:
- Graphical installation for the SQL Server instances
- Automatic configurations for SQL Server Agent, backups, certificates, backup retention, and restoration with proper backup chain
- Monitoring and alerting
- Security and Compliance
- High availability and disaster recovery (DR) with automatic failover
ClusterControl supports manual failover using graphical controls with and without data loss.
- Manual failover without data loss: You can initiate a manual failover without any data loss. The cluster logs show the commands, progress tracking the manual failover, and its status.
- Forced failover with data loss: If the primary instance is unavailable, ClusterControl automatically performs forced failover and promotes the secondary replica as a new primary replica. It creates a new node as a new secondary replica, so you have a connected primary and secondary replica.
Let’s understand how the AG failover works using the ClusterControl for SQL Server on Linux. In my demo environment, I have three AG replicas:
Secondary: mssql2 and mssql3
Now, due to some maintenance activity, we need to perform a failover and switch node mssql1 role from primary to secondary. The node mssql2 should be the new primary replica.
Click on the ellipses for the node mssql2 and Promote Replica.
It opens the following promote replica page. You get an option – Force promotion of replica. Do not click on this option unless your current primary is down.
Click on Promote to start a new job that will perform the SQL Server Linux AG failover. After this, the node mssql2 should be the primary.
You can go to the activity center and view the job messages, status, SQL script it uses for performing the AG failover.
The job shows the green tick or logs a message: successfully promoted a new master, stating the ClusterControl has performed the AG failover.
You can view the status of the nodes, and it shows the followings:
New Primary: mssql2
Old primary or current secondary: mssql1 and mssql3
Let’s say due to some issues, the primary replica is down. In this case, you required manual forced failover.
Now we want to promote mssql3 as the new primary after forced failover. Click on Promote Replica from the mssql3 ellipses options.
Toggle the switch to enable force promotion of the replica, as shown below.
View the job activity details as shown below.
After the forced failover, the mssql3 node takes the role of a new primary replica. The old primary replica mssql2 is still in a failed state.
If the primary replica becomes available again, you need to remove it from the cluster and add it as a secondary AG replica. Later, if required, you can perform a failover to switch the role back to primary.
When it comes to implementing a successful high availability and disaster recovery solution for SQL Server on Linux, understanding the failover modes available, how they work, and how to configure them is key. Hopefully, this article has helped clearly outline the steps involved in each failover process.
If you’re just getting started with SQL Server Always On Availability Groups, check out these steps to set up AG on Linux. Want more SQL Server content? Subscribe to our newsletter to have the latest posts delivered straight to your inbox, and follow us on LinkedIn and Twitter for regular database operation management tips.
Subscribe to get our best and freshest content