Automating MySQL failover on Docker to reduce downtime
Multi-DC MySQL Automated Failover on Docker coupled with on-premise and cloud backup management features.
Background
Instant-Gaming.com is a leading online, worldwide distributor of video games for platforms like Steam, uPlay, Origin, Xbox, PS4, and more. They are a team of enthusiastic gamers who chose to break free from the classic brick-and-mortar retail model and branch out into discounted online distribution. By distributing titles online they save money by not having storage or logistical costs, allowing them to pass savings along to the gamer. Game codes are scanned from the boxed versions of the games and are delivered to their customers within a few seconds of purchase. The codes are then entered legally into the popular online streaming platforms which handle the installation and setup. This exciting new path to market allows Instant Gaming the ability to offer popular titles to their users at up to 80% in savings over traditional retail.
Challenge
As Instant Gaming continued to grow across its different web properties, it was not uncommon for them to have surges of activity trickling down to the centralized database infrastructure. During peak times, this could create performance issues. Occasionally, downtime would occur due to loss of connectivity between data centers, maintenance and upgrade operations, or even operator error in some cases. “Downtime is critical for e-commerce as your revenue stream depends on your uptime,” said Jérémie Bordier, IT Infrastructure Consultant of Instant Gaming.
Built using a custom-made eCommerce framework, Instant Gaming’s front-end applications used a MySQL database deployed in Master-Slave mode, with manual failover. The same database was used for back-office applications like order management, billing, inventory, and customer data. The applications ran in a Docker environment, while the database itself ran on barebone hardware.
Docker is an amazing technology that allowed us to abstract our applications from underlying hardware and data center environments. It was a great way for us to deploy our applications and quickly spin up and replace failed infrastructure.
Jérémie Bordier, IT Infrastructure Consultant of Instant Gaming
With the website traffic continuing to grow came an increasing realization that the database had too many potential points of failure. For instance, a network glitch between data centers could lead the database, and therefore the service, to be unavailable for 10 to 15 minutes while the failover was being performed. Even if they could outsource the management of the database to external database consultants, manual failover of the database was just too slow.
Solution
Jérémie Bordier and his team took to the web, evaluating several different ways to achieve the stability they were looking for. Galera Cluster technology was evaluated, but while the technology had automatic failover, it was more complex and would have required much more work to integrate into their system, which they felt was not ideal. The team also looked at open source solutions, for instance, Orchestrator, a replication topology manager for MySQL used at GitHub, offered advanced functionality. Although the team was impressed with its capabilities, they felt it was not straightforward to implement and maintain. Other options included the possibility of adding a full-time DBA or outsourcing the operations to external DBA consultants. In that way, should anything happen, their DBA would be paged and the SLA would guarantee a certain response time. However, this would mean increased operating costs for the business, but more importantly, they would still suffer from prolonged outage periods that came with paging a human, waiting for the person to log in, read logs, and understand the issue before being able to initiate failover procedures.
Amazon RDS for MySQL (Relational Database Service) was also evaluated, however, it was quickly realized that not having your application hosted with Amazon created latency issues, and migrating the system fully to AWS (and signing up for a fully managed MySQL RDS solution) would have resulted in a four to five-time cost increase over running off their own hardware. “Every additional millisecond of latency matters, we can tie it directly to our conversion rate,” said Bordier “with AWS you can achieve performance, but it’s very expensive compared to a barebone solution.”
Outcome
The team then found and installed ClusterControl. For Instant Gaming, ClusterControl offered the MySQL automated failover they were looking for as well as support for multiple data centers which tied perfectly into their disaster recovery needs. ClusterControl also offered support for Docker and integration with Amazon S3 for automatic backups and archiving to the cloud.
ClusterControl was put through rigorous testing by the Instant Gaming team. “We tested extensively the failure scenarios and ClusterControl behaved accordingly, giving us confidence in our infrastructure,” said Bordier. “This also allowed us to have better processes for maintenance operations.”
From the initial download of ClusterControl to the migration of their databases to the launch into production, it took a little more than two months.
Summary
Improved reliability
While a disaster had not yet struck Instant Gaming, downtime from errors and maintenance were enough to cause an impact to the business. This coupled with the low margins of their industry meant the reliability of the infrastructure kept the team up at night. But with ClusterControl, they can now relax. “Production database reliability is not something we have to care about or fear anymore when we have time off,” said Bordier. “ClusterControl allows me to sleep at night and enjoy skiing. ClusterControl just works and will save you time and money.”
Cloud backup integration for disaster recovery
The team used ClusterControl to manage a combination of on-prem and cloud backups to ensure they had recent backups on local storage, and the rest uploaded in AWS S3.
Easy migration for MySQL replication
The Instant Gaming team was comfortable with their MySQL Replication setup and wanted to stick with it. They utilized ClusterControl to import their existing setups for evaluation & testing. They then used ClusterControl to deploy a new production setup on Docker, and perform a live migration of the database before switching traffic to the new containerized databases.
Ready to automate your database?
Sign up now and you’ll be running your database in just minutes.
Ready to automate your database?
Sign up now and you’ll be running your database in just minutes.