How Booking.com built their own Sovereign DBaaS at scale

February 10, 2023
Kristian Köhntopp

Automation is a necessity and a prerequisite for growth, but going through it is a time-consuming and costly process. Therefore, before going down this path, it would be useful to establish your resource limits, whether in money, tools, or expertise.

Kristian Köhntopp, the principal systems engineer at Booking.com, joined us again on this episode of Sovereign DBaaS Decoded. In Part 1 of our conversation with Kristian, he walked us through building and running a robust database infrastructure with thousands of instances and hundreds of hierarchies. In this episode, Kristian explains why every enterprise should automate its database and how long it takes to put into effect. However, he also explains that automation is a journey, not a destination. It’s a never-ending game, and each enterprise must stay up to date with regulatory, compliance, or tooling changes, to ensure their databases always meet business objectives.

Key Insights

Automation as an Existential Necessity for Scale: For Booking.com, automation wasn’t a “nice-to-have” but a requirement as they moved from a single database in 2006 to a “liquid” environment with thousands of instances. Without it, every database remains a “snowflake” that eventually obstructs operations.

Systems-Level Orchestration Over Node Management: Effective database automation must look at the entire replication hierarchy rather than individual hosts. Changing one node affects dependencies and capacity across the system; therefore, automation must understand topology to ensure operations are safe.

The “Liquid” Infrastructure Philosophy: By treating infrastructure as immutable and ensuring no instance lives longer than 30 days, the team forces themselves to maintain 100% hands-off automation. This “forced honesty” ensures that any patch or configuration change must be committed to code rather than manually tweaked on a server.

Episode Highlights

00:02:30 — The Necessity of Database Automation Kristian emphasizes that without automation, databases become “snowflakes” that inevitably obstruct operational growth. As Booking.com scaled, automation transitioned from a convenience to a strict necessity to handle thousands of databases in a “completely liquid way” where instances are frequently replaced and resized automatically.

00:05:38 — Orchestrating Systems, Not Just Hosts A critical realization for the team was that database automation cannot focus on a single host; it must account for the entire replication hierarchy. This systems-level approach ensures that operations—such as adding instances or reconfiguring topologies—do not interrupt client operations or exceed the replication tree’s read capacity.

00:09:04 — Influencing MySQL Development Booking.com played a significant role in “nudging” Oracle to develop parallel replication for MySQL. By providing “harsh feedback” and discussing design requirements based on their massive scale, they helped engineer away storage commit latency bottlenecks, eventually achieving 6,000 to 10,000 transactions per second.

00:14:13 — Embracing “Reliable Death” and Fluidity The infrastructure is designed for nodes to die reliably and fast. By using a “bare metal cloud” that behaves like EC2 but offers superior workload isolation, the automation can provision a new blade, “paint” an operating system onto the disk, and integrate it into the replication tree without any manual intervention.

00:33:56 — Predictive Capacity Planning with Sensor Queries To manage seasonal business spikes, the system uses “sensor queries” to test execution latency on production nodes. By artificially increasing the load on a single node until it reaches a saturation point, the team determines the “load factor” or runway, allowing them to automate hardware provisioning based on real-world workload data.

00:54:37 — Immutable Infrastructure and Security Compliance To meet security requirements for fixing CVEs within 30 days, Booking.com treats its database servers as immutable. Rather than patching existing servers, they deprovision the old ones and reprovision new ones using the latest OS images and automation code, ensuring the entire environment remains fresh and hands-off.

01:00:57 — Safe Test Data via Tokenization (The Soylent System) Booking.com protects PII (Personally Identifiable Information) by replacing sensitive data with synthetic tokens in production. In test environments, a “fake” replacement service called Soylent provides developers with consistent but entirely made-up data, allowing them to test applications with production-like scenarios without the risk of exposing real customer data.

Here’s the full transcript:

Vinay Joosery: Hello and welcome to this new episode of Sovereign DBaaS Decoded. I’m Vinay Joosery, and this episode is brought to you by Severalnines. So, we build enterprise automation software to orchestrate high availability open source database operations in any environment while maintaining total control. So, welcome back, Kristian. So, we’re welcoming back Kristian Köhntopp, the database lead at Booking.com. Kristian joined us in the previous episode about running database infrastructure at enterprise scale. So, a very warm welcome. So, in the previous episode, we spoke quite a lot. We talked a lot about cloud services, database services, and why they really work for greenfield low-scale apps, you know, we kind of saw that. Public DBaaS may work for day one operations, but from day two and on, you still need a lot of Ops folks to do the work.

We touched upon DevOps infrastructure tooling for databases, and the challenges using tools like Ansible, Chef, Puppet to operate your databases. And also we talked about Kubernetes and how you can use Kubernetes operators to automate databases. And the issue was that as Ops is different in every enterprise, then there is a need to adapt or extend the operator to the way the enterprise runs its Ops. And this is easier said than done. It is hard to find DBAs who can code Golang, or it’s hard to find good developers who know databases, right? So today’s episode is a bit of a continuation of that discussion. And we’ll talk specifically about how Booking.com implemented database automation. So the first question to you, Kristian, is, you know, should an enterprise automate their databases?

Kristian Köhntopp: Well, if you don’t, then the database is always a snowflake and gets into the way in various ways. For us, it became a necessity as we grew and we had more databases. And then we had explosively more databases when we realized that we don’t actually need caches. But the database is fast enough that it performs the same operation as a cache. And then we grew even more, so the requirements from compliance and security arose. And in the end, we ended up with a system that operates very many databases in a completely liquid way, where no instance lives longer than 30 days and everything is found automatically and everything can be resized automatically as well.

But that is not how we started out. I mean, 2006, I came to Booking as a MySQL consultant. There was one single database with one single schema running at 700 queries per second and about 40 gigabytes of data and with a few replicas that was like a handful of machines. And, of course, that was on full manual. But the operations team went full in to instance management, with Puppet, for example. The time the entire company crossed about 100 machines and now it’s a few 10 000, and of course, nobody ever touches any box. And now it’s a few thousand databases, and of course, nobody ever touches the database. So, that is properly and fully automated.

But well, not yet in the way that is our next challenge in a way that we expose that as an API to application business units, then we have automated or saved completely out of the picture. Right now, there’s still a DBA starting things for certain operations. Not for all. We are halfway there. It began with a bunch of shell scripts in an RPM package that we rolled out to any database machine, and that basically helped us to make an instance. Generate a config, create a database directory with an empty database, provide the tables from some clone source, copy them over, and then maybe automate a bit, putting the newly created instance into an existing application tree. And very quickly, after very few years, we realized that shell is not a good language for that.

So we basically banished all shell development and moved to a thing in Python that did these things. And the moment you start to work with replication, you have to orchestrate things, because you need to know where are you in the replication hierarchy? Can you operate on the replication hierarchy without interrupting operations? Do you have enough capacity in the replication tree for the read load that you have, for example? And if I add things to the replication tree, will the load balancers find the new instances so that they are actually getting load from the clients? And all of these things can work only if the automation you write does not look at a single host. But it has to look at the structure that represents the replication hierarchy.

So all the database instances that form the replication tree. And then it has to check if it’s safe to do what it wants to do, it has to probably announce things that it wants to do with some kind of lock or other way to signal the intent. Then it has to perform the operation. When the operation is finished, it has to take down the locks. And if anything goes wrong, it probably should ask kindly for a human to fix things after the mess it made. And then you will, over time, run into a lot of obscure error conditions, because what can go wrong will go wrong eventually. And that means that you get a lot more code and a lot more conditions that you have to check for things to clean up.

And when that all works, then other people will come to you and ask you to do even more things because of change requirements. Oh, that’s awesome. Can we have that more? But I still have to log into the box to do these things. Can you automate them as well? And that is how we arrive at the list of things that we do today with databases. This all grew. We never made that list and then worked it down. It is like these requirements arrived and then somebody wrote the code and then this interacted badly with other code. Then we introduced some kind of locking mechanism or other controls. And eventually, this is now 10 years old, somebody, me, had to sit down and actually write all of the things down that we do in order to explain to management what are these people doing all day if they’re not actually logging into databases. But that is what we have today.

Vinay Joosery: So, let me just stop you there. So, I think that’s one thing that we also, we’ve seen for many years, we’ve been dealing with clusters and systems, right? And that’s what we have seen, the difference. There’s a lot of tooling which actually takes a single node approach, and then you have many single instances, whereas very few actually look at things from a systems perspective. Because there’s all the interactions, all the dependencies. And that’s kind of the main thing. Like, changing one node actually has a lot of effect on other things. And you want to get that whole picture, so to speak. But so how long did it take you? So, basically, you’re still getting more automation, but, you know, how long did it take you to build the automation in a way?

Kristian Köhntopp: We are doing this for 10 years now. But, of course, if we had set out from this with the laundry list we have today. And then, in a concentrated way, engineered our way through this, it would have probably been a lot faster. But this is doing all of these things and exploring and learning all of these things incrementally, while at the same time growing and doing operations. And also why MySQL itself evolved. I mean, originally we had single threaded replication, statement-based replication, and that worked. But with single threaded replication, you are very dependent on commit latency from the storage. Because if your storage can do like 2,000 commits per second, half a millisecond commit latency, then your replication capacity is limited.

And at some point, you will run out of RAM. And we actually had discussions with [inaudible] Oracle, with the MySQL team at Oracle about that, and proposed several ideas of how to do parallel replication to them. And what we now have in parallel replication has been completely developed by Oracle on their own. But we nudged them very hard and we also discussed the ideas of how to do that with them very intensely. So, while we did not actually code anything on that, we are a hotel booking company, we don’t develop databases. We still had quite an influence on the entire design of that because we had that requirement.

And today, you have a Binlog stream that has a certain width, whatever the limit is that the workload imposes, on the amount of parallelism that you can have. Transactions cannot be reordered arbitrarily. And for us, that width is often three to five. In certain circumstances, a lot more. But for most databases, that. So, if we have now a machine that can do a storage, that can do 2000 commits per second, then we now get like three to five times that most of the time. So, 6,000 to 10,000 transactions per second in application. And that’s, of course, a fantastic improvement. And that is part of, for example, the vendor management that we did in early testing releases coming in from Oracle.

And giving constructive, but also harsh feedback to them in what they do with the database development there, for example. I think that’s a really nice work there. I’m really happy, for example, how that came out in the end as a feature, because it more or less solved that bottleneck for us and completely engineered that away. If it’s now sequential at the limit, it’s mostly our problem if the storage has certain minimum performance. So, a 2012 NetApp filer with hard disk has this 0.5 millisecond commit latency. Anything that is as good as that 10-year-old hardware will work for us. So, basically, anything but self.

Vinay Joosery: Yeah. So, looking at the automation that you built, you have a blog where you actually lay out a number of different areas that actually should be automated, and I’m going to list them here, and we can go through these. There’s provisioning, there’s service discovery, there’s switchover and failover, there’s capacity planning, there’s schema changes, there’s restore. I know you don’t like the word backup. So, then there’s monitoring. There’s upgrades. There’s security. There’s test data and organization. And then also the fact that once you’ve automated all that, you actually go ahead and sort of bring value to the business units and help them with data modeling, and writing queries and things like that. So, why don’t we get started with provisioning?

Kristian Köhntopp: Started with making instances, because that is where it all begins. Originally, primaries in the replication hierarchy is a multi-layered tree. There’s always one primary that you write to, and then there are intermediate primaries that are residing in different data centers or different availability zones that take the write stream from remote, so it’s transmitted only once into the local availability zone or data center. And from there, there’s a fan out to the actual read replicas that reads to the client applications. So, there’s always at least three layers. And sometimes there are more because there is some kind of transitionary topology that exists for a moment when everything is being reorganized for whatever reason.

And originally, that was all statement-based replication. That doesn’t work very well with that. That works… It’s the only thing that works with MySQL. When you want to go to row-based replication, you have to first get rid of anything but inner-base. And now it is also in part group replication and semi-synchronous replication, because that allows us to make promises about the data, not just leaving the box, but even leaving the data center so we can lose, for sure, any individual box and most likely any individual data center. But that is a lot in the future. Originally, primaries were special machines.

Originally they had two instances, active and passive, Heartbeat with DRBD. Then we had those NetApp filers and we had iSCSI failover, with iSCSI locks being broken by the failover. And then log waste recovery or repair table, as long as you have my MyISAM, you are looking at that. But with the invention of local transaction IDs, we actually don’t need anything there anymore. We can just promote any replica that has a full bin lock to the new primary and then hang the entire application topology from that. And with Orchestrator and the work of Shlomi Noach, that actually became completely automated. So that is what happens. And I have one specific incident in mind where that actually proved how incredibly desirous that everything makes. But yes.

The automation now uses all the web frontend blades with a single NVMe drive. No write, nothing. So if there’s any problem with the instance, the instance dies. And the only thing that is important to us is that it reliably dies. So, once it’s dying, it stays dead, and it dies fast. So, if we detect any problem, the best is we kill it, then capacity is gone. The tree rearranges, and maybe most likely by these days reliably, the automation then provides new instances. That is what the automation does. It talks to for bare metal to our legacy bare metal provisioning system, that’s an API, and tells it, give me a blade, and that thing will give you a blade. Because it’s bare metal hardware, that will take a moment, a few minutes. And with the new energy savings measures, the blade is actually off. It takes even longer because it needs to be powered on.

Then it needs to be checked if it’s flashed up to spec. Then an operating system is painted on the hard disk. And then that automation will, as an ENC and Puppet, tell the Puppet what the Puppet class is that needs to be installed on the database schema. Then there’s a MySQL. And then it’s getting handed over to our automation. We finally get our own hands on the blade. And we provide then the beta directory. And with two terabytes of data, because there are two terabytes of local disk, that takes maybe an hour or two to arrive. But it works in parallel. So if I ask for one blade, it takes two hours. If I ask for 200, it takes one hour and 20 minutes for the things to arrive. And that is basically what the provisioning automation does. It gives you a bare metal machine, a piece of openstack, and then the near future, also a piece of EC2.

With the current image, the newest one, or if it’s bare metal with the current operating system version and all the Bios is flashed up to spec automatically, and so on. And then it will give you a MySQL package, the one you configured, so most likely the newest, and then it will also give you a data directory. And then it’s being handed off to application control that is orchestrated automation calling each other as callbacks. And that means that the instance ends up in the replication tree and catching up. And because it is not called up, the service discovery and auto discovery will not put that in any pool to serve reads, because it’s not ready. And when it signals itself as ready to serve, it then enters the pool and is being discovered. And then it will randomly get read requests from the front ends.

So, it will serve if there are 10. In the pool, we get one tenth of the workload to serve that. Or we artificially increase or lower the weight so that it gets a higher or lower percentage of the workload that comes in from the front end. But nobody has ever touched that. And in most cases, also, nobody has actually asked for that capacity. So it’s 10 front ends because the capacity testing that runs every night or during the day, has determined that 10 would be a good number. And what we could figure is like a corridor, minimum capacity, maximum capacity, And the provisioning will grow and shrink the pool over several days so that it has the right size.

So these things, provisioning, service discovery, topology, changes with Orchestrator and capacity planning, are actually one thing, except that in code they are, of course, different things, but they interact. That’s also a big problem, because if you do things to an instance, maybe you can’t do other things at the same time in different parts of the automation. And we, at several stages, ran into problems there. Because like capacity test and schema change maybe interact and should not happen at the same time. So, we have been learning and discovering these things, sometimes the hard way.

Vinay Joosery: So, you mentioned something about, you know, one of the bigger problem is to make sure that a node dies reliably. You know, it’s like, go and die now. But I guess, yeah, I mean, that’s a hard thing, because, you know, sometimes machines can just hang and then…

Kristian Köhntopp: Yeah, that is fine, then it’s dead. The worst is when you, for example, have a bear metal thing with a battery-backed controller, like an HP smart array or something like that. Back in the times of hardest, we had these the NDM things, of course, directly of the bus. But when the battery dies, that thing cannot use its cache because the cache is no longer persistent. There’s no battery. But it does not announce that properly always. So, it’s suddenly slow. And instead of a commit latency of actually, I think, about 100 microseconds, you suddenly get a commit latency of several milliseconds. So, it’s one or two orders of magnitude slower. It’s not bad. It’s serving rights.

And, of course, you get then another increasing replication delay that is never caught up. And it had like several incidents of that. They were not often enough, for the amount of hardware that you have, that it became like a repeatable or even reproducible problem. And only after some time, we noticed there is a common root cause to these things. And we can actually monitor this, or the hardware team can monitor this and then notify us of that problem, and then we go and eject that node because it cannot reliably replicate, it cannot reliably catch up, so we have to kill it, and then we actually eject that node. We decommission that node and we automatically put it into a permanently decommissioned state in the hardware provisioning, bare-metal provisioning automation that the hardware team makes, and we make a ticket for that.

And then they will actually send somebody with the replacement parts to that node, pull it out of the rack, put in a new battery, and put it back. And only then that thing becomes available again in the pool for anybody who requests hardware, us or any other team. So, there is hardware provisioning automation built by another team. So, it behaves like EC2, only that it’s made from real pieces of hardware. And we only talk to an API, and the only difference between EC2 and that is that if it’s bare metal, it’s provisioned a lot slower than a virtual instance that is just spun up on a machine that’s already running. But there is a bare metal cloud, basically. And we have been using that for a long time. The openstack is very new. Because it provides superior performance, there is nothing between you and the hardware, you get all of the performance of the box, and it provides absolutely superior workload isolation. There’s no crosstalk because you’re alone on the box, there’s nobody who can steal your disk or your memory bandwidth or things like that. And now, with the openstack, that has become necessary because hardware is too big even for us. We are going through all of these things and are seeing two ways how we get reliable performance out of virtualization, and that works. But Booking has avoided that for a very long time.

Also, of course, these hardware blades are not cheap, they come in at about 120 to 150 Euros a month if you write them off over five years. And that includes everything like networking share, write share, power cost, cooling cost and the right of the capital invest. And that means that they are competitive if they are 10% to 15% utilized with the equivalent EC2 virtual machine, for example. They don’t have to be very busy to be worth the money.

Vinay Joosery: No, no.

Kristian Köhntopp: That is because these are throwaway content blades with no special properties. They are just web blades. And that has enabled them, by the way, we do replication with global transaction IDs in Orchestrator. And we don’t need RAID, we don’t need iSCSI, we don’t need fibre Channel, we don’t need filers, we just take local disks unrated.

Vinay Joosery: So this thing about OpenStack, that’s a big, huge project, right? We do work with a bunch of OpenStack companies, cloud providers mainly, but as an enterprise, you know, VMWare is usually quite, you know, quite sort of…

Kristian Köhntopp: Yes, I don’t care how you make virtualization, as long as CrossTalk is under control, and as long as there is an API I can talk to that reliably provisions things. Yeah. Yeah, so Bare Metal cloud, OpenStack, VMware, EC2, I don’t care. The automation also doesn’t care. It does a web request and then expects to be called back or otherwise notified when the thing is ready, which will take a few minutes, maybe.

Vinay Joosery: Yeah, yeah. So, as part of this system, so to speak, switch over and fail over is part of that, right? So you have used this word in the past, the fluidity of these database systems.

Kristian Köhntopp: Let’s establish a few ground concepts. Where I work, we have applications that always have two database handles per hierarchy. One is the write handle. That goes to the primary of a tree-shaped replication hierarchy. And one is the read handle, and that goes to a pool, which is a subset of the leaves of the replication tree. If the replication tree writes go to the top, and database instances then exist in multiple data centers or availability zones. And the pool is the subset of the replication tree that is close to you in your data center, or that the reads will not cross like a 15-millisecond gap to another data center, but you have a sub millisecond read latency because the database is right next. Or the read copy of the data you read from right next to where you are with your application. We have had that for over 10 years now.

Very early on, we forced all developers to live in an environment where you have different read and write handles. And also we provide the current time in the heartbeat table at the primary. And that replicates down. So if you read the heartbeat from your leaf replica, and it’s not the current time, then you know the replication delay and you know that across multiple layers. So, you can also drive that as an application developer, you can write to a heartbeat table that you maintain the sequence number. And because of the way Oracle MySQL executes replication, If you see your sequence number in the read replica, you also know that all other writes that happened physically before that sequence number, they also are available on that read replica. There is no out-of-order execution. It’s very simple. And that means that we can tell application developers that the read handle may be delayed.

And we don’t even want to hear if it’s less than 10 seconds, because that can always happen. And you can use the sequence number in the heartbeat table that you maintain yourself in order to understand if you can reasonably expect your write to be available in the read replica that you’re talking to. Applications get the read replica randomly selected from the pool. You, of all the instances that are close to you, you get anyone and you can’t control which one. And that is the environment they live in. Now, bad things can happen.

Like, the primary must go for whatever reason. Maybe it’s 30 days old, maybe it’s actually broken, maybe the rack gets in, undergoes rack maintenance and the entire rack becomes unavailable for 15 minutes. But in any case, let’s consider ourselves lucky. And this is a controlled change of the replication track topology. We call that a switchover. And that is then executed through Orchestrator fully automated, and it’s also being picked up by the discovery. So, the application, the next time it tries to connect to the primary, will get a different address and talk to a different primary. And underneath, it will find a valid replication topology. And we have during the reconstruction and reorganization of the replication tree, we have a very short write-in unavailability.

You try to connect, but anything you connect to, even the write handle, is read only, or it will simply not let you connect. But that is also less than 10 seconds. That’s a thing that the application must be able to handle. Another thing we demand. We try to be fast, and the slowest thing is actually the DNS update for the applications that use DNS. Most these days should actually be using Zookeeper, which can do these things much faster. But, of course, all that interaction between Orchestrator and automation also takes a bit of time.

And, of course, you have to prepare that properly. Failovers are switchovers that are unannounced. So something actually breaks. And that is a switchover in adverse conditions. So it may take longer to execute because there’s some kind of recovery going on. And there’s also the time to detection. That means that there must be at least three check-ins missed so that you know that there is actually a persistent error condition. And so we are looking actually at an outage of write-in availability of about 90 seconds or so. On the plus side, you get a promoted X-read replica. So you’re looking at a machine that has a warm file system buffer cache and that has a warm Innobase buffer pool. So, you’re not looking at something that has to load, like, 100 GB of buffer pool from disk in order to be able to keep up.

Vinay Joosery: Yeah, that’s fine. Yeah. The failover is interesting because, you know, I understand that automatic mechanism. Not everybody’s comfortable with that, right? I mean, there’s been companies with, you know, I think GitHub had an outage some years ago and you know, some of the bloggers in the MySQL world, you know, they have blogged about, things like, is failover evil, right? Should you automate failover? But I guess, I mean, you know, yeah, what would you say to that?

Kristian Köhntopp: So I’m a great fan of simple solutions. We have done manual failovers for a long time. But at some point, cost and size and amount of machinery forces you to do more complicated things. At the moment, I have a certain outage budget in a month in terms of lost revenue that I can spend. And if I were to do things manually, I would be like, factors of 10 multiple over that. So, with the amount of stuff that I have running, I need to automate that. I was forced to do that at some point in time. But back then, when we, for example, used active-passive standbys with NetApps as the backing store and migratory, that we switched between the active and the passive instance, and Pacemaker, basically at DRBD or the NetApp then. We did all of these things scripted, but there was a human that actually had to initiate that.

But then you’re looking at outages that take the failover several minutes to resolve. And that was okay back then, but it totally wouldn’t be today in the environment I’m in. Yeah. When we used Pacemaker, we had more outages when the Pacemaker was running unattended. We had more loss through Pacemaker doing the wrong thing. And when we switched that off and had a human type, the same package commands that Pacemaker executed, but the human in between judged what is the cause of the outage here. Is it actually making sense to switch over? And that was more reliable because failure detection is hard to get right, especially if you have flaky outages or even flapping, where it goes back and forth between two instances, then you never recover. And we had all of these problems. So, we turned that to manual and then left it at manual for a very long time.

Actually, I think we probably ran with manual failover, with the active, passive clusters all of the time, and only with the arrival of GTID and Orchestrator and this storage shenanigans in the back end, going away, going to just unrated local storage, then handed all of that over to Orchestrator, that worked then. I might not be getting the history right. I also have not been there for two years and some things happened in the time where I wasn’t around. So that might not be entirely accurate, but I think it is not too far off. In the end, it’s a question of how reliable is your failure detection? How fast is your failure detection? And how fast is the recovery that it kicks off? And, of course, do you manage to actually keep the things you consider broken? Do you manage to keep them dead so that they don’t come back until they have been validated?

Vinay Joosery: I mean, we do see, you know, we talk to a lot of companies, well, about automation, and we can see there is a portion of DBAs who actually do not trust software to do failovers for them. They are pretty adamant that they want to do it themselves.

Kristian Köhntopp: And as long as the company can afford that, it’s fine. Yeah?

Vinay Joosery: Yeah.

Kristian Köhntopp: I mean, you get paged at night, you have to get out of bed at speed without waking up your children and your wife or whatever, or your partner. And then you have to go and execute that stuff. It was a big improvement when it went away, but I agree it was hard to get right.

Vinay Joosery: All right. So, you know, moving on there, there’s also… I mean, obviously, workloads are always changing, applications are changing. So, you also have an element of capacity planning built to the systems. You have these ways of understanding what’s happening and then you can plan ahead.

Kristian Köhntopp: We measure various things. I mean, we measure displays and how it shrinks over time. We have various resources that are limited that we monitor over time and where we model future resource consumption of that goes on. We measure Binlog size, for example, which we keep for seven days. And if Binlog grows out of bounds with a sudden spike, we alert the data owner of this case of mass consumption, because they might want to know. Maybe they know because they have a big deletion running, and deleting data obviously eats disk space, because that is how the Binlog works.

We monitor overall disk space consumption and give you then a linear plot, like, how many days of runway do you still have with the disk that is available. We monitor replication capacity by taking one node out of the pool, stopping it for an hour, and then measuring how long it takes to come back up to speed. So there’s a catch-up factor that we tell you. That’s your replication load runway, basically. We measure restore speed, like how long does it take to recover an instance from the backup in the restore testing. That’s another metric that we give you.

And then, of course, you have other metrics like general QPS and what we expect a blade or an instance of that type to be able to do. So that you get a number of operational, predictive metrics. That, as a data owner, tell you in what shape are you. And if business increases or picks up, or workload increases and picks up, how long can you survive? It’s their task to provide input to us and the data center teams to provide capacity. And that’s a bit of a problem. We have a highly seasonal business, with the high season, holiday season in the mid of the year and the absolute low between the days between Christmas and New Year. And it goes up and down in a repeating pattern every year and every normal year.

That is not during COVID, of course. So that when the financial planning people give us an estimated growth average for the year, we can size for that using the last few years’ data and scale that. And together with the predictions, we can tell you if any of the consumption lines will touch the limits of the hardware, and that gives them an estimate how much hardware to order. And the aggregate of that is then for the hardware team to provide. So that works as a planning process, but it requires, of course, that we explore these metrics and can tell you what your workload can do. So every day, one actual productive node will be taken in the load balancer that will be assigned more and more load in the load balancer. So it will, as a percentage, have like two times or three times the workload that the other nodes have.

We also put a sensor query into that node, where we test what the execution latency of that query is. And as long as that has no wait time, the node is not overloaded. So we like, the probe query goes into the database and executes say with, I don’t know, 400 micros runtime, and as long as it’s below a thousand micros one millisecond, we assume there is no wait time, the node is not saturated, and we increase the load patterns of weight a bit more, a bit more, a bit more, and when the probing query then reaches a thousand micros, one millisecond, or when there are not ready in time indicators from the application, we violate other latency SLOs, we immediately reset the node weight to normal.

And the maximum node weight, divided by the normal weight, that’s your runway, your load factor, how much back capacity you have. And then you have to reduce that by additional capacity that might be consumed by front-end experiments. You have to reduce that even more by an availability zone failing. So, you forward the entire workload of another data center into this one. And you arrive at maybe a node that has a target capacity of, say, a third of what it can potentially do with all the headroom that we need to reserve for the various things.

So, if I have a CPU-bound database that can keep the working set in memory, I’m only interested in the number of cores busy. And I do know that on a 32-thread box, I can have a target load of 12. I see this one is an eight, and that gives me then the amount of headroom that I still have for growth. For I/O-bound devices, it’s a lot more complicated. And for I/O-bound devices with rotating hard disks, it was almost impossible to solve because the access pattern on the hard disk that seeks… played an important role. That was always very risky business, but we have fortunately, like almost a decade past that now. With SSDs, I can manage IOPS properly. It’s a bit more complicated than CPU bound nodes, but it is doable. Putting all these things into a graph, I can give you a number for capacity planning, despite the fact that there is like several complicated considerations behind that and the load test that actually tests. But that’s a real-world number. That’s your workload from today.

Vinay Joosery: Yeah, yeah.

Kristian Köhntopp: Yeah, that gives you for your nodes, the capacity that you still have. And it’s reliable enough that I actually can take these numbers and put them into the auto-sizer that provisions or deprovisions nodes within the corridor that I set.

Vinay Joosery: Yeah. The way you explain it in a way, it’s, you know, I mean, you do sometimes hear people saying, oh, this database is only running at 20%, right? But then when you realize that, well, you might be taking backups, there might be some peaks, you might lose other nodes that traffic is being sent to, and that traffic can just suddenly go from 25% to 60%, 70%, right? Or even more, then you realize that, yeah, you cannot be running databases at too high utilization.

Kristian Köhntopp: It is important to understand how much money the business makes, how much you pay for a node, and how much money you lose when this replication hierarchy is actually down. We assign them criticality. A low criticality chain doesn’t even have 24-7. It has 8-5 service, or 12-5. So, if it’s down during the night, we do not get alerted. For medium and high criticality things, of course, that is very different. And for high criticality, if I can drown a problem in capacity to buy time to deal with it, that would be quite valuable, right?

So, for high criticality replication chains, I just provide more replicas to be sure that this thing stays up. And if I then feel that it’s expensive, I maybe talk to the application business unit, the data owner, tell them what the problem is, and then they need to maybe rearrange their backlog a bit in order to bring the cost back down. But then it’s just a monetary problem. It’s not a life-threatening problem for the company. And for low-grid, it looks very different, of course. But that is, again, a thing that many companies do not have inventory. They do not understand which subsystems they have, what they are doing and how critical they are, and how they are interdependent on each other. Or maybe they do know in the head of some very senior person somewhere, but they do not force that person to actually sit down and make a drawing of that and keep it up to date.

We didn’t for a very long time. We have now a thing called service directory that actually requires you for your application to declare dependencies. And because we also provision secrets, passwords automatically, and firewall rules automatically, you cannot talk to things, to dependencies, or you cannot log into dependencies that you haven’t declared. So, we do know that the things in the service directory are the actual dependencies or more. And then there are forced, regular reviews. People are asked to revisit their service directory dependency declarations and bring them down to the minimum set. Everybody hates that, but I’m part of an organization that has several thousand developers, and this is the only way to survive with a development organization that large. Yeah.

So, we have this piece of software and that forces you to be honest about your dependencies. And then you have a graph, and then you can apply graph theory and see if the entire construct even makes sense from a formal point of view. Is it connected? Are highly critical things dependent on low critical things? Are the priorities arranged properly? If there’s such a dependency and it’s marked as hard, that can’t be. If it’s a weak dependency, like you can shed that dependency, then the database can be low-grid, even if you’re high grid, and things like that.

But that is all enterprise, large organization stuff. Smaller companies do that with two or three people, hopefully at the whiteboard, and then hang that somewhere in the corridor so that everybody walks by it and learns the structure by heart. And another company I worked on 20 years ago in Germany, WebDE, the technical director made that on a giant roll of paper on the wall. It was like two meters high and 10 wide. And with whiteboard, marker or a pen on the paper to just discover the stuff they had built. And then she would look at this and actually go down to the data center and purposefully break things and then go to the monitoring and see, did you notice what I just did? And did it have an impact? And she did that every morning. And when she arrived, she went into this with the data center. So, she was going into the data center, and she broke a thing that she knew she should be able to break. And then she went up, passed by the monitoring before she arrived at the desk. And that forced people over time to keep that thing up to date.

That is, of course, how you get rid of legacy and how you force your people to grow into this way of working. And over time, then you actually write a piece of software that models that. And with the firewall and the secrets, rules and generation, you have another way of enforcing that. But going from completely unmanaged to such a rigidly managed system requires, of course, some adjustment. And it requires a certain meanness from management, because management at some point must force people to be honest to themselves and the organization about these dependencies and keep them up to date.

And that’s one way of arriving there. That’s an early version of Chaos Monkey. You cannot run Chaos Monkey on an unmanaged network of dependencies. So, you have to have a human do that at first, with a bit of feeling and outside knowledge about what goes on. If you actually successfully can run Chaos Monkey, you know that you’re done. Because you have a proven resilient structure that is probably also documented and automatically managed. But going there is hard. We had at Booking for a while, for several years, the title of master of disaster. That was a person and the apprentice of disaster, two persons that organized outage drills.

So, they went to every team and asked the team to provide a list of failure scenarios and to write down what the failure scenarios would test, what the mitigation is. And then they would, during half a day in the morning, basically. Execute these various scenarios and test that the system would actually behave in the various ways that they documented. And all of that was then put into business continuity management documentation. At the drills, these scenarios, they went on larger and larger until after several years, the organization was at a level where we can sabotage a data center, like, cut all outside connections without impact, user visible, customer visible impact on the entire thing. But at an organization of our size, from an unmanaged state to the managed state, that transition is multiple FTES working on that in cooperation and with senior management backing for several years. It’s not like you can go and simply turn over data center and expect that to work.

Vinay Joosery: And that leads us to the next thing, which is to be automated, is disaster recovery. Well, I mean, there are several aspects there, right? But restore is one of them.

Kristian Köhntopp: Yeah. Redundancy is the key to surviving outages, of course. And we do have in databases a lot of redundancy in the replicas. Because we have this redundancy in the replicas, we don’t need redundancy in storage. That is how we can run off unrated localization. Of course, then you might need also redundancy offline. That’s what the backup provides you. Maybe you don’t know. Yeah, backups are just work, they are just cost. They prove nothing. So, how do you prove that the backup is actually useful? Well, what is the actual problem? The problem is that some data is lost. So your boss is coming down to your desk, breathing down your neck and is asking you, have you lost data? And, of course, the answer should be no. But how would you prove that?

Then they would probably ask you, next thing is like, will we be back? And the answer to that is hopefully yes. And the third question that always comes is, how long will it take? And these questions need to be answered with great confidence for every single application hierarchy that you have. So, you need an offline backup. But, of course, an offline backup does give you the state of the database at the point in time the backup was made. So, hopefully it is atomic. You need a way to atomically make a backup. And the easiest way to do that is to take any replica, shut it down, and then just copy the files.

Another way to do that is like to use extra backup that simulates this point in time thingies. Or, if you have data on the file that can do snapshots, then you have a snapshot of the file system at a point in time. And then you need to mount that writable and recover that. And then you can take these files. But if you’re doing replication anyway, and you’re doing it the way I suggest you do, you always have one replica more than you need. You can take that replica, stop it, and then copy it somewhere and keep it. In any way, the backup is atomic. So, it has a Binlog position that is expressed as a GTID set, or some other way that allows you to connect it to replication. And that means when you restore from the backup, you have an instance that is a day old or two or three, you know the replication catch up factor, you know how long would it take to apply the Binlog, and then you put it into the replication hierarchy and let it catch up.

And by doing that, you have proven that the replica that you generated from a restore connects. It can run like any other replica that has not had an outage. And that means you have an unbroken replication history that will just work fine. And that is how we test the validity of our backup. That also generates me the timing how long it takes to restore. And it provides, by it being able to connect and all the NoDB checks being there and stuff like that, that it is a valid backup that I can actually execute the valid restore because that is where the value is.

Vinay Joosery: And we can see, I mean, really, that’s one of the things, you know, restore is something that needs to be automated, because there’s no way anybody can…

Kristian Köhntopp: Yes, we have that. We have a Grafana panel, there is a box for every replication hierarchy. It’s green if the last restore test is less than that many time units away. And it has to be completely green in order to fulfill the restore SLO. Yeah. It’s not hard to report. It is also not really hard to automate. You just have to do it. And that requires, of course, management will and empowerment to sit down and do it. But if compliance comes and says you have to be able to prove that you can actually execute business continuity, that’s the only way to do it. Yeah, So you assign that to the backlog item, you put people on it, and then you let them code. The definition is simple, the task is simple, there is nothing hard to actually sitting down and coding that. It is just that it needs to be done.

Vinay Joosery: Yeah, yeah. You mentioned Grafana for visualization. So, I guess for monitoring, you do have a lot of collectors on these nodes, and you collect a lot of metrics.

Kristian Köhntopp: Yeah, basically, these days, we did run a thing called Diamond that wrote to Graphite, or a thing that looks like Graphite, but not actually is Graphite anymore, for certain reasons. These days, we run Telegraf, and we run Prometheus to collect metrics. And the MySQL collector in Telegraf is actually awesomely complete. I have great respect to the people who constructed that. It has every single thing that I could think of as being important, and then a few more that make it comfortable. And with Grafana, then you can create the necessary derived metrics.

So, there used to be a PHP RID tool thing from Etsy that collected systems metrics and MySQL metrics. And there used to be a tool from [inaudible] Oracle from MySQL called Marlin, the MySQL Enterprise Manager. And we built a thing in Grafana that basically takes this Etsy toilet roll a very, very long dashboard with all the metrics from the hardware. And we added equivalent metrics to what MySQL Enterprise Manager produced down below. So, the toilet roll is even longer. And that is our per host view. So we see what the hardware does, what the disks do, what the CPU does, what the temperature does in the box, if it’s a piece of hardware. And we add then replication metrics, NoDB metrics, primary key, our index accesses versus scans, replication delay, all these things. They’re in a somewhat organized way. That’s a per-host view. And then there’s, of course, a bunch of views for the entire replication hierarchy. You can see if there are nodes that delay, whereas others are not delayed and things like that. And these things are publicly available for anything. And teams can then create subviews from that that show them their nodes, or they can add other application-based metrics on top of that. And we do a few things more. We also collect bin logs and analyze them. So, we have per table traffic metrics, we have Binlog size metrics, we have QPS, metrics, things like that. And we have statement-based from performance schema, statement latency per user. So you can isolate bad behavior and bad application behavior. Because another thing that we very early did is we assigned every application their own user password pair and their own grants. So we have really many grants, but it allows us by username to identify applications and then collect metrics by username, but they’re actually metrics by application because of this. And that then helps to pinpoint teams that have a problem and talk to them and with them together, then isolate the root cause for the problem and kill it.

Vinay Joosery: Yeah, and I guess the visualization is also a common ground.

Kristian Köhntopp: Yes, and because it’s time based, I can tell you when it went bad, and because we also record rollouts, I can tell you when this tag changed, performance went bad, and that’s done a very simple pricing. Or maybe just it could show on the change set because the SQL queries stand out.

Vinay Joosery: Okay. Then one more thing that, you know, also actually I’m surprised by the number of servers that do not ever get upgraded. But, you know, I believe you have a pretty ferocious rate of upgrading servers.

Kristian Köhntopp: Well, I have to. Security comes down and says, any CVE must be fixed within 30 days. But my infrastructure is, even if it’s on bear metal, it’s seen as immutable. So the only thing that I can do is I can deprovision the server. And before I deprovision that server, I reprovision a new one, so I make me one unit of overcapacity and then I kill the oldest box. And that way, I always build new service from the newest operating system images. I always deploy the newest version of my automation. In building this, I prove that my automation is completely hands-off. I make zero manual interactions in building new machines.

And that also proves that I can parallel provision things. I mean, if I automate things, not completely, there is one single thing that the person needs to do for the machine to actually go into production. I have just serialized all operations through a single human. And that means I completely lose the advantage of parallelism that I get from full automation. So, being forced to reprovision things every few days, or every few weeks at a reasonable rate proves that the automation is not broken at all times because it’s executed all of the times.

And humans can’t do things in between because the process is built that way that it’s impossible. You have to fix the automation, you have to commit the fix, you have to put the fix through tests into production, and then you know that the provisioning is completely hands-off. But that means it’s not just repeatable, it’s also parallelizable. And that means I have always enough capacity, one replica more than I need, and that means I can orchestrate change. Which is what I do here with this constant repositioning. I no longer have active passive systems. I do no longer care if I have a failure. I might have two or three failures. I care, do I have enough capacity?

And is the reprovisioning rate higher than the failure rate? Which is a completely different scenario. And relaxes operational requirements and toil on the operative teams a lot. They get paged a lot less often. And that allows me then, of course, to also make promises to security. Like, you have a CVE, I have a patch within 30 days of normal operations. Or they say this is super highly critical and actively being exploited. And then I have a bit of a problem because I need to do it in three days. But reprovisioning several thousand servers in three days is actually putting a lot of load on the provisioning infrastructure. And so far, we have mostly gotten away by not doing this.

But I can tell you that everything I have running, even the database land, is fresh. I can also tell you that if you configure something with a fixed IP, not using the auto discovery mechanism, it will fail within 30 days. So, better look at the pools and the Zookeeper and find your endpoints every time when you make a connection. Don’t cache any IPs, don’t cache any credentials. Because the same thing that happens with IPs of service also happens with usernames and passwords. They’re being rotated. You get a new user, that is the new user that is being provisioned. And that means when you disconnect to fetch the new credentials, then you use those to connect. Your credentials will continue to work for a certain time. We monitor that they are being used. And if they’re being used, like, more than a day after they have been decommissioned, we have a problem and we need to talk.

And once we can prove that they are no longer used, the user is gone from all process lists. We take these accounts out of the system so they will no longer work. And that also means that if any secrets ever leak, they are valid only for a very limited amount of time, and then they become worthless. And, of course, every application has only the permission. It has declared, which supposedly is a minimum set. And if it’s trying to execute data definition language, that’s a special case. Unfortunately, partition management is also full DDL. But ALTER TABLE create partition and ALTER TABLE drop partition may be a different thing from a schema change. So, we need to talk about MySQL permission system there a bit. But that is complaining at a very high level, right?

Vinay Joosery: And that’s the security part, which is rotating passwords, and yeah.

Kristian Köhntopp: Yeah, these things become very easy once you force people to do discovery, you also can easily force them to manage secrets or machine accounts properly. And then with a performance schema, you can see credentials being used or not used. And then you can become hard and harsh on that side as well. Again, it’s not a lot of code. It is mostly agreeing across all departments that things are done that way and that any other way of doing things will simply cease to work. And if you force that on all the teams, for whatever reasons, then it will take a few months for them to adjust. And then you very carefully start introducing that. You will see failures. It’s okay. You have to assign outage budget to that because these are expected outages. You need to prove that you are able to discover that. And then everything becomes better because you have now reached a different level of doing operations.

Vinay Joosery: So, test data and anonymization, that’s one of the things that I guess, Dev teams, they need access to. And that’s also automated on your end.

Kristian Köhntopp: There are various approaches to this. At the moment, we have tokenized all PII. And that means that if you have like a guest table or reservation table, there is no user data in that. There is a placeholder token that looks like a hash, like an MD5 or an SSH-256, a synthetic token, and you can then go to a service. And with the token and your authentication, you get permission to get the PI associated with that token for that application. Yeah, so you are a customer help desk application that needs to see the following items associated with the guest. By the fact that you authenticate as the Customer Head Test application and that you have this token, you get these PII items for this particular guest, and the access is being recorded. And the business reason is this particular CS case with the case number.

So there is a business reason for a customer head, this person to actually see that data. And we can break that down to an actually particular incident that is being logged. And then they get with the token, they get the data that is shown in the web application that they’re using to do that, and they work with that. That’s a production. But that also means that if I take the reservation table, all of it, and copy it with all indexes at this speed to a test system, there is no PII in it, they are just tokens. And when you test the customer-happiness application in non-production, you can take the token and you can take the credentials that you get in the test environment and go to that PII service, which is called Soylent. And it will give you data, fake data, which is standing provision so you ask for the token. You get fake data, not actual people data, but it’s also storing the data just generated in the test copy of the Soylent database.

So, if you take the same token and ask for the same customer again, you get the same data, But the customers are made up and they’re made up on the spot. So, the test database is also very tiny. The test Soylent database is very tiny. And that allows us to ship most production data into test environments at very low risk because there is no PII in it, there’s no PCI in it, it is not toxic, it’s not radioactive waste, and that needs to be kept away from developers at all costs. It’s like masking data. You copy the data from production to test and then actually replacing all people names with, I don’t know, empty five hashes of the name with a secret salt or something. But there you do it every time you copy, which is very slow. And we actually do the masking in production all of the time.

So, the masking is just there and shut things around. And that saves us doing all of these gazillions of updates and index changes when we copy the data around. All data is always masked, and that way, we have to work only once. And the solid system then is simply not present in test, but there’s a fake replacement with the same API that gives you fake data, but the fake data then is persistent. And that way, we get rid of that problem.

Vinay Joosery: Yeah, yeah. Yeah.

Kristian Köhntopp: That’s fast, it’s stable, and it’s safe, and apparently, it’s compliant. At least it has the compliance seal of approval. So, yeah.

Vinay Joosery: Yeah. No, this is, you know, I can definitely see the advantages of that. So, basically, with all this automation, then what do you do with your free time?

Kristian Köhntopp: So all of this doesn’t work. Like it works very well most of the time, but there’s always new borderline cases and outages that we discover. Also, requirements change. Yeah, like originally we did not do that, PII management, we did that. Oh, it’s a long time ago now, but originally we did. And so somebody had to sit down that was not the DBA team, that was an actual application business unit that was duly funded to implement that. And then there’s the migration to that. And then there’s us helping them build the system and stuff like that. So there’s always enough work to do. I mean, this automation thing is a lot like a journey. And I can now come back and tell you what our things are because I now know the specs because I’ve experienced all these things. But I would not have been able to do the same thing 10 years ago. Yeah.

So that was all very exploratory. And with a really great team of very highly capable people, I’m highly impressed by what happened over all these years with all these people. So that was fantastic. But it’s, of course, not finished. We still have work to do. We now go from what we currently have to rootless operation, and to an API that we can actually offer directly to the ABUs, taking us as people out of the loop. And, of course, there’s always the stuff that the ABUs do with the database. So a large part of the work is always caring for the database and the ABU after a while and before they crash the database. That always has been our credo. DBA is there to support you when you break the database.

Vinay Joosery: And I guess to also prevent you.

Kristian Köhntopp: Yeah, the smart ones ask us before they change anything. The others will do that the next time, then. But it is not our task to prevent the ABU from doing things with data. It’s their data. And if they insist on doing it this way, we can tell you that this might not be a good idea, that they’ll likely break it this way. But if you insist, go ahead and touch the candle and learn. Yeah. And actually experience how hot the flame is. And then they do that. It’s okay. I have had to train myself to accept that as an attitude. I have been a lot more engaged in this entire issue, but it’s their out of budget. They are free to spend it in any way they like. So, there’s that. And then they come, and next time they come before and ask, will that work or not? And I can only tell them, I don’t know, we have to benchmark that. But we can benchmark that and test that in a way that it will not break production. And then we sit down and do a thing. So, yeah, because I don’t know either in many cases, or maybe I have ideas, but there are theories. And only the test is proving that this is reality. And that is what we do.

Vinay Joosery: And that’s the kind of value that actually… that’s when… there’s a lot of value being brought because the applications are building business logic, and yeah, that that’s directly contributing to the value of the company in a way. And that’s when you sort of step into the consulting role to coach the app teams.

Kristian Köhntopp: So we have been talking a lot about the operational aspects of databases. Operations are hard, and most companies don’t understand that. Amazon is rather rich because as a company, they basically made it their business to do operations. Well, not right, but at least better than anybody else. And that is where the success comes from, they’re taking operations serious. But as a DBA, there is also always the work of the Dev DBA that does the data model, modeling of business processes, making the implementation of business processes in this particular SQL product, Planning the future, like, how do I get from the current model to a future model without losing data, transforming that life, keeping the Binlog unbroken. And that is then where Devs and Ops DBA intersect as well, which is an entirely separate topic in itself.

Like, we have to perform these business changes without affecting production. We want to bring in all of these billions of turnover, but we also need to adjust the way we do business in a way. And how do you express that? That is actually a thing, for example, when you read content on MySQL, that makes me very sad because most of the articles, most of the time, they are operational. And there is not enough, by far not enough, talk about as a developer, do I talk to the database? What is the correct modified write transaction as opposed to an incorrect one?

How do I batch up transactions so that it’s still fast? What is a good balance between foreign key constraints and integrity checks in the application? Check constraints are not new now. Can I actually use them? Does it make sense? How will that break operations? If they are there, would it interfere with automated schema changes, for example? Is it really useful to have foreign key constraints executed on every replica when they are checked once at the primary? Should the read-only replica even check foreign key constraints? Or should that be automatically turned off? That’s a question to [inaudible] Oracle, to MySQL. Does that make sense? Yeah?

And what’s the cost of actually doing these checks in terms of, for example, increased working set size? Because it’s a lookup more that brings pages into memory. And that goes from purely development aspects, seamlessly into this intersection between Dev and Ops, Dev DBA and Ops DBA, because these things you do in Dev, they have implications and operations. And because that does not happen enough as a discussion in the entire community, the product does not also develop itself fast enough into the right direction. Because if there was more Dev of interaction, then there would be different items on the backlog of the MySQL de product, for example. But that would require to like together develop an operational and deployment model of, this is how we run MySQL, this is how we operate MySQL. If you do it that way, it will hurt less. And if you do it that way, we can provide you with the following features that fit seamlessly into this.

Because we make expectations about what, for example, the replication tree looks like. And what are the performance promises that we now can make with today’s hardware? And what are the performance and behavioral promises that we can make with this kind of replication tree? And the standard Python libraries or the standard Java libraries. There’s no reason for you to manually code out every read modified write transaction. As a transaction, if on top of JDBC, you could make assumptions that are higher level, but because we give you access to SQL at the individual statement level, you can do any kind of things. And then we must assume that things are broken and cannot do certain other things. And in a way, it is also time to go past that and try things out. What can we assume that everybody needs? And what can we provide as larger building blocks?

If you look at MySQL versus Postgres, MySQL has always seen the application of the connector and the database, and the replication tree as a system. At least everybody who uses MySQL in the same way does not use it as a standalone database, but they make assumptions about the client connector, about replication being there, and the way how the client and the database interact. Postgres is a lot the other way, where the database tries to do all the things and validations in the database, and often also it thinks of itself as a standalone database, with maybe replication tagged on. It has become a lot better in the last five years, but the original Postgres mindset is not like the MySQL mindset. And MySQL actually should be doing more of the MySQL thing. See this as a system, see how you can integrate that, and see if there are wins and simplifications available that you can then take and turn into value, somehow.

Vinay Joosery: I think the key thing that I take here is, there is a high level of automation, which is impressive and, you know, which has taken years of work to build. That also frees up time for the team to actually help the application business units, coach them and help them with all the, modeling that’s needed. Because that’s usually specialized knowledge, and DBAs usually are… or, you know, they do have this sort of knowledge, but yeah, you have to get out of this day-to-day deployments and all these operational things that’s taking a lot of time for you.

Kristian Köhntopp: Our databases are newer than other people’s databases. Our databases are easier available in the required capacity up and down than other people’s databases. Our databases are better monitored than most others that I have seen. And we do have problems, but we do have them in places where other people don’t even have code. So that is the key difference here. Yeah. And that is due to the work of a number of very bright and fast people. So there was a lot of engineering going into this, purely by accident. I mean, that is how it started, by using the application to grow the business and provide capacity, and then by I’m splitting databases into different top schemas that run independently to do that even more. But then you suddenly have a bunch of databases that you somehow need to keep running, but you don’t get more people. Never was more than, I think the peak was 18 people. And the minimum was six or so. But the number of databases was always a four-digit number.

Vinay Joosery: Yeah. So, Kristian, you get the final word. You know, what would be your recommendation? Because, you know, you’ve been through this long journey, 10, 15 years at Booking.com. You’ve seen how it was when there was no automation, when maybe people tried to do automation with Puppet, and then how you’ve actually gone through scripts and then moved on to maybe better methods. What would be your recommendation to enterprises today regarding database automation?

Kristian Köhntopp: Yeah, do not accept stateless devops tools for database work, for stateful systems. Not just MySQL, everything that is a cluster will not work with current Devops tools because most of these tools don’t understand statefuls at all. Try to automate yourself out of the picture. It will not work. You will always have enough work, but you will perform operations at a much higher level. Take operations seriously, at least as hard as development, at least it develops as much value as development does. And make a ticket every time you use SSH. Because you do something manual, you shouldn’t. You’re looking something up, that’s an observability bug. Or you’re making a change, that’s an automation bug, that’s even worse. So, I mean, it sounds like a joke, but every time you use SSH, make a ticket. I’m dead serious. Because you just found a thing that needs to be fixed. And that is how you start, basically. And then things happen. You become better.

Vinay Joosery: Okay, all right. Well, we’ve come to the end of this episode. I think that was great. Thank you very much, Kristian, for sharing all your experience here with us. So, you know, it’s been two episodes that are heavily loaded with a lot of experience. So, looking ahead, in our next episode, we’ll talk about data privacy. We’ll switch gears a little bit and we’ll talk about how enterprises are collecting and using data. And for that, our next guest will be the CIO of Freja eID, which is a government-approved electronic identity in Sweden. And we’ll get his take on individual data sovereignty. So thank you for listening and see you next time.

Guest-at-a-Glance

Name: Kristian Köhntopp
What he does: Kristian is the principal system engineer at Booking.com
Website: Booking.com
Noteworthy: Kristian is an architect with years of experience in databases, Linux/Unix, data center planning and design, and security management systems in enterprise and startup environments. In his current role, Kristan focuses on database automation, provisioning a few thousand databases and a few hundred application hierarchies, and cloud migration.
You can find Kristian Köhntopp on LinkedIn