The cloud of Babel: unifying cloud and on-prem with Kubernetes

In this episode of Sovereign Database Decoded, Vinay Joosery sits down with Justin Garrison, Head of Products at Sidero Labs, to discuss the evolving role of Kubernetes on-prem. They explore the shifting landscape of cloud adoption, the rise of repatriation, and how organizations are regaining control over their infrastructure. Justin shares his experiences from Disney and AWS, shedding light on why some enterprises are moving workloads back on-prem and how Kubernetes can simplify operations beyond the public cloud.
Key insights
⚡The Cloud’s Hidden Costs
Many enterprises are discovering that cloud services, while convenient, can become unsustainable at scale. Justin highlights how businesses are facing growing operational costs, unexpected upgrades, and vendor lock-in that makes long-term cloud reliance impractical.
⚡ Cloud Repatriation: The New Reality
Repatriation isn’t just a trend – it’s a strategic shift. Enterprises that initially moved to the cloud for flexibility and scalability are now reassessing whether it aligns with their cost structures. Justin breaks down how businesses spending over $100,000 per month on cloud services often see financial benefits from moving back on-prem or to colocation facilities.
⚡Kubernetes On-Prem: Beyond the Cloud Narrative
On-premises infrastructure doesn’t have to mean outdated or cumbersome. Justin explains how Kubernetes is making it easier to run dynamic workloads outside the hyperscalers. He discusses the benefits of grounded computing – a more predictable and controlled alternative to the cloud.
⚡The Realities of Kubernetes Operators
Database operators promise full automation, but are they truly production-ready? Justin dives into the complexities of running databases within Kubernetes, emphasizing the need for operational expertise. While automation tools can help, companies should be cautious of blindly trusting operators without understanding their limitations.
⚡Rethinking Private Cloud
Is private cloud just another term for on-prem? Justin argues that many enterprises have been conditioned to view the cloud as the ultimate goal, but in reality, ownership and responsibility are key to optimizing performance and costs. He and Vinay explore why companies are reframing on-prem infrastructure in terms of the cloud operating model rather than a relic of the past.
Episode highlights
💡 The Shift Away from the Cloud [12:14 – 13:45]
Justin explains why companies are moving workloads back on-prem and how they are overcoming challenges like retraining staff, rebuilding data centers, and handling infrastructure costs. He highlights how organizations are re-evaluating cloud investments and weighing them against long-term ownership benefits.
💡 Kubernetes on Bare Metal: The Future of On-Prem [18:32 – 20:06]
The conversation turns to how Kubernetes is being used as an orchestration tool for on-prem infrastructure, simplifying deployments and improving automation. Justin discusses how enterprises are leveraging Kubernetes to streamline operations while maintaining full control over their environments.
💡 Cloud Cost Optimization vs. Ownership [27:51 – 29:33]
Justin compares the costs of running on cloud vs. owning hardware and why companies should reconsider their long-term infrastructure strategy. He explores case studies of organizations that have significantly reduced expenses by moving workloads out of the public cloud and into self-managed environments.
💡 The Misconceptions of Kubernetes Operators [34:18 – 36:02]
A deep dive into why database operators don’t solve all operational challenges and why expertise in managing them is still necessary. Justin explains the difference between day-one deployments and the long-term day-two operations that require continuous maintenance, scaling, and monitoring.
💡 How Enterprises Are Navigating Repatriation [42:55 – 44:47]
Justin shares real-world examples of companies that have successfully moved workloads off the cloud and what they learned in the process. He discusses strategies for reducing complexity, managing hybrid environments, and ensuring business continuity while repatriating workloads.
💡 The Future of Kubernetes On-Prem [50:22 – 52:14]
What’s next for Kubernetes in self-managed environments? Justin discusses how organizations can leverage Kubernetes without making it a one-size-fits-all solution. He explores emerging trends in infrastructure management, automation, and hybrid cloud strategies that are shaping the future of Kubernetes adoption.
Here’s the full transcript:
Vinay Joosery: Hello and welcome to 2025’s second episode of Sovereign Database Decoded. I’m Vinay Joosery, and this episode is brought to you by Severalnines. Our guest today is Justin Garrison, Head of Products at Sidero Labs. Thanks for joining us, Justin.
Justin Garrison: Yeah, thanks for having me.
Vinay Joosery: So, let’s talk a bit about your background. You spent over five years at Disney and three years at AWS. What were those experiences like, and what did you take away from them?
Justin Garrison: Disney was a very interesting, large place for different reasons. I started in animation, so I was doing infrastructure for movies like Moana and Frozen, which was not something I’d ever done before. It was very much a scientific computing environment with a lot of HPC on-prem processes. At the end of the day, what I realized we were making was box software. The product we created was all the artists’ drawings that would be put on a CD or DVD, shipped to a store, and sit on a shelf. It was box software, and the player at the other end was something like a Java Blu-ray player. That was the process, and it was how we used to get things done, and it was fascinating because I loved being able to finish something. There was a moment when it was like, “Oh, I’m done with this movie, and we can move on to the next thing.” I feel like a lot of people don’t get that experience anymore.
Then I shifted over to Disney Plus, doing infrastructure for Disney Plus before and during the launch. That was the opposite. As soon as we were done with Disney Plus and ready to have people use it, that’s when the real work started. That’s when it got really busy, and things shifted into high gear—the exact opposite of what we were doing at Disney Animation. At Disney Animation, when we were done with a movie, we’d all celebrate and then move on. Disney Plus was like, “We’re done with the platform, the service, and now we really have to do more work.” It was a very different environment, which was a lot of fun to learn both. But also understanding why the process and tools were different.
At Amazon, it was obviously a services company. I was in AWS, working on EKS. We didn’t get the moment of, “We’re done with this thing,” but the culture at Amazon was much more of a monoculture compared to Disney. All the different Disney divisions ran software, products, and services differently. They all had their own ideas around what they wanted. You could go from division to division and have completely different tool sets, talents, and all different things. But at Amazon, it was very much one type of thing, which made it easier to shift between teams but harder to differentiate yourself. It was harder to say, “I want to do something that I think would make a big improvement,” because you had to bring all of Amazon along, and that was really difficult to make any sort of changes like that.
Vinay Joosery: And today, you’re at Sidero Labs. What do you do there, and what does Sidero Labs do?
Justin Garrison: Sidero is the Greek word for iron. If anyone knows Kubernetes, they know it’s Greek-themed. If you know iron, you know it’s kind of bare metal, on-prem infrastructure. What I found at Disney Animation was that I was doing on-prem Kubernetes for them, trying to build up their practices around containers. Even at Amazon, I was doing on-prem Kubernetes a lot with EKS Anywhere. I kept coming back to this idea that on-prem allowed people to have more flexibility, more ownership, and just be able to do more with what they owned and not having to rent it over and over again.
I was a big believer in that, coming from an environment where they owned data centers and knew how to run the hardware. I didn’t think everyone should rent all the time. I enjoyed the cloud—I was using the cloud in multiple places—but one of the biggest things I learned about Amazon was that their best asset is marketing. They can make everyone think that having a data center is a bad thing, but then you can praise Amazon for building data centers. It’s like, “Well, Amazon is a really big on-prem company,” and no one really connects those things.
At Sidero, we really focus on that on-prem side. We’re just like, “Hey, let’s make Kubernetes on-prem super easy.” Coming from doing it manually at Disney, experimenting with different products, and then building EKS Anywhere, I thought, “There has to be a better way.” All of these things felt too complicated. So, Sidero really focused on making the underlying operating system dead simple to use with Talos and then building on top of that, saying, “Okay, now if we have an API for Linux, what can we do with that?” It turns out you can do a lot of things that aren’t necessarily automation but more like autonomous systems where you don’t have to run a bunch of shell scripts, Ansible, CloudInit, and all these things that kind of tie everything together and hope it works. It’s like, “Oh no, we know what declarative APIs look like. Kubernetes uses them all over the place.” That’s what they’ve been doing for a very long time, and I joined to keep doing that.
Vinay Joosery: I see the pronunciation is “Cedero,” and that’s “Sidero.” Maybe the Greek word is “Sidero,” not sure.
Justin Garrison: I mean, same with Kubernetes, right? A lot of the pronunciation doesn’t transfer over between languages or gets it wrong. I’m not too caught up on how people pronounce things.
Vinay Joosery: Yeah, actually, I used to work for MySQL for a while, and people would get hung up on how you say MySQL—MySQL, MySQL.
Justin Garrison: It’s an age-old problem.
Vinay Joosery: So, today we’ll be talking about Kubernetes on-prem, starting with migration patterns to and from the public cloud and also getting into the nitty-gritty environmental and operational details of running Kubernetes in your own environment. Let’s get started.
First, if we roll back and talk a bit about the dream of the cloud—the move to the cloud—why has that happened on that scale? What has been the effect of enterprises moving to the cloud?
Justin Garrison: I think the cloud marketing is amazing. The cloud is a single destination. Whenever anyone thought about on-prem, there wasn’t a single representation of on-prem because everyone’s on-prem was different, and everyone’s on-prem was broken in different ways. We could boil the cloud down to, “Oh, it’s the one thing you really want.” All of those are also broken in different ways, so it’s not like it’s a new thing, but it gives you a single target. Everyone can say, “I’m moving to the cloud,” just like now a lot of people are like, “I’m moving to AI.” You’re like, “What does that mean for you?” And it’s like, “I don’t know, but we have to have it.” People had to have the cloud. Some of them said, “I want a private cloud. I want to own the cloud.” Whatever they said, they wanted the cloud.
Really, they didn’t want the responsibility of maintenance. They didn’t want the responsibility of something long-term that they owned. Just like if I travel to a city, I want to get from the airport to a hotel. I’m not going to buy a car. I’m not going to own the car long-term because I only need it for a short period of time. I’m going to call a Lyft, go from the airport to the hotel. I might get public transportation, but likely, I just want the convenience of having a car. I just was on a plane for a while; I just want to sit down and go. The cloud gives you that convenience. The cloud says, “Oh, cool, you can experiment with all this stuff. You can do all these things really quickly. You can get it instantly.” It’s just instant gratification. Just give me that EC2 instance, give me that database, whatever. They say, “I want that convenience.”
But the problem is, if you’re doing it long-term, if you really want to get a ride back and forth—if I’m taking my kids to school every day, I’m not getting a Lyft for that. That is so expensive, and it’s actually not convenient anymore because I can’t leave whenever I want. I have to coordinate with someone else. I have to request the car. I have to do all this stuff. The cloud determines when you upgrade. The cloud says, “We have your RDS instance; we’re upgrading you.” You don’t get to make that call anymore. You have your Kubernetes cluster; we are upgrading that for you. You don’t have any ownership there. You now have to coordinate with this other group that is part of your infrastructure but you don’t own. You don’t have any control over those things.
So, the convenient side of the cloud is absolutely amazing. Everyone loves it. I’m not saying don’t use the cloud. The cloud is very convenient. I’m still using Lyft to get to the airport. Those situations when I don’t know how long I need this thing, I don’t know how far it is, I don’t know necessarily where I’m going—use something that’s convenient, use something that’s temporary. But once you understand, “I need to do this every day. I need to run this. I need to be responsible for this,” you need to then build up the skills to maybe go buy your own car, get a driver’s license, pay for insurance. Those sorts of things are all critical to having responsibility and flexibility.
Vinay Joosery: And what is the effect of this move to the cloud? What has it done to all these enterprises who maybe have moved significant portions of their operations to the cloud.
Justin Garrison: Even in a short period of time a lot of businesses have forgotten what it actually takes to run something on a server. Sometimes that’s because they sold all their data centers, and they’re like, “I don’t know how to start from scratch.” No one ever knew how to start from scratch because the data center was bought 30 years ago or something. That could just be lost over time. But a lot of developers also don’t know the layer below where their infrastructure or applications run. They say, “I don’t know Linux. I don’t know what hardware looks like. I don’t know what a NUMA node is on a server. I don’t care about that stuff.” In many cases, they don’t have to care about that stuff.
But now, businesses don’t prioritize any of it. They say, “Well, if that application is slow, I’m going to put more money there. I’m going to get a bigger hard drive to get more IOPS. I’m just going to rent a bigger EC2 instance because that’s all I know how to do—throw money at the problem.” They don’t know how to actually figure out what’s causing the problem and then solve it at a lower level. They just kind of abstract that away because they’re like, “That’s not my ownership. I can’t touch what’s underneath Lambda. My cold starts—I just have to run one all the time. That’s the solution.” When you don’t have access to those things, a lot of companies are losing access to the things that can give them differentiators, make them more performant, and allow them to run more infrastructure or applications at a cheaper cost.
All of those things—they’re just like, “Well, it’s not mine.” If you’re renting a car, I’m not going to change the oil on it. I don’t need to know how to change the oil. It’s okay. But if the car breaks down or overheats or something, then I might need to fix it.
Vinay Joosery: And I guess also, the way the market rewards competence—being able to build your own network using Cisco switches and routers instead of maybe configuring things in AWS—the basic skills are not as prized today.
Justin Garrison: Just like the cloud is a single target for where people are moving, even if it means different things, certifications and training and ecosystems have done the same thing for people who are learning. People are saying, “I want to get into technology. All right, go get your AWS certification. That’s the thing that is the hope people have.” When you change people’s view of, “I can have a better life, I can get a better job, I can support my family,” changing where their hope is is a big deal.
I do think that having a single target makes that easier for a lot of people to align on what they’re trying to do. Certifications, training, ecosystems—those all do that. It’s just like Kubernetes does to a lesser extent. It’s like, “I can get my CKA certification in Kubernetes, and maybe I can go get a job and do Kubernetes. I don’t have to do this Ansible stuff anymore. I don’t have to do whatever it is that they don’t want to do.” Kubernetes might be the better thing, so I’m going to go for it. Humans generally are really good at taking care of themselves and trying to build themselves a better future.
Vinay Joosery: So, we are now at a stage where maybe we’re seeing reality. Cloud growth is slowing down, and the question is, are we seeing a backlash, or will we see a backlash to public cloud? For example, repatriation. Geico last year spent a decade in a large public cloud and then decided it was too expensive and they were not getting the value they thought.
Justin Garrison: Uber had backlash when people realized how expensive Uber was. Uber started raising prices because their VC funding was drying up, and they needed to make money from customers. So, it turns out they can demolish a lot of the taxi industry and just pay for it based on VC funding. Now that they have captive audiences, they can charge them more. There is some of that backlash because the cloud, at least when I was getting started, always promised, “We’re going to get cheaper and cheaper.” That hasn’t happened for years. There haven’t been big sweeps of, “Oh, you know what? I didn’t do anything, and now my bill is cheaper.”
My last year at AWS, every company I talked to was trying to save money on their cloud bill. They’re like, “I can’t afford this anymore.” Not only because the infrastructure itself costs more money, but because they got rid of all the people that used to rack and stack servers and replaced them with more expensive people that do FinOps or something else in the cloud. I have to pay these people more money and they’re basically maintaining and optimizing their infrastructure in various ways. They just can’t afford this anymore. How do I make this cheaper? Every customer I talked to—I wrote best practices, all this stuff around it—it was the same three things I could offer: “Try to right-size your EC2 instances, try to reduce your network spend, try to get reserved instances and use spot.” That was it. That was the most you could do. If you do those things, you might reduce your bill by 30 or 40%.
I just sat down on my own and thought, “Okay, how much would this infrastructure cost me to buy it and run it and host it in a data center?” I realized that somewhere around the $50,000 mark—if you were spending $50,000 a month and could project that spend over a year—I could go buy a server and put it in a data center and do it cheaper than what I was getting from AWS. That was the breaking point for me. I thought, “Wow, there’s a lot of companies spending more than
$50,000 a month that know they’re going to spend more than $50,000 a month for the next year or two, and they could make it cheaper.”
But that wasn’t the real savings point because that wasn’t actually like, “Okay, well, for big companies, this doesn’t actually matter.” Then I started projecting out, “Okay, how much did the teams cost?” I saw, talking to more and more people, that somewhere around $100,000, you could save money on your people. You could actually say, “Oh,you know what? I could get a white − glove service in a colo. I don’t even have to rack and stack things. I could still rent some portions of this. I don’t have to do everything manually. I don’t have to plug in network cables and make it look pretty. “Eventhough some people love doing that, somewhere around $100,000 a month, if you could project out your $100,000 spend, you could save money on people.
Then people like Geico and Hey.com were spending more—they were spending around $200,000 to $250,000 or more. At that point, the cloud became a limiting factor where people started noticing, “Hey, you know what? I’m actually going slower because I’m in the cloud. This is taking me longer to do things because of the complexity of hundreds of AWS accounts, VPC peering, limitations in what the APIs can do, and all of these things.” Like, actually, at this $250,000 mark, I should just go build a data center. I’m going to move faster, have more ownership, and do a lot more with my time if I just own more of the stack.
Those were the interesting spots I started noticing over and over again. Companies were shifting to, “Hey, if it’s a small company, even people like Zig, the Zig programming language, they had a couple of servers, and they’re like, ‘I just need this to be cheaper. I’m not saving any time, but I just moved it out of AWS and saved all the egress fees.'” They cut their bill like 80% or something ridiculous by just moving two servers somewhere else. Even at the low end, it was interesting seeing people realize that just the cost of a big hyperscaler cloud—like, “I don’t need all the features. I don’t need to offset the cost of someone else running something that’s cheaper. I just need an EC2 instance. I just need a VM somewhere.” That was just over and over again the whole year at Amazon. I thought, “I don’t think the cloud is sustainable for a lot of people long-term. It is incredibly convenient. There’s a lot of people that are going to be there. They should be there. But if you’re just planning this out and you make a spreadsheet and do the math, at some point, it’s not going to add up.”
Vinay Joosery: Yeah, and that’s also one thing we hear. Planning that cost is very hard.
Justin Garrison: People hate budgets. It’s extreme, and that’s another thing. People are like, “I don’t want to go back to yearly budgets.” It’s like, “Actually, you know what? A yearly budget is really good if you need consistency.” That’s what most companies need. They don’t care how much it costs. The finance department just needs to know if what you’re telling them this is going to cost is actually what it’s going to cost in a year. In the cloud, with everything being on a credit card or dynamic, that predictability is gone. If you can just give them, “I bought these 10 servers. This is how much it costs to host them somewhere. This is how much my power costs, my cooling costs. Network usage is free. All that stuff. This is my total cost. I will fit whatever I need inside those 10 servers.” In most cases, they don’t actually care. They’re like, “Okay, cool. That’s all I want. Put that line number down, and the whole business balance sheet works because I know that’s consistent.” The cloud completely throws that for a loop, and that gets really difficult because then you’re reacting to things that are too expensive or trying to hire companies to save you money or whatever.
Vinay Joosery: So, you mentioned something about building your own data center, on-prem. But is on-prem still a swear word? I mean, it’s like you need to be a Luddite in a way to maybe be ashamed of running your on-prem infrastructure because the cool people do it all in the cloud. That’s the way we’ve been made to feel over the last 10 years or so. It’s like the cool people are doing stuff in the cloud. So, what are your thoughts on that? Is that changing? Maybe the concept of on-prem is moving to something like private cloud. Maybe we don’t call it on-prem anymore; maybe we call it private cloud because cloud is an operating model versus a destination. So, what are your thoughts on that?
Justin Garrison: Just like I was saying before, the best part about AWS is the marketing—giving “cloud” as a name and making it a single place where people can go is fantastic. As soon as you say “on-prem” to people, they say, “Well, what about edge? What about something else?” It’s like, “Well, it’s all kind of on-prem.” I like to call it “grounded computing.” Instead of the cloud, we’re grounded—not because we’re running on the ground, but because we’re more realistic about what we need. We say, “I can predict what I need.” Some of that might be in the cloud; some of that is going to be a hosted service. I’m not going to run an email server; I’m just going to use Gmail. But I’m going to stay grounded about what value I’m getting out of that.
So, that’s the first thing. It’s just making that a target. I don’t think saying “private cloud” works because my VPC in AWS is also private. Like, what is privacy here? It’s like, “Oh, we’re just going back to the old days of firewalls and no ports open on the firewall for people coming in.” No, that’s not really what we mean by private cloud. What we mean is API-driven infrastructure. What we mean is on-demand requests so that when someone says, “I want to create something, I want to run something,” they can do it in a way that doesn’t require an email or talking to someone. Just unblock me to be able to get something done.
As soon as we put “private,” “hybrid,” or “cloud” in the title, everyone just says, “Oh, then the actual target is the real cloud.” Like, “You have a lesser experience of the cloud.” That’s not it. That’s not what we’re talking about. We’re talking about something different—ownership and responsibility but also flexibility in how you’re using and consuming it. Historically, that’s been difficult because someone says, “I need to build something.” Okay, go to the Dell or HP website, pick out your server, email procurement, and then in a month, you’re going to get it. Then, in two months after that, we’re going to unbox it, rack it, give it an IP, provision it, and then you can SSH into a host. Now you can do whatever you want. That’s not good enough. We know that’s not good enough.
So, being able to give people any sort of dynamic infrastructure—like, “I can just deploy something”—my first introduction to Kubernetes was a Jira ticket internally at Disney that just said, “Allow the web team to self-deploy. Don’t talk to anyone. They just want to be able to create an application and deploy it somewhere.” That literally led me down this whole rabbit hole of building out Kubernetes and saying, “Okay, how can we give them an API that does that?” Do they want VMs? They didn’t want the responsibility of the VMs because they said, “We would force them to do upgrades and tests and all that stuff.” So, I went to them and said, “Hey, could you package your stuff in this container format and give it to me? I will run it somewhere.” At first, it was just a manual thing. They just packaged the container, put it in a registry, I had a few VMs, and I ran the container. I said, “Cool, you didn’t have to email me. I could see that it was being updated. I could see the Jenkins job. All that stuff. I just ran it for you.” I was the orchestrator. I was the Kubernetes component of that.
Then, once I got comfortable with that, I said, “Okay, I understand how this works with three machines. Now let me do it with five machines in an automated way. Let me find an API that does that for me.” That was my whole path into this Kubernetes thing—just trying to solve that as a business problem of making dynamic, API-driven, no-emails infrastructure.
Vinay Joosery: So, if we come back to this on-prem versus private cloud, it is really on-prem. We should not really think “private cloud” in the sense that private cloud alludes to the real public cloud in a way. It’s pretty much the same.
Justin Garrison: I feel like if you call it a private cloud, everyone says, “Oh, you just should go pay for the real thing. You have a private cloud because you can’t afford the real cloud or you’re not cool enough for the real cloud.” The cloud does really cool stuff. The amount of scaling and all those demos—like, “Ah, you can’t do this on-prem.” It’s like, “Yeah, you’re right. I can’t. Do I need to? Does my business actually need that? I don’t think so. I don’t think I need to scale a thousand nodes in a minute, but okay, that’s really great for you. I love that that exists for some people that might think they need it.”
Even Disney Plus—we launched Disney Plus, and we had 10 million signups in a day. You know how we did it? We pre-scaled everything for three weeks. We just scaled up everything—all the load balancers, all the VMs. We said, “Oh, actually, we know we’re going to get a big hit on launch day for Disney Plus, so let’s just take the estimates we have of how many users, do a little bit of math, and say, ‘Okay, we’re just going to scale all this up by however much and just leave it there for weeks beforehand.'” Just to make sure that, A, we get the capacity from the cloud because you don’t get capacity all the time you need it. We’re going to warm the load balancers because AWS requires that, which you don’t usually require other places. We just let it sit, and we spent an extra 60% on our infrastructure for a while until Disney Plus launch settled down, and then we could scale it back a little bit and say, “Okay, this is about where it’s settling down. Let’s scale back to the actual numbers.”
Vinay Joosery: Yeah, so regardless of where people decide to place their racks and operationalize them—let’s say their own data center or maybe in a colocation—what are people using today to build their own.
Justin Garrison: All of that!
It’s a mix of all of those things, and it depends on who you talk to which one they’re using the most. When I was at Amazon, everyone we talked to was using Amazon. Now, at Sidero, everyone is using on-prem or they’re using a not-hyperscaler cloud. They’re like, “Oh, you know what? This OVH Cloud or Hetzner or DigitalOcean or whatever—they’re all cheaper. All I need is the compute. If you just give me computors, I can do whatever I want with it, and they happen to cost 60-70% less than the EC2 instance that was the same.” So, it’s like those areas where they just say, “Oh, well, I need it this way. I don’t need all the bells and whistles of this thing. I know what I actually am going to use it for.”
For us, it’s Kubernetes clusters. We’re like, “Actually, Kubernetes solves all those cloud layers for me where I get the API. I can do the dynamic scaling. I can do all the stuff that maybe I thought I needed at AWS. I can just put those at the Kubernetes API, and I’m good.”
Vinay Joosery: What about the old players? You have VMware, which used to be king and is still massively used. Then you have newer ones like Nutanix, platforms like OpenStack, CloudStack, and Proxmox. There’s a bunch of them. What about those?
Justin Garrison: Even KubeVirt, right? All these are reinventing a lot of this over and over again—what’s the API you actually need to do something. There is some difference there. Some of them are open-source; some of them are hardware appliances; some of them are software you run on whatever you own. Those three dynamics are options, and it’s good to have those options to say, “Actually, I don’t have any hardware, but I have a partial rack somewhere in a colo. I can get new tanks, plug them in. I don’t have to worry about doing the installs because maybe I don’t have the skill set to maintain some of that stuff.”
VMware was always the play of, “You run it on your hardware.” They don’t sell you a box; they license it based on what your hardware is. It was a way to utilize more of what you bought—get more out of the thing you already own. The whole reason VMware exists is because we weren’t utilizing the hardware we were buying. We said, “Oh, we can carve it up into smaller bits, and then you can use it more.” So, you’re just getting more out of that investment.
Things like OpenStack and CloudStack and the other open-source ones are all about freedom of responsibility and ownership. Like, “Hey, you know what? I can own this wherever I need to run it. I can install it on my own hardware, or I can get a service like Rackspace that runs it for me.” But the idea is, “I might need this portable at a different layer.”
One of the things a lot of people think about Kubernetes is, “Oh, it’s a portable thing. I can run this wherever I want. I can run it on-prem; I can run it in the cloud. Then all of my applications are portable.” It’s not really like that. Every environment is a little different, just like OpenStack. Every environment is going to be a little different because there are different constraints and requirements. But if you have the skill set to do that and you need that much flexibility, then you go to that end of the extreme. You say, “Okay, I’m going to run everything open-source.”
There are a lot of these other clouds that are like self-hosted cloud environment APIs that aren’t open-source. They seem a little weird to me where it’s like, “Okay, you’re not quite VMware. You’re not as good or well-known as VMware, but you’re not open-source and portable like OpenStack.” So, you fit somewhere in the middle there. The people that decide that are just like, “I just need to get off VMware because the bill is too high now, and I don’t really care about the open-source stuff.” So, there’s this other piece in there.
But all of them are just on the spectrum of what you actually need, what your skill sets are, and do you want to consume your infrastructure with an API in some way to make it somewhat dynamic. But all of the processes and practices around that usually restrict people because we’re not giving admin access to create VMs to the devs. So, they’re still like, “Oh, well, no, you have to talk to me even though the API exists.” Which, again, is a big reason AWS and the cloud have been successful because we actually gave all the devs admin IAM access. We said, “Go have fun. It’s all on this credit card. Just don’t go over this amount or whatever.” But on-prem, we said, “Go have fun. We already paid for it. You can’t trust them to restrict themselves because they’re going to consume all the resources, and then someone else can’t run something.”
So, this is the difference between how they function. In one case, I can say, “Oh, I can just cap your bill and say, ‘I get alerts whenever you go over $100. ‘ “On−prem, I can′t do that because I already paid for it. I don′t know when you go over $100. Is that an amount of time? Is that an amount of resources? Whatever. We have to have a way to restrict them, and so usually, that’s a team and a process that restricts them, not a technology.
Vinay Joosery: But Kubernetes is a viable alternative today to fully build your own private cloud if you have a bunch of machines.
Justin Garrison: Kubernetes sits at a different layer. Kubernetes isn’t, it can’t control or at least call the APIs for infrastructure, but really, Kubernetes is kind of above that. It’s like, “Just give me the compute, and I’m going to run somewhere. I’m going to coordinate things between that.” But all of the other stuff—like, “Give me a load balancer. Give me another VM for autoscaling”—those are all just calling those APIs. The APIs exist, but Kubernetes is kind of dumb,
and Kubernetes just says, “Oh, look, a load balancer is here. I’ll connect it to that label.” It’s not doing a lot of the work outside of just, “You give me an API to call, and I can call it.” It sits at that different layer, which also gives us a little more freedom.
I don’t think a lot of people—a lot of people have been constrained by the idea of data centers in regions, in VPCs. They say, “Oh, I have a data center here. Kubernetes runs here. I have a data center over there. Kubernetes runs over there.” Same thing with regions—like, “You have regional clusters.” But really, it’s just this layer of compute orchestration that says, “Wherever you give me machines, we can run it there.” That was one of the really cool things starting at Sidero where I realized that they had technologies like WireGuard networking built into the OS. We can just talk to another machine wherever it is. WireGuard is a low-latency VPN, so you can solve problems in different ways that aren’t constrained to regions and data centers. You say, “If I can coordinate a bunch of infrastructure all over the world, I can do that in a different way because Kubernetes, as the shim layer, isn’t actually constrained to those things.” Whereas a lot of times, cloud providers and historically, people have said, “You have to run it in this location.” I’m like, “No, that’s not actually how it works. You can do different things with it.”
Vinay Joosery: But I assume that you are able to have some kind of locality. If you want to say, “Okay, maybe these databases I can only run in this region,” or, “I’m going to do redundancy with my containers, but I can’t have the backup being spun up on the same machine. I want my redundant node somewhere else.”
Justin Garrison: That all comes down to what the application and the business needs. Whatever their risk profile is, they say, “I need redundancy.” In a lot of places in the cloud, they’re like, “Okay, well, then you should be across multiple AZs because those are data centers.” Some people are like, “No, I need it across multiple racks. I want to make sure that the power profile on that rack is different than this one. Yeah, they’re not on the same box. They’re in different racks.” You can even go down to different NUMA nodes. You’re like, “I need this to run on a different CPU set than that machine, that workload.” Whatever it is, you can have that flexibility down to that layer.
But then also, you could say, “Oh, if I need high availability across data centers, then yeah, I can place the workload. I can say, ‘Spread this across data centers.'” If I label it properly, I can do that.
Vinay Joosery: So, let’s assume an enterprise has been running in the public cloud and has decided it’s not getting the benefits of the cloud—high cost, lack of control, whatever. What challenges will they face when they’re building and running their operations for this on-prem, let’s say, or maybe private cloud? I shouldn’t say private cloud. Is there a way to reduce these through simplifying certain choices, like choosing something like containers?
Justin Garrison: If they’re already containerized, sure, keep doing containers. One of the benefits of moving to the cloud that a lot of people did was they just picked one or two services to move up there first. They were just going to slowly trickle some things up there and then figure out how they work. The bigger benefits a lot of companies got were the process changes over the technology changes. They said, “Dev, you get your whole account. You do whatever you want. We’re just going to look at the bill at the end of the month, and we’re going to set some alerts. We might set some restrictions, but you get the entire thing, and you need to run it.” They didn’t do that on-prem. They maybe gave them a server, but there was still a team behind them that was doing the monitoring, updates, racking, and all that stuff. They didn’t give them a box and say, “Go rack this and go have fun.”
In the cloud, moving out of the cloud is a very similar thing. I would hope that people don’t fall back to the old on-prem sort of process where they say, “Okay, now if you want access to something, you have to SSH into it. We got rid of all the APIs.” That’s the thing that a lot of on-prem environments restrict. They said, “Oh, if the server is here, we want you to be able to go touch the server and push the power button if it gets stuck.” In the cloud, you can’t do that. So, we had to build processes around, “How do you get access? How do you debug a system that you can’t push a power button for?” The cloud might give you some of those virtual buttons to yank a power cable or something, but in general, I had to build a better process around, “What would it look like if I don’t have access to the machine?”
That was actually the co-founder for Sidero, Talos. Talos was built to remove humans out of that loop, specifically on-prem, where all of Talos is an API. It’s all API-driven. There’s no SSH. There’s not even a concept of SSH inside the system. So, if you want to configure or debug or do anything to the system, you call an API. That API exists to do everything you would need, including bonding network adapters, setting up hard drives, running workloads, and all of those things at a machine level. This API spec, and we say, “Hey, if everything’s an API, what do we use on top of that?” It’s like, “Well, Kubernetes is a great orchestrator. We can do everything above Talos with Kubernetes.”
Coming out of the cloud, if you have Kubernetes and you have containers, then you’re ahead of the game. If you have VMs, again, I’m not saying give it up. You could use VMs on-prem. You can do a Nutanix or an OpenStack or VMware—wherever you are in there—and say, “Ah, is my VMware bill really high? Yeah, how much was your cloud bill?” It’s like, “People are saying how much more expensive VMware is and not actually comparing it to something else that would be comparable and saying, ‘Actually, maybe VMware is not so expensive if I was looking at moving to the cloud.'” VMware is cheaper than AWS, so I’m fine. But that’s not how most companies think about it. They just say, “I need to cut this bill by 10% or 20%. It doesn’t matter if it was more expensive with our proposed shift to the cloud or something.” So, they just say, “Okay, well, where can I save money?” In most cases, it’s like, “Okay, well, you either utilize the thing you already have, or you buy cheaper versions of the thing you have, or you get rid of people.” Unfortunately, a lot of companies are just like, “Well, the people are the most expensive part, so I’m just getting rid of people.”
Vinay Joosery: Can you go 100% cloud-native and run Kubernetes as your operations platform? If so, would it even make sense? Red Hat has been pushing OpenShift virtualization based on KubeVirt. Is that more of a bridge play so you can run legacy VM workloads on KubeVirt and then the rest of the stuff cloud-native?
Justin Garrison: Like you said before, cloud is more of a mindset of how we’re working with things, how we operate things, and how we control them. People often forget that the cloud was born with VMs. The high praise of Netflix being one of the early cloud adopters was that they built golden images of VMs, and we all said they were cloud-native. We all said 10-15 years ago, “Netflix is doing it right. They’re building AMIs and deploying AMIs to autoscaling groups, and those are cycling through and replacing themselves.” It’s just a VM game. The package changed from an AMI to a container with Kubernetes, and we say, “Oh, now Kubernetes has to do it all.” I’m like, “Not really. VMs are still a part of this API spec.” If you manage them in a cloud-like way of saying, “I need this to be dynamic, or I need to have someone—the ownership of that to be an API call, not an email,” that’s a big deal.
When I was writing the book Cloud Native Infrastructure, we went through that history. We started talking about how all of these people that are running cloud-native stuff, they’re all doing VMs really well. Most of them didn’t even have any Kubernetes at the time. They’re just like, “Oh, that Kubernetes thing is like, ‘Hmm, maybe we’ll get there.'” But VMs and the process of building artifacts and deploying them and making sure that side of it was healthy and deployed in an automated fashion—that was the cloud-native bits. It wasn’t the Kubernetes and containers.
I don’t think that everything will ever be Kubernetes or should be Kubernetes. I’m a firm believer that you shouldn’t make all problems Kubernetes problems. That was one of the things I really disliked about EKS Anywhere, where every time we were building a new system, we would put it in Kubernetes in this management cluster. I’m like, “Why are you making all these Kubernetes problems? That’s DNS. Don’t make DNS your actual DNS mask. Don’t make it a Kubernetes problem. If I want to debug DNS, let me just go find where it’s running and tail the logs.” I don’t want it to be dynamic. I don’t want it to autoscale. I don’t want it to shift machines. I need things like DHCP and DNS and a lot of times databases—I need those just to be in a certain state that I can understand. I don’t need all the flexibility and dynamic stuff that you might get out of stateless web applications.
We say, “Oh, we should treat everything that way.” I’m like, “No, you should not.” I don’t think that Kubernetes clusters should be ephemeral. There’s too many things that point to a Kubernetes cluster—like DNS externally, load balancers externally, storage externally. People say, “I can destroy my entire Kubernetes fleet and then rebuild it with a single command.” I’m like, “Well, what about all the other stuff that exists? That’s not a bubble of infrastructure that can be redeployed. All the stuff outside of that—you can’t do these sort of blue-green deploys with multiple Kubernetes clusters a lot of times because, well, then how do you rotate DNS? How do you shift your load balancers? How do you move traffic from one machine to the other when the customer outside is calling DNS? Who knows when their cache expires? You have to run both clusters for an undetermined amount of time until you know for sure. It’s just more expensive and more work to do that.”
A lot of people that just try to go all-in on Kubernetes—it’s a tool. We should use it for the things it’s good for and not necessarily try to put everything as a Kubernetes service.
Vinay Joosery: You actually mentioned databases. What about them? You can theoretically run them via operators, right? You can keep them running on your VMs via KubeVirt or on your regular VM environment. What do you see out there?
Justin Garrison: The main thing with databases is they’re not web applications. They’re not stateless applications that I can just throw wherever I want and put behind a load balancer. I have different requirements. The closer I get to data—to bits on disk—the more paranoid I should be. In a lot of cases, that paranoia means I might want to SSH into a server and run this manually so I know how that works. Some people—that data isn’t as critical. If this data is down for a few seconds or a little while, or if I need to do a restore backup, I’m not losing the business. In those things, absolutely, it’s okay to be able to trust someone else’s software to maintain that database for you.
We’ve been doing that for a long time before operators. There’s been database managers and things that would babysit and make sure that databases would fail over properly. There’s all this software that we didn’t write that maintains and manages databases. But the more important it becomes, the less willy-nilly you should be with that software and say, “Actually, this one I’m not going to put in Kubernetes until maybe I trust it enough or I have enough practice around that.”
I see this shift from both sides. A lot of times, databases are getting better at being containerized and being more dynamic. We look at things like CockroachDB and different databases. They’re like, “Oh, you know what? We were designed to horizontally scale from the start. We were designed to lower your operations requirements so that new nodes joining and leaving aren’t a big deal.” Older, traditional databases are not like that, and we have to be aware of why they were built, how they were built, and we can’t just call it an operator and say, “It’s all good now.” Like, “Oh, this operator is going to fix everything for me.” No, you should really have a skilled person that is your operator at some level. If you have a software operator, you need a person operator that knows how that works. You can’t just trust everything to this CRD plus a control loop.
Again, at that layer, we pay DBAs. We should be paying DBAs a lot of money because that is the most critical part of most businesses. Again, it’s just one of those places where a lot of companies—maybe they’re shortsighted, maybe they just see it as a cost—are like, “Actually, these databases have been running stable for years. Why do I need a DBA?” Well, maybe they’ve been running stable for years because of your DBA. Maybe you should look at that side of it. That is a different conversation, but it’s really hard because no one ever told anyone, “Good job for running servers that didn’t fail.” It’s like, “Oh, no one ever—back to the sysadmin days—no one’s congratulating you on not having an outage. They’re saying, ‘This should work just like a light switch in my house.’ It’s only surprising when I flip the light switch and the light doesn’t turn on. I say, ‘Well, what’s going on? Why isn’t that working the way I expect it to all the time and be stable?'” Five nines of light switchability—this should work. In infrastructure, we keep going back to this: “How much reliability do you need? How much investment do you need on that? How long are you going to run it? How much is it going to change?” All of those things determine how flexible your process needs to be, your software needs to be, and how much of it you need to own and write.
A lot of these companies, a lot of the enterprises, are writing a small percentage of all software they run. That’s just fat. They’re not writing operating systems from scratch. They’re not writing load balancers. They’re not writing Kubernetes. The amount of software that they run that they are not owners of is very large compared to the amount they’re running. They’re just gluing it together, and they’re saying, “I trust someone else. I trust this open-source project. It has 100 stars. I should run that in production.” That’s like the barrier for a lot of people now. “100 stars? We’re good. I’m going to run their operator. I don’t need a DBA anymore. MySQL is good, and we can operate it from that.” Then it breaks, and they say, “I don’t have the skill set to fix this anymore. I have to open a ticket or I have to go hire someone and do something to figure it out.”
So, databases—again, I wouldn’t treat them like web services the same way and say, “Actually, these are stateful, and I care about this data.” If you don’t care about the data, sure. I’ve thrown away plenty of data that’s like, “Oh, you know, I lost some metrics. It’s okay. I recovered those.” Those sorts of things are fine. But when you really get down to the important data, just make sure you have someone that operates the operator.
Vinay Joosery: I must say that we see less and less DBAs because maybe there’s more databases, and now you have DevOps, SREs, all types of titles handling databases. They’re getting more and more general-purpose—people who do not just databases but a bit of other things as well. But coming back to the database operators, it’s interesting that you said anyone running an operator should really understand what’s going on there so they can fix things if things break. But the operators themselves—are they complete? Do they do everything for you? Day-one deployments are fairly simple. You’re not deploying the same database every day, but then you have day two, which involves the entire lifecycle of the database—upgrades, auto-failover, repair, scaling, backups, point-in-time recovery, and so on. What level of automation do we have in today’s operators?
Justin Garrison: All of them were built for different reasons. All of the operators—even if you look for a Postgres operator right now, there’s at least five that I know of that all have different trade-offs around, “Oh, we built this one for this environment or for this type of data or this use case.” We had to fork the old one because of whatever reason. All of software is expanding. There’s been an explosion of new types of databases that a lot of people became aware of, I feel like, in the last maybe eight years. It used to just be, “We have SQL and NoSQL, and that was it.” We say, “Actually, SQL is the one that businesses and enterprises use. NoSQL is for the hipster web devs.” NoSQL had less operations, so people started gravitating towards that because, “Oh, it’s easier to operate because you do everything in software.” I’m like, “Well, is that actually what you need though? Is that the performance you need? Do you need that large-scale ability? I don’t know. Let’s figure that out.”
I remember shifting a SQL database to a time-series database for what I identified as time-series data. I said, “Wow, you have a four-terabyte SQL database that you would load once a year to get reporting out of.” I’m like, “What are you doing? Why don’t we put that in a time-series database?” They’re like, “Oh, we had to have X, Y, and Z. We had to have all this stuff.” I’m like, “What are you actually reporting on? If I can run your reports faster, let’s figure that out.” So, I literally migrated all the data, and it went down to like three and a half gigs—four terabytes to three and a half gigs of time-series data. Then they could run it all the time. They could run their queries. They had all the granularity of details that they needed for the reporting. I was like, “Okay, well, we just don’t need that thing anymore because this new database exists.”
People keep shifting around new types of databases and realizing that sometimes their data isn’t fit for a certain thing, which also changes their requirement on how they run it. The operator side of it—the long-term, like day two, is the longest day of your life. We could deploy something day one and say, “Cool, this works.” Then day two is forever now. That side of it gets really difficult because it just stretches on until the application is no longer needed. A lot of companies are not good at retiring applications. A lot of companies are not good at knowing when to turn things off. That is another difficulty because they say, “Well, now we just need to maintain everything forever.” So, of course, our infrastructure bills and everything else are going to always go up. Not all those things are maintained forever. So, we have to either change versions or manage it in new ways.
The day-two side of it is the important thing. A lot of operators focus on day one: “I can get you 100 databases in 10 seconds.” Cool, I don’t need that. I want more nines over five years. That’s the actual thing I want. So, there is some maturity that’s happening. Some of that is because operators are still a newish thing in Kubernetes where it’s like, “This CRD in a control loop.” As the Kubernetes API matures, some of those things work differently. A lot of the operators were built on beta and alpha APIs—those are things you maybe shouldn’t run in production. But Kubernetes has a long maturity cycle for getting those things to production. I need this today. I’m not going to wait for six more Kubernetes releases for it to be stable. Let’s just ship it, and then we’ll figure out how to change later.
That is a process that people aren’t good at. They make the decision today, and it’s someone else’s problem in the future. Choosing the right operator—I wouldn’t say like the “right” operator. Those things don’t exist. You need to have someone that understands what you’re building for an operator. For the operator to just say, “This is what we actually need today. Let me make the right decision today. I might have to change that in the future. Let me make sure that the process, the understanding, the documentation, runbooks, and all that supporting stuff around the application is good enough for tomorrow.”
Vinay Joosery: Yeah, and actually, from what I’ve heard from other guests in the past talking about Kubernetes, there’s an encouragement that you bring in the operator, but you should really fork it because every company or organization does things their own way. So, you mentioned maybe there are five operators out there, and maybe they made decisions because they had specific reasons why they wanted to use this and that, and maybe you don’t want to do the same. So, from that perspective, maybe you also understand where whatever you’re using is coming from and then also be prepared to fork it so that it fits your own kind of operations.
Justin Garrison: People do that all the time with Helm charts and different things. They say, “This is the thing I need today, and I don’t want it to change on me tomorrow if I’m not ready for it.” The problem a lot of them get into is tomorrow ends up being three years. It ends up being this difference of time that they didn’t expect because they just ran out of time. They’re like, “Actually, I can’t look at all that stuff.” A lot of platform engineers I know are just looking at support matrices of YAML, and they’re just like, “Oh, this version of Kubernetes with that version of CoreDNS and this Argo CD and that—they like do this whole chart of what supports what because we froze everything in time.” Which is probably the right move if you have production requirements because Argo is not doing that check for you. Argo doesn’t know your environment. CoreDNS does enough checks to get something inside their sphere to work, but as you have 10-15 services supporting your Kubernetes cluster, no one’s checking that in open-source. That’s your responsibility. But you need to be responsible for making sure you move it forward enough to always keep it up to date. That’s just maintenance. That’s just the job of a sysadmin. This is like the age-old question of, “We used to have VMware stacks and even Exchange servers. I was responsible for Active Directory and Exchange. Okay, what version of Windows could I run? What version of Exchange can I run with this version of AD with these features?” I had to do that work. I just had to do that chart. Kubernetes is that same thing. We didn’t change this any different. We just write YAML instead of click-through wizards, and it’s all kind of the same ideas of being responsible for your software and making sure it’s not going to cause an outage because that’s the thing you’re risking. Like, “Oh, if the business goes down, if the website’s down, we’re losing money, and it’s my fault. I don’t want that.” A lot of times, it’s just the same thing in a different skin.
Vinay Joosery: Looking at Kubernetes, I guess many would probably start off playing around with public cloud services—EKS, GKE—and then maybe they would say, “Hey, let’s scroll in and then use this. Okay, we’ve been consuming this as a service, and now we want to manage it ourselves.” So, from that perspective, let’s say what is the difference when you actually have been consuming it as a service but now you want to manage it yourself in your own private environment? What are the sort of day-to-day limitations you might encounter when running it like in your own set of boxes?
Justin Garrison: The same thing goes for any proprietary software. If I want to move from Microsoft Office to LibreOffice—an open-source option—I’m probably not going to look at all the code. I’m just going to use it. I’m going to be a consumer of it. When I move from something like a proprietary—because all of those hosted services are proprietary services—just start there. A lot of people are like, “Oh, it’s Kubernetes, so it’s open-source.” I’m like, “No, there’s nothing about GKE and EKS that is open-source behind the Kubernetes API. All of that stuff is proprietary. You don’t get to see any of that code. You don’t get to touch it. You cannot run that in your data center.” There are flavors like EKS Anywhere. I did not like the branding because I’m like, “There’s nothing that looks like EKS Anywhere that looks like EKS. Those are completely different things.” EKS Anywhere is entirely different. EKS, behind the hood, is a bunch of Lambda functions and Step Functions in EC2. That is how it’s built, and that is how it runs. You are not going to replicate that somewhere else.
But if you’re moving from any proprietary software to an open-source thing, the first thing you need to look at is, “What do I need? What is my requirement for a word processor?” If all I’m doing is typing words in a box and maybe printing it occasionally and adding a little formatting, I can go pretty far on any of the open-source options. I can go far on a text document. Like, I don’t actually need Microsoft Office to write Markdown. I can write Markdown, render it, and it’s good enough for what I need. That’s usually the first thing. It’s like, “What are you actually using out of this? What do you actually need out of this? How much of the proprietary service operations are you relying on versus what you need to do?”
Are you now responsible for upgrading the database? ETCD is a database, and people say that’s why they use hosted services because ETCD is a difficult database to run. They say, “I don’t want to be responsible for that database. I don’t want to hire people for it, so I’m going to pay Amazon to do it for me.” I still frequently claim that Kubernetes would be run more places—not hosted services—if ETCD was not the database. If we said this was all SQLite, the amount of companies that are comfortable running SQLite would be great. They’re like, “I don’t—this is a file on a disk. I know how to back that up. I know I have some SQL experts. I know enough SQLite that I could run this and not have to run a manual service.” I was not in the room when Kubernetes was built for ETCD. I know that Google wanted high availability across data centers, and SQLite could not do that. But also, I know that Google doesn’t use ETCD. GKE is not backed by ETCD anymore because it’s a difficult database to maintain. It’s not the thing they run because, like, “Actually, we don’t need this database. We have a different internal proprietary database that is more performant, better scalability. We built it all from scratch. We know how that works, so we’re going to shim Kubernetes to our database. We’re not going to run ETCD. On-prem – you’re stuck with ETCD.” Unless you’re doing something like K3s, which has a SQLite option. But then people say, “Oh, I want the high availability. I want to do all this stuff,” and they kind of shim all this stuff into SQLite to say, “Oh, well, now it’s highly available.” At some point, you just have to get comfortable with running a distributed database, and ETCD is that database.
If you want to own this now, you have to say, “Okay, how much of Kubernetes do I rely on? How much of it do I need? Do I just need web applications to go between servers? Easy. Don’t even worry about it. Your database isn’t going to fill up. You’re going to get enough performance. You’re probably not autoscaling it.” Even autoscaling isn’t required for a lot of people. The amount of companies that I talked to at AWS—I said, “You just spent six months figuring out your perfect autoscaling, and you’re only changing like 10 machines a day. You could have saved all that time and just pinned it at 10 machines and not even worried about it and saved six months of your effort, which cost you more than any of the savings you’re getting out of scaling up and down by 10 machines.” People look at these things and say, “I have to autoscale because that’s how Kubernetes says.” Like, “No, you don’t. You just do static machines. Just have a few hardware nodes, and if your workload fits, you’re good.”
My first cluster was six nodes. It didn’t autoscale. It was all hardware. I didn’t even run a CNI, network interface, because I didn’t need it. All I had was static routes. I just went to each server and said, “Any IP in this block, go to that box.” I knew which box had the routes. I didn’t need a CNI. I didn’t need to upgrade it. I said, “This is how we can do this in a responsible way because I don’t need dynamic scaling up. I didn’t need this infrastructure that was going to change a lot. I was like, ‘No, I just need an API for the developers to deploy something.’ Everything else underneath that was static, and it was fine.”
So, that’s really the concern that a lot of people have. They look at what is possible and say, “I have to have all the boxes.” In reality, it says, “What are you actually doing? What do you actually need? Let’s start there and then figure out, ‘Okay, if this is all you need, maybe they don’t even need Kubernetes.’ Maybe some bash scripts are good enough for them, and that’s great because it’s just going to lower the operation cost.” I’m not saying everyone should be doing bash scripts. I’ve shipped a lot of bash to production, and I’m not proud of it. It was hard to maintain. Getting the practices and the common features of Kubernetes is a good thing because, again, that ecosystem helps you solve problems, just like the cloud. I can search for anything about EC2, and I can probably find my answer. Same thing with Kubernetes. I can’t do that for my own bash scripts. So, being able to bring people on board and not have to have all the expertise, but you have to have enough to find a solution to your problems, is a great place to be. But don’t think that Kubernetes is the new Terraform. A lot of people jump into Kubernetes and say, “I want to run Crossplane because I want everything to be in Kubernetes.” Don’t make everything a Kubernetes problem. Don’t put DHCP there. Just find the things that are painful to operate. Do you want it to be more painful to operate? Put it in Kubernetes. That trade-off is like, “Okay, well, do I want it to be dynamic?” No, actually, something should just be static, and that is great. Being able to make those decisions is really difficult, especially as people see new shiny tools.
Operating yourself is more about figuring out what you need than it is about, “I need to now build out whole teams and process.” I was like, “You have a DBA. Tell them, ‘Help them do some SRE training. Get them an SRE cluster to play with, and let them break it a bunch of times and see if they can fix it.'” Say, “Oh, okay, now I feel comfortable. Can you restore a database dump for ETCD? If you can back up and restore, you are 80% of the way there.” That is what most people do with any database that fails. Like, “Hey, this broke. I’m going to restore from backup.” If you can do that for ETCD, you’re pretty close. This isn’t a big lift from that. The last 20% might take you longer because there’s a lot of edge cases, a lot of things you don’t know you don’t know. But just to get, “I can back up and restore, and our downtime doesn’t have to be one second. It’s okay if it’s an hour.” Cool, than solve that problem first. Don’t think that you have to run all this stuff with, again, like at Sidero, we try to make that as easy as possible. ETCD is built into the operating system. It’s not a layer on top of Kubernetes. It’s not a set of other machines. The operating system is only meant to run Kubernetes. We say, “Actually, we can reduce a lot of your headaches by just saying this only works with Kubernetes. This is a single-purpose operating system. It just runs ETCD and runs containers.” Some of those in containers—it’s doing this in a way that’s like, “Hey, we baked in as much of the operational excellence by deleting all the stuff we didn’t need. We didn’t actually need systemd. We didn’t actually need SSH.” All of that stuff was just like, “Okay, if this is all we have, we can bake in the things that are common practices by default.” So, when we do an upgrade, we know when the last node comes in, we’re going to do a database migration. Those things we just like, “Okay, yeah, we changed versions. We know how that works because that’s all we do.” That side of it makes it super clear of just like, “Okay, I don’t need all that stuff, and I can reduce the things I’m responsible for by using single-purpose things, by using some of the common practices of backup and restore.”
Vinay Joosery: Can you give us an idea of the landscape? Sidero is one. I guess you can use Rancher. OpenShift is probably the big one, which I assume is a beast to operate. What’s out there?
Justin Garrison: You’re right. OpenShift is the big one with a lot of market share, mostly because of Red Hat. Red Hat has been around for a long time. They’ve been the enterprise Linux for a lot of people, so when they sell you something, you use it. I’ve been an operator of a lot of Red Hat software in my time, and all of them—I’m not a huge fan. I’ve done a lot of upgrades for Satellite, Katello, their directory services, and various things. These are big, monolithic—they feel like box software in a not-good way sometimes. I’m like, “Okay, this is kind of dangerous. I have to run this giant bash script that’s going to hopefully finish in a few hours.” You just have to watch the logs for half your day just to make sure this thing is going to work. OpenShift has come a long way, but it also does a lot that people don’t need because, again, this is like a huge system that didn’t start with Kubernetes. OpenShift had their own concept of containerized applications. They used to call them cartridges and gears. That’s when I was running OpenShift back in the day of pre-Kubernetes. They kind of took a lot of those ideas and moved them forward into this big, giant thing that they say, “Hey, if you want the all-inclusive Kubernetes package, OpenShift has it because they’re going to bring you Jenkins for your CI/CD. They’re going to bring you a custom command line. They’re going to bring you build packs that say you don’t ever have to write a Docker file again. They’re going to say, ‘We’re going to do so much for you that your teams don’t need to know how it works.'”
Many, many people are successful with that, but also, a lot of people want more flexibility. They need to be in the details. They don’t want to use Jenkins. They don’t want to do something the opinionated OpenShift way. That’s either because they have teams of people. They’re like, “I’m not going to run my VMs there because I have VMware. I don’t need your KubeVirt. I have VMware. We have a whole team around it. It’s stable. It’s big.” Whatever. There’s a lot of reasons that you don’t need all the bells and whistles, and it’s also really expensive. That piece of software is the thing that makes Red Hat all of the money now. It’s no longer enterprise Linux. It’s no longer subscriptions. It is OpenShift. OpenShift is the money maker for Red Hat. It’s a big thing. It makes them all the money.
The other side of it is folks like Rancher. There’s a handful of smaller companies that are like, “Oh, we have this thing.” EKS Anywhere, Google has a thing. There’s some that are attached to a cloud that are like, “Hey, the reason this exists is just to make you move to the cloud someday.” That’s all. That’s why EKS Anywhere exists—I didn’t like that, but that was why there was always like, “Oh, no, we need to get all these on-prem people into the cloud.” I’m like, “Why can’t the on-prem people just be on-prem people?” Like, “No, no, no, we’re going to make it so easy to consume the AWS stuff that they’re going to want to move to AWS.” I’m like, “That’s not why I want to do this. I just want to make it easier for on-prem people.”
There’s that whole notion there, and then the Ranchers and the other handful of people—all of them collapsed onto this Cluster API. Cluster API is a way of managing Kubernetes with Kubernetes. The co-author for my book, Kris Nova, she started that. She started this idea of, “What if the controllers could create Kubernetes clusters?” She had a project called Kubicorn, which was a way of bootstrapping Kubernetes with Kubernetes. She was doing a lot of Kops back in the day. Kops was a way to run Kubernetes in AWS before EKS existed. She was a maintainer of that. She went to this Kubicorn thing and then started Cluster API and said, “Okay, this is the idea of all of your Kubernetes clusters are Kubernetes resources. If you know how to use Kubernetes, this should be the best thing for you.”
It’s a very complex system. It’s a very—anything like Kubernetes, once you add a bunch of controllers to it, you have to go find the right pod that has the right log that does the right thing to the CRD that doesn’t update. It’s like this Rube Goldberg machine of complexity. It’s different than a Terraform file. Like, I can go look through the Terraform state file and say, “I know where this went because everything was self-contained in this.” The monolithic nature of Terraform was a good thing for debugging. Cluster API was built for cluster builders. It was made for companies that wanted to give you the same interface for a lot of different resources. So, all of these cloud providers become a Cluster API target. It’s like, “You run a controller for AWS, and it can create all your AWS resources.” Then, if I want on-prem, if I want VMware, if I want something else—all these things are more controllers that become Kubernetes clusters or Kubernetes controllers. Then, when I have to upgrade that central management Kubernetes cluster, it is the most risky thing you can do. It is this huge cluster that’s like, “Well, you know what? Maybe Terraform looked good. Maybe actually going back to it—I don’t want everything to be a Kubernetes problem.” When you have to upgrade it, like literally for EKS Anywhere, you would download all the ETCD data to your laptop and say, “Okay, well, now my laptop is critical to the production environment running, and I have to upgrade the Kubernetes cluster. I have to put all that stuff back into a database.” That is a bad idea. I don’t care who you are. I don’t care if this is Kubernetes or not. You do not want to download the SQL database to your laptop, do some stuff to it, and then restore it back. That’s just not—we’ve learned that over time. But that’s where almost all of them kind of collapsed. This Cluster API made everything a Kubernetes problem because the builders loved Kubernetes so much. They said, “Everyone should have Kubernetes everywhere.” Then, it also segmented all those things and said, “Oh, this is the controller for AWS. This is the one for VMware.” So, it made it easy to say, “I want a VMware cluster. Cool. I want an EKS cluster. Cool. I want whatever bare metal. Great.” But once you say, “I want a VMware cluster with bare metal as the worker nodes and AWS for bursting,” you can’t do that. All this isolation became really hard.
So, all of the providers today that are based on Cluster API—they still have this mindset of, “You’re running it in one place. Your Kubernetes cluster only goes in VMware. It only goes on bare metal.” You can’t span that, and then you can’t talk across data centers because we maintain it and manage it the way it’s built for this sort of notion of, “The DigitalOcean controller is managed by DigitalOcean. They’re the ones that build it, and they’re never going to have a DigitalOcean controller that also spans on-prem. That’s just not going to happen.”
For us, Sidero, we are agnostic to all of that. We had a product that was built on Cluster API, and we didn’t like it. It was too complex. It was hard to operate. It had all these limitations, especially with things like bare metal where Cluster API assumes when you are updating your nodes, you can get a new one. You can call an API and roll the cluster by creating new ones and then deleting old ones. That was the notion of, “But bare metal is not that way. Bare metal is static. I bought that. I racked it. I named it. It has a static IP address. I’m not rolling that somewhere.” So, we needed a different way to operate and say, “Actually, Talos, the operating system, has an API. We can call that API. We can upgrade that machine in place.” In-place upgrades are something again that Cluster API is not interested in. They’re not interested in saying, “You can span multiple data centers or multiple providers,” because the way it’s built and the way it’s kind of combined together.
So, we built a thing that was based on that. That was kind of the landscape at the time. Everyone does that now. That’s just where everyone’s at. As far as I know, we’re the only ones that have something that’s not Cluster API-based today for on-prem usage that says, “Actually, this is your infrastructure, and we’re just Kubernetes is the layer on top of infrastructure. However you get a machine into this area of management, we can consume it, and we can connect it because we have WireGuard VPNs and all that stuff.” It was a different way of thinking because we’re like, “Cluster API—you can do Cluster API things with Talos and stuff, but that’s not what we’re building. That’s not actually how we’re maintaining it. We think there’s too many limitations for people that want to be on-prem.”
Vinay Joosery: So, if you look forward five years out, what’s next for Kubernetes on-prem? You mentioned the fact that if you have stuff running, just don’t make everything a Kubernetes problem. Don’t think you have a hammer, and not everything is a nail. But it also brings some complexity because then you have some stuff that’s running outside, and then you have your Kubernetes. The amount of stuff you need to know—maybe it’s different environments. I’m more thinking about the sort of complexity. Where do you see us going with this?
Justin Garrison: There’s always going to be the back and forth of people that like, “I don’t want to have to know this,” or, “I need to know this.” People that have a reaction to the cloud being too expensive and saying, “I’m going to run it all on-prem.” People that went all-in on Kubernetes and Cluster API and like, “This is too complex. I don’t want this. I’m going back to Terraform.” That pendulum’s always going to swing, and I’m hoping some people are settling into the understanding of what it is that they need to know—just like my car analogy. I got a flat tire last week. I was driving on the freeway in the rain, saw the pressure go down, and the pressure warning came up. I’m like, “I know where to look for this. I can change the dial. I can see all the pressure on my tires.” I’m like, “Okay, all of them are correct, but this one’s a little low.” I was like, “I don’t know if that’s just because it’s low or I actually just ran over something.” So, I just watched it. I got my kids in the car. We’re going somewhere. It’s pouring rain, and I just watched the pressure go down slowly—5 PSI, 5 PSI. I’m like, “Man, it’s a flat. Can I make it to my destination without this failing on me? Let me just ride another mile and see what this rate of change is before I might have to do something.” Determined no, I’m not going to make it where I need to go before I’m riding on the rim, and I know I don’t want to do that. So, just calmly, I gave my phone to my son. I said, “Hey, text your mom. Let her know where we’re at. Text the people we’re meeting. Let them know we’re going to be a little bit late. I’m pulling off the freeway here.” Still, they didn’t even know why. What are you doing? I’m like, “Don’t worry. We have a flat tire. I know how to change a flat tire.” Thankfully, right? I have that skill set. It is a thing that I don’t do often, but I know how to do it. That was good enough for me to make it to my destination on time because I had the skill set. I had built that up over time to be able to just know, “Is this an important thing for me to know, or is this something I should just call a tow truck for?” I have both options, and the tow truck would have taken me a lot longer in the rain, waiting for someone else to do the work. By skilling myself up and just saying, “Okay, I need this, and I want a spare tire to be available on my car at all times,” that makes me have worse gas mileage, and I almost never think about it until I need it.
Technology and operating things are the same situation. It’s just like, “Hey, I want that skill set in my bag somewhere so I know I can do this if I need to. I don’t need it all the time. Hopefully, if everything goes well, I need it never.” Like, I never have to use a spare tire. But if we’re in a situation where something’s broken, or we have to operate change quickly, or we need to get to a destination with software, we should have enough of the skills to be able to get us there on time. Kubernetes on-prem is one of those tools. It can be because it gives you that API layer that a lot of people are building themselves. Almost every on-prem enterprise—they all have platform teams now, and they all think that they’re building something that they actually need in a way that is going to be sustainable long-term.
I’ve been around long enough that every single one of those platforms ends up migrating from platform to platform. The problem usually comes from misalignment of incentives where the team that is paying for platform engineering is the security team because they’re like, “I just need to be able to see all the vulnerabilities.” But you’re serving the developers who don’t care about that at all. So, they’re building in all this stuff for the security team who’s paying their bills when their customers are saying, “I just want you to deploy this faster.” I don’t care what the security thing is. This misaligned incentive becomes a problem, and I don’t think that’s going to go away because that’s been around for a very long time. But being able to just start from a more common interface and say, “This is Kubernetes. This is how this is going to work for us. This is how we’re going to use it.” Hopefully, they look back at that and say, “This is how much of it we actually want to use. Maybe we disable all of the beta APIs because we don’t want people to accidentally do those things and then have us migrate off and have an outage or whatever.” I’m hoping on-prem makes people a little more responsible and not just look at new and shiny. I don’t have a lot of hopes for that, looking back over the couple of decades that I’ve seen people doing it. It’s the same thing with config management and all these things in the past. VMware golden images—all of these things were the tool to end all problems, and they never ended up being that way. But we found solutions where they made sense, and I hope that in five years’ time, Kubernetes is in a spot where it makes more sense how people are using it. It’s not the hammer for all nails. It is the responsible area where they say, “I need this much of Kubernetes. I love the APIs. I want it to manage these things, but maybe I keep my databases somewhere else. Maybe I keep my storage. My NFS isn’t in Kubernetes.” Those sorts of things—let’s just figure out how to be responsible with it.
Vinay Joosery: Or, if we listen to what, I think, Satya Nadella said earlier—who talked about, “Well, we’re going in a world where it’s all going to be AI and SaaS services. It’s just a database with CRUD.”
Justin Garrison: Yeah, and most applications are that. Like, most of them are data over here, change data a little bit, send data over there. If we dumb it down, that’s the essence of most software. But also, it’s a lot more complex than that too because people are in the mix, and people and environments and processes are a lot more complicated than that.
Vinay Joosery: All right, excellent. Well, time to wrap up. We have spoken about how the cloud is changing—perhaps how it’s becoming more of an operating model—and perhaps on-prem should not be thought of as the same as private cloud. Maybe I have to reconsider now based on what we’ve discussed. We’ve also spoken about Kubernetes on-prem as opposed to using services like GKE or EKS and how you can do self-managed SaaS. Well, thank you, Justin, and that’s it for today. Thank you all for listening.
