I wouldn't trust the management of this team for anything. They appear totally incompetent in both management and basic half-brain analytical skills. Who in the heck creates a cluster per service per cloud provider, duplicate all the supporting services around it, burn money and sanity in a pit, and blame the tool.
Literally every single decision they listed was to use any of the given tools in the absolute worst, incompetent way possible. I wouldn't trust them with a Lego toy set with this record.
The people who quit didn't quit merely out of burnout. They quit the stupidity of the managers running this s##tshow.
that tends to be the take on most “k8s is too complex” articles, at least the ones i’ve seen.
yes, it’s complex, but it’s simpler than running true high availability setups without something like it to standardize the processes and components needed. what i want to see is a before and after postmortem on teams that dropped it and compare their numbers like outages to get at the whole truth of their experience.
Complexity is a puzzle and attracts a certain kind of easy bored dev, who also has that rockstar flair, selling it to management - then quitting (cause bored) leaving a group of wizard-prophet-whorshippers to pray to the k8 goddess at the monolith circle at night. And you can not admit as management, that you went all in on a guru and a cult.
Then they hire a different cult leader, one that can clean up the mess and simplify it for the cult that was left behind. The old cult will question their every motive, hinder them with questions about how they could ever make it simpler. Eventually, once the last old architecture is turned off; they will see the errors of their ways. This new rock star heads off to the next complicated project.
A new leader arrives and says, “we could optimize it by…”
Why exactly did they have 47 clusters? One thing I noticed (maybe because I’m not at that scale) is that companies are running 1+ clusters per application. Isn’t the point of kubernetes that you can run your entire infra in a single cluster, and at most you’d need a second cluster for redundancy, and you can spread nodes across regions and AZs and even clouds?
I think the bottleneck is networking and how much crosstalk your control nodes can take, but that’s your architecture team’s job?
It's just a matter of time before someone releases an orchestration layer for k8s clusters so the absurd Rube Goldberg machine that is modern devops stacks can grow even more complex.
> Isn’t the point of kubernetes that you can run your entire infra in a single cluster
I've never seen that, but yes 47 seems like a lot. Often you'd need production, staging, test, development, something like that. Then you'd add an additional cluster for running auxiliary service, this is services that has special network access or are not related to you "main product". Maybe a few of these. Still there's a long way to 47.
Out in the real world I've frequently seen companies build a cluster per service, or group of services, to better control load and scaling and again to control network access. It could also be as simple as not all staff being allowed to access the same cluster, due to regulatory concerns. Also you might not want internal tooling to run on the public facing production cluster.
You also don't want one service, either do to misconfiguration or design flaws, taking down everything, because you placed all your infrastructure in one cluster. I've seen Kubernetes crash because some service spun out of control and then causing the networking pods to crash, taking out the entire cluster. You don't really want that.
Kubernetes doesn't really provide the same type of isolation as something like VmWare, or at least it's not trusted to the same extend.
Which in many cases would break SOC2 compliance (co-mingling of development and customer resources), and even goes against the basic advice offered in the K8s manual. Beyond that, this limits your ability to test Control Plane upgrades against your stack, though that has generally been very stable in my experience.
To be clear I'm not defending the 47 Cluster setup of the OP, just the practice of separating Development/Production.
Why would you commingle development and customer resources? A k8s cluster is just a control plane, that specifically controls where things are running, and if you specify they can’t share resources, that’s the end of that.
If you say they share the same control plane is commingling… then what do you think a cloud console is? And if you are using different accounts there… then I hope you are using dedicated resources for absolutely everything in prod (can’t imagine what you’d pay for dedicated s3, sqs) because god forbid those two accounts end up on the same machine. Heh, you are probably violating compliance and didn’t even know it!
I would want to have at least dev + prod clusters, sometimes people want to test controllers or they have badly behaved workloads that k8s doesn't isolate well (making lots of giant etcd objects). You can also test k8s version upgrades in non-prod.
That said it sounds like these people just made a cluster per service which adds a ton of complexity and loses all the benefits of k8s.
In this case, I use a script to spin up another production cluster, perform my changes, and send some traffic to it. If everything looks good, we shift over all traffic to the new cluster and shutdown the old one. Easy peasy. Have you turned your pets into cattle only to create a pet ranch?
> Then you'd add an additional cluster for running auxiliary service, this is services that has special network access or are not related to you "main product". Maybe a few of these. Still there's a long way to 47.
Why couldn't you do that with a dedicated node pool, namespaces, taints and affinities? This is how we run our simulators and analytics within the same k8s cluster.
You could do a dedicated node pool and limit the egress to those nodes, but it seems simpler, as in someone is less likely to provision something incorrect, by having a separate cluster.
In my experience companies do not trust Kubernetes to the same extend as they'd trust VLANs and VMs. That's probably not entirely fair, but as you can see from many of the other comments, people find managing Kubernetes extremely difficult to get right.
For some special cases you also have regulatory requirements that maybe could be fulfilled by some Kubernetes combination of node pools, namespacing and so on, but it's not really worth the risk.
From dealing with clients wanting hosted Kubernetes, I can only say that 100% of them have been running multiple clusters. Sometimes for good reason, other times because hosting costs where per project and it's just easier to price out a cluster, compared to buying X% of the capacity on an existing cluster.
One customer I've worked with even ran an entire cluster for a single container, but that was done because no one told the developers to not use that service as an excuse to play with Kubernetes. That was its own kind of stupid.
Indeed. My previous company did this due to regulatory concerns.
One cluster per country in prod, one cluster per team in staging, plus individual clusters for some important customers.
A DevOps engineer famously pointed that it was stupid since they could access everything with the same SSO user anyway, and the CISO demanded individual accounts with separate passwords and separate SSO keys.
What you just described with one bad actor bringing the entire cluster down is yet another really good reason I’ll never put any serious app on that platform.
> Out in the real world I've frequently seen companies build a cluster per service, or group of services, to better control load and scaling and again to control network access.
Network Policies have solved that at least for ingress traffic.
Egress traffic is another beast, you can't allow egress traffic to a service, only to pods or IP ranges.
Maybe they had 47 different Kubernetes consultants coming in sequentially and each one found something to do different from the last one, but none of them got any time to update their predecessor's stuff.
There are genuine reasons for running multiple clusters. It helps to sometimes keep stateful (databases generally) workloads on one cluster, have another for stateless workloads etc. Sometimes customers demand complete isolation so they get their own cluster (although somehow its ok that the nodes are still VMs that are probably running on shared nodes… these requirements can be arbitrary sometimes).
> It helps to sometimes keep stateful (databases generally) workloads on one cluster, have another for stateless workloads etc. Sometimes customers demand complete isolation
Are these not covered by taint/toleration? I guess maybe isolation depending on what exactly they're demanding but even then I'd think it could work.
I'll take this opportunity to once again bitch and moan that Kubernetes just fucking refuses to allow the KV store to be pluggable, unlike damn near everything else in their world, because they think that's funny or something
I'm wondering the same. Either they are quite a big company, so such infrastructure comes naturally from many products/teams or their use case is to be in the clusters business (provisioning and managing k8s clusters for other companies). In both cases I'd say there should be a dedicated devops team that knows their way around k8s.
Other than that, the experience I have is that using a managed solution like EKS and one cluster per env (dev, staging, prod) with namespaces to isolate projects takes you a long way. Having used k8s for years now, I'm probably biased, but in general I disagree with many of the k8s-related posts that are frequently upvoted on the front page. I find it gives me freedom, I can iterate fast on services, change structure easily without worrying too much about isolation, networking and resources. In general I feel more nimble than I used to before k8s.
Don’t know about the writer of the article, but there are some legit reasons to use multiple K8s clusters. Single-tenant environments, segregation of resources into different cloud accounts, redundancy (although there are probably lots of ways to do most of this within a single cluster), 46 different developer / QA / CI clusters, etc.
Yeah, it's not very smart. I'm at a company with a $50B+ MC and we run prod on a cluster, staging on another one, then it's tooling clusters like dev spaces, ML, etc. I think in total we have 6 of 7 for ~1000 devs and thousands of nodes.
It makes sense that getting off k8s helped them if they were using it incorrectly.
I had a client who had a single K8 cluster, too much ops for the team, so their idea was to transfer that to each product dev team and thus was born the one K8 per product. They had at least a few
100s of products.
To answer your question directly: yes, that's the point. You may have different clusters for different logical purposes but, yes: less clusters, more node groups is a better practice.
Isn't one of the strategies also to run one or two backup clusters for any production cluster? Which can take over the workloads if the primary cluster fails for some reason?
In a cloud environment the backup cluster can be scaled up quickly if it has to take over, so while it's idling it only requires a few smaller nodes.
You might run a cluster per region, but the whole point of Kubernetes is that it's highly available. What specific piece are you worried about will go down in one cluster that you need two production clusters all the time? Upgrades are a special case where I could see spinning up a backup cluster for.
A lot of things can break (hardware, networking, ...). Spanning the workload over multiple clusters in different regions is already satisfying the "backup cluster" recommendation.
Many workloads don't need to be multi-region as a requirement. So they might run just on one cluster with the option to fail over to another region in case of an emergency. Running a workload on one cluster at a time (even with some downtime for a manual failover) makes a lot of things much easier. Many workloads don't need 99,99% availability, and nothing awful happens if they are down for a few hours.
We ran only two (very small) clusters for some time in the past and even then it introduced some unnecessary overhead on the ops side and some headaches on the dev side. Maybe they were just growing pains, but if I have to run Kubernetes again I will definitely opt for a single large cluster.
After all Kubernetes provides all the primitives you need to enforce separation. You wouldn't create separate VMWare production and test clusters either unless you have a good reason.
You need a separate cluster for production because there are operations you'd do your staging/QA environments that might accidentally knock out your cluster, I did that once and it was not fun.
I completely agree with keeping everything as simple as possible though. No extra clusters if not absolutely necessary, and also no extra namespaces if not absolutely necessary.
The thing with Kubernetes is that it was designed to support every complex situation imaginable. All these features make you feel as though you should make use of them, but you shouldn't. This complexity leaked into systems like Helm, which why in my opinion it's better to roll your own deployment scripts rather than to use Helm.
Do you mind sharing what these operations were? I can think of a few things that may very well brick your control plane. But at the very least existing workloads continue to function in this case as far as I know. Same with e.g. misconfigured network policies. Those might cause downtimes, but at least you can roll them back easily. This was some time ago though. There may be more footguns now. Curious to know how you bricked your cluster, if you don't mind.
I agree that k8s offers many features that most users probably don't need and may not even know of. I found that I liked k8s best when we used only a few, stable features (only daemonsets and deployments for workloads, no statefulsets) and simple helm charts. Although we could have probably ditched helm altogether.
You can’t roll back an AWS EKS control plane version upgrade. “Measure twice, cut once” kinda thing.
And operators/helm charts/CRDs use APIs which can and are deprecated, which can cause outages. It pays to make sure your infrastructure is automated with Got apps, CICD, and thorough testing so you can identify the potential hurdles before your cluster upgrade causes unplanned service downtime.
It is a huge effort just to “run in place” with the current EKS LTS versions if your company has lots of 3rd party tooling (like K8s operators) installed and there isn’t sufficient CICD+testing to validate potential upgrades as soon after they are released.
3rd party tooling is frequently run by open source teams, so they don’t always have resources or desire/alignment to stay compatible with the newest version of K8s. Also, when the project goes idle/disbands/fractures into rival projects, that can cause infra/ops teams time to evaluate the replacement/ substitute projects which are going to be a better solution going forward. We recently ran into this with the operator we had originally installed to run Cassandra.
In my case, it was the ingress running out of subdomains because each staging environment would get its own subdomain, and our system had a bug that caused them to not be cleaned up. So the CI/CD was leaking subdomains, eventually the list became too long and it bumped the production domain off the list.
In theory: absolutely. This is just anecdata and you are welcome to challenge me on it, but I have never had a problem upgrading Kubernetes itself. As long as you trail one version behind the latest to ensure critical bugs are fixed before you risk to run into them yourself, I think you are good.
Edit: To expand on it a little bit. I think there is always a real, theoretical risk that must be taken into account when you design your infrastructure. But when experience tells you that accounting for this potential risk may not be worth it in practice, you might get away with discarding it and keeping your infra lean. (Yes, I am starting to sweat just writing this).
"I am cutting this corner because I absolutely cannot make a business case I believe in for doing it the hard (but more correct) way but believe me I am still going to be low key paranoid about it indefinitely" is an experience that I think a lot of us can relate to.
I've actually asked for a task to be reassigned to somebody else before now on the grounds that I knew it deserved to be done the simple way but could not for the life of me bring myself to implement that.
(the trick is to find a colleague with a task you *can* do that they hate more and arrange a mutually beneficial swap)
Actually I think the trick is to change ones own perspective on these things. Regardless of how many redundancies and how many 9's of availability your system theoretically achieves, there is always stuff that can go wrong for a variety of reasons. If things go wrong, I am faster at fixing a not-so-complex system than the more complex system that should, in theory, be more robust.
Also I have yet to experience that an outage of any kind had any negative consequences for me personally. As long as you stand by the decisions you made in the past and show a path forward, people (even the higher-ups) are going to respect that.
Anticipating every possible issue that might or might not occur during the lifetime of an application just leads to over-engineering.
I think rationalizing it a little bit may also help with the paranoia.
At my last job we had a Kubernetes upgrade go so wrong we ended up having to blow away the cluster and redeploy everything. Even a restore of the etcd backup didn't work. I couldn't tell you exactly what went wrong, as I wasn't the one that did the upgrade. I wasn't around of the RCA on this one. As the fallout was straw that broke the camels back, I ended up quitting to take a sabbatical.
Why would those brick everything? You update node one by one and take it slow, so issues will become apparent after upgrade and you have time to solve those - whole point of having clusters comprised of many redundand nodes.
I think it depends on the definition of "bricking the cluster". When you start to upgrade your control plane, your control plane pods restart one after one, and not only those on the specific control plane node. So at this point your control plane might not respond anymore if you happen to run into a bug or some other issue. You might call it "bricking the cluster", since it is not possible to interact with the control plane for some time. Personally I would not call it "bricked", since your production workloads on worker nodes continue to function.
Edit: And even when you "brick" it and cannot roll back, there is still a way to bring your control plane back by using an etcd backup, right?
Not sure if this has changed, but there have been companies admitting to simply nuking Kubernetes clusters if they fail, because it does happens. The argument, which I completely believe, is that it's faster to build a brand new cluster than debugging a failed one.
I work for a large corp and we have three for apps (dev, integrated testing, prod) plus I think two or three more for the platform team that I don't interact with. 47 seems horrendously excessive
If you have 200 YAML files for a single service and 46 clusters I think you're using k8s wrong. And 5 + 3 different monitoring and logging tools could be a symptom of chaos in the organization.
k8s, and the go runtime and network stack have been heavily optimized by armies of engineers at Google and big tech, so I am very suspicious of these claims without evidence. Show me the resource usage from k8s component overhead, and the 15 minute to 3 minute deploys and then I'll believe you. And the 200 file YAML or Helm charts so I can understand why in gods name you're doing it that way.
This post just needs a lot more details. What are the typical services/workloads running on k8s? What's the end user application?
I taught myself k8s in the first month of my first job, and it felt like having super powers. The core concepts are very beautiful, like processes on Linux or JSON APIs over HTTP. And its not too hard to build a CustomResourceDefinition or dive into the various high performance disk and network IO components if you need to.
I feel you. I learned K8s with an employer where some well intentioned but misguided back end developers decided that their YAML deployments should ALL be templated and moved into Helm charts. It was bittersweet to say the least, learning all the power of K8s but having to argue and feel like an alien for saying that templating everything was definitely not going to make everything easier on the long term.
Then again they had like 6 developers and 30 services deployed and wanted to go "micro front end" on top of it. So they clearly had misunderstood the whole thing. CTO had a whole spiel on how "microservices" were a silver bullet and all.
I didn't last long there but they paid me to learn some amazing stuff. Which, in retrospect, they also taught me a bunch of lessons on how not to do things.
How to save 1M off your cloud infra? Start from a 2M bill.
That's how I see most of these projects. You create a massively expensive infra because webscale, then 3 years down the road you (or someone else) gets to rebuild it 10x cheaper. You get to write two blog posts, one for using $tech and one for migrating off $tech. A line in the cv and a promotion.
But kudos for them for managing to stop the snowball and actually reverting course. Most places wouldn't dare because of sunken costs.
I don't think that's necessarily a problem. When starting a new product time to market as well as identifying the user needs feature wise is way more important than being able to scale "Infinitely".
It makes sense to use whatever $tech helps you get an MVP out asap and iterate on it. once you're sure you found gold, then it makes sense to optimize for scale. The only thing I guess one has to worry about when developing something like that, is to make sure good scalability is possible with some tinkering and efforts and not totally impossible.
I agree with you. I'm not advocating for hyper cost optimization at an early stage startup. You probably don't need k8s to get your mvp out of the door either.
The article says they spend around $150k/year on infra. Given they have 8 DevOps engineers I assume a team of 50+ engineers. Assuming $100k/engineer that's $5 million/year in salary. That's all low end estimates.
They saved $100k in the move or 2% of their engineering costs. And they're still very much on the cloud.
If you tell most organizations that they need to change everything to save 2% they'll tell you to go away. This place benefited because their previous system was badly designed and not because it's cloud.
I'm not making an argument against the cloud here. Not saying you should move out.
The reason why I call out cloud infrastructure specifically is because of how easy it is to let the costs get away from you in the cloud. This is a common thread in every company that uses the cloud. There is a huge amount of waste. And this company's story isn't different.
By the way, 8 DevOps engineers with $150k/year cloud bill deserves to be highlighted here. This is a very high amount of staff dedicated to a relatively small infrastructure setup in an industry that keeps saying "cloud will manage that for you."
To expand on this, I run BareMetalSavings.com[0] and the most common cause of people staying with the cloud is it's very hard for them to maintain their own K8S cluster(s), which they want to keep because they're great for any non-ops developer.
So those savings are possible only if your devs are willing to leave the lock in of comfort
This is not that surprising. First, it depends on how big the YAML files were and what was in them. If you have 200 services, I could easily see 200 YAML files. Second, there are non-service reasons to have YAML files. You might have custom roles, ingresses, volumes, etc. If you do not use something like Helm, you might also have 1 YAML file per environment (not the best idea but it happens).
My suspicion is the original environment (47 Kubernetes clusters, 200 YAML files, unreliable deployments, using 3 clouds, etc.) was not planned out. You probably had multiple teams provisioning infrastructure, half completed projects, and even dead clusters (clusters which were used once but were not destroyed when they were no longer used).
I give the DevOps team in the article a lot of credit for increasing reliability, reducing costs, and increasing efficiency. They did good work.
I've seen much-lauded "Devops" or "platform" teams spend two months writing 500+ files for 3 simple python services, 5 if you include two databases.
We could have spent a tiny fraction of that 10-dev-months to deploy something to production on a bare VM on any cloud platform in a secure and probably very-scalable way.
These days I cringe and shudder everytime I hear someone mentions writing "helm charts", or using the word "workloads".
Every guy that joins or starts new project - instead of reading and getting familiar with what is available does his own stuff.
I see this happening all the time and unless you really have DevOps or SysAdmins who are feel acting like 'assholes' enforcing rules it is going to be like that.
Of course 'assholes' is in quotes because they have to be firm and deny a lot fo crap to keep setup clean - but then also they will be assholes for some that "just want to do stuff".
“We want to use one standard Helm chart for all applications but then we need it to support all possible variations and use cases across the whole company”
Like most tech stories this had pretty much nothing to do the tool itself but with the people/organization. The entire article can be summarized with this one quote
> In short, organizational decisions and an overly cautious approach to resource isolation led to an unsustainable number of clusters.
And while I emphasize with how they could end up in this situation, it feels like a lot of words were spent blaming the tool choice vs being a cautionary tail about for example planning and communication.
In my experience organizations that end up this way have a very much non blame-free culture. Can be driven by founders that lack technical skills and management experience but have a type-A personality. As a result no one wants to point out a bad decision because the person who made it will get reprimanded heavily. So they go down a path that is clearly wrong until they find a way to blame something external to reset. Usually that's someone who recently left the company or some tool choice.
The article reads to me as pretty explicitly saying that the only real takeaway wrt k8s itself is "it was the wrong choice for us and then we compounded that wrong choice by making more wrong choices in how we implemented it."
Maybe I'm reading it with rose coloured glasses - but I feel like the only thing kubernetes "did wrong" is allowing them to host multiple control planes. Yes, you need 3+ CP instances for HA, but the expectation is you'd have 3 CP instances for X (say 10) workers for Y (say 100) apps. Their implied ratio was insane in comparison.
Since you can't run the Fargate control plane that indirectly solved that problem for them
So they made bad architecture decisions, blamed it on Kubernetes for some reason, and then decided to rebuild everything from scratch. Solid. The takeaway being what? Don't make bad decisions?
I've always been fond of blaming myself and asking everybody else to help make sure I don't cock it up a second time - when it works out I get lots of help, lots of useful feedback, and everybody else feels good about putting the effort in.
This does require management who won't punish you for recording it as your fault, though. I've been fairly lucky in that regard.
I think the takeaway was Kubernetes did not work for their team. Kubernetes was probably not the root problem but it sounds like they simplified their infrastructure greatly by standardizing on a small set of technologies.
Kubernetes is not an easy to use technology and sometimes its documentation is lacking. My gut feeling is Kubernetes is great if you have team members who are willing to learn how to use it, and you have a LOT of different containers to run. It probably is not the best solution for small to medium sized teams because of its complexity and cost.
It highlights a classic management failure that I see again and again and again: Executing a project without identifying the prerequisite domain expertise and ensuring you have the right people.
Are the people who decided to spin up a separate kubernetes cluster for each microservice still employed at your organization? If so, I don't have high hopes for your new solution either.
Too bad the author and company are anonymous. I'd like to confirm my assumption that the author has zero business using k8s at all.
Infrastructure is a lost art. Nobody knows what they're doing. We've entered an evolutionary spandrel where "more tools = better" meaning the candidate for an IT role who swears by 10 k8s tools is always better than the one who can fix your infra, but will also remove k8s because it's not helping you at all.
47 clusters? Is that per developer? You could manage small, disposable VPS for every developer/environment, etc and only have Kubernetes cluster for a production environment...
At my last gig we were in the process of sunsetting 200+ clusters. We allowed app teams to request and provision their own cluster. That 3 year experiment ended with a migration down to 24ish clusters (~2 clusters per datacenter)
It's the same story over and over again. Nobody gets fired for choosing AWS or Azure. Clueless managers and resume driven developers, a terrible combination.
The good thing is that this leaves a lot of room for improvement for small companies, who can out compete larger ones by just not making those dumb choices.
but does improving really help these small companies in the way it matters? If the cost of the infrastructure is apparently not important to the needs of the business..
https:///item?id=28838053 is where I learned about that, with the top comment showing a bazillion bookmarklet fixes. I'd bet dollars to donuts someone has made a scribe.rip extension for your browser, too
What else is out there? I'm running docker swarm and it's extremely hard to make it work with ipv6. I'm running my software on a 1GB RAM cloud instance and I pay 4EUR/month, and k8s requires at least 1GB of RAM.
As of now, it seems like my only alternative is to run k8s on a 2GB of RAM system, so I'm considering moving to Hetzner just to run k3s or k0s.
I have traefik as part of the docker-compose file. Installing nginx on the host seems less reproducible, though it could fix my problem. I guess I would choose something as caddy (I'm not that happy with traefik)
Is there a non-paid version of this? The title is a little clickbait, but reading the comments here seems like this is a story that jumped on the k8s bandwagon, made a lot of terrible decisions along the way and now they're blaming k8s for everything.
I've read this article now multiple times and I'm still not sure if this is just good satire or if it's real and they can burn money like crazy or some subtle ad for aws managed cloud services :)
Those kind of articles often read like an ad for managed cloud services. "We got rid of that complicated, complicated Kubernetes beast by cobbling together 20 bespoke managed services from provider X which is somehow so much easier".
Kubernetes is not a one size fits all solution but even the bullet points in the article raise a number of questions. I have been working with Kubernetes since 2016 and keep being pragmatic on tech stuff. Currently support 20+ clusters with a team of 5 people across 2 clouds plus on-prem. If Kubernetes is fine for this company/project/business use case/architecture we'll use it. Otherwise we'll consider whatever fits best for the specific target requirements.
Smelly points from the article:
- "147 false positive alerts" - alert and monitoring hygiene helps. Anything will have a low signal-to-noise ratio if not properly taken care of. Been there, done that.
- "$25,000/month just for control planes / 47 clusters across 3 cloud providers" - multiple questions here. Why so many clusters? Were they provider-managed(EKS, GKE, AKS, etc.) or self-managed? $500 per control plane per month is too much. Cost breakdown would be great.
- "23 emergency deployments / 4 major outages" - what was the nature of emergency and outages? Post mortem RCA summary? lessons learnt?..
- "40% of our nodes running Kubernetes components" - a potential indicator of a huge number of small worker nodes. Cluster autoscaler been used? descheduler been used? what were those components?
- "3x redundancy for high availability" - depends on your SLO, risk appetite and budget. it is fine to have 2x with 3 redundancy zones and stay lean on resource and budget usage, and it is not mandatory for *everything* to be highly available 24/7/365.
- "30% increase in on-call incidents" - Postmortems, RCA, lessons learnt? on-call incidents do not increase just because of the specific tool or technology being used.
- "200+ YAML files for basic deployments" - There are multiple ways to organise and optimise configuration management. How was it done in the first place?
- "5 different monitoring tools / 3 separate logging solutions" - should be at most one for each case. 3 different cloud providers? So come up with a cloud-agnostic solution.
- "Constant version compatibility issues" - if due diligence is not properly done. Also, Kubernetes API is fairly stable(existing APIs preserve backwards compatibility) and predictable in terms of deprecating existing APIs.
That being said, glad to know the team has benefited from ditching Kubernetes. Just keep in mind that this "you don't need ${TECHNOLOGY_NAME} and here is why" is oftentimes an emotional generalisation of someone's particular experience and cannot be applied as the universal rule.
The price comparison doesn't make sense if they used to have a multi cloud system and now its just AWS. Makes me fear this is just content paid by AWS. Actually getting multi cloud to work is a huge achievment and I would be super interested to hear of another tech standard that would make that easier.
Basically they went AWS-native, with ECS being the biggest part of that. I'm currently trying to move our own stack into a simpler architecture, but I can wholeheartedly recommend ECS as a Kubernetes alternative, giving you 80% of the functionality for 20% of the effort.
I feel like the medium paywall saved me... as soon as I saw "47 clusters across 3 different cloud providers", I begin to think that the tool used here might not actually the real issue.
I've looked into K8s some years back and found so many new concepts that I thought: is our team big enough for so much "new".
Then I read someone saying that K8s should never be used for teams <20 FTE and will require 3 people to learn it for redundancy (in case used to self-host a SaaS product). This seemed really good advice.
Our team is smaller than 20FTE, so we use AWS/Fargate now. Works like a charm.
When they pulled apart all those kubernetes clusters they probably found a single fat computer would run their entire workload.
“Hey, look under all that DevOps cloud infrastructure! There’s some business logic! It’s been squashed flat by the weight of all the containers and orchestration and serverless functions and observability and IAM.”
I don't think this is a fair comment. This happens sometimes but usually people go with Kubernetes because they really do need its power and scalability. Also, any technology can be misused and abused. It's not the technology's fault. It's the people who misuse the technology's fault ().
() Even here, a lot of it comes from ignorance and human nature. 99.9% of people do not set out to fail. They usually do the best they can and sometimes they make mistakes.
Given that seemingly half the devs and orgs are antithetical to writing performant software, or optimising anything I somewhat doubt that’s going to happen anytime soon. As much as I’d like that to happen.
Performance isn’t the only reason to use container orchestration tools: they’re handy for just application lifecycle management.
How did your managers ever _ever_ sign off on something that cost an extra $0.5M?
Either your pre profit or some other bogus entity, or your company streamlined moving to k8s and then further streamlined by cutting away things you don't need.
I'm frankly just alarmed at the thought of wasting that much revenue, I could bring up a fleet of in house racks for that money!
Oh boy. Please, please stop using Medium for anything. I have lost count of how many potentially interesting or informative articles are published behind the Medium sign-in wall. At least for me, if you aren't publishing blog articles in public then what's the point of me trying to read them.
This is why I always sign-up for something first time with fake data and a throw-away email address (I have a catch-all sub-domian, that is no longer off my main domain, for that). If it turns out the site is something I might care to return to I might then sign up with real details, or edit the initial account, if not then the email address given to them gets forwarded to /dev/null for when the inevitable spam starts arriving. I'm thinking of seeing if the wildcarding/catchall can be applied at the subdomain level, so I can fully null-route dead addresses in DNS and not even have connection attempts related to the spam.
Don't bother reading. This is just another garbage in garbage out kind of article written by something that ends in gpt. Information density approaches zero in this one.
Or course they are. The original value proposition of cloud providers managing your infra (and moreso with k8s) was that you could fire your ops team (now called "DevOps" because the whole idea didn't pan out) and the developers could manage their services directly.
In any case, your DevOps team has job security now.
I think the value proposition holds when you are just getting started with your company and you happen to employ people that know their way around the hyperscaler cloud ecosystems.
But I agree that moving your own infra or outsourcing operations when you have managed to do it on your own for a while is most likely misguided. Speaking from experience it introduces costs that cannot possibly be calculdated before the fact and thus always end up more complicated and costlier than the suits imagined.
In the past, when similar decicions were made, I always thought to myself: You could have just hired one more person bringing their own, fresh perspective on what we are doing in order to improve our ops game.
Oh, I've seen this before and it's true in an anecdotal sense for me. One reason why is that they always think of hiring an additional developer as a cost, never savings.
Literally every single decision they listed was to use any of the given tools in the absolute worst, incompetent way possible. I wouldn't trust them with a Lego toy set with this record.
The people who quit didn't quit merely out of burnout. They quit the stupidity of the managers running this s##tshow.
reply