Common mistakes using Kubernetes (pipetail.io)
I wish there was a way to upvote something 10x once a month here. This would be the post I use that on.

When I was writing my book my editor asked me to remove any writing about mistakes and changes I made in the project for each chapter. I had a bug that appeared and I wanted to write about how I determined that and fixed it. They said the reader wants to see an expert talking, as if experts never make mistakes or need to shift from one tact to another.

But, I find I learn the most from explanations that share how your mental model was wrong initially and how you figured it out and how you did it "more right" the next time.

That's really how people build things.

>They said the reader wants to see an expert talking, as if experts never make mistakes or need to shift from one tact to another.

Your editor was very fucking wrong.

So, so wrong. How did (s)he think experts get so good? Isn't the phrase `the master has failed more times than the apprentice has even tried` well known for a reason?

> the master has failed more times than the apprentice has even tried

I've never actually heard that one before, but it's so very, very true.

It obviously depends on the type of the book and the reader's expectations. It just might not have been the type of book where you write about things like that.

>It just might not have been the type of book where you write about things like that.

I think that'd be more of the author's choice, instead of the editor.

Agreed. The entire expertise of programming is based on the willingness to find these sorts of issues and fix them. Over and over, for your entire career.

Obviously, I agree with you!

I also think this is the way the majority of tech books are written.

Can you think of another where the author goes from mistake to mistake and then finally gets it right?

I believe there is a space in tech writing for this kind of writing, but it is not something traditional book publishers believe.

This was an O'Reilly book by the way, with really good editors and a really good editoral process.

That editor was right most of the time, IMHO.

>Can you think of another where the author goes from mistake to mistake and then finally gets it right?

Not a book, but Raymond Hettinger often presents in this way and it's fantastic: https://www.youtube.com/watch?v=wf-BqAjZb8M

Most (in-depth, technical) security books follow this sort of thought process- I think. “A big hunters diary” comes to mind.

> Your editor was very fucking wrong.

The editor is completely right in what they were saying. You just want them to be wrong, because you'd prefer to live in the fantasy world where they are wrong.

Let's say you go to get a surgery. You don't want the doctor to tell you about all the times they fucked up and what the awful consequences were. It doesn't matter that they're probably a better surgeon now, having learned from their mistakes. Psychologically, you need that person with the sharp tool poking around inside your body to be a superhuman.

To a lesser degree, the same is true for any expert. Of course everybody makes mistakes. Notice the de-personalization in the word "everybody". You can talk about the mistakes everybody makes, or those ones that many people make. If you talk about your own mistakes however, you lose the superhuman status. There may be a few situations where that somehow helps you, but not when you want to sell books.

I sure do want that surgeon to have been presented/instructed on the common ways that surgeons before them have made mistakes and how to avoid or overcome them.

That they won’t tell me (the patient) is quite a different question from whether they got the material from someone more experienced in their primary or continuing medical education.

> Let's say you go to get a surgery.

Shouldn't that be more like "Let's say you're learning to be a surgeon."?

For that situation, the person they're learning from discussing problems they hit and how they solved them does sound like it would be very useful.

I wonder if we're on the same page. Should a book on a programming language discuss compilation errors and how to interpret them? Should a book like Effective C++ (afaik the most popular series of books on C++) exist?

Found the editor!

I hate this so much. But it's so common.

I was told not to mention the caveats, instead, render a confident image for the team many times in my career.

It's like doctors suggest their patients to use some drugs without mentioning any side effects.

It reminds me of the video[0] which asks the developer to draw red lines with blue ink, while the project manager keeps pushing the developer like "You're the expert, of course you can do draw red lines with blue ink!".

[0] https://www.youtube.com/watch?v=BKorP55Aqvg


Tact means skill in dealing with other people, particularly in sensitive situations.

Tack has to do with sailing boats into the wind, and crucially "changing tack" means changing direction.

Thanks, I'm glad I had you as an editor this time! I never noticed that before.

years (and years) ago, if I wanted to learn a new computer technology or language, I would pick up a book and learn it.

One language that didn't go how I expected it was applescript.

The book on applescript (from o'reilly I believe) was necessarily different.

Instead writing "normal" top-down programs, applescript hooks into the OS from the side.

Many books can concentrate on what you can do, but the applescript had to do everything by example. This is because all of apple's applications expose interfaces that you have to figure out on the fly.

I had to learn a different way from that book, because it necessarily concentrated on "how" instead of "what".

personally... I'd like to see that sort of info, but typically not in the middle of a chapter/section. make a note or call out in the text pointing to a section on why/how you got to the 'correct' position. That info is often helpful info, but can disrupt the flow of the 'good' information.

In my opinion, the most common mistake is not in the article : using kubernetes when you don't need to.

Kubernetes has a lot of pros or the papers but in practice it's not worth it for most small and medium companies.

Most of the time you don't even need a cluster of any kind. We live in a world where you can spin up a server with 3TB of RAM and 128 cores whenever you want, and it will cost you less than a senior developer.

This was my thought exactly. The article is great assuming you need to use k8s, but does leave out the important question: does your project or product require k8s and all the overhead it unavoidably entails?

Amazon's Elastic Container Service (ECS) on Fargate deployment type is probably a better option much of the time. Until you maintain your own k8s cluster (including the hosted variants on AWS, GCP, etc.) you might not realize how complex configuring k8s is.

While the AWS's ECS service may be more limited, I've found it leaves less room to do the wrong thing. Unfortunately, the documentation on ECS and the community support is inferior to k8s, but I'll accept that if I don't have to spend a whole day researching what service mesh to use with k8s and how to configure a load balancer and SSL certs.

Are there good resources for making that decision according to good criteria?

A simple rule of thumb is the number of services you have. For instance, my current employer has been working on getting everything on Kubernetes for the last year or so and we have two services... the frontend server-side renderer, and API.

Do you also include managed kubernetes offerings, such as from Digital Ocean, in that assessment?

Honestly my experience has been that managed k8s is often more complicated from a developer perspective than just k8s - sure, you don't have to deal with setting it up, but you have to figure out how all the 'management' and provider-specific features work, and they often seem pretty clumsily integrated.

And in many cases features that would help your use case aren't enabled on that platform, or are in a release that's still a year or two from being supported on that platform... I'm looking at you, EKS.

Oh yes. Managed kubernetes is full of various issues. Some major cloud providers sell very poor managed kubernetes.

If someone knows about a reliable managed kubernetes, please let me know.

Can you elaborate on the cloud providers selling poorly managed k8s - are they all problematic? I have no experience with cloud provided k8s.

Not the OP but yes if cost is a factor. As far as I know no managed K8S offerings are cheap.

- DigitalOcean offers free clusters.

- Azure (still?) offers free clusters.

- GCP offers one free single-zone cluster.

If you need more than DO can offer in terms of compute instances, you can probably afford GKE/EKS, which is around 75$/month.


Running highly-available control plane (K8s masters & etcd) by yourself is NOT cheaper than using EKS.

To achive high availability, EKS runs 3 masters and 3 etcd instances, in different availability zones. Provisioning 3 t3.medium instances (4 GB of memory and 2 CPUs) would cost the same as a completely managed EKS.

Not to mention the manual work you need to setup, maintain and upgrade such instances.

Great post! If you're in the Kubernetes space for long enough, you'll see all of these configuration mistakes happening over and over again.

I've created a static code analyzer for Kubernetes objects, called kube-score, that can identify and prevent many of these issues. It checks for resource limits, probes, podAntiAffinities and much more.

1: https://github.com/zegl/kube-score

I actually disagree with the first recommendation as written - specifically, not to set a CPU resource request to a small amount. It's not always as harmful as it might sound to the novice.

It's important to understand that CPU resource requests are used for scheduling and not for limiting. As the author suggests, this can be an issue when there is CPU contention, but on the other hand, it might not be. That's because memory limits are even more important than CPU requests when scheduling: most applications use far more memory as a proportion of overall host resources than CPU.

Let's take an example. Suppose we have a 64GB worker node with 8 CPUs in it. Now suppose we have a number of pods to schedule on it, each with a memory limit of 2GB and a CPU request of 1 millicore (0.001CPU). On this node, we will be able to accommodate 32 such pods.

Now suppose one of the pods gets busy. This pod can have all the idle CPU it wants! That's because it's a request and not a limit.

Now suppose all of the pods become fully CPU contended. The way the Linux scheduler works is that it will use the CPU request as a relative weight with respect to the other processes in the parent cgroup. It doesn't matter that they're small as an absolute value; what matters is their relative proportion. So if they're all 1 millicore, they will all get equal time. In this example, we have 32 pods and 8 CPUs, so under full contention, each will get 0.25 CPU shares.

So when I talk to customers about resource planning, I actually usually recommend that they start with low CPU reservation, and optimize for memory consumption until their workloads dictate otherwise. It does happen that particularly greedy pods are out there, but that's not the typical case - and for those that are, they will often allocate all of a worker's CPUs in which case you might as well dedicate nodes to them and forget about how to micromanage the situation.

If you ask for 0.001 CPU share, you might get it. I would advise caution. You that pod gets scheduled on a node with another node that asks for 4 CPUs and 100MB of memory, it's not going to get any time.

It depends. If the second pod requests 4 CPUs, it doesn't necessarily mean that the first pod can't use all the CPUs in the uncontended case.

A lot of this depends on policy and cooperation, which is true for any multitenant system. If the policy is that nobody requests CPU, then the behavior will be like an ordinary shared Linux server under load - the scheduler will manage it as fairly as possible. OTOH, if there are pods that are greedy and pods that are parsimonious in terms of their requests, the greedy pods will get the lion's share of the resources if it needs them.

The flip side of overallocating CPU requests is cost. This value is subtracted from the available resources, making the node unavailable to do other useful work. Most of the time I see customers making the opposite mistake - overallocating CPU requests so much that their overall CPU utilization is well under 25% during peak periods.

Most people would be thrilled to get anything close to 25% CPU util. I guess one of the big missing pieces fro Borg that hasn't landed in k8s is node resource estimation. If you have a functional estimator, setting requests and limits becomes a bit less critical.

Great article, I've learned many of these firsthand and agree with their conclusions. I have some more reading to do on PDBs!

K8s is a powerful and complex tool that should only be used when needed. IMO you should be wary of using it if you're trying to host less than a dozen applications - unless it's for learning/testing purposes.

It's a complex beast with many footguns for the uninitiated. For those with the right problems, motivations and skillsets, k8s is the holy grail of scale and automation.

I'm not necessarily agreeing with you. Kubernetes really is a complex beast, and I wouldn't recommend self-hosting it for companies that don't have people that can focus solely on managing it.

I would also not recommend it for hosting a WordPress site, or a simple CRUD app.

However, when you get to the level where autoscaling is required, and where you are deploying multiple services, managed Kubernetes is not such a bad idea.

Using EKS (especially with Fargate) on AWS is not much harder than figuring out and properly utilizing EC2/ASG/ELB. GKE or DigitalOcean offerings seem to be even easier to use and understand.

What would you suggest as an alternative, simpler form for docker deploy, running and managing? Docker-compose?

What do you mean by docker hosting? Kubernetes (and other related tools) are container orchestration/management tools. As if often the case in the management space, if you're just running at small scale, you may not need anything beyond container command line tools and some scripts. You could also use Ansible to automate.

Thanks we are running a few node servers, we now deploy command line. Dev we use docker-compose. But we are looking for a way to easily share our servers. We developed it for Amsterdam open source. Around 20-30 cities are in line to start using it. Doesn't have to be scalable, or have high availibility. Ease of deployment, easy way to update and basic security. All sysadmins are pushing for kubernetes, although for the big cities it makes sense, it really starting to feel like an overkill for small cities who will run 1-3 non-critical sites with 0.5-5k users p/m. Heard a lot about ansible, will look into it, thanks!

So it sounds as if you've sort of outgrown the command line but aren't sure you want to jump in on self-managed Kubernetes. You'd have to look at the costs but maybe some sort of managed offering would work for you. It could scale up for larger sites but would be fairly simple for smaller ones--especially with standardized configurations.

You could look at the big cloud providers directly. OpenShift [I work at Red Hat] also has a few different types of managed offerings.

Well, nobody asked me, and I'm no expert, but here's my list of what (not) to do in Kubernetes (if I had the authority).

1. There. Is. No. Machine. (Insert matrix meme here.) Before you open up your cluster to the rest of company, drill it down to them. Maybe even create a Google Form where they have to sign "I hereby acknowledge that there is no machine in k8s and any attempt to tie my job to a particular machine means a broken config by definition."

2. Thanks to 1, don't let anyone use hostNetwork, hostIPC, hostPID, hostPorts, host whatever, unless you have a really good reason to (with explicit approval process).

3. Don't let anybody start a job without memory/CPU limit. Make sure they understand that, if the job goes over the memory limit, it dies, and it's not k8s admin's problem.

4. You can't log anything into the pod - when the pod dies the log is gone. You can't log into the machine, either (see 1). Therefore, you really need some kind of logging framework that takes the log from your pod and saves it, in its raw form, somewhere safe (like S3). I don't know if there's any such framework, but there had better be.

5. Make sure every manual operation is logged (who did what to which job when), unless you like asking "@here Does anybody know who owns fooservice?" every month.

6. Kubernetes is not magic: if it takes thirty minutes to provision your service, fix that, instead of moving thirty minutes of manual provisioning into k8s and somehow expect it to be magically reliable.

7. Don't bring in existing dependencies uncritically. If your job connects to a zookeeper server to find out its peers, don't bring it into k8s, but rewrite it to use k8s service instead.

8. Take extra extra care when writing down your first job specification, because there are a lot of yaml files to write, and people will just copy what's already there. If your first k8s job mounts host /tmp directory just because you were testing something and forgot to delete the line, soon you will have fifty jobs all mounting host /tmp directory. Good luck figuring out which job actually needs it then.

Yeah, again, I'm by no means an expert - I'm not even an admin, so just consider the list as a rambling of some poor soul who has seen some stuff. Here be dragons, have fun.

The guaranteed QoS example in the article is wrong. Kubernetes only sets the Guaranteed QoS if the CPU count is an integer (which 0.5 is not).

Also, to take full benefit of the QoS you need to configure the Kubelet with "--cpu-manager-policy static"[0].

[0]: https://kubernetes.io/blog/2018/07/24/feature-highlight-cpu-...

This also needs a companion post called "common mistakes: using kubernetes".

I feel like it's a weekly occurrence now where I hear of a startup launching their mvp on kubernetes having spent 8 months too long on Dev as a result.

The other day in an interview someone bragged to me how he had convinced his team to spend 12 months moving to K8s. Upper management thought it was a waste of time but eventually agreed. I asked him if there were any measurable benefits and he said no.

I totally understands why Google needs it. Do you?

Yes we need it.

This is absolutely becoming a tiresome trope. K8s is a huge benefit to tons of companies and none of them are Google.

Yes some people are using K8s when they don't need it. Just like many are using cloud managed services when they don't need them. Or vms. Or insert any technology here.

This article has nothing to do with whether K8s fits some particular use case but may be of help (although I disagree with the entire section on resources which reflects a lack of long term experience with K8s in production) to those who do want to use it.

You're the 10th person in this thread saying the same thing and it doesn't appear you even have that much experience with operations in general.

Sorry to go off on you but I'm really seeing these types of tropes and quick depthless one liner comments and offtopic snipes lately as the downfall of the hn comment section.

Kubernetes doesn't benefit Google? How not?

I think the intention of the post you replied to was to say that many companies other than google have a legitimate need for kubernetes.

Even Google doesn't need it that much: back when I was there, each Borg cluster had something like 10,000+ cores. Large enough to run a typical SV startup wholescale. The ratio of "cluster management work" vs. "actual work being done on it" was not that high.

These days, some people are like "Dude, if you don't have one cluster per AWS availability zone per each environment, you're doing it wrong." Why, just why.

This instantly remembered me of this: https://thedailywtf.com/articles/the-longest-method

Kubernetes sometimes shows its Java roots.

> its Java roots

Citation needed.

AFAIR Borg is implemented in C++ and k8s has been implemented in Go from day 0.

Am I missing some crucial steps in k8s's history?

Apparently the first version was based on a Java prototype, but it's unclear to me how much that's visible:


> Concretely, Kubernetes started as some prototypes from Brendan Burns combined with ongoing work from me and Craig McLuckie to better align the internal Google experience with the Google Cloud experience. Brendan, Craig, and I really wanted people to use this, so we made the case to build out this prototype as an open source project that would bring the best ideas from Borg out into the open.

> After we got the nod, it was time to actually build the system. We took Brendan’s prototype (in Java), rewrote it in Go, and built just enough to get the core ideas across

I remember a comment from someone on the go team (bradfitz maybe) saying that the initial plan for Kubernetes was to be implemented in java, but go was chosen because it already had a lot of momentum in the container world.

That Java link in the article goes to a completely not-Java Chinese site btw

And the Spring link is a 404

To be fair, I don't see how this could be shorter in any other language without losing readability.

You don't have to. Kubernetes should just say "required" here, and the documentation should say "Warning: this is only checked during scheduling."

printf isn't named printIntoBufferWhichMayNotFlushUntilLinefeed, and people are fine with it.

I believe there may have been a plan to allow checks during runtime at some point; although this feature is no longer necessary since https://github.com/kubernetes-sigs/descheduler#removeduplica... can do it.

I'm glad readers are liking the article, but please read and follow the site guidelines. Note this one: If the title begins with a number or number + gratuitous adjective, we'd appreciate it if you'd crop it. E.g. translate "10 Ways To Do X" to "How To Do X," and "14 Amazing Ys" to "Ys." Exception: when the number is meaningful, e.g. "The 5 Platonic Solids."

The submitted title was "10 most common mistakes using kubernetes", HN's software correctly chopped off the "10", but then you added it back. Submitters are welcome to edit titles when the software gets things wrong, but not to reverse things it got right.

Shameless plug but on topic. I wrote recently about readiness and liveness probes with Kubenetes. If you look for an educational perspective you can check: https://medium.com/aiincube-engineering/kubernetes-liveness-...

"more tenants or envs in shared cluster"

This is what I'm trying to convince my current company about. They want everything in a single cluster (prod, test, stage, qa).

Of course self hosting makes this more difficult to justify, since it is additional expenses for more machines.

Have you considered using OpenShift instead of Kubernetes? It comes with vastly improved multitenancy features, as well as other aspects, in regards to plain Kubernetes. OKD, the open sourced package of OpenShift allows full self-hosting: https://www.okd.io

That sounds like a disaster waiting to happen!

Seems like a perfect use-case for Cluster API: https://cluster-api.sigs.k8s.io/user/quick-start.html

Have one global "mgmt cluster" with several workload clusters

One thing to watch for with pod antiAffinity - if you use required vs preferred, and your pod count exceeds the node count, the remainder will be left in Pending and won't spin up anywhere.

There's a new feature which does a better job of spreading Pods without blocking scheduling quite as badly: https://kubernetes.io/docs/concepts/workloads/pods/pod-topol...

Really great article.

I have used Kubernetes pretty heavily in the past, and didn't know about PodDisruptionBudget.

What would a good liveness and readyness probe do for a rails app? What kind of work and metrics would these 2 endpoints do in my app?

This is a good question, and I think the article doesn't cover this topic well.

From the article: > The other one is to tell if during a pod's life the pod becomes too hot handling too much traffic (or an expensive computation) so that we don't send her more work to do and let her cool down, then the readiness probe succeeds and we start sending in more traffic again.

Well... maybe. Is it a routine occurrence that an individual Pod becomes "too hot"? If your load balancer can retry a request on, say, a 503 Service Unavailable, you may be better off relying on that retry combined with CPU-based autoscaling to add another Pod (it's simpler, tradeoff is the load balancer may spend too much time retrying).

If you can't or don't want to add additional Pods, then your client is going to see that 503 (or similar) anyway. I'd say, then, that the point of a Pod claiming it's "not ready" to get itself removed from the load balanced pool is to allow the load balancer to more quickly find an available Pod, but this adds complexity and may be irrelevant if you run enough Pods to have some overhead capacity.

A Rails app is a bit different from a node/go/java app in that (typically at least, if you're using Unicorn or other forking servers) each individual Pod can only handle a limited number of concurrent requests (8, 16, whatever it is). It's more likely then that any given Pod is at capacity.

But, liveness/readiness are not so simple. If these probes go through the main application stack, then they're tying up one of the precious few worker processes, even if only momentarily. I haven't worked with Ruby in a number of years, but I remember running a webrick server in the unicorn master process, separate from the main app stack, to respond to these checks. But I did not implement a readiness check that tracked the number of requests and reported "not ready" if all the workers are busy.

If it has network problems, kubernetes can take out that instance out of serving traffic.

When your doing rolling upgrades it can signal your app is ready to take traffic.

Those are the main uses.

For the readiness probe a simple endpoint that returns 200 is enough. This tests your service’s ability to respond to requests without depending on any other dependencies (sessions which might use Redis or a user auth service which might use a database).

For liveness probe I guess you could check if your service is accepting TCP connections? I don’t think there should ever be a reason for your service to outright refuse connections unless the main service process has crashed (in which case it’s best to let Kubernetes restart the container instead of having a recovery mechanism inside the container itself like supervisord or daemon tools).

> For the readiness probe a simple endpoint that returns 200 is enough. This tests your service’s ability to respond to requests without depending on any other dependencies (sessions which might use Redis or a user auth service which might use a database).

If the underlying dependencies aren't working, can a pod actually be considered ready and able to serve traffic? For example, if database calls are essential to a pod being functional and the pod can't communicate with the database, should the pod actually be eligible for traffic?

The article explicitly warns against that:

> Do not fail either of the probes if any of your shared dependencies is down, it would cause cascading failure of all the pods.

The idea would be that the downstream dependencies have their own probes and if they fail they will get restarted in isolation without touching the services that depend on them (that are only temporarily degraded because of the dependency failure and will recover as soon as the dependency is fixed).

I think there is more to the story for some of these points and it can be dangerous to just take this at face value of best practices.

For example on the liveness / readiness probe item, the article says,

> “ The other one is to tell if during a pod's life the pod becomes too hot handling too much traffic (or an expensive computation) so that we don't send her more work to do and let her cool down, then the readiness probe succeeds and we start sending in more traffic again.”

But this is often a very bad idea and masks long term errors in underprovisioning a service.

If the contention of readiness / liveness checks vs real traffic is ever resulting in congestion, you need the failure of the checks to surface it so you can increase resources. If you set things up so this failure won’t surface, like allowing the readiness check to take that pod out of service until the congestion subsides, you’re only hurting yourself by masking the issue. It basically means your readiness check is like a latency exception handler outside the application, very bad idea.

The other item that is way more complicated than it seems is the issue about IAM roles / service accounts instead of single shared credentials.

In cases where your company has an enterprise security team that creates extremely low-friction tools to generate service account credentials and inject them, then sure, I would agree it’s a best practice to ruthlessly split the credentialing of every application to a shared resource, so you can isolate access and revoking.

But if you are on some application team and your company doesn’t have a mature enough security tooling setup managed by a separate security team, this can become a bad idea.

It can lead to superlinear growth in secrets management as there will be manual service account creation and credential propagation overhead for every separate application. Non-security engineers will store things in a password manager, copy/paste into some CI/CD tool, embed credentials as ENV permanently in a container, etc., all because they can’t create and maintain the end to end service account credential tools in addition to their job as an application team engineer. It’s something they think about twice per year and need off their plate immediately to move on to other work.

Across teams it means you end up with 20 different team-specific ways to cope with rapid growth of service accounts, leading to an even worse security surface area, risk of credential-based outages, omission of important testing because ensuring ability to impersonate the right service account at the right place is too hard, etc.

Very often it is a real trade-off to consider that one single service account credential that has just one way to be injected for every service is safer in the bigger picture.

Yes it means a credential issue for any service becomes an issue for all, and this is a risk and you want automated tooling to mitigate it, but it very often will be less of a risk than insisting on a parochial best practice of individual service account credentials, resulting in much worse and less auditable secrets workflows overall unless it is completely owned and operated by a central security team in such a way that it doesn’t create any approval delays or workflow friction for application teams.

You of course should monitor the rate of liveness flapping for your services. The need to monitor it does not imply that it's a bad feature.

You can’t have it both ways. If you need to monitor it and take corrective action (which you do) then you shouldn’t rely on it.

This is an argument for making your liveness probe == readiness probe. It should just check pod availability in a minimal way, and if continuing to send the pod traffic based on this indicator turns out bad because of congestion, you want to see that causing errors and react, not let the scheduler take it out of service for new traffic.

You want liveness & readiness to check the same thing, and it should be a non-trivial check of service health that is also very low latency. And as long as that check is passing, keep sending traffic.

When the check fails, it should always be for a “hard down” reason that tells you the pod could not, regardless of traffic levels, accept traffic because it’s fundamentally internally down.

I don't want the pager to go off just because of some slight non-liveness. That's a likely outcome of high utilization (usually viewed as a good thing, isomorphic with low cost). If you're running really hot and a few tasks are shedding load by playing dead intermittently, that's OK up to a point; if a large portion of pods are doing that at a high rate, that might be bad. You might not even alert on it, just throw it up on a dashboard as informative indicator for operators.

> “ I don't want the pager to go off just because of some slight non-liveness.“

That’s just bad engineering. Really, one should want the pager to go off for that and be really pedantic to actually sniff out the root cause and actually fix it.

Hiding that type of issue by letting something like liveness/readiness policy tacitly conceal it is just going to result in a far worse or more systemic issue later with far worse pager disruptions to your life.

You’re skipping flossing every now and then only to need serious root canals later.

Lots of good advice in this article.


You might be joking but I'd say using k8s when you don't need to, is definitely a mistake.

You spent the time signing up just for this?

It’s an important point for those who may unknowingly over engineer.

And they are probably not going to change their mind by reading an anonymous sarcastic comment on HN.

For someone with a short time scale, only trawling this thread, of course not.

For someone young in this space, this comment and one hundred others -for and against- sift into hiso'er consciousness as part of the perceived zeitgeist of kubernetes within the larger community. Perhaps after several years, this person may have an intuition to avoid kubernetes in favor of separate docker or lxd containers.

That association of kubernetes as a FAANG-level tool builds stronger with the linked article: this hypothetical person can compare the struggles against perceived resources and so on. But not everyone has time to read the article, any given kubernetes article, that we come across. Some of those times, it's enough to take the temperature and move on. So, -over years- that may build an aversion that would not have otherwise formed had commenters avoided denouncing (or endorsing) kubernetes with less than full commitment toward convincing others.

Also, in this toneless medium, I can't intuit much of the emotional weight lolkube conveyed the sentence with. Was this person rueful, playful ?

But there are several other actually useful comments in this thread warning about using kubernetes when it's not needed, that cite actual reasons. We didn't need one by "lolkube", not at any timescale. By the way, that name is the clue you need about the tone.

TBH, for me it's usually "Using Kubernetes."

Maybe on GCP (I don't see a lot of companies on GCP) it makes sense, but ECS is AWS native and on bare hardware I immediately go to docker swarm since it ships with the container runtime (instead of a bolted on sidecar container thing).

I like the primitives and features of Kubernetes, but the implementation doesn't give me warm fuzzies and it always gets passed over for safer bets for me. Even very early on in it's development I always went to Mesos over Kubernetes (though Mesos is P dead at this point).

Common mistake: using Kubernetes

> You can't expect kubernetes scheduler to enforce anti-affinites for your pods. You have to define them explicitly.

Why isn't this the default behavior? Why don't I have to go in and tell it that it's okay to have multiple instances on the same node? Why? So that I somehow feel like I've contributed to the whole process by fixing something that never should break in the first place?

I know of a few pieces of code where I definitely want to run N copies on one machine, but for all of the rest? Why am I even running 2 copies if they're just going to compete for resources?

Simply put, what the article recommends is in most situations, dead wrong advice.

If configured as suggested, and for some reason you lose enoguh nodes in your cluster for not having single node for each of your replicas, you will have less replicas than you intended, since the scheduler can't schedule a pod to a node where identical pod is already running.

Additionally, the affinity and anti affinity features are costly from the cluster perspective, so the configuration recommended by the author cost you performance.

And why isn't Kubernetes doing the obvious thing and spread apart your pods? well, it's simple - it does the right thing:


The first scoring parameter is SelectorSpreadPriority.

It's quite possible that you have a machine with 192 CPU cores in it, but it's very unlikely that you are able to write a service that scales to that level ... and if you write it in Go it's really unlikely that you can scale even to 8 CPUs. There's nothing weird about having multiple replicas of the same job on the same node. If you look through the Borg traces that Google recently published you can find lots of jobs with multiple replicas per node.

This is not how defaults work.

When you are talking about the realm of the possible, you provide settings that allow you to reach the scenarios that you feel are reasonable, desirable, or lucrative (or commonly enough, some happy combination of the three).

Defaults are the realm of the probable. And nobody is requisitioning a 192 core machine without a good bit of due diligence, which would include deciding how to set server affinity.

You're suggesting that preventing multiple replicas of the same job to schedule on the same machine as a good default. There's no evidence to support your conclusion, and my experience it quite the opposite. It is much better if people running batch jobs just schedule 100000 tiny replicas, and let the scheduler sort it out. This provides the cluster scheduler with plenty of liquidity. Multiple small processes are more efficient than a shared-nothing single process.

Pod anti affinities did historically dramatically increase scheduling times. Not sure this is the primary reason, but probably one

It misses the biggest one: using it. I ranted about the cloud a decade ago http://drupal4hu.com/node/305 and there's nothing new under the Sun. Still most companies doing cloud and Kubernetes doesn't need it... practice YAGNI ferociously.

I think you may be on the wrong side of history here

Nothing new with that one. I still think git was the wrong choice for DVCS and yet, I have been using it since for a decade or more now. I am still still feeling I have an uneasy truce with it but not a friendship. I still think github is a shitty choice for hosted git -- at least most large open source projects have went with gitlab so I am not utterly alone with that. I am using now Kubernetes because my primary client is using it. Doesn't mean I am happy with it or that I think it's necessary by any means. It's fine. I am getting old but I still can learn. Doesn't mean I can't be grumpy about it.

