Continuous Kubernetes Reliability through Automation and Best Practices

Oops,I did it again! Yesterday I broke our Kubernetes cluster. Or, to be more precise - I broke Karpenter

Here’s my very brief and private postmortem - I know you want it!

How I Broke Karpenter

As a part of my job I run lots of experiments. In fact - I believe it’s an important part of any engineer’s job. Understandably - my experiments involve spinning up multiple Kubernetes clusters. Some of those I run locally with k3d. But when I need to experiment with Karpenter - they have to be actual EKS clusters. Hence my blog series about all the ways one can spin up an EKS.

So yesterday I was trying to do just that - but cluster creation failed - because there was a dangling IAM role for Karpenter which was conflicting with what I tried to create. So I went to AWS IAM in the console and searched for all the roles matching KarpenterNodeRole. And I actually found there were a few of those left over from my previous experiments. So in a moment of a cleanup frenzy I deleted them all! And of course I deleted one too many!

In about half an hour the SRE who was on-call pinged me about a KarpenterNodeRole that belonged to the Dev cluster - the one that the whole development team was using. The moment the role was gone - Karpenter started losing nodes, couldn’t provision new ones and couldn’t even move pods over to the remaining nodes. Even when the SRE team recreated the role by running Terraform - the storage mounted on older nodes couldn’t be unmounted causing stateful pods to hang. In short - quite a headache. All caused by my careless cleanup activity. (This is also the place to mention that you should in general manage all of your cloud resources with IaC - be it Terraform, Pulumi or CDK. Alas, sometimes the nature of experiments is that they aren’t very manageable…)

Thankfully this was a Dev cluster. Thankfully it was an easy fix, because we Terraform all the important stuff. Thankfully our SRE team immediately held a post-mortem in a Slack channel and defined the alerts that will allow us to identify such situations quicker and an automation that would reconcile this specific role automatically if it’s gone.

‍

Multiple Things Can Go Wrong

But the point I’m trying to make is this: Kubernetes is very complex. And this means - multiple things can go wrong! Too many to be able to cover them all in a blog post. So I won’t even try that. And yet - there is a whole category of reliability risks that PerfectScale can help you mitigate in a comprehensive and automated way - without having to build custom dashboards and defining alert routing. I’m talking about resource allocation risks of course - because that’s where our expertise lies.

The Reliability Risks of Resource Allocation

The economics of resource allocation are simple. Over-provisioning leads to waste. Under-provisioning leads to risk.

Specifically on Kubernetes we talk about under-provisioning the basic resources that can be allocated to containers - CPU shares and memory - and there are 4 main risk factors:

Under-provisioned Memory Requests

If a container consumes more memory than originally requested - the node it’s running on may not have that memory available. The node will go into the MemoryPressure condition which can lead to the pod being evicted, thus causing service disruption. This risk can be somewhat mitigated by setting the PodDisruptionBudget for the most critical pods and raising the PriorityClass. Still the best way to handle this is calculate the actual memory utilization and set the request accordingly.

Under-provisioned Memory Limits

When a container consumes more memory than the defined limit - the container runtime will notify the kernel and the OOM killer will stop the problematic container. On Kubernetes this can lead to recurring container restarts - impacting response and availability.

Under-Provisioned CPU Requests

Unlike memory - the CPU is a compressible resource. So containers consuming more CPU than originally requested will not be evicted but will instead be slowed down by not getting enough access to CPU. This will usually lead to higher latency and expiring timeouts. Moreover - if there are containers which haven’t defined any CPU requests (and consequently no limits) - they may try to consume all the available CPU shares on the node, eventually leading to node CPU saturation - strangling the well-behaved containers too.

Defined CPU Limits

As already described by a number of Kubernetes experts - defining CPU limits, even though possible - is definitely an anti-pattern in most cases. It often leads to unnecessary CPU throttling even when there is CPU available on the node. And the result is once again - refusal of CPU access, slowness and increased latencies.

As mentioned above - all these risks are well known, but they can only be discovered during runtime - by constantly sampling and observing resource utilization and reliability events such as OOM kills, CPU throttling and container restarts.

What we at PefectScale understand is that very often it’s the fear of accidental under-provisioning and thus compromising performance that stops organizations from optimizing their Kubernetes workloads efficiency.

Continuous Risk Mitigation

That’s why we’ve integrated the discovery of these risks into our platform. PerfectScale’s workload view makes it very easy to find the under-provisioned containers and apply the necessary changes to fix the reliability issues. We can start with the following simple workflow:

Find the OOM killed containers
- Right-size the memory limits to stop them from getting OOM-killed
Find the CPU throttled containers
- Right-size the CPU requests/limits to get them to their optimal performance
Find the Memory under-provisioned containers (which are at risk of eviction)
- Right-size the memory requests to schedule them correctly and avoid eviction.

Once you’re happy with the results and start seeing the risk reduced - enable PerfectScale automation for the workloads at risk and watch the reliability of your cluster go up automatically.

And finally get a good night sleep - knowing your cluster is getting optimized for performance while you rest.

Now that you’re well-rested - we can start optimizing your clusters for cost. But that’s a topic for another blog.

‍

K8s Reliability through Automation and Best Practices

How I Broke Karpenter

Multiple Things Can Go Wrong