Bottlerocket is a Linux-based operating system optimized for hosting containers. It was originally developed at AWS specifically for runnning secure and performant Kubernetes nodes. It’s minimal, secure and supports atomic updates.
According to this discussion - starting with Bottlerocket 1.13.0 (Mar 2023) new distributions will default to using Cgroups v2 interface for process organization and enforcing resource limits.
In this post I intend to explore how this works for EKS clusters running Kubernetes 1.26+ and what this change means for EKS users.
Cgroups - An Intro
Cgroups (abbreviated from Control Groups) - is a Linux kernel feature that lies at the foundation of what we now know as Linux containers.
The feature allows to limit. account for and isolate resource usage for a collection of processes.
It was developed at Google circa 2007 and merged into Linux kernel mainline in 2008.
Cgroups and Kubernetes
Kubernetes allows us to define resource usage for containers via the resources map in the Pod API spec. These definitions are then passed by the kubelet on to the container runtime on the node and translated into Cgroups configuration.
Up until version 1.25 Kubernetes only supported Cgroups v1 by default. In 1.25 - stable support for Cgroups v2 was added. Now if running on a node with Cgroups v2 - the kubelet automatically identifies this and perfroms accordingly. But what does this mean for our workload configuration? In order to understand that we need to explain what Cgroups v2 is.
Cgroups V2
Cgroups v2 was released in 2015 introducing API redesign - mainly for a unified hierarchy and improved consistency. The following diagram shows the change in how Cgroup controllers are ordered in v2 vs. v1:
According to this architecture document : “Some Kubernetes features exclusively use cgroup v2 for enhanced resource management and isolation. For example, the MemoryQoS feature improves memory QoS and relies on cgroup v2 primitives.”
And when we look at the description of the aforementioned MemoryQoS feature we find out that “In cgroup v1, and prior to this feature, the container runtime never took into account and effectively ignored spec.containers[].resources.requests[“memory”].” and that “Fortunately, cgroup v2 brings a new design and implementation to achieve full protection on memory… With this experimental feature, quality-of-service for pods and containers extends to cover not just CPU time but memory as well.”
Well, first of all - it’s a bit shocking and even insulting to learn that container runtimes ignored our settings! But I was also very curious to learn how this changes now that cgroups v2 support is introduced.
MemoryQoS and Cgroups v2
According to this page:
Memory QoS uses the memory controller of cgroup v2 to guarantee memory resources in Kubernetes. Memory requests and limits of containers in pod are used to set specific interfaces memory.min and
memory.high provided by the memory controller. When `memory.min
is set to memory requests, memory resources are reserved and never reclaimed by the kernel; this is how Memory QoS ensures the availability of memory for Kubernetes pods. And if memory limits are set in the container, this means that the system needs to limit container memory usage, Memory QoS uses memory.high
to throttle workload approaching it’s memory limit, ensuring that the system is not overwhelmed by instantaneous memory allocation.
This is all great! Let’s now provision an EKS cluster with some Bottlerocket nodes and see how this works in practice.
To easily spin up a cluster - use the cluster.yaml in the attached github repository:
generate ssh keys:
and create the cluster
This will create a cluster with one Bottlerocket node. It also configures ssh access to the nodes by running the Bottlerocket admin container.
This means we can now access the node:
We get greeted with the following screen:
As this says - we can get admin access to the Bottlerocket filesystem by running sudo sheltie
. So let’s do that!
Now we can check if we in fact have cgroupv2
enabled:
Yup! This is cgroupv2! Were this cgroupv1
the output would’ve been tmpfs
.
Let’s Deploy a Pod
Ok, now let’s deploy a pod to our node. We’ll do that by creating a deployment based on the following yaml
spec. This deploys antweiss/busyhttp, that I forked from jpetazzo/busyhttp and added memory load and release endpoints to. You’ll notice that the pod runs a container with Guaranteed QoS - i.e memory and CPU limits are equal to requests:
This spec is found in dep.yaml and we can deploy it with:
Check the Cgroup Impact
Now let’s go back to our node and see how our resource definitions are reflected in the cgroup
config.
Back inside the sheltie
prompt let’s explore the containers running on Bottlerocket. Bottlerocket OS is using containerd
container runtime. In order to interact with it we’ll need to use ctr
.
When we run ctr help
- we get the following:
So ctr
is unsupported. A bit discouraging, but well, it’s working. Let’s try to look at our containers:
No containers?! But I do see my pod running on the node! Where is my container? Well the answer to that is namespaces
. Yup, just like kubernetes or linux kernel - containerd has namespaces. And all the containers executed by the kubelet live in a namespace called “k8s.io”. We can see it by running:
Ok, let’s check the containers in the “k8s.io” namespace:
Now we’re talking! We have all the usual suspects here - coredns, kube-proxy, the omnipresent pause containers. But right now we’re interested in the container based on the docker.io/otomato/busyhttp:latest
image.
Let’s look for its cgroup definition in the cgroup filesystem we discovered previously. First we need to filter out the container id. ctr
supports filters for its listing function. So the way to parse out the container id by image name is the following:
Note the -q
that tells ctr
to only output the id.
Now we can find the container’s cgroup config:
This gives us a long path somewhere inside a folder called kubepods.slice
. Let’s wrap this path in an environment variable and look around:
Whew! That’s a lot of files! Now according to this page on Memory QoS - our requests.memory
should be translated to memory.min
while memory.high
is calculated the following way:
Let’s look at the limit first:
Hmm. That’s not a number. But we can also notice that there’s a file called memory.max
. Let’s look inside that:
Ok, here’s our limit! 209715200 bytes is exactly the 200Mi we defined in the resources
section of our pod spec.
Now what about the requests? Let’s look at memory.min
:
0 is not the request we’ve defined. And that makes sense. Memory QoS has been in alpha since Kubernetes 1.22 (August 2021) and according to the KEP data was still in alpha as of 1.27.
In order to see the actual request values for memory reflected in cgroup config one needs to enable the Memory QoS feature gate in kubelet config as defined here:
Trouble is - due to the atomic nature of Bottlerocket OS - we can’t change its KubeletConfiguration file (found at /etc/kubernetes/kubelet/config) directly. We can only pass settings through settings.kubernetes
via the API or a config file. But these currently don’t support setting feature gates. So it looks like the only way to modify the Kubelet to support Memory QoS on EKS Bottlerocket nodes is to build our own Bottlerocket images. Which is a subject for a whole another blog post.
And for now - let’s shrug our shoulders, scratch our heads and bring down our EKS cluster:
Summing it All Up
cgroup v2
is enabled by default in current Bottlerocket EKS instances.- this allows a better organized resource management on the nodes
- an important Kubernetes feature based on
cgroup v2
is Memory QoS that ensure that memory requests are actually allocated by the container runtime and not merely checked for by the Kubernetes scheduler - MemoryQoS is still in
alpha
after 2 years - There’s no easy way to enable Memory QoS on Bottlerocket nodes without building the AMIs ourselves.
Anyway - this was an interesting exploration. And if there’s anything I got wrong or didn’t make clear - please let me know in comments.
May all your containers run smoothly!
The config files used in the blog post can be found in this github repo