August 8, 2024

We Can Resize Pods without Restarts! Or Can't We?

Anton Weiss

Kubernetes v1.27 released in April 2023 came with an exciting announcement - we can now resize pod CPU and memory requests and limits in-place! Without deleting the pod or even restarting the containers!

This happened more than a year ago and since then a lot of folks seem to think this feature is already publicly available or is due to become so tomorrow.

But the reality is that this was originally released as an Alpha feature and since then had no success moving to Beta due to a number of unresolved issues.

Latest status as of June 2024 is that it has been pushed back to v1.32:

Image description

Here's the link to that comment on Github.

So first of all - this isn't coming tomorrow. But we can still play with the feature and understand its advantages and shortcomings. Which is exactly what I'm planning to do in this post.

Get a Cluster with Alpha Features

k3d is irreplaceable when we want quickly and cheaply test Kubernetes Alpha features. All we need to do is to pass the correct feature gate to the correct control plane component.

Install k3d

If you still haven't done so - install k3d:
with curl and bash:

curl -s https://raw.githubusercontent.com/k3d-io/k3d/main/install.sh | bash

or with another method of your choice listed here

In our case the component is the API server and the feature gate is called InPlacePodVerticalScaling as can be seen here

I'm spinning up a single-node cluster with the following config:


cat --'EOF' | k3d cluster create -c -
apiVersion: k3d.io/v1alpha3
kind: Simple
name: pod-resize
servers: 1
image: rancher/k3s:v1.30.2-k3s2
options:
  k3d:
    disableLoadbalancer: true
  k3s:
    extraArgs: # the feature gate is passed here
      - arg: --kube-apiserver-arg=feature-gates=InPlacePodVerticalScaling=true
        nodeFilters:
          - server:*
EOF

The Happy Path - Updating the CPU

Now let's create a pod with one container defining resource requests and limits.


apiVersion: v1
kind: Pod
metadata:
  name: stress
spec:
  containers:
  - image: progrium/stress
    args: ["--cpu", "1", "--vm", "1", "--vm-bytes", "128M", "--vm-hang", "3"]
    name: stress
    resources:
      requests:
        memory: 150M
        cpu: 100m
      limits:
        memory: 150M
        cpu: 100m

You can create the pod with:


kubectl apply -f https://raw.githubusercontent.com/perfectscale-io/inplace-pod-resize/main/guaranteed.yaml

I'm using progrium/stress and setting it up for slow success by requesting a tenth of the CPU it needs and just enough memory.

stress --vm 1 --vm-bytes 128M --vm-hang 3 - this tells stress to spawn one worker that allocates 128 Mb of memory and then releases them every 3 seconds.
My pod is only currently allowed to have 150M of memory, so I expect it to run fine.

While this 'stress --cpu 1 tells the container to use one whole CPU. While it's actually allowed to only use 0.1 CPU. So it'll surely get throttled.

The container starts just fine:


kubectl get pod
NAME     READY   STATUS      RESTARTS   AGE
stress   1/1     Running   0         7s

After a few minutes I can also check its resource consumption by running:


kubectl top pod stress
NAME     CPU(cores)   MEMORY(bytes)
stress   101m         131Mi

It's running happily, consuming the 101m of CPU and 131M of memory. All within the limits.

A Note about resizePolicy

Once we toggle the InPlacePodVerticalScaling feature gate - all new pods are automatically created with a new field resizePolicy set for each container. If unset  - the default will be restartPolicy:NotRequired:


    containers:
    - image: progrium/stress
      imagePullPolicy: Always
      name: stress
      resizePolicy:
      - resourceName: cpu
        restartPolicy: NotRequired
      - resourceName: memory
        restartPolicy: NotRequired

This is the config I’m testing in this post. We can set it to restartPolicy: RestartContainer- which will lead the container to be restarted when the relevant resource type is updated.

Pod QoS Matters

Now let's try to increase our container's limits in-place to give it more resources and see what happens:


kubectl patch pod stress -p '{"spec" : { "containers" : [{"name" : "stress", "resources": { "limits": {"cpu":"300m","memory":"250M"}}}]}}'

Oops! That didn't work!
We're getting:


The Pod "stress" is invalid: metadata: Invalid value: "Guaranteed": Pod QoS is immutable

So what we now know is that while we can change the values of limits and requests - we can't change the pod QoS class. I.e the relationship between the requests and the limits has to follow the QoS - if we start with Guaranteed - we can’t manage requests and limits separately.  And if we start with Burstable - we can’t set limits equal to requests.

Updating the Resources

Let's try to update both the requests and the limits while staying within the Guaranteed QoS:


kubectl patch pod stress -p '{"spec" : { "containers" : [{"name" : "stress", "resources": {"requests": {"cpu":"300m","memory": "250M"}, "limits": {"cpu":"300m","memory":"250M"}}}]}}'
pod/stress patched

If we now watch kubectl top pod stress we will se how the container gradually gets the additional CPU time:

The CGroups Behind the Scenes

Now, being the curious cat that I am - I wanted to check how this works behind the scenes. I know there are cgroups involved in setting container resource restrictions but I like checking myself how stuff works.
The great thing with k3d is it's very easy to get into your nodes with a simple docker exec.


docker exec -it k3d-pod-resize-server-0 sh

Now I want to find my container and identify the path to its cgroup definition.
Find the container ID using ctr - the containerd command-line utility:


ctr c ls | grep stress
a4ad15ff9c7a71a0f1c34cdce9d1ae9d18ebd4e7b01f3c92ee796e5180729460    docker.io/progrium/stress:latest                       io.containerd.runc.v2

and then - find the cgroup information for my container:


ctr c info a4ad15ff9c7a71a0f1c34cdce9d1ae9d18ebd4e7b01f3c92ee796e5180729460 | grep cgroup

which will give me something like:


"destination": "/sys/fs/cgroup",
                "type": "cgroup",
                "source": "cgroup",
            "cgroupsPath": "/kubepods/podaa80f5b5-d68b-4ab6-ac38-df493310068b/a4ad15ff9c7a71a0f1c34cdce9d1ae9d18ebd4e7b01f3c92ee796e5180729460",
                    "type": "cgroup"

The important parts here are /sys/fs/cgroup where all the cgroup definitions are found and the cgroupsPath - where the specific constraints for this container are defined.

You'll notice there's a hierarchy there - first we have the pod... directory and then - the directory named as the container id. This being a single-container pod - all the cgroup values will be featured in the parent folder. So that's where we're going to look.


cat /sys/fs/cgroup/kubepods/podaa80f5b5-d68b-4ab6-ac38-df493310068b/memory.max

249999360

That's right - 250 Mb of memory in bytes!


cat /sys/fs/cgroup/kubepods/podaa80f5b5-d68b-4ab6-ac38-df493310068b/cpu.max

30000 100000

An that's correct too! According to the RedHat documentation:

The first value is the allowed time quota in microseconds for which all processes collectively in a child group can run during one period. The second value specifies the length of the period.
During a single period, when processes in a control group collectively exhaust the time specified by this quota, they are throttled for the remainder of the period and not allowed to run until the next period.

Impact on Scheduling

Another thing I wanted to try is update requests to more than my node can give and check if the scheduler will try to reschedule my pod to another node because the current one doesn't have the needed capacity.

Let's check how many cpus my node has access to:


kubectl get node -ojsonpath="{ .items[].status.allocatable.cpu } cpus"
8 cpus%

I got 8. So let's try to request 10 and see what happens:


kubectl patch pod stress -p '{"spec" : { "containers" : [{"name" : "stress", "resources": {"requests": {"cpu": "10"}, "limits": {"cpu":"10"}}}]}}'
pod/stress patched

Alas, while the requests got updated - nothing else happens. Pod doesn't get rescheduled or evicted. Why? No idea.. Have I tried creating it with 10 cpu request from the beginning - it would have stayed pending because there aren't any nodes large enough. So I would expect the pod with requests higher than a node can satisfy to get evicted. But maybe my thinking is flawed?

Actually according to the official documentation - there shouldn’t be any scheduling impact. Instead the Pod status field should reflect that current resize request is “Infeasible”. Let’s check that:


kubectl get pod stress -ojsonpath="Current resources: { .status.containerStatuses[0].allocatedResources }. Resize is { .status.resize }"
Current resources: {"cpu":"300m","memory":"150M"}. Resize is Infeasible%

Yes - it’s reflected correctly in the status field.

Still - we now have a pod that doesn’t abide to its spec. Which is puzzling and could lead to unexpected reliability issues.

Negating Resources

Until now all worked fine because we were only adding resources. Everybody likes having more stuff, nobody likes when stuff is taken away from them.

Let's start by taking back the CPU time we granted in the previous section:


kubectl patch pod stress -p '{"spec" : { "containers" : [{"name" : "stress", "resources": {"requests": {"cpu":"100m"}, "limits": {"cpu":"100m"}}}]}}'
pod/stress patched

I'm bringing the CPU requests back to 100m. Quite expectedly in a couple of seconds kubectl top will show me that pod cpu consumption went down to 100m.
And the cgroup cpu.max file will get updated as expected:


cat /sys/fs/cgroup/kubepods/podaa80f5b5-d68b-4ab6-ac38-df493310068b/cpu.max
10000 100000

But what if I try to reduce memory?


kubectl patch pod stress -p '{"spec" : { "containers" : [{"name" : "stress", "resources": {"requests": {"memory": "150M"}, "limits": {"memory":"150M"}}}]}}
pod/stress patched

Seems to work fine. Checking the cgroups I see the config has been updated:


cat /sys/fs/cgroup/kubepods/podaa80f5b5-d68b-4ab6-ac38-df493310068b/memory.max
149999616

And what if I need to free even more memory?


kubectl patch pod stress -p '{"spec" : { "containers" : [{"name" : "stress", "resources": {"requests": {"memory": "100M"}, "limits": {"memory":"100M"}}}]}}
pod/stress patched

Note that I'm reducing memory to 100M which should cause my container to get OOMKilled. And it seems to work:


kubectl get pod stress -ojsonpath="{ .spec.containers[0].resources }"

{"limits":{"cpu":"100m","memory":"100M"},"requests":{"cpu":"100m","memory":"100M"}}

But I see that the pod continues running!


kubectl get pod
NAME     READY   STATUS    RESTARTS   AGE
stress   1/1     Running   0          21m

And checking the cgroup memory.max file shows why:


cat /sys/fs/cgroup/kubepods/podaa80f5b5-d68b-4ab6-ac38-df493310068b/memory.max
149999616

The cgroup wasn't updated! Looks like something is getting in our way - protecting the container from getting less memory than it's already using. While this makes sense as a precaution - taking away memory from a running process may lead to irreversible corruption - this now leads to container limits holding an incorrect value which will surely puzzle anyone trying to understand why it's not getting OOMKilled.

I would expect some validating admission hook to tell me that memory can't be reduced. Looks like a bug to me.

Changing the resizePolicy

But what if we allow container restarts? Will the cgroup for memory get updated then?

It’s not possible to change the resizePolicy for an existing pod, so let’s create a new one:


apiVersion: v1
kind: Pod
metadata:
  name: restart
spec:
  containers:
  - image: progrium/stress
    args: ["--cpu", "1", "--vm", "1", "--vm-bytes", "128M", "--vm-hang", "3"]
    name: restart
    resizePolicy:
      - resourceName: cpu
        restartPolicy: NotRequired
      - resourceName: memory
        restartPolicy: RestartContainer
    resources:
      requests:
        memory: 150M
        cpu: 100m
      limits:
        memory: 150M
        cpu: 100m
        

Apply this spec by:


kubectl create -f https://raw.githubusercontent.com/perfectscale-io/inplace-pod-resize/main/restart.yaml

And now let’s reduce the memory for that restart container:


kubectl patch pod restart -p '{"spec" : { "containers" : [{"name" : "restart", "resources": {"requests": {"memory": "100m"}, "limits": {"memory":"100m"}}}]}}'

I’m setting the memory to 100m which is too low.


kubectl get pod restart -ojsonpath="{ .status.resize }"
InProgress

Pod status shows us that the resize request was actually received.  And after a while the contiainer gets restarted, quite expectedly fails with RunContainerErrorand then goes into the CrashLoopBackoff. With kubectl describe pod restart showing us that the kubelet has restarted the container but it got OOMKilled :


  Normal   Created    21s (x5 over 2m14s)  kubelet            Created container restart
  Warning  Failed     21s (x4 over 71s)    kubelet            Error: failed to create containerd task: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: container init was OOM-killed (memory limit too low?): unknown
  

The puzzling thing about this is that when we look at the cgroup for the pod we see that the memory limit doesn't get updated. So it’s not totally clear what triggers the OOMKill:


docker exec k3d-pod-resize-server-0 sh
# cat  /sys/fs/cgroup/kubepods/pod31cfb3e5-aaa0-424f-bbd2-1ec9e77cafc1/memory.max
149999616

Still 150Mb 🤷

Saving Hungry Pods

Ok, we found out that memory being an incompressible resource - we can't really reduce it in-place to a value lower what than the container is already using.

But can we save an OOMing container by giving it more memory?

Let's try that with a similar pod but one that gets only 100M of memory from the get go (while trying to allocate 128):


apiVersion: v1
kind: Pod
metadata:
  name: hungry
spec:
  containers:
  - image: progrium/stress
    args: ["--cpu", "1", "--vm", "1", "--vm-bytes", "128M", "--vm-hang", "3"]
    name: stress
    resources:
      requests:
        memory: 100M
      limits:
        memory: 100M

kubectl create -f https://raw.githubusercontent.com/perfectscale-io/inplace-pod-resize/main/hungry.yaml

Quite expectedly the container gets OOMKilled almost instantly:


kubectl get pod hungry
NAME     READY   STATUS      RESTARTS     AGE
hungry   0/1     OOMKilled   1 (5s ago)   8s

And it will continue restarting and getting OOMkilled until we update its memory limits. So let's save it from this misery by giving it the memory it needs:


kubectl patch pod hungry -p '{"spec" : { "containers" : [{"name" : "stress", "resources": {"requests": {"memory": "200M"}, "limits": {"memory":"200M"}}}]}}'
pod/hungry patched

This seems to work fine:


kubectl get pod hungry -ojsonpath="{ .spec.containers[0].resources }"
{"limits":{"memory":"200M"},"requests":{"memory":"200M"}}%

But the pod continues getting killed:

kubectl get pod hungry
NAME     READY   STATUS      RESTARTS      AGE
hungry   0/1     OOMKilled   4 (33s ago)   60s

And if check the cgroup memory.max file we'll see why:

cat /sys/fs/cgroup/kubepods/burstable/pod708b8195-0ca0-45e0-9f2b-015f679c98da/memory.max
99999744

Its memory limit never actually got updated!
Why? I wasn't able to find an answer for this one. Why disallow saving containers from getting killed by providing them memory they need? I'm not aware of the technical limitations that would prevent this and I also didn't find anything in the KEP docs

So it looks like the only way to fix the OOMKill is still by deleting the pod and creating a new one with more memory.

Summary

In-place pod resizing is a long awaited feature. Still in alpha since v1.27 it will hopefully make it to beta by v1.32.
If the drawbacks and bugs get fixed.
And here are some of them I found:

  • Memory can't be reduced lower than currently used (either with or without container restarts). But there's no notification about that.
  • Giving more resources than available on the node doesn't lead to pod eviction (true for both CPU and Memory)
  • If a pod is getting OOMKilled - it's not possible to give it more memory to save it from getting killed.

Will these get eventually fixed? I certainly hope so. Will the feature get it to beta by v1.32? Let's keep our fingers crossed.

Something in this post isn't clear or correct? Let me know in the comments.

PerfectScale Lettermark

Reduce your cloud bill and improve application performance today

Install in minutes and instantly receive actionable intelligence.
Subscribe to our newsletter
Resize pod CPU and memory on the fly with K8s v1.27! Discover how this update allows you to dynamically adjust resource allocations without disrupting workloads.
This is some text inside of a div block.
This is some text inside of a div block.

About the author

This is some text inside of a div block.
more from this author
By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.