Serving Large Language Models on Kubernetes without GPU

TL;DR

It’s possible to serve some LLMs for inference on Kubernetes without GPU. The easiest way to do this is by using Ollama with the qwen2.5 family of models.

Got GPUs?

A couple of years ago when the whole AI/ML buzz really exploded I had a discussion with a friend. We were contemplating if in the end real corporations will want to run their AI/ML workloads themselves or if they’d rather prefer to use managed services like the ones provided by OpenAI, Anthropic and the rest. Because after all - running this yourself is expensive, it requires expertise and also - yes - GPUs. Quite a lot of them.

Today - we can already see it’s the mixture of both - which means a lot of in-house ML infrastructure is getting built. People are already paying a lot of money for these hard to get hardware accelerators.

Now if you’re geeky - you too surely want to play with them ML/AI thingies - this is where the bleeding edge is, and there’s definitely magic about it that’s a magnet for creative souls. But what if you don’t have the money for a GPU right now and no corporate pockets to reach your hand into? Can we run models on CPU only?

Or maybe - you do work at a company and you want to prototype some ML or even run a relatively small model for some limited needs. And you want to do so without provisioning the expensive GPU-capable cloud nodes. And also do it on existing Kubernetes infrastructure - because that’s what you’re already using for the rest of your stack.

Is it possible?

Well - the answer is yes - and in this post I will show you how.

Enter Ollama

There are a lot of ways to train and serve models on (and off) Kubernetes. Frameworks, schedulers, packagers… It’s easy to get confused. I plan to cover some of the most prominent approaches and tools in the upcoming posts. But without any doubt the easiest way to get started with running LLMs is Ollama.

Ollama was built by ex-Docker employees so it also feels very intuitive if you’ve ever used Docker to run and build containers. The idea is that models get packaged into a standard bundle by providing a configuration script called Modelfile. Then they can be easily executed and served using the ollama executable.

Ollama can be installed on your own laptop - for details on how to do that - follow the official documentation.

But in order to run it on Kubernetes we’ll want to use the official ollama/ollama container image.

Or even better - install the community-managed Ollama helm chart.

Let’s get started, shall we?

Wait, But What About GPUs?

The nice thing about Ollama is that it tests for GPU availability when loading the process and then tells the model whether to use GPU or CPU. So it can run on our GPU-less computer without any additional configuration. But will every model actually run on CPU only?

The answer is rather no than yes - i.e. while many larger models can technically be served on CPU - they will be so slow as to border on unusable.

But the good news is there are other - smaller models that can be executed on CPU and give satisfactory response times, One such model is qwen2.5 - it’s a model released by the team at AliBaba cloud which has a 0.5B variant (which means it was trained on half a billion parameters) with file size less than 400Mb. But even more exciting for us geeks is the qwen2.5-coder model, which was optimized for code generation, reasoning and fixing.

And that’s what we’re going to use. Now, finally, let’s get started!

‍

Get a Kubernetes Cluster

I’m running all this on my old MacBook Air M1 - with 16Gb memory and no GPUs. The lightest way to run a Kubernetes cluster on it is by using k3d - a user-friendly wrapper around the k3s distro that provisions Kubernetes nodes in containers.

So first things first - install k3d using one of the official installation methods.

Now let’s get a minimal cluster:

k3d cluster create llmcluster –network host

This will create a single-node cluster that we can now deploy Ollama to.

Note I’m using --network host to simplify the cluster networking here. That’s because I like to provision only the bare minimum of resources needed for every use case. (Did somebody say “optimization”?)

Deploy Ollama with Helm

Once the cluster is up and running, it’s time to deploy ollama.

Ollama can run multiple models in parallel, but prior to that they need to be pulled from the model registry (just like containers, right?) here. Ollama’s API features a /pull endpoint exactly for that, but the chart allows us to pre-pull models when it’s installed.

Deploy Ollama with the qwen2.5-coder 0.5B variant:

helm upgrade -i ollama ollama-helm/ollama \
             -n ollama --wait \
             --set ollama.models={"qwen2.5-coder:0.5b"} \
             --create-namespace

The --wait flag waits until the Ollama pod pulls the model and becomes ‘'Ready”. Once helm returns a prompt, we can start talking to the model, asking it to do stuff for us.

If we look at the ollama pod logs we'll see how it tests for GPU availability and falls back to CPU when no GPU is found:

time=2024-11-24T13:28:35.551Z level=INFO source=common.go:49 msg="Dynamic LLM libraries" runners="[cpu cuda_v11 cuda_v12]"
time=2024-11-24T13:28:35.552Z level=INFO source=gpu.go:199 msg="looking for compatible GPUs"
time=2024-11-24T13:28:35.553Z level=INFO source=gpu.go:347 msg="no compatible GPUs were discovered"
time=2024-11-24T13:28:35.553Z level=INFO source=types.go:107 msg="inference compute" id=0 library=cpu variant="no vector extensions" compute="" driver=0.0 name="" total="1.9 GiB" available="1.0 GiB"

Generate Kubernetes YAML with an LLM Deployed on Kubernetes:

Ok, let’s put our model to work now!

Port forward to the Ollama pod that listens on port 11434 by default:

kubectl port-forward -n ollama svc/ollama 11434:11434‍

And now let’s ask for some yaml:

curl localhost:11434/api/generate  -d '{ "model": "qwen2.5-coder:0.5b", "system": "You are a code generation tool", "prompt": "Generate YAML to deploy Ollama on Kubernetes as a StatefulSet with a volumeClaimTemplate. Do not be verbose, only generate the code ", "stream": false }' | jq .response | sed 's/\\n/\n/g'

The call to `/generate` returns a JSON object out of which we only want the response field. I’m using `jq` to parse it out and `sed` to unescape all the newlines and here’s what I got on my first attempt:

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: ollama-statefulset
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama-container
        image: ollama/llama:latest
        ports:
        - containerPort: 8081
        volumeMounts:
          - name: ollama-volume
            mountPath: /var/run/ollama
      volumes:
      - name: ollama-volume
        persistentVolumeClaim:
          claimName: ollama-pvc

This YAML code generates a Kubernetes StatefulSet named `ollama-statefulset` with three replicas. The container runs the Ollama LLM model on port 8081 and uses a PersistentVolumeClaim (PVC) named `ollama-pvc` to persist data."

Needless to say - this isn’t exactly what I asked for. But then, I haven’t really invested into engineering my prompt. Or maybe the model doesn’t know about volumeClaimTemplates because it ignored my request for them on a few subsequent attempts too.

I’ll let you experiment with this, and do let me know if you succeed in improving my results.

LLM Performance (Without GPUs)

But for now let’s discuss the performance of the model on my modest 8-core M1 CPU:

 Time      Time     
 Total      Spent    
 0:00:05    0:00:05

All in all - it takes about 5 seconds to generate this YAML. Not too much value, but also not too much CPU time, right?

I also tried to give the model a more challenging task with the following prompt: “Generate a Terraform script to provision an EKS cluster, then deploy Ollama on it as a Helm release of the ollama-helm/ollama chart. Ollama should be deployed as a StatefulSet with a volumeClaimTemplate.”

I won’t even paste the resulting code here as it’s absolutely unusable in its initial state. That prompt took about 20 seconds to process. Again, I definitely am not the best prompt programmer out there - so feel free to give me tips on how to improve my prompting.

But what I’m trying to say is that it’s definitely possible to serve GenAI models on Kubernetes without any GPU - if it’s for fun or profit. Mostly for small, limited scope tasks. For more serious stuff - you’ll definitely need GPUs. We will be exploring more on how and what types of GPUs best to use - depending if it’s performance or efficiency you’re focused on.

Watch this space!

And if you’re serious about running your Kubernetes clusters with optimal performance at the lowest possible cost - make sure to try PerfectScale now. Enable ML-based automated optimization for CPU and memory today. And get GPU optimization insights and recommendations in the upcoming release!