• info@radicalgeek.co.uk
  • Sutton-In-Ashfield, UK

Adding A GPU node to a K3S Cluster

I recently wanted to add a GPU node to my K3S cluster and found the documentation a little lacking so I wanted to just quickly capture how I did it, so that should I need to do it again, I can refer back to it. And if anyone else finds it useful too, then all the better.

Installing the node.

This is my fist dive into working with AI to build software. Like everyone else I have been very impressed with ChatGPT and all the other buzz around AI for the last year or so, and have used it quite a lot. What I have not done though is try to integrate it into my own apps. When I started to look at doing this I was disappointed with the performance of the GPU in my laptop due to the limited VRAM available, and I was not keen on watch the costs of using the ChatGPT API shoot up, and knew based on what I wanted to do that I would hit the rate limits quickly. Instead I decided to build a dedicated server in my home lab to experiment with running my own Large Language Models.

I began by scrounging an old desktop from a friend. It is an 8 core i7 3.6GHz with 16GB RAM, then I purchased an Nvidia RTX 4070 ti super graphics card with 16GB of VRAM. This seemed to be about the best value for money in terms of performance and available VRAM. Then I installed Ubuntu 22.04.4 and began setting it up to join the K3S cluster.

Installing the Nvidia software.

After setting up SSH the first thing I did was to install the nvidia software. This was the easy part as you can just follow the official Nvidia documentation. Start by setting up APT:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

And make sure you enable the experimental features.

sed -i -e '/experimental/ s/^#//g' /etc/apt/sources.list.d/nvidia-container-toolkit.list

Update apt and add the required packages.

sudo apt-get update && nvidia-container-toolkit nvidia-container-runtime cuda-drivers-fabricmanager-535 nvidia-headless-535-server nvidia-utils-535-server

Setting up K3S

Next install K3S and join it to the cluster

sudo curl -sfL http://get.k3s.io | K3S_URL=https://10.1.0.21:6443  K3S_TOKEN=<token> sh -s -

Next we need to update the containerd runtime to recognise the GPU

sudo nvidia-ctk runtime configure --runtime=containerd

At this point it is worth verifying that you can run a container on the containerd runtime

sudo ctr image pull docker.io/nvidia/cuda:12.3.2-base-ubuntu22.04

sudo ctr run --rm --gpus 0 -t docker.io/nvidia/cuda:12.3.2-base-ubuntu22.04 cuda-12.3.2-base-ubuntu22.04 nvidia-smi

This should prove that the container can see the GPU. Now all that is left is to configure K3S to use it so we can run our AI containers in Kubernetes. Start by creating a RuntimeClass manifest.

apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia

And deploy it to your cluster

kubeclt apply -f nvidia-runtimeclass.yaml

Now we need to deploy the nvidia-device-plugin damonset. Because this is the only node in my cluster that has a GPU, I did not want the Daemonset to be deployed to all nodes, so I first added a label to my new node

kubectl label nodes rdg-clust-ai workload=ai

And I edited the nvidia-device-plugin available from here to have an affinity matching my label, and to use my new runtime class

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      runtimeClassName: nvidia
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: workload
                operator: In
                values:
                - "ai"
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      priorityClassName: "system-node-critical"
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.14.5
        name: nvidia-device-plugin-ctr
        env:
          - name: FAIL_ON_INIT_ERROR
            value: "false"
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins

Once this is deployed and running on the new node you can verify that your done is ready by running ‘kubeclt describe node nodeName>’ and checking to make sure that ‘nvidia.com/gup’ is listed under both Capacity and Allocatable. If it is you are ready to deploy a pod that has access to your GPU.

Deploying your first GPU pod

I used the following simple test manifest to check that pods could access the GPU. Note that the runtime class is referred to again.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  runtimeClassName: nvidia
  containers:
  - name: cuda-test
    image: nvidia/cuda:12.3.2-base-ubuntu22.04
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 1
  restartPolicy: OnFailure

The pod should be correctly allocated to your new GPU node, and upon inspecting the logs you should see the output of ‘nvidia-smi’

Leave a Reply

Your email address will not be published. Required fields are marked *