First attempt at Ceph RBD on Kubernetes in LXC containers

Why I wanted this

I had a Kubernetes cluster running in LXC containers on Proxmox and wanted storage for it. Pods come and go, but a lot of what they hold shouldn’t: a database, the in-cluster docker registry. For that you want PersistentVolumes, ideally dynamic: a pod asks for 5GB via a PersistentVolumeClaim, and a volume gets provisioned and attached automatically, instead of me manually carving out disks every time.

I already run Ceph on my Proxmox cluster (it’s what PVE uses for shared VM/LXC storage, so the OSDs and mons were already there). Ceph’s RBD (RADOS Block Device) hands out network block devices backed by the Ceph pool, which is the shape Kubernetes wants for a block-mode PV. Plan: point K8s at my existing Ceph cluster, let it carve out RBD images on demand.

The provisioner

The OpenShift docs have a decent walkthrough of dynamic RBD provisioning, but they use the in-tree kubernetes.io/rbd provisioner, which has a known annoyance: the controller-manager pod needs the rbd binary to provision images and the upstream image doesn’t ship it. So use the out-of-tree provisioner from kubernetes-incubator/external-storage, which registers itself as ceph.com/rbd instead of kubernetes.io/rbd. It runs as its own pod with rbd baked in, so provisioning stops caring about what’s in the controller-manager image. I followed its deploy README: the RBAC, the provisioner deployment, a StorageClass pointing at my mons and pool, the secret with the Ceph admin key.

The provisioner came up, and when I created a PVC, it carved out an RBD image in the pool and bound the PV. Progress!

But did it work?

The provisioner creates the image. But to attach it to a pod, the kubelet on the node has to map the RBD device and mount it, and the kubelet runs inside my LXC container.

I threw a busybox pod with a PVC at it and watched journalctl -xeu kubelet on the worker node k8s-camphalfblood:

Apr 18 20:00:20 k8s-camphalfblood kubelet[17578]: E0418 20:00:20.106575   17578 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/rbd/kube:kubernetes-dynamic-pvc-b6698063-81ac-11ea-9c84-7e704ff0a88c podName: nodeName:}" failed. No retries permitted until 2020-04-18 20:02:22.106556737 +0000 UTC m=+1808797.868834092 (durationBeforeRetry 2m2s). Error: "MountVolume.WaitForAttach failed for volume \"pvc-61a1950d-7371-4ab0-8891-104b10faae9b\" (UniqueName: \"kubernetes.io/rbd/kube:kubernetes-dynamic-pvc-b6698063-81ac-11ea-9c84-7e704ff0a88c\") pod \"busybox\" (UID: \"c007698e-3205-4412-90ac-9a659129457a\") : fail to check rbd image status with: (executable file not found in $PATH), rbd output: ()"

fail to check rbd image status with: (executable file not found in $PATH). The kubelet wants to shell out to the rbd CLI and it isn’t there. The provisioner pod has rbd baked in, but the node doesn’t. The kubelet runs as a plain host process (well, “host” being the LXC container here), so it needs rbd on its own PATH. Fine, install ceph’s client tools in the container:

apt install ceph-common

That got rbd onto PATH and the error changed, which I’ll take as progress. Then:

Apr 18 20:02:22 k8s-camphalfblood kubelet[17578]: W0418 20:02:22.200800   17578 rbd_util.go:794] rbd: no watchers on kubernetes-dynamic-pvc-b6698063-81ac-11ea-9c84-7e704ff0a88c
Apr 18 20:02:22 k8s-camphalfblood kubelet[17578]: W0418 20:02:22.201465   17578 rbd_util.go:440] rbd: failed to load rbd kernel module:exit status 1
Apr 18 20:02:31 k8s-camphalfblood kubelet[17578]: E0418 20:02:31.254881   17578 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/rbd/kube:kubernetes-dynamic-pvc-b6698063-81ac-11ea-9c84-7e704ff0a88c podName: nodeName:}" failed. No retries permitted until 2020-04-18 20:04:33.254859693 +0000 UTC m=+1808929.017137047 (durationBeforeRetry 2m2s). Error: "MountVolume.WaitForAttach failed for volume \"pvc-61a1950d-7371-4ab0-8891-104b10faae9b\" (UniqueName: \"kubernetes.io/rbd/kube:kubernetes-dynamic-pvc-b6698063-81ac-11ea-9c84-7e704ff0a88c\") pod \"busybox\" (UID: \"c007698e-3205-4412-90ac-9a659129457a\") : Could not map image kube/kubernetes-dynamic-pvc-b6698063-81ac-11ea-9c84-7e704ff0a88c, Timeout after 10s"

The line that ended the evening:

rbd: failed to load rbd kernel module: exit status 1

To map an RBD image the kernel needs the rbd module (and libceph under it) loaded. The kubelet, sitting in my container, tries to modprobe rbd. But you can’t load a kernel module from inside an LXC container - it shares the host’s kernel instead of getting its own. So modprobe fails, the map fails, and no amount of ceph-common in the container fixes it. The Proxmox forums put it bluntly: “you cannot load kernel modules inside a container, you can mount it on the host and do a bindmount into the container”.

Same problem I hit setting the cluster up, where K8s wanted to load the configs module for a preflight check and couldn’t.

The retries then made it worse:

Apr 18 20:05:22 k8s-camphalfblood kubelet[17578]: E0418 20:05:22.696654   17578 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/rbd/kube:kubernetes-dynamic-pvc-b6698063-81ac-11ea-9c84-7e704ff0a88c podName: nodeName:}" failed. ... : rbd image kube/kubernetes-dynamic-pvc-b6698063-81ac-11ea-9c84-7e704ff0a88c is still being used"

At some point during the failed attempts the image got mapped (or it was me trying to force mount it on the host and pass it through a bindmount to try it out) and left a watcher behind on the Ceph side, so the next attempt refused to map it again. The client checks for other watchers first and bails rather than risk two mappers corrupting the image (rbd status <image> lists who’s holding it open). Deleting the pod and recreating it got me the same thing on the second try:

Apr 18 20:09:05 k8s-camphalfblood kubelet[17578]: E0418 20:09:05.521497   17578 nestedpendingoperations.go:301] Operation for ... pod "busybox" (UID: "46707448-e4ef-4fa1-b6a8-87192ed8c922") : rbd image kube/kubernetes-dynamic-pvc-b6698063-81ac-11ea-9c84-7e704ff0a88c is still being used"

The pod’s own describe output:

  Normal   Scheduled               <unknown>           default-scheduler           Successfully assigned default/busybox to k8s-camphalfblood
  Normal   SuccessfulAttachVolume  13m                 attachdetach-controller     AttachVolume.Attach succeeded for volume "pvc-61a1950d-7371-4ab0-8891-104b10faae9b"
  Warning  FailedMount             57s (x10 over 12m)  kubelet, k8s-camphalfblood  MountVolume.WaitForAttach failed for volume "pvc-61a1950d-7371-4ab0-8891-104b10faae9b" : rbd image kube/kubernetes-dynamic-pvc-b6698063-81ac-11ea-9c84-7e704ff0a88c is still being used
  Warning  FailedMount             7s (x6 over 11m)    kubelet, k8s-camphalfblood  Unable to attach or mount volumes: unmounted volumes=[test-pvc], unattached volumes=[test-pvc default-token-5nqk5]: timed out waiting for the condition

AttachVolume.Attach succeeded: the attach controller is happy and the API says it’s attached. But MountVolume.WaitForAttach is the node trying to rbd map the image, and that never completes.

So, alternatives?

Two options:

Sort it out on the host. Load rbd (and libceph) on the Proxmox host: modprobe rbd there, persisted in /etc/modules. Then expose the /dev/rbd* device nodes into the LXC container the way you’d expose any device to a privileged container (the lxc.cgroup.devices.allow and a mount entry for the device node; the container’s already privileged for K8s anyway). The container keeps ceph-common so the kubelet’s rbd binary works, the module lives on the host. Whether the device-node plumbing ends up clean enough to bother with, I don’t know yet.
Or just don’t run the node in LXC. Run the K8s worker as a proper KVM VM, eat the overhead I was trying to avoid, and let it have its own kernel that it’s allowed to modprobe into. RBD then “just works” the way every guide assumes you’re on a real machine.

There’s also the newer Ceph CSI / rbd-kubernetes path in the Ceph docs, which may sidestep some of the provisioner mess, but it still has to rbd map on the node eventually, so I don’t think it escapes the kernel-module problem on an LXC node either. One for next time.

I went with just running k3s in a QEMU VM instead of LXC. Thankfully it’s all NixOS configured, so not a lot of re-setup, just certs and CLI config files.

First attempt at Ceph RBD on Kubernetes in LXC containers

Why I wanted this

The provisioner

But did it work?

So, alternatives?

Leave a comment below :)