Kubernetes in LXC on Proxmox

Backstory

I’ve been working with Kubernetes for the past few months at work, to run our webserving and internal applications. We use AWS’s Elastic Kubernetes Service (EKS), their managed Kubernetes control plane offering. It’s easy to set up and reliable enough to let folks like me new to Kubernetes run production workloads without too much trouble. We don’t to worry about managing critical control-plane services, we just have to supply our own worker nodes and our application stack.

But to really get stuff done on EKS, a deeper understanding of the Kubernetes core is necessary. There are quite a few rough edges and when something goes wrong or you feel like you’re missing an obvious feature, you need to know how stuff works under the hood. For the past few months my team has been on that, but I wanted to set up and run my own kubernetes cluster at home to learn more than what I could experiement with on AWS EKS.

Kubernetes single-node MicroK8s

Kubernetes can be run on bare-metal on-premises equipment, like a homelab. There are some differences in how loadbalancers work (each cloud platform has its own version of LoadBalancer, but on your own servers, you have to set those up yourself), and how traffic flows into your cluster, and that presents a lot of opportunites to learn more about everything. To get started, I spun up a normal linux virtual machine with Ubuntu 18.04 and installed microk8s on it. Following a couple of guides on setting up the basics and adding some of my deployments and everything went as expected. I ran this setup for a couple of months, serving this website off of it, along with some apps like CodiMD, FileBrowser, Trilium Notes. An automatic Let’s Encrypt setup kept everything serving over https, and a in-cluster docker registry backed by some Ceph storage from my Proxmox cluster persisted custom docker images.

Kubernetes Cluster in LXC on multi-node Proxmox

This was fine for a basic Kubernetes setup, but what’s the point of running that on a single virtual machine? I could have just run a few docker-compose deployments on a docker VM, which is really everyone’s go-to choice for a single node dockerized set of applications. The real benefit of Kubernetes is multi-node application scheduling, and I had three systems in my homelab with power to spare on each. It was time to set up a real K8s cluster across the three nodes.

Since I’m running Proxmox on my homelab systems, I wanted to use LXC containers to run K8s, instead of full KVM virtual machines. They would be much more efficient, avoiding all hardware emulation and simply running the application processes inside namespaces on the host system and kernel. Less isolation, yes, but an acceptable tradeoff.

Over the past year I’ve run a docker-on-ubuntu setup in an LXC instead of a VM, and while setting that up a few hurdles had to be jumped over to get it working. Some of them were this, this and this. By now though, some of these are unnecessary, since the linux kernal and PVE have incorporated changes to make it easier. There are still some custom steps though, and I didn’t see any guide online detailing them all in a single place, so here goes my guide.

This is applicable to the current versions of PVE, Ubuntu and K8s:

PVE 6.1-3
Ubuntu 18.04-1.1 LXC template
Kubeadm v1.17
K8s 1.17

The node layout is simple for now - I want a separation between the control plane nodes and the worker nodes, just like in AWS EKS and other cloud K8s offerings. I also want a high availability cluster, so ideally I’d run the control plane across atleast 3 nodes/containers with HA configured for etcd as well. Right now however I’m starting off with a single-node control plane, which I’ll keep on shared storage so I can migrate it across physical PVE nodes quickly if I need to. I want to be able to reasonably quickly and easily switch over to a proper HA control plane setup so I’ll be keeping that in mind for some of the following steps.

Also, since this will come up later - a choice of network has to be made for the cluster. Based on some light research, I’ve decided to use Calico. With this there is a step in the cluster setup process which has a slight change - I need to specify the IP address pool from which pod IPs will be allocated. The default is 192.168.0.0/16 for “<50 nodes” and 10.244.0.0/16 for “more than 50 nodes”, but I’ll be using 10.244.0.0/16 since I may/do have network devices in the 192.168.0.0/16 range which I want to be able to reach from pods.

I’ll be using kubeadm to set up the cluster. I’ll call my first control plane node k8s-control-1. Let’s set it up:

I’ve downloaded the Ubuntu 18.04-1.1 LXC template from Proxmox’s templates download GUI, to local storage. We need to customize a container configuration to make changes to allow K8s to run correctly, similar to the config for the docker LXC container: The container needs to be a priviledged container, so don’t tick Unpriviledged The container’s SWAP needs to be set to 0 (K8s really doesn’t like SWAP for performance reasons, so we’ll just provide enough physical RAM instead).

Open up the container’s PVE config file in /etc/pve/lxc/ and add the following at the bottom:

lxc.apparmor.profile: unconfined
lxc.cap.drop: 
lxc.cgroup.devices.allow: a
lxc.mount.auto: proc:rw sys:rw

This blows away a lot of the security features of LXC, but I’re doing this to avoid running a full KVM instance. Now, start up the container and go inside.

To get started, refer to this guide to set up the docker runtime.

You’ll see docker running successfully:

root@k8s-staticwan:~# sudo systemctl status docker
● docker.service - Docker Application Container Engine
  Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled)
  Active: active (running) since Sat 2020-03-28 20:33:09 UTC; 19min ago

Then, follow the kubeadm installation process here.

Now that we have kubeadm and kubelet installed, we can check up on the status of the kubectl service before we proceed, with systemctl status kubectl

The crash loop of the kubelet is expected, since kubeadm hasn’t set up a config file for it yet.

kubeadm init

Now comes the real stuff. I want to be able to switch to a high-availability cluster control plane later, while I’m starting with a single node right now. The docs have this to say:

(Recommended) If you have plans to upgrade this single control-plane kubeadm cluster to high availability you should specify the –control-plane-endpoint to set the shared endpoint for all control-plane nodes. Such an endpoint can be either a DNS name or an IP address of a load-balancer.

So I’ll be setting a DNS entry in my pfSense DNS resolver/forwarder for kcontrol.mydomain.com to the current IP address of this control place container. When I add more control plane containers, I can add more A record values to that DNS entry.

So, the kubeadm init command for me is:

kubeadm init --control-plane-endpoint=kcontrol.mydomain.com --pod-network-cidr=10.244.0.0/16

Now, I run the kubeadm init command, and end up with

   W0105 21:06:32.007656    9528 validation.go:28] Cannot validate kube-proxy config - no validator is available
   W0105 21:06:32.007691    9528 validation.go:28] Cannot validate kubelet config - no validator is available
   [init] Using Kubernetes version: v1.17.0
   [preflight] Running pre-flight checks
   [preflight] The system verification failed. Printing the output from the verification:
   KERNEL_VERSION: 5.3.10-1-pve-tlg
   OS: Linux
   CGROUPS_CPU: enabled
   CGROUPS_CPUACCT: enabled
   CGROUPS_CPUSET: enabled
   CGROUPS_DEVICES: enabled
   CGROUPS_FREEZER: enabled
   CGROUPS_MEMORY: enabled
   error execution phase preflight: [preflight] Some fatal errors occurred:
      [WARNING SystemVerification]: failed to parse kernel config: unable to load kernel module: "configs", output: "modprobe: ERROR: ../libkmod/libkmod.c:586 kmod_search_moddep() could not open moddep file '/lib/modules/5.3.18-2-pve/modules.dep.bin'\nmodprobe: FATAL: Module configs not found in directory /lib/modules/5.3.18-2-pve\n", err: exit status 1
   [preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
   To see the stack trace of this error execute with --v=5 or higher

Even if you do mount the /lib/modules/ folder from the host to the LXC guest with:

pct set xxx --mp0 /lib/modules/$(uname -r),mp=/lib/modules/$(uname -r),ro=1

You’ll still get:

	[ERROR SystemVerification]: failed to parse kernel config: unable to load kernel module: "configs", output: "modprobe: FATAL: Module configs not found in directory /lib/modules/5.3.10-1-pve\n", err: exit status 1

I’m running a custom compiled kernel on my PVE host, with a different version tag, and even if I wasn’t, PVE doesn’t come with the linux-image package for its kernels. There is likely a solution somewhere, involving copying the right kernel .config to the right place, but this preflight check can be ignored as just a warning.

Rerun kubeadm init with:

kubeadm init --control-plane-endpoint=kcontrol.mydomain.com --ignore-preflight-errors=SystemVerification --pod-network-cidr=10.244.0.0/16

A disk-heavy portion of this, the downloading of images from Google’s registry onto local rootdisk, took ridiculously long on the Ceph root disk (it took under a minute when I had tried this on a local SSD disk)

Now it proceeds further, but then:

[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s
[kubelet-check] Initial timeout of 40s passed.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp [::1]:10248: connect: connection refused.
[kubelet-check] It seems like the kubelet isn't running or healthy.
[kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp [::1]:10248: connect: connection refused.

Checking the logs for the kubelet with journalctl -xeu kubelet shows:

Jan 05 21:20:46 k8s-control-1 kubelet[13376]: E0105 21:20:46.135602   13376 reflector.go:156] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46: Failed to list *v1.Pod: Get https://kcontr
Jan 05 21:20:46 k8s-control-1 kubelet[13376]: E0105 21:20:46.137016   13376 reflector.go:156] k8s.io/kubernetes/pkg/kubelet/kubelet.go:458: Failed to list *v1.Node: Get https://kcontrol.home
Jan 05 21:20:46 k8s-control-1 kubelet[13376]: E0105 21:20:46.319643   13376 aws_credentials.go:77] while getting AWS credentials NoCredentialProviders: no valid providers in chain. Deprecate
Jan 05 21:20:46 k8s-control-1 kubelet[13376]:         For verbose messaging see aws.Config.CredentialsChainVerboseErrors
Jan 05 21:20:46 k8s-control-1 kubelet[13376]: I0105 21:20:46.320240   13376 kuberuntime_manager.go:211] Container runtime containerd initialized, version: 1.2.10, apiVersion: v1alpha2
Jan 05 21:20:46 k8s-control-1 kubelet[13376]: I0105 21:20:46.320586   13376 server.go:1113] Started kubelet
Jan 05 21:20:46 k8s-control-1 kubelet[13376]: F0105 21:20:46.321394   13376 kubelet.go:1413] failed to start OOM watcher open /dev/kmsg: no such file or directory
Jan 05 21:20:46 k8s-control-1 systemd[1]: kubelet.service: Main process exited, code=exited, status=255/n/a
Jan 05 21:20:46 k8s-control-1 systemd[1]: kubelet.service: Failed with result 'exit-code'.

Don’t be misled by the failed HTTPs requests - those are failing because the kubelet hasn’t been able to start successfully yet. Why? Notice the last F Fatal error - kubelet[13376]: F0105 21:20:46.321394 13376 kubelet.go:1413] failed to start OOM watcher open /dev/kmsg: no such file or directory

Here we see a helpful person noticing that lxc.kmsg = 1 is a known config option, but PVE LXC doesn’t work with it. I tried adding lxc.kmsg: 1 to the PVE LXC config, inline with the other lxc configs I added previously, but on starting the container I get a:

Jan 06 02:52:58 fowlmanor lxc-start[1747057]: lxc-start: 124: confile.c: parse_line: 2811 Unknown configuration key "lxc.kmsg"
Jan 06 02:52:58 fowlmanor lxc-start[1747057]: lxc-start: 124: parse.c: lxc_file_for_each_line_mmap: 142 Failed to parse config file "/var/lib/lxc/124/config" at line "lxc.kmsg = 1"
Jan 06 02:52:58 fowlmanor lxc-start[1747057]: Failed to load config for 124
Jan 06 02:52:58 fowlmanor lxc-start[1747057]: lxc-start: 124: tools/lxc_start.c: main: 263 Failed to create lxc_container
Jan 06 02:52:58 fowlmanor systemd[1]: [email protected]: Control process exited, code=exited, status=1/FAILURE

So, I had to go with the workaround - symlink /dev/console to /dev/kmsg. Thanks, helpful guy on the PVE forums (this workaround has been mentioned online elsewhere too). You can run ln -s /dev/console /dev/kmsg to do so, but this doesn’t survive a reboot, so do:

echo 'L /dev/kmsg - - - - /dev/console' > /etc/tmpfiles.d/kmsg.conf

Reference: step from kubernetes-lxd. (This relies on systemd, which is in Ubuntu based containers)

Now, systemctl restart kubelet. The kubectl successfully spins up now, but the kubeadm init process was incomplete. Rerunning it fails since the kubelet is partially configured and has bound to its ports, so I cleared the kubeadm init setup with kubeadm reset, then reran the kubeadm init command. This should probably be one of the preflight checks, but for now, remember to check for /dev/kmsg and set it up if it isn’t present, before doing a kubeadm init.

This time, all goes well, and I have a running kubelet. Let’s do a kubectl get nodes:

NAME                STATUS   ROLES    AGE   VERSION
k8s-control-1       Ready    master   2d   v1.17.0

I did run into some issues doing the steps above, which I’ve skipped for brevity - errors involving the kubelet not becoming healthy in time because of the slow backing disk of the LXC made me have to switch to a local disk and muck around with configs while the kubeadm join command was running.

To set up subsequent nodes (a worker node LXC on each of my Proxmox hosts, and another control plane node LXC on a different physical host), I redid the above, but in the “right” order. with kubeadm join instead of kubeadm init. The join info for control plane nodes and worker nodes is printed as the kubeadm init finishes. The join process also needs the --ignore-preflight-errors=SystemVerification I used previously.

NAME                STATUS   ROLES    AGE   VERSION
k8s-camphalfblood   Ready    <none>   15m   v1.17.2
k8s-control-1       Ready    master   2d   v1.17.2
k8s-fowlmanor       Ready    <none>   15m   v1.17.2

Now that I have two control plane nodes, if even one goes down, consensus/node lease is lost since there is no longer a majority, and the whole control plane stops functioning. A 3rd control node is necessary to actually benefit from High Availability.

Join more nodes later:

To join a worker node:

The certs and join token created above are only valid for a short time - see kubeadm token list to see validity info. To recreate tokens and get the join info printed again, use:

kubeadm token create --print-join-command

This creates a token to let worker nodes join.

To join a control plane node:

Use kubeadm to upload the cluster’s certs encrypted, into a kubernetes Secret named kubeadm-certs in the cluster’s kube-system namespace, with:

kubeadm init phase upload-certs --upload-certs

This prints a certificate key, to use for decryption of these certs later. Now create a join token and print the join command for a control plane node with:

kubeadm token create --print-join-command --certificate-key <certkey>

This prints a similar join command to run on a new control plane node, but with --control-plane to direct kubeadm to join as a cluster, and the --certificate-key we provided.

Everything in a single script:

If you’re lazy like me, you don’t want to be copy-pasting a few commands at a time from the documentation, across 10+ steps. To make all this faster, I put all the commands I actually ran, in a bash script:

kubernetes-kubeadm-install.sh

Upgrading kubernetes with kubeadm is rather easy, but here’s a snippet with commands for that too (not everything is to be run on all nodes):

kubernetes-kubeadm-upgrade.sh

Notes:

To remove a control plane endpoint, don’t just delete the LXC and do a kubectl delete node node-name. Go to the node and do a kubeadm reset - which should remove the etcd member on that node from the member list. Otherwise, that member remains in the etcd member list and will be unhealthy. I was unable to join a new control plane node in this situation since kubeadm checks etcd health and the old control plane node I had deleted abruptly was still in the etcd member list.

To debug/fix etcd problems:

Go to any etcd pod in kube-system namespace, do:

(where the endpoint list is the IPs of the control plane nodes you want to operate against)

alias ek='etcdctl --endpoints=https://10.0.0.90:2379,https://10.0.0.89:2379,https://10.0.0.88:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt  --key /etc/kubernetes/pki/etcd/server.key --cert /etc/kubernetes/pki/etcd/server.crt'

ek endpoint health - will show you health of the endpoints in the --endpoints flag

ek member remove <member> as needed (make sure to not break quorum)

You may need to use etcdctl with only working –endpoints (if you are trying to recover a broken etcd node and the cluster is still in quorum)

More notes:

node-exporter / not shared or slave mount

When you inevitably try to run node-exporter on these LXC containers to monitor resources, you may run into:

Warning  Failed     43s (x4 over 80s)  kubelet, k8s-control-2  Error: failed to start container "node-exporter": Error response from daemon: path / is mounted on / but it is not a shared or slave mount

To fix this, run

mount --make-rshared /

To make this permanent:

echo '#!/bin/sh -e
mount --make-rshared /' > /etc/rc.local
chmod +x /etc/rc.local

100% of one CPU core usage in LXC container by systemd-journald process in recent Ubuntu LXC templates

The symlink I made from /dev/console to /dev/kmsg causes a infinite loop in systemd which tries to read from kmsg and write to console (this problem wouldn’t have occured on a non-lxc setup). Various references: linkhere

I didn’t see any clear alternative to the symlink recommended previously though, so to try to work around the systemd loop situation:

mkdir /etc/systemd/journald.conf.d/; echo "ReadKMsg=no" > /etc/systemd/journald.conf.d/kfilter.conf
systemctl restart systemd-journald

This still doesn’t stop systemd-journald from occasionally taking up a core with its infinite loop, so there’s still something to be investigated here. For now I just kill the process once after bootup if I see this situation (yes, ugly)