Kubernetes in LXC on Proxmox
I’ve been working with Kubernetes for the past few months at work, to run our webserving and internal applications. We use AWS’s Elastic Kubernetes Service (EKS), their managed Kubernetes control plane offering. It’s easy to set up and reliable enough to let folks like me new to Kubernetes run production workloads without too much trouble. We don’t to worry about managing critical control-plane services, we just have to supply our own worker nodes and our application stack.
But to really get stuff done on EKS, a deeper understanding of the Kubernetes core is necessary. There are quite a few rough edges and when something goes wrong or you feel like you’re missing an obvious feature, you need to know how stuff works under the hood. For the past few months my team has been on that, but I wanted to set up and run my own kubernetes cluster at home to learn more than what I could experiement with on AWS EKS.
Kubernetes single-node MicroK8s
Kubernetes can be run on bare-metal on-premises equipment, like a homelab. There are some differences in how loadbalancers work (each cloud platform has its own version of LoadBalancer, but on your own servers, you have to set those up yourself), and how traffic flows into your cluster, and that presents a lot of opportunites to learn more about everything. To get started, I spun up a normal linux virtual machine with Ubuntu 18.04 and installed microk8s on it. Following a couple of guides on setting up the basics and adding some of my deployments and everything went as expected. I ran this setup for a couple of months, serving this website off of it, along with some apps like CodiMD, FileBrowser, Trilium Notes. An automatic Let’s Encrypt setup kept everything serving over https, and a in-cluster docker registry backed by some Ceph storage from my Proxmox cluster persisted custom docker images.
Kubernetes Cluster in LXC on multi-node Proxmox
This was fine for a basic Kubernetes setup, but what’s the point of running that on a single virtual machine? I could have just run a few docker-compose deployments on a docker VM, which is really everyone’s go-to choice for a single node dockerized set of applications. The real benefit of Kubernetes is multi-node application scheduling, and I had three systems in my homelab with power to spare on each. It was time to set up a real K8s cluster across the three nodes.
Since I’m running Proxmox on my homelab systems, I wanted to use LXC containers to run K8s, instead of full KVM virtual machines. They would be much more efficient, avoiding all hardware emulation and simply running the application processes inside namespaces on the host system and kernel. Less isolation, yes, but an acceptable tradeoff.
Over the past year I’ve run a docker-on-ubuntu setup in an LXC instead of a VM, and while setting that up a few hurdles had to be jumped over to get it working. Some of them were this, this and this. By now though, some of these are unnecessary, since the linux kernal and PVE have incorporated changes to make it easier. There are still some custom steps though, and I didn’t see any guide online detailing them all in a single place, so here goes my guide.
This is applicable to the current versions of PVE, Ubuntu and K8s:
PVE 6.1-3 Ubuntu 18.04-1.1 LXC template Kubeadm v1.17 K8s 1.17
The node layout is simple for now - I want a separation between the control plane nodes and the worker nodes, just like in AWS EKS and other cloud K8s offerings. I also want a high availability cluster, so ideally I’d run the control plane across atleast 3 nodes/containers with HA configured for
etcd as well. Right now however I’m starting off with a single-node control plane, which I’ll keep on shared storage so I can migrate it across physical PVE nodes quickly if I need to. I want to be able to reasonably quickly and easily switch over to a proper HA control plane setup so I’ll be keeping that in mind for some of the following steps.
Also, since this will come up later - a choice of network has to be made for the cluster. Based on some light research, I’ve decided to use Calico. With this there is a step in the cluster setup process which has a slight change - I need to specify the IP address pool from which pod IPs will be allocated. The default is 192.168.0.0/16 for “<50 nodes” and 10.244.0.0/16 for “more than 50 nodes”, but I’ll be using 10.244.0.0/16 since I may/do have network devices in the 192.168.0.0/16 range which I want to be able to reach from pods.
I’ll be using
kubeadm to set up the cluster. I’ll call my first control plane node
k8s-control-1. Let’s set it up:
I’ve downloaded the Ubuntu 18.04-1.1 LXC template from Proxmox’s templates download GUI, to local storage.
We need to customize a container configuration to make changes to allow K8s to run correctly, similar to the config for the docker LXC container:
The container needs to be a priviledged container, so don’t tick
The container’s SWAP needs to be set to 0 (K8s really doesn’t like SWAP for performance reasons, so we’ll just provide enough physical RAM instead).
Open up the container’s PVE config file in
/etc/pve/lxc/ and add the following at the bottom:
lxc.apparmor.profile: unconfined lxc.cap.drop: lxc.cgroup.devices.allow: a lxc.mount.auto: proc:rw sys:rw
This blows away a lot of the security features of LXC, but I’re doing this to avoid running a full KVM instance. Now, start up the container and go inside.
To get started, refer to this guide to set up the
You’ll see docker running successfully:
[email protected]:~# sudo systemctl status docker ● docker.service - Docker Application Container Engine Loaded: loaded (/lib/systemd/system/docker.service; enabled; vendor preset: enabled) Active: active (running) since Sat 2020-03-28 20:33:09 UTC; 19min ago
Then, follow the kubeadm installation process here.
Now that we have kubeadm and kubelet installed, we can check up on the status of the kubectl service before we proceed, with
systemctl status kubectl
The crash loop of the kubelet is expected, since kubeadm hasn’t set up a config file for it yet.
Now comes the real stuff. I want to be able to switch to a high-availability cluster control plane later, while I’m starting with a single node right now. The docs have this to say:
(Recommended) If you have plans to upgrade this single control-plane kubeadm cluster to high availability you should specify the –control-plane-endpoint to set the shared endpoint for all control-plane nodes. Such an endpoint can be either a DNS name or an IP address of a load-balancer.
So I’ll be setting a DNS entry in my pfSense DNS resolver/forwarder for
kcontrol.mydomain.com to the current IP address of this control place container. When I add more control plane containers, I can add more A record values to that DNS entry.
kubeadm init command for me is:
kubeadm init --control-plane-endpoint=kcontrol.mydomain.com --pod-network-cidr=10.244.0.0/16
Now, I run the
kubeadm init command, and end up with
W0105 21:06:32.007656 9528 validation.go:28] Cannot validate kube-proxy config - no validator is available W0105 21:06:32.007691 9528 validation.go:28] Cannot validate kubelet config - no validator is available [init] Using Kubernetes version: v1.17.0 [preflight] Running pre-flight checks [preflight] The system verification failed. Printing the output from the verification: KERNEL_VERSION: 5.3.10-1-pve-tlg OS: Linux CGROUPS_CPU: enabled CGROUPS_CPUACCT: enabled CGROUPS_CPUSET: enabled CGROUPS_DEVICES: enabled CGROUPS_FREEZER: enabled CGROUPS_MEMORY: enabled error execution phase preflight: [preflight] Some fatal errors occurred: [WARNING SystemVerification]: failed to parse kernel config: unable to load kernel module: "configs", output: "modprobe: ERROR: ../libkmod/libkmod.c:586 kmod_search_moddep() could not open moddep file '/lib/modules/5.3.18-2-pve/modules.dep.bin'\nmodprobe: FATAL: Module configs not found in directory /lib/modules/5.3.18-2-pve\n", err: exit status 1 [preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...` To see the stack trace of this error execute with --v=5 or higher
Even if you do mount the /lib/modules/ folder from the host to the LXC guest with:
pct set xxx --mp0 /lib/modules/$(uname -r),mp=/lib/modules/$(uname -r),ro=1
You’ll still get:
[ERROR SystemVerification]: failed to parse kernel config: unable to load kernel module: "configs", output: "modprobe: FATAL: Module configs not found in directory /lib/modules/5.3.10-1-pve\n", err: exit status 1
I’m running a custom compiled kernel on my PVE host, with a different version tag, and even if I wasn’t, PVE doesn’t come with the linux-image package for its kernels. There is likely a solution somewhere, involving copying the right kernel
.config to the right place, but this preflight check can be ignored as just a warning.
kubeadm init with:
kubeadm init --control-plane-endpoint=kcontrol.mydomain.com --ignore-preflight-errors=SystemVerification --pod-network-cidr=10.244.0.0/16
A disk-heavy portion of this, the downloading of images from Google’s registry onto local rootdisk, took ridiculously long on the Ceph root disk (it took under a minute when I had tried this on a local SSD disk)
Now it proceeds further, but then:
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s [kubelet-check] Initial timeout of 40s passed. [kubelet-check] It seems like the kubelet isn't running or healthy. [kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp [::1]:10248: connect: connection refused. [kubelet-check] It seems like the kubelet isn't running or healthy. [kubelet-check] The HTTP call equal to 'curl -sSL http://localhost:10248/healthz' failed with error: Get http://localhost:10248/healthz: dial tcp [::1]:10248: connect: connection refused.
Checking the logs for the kubelet with
journalctl -xeu kubelet shows:
Jan 05 21:20:46 k8s-control-1 kubelet: E0105 21:20:46.135602 13376 reflector.go:156] k8s.io/kubernetes/pkg/kubelet/config/apiserver.go:46: Failed to list *v1.Pod: Get https://kcontr Jan 05 21:20:46 k8s-control-1 kubelet: E0105 21:20:46.137016 13376 reflector.go:156] k8s.io/kubernetes/pkg/kubelet/kubelet.go:458: Failed to list *v1.Node: Get https://kcontrol.home Jan 05 21:20:46 k8s-control-1 kubelet: E0105 21:20:46.319643 13376 aws_credentials.go:77] while getting AWS credentials NoCredentialProviders: no valid providers in chain. Deprecate Jan 05 21:20:46 k8s-control-1 kubelet: For verbose messaging see aws.Config.CredentialsChainVerboseErrors Jan 05 21:20:46 k8s-control-1 kubelet: I0105 21:20:46.320240 13376 kuberuntime_manager.go:211] Container runtime containerd initialized, version: 1.2.10, apiVersion: v1alpha2 Jan 05 21:20:46 k8s-control-1 kubelet: I0105 21:20:46.320586 13376 server.go:1113] Started kubelet Jan 05 21:20:46 k8s-control-1 kubelet: F0105 21:20:46.321394 13376 kubelet.go:1413] failed to start OOM watcher open /dev/kmsg: no such file or directory Jan 05 21:20:46 k8s-control-1 systemd: kubelet.service: Main process exited, code=exited, status=255/n/a Jan 05 21:20:46 k8s-control-1 systemd: kubelet.service: Failed with result 'exit-code'.
Don’t be misled by the failed HTTPs requests - those are failing because the kubelet hasn’t been able to start successfully yet. Why? Notice the last
F Fatal error -
kubelet: F0105 21:20:46.321394 13376 kubelet.go:1413] failed to start OOM watcher open /dev/kmsg: no such file or directory
Here we see a helpful person noticing that
lxc.kmsg = 1 is a known config option, but PVE LXC doesn’t work with it. I tried adding
lxc.kmsg: 1 to the PVE LXC config, inline with the other lxc configs I added previously, but on starting the container I get a:
Jan 06 02:52:58 fowlmanor lxc-start: lxc-start: 124: confile.c: parse_line: 2811 Unknown configuration key "lxc.kmsg" Jan 06 02:52:58 fowlmanor lxc-start: lxc-start: 124: parse.c: lxc_file_for_each_line_mmap: 142 Failed to parse config file "/var/lib/lxc/124/config" at line "lxc.kmsg = 1" Jan 06 02:52:58 fowlmanor lxc-start: Failed to load config for 124 Jan 06 02:52:58 fowlmanor lxc-start: lxc-start: 124: tools/lxc_start.c: main: 263 Failed to create lxc_container Jan 06 02:52:58 fowlmanor systemd: [email protected]: Control process exited, code=exited, status=1/FAILURE
So, I had to go with the workaround - symlink /dev/console to /dev/kmsg. Thanks, helpful guy on the PVE forums (this workaround has been mentioned online elsewhere too). You can run
ln -s /dev/console /dev/kmsg to do so, but this doesn’t survive a reboot, so do:
echo 'L /dev/kmsg - - - - /dev/console' > /etc/tmpfiles.d/kmsg.conf
Reference: step from kubernetes-lxd. (This relies on systemd, which is in Ubuntu based containers)
systemctl restart kubelet. The kubectl successfully spins up now, but the
kubeadm init process was incomplete. Rerunning it fails since the kubelet is partially configured and has bound to its ports, so I cleared the kubeadm init setup with
kubeadm reset, then reran the
kubeadm init command. This should probably be one of the preflight checks, but for now, remember to check for /dev/kmsg and set it up if it isn’t present, before doing a
This time, all goes well, and I have a running kubelet. Let’s do a
kubectl get nodes:
NAME STATUS ROLES AGE VERSION k8s-control-1 Ready master 2d v1.17.0
I did run into some issues doing the steps above, which I’ve skipped for brevity - errors involving the kubelet not becoming healthy in time because of the slow backing disk of the LXC made me have to switch to a local disk and muck around with configs while the kubeadm join command was running.
To set up subsequent nodes (a worker node LXC on each of my Proxmox hosts, and another control plane node LXC on a different physical host), I redid the above, but in the “right” order. with
kubeadm join instead of
kubeadm init. The join info for control plane nodes and worker nodes is printed as the
kubeadm init finishes. The join process also needs the
--ignore-preflight-errors=SystemVerification I used previously.
NAME STATUS ROLES AGE VERSION k8s-camphalfblood Ready <none> 15m v1.17.2 k8s-control-1 Ready master 2d v1.17.2 k8s-fowlmanor Ready <none> 15m v1.17.2
Now that I have two control plane nodes, if even one goes down, consensus/node lease is lost since there is no longer a majority, and the whole control plane stops functioning. A 3rd control node is necessary to actually benefit from High Availability.
Join more nodes later:
To join a worker node:
The certs and join token created above are only valid for a short time - see
kubeadm token list to see validity info. To recreate tokens and get the join info printed again, use:
kubeadm token create --print-join-command
This creates a token to let worker nodes join.
To join a control plane node:
kubeadm to upload the cluster’s certs encrypted, into a kubernetes Secret named
kubeadm-certs in the cluster’s
kube-system namespace, with:
kubeadm init phase upload-certs --upload-certs
This prints a certificate key, to use for decryption of these certs later. Now create a join token and print the join command for a control plane node with:
kubeadm token create --print-join-command --certificate-key <certkey>
This prints a similar join command to run on a new control plane node, but with
--control-plane to direct kubeadm to join as a cluster, and the
--certificate-key we provided.
Everything in a single script:
If you’re lazy like me, you don’t want to be copy-pasting a few commands at a time from the documentation, across 10+ steps. To make all this faster, I put all the commands I actually ran, in a bash script:
Upgrading kubernetes with kubeadm is rather easy, but here’s a snippet with commands for that too (not everything is to be run on all nodes):
To remove a control plane endpoint, don’t just delete the LXC and do a
kubectl delete node node-name. Go to the node and do a
kubeadm reset - which should remove the etcd member on that node from the member list. Otherwise, that member remains in the etcd member list and will be unhealthy. I was unable to join a new control plane node in this situation since kubeadm checks etcd health and the old control plane node I had deleted abruptly was still in the etcd member list.
To debug/fix etcd problems:
Go to any etcd pod in kube-system namespace, do:
(where the endpoint list is the IPs of the control plane nodes you want to operate against)
alias ek='etcdctl --endpoints=https://10.0.0.90:2379,https://10.0.0.89:2379,https://10.0.0.88:2379 --cacert /etc/kubernetes/pki/etcd/ca.crt --key /etc/kubernetes/pki/etcd/server.key --cert /etc/kubernetes/pki/etcd/server.crt' ek endpoint health - will show you health of the endpoints in the --endpoints flag ek member remove <member> as needed (make sure to not break quorum)
You may need to use etcdctl with only working –endpoints (if you are trying to recover a broken etcd node and the cluster is still in quorum)
node-exporter / not shared or slave mount
When you inevitably try to run node-exporter on these LXC containers to monitor resources, you may run into:
Warning Failed 43s (x4 over 80s) kubelet, k8s-control-2 Error: failed to start container "node-exporter": Error response from daemon: path / is mounted on / but it is not a shared or slave mount
To fix this, run
mount --make-rshared /
To make this permanent:
echo '#!/bin/sh -e mount --make-rshared /' > /etc/rc.local chmod +x /etc/rc.local
100% of one CPU core usage in LXC container by systemd-journald process in recent Ubuntu LXC templates
The symlink I made from
/dev/kmsg causes a infinite loop in
systemd which tries to read from kmsg and write to console (this problem wouldn’t have occured on a non-lxc setup).
Various references: linkhere
I didn’t see any clear alternative to the symlink recommended previously though, so to try to work around the systemd loop situation:
mkdir /etc/systemd/journald.conf.d/; echo "ReadKMsg=no" > /etc/systemd/journald.conf.d/kfilter.conf systemctl restart systemd-journald
This still doesn’t stop systemd-journald from occasionally taking up a core with its infinite loop, so there’s still something to be investigated here. For now I just kill the process once after bootup if I see this situation (yes, ugly)