Getting Along with the OpenShift Machine Config Operator

Overview

When I'm working with the Machine Config Operator (MCO) in OpenShift, and frequently typing oc describe mcp and oc get mcp, I'm often reminded of the MCP (Machine Control Program) in the film Tron. I like the idea behind the MCO, and when it's working it's great, but perhaps this association in my mind is not only because of the shared acronym, but because you do sometimes feel you're fighting with it for control of your cluster!

What is the OpenShift Machine Config Operator?

The OpenShift Machine Config Operator is a very powerful piece of software and an important part of the OpenShift Container Platform (OCP). It manages the Operating System versions, updates and configuration files of the master and worker nodes that make up an OpenShift cluster. Operating system configuration files such as NetworkManager config files, systemd units, sysctl files as well as kubelet config files and certificate bundles, are base64 encoded and held in MachineConfig resources inside Kubernetes itself.

This might not sound that interesting in the context of today's public cloud service providers IaaS solutions, but when the nodes are running on HP and Dell bare metal physical servers in private disconnected data centres, having a software operator like this - which is run and managed using native Kubernetes primitives and commands - drastically reduces the SA workload required to manage and maintain multiple clusters made up of hundreds of bare metal physical servers

Understanding the Machine Config Operator (MCO)

The MCO lives in the openshift-machine-config-operator project and is present in every OpenShift cluster. There are a number of different containers running in this namespace:

  • machine-config-operator - this is the main controller loop or ClusterOperator itself that deploys and manages everything else in this namespace.
  • machine-config-server - provides the endpoint that nodes connect to in order to get their configuration files.
  • machine-config-controller - co-ordinates upgrades and manages the lifecycle of nodes in the cluster by rendering (generating) MachineConfigs (mc), managing MachineConfigPools (mcp) and by co-ordinating with the machine-config-daemon running on each node to keep all the nodes up to date with the correct configuration.
  • machine-config-daemon - responsible for updating nodes to a given MachineConfig when requested to by the machine-config-controller. This runs as a DaemonSet so there is one pod per node in the cluster.

The terminal output below illustrates these pods running in the openshift-machine-config-operator project in an OpenShift cluster:

 1$ oc get pods -n openshift-machine-config-operator -o wide
 2NAME                                         READY   STATUS    RESTARTS   AGE   IP              NODE       NOMINATED NODE   READINESS GATES
 3machine-config-controller-7c4975f766-gbnh2   1/1     Running   1          20h   10.129.0.27     server01   <none>           <none>
 4machine-config-daemon-2kt5g                  2/2     Running   2          18h   192.162.1.203   server03   <none>           <none>
 5machine-config-daemon-jmhll                  2/2     Running   6          54d   192.168.1.206   server06   <none>           <none>
 6machine-config-daemon-rzwhv                  2/2     Running   6          54d   192.168.1.202   server02   <none>           <none>
 7machine-config-daemon-vqphs                  2/2     Running   24         54d   192.168.1.201   server01   <none>           <none>
 8machine-config-daemon-xqjgk                  2/2     Running   2          17h   192.168.1.205   server05   <none>           <none>
 9machine-config-operator-66f4cd998f-v8cfg     1/1     Running   1          20h   10.129.0.40     server01   <none>           <none>
10machine-config-server-5hlsd                  1/1     Running   12         54d   192.168.1.101   server01   <none>           <none>
11machine-config-server-dfvg8                  1/1     Running   3          54d   192.168.1.102   server02   <none>           <none>
12machine-config-server-tr2fd                  1/1     Running   11         54d   192.168.1.103   server03   <none>           <none>

MachineConfigs

So how does an update or change to a systemd unit file used to start the kubelet on a node make it to the OS disk of a node that requires it?

One of the core utilities underpinning the MCO is Ignition. When RedHat acquired the CoreOS company in 2018, they merged the best parts of CoreOS, Red Hat Linux and Red Hat Atomic operating systems and the resulting product was named Red Hat CoreOS (RHCOS). This is the OS that runs on the master and worker nodes in an OpenShift cluster.

All the OS configuration files a node requires are encoded and encapsulated in ignition files. The machine-config-controller collates all these various snippets and OS configuration files each node requires, and then renders (generates) MachineConfig kubernetes resources with all the files.

The command below, shows the MachineConfigs available in an OpenShift cluster:

 1$ oc get mc --sort-by='{.metadata.creationTimestamp}'
 2NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
 399-worker-ssh                                                                                 3.2.0             216d
 499-master-ssh                                                                                 3.2.0             216d
 500-master                                          a537783ea4a0cd3b4fe2a02626ab27887307ea51   3.2.0             216d
 601-master-kubelet                                  a537783ea4a0cd3b4fe2a02626ab27887307ea51   3.2.0             216d
 701-worker-container-runtime                        a537783ea4a0cd3b4fe2a02626ab27887307ea51   3.2.0             216d
 801-master-container-runtime                        a537783ea4a0cd3b4fe2a02626ab27887307ea51   3.2.0             216d
 999-master-generated-registries                     a537783ea4a0cd3b4fe2a02626ab27887307ea51   3.2.0             216d
1000-worker                                          a537783ea4a0cd3b4fe2a02626ab27887307ea51   3.2.0             216d
1199-worker-generated-registries                     a537783ea4a0cd3b4fe2a02626ab27887307ea51   3.2.0             216d
1201-worker-kubelet                                  a537783ea4a0cd3b4fe2a02626ab27887307ea51   3.2.0             216d
13rendered-master-0a540fd2eb85335cab8066f0bb407d5b   116603ff3d7a39c0de52d7d16fe307c8471330a0   3.2.0             216d
14rendered-worker-9da78a560e7fa3c0cbd53d2bf0c17941   116603ff3d7a39c0de52d7d16fe307c8471330a0   3.2.0             216d
15rendered-master-248f9034db99de75517a88a2ae8d29e2   116603ff3d7a39c0de52d7d16fe307c8471330a0   3.2.0             215d
16rendered-worker-e323d4623ccb8cb71a4953f4ca81f8a4   116603ff3d7a39c0de52d7d16fe307c8471330a0   3.2.0             215d
17rendered-worker-5599b4631efd88b5834f0844bec5ce83   116603ff3d7a39c0de52d7d16fe307c8471330a0   3.2.0             203d
18rendered-master-8e5868011cc53d9697ac9dda69a64b39   116603ff3d7a39c0de52d7d16fe307c8471330a0   3.2.0             203d
19rendered-worker-77e79ef013b565633a4d04da28f7dda4   116603ff3d7a39c0de52d7d16fe307c8471330a0   3.2.0             203d
20rendered-master-c40185b01423956d1146b7dea0a3e265   116603ff3d7a39c0de52d7d16fe307c8471330a0   3.2.0             203d
21rendered-worker-dee67e9dc2ba77876de8bc8d045740dd   c76785d62cc28ddf3390c865c9e999a02248cd84   3.2.0             57d
22rendered-master-3fcbbc2dcbddc79435994c98b655b192   c76785d62cc28ddf3390c865c9e999a02248cd84   3.2.0             57d
23rendered-worker-7373393797c4c7e5ae500ff51ba234a9   a537783ea4a0cd3b4fe2a02626ab27887307ea51   3.2.0             55d
24rendered-master-40ad8d11cc05c7079ee8f3bfa6df5540   a537783ea4a0cd3b4fe2a02626ab27887307ea51   3.2.0             55d
25rendered-worker-53656729c24a74935907c4fba2adc90a   a537783ea4a0cd3b4fe2a02626ab27887307ea51   3.2.0             54d
26rendered-master-1e2f18449f36cad8c9618c264d5cdc10   a537783ea4a0cd3b4fe2a02626ab27887307ea51   3.2.0             54d
27rendered-worker-82df13b281ba3ddb9ba2e8811ef7957c   a537783ea4a0cd3b4fe2a02626ab27887307ea51   3.2.0             14d
28rendered-master-7ed9a47c85ff478b341adaa20a55e942   a537783ea4a0cd3b4fe2a02626ab27887307ea51   3.2.0             14d
29rendered-master-98968a4a639e3efad0436d19d4a6c93f   a537783ea4a0cd3b4fe2a02626ab27887307ea51   3.2.0             14d
30rendered-worker-65ec18c5a3e4ebbf5f8ef459fb3c50b2   a537783ea4a0cd3b4fe2a02626ab27887307ea51   3.2.0             14d

In this command output, you can see that 216 days ago when the cluster was first installed, initial MachineConfigs were rendered - one for the master nodes (rendered-master-0a540f..) in the cluster, and one for the worker nodes (rendered-worker-9da78a...). Whilst these two MachineConfigs share common OS configuration files, there are also lots of differences and so they are managed separately for each MachineConfigPool.

Since the initial install, we can see at various intervals, new MachineConfig objects were rendered by the machine-config-controller. Each new rendered- MachineConfig is from when upgrades were made to the OpenShift cluster (because an OpenShift update may have updated an OS setting in say /etc/crio/crio.conf)

As well as OCP version updates forcing updates, you can also manually edit MachineConfigs to modify or add you own OS config files. For what it's worth, I did have to do this with early OpenShift 4.1/4.2 versions, where I modified NTP and SSHD config files for custom settings and to modify the ca-bundle.crt as a workaround as early versions had issues with private CAs. But since those early versions, I have not had need to make any changes - the MCO just manages the complete OS lifecycle of the node.

When a new MachineConfig is generated, the machine-config-controller will work with the machine-config-daemon running on each node to reboot it and have it apply the latest ignition configuration held within the MachineConfig.

When a node boots Red Hat CoreOS, ignition connects to the machine-config-server endpoint https://<cluster-api>:22623/config/master (or /config/worker for worker nodes, or /config/<machine-config-pool>) and applies the Ignition config to the node.

Quick Health Check of MCO

Most of the time, the MCO works very well. However, as with any highly automated process, on occasions it does go wrong. The first thing to be aware of with MCO is that you have to keep a close eye on it, particularly when it's rolling out updates. At a high level, the quickest ways to do this is by looking at status of the the ClusterOperator's status:

1$ oc get co machine-config
2NAME             VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
3machine-config   4.8.10    True        False         False       14d

Here we can see that the MCO is AVAILABLE=True. This is good, and you nearly always want to see this. If it's False, something is very wrong with the cluster, or more likely perhaps, the cluster is in the middle of an OCP upgrade and the MCO pod is restarting on a different node.

If PROGRESSING=False is shown, that means that the MCO has no updates to rollout, and that the nodes in your cluster are all up to date and running on the latest OS version and using all the OS config files from the MachineConfig. If this is set to False, then that means the machine-config-controller is busy working through an update and asking the machine-config-daemons on each node to reboot the node and apply a MachineConfig.

For a healthy MCO, you also want to see DEGRADED=False. If you see this as True, then your MCO has a problem that usually requires manual intervention, and some of the key reasons that I've seen for this condition are documented below.

As well as looking at the overview status of the ClusterOperator resource, another way you can see the health of the MCO at a glance is to look at its MachineConfigPools. In a healthy cluster, these pools will show UPDATED=True and DEGRADED=False. If UPDATING=True is set, the machine-config-controller is in the middle of rolling out an update:

1$ oc get mcp
2NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
3master   rendered-master-98968a4a639e3efad0436d19d4a6c93f   True      False      False      3              3                   3                     0                      216d
4worker   rendered-worker-53656729c24a74935907c4fba2adc90a   True      False      False      3              3                   3                     0                      216d

Monitoring an MCO Update

If your MCO is updating (PROGRESSING=True, UPDATING=True), then you can monitor its progress by watching the logs of the machine-config-controller pod.

1$ oc logs -l k8s-app=machine-config-controller -n openshift-machine-config-operator

In the logs from this pod, you'll see nodes in the cluster sequentially being cordoned (made schedulable=False) and then rebooted. Whilst rebooting and applying the new configuration, they will be seen as NotReady status in oc get nodes).

1$ oc get nodes
2NAME       STATUS                        ROLES           AGE    VERSION
3server01   Ready                         master,worker   217d   v1.21.1+9807387
4server02   NotReady,SchedulingDisabled   master,worker   217d   v1.21.1+9807387
5server03   Ready                         master,worker   217d   v1.21.1+9807387
6server04   NotReady,SchedulingDisabled   worker          203d   v1.21.1+9807387
7server05   Ready                         worker          203d   v1.21.1+9807387

If something happens when the node reboots (eg it fails to DHCP, it can't boot because of a disk or grub error, fails to start the kubelet), the MCO rollout process will halt. You'll notice this because the node stays NotReady for longer than you'd expect and the status will become degraded. If this happens, you have to roll your sleeves up and work out why.

Troubleshooting Degraded Nodes

The first place to look when troubleshooting a degraded node is its machine-config-daemon pod. In early OpenShift 4.x versions, there were also occasions when the reason for a degraded node was only apparent by looking in the Linux journald logs on the node itself. However in recent versions, I've not had to resort to journald logs because everything about an error has been available in the machine-config-daemon pod logs for the node. To find the correct machine-config-daemon pod for the node, you simply run oc get pods -o wide, or alternatively you could cut and paste the following:

1$ MCDPOD=$(oc get pods -l k8s-app=machine-config-daemon -o jsonpath='{.items[?(@.spec.nodeName=="server01")].metadata.name')
2$ oc logs $MCDPOD -c machine-config-daemon

Next, lets look at a few common reasons for a node being degraded.

Unexpected on-disk state - content mismatch for file

If the files on a node's disk that are under control of the MachineConfig don't match the contents of those same files as specified in that MachineConfig (ie the MachineConfig specified in the node's machineconfiguration.openshift.io/currentConfig annotation), then the machine-config-daemon will refuse to apply the update and this will put the MachineConfigPool into a degraded state. You'll see a message event similar to this:

1Type:                  RenderDegraded
2Message:               Node server01 is reporting: "unexpected on-disk state validating against rendered-worker-82df13b281ba3ddb9ba2e8811ef7957c: content mismatch for file \"/etc/kubernetes/kubelet.conf\""
3Reason:                1 nodes are reporting degraded status on sync
4Status:                True
5Type:                  NodeDegraded

This happens when somebody has manually edited a file on a node. Although remember, there should be no reason or need for a sysadmin (SA) to edit files on RHCOS, but sometimes junior SAs or SAs not familiar with OpenShift might be tempted to edit OS config files using ssh/vim.

You can get an indication of this by looking at the machineconfiguration.openshift.io/ssh annotation on the node. If this is set to accessed, then a SA has ssh'd onto the node and could have perhaps changed something manually. Ordinarily (except see troubleshooting below), SA shouldn't be ssh'ing onto RHCOS nodes.

The easiest way to quickly fix this is to restore the config file from the MachineConfig. To do this, output the correct MachineConfig in full YAML, and then search for the path entry for the file being reported as having a content mismatch. Alternatively, copy and run the following commands to get the file contents:

1$ CURRENT_CONFIG=$(oc get node -o jsonpath="{.items[?(@.metadata.name=='server01')].metadata.annotations['machineconfiguration\.openshift\.io/currentConfig']}")
2$ oc get mc $CURRENT_CONFIG -o jsonpath="{.spec.config.storage.files[?(@.path=='/etc/kubernetes/kubelet.conf)].contents.source}" > newkubelet.conf

Then you simply base64 decode (or urldecode) the file, and scp it back onto the degraded node in the correct location. This will overwrite the manual changes made to the file and restore it to what the MCO wants it to be. The MCO should then pick up that the MachineConfig now matches, and continue progressing with what it was doing.

Error running rpm-ostree

Red Hat CoreOS uses something called rpm-ostree which is a "hybrid image/package system". It's basically an archive file that contain all the Linux software package (RPMS) that CoreOS requires and allows atomic upgrades (and downgrades) by being able to pivot between them. Sometimes this fails, and the node will become degraded. I've seen this fail for a couple of reasons - once with an issue where the node was unable to download an image from quay.io, and another time when a third party vendor (who wasn't familiar with how rpm-ostrees and Red Hat CoreOS worked) set the immutable ACL attribute on some of its files on the node, causing the node to fail drastically on each OS upgrade. The vendor has since removed these custom ACLs from their product after our experience.

The best way to fix this, if you can, is to correct the issue breaking the pivot. For example, by removing the immutable ACL on a system file. The Linux systemd journal on the node should help with identifying the error. If you still can't fix it, the last resort to fix this issue is to force a pivot.

Forcing a Pivot

If you can't get the node to update, you'll have to force it. To do this, you connect to the node and run the machine-config-daemon pivot command with the correct OS image SHA (find this from the errors in the logs to be sure you get the correct one!)

1$ oc debug node/server01
2Starting pod/server01-debug ...
3To use host binaries, run `chroot /host`
4Pod IP: 192.168.1.201
5If you don't see a command prompt, try pressing enter.
6# sh-4.4
7# chroot /host
8# /run/bin/machine-config-daemon pivot "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:aaaabbbbccccdddd"

Forcing a Node to Use a Specific MachineConfig

As a last resort - and experience has shown it really is better to address any MCO issues in other ways before doing this - you can force a node to reboot and use a different MachineConfig. This is done by manually editing the machineconfiguration annotations on the Node object itself.

1oc patch node server01 --type merge --patch '{"metadata": {"annotations": {"machineconfiguration.openshift.io/currentConfig": "render-worker-52656729c24a7493"}}}'
2oc patch node server01 --type merge --patch '{"metadata": {"annotations": {"machineconfiguration.openshift.io/desiredConfig": "render-worker-52656729c24a7493"}}}'
3oc patch node server01 --type merge --patch '{"metadata": {"annotations": {"machineconfiguration.openshift.io/reason": ""}}}'
4oc patch node server01 --type merge --patch '{"metadata": {"annotations": {"machineconfiguration.openshift.io/state": "Done"}}}'

Then you have to connect to the node and touch a machine-config-daemon-force file. This will trigger the machine-config-daemon to do a reboot and to force the update to the MachineConfig.

1$ oc debug node/server01
2[root@server01 core] chroot /host
3[root@server01 core] touch /run/machine-config-daemon-force

On occasion, you might not be able to connect to the node using an oc debug pod. I've seen this happen when a systemd unit running podman early in the boot process has had an image pull failure, and the debug pod is also not able to run when kubelet is not running properly, mean SSH is the only option.

1$ ssh core@server01
2[root@server01 core] touch /run/machine-config-daemon-force

Pausing MCO Updates

As described above, the MCO will automatically carry on the work of rebooting and managing your nodes if you let it. This is great, but does mean that nodes in the cluster can sometimes be rebooted without warning. Of course, true cloud-native Kubernetes ready applications should be running multiple replicas across nodes anyway, and this shouldn't normally be a problem.

However, should you wish to pause the MCO so that it doesn't automatically reboot nodes when you don't want it to, you can "pause" the worker pools by setting the .spec.paused field in the MachineConfigPool. For example, by running the following patch commands:

1$ oc patch mcp master --type merge --patch '{"spec":{"paused": true}}'
2$ oc patch mcp worker --type merge --patch '{"spec":{"paused": true}}'

It should be noted that you shouldn't leave the MachineConfigPool's paused for long periods of times. This can cause issues for MCO if configuration updates combine with cluster cert rotation events. However, pausing them whilst the actual OCP control plane upgrade happens and then unpausing when you're ready for all the nodes to be rebooted can be very useful.

Summary

The Machine Config Operator is a clever piece of software that significantly reduces workload of running and managing nodes in an OpenShift cluster. However, it's important for cluster administrators to understand how it works, and to be able to successfully recover it when any issues arise and it becomes degraded. Hopefully, this blog post will help you understand it a little better and to resolve any issues you encounter.