Breaking into your OpenShift Cluster

Of course, this really shouldn't happen right? You're a responsible IT professional, and your OpenShift cluster is configured with multiple authentication methods, and a secure backup of a kubeconfig file with the system:admin user certificates for passwordless login? Except sometimes it does.

In my day job, we have a small lab OCP cluster running on vSphere that might not get used for weeks or sometimes months. The OpenShift nodes that make up the OCP cluster sometimes get automatically powered down and left off in that powered off state.

Yesterday, I came to use this cluster for the first time again after quite a long period and found that I was unable to login with a standard LDAP user account with cluster-admin credentials that I would normally use. In addition, my admin kubeconfig (the one generated by openshift-installer for the original installation), had very old pre-rotation CA certs or something and hadn't been used for a while. Attempting to use it just gave x509 errors and so it seemed unusable. This has happened a couple of times now over the years, and I believe it's related to it being an IPI cluster and certificate rotation issues from the cluster being powered down for long periods. Before looking at the root cause, I'm just writing down the recovery procedure I used for next time!

So with no standard OAUTH authentication method available, I went back trying to use the original kubeconfig from the installation of the cluster.

1$ export KUBECONFIG=auth/kubeconfig
2$ oc get nodes
3Unable to connect to the server: x509: certificate signed by unknown authority

To try and understand more about what was happening, I added --insecure-skip-tls-verify=true, and the --loglevel=10 options to the command line, but I could see that a goroutine stack trace was immediately output after the x509 error message above.

As any experienced senior engineer will know, the first place to go for solving any technical problem is a good Google or StackOverflow search. Unfortunately StackOverflow threw up a lot of articles that were old, or weren't quite right and not what I was looking for.

Unsurprisingly - this being OpenShift - the best results were found in the RedHat Knowledgebase. The KB articles error: x509 certificate signed by unknown authority when logging in OpenShift 4 using the installation kubeconfig file and How to re create kubeconfig from scratch for system:admin user in OpenShift 4 looked promising, but both required existing access to the cluster to recreate them, which I didn't have because standard authentication methods were not working.

I knew that all the certs I needed were on the cluster nodes themselves, but these systems were running RedHat CoreOS. RedHat CoreOS is based on RedHat Enterprise Linux, but designed very much as an appliance. It's not expected to be managed by a human sysadmin - it has disabled root account, doesn't allow passwordless logins and its OS configuration is managed by the MachineConfigOperator function of OpenShift. The next problem was that (for reasons of this being a lab and the cluster being built by a colleague) the SSH key for the coreos user used for the installation wasn't available to me either :-) I did however, have access to the vSphere console of these VMS. There was no option, I had break my way in and get those certs.

The first thing was to boot the system to single user, reset the root password so I could login on the console. I've done this thousands of times over the years on many different versions of Linux and flavours of Linux, but interestingly this was more complicated on CoreOS (because of the way CoreOS works, and so it's not normally required). However, I did eventually get in and then I was able to locate the files I needed.

So all the secrets needed to run kube-apiserver are found in /etc/kubenetes/static-pod-resources/kube-apiserver-certs. In particular, the node-kubeconfigs/ directory has mulitple kubeconfig files, one of which was called lb-int.kubeconfig. When I set my KUBECONFIG environment variable to point to this file, I was then able to issue oc commands to the cluster:

1# cd /etc/kubenetes/static-pod-resources/kube-apiserver-certs/secrets/node-kubeconfigs
2# export KUBECONFIG=$(pwd)/lb-int.kubeconfig
3# oc get nodes
4<lots of nodes with STATUS NotReady>
5# oc get pods -A | grep -e Running
6<lots of pods in ContainerCreating or Pending
7# oc get csr -o name | xargs oc adm certificate approve

I found that half the nodes were in state NotReady, and so the oauth-openshift containers were not Running. This in turn was because there were a bunch of unapproved CSRs. Once these were approved, the nodes became Ready and the OAuth authentication pods were able to start up and the cluster recovered itself in the usual way. I was then able to login normally again and recreate a backup kubeconfig!