So, Kubernetes, bare-metal cluster, and the etcd died on one of the control-plane machines after a reboot due to a hardware failure. You can clearly see that the control-plane node is "NotReady" and the log shows:
panic: tocommit(11989253) is out of range [lastIndex(11989215)]. Was the raft log corrupted, truncated, or lost?
What is the Kubernetes way? Kill egy node, create a new node, add this new node to the cluster and that's it. But this is an easily fixable error anyway:
- stop the "kubelet" on the affected node,
- remove the broken "etcd" member from the cluster on any healthy node,
- remove the "etcd" broken data,
- start the "kubelet" on the affected node.
In CLI commands:
# systemctl stop kubelet
# kubectl exec --stdin --tty -n kube-system etcd-control-plane1 -- sh
# export ETCDCTL_API=3
# etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key -w table member list
+------------------+---------+----------------+-------------------------------+--------------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+----------------+-------------------------------+--------------------------------+------------+
| 3bdfefa0fdf07aec | started | control-plane1 | https://192.168.74.84:2380 | https://192.168.74.84:2379 | false |
| 62c4c2a1fcdcc7f8 | started | control-plane2 | https://192.168.79.78:2380 | https://192.168.79.78:2379 | false |
| fe45461662c93a78 | started | control-plane3 | https://192.168.212.236:2380 | https://192.168.212.236:2379 | false |
+------------------+---------+----------------+-------------------------------+--------------------------------+------------+
# etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key member remove fe45461662c93a78
Member fe45461662c93a78 removed from cluster 3bdfefa0fdf07aec
# etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key member add fe45461662c93a78
--peer-urls https://192.168.212.236:2380
# systemctl- start kubelet
That's it... :)
Blog comments