So, Kubernetes, bare-metal cluster, and the etcd died on one of the control-plane machines after a reboot due to a hardware failure. You can clearly see that the control-plane node is “NotReady” and the log shows:
panic: tocommit(11989253) is out of range [lastIndex(11989215)].
Was the raft log corrupted, truncated, or lost?
What is the Kubernetes way? Kill that node, create a new node, add this new node to the cluster and that’s it. But this is an easily fixable error anyway:
- stop the “kubelet” on the affected node,
- remove the broken “etcd” member from the cluster on any healthy node,
- remove the “etcd” broken data on the affected node,
- start the “kubelet” on the affected node.
In CLI commands:
# systemctl stop kubelet
# kubectl exec --stdin --tty -n kube-system etcd-control-plane1 -- sh
# export ETCDCTL_API=3
# etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
-w table member list
+------------------+---------+----------------+-------------------------------+--------------------------------+------------+
| ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER |
+------------------+---------+----------------+-------------------------------+--------------------------------+------------+
| 3bdfefa0fdf07aec | started | control-plane1 | https://192.168.74.84:2380 | https://192.168.74.84:2379 | false |
| 62c4c2a1fcdcc7f8 | started | control-plane2 | https://192.168.79.78:2380 | https://192.168.79.78:2379 | false |
| fe45461662c93a78 | started | control-plane3 | https://192.168.212.236:2380 | https://192.168.212.236:2379 | false |
+------------------+---------+----------------+-------------------------------+--------------------------------+------------+
# etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
member remove fe45461662c93a78
Member fe45461662c93a78 removed from cluster 3bdfefa0fdf07aec
# etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
--key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
--peer-urls https://192.168.212.236:2380
member add fe45461662c93a78 \
# systemctl start kubelet
That’s it… 🙂