So, Kubernetes, bare-metal cluster, and the etcd died on one of the control-plane machines after a reboot due to a hardware failure. You can clearly see that the control-plane node is “NotReady” and the log shows:
panic: tocommit(11989253) is out of range [lastIndex(11989215)].
Was the raft log corrupted, truncated, or lost?What is the Kubernetes way? Kill that node, create a new node, add this new node to the cluster and that’s it. But this is an easily fixable error anyway:
- stop the “kubelet” on the affected node,
- remove the broken “etcd” member from the cluster on any healthy node,
- remove the “etcd” broken data on the affected node,
- start the “kubelet” on the affected node.
In CLI commands:
# systemctl stop kubelet
# kubectl exec --stdin --tty -n kube-system etcd-control-plane1 -- sh
# export ETCDCTL_API=3
# etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  -w table member list
+------------------+---------+----------------+-------------------------------+--------------------------------+------------+
|        ID        | STATUS  |     NAME       |          PEER ADDRS           |         CLIENT ADDRS           | IS LEARNER |
+------------------+---------+----------------+-------------------------------+--------------------------------+------------+
| 3bdfefa0fdf07aec | started | control-plane1 |    https://192.168.74.84:2380 |    https://192.168.74.84:2379  |      false |
| 62c4c2a1fcdcc7f8 | started | control-plane2 |    https://192.168.79.78:2380 |    https://192.168.79.78:2379  |      false |
| fe45461662c93a78 | started | control-plane3 |  https://192.168.212.236:2380 |  https://192.168.212.236:2379  |      false |
+------------------+---------+----------------+-------------------------------+--------------------------------+------------+
# etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  member remove fe45461662c93a78 
Member fe45461662c93a78 removed from cluster 3bdfefa0fdf07aec
# etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  --peer-urls https://192.168.212.236:2380
  member add fe45461662c93a78 \
# systemctl start kubeletThat’s it… 🙂