Was the raft log corrupted, truncated, or lost?

So, Kubernetes, bare-metal cluster, and the etcd died on one of the control-plane machines after a reboot due to a hardware failure. You can clearly see that the control-plane node is “NotReady” and the log shows:

panic: tocommit(11989253) is out of range [lastIndex(11989215)].
Was the raft log corrupted, truncated, or lost?

What is the Kubernetes way? Kill that node, create a new node, add this new node to the cluster and that’s it. But this is an easily fixable error anyway:

stop the “kubelet” on the affected node,
remove the broken “etcd” member from the cluster on any healthy node,
remove the “etcd” broken data on the affected node,
start the “kubelet” on the affected node.

In CLI commands:

# systemctl stop kubelet

# kubectl exec --stdin --tty -n kube-system etcd-control-plane1 -- sh
# export ETCDCTL_API=3
# etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  -w table member list
+------------------+---------+----------------+-------------------------------+--------------------------------+------------+
|        ID        | STATUS  |     NAME       |          PEER ADDRS           |         CLIENT ADDRS           | IS LEARNER |
+------------------+---------+----------------+-------------------------------+--------------------------------+------------+
| 3bdfefa0fdf07aec | started | control-plane1 |    https://192.168.74.84:2380 |    https://192.168.74.84:2379  |      false |
| 62c4c2a1fcdcc7f8 | started | control-plane2 |    https://192.168.79.78:2380 |    https://192.168.79.78:2379  |      false |
| fe45461662c93a78 | started | control-plane3 |  https://192.168.212.236:2380 |  https://192.168.212.236:2379  |      false |
+------------------+---------+----------------+-------------------------------+--------------------------------+------------+
# etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  member remove fe45461662c93a78 
Member fe45461662c93a78 removed from cluster 3bdfefa0fdf07aec
# etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt \
  --key=/etc/kubernetes/pki/etcd/healthcheck-client.key \
  --peer-urls https://192.168.212.236:2380
  member add fe45461662c93a78 \

# systemctl start kubelet

That’s it… 🙂

Post Views: 323

Leave a Comment Cancel Reply