Was the raft log corrupted, truncated, or lost?

By gabor.auth , 2 August, 2025

gabor.auth's Blog
Log in or register to post comments

So, Kubernetes, bare-metal cluster, and the etcd died on one of the control-plane machines after a reboot due to a hardware failure. You can clearly see that the control-plane node is "NotReady" and the log shows:

panic: tocommit(11989253) is out of range [lastIndex(11989215)]. Was the raft log corrupted, truncated, or lost?

What is the Kubernetes way? Kill egy node, create a new node, add this new node to the cluster and that's it. But this is an easily fixable error anyway:

stop the "kubelet" on the affected node,
remove the broken "etcd" member from the cluster on any healthy node,
remove the "etcd" broken data,
start the "kubelet" on the affected node.

In CLI commands:

# systemctl stop kubelet

# kubectl exec --stdin --tty -n kube-system etcd-control-plane1 -- sh
# export ETCDCTL_API=3
# etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key -w table member list
+------------------+---------+----------------+-------------------------------+--------------------------------+------------+
|        ID        | STATUS  |     NAME       |          PEER ADDRS           |         CLIENT ADDRS           | IS LEARNER |
+------------------+---------+----------------+-------------------------------+--------------------------------+------------+
| 3bdfefa0fdf07aec | started | control-plane1 |    https://192.168.74.84:2380 |    https://192.168.74.84:2379  |      false |
| 62c4c2a1fcdcc7f8 | started | control-plane2 |    https://192.168.79.78:2380 |    https://192.168.79.78:2379  |      false |
| fe45461662c93a78 | started | control-plane3 |  https://192.168.212.236:2380 |  https://192.168.212.236:2379  |      false |
+------------------+---------+----------------+-------------------------------+--------------------------------+------------+
# etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key member remove fe45461662c93a78 
Member fe45461662c93a78 removed from cluster 3bdfefa0fdf07aec
# etcdctl --endpoints=https://127.0.0.1:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --cert=/etc/kubernetes/pki/etcd/healthcheck-client.crt --key=/etc/kubernetes/pki/etcd/healthcheck-client.key member add fe45461662c93a78 
--peer-urls https://192.168.212.236:2380

# systemctl- start kubelet

That's it... :)

Blog comments

Blog tags

devops

kubernetes