Recover Cassandra pods from corrupt commit logs
After deploying the Cassandra data service, when you reboot the worker nodes, the Cassandra pods do not come up to form the cluster. The pods do not come up due to the corrupt logs:
cassandra ERROR 18:22:11 Exiting due to error while processing commit log during initialization.
cassandra org.apache.cassandra.db.commitlog.CommitLogReadHandler$CommitLogReadException: Mutation checksum failure at 23031717 in Next section at 23028485 in CommitLog-7-1676881531203.log
To recover Cassandra pods from the corrupt commit logs:
-
Scale deployment
pds-deployment-controller-manager
to 0. -
Edit Cassandra statefulset by adding the follwing line under
spec.template.spec.containers
:command: ["/bin/sleep", "3650d"]
noteThe statefulset name is identical to the deployment name in PDS UI.
-
Delete all Cassandra pods and wait for the pod 0 to start running (it will start running, but never become ready).
-
Shell into pod 0 and delete the corrupt commit log. For example:
rm /srv/pds/data/commitlog/CommitLog-7-1676881531203.log
-
Exit the sheel of pod 0 and scale
pds-deployment-controller-manager
back to 1. -
Wait for 30 seconds (approximately) for the deployment Operator to update the statefulset.
-
Delete Cassandra pod 0.
All Cassandra nodes should come up successfully. However, it is recommended to shell back into one of the Cassandra pods and run nodetool repair
.