We as DevOps and DevSecOps Engineers working on many microservice based application architectures where we need to manage Kubernetes Cluster Troubleshot at various levels.
You cannot rely on single point of look for failures. While working on Kubernetes Troubleshooting we can make ourselves easy to understand the problem, if we could classify the problem belong to the following categories.- Application Failure
- Master node/ControlPlane Failures
- Worker node Failures
Application Failure - trobleshooting
Here I'm listing out these with my understanding and experiance in practice tests provided by Munshad Mohammad on KodeKloud.- You should know the architecture how it is deployed what all its dependents, where they have deployed with what endpoints, what names used.
- Check the service 'name' defined and referring service should match and also check the services 'Endpoints' are correctly defined and in referenceing used correctly.
k -n dev-ns get all
- Better to check that the selector are as properly aligned or not, as per the architecture design definitions. if it is not then you need to change them.
k -n test-ns edit svc mysql-service
- Identify is there any mismatch for the environment values defined in the deployment cross check with the Kubernetes objects those are integrating.
k -n test-ns descrube deploy webapp-mysql
If that doesn't match example mysql-user value was mismatched then you can change it, it will automatically redeploy the pods.k -n test-ns edit deploy webapp-mysql
- Check also service NodePort port correctly mentioned or not. If it mismatches then need to replace with correct one as per the design.
k -n test-ns describe service/web-service k -n test-ns edit service/web-service # edit nodePort value correct
Controlplane/Kubernetes Master node Failure - Troubleshooting
- Initial analysis start from nodes, pods
To troubleshoot the controlplane failure first thing is to check the status of the nodes in the cluster.k get nodes
they all should be healthy then, go for the next step that is status of the pods,deployments,services,replicasets (all) within the namespace on which we have trouble.k get po k get all
Then ensure that pods that belongs to kube-system are 'Running' status. - Check the Controlplane services
# Check kube-apiserver service kube-apiserver status or systemctl status kube-apiserver # Check kube-controller-manager service kube-controller-manager status or systemctl status kube-controller-manager # Check kube-schduler service kube-schduler status or systemctl status kube-schduler # # Check kubelet service on the worker nodes service kubelet status or systemctl status kubelet # # Check kube-proxy service on the worker nodes service kube-proxy status or systemctl status kube-proxy # Check the logs of Controlplane components kubectl logs kube-apiserver-master -n kube-system # system level logs journalctl -u kube-apiserver
- If there is issue with the Kube-scheduler then to correct it we need to change the YAML file preent in default location `vi /etc/kubernetes/manifests/kube-scheduler.yaml`
You may need to check the file `/etc/kubernetes/manifests/kube-controller-manager.yaml` parameters given for 'command'. Sometime there could be missing or incorrectly entered for the VolumeMounts paths values, if you correct them the kube-systeem pods automatically starts!
Worker Node failure - Troubleshooting
This is mostly around kubelet serivce unable to come up. The bronken Kubernetes cluster can be identified by listing your nodes, where it tells us 'NotReady' state. There could be several reason each one is a case that need to be understood, where Kubelet cannot communicate with the Master node. Identifying the reason is the main thing here.- Kubelet service not started: There could be many reasons when worker node fails. One such is if there is a CA certs rotated on the there should be manually you need to start the kubelet service and validated it is running on worker node.
# To investigate whats going on worker node ssh node01 "service kubelet status" ssh node01 "journalctl -u kubelet" # To start the kubelet ssh node01 "service kubelet start"
Once started you need to double check that kubelet status again if it shows 'active' then fine. - Kubelet Config mismatch : The kubelet service even you start it is failed to come up. There could be some CONFIG related issue. In one of the example practice test given that ca.crt file path wrongly mentioned. You may need to correct the ca.crt file path in the worker node in such case you must know where the kubeconfig resides! so the path is '/var/lib/kubelet/config.yaml' After editing the ca.crt file you need to start the kubelet
service kubelet start
and check the kubelet logs using journalctl.journalctl -u kubelet -f
And ensure that in the controlplane node list show that node01 status as 'Ready'. - Cluster Config mismatching: There could be conffig.yaml file currupted where master ip or port configured wrongly or cluster name, user, context may be wrongly entered that could be reason where kubelet unable to communicate with the master node. Compare the configuration available on the master node and worker node if you found mismatches correct them and restart the kubelet.
- Finally, check the kubelet status on the worker node and on the master node check the list of nodes.