Friday, December 30, 2022

Kubernetes Troubleshooting

 We as DevOps and DevSecOps Engineers working on many microservice based application architectures where we need to manage Kubernetes Cluster  Troubleshot at various levels.

You cannot rely on single point of look for failures. While working on Kubernetes Troubleshooting we can make ourselves easy to understand the problem, if we could classify the problem belong to the following categories.
  1. Application Failure
  2. Master node/ControlPlane Failures
  3. Worker node Failures

Application Failure - trobleshooting

Here I'm listing out these with my understanding and experiance in practice tests provided by Munshad Mohammad on KodeKloud.
  1. You should know the architecture how it is deployed what all its dependents, where they have deployed with what endpoints, what names used.
  2. Check the service 'name' defined and referring service should match and also check the services 'Endpoints' are correctly defined and in referenceing used correctly.
    k -n dev-ns get all
    
  3. Better to check that the selector are as properly aligned or not, as per the architecture design definitions. if it is not then you need to change them.
    k -n test-ns edit svc mysql-service
    
  4. Identify is there any mismatch for the environment values defined in the deployment cross check with the Kubernetes objects those are integrating.
    k -n test-ns descrube deploy webapp-mysql
    
    If that doesn't match example mysql-user value was mismatched then you can change it, it will automatically redeploy the pods.
    k -n test-ns edit deploy webapp-mysql
  5. Check also service NodePort port correctly mentioned or not. If it mismatches then need to replace with correct one as per the design.
    k -n test-ns describe service/web-service
    k -n test-ns edit service/web-service # edit nodePort value correct
    

Controlplane/Kubernetes Master node Failure - Troubleshooting

  1. Initial analysis start from nodes, pods
    To troubleshoot the controlplane failure first thing is to check the status of the nodes in the cluster.
    k get nodes 
    
    they all should be healthy then, go for the next step that is status of the pods,deployments,services,replicasets (all) within the namespace on which we have trouble.
    k get po 
    k get all 
    
    Then ensure that pods that belongs to kube-system are 'Running' status.
  2. Check the Controlplane services
    # Check kube-apiserver
    service kube-apiserver status 
    or 
    systemctl status kube-apiserver 
    
    # Check kube-controller-manager
    service kube-controller-manager status 
    or 
    systemctl status kube-controller-manager
    
    # Check kube-schduler
    service kube-schduler status 
    or 
    systemctl status kube-schduler
    
    # # Check kubelet service on the worker nodes 
    service kubelet status 
    or 
    systemctl status kubelet 
    
    # # Check kube-proxy service on the worker nodes 
    service kube-proxy status 
    or 
    systemctl status kube-proxy 
    
    # Check the logs of Controlplane components 
    kubectl logs kube-apiserver-master -n kube-system 
    # system level logs 
    journalctl -u kube-apiserver 
    
  3. If there is issue with the Kube-scheduler then to correct it we need to change the YAML file preent in default location `vi /etc/kubernetes/manifests/kube-scheduler.yaml`
    You may need to check the file `/etc/kubernetes/manifests/kube-controller-manager.yaml` parameters given for 'command'. Sometime there could be missing or incorrectly entered for the VolumeMounts paths values, if you correct them the kube-systeem pods automatically starts!

Worker Node failure - Troubleshooting

This is mostly around kubelet serivce unable to come up. The bronken Kubernetes cluster can be identified by listing your nodes, where it tells us 'NotReady' state. There could be several reason each one is a case that need to be understood, where Kubelet cannot communicate with the Master node. Identifying the reason is the main thing here.
  1. Kubelet service not started: There could be many reasons when worker node fails. One such is if there is a CA certs rotated on the there should be manually you need to start the kubelet service and validated it is running on worker node.
    # To investigate whats going on worker node 
    ssh node01 "service kubelet status"
    ssh node01 "journalctl -u kubelet"
    # To start the kubelet 
    ssh node01 "service kubelet start"
    
    Once started you need to double check that kubelet status again if it shows 'active' then fine.
  2. Kubelet Config mismatch : The kubelet service even you start it is failed to come up. There could be some CONFIG related issue. In one of the example practice test given that ca.crt file path wrongly mentioned. You may need to correct the ca.crt file path in the worker node in such case you must know where the kubeconfig resides! so the path is '/var/lib/kubelet/config.yaml' After editing the ca.crt file you need to start the kubelet
    service kubelet start 
    and check the kubelet logs using journalctl.
    journalctl -u kubelet -f 
    And ensure that in the controlplane node list show that node01 status as 'Ready'.
  3. Cluster Config mismatching: There could be conffig.yaml file currupted where master ip or port configured wrongly or cluster name, user, context may be wrongly entered that could be reason where kubelet unable to communicate with the master node. Compare the configuration available on the master node and worker node if you found mismatches correct them and restart the kubelet.
  4. Finally, check the kubelet status on the worker node and on the master node check the list of nodes.
Enjoy the Kubernetes Administration !!! Have more fun!

No comments:

Categories

Kubernetes (24) Docker (20) git (13) Jenkins (12) AWS (7) Jenkins CI (5) Vagrant (5) K8s (4) VirtualBox (4) CentOS7 (3) docker registry (3) docker-ee (3) ucp (3) Jenkins Automation (2) Jenkins Master Slave (2) Jenkins Project (2) containers (2) create deployment (2) docker EE (2) docker private registry (2) dockers (2) dtr (2) kubeadm (2) kubectl (2) kubelet (2) openssl (2) Alert Manager CLI (1) AlertManager (1) Apache Maven (1) Best DevOps interview questions (1) CentOS (1) Container as a Service (1) DevOps Interview Questions (1) Docker 19 CE on Ubuntu 19.04 (1) Docker Tutorial (1) Docker UCP (1) Docker installation on Ubunutu (1) Docker interview questions (1) Docker on PowerShell (1) Docker on Windows (1) Docker version (1) Docker-ee installation on CentOS (1) DockerHub (1) Features of DTR (1) Fedora (1) Freestyle Project (1) Git Install on CentOS (1) Git Install on Oracle Linux (1) Git Install on RHEL (1) Git Source based installation (1) Git line ending setup (1) Git migration (1) Grafana on Windows (1) Install DTR (1) Install Docker on Windows Server (1) Install Maven on CentOS (1) Issues (1) Jenkins CI server on AWS instance (1) Jenkins First Job (1) Jenkins Installation on CentOS7 (1) Jenkins Master (1) Jenkins automatic build (1) Jenkins installation on Ubuntu 18.04 (1) Jenkins integration with GitHub server (1) Jenkins on AWS Ubuntu (1) Kubernetes Cluster provisioning (1) Kubernetes interview questions (1) Kuberntes Installation (1) Maven (1) Maven installation on Unix (1) Operations interview Questions (1) Oracle Linux (1) Personal access tokens on GitHub (1) Problem in Docker (1) Prometheus (1) Prometheus CLI (1) RHEL (1) SCM (1) SCM Poll (1) SRE interview questions (1) Troubleshooting (1) Uninstall Git (1) Uninstall Git on CentOS7 (1) Universal Control Plane (1) Vagrantfile (1) amtool (1) aws IAM Role (1) aws policy (1) caas (1) chef installation (1) create organization on UCP (1) create team on UCP (1) docker CE (1) docker UCP console (1) docker command line (1) docker commands (1) docker community edition (1) docker container (1) docker editions (1) docker enterprise edition (1) docker enterprise edition deep dive (1) docker for windows (1) docker hub (1) docker installation (1) docker node (1) docker releases (1) docker secure registry (1) docker service (1) docker swarm init (1) docker swarm join (1) docker trusted registry (1) elasticBeanStalk (1) global configurations (1) helm installation issue (1) mvn (1) namespaces (1) promtool (1) service creation (1) slack (1)