Friday, December 30, 2022

Kubernetes Troubleshooting

 We as DevOps and DevSecOps Engineers working on many microservice based application architectures where we need to manage Kubernetes Cluster  Troubleshot at various levels.

You cannot rely on single point of look for failures. While working on Kubernetes Troubleshooting we can make ourselves easy to understand the problem, if we could classify the problem belong to the following categories.
  1. Application Failure
  2. Master node/ControlPlane Failures
  3. Worker node Failures

Application Failure - trobleshooting

Here I'm listing out these with my understanding and experiance in practice tests provided by Munshad Mohammad on KodeKloud.
  1. You should know the architecture how it is deployed what all its dependents, where they have deployed with what endpoints, what names used.
  2. Check the service 'name' defined and referring service should match and also check the services 'Endpoints' are correctly defined and in referenceing used correctly.
    k -n dev-ns get all
    
  3. Better to check that the selector are as properly aligned or not, as per the architecture design definitions. if it is not then you need to change them.
    k -n test-ns edit svc mysql-service
    
  4. Identify is there any mismatch for the environment values defined in the deployment cross check with the Kubernetes objects those are integrating.
    k -n test-ns descrube deploy webapp-mysql
    
    If that doesn't match example mysql-user value was mismatched then you can change it, it will automatically redeploy the pods.
    k -n test-ns edit deploy webapp-mysql
  5. Check also service NodePort port correctly mentioned or not. If it mismatches then need to replace with correct one as per the design.
    k -n test-ns describe service/web-service
    k -n test-ns edit service/web-service # edit nodePort value correct
    

Controlplane/Kubernetes Master node Failure - Troubleshooting

  1. Initial analysis start from nodes, pods
    To troubleshoot the controlplane failure first thing is to check the status of the nodes in the cluster.
    k get nodes 
    
    they all should be healthy then, go for the next step that is status of the pods,deployments,services,replicasets (all) within the namespace on which we have trouble.
    k get po 
    k get all 
    
    Then ensure that pods that belongs to kube-system are 'Running' status.
  2. Check the Controlplane services
    # Check kube-apiserver
    service kube-apiserver status 
    or 
    systemctl status kube-apiserver 
    
    # Check kube-controller-manager
    service kube-controller-manager status 
    or 
    systemctl status kube-controller-manager
    
    # Check kube-schduler
    service kube-schduler status 
    or 
    systemctl status kube-schduler
    
    # # Check kubelet service on the worker nodes 
    service kubelet status 
    or 
    systemctl status kubelet 
    
    # # Check kube-proxy service on the worker nodes 
    service kube-proxy status 
    or 
    systemctl status kube-proxy 
    
    # Check the logs of Controlplane components 
    kubectl logs kube-apiserver-master -n kube-system 
    # system level logs 
    journalctl -u kube-apiserver 
    
  3. If there is issue with the Kube-scheduler then to correct it we need to change the YAML file preent in default location `vi /etc/kubernetes/manifests/kube-scheduler.yaml`
    You may need to check the file `/etc/kubernetes/manifests/kube-controller-manager.yaml` parameters given for 'command'. Sometime there could be missing or incorrectly entered for the VolumeMounts paths values, if you correct them the kube-systeem pods automatically starts!

Worker Node failure - Troubleshooting

This is mostly around kubelet serivce unable to come up. The bronken Kubernetes cluster can be identified by listing your nodes, where it tells us 'NotReady' state. There could be several reason each one is a case that need to be understood, where Kubelet cannot communicate with the Master node. Identifying the reason is the main thing here.
  1. Kubelet service not started: There could be many reasons when worker node fails. One such is if there is a CA certs rotated on the there should be manually you need to start the kubelet service and validated it is running on worker node.
    # To investigate whats going on worker node 
    ssh node01 "service kubelet status"
    ssh node01 "journalctl -u kubelet"
    # To start the kubelet 
    ssh node01 "service kubelet start"
    
    Once started you need to double check that kubelet status again if it shows 'active' then fine.
  2. Kubelet Config mismatch : The kubelet service even you start it is failed to come up. There could be some CONFIG related issue. In one of the example practice test given that ca.crt file path wrongly mentioned. You may need to correct the ca.crt file path in the worker node in such case you must know where the kubeconfig resides! so the path is '/var/lib/kubelet/config.yaml' After editing the ca.crt file you need to start the kubelet
    service kubelet start 
    and check the kubelet logs using journalctl.
    journalctl -u kubelet -f 
    And ensure that in the controlplane node list show that node01 status as 'Ready'.
  3. Cluster Config mismatching: There could be conffig.yaml file currupted where master ip or port configured wrongly or cluster name, user, context may be wrongly entered that could be reason where kubelet unable to communicate with the master node. Compare the configuration available on the master node and worker node if you found mismatches correct them and restart the kubelet.
  4. Finally, check the kubelet status on the worker node and on the master node check the list of nodes.
Enjoy the Kubernetes Administration !!! Have more fun!

Monday, December 26, 2022

Kubernetes Tools Tricks & Tips

Hey Guys, Welcome to "DevOps Hunter" blog! In this post I would like to share my learnings at different times collected that is about Kubernetes commands and their applied tricks and tips.

  • Initially I've collected few kubectl related alias command tricks
  • Play with the etcd database and then backup and recovery short-cuts
  • Finally worked on the Kubernetes command tools kubectx, kubens for easy switching in CLI.


Come on! let's explore about the API resources which we might be frequently use when we prepare the YAML files for each Kubernetes Objects.

kubectl api-resources

We can get sometime the API version mismatch due to change in API version. This can be examine what is new in the current version

How do you identify the certificate file used to authenticate 'apiserver'?

cat /etc/kubernetes/manifests/kube-apiserver.yaml|grep tls-cert
    - --tls-cert-file=/etc/kubernetes/pki/apiserver.crt
The tls-cert-file will be Kubernetes apiserver cerificate file path .

How do you identify the certificate file used to authenticate 'kube-apiserver' as a client to ETCD server?

You can look into the kube-apiserver manifest file.

cat /etc/kubernetes/manifests/kube-apiserver.yaml 

Do you have any alias tricks for Kubernetes CLI commands?

Yes, I do have many but here I would like to common usable Bash shell alias.
# kubectl can be used with k most common alias 
alias k='kubectl'

# This is to list all available objects, alias will be used with many Kubernetes Objects
alias kg='kubectl get'

# This will be used to describe any kubernetes object 
alias kdp='kubectl describe'

Looking into the logs

Kubernetes collects the logging from all the containers that run in a Pod.
# To look into the logs of any pod 
alias kl='kubectl logs'

# To get into the pod containers 
alias kei='kubectl exec -it'

Realtime scenario: maintenance window on worker node

There can be regular routine maintenance windows on worker nodes may be to have OS patching on the node or any other urgent maintenance then how to handle is important activity as Kubernetes Administrator.

When maintenance starts on node01:

 alias k="kubectl"
 k drain node01 --ignore-daemonsets
 # check pods scheduling on which nodes 
 k get po -o wide
 # check nodes status - observe that node01  STATUS = Ready,SchedulingDisable
 k get nodes 

when maintenance on node01 completes, How to releae that node back to ready state?

First make the node as schedulable using uncordon, then check nodes

 k uncordon node01
 the uncordon sub-command will mark node as schedulable, bring back to ready state for node.
 
 # Check pods, nodes 
 k get nodes,pods -o wide
Existing nodes will not be re-scheduled back to the node01. But if any new pods are created they will be scheduled.

Locking your node for not to perform schedule any new pods

Without effecting existing pods on the node make the node Unschedulable can be done with the cordon
 k cordon node01
 k get nodes -o wide
 
cordon sub-command will mark node as unschedulable.

Kubernetes Ugrade plan

Similar to any OS package managers allow us to upgrade here we can do it for Kubernetes. But we need to be little cautious, If there is any upgrade plan then we need to check that from the kubenetes CLI
 kubeadm upgrade plan
 

How do you find the ETCD cluster address from the controlplane?

From the describe output you can identify the etcd address which is present in the --advertis-client-urls value.

k describe po etcd-controlplane -n kube-system|grep -i advertise-client
Annotations:          kubeadm.kubernetes.io/etcd.advertise-client-urls: https://10.36.169.6:2379
      --advertise-client-urls=https://10.36.169.6:2379

How to get the version of etcd running on the Kubernetes Cluster?

To get the version of the etcd by describe the etcd pod which is present in kube-sytem namespace.

k get po -n kube-system |grep etcd
etcd-controlplane                      1/1     Running   0          22m

k describe po etcd-controlplane -n kube-system|grep -i image:
    Image:         k8s.gcr.io/etcd:3.5.3-0

Where is the ETCD server certificate file located?

To find the server certificate the file location present in '--cert-file' line. To skip -- in the grep use back slash

k describe po etcd-controlplane -n kube-system|grep '\--cert-file'
      --cert-file=/etc/kubernetes/pki/etcd/server.crt
Alternative: another way is to get certifiate files and key files of etcd. You know that etcd is a static pod and which will have the definitions and configuration details as manifest file at /etc/kubernetes/manifests/etcd.yaml. To run the the etcd backup we must pass certfiles, key files. Let's find those from the manifest file.
 cat /etc/kubernetes/manifests/etcd.yaml |grep "\-file"

Where is the ETCD CA Certificate file located?

Generally CA certificates file will be saved as ca.crt.

k describe po etcd-controlplane -n kube-system|grep -i ca.crt
      --peer-trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt
      --trusted-ca-file=/etc/kubernetes/pki/etcd/ca.crt

Backup and Recovery of ETCD database

ETCD database BACKUP to a snapshot using following command

ETCDCTL_API=3 etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
snapshot save /opt/snapshot-pre-boot.db

# validate snapshot created in the /opt directory.
ls -l /opt

How to restore the etcd cluster database?

Same command only in place of save use restore option.
ETCDCTL_API=3 etcdctl  --data-dir /var/lib/etcd-from-backup \
snapshot restore /opt/snapshot-pre-boot.db
To know nmber of clusters configured on the node you can use the following :
k config view
# Be specific to cluster listing you can use get-clusters 
k config get-clusters

Kubernetes Tools

Your life will be easy if you know these two tools as your tricks! kubectx, kubens two customized commandline tools.

Using kubectx

kubectx examples
sudo git clone https://github.com/ahmetb/kubectx /opt/kubectx
sudo ln -s /opt/kubectx/kubectx /usr/local/bin/kubectx
kubectx -h
kubectx -c
kubectx 
Download and Setup the kubectx

kubens

Setup the kubens and using it for switching between namespaces.
sudo ln -s /opt/kubectx/kubens /usr/local/bin/kubens
kubens
kubens -h
kubens kube-system
k get po
kubens -
k get po
Kubernetes namespace switching tool kubens setup and executions

Network Tricks

To find the weave-net running on which node
k get po -n kube-system -l name=weave-net -o wide

What is the DNS implementation in your Kubernetes Cluster?

To know dns details the label used 'k8s-app-kube' we can run on pods, deployments we can get the complete implementation of DNS on the Kube:
k -n kube-system get po,deploy -l k8s-app=kube-dns
The execution sample output

Finding Node info using jsonpath

To work on the jsonpath you must know what is the output in json format first. then we can narrow-down to the required field data to be extracted.
k get nodes -o jsonp 
k get nodes -o jsonp | jq
k get nodes -o jsonp | jq -c 'paths' |grep InternalIP
To get the InternalIP address of each node can be retrived first we need to give a try for first node than we can change to all nodes using '*'.
k get no -o jsonpath='{.items[0].status.addresses}'|jq
k get no -o jsonpath='{.items[*].status.addresses[0]}'|jq
k get no -o jsonpath='{.items[*].status.addresses[?(@.type=="InternalIP")]}'
k get no -o jsonpath='{.items[*].status.addresses[?(@.type=="InternalIP")].address}'

Kubectl autocomplete

Set up autocomplete enable in bash shell if that is your current shell, bash-completion package should be installed first.
source <(kubectl completion bash)
Let's add this above line for autocomplete permanently in the .bashrc
> ~/.bashrc
Reference
Hope you have enjoyed this post.

Friday, December 23, 2022

Ansible powerful parameters - delegate_to, connection

 

Delegation to a host

Here is an example where we can delegate the task to a particular host. This play book is using inventory_hostname from the gather facts.
- name: Delegation to localhost
  hosts: all
  tasks:
  - name: create a file on target server
    file:
      path: /tmp/i_m_on_target_server.txt
      state: touch
  - name: create a file with host named file by delegation to localhost
    file:
      state: touch
      path: "/tmp/{{ inventory_hostname }}.txt"
    delegate_to: localhost

connection paramer

We can use this "connection" parameter add to your task level or play level.
# Filename: connection_local.yml 
# To do some task on ansible server
# local means  without doing ssh command (no need of  password and no need of ssh keys)
# with the local connection parameter for the play
---
- name: This is to determine how the connection parameter works  with local
  hosts: app
  connection: local
  gather_facts: false
  tasks:
  - name: connection local test
    file:
      state: touch
      path: /tmp/srirama.txt

local action

Now in the following example we will see how local_action will be working. The local_action is an Ansible module which will by pass the ssh connection to do command locally.
- name: Using local acion in this  playbook
  hosts: localhost
  vars: 
    test_file: /tmp/omnamahshivaya.txt
  tasks:
  - name: local action
    local_action: "command touch {{ test_file }}"
    
The better option to do it in local is using connection: local parameter at play level.

Monday, December 19, 2022

Ansible Vault - To save Secrets

Hello DevOps Automations Engineers!!

 Ansible provides us special command 'ansible-vault' that is used to encrypt, decrypt, view an Ansible  playbook, this is also have amazing feature specific to role, vars YAML files, we can apply this to string of text in regular variables. 

Why do we need to encrypt our Play books?

Our Ansible automation projects, we need to work on multiple tasks and which may have some sensitive data such as database user credentials, any cloud IAM role details or it can be some other applications login credentials that's used to validate URL availability. Or it can be used to store the SSL certificates. At any point of time if the system is using plain text and it  has trouble to your confidential and sensitive data otherwise it could causes huge damage to your organization. Where we need a way to store the sensitive data can be protected by data encryption tool, and this can be done using the Ansible-vault command. 


Let's see the ansible-vault command help, in this we will experiment with what all the options we have to play with encryption and decryption of plain text in a file, string, entire YAML file also can be locked with this.
ansible-vault --help
usage: ansible-vault [-h] [--version] [-v] {create,decrypt,edit,view,encrypt,encrypt_string,rekey} ...

encryption/decryption utility for Ansible data files

positional arguments:
  {create,decrypt,edit,view,encrypt,encrypt_string,rekey}
    create              Create new vault encrypted file
    decrypt             Decrypt vault encrypted file
    edit                Edit vault encrypted file
    view                View vault encrypted file
    encrypt             Encrypt YAML file
    encrypt_string      Encrypt a string
    rekey               Re-key a vault encrypted file

ansible-vault with create option

The create option will help us to create new encrypted file, when you execute this it will prompt for a password, confirm password for the Vaultify YAML file, once you entered the credentials it will be opens the default editor  and you may enter the text in that and save the file. When you view the content of the file it will be encrypted 

ansible-vault create vault.yml
Here in place of vault.yml you can use your confidential file.

ansible-vault create 


Encrypt

The encrypt option will enable us to do encrypt any file content, that can be a plain text file or our ansible playbooks.

echo "unencrypted stuff"> encrypt_me.txt
cat encrypt_me.txt
ansible-vault encrypt encrypt_me.txt
cat encrypt_me.txt
Ansible-vault for encryption of a file


Decrypt

ansible-vault decrypt vault.yml
  

View

You can view the content of a encrypted file it can be any playbook or vars which you only owner and  aware of the key to open it.

ansible-vault view valut.yml


Edit

When we have encrypted file that may be created using 'create' or 'encrypt' option of ansible-vault command we can use the following way to edit the file. where the Ansible will prompt for password.

Recreate the password with rekey









Using --ask-vault-pass
ansible-playbook vault_demo.yml --ask-vault-pass

Using --vault-password-file 
ansible-playbook vault_demo.yml --vault-password-file my_pass

Using --vault-id
ansible-playbook vault_demo.yml --valult-id my_pass

Courtesy by Krishna Tatepally[DevOPs Operations Expert]

Thursday, December 1, 2022

Ansible handlers

Hello DevOps Experts!! let's zoom  into the usage of the Ansible Handlers and notifies 

What are Ansible Handlers?

The handlers section or the tasks defined under the handlers folder are executed at the end of the play once all tasks are finished. In the handlers tasks we are typically do either start, reload, restart and stop services.

Sometimes we may need to execute the task only when a particular change is made that can be notified.  Simple example of Apache web server when we modify httpd.conf file then we want to restart the httpd service. 

When we were working on Tomcat, when tomcat service is enabled. then there is a need for the reload firewalld service this is where we need to move this reload task under handlers and the enable tomcat service should have notify the task name 'reload firewalld service'. These are the perfect examples for handlers usage in Ansible play.

So here the point is that handler tasks will be performed only when they are notified.

Ensure that handler task name should have a globally unique name.

 Let's see the example of notify and handlers

---
- name: install php
  yum:  name=((item))  state=latest
  with_items:
    - php 
	- php-gd
	- php-pear
	- php-mysql
  notify: restart httpd
  handler
	handler/main.yml

 - name: rstart apache
   service: name=httpd state=restarted

Now you can see how to include the role php-webserver inot the main playbook
# Filename: test-handler.yml
---
 - hosts: web
   user: root
   become: yes
   roles:
     - php-webserver  

Execution of the test-handler will be like this:
ansible-playbook test-handler.yml


Categories

Kubernetes (24) Docker (20) git (13) Jenkins (12) AWS (7) Jenkins CI (5) Vagrant (5) K8s (4) VirtualBox (4) CentOS7 (3) docker registry (3) docker-ee (3) ucp (3) Jenkins Automation (2) Jenkins Master Slave (2) Jenkins Project (2) containers (2) docker EE (2) docker private registry (2) dockers (2) dtr (2) kubeadm (2) kubectl (2) kubelet (2) openssl (2) Alert Manager CLI (1) AlertManager (1) Apache Maven (1) Best DevOps interview questions (1) CentOS (1) Container as a Service (1) DevOps Interview Questions (1) Docker 19 CE on Ubuntu 19.04 (1) Docker Tutorial (1) Docker UCP (1) Docker installation on Ubunutu (1) Docker interview questions (1) Docker on PowerShell (1) Docker on Windows (1) Docker version (1) Docker-ee installation on CentOS (1) DockerHub (1) Features of DTR (1) Fedora (1) Freestyle Project (1) Git Install on CentOS (1) Git Install on Oracle Linux (1) Git Install on RHEL (1) Git Source based installation (1) Git line ending setup (1) Git migration (1) Grafana on Windows (1) Install DTR (1) Install Docker on Windows Server (1) Install Maven on CentOS (1) Issues (1) Jenkins CI server on AWS instance (1) Jenkins First Job (1) Jenkins Installation on CentOS7 (1) Jenkins Master (1) Jenkins automatic build (1) Jenkins installation on Ubuntu 18.04 (1) Jenkins integration with GitHub server (1) Jenkins on AWS Ubuntu (1) Kubernetes Cluster provisioning (1) Kubernetes interview questions (1) Kuberntes Installation (1) Maven (1) Maven installation on Unix (1) Operations interview Questions (1) Oracle Linux (1) Personal access tokens on GitHub (1) Problem in Docker (1) Prometheus (1) Prometheus CLI (1) RHEL (1) SCM (1) SCM Poll (1) SRE interview questions (1) Troubleshooting (1) Uninstall Git (1) Uninstall Git on CentOS7 (1) Universal Control Plane (1) Vagrantfile (1) amtool (1) aws IAM Role (1) aws policy (1) caas (1) chef installation (1) create deployment (1) create organization on UCP (1) create team on UCP (1) docker CE (1) docker UCP console (1) docker command line (1) docker commands (1) docker community edition (1) docker container (1) docker editions (1) docker enterprise edition (1) docker enterprise edition deep dive (1) docker for windows (1) docker hub (1) docker installation (1) docker node (1) docker releases (1) docker secure registry (1) docker service (1) docker swarm init (1) docker swarm join (1) docker trusted registry (1) elasticBeanStalk (1) global configurations (1) helm installation issue (1) mvn (1) namespaces (1) promtool (1) service creation (1) slack (1)