Sunday, November 17, 2019

DevOps Troubleshooting Tricks & tips

Here in this post, I would like to collect all my daily challenges in my DevOps learning operations and possible workarounds, fixes links. I also invite you please share your experiences dealing with DevOps operations.

DevOps Troubleshooting process


Issue #1: Vagrant failed to reload when Docker installed in CentOS


The following SSH command responded with a non-zero exit status.
Vagrant assumes that this means the command failed!

chmod 0644 /etc/systemd/system/docker.service.d/http-proxy.conf

Stdout from the command:



Stderr from the command:

chmod: cannot access ‘/etc/systemd/system/docker.service.d/http-proxy.conf’: No such file or directory



Here it is actually starting the vagrant box but it is not able to find a file called http-proxy.conf file. I would like to suggest for this issue, create the file and grant the permission as given:

Now restart the vagrant box. usually it is blocker when you are starting couple vagrant boxes with single vagrant up command where it will be stopped after first instance creation only. You need to do these changes to all nodes one after the other started.


Issue #2 Docker daemon not running


[vagrant@mstr ~]$ docker info
Client:
 Debug Mode: false
 Plugins:
  cluster: Manage Docker clusters (Docker Inc., v1.2.0)

Server:
ERROR: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
errors pretty printing info


Workaround

start the docker daemon
sudo systemctl start docker
sudo systemctl status docker
Fix

References:

  1. Control docker with systemd
  2. Post steps for Docker installation

Issue #3 : Snap package unable to install helm



error: cannot communicate with server: Post http://localhost/v2/snaps/helm: dial unix /run/snapd.socket: connect: no such file or directory

Fix is :
Check the snapd daemon running
[root@mstr ~]# systemctl status snapd.service
● snapd.service - Snappy daemon
   Loaded: loaded (/usr/lib/systemd/system/snapd.service; disabled; vendor preset: disabled)
   Active: inactive (dead)

If not running and tells you Inactive (dead) then give the life by start it and check again!!!
[root@mstr ~]# systemctl start snapd.service
[root@mstr ~]# systemctl status snapd.service
● snapd.service - Snappy daemon
   Loaded: loaded (/usr/lib/systemd/system/snapd.service; disabled; vendor preset: disabled)
   Active: active (running) since Sun 2019-11-17 05:27:28 UTC; 7s ago
 Main PID: 23376 (snapd)
    Tasks: 10
   Memory: 15.2M
   CGroup: /system.slice/snapd.service
           └─23376 /usr/libexec/snapd/snapd

Nov 17 05:27:27 mstr.devopshunter.com systemd[1]: Starting Snappy daemon...
Nov 17 05:27:27 mstr.devopshunter.com snapd[23376]: AppArmor status: apparmor not enabled
Nov 17 05:27:27 mstr.devopshunter.com snapd[23376]: daemon.go:346: started snapd/2.42.1-1.el7 (...6.
Nov 17 05:27:28 mstr.devopshunter.com snapd[23376]: daemon.go:439: adjusting startup timeout by...p)
Nov 17 05:27:28 mstr.devopshunter.com snapd[23376]: helpers.go:104: error trying to compare the...sk
Nov 17 05:27:28 mstr.devopshunter.com systemd[1]: Started Snappy daemon.

Now go on for the
[root@mstr ~]# snap install helm --classic
2019-11-17T05:30:10Z INFO Waiting for restart...
Download snap "core18" (1265) from channel "stable"                               88%  139kB/s 50.3s

Issue #4: K8s nodes not able to list out


$ kubectl get nodes
The connection to the server localhost:8080 was refused - did you specify the right host or port?
Solution:
systemctl enable kubelet
systemctl start kubelet

vi /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1

sysctl --system


Issue 5: k8s issue unable to proceed to start the kubeadm

[root@mstr ~]# kubeadm init --pod-network-cidr=192.148.0.0/16 --apiserver-advertise-address=192.168.33.100
[init] Using Kubernetes version: v1.16.3
[preflight] Running pre-flight checks
        [WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
        [WARNING SystemVerification]: this Docker version is not on the list of validated versions: 19.03.4. Latest validated version: 18.09
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
error execution phase preflight: [preflight] Some fatal errors occurred:
        [ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-apiserver:v1.16.3: output: Error response from daemon: Get https://k8s.gcr.io/v2/: dial tcp: lookup k8s.gcr.io on [::1]:53: read udp [::1]:35272->[::1]:53: read: connection refused
, error: exit status 1
        [ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-controller-manager:v1.16.3: output: Error response from daemon: Get https://k8s.gcr.io/v2/: dial tcp: lookup k8s.gcr.io on [::1]:53: read udp [::1]:40675->[::1]:53: read: connection refused
, error: exit status 1
        [ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-scheduler:v1.16.3: output: Error response from daemon: Get https://k8s.gcr.io/v2/: dial tcp: lookup k8s.gcr.io on [::1]:53: read udp [::1]:48699->[::1]:53: read: connection refused
, error: exit status 1
        [ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-proxy:v1.16.3: output: Error response from daemon: Get https://k8s.gcr.io/v2/: dial tcp: lookup k8s.gcr.io on [::1]:53: read udp [::1]:48500->[::1]:53: read: connection refused
, error: exit status 1
        [ERROR ImagePull]: failed to pull image k8s.gcr.io/pause:3.1: output: Error response from daemon: Get https://k8s.gcr.io/v2/: dial tcp: lookup k8s.gcr.io on [::1]:53: read udp [::1]:46017->[::1]:53: read: connection refused
, error: exit status 1
        [ERROR ImagePull]: failed to pull image k8s.gcr.io/etcd:3.3.15-0: output: Error response from daemon: Get https://k8s.gcr.io/v2/: dial tcp: lookup k8s.gcr.io on [::1]:53: read udp [::1]:52592->[::1]:53: read: connection refused
, error: exit status 1
        [ERROR ImagePull]: failed to pull image k8s.gcr.io/coredns:1.6.2: output: Error response from daemon: Get https://k8s.gcr.io/v2/: dial tcp: lookup k8s.gcr.io on [::1]:53: read udp [::1]:53803->[::1]:53: read: connection refused
, error: exit status 1
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher
[root@mstr ~]#

Solution:
You need to initialize the Kubernetes master in the cluster
kubeadm init --pod-network-cidr=192.148.0.0/16 --apiserver-advertise-address=192.168.33.100 --ignore-preflight-errors=Hostname,SystemVerification,NumCPU

Issue #6: K8s Unable to connect with server

[root@mstr tmp]# kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
Unable to connect to the server: dial tcp: lookup raw.githubusercontent.com on 10.0.2.3:53: server misbehaving
[root@mstr tmp]#

Workaround: When I've stried to run the above kubectl command at office network got that error. Once I'm at home able to run it perfectly. So please check your Company VPN network proxy settings before your run that kubectl command.

Issue #7: Docker Networking : Error response from daemon

[vagrant@mydev ~]$ docker network create -d overlay \
>                 --subnet=192.168.0.0/16 \
>                 --subnet=192.170.0.0/16 \
>                 --gateway=192.168.0.100 \
>                 --gateway=192.170.0.100 \
>                 --ip-range=192.168.1.0/24 \
>                 --aux-address="my-router=192.168.1.5" --aux-address="my-switch=192.168.1.6" \
>                 --aux-address="my-printer=192.170.1.5" --aux-address="my-nas=192.170.1.6" \
>                 my-multihost-network
Error response from daemon: This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again.

Basic Analysis: Check the 'swarm' line in the docker info command output.
docker info

Here from the error line, you can understand the there is an issue due to Swarm inactive state. To turn it on 'active' Workaround:
docker swarm init --advertise-addr 192.168.33.200

Issue #8: Kubernetes join command timeout on AWS ec2 instance

There were 3 ec2 instances created to provision the Kubernetes cluster on them. Master came up and Ready state. But when we run the join command on the other nodes, it was timed out with the following error:
root@ip-172-31-xx-xx:~# kubeadm join 172.31.xx.204:6443 --token ld3ea8.jghaj4lpkwyk6b38     --discovery-token-ca-cert-hash            sha256:f240647cdeacc429a3a30f6b83a3e9f54f603fbdf87fb24e4ee734d5368a21cf
W0426 14:58:03.699933   17866 join.go:346] [preflight] WARNING: JoinControlPane.controlPlane settings will be ignored when            control-plane flag is not set.
[preflight] Running pre-flight checks
        [WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd           ". Please follow the guide at https://kubernetes.io/docs/setup/cri/
error execution phase preflight: couldn't validate the identity of the API Server: Get https://172.31.35.204:6443/api/v1/na           mespaces/kube-public/configmaps/cluster-info?timeout=10s: dial tcp 172.31.35.204:6443: i/o timeout
To see the stack trace of this error execute with --v=5 or higher

Solution for such issue is Understand that the AWS Security Group PORT open for inbound rules. Kubernetes uses API service which internally call the HTTP protocol this should be open to all(0.0.0.0/0) inbound connections. And also the Kubernetes master-worker communications may need other TCP inbound connections as well so let it be open. 
Security Group settings in AWS for Kubernetes

Issue # VirtualBox issue (VERR_VMX_NO_VMX) code E_FAIL (0x80004005) gui headless

Stop hyper-v service running by default in Windows 8/10, since it blocks all other calls to VT hardware.

Additional explanation here: https://social.technet.microsoft.com/Forums/windows/en-US/118561b9-7155-46e3-a874-6a38b35c67fd/hyperv-disables-vtx-for-other-hypervisors?forum=w8itprogeneral

Also as you have mentioned, if not already enabled, turn on Intel VT virtualization in BIOS settings and restart the machine.


To turn Hypervisor off, run this from Command Prompt (Admin) (Windows+X):

bcdedit /set hypervisorlaunchtype off

and reboot your computer. To turn it back on again, run:

bcdedit /set hypervisorlaunchtype on

If you receive "The integer data is not valid as specified", try:

bcdedit /set hypervisorlaunchtype auto


The worked solution.

Help required or Support on your project issues?

Jenkins Build Failure

Problem in Console Output

Started by user BhavaniShekhar
Running as SYSTEM
Building remotely on node2 in workspace /tmp/remote/workspace/Test-build-remotely
[Test-build-remotely] $ /bin/sh -xe /tmp/jenkins681586635812408746.sh
+ echo 'Executed from BUILD REMOTELY Option'
Executed from BUILD REMOTELY Option
+ echo 'Download JDK 17'
Download JDK 17
+ cd /opt
+ wget https://download.oracle.com/java/17/latest/jdk-17_linux-x64_bin.tar.gz
/tmp/jenkins681586635812408746.sh: line 5: wget: command not found
Build step 'Execute shell' marked build as failure
Finished: FAILURE

Solution: To fix this you need to install wget on that node2 or you can use alternative as curl command.

Issue with sending mail on the Linux System

Solution investigate the mail can be sent from the command line or not. us the following command:

 echo "Test Mail" | mailx -s "test" "Pavan@gmail.com"
 
Replace the mail id with your company mailid and run that command. Hello guys if you need any support on Docker and DevOps do let us know in comments!

1 comment:

Priya said...

"This amazing article i have ever read in recent times. This is very inforamtive article. I regularly visit this blog for this kind fo helpful posts. Thank you so much for this wonderful blog post, keep posting such helpful information. If you are genuinely searching for a job oriented pega online training or pega online training hyderabad who are expertise to teach 100% practicals based course. And they provide certification material at pega training institutes in hyderabad and you can see this pega online training hyderabad. I was looking for a pega training institutes in pune whose instructor is really good at teaching. So you can either join at pega training institutes in Kolkata or pega training institutes in Bangalore in case if you are staying in Bengaluru. So start finding a job after a rigorous practice at pega training institutes in Mumbai whose faculty trainer the students at pega training institutes in Delhi also and in the end check out this pega interview questions.
Once again thanks a lot for this wonderful blog article, your efforts are priceless."

Categories

Kubernetes (24) Docker (20) git (13) Jenkins (12) AWS (7) Jenkins CI (5) Vagrant (5) K8s (4) VirtualBox (4) CentOS7 (3) docker registry (3) docker-ee (3) ucp (3) Jenkins Automation (2) Jenkins Master Slave (2) Jenkins Project (2) containers (2) docker EE (2) docker private registry (2) dockers (2) dtr (2) kubeadm (2) kubectl (2) kubelet (2) openssl (2) Alert Manager CLI (1) AlertManager (1) Apache Maven (1) Best DevOps interview questions (1) CentOS (1) Container as a Service (1) DevOps Interview Questions (1) Docker 19 CE on Ubuntu 19.04 (1) Docker Tutorial (1) Docker UCP (1) Docker installation on Ubunutu (1) Docker interview questions (1) Docker on PowerShell (1) Docker on Windows (1) Docker version (1) Docker-ee installation on CentOS (1) DockerHub (1) Features of DTR (1) Fedora (1) Freestyle Project (1) Git Install on CentOS (1) Git Install on Oracle Linux (1) Git Install on RHEL (1) Git Source based installation (1) Git line ending setup (1) Git migration (1) Grafana on Windows (1) Install DTR (1) Install Docker on Windows Server (1) Install Maven on CentOS (1) Issues (1) Jenkins CI server on AWS instance (1) Jenkins First Job (1) Jenkins Installation on CentOS7 (1) Jenkins Master (1) Jenkins automatic build (1) Jenkins installation on Ubuntu 18.04 (1) Jenkins integration with GitHub server (1) Jenkins on AWS Ubuntu (1) Kubernetes Cluster provisioning (1) Kubernetes interview questions (1) Kuberntes Installation (1) Maven (1) Maven installation on Unix (1) Operations interview Questions (1) Oracle Linux (1) Personal access tokens on GitHub (1) Problem in Docker (1) Prometheus (1) Prometheus CLI (1) RHEL (1) SCM (1) SCM Poll (1) SRE interview questions (1) Troubleshooting (1) Uninstall Git (1) Uninstall Git on CentOS7 (1) Universal Control Plane (1) Vagrantfile (1) amtool (1) aws IAM Role (1) aws policy (1) caas (1) chef installation (1) create deployment (1) create organization on UCP (1) create team on UCP (1) docker CE (1) docker UCP console (1) docker command line (1) docker commands (1) docker community edition (1) docker container (1) docker editions (1) docker enterprise edition (1) docker enterprise edition deep dive (1) docker for windows (1) docker hub (1) docker installation (1) docker node (1) docker releases (1) docker secure registry (1) docker service (1) docker swarm init (1) docker swarm join (1) docker trusted registry (1) elasticBeanStalk (1) global configurations (1) helm installation issue (1) mvn (1) namespaces (1) promtool (1) service creation (1) slack (1)