Here in this post, I would like to collect all my daily challenges in my DevOps learning operations and possible workarounds, fixes links. I also invite you please share your experiences dealing with DevOps operations.
 |
DevOps Troubleshooting process |
Issue #1: Vagrant failed to reload when Docker installed in CentOS
The following SSH command responded with a non-zero exit status.
Vagrant assumes that this means the command failed!
chmod 0644 /etc/systemd/system/docker.service.d/http-proxy.conf
Stdout from the command:
Stderr from the command:
chmod: cannot access ‘/etc/systemd/system/docker.service.d/http-proxy.conf’: No such file or directory
Here it is actually starting the vagrant box but it is not able to find a file called http-proxy.conf file. I would like to suggest for this issue, create the file and grant the permission as given:
Now restart the vagrant box. usually it is blocker when you are starting couple vagrant boxes with single vagrant up command where it will be stopped after first instance creation only. You need to do these changes to all nodes one after the other started.
Issue #2 Docker daemon not running
[vagrant@mstr ~]$ docker info
Client:
Debug Mode: false
Plugins:
cluster: Manage Docker clusters (Docker Inc., v1.2.0)
Server:
ERROR: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
errors pretty printing info
Workaround
start the docker daemon
sudo systemctl start docker
sudo systemctl status docker
Fix
References:
- Control docker with systemd
- Post steps for Docker installation
Issue #3 : Snap package unable to install helm
error: cannot communicate with server: Post http://localhost/v2/snaps/helm: dial unix /run/snapd.socket: connect: no such file or directory
Fix is :
Check the snapd daemon running
[root@mstr ~]# systemctl status snapd.service
● snapd.service - Snappy daemon
Loaded: loaded (/usr/lib/systemd/system/snapd.service; disabled; vendor preset: disabled)
Active: inactive (dead)
If not running and tells you Inactive (dead) then give the life by start it and check again!!!
[root@mstr ~]# systemctl start snapd.service
[root@mstr ~]# systemctl status snapd.service
● snapd.service - Snappy daemon
Loaded: loaded (/usr/lib/systemd/system/snapd.service; disabled; vendor preset: disabled)
Active: active (running) since Sun 2019-11-17 05:27:28 UTC; 7s ago
Main PID: 23376 (snapd)
Tasks: 10
Memory: 15.2M
CGroup: /system.slice/snapd.service
└─23376 /usr/libexec/snapd/snapd
Nov 17 05:27:27 mstr.devopshunter.com systemd[1]: Starting Snappy daemon...
Nov 17 05:27:27 mstr.devopshunter.com snapd[23376]: AppArmor status: apparmor not enabled
Nov 17 05:27:27 mstr.devopshunter.com snapd[23376]: daemon.go:346: started snapd/2.42.1-1.el7 (...6.
Nov 17 05:27:28 mstr.devopshunter.com snapd[23376]: daemon.go:439: adjusting startup timeout by...p)
Nov 17 05:27:28 mstr.devopshunter.com snapd[23376]: helpers.go:104: error trying to compare the...sk
Nov 17 05:27:28 mstr.devopshunter.com systemd[1]: Started Snappy daemon.
Now go on for the
[root@mstr ~]# snap install helm --classic
2019-11-17T05:30:10Z INFO Waiting for restart...
Download snap "core18" (1265) from channel "stable" 88% 139kB/s 50.3s
Issue #4: K8s nodes not able to list out
$ kubectl get nodes
The connection to the server localhost:8080 was refused - did you specify the right host or port?
Solution:
systemctl enable kubelet
systemctl start kubelet
vi /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
sysctl --system
Issue 5: k8s issue unable to proceed to start the kubeadm
[root@mstr ~]# kubeadm init --pod-network-cidr=192.148.0.0/16 --apiserver-advertise-address=192.168.33.100
[init] Using Kubernetes version: v1.16.3
[preflight] Running pre-flight checks
[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
[WARNING SystemVerification]: this Docker version is not on the list of validated versions: 19.03.4. Latest validated version: 18.09
[preflight] Pulling images required for setting up a Kubernetes cluster
[preflight] This might take a minute or two, depending on the speed of your internet connection
[preflight] You can also perform this action in beforehand using 'kubeadm config images pull'
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-apiserver:v1.16.3: output: Error response from daemon: Get https://k8s.gcr.io/v2/: dial tcp: lookup k8s.gcr.io on [::1]:53: read udp [::1]:35272->[::1]:53: read: connection refused
, error: exit status 1
[ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-controller-manager:v1.16.3: output: Error response from daemon: Get https://k8s.gcr.io/v2/: dial tcp: lookup k8s.gcr.io on [::1]:53: read udp [::1]:40675->[::1]:53: read: connection refused
, error: exit status 1
[ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-scheduler:v1.16.3: output: Error response from daemon: Get https://k8s.gcr.io/v2/: dial tcp: lookup k8s.gcr.io on [::1]:53: read udp [::1]:48699->[::1]:53: read: connection refused
, error: exit status 1
[ERROR ImagePull]: failed to pull image k8s.gcr.io/kube-proxy:v1.16.3: output: Error response from daemon: Get https://k8s.gcr.io/v2/: dial tcp: lookup k8s.gcr.io on [::1]:53: read udp [::1]:48500->[::1]:53: read: connection refused
, error: exit status 1
[ERROR ImagePull]: failed to pull image k8s.gcr.io/pause:3.1: output: Error response from daemon: Get https://k8s.gcr.io/v2/: dial tcp: lookup k8s.gcr.io on [::1]:53: read udp [::1]:46017->[::1]:53: read: connection refused
, error: exit status 1
[ERROR ImagePull]: failed to pull image k8s.gcr.io/etcd:3.3.15-0: output: Error response from daemon: Get https://k8s.gcr.io/v2/: dial tcp: lookup k8s.gcr.io on [::1]:53: read udp [::1]:52592->[::1]:53: read: connection refused
, error: exit status 1
[ERROR ImagePull]: failed to pull image k8s.gcr.io/coredns:1.6.2: output: Error response from daemon: Get https://k8s.gcr.io/v2/: dial tcp: lookup k8s.gcr.io on [::1]:53: read udp [::1]:53803->[::1]:53: read: connection refused
, error: exit status 1
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher
[root@mstr ~]#
Solution:
You need to initialize the Kubernetes master in the cluster
kubeadm init --pod-network-cidr=192.148.0.0/16 --apiserver-advertise-address=192.168.33.100 --ignore-preflight-errors=Hostname,SystemVerification,NumCPU
Issue #6: K8s Unable to connect with server
[root@mstr tmp]# kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
Unable to connect to the server: dial tcp: lookup raw.githubusercontent.com on 10.0.2.3:53: server misbehaving
[root@mstr tmp]#
Workaround: When I've stried to run the above kubectl command at office network got that error. Once I'm at home able to run it perfectly. So please check your Company VPN network proxy settings before your run that kubectl command.
Issue #7: Docker Networking : Error response from daemon
[vagrant@mydev ~]$ docker network create -d overlay \
> --subnet=192.168.0.0/16 \
> --subnet=192.170.0.0/16 \
> --gateway=192.168.0.100 \
> --gateway=192.170.0.100 \
> --ip-range=192.168.1.0/24 \
> --aux-address="my-router=192.168.1.5" --aux-address="my-switch=192.168.1.6" \
> --aux-address="my-printer=192.170.1.5" --aux-address="my-nas=192.170.1.6" \
> my-multihost-network
Error response from daemon: This node is not a swarm manager. Use "docker swarm init" or "docker swarm join" to connect this node to swarm and try again.
Basic Analysis: Check the 'swarm' line in the docker info command output.
docker info
Here from the error line, you can understand the there is an issue due to Swarm inactive state. To turn it on 'active'
Workaround:
docker swarm init --advertise-addr 192.168.33.200
Issue #8: Kubernetes join command timeout on AWS ec2 instance
There were 3 ec2 instances created to provision the Kubernetes cluster on them. Master came up and Ready state. But when we run the join command on the other nodes, it was timed out with the following error:
root@ip-172-31-xx-xx:~# kubeadm join 172.31.xx.204:6443 --token ld3ea8.jghaj4lpkwyk6b38 --discovery-token-ca-cert-hash sha256:f240647cdeacc429a3a30f6b83a3e9f54f603fbdf87fb24e4ee734d5368a21cf
W0426 14:58:03.699933 17866 join.go:346] [preflight] WARNING: JoinControlPane.controlPlane settings will be ignored when control-plane flag is not set.
[preflight] Running pre-flight checks
[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd ". Please follow the guide at https://kubernetes.io/docs/setup/cri/
error execution phase preflight: couldn't validate the identity of the API Server: Get https://172.31.35.204:6443/api/v1/na mespaces/kube-public/configmaps/cluster-info?timeout=10s: dial tcp 172.31.35.204:6443: i/o timeout
To see the stack trace of this error execute with --v=5 or higher
Solution for such issue is
Understand that the AWS Security Group PORT open for inbound rules. Kubernetes uses API service which internally call the HTTP protocol this should be open to all(0.0.0.0/0) inbound connections. And also the Kubernetes master-worker communications may need other TCP inbound connections as well so let it be open.
 |
Security Group settings in AWS for Kubernetes |
Issue # VirtualBox issue (VERR_VMX_NO_VMX) code E_FAIL (0x80004005) gui headless
The worked solution.
Help required or Support on your project issues?
Jenkins Build Failure
Problem in Console Output
Started by user BhavaniShekhar
Running as SYSTEM
Building remotely on node2 in workspace /tmp/remote/workspace/Test-build-remotely
[Test-build-remotely] $ /bin/sh -xe /tmp/jenkins681586635812408746.sh
+ echo 'Executed from BUILD REMOTELY Option'
Executed from BUILD REMOTELY Option
+ echo 'Download JDK 17'
Download JDK 17
+ cd /opt
+ wget https://download.oracle.com/java/17/latest/jdk-17_linux-x64_bin.tar.gz
/tmp/jenkins681586635812408746.sh: line 5: wget: command not found
Build step 'Execute shell' marked build as failure
Finished: FAILURE
Solution: To fix this you need to install wget on that node2 or you can use alternative as curl command.
Issue with sending mail on the Linux System
Solution investigate the mail can be sent from the command line or not. us the following command:
echo "Test Mail" | mailx -s "test" "Pavan@gmail.com"
Replace the mail id with your company mailid and run that command.
Hello guys if you need any support on Docker and DevOps do let us know in comments!