Sunday, March 23, 2025

Exploring git pre-commit for Secrets leaks

What is GitGaurdian and ggsheild?

The ggsheild is a security CLI tool developed by GitGuardian that helps developers and organizations prevent the exposure of sensitive information, such as API keys, credentials, and secrets, in their Git repositories.

What are key features of ggsheild?

  • Pre-Commit and Pre-Push Scanning: Scans code before it is committed or pushed to detect secrets. Prevents accidental leaks of sensitive data in version control.
  • CI/CD Pipeline Integration: Works with GitHub Actions, GitLab CI/CD, Jenkins, and other CI tools. Ensures security checks are part of automated workflows.
  • Real-Time Monitoring and Alerts: Detects exposed secrets in public or private repositories. Sends alerts and suggests remediation steps.
  • Custom Rules & Policies: Allows defining custom regex patterns to detect organization-specific secrets. Supports allowlists to prevent false positives.
How to install ggsheild on Ubuntu 24.04?

pipx install ggsheild
Install in local will update the pre-commit file.
ggsheild install -m local 
will update the pre-commit file at .git/hooks/pre-commit
We can run the same with global scope as well
ggsheild install -m global
ignore the last findings
ggsheild ignore --last-found

Monday, March 10, 2025

Handling Git Large repositories

Hey, hello, dear DevOps, DevSecOps, and SRE team heroes!! Here I came across a new challenge to solve the common problem on Git. You may be using GitHub or GitLab or even Bitbucket for source code management. Now, a few projects, websites, or mobile apps require storing images, audio files, or video files that are larger in size. During the transfer to the client systems, they are facing the following issues:
  1. Slowness in git clone and fetch operations: files taking too long to upload or download, leading to delays in deployment and user experience
  2. Sluggish commits and status checks: some clients are encountering errors related to file size limitations, causing frustration and hindering workflow efficiency
  3. Repository size bloat
  4. Complexity in managing multiple branches
It's crucial for us to explore solutions that can streamline this process and ensure smooth handling of large media files.

Git LFS installation

apt install -y git-lfs
Git LFS (Large File Storage) is used to handle large files in a Git repository efficiently by replacing them with lightweight pointers while storing the actual files in a separate location. Here are various examples of using Git LFS, including tracking, untracking, checking status, and more: ---

1. Initialize Git LFS

Before using Git LFS in a repository, initialize it:
git lfs install
This sets up Git LFS for the repository. ---

2. Track Large Files

To track specific file types, use:
git lfs track "*.psd"
or track a specific file:
git lfs track "bigfile.zip"
This updates the `.gitattributes` file to include:
***.psd filter=lfs diff=lfs merge=lfs -text
After tracking, commit the `.gitattributes` file:
git add .gitattributes
git commit -m "Track large files with Git LFS"
---

3. Check Tracked Files

To see which files are being tracked by Git LFS:
git lfs track
---

4. Check LFS Status

To check which large files are modified or committed:
git lfs status
---

5. Untrack a File

If you no longer want a file to be tracked by Git LFS:
git lfs untrack "bigfile.zip"
This removes it from `.gitattributes`. Then commit the change:
git add .gitattributes
git commit -m "Untrack bigfile.zip from Git LFS"
Important: This does not remove files from previous commits. ---

6. List LFS Objects

To see which LFS files exist in your repository:
git lfs ls-files
---

7. Migrate Large Files (if added before tracking)

If you accidentally committed a large file before tracking it with Git LFS, migrate it:
git lfs migrate import --include="*.zip"
---

8. Push and Fetch LFS Files

After committing, push LFS files to the remote:
git push origin main
To pull and fetch LFS files:
git pull
git lfs fetch
---

9. Removing LFS Files From History (if needed)

If a large file was added before tracking and you want to remove it:
git lfs migrate import --include="bigfile.zip" --everything
Then, force push the cleaned history:
git push origin --force --all
---

10. Verify LFS Files in Remote Repository

To check which LFS files exist on the remote:
git lfs ls-files --long

Common mistakes working with Git LFS

  • Not Tracking Files Properly: Forgetting to use git lfs track for large files can lead to them being stored in the repository instead of being managed by Git LFS.
  • Ignoring the .gitattributes File: The .gitattributes file is crucial for Git LFS to function correctly. Failing to commit this file can cause issues for collaborators.
  • Pushing Without Installing Git LFS: If Git LFS isn't installed on your system, pushing large files will fail or result in errors.
  • Exceeding Storage Limits: Platforms like GitHub have storage limits for Git LFS. Exceeding these limits can block further uploads.
  • Cloning Without Git LFS: If you clone a repository without Git LFS installed, you might end up with pointer files instead of the actual large files.
  • Using Git LFS for Small Files: Git LFS is designed for large files. Using it for small files can unnecessarily complicate your workflow.
  • Not Cleaning Up Old Files: Over time, unused large files can accumulate in the LFS storage, increasing costs or storage usage.
Please let me know your experiances with Large files on Git server, write your learnings in the comments.

Saturday, March 8, 2025

Git installation on Ubuntu 24.04

Git installation on Ubuntu is pretty simple. If you are looking for git instatlling on the RHEL or Rocky or Oracle Linux you can use this link.
Now most of the software projects are using Git. So let's do installation on Ubuntu wit the following steps:
  1. Check for Git exists
  2. Install Git
  3. Confirm Git Installation
Pre-requisites:

Pick an instance on Cloud or online terminal of Ubuntu 20+ version to this experiment. Here I'm using the KillerCoda provided Ubuntu.

Check for Git exists

This is a common requirement when you join a new project and on the Linux machine you would like to know git installed or not. We have couple of options to check it. Let's do it here:

dpkg -l git

#or

dpkg --list git
In the output first 'ii' in the list means (if there are packages installed, you should see this mark) that the package is correctly installed and available. alternatively you can also try other option to check git installation on Ubuntu.
apt list git

#or

apt list git -a
Or else you can simply use `git` it will works and provides git command help when git already installed.
git
We can see that it is not the latest version as of now, So I wish to get installed the latest Git version on Ubuntu 24.04.

Install Git from source

1. we will download our desired verison of Git tarball, untar it, enter into the extracted directory. 2. Build git from the source code using 'make' tool 3. Run the installation using 'make install'

# install pre-requisite libraries 
sudo apt update
sudo apt install make libssl-dev libghc-zlib-dev libcurl4-gnutls-dev libexpat1-dev gettext unzip

# Download your desired version
wget https://mirrors.edge.kernel.org/pub/software/scm/git/git-2.48.1.tar.gz
tar -zxf git-*.tar.gz 

cd git-2.48.1
make prefix=/usr/local all
sudo make prefix=/usr/local install
Now Build steps

Confirmation of Git Installation

You know already know multiple ways to check git existing version on your Ubuntu. Let's quickly validate it git command with `--version` opton.

git --version

Auomation in mind: Git desired version on Ubuntu

Power of your brain should work smart when you look at this instructions any AI tool can gives these but we have God's given our own brain to use :)

#!/bin/bash

#Measurable automation - this will calculate duration of this script execution
SECONDS=0

# This can be passed as command line arugument
GIT_VERSION="2.48.1"

if ! sudo apt update; then
    echo "Failed to update package list"
    exit 1
fi

rm -rf git*

# Install the dependency libraries
sudo apt install -y make libssl-dev libghc-zlib-dev libcurl4-gnutls-dev libexpat1-dev gettext unzip
# Download the desired git tarball
wget https://mirrors.edge.kernel.org/pub/software/scm/git/git-$GIT_VERSION.tar.gz

tar -zxf git-*.tar.gz 
cd git-$GIT_VERSION
pwd
# compile the source code and install
make prefix=/usr/local all
sudo make prefix=/usr/local install

# Validate git installed version
  
git --version | grep ${GIT_VERSION}
[ $? -eq 0 ] && echo "Git installed latest version successfully!"
echo "Execution time: $SECONDS seconds"

Magical way of Git installation

The beatiful output of the above bash script is here:

If you encounter any issues during the installation, you may want to check your internet connection or ensure that your package manager is up to date. Additionally, it’s helpful to refer to the official Git documentation for troubleshooting tips and further configuration options.
Please Write your valuable comments with your git installation experiments and any issues that you faced. Happy learning!!

Saturday, February 22, 2025

Apache Cassandra Performance Optimization

Hey Guys!! I'm back with a new learning this week, I worked and experimented on Apache Cassandra Distributed database. It's special feature is it's quering capability with NoSQL - Not only SQL.
Let's jump to our last set blog post where we have learnt about the Cassandra installation on a VM. Hope you are ready with Cassandra DB node.

1: Optimizing Data Modeling

Objective: Understand partitioning and primary key design to optimize performance.
Create an inefficient table within company_db keyspace:
CREATE KEYSPACE company_db
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
USE company_db;

CREATE TABLE company_db.employees_bad (
id UUID PRIMARY KEY,
name TEXT,
department TEXT,
age INT,
city TEXT
);

Now let's Insert some sample records into the table and try to query it.
INSERT INTO employees_bad (id, name, department, age, city) VALUES (uuid(), 'Reemi', 'Engineering', 30, 'New York');
INSERT INTO employees_bad (id, name, department, age, city) VALUES (uuid(), 'Ausula Santhosh', 'Marketing', 40, 'San Francisco');
INSERT INTO employees_bad (id, name, department, age, city) VALUES (uuid(), 'Telluri Raja', 'Marketing', 40, 'San Francisco');
INSERT INTO employees_bad (id, name, department, age, city) VALUES (uuid(), 'Abdul Pathan', 'Marketing', 40, 'San Francisco');
Run the following queries and observe the issue SELECT * FROM employees_bad; SELECT * FROM employees_bad WHERE department = 'Engineering';

Resolve Data model performances

CREATE TABLE company_db.emp (
id UUID,
name TEXT,
department TEXT,
age INT,
city TEXT,
PRIMARY KEY(department,id)
);
Rerun the Insert sample records
INSERT INTO emp (id, name, department, age, city) VALUES (uuid(), 'Reemi', 'Engineering', 30, 'New York');
INSERT INTO emp (id, name, department, age, city) VALUES (uuid(), 'Ausula Santhosh', 'Marketing', 40, 'San Francisco');
INSERT INTO emp (id, name, department, age, city) VALUES (uuid(), 'Telluri Raja', 'Marketing', 40, 'San Francisco');
INSERT INTO emp (id, name, department, age, city) VALUES (uuid(), 'Abdul Pathan', 'Marketing', 40, 'San Francisco');
Run queries and observe performance
SELECT * FROM emp;
SELECT * FROM emp WHERE department = 'Engineering';

2: Using Secondary Indexes Efficiently

Objective: Learn when to use secondary indexes and when to avoid them. Create a table without a proper partition key:
CREATE TABLE company_db.products (
id UUID PRIMARY KEY,
name TEXT,
category TEXT,
price DECIMAL
);

INSERT INTO products (id, name, category, price)
VALUES (uuid(), 'Laptop', 'Electronics', 120000.00);
INSERT INTO products (id, name, category, price)
VALUES (uuid(), 'Wireless Mouse', 'Electronics', 200.00);
INSERT INTO products (id, name, category, price)
VALUES (uuid(), 'Chair', 'Furniture', 150.00);	
INSERT INTO products (id, name, category, price)
VALUES (uuid(), 'ErgoChair', 'Furniture', 1500.00);	
INSERT INTO products (id, name, category, price)
VALUES (uuid(), 'ErgoStandSitTable', 'Furniture', 24050.00);	

SELECT * FROM products ;
SELECT * FROM products WHERE category = 'Electronics';
Create an index on category so that partition key forms with id and category fields after this optimization you are good to go, now rerun the query:
CREATE INDEX category_index ON products (category);
SELECT * FROM products WHERE category = 'Electronics';

3: Compaction Strategies

Objective: Understand how different compaction strategies affect performance. We have 4 different options we have let's explore on each of these options in detail here. You can use `DESC TABLE emp`, This way we can check the current compaction strategy:

1. SizeTieredCompactionStrategy (STCS)

Scenario: An IoT application with a high volume of incoming sensor data.
Use Case: An IoT platform collects data from thousands of sensors distributed across a smart city. Each sensor sends data continuously, leading to a high volume of writes.
Advantage: STCS is ideal for this write-heavy workload because it efficiently handles large volumes of data by merging smaller SSTables into larger ones, reducing write amplification and managing disk space effectively.
CREATE KEYSPACE iot
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};

use iot;

CREATE TABLE iot.sensor_data (
  sensor_id int,
  timestamp timestamp,
  data blob,
  PRIMARY KEY (sensor_id, timestamp)
) WITH compaction = {
  'class': 'SizeTieredCompactionStrategy',
  'min_threshold': 4,
  'max_threshold': 32
};

DESC TABLE  iot.sensor_data

2. LeveledCompactionStrategy (LCS)

Scenario: A social media application with a focus on fast reads. Use Case: A social media platform requires fast access to user profiles and posts. Users frequently query the latest posts, likes, and comments.
Advantage: LCS is suitable for read-heavy workloads. It organizes SSTables into levels, ensuring that queries read from a small number of SSTables, resulting in lower read latency and consistent performance.
CREATE KEYSPACE socialmedia
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};

USE socialmedia;

CREATE TABLE socialmedia.user_posts (
  user_id int,
  post_id int,
  content text,
  PRIMARY KEY (user_id, post_id)
) WITH compaction = {
  'class': 'LeveledCompactionStrategy',
  'sstable_size_in_mb': 160
};

DESC TABLE socialmedia.user_posts

3. TimeWindowCompactionStrategy (TWCS)

Scenario: A time-series database for monitoring server performance. Use Case: A company uses Cassandra to store and analyze server performance metrics such as CPU usage, memory usage, and network traffic. These metrics are collected at regular intervals and are time-based. Advantage: TWCS groups data into time windows, making it easier to expire old data and reduce compaction overhead. It is optimized for time-series data, ensuring efficient data organization and faster queries for recent data.
CREATE KEYSPACE monitoring
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};

USE monitoring;

CREATE TABLE monitoring.server_metrics (
  server_id int,
  metric timestamp,
  cpu_usage double,
  memory_usage double,
  network_traffic double,
  PRIMARY KEY (server_id, metric)
) WITH compaction = {
  'class': 'TimeWindowCompactionStrategy',
  'compaction_window_unit': 'HOURS',
  'compaction_window_size': 1
};

DESC TABLE monitoring.server_metrics

4. UnifiedCompactionStrategy (UCS)

Scenario: An e-commerce platform with mixed read and write workloads. Use Case: An e-commerce website handles a mix of reads and writes, including product catalog updates, user reviews, and order processing. The workload varies throughout the day, with peak periods during sales events.
Advantage: UCS adapts to the changing workload by balancing the trade-offs of STCS and LCS. It provides efficient compaction for both read-heavy and write-heavy periods, ensuring consistent performance.
CREATE KEYSPACE ecommerce
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};

USE ecommerce;

CREATE TABLE ecommerce.orders (
  order_id int,
  user_id int,
  product_id int,
  order_date timestamp,
  status text,
  PRIMARY KEY (order_id, user_id)
) WITH compaction = {
  'class': 'UnifiedCompactionStrategy'
};

desc table ecommerce.orders

Friday, February 21, 2025

Cassandra nodetool by examples

To monitor an Apache Cassandra cluster from the command line interface (CLI), you can use the nodetool utility, which is a powerful command-line tool specifically designed for managing and monitoring Cassandra clusters. Here are some key commands and their functionalities:

Key nodetool Commands

  • Check Cluster Status:

    nodetool status
    

    This command displays the status of all nodes in the cluster, including whether they are up or down, their load, and other important metrics.

  • Column Family Statistics:

    nodetool cfstats [keyspace_name].[table_name]
    

    This command provides detailed statistics for a specific table (column family), including read/write counts, disk space used, and more.

  • Thread Pool Statistics:

    nodetool tpstats
    

    This command shows statistics about thread pools used for read, write, and mutation operations, helping to identify potential bottlenecks.

  • Network Statistics:

    nodetool netstats
    

    This command displays information about network activity, including pending hints and the status of streaming operations.

  • Compaction Stats:

    nodetool compactionstats
    

    This command provides information about ongoing compactions, which can be useful for performance tuning.

  • Cluster Information:

    nodetool info
    

    This command gives a summary of the node's configuration and status within the cluster.

Additional Monitoring Options

For more advanced monitoring, you may also consider integrating tools like Prometheus and Grafana for visualization or using cloud-based solutions such as Datadog or New Relic. These tools can provide real-time metrics and alerts based on your Cassandra cluster's performance.

Using nodetool effectively allows you to maintain a healthy Cassandra environment by providing insights into its operational metrics directly from the CLI.

Apache Cassandra 5 installation on Ubuntu

In this post we will have step-by-step process of installation of the Latestt version of Apache Cassandra 5.0.3 (as of Feb 2025 available as latest) on Ubuntu 20.

What problem I'm solving with this?

There is no direct documentation on the Cassandra to help on installation of latest version that is 5.0.3 on Ubuntu. So I've experimented it on the online Ubuntu terminal(kllercoda) and posting all the steps here.

Pre-requisite to install Cassandra

1. Ubuntu Terminal either killercoda or codespace on github works good for this experiment.
2. Cassandra has specific compatibility requirements with different Java versions, which are crucial for ensuring optimal performance and stability.
you must have super(root) user access to install cassandra.

Step 1: Install Java/JRE

Ensure Java installed by checking with command `java -version` and if not existing then install as following:

#install jdk17-jre 
apt install openjdk-17-jre-headless

java -version

Step 2: Add Cassandra Repository

Since Cassandra is not included in the default Ubuntu repositories, you must add its repository with the following commands.
sudo apt install apt-transport-https
Let's get the cassandra to be downloaded with secure way to dao this we need to have GPG key from Cassandra:
wget -qO- https://downloads.apache.org/cassandra/KEYS | sudo apt-key add -
Note: don't miss the dash at the end of the line in the above wget command. Actual package information will be added to repo withfollowing line, this option can vary for cassandra versions 40x, 41x or 50x which can placed before the main.
echo "deb https://debian.cassandra.apache.org 50x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list
Update the package index
sudo apt update

Step 3: Install Cassandra 5.0.3

Now that the repository is set up, you can install Cassandra 5.0.3.
sudo apt install cassandra -y
Wait for the process to complete. Once the installation finishes, the Cassandra service starts automatically. Also, a user named cassandra is created during the process. That user is used to run the service.
systemctl status cassandra

Verify Cassandra installation

nodetool status

Don't work HARD!

Working hard it means you are not using your brain to work! Repeatative task can be automated that is called SMART work. Let's do that here:
apt install -y openjdk-17-jre-headless 
apt install apt-transport-https
wget -qO- https://downloads.apache.org/cassandra/KEYS | sudo apt-key add -

echo "deb https://debian.cassandra.apache.org 50x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list

apt update
sudo apt install cassandra -y
sleep 10
echo "Cassandra installed verify"
nodetool status

Troubleshoot pointers:

Initializing cassandra server may takes some more seconds so wait if you see :
Exception in thread "main" java.lang.IllegalArgumentException: Server is not initialized yet, cannot run nodetool.
Obeserve in the output that Active line should have `active(running)`. If it is not that there is something is missing may be Java / JRE missmatched or unable to launch the Cassandra process we can look at the system logs.
Please write your experiance with Cassandra 5 installation in the comment box. Thanks for being with me till here.

Saturday, February 8, 2025

Kafka Message system on Kubernetes

 

Setting up the Kubernetes namespace for kafka
apiVersion: v1
kind: Namespace
metadata:
  name: "kafka"
  labels:
    name: "kafka"

k apply -f kafka-ns.yml 
Now let's create the ZooKeeper container inside the kafka namespace
apiVersion: v1
kind: Service
metadata:
  labels:
    app: zookeeper-service
  name: zookeeper-service
  namespace: kafka
spec:
  type: NodePort
  ports:
    - name: zookeeper-port
      port: 2181
      nodePort: 30181
      targetPort: 2181
  selector:
    app: zookeeper
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: zookeeper
  name: zookeeper
  namespace: kafka
spec:
  replicas: 1
  selector:
    matchLabels:
      app: zookeeper
  template:
    metadata:
      labels:
        app: zookeeper
    spec:
      containers:
        - image: wurstmeister/zookeeper
          imagePullPolicy: IfNotPresent
          name: zookeeper
          ports:
            - containerPort: 2181
image1 - kube-kafka1
From the ZOOKEEPER services get the Cluster IP and use it in the Kafka broker configuration which is next step we are going to perform.
apiVersion: v1
kind: Service
metadata:
  labels:
    app: kafka-broker
  name: kafka-service
  namespace: kafka
spec:
  ports:
  - port: 9092
  selector:
    app: kafka-broker
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: kafka-broker
  name: kafka-broker
  namespace: kafka
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kafka-broker
  template:
    metadata:
      labels:
        app: kafka-broker
    spec:
      hostname: kafka-broker
      containers:
      - env:
        - name: KAFKA_BROKER_ID
          value: "1"
        - name: KAFKA_ZOOKEEPER_CONNECT
          value: ZOOKEEPER-INTERNAL-IP:2181
        - name: KAFKA_LISTENERS
          value: PLAINTEXT://:9092
        - name: KAFKA_ADVERTISED_LISTENERS
          value: PLAINTEXT://kafka-broker:9092
        image: wurstmeister/kafka
        imagePullPolicy: IfNotPresent
        name: kafka-broker
        ports:
          - containerPort: 9092

In the above line number 37 you need to change according to your Zookeeper service NodePort. Now apply
k apply -f kafka-broker.yml
after apply
watch kubectl get all -n kafka
Image: Kube kafka 2
Step 3: Enable Network communications To ensure that Zookeeper and Kafka can communicate by using this hostname (kafka-broker), we need to add the following entry to the /etc/hosts file on
echo "127.0.0.1 kafka-broker" > /etc/hosts
Set the port forwarding as following: kubectl port-forward 9092 -n kafka Image: Kafka on Kubernetes
Open new terminal and run the following:

Test Kafka Topics using Kafkacat

To easily send and retrieve messages from Kafka, we’ll use a CLI tool named KCat 7Install KCat using the below command:
apt install kafkacat
Image : Kafkacat installation

Producing message and Consume using Kafcat

Run the below command to create a topic named topic1 and send a test message “hello everyone!” you can enter your own messages.
echo "hello everyone!" | kafkacat -P -b 127.0.0.1:9092 -t topic1
   
Now let's consume the message using kafkacat command:
  kafkacat -C -b 127.0.0.1:9092 -t topic1
  
Image : Kafkacat Producer Consumer
Happy learning Kafka on the Kubernetes, The above experiment I've run on the Killercoda terminal.

Kafdrop install and Monitor

There are many monitoring trools available for Kafka brokers. Ive collectiect Various monitoring options:
  • Factor House Kpow
  • Datadog
  • Logit.io
  • Kafka Lag Exporter
  • Confluent
  • CMAK (Cluster Manager for Apache Kafka)
  • Kafdrop
  • Offset Explorer
 Let's explore the kafdrop monitoring
Kafdrop for Kafka brokers monitoring


Prerequisites:

To run the Kafdrop monitoring tool Ensure Java installed by checking
java -version
If you don't find java on your Ubuntu run the following :
sudo apt update
sudo apt install openjdk-21-jdk -y
for other Linux distributions you need to use right package manger to install JDK.

Download Kafdrop jar file from github:
sudo mkdir kafka-monitor
cd  kafka-monitor
curl -L -o kafdrop.jar https://github.com/obsidiandynamics/kafdrop/releases/download/4.1.0/kafdrop-4.1.0.jar
The curl command ensures that Kafdrop may have any release version but the jar file renamed as `kafdrop.jar`.

Now all set to go and run the Kafdrop jar file with the --kafka.brokerConnect option where you can give single Kafka broker details or Kafka cluster details as mentioned here:

java -jar kafdrop.jar \
--kafka.brokerConnect=kafka-broker1-host:port,kafka-broker2-host:port
 
If you don't specified anything, kafka.brokerConnect defaults to localhost:9092

Accessing Kafdrop UI

As Kafdrop uses 9000 as default port, So we can open a fresh browser and access the Kafdrop with localhost:9000 or from online terminals there will be Ports opened by providing custom port as '9000'. If you want to override the port and also get to know about the Kafka cluster details from a properties file then create a file named as 'application.properties' with the content as:
  server.port=9090
  kafka.brokerConnect=kafka-broker1-host:port,kafka-broker2-host:port
  
when you have application.properties file in the same path where kafdrop.jar presents then they all picked automatically so the command would be `java -jar kafddrop.jar`. *+*

Wednesday, February 5, 2025

Kafka installation on Ubuntu

Kafka message broker is developed by Linkedin. To support their social media platform across the world. Message systems are defined as two Queues, Topics. If a messge send from a producer and received by single consumer then the communication will be point-to-point this is will be allowed in Queues. If Producer produce the message and there coudl be multiple consumers then it is One-to-many or pubf-sub of Publisher/Subscriber model it is implemented by Topic. Kafka supports both message models.

Initial Setup for Kafka

Kafka is build on scale and it runs on Java run time. So prerequisite to run Kafka we need Java as prerequisite. Optional other tools to help for troubleshoot and identify Kafka ports in use with 'netstat' command, topic content to view 'jq' and to have tree view 'tree' need to installed.
apt update
apt install -y net-tools jq tree
On Ubuntu run the following commands to install compatible Java, here I'm using OpenJDK 21 version
 
apt install -y openjdk-21-jdk-headless
Confirm the Java installation get the java version : `java -version`

OpenJDK21 installed verify


 

Install Apache Kafka

Let's use kafka's quick start https://kafka.apache.org/quickstart as per the Apache Kafka release model they will have yearly 3 stable releases the last stable release version available 3.9.0 from Nov 2024 with two binary options with Scala version support 2.12 and 2.13 are availble. So we will download kafka at https://www.apache.org/dyn/closer.cgi?path=/kafka/3.9.0/kafka_2.12-3.9.0.tgz

Install Kafka 3.9.0


 
wget https://dlcdn.apache.org/kafka/3.9.0/kafka_2.12-3.9.0.tgz

# extract the file and cd into the folder

tar -xzf kafka_2.12-3.9.0.tgz
cd kafka_2.12-3.9.0

start zookeeper

Kafka comes bundled with ZooKeeper. Here’s an example of what your zookeeper.properties might look like:
tickTime=2000
dataDir=/tmp/zookeeper
clientPort=2181
start ZooKeeper using the following command:
bin/zookeeper-server-start.sh config/zookeeper.properties
To verify Zookeeper working run the below command:
echo "Are you there ZooKeeper" | nc localhost 2181
If it responds with `imok`, then ZooKeeper is operational.

Starting Kafka multiple brokers on single host

Intially try with single broker after understanding how it is working you can move to multi-broker setup as discussed here. So our step here is to Start the Kafka server and then create a topic Now open a new terminal window and start the kafka-server, the start script available in bin directory. To make cluster configuration create server.partitions file to 2 copies to simulate:
cd kafka_2.12-3.9.0/config/
cp server.properties server1.properties # for broker1
cp server.properties server2.properties # for broker2
Most common changes are with the 3 parameters on broker configuration:
Broker 1:
broker.id=1
listeners=PLAINTEXT://localhost:9092
log.dirs=/tmp/kafka-logs-1

Broker 2:
broker.id=2
listeners=PLAINTEXT://localhost:9093
log.dirs=/tmp/kafka-logs-2
and then start these three servers in background as follows:
cd ~/kafka_2.12-3.9.0
nohup bin/kafka-server-start.sh config/server1.properties >broker1.out 2>&1 &
tail -f broker1.out

nohup bin/kafka-server-start.sh config/server2.properties >broker2.out 2>&1 &
tail -f broker2.out

#confirm with process check and network stats
jps -lm
netstat -tulpn|grep java


Create Topic

Add another terminal window create topic 'test' with the following commands:
cd ~/kafka_2.12-3.9.0/
bin/kafka-topics.sh --create --bootstrap-server \
localhost:9092 --replication-factor 1 \
--partitions 1 --topic test
Noe, we can confirm the topic is created on the kafka server by listing all topics on it:
bin/kafka-topics.sh --list --bootstrap-server localhost:9092

Send messages to topic

Producer simulation here, Send some text messages for testing purpose Kafka comes with a command line client that will take input from a file or from standard input and send it out as messages to the Kafka cluster. By default, each line will be sent as a separate message.
Run the producer and then type a few messages into the console to send to the server.
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test

For example enter the message on topic test as follows:

I love India
Mera Bharat mahan
Viswa guru Bharat
Skill India

Use ctrl-c to exit from producer

Consume messages

Start a consumer script to receive messages from the topic. Kafka also has a command line consumer that will dump out messages to standard output, it should return the lines you typed in the above step.
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test 
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test -o 2
same as we did for producer, Consumer can use ctrl-c to exit Hope you enjoyed this learning with me. Keep posting your experiance on this.

Wednesday, January 22, 2025

Job & CronJob - Batch Job

What is Job object in Kubernetes?


A Job object will be used to create one or more Pods and the Job ensures that a specified number of Pod instances will be created and terminates after completion of the Job. There could be finite jobs which will run within given certain timeout values. Job tracks for 'Successful' completion of the required task. Jobs can be run in two variants they can be parallel and also non-parallel.

Kubernetes Job types



There are 3 types of jobs non-parallel jobs [single pod jobs - unless it fails. creates replacement pod when pod goes down] parallel jobs with a fixed completion count parallel jobs with task queue 

##Example type 1: hundred-fibonaccis.yml
---
apiVersion: batch/v1
kind: Job
metadata:
    name: fibo-100
spec:
  template:
    spec: 
      containers:
      - name: fib-container 
        image: truek8s/hundred-fibonaccis:1.0
      restartPolicy: OnFailure
  backoffLimit: 3
Create the Job:
kubectl create -f hundred-fibonaccis.yml
Now let's describe the job:
kubectl describe job fibo-100
Describing a Job in Kubernetes



In the Job description you can observe the attributes such as parallelism, Pod Statuses.
 
On the other terminal which is running in parallel, Observe the pod status change from Running to Completed:
kubectl get po -w 
Pod still exists to get the logs
kubectl logs [podname] -- [container] 
We can see the Fibonacci number series printed out from the container logs.
Fibonacci series printed from kubctl logs command



##Example type 2
---
apiVersion: batch/v1
kind: Job
metadata:
  name: counter
spec:
  template:
    metadata:
      name: count-pod-template
    spec:
      containers:
      - name: count-container
        image: alpine
        command:
         - "/bin/sh"
         - "-c"
         - "for i in 100 50 10 5 1; do echo $i; done"
      restartPolicy: Never
To create the counter-job use the following command :
kubectl create -f counter-job.yaml
Check Job list:
kubectl get jobs
Check the Pod:
kubectl get po 
Check the log:
kubectl logs count-pod-xxx 
Counter Job execution output using kubectl logs



Now let's Describe the job counter :
kubectl describe jobs counter 
You can observer the Start Time and Pod Statuses from the above command ##cleanup delete job:
kubectl delete jobs counter-job 
no need to delete pod, When a Job deleted automatically removes the corresponding all its pods.

Controlling Job Completions

In some situations you need to run the same job multiple times, we need to define completions: number it under Job-> Specifications (spec section)
  ---
apiVersion: batch/v1
kind: Job
metadata:
  name: hello-job
spec:
  completions: 2
  template:
    spec:
      containers:
        - name: busybox-container
          image: busybox
          command: ["echo", "hello Kubernetes job!!!"]
       restartPolicy: Never
  
Observe the number of the Completions on the
watch kubectl get all
  kubectl get pods
  kubectl delete job hello-job
  kubectl get pods 
Once Job is deleted all its relavant resource will be cleaned up automaticall.

Parallelism

In some project there will be need to run multiple pods running in parallel
  ---
apiVersion: batch/v1
kind: Job
metadata:
  name: hello-job
spec:
  parallelism: 2
  template:
    spec:
      containers:
        - name: busybox-container
          image: busybox
          command: ["echo", "hello Kubernetes job!!!"]
       restartPolicy: Never
  
Observe the number of the Completions on the How backoffLimit works on Job? Let's do simple exeriment and understand this 'backoffLimit' attribute.
  ---
apiVersion: batch/v1
kind: Job
metadata:
  name: hello-job
spec:
  parallelism: 2
  backoffLiit: 4
  template:
    spec:
      containers:
        - name: busybox-container
          image: busybox
          command: ["ech0o", "hello Kubernetes job!!!"]
       restartPolicy: Never
  
Observe command mistyped purposefully, which will fail to create new Pod.

Working with CronJob


Job that works like a crontab in Linux systems. Any task that needs to be executed based on the scheduled time then we can use this Kubernetes Object.

##Example CronJob
---
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: hello-cron
spec:
  schedule: "* * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: busybox-container
            image: busybox
            command: ["echo", "Namste Kubernetes Cronjob!!!"]
           restartPolicy: OnFailure
Creating the sample CronJob by running the following command:
kubectl create -f cronjob.yaml

Open in new terminal and run the following command to watch continuously:
watch kubectl get all 
Check the Pod Logs: Use the Pod name to view the logs and see the output of the CronJob: kubectl get po ; kubectl logs PODNAME

kubernetes logs cronjob run pod



The schedule field is set to "* * * * *", which means the job will run every minutes.
The job runs a busybox container that prints the given text message.

Check the pods section in the above output

Kubernetes CronJob





  • Categories

    Kubernetes (25) Docker (20) git (15) Jenkins (12) AWS (7) Jenkins CI (5) Vagrant (5) K8s (4) VirtualBox (4) CentOS7 (3) docker registry (3) docker-ee (3) ucp (3) Jenkins Automation (2) Jenkins Master Slave (2) Jenkins Project (2) containers (2) create deployment (2) docker EE (2) docker private registry (2) dockers (2) dtr (2) kubeadm (2) kubectl (2) kubelet (2) openssl (2) Alert Manager CLI (1) AlertManager (1) Apache Maven (1) Best DevOps interview questions (1) CentOS (1) Container as a Service (1) DevOps Interview Questions (1) Docker 19 CE on Ubuntu 19.04 (1) Docker Tutorial (1) Docker UCP (1) Docker installation on Ubunutu (1) Docker interview questions (1) Docker on PowerShell (1) Docker on Windows (1) Docker version (1) Docker-ee installation on CentOS (1) DockerHub (1) Features of DTR (1) Fedora (1) Freestyle Project (1) Git Install on CentOS (1) Git Install on Oracle Linux (1) Git Install on RHEL (1) Git Source based installation (1) Git line ending setup (1) Git migration (1) Grafana on Windows (1) Install DTR (1) Install Docker on Windows Server (1) Install Maven on CentOS (1) Issues (1) Jenkins CI server on AWS instance (1) Jenkins First Job (1) Jenkins Installation on CentOS7 (1) Jenkins Master (1) Jenkins automatic build (1) Jenkins installation on Ubuntu 18.04 (1) Jenkins integration with GitHub server (1) Jenkins on AWS Ubuntu (1) Kubernetes Cluster provisioning (1) Kubernetes interview questions (1) Kuberntes Installation (1) Maven (1) Maven installation on Unix (1) Operations interview Questions (1) Oracle Linux (1) Personal access tokens on GitHub (1) Problem in Docker (1) Prometheus (1) Prometheus CLI (1) RHEL (1) SCM (1) SCM Poll (1) SRE interview questions (1) Troubleshooting (1) Uninstall Git (1) Uninstall Git on CentOS7 (1) Universal Control Plane (1) Vagrantfile (1) amtool (1) aws IAM Role (1) aws policy (1) caas (1) chef installation (1) create organization on UCP (1) create team on UCP (1) docker CE (1) docker UCP console (1) docker command line (1) docker commands (1) docker community edition (1) docker container (1) docker editions (1) docker enterprise edition (1) docker enterprise edition deep dive (1) docker for windows (1) docker hub (1) docker installation (1) docker node (1) docker releases (1) docker secure registry (1) docker service (1) docker swarm init (1) docker swarm join (1) docker trusted registry (1) elasticBeanStalk (1) global configurations (1) helm installation issue (1) mvn (1) namespaces (1) promtool (1) service creation (1) slack (1)