Saturday, February 22, 2025

Apache Cassandra Performance Optimization

Hey Guys!! I'm back with a new learning this week, I worked and experimented on Apache Cassandra Distributed database. It's special feature is it's quering capability with NoSQL - Not only SQL.
Let's jump to our last set blog post where we have learnt about the Cassandra installation on a VM. Hope you are ready with Cassandra DB node.

1: Optimizing Data Modeling

Objective: Understand partitioning and primary key design to optimize performance.
Create an inefficient table within company_db keyspace:
CREATE KEYSPACE company_db
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
USE company_db;

CREATE TABLE company_db.employees_bad (
id UUID PRIMARY KEY,
name TEXT,
department TEXT,
age INT,
city TEXT
);

Now let's Insert some sample records into the table and try to query it.
INSERT INTO employees_bad (id, name, department, age, city) VALUES (uuid(), 'Reemi', 'Engineering', 30, 'New York');
INSERT INTO employees_bad (id, name, department, age, city) VALUES (uuid(), 'Ausula Santhosh', 'Marketing', 40, 'San Francisco');
INSERT INTO employees_bad (id, name, department, age, city) VALUES (uuid(), 'Telluri Raja', 'Marketing', 40, 'San Francisco');
INSERT INTO employees_bad (id, name, department, age, city) VALUES (uuid(), 'Abdul Pathan', 'Marketing', 40, 'San Francisco');
Run the following queries and observe the issue SELECT * FROM employees_bad; SELECT * FROM employees_bad WHERE department = 'Engineering';

Resolve Data model performances

CREATE TABLE company_db.emp (
id UUID,
name TEXT,
department TEXT,
age INT,
city TEXT,
PRIMARY KEY(department,id)
);
Rerun the Insert sample records
INSERT INTO emp (id, name, department, age, city) VALUES (uuid(), 'Reemi', 'Engineering', 30, 'New York');
INSERT INTO emp (id, name, department, age, city) VALUES (uuid(), 'Ausula Santhosh', 'Marketing', 40, 'San Francisco');
INSERT INTO emp (id, name, department, age, city) VALUES (uuid(), 'Telluri Raja', 'Marketing', 40, 'San Francisco');
INSERT INTO emp (id, name, department, age, city) VALUES (uuid(), 'Abdul Pathan', 'Marketing', 40, 'San Francisco');
Run queries and observe performance
SELECT * FROM emp;
SELECT * FROM emp WHERE department = 'Engineering';

2: Using Secondary Indexes Efficiently

Objective: Learn when to use secondary indexes and when to avoid them. Create a table without a proper partition key:
CREATE TABLE company_db.products (
id UUID PRIMARY KEY,
name TEXT,
category TEXT,
price DECIMAL
);

INSERT INTO products (id, name, category, price)
VALUES (uuid(), 'Laptop', 'Electronics', 120000.00);
INSERT INTO products (id, name, category, price)
VALUES (uuid(), 'Wireless Mouse', 'Electronics', 200.00);
INSERT INTO products (id, name, category, price)
VALUES (uuid(), 'Chair', 'Furniture', 150.00);	
INSERT INTO products (id, name, category, price)
VALUES (uuid(), 'ErgoChair', 'Furniture', 1500.00);	
INSERT INTO products (id, name, category, price)
VALUES (uuid(), 'ErgoStandSitTable', 'Furniture', 24050.00);	

SELECT * FROM products ;
SELECT * FROM products WHERE category = 'Electronics';
Create an index on category so that partition key forms with id and category fields after this optimization you are good to go, now rerun the query:
CREATE INDEX category_index ON products (category);
SELECT * FROM products WHERE category = 'Electronics';

3: Compaction Strategies

Objective: Understand how different compaction strategies affect performance. We have 4 different options we have let's explore on each of these options in detail here. You can use `DESC TABLE emp`, This way we can check the current compaction strategy:

1. SizeTieredCompactionStrategy (STCS)

Scenario: An IoT application with a high volume of incoming sensor data.
Use Case: An IoT platform collects data from thousands of sensors distributed across a smart city. Each sensor sends data continuously, leading to a high volume of writes.
Advantage: STCS is ideal for this write-heavy workload because it efficiently handles large volumes of data by merging smaller SSTables into larger ones, reducing write amplification and managing disk space effectively.
CREATE KEYSPACE iot
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};

use iot;

CREATE TABLE iot.sensor_data (
  sensor_id int,
  timestamp timestamp,
  data blob,
  PRIMARY KEY (sensor_id, timestamp)
) WITH compaction = {
  'class': 'SizeTieredCompactionStrategy',
  'min_threshold': 4,
  'max_threshold': 32
};

DESC TABLE  iot.sensor_data

2. LeveledCompactionStrategy (LCS)

Scenario: A social media application with a focus on fast reads. Use Case: A social media platform requires fast access to user profiles and posts. Users frequently query the latest posts, likes, and comments.
Advantage: LCS is suitable for read-heavy workloads. It organizes SSTables into levels, ensuring that queries read from a small number of SSTables, resulting in lower read latency and consistent performance.
CREATE KEYSPACE socialmedia
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};

USE socialmedia;

CREATE TABLE socialmedia.user_posts (
  user_id int,
  post_id int,
  content text,
  PRIMARY KEY (user_id, post_id)
) WITH compaction = {
  'class': 'LeveledCompactionStrategy',
  'sstable_size_in_mb': 160
};

DESC TABLE socialmedia.user_posts

3. TimeWindowCompactionStrategy (TWCS)

Scenario: A time-series database for monitoring server performance. Use Case: A company uses Cassandra to store and analyze server performance metrics such as CPU usage, memory usage, and network traffic. These metrics are collected at regular intervals and are time-based. Advantage: TWCS groups data into time windows, making it easier to expire old data and reduce compaction overhead. It is optimized for time-series data, ensuring efficient data organization and faster queries for recent data.
CREATE KEYSPACE monitoring
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};

USE monitoring;

CREATE TABLE monitoring.server_metrics (
  server_id int,
  metric timestamp,
  cpu_usage double,
  memory_usage double,
  network_traffic double,
  PRIMARY KEY (server_id, metric)
) WITH compaction = {
  'class': 'TimeWindowCompactionStrategy',
  'compaction_window_unit': 'HOURS',
  'compaction_window_size': 1
};

DESC TABLE monitoring.server_metrics

4. UnifiedCompactionStrategy (UCS)

Scenario: An e-commerce platform with mixed read and write workloads. Use Case: An e-commerce website handles a mix of reads and writes, including product catalog updates, user reviews, and order processing. The workload varies throughout the day, with peak periods during sales events.
Advantage: UCS adapts to the changing workload by balancing the trade-offs of STCS and LCS. It provides efficient compaction for both read-heavy and write-heavy periods, ensuring consistent performance.
CREATE KEYSPACE ecommerce
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};

USE ecommerce;

CREATE TABLE ecommerce.orders (
  order_id int,
  user_id int,
  product_id int,
  order_date timestamp,
  status text,
  PRIMARY KEY (order_id, user_id)
) WITH compaction = {
  'class': 'UnifiedCompactionStrategy'
};

desc table ecommerce.orders

Friday, February 21, 2025

Cassandra nodetool by examples

To monitor an Apache Cassandra cluster from the command line interface (CLI), you can use the nodetool utility, which is a powerful command-line tool specifically designed for managing and monitoring Cassandra clusters. Here are some key commands and their functionalities:

Key nodetool Commands

  • Check Cluster Status:

    nodetool status
    

    This command displays the status of all nodes in the cluster, including whether they are up or down, their load, and other important metrics.

  • Column Family Statistics:

    nodetool cfstats [keyspace_name].[table_name]
    

    This command provides detailed statistics for a specific table (column family), including read/write counts, disk space used, and more.

  • Thread Pool Statistics:

    nodetool tpstats
    

    This command shows statistics about thread pools used for read, write, and mutation operations, helping to identify potential bottlenecks.

  • Network Statistics:

    nodetool netstats
    

    This command displays information about network activity, including pending hints and the status of streaming operations.

  • Compaction Stats:

    nodetool compactionstats
    

    This command provides information about ongoing compactions, which can be useful for performance tuning.

  • Cluster Information:

    nodetool info
    

    This command gives a summary of the node's configuration and status within the cluster.

Additional Monitoring Options

For more advanced monitoring, you may also consider integrating tools like Prometheus and Grafana for visualization or using cloud-based solutions such as Datadog or New Relic. These tools can provide real-time metrics and alerts based on your Cassandra cluster's performance.

Using nodetool effectively allows you to maintain a healthy Cassandra environment by providing insights into its operational metrics directly from the CLI.

Apache Cassandra 5 installation on Ubuntu

In this post we will have step-by-step process of installation of the Latestt version of Apache Cassandra 5.0.3 (as of Feb 2025 available as latest) on Ubuntu 20.

What problem I'm solving with this?

There is no direct documentation on the Cassandra to help on installation of latest version that is 5.0.3 on Ubuntu. So I've experimented it on the online Ubuntu terminal(kllercoda) and posting all the steps here.

Pre-requisite to install Cassandra

1. Ubuntu Terminal either killercoda or codespace on github works good for this experiment.
2. Cassandra has specific compatibility requirements with different Java versions, which are crucial for ensuring optimal performance and stability.
you must have super(root) user access to install cassandra.

Step 1: Install Java/JRE

Ensure Java installed by checking with command `java -version` and if not existing then install as following:

#install jdk17-jre 
apt install openjdk-17-jre-headless

java -version

Step 2: Add Cassandra Repository

Since Cassandra is not included in the default Ubuntu repositories, you must add its repository with the following commands.
sudo apt install apt-transport-https
Let's get the cassandra to be downloaded with secure way to dao this we need to have GPG key from Cassandra:
wget -qO- https://downloads.apache.org/cassandra/KEYS | sudo apt-key add -
Note: don't miss the dash at the end of the line in the above wget command. Actual package information will be added to repo withfollowing line, this option can vary for cassandra versions 40x, 41x or 50x which can placed before the main.
echo "deb https://debian.cassandra.apache.org 50x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list
Update the package index
sudo apt update

Step 3: Install Cassandra 5.0.3

Now that the repository is set up, you can install Cassandra 5.0.3.
sudo apt install cassandra -y
Wait for the process to complete. Once the installation finishes, the Cassandra service starts automatically. Also, a user named cassandra is created during the process. That user is used to run the service.
systemctl status cassandra

Verify Cassandra installation

nodetool status

Don't work HARD!

Working hard it means you are not using your brain to work! Repeatative task can be automated that is called SMART work. Let's do that here:
apt install -y openjdk-17-jre-headless 
apt install apt-transport-https
wget -qO- https://downloads.apache.org/cassandra/KEYS | sudo apt-key add -

echo "deb https://debian.cassandra.apache.org 50x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list

apt update
sudo apt install cassandra -y
sleep 10
echo "Cassandra installed verify"
nodetool status

Troubleshoot pointers:

Initializing cassandra server may takes some more seconds so wait if you see :
Exception in thread "main" java.lang.IllegalArgumentException: Server is not initialized yet, cannot run nodetool.
Obeserve in the output that Active line should have `active(running)`. If it is not that there is something is missing may be Java / JRE missmatched or unable to launch the Cassandra process we can look at the system logs.
Please write your experiance with Cassandra 5 installation in the comment box. Thanks for being with me till here.

Saturday, February 8, 2025

Kafka Message system on Kubernetes

 

Setting up the Kubernetes namespace for kafka
apiVersion: v1
kind: Namespace
metadata:
  name: "kafka"
  labels:
    name: "kafka"

k apply -f kafka-ns.yml 
Now let's create the ZooKeeper container inside the kafka namespace
apiVersion: v1
kind: Service
metadata:
  labels:
    app: zookeeper-service
  name: zookeeper-service
  namespace: kafka
spec:
  type: NodePort
  ports:
    - name: zookeeper-port
      port: 2181
      nodePort: 30181
      targetPort: 2181
  selector:
    app: zookeeper
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: zookeeper
  name: zookeeper
  namespace: kafka
spec:
  replicas: 1
  selector:
    matchLabels:
      app: zookeeper
  template:
    metadata:
      labels:
        app: zookeeper
    spec:
      containers:
        - image: wurstmeister/zookeeper
          imagePullPolicy: IfNotPresent
          name: zookeeper
          ports:
            - containerPort: 2181
image1 - kube-kafka1
From the ZOOKEEPER services get the Cluster IP and use it in the Kafka broker configuration which is next step we are going to perform.
apiVersion: v1
kind: Service
metadata:
  labels:
    app: kafka-broker
  name: kafka-service
  namespace: kafka
spec:
  ports:
  - port: 9092
  selector:
    app: kafka-broker
---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: kafka-broker
  name: kafka-broker
  namespace: kafka
spec:
  replicas: 1
  selector:
    matchLabels:
      app: kafka-broker
  template:
    metadata:
      labels:
        app: kafka-broker
    spec:
      hostname: kafka-broker
      containers:
      - env:
        - name: KAFKA_BROKER_ID
          value: "1"
        - name: KAFKA_ZOOKEEPER_CONNECT
          value: ZOOKEEPER-INTERNAL-IP:2181
        - name: KAFKA_LISTENERS
          value: PLAINTEXT://:9092
        - name: KAFKA_ADVERTISED_LISTENERS
          value: PLAINTEXT://kafka-broker:9092
        image: wurstmeister/kafka
        imagePullPolicy: IfNotPresent
        name: kafka-broker
        ports:
          - containerPort: 9092

In the above line number 37 you need to change according to your Zookeeper service NodePort. Now apply
k apply -f kafka-broker.yml
after apply
watch kubectl get all -n kafka
Image: Kube kafka 2
Step 3: Enable Network communications To ensure that Zookeeper and Kafka can communicate by using this hostname (kafka-broker), we need to add the following entry to the /etc/hosts file on
echo "127.0.0.1 kafka-broker" > /etc/hosts
Set the port forwarding as following: kubectl port-forward 9092 -n kafka Image: Kafka on Kubernetes
Open new terminal and run the following:

Test Kafka Topics using Kafkacat

To easily send and retrieve messages from Kafka, we’ll use a CLI tool named KCat 7Install KCat using the below command:
apt install kafkacat
Image : Kafkacat installation

Producing message and Consume using Kafcat

Run the below command to create a topic named topic1 and send a test message “hello everyone!” you can enter your own messages.
echo "hello everyone!" | kafkacat -P -b 127.0.0.1:9092 -t topic1
   
Now let's consume the message using kafkacat command:
  kafkacat -C -b 127.0.0.1:9092 -t topic1
  
Image : Kafkacat Producer Consumer
Happy learning Kafka on the Kubernetes, The above experiment I've run on the Killercoda terminal.

Kafdrop install and Monitor

There are many monitoring trools available for Kafka brokers. Ive collectiect Various monitoring options:
  • Factor House Kpow
  • Datadog
  • Logit.io
  • Kafka Lag Exporter
  • Confluent
  • CMAK (Cluster Manager for Apache Kafka)
  • Kafdrop
  • Offset Explorer
 Let's explore the kafdrop monitoring
Kafdrop for Kafka brokers monitoring


Prerequisites:

To run the Kafdrop monitoring tool Ensure Java installed by checking
java -version
If you don't find java on your Ubuntu run the following :
sudo apt update
sudo apt install openjdk-21-jdk -y
for other Linux distributions you need to use right package manger to install JDK.

Download Kafdrop jar file from github:
sudo mkdir kafka-monitor
cd  kafka-monitor
curl -L -o kafdrop.jar https://github.com/obsidiandynamics/kafdrop/releases/download/4.1.0/kafdrop-4.1.0.jar
The curl command ensures that Kafdrop may have any release version but the jar file renamed as `kafdrop.jar`.

Now all set to go and run the Kafdrop jar file with the --kafka.brokerConnect option where you can give single Kafka broker details or Kafka cluster details as mentioned here:

java -jar kafdrop.jar \
--kafka.brokerConnect=kafka-broker1-host:port,kafka-broker2-host:port
 
If you don't specified anything, kafka.brokerConnect defaults to localhost:9092

Accessing Kafdrop UI

As Kafdrop uses 9000 as default port, So we can open a fresh browser and access the Kafdrop with localhost:9000 or from online terminals there will be Ports opened by providing custom port as '9000'. If you want to override the port and also get to know about the Kafka cluster details from a properties file then create a file named as 'application.properties' with the content as:
  server.port=9090
  kafka.brokerConnect=kafka-broker1-host:port,kafka-broker2-host:port
  
when you have application.properties file in the same path where kafdrop.jar presents then they all picked automatically so the command would be `java -jar kafddrop.jar`. *+*

Wednesday, February 5, 2025

Kafka installation on Ubuntu

Kafka message broker is developed by Linkedin. To support their social media platform across the world. Message systems are defined as two Queues, Topics. If a messge send from a producer and received by single consumer then the communication will be point-to-point this is will be allowed in Queues. If Producer produce the message and there coudl be multiple consumers then it is One-to-many or pubf-sub of Publisher/Subscriber model it is implemented by Topic. Kafka supports both message models.

Initial Setup for Kafka

Kafka is build on scale and it runs on Java run time. So prerequisite to run Kafka we need Java as prerequisite. Optional other tools to help for troubleshoot and identify Kafka ports in use with 'netstat' command, topic content to view 'jq' and to have tree view 'tree' need to installed.
apt update
apt install -y net-tools jq tree
On Ubuntu run the following commands to install compatible Java, here I'm using OpenJDK 21 version
 
apt install -y openjdk-21-jdk-headless
Confirm the Java installation get the java version : `java -version`

OpenJDK21 installed verify


 

Install Apache Kafka

Let's use kafka's quick start https://kafka.apache.org/quickstart as per the Apache Kafka release model they will have yearly 3 stable releases the last stable release version available 3.9.0 from Nov 2024 with two binary options with Scala version support 2.12 and 2.13 are availble. So we will download kafka at https://www.apache.org/dyn/closer.cgi?path=/kafka/3.9.0/kafka_2.12-3.9.0.tgz

Install Kafka 3.9.0


 
wget https://dlcdn.apache.org/kafka/3.9.0/kafka_2.12-3.9.0.tgz

# extract the file and cd into the folder

tar -xzf kafka_2.12-3.9.0.tgz
cd kafka_2.12-3.9.0

start zookeeper

Kafka comes bundled with ZooKeeper. Here’s an example of what your zookeeper.properties might look like:
tickTime=2000
dataDir=/tmp/zookeeper
clientPort=2181
start ZooKeeper using the following command:
bin/zookeeper-server-start.sh config/zookeeper.properties
To verify Zookeeper working run the below command:
echo "Are you there ZooKeeper" | nc localhost 2181
If it responds with `imok`, then ZooKeeper is operational.

Starting Kafka multiple brokers on single host

Intially try with single broker after understanding how it is working you can move to multi-broker setup as discussed here. So our step here is to Start the Kafka server and then create a topic Now open a new terminal window and start the kafka-server, the start script available in bin directory. To make cluster configuration create server.partitions file to 2 copies to simulate:
cd kafka_2.12-3.9.0/config/
cp server.properties server1.properties # for broker1
cp server.properties server2.properties # for broker2
Most common changes are with the 3 parameters on broker configuration:
Broker 1:
broker.id=1
listeners=PLAINTEXT://localhost:9092
log.dirs=/tmp/kafka-logs-1

Broker 2:
broker.id=2
listeners=PLAINTEXT://localhost:9093
log.dirs=/tmp/kafka-logs-2
and then start these three servers in background as follows:
cd ~/kafka_2.12-3.9.0
nohup bin/kafka-server-start.sh config/server1.properties >broker1.out 2>&1 &
tail -f broker1.out

nohup bin/kafka-server-start.sh config/server2.properties >broker2.out 2>&1 &
tail -f broker2.out

#confirm with process check and network stats
jps -lm
netstat -tulpn|grep java


Create Topic

Add another terminal window create topic 'test' with the following commands:
cd ~/kafka_2.12-3.9.0/
bin/kafka-topics.sh --create --bootstrap-server \
localhost:9092 --replication-factor 1 \
--partitions 1 --topic test
Noe, we can confirm the topic is created on the kafka server by listing all topics on it:
bin/kafka-topics.sh --list --bootstrap-server localhost:9092

Send messages to topic

Producer simulation here, Send some text messages for testing purpose Kafka comes with a command line client that will take input from a file or from standard input and send it out as messages to the Kafka cluster. By default, each line will be sent as a separate message.
Run the producer and then type a few messages into the console to send to the server.
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test

For example enter the message on topic test as follows:

I love India
Mera Bharat mahan
Viswa guru Bharat
Skill India

Use ctrl-c to exit from producer

Consume messages

Start a consumer script to receive messages from the topic. Kafka also has a command line consumer that will dump out messages to standard output, it should return the lines you typed in the above step.
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test 
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning
bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test -o 2
same as we did for producer, Consumer can use ctrl-c to exit Hope you enjoyed this learning with me. Keep posting your experiance on this.

Categories

Kubernetes (25) Docker (20) git (15) Jenkins (12) AWS (7) Jenkins CI (5) Vagrant (5) K8s (4) VirtualBox (4) CentOS7 (3) docker registry (3) docker-ee (3) ucp (3) Jenkins Automation (2) Jenkins Master Slave (2) Jenkins Project (2) containers (2) create deployment (2) docker EE (2) docker private registry (2) dockers (2) dtr (2) kubeadm (2) kubectl (2) kubelet (2) openssl (2) Alert Manager CLI (1) AlertManager (1) Apache Maven (1) Best DevOps interview questions (1) CentOS (1) Container as a Service (1) DevOps Interview Questions (1) Docker 19 CE on Ubuntu 19.04 (1) Docker Tutorial (1) Docker UCP (1) Docker installation on Ubunutu (1) Docker interview questions (1) Docker on PowerShell (1) Docker on Windows (1) Docker version (1) Docker-ee installation on CentOS (1) DockerHub (1) Features of DTR (1) Fedora (1) Freestyle Project (1) Git Install on CentOS (1) Git Install on Oracle Linux (1) Git Install on RHEL (1) Git Source based installation (1) Git line ending setup (1) Git migration (1) Grafana on Windows (1) Install DTR (1) Install Docker on Windows Server (1) Install Maven on CentOS (1) Issues (1) Jenkins CI server on AWS instance (1) Jenkins First Job (1) Jenkins Installation on CentOS7 (1) Jenkins Master (1) Jenkins automatic build (1) Jenkins installation on Ubuntu 18.04 (1) Jenkins integration with GitHub server (1) Jenkins on AWS Ubuntu (1) Kubernetes Cluster provisioning (1) Kubernetes interview questions (1) Kuberntes Installation (1) Maven (1) Maven installation on Unix (1) Operations interview Questions (1) Oracle Linux (1) Personal access tokens on GitHub (1) Problem in Docker (1) Prometheus (1) Prometheus CLI (1) RHEL (1) SCM (1) SCM Poll (1) SRE interview questions (1) Troubleshooting (1) Uninstall Git (1) Uninstall Git on CentOS7 (1) Universal Control Plane (1) Vagrantfile (1) amtool (1) aws IAM Role (1) aws policy (1) caas (1) chef installation (1) create organization on UCP (1) create team on UCP (1) docker CE (1) docker UCP console (1) docker command line (1) docker commands (1) docker community edition (1) docker container (1) docker editions (1) docker enterprise edition (1) docker enterprise edition deep dive (1) docker for windows (1) docker hub (1) docker installation (1) docker node (1) docker releases (1) docker secure registry (1) docker service (1) docker swarm init (1) docker swarm join (1) docker trusted registry (1) elasticBeanStalk (1) global configurations (1) helm installation issue (1) mvn (1) namespaces (1) promtool (1) service creation (1) slack (1)