My "todo" List

I am planning to write some blogs at somepoint of time in the future on the following topics:

Spring Framework, Hibernate, ADO.NET Entity Framework, WPF, SharePoint, WCF, Whats new in Jave EE6, Whats new in Oracle 11G, TFS, FileNet, OnBase, Lombardi BPMS, Microsoft Solution Framework (MSF), Agile development, RUP ..................... the list goes on


I am currently working on writing the following blog

Rational Unified Process (RUP) and Rational Method Composer (RMC)

Monday, November 20, 2017

Big Data/Data Science (Analytics) – Apache Hadoop, Kafka and Spark.

Introduction
In this article I plan to provide an overview about the following technologies and a use case where these three can be used to do Data Analytics as well as perform predictive analytics through machine learning.
  • Apache Hadoop
  • Apache Kafka
  • Apache Spark
Architectural Overview

The above diagram provides you with an overview of how the various technologies are working together to allow big data analysis and perform predictive analytics through machine learning.
Apache Kafka
Kafka is used here to provide data streaming (NOTE: For AWS fans you can use Amazon Kinesis for data streaming as well)
Apache Spark
Apache Spark is used to perform data analysis on the data stream (DStream) generated in real time by Apache Kafka. More specifically it’s the Spark Streaming module that’s used. (NOTE: For AWS fans you can deploy an Elastic MapReduce (EMR) cluster in AWS with Spark libraries deployed in that cluster and then use that cluster instead of Apache Spark and Apache Hadoop)
Apache Hadoop
The Hadoop ecosystem serves following purpose here
  • It’s used by Kafka consumers to process big data
  • It’s used by Spark to store the output results
  • It’s used as a Cluster Manager (the YARN piece of Hadoop)
(NOTE: For AWS fans you can deploy an EMR cluster in AWS with Spark libraries deployed in that cluster and then use that cluster instead of Apache Spark and Apache Hadoop)

Apache Hadoop
Apache Hadoop is an open source distributed computation platform used for processing big data to perform data analytics (thanks to Google’s published white papers). The Architectural diagram above shows the main modules of Apache Hadoop, a brief description follows
  • HDFS - Distributed file system to store big data across commodity server infrastructure
  • YARN - Resource Manager engine responsible for co-ordination of cluster resources and the acronym “Yet Another Resource Negotiator” (YARN) is pretty self-explanatory
  • MAPREDUCE – It’s the de-facto parallel, distributed programming algorithm used on a Hadoop cluster.
(NOTE: I am excluding quite a few modules to simplify the articles intent).

HDFS
As mentioned before HDFS is the Hadoop Distributed File System. The critical components of HDFS are
  • NameNode
NameNode stores the file/directory system structure of the distributed file system

  • DataNode
DatNodes are servers that store the actual big data blocks. Failover capability is built into DataNodes as the datablocks are replicated across multiple DataNodes based on replication factor (NOTE: typically for production you should have a replication factor of 3 or 5)

  • ZooKeeper
This is used to manage fail-over of resources, in the above example – the NameNode has an active-passive configuration with the zookeeper acting as the gatekeeper to listen to heart beats and doing fail-over action to passive NameNode if the active NameNode fails. (NOTE:  Even though the diagram shows one box for Zookeeper in production Zookeeper daemons typically will be running on multiple nodes to provide High Availability (HA))
YARN
YARN is the Resource manager for Hadoop. Prior to YARN, Apache Hadoop had a different architecture for Resource Management that was not as scalable as YARN and was integrated with the MapReduce programming algorithm, starting with Apache Hadoop 2.X, Hadoop separated the Resource Management responsibility from the MapReduce programming algorithm. By doing this separation Hadoop opened the possibility for other projects to use YARN and HDFS without using the MapReduce programming algorithm.
This allowed other technology platforms like Apache Spark to come into picture to run Spark jobs on Hadoop systems using Hadoops Resource Manager (YARN) as Spark’s Cluster Manager and the underling HDFS file system for big data.
Apache Spark jobs are faster than their MapReduce equivalent and one of the reasons for that is the usage of Directed Acyclic Graphs (DAG)

MAPREDUCE
It’s the de-facto parallel, distributed programming algorithm used on a Hadoop cluster. Prior to Hadoop 2.X this programming algorithm was integrated into the Resource Manager and was collectively referred to as MRv1 (MapReduce version 1). Starting with Hadoop 2.X, the algorithm part and the resource management part were separated into MapReduce and YARN respectively, these are collectively referred to as MRv2 (MapReduce version 2)

Apache Kafka
Apache Kafka is a distributed streaming platform which is used for building real-time data pipelines and streaming apps. The Architectural diagram above shows the main modules of Apache Kafka, a brief description follows
Kafka Brokers
This is the component that the consumers and producers communicate via API to send and receive messages. You typically will have multiple Kafka brokers for high-availability within a cluster.
Zookeeper
In a Kafka cluster, Zookeeper is needed among other things for service discovery and leadership election of Kafka brokers. It’s basically the central repository used by various distributed components of Kafka to communicate with each other and share metadata information related to Kafka cluster through it.
Producers
Kafka Producers use Kafka Producer API to send messages to topics.
Consumers
Kafka Consumers use Kafka Consumer API to receive messages from topics. In this article Apache Spark is one such Consumer.
Topic
Topics are named set of records (messages). Kafka stores Topics in logs. A Topic log is broken down into multiple partitions. Each Topic partition is replicated across Kafka brokers to provide data redundancy in case one Kafka broker fails. Topics provide a de-coupled publisher-subscriber style messaging model for sharing data.
Apache Spark
Apache Spark provides an engine to do large scale data processing. It provides multiple modules each meant for a different data processing purposes, a brief description of the modules follow
Spark SQL
This module allows you to run SQL type queries on data that is not typically stored in a RDBMS. SQL is a well-accepted scripting language to perform data analysis and Spark SQL is trying to minimize the learning curve of data scientist by providing a familiar scripting language to them; the alternative will be to write the same via Map functions and Reduce functions. By the way, Spark SQL behind the scene uses Map and Reduce functions also, it’s just abstracting those details from user’s who are writing Spark SQL.
Spark Streaming
This module allows real-time data analysis of stream data. It provides libraries and API to stream data from Apache Kafka, Amazon Kinesis, direct TCP ports and many other streaming sources. In the “Use Case” section of this article, when I mention Spark I really mean, using Spark streaming module.
MLlib
This is the Machine Learning module of Apache Spark and in the Use Case” section I mention where this module can be used in real-world application.
GraphX
This is the graph computational engine of Spark that allows users to build and transform graph structured data at scale using Spark GraphX API.

The Architectural diagram above shows the main components of Apache Spark ecosystem, a brief description follows
Driver Program (Spark Context)
This is the client program that interacts with Spark API; Scala is the native functional programming language platform for Apache Spark. Apache Spark does provide similar API interfaces in Java and Python. The Driver program is what sets up the Spark Context that is used to interact with the Apache Spark cluster.
Cluster Manager
This is the component of Apache Spark that manages interaction with the Worker Nodes to distribute the Spark jobs to those Worker nodes for execution. Apache Spark supports following three types of cluster manager configuration (NOTE: there is an experimental support for Kubernetes (K8S))
  • Standalone – this runs the cluster manager as a part of Apache Spark cluster
  • YARN – this uses Hadoop 2 resource manager (YARN) for cluster management
  • Apache Mesos – this is a more generic resource manager similar to YARN but is not restricted to managing clusters in Hadoop only.
Worker Node
Worker nodes are the ones responsible for providing system resource details to the Cluster Manager so it can schedule jobs at the appropriate Worker nodes for job execution based on resource availability at that Worker node. Each application gets an “executor” to run on each Worker node and the executor is the one that runs multiple tasks for the application and is also responsible for providing in-memory caching for faster processing. The Spark Context started by the Driver Program communicates directly with the executors in parallel on each Worker node for job execution status. The Driver Program also splits the job into tasks that needs to be executed in parallel on each Worker node by the executor.
Use Case
Now that you have a quick overview of the technologies involved. Let’s consider where these can be used together to do data analytics and also do predictive analytics. The figure below shows one such use case.
A few observations about the above use case
  • Point 1: is the entry point to transfer web server access/error logs to Hadoop’s HDFS.
  • Point 2: is where the client programs read data from HDFS and submits the same to Apache Kafka for streaming.
  • Point 3: is where the client program uses the Producer API of Apache Kafka to send “messages” to Topic for consumption.
  • Point 4: is where Kafka consumers (Apache Spark in this example) uses the Consumer API of Apache Kafka to consume messages from the Topic - messages contain web log entries that need to be analyzed by Apache Spark
  • Point 5: Apache Spark can analyse the data in real-time by submitting Spark jobs in Hadoop cluster using Hadoop’s resource manager YARN.
  • Point 6: Apache Spark can then save the results in Hadoop’s HDFS file system for further actions
  • Point 7: You can have Alert and Monitoring programs that can take actions based on Apache Sparks’ real-time data analytical results. A few of the possible actions these programs can do is listed below
    • The results can store which IP addresses are generating more web traffic with “error” results. The programs can use the “per IP” count of “errors” as a metrics to determine which IP’s can be black listed as they could be potential hackers doing web scan to exploit website vulnerabilities
    • The result can store top 10 web pages visited by user community and accordingly re-align the website navigation to encourage users to stay longer on their website thereby by increasing the likelihood of them purchasing more items from your website
    • Using the MLlib module of Apache Spark you can do machine learning through algorithms to generate results that can predict what items a user is most likely to buy based on web pages the user visits on your site, you can accordingly change the home page for that user the next time he visits you website (predictive analytics is the point I am trying to get across here).
  • Point 8: Visual BI tools like Tableau and Power BI have connectors that allow you to connect directly to Hadoop. You can perform visual data analysis based on the results stored by Apache Spark in HDFS and look for patterns from which you can learn and adapt.
As you can see, the Use Case can be extrapolated to be used for many real-world applications and the possibilities are limitless (I hope you are getting excited to use this concept at your work place).
A few takeaways
  • It’s possible to use Kafka and Spark alone and not use Hadoop. Hadoop is only needed if you have big data to handle (petabytes not terabytes of data)
  • I am using Kafka for streaming but if you want you can use other streaming technologies like – Apache Storm, Amazon Kinesis.
  • If you do not have a need to do real-time Data analytics then you do not need to use Apache Kafka. You can just use Apache Spark’s other libraries like Spark SQL, MLLib to do data analysis and perform predictive analytics
  • It’s possible to use the output of Apache Spark’s analysis as data feed for Visual Analytical tools like – Microsoft Power BI and Tableau.
  • Visual Analytical tools like Power BI and Tableau will be something that you will end up using in your data analysis as they provide pretty fancy Dashboards and Visual Graphs that help you and the management executives visually understand the data trends quickly
  • Try to use equivalent technologies provided by Cloud providers, instead of building the clusters from ground up through bare metal servers or Virtual Machines. The only down side to this is you cannot control upgrades of the various versions of these technologies at a pace you want as a result you will be restricted to using older versions of these technologies while the Cloud providers give you an easy upgrade path. As an example of using Cloud technologies equivalent the following combination is a good alternative to Apache Kafka, Apache Spark and Apache Hadoop if Amazon AWS is your cloud provider –

Amazon Kinesis and Amazon EMR cluster with Spark libraries enabled.

Use this approach if you are using AWS instead of building your own clusters for Hadoop, Kafka and Spark via the EC2 bare-metal server’s route.

NOTE: As mentioned before flip side of using Cloud providers equivalent technologies is getting married to the technology versions provided by the Cloud provider. So pick your poison is what I am saying. The benefit is infrastructure management work is simplified and standing up clusters is just a few clicks via the AWS management console (all right not trying to be a sales person here for Amazon, you will still need a DevOps team)

Conclusion
I hope this article will inspire you to do data analytics with your real world use cases. Please refer to the following blog - http://pmtechfusion.blogspot.com/2017/11/big-datadata-science-analytics-apache.html  if you are interested in reading detail installation steps for the above technologies and do some hands-on. I have also provided a python script for the “Use Case” described in this blog and explained how to execute the same.

Big Data/Data Science (Analytics) – Apache Hadoop, Kafka and Spark Installation

Introduction
This is a follow-up article to my overview article on using Hadoop, Kafka and Sparks (http://pmtechfusion.blogspot.com/2017/11/big-datadata-science-analytics-apache_20.html ) I plan to provide detail steps to first install Apache Hadoop on one Virtual Machine (VM) and then provide specifics of installing it as a true distributed multi-node cluster on multiple VM. I will also provide steps to install Spark and Kafka and complete this blog with an example of how to use the three technologies in your real world application.
Architectural Overview
The above diagram provides you with an overview of where each components are deployed on four VM.
Development Environment
  1. I will have one VM where I will install Kafka, Spark and ZooKeeper
  2. I will have three VM where I will install Apache Hadoop, one VM will act as the master node and the other two VM will act as data nodes
  3. I have written a simple Python script that runs as a Spark job using YARN as a cluster manager and Apache Hadoop as the distributed file system.
  4. All VM run Ubuntu 16.04
  5. Apache Hadoop version is 2.8.1
  6. Apache Kafka version is 1.0.0 (I am using Scala 2.11 binary distribution)
  7. Apache Spark version is 2.20
  8. Apache Zookeeper version is 3.4.8-1
  9. The VM on which Apache Kafka, Apache Spark and Apache ZooKeeper is installed is named – “hadoop1”, the VM that is acting as a master node for Hadoop is named – “hadoopmaster” and the two VM that run as slaves for Hadoop cluster are named – “hadoopslave1”, “hadoopslave2”.
  10. I am mentioning the versions of various software modules and Operating System (OS) I have used so if you want to create your own working environment you know which versions worked for me. It’s possible that you may end up using a different version by the time you start building your own development environment.

Common software & network installation steps
This section details installation steps for common software’s (like Java) needed by all components.
Step1
Let’s install Java using the following command (NOTE: You need to install Java on all four VM)
sudo apt install openjdk-8-jdk-headless

Step2
As we need Java to be an environment variable and a part of bash environment variable make sure that you add the same using the following steps -
Step2a
Edit the “/etc/environment” file and add the following lines


Step2b
Edit the “~/.bashrc” file and add the following lines
NOTE: The above file will need to be edited for different user accounts that you will have to create for each component. The different accounts are mentioned below
  • For Hadoop I am creating an user - hduser and group - hadoop
  • For Kafka I am creating an user - Kafka and group – Kafka
  • For Spark I am creating an user – testuser1 and group - testuser1

Make sure the “~/.bashrc” file for users - hduser, Kafka and testuser1 have the above java environment variable included.
Step3
Let’s install “ssh” on all VM as you will need the same to start Hadoop cluster on a multi-node Hadoop cluster. You also need “ssh” to do remote administration of all components related to Hadoop, Kafka and Spark.
sudo apt install ssh

Step4
In order to allow all the VM to communicate via the network with each other. Add the following host entries to all the four VM.

Kafka & Zookeeper installation steps
Kafka needs Zookeeper for distributed system co-ordination. The installation steps below show how to install the same on one VM. You should know that both Kafka as well as ZooKeeper can be installed on multiple VM as separate distributed systems for High Availability (HA) and fail-over.  I am installing them along with Spark on one VM as my laptop resource limitation will not allow me to create infinite VM (I wish).
Step1
Download the latest Kafka binary from the following URL - http://kafka.apache.org/downloads.html
The version I downloaded for this blog is -


Step1a
Unzip the downloaded file into the “/usr/local/kafka” folder with the following steps
$ cd /usr/local
$ sudo tar xzf kafka_2.11-1.0.0.tgz
$ sudo mv kafka_2.11-1.0.0 kafka
Step1b
Let’s create a user – “kafka” and group – “kafka” using the following commands -
sudo addgroup kafka
sudo adduser --ingroup kafka kafka

Step1c
Let’s grant access to “kafka” user/group to the “/usr/local/kafka” folder using following commands
chown -R kafka:kafka "/usr/local/kafka"

Step2
Let’s install ZooKeeper using the following command
sudo apt install zookeeperd

The above command will install and run ZooKeeper on port “2181”
Step3
Now we need to run command to start Kafka and create a topic
Step3a
Run the following commands to start and stop Kafka server
Step3b
Run the following command to create a Topic named “apacheLogEntries”, this topic will store apache web logs for doing real-time data analytics
At this point of time Kafka and ZooKeeper installation is completed. In the “Use Case” section below I will provide details of how to send apache access log entries to the above created topic and also how to make Spark consume the “apacheLogEntries” Topic to do real-time data analytics
Apache Spark installation steps
Step1
Download the latest Spark binary from the following URL - http://spark.apache.org/downloads.html
The version I downloaded for this blog is –
Step1a
Unzip the downloaded file into the “/usr/local/spark” folder with the following steps
$ cd /usr/local
$ sudo tar xzf spark-2.2.0-bin-hadoop2.7.tgz
$ sudo mv spark-2.2.0-bin-hadoop2.7 spark
Step1b
Lets create a user – “spark” and group – “spark” using the following commands -
sudo addgroup spark
sudo adduser --ingroup spark spark

Step1c
Let’s grant access to “spark” user/group to the “/usr/local/spark” folder using following commands
chown -R spark:spark  "/usr/local/spark"

Step2
Step2a
Edit the “/etc/environment” file and add the following lines

Step2b
Edit the “~/.bashrc” file and add the following lines

The above environment variable needs to be set in “~/.bashrc” file for “spark” user as well as “hduser”, the latter is needed because when you use YARN and Hadoop for running Spark jobs you will have to submit the same to the cluster with “hduser” credentials as the Hadoop cluster is set to accept jobs from “hduser” account only.

Step3
Copy the “/usr/local/spark/conf/spark-env.template” file into “/usr/local/spark/conf/spark-env.sh” and add the following line to the “/usr/local/spark/conf/spark-env.sh” file
The above line can be commented out if you plan to run Spark as a standalone cluster where you do not want to run the spark job on Hadoop using YARN as its cluster manager
NOTE: The above line should point to the “hadoop" configuration folder. This is the same configuration folder that all Hadoop node’s read from when they start up. So any Hadoop configuration changes done in this folder is also visible to Spark jobs. This is how Spark knows about the Hadoop cluster configuration and where the YARN is running.
That’s it Spark is now ready to submit job to Hadoop YARN
If you want to just use Spark to do your data analytics, then “Step3” above should not be done as by default Spark is configured to run in standalone mode using its internal Cluster Manager for resource negotiation, in fact, that’s how you should start using Spark to get deep dive hands-on experience with the various Spark modules before using Hadoop’s YARN as the Cluster Manager.(Standalone mode of Spark helps with troubleshooting your Spark script errors without relying on the goliath - the Hadoop ecosystem).

Apache Hadoop installation steps
Let’s start with installing stand-alone mode of Hadoop on one VM and then I will point out places where you need to deviate from stand-alone steps to create a Multi-Node Hadoop cluster.
In each step mentioned below, I will have two sections, one for single-node Hadoop cluster and another for multi-node. Steps that are common for both will not have these two sub-sections mentioned.

Step1
Download the latest Hadoop binary from the following URL –

The version I downloaded for this blog is –
Step1a
Unzip the downloaded file into the “/usr/local/hadoop” folder with the following steps
$ cd /usr/local
$ sudo tar xzf hadoop-2.8.1.tar.gz
$ sudo mv hadoop-2.8.1 hadoop
Step1b
Lets create a user – “hduser” and group – “hadoop” using the following commands -
sudo addgroup hadoop
sudo adduser --ingroup hduser hadoop

Step1c
Let’s grant access to “hduser/hadoop” user/group to the “/usr/local/hadoop” folder using following commands
chown -R hduser:hadoop  "/usr/local/hadoop"

NOTE: Step1, Step1a through Step1c above need to be executed on all nodes of a Multi-Node Hadoop cluster.
Step2
Edit the “~/.bashrc” file and add the following lines
This environment variable needs to be set in the bash shell for the “hduser” as that’s the user we will be using to start the Hadoop cluster as well as submit any jobs to run in the cluster.
NOTE: Step2 above need to be executed on all nodes of a Multi-Node Hadoop cluster.

Step3
We are now going to create a password less “ssh” key for the “hduser” as that is needed for the Hadoop master node to start the “slave” nodes. For stand-alone the slave nodes will run on “localhost” so we need this even for stand-alone Hadoop configuration.
Step3a
Following command will create the public private key, make sure you run this command on shell with “hduser” login

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

Step3b
Stand-Alone
Use the following command to copy the public key of the “Step 3a” above to the authorized key file
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Now you can test that the password-less ssh login works by running the following command (you should not be asked for a password)
ssh localhost


Multi-Node
Run the following command from the master-node “hadoopmaster” to copy the public key of “hduser” to all the slave nodes (“hadoopslave1” and “hadoopslave2”)
ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@hadoopslave1
ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@hadoopslave2

Step4
Now we need to change the configuration files for Hadoop cluster on all Hadoop nodes. I will provide the difference between stand-alone and multi-node changes in each respective sections
I will start with multi-node and point the difference
Step4a
Edit or create “/usr/local/hadoop/etc/hadoop/hadoop-env.sh” file and make the following changes to it (NOTE: This change is needed on all nodes and is needed for both stand-alone as well as multi-node Hadoop cluster)
Step4b
Edit or create “/usr/local/hadoop/etc/hadoop/core-site.xml” file and make the following changes
Multi-Node
This change should be done on all master/slave nodes
Standalone
Replace “hadoopmaster” with “localhost”

Step4c
Edit or create “/usr/local/hadoop/etc/hadoop/mapred-site.xml” file and make the following changes (NOTE: This change is needed on all nodes and is needed for both stand-alone as well as multi-node Hadoop cluster)
Step4d
Now you need to create folders on each node where the NameNode and DataNode write to.
Multi-Node
On master, create the following directory structure with access rights to “hduser/hadoop” as shown below
On all slaves which will only run the DataNode just create the folder for “data” with access rights to “hduser/hadoop” as shown below


Standalone
For standalone create the “data” and “name” folders on the single VM.
Step4e
Edit or create “/usr/local/hadoop/etc/hadoop/hdfs-site.xml” file and make the following changes

Multi-Node
On master
NOTE: The replication factor is “3” as the master node is going to run the DataNode as well, so basically the master will act as a “master” as well as “slave” in the cluster. In production, you should not make the “master” node act as “slave”.
On all slaves
Standalone
For standalone use the following configuration setting
NOTE: Replication is set to “1” for stand-alone
Step4f
Edit or create “/usr/local/hadoop/etc/hadoop/yarn-site.xml” file and make the following changes
Multi-Node
On Master as well as slave nodes

Standalone
For standalone use the following configuration setting
A few observations about standalone configuration
  • I am running my standalone Hadoop cluster on the same VM (“hadoop1”) where my Spark and Kafka a running
  • The memory for the standalone VM (“hadoop1”) is 12GB when I have all three components (Hadoop, Kafka and Spark) running on the same VM.
Step4g
Multi-Node
On Master only, edit or create “/usr/local/hadoop/etc/hadoop/masters” file and include the following line. This file actually controls where the SecondaryNameNode runs in Hadoop cluster (not the name of master node as the file name might suggest).

hadoopmaster

Standalone
This file is not needed for standalone mode
Step4h
Multi-Node
On master only, edit or create “/usr/local/hadoop/etc/hadoop/slaves” file and include the following nodes as slaves (In the example I am using the master node to be a slave node that will run DataNode daemon as well, hence “hadoopmaster” is included in the list below)

hadoopmaster
hadoopslave1
hadoopslave2

Standalone
This file is not needed for standalone mode

That’s is the Hadoop cluster configuration is done, now its time to start the various daemons.
  • To start the DataNode and NameNode use the following command

/usr/local/hadoop/sbin/start-dfs.sh

You will have to log in as “hduser” to run the above commands as the above script will use the passwordless “hduser” public key to ssh into other VM (slaves). This command runs from master node – “hadoopmaster”

  • To start the ResourceManager and NodeManager use the following command

/usr/local/hadoop/sbin/start-yarn.sh

You will have to log in as “hduser” to run the above commands as the above script will use the passwordless “hduser” public key to ssh into other VM (slaves). This command runs from master node – “hadoopmaster”

  • Now use “jps” command to check which daemons are running where
On Master

NOTE: Because I am using master as a slave – DataNode is running on master as well.

On slaves (hadoopslave1 here, same output for hadoopslave2)


Use Case – Apache Log Analytics

In my other blog (http://pmtechfusion.blogspot.com/2017/11/big-datadata-science-analytics-apache_20.html ), in section “Use Case” I provide the description of how I plan to connect the three technologies for a real world Use Case – which is to analyze the Apache Log. Here I plan to provide description of what the Spark job script looks like. I plan to use Python as the scripting language. But I suggest you use “Scala” as that is native language for Spark and will provide better performance than “Python”. I choose “Python” as that is one of the preferred languages (along with “R”) by “Data Scientist”, so its familiar territory.
I will split the “python” script in smaller code snippets and provide comments for each snippet separately
Snippet 1

A few points about the above snippet
  • I am using the “pyspark” module to write Spark jobs in python
  • I am using the “pyspark.streaming.kafka” module to write real-time kafka streaming consumer for Apache Log.
Snippet 2
A few points about the above snippet
  • This method just uses regular expression (re) to split the apache access log messages into manageable collection in python


Snippet 3
A few points about the above snippet
  • Give that SparkSession creation is a performance intensive ordeal, the above function is using the singleton pattern to return the same SparkSession to the running python script.

Snippet 4
A few points about the above snippet
  • In the above script, I am using the KafkaUtils class methods to create a Kafka consumer group for reading real-time entries from the “apacheLogEntries” Topic that I created in “Kafka & Zookeeper installation steps - Step3b” section in this blog.
Snippet 5
A few points about the above snippet
  • This is a method that is used by the SparkSQL in Snippet 6 below to count the apache log entries.
Snippet 6
A few points about the above snippet
  • This is where the data is being processed by Spark to generate counts at a “per IP” level.
  • In spark real-time streaming, the stream is referred to as “DStream” and each “DStream” consists of multiple RDD which is why each RDD Is processed in a for loop – (lines.foreachRDD(process)
  • The line below is just writing the spark sql results back into hadoop DFS

A sample output of the above job is shown below (the results are saved in “json” format as mentioned above


Snippet 7
A few points about the above snippet
  • I am using checkpoint to ensure that if the spark jobs fail on one specific worker node then the checkpoint location will allow the job to start from where it left off on another worker node. In order to achieve that, the above checkpoint location “/mysoftware/mysparkcheckpointdir/” should be accessible from all worker nodes of Spark, hence in a cluster the above folder is saved in “hdfs”

Code Snippet 1 through Snippet 7 completes the Apache Log analysis script. In order to send apache log details to Kafka “apacheLogEntries” Topic, use the following script
A few points about the above line
  • I am using “kafka-console-producer.sh” a producer shell script that comes with Kafka to send messages to “apacheLogEntries” Topic. In real world you will end up writing application scripts that read from apache log and send messages to the Kafka topic.
  • The above script is executed from VM “hadoop1” as that is the VM that has Kafka binaries installed in it.

The above apache log Spark python script can be submitted to the Hadoop cluster using the following command
A few points about the above line
  • You submit this Spark job from “hadoop1” VM.
  • The “jar” file provided at the command prompt contains the “java” classes needed for Spark to talk with Kafka.
  • Same command is needed whether it’s a stand-alone cluster or multi-node cluster
  • “apachelogstatssqlKafkaSaveFHadoop.py” is the same python script defined in Snippet Code “1” through “7”


Conclusion

I hope this blog helps you with getting started with using Apache Hadoop, Spark and Kafka for doing batch and real-time data analytics.