Project Management & Technology Fusion: 2017

Monday, November 20, 2017

Big Data/Data Science (Analytics) – Apache Hadoop, Kafka and Spark.

Introduction

In this article I plan to provide an overview about the following technologies and a use case where these three can be used to do Data Analytics as well as perform predictive analytics through machine learning.

Apache Hadoop
Apache Kafka
Apache Spark

Architectural Overview

The above diagram provides you with an overview of how the various technologies are working together to allow big data analysis and perform predictive analytics through machine learning.

Apache Kafka

Kafka is used here to provide data streaming (NOTE: For AWS fans you can use Amazon Kinesis for data streaming as well)

Apache Spark

Apache Spark is used to perform data analysis on the data stream (DStream) generated in real time by Apache Kafka. More specifically it’s the Spark Streaming module that’s used. (NOTE: For AWS fans you can deploy an Elastic MapReduce (EMR) cluster in AWS with Spark libraries deployed in that cluster and then use that cluster instead of Apache Spark and Apache Hadoop)

Apache Hadoop

The Hadoop ecosystem serves following purpose here

It’s used by Kafka consumers to process big data
It’s used by Spark to store the output results
It’s used as a Cluster Manager (the YARN piece of Hadoop)

(NOTE: For AWS fans you can deploy an EMR cluster in AWS with Spark libraries deployed in that cluster and then use that cluster instead of Apache Spark and Apache Hadoop)

Apache Hadoop

Apache Hadoop is an open source distributed computation platform used for processing big data to perform data analytics (thanks to Google’s published white papers). The Architectural diagram above shows the main modules of Apache Hadoop, a brief description follows

HDFS - Distributed file system to store big data across commodity server infrastructure
YARN - Resource Manager engine responsible for co-ordination of cluster resources and the acronym “Yet Another Resource Negotiator” (YARN) is pretty self-explanatory
MAPREDUCE – It’s the de-facto parallel, distributed programming algorithm used on a Hadoop cluster.

(NOTE: I am excluding quite a few modules to simplify the articles intent).

HDFS

As mentioned before HDFS is the Hadoop Distributed File System. The critical components of HDFS are

NameNode

NameNode stores the file/directory system structure of the distributed file system

DataNode

DatNodes are servers that store the actual big data blocks. Failover capability is built into DataNodes as the datablocks are replicated across multiple DataNodes based on replication factor (NOTE: typically for production you should have a replication factor of 3 or 5)

ZooKeeper

This is used to manage fail-over of resources, in the above example – the NameNode has an active-passive configuration with the zookeeper acting as the gatekeeper to listen to heart beats and doing fail-over action to passive NameNode if the active NameNode fails. (NOTE: Even though the diagram shows one box for Zookeeper in production Zookeeper daemons typically will be running on multiple nodes to provide High Availability (HA))

YARN

YARN is the Resource manager for Hadoop. Prior to YARN, Apache Hadoop had a different architecture for Resource Management that was not as scalable as YARN and was integrated with the MapReduce programming algorithm, starting with Apache Hadoop 2.X, Hadoop separated the Resource Management responsibility from the MapReduce programming algorithm. By doing this separation Hadoop opened the possibility for other projects to use YARN and HDFS without using the MapReduce programming algorithm.

This allowed other technology platforms like Apache Spark to come into picture to run Spark jobs on Hadoop systems using Hadoops Resource Manager (YARN) as Spark’s Cluster Manager and the underling HDFS file system for big data.

Apache Spark jobs are faster than their MapReduce equivalent and one of the reasons for that is the usage of Directed Acyclic Graphs (DAG)

MAPREDUCE

It’s the de-facto parallel, distributed programming algorithm used on a Hadoop cluster. Prior to Hadoop 2.X this programming algorithm was integrated into the Resource Manager and was collectively referred to as MRv1 (MapReduce version 1). Starting with Hadoop 2.X, the algorithm part and the resource management part were separated into MapReduce and YARN respectively, these are collectively referred to as MRv2 (MapReduce version 2)

Apache Kafka

Apache Kafka is a distributed streaming platform which is used for building real-time data pipelines and streaming apps. The Architectural diagram above shows the main modules of Apache Kafka, a brief description follows

Kafka Brokers

This is the component that the consumers and producers communicate via API to send and receive messages. You typically will have multiple Kafka brokers for high-availability within a cluster.

Zookeeper

In a Kafka cluster, Zookeeper is needed among other things for service discovery and leadership election of Kafka brokers. It’s basically the central repository used by various distributed components of Kafka to communicate with each other and share metadata information related to Kafka cluster through it.

Producers

Kafka Producers use Kafka Producer API to send messages to topics.

Consumers

Kafka Consumers use Kafka Consumer API to receive messages from topics. In this article Apache Spark is one such Consumer.

Topic

Topics are named set of records (messages). Kafka stores Topics in logs. A Topic log is broken down into multiple partitions. Each Topic partition is replicated across Kafka brokers to provide data redundancy in case one Kafka broker fails. Topics provide a de-coupled publisher-subscriber style messaging model for sharing data.

Apache Spark

Apache Spark provides an engine to do large scale data processing. It provides multiple modules each meant for a different data processing purposes, a brief description of the modules follow

Spark SQL

This module allows you to run SQL type queries on data that is not typically stored in a RDBMS. SQL is a well-accepted scripting language to perform data analysis and Spark SQL is trying to minimize the learning curve of data scientist by providing a familiar scripting language to them; the alternative will be to write the same via Map functions and Reduce functions. By the way, Spark SQL behind the scene uses Map and Reduce functions also, it’s just abstracting those details from user’s who are writing Spark SQL.

Spark Streaming

This module allows real-time data analysis of stream data. It provides libraries and API to stream data from Apache Kafka, Amazon Kinesis, direct TCP ports and many other streaming sources. In the “Use Case” section of this article, when I mention Spark I really mean, using Spark streaming module.

MLlib

This is the Machine Learning module of Apache Spark and in the Use Case” section I mention where this module can be used in real-world application.

GraphX

This is the graph computational engine of Spark that allows users to build and transform graph structured data at scale using Spark GraphX API.

The Architectural diagram above shows the main components of Apache Spark ecosystem, a brief description follows

Driver Program (Spark Context)

This is the client program that interacts with Spark API; Scala is the native functional programming language platform for Apache Spark. Apache Spark does provide similar API interfaces in Java and Python. The Driver program is what sets up the Spark Context that is used to interact with the Apache Spark cluster.

Cluster Manager

This is the component of Apache Spark that manages interaction with the Worker Nodes to distribute the Spark jobs to those Worker nodes for execution. Apache Spark supports following three types of cluster manager configuration (NOTE: there is an experimental support for Kubernetes (K8S))

Standalone – this runs the cluster manager as a part of Apache Spark cluster
YARN – this uses Hadoop 2 resource manager (YARN) for cluster management
Apache Mesos – this is a more generic resource manager similar to YARN but is not restricted to managing clusters in Hadoop only.

Worker Node

Worker nodes are the ones responsible for providing system resource details to the Cluster Manager so it can schedule jobs at the appropriate Worker nodes for job execution based on resource availability at that Worker node. Each application gets an “executor” to run on each Worker node and the executor is the one that runs multiple tasks for the application and is also responsible for providing in-memory caching for faster processing. The Spark Context started by the Driver Program communicates directly with the executors in parallel on each Worker node for job execution status. The Driver Program also splits the job into tasks that needs to be executed in parallel on each Worker node by the executor.

Use Case

Now that you have a quick overview of the technologies involved. Let’s consider where these can be used together to do data analytics and also do predictive analytics. The figure below shows one such use case.

A few observations about the above use case

Point 1: is the entry point to transfer web server access/error logs to Hadoop’s HDFS.
Point 2: is where the client programs read data from HDFS and submits the same to Apache Kafka for streaming.
Point 3: is where the client program uses the Producer API of Apache Kafka to send “messages” to Topic for consumption.
Point 4: is where Kafka consumers (Apache Spark in this example) uses the Consumer API of Apache Kafka to consume messages from the Topic - messages contain web log entries that need to be analyzed by Apache Spark
Point 5: Apache Spark can analyse the data in real-time by submitting Spark jobs in Hadoop cluster using Hadoop’s resource manager YARN.
Point 6: Apache Spark can then save the results in Hadoop’s HDFS file system for further actions
Point 7: You can have Alert and Monitoring programs that can take actions based on Apache Sparks’ real-time data analytical results. A few of the possible actions these programs can do is listed below

The results can store which IP addresses are generating more web traffic with “error” results. The programs can use the “per IP” count of “errors” as a metrics to determine which IP’s can be black listed as they could be potential hackers doing web scan to exploit website vulnerabilities
The result can store top 10 web pages visited by user community and accordingly re-align the website navigation to encourage users to stay longer on their website thereby by increasing the likelihood of them purchasing more items from your website
Using the MLlib module of Apache Spark you can do machine learning through algorithms to generate results that can predict what items a user is most likely to buy based on web pages the user visits on your site, you can accordingly change the home page for that user the next time he visits you website (predictive analytics is the point I am trying to get across here).

Point 8: Visual BI tools like Tableau and Power BI have connectors that allow you to connect directly to Hadoop. You can perform visual data analysis based on the results stored by Apache Spark in HDFS and look for patterns from which you can learn and adapt.

As you can see, the Use Case can be extrapolated to be used for many real-world applications and the possibilities are limitless (I hope you are getting excited to use this concept at your work place).

A few takeaways

It’s possible to use Kafka and Spark alone and not use Hadoop. Hadoop is only needed if you have big data to handle (petabytes not terabytes of data)
I am using Kafka for streaming but if you want you can use other streaming technologies like – Apache Storm, Amazon Kinesis.
If you do not have a need to do real-time Data analytics then you do not need to use Apache Kafka. You can just use Apache Spark’s other libraries like Spark SQL, MLLib to do data analysis and perform predictive analytics
It’s possible to use the output of Apache Spark’s analysis as data feed for Visual Analytical tools like – Microsoft Power BI and Tableau.
Visual Analytical tools like Power BI and Tableau will be something that you will end up using in your data analysis as they provide pretty fancy Dashboards and Visual Graphs that help you and the management executives visually understand the data trends quickly
Try to use equivalent technologies provided by Cloud providers, instead of building the clusters from ground up through bare metal servers or Virtual Machines. The only down side to this is you cannot control upgrades of the various versions of these technologies at a pace you want as a result you will be restricted to using older versions of these technologies while the Cloud providers give you an easy upgrade path. As an example of using Cloud technologies equivalent the following combination is a good alternative to Apache Kafka, Apache Spark and Apache Hadoop if Amazon AWS is your cloud provider –

Amazon Kinesis and Amazon EMR cluster with Spark libraries enabled.

Use this approach if you are using AWS instead of building your own clusters for Hadoop, Kafka and Spark via the EC2 bare-metal server’s route.

NOTE: As mentioned before flip side of using Cloud providers equivalent technologies is getting married to the technology versions provided by the Cloud provider. So pick your poison is what I am saying. The benefit is infrastructure management work is simplified and standing up clusters is just a few clicks via the AWS management console (all right not trying to be a sales person here for Amazon, you will still need a DevOps team)

Conclusion

I hope this article will inspire you to do data analytics with your real world use cases. Please refer to the following blog - http://pmtechfusion.blogspot.com/2017/11/big-datadata-science-analytics-apache.html if you are interested in reading detail installation steps for the above technologies and do some hands-on. I have also provided a python script for the “Use Case” described in this blog and explained how to execute the same.

Big Data/Data Science (Analytics) – Apache Hadoop, Kafka and Spark Installation

Introduction

This is a follow-up article to my overview article on using Hadoop, Kafka and Sparks (http://pmtechfusion.blogspot.com/2017/11/big-datadata-science-analytics-apache_20.html ) I plan to provide detail steps to first install Apache Hadoop on one Virtual Machine (VM) and then provide specifics of installing it as a true distributed multi-node cluster on multiple VM. I will also provide steps to install Spark and Kafka and complete this blog with an example of how to use the three technologies in your real world application.

Architectural Overview

The above diagram provides you with an overview of where each components are deployed on four VM.

Development Environment

I will have one VM where I will install Kafka, Spark and ZooKeeper
I will have three VM where I will install Apache Hadoop, one VM will act as the master node and the other two VM will act as data nodes
I have written a simple Python script that runs as a Spark job using YARN as a cluster manager and Apache Hadoop as the distributed file system.
All VM run Ubuntu 16.04
Apache Hadoop version is 2.8.1
Apache Kafka version is 1.0.0 (I am using Scala 2.11 binary distribution)
Apache Spark version is 2.20
Apache Zookeeper version is 3.4.8-1
The VM on which Apache Kafka, Apache Spark and Apache ZooKeeper is installed is named – “hadoop1”, the VM that is acting as a master node for Hadoop is named – “hadoopmaster” and the two VM that run as slaves for Hadoop cluster are named – “hadoopslave1”, “hadoopslave2”.
I am mentioning the versions of various software modules and Operating System (OS) I have used so if you want to create your own working environment you know which versions worked for me. It’s possible that you may end up using a different version by the time you start building your own development environment.

Common software & network installation steps

This section details installation steps for common software’s (like Java) needed by all components.

Step1

Let’s install Java using the following command (NOTE: You need to install Java on all four VM)

sudo apt install openjdk-8-jdk-headless

Step2

As we need Java to be an environment variable and a part of bash environment variable make sure that you add the same using the following steps -

Step2a

Edit the “/etc/environment” file and add the following lines

Step2b

Edit the “~/.bashrc” file and add the following lines

NOTE: The above file will need to be edited for different user accounts that you will have to create for each component. The different accounts are mentioned below

For Hadoop I am creating an user - hduser and group - hadoop
For Kafka I am creating an user - Kafka and group – Kafka
For Spark I am creating an user – testuser1 and group - testuser1

Make sure the “~/.bashrc” file for users - hduser, Kafka and testuser1 have the above java environment variable included.

Step3

Let’s install “ssh” on all VM as you will need the same to start Hadoop cluster on a multi-node Hadoop cluster. You also need “ssh” to do remote administration of all components related to Hadoop, Kafka and Spark.

sudo apt install ssh

Step4

In order to allow all the VM to communicate via the network with each other. Add the following host entries to all the four VM.

Kafka & Zookeeper installation steps

Kafka needs Zookeeper for distributed system co-ordination. The installation steps below show how to install the same on one VM. You should know that both Kafka as well as ZooKeeper can be installed on multiple VM as separate distributed systems for High Availability (HA) and fail-over. I am installing them along with Spark on one VM as my laptop resource limitation will not allow me to create infinite VM (I wish).

Step1

Download the latest Kafka binary from the following URL - http://kafka.apache.org/downloads.html

The version I downloaded for this blog is -

https://www.apache.org/dyn/closer.cgi?path=/kafka/1.0.0/kafka_2.11-1.0.0.tgz

Step1a

Unzip the downloaded file into the “/usr/local/kafka” folder with the following steps

$ cd /usr/local

$ sudo tar xzf kafka_2.11-1.0.0.tgz

$ sudo mv kafka_2.11-1.0.0 kafka

Step1b

Let’s create a user – “kafka” and group – “kafka” using the following commands -

sudo addgroup kafka

sudo adduser --ingroup kafka kafka

Step1c

Let’s grant access to “kafka” user/group to the “/usr/local/kafka” folder using following commands

chown -R kafka:kafka "/usr/local/kafka"

Step2

Let’s install ZooKeeper using the following command

sudo apt install zookeeperd

The above command will install and run ZooKeeper on port “2181”

Step3

Now we need to run command to start Kafka and create a topic

Step3a

Run the following commands to start and stop Kafka server

Step3b

Run the following command to create a Topic named “apacheLogEntries”, this topic will store apache web logs for doing real-time data analytics

At this point of time Kafka and ZooKeeper installation is completed. In the “Use Case” section below I will provide details of how to send apache access log entries to the above created topic and also how to make Spark consume the “apacheLogEntries” Topic to do real-time data analytics

Apache Spark installation steps

Step1

Download the latest Spark binary from the following URL - http://spark.apache.org/downloads.html

The version I downloaded for this blog is –

https://www.apache.org/dyn/closer.lua/spark/spark-2.2.0/spark-2.2.0-bin-hadoop2.7.tgz

Step1a

Unzip the downloaded file into the “/usr/local/spark” folder with the following steps

$ cd /usr/local

$ sudo tar xzf spark-2.2.0-bin-hadoop2.7.tgz

$ sudo mv spark-2.2.0-bin-hadoop2.7 spark

Step1b

Lets create a user – “spark” and group – “spark” using the following commands -

sudo addgroup spark

sudo adduser --ingroup spark spark

Step1c

Let’s grant access to “spark” user/group to the “/usr/local/spark” folder using following commands

chown -R spark:spark "/usr/local/spark"

Step2

Step2a

Edit the “/etc/environment” file and add the following lines

Step2b

Edit the “~/.bashrc” file and add the following lines

The above environment variable needs to be set in “~/.bashrc” file for “spark” user as well as “hduser”, the latter is needed because when you use YARN and Hadoop for running Spark jobs you will have to submit the same to the cluster with “hduser” credentials as the Hadoop cluster is set to accept jobs from “hduser” account only.

Step3

Copy the “/usr/local/spark/conf/spark-env.template” file into “/usr/local/spark/conf/spark-env.sh” and add the following line to the “/usr/local/spark/conf/spark-env.sh” file

The above line can be commented out if you plan to run Spark as a standalone cluster where you do not want to run the spark job on Hadoop using YARN as its cluster manager

NOTE: The above line should point to the “hadoop" configuration folder. This is the same configuration folder that all Hadoop node’s read from when they start up. So any Hadoop configuration changes done in this folder is also visible to Spark jobs. This is how Spark knows about the Hadoop cluster configuration and where the YARN is running.

That’s it Spark is now ready to submit job to Hadoop YARN

If you want to just use Spark to do your data analytics, then “Step3” above should not be done as by default Spark is configured to run in standalone mode using its internal Cluster Manager for resource negotiation, in fact, that’s how you should start using Spark to get deep dive hands-on experience with the various Spark modules before using Hadoop’s YARN as the Cluster Manager.(Standalone mode of Spark helps with troubleshooting your Spark script errors without relying on the goliath - the Hadoop ecosystem).

Apache Hadoop installation steps

Let’s start with installing stand-alone mode of Hadoop on one VM and then I will point out places where you need to deviate from stand-alone steps to create a Multi-Node Hadoop cluster.

In each step mentioned below, I will have two sections, one for single-node Hadoop cluster and another for multi-node. Steps that are common for both will not have these two sub-sections mentioned.

Step1

Download the latest Hadoop binary from the following URL –

http://hadoop.apache.org/releases.html#08+June%2C+2017%3A+Release+2.8.1+available

The version I downloaded for this blog is –

http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.8.1/hadoop-2.8.1.tar.gz

Step1a

Unzip the downloaded file into the “/usr/local/hadoop” folder with the following steps

$ cd /usr/local

$ sudo tar xzf hadoop-2.8.1.tar.gz

$ sudo mv hadoop-2.8.1 hadoop

Step1b

Lets create a user – “hduser” and group – “hadoop” using the following commands -

sudo addgroup hadoop

sudo adduser --ingroup hduser hadoop

Step1c

Let’s grant access to “hduser/hadoop” user/group to the “/usr/local/hadoop” folder using following commands

chown -R hduser:hadoop "/usr/local/hadoop"

NOTE: Step1, Step1a through Step1c above need to be executed on all nodes of a Multi-Node Hadoop cluster.

Step2

Edit the “~/.bashrc” file and add the following lines

This environment variable needs to be set in the bash shell for the “hduser” as that’s the user we will be using to start the Hadoop cluster as well as submit any jobs to run in the cluster.

NOTE: Step2 above need to be executed on all nodes of a Multi-Node Hadoop cluster.

Step3

We are now going to create a password less “ssh” key for the “hduser” as that is needed for the Hadoop master node to start the “slave” nodes. For stand-alone the slave nodes will run on “localhost” so we need this even for stand-alone Hadoop configuration.

Step3a

Following command will create the public private key, make sure you run this command on shell with “hduser” login

ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa

Step3b

Stand-Alone

Use the following command to copy the public key of the “Step 3a” above to the authorized key file

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Now you can test that the password-less ssh login works by running the following command (you should not be asked for a password)

ssh localhost

Multi-Node

Run the following command from the master-node “hadoopmaster” to copy the public key of “hduser” to all the slave nodes (“hadoopslave1” and “hadoopslave2”)

ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@hadoopslave1

ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@hadoopslave2

Step4

Now we need to change the configuration files for Hadoop cluster on all Hadoop nodes. I will provide the difference between stand-alone and multi-node changes in each respective sections

I will start with multi-node and point the difference

Step4a

Edit or create “/usr/local/hadoop/etc/hadoop/hadoop-env.sh” file and make the following changes to it (NOTE: This change is needed on all nodes and is needed for both stand-alone as well as multi-node Hadoop cluster)

Step4b

Edit or create “/usr/local/hadoop/etc/hadoop/core-site.xml” file and make the following changes

Multi-Node

This change should be done on all master/slave nodes

Standalone

Replace “hadoopmaster” with “localhost”

Step4c

Edit or create “/usr/local/hadoop/etc/hadoop/mapred-site.xml” file and make the following changes (NOTE: This change is needed on all nodes and is needed for both stand-alone as well as multi-node Hadoop cluster)

Step4d

Now you need to create folders on each node where the NameNode and DataNode write to.

Multi-Node

On master, create the following directory structure with access rights to “hduser/hadoop” as shown below

On all slaves which will only run the DataNode just create the folder for “data” with access rights to “hduser/hadoop” as shown below

Standalone

For standalone create the “data” and “name” folders on the single VM.

Step4e

Edit or create “/usr/local/hadoop/etc/hadoop/hdfs-site.xml” file and make the following changes

Multi-Node

On master

NOTE: The replication factor is “3” as the master node is going to run the DataNode as well, so basically the master will act as a “master” as well as “slave” in the cluster. In production, you should not make the “master” node act as “slave”.

On all slaves

Standalone

For standalone use the following configuration setting

NOTE: Replication is set to “1” for stand-alone

Step4f

Edit or create “/usr/local/hadoop/etc/hadoop/yarn-site.xml” file and make the following changes

Multi-Node

On Master as well as slave nodes

Standalone

For standalone use the following configuration setting

A few observations about standalone configuration

I am running my standalone Hadoop cluster on the same VM (“hadoop1”) where my Spark and Kafka a running
The memory for the standalone VM (“hadoop1”) is 12GB when I have all three components (Hadoop, Kafka and Spark) running on the same VM.

Step4g

Multi-Node

On Master only, edit or create “/usr/local/hadoop/etc/hadoop/masters” file and include the following line. This file actually controls where the SecondaryNameNode runs in Hadoop cluster (not the name of master node as the file name might suggest).

hadoopmaster

Standalone

This file is not needed for standalone mode

Step4h

Multi-Node

On master only, edit or create “/usr/local/hadoop/etc/hadoop/slaves” file and include the following nodes as slaves (In the example I am using the master node to be a slave node that will run DataNode daemon as well, hence “hadoopmaster” is included in the list below)

hadoopmaster

hadoopslave1

hadoopslave2

Standalone

This file is not needed for standalone mode

That’s is the Hadoop cluster configuration is done, now its time to start the various daemons.

To start the DataNode and NameNode use the following command

/usr/local/hadoop/sbin/start-dfs.sh

You will have to log in as “hduser” to run the above commands as the above script will use the passwordless “hduser” public key to ssh into other VM (slaves). This command runs from master node – “hadoopmaster”

To start the ResourceManager and NodeManager use the following command

/usr/local/hadoop/sbin/start-yarn.sh

Now use “jps” command to check which daemons are running where

On Master

NOTE: Because I am using master as a slave – DataNode is running on master as well.

On slaves (hadoopslave1 here, same output for hadoopslave2)

Use Case – Apache Log Analytics

In my other blog (http://pmtechfusion.blogspot.com/2017/11/big-datadata-science-analytics-apache_20.html ), in section “Use Case” I provide the description of how I plan to connect the three technologies for a real world Use Case – which is to analyze the Apache Log. Here I plan to provide description of what the Spark job script looks like. I plan to use Python as the scripting language. But I suggest you use “Scala” as that is native language for Spark and will provide better performance than “Python”. I choose “Python” as that is one of the preferred languages (along with “R”) by “Data Scientist”, so its familiar territory.

I will split the “python” script in smaller code snippets and provide comments for each snippet separately

Snippet 1

A few points about the above snippet

I am using the “pyspark” module to write Spark jobs in python
I am using the “pyspark.streaming.kafka” module to write real-time kafka streaming consumer for Apache Log.

Snippet 2

A few points about the above snippet

This method just uses regular expression (re) to split the apache access log messages into manageable collection in python

Snippet 3

A few points about the above snippet

Give that SparkSession creation is a performance intensive ordeal, the above function is using the singleton pattern to return the same SparkSession to the running python script.

Snippet 4

A few points about the above snippet

In the above script, I am using the KafkaUtils class methods to create a Kafka consumer group for reading real-time entries from the “apacheLogEntries” Topic that I created in “Kafka & Zookeeper installation steps - Step3b” section in this blog.

Snippet 5

A few points about the above snippet

This is a method that is used by the SparkSQL in Snippet 6 below to count the apache log entries.

Snippet 6

A few points about the above snippet

This is where the data is being processed by Spark to generate counts at a “per IP” level.
In spark real-time streaming, the stream is referred to as “DStream” and each “DStream” consists of multiple RDD which is why each RDD Is processed in a for loop – (lines.foreachRDD(process)
The line below is just writing the spark sql results back into hadoop DFS

A sample output of the above job is shown below (the results are saved in “json” format as mentioned above

Snippet 7

A few points about the above snippet

I am using checkpoint to ensure that if the spark jobs fail on one specific worker node then the checkpoint location will allow the job to start from where it left off on another worker node. In order to achieve that, the above checkpoint location “/mysoftware/mysparkcheckpointdir/” should be accessible from all worker nodes of Spark, hence in a cluster the above folder is saved in “hdfs”

Code Snippet 1 through Snippet 7 completes the Apache Log analysis script. In order to send apache log details to Kafka “apacheLogEntries” Topic, use the following script

A few points about the above line

I am using “kafka-console-producer.sh” a producer shell script that comes with Kafka to send messages to “apacheLogEntries” Topic. In real world you will end up writing application scripts that read from apache log and send messages to the Kafka topic.
The above script is executed from VM “hadoop1” as that is the VM that has Kafka binaries installed in it.

The above apache log Spark python script can be submitted to the Hadoop cluster using the following command

A few points about the above line

You submit this Spark job from “hadoop1” VM.
The “jar” file provided at the command prompt contains the “java” classes needed for Spark to talk with Kafka.
Same command is needed whether it’s a stand-alone cluster or multi-node cluster
“apachelogstatssqlKafkaSaveFHadoop.py” is the same python script defined in Snippet Code “1” through “7”

Conclusion

I hope this blog helps you with getting started with using Apache Hadoop, Spark and Kafka for doing batch and real-time data analytics.

Project Management & Technology Fusion

My "todo" List

Monday, November 20, 2017

Big Data/Data Science (Analytics) – Apache Hadoop, Kafka and Spark.

Big Data/Data Science (Analytics) – Apache Hadoop, Kafka and Spark Installation

Followers