My "todo" List

I am planning to write some blogs at somepoint of time in the future on the following topics:

Spring Framework, Hibernate, ADO.NET Entity Framework, WPF, SharePoint, WCF, Whats new in Jave EE6, Whats new in Oracle 11G, TFS, FileNet, OnBase, Lombardi BPMS, Microsoft Solution Framework (MSF), Agile development, RUP ..................... the list goes on


I am currently working on writing the following blog

Rational Unified Process (RUP) and Rational Method Composer (RMC)

Saturday, April 21, 2018

Internet of Things (IoT), AWS & Raspberry Pi3

Introduction
In this article I plan to describe how you can use Raspberry Pi3 as a Single Board Computer (SBC) to read temperature and humidity through DHT11 sensor and use AWS IoT as the MQTT message broker to send the data to AWS for further analysis.
Architectural Overview
In the architectural diagram above the arrows are showing a flow of events with numbers used to explain the order in which these events occur.
Point1:
This is the Raspberry Pi3 circuit that is generating sensor data. I did not use a microcontroller here to make things easy for me.
Point2:
AWS IoT is being used as a Message Broker to collect device data. Although I am using AWS IoT here, other cloud providers like Azure’s IoT Hub or Google’s Cloud IoT will work as well. If you do not want to use a cloud provider and want to have a self-hosted MQTT message broker you can use Eclipse Mosquitto. All these providers support MQTT protocol which is the protocol I plan to use with Raspberry Pi3. Other protocols are also supported by cloud providers but MQTT is widely accepted standard. MQTT is a publisher/subscriber messaging protocol which fits well in a disconnected architecture like IoT. Its bidirectional so you can send information/configuration instructions to devices as well.
Point3:
AWS IoT allows devices to publish to an AWS Topic. In this diagram, the device (Pi3 with DHT11 sensor) uses AWS IoT MQTT broker to send data to the topic.
Point4:
AWS allows you to define rules that can trigger actions based on data that is published by multiple devices to a specific Topic. You can define multiple rules for the same Topic.
Point5:
AWS allows multiple AWS services to be integrated with AWS IoT through the execution of actions that are run by the AWS IoT rules. In the diagram a few of the AWS services are listed, most notably - AWS Lambda, AWS DynamoDB, AWS S3, AWS SNS/SQS, AWS Kinesis (for real time streaming) and ElasticSearch. NOTE: The list is not complete
Point6:
Once the IoT device data is received at the AWS end and the appropriate AWS service is used to store the data, it’s possible to do data analysis on the same. For example: If you save the data in DynamoDB you can do data analytics using AWS QuickSight.

Raspberry Pi3 Circuit Diagram
The above diagram shows a logical circuit diagram of how Raspberry Pi3 is connected to the DHT11 temperature and humidity sensor. I am not showing the actual pin connection to keep this article simple but there is a bread board missing here which is connecting all of these components. A few observation points about the above diagram

    • On the MicroSD card of Raspberry Pi3 I have installed Raspbian Stretch Desktop OS (Debian based Linux OS)
    • I am using python libraries to communicate with GPIO pins instead of “C” program, just to make life easy for me. Basically I am using the “RPi.GPIO” libraries that come with Raspbian Stretch OS 
    • I have installed AWS CLI client on the Raspberry Pi3 to allow it to communicate with AWS IoT
    • I have installed the “AWSIoTPythonSDK.MQTTLib” python library to allow Pi3 to communicate with AWS IoT MQTT broker as a MQTT client
    • The resistors in the diagram are used to control the current flow.
    The logical flow of events is as follows

    Point 1:
    The pressing of button controls when Pi3 will read sensor data. I am using a button just to control the event flow through a user driven action. If you want you can remove the button and the LED from the above circuit and let the Pi3 read temperature and humidity at periodic intervals directly from DHT11 sensor
    Point 2:
    Once the button is pressed, I am using a LED as a visual indicator to tell me that Pi3 will be reading sensor data
    Point 3:
    The button also sends a signal to Pi3 via the GPIO pins as a trigger to read sensor data
    Point 4:
    Pi3 reads the temperature and humidity data from the DHT11 sensor once it receives the trigger from the button press event.
    Point 5:
    Pi3 then uses the “AWSIoTPythonSDK.MQTTLib” python library to send data to AWS IoT. Once the data reaches AWS IoT, then data analysis can be done from that point forward using AWS services shown in previous architectural diagram.
    A few take away
    Things you should consider exploring but is not elaborated in this article are listed below
    At the AWS end
    • It’s possible to use AWS Athena to do data analysis directly from S3 bucket
    • You can do real-time data analytics using Spark and EMR cluster of AWS and use AWS Kinesis as a streaming source which streams data from multiple IoT devices.
    • If you want to do any data transformation on the IoT data received from devices, you can consider using AWS Glue/Data pipeline.

    At the IoT end
    • You can use a Passive infrared sensor (PIR motion sensor) and connect it to Raspberry Pi3 and built your own home security system that sends you email alerts when someone enters your house or you can connect a Piezo buzzer to your Pi3 along with the motion sensor to sound an alarm
    • You can also connect a Relay to Pi3 and then use it to control pretty much anything over the web, I suggest using a smart Relay like Adafruit's “Controllable Four Outlet Power Relay Module version 2” that has an inbuilt relay so you can connect low voltage circuit like Pi3 at one end and have the power strips for high voltage at the other end to which you can connect any electrical appliance like (air conditioner, table lamp etc.). This setup will allow you to control pretty much anything in your house from anywhere over the cloud. Using a smart relay will avoid the need to work with high voltage wires and the need to connect them directly to a regular relay, safety first.
    • You can measure pressure, temperature and altitude using BMP180 digital pressure sensor that can be connected to Pi3.
    Conclusion

    I hope this gets you excited with IoT and encourages you to do some DIY projects. I have given some DIY project pointers in the “A few take away” sections in my article for you to get started on. Good luck.

    Sunday, March 18, 2018

    ELK+ Stack - Elasticsearch, Logstash (Filebeat) and Kibana

    Introduction
    In this article I will provide details of how to use the ELK+ stack for doing 
      • Full Text searching (using google type search capability)
      • Visual Data Analysis of log data generated from various sources (like server syslog, apache logs, IIS logs etc.)

        Architectural Overview

        Filebeat
        In the ELK+ stack, Filebeat provides a lightweight alternative to send log data to either the Elasticsearch directly or to Logstash for further transformation of data before its send to Elasticsearch for storage. Filebeat provides many out of the box modules to process data and these modules can be configured with minimal configuration to start shipping logs to Elasticsearch and/or Logstash. 

        Logstash
        Logstash is the log analysis platform for ELK+ stack. It can do what Filebeat does and more. It has inbuilt filters and scripting capabilities to perform analysis and transformation of data from various log sources (Filebeat being one such source) before sending information to Elasticsearch for storage. The question comes when to use Logstash versus Fliebeat. Typically you can consider Logstash as the “big daddy” for Filebeat. Logstash requires Java JVM and is a heavy weight alternative to Filebeat which requires minimal configuration in order to collect log data from client nodes. Filebeat can be used as a light weight log shipper and one of the source of data coming over to Logstash which can then act as an aggregator and perform further analysis and transformation on the data before its stored in Elasticsearch as shown below

        Elasticsearch
        Elasticsearch is the NoSQL data store for storing documents. The “term” document in Elasticsearch nomenclature means “json”- type data. Since the data is stored in “json” format, it allows a schema less layout thereby deviating from the traditional RDBMS-type of schema layouts and allowing flexibility for the json-elements to be changed with no impact on existing data. This benefit comes due to the Schema less nature of Elasticsearch data store and is not unique to Elasticsearch. DynamoDB, MongoDB, DocumentDB and many other NoSQL data stores provide similar capabilities. Elasticsearch under the hood uses Apache Lucene as the search engine for searching documents it stores in its data store.

        Data (more specifically documents) in Elasticsearch are stored in indexes which are stored in multiple shards. A document belongs to one index and one primary shard within that index. Shards allow parallel data processing for the same index. You can replicate shards across multiple replicas to provide fail-over and high availability. The diagram below shows how data is logically distributed across an Elasticsearch cluster comprising of three nodes. A node is a physical server or a virtual machine (VM)


        Kibana
        Within the ELK+ stack, the most exciting piece of the stack is the Visual Analytical tool – Kibana. Kibana allows you to do visual data analysis on data that’s stored in Elasticsearch. It allows you to visualize data by providing many visualization types as shown below
        Once you create visualizations using the above visualization types. You can create a Dashboard that can includes these visualizations on a single page to provide visual representation of data stored in Elasticsearch from different viewpoints, a technique used in Visual data analysis to identify patterns and trends within your data.

        Use Cases for ELK+ stack:
        Use Case 1
        In this use case we can use Elasticsearch to store documents to create a google type search engine within your organization. Documents referred to in this use case are Word documents, PDF documents, image documents etc. and not the json-type documents stored in Elasticsearch.


        The above diagram shows two user driven workflows in which Elasticsearch cluster is used by web applications.

        Workflow1
        Users upload the files using the Web application user interface. The web application in turn behind the scene uses the Elasticsearch API to send the files to Ingest node of Elasticsearch cluster. The web application can be written in Java (JEE), Python (Django), Ruby on Rails, PHP etc. Depending on the choice of programming language and application framework, you can pick the appropriate API l to interact with the Elasticsearch cluster.

        Workflow 2
        Users use the web application user interface to search for documents (like PDF, XLS, DOC etc.) that are ingested into the Elasticsearch cluster (via “workflow 1”).

        Elasticsearch provides “Ingest Attachment” plugin to ingest documents into the cluster. The “Ingest Attachment” plugin is based on open source Apache Tika project. Apache Tika is an open source toolkit that detects and extracts metadata and text from many different file types (like PDF, DOC, XLS, PPT etc.). The “Ingest Attachment” plugin uses the Apache Tika library to extract data for different file types and then store the clear text contents in Elasticsearch as json-type documents. The plugin also stores the full-text extract version of the different file types as an element within the json-type document. Once the metadata and the full-text extraction is done and stored in clear-text json-style format within Elasticsearch. From that point forward Elasticsearch uses the Apache Lucene search engine to allow full-text searching capability by doing “Analysis” and “Relevance” ranking on the full-text field.

        Use Case 2

        Using Elasticsearch as a NoSQL data store to store logs from various sources (apache server logs, syslogs, IIS logs etc.) to do visual data analysis using Kibana. The architectural diagram in section titled “Architectural Overview” in this article shows how ELK+ stack can be used for this use case. I am not going to elaborate more on this use case besides providing a few additional informational points (or take away notes)
        • It’s possible to directly send log data from Filebeat to Elasticsearch cluster (bypassing the Logstash)
        • Filebeat has a built in backpressure sensitive protocol to slow down the sending of log data to Logstash and/or Elasticsearch if they are busy doing other activities.

        A few take away

          • In order to avoid the Elasticsearch nodes from doing extraction of metadata and full text data from different file types while files are sent to “Ingest Attachment” plugin in real time and there by resulting in CPU and resource load on the ingest node. An alternate way to do the same will be to use Apache Tika as an offline utility and use it to extract metadata and full-text data using a batch program and then using that batch program to send data directly to Elasticsearch as a “json-type” document. This eliminates increasing payload pressure on Elasticsearch cluster as the batch program can be run outside the cluster nodes, full-text data extraction is a very CPU intensive operation. This batch program approach will also work for Apache Solr as the same batch program can be used to route the data to Apache Solr and/or Elasticsearch
          • Another benefit of off-loading the full-text data extract operation outside of Elasticsearch is that you do not have to use Apache Tika for metadata and full-text data extraction, you can use other commercial alternatives.
          • For High availability and large Elasticsearch clusters, try to utilize different Elasticsearch nodes-types (a node is a physical machine and/or VM in an Elasticsearch cluster). Below are the different Elasticsearch note-types 
          Master eligible Node

          This node-type allows the node to be eligible for being designated as a master node within the Elasticsearch cluster. For large clusters, do not flag “master eligible” nodes to function as “data” nodes and vice-versa

          Data Node 

          This node-type allows the node to be a “data” node in an Elasticsearch cluster.

          Ingest Node

          This node-type allows the node to be a “ingest” node. If you were using the “Ingest Attachment” plugin, then it is a good idea to send data traffic associated with file uploads to “ingest” only nodes as the full-text extraction from different file-types is a very CPU exhaustive process and you are better off dedicating nodes just to perform that operation

          Coordinating Node

          This node-type allows the node to co-ordinate the search request across “data” nodes and aggregate the results before sending the data back to the client. As a result, this node should have enough memory and CPU resources to perform that role efficiently.

          Tribe Node

          It’s a special type of “coordinating” node that allows cross-cluster searching operations

          NOTE: by default an Elasticsearch node is “master-eligible” node, “data” node, “ingest” node as well as “coordinating” node. This works for small clusters but for large clusters you need to plan these node roles (types) for scalability and performance.

          • Both Apache Solr and Elasticsearch are competing open source platforms utilizing Apache Lucene as the search engine under the hood. Elasticsearch is more recent and is able to benefit from lessons learned from previous competing platforms. Elasticsearch has gained popularity due to its simplicity of setting up the cluster versus Apache Solr and as mentioned earlier many cloud providers (like AWS) are providing Elasticsearch as a “Platform” as a service (PaaS)
          • Elasticsearch has a product called X-Pack, at this point of time you have to pay for this but this year the company behind Elasticsearch is going to make the X-Pack offering open source. X-Pack provides following capabilities


            1. Secures your Elasticsearch cluster through authentication/authorization
            2. Sending alerts
            3. Monitoring your clusters
            4. Reporting capability
            5. Exploring relationship between your Elasticsearch data by utilizing “Graph” plotting.
            6. Machine Learning of your Elasticsearch data

            Conclusion
            ELK+ stack is an open source platform that allows you to quickly set up a cluster for performing visual data analysis as well as provide a “google” search like capability within your organization and it’s all “free”. I hope this article will give you enough details to use the ELK+ stack in your own organization

            Sunday, March 4, 2018

            CI/CD automation for Agile projects - AWS CloudFormation (IaC), Jenkins, Git, Jira and Selenium

            Introduction
            In this article I plan to describe how to go about automating the entire software development process using Continuous Integration (CI)/Continuous Deployment (CD) tools sets. Following CI/CD toolset will be used by me to explain complete end to end automation

             
            Architectural Overview
            CI/CD Automation Flow
            In the above diagram I am using numbering to explain a typical automation flow in a CI/CD type of application development.

            Point 1: Source Code Management (SCM)
            Developers will use some sort of source code management software (like Git) to check in code. Any source code repository allows multiple release branches to be maintained for parallel releases. A code “check-in” event can trigger an automatic built process and you can control when to build based on some predefined rules, for example:
            • For git you can trigger automatic builds based on release “tags” – like rel1.0, rel2.0 etc.
            • For GitHub you can trigger automatic builds based on Web hooks calling your endpoint URL to trigger a build. And within your endpoint URL code you can check the GitHub event to determine if a built is needed.
            Point is, SCM allows you to associate “check in” events to trigger automatic builds based on pre-defined rules. This gives you control over which code “check in” event causes automatic builds.

            Point 2: Build Server
            Build servers are the main components that connect all the CI/CD components together by allowing orchestration of built process through numerous plugins. One such software is Jenkins. Jenkins is an open source software that provides plugins to integrate with
            • SCM software’s like git, GitHub, CodeCommit etc.
            • Cloud providers like AWS, GCP, Azure
            • Java applications through maven plugin
            • .NET applications through msbuild plugin
            • ALM software’s like Jira
            AWS also provides an end to end build software called – AWS CodePipeline. AWS CodePipeline is something you can consider if all the components you are using are from AWS cloud (for example: CodeCommit instead of git).

            Point 3: Application Lifecycle Management (ALM) tools like Jira
            With Jenkins you can use the Jira plugin to update any issues that are opened in Jira so Business Owners can look at Jira’s comments section to know what issues are fixed by the build. I have spoken about this type of integration in my blog: http://pmtechfusion.blogspot.com/2017/03/application-lifecycle-management-alm.html. I encourage you to read that blog in case if you are interested in knowing the details.
            Point 4: Shared Infrastructure Stack

            With Infrastructure as Code (IaC), it’s possible to define infrastructure in terms of code scripts that can be checked into SCM along with the application code thereby allowing versioning of infrastructure templates just like application code in SCM. AWS provides CloudFormation to define infrastructure templates in JSON or YAML format (Azure and Google - GCP also provide similar JSON style scripting capabilities). In a CI/CD type of development automation, you should define at least two infrastructure templates one for infrastructure that is shared across applications and another for application specific infrastructure and technology stack. The former is usually not torn down with each automatic deployment.
            Shared infrastructure stack – Architectural Diagram
            NOTE: The above shared infrastructure is defined with High Availability in mind.

            A sample infrastructure layout is shown above. Some of the AWS components that you define in shared infrastructure CloudFormation stack are mentioned below
            • VPC – Virtual Private Cloud (the virtual network)
            • Internet Gateway (IGW) – this allows the VPC access to internet
            • NAT Gateway – this is defined in the shared infrastructure so that private subnets in the application specific stacks can use the NAT gateway to access the internet.
            • Elastic IP – NAT gateways need to be given a public IP (Elastic IP allows that to happen in AWS)
            • Route Tables – typically you define three route tables (at a minimum)
            1. One for public routes
            2. Two for private routes (Since I have two subnets I will have two route tables). In each private route tables you define an outbound rule that sends all internet traffic to respective NAT Gateways. This is how private subnet EC2/RDS instances can access the internet for package/software updates.
            • S3 – is the AWS storage used to store CloudFormation templates and Application specific Code.
            • Two public subnets (A and B) – you need to define one each for the two NAT gateways for high availability.
            Point 4: Application Infrastructure Stack

            This is the CloudFormation stack that will be continuously build, destroyed and rebuilt as a part of the CI/CD process. By having a separate stack for application specific builds you isolate different applications from overlapping with each other’s build.
            Application infrastructure stack – Architectural Diagram
            The green shaded box shows the shared infrastructure stack defined earlier. The Blue shaded box shows the Application Specific CloudFormation stack.

            Some of the AWS components that you define in Application specific infrastructure CloudFormation stack are mentioned below
            • You typically will have 2 public subnets (C and D) – these subnets will be associated with the public “RouteTable” defined in shared infrastructure for routing packets.
            • For the application layer you will define 2 private subnets (A and B) and for database layer (specifically RDS) you are required to have 2 subnets (C and D - defined private here for security) for high availability. The two private “RouteTable” defined in shared infrastructure are used by these subnets for packet routing. And since each “RouteTable” has a NAT Gateway defined for internet outbound traffic. Any EC2 instances and RDS instances can communicate with the outside world via these “RouteTable” only.
            • You define 2 Application Load Balancers (ALB) in public subnets (C and D)
            • You define at a minimum following security groups
            1. Security Groups (SG) that allow Application layer instances access to database layer instances.
            2. SG for ALB such that they can access the Application Layer.
            3. SG to allow public traffic to flow through port 80/443 of the ALB (you typically terminate SSL at the ALB)
            4. SG for a Bastion EC2 instance (not shown in the diagram) running in either public subnet A or B above that can be accessed via port 22 (ssh) from your corporate public IP’s. Bastion EC2 instances is how you get into the infrastructure to administer your VPC network and EC2 instances
            • Auto scaling group and Launch Configuration for your servers
            • RDS for your database layer
            • CodeDeploy components in the Application Stack that use the S3 bucket for updating the application code using Deployment Group to deploy different built versions of application code. The newer versions of the application code will be built by Jenkins whenever automatic build triggers (say by a release tag commit in git). Jenkins will then put the application build code in AWS S3. Jenkins will then use AWS API to run the application specific stack with an “Update Stack” command thereby resulting in automatic deployments.
            In the Application Layer you can have an application server running JEE/Spring/Hibernate technology stack or ASP.NET/WCF/ER Framework .NET stack or Python/Django framework or Node.js or PHP stacks.

            In the Database layer the RDS can be – MySQL, SQL Server, Aurora etc. or you can even have a NoSQL like – MongoDB, DynamoDB or DocumentDB data layer.

            CI/CD Automatic Flow:

            The diagram below shows the complete logic that triggers the CI/CD process.
            NOTE: It is technically possible and most certainly desirable to automate the build process based on testing results such that the success and failure of the automatic testing done via Selenium will determine if Production environment code deployment should occur. This flow logic is not shown above but is the next logical step to take to even automate production deployments
             
            Conclusion
            You can use the above toolsets mentioned in this article to do Continuous Integration and Continuous Deployment in your project. Infrastructure as Code (IaC) is changing the way software development is done these days and CI/CD automation is allowing application teams to be agile, productive and deliver bug free software products. The term DevOps is truly bridging the barrier between developers and operations and is merging the two roles into one cohesive role.

            Saturday, February 24, 2018

            Apache Hadoop – Spark – Kafka Versus AWS EMR – Spark – Kinesis Stream

            Introduction
            In this article I will provide two options to do real-time data analytics and give pros and cons for each option. 

            Option 1: Apache Hadoop – Spark – Kafka

            This option is based on assumption that you will be hosting all the components shown above in your local data center. I have written an article (https://www.linkedin.com/pulse/big-datadata-science-analytics-apache-hadoop-kafka-spark-jayesh-nazre/ ) and a blog (http://pmtechfusion.blogspot.com/2017/11/big-datadata-science-analytics-apache.html ) on this option. If you are interested in doing a local installation I encourage you to read the detailed installation steps provided in my blog (http://pmtechfusion.blogspot.com/2017/11/big-datadata-science-analytics-apache.html )

            Below are some of the pros of option 1
            • Since you are using open source software the cost of software is free
            • You have more control on what you can do under the hood as the installation will be done by your DevOps group locally on your local data center.
            • You are not held hostage to one cloud provider. The second option I will describe later is pretty much going to utilize as many services as AWS provides so that overall solution cost is low.
            • You will be able to perform component upgrades independently to stay up to speed with newer version releases. Given that we are using open source software, be prepared for such upgrades to occur at much faster pace than commercial equivalents
            • Many times developers want to use features available with the latest releases of these components and you will have the flexibility to upgrade individual components (say Kafka libraries only, or Spark libraries only) without impacting other components (assuming the newer version provides backward compatibility)
            • Security will be better as the data will never leave your data center (although this is an arguable point)
            Below are some of the cons of option 1
            • You will need to plan ahead for how much hardware will be needed. So it’s a projection game and a cost that you will incur upfront at the start of the project.
            • More staff will be needed for operational activities that will occur during maintenance phase of the project (relatively more than Option 2 below)
            • You will have to factor in and plan for scaling your cluster either manually or automatically (through custom scripts)
            Option 2: AWS EMR – Spark – Kinesis Stream
            In this option, you can replace some of the open source components with what is provided by Amazon AWS as a managed service. A brief description of the above diagram follows
            We have substituted Kafka with AWS Kinesis streaming.
            • Apache Hadoop and Apache Spark are now managed inside AWS Elastic MapReduce (EMR) cluster.
            • The red arrows show the data flow from producers to data analytics end.
            • In the diagram it’s shown that you can optionally store the data analytics results in S3 (the light red arrows) instead of HDFS (Apache Hadoop) or both.
            • Other AWS services like AWS QuickSight, DynamoDB, RedShift, Lambda etc. can be used to further analyze the results stored in S3. I have not shown them in the diagram as that’s a topic for another article.
            Below are some of the pros of option 2
            • Since you are using Amazon provide services most of the maintenance cost involved in doing patching, cluster scaling, auditing etc. is now managed with out-of-the box services provided by AWS
            • You no longer need to pay for the physical data center cost and the staff personnel’s needed for physical security of the on premise servers.
            • AWS provides various pricing options ranging from pay as you go (On Demand), Spot purchases for less critical jobs as well as reserved instances to save some cost by paying upfront.
            • With AWS EMR, Amazon also gives you an option to create transient EMR cluster which can run your long running job and then tear down the entire EMR cluster after the job is completed there by charging you exactly for the time it takes for the job to be completed.
            • Relatively less operational staff needed as some of the AWS services have replaced manual work with automation (For Example: Automation through CloudFormation usage as an Infrastructure as Code (IaS))
            Below are some of the cons of option 2
            • Projects will incur cost even during development and testing phase of the project as everything occurs in the cloud
            • Since the technology behind Hadoop, Spark is open source, it’s possible and highly likely that AWS EMR provided Hadoop, Spark versions are sometimes (N-1) behind the most recent releases from the respective open source projects.
            A few takeaways
            • If you have the right technical staff (skillset wise) to maintain and manage the cluster components then hosting the cluster at your data center gives you more control
            • If you do not want to factor the headache of maintaining the infrastructure for the cluster components then AWS option is better
            • During the execution/development phase of the project it will be more cost efficient to host the same in your local data center as the execution/development phase of a project typically has a lot of moving parts.

            A few other alternate options besides the two described in this article are listed below
            • If you are using Option 2 and are using AWS as the cloud provider then –
            1. You can consider using Kinesis data analytics instead of Apache Spark. NOTE: In the data analytics community Apache Spark is a more widely used and accepted technology than Kinesis data analytics.
            2. If you are not doing real-time streaming another appealing option is to store your data files in AWS S3 and then use AWS Athena for doing data analysis
            • You can also use Elasticsearch and Kibana for data analysis. Elasticsearch provides full-text searching and data analysis capabilities with Kibana (a graphical browser based utility) providing you capabilities to visualize the search results and do visual data analysis. Elasticsearch also provides client API in many programing languages (.NET, Java, Python, Ruby, JavaScript etc.) to integrate application websites with ElasticSearch. I plan to write a blog on this in the near future at my blog website: http://pmtechfusion.blogspot.com/ (so stay tuned)
            Conclusion
            I hope this article will inspire you to do data analytics with your real world use cases. Personally, I started with Option 1 as I wanted to get all of the components running on multiple VM and understand the interaction of the same. Once I did that, Option 2 – using AWS was so easy that I built the entire Option 2 infrastructure in a matter of one day. The beauty and the power of cloud is that you do not have to worry about silo type red-tapes to build a data center and your entire cluster as an individual (No wonder it’s called disruptive technology). Working in cloud experience wise for me is like a kid in a candy store.