Introduction
In this article I will provide two options to do real-time data analytics and give pros and cons for each option.
Option 1: Apache Hadoop – Spark – Kafka
Below are some of the pros of option 1
- Since you are using open source software the cost of software is free
- You have more control on what you can do under the hood as the installation will be done by your DevOps group locally on your local data center.
- You are not held hostage to one cloud provider. The second option I will describe later is pretty much going to utilize as many services as AWS provides so that overall solution cost is low.
- You will be able to perform component upgrades independently to stay up to speed with newer version releases. Given that we are using open source software, be prepared for such upgrades to occur at much faster pace than commercial equivalents
- Many times developers want to use features available with the latest releases of these components and you will have the flexibility to upgrade individual components (say Kafka libraries only, or Spark libraries only) without impacting other components (assuming the newer version provides backward compatibility)
- Security will be better as the data will never leave your data center (although this is an arguable point)
- You will need to plan ahead for how much hardware will be needed. So it’s a projection game and a cost that you will incur upfront at the start of the project.
- More staff will be needed for operational activities that will occur during maintenance phase of the project (relatively more than Option 2 below)
- You will have to factor in and plan for scaling your cluster either manually or automatically (through custom scripts)
Option 2: AWS EMR – Spark – Kinesis Stream
We have substituted Kafka with AWS Kinesis streaming.
- Apache Hadoop and Apache Spark are now managed inside AWS Elastic MapReduce (EMR) cluster.
- The red arrows show the data flow from producers to data analytics end.
- In the diagram it’s shown that you can optionally store the data analytics results in S3 (the light red arrows) instead of HDFS (Apache Hadoop) or both.
- Other AWS services like AWS QuickSight, DynamoDB, RedShift, Lambda etc. can be used to further analyze the results stored in S3. I have not shown them in the diagram as that’s a topic for another article.
- Since you are using Amazon provide services most of the maintenance cost involved in doing patching, cluster scaling, auditing etc. is now managed with out-of-the box services provided by AWS
- You no longer need to pay for the physical data center cost and the staff personnel’s needed for physical security of the on premise servers.
- AWS provides various pricing options ranging from pay as you go (On Demand), Spot purchases for less critical jobs as well as reserved instances to save some cost by paying upfront.
- With AWS EMR, Amazon also gives you an option to create transient EMR cluster which can run your long running job and then tear down the entire EMR cluster after the job is completed there by charging you exactly for the time it takes for the job to be completed.
- Relatively less operational staff needed as some of the AWS services have replaced manual work with automation (For Example: Automation through CloudFormation usage as an Infrastructure as Code (IaS))
- Projects will incur cost even during development and testing phase of the project as everything occurs in the cloud
- Since the technology behind Hadoop, Spark is open source, it’s possible and highly likely that AWS EMR provided Hadoop, Spark versions are sometimes (N-1) behind the most recent releases from the respective open source projects.
A few takeaways
- If you have the right technical staff (skillset wise) to maintain and manage the cluster components then hosting the cluster at your data center gives you more control
- If you do not want to factor the headache of maintaining the infrastructure for the cluster components then AWS option is better
- During the execution/development phase of the project it will be more cost efficient to host the same in your local data center as the execution/development phase of a project typically has a lot of moving parts.
A few other alternate options besides the two described in this article are listed below
- If you are using Option 2 and are using AWS as the cloud provider then –
- You can consider using Kinesis data analytics instead of Apache Spark. NOTE: In the data analytics community Apache Spark is a more widely used and accepted technology than Kinesis data analytics.
- If you are not doing real-time streaming another appealing option is to store your data files in AWS S3 and then use AWS Athena for doing data analysis
- You can also use Elasticsearch and Kibana for data analysis. Elasticsearch provides full-text searching and data analysis capabilities with Kibana (a graphical browser based utility) providing you capabilities to visualize the search results and do visual data analysis. Elasticsearch also provides client API in many programing languages (.NET, Java, Python, Ruby, JavaScript etc.) to integrate application websites with ElasticSearch. I plan to write a blog on this in the near future at my blog website: http://pmtechfusion.blogspot.com/ (so stay tuned)
Conclusion
I hope this article will inspire you to do data analytics with your real world use cases. Personally, I started with Option 1 as I wanted to get all of the components running on multiple VM and understand the interaction of the same. Once I did that, Option 2 – using AWS was so easy that I built the entire Option 2 infrastructure in a matter of one day. The beauty and the power of cloud is that you do not have to worry about silo type red-tapes to build a data center and your entire cluster as an individual (No wonder it’s called disruptive technology). Working in cloud experience wise for me is like a kid in a candy store.