My "todo" List

I am planning to write some blogs at somepoint of time in the future on the following topics:

Spring Framework, Hibernate, ADO.NET Entity Framework, WPF, SharePoint, WCF, Whats new in Jave EE6, Whats new in Oracle 11G, TFS, FileNet, OnBase, Lombardi BPMS, Microsoft Solution Framework (MSF), Agile development, RUP ..................... the list goes on


I am currently working on writing the following blog

Rational Unified Process (RUP) and Rational Method Composer (RMC)

Sunday, March 18, 2018

ELK+ Stack - Elasticsearch, Logstash (Filebeat) and Kibana

Introduction
In this article I will provide details of how to use the ELK+ stack for doing 
    • Full Text searching (using google type search capability)
    • Visual Data Analysis of log data generated from various sources (like server syslog, apache logs, IIS logs etc.)

      Architectural Overview

      Filebeat
      In the ELK+ stack, Filebeat provides a lightweight alternative to send log data to either the Elasticsearch directly or to Logstash for further transformation of data before its send to Elasticsearch for storage. Filebeat provides many out of the box modules to process data and these modules can be configured with minimal configuration to start shipping logs to Elasticsearch and/or Logstash. 

      Logstash
      Logstash is the log analysis platform for ELK+ stack. It can do what Filebeat does and more. It has inbuilt filters and scripting capabilities to perform analysis and transformation of data from various log sources (Filebeat being one such source) before sending information to Elasticsearch for storage. The question comes when to use Logstash versus Fliebeat. Typically you can consider Logstash as the “big daddy” for Filebeat. Logstash requires Java JVM and is a heavy weight alternative to Filebeat which requires minimal configuration in order to collect log data from client nodes. Filebeat can be used as a light weight log shipper and one of the source of data coming over to Logstash which can then act as an aggregator and perform further analysis and transformation on the data before its stored in Elasticsearch as shown below

      Elasticsearch
      Elasticsearch is the NoSQL data store for storing documents. The “term” document in Elasticsearch nomenclature means “json”- type data. Since the data is stored in “json” format, it allows a schema less layout thereby deviating from the traditional RDBMS-type of schema layouts and allowing flexibility for the json-elements to be changed with no impact on existing data. This benefit comes due to the Schema less nature of Elasticsearch data store and is not unique to Elasticsearch. DynamoDB, MongoDB, DocumentDB and many other NoSQL data stores provide similar capabilities. Elasticsearch under the hood uses Apache Lucene as the search engine for searching documents it stores in its data store.

      Data (more specifically documents) in Elasticsearch are stored in indexes which are stored in multiple shards. A document belongs to one index and one primary shard within that index. Shards allow parallel data processing for the same index. You can replicate shards across multiple replicas to provide fail-over and high availability. The diagram below shows how data is logically distributed across an Elasticsearch cluster comprising of three nodes. A node is a physical server or a virtual machine (VM)


      Kibana
      Within the ELK+ stack, the most exciting piece of the stack is the Visual Analytical tool – Kibana. Kibana allows you to do visual data analysis on data that’s stored in Elasticsearch. It allows you to visualize data by providing many visualization types as shown below
      Once you create visualizations using the above visualization types. You can create a Dashboard that can includes these visualizations on a single page to provide visual representation of data stored in Elasticsearch from different viewpoints, a technique used in Visual data analysis to identify patterns and trends within your data.

      Use Cases for ELK+ stack:
      Use Case 1
      In this use case we can use Elasticsearch to store documents to create a google type search engine within your organization. Documents referred to in this use case are Word documents, PDF documents, image documents etc. and not the json-type documents stored in Elasticsearch.


      The above diagram shows two user driven workflows in which Elasticsearch cluster is used by web applications.

      Workflow1
      Users upload the files using the Web application user interface. The web application in turn behind the scene uses the Elasticsearch API to send the files to Ingest node of Elasticsearch cluster. The web application can be written in Java (JEE), Python (Django), Ruby on Rails, PHP etc. Depending on the choice of programming language and application framework, you can pick the appropriate API l to interact with the Elasticsearch cluster.

      Workflow 2
      Users use the web application user interface to search for documents (like PDF, XLS, DOC etc.) that are ingested into the Elasticsearch cluster (via “workflow 1”).

      Elasticsearch provides “Ingest Attachment” plugin to ingest documents into the cluster. The “Ingest Attachment” plugin is based on open source Apache Tika project. Apache Tika is an open source toolkit that detects and extracts metadata and text from many different file types (like PDF, DOC, XLS, PPT etc.). The “Ingest Attachment” plugin uses the Apache Tika library to extract data for different file types and then store the clear text contents in Elasticsearch as json-type documents. The plugin also stores the full-text extract version of the different file types as an element within the json-type document. Once the metadata and the full-text extraction is done and stored in clear-text json-style format within Elasticsearch. From that point forward Elasticsearch uses the Apache Lucene search engine to allow full-text searching capability by doing “Analysis” and “Relevance” ranking on the full-text field.

      Use Case 2

      Using Elasticsearch as a NoSQL data store to store logs from various sources (apache server logs, syslogs, IIS logs etc.) to do visual data analysis using Kibana. The architectural diagram in section titled “Architectural Overview” in this article shows how ELK+ stack can be used for this use case. I am not going to elaborate more on this use case besides providing a few additional informational points (or take away notes)
      • It’s possible to directly send log data from Filebeat to Elasticsearch cluster (bypassing the Logstash)
      • Filebeat has a built in backpressure sensitive protocol to slow down the sending of log data to Logstash and/or Elasticsearch if they are busy doing other activities.

      A few take away

        • In order to avoid the Elasticsearch nodes from doing extraction of metadata and full text data from different file types while files are sent to “Ingest Attachment” plugin in real time and there by resulting in CPU and resource load on the ingest node. An alternate way to do the same will be to use Apache Tika as an offline utility and use it to extract metadata and full-text data using a batch program and then using that batch program to send data directly to Elasticsearch as a “json-type” document. This eliminates increasing payload pressure on Elasticsearch cluster as the batch program can be run outside the cluster nodes, full-text data extraction is a very CPU intensive operation. This batch program approach will also work for Apache Solr as the same batch program can be used to route the data to Apache Solr and/or Elasticsearch
        • Another benefit of off-loading the full-text data extract operation outside of Elasticsearch is that you do not have to use Apache Tika for metadata and full-text data extraction, you can use other commercial alternatives.
        • For High availability and large Elasticsearch clusters, try to utilize different Elasticsearch nodes-types (a node is a physical machine and/or VM in an Elasticsearch cluster). Below are the different Elasticsearch note-types 
        Master eligible Node

        This node-type allows the node to be eligible for being designated as a master node within the Elasticsearch cluster. For large clusters, do not flag “master eligible” nodes to function as “data” nodes and vice-versa

        Data Node 

        This node-type allows the node to be a “data” node in an Elasticsearch cluster.

        Ingest Node

        This node-type allows the node to be a “ingest” node. If you were using the “Ingest Attachment” plugin, then it is a good idea to send data traffic associated with file uploads to “ingest” only nodes as the full-text extraction from different file-types is a very CPU exhaustive process and you are better off dedicating nodes just to perform that operation

        Coordinating Node

        This node-type allows the node to co-ordinate the search request across “data” nodes and aggregate the results before sending the data back to the client. As a result, this node should have enough memory and CPU resources to perform that role efficiently.

        Tribe Node

        It’s a special type of “coordinating” node that allows cross-cluster searching operations

        NOTE: by default an Elasticsearch node is “master-eligible” node, “data” node, “ingest” node as well as “coordinating” node. This works for small clusters but for large clusters you need to plan these node roles (types) for scalability and performance.

        • Both Apache Solr and Elasticsearch are competing open source platforms utilizing Apache Lucene as the search engine under the hood. Elasticsearch is more recent and is able to benefit from lessons learned from previous competing platforms. Elasticsearch has gained popularity due to its simplicity of setting up the cluster versus Apache Solr and as mentioned earlier many cloud providers (like AWS) are providing Elasticsearch as a “Platform” as a service (PaaS)
        • Elasticsearch has a product called X-Pack, at this point of time you have to pay for this but this year the company behind Elasticsearch is going to make the X-Pack offering open source. X-Pack provides following capabilities


          1. Secures your Elasticsearch cluster through authentication/authorization
          2. Sending alerts
          3. Monitoring your clusters
          4. Reporting capability
          5. Exploring relationship between your Elasticsearch data by utilizing “Graph” plotting.
          6. Machine Learning of your Elasticsearch data

          Conclusion
          ELK+ stack is an open source platform that allows you to quickly set up a cluster for performing visual data analysis as well as provide a “google” search like capability within your organization and it’s all “free”. I hope this article will give you enough details to use the ELK+ stack in your own organization

          Sunday, March 4, 2018

          CI/CD automation for Agile projects - AWS CloudFormation (IaC), Jenkins, Git, Jira and Selenium

          Introduction
          In this article I plan to describe how to go about automating the entire software development process using Continuous Integration (CI)/Continuous Deployment (CD) tools sets. Following CI/CD toolset will be used by me to explain complete end to end automation

           
          Architectural Overview
          CI/CD Automation Flow
          In the above diagram I am using numbering to explain a typical automation flow in a CI/CD type of application development.

          Point 1: Source Code Management (SCM)
          Developers will use some sort of source code management software (like Git) to check in code. Any source code repository allows multiple release branches to be maintained for parallel releases. A code “check-in” event can trigger an automatic built process and you can control when to build based on some predefined rules, for example:
          • For git you can trigger automatic builds based on release “tags” – like rel1.0, rel2.0 etc.
          • For GitHub you can trigger automatic builds based on Web hooks calling your endpoint URL to trigger a build. And within your endpoint URL code you can check the GitHub event to determine if a built is needed.
          Point is, SCM allows you to associate “check in” events to trigger automatic builds based on pre-defined rules. This gives you control over which code “check in” event causes automatic builds.

          Point 2: Build Server
          Build servers are the main components that connect all the CI/CD components together by allowing orchestration of built process through numerous plugins. One such software is Jenkins. Jenkins is an open source software that provides plugins to integrate with
          • SCM software’s like git, GitHub, CodeCommit etc.
          • Cloud providers like AWS, GCP, Azure
          • Java applications through maven plugin
          • .NET applications through msbuild plugin
          • ALM software’s like Jira
          AWS also provides an end to end build software called – AWS CodePipeline. AWS CodePipeline is something you can consider if all the components you are using are from AWS cloud (for example: CodeCommit instead of git).

          Point 3: Application Lifecycle Management (ALM) tools like Jira
          With Jenkins you can use the Jira plugin to update any issues that are opened in Jira so Business Owners can look at Jira’s comments section to know what issues are fixed by the build. I have spoken about this type of integration in my blog: http://pmtechfusion.blogspot.com/2017/03/application-lifecycle-management-alm.html. I encourage you to read that blog in case if you are interested in knowing the details.
          Point 4: Shared Infrastructure Stack

          With Infrastructure as Code (IaC), it’s possible to define infrastructure in terms of code scripts that can be checked into SCM along with the application code thereby allowing versioning of infrastructure templates just like application code in SCM. AWS provides CloudFormation to define infrastructure templates in JSON or YAML format (Azure and Google - GCP also provide similar JSON style scripting capabilities). In a CI/CD type of development automation, you should define at least two infrastructure templates one for infrastructure that is shared across applications and another for application specific infrastructure and technology stack. The former is usually not torn down with each automatic deployment.
          Shared infrastructure stack – Architectural Diagram
          NOTE: The above shared infrastructure is defined with High Availability in mind.

          A sample infrastructure layout is shown above. Some of the AWS components that you define in shared infrastructure CloudFormation stack are mentioned below
          • VPC – Virtual Private Cloud (the virtual network)
          • Internet Gateway (IGW) – this allows the VPC access to internet
          • NAT Gateway – this is defined in the shared infrastructure so that private subnets in the application specific stacks can use the NAT gateway to access the internet.
          • Elastic IP – NAT gateways need to be given a public IP (Elastic IP allows that to happen in AWS)
          • Route Tables – typically you define three route tables (at a minimum)
          1. One for public routes
          2. Two for private routes (Since I have two subnets I will have two route tables). In each private route tables you define an outbound rule that sends all internet traffic to respective NAT Gateways. This is how private subnet EC2/RDS instances can access the internet for package/software updates.
          • S3 – is the AWS storage used to store CloudFormation templates and Application specific Code.
          • Two public subnets (A and B) – you need to define one each for the two NAT gateways for high availability.
          Point 4: Application Infrastructure Stack

          This is the CloudFormation stack that will be continuously build, destroyed and rebuilt as a part of the CI/CD process. By having a separate stack for application specific builds you isolate different applications from overlapping with each other’s build.
          Application infrastructure stack – Architectural Diagram
          The green shaded box shows the shared infrastructure stack defined earlier. The Blue shaded box shows the Application Specific CloudFormation stack.

          Some of the AWS components that you define in Application specific infrastructure CloudFormation stack are mentioned below
          • You typically will have 2 public subnets (C and D) – these subnets will be associated with the public “RouteTable” defined in shared infrastructure for routing packets.
          • For the application layer you will define 2 private subnets (A and B) and for database layer (specifically RDS) you are required to have 2 subnets (C and D - defined private here for security) for high availability. The two private “RouteTable” defined in shared infrastructure are used by these subnets for packet routing. And since each “RouteTable” has a NAT Gateway defined for internet outbound traffic. Any EC2 instances and RDS instances can communicate with the outside world via these “RouteTable” only.
          • You define 2 Application Load Balancers (ALB) in public subnets (C and D)
          • You define at a minimum following security groups
          1. Security Groups (SG) that allow Application layer instances access to database layer instances.
          2. SG for ALB such that they can access the Application Layer.
          3. SG to allow public traffic to flow through port 80/443 of the ALB (you typically terminate SSL at the ALB)
          4. SG for a Bastion EC2 instance (not shown in the diagram) running in either public subnet A or B above that can be accessed via port 22 (ssh) from your corporate public IP’s. Bastion EC2 instances is how you get into the infrastructure to administer your VPC network and EC2 instances
          • Auto scaling group and Launch Configuration for your servers
          • RDS for your database layer
          • CodeDeploy components in the Application Stack that use the S3 bucket for updating the application code using Deployment Group to deploy different built versions of application code. The newer versions of the application code will be built by Jenkins whenever automatic build triggers (say by a release tag commit in git). Jenkins will then put the application build code in AWS S3. Jenkins will then use AWS API to run the application specific stack with an “Update Stack” command thereby resulting in automatic deployments.
          In the Application Layer you can have an application server running JEE/Spring/Hibernate technology stack or ASP.NET/WCF/ER Framework .NET stack or Python/Django framework or Node.js or PHP stacks.

          In the Database layer the RDS can be – MySQL, SQL Server, Aurora etc. or you can even have a NoSQL like – MongoDB, DynamoDB or DocumentDB data layer.

          CI/CD Automatic Flow:

          The diagram below shows the complete logic that triggers the CI/CD process.
          NOTE: It is technically possible and most certainly desirable to automate the build process based on testing results such that the success and failure of the automatic testing done via Selenium will determine if Production environment code deployment should occur. This flow logic is not shown above but is the next logical step to take to even automate production deployments
           
          Conclusion
          You can use the above toolsets mentioned in this article to do Continuous Integration and Continuous Deployment in your project. Infrastructure as Code (IaC) is changing the way software development is done these days and CI/CD automation is allowing application teams to be agile, productive and deliver bug free software products. The term DevOps is truly bridging the barrier between developers and operations and is merging the two roles into one cohesive role.