Hadoop End-to-End (Beyond Word Count): Hadoop End-to-End

Overview

As a Data Architect / Developer I spent some time and did some research on Big Data. I setup a server with a single node, figured out how to load some data and ran the very popular word count MapReduce job.

For me this was not enough. I had more questions than answers at this point.

I decided to set some objectives to achieve a realistic feasibility test to consider moving from a traditional ETL / RDBMS based system to a HDFS / MapReduce / NoSQL based platform.

For me the summary Objective would be to utilize the core features of Hadoop and produce a production quality hardware and software end-to-end prototype.

Objectives

1. Hardware– Determine cluster sizing hardware requirements

2. Software- Install Hadoop using a Provisioning and Managing Platform

3. Data– Load a large formatted set of data into HDFS

4. NoSQL Database– Hbase

5. Process Data (Example 1 MapReduce) – Move the data from HDFS into Hbase using MapReduce.

6. Process Data (Example 2 Pig) – Move the data from HDFS into Hbase using PIG.

7. Analysis/Reporting/Visualization
  · Tableau
  · Excel - Power Pivot
  · MicroStrategy

My goal here is to share with you the copious amount of searching and trial and error it took to finally put together an end-to-end functioning prototype, starting from HDFS to a visualized report demonstrating the different functionality of Hadoop.

1. Hardware

Sizing a true production environment can be a complex calculation. So what I decided to use is the realistic Master and Slave configuration to test the distributed processing capabilities of Hadoop. I went with 8 servers for the Cluster and 1 Provisioning server for a total of 9 servers.

Where am I going to find 9 severs laying around? Fortunately for me I have a VMWare cluster in my garage with 50TB of data. I know, how geeky can you get. Some people have garage Bands I have a garage Data Center. I created 9 hosts and began the task to determine what OS and flavor of Hadoop I was going to install.

Note: if you wish to skip the hardware sizing objective, you can still run the rest of this project on a single machine. Warning it will be slower.

2. Software

Configuring a Hadoop Cluster can be an involved task. Getting the Master and Nodes to work together and to configuration by hand would take some time. Fortunately, organizations have design provisioning software to manage this activity for us. I took some time and played around with Cloudera and Hortonworks. I have to say they have done a great job with these tools, but to me they have commercialized what is available already from Apache. I wanted a true Apache approach to manage my cluster and I ran across Ambari. Ambari is Apache’s approach to provisioning, managing and monitoring an Apache Hadoop cluster. This was exactly what I was looking for so I decided to use Ambari. It seems to be a complete Hadoop Platform, just short of Mahout.

Link to AmbariWebsite (The version I used was 1.6)

Ambari Breakdown - What is installed – (I'll try to give you a definition from my perspective)

· HDFS – This is where the data is stored. It is a file system similar to your files system on your computer but with one big difference, it is distributed!

Some people say the advantage of Big Data is that it is “Unstructured”, “Throw whatever you want in it”. These people never wrote a java application! Ok YES, we don’t have to define every column we would ever need beforehand, like a traditional RDBMS, so if that is your definition of unstructured then OK. BUT if you have control over your data try to make the data parseable. CSV, JSON, BSON, XML etc.. There are some cool compressible formats as well. Having some structure makes your MapReduce jobs easier to write.

· YARN - YARN was created as a logic split off from the original MapReduce. It takes the cluster resource management “Traffic Cop” and puts as a layer onto of HDFS. (See Diagram Below)

· Tez – “Tez will speed up Pig and Hive jobs by an order of magnitude.” This is what they say. Replacing what we have become accustom to, with tradition RDBMS system, we will see. Jury is still out.

· MapReduce2 – Java based ETL like tool with typically two steps, the Map and the Reduce. The Map allows you to Extract data from HDFS and Translate the data how you feel fit. The Reduce step allows you to group by on your Key and then Load the data into a Database or back to HDFS as a file.

· HBase – Is a type of NoSQL Column Type Database (I hate that term NoSQL. Should be almost no SQL). If you are doing reporting, BI or any type of non-transactional type data storage this should be your choice. If you are building a large capacity transaction based system then Cassandra would be suitable.

· Hive - adds a SQL query like language layer over HDFS and HBase. There are lots of advantages to Hive, a big one being an ODBC bridge to Hadoop.

· WebHCat - A REST API, or Web API to Hive, Pig or MapReduce. This allows developers to create applications to call or execute command directly on Hadoop.

· Falcon - Data Replication: Replicate your data between clusters for disaster recovery or other purposes. Data Lifecycle Management: Manage archiving or purging records from your data.

· Storm - is an open source engine to process and stream data from Hadoop. Simply put this allows a programmer to write a method to execute a job and stream its results back real time to an application.

· Oozie - is a workflow scheduler to help manage Hadoop jobs.

· Ganglia - is a monitoring tool. This is critical when it comes to managing a large cluster and performance.

· Nagios - is also a monitoring tool. There to complement Ganglia.

· ZooKeeper - is a centralized services for managing the configuration and coordinating any changes within the cluster.

· Pig - is a useful programming tool for creating quick MapReduce Programs using a language called Pig Latin.

· Sqoop - is a great extraction tool to transfer data from Hadooop to a RDBMS.

Not on Ambari but available from Apache to consider

· Cassandra

· Mahout

Ambari Setup

1. Install OS (CentOS)

Installed CentOS on one host then copied the VM to the other 8 servers to save some time.

2. Host Prep – This will hopefully insure a smooth setup.

Host Names – I keep it simple: Ambari Server - ABM Cluster Servers - ABS1-ABS8

DNS / Reverse DNS – I have read in a few places that having DNS and Reverse Lookup can correct some setup issues people have. I already have an Active Directory and DNS Server so I simply added the new Hosts. You can try host files and add all the host DN’s to your hosts files that may work too.

Edit hosts File – (I added for some redundancy a record in the host file to assure name resolution.)

# Open a connection to all hosts
# Edit hosts File
$ vi /etc/hosts

# (Add to the top of the file. My Example below:  [IP Address] [FCDN] [HOSTNAME])
$ 192.168.1.122  ABS1.JAPALI.local  ABS1

# Close File
# Repeat on all hosts.

Update SSL

# Open a connection to all hosts
# Update openssl
$ yum update openssl

# Repeat on all hosts.

SSH – Very important to the process. Ambari uses SSH to connect and remote setup and install the Hadoop Cluster software. Generating the SSH keys and distributing the public key will create a root connection to each host without the need for entering your password.

# Open a connection to the Ambari Server (ABM)
# Generate Keys
$ ssh-keygen

# Change Directory to location of kyes
$ cd /root/.ssh/

# Move the Public Key “id_rsa.pub" and the private “id_rsa” to somewhere you
# can get it. FTP, Flash Drive? You will need them later

# Open a connection to Cluster Server (ABS1-ABS8)
# Change Directory to location of kyes
$ cd /root/.ssh/

# Get id_rsa.pub file, where you stored it, and move to directory above.
# Authorize Key
$ cat id_rsa.pub >> authorized_keys

# Repeat on Hosts ABS1-ABS8

# Test Connection from Ambari Server (ABM)
$ ssh root@ABS1.[YOURDN.local]

# It should not prompt for a password. /span>
# Test each server!

IPTABLE – I disabled just to make sure it didn’t block and connections. In a production environment you may change this.

# Open a connection to all hosts
# Save and Stop running service
$ /etc/init.d/iptables save
$ /etc/init.d/iptables stop

# Make sure service does not restart on reboot.
$ chkconfig iptables off

# Repeat on all Hosts

NTPD - Timing is important in a distributed environment. NTPD will insure proper timing.

# Open a connection to all hosts
# Install ntpd
$ yum install ntp ntpdate ntp-doc

# Set service to start at boot.
$ chkconfig ntpd on

# Set Location and start service.
$ ntpdate pool.ntp.org
$ /etc/init.d/ntpd start

# Repeat on all Hosts

Ipv6 – Disable. All Server Commands:

# Open a connection to all hosts
# Edit sysctl.conf file
$ vi /etc/sysctl.conf

# (Add these lines and save)
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1

# Repeat on all Hosts

3. Server Setup - Link to Ambari 1.6.0 Setup

Install Ambari:

# Connect to Ambari Server (ABM)
# Add Repo
$ cd /etc/yum.repos.d/
$ wget http://public-repo-1.hortonworks.com/ambari/centos6/1.x/updates/1.5.1/ambari.repo

# Install Ambari
$ yum install ambari-server

# Configure Ambari (I used the default settings)
$ ambari-server setup

Open Ambari WEB UI and finish the install

http://<ambari-server>:8080

# Connect to Ambari Server. 
http://<ambari-server>:8080

# Login
(Username: admin Password: admin)
Enter Cluster Name
Select Stack (I used the latest)

# Next you will get the Install Options screen. 
Add your target hosts

# Select SSH Private Key
Click Browse
Select your SSH Private key “id_rsa”. 
Click Register and Confirm

# Finish Install. 

# Sorry I don't have screen shots.
# I used all the defaults

# Once you are done you should be at the Ambari Dashboard

3. Data

Now that Ambari is setup and running it is time to load some data. What to load? Lots of choices to pick from on the web but I wanted something different.

So I decided to look at IPv4. The total possible combinations of unique addresses is 4,294,967,296. If you add the number of ports per address then the total can get very large. So we have a reasonable number of rows but we need some column depth as well. If you add what information we can lookup of each ip address then things get interesting. Plus this can make some interesting reports to create later.

How to collect the data. I could write a port scanner, not too hard in java but that would take some time. Why reinvent the wheel when something may already be available.

I found Zenmap. Open source and it outputs the results in XML. I found the XML output exciting as it offers an input format different from your traditional examples for mapReduce.

Once I had an output file created and saved I needed to transfer the data into HDFS. Now ideally in a production environment everything would be automated, but for this test I was fine with a manual transfer. I have done the long manual way to transfer data with FTP and command line but there has to be a drag and drop app to make this easy for development. After a little searching I found one by Redgate called HDFS Explorer. This tool makes transferring data from windows very easy drag and drop.

Redgate Link

Setup:

I opened HDFS Explorer and created a simple folder structure on the root of HDFS.

Scanner - Root Folder

Input - Location for MapReduce input files from Zenmap

Jar – Location for MapRuduce jar file.

Output – MapReduce output location (if we needed for other things).

Next was a simple process to move/copy the output files from ZenMap using HDFS Explorer and put them into the “Input” folder on the server.

Data Model

Now that we have data it would be advantageous to model the data out.

4. NoSQL Database

HBase

I picked HBase one because that is what was installed with Ambari but two it seems to be positioned in line with Reporting / Data Warehouse / BI / Analytics.

Let’s perform some quick test with HBase to confirm the installation.

# Connect to the HBase Master Server. 

# Open HBase shell
$ hbase shell

# List Tables
$ list
#You will see a table that was created during the installation

HIVE

In order to query and pull the data, we need a tool to access the data that also has ODBC support. That tool is HIVE.

In order for HIVE to access the tables in HBase the tables need to be created using HIVE. This will create a relationship between HIVE and HBase.

Create Tables:

# Connect to the Hive Master Server. 

# Open HIVE shell
$ hive

# List Tables
$ show tables;
# You can see we have not tables.

# Create hosts table 

$ CREATE
TABLE hosts(key string, ipAddress string, Status string, Vendor string,
HostName string, NetName string, Orgname string, Address string, City string,
StateProv string, PostalCode string, Country string, NetRange string, AClass
string, BClass string, CClass string, DClass string) STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES
("hbase.columns.mapping" = ":key,cf:ipAddress,cf:Status,cf:Vendor,cf:HostName,cf:NetName,
cf:OrgName,cf:Address,cf:City,cf:StateProv,cf:PostalCode,cf:Country,cf:NetRange,cf:AClass
,cf:BClass,cf:CClass,cf:DClass") TBLPROPERTIES
("hbase.table.name" = "hosts");

# It should finish with no errors.(if you have issues make it one long string)

# Switch back to your connection with the HBase Shell 

# Confirm table in HBase shell
$ list

# You can see now that we have two tables in HBase

# Create the other tables in the Hive Shell 

# Create Host Ports Table (Added a Property to set storage type to binary. Needed when filed values are type String)
$ CREATE
TABLE hostports(key string, HostKey string, PortKey string, State string,
Reason string, Count int) STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES
("hbase.columns.mapping" = ":key,cf:HostKey,cf:PortKey,cf:State,cf:Reason,cf:Count",
"hbase.table.default.storage.type" = "binary")
TBLPROPERTIES ("hbase.table.name" = "hostports");

# Create Ports Table
$ CREATE
TABLE ports(key string, Port string, Protocol string, Name string) STORED BY
'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES
("hbase.columns.mapping" = ":key,cf:Port,cf:Protocol,cf:Name
") TBLPROPERTIES ("hbase.table.name" = "ports");

# Switch back to your connection with the HBase Shell 

# Confirm tables in HBase shell
$ list

# We now have our three tables. It is now time to load the data.

5. Process Data – MapReduce

I’m assuming at this point you know what MapReduce is. Hadoop is changing very rapidly, and with this change so does the base classes we use to write MapReduce jobs. What I found confusing in the beginning was all the variations that you find on the web and trying to determine what the latest classes are and the format to use. I found an example reference example in the latest version that I used to base my code.

In this project we are using a MapReduce Job that reads from HDFS, our XML files, and outputs to HBase, the three tables we just created.

Let’s break down one of the job’s structure:

Build / Export / Run

There are many tools out there to compile java and export your Jar file. I decided to use Eclipse.

Java files needed for Project: (Click to download files)

1. HostPorts.java

2. HostPortsXMLHbase.java

3. Hosts.java

4. HostsXMLHbase.java

5. Ports.java

6. PortsXMLHbase.java

7. XMLInputFormatJB.java

8. XMLRecordReader.java

Create your project and add the files.

· Build Project

· Export Jar

Copy Scanner.jar file to Server

· Using HDFS Explorer (copy jar into the Jar directory)

Next we will run the MapReduce Job.

We need to: Connect to the Master HBase host (It has the jar files we need), switch to hdfs account, change home directory, export the hbase classpath data so we have the jar files that we need and Run the jobs.

# Connect to Master HBase Server 

# Change User and switch to hdfs
$ su hdfs

# Create and switch to MapReduce Directory
$ mkdir /home/hdfs/mapreduce
$ cd /home/hdfs/mapreduce/

# Export Class Path
$ export HADOOP_CLASSPATH=`hbase classpath`

# Run job to populate Ports table
$ rm Scanner.jar && hadoop fs -copyToLocal /scanner/jar/Scanner.jar
/home/hdfs/mapreduce/Scanner.jar && hadoop jar  /home/hdfs/mapreduce/Scanner.jar 
org.scanner.PortsXMLHbase  /scanner/input/*

You can check the progress of the job in Hadoop

Once the job has finished check the results

# Open your Hive Shell 

# List Results in table
$ select * from ports;

Run job to populate Hosts table

# Switch back to Connection on HBase Server 

# Run job to populate Hosts table
$ rm Scanner.jar && hadoop fs -copyToLocal /scanner/jar/Scanner.jar
/home/hdfs/mapreduce/Scanner.jar && hadoop jar  /home/hdfs/mapreduce/Scanner.jar 
org.scanner.HostsXMLHbase  /scanner/input/*

Once the job has finished check the results

# Open your Hive Shell 

# List Results in table
$ select * from hosts;

Run job to populate Host Ports table

# Switch back to Connection on HBase Server 

# Run job to populate Host Ports table
$ rm Scanner.jar && hadoop fs -copyToLocal /scanner/jar/Scanner.jar
/home/hdfs/mapreduce/Scanner.jar && hadoop jar  /home/hdfs/mapreduce/Scanner.jar 
org.scanner.HostPortsXMLHbase  /scanner/input/*

Once the job has finished check the results

# Open your Hive Shell 

# List Results in table
$ select * from hostports;

6. Process Data - PIG

Coming soon I will put together a Pig Script (Pig Latin) that runs the same MpaReduce example as above.

7. Analysis/Reporting/Visualization

We have completed the import of data into HBase. Now it is time to analyze what we have.

I picked a few tools to accomplish the visualization for our analysis: Excel, Tableau and MicroStrategy.

Server Setup:

HIVE Config Change

For our development environment we need to disable the HIVE security authorization so we don’t get any connectivity errors. I do not recommend doing this in a production environment. Take the time to setup security correctly.

# Open the Ambari Dashboard 
http://<ambari-server>:8080

Click on the Hive Link on the left Nav

# On the Hive Screen
Click on the Configs Tab

We need to create a new Config Group.

# On The Config Tab 
Click on Manage Config Group Link.

# In the popup window
Click the + button on the left
Type in a new Config name and click OK

Next we need to add hosts.

# On the Manage Hive Configuration Groups Window
Click the + button on the Right

# On the Select Configuration Group Hosts
Select all the hosts and click OK

You should see your new group with all the hosts.

Click Save

Next We Need to Override a setting in the XML Config.

# Back on the Configs Tab
Expand the advanced section
Scroll down to hive.security.authorization.enabled
Click Override

# On the Hive Configuration Group Window
Select your new configuration group
Click OK

# Back on the Configs Tab
Change the value to false
Click Save (at the bottom of page)

We need to restart the service.

Click the Restart Button

Client Setup (Windows):

ODBC

We need connectivity to HIVE. To accomplish this we need to install an ODBC Drive so we can run queries and return data from HBase through HIVE.

Cloudera has drivers available: Download both 64bit and 32bit.

Site link.

Install Drivers – you need to install both the 64 bit and 32 bit. Tableau works with the 32 bit.

Tableau

Tableau is a very popular analysis tool that has some really nice features. I’m not going to turn this into a Tableau how to document, but will do some basics.

Open Tableau

Under Data – Click on Connect to Data
Click on Cloudera Hadoop

On the Cloudera Hadoop Hive Connection Window

Enter Server Name
Change Type to HiveServer2
Change Authentication to User Name
Enter hive under Username
Click Connect

Select Custom SQL
Click on button with the ….

# Past SQL Code into Window
SELECT `hostports`.`key` AS `key`,
  `hostports`.`hostkey` AS `hostkey`,
  `hostports`.`portkey` AS `portkey`,
  `hostports`.`state` AS `state`,
  `hostports`.`reason` AS `reason`,
  `hostports`.`count` AS `count`,
  `ports`.`key` AS `ports_key`,
  `ports`.`port` AS `port`,
  `ports`.`protocol` AS `protocol`,
  `ports`.`name` AS `name`,
  `hosts`.`key` AS `host_key`,
  `hosts`.`ipaddress` AS `ipaddress`,
  `hosts`.`status` AS `status`,
  `hosts`.`vendor` AS `vendor`,
  `hosts`.`hostname` AS `hostname`,
  `hosts`.`netname` AS `netname`,
  `hosts`.`orgname` AS `orgname`,
  `hosts`.`address` AS `address`,
  `hosts`.`city` AS `city`,
  `hosts`.`stateprov` AS `stateprov`,
  `hosts`.`postalcode` AS `postalcode`,
  `hosts`.`country` AS `country`,
  `hosts`.`netrange` AS `netrange`,
  `hosts`.`aclass` AS `aclass`,
  `hosts`.`bclass` AS `bclass`,
  `hosts`.`cclass` AS `cclass`,
  `hosts`.`dclass` AS `dclass`
FROM `default`.`hostports` `hostports`,`default`.`ports` `ports`, 
 `default`.`hosts` `hosts`
WHERE (`hostports`.`portkey` = `ports`.`key`) 
 AND (`hostports`.`hostkey` = `hosts`.`key`)

Click on the Preview Results
Wait for Preview

Close Preview Window
Click OK on Custom SQL Window
Click OK

Click on Import all data

Save file

I’m not going into detail on how to create a report.

I created a Simple Port Summary report

Excel – Power Pivot

Excel is a widely used tool that has new features that really extend its functionality. I’m not going to turn this into a Excel Power Pivot how to document, but will do some basics.

Add ODBC Source

Open Control Panel
Open Administrative Tools
Open ODBC Data Sources (64-bit)

# On ODBC Data Souce Administration Window
Select System DSN Tab
Click Add

# On the Create New Data Source Windows
Select Cloudera ODBC Driver for Apache Hive
Click Finish

On the Cloudera ODBC Driver for Apache Hive DSN Setup Winodw
Enter Data Source Name
Enter Host Name
Change Type to Hive Server 2
Change Authentication Mechanism to User Name
Enter hive in User Name filed
Click test. It should work.
Click OK

Enable PowerPivot – If not enabled

Open Excel

Click on File
Click on Options
On the Left Nav Click on Add-ins
On the Manage Drop down Select COM Add-ins
Click Go

# On the COM Add-ins Window
Select Power View
Select Microsoft Office PowerPivot
Click OK and Return to Excel

Next we will import some data

# Return to Workbook
Click on POWERPIVOT Tab

Click on the Manager Icon

Click on From Other Sources Icon

# On the Table Import Wizard Screen
Select Others (OLEDBODBC)
Click Next 

Name the Connection
Click on Build

# On the Data Link Properties Window
Click on Provider Tab
Select Microsoft OLE DB Provider for ODBC Drivers
Click Next

Under use data source name Select the Source you setup earlier
Enter hive under User Name
Click Test. Should work.
Click Ok

Click Test Connection (Just to make sure)
Click Next

# On the Table Import Wizard Window
Select all 3 tables
Click Finish

# Wait for import to finish
Click close

Create Relationships

# Back on the PowerPivot Window
Click on the Relationship Icon in Lower Corner

# Drag and create the relationship between the tables
Drag hostPorts – hostkey to hosts – key
Drag hostPorts – portkey to ports – key
Close Window in Upper Corner

Create Report

# Back on the Workbook
Click on the Insert Tab
Click on the Power View Icon

You will get a Blank Power View window

I’m not going into detail on how to create a report.

I created a Simple Port Power View Summary report

Off the same data source I created a standard Pivot Table

MicroStrategy

MicroStrategy is a great web based tool that has some really nice features. I’m not going to turn this into a MicroStrategy how to document, but will do some basics.

Import Data

· Click on import and then data

· Next click on Database

· Select DSN Connections

· Under DSN select the ODBC Connection we created earlier.

· Enter hive under user Id

· Name the connection and click OK

· Click Edit SQL and Past the below SQL into

SELECT `hostports`.`key` AS `key`,
  `hostports`.`hostkey` AS `hostkey`,
  `hostports`.`portkey` AS `portkey`,
  `hostports`.`state` AS `state`,
  `hostports`.`reason` AS `reason`,
  `hostports`.`count` AS `count`,
  `ports`.`key` AS `ports_key`,
  `ports`.`port` AS `port`,
  `ports`.`protocol` AS `protocol`,
  `ports`.`name` AS `name`,
  `hosts`.`key` AS `host_key`,
  `hosts`.`ipaddress` AS `ipaddress`,
  `hosts`.`status` AS `status`,
  `hosts`.`vendor` AS `vendor`,
  `hosts`.`hostname` AS `hostname`,
  `hosts`.`netname` AS `netname`,
  `hosts`.`orgname` AS `orgname`,
  `hosts`.`address` AS `address`,
  `hosts`.`city` AS `city`,
  `hosts`.`stateprov` AS `stateprov`,
  `hosts`.`postalcode` AS `postalcode`,
  `hosts`.`country` AS `country`,
  `hosts`.`netrange` AS `netrange`,
  `hosts`.`aclass` AS `aclass`,
  `hosts`.`bclass` AS `bclass`,
  `hosts`.`cclass` AS `cclass`,
  `hosts`.`dclass` AS `dclass`
FROM `default`.`hostports` `hostports`,`default`.`ports` `ports`, `default`.`hosts` `hosts`
WHERE (`hostports`.`portkey` = `ports`.`key`) AND (`hostports`.`hostkey` = `hosts`.`key`)

· Next Click Execute SQL (wait for the results)

· Next Click Continue in the lower right

· Then save the results

· Wait for the Processing

· Click Create Dashboard

· I’m not going into detail on how to create a report.

· I created a Simple Summary report. See Below

What's next

For me Predictive Analysis would be next. Using Mahout to creating an analysis for potential security issues by org or IP.

Conclusion

So I hope this was helpful to you. I wanted to provide a single stop-shop for users researching and/or implementing Big Data solutions. I tried to consolidate all the trial and error and research it took me to put this together in one spot.

Questions, comments or suggestions are welcome.

Thanks

Jerry Baird

Reference Sites:

Ambari 1.6

Cloudera ODBC
Redgate Link
Tableau

Hive