Overview
As a Data Architect / Developer I spent some time and did some research on Big Data. I setup a server with a single node, figured out how to load some data and ran the very popular word count MapReduce job.
As a Data Architect / Developer I spent some time and did some research on Big Data. I setup a server with a single node, figured out how to load some data and ran the very popular word count MapReduce job.
For me this was not enough.
I had more questions than answers at this point.
I decided to set some objectives to achieve a realistic feasibility test to consider moving from a traditional ETL / RDBMS based system to a HDFS / MapReduce / NoSQL based platform.
I decided to set some objectives to achieve a realistic feasibility test to consider moving from a traditional ETL / RDBMS based system to a HDFS / MapReduce / NoSQL based platform.
For me the summary Objective would be to utilize the core features
of Hadoop and produce a production quality hardware and software end-to-end prototype.
Objectives
4.
NoSQL Database– Hbase
6.
Process Data (Example 2 Pig) – Move the data from HDFS into Hbase using PIG.
My goal here is to share with you the copious amount of
searching and trial and error it took to finally put together an end-to-end
functioning prototype, starting from HDFS to a visualized report demonstrating the
different functionality of Hadoop.
Sizing a true production
environment can be a complex calculation. So what I decided to use is the
realistic Master and Slave configuration to test the distributed processing
capabilities of Hadoop. I went with 8 servers for the Cluster and 1
Provisioning server for a total of 9 servers.
Where am I going to find 9
severs laying around? Fortunately for me
I have a VMWare cluster in my garage with 50TB of data. I know, how geeky can
you get. Some people have garage Bands I have a garage Data Center. I created 9 hosts and began the task to
determine what OS and flavor of Hadoop I was going to install.
Note: if you wish to skip the hardware sizing objective, you can
still run the rest of this project on a single machine. Warning it will be
slower.

Configuring a Hadoop Cluster can
be an involved task. Getting the Master
and Nodes to work together and to configuration by hand would take some
time. Fortunately, organizations have
design provisioning software to manage this activity for us. I took some time and played around with Cloudera
and Hortonworks. I have to say they have
done a great job with these tools, but to me they have commercialized what is
available already from Apache. I wanted
a true Apache approach to manage my cluster and I ran across Ambari. Ambari is
Apache’s approach to provisioning, managing and monitoring an Apache Hadoop
cluster. This was exactly what I was looking for so I decided to use Ambari. It
seems to be a complete Hadoop Platform, just short of Mahout.
Link to AmbariWebsite (The version I used was 1.6)
Link to AmbariWebsite (The version I used was 1.6)
Ambari Breakdown - What is installed – (I'll try to give you a
definition from my perspective)
·
HDFS
– This is where the data is stored. It is a file system similar to your files system
on your computer but with one big difference, it is distributed!
Some people say the advantage
of Big Data is that it is “Unstructured”, “Throw whatever you want in it”.
These people never wrote a java application! Ok YES, we don’t have to define
every column we would ever need beforehand, like a traditional RDBMS, so if that
is your definition of unstructured then OK. BUT if you have control over your
data try to make the data parseable. CSV, JSON, BSON, XML etc.. There are some
cool compressible formats as well. Having some structure makes your MapReduce
jobs easier to write.
·
YARN -
YARN was created as a logic split off from the original MapReduce. It takes the
cluster resource management “Traffic Cop” and puts as a layer onto of HDFS.
(See Diagram Below)
·
Tez – “Tez will speed up Pig and Hive jobs by an order of magnitude.” This is what
they say. Replacing what we have
become accustom to, with tradition RDBMS system, we will see. Jury is still
out.
·
MapReduce2
– Java based ETL like tool with typically two
steps, the Map and the Reduce. The Map allows you to Extract data from HDFS and Translate
the data how you feel fit. The Reduce step allows you to group by on your Key
and then Load the data into a
Database or back to HDFS as a file.
·
HBase
– Is a type of NoSQL Column Type Database (I hate that term NoSQL. Should be almost
no SQL). If you are doing reporting, BI or any type of non-transactional type
data storage this should be your choice.
If you are building a large capacity transaction based system then
Cassandra would be suitable.
·
Hive
- adds a SQL query like language layer over HDFS
and HBase. There are lots of advantages to Hive, a big one being an ODBC bridge
to Hadoop.
·
WebHCat - A REST API, or Web API to Hive, Pig or MapReduce.
This allows developers to create applications to call or execute command
directly on Hadoop.
·
Falcon - Data Replication: Replicate
your data between clusters for disaster recovery or other purposes. Data Lifecycle Management: Manage
archiving or purging records from your data.
·
Storm - is an open source engine to process and stream data from Hadoop. Simply put this allows a programmer to write a method to execute a job and stream its results back real time to an application.
·
Oozie - is a workflow scheduler to help manage Hadoop jobs.
·
Ganglia - is a monitoring tool. This is critical when it comes to managing a large cluster and performance.
·
Nagios - is also a monitoring tool. There to complement Ganglia.
·
ZooKeeper - is a centralized services for managing the configuration and coordinating any changes within the cluster.
·
Pig - is a useful programming tool for creating quick MapReduce Programs using a language called Pig Latin.
·
Sqoop - is a great extraction tool to transfer data from Hadooop to a RDBMS.
Not on Ambari but available from Apache to consider
·
Cassandra
·
Mahout
Ambari Setup
1. Install OS (CentOS)
Installed CentOS on one host
then copied the VM to the other 8 servers to save some time.
2. Host Prep – This will hopefully insure a smooth setup.
Host Names – I keep it
simple: Ambari Server - ABM Cluster Servers - ABS1-ABS8
DNS / Reverse DNS – I have read in a few places that having DNS and
Reverse Lookup can correct some setup issues people have. I already have an
Active Directory and DNS Server so I simply added the new Hosts. You can try
host files and add all the host DN’s to your hosts files that may work too.
Edit hosts File – (I added for some redundancy a record in the host
file to assure name resolution.)
1 2 3 4 5 6 7 8 9 |
|
Update SSL
1 2 3 4 5 |
|
SSH – Very important to the process. Ambari uses SSH to connect and
remote setup and install the Hadoop Cluster software. Generating the SSH keys
and distributing the public key will create a root connection to each host
without the need for entering your password.
1 2 3 4 5 6 7 8 9 |
|
1 2 3 4 5 6 7 8 9 |
|
1 2 3 4 5 |
|
IPTABLE – I disabled just to make sure it didn’t block and
connections. In a production environment
you may change this.
1 2 3 4 5 6 7 8 9 |
|
NTPD - Timing is important in a distributed environment. NTPD will
insure proper timing.
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Ipv6 – Disable. All Server Commands:
1 2 3 4 5 6 7 8 9 |
|
3. Server Setup - Link
to Ambari 1.6.0 Setup
Install Ambari:
1 2 3 4 5 6 7 8 9 10 |
|
Open Ambari WEB UI and finish
the install
1 2 3 4 5 6 7 |
|
1 2 3 4 5 6 7 |
|
1 2 3 4 5 6 |
|
Now that Ambari is setup and
running it is time to load some data. What to load? Lots of choices to pick from
on the web but I wanted something different.
So I decided to look at IPv4.
The total possible combinations of unique addresses is 4,294,967,296. If you add the number of ports per address
then the total can get very large. So we
have a reasonable number of rows but we need some column depth as well. If you add what information we can lookup of
each ip address then things get interesting. Plus this can make some
interesting reports to create later.
How to collect the data. I could
write a port scanner, not too hard in java but that would take some time. Why reinvent the wheel when something may
already be available.
I found Zenmap. Open source and
it outputs the results in XML. I found the XML output exciting as it offers an
input format different from your traditional examples for mapReduce.
Once I had an output file
created and saved I needed to transfer the data into HDFS. Now ideally in a production
environment everything would be automated, but for this test I was fine with a
manual transfer. I have done the long
manual way to transfer data with FTP and command line but there has to be a
drag and drop app to make this easy for development. After a little searching I found one by
Redgate called HDFS Explorer. This tool makes transferring data from windows
very easy drag and drop.
Setup:
I opened HDFS Explorer and created
a simple folder structure on the root of HDFS.
Scanner - Root Folder
Input
- Location for MapReduce input files from Zenmap
Jar
– Location for MapRuduce jar file.
Next was a simple process to
move/copy the output files from ZenMap using HDFS Explorer and put them into
the “Input” folder on the server.
Data Model
Now that we
have data it would be advantageous to model the data out.
HBase
I picked HBase one because that
is what was installed with Ambari but two it seems to be positioned in line
with Reporting / Data Warehouse / BI / Analytics.
Let’s perform some quick test
with HBase to confirm the installation.
1 2 3 4 5 6 7 8 |
|
HIVE
In order to query and pull the
data, we need a tool to access the data that also has ODBC support. That tool is
HIVE.
In order for HIVE to access the
tables in HBase the tables need to be created using HIVE. This will create a relationship between HIVE
and HBase.
Create Tables:
1 2 3 4 5 6 7 8 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
|
1 2 3 4 5 6 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
|
1 2 3 4 5 6 |
|
I’m assuming at this point you
know what MapReduce is. Hadoop is changing very rapidly, and with this change
so does the base classes we use to write MapReduce jobs. What I found confusing
in the beginning was all the variations that you find on the web and trying to determine
what the latest classes are and the format to use. I found an example reference
example in the latest version that I used to base my code.
In this project we are using a
MapReduce Job that reads from HDFS, our XML files, and outputs to HBase, the
three tables we just created.
Build / Export / Run
There are many tools out there
to compile java and export your Jar file.
I decided to use Eclipse.
Java files needed for Project:
(Click to download files)
Create your project and add the
files.
·
Build
Project
·
Export
Jar
Copy Scanner.jar file to Server
·
Using
HDFS Explorer (copy jar into the Jar directory)
Next we will run the MapReduce
Job.
We need to: Connect to the
Master HBase host (It has the jar files we need), switch to hdfs account, change
home directory, export the hbase classpath data so we have the jar files that
we need and Run the jobs.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
You can check the progress of
the job in Hadoop
Once the job has finished check
the results
1 2 3 4 |
|
Run job to populate Hosts table
1 2 3 4 5 6 |
|
Once the job has finished check
the results
1 2 3 4 |
|
Run job to populate Host Ports table
1 2 3 4 5 6 |
|
Once the job has finished check
the results
1 2 3 4 |
|
Coming soon I will put together a Pig Script (Pig Latin) that runs the same MpaReduce example as above.
We have completed the import of data into HBase. Now it is time to analyze what we have.
I picked a few tools to accomplish the visualization for our analysis: Excel, Tableau and MicroStrategy.
Server Setup:
HIVE Config Change
For our development environment
we need to disable the HIVE security authorization so we don’t get any
connectivity errors. I do not recommend doing this in a production environment.
Take the time to setup security correctly.
1 2 3 4 5 6 7 |
|
We need to create a new Config Group.
1 2 3 4 5 6 |
|
Next we need to add hosts.
1 2 3 4 5 |
|
You should see your new group with all the hosts.
1
|
|
Next We Need to Override a setting in the XML Config.
1 2 3 4 |
|
1 2 3 |
|
1 2 3 |
|
We need to restart the service.
1
|
|
Client Setup (Windows):
ODBC
We need connectivity to HIVE. To
accomplish this we need to install an ODBC Drive so we can run queries and
return data from HBase through HIVE.
Cloudera has drivers available: Download both 64bit and 32bit.
Install Drivers – you need to
install both the 64 bit and 32 bit. Tableau works with the 32 bit.
Tableau
Tableau is a very popular
analysis tool that has some really nice features. I’m not going to turn this into a Tableau how
to document, but will do some basics.
Open Tableau
1 2 |
|
On the Cloudera Hadoop Hive Connection Window
1 2 3 4 5 |
|
1 2 |
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
|
1 2 3 4 5 6 |
|
1
|
|
1
|
|
I’m not going into detail on
how to create a report.
I created a Simple Port
Summary report
Excel – Power Pivot
Excel is a widely used tool
that has new features that really extend its functionality. I’m not going to turn this into a Excel Power
Pivot how to document, but will do some basics.
Add ODBC Source
1 2 3 4 5 6 7 8 9 10 11 |
|
1 2 3 4 5 6 7 8 |
|
Enable PowerPivot – If not enabled
Open Excel
1 2 3 4 5 |
|
1 2 3 4 |
|
Next we will import some data
1 2 |
|
1
|
|
1
|
|
1 2 3 |
|
1 2 |
|
1 2 3 4 |
|
1 2 3 4 |
|
1 2 |
|
1 2 3 |
|
1 2 |
|
Create Relationships
1 2 |
|
1 2 3 4 |
|
Create Report
1 2 3 |
|
You will get a Blank Power
View window
I’m not going into detail on
how to create a report.
I created a Simple Port Power
View Summary report
Off the same data source I
created a standard Pivot Table
MicroStrategy
MicroStrategy is a great web
based tool that has some really nice features.
I’m not going to turn this into a MicroStrategy how to document, but
will do some basics.
Import Data
·
Click on import
and then data
·
Next click on Database
·
Select DSN Connections
·
Under DSN select the ODBC Connection we created earlier.
·
Enter hive under user Id
·
Name the connection and click OK
·
Click
Edit SQL and Past the below SQL into
SELECT `hostports`.`key` AS `key`,
`hostports`.`hostkey` AS `hostkey`,
`hostports`.`portkey` AS `portkey`,
`hostports`.`state` AS `state`,
`hostports`.`reason` AS `reason`,
`hostports`.`count` AS `count`,
`ports`.`key` AS `ports_key`,
`ports`.`port` AS `port`,
`ports`.`protocol` AS `protocol`,
`ports`.`name` AS `name`,
`hosts`.`key` AS `host_key`,
`hosts`.`ipaddress` AS `ipaddress`,
`hosts`.`status` AS `status`,
`hosts`.`vendor` AS `vendor`,
`hosts`.`hostname` AS `hostname`,
`hosts`.`netname` AS `netname`,
`hosts`.`orgname` AS `orgname`,
`hosts`.`address` AS `address`,
`hosts`.`city` AS `city`,
`hosts`.`stateprov` AS `stateprov`,
`hosts`.`postalcode` AS `postalcode`,
`hosts`.`country` AS `country`,
`hosts`.`netrange` AS `netrange`,
`hosts`.`aclass` AS `aclass`,
`hosts`.`bclass` AS `bclass`,
`hosts`.`cclass` AS `cclass`,
`hosts`.`dclass` AS `dclass`
FROM `default`.`hostports` `hostports`,`default`.`ports` `ports`, `default`.`hosts` `hosts`
WHERE (`hostports`.`portkey` = `ports`.`key`) AND (`hostports`.`hostkey` = `hosts`.`key`)
`hostports`.`hostkey` AS `hostkey`,
`hostports`.`portkey` AS `portkey`,
`hostports`.`state` AS `state`,
`hostports`.`reason` AS `reason`,
`hostports`.`count` AS `count`,
`ports`.`key` AS `ports_key`,
`ports`.`port` AS `port`,
`ports`.`protocol` AS `protocol`,
`ports`.`name` AS `name`,
`hosts`.`key` AS `host_key`,
`hosts`.`ipaddress` AS `ipaddress`,
`hosts`.`status` AS `status`,
`hosts`.`vendor` AS `vendor`,
`hosts`.`hostname` AS `hostname`,
`hosts`.`netname` AS `netname`,
`hosts`.`orgname` AS `orgname`,
`hosts`.`address` AS `address`,
`hosts`.`city` AS `city`,
`hosts`.`stateprov` AS `stateprov`,
`hosts`.`postalcode` AS `postalcode`,
`hosts`.`country` AS `country`,
`hosts`.`netrange` AS `netrange`,
`hosts`.`aclass` AS `aclass`,
`hosts`.`bclass` AS `bclass`,
`hosts`.`cclass` AS `cclass`,
`hosts`.`dclass` AS `dclass`
FROM `default`.`hostports` `hostports`,`default`.`ports` `ports`, `default`.`hosts` `hosts`
WHERE (`hostports`.`portkey` = `ports`.`key`) AND (`hostports`.`hostkey` = `hosts`.`key`)
·
Next Click Execute SQL (wait for the results)
·
Next Click Continue in the lower right
·
Then save the results
·
Wait for the Processing
·
Click Create Dashboard
·
I’m not going into detail on how to create a report.
·
I created a Simple Summary report. See Below
What's next
For me Predictive Analysis would be next. Using Mahout to creating an analysis for potential security issues by org or IP.
For me Predictive Analysis would be next. Using Mahout to creating an analysis for potential security issues by org or IP.
Conclusion
So I hope this was helpful to you. I wanted to provide a single stop-shop for users researching and/or implementing Big Data solutions. I tried to consolidate all the trial and error and research it took me to
put this together in one spot.
Questions, comments or suggestions are welcome.
Thanks
Jerry Baird
Reference Sites:












































































Awesome job Jerry. Very practical and insightful.
ReplyDeleteReally useful information about hadoop, i have to know information about hadoop online training institutes.
ReplyDelete