Distributed Systems Practice Notes

Big Data - Hadoop Intro

November 07, 2018

Learning Outcomes

What is Hadoop
Hadoop Characteristics
RDBMS vs. Hadoop
Big Data Processing
Hadoop Ecosystem Components

What is Hadoop?

Hadoop is a framework that allows distributed processing of large data sets across clusters of commodity computers using simple programming models. It was inspired by a technical document published by Google. Doug Cutting discovered Hadoop and named it after his son’s yellow-colored toy elephant.

Hadoop Characteristics

The 4 key characteristics of Hadoop are:

Economical - Ordinary computers can be used for data processing.
Reliable - Stores copies of data on different machines and is resistant to hardware failure.
Scalable - Can follow both horizontal and vertical scaling.
Flexible - Can store as much of the data and decide to use it later.

RDBMS vs. Hadoop

RDBMS stores data in a central location, and the data is sent to the processor at runtime. This method works well for limited data, but not for pushing high volumes of data to the processor.

Hadoop brought a radical approach. In Hadoop, the program goes to the data not vice-versa. It initially distributes the data to multiple systems and later runs the computation wherever the data is located.

Big Data Processing

4 stages of Big Data require the functionalities of Hadoop components,

Ingest
Processing
Analyze
Access

Hadoop Ecosystem Components

Data Processing (works on Hadoop Core)
- Hadoop Distributed File System (HDFS)
  
  A storage layer for Hadoop, suitable for the distributed storage and processing. Hadoop provides a command line interface wo interact with HDFS. HDFS provides streaming access to file system data, file permission and authentication.
- Spark
  
  Spark is an open source cluster computing framework, it provides up to 100 times faster performance for a few applications within memory primitives as compared to the 2 stage disk based MapReduce paradigm of Hadoop. Spark can run in the Hadoop cluster and processes data in HDFS, it also supports a wide variety of workload which includes machine learning, business intelligence, streaming and batch processing.
- Hadoop MapReduce
  
  Hadoop MapReduce is the other framework that processes data, it is the original Hadoop processing engine which is primarily Java based. It is based on the map-and-reduce programming model. It has an extensive and mature fault tolerance framework built into the model, and it is still very commonly used, but it is losing ground to Spark.
NoSQL
- HBase
  
  HBase stores data in HDFS, it is a NoSQL database mainly used when you need random, real-time, read/write access to your Big Data. It provides support to high volume of data and high throughput. In HBase, a table can have thousands of columns.
Data Ingestion
- Sqoop
  
  Sqoop is designed to transfer data between Hadoop and relational database servers, it’s used to import data from RDBMS (Oracle, MySQL) to HDFS and export data from HDFS to RDBMS.
- Flume
  
  Flume is a distributed service that collects event data and transfers it to HDFS, it is ideally suited for event data from multiple systems.
Data Analysis
- Pig
  
  Pig is an open source high level data flow system, it’s mainly used for analytics. Pig converts its script to map-and-reduce code thus saving the user from writing complex MapReduce programs. Ad-hoc queries like filter and join which are difficult to perform in MapReduce can be done easily using Pig.
- Impala
  
  Impala is an open source high performance SQL engine which runs on Hadoop cluster. It is idea for interactive analysis and has very low latency which can be measured in milliseconds. Impala supports a dialect of SQL (Impala SQL) so that data in HDFS is modeled as a database table.
- Hive
  
  Hive is an abstraction layer on top of Hadoop, it’s very similar to Impala however it’s preferred for data processing and extract, transform, load (ETL) operations.
Data Exploration
- Cloudera Search
  
  Search is one of Cloudera’s near-real-time access products. It enables non-technical users to search and explore data stored in or ingested into Hadoop and HBase. Users do not need SQL or programming skills to use Cloudera Search because it provides a simple full-text interface for searching.
- Hue
  
  Hue is an acronym for Hadoop User Experience, it is an open source web interface for Hadoop. It supports operations such as upload and browse data, query a table in Hive or Impala, run Spark and Pig jobs and workflows. Hue provides SQL editor for Hive, Impala, MySQL, Oracle, Postgre, Spark SQL and Solar SQL.
Workflow System
- Oozie
  
  Oozie is a workflow or coordination system that you can use manage the Hadoop jobs.

Written by Warren who studies distributed systems at George Washington University. You might wanna follow him on Github