November 07, 2018
Hadoop is a framework that allows distributed processing of large data sets across clusters of commodity computers using simple programming models. It was inspired by a technical document published by Google. Doug Cutting discovered Hadoop and named it after his son’s yellow-colored toy elephant.
The 4 key characteristics of Hadoop are:
Economical - Ordinary computers can be used for data processing.
Reliable - Stores copies of data on different machines and is resistant to hardware failure.
Scalable - Can follow both horizontal and vertical scaling.
Flexible - Can store as much of the data and decide to use it later.
RDBMS stores data in a central location, and the data is sent to the processor at runtime. This method works well for limited data, but not for pushing high volumes of data to the processor.
Hadoop brought a radical approach. In Hadoop, the program goes to the data not vice-versa. It initially distributes the data to multiple systems and later runs the computation wherever the data is located.
4 stages of Big Data require the functionalities of Hadoop components,
Data Processing (works on Hadoop Core)
Hadoop Distributed File System (HDFS)
A storage layer for Hadoop, suitable for the distributed storage and processing. Hadoop provides a command line interface wo interact with HDFS. HDFS provides streaming access to file system data, file permission and authentication.
Spark
Spark is an open source cluster computing framework, it provides up to 100 times faster performance for a few applications within memory primitives as compared to the 2 stage disk based MapReduce paradigm of Hadoop. Spark can run in the Hadoop cluster and processes data in HDFS, it also supports a wide variety of workload which includes machine learning, business intelligence, streaming and batch processing.
Hadoop MapReduce
Hadoop MapReduce is the other framework that processes data, it is the original Hadoop processing engine which is primarily Java based. It is based on the map-and-reduce programming model. It has an extensive and mature fault tolerance framework built into the model, and it is still very commonly used, but it is losing ground to Spark.
NoSQL
HBase
HBase stores data in HDFS, it is a NoSQL database mainly used when you need random, real-time, read/write access to your Big Data. It provides support to high volume of data and high throughput. In HBase, a table can have thousands of columns.
Data Ingestion
Sqoop
Sqoop is designed to transfer data between Hadoop and relational database servers, it’s used to import data from RDBMS (Oracle, MySQL) to HDFS and export data from HDFS to RDBMS.
Flume
Flume is a distributed service that collects event data and transfers it to HDFS, it is ideally suited for event data from multiple systems.
Data Analysis
Pig
Pig is an open source high level data flow system, it’s mainly used for analytics. Pig converts its script to map-and-reduce code thus saving the user from writing complex MapReduce programs. Ad-hoc queries like filter and join which are difficult to perform in MapReduce can be done easily using Pig.
Impala
Impala is an open source high performance SQL engine which runs on Hadoop cluster. It is idea for interactive analysis and has very low latency which can be measured in milliseconds. Impala supports a dialect of SQL (Impala SQL) so that data in HDFS is modeled as a database table.
Hive
Hive is an abstraction layer on top of Hadoop, it’s very similar to Impala however it’s preferred for data processing and extract, transform, load (ETL) operations.
Data Exploration
Cloudera Search
Search is one of Cloudera’s near-real-time access products. It enables non-technical users to search and explore data stored in or ingested into Hadoop and HBase. Users do not need SQL or programming skills to use Cloudera Search because it provides a simple full-text interface for searching.
Hue
Hue is an acronym for Hadoop User Experience, it is an open source web interface for Hadoop. It supports operations such as upload and browse data, query a table in Hive or Impala, run Spark and Pig jobs and workflows. Hue provides SQL editor for Hive, Impala, MySQL, Oracle, Postgre, Spark SQL and Solar SQL.
Workflow System
Oozie
Oozie is a workflow or coordination system that you can use manage the Hadoop jobs.
Written by Warren who studies distributed systems at George Washington University. You might wanna follow him on Github