Sunday, 28 September 2014

Hadoop Architecture


HADOOP ECOSYSTEM :

Recent Hadoop ecosystem consisting different level layers and each layer performing different kind of tasks like storing your data, processing stored data, resource allocating and supporting different programming languages to develop various applications in Hadoop ecosystem.

HDFS : (Hadoop Distributed File System)

HDFS is basic and most important layer in Hadoop ecosystem. HDFS is Java based file system which provides scalable and reliable data storage by distributing a large set of data into several local machines in the same cluster. HDFS has two components to maintain distributed files called name node and data node. Name node is responsible for keeping meta data of actual data, Which are stored in different data nodes in the distributed cluster.Data node is the one storing all the actual data(your data). To get more details about name and data node,please cilck here.

MAP REDUCE FRAMEWORK :

Map reduce is second component or layer of Hadoop ecosystem. It is processing the stored data from HDFS using, two components called a job tracker and task tracker. Job tracker responsible for allocating the job to task tracker and task tracker completes the assigned task with help of datanode. Another important job is allocating resources like local storages to distribute the data into several data nodes in the cluster, allocating main memory and processors to process the large set of distributed data. Map reduce allows Java code defaultly to access and process the stored data.


YARN :(Yet Another Resource Locator) 

YARN is introduced in Hadoop 2.0. It meant for resource allocation purpose. Before YARN, resource allocation also taken care by map reduces and at the same time map reduces main job is processed the data,It was overburden to map reduce. To over come overburden of map reduce's task YARN introduced by taking responsibility of resource allocation for processing and storing data in Hadoop distributed file system. YARN has two components called resource manager and node manager. To get detailed informations about YARN click here

TEZ : 

Tez is a project of the apache software foundation. Tez acts like process engines and it's placed on top of YARN layer, to support different kind of programming languages like apache pig and apache hive. Normally map reduce framework allows Java programs to develop applications on Hadoop, beacuse map reduce was written Java. So apache introduced a process engine called tez which allows other languages like pig and hive and it converts those(pig,hive) codes into Java code to process by mapreduce. Tez provides a general-purpose, highly customizable framework that creates simplifies data-processing tasks across both small scale (low-latency) and large-scale (high throughput) workloads in Hadoop. 

APACHE PIG : 

Apache pig supports scripting language called pig Latin. It works on TEZ process engines. Actually it reduces writing a bunch of codes when using the Java programming in mapreduce. It is used for data analysis. Whatever we write in pig Latin which is processing by tez engine and converting to map reduce format(java). Because map reduce processing and supports Java programming only. 

 APACHE HIVE :

Apache hive supports to process structured data set by using HiveQL. HiveQL allows writing queries like SQL to store the structured data and process those in hadoop ecosystem. Hive is palced on apache tez process engine. HiveQL queries are converted into map reduce code using tez process engine. 

OOZIE : 

 Oozie seems like a server, When we are working with large sets of data there will be multiple tasks. In that time one task needs input from another task's information as output to complete their tasks. So each task should be co-ordinate with another task, this kind of flows are managed by oozie. thats why oozie is called as a workflow engine. Oozie has two components called control and action node. Control node responsible for from where the task should start and action node is used to run particular task at a time. For n number of tasks there will be only one control node and each and every task there will be a separate action node. We need to run this oozie, it has one .XML file called workflow.xml, which consisting all the configuration details of control node,action node and tasks that have to be performed. Oozie placed top of the Hadoop ecosystem. 

FLUME AND SQOOP : 

Flume and Sqoop are transformation tools. Flume imports or exports unstructured and semi structured into HDFS. Sqoop imports or exports structured data(Traditional RDBMS,DBMS data) into HDFS. Using sqoop we can import the structured data which are already stored in a database using SQL,MySQL queries. 

Thanks for reading and if you have any comments on my blog, Feel free to post your comments.