Tuesday 29 July 2014

BIG DATA'S BRAIN AND HEART

    
    We all knew that Big Data is booming around the world for past 4 or 3 years. I explained my last post like "Why Big Data plays major role on the planet".
    Now I am going to explain how Big Data problems are managing and actually what tecnologies are used in Big Data.

HADOOP:

    The most important terminology or technology using to solve Big Data problems. Which is initiated by Yahoo and plays major role in hadoop development but the idea of hadoop is Google's GFS (Goolge File System) white paper.
What is hadoop and what it has? Hadoop is a framework which contains two powerful components called HDFS and MapReduce
and most important thing is it is an opensource, So no need to worry about license and it is free to download.

HDFS :

    Hadoop Distributed File System(HDFS) is used to  handle large amount of data by distributing data into different node in the same cluster or different cluster. HDFS uses two components to distribute and manage the distributed data. Input files are splited into blocks before it distributes. Each blocks has 64mb size. Blocks are stored in slaves (Data node).

Components of HDFS :

  •     Name Node (NN)
  •     Data Node (DN)
  •     Secondary Name Node (SNN)

What is Name Node (NN) ?

  •  Name node is the brain of hadoop. Name node acts like master.
  •  Every cluster has one name node and many number of slaves (data node).
  •  It keeps the meta data (data about data) in RAM for faster access.
  • The meta data shows information about data node (DN) like file format and free memory  space       availability in data node.
  •  Name node is Single Point of Failure, Which means if name node goes down entire cluster wont work until it is restarted but this problem is occured in prior of hadoop 1.x.
  •  New release of hadoop (2.x) has two name node called one is active name node and other one is standby and standby name node has all meta data of active name node.
  •  If active name node is failed then standby name node will become active name node.
  •  Hadoop client want to access data means client must communicate with name node for collecting information about where the data is stored.
  •  After getting information from name node client can access their actual data from data node.

What is Data Node (DN) ?

  •  Data node is the location of actual data is stored as blocks.
  •  which acts like slave. It sends heart beat to name node at intervel of every 10sec.
  •  If data node does not send heart beat to name node within 10sec then name node takes as the data node get damaged or goes down or dead.
  • If any one data node goes down means, The name node orders another data node to take responsiblity of failed data node because every blocks are reflicated in other data node.
  • Default replication of block is 3. We can increase or decrease the replication factor and replication is used to avoid unavailability of blocks over distributed system.
  • Some of you may have thought like what is heart beat? Don't worry, Heart beat says to name node like "I am alive, I am alive" and sends meta data of processed data.

What is Secondary Name Node (SNN) ?

  • Secondary Name Node contains the copy of meta data of active name node. It is like save point of name node.
  • If name node goes down then the entir hadoop system wont work until it is resatart. After name node restarted, Which collects meta data from secondary name node.
  • But hadoop 2.x has two name node and one is active another one is standby. If active is failed and standby comes into action.

MapReduce :

  •  MapReduce is a programming framework for solve big data probles.
  •  MapReduce is a Google's white paper and It has algorithms but the implementation is done by  Yahoo, Apache and some other big firms.
  •  It handles the resource manager (harware share like RAM, Hard Drive) and processing data.
  •  MapReduce follows batch processing technique.
  •  Which is written in java. It contains two components like Jobtracker and Tasktracker.


Jobtracker :

  •  Jobtracker is act as master.
  •  It has responsibility of assign the job to tasttracker and monitor the tasktracker.
  •  It collects the heart beat from tasttracker, If tasktracker is not send any heart beat in     particluar time period then jobtracker assumes that tasktracker is slow or dead.
  •  Jobtracker collects the meta data from name node to assign the task to tasktracker.




Tasktracker :
   
  •  Tasktracker acts as slave.
  •  It performs the task, Whatever given by jobtracker.
  •  It has Map and Reduce part.
  •  Map uses the blocks as input and process the blocks.
  •  Reducer takes input from Map's output, Reducer performs operation and stores the output file in local machine.

Advantages of Hadoop :

  •  Large amount of data are processed by parallel processor and gives effective,relailable output with less time.
  •  No data unavailability problem because it has replication of blocks.
  •  Opensource framework, So free to learn and use.
  •  It monitors the hardware failure.

Disadvantage of Hadoop :

  •  Security problem like Man-In-Middile attack.
  •  It does not have any safe guard.

Hadoop 2.x :

  •  It introduced two name node in same cluster.It uses seperate technique for resource manager called YARN(Yet Another Resource Nagotiator).
  • We will see more details about YARN in my next post. Because it is new layer used in Hadoop.

    Thanks for reading. Please give your valuable comments to improve my quality of blog.
    Feel free to ask question or queries.
                                                                                                                    By,
                                                                                                              Gunasekar.