Monday 26 January 2015

Hadoop Configuration On Linux Environment

Hadoop Single Node Setup

This blog describes how to setup and configure Hadoop 1.2.1 single node installation on Linux platform. GNU/Linux is supported as development and production platform. 

Required Software :

Java 1.6.x must be installed and JAVA_HOME must be set.
ssh must be installed and sshd must be running.  To install ssh on Ubuntu Linux use below command           
$ sudo apt-get install openssh-server

Installing Hadoop :

1. Download Hadoop1.2.1 distribution from  Apache hadoop download mirror, Please click given link   http://hadoop.apache.org/releases.html
2. Extract distribution in to local folder, called hadoop-1.2.1 will be pointing to this directory.
3. By using this command you can extract the downloaded tar file >> tar -xvf  hadoop-1.2.1.tar.gz
4. From extracted hadoop-1.2.3 folder, edit the file conf/hadoop-env.sh to      define JAVA_HOME as below –
 # The java implementation to use. Required.
export JAVA_HOME=/usr/home/java/jdk1.7.0_03 (path of your JAVA_HOME)
 
Now we are ready to start Hadoop cluster in Local (Standalone) Mode.
 
Local (Standalone) Mode – In this mode Hadoop run as non-distribution mode as single java process. Hadoop will run on local machine without any cluster environment. We can test Local (Standalone) hadoop mode by following example –

$ bin/hadoop jar hadoop-examples-*.jar wordcount   /home/input/test.txt   /home/output

by excuting above line, In home path you will get output folder and inside that output folder your actual wordcount programm output will be there (based on your input file)

NOTE: You must have some input file in "/home/input/" path. File name can be anything but in this example, I have given file name as "text.txt" and if your file name different means please change in above example also.

We will continue with seudo-Distributed Mode setup with following steps –
5. From extracted HADOOP_HOME folder, edit the file conf/core-site.xml as below to define HDFS file system –
<configuration>
     <property>
         <name>fs.default.name</name>      // providing name for file system.
         <value>hdfs://localhost:9000</value> // hdfs running in given port number.
     </property>
</configuration>
6. From extracted HADOOP_HOME folder, edit the file conf/hdfs-site.xml to define replication value –
<configuration>
     <property>
         <name>dfs.replication</name>
         <value>1</value> // by default your file replicated at once in HDFS. If you want more replication, You can change the value as your wish.
     </property>
</configuration>
7. From extracted HADOOP_HOME folder, edit the file conf/mapred-site.xml to define job tracker
<configuration>
     <property>
         <name>mapred.job.tracker</name>
         <value>localhost:9001</value>
     </property>
</configuration>

SSH Configuration: 

Hadoop node should able to ssh to localhost without any passphrase for data communication. To achieve this execute following commands –

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
Now try with – $ ssh localhost
Now you should able to communicate with localhost without any passphrase.

Testing Hadoop :

1. During initial setup we will have to format namenoade. Remember this is one time activity and should not do during each start. From extracted HADOOP_HOME folder, execute following command from terminal –
$ bin/hadoop namenode -format
2. Start Hadoop daemons –
$ bin/start-all.sh
 
If everything goes well then by executing ‘jps’ command you should get following output on console –
$ jps
NameNode
SecondaryNameNode
DataNode
JobTracker
TaskTracker

                      Please share this content if you like !!!  Feel free to ask questions!!!