In my previous post , I explained abot the basics of Apache-Hadoop.


In This post , I configured Apache Hadoop Pseudo Distributed Mode on a Single Node , for this i used Centos 6 .4 Linux Operating system.


Before starting installing hadoop , we must have java 1.6 or above installed on the system , so

Make sure Java 1.6 or above is installed on your system.



for installing latest java 1.7 follow this link installing java1.7


Then the full JDK which will be placed in /opt/jdk1.7.0_51 ,after installation, make a quick check whether Java is correctly set up by

$java -version


We will use a dedicated Hadoop user account for running Hadoop. While that’s not required it is recommended because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine
#groupadd hadoop
#useradd -G hadoop hduser
#passwd hduser
This will add the user hduser and the group hadoop to your local machine.






Download the Apache Hadoop Common and move it to the server where you want to install it.


You can also use wget to download it directly to your server in the /opt directory using wget.

#su – hduser
$cd /opt
$wget http://mirrors.sonic.net/apache/hadoop/common/hadoop-1.2.1/hadoop-1.2.1.tar.gz


As hadoop user, unpack this package.

$tar xfz hadoop-1.2.1.tar.gz
$mv hadoop-1.2.1 hadoop
$chown -R hduser:hadoop hadoop





You can also disable IPv6 only for Hadoop, You can do so by adding the following line to conf/hadoop-env.sh


$vi /opt/hadoop/conf/hadoop-env.sh
export HADOOP_OPTS=-Djava.net.preferIPv4Stack=true




Modify the hadoop- file and make sure JAVA_HOME environment variable is pointing to the correct location of the java that is installed on your system.


$vi /opt/hadoop/conf/hadoop-env.sh
export JAVA_HOME=/opt/jdk1.7.0_51







Add the following lines to the end of the $HOME/.bashrc file of user hduser


$vi /home/hduser/.bashrc
# Set Hadoop-related environment variables

export HADOOP_PREFIX=/opt/hadoop


# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)

export JAVA_HOME=/opt/jdk1.7.0_51


# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs=”hadoop fs”
unalias hls &> /dev/null
alias hls=”fs -ls”


#If you have LZO compression enabled in your Hadoop cluster and compress job outputs with LZOP. Conveniently inspect an LZOP compressed file from#the command line;

lzohead () {
hadoop fs -cat $1 | lzop -dc | head -1000 | less







Hadoop’s default configurations use hadoop.tmp.dir as the base temporary directory both for the local file system and HDFS, so we create the directory and set the required ownerships and permissions

As a root, execute following commands


#mkdir -p /app/hadoop/tmp/data
#mkdir -p /app/hadoop/tmp/dfs/name
#mkdir -p /app/hadoop/tmp/mapred/local
#mkdir /app/hadoop/tmp/dfs/namesecondary
#chown -R hduser:hadoop /app/hadoop/tmp
#chmod -R 755 /app/hadoop/tmp





Add the following snippets between the <configuration> … </configuration> tags in the respective configuration XML file.
In file conf/core-site.xml


$vi /opt/hadoop/conf/core-site.xml

<description>A base for other temporary directories.</description>

<description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri’s scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri’s authority is used to determine the host, port, etc. for a filesystem.</description>




In file conf/mapred-site.xml


$vi /opt/hadoop/conf/mapred-site.xml

<description>The host and port that the MapReduce job tracker runs at. If “local”, then jobs are run in-process as a single map
and reduce task.</description>




In file conf/hdfs-site.xml


$$vi /opt/hadoop/conf/hdfs.site.xml

<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.





See Getting Started with Hadoop and the documentation in Hadoop’s API Overview if you have any questions about Hadoop’s configuration options.


In a typical Hadoop production environment you’ll be setting up this passwordless ssh access between the different servers. Since we are simulating a distributed environment on a single server, we need to setup the passwordless ssh access to the localhost itself.

Use ssh-keygen to generate the private and public key value pair.



#su – hduser
$ssh-keygen -r tsa -P “”

Add the public key to the authorized_keys. Just use the ssh-copy-id command, which will take care of this step automatically and assign appropriate permissions to these files.

$ssh-copy-id -i ~/.ssh/id_rsa.pub localhost

Test the passwordless login to the localhost as shown below.

$ssh localhost




formatting the Hadoop filesystem which is implemented on top of the local filesystem of your “cluster”. To format the filesystem, run the command


#su – hduser
$/opt/hadoop/bin/hadoop namenode -format






Use the /opt/hadoop/bin/start-all.sh script to start all hadoop related services. This will start the namenode, datanode, secondary namenode, jobtracker, tasktracker, etc.



A nifty tool for checking whether the expected Hadoop processes are running is jps





You can also check with netstat if Hadoop is listening on the configured ports.

$netstat -plten | grep java






Hadoop comes with several web interfaces which are by default available at these locations:

http://localhost:50070/ – web UI of the NameNode daemon
http://localhost:50030/ – web UI of the JobTracker daemon
http://localhost:50060/ – web UI of the TaskTracker daemon

These web interfaces provide concise information about what’s happening in your Hadoop cluster. You might want to give them a try.




NameNode Web Interface (HDFS layer):

The name node web UI shows you a cluster summary including information about total/remaining capacity, live and dead nodes. Additionally, it allows you to browse the HDFS namespace and view the contents of its files in the web browser. It also gives access to the local machine’s Hadoop log files.

By default, it’s available at http://localhost:50070/.






JobTracker Web Interface (MapReduce layer):

The JobTracker web UI provides information about general job statistics of the Hadoop cluster, running/completed/failed jobs and a job history log file. It also gives access to the ‘‘local machine’s’’ Hadoop log files (the machine on which the web UI is running on).

By default, it’s available at http://localhost:50030/.





TaskTracker Web Interface (MapReduce layer):

The task tracker web UI shows you running and non-running tasks. It also gives access to the ‘‘local machine’s’’ Hadoop log files.

By default, it’s available at http://localhost:50060/.






Simple example to see whether this setup work. for testing purpose, add some sample data files to the input directory. Let us just copy all the xml file from the conf directory to the input directory. So, these xml file will be considered as the data file for the example program. In the standalone version, you used the standard cp command to copy it to the input directory.


However in a distributed Hadoop setup, you’ll be using -put option of the hadoop command to add files to the HDFS file system. Importantly it’s not a Linux filesystem, you are adding the input files to the Hadoop Distributed file system. So, you use use the hadoop command to do this.


$cd /opt/hadoop

$/opt/hadoop/bin/hadoop dfs -put conf input


Execute the sample hadoop test program. This is a simple hadoop program that simulates a grep. This searches for the reg-ex pattern “dfs[a-z.]+” in all the input/*.xml files in the HDFS and stores the output in the output directory that will be stored in the HDFS.

$bin/hadoop jar hadoop-examples-*.jar grep input output ‘dfs[a-z.]+’






The above command will create the output directory (in HDFS) with the results as shown below. To view this output directory, you should use “-get” option in the hadoop command as shown below.


$/opt/hadoop/bin/hadoop dfs -get output output
$cat output/*

it will show the output  ,that’s it ….



Coming posts I will describe how to build a Hadoop ‘‘multi-node’’ cluster with two servers.


In addition, I will wrote a tutorial on how to code a simple MapReduce job in the Python programming language which can serve as the basis for writing your own MapReduce programs.