Hi, this blog is to help you to set up a single node Hadoop environment on your Linux machine.
To know about Hadoop follow these links
https://en.wikipedia.org/wiki/Apache_Hadoop
http://www.tutorialspoint.com/hadoop/
https://www.mapr.com/products/apache-hadoop
http://findnerd.com/list/view/What-is-Hadoop/14171/
You must have Java 6 (Java 7 or greater version recommended), ssh, rsync installed in order to install and use Hadoop.
Here is a link below for detailed information about which Java version to be used with Hadoop.
https://wiki.apache.org/hadoop/HadoopJavaVersions
Confirm that, correct Java version is properly installed to your system. To confirm this execute the following command in terminal.
java -version
And you will get response like the below image if java is installed to your system and ensure the version of Java, else you need a fresh installation of java.
For installing Java version of you choice please follow the below link.
http://findnerd.com/account#url=/list/view/Install-Oracle-JDK-with-apt-get/2944/
If ssh is not installed in your machine you need to install it.
To install ssh
sudo apt-get install ssh
Install rsync using following command
sudo apt-get install rsync
To allow SSH public key authentication.
First, we have to generate an SSH key
ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
The above command will create an RSA key pair with an empty password. As we don't want to enter passphrase every time Hadoop interacts with its nodes, so we are going to create it using an empty password.
After this you have to enable SSH access to your local machine with this newly created key.
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Download your desired version of Hadoop binary tarball from Apache Hadoop website.
http://hadoop.apache.org/releases.html
Now go to the directory in which tarball is downloaded using terminal.
In my case I downloaded hadoop-2.6.3.tar.gz in ~/Downloads/
cd ~/Downloads/
You need to extract the Hadoop package and put the extracted content to a location of your choice. In my case I choose /usr/local/hadoop
.
sudo tar -zxvf hadoop-2.6.3.tar.gz
sudo mv hadoop-2.6.3 /usr/local/hadoop
Copy the current Java path, for this you can opt to run the given command
update-alternatives --config java
To edit your bashrc run the following command
sudo gedit ~/.bashrc
Append the following lines at last in your .bashrc file.
#Hadoop Variables
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib"
Your .bashrc should look like this
Put the Java path without quotes where the above arrow indicates.
Executes the content of the file .bashrc in terminal.
source ~/.bashrc
Now comes the Hadoop's Configuration part.
Go to Hadoop's configuration directory
cd /usr/local/hadoop/etc/hadoop
Now update your hadoop-env.sh
sudo gedit hadoop-env.sh
You have to put the path of your current Java home which you had copied in earlier step inside double quotes.
Update core-site.xml
sudo gedit core-site.xml
replace the <configuration></configuration> tag with given updated tags and save the file.
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
The content of core-site.xml should look like the following image.
Update yarn-site.xml
sudo gedit yarn-site.xml
replace the <configuration></configuration> tag with given updated tags and save the file.
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value> org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
Make a copy of mapred-site.xml.template with a name of mapred-site.xml
sudo cp mapred-site.xml.template mapred-site.xml
Now edit your mapred-site.xml
sudo gedit mapred-site.xml
replace the <configuration></configuration> tag with given updated tags and save the file.
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
</configuration>
Edit hdfs-site.xml
sudo gedit hdfs-site.xml
replace the <configuration></configuration> tag with given updated tags and save the file.
<configuration>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/usr/local/hadoop/hadoop_data/hdfs/namenode</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/usr/local/hadoop/hadoop_data/hdfs/datanode</value>
</property>
</configuration>
Go back to your home directory
cd
You need to create a directory for your namenode and datanode. It is used to specify the directories which will be used as the namenode and the datanode on that host.
sudo mkdir -p /usr/local/hadoop/hadoop_data/hdfs/namenode
sudo mkdir -p /usr/local/hadoop/hadoop_data/hdfs/datanode
Change ownership of the directory
sudo chown abhishek:abhishek -R /usr/local/hadoop
follow this link to learn more about changing ownership http://www.techonthenet.com/linux/commands/chown.php
Hadoop file system needs to be formatted so that we can start to use it. The format command should be issued with write permission since it creates current directory under /usr/local/hadoop/hadoop_data/hdfs/namenode folder
hdfs namenode -format
Now we can start Hadoop services.to start Hadoop services run this command.
start-all.sh
To list the process running after excuting the above command. Run jps in your terminal
jps
You must have SecondaryNameNode, NodeManager, ResourceManager, NameNode and DataNode should be running to ensure that the installation is fine and would work for our further tasks.
Now go to the following urls to get the GUI of Hadoop
http://localhost:8088/
http://localhost:50070/
http://localhost:50090/
http://localhost:50075/
Port 8088 is for All Application on your Hadoop system,
Port 50070 is for NameNode Information
Port 50075 is for DataNode Information
Port 50090 is for Secondary NameNode Information
To stop hadoop execute the command
stop-all.sh
We can change the password which was left blank in earlier step by following these links
https://www.sophos.com/en-us/support/knowledgebase/115708.aspx
http://www.cyberciti.biz/faq/ssh-password-less-login-with-dsa-publickey-authentication/
To run word count program on your single node cluster search for Word Count program on Hadoop on FindNerd.
0 Comment(s)