Original: http://disi.unitn.it/~lissandrini/notes/installing-hadoop-on-ubuntu-14.html
This are shows step by step-to-set-a multi nod cluster with Hadoop and HDFS 2.4.1 on Ubuntu 14.04 . It is a update, and takes many parts from previous guides about installing HADOOP&HDFS versions 2.2 and 2.3 on Ubuntu .
The text is quite lengthy, I'll soon provide a script to auomate some parts.
Assume we had a 3 nodes cluster, my test case was the following (with IP addresses and shortnames):
10.10.10.104 mynode110.10.10.105 mynode210.10.10.106 mynode3
Setup
Make sure has Oracle JDK 7 or 8 installed. The following is the commands for Java 8, to install Java 7 You just need to change the version number
sudo add-apt-repository ppa:webupd8team/java-ysudo apt-get updatesudo apt-get Install Oracle-java8-installersudo Apt-get Install Oracle-java8-set-default
Note: I know some of you is trying to run the This guide with Debian. I am not sure how much of these guide would apply to that OS, but for this specific case, for Debian, the instructions to I Nstall Java 8 is here.
While we is installing software, you can find useful to install Alsoscreento start sessions the work on the remote servers, an Dnmapto Check server ports in case something are not working in the cluster networking
sudo apt-get install screen nmap
Repeat This installation-procedure, up-to-this-point, on-every node you had in the cluster
The following'll is necessary only on the first node: Then we start a screens to work remotely without fear of the losing work if disconnected.
Screen-s installing
After The-syou can put whatever name for your sessions
Now we is going to actually install the software needed Withmavenwith libraries to compile hdfs&hadoop.
sudo apt-get install maven build-essential zlib1g-dev cmake pkg-config libssl-dev Protobuf-compiler
Among these files,protocor also calledprotobuf-compilermay cause some problems with the version depending on your operatin G system version. In this case, you can compile and install the correct version (2.5.0) from the source.
Hadoop User & Authentication
Next, let's createhadoopgroup and the Userhduser, which'll be is also in the sudoers, the following commands has to be run One at at time. In the second step Theadduserwill also ask the login password Forhduser:
sudo addgroup hadoopsudo adduser--ingroup hadoop hdusersudo adduser hduser sudo
Repeat This procedure, up to this point, on every node you has in the cluster
We now log in as the Newhduseron one node and we'll create SSH keys to access the other servers:
sudo su-hduser
From now on , in the rest of the This guide, all commands'll be run as Thehduser.
Ssh-keygen-t rsa-p ""-F ~/.ssh/id_rsacat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
Now let's copy these files on the other nodes, e.g, frommynode1tomynode2andmynode3
Scp-r ~/.ssh [email protected]:~/
Compile the Sources
The following steps is needed only once. Download Hadoop2. Xstable, to-do-navigate in the List of mirrors select one and decide-what version to download. Withwgetyou can run something like the following for Hadoop2.4.1-from Europe:
wget http://www.eu.apache.org/dist/hadoop/core/hadoop-2.4.1/hadoop-2.4.1-src.tar.gz
From the U.S. instead
wget http://apache.mirror.anlx.net/hadoop/core/hadoop-2.4.1/hadoop-2.4.1-src.tar.gz
Once It has been downloaded, unpack it
TAR-XVF hadoop-2.4.1-src.tar.gz
Then enter the directory and compile
CD HADOOP-2.4.1-SRC/MVN package-pdist,native-dmaven.javadoc.skip=true -dskiptests-dtar
Notice that, if is behind a proxy, Maven needs Asettings.xml file in the configuration directory In~/.m2that contains The basic information of your proxy configuration.
Compiled files would be found Inhadoop-dist/target/hadoop-2.4.1.tar.gzjust put them in the home
MV Hadoop-dist/target/hadoop-2.4.1.tar.gz ~/
Now let's copy these files on the other nodes, e.g, frommynode1tomynode2andmynode3
SCP ~/hadoop-2.4.1.tar.gz [email protected]:~/scp ~/hadoop-2.4.1.tar.gz [email protected]:~/
Install the Compiled Code
The following steps is needed on all the machines We unpack the compiled version and put it N/usr/localand we create a shortcut called/usr/local/hadoop
sudo tar-xvf ~/hadoop-2.4.1.tar.gz-c/usr/local/sudo ln-s/usr/local/hadoop-2.4.1/usr/local/hadoopsudo chown-r hduser : hadoop/usr/local/hadoop-2.4.1
Set up ENV Variables
The following steps is needed on all the machines We update the profile of the shell, i.e, and we edit the.profilefile to put some enviroment variables, in order to upset Equa Llyvimandemacsuser We'll use a text editor Callednano
Nano ~/.profile
And we add, at the end, the following
Export java_home=$ (readlink-f/usr/bin/java | sed "S:bin/java::") export Hadoop_install=/usr/local/hadoopexport HADOOP _home= $HADOOP _installexport path= $PATH: $HADOOP _install/binexport path= $PATH: $HADOOP _install/sbinexport hadoop_ Mapred_home= $HADOOP _installexport hadoop_common_home= $HADOOP _installexport hadoop_hdfs_home= $HADOOP _ Installexport hadoop_conf_dir=${hadoop_home} "/etc/hadoop" Export yarn_home= $HADOOP _installalias hfs= "HDFs dfs"
(To Savectrl+oenterandctrl+x)
Note: If you installed somewhere else Hadoop, chec the proper directory path For$hadoop_install, but does not change$hadoop_conf_d IR.
Now we made the edit operative by reloading The.profilefile with
SOURCE ~/.profile
We also has to edithadoop-env.shfiles with for the same$java_homevariable, which they seem not able to set up properly, so We open the file in
nano/usr/local/hadoop/etc/hadoop/hadoop-env.sh
And around line we can replace
Export Java_home=${java_home}
With
Export java_home=$ (readlink-f/usr/bin/java | sed "S:bin/java::")
If you want to being sure it worked, you can paste some values, like
echo $JAVA _homeecho $HADOOP _home
Set up Data Directory & Logs
We Create the directory Wherehdfsdata files and logs is stored, you can create them wherever
The first directory is actually needed only on the NameNode (main) machine
Mkdir-pv/usr/local/hadoop/data/namenode
These steps 'll be needed on all the machines
MKDIR-PV/USR/LOCAL/HADOOP/DATA/DATANODEMKDIR-PV $HADOOP _install/logs
Edit Configuration Files
These steps'll be needed only on the main machine and then we'll copy the entire Conf directory on the other machines
Then we put it information in thehdfs-site.xmlfile with
Nano $HADOOP _install/etc/hadoop/hdfs-site.xml
and paste the following between<configuration>tag:
<property> <name>dfs.datanode.data.dir</name> <value>file:///usr/local/hadoop/ data/datanode</value> <description>datanode directory</description></property>< property> <name>dfs.namenode.name.dir</name> <value>file:///usr/local/hadoop/data /namenode</value> <description>namenode directory for namespace and transaction logs storage.</ Description></property>
The following is additional configuration parameters to put alongside the previous ones, among them the replication param Eter to match the number redundant copy we want-it does not necessarily match the number of nodes in the cluster.
<property> <name>dfs.replication</name> <value>2</value></property ><property> <name>dfs.permissions</name> <value>false</value></ property><property> <name>dfs.datanode.use.datanode.hostname</name> <value> False</value></property><property> <name> Dfs.namenode.datanode.registration.ip-hostname-check</name> <value>false</value></ Property>
Notice: When you'll start your HDFS distributed filesystem, you'll have a mainnamenodeand asecondary NameNode . Thesecondary Namenodeis *not What do you think it is*.
The term "secondary name-node" is somewhat misleading. It is not a name-node in the sense this data-nodes cannot connect to the secondary name-node, and in no event it can Repla Ce the primary name-node in case of its failure. –from Hadoop FAQ
Want to put the secondary name node on a different machine that's not the master, but maybe one of th E workers. Assume decide your cluster main node is
10.10.10.104 Mynode1
And assume you decide your cluster to having the secondary NameNode on
10.10.10.105 Mynode2
Then we add the following to Thehdfs-site.xmlfile:
<property> <name>dfs.namenode.http-address</name> <value>10.10.10.104:50070</value > <description>your NameNode hostname for HTTP access.</description></property><property> <name>dfs.namenode.secondary.http-address</name> <value>10.10.10.105:50090</value> < Description>your Secondary NameNode hostname for HTTP access.</description></property>
I thank my colleague Sabeur for helping me with this bit on the secondary NameNode
Then we also point Tomynode1ip-to-the Hadoop cluster to tell where we host the Hadoopnamenodeby editing:
Nano $HADOOP _install/etc/hadoop/core-site.xml
And we add inside The<configuration>tag the following
<property> <name>fs.defaultFS</name> <value>hdfs://10.10.10.104/</value> <description>namenode uri</description></property>
We put the IP addresses of the nodes to being used Asdatanodesin theslavesfile, we open it with
Nano $HADOOP _install/etc/hadoop/slaves
And we put the list of server addresses one per line, note that in this case also the master was used, so we put there the Following list:
10.10.10.104 10.10.10.105 10.10.10.106
Up to here is mainly Abouthdfs, now we configure theyarncluster, i.e., the execution engine, we then edit THEYARN-SITE.XM L.
Nano $HADOOP _install/etc/hadoop/yarn-site.xml
Again we add the following inside The<configuration>tag
<property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle </value></property><property> <name>yarn.nodemanager.aux-services.mapreduce_ Shuffle.class</name> <value>org.apache.hadoop.mapred.shufflehandler</value></property ><property> <name>yarn.resourcemanager.resource-tracker.address</name> <value >10.10.10.104:8025</value></property><property> <name> Yarn.resourcemanager.scheduler.address</name> <value>10.10.10.104:8030</value></ property><property> <name>yarn.resourcemanager.address</name> <value> 10.10.10.104:8050</value></property>
Now are time to update all the nodes with this news configuration, thus we copy frommynode1tomynode2andmynode3the directory With the following command (note the destination directory)
Scp-r $HADOOP _install/etc/hadoop [email protected]: $HADOOP _install/etc/scp-r $HADOOP _install/etc/ Hadoop [email protected]: $HADOOP _install/etc/
Initialize HDFS
These commands 'll be used only on the main node
If all went well we should is able to run the following command
Hadoop version
and obtain something like
Hadoop 2.4.1Subversion unknown-r unknowncompiled by HDUser on 2014-08-23t15:29zcompiled with Protoc 2.5.0From source with Checksum bb7ac0a3c73dc131f4844b873c74b630this command was run using/usr/local/hadoop-2.4.1/share/hadoop/common/ Hadoop-common-2.4.1.jar
Now the first step was toformatthe NameNode, this would basically initialize thehdfsfile system. So on the main node you run:
HDFs Namenode-format
Hadoop NameNode is the centralized place of a HDFS file system which keeps the directory tree of all files in the file sy Stem, and tracks where across the cluster the file data is kept. In short, it keeps the metadata related to Datanodes. When we format namenode it formats the Meta-data related to Data-nodes. –from StackOverflow
Start and test the cluster!
These commands 'll be used only on the main node Now we can start thehdfscluster with the command
start-dfs.sh
And if the preivious command didn ' t complain about anythign, we can create a random directory in Ourhdfsfilesystem with
Hadoop fs-mkdir-p/datastore
Note that we used the Fullhadoop Fscommand, but with our profiles we added an alias WITHHFS
Now check the size of the files inside Thedatanodedirectory
Du-sh/usr/local/hadoop/data/datanode
And we can put inside a new directory and, as a test, the.tar.gzfile of Hadoop
Hfs-mkdir-p/datastore/testhfs-copyfromlocal ~/hadoop-2.4.1.tar.gz/datastore/
Now check again the size of the files inside Thedatanodedirectory, you can run the same command on all nodes, and see that The file is also on those other servers (allof it or part, it depends on the replication level and the number of nod Esyou have)
Du-sh/usr/local/hadoop/data/datanode
You can check the content of Thehdfsdirectory with
Hfs-ls/datastore
and remove the the files with
hfs-rm/datastore/test/*
In case you want to delete a entire directory you can instead use
Hfs-rm-r/datastore/test
The Distributed file system running, and you can check the processes with
JPs
Which would give you, on the main node, something like
18755 DataNode18630 NameNode18969 SecondaryNameNode19387 Jps
Up to here we set up the the distributed filesystem, this'll be come handy isn't only forhadoop, but also for other distribute D computation engines, like Spark or Flink-which is stratosphere.
Finally to start the actualhadoopyarnexecution engine you just go with
start-yarn.sh
Configure hostnames
As a side note, in this guide, we used IP addresses in configuration files, if you want to use instead the Shortnames Shall first update The/etc/hostsso, all of them is listed with their shortname.
10.10.10.104 mynode110.10.10.105 mynode210.10.10.106 mynode3
In this case, make sure that there, the only appeareance of the Ip127.0.0.1is withlocalhost. This was very important, so if in youhostsfile there are a line like
127.0.0.1 Mynode1
Delete it!
Hadoop 2.X: Distributed Installation