Hadoop 2.X: Distributed Installation

Last Update:2015-04-25 Source: Internet

Author: User

Tags unpack hdfs dfs hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original: http://disi.unitn.it/~lissandrini/notes/installing-hadoop-on-ubuntu-14.html

This are shows step by step-to-set-a multi nod cluster with Hadoop and HDFS 2.4.1 on Ubuntu 14.04 . It is a update, and takes many parts from previous guides about installing HADOOP&HDFS versions 2.2 and 2.3 on Ubuntu .

The text is quite lengthy, I'll soon provide a script to auomate some parts.

Assume we had a 3 nodes cluster, my test case was the following (with IP addresses and shortnames):

10.10.10.104  mynode110.10.10.105  mynode210.10.10.106  mynode3

Setup

Make sure has Oracle JDK 7 or 8 installed. The following is the commands for Java 8, to install Java 7 You just need to change the version number

sudo add-apt-repository ppa:webupd8team/java-ysudo apt-get updatesudo apt-get Install Oracle-java8-installersudo Apt-get Install Oracle-java8-set-default

Note: I know some of you is trying to run the This guide with Debian. I am not sure how much of these guide would apply to that OS, but for this specific case, for Debian, the instructions to I Nstall Java 8 is here.

While we is installing software, you can find useful to install Alsoscreento start sessions the work on the remote servers, an Dnmapto Check server ports in case something are not working in the cluster networking

sudo apt-get install screen nmap

Repeat This installation-procedure, up-to-this-point, on-every node you had in the cluster

The following'll is necessary only on the first node: Then we start a screens to work remotely without fear of the losing work if disconnected.

Screen-s installing

After The-syou can put whatever name for your sessions

Now we is going to actually install the software needed Withmavenwith libraries to compile hdfs&hadoop.

sudo apt-get install  maven build-essential zlib1g-dev cmake pkg-config libssl-dev Protobuf-compiler

Among these files,protocor also calledprotobuf-compilermay cause some problems with the version depending on your operatin G system version. In this case, you can compile and install the correct version (2.5.0) from the source.

Hadoop User & Authentication

Next, let's createhadoopgroup and the Userhduser, which'll be is also in the sudoers, the following commands has to be run One at at time. In the second step Theadduserwill also ask the login password Forhduser:

sudo addgroup hadoopsudo adduser--ingroup hadoop hdusersudo adduser hduser sudo

Repeat This procedure, up to this point, on every node you has in the cluster

We now log in as the Newhduseron one node and we'll create SSH keys to access the other servers:

sudo su-hduser

From now on , in the rest of the This guide, all commands'll be run as Thehduser.

Ssh-keygen-t rsa-p ""-F ~/.ssh/id_rsacat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Now let's copy these files on the other nodes, e.g, frommynode1tomynode2andmynode3

Scp-r ~/.ssh  [email protected]:~/

Compile the Sources

The following steps is needed only once. Download Hadoop2. Xstable, to-do-navigate in the List of mirrors select one and decide-what version to download. Withwgetyou can run something like the following for Hadoop2.4.1-from Europe:

wget http://www.eu.apache.org/dist/hadoop/core/hadoop-2.4.1/hadoop-2.4.1-src.tar.gz

From the U.S. instead

wget http://apache.mirror.anlx.net/hadoop/core/hadoop-2.4.1/hadoop-2.4.1-src.tar.gz

Once It has been downloaded, unpack it

TAR-XVF hadoop-2.4.1-src.tar.gz

Then enter the directory and compile

CD HADOOP-2.4.1-SRC/MVN package-pdist,native-dmaven.javadoc.skip=true  -dskiptests-dtar

Notice that, if is behind a proxy, Maven needs Asettings.xml file in the configuration directory In~/.m2that contains The basic information of your proxy configuration.

Compiled files would be found Inhadoop-dist/target/hadoop-2.4.1.tar.gzjust put them in the home

MV Hadoop-dist/target/hadoop-2.4.1.tar.gz ~/

Now let's copy these files on the other nodes, e.g, frommynode1tomynode2andmynode3

SCP ~/hadoop-2.4.1.tar.gz  [email protected]:~/scp ~/hadoop-2.4.1.tar.gz  [email protected]:~/

Install the Compiled Code

The following steps is needed on all the machines We unpack the compiled version and put it N/usr/localand we create a shortcut called/usr/local/hadoop

sudo tar-xvf ~/hadoop-2.4.1.tar.gz-c/usr/local/sudo ln-s/usr/local/hadoop-2.4.1/usr/local/hadoopsudo chown-r hduser : hadoop/usr/local/hadoop-2.4.1

Set up ENV Variables

The following steps is needed on all the machines We update the profile of the shell, i.e, and we edit the.profilefile to put some enviroment variables, in order to upset Equa Llyvimandemacsuser We'll use a text editor Callednano

Nano ~/.profile

And we add, at the end, the following

Export java_home=$ (readlink-f/usr/bin/java | sed "S:bin/java::") export Hadoop_install=/usr/local/hadoopexport HADOOP _home= $HADOOP _installexport path= $PATH: $HADOOP _install/binexport path= $PATH: $HADOOP _install/sbinexport hadoop_ Mapred_home= $HADOOP _installexport hadoop_common_home= $HADOOP _installexport hadoop_hdfs_home= $HADOOP _ Installexport hadoop_conf_dir=${hadoop_home} "/etc/hadoop" Export yarn_home= $HADOOP _installalias hfs= "HDFs dfs"

(To Savectrl+oenterandctrl+x)

Note: If you installed somewhere else Hadoop, chec the proper directory path For$hadoop_install, but does not change$hadoop_conf_d IR.

Now we made the edit operative by reloading The.profilefile with

SOURCE ~/.profile

We also has to edithadoop-env.shfiles with for the same$java_homevariable, which they seem not able to set up properly, so We open the file in

nano/usr/local/hadoop/etc/hadoop/hadoop-env.sh

And around line we can replace

Export Java_home=${java_home}

With

Export java_home=$ (readlink-f/usr/bin/java | sed "S:bin/java::")

If you want to being sure it worked, you can paste some values, like

echo $JAVA _homeecho $HADOOP _home

Set up Data Directory & Logs

We Create the directory Wherehdfsdata files and logs is stored, you can create them wherever

The first directory is actually needed only on the NameNode (main) machine

Mkdir-pv/usr/local/hadoop/data/namenode

These steps 'll be needed on all the machines

MKDIR-PV/USR/LOCAL/HADOOP/DATA/DATANODEMKDIR-PV $HADOOP _install/logs

Edit Configuration Files

These steps'll be needed only on the main machine and then we'll copy the entire Conf directory on the other machines

Then we put it information in thehdfs-site.xmlfile with

Nano $HADOOP _install/etc/hadoop/hdfs-site.xml

and paste the following between<configuration>tag:

<property>    <name>dfs.datanode.data.dir</name>    <value>file:///usr/local/hadoop/ data/datanode</value>    <description>datanode directory</description></property>< property>    <name>dfs.namenode.name.dir</name>    <value>file:///usr/local/hadoop/data /namenode</value>    <description>namenode directory for namespace and transaction logs storage.</ Description></property>

The following is additional configuration parameters to put alongside the previous ones, among them the replication param Eter to match the number redundant copy we want-it does not necessarily match the number of nodes in the cluster.

<property>    <name>dfs.replication</name>    <value>2</value></property ><property>    <name>dfs.permissions</name>    <value>false</value></ property><property>    <name>dfs.datanode.use.datanode.hostname</name>    <value> False</value></property><property>    <name> Dfs.namenode.datanode.registration.ip-hostname-check</name>    <value>false</value></ Property>

Notice: When you'll start your HDFS distributed filesystem, you'll have a mainnamenodeand asecondary NameNode . Thesecondary Namenodeis *not What do you think it is*.

The term "secondary name-node" is somewhat misleading. It is not a name-node in the sense this data-nodes cannot connect to the secondary name-node, and in no event it can Repla Ce the primary name-node in case of its failure. –from Hadoop FAQ

Want to put the secondary name node on a different machine that's not the master, but maybe one of th E workers. Assume decide your cluster main node is

10.10.10.104  Mynode1

And assume you decide your cluster to having the secondary NameNode on

10.10.10.105  Mynode2

Then we add the following to Thehdfs-site.xmlfile:

<property> <name>dfs.namenode.http-address</name> <value>10.10.10.104:50070</value > <description>your NameNode hostname for HTTP access.</description></property><property> <name>dfs.namenode.secondary.http-address</name> <value>10.10.10.105:50090</value> < Description>your Secondary NameNode hostname for HTTP access.</description></property>

I thank my colleague Sabeur for helping me with this bit on the secondary NameNode

Then we also point Tomynode1ip-to-the Hadoop cluster to tell where we host the Hadoopnamenodeby editing:

Nano $HADOOP _install/etc/hadoop/core-site.xml

And we add inside The<configuration>tag the following

<property>    <name>fs.defaultFS</name>    <value>hdfs://10.10.10.104/</value>    <description>namenode uri</description></property>

We put the IP addresses of the nodes to being used Asdatanodesin theslavesfile, we open it with

Nano $HADOOP _install/etc/hadoop/slaves

And we put the list of server addresses one per line, note that in this case also the master was used, so we put there the Following list:

10.10.10.104    10.10.10.105    10.10.10.106

Up to here is mainly Abouthdfs, now we configure theyarncluster, i.e., the execution engine, we then edit THEYARN-SITE.XM L.

Nano $HADOOP _install/etc/hadoop/yarn-site.xml

Again we add the following inside The<configuration>tag

<property>    <name>yarn.nodemanager.aux-services</name>    <value>mapreduce_shuffle </value></property><property>    <name>yarn.nodemanager.aux-services.mapreduce_ Shuffle.class</name>    <value>org.apache.hadoop.mapred.shufflehandler</value></property ><property>    <name>yarn.resourcemanager.resource-tracker.address</name>    <value >10.10.10.104:8025</value></property><property>    <name> Yarn.resourcemanager.scheduler.address</name>    <value>10.10.10.104:8030</value></ property><property>    <name>yarn.resourcemanager.address</name>    <value> 10.10.10.104:8050</value></property>

Now are time to update all the nodes with this news configuration, thus we copy frommynode1tomynode2andmynode3the directory With the following command (note the destination directory)

Scp-r  $HADOOP _install/etc/hadoop  [email protected]: $HADOOP _install/etc/scp-r  $HADOOP _install/etc/ Hadoop  [email protected]: $HADOOP _install/etc/

Initialize HDFS

These commands 'll be used only on the main node

If all went well we should is able to run the following command

Hadoop version

and obtain something like

Hadoop 2.4.1Subversion unknown-r unknowncompiled by HDUser on 2014-08-23t15:29zcompiled with Protoc 2.5.0From source with Checksum bb7ac0a3c73dc131f4844b873c74b630this command was run using/usr/local/hadoop-2.4.1/share/hadoop/common/ Hadoop-common-2.4.1.jar

Now the first step was toformatthe NameNode, this would basically initialize thehdfsfile system. So on the main node you run:

HDFs Namenode-format

Hadoop NameNode is the centralized place of a HDFS file system which keeps the directory tree of all files in the file sy Stem, and tracks where across the cluster the file data is kept. In short, it keeps the metadata related to Datanodes. When we format namenode it formats the Meta-data related to Data-nodes. –from StackOverflow

Start and test the cluster!

These commands 'll be used only on the main node Now we can start thehdfscluster with the command

start-dfs.sh

And if the preivious command didn ' t complain about anythign, we can create a random directory in Ourhdfsfilesystem with

Hadoop fs-mkdir-p/datastore

Note that we used the Fullhadoop Fscommand, but with our profiles we added an alias WITHHFS

Now check the size of the files inside Thedatanodedirectory

Du-sh/usr/local/hadoop/data/datanode

And we can put inside a new directory and, as a test, the.tar.gzfile of Hadoop

Hfs-mkdir-p/datastore/testhfs-copyfromlocal ~/hadoop-2.4.1.tar.gz/datastore/

Now check again the size of the files inside Thedatanodedirectory, you can run the same command on all nodes, and see that The file is also on those other servers (allof it or part, it depends on the replication level and the number of nod Esyou have)

Du-sh/usr/local/hadoop/data/datanode

You can check the content of Thehdfsdirectory with

Hfs-ls/datastore

and remove the the files with

hfs-rm/datastore/test/*

In case you want to delete a entire directory you can instead use

Hfs-rm-r/datastore/test

The Distributed file system running, and you can check the processes with

JPs

Which would give you, on the main node, something like

18755 DataNode18630 NameNode18969 SecondaryNameNode19387 Jps

Up to here we set up the the distributed filesystem, this'll be come handy isn't only forhadoop, but also for other distribute D computation engines, like Spark or Flink-which is stratosphere.

Finally to start the actualhadoopyarnexecution engine you just go with

start-yarn.sh

Configure hostnames

As a side note, in this guide, we used IP addresses in configuration files, if you want to use instead the Shortnames Shall first update The/etc/hostsso, all of them is listed with their shortname.

10.10.10.104  mynode110.10.10.105  mynode210.10.10.106  mynode3

In this case, make sure that there, the only appeareance of the Ip127.0.0.1is withlocalhost. This was very important, so if in youhostsfile there are a line like

127.0.0.1 Mynode1

Delete it!

Hadoop 2.X: Distributed Installation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More