Apache Hadoop2.2.0, as the next-generation hadoop version, breaks through the limit of up to 4000 machines in the original hadoop1.x cluster, and effectively solves the frequently encountered OOM (memory overflow) problem, its innovative computing framework, YARN, is called the hadoop operating system. It is not only compatible with the original mapreduce computing model, but also supports other parallel computing models.
Suppose we want to build a cluster with two nodes, hadoop2.2.0. The Host Name of a node is master and serves as the master and slave roles of the cluster to run daemon processes such as namenode, datanode, secondarynamenode, resourcemanager, and node manager; another node named slave1 runs the datanode and nodemanager processes as the cluster slave role.
1. Get hadoop binary or source package: http://mirrors.cnnic.cn/apache/hadoop/common/hadoop-2.2.0/, using hadoop-2.2.0.tar.gz or hadoop-2.2.0-src.tar.gz
2. Create a user with the same name on each machine, such as hduser, and install java (1.6 or 1.7)
Decompress the package, such as to the directory/home/hduser/hadoop-2.2.0
To compile the source code, see Steps 3, 4 and 5 below.
---------------- For compile source file -----------------------
3. Download protocbuf2.5.0: https://code.google.com/p/protobuf/downloads/list, download the latest maven: http://maven.apache.org/download.cgi
Compile protocbuf 2.5.0:
- Tar-xvf protobuf-2.5.0.tar.gz
- Cd protobuf-2.5.0
- ./Configure -- prefix =/opt/protoc/
- Make & make install
4. install required software packages
For rmp linux:
- Yum install gcc
- Yum intall gcc-c ++
- Yum install make
- Yum install cmake
- Yum install openssl-devel
- Yum install ncurses-devel
For Debian linux:
- Sudo apt-get install gcc
- Sudo apt-get install intall g ++
- Sudo apt-get install make
4. sudo apt-get install cmake
5. sudo apt-get install libssl-dev
6. sudo apt-get install libncurses5-dev
5. Start to compile the hadoop-2.2.0 source code:
Mvn clean install-DskipTests
Mvn package-Pdist, native-DskipTests-Dtar
6 if you have already compiled the package (for example, hadoop-2.2.0.tar.gz), the installation and configuration process is as follows.
Use hduser to log on to the master machine:
6.1 Install ssh
For example on Ubuntu Linux:
$ Sudo apt-get install ssh
$ Sudo apt-get install rsync
Now check that you can ssh to the localhost without a passphrase:
$ Ssh localhost
If you cannot ssh to localhost without a passphrase, execute the following commands:
$ Ssh-keygen-t dsa-p'-f ~ /. Ssh/id_dsa
$ Cat ~ /. Ssh/id_dsa.pub> ~ /. Ssh/authorized_keys
Then can ssh from master to slaves: scp ~ /. Ssh/authorized_keys slave1:/home/hduser/. ssh/
6.2 set JAVA_HOME in hadoop-env.sh and yarn-env.sh inHadoop_home/Etc/hadoop
6.3 edit core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml inHadoop_home/Etc/hadoop
A sample core-site.xml:
<! -- Put site-specific property overrides in this file. -->
<Configuration>
<Property>
<Name> fs. defaultFS </name>
<Value> hdfs: // master: 9000 </value>
</Property>
<Property>
<Name> hadoop. tmp. dir </name>
<Value>/home/hduser/temp </value>
</Property>
</Configuration>
A sample hdfs-site.xml:
<! -- Put site-specific property overrides in this file. -->
<Configuration>
<Property>
<Name> dfs. replication </name>
<Value> 2 </value>
</Property>
<Property>
<Name> dfs. namenode. name. dir </name>
<Value>/home/hduser/dfs/name </value>
</Property>
<Property>
<Name> dfs. datanode. data. dir </name>
<Value>/home/hduser/dfs/data </value>
</Property>
</Configuration>
A sample mapred-site.xml:
<! -- Put site-specific property overrides in this file. -->
<Configuration>
<Property>
<Name> mapreduce. framework. name </name>
<Value> yarn </value>
</Property>
<Property>
<Name> yarn. app. mapreduce. am. staging-dir </name>
<Value>/home/hduser/temp/hadoop-yarn/staging </value>
</Property>
</Configuration>
A sample yarn-site.xml:
<Configuration>
<! -- Site specific YARN configuration properties -->
<Property>
<Name> yarn. nodemanager. aux-services </name>
<Value> mapreduce_shuffle </value>
</Property>
<Property>
<Name> yarn. resourcemanager. hostname </name>
<Value> master </value>
</Property>
<Property>
<Description> CLASSPATH for YARN applications. A comma-separated list of CLASSPATH entries </description>
<Name> yarn. application. classpath </name>
<Value>
Hadoop_home/etc/hadoop,
Hadoop_home/share/hadoop/common /*,
Hadoop_home/share/hadoop/common/lib /*,
Hadoop_home/share/hadoop/hdfs /*,
Hadoop_home/share/hadoop/hdfs/lib /*,
Hadoop_home/share/hadoop/mapreduce /*,
Hadoop_home/share/hadoop/mapreduce/lib /*,
Hadoop_home/share/hadoop/yarn /*,
Hadoop_home/share/hadoop/yarn/lib /*
</Value>
</Property>
</Configuration>
6.4 edit slaves file inHadoop_home/Etc/hadoop to have the following content
Master
Slave1
After the preceding steps are completed, copy the hadoop-2.2.0 directory and content to the same path on the master machine as the hduser using the scp command:
Scp hadoop folder to various machines: scp/home/hduser/hadoop-2.2.0 slave1:/home/hduser/hadoop-2.2.0
7. Format hdfs (usually only once, unless hdfs fails) and execute the following commands in sequence
- Cd/hduser/hadoop-2.2.0/bin/
- ./Hdfs namenode-format
8. Start and Stop the hadoop cluster (it can be performed multiple times. Generally, it does not stop after startup. Otherwise, the Application running information will be lost)
- [Hadoop @ master bin] $ cd ../sbin/
- [Hadoop @ master sbin] $./start-all.sh
9. Verification:
Hdfs WEB Interface: http: // master: 50070
RM (ResourceManager) interface: http: // master: 8088
Build a Hadoop environment on Ubuntu 13.04
Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1
Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)
Configuration of Hadoop environment in Ubuntu
Detailed tutorial on creating a Hadoop environment for standalone Edition
Build a Hadoop environment (using virtual machines to build two Ubuntu systems in a Winodws environment)