Recent work needs, groping to build a Hadoop 2.2.0 (YARN) cluster, encountered some problems in the middle, in this record, I hope to help students need.
This article does not cover hadoop2.2 compilation, compilation-related issues in another article, "Hadoop 2.2.0 Source Compilation Notes", this article assumes that we have obtained the Hadoop 2.2.0 64bit release package.
Due to spark compatibility issues, we later used the version of the Hadoop 2.0.5-alpha (2.2.0 is a stable version), the 2.0.5 configuration is slightly different, there are special hints.
1. Introduction
"This section is excerpted from http://www.cnblogs.com/xia520pi/archive/2012/05/16/2503949.html"
Hadoop is an open source distributed computing platform owned by the Apache Software Foundation. With Hadoop Distributed File System (Hdfs,hadoop distributed filesystem) and MapReduce (Google MapReduce's Open source implementation) provides the user with a distributed infrastructure that is transparent to the underlying details of the system at the core of Hadoop.
For Hadoop clusters, there are two broad categories of roles: Master and Salve. A HDFS cluster is made up of a namenode and several datanode. Where Namenode as the primary server, manages the file system's namespace and the client's access to the file system; The Datanode in the cluster manages the stored data. The MapReduce framework is composed of a jobtracker running on the master node and a tasktracker running on each cluster from the node. The master is responsible for scheduling all the tasks that make up a job, which are distributed across different nodes. The master node monitors their execution and restarts the previous failed tasks, and only the tasks assigned by the master node from the node. When a job is committed, Jobtracker receives the submit job and configuration information, and distributes the configuration information to the node, dispatching the task and monitoring the execution of the Tasktracker.
As can be seen from the above introduction, HDFs and MapReduce together form the core of the Hadoop Distributed system architecture. HDFs realizes Distributed File system on the cluster, MapReduce distributed computing and task processing on the cluster. HDFS provides the support of file operation and storage in the process of MapReduce task, MapReduce realizes the distribution, tracking and execution of tasks on the basis of HDFs, and collects the results, which are the main tasks of the Hadoop distributed cluster.
2. System Environment System version
CentOS 6.4 64bit
uname-a
Linux * 2.6.32_1-7-0-0 #1 SMP * * x86_64 x86_64 x86_64 gnu/linux JAVA Environment Install JAVA 1.6
To extract the JDK to the local directory
Add java_home environment variable to. bashrc file
Export java_home=/home/< hostname>/local/jdk1.6.0_45/"
Export jre_home="/home/ Export Path= $JAVA _home/bin: $JRE _home/bin: $PATH
Export classpath=.: $JAVA _home/lib: $JRE _home/lib: $CLASSPATH
HADOOP uncompressed hadoop-2.2.0-bin_64.tar.gz (this package was compiled by me in CentOS6.4 64bit) to the user root
Export hadoop_home=/home/< hostname>/hadoop-2.2.0
[HTML] view plaincopyprint? export path= $JAVA _home/bin:$ Hadoop_home/bin: $PATH
test Local mode Hadoop is configured in local mode by default, so you can perform local tests without modifying any configuration after decompression
Create local Directory
mkdir input
Populating data
CP Conf/*.xml Input
Execute Hadoop
Bin/hadoop jar hadoop-examples-*.jar grep input Output ' dfs[a-z.] +'
View Results
[HTML] view Plaincopyprint? Cat output/*
3. The network environment is simply using two nodes because of the previous test environment and configuration:
Master machine, acting as Namenode & Datanode
Slave machine, acting as Datanode
Set hostname
HDFS use hostname instead of IP to communicate with each other, Hadoop will reverse parse hostname, even with IP, will use the hostname to start tasktracker, so all configuration files can only use hostname, Can't use IP(full of Tears). We set the following two machines separately:
Machine |
IP |
HOSTNAME |
role |
Master |
192.168.216.135 |
Master |
Namenode, Datanode |
Slave |
192.168.216.136 |
Slave1 |
Datanode |
The command to temporarily change hostname is (root permissions)
Hostname <new_name>
Permanent changes need to modify the configuration file/etc/sysconfig/network
[HTML]View Plaincopyprint? Hostname=<new_name>
Modify Hosts FileSet the/etc/hosts file (to be set on each machine) and add the following
[HTML]View Plaincopyprint? The Datanode content of 192.168.216.135 master 192.168.216.136 slave1 Namenode and/etc/hosts must correspond to the IP address and host name. Cannot use 127.0.0.1 to replace the local IP address, otherwise Hadoop uses hostname to find IP, will be "127.0.0.1" as IP addresses.
set SSH no password accessBetween master and all slave, bidirectional ssh password-free access is required (slave and slave can be implemented without implementation).
Please see the SSH no password access article, this article no longer details
Firewall SettingsStrictly speaking, it should be to open some of the corresponding ports. For simplicity's sake, we're here to close SELinux and Iptalbes. Ways to turn off SELinux
[HTML]View Plaincopyprint? Setenforce 1 set SELinux becomes enforcing mode setenforce 0 setting SELinux becomes permissive mode if permanently closed, edit/etc/selinux/config
[HTML]View Plaincopyprint? Selinux=disabled