CentOS Hadoop-2.2.0 cluster installation Configuration
For a person who just started learning Spark, of course, we need to set up the environment and run a few more examples. Currently, the popular deployment is Spark On Yarn. As a beginner, I think it is necessary to go through the Hadoop cluster installation and configuration, instead of just learning in local mode, because the cluster mode involves multiple machines, and the Environment is relatively more complex, many problems that cannot be encountered in local mode often occur in cluster mode, the cluster installation for the CentOS-6.x on the hadoop-2.2.0 system (not too different for other Linux distributions) is detailed below, and finally the WordCount program is run to verify that the Hadoop cluster installation is successful.
Machine preparation
Assume that there are three machines in the cluster, and the machines can be three physical machines or virtual machines to ensure that the three machines can communicate with each other. One machine acts as the master (running NameNode and ResourceManager ), the other two machines are used as slave or worker (running DataNode and NodeManager ). The configuration of the machine I have prepared is as follows. Ensure that the user names of each machine are consistent.
Host Name |
User Name |
IP address |
Master |
Hadoop |
192.168.100.10 |
Slave1 |
Hadoop |
192.168.100.11 |
Slave2 |
Hadoop |
192.168.100.12 |
Tool preparation
To avoid repeated configuration installation on three machines, we can only install the configuration on the master machine, and then package the configured software directly to each slave machine for decompression, first, we should configure the master machine to log on to other machines with ssh password-free login, which is the prerequisite for all subsequent installation work.
1. Configure host
Configure the host on the master machine and add the following configuration to the/etc/hosts file:
192.168.100.10 master192.168.100.11 slave1192.168.100.12 slave2
2. Configure master password-free Login
First, run the following command to generate the public key:
[hadoop@master ~]$ ssh-keygen -t rsa
Copy the public key to each machine, including the local machine, so that ssh localhost password-free login:
[hadoop@master ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@master[hadoop@master ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@slave1[hadoop@master ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub hadoop@slave2
To better manage the cluster and switch to the root identity, repeat the above ssh password-less setting process to ensure that the root identity is also incapable of logging on with the password:
[root@master ~]$ su root[root@master ~]$ ssh-keygen -t rsa[root@master ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub root@master[root@master ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub root@slave1[root@master ~]$ ssh-copy-id -i ~/.ssh/id_rsa.pub root@slave2
After completing the above operations, switch back to the hadoop user. Now the master machine can log on to each machine in the Cluster with ssh password-free. Now we start to install and configure hadoop on the master machine.
JDK Installation
Download the jdk from the official Oracle website and place it in/home/hadoop
Directory (all subsequent installation packages are installed in/home/hadoop
Directory). The downloaded version is jdk1.7.0 _ 40. decompress the package and set the jdk environment variable. It is best not to set the environment variable to global (in/etc/profile ), set only the environment variables of the current user.
[hadoop@master ~]$ pwd/home/hadoop[hadoop@master ~]$ vim .bash_proflie # JAVA ENVIRONMENT export JAVA_HOME=$HOME/jdk1.7.0_40 export PATH=$JAVA_HOME/bin:$PATH export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar[hadoop@master ~]$ source .bash_proflie
Hadoop Installation
Download the hadoop release from the Apache official website and place it in/home/hadoop
Directory, I downloaded the version for the hadoop-2.2.0, unzip the package, first set the hadoop environment variables.
[hadoop@master ~]$ vim .bash_proflie # HADOOP ENVIRONMENT export HADOOP_HOME=$HOME/hadoop-2.2.0 export HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOME export YARN_HOME=$HADOOP_HOME export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop export HDFS_CONF_DIR=$HADOOP_HOME/etc/hadoop export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop export HADOOP_LOG_DIR=$HADOOP_HOME/logs[hadoop@master ~]$ source .bash_proflie
Next, we will start configuring hadoop and go to the hadoop configuration directory. First, we will gohadoop-env.sh
Andyarn-env.sh
And then modify the hadoop configuration file.
Configure hdfs
In the configuration filehdfs-site.xml
Add the following content.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
<Configuration> <Property> <! -- Hdfs address --> <Name> fs. defaultFS </name> <Value> hdfs: // master: 9000 </value> </Property> <Property> <! -- The number of copies stored in each block in hdfs. I will set one copy here. The default value is three copies. --> <Name> dfs. replication </name> <Value> 1 </value> </Property> <Property> <! -- Enable hdfs web access --> <Name> dfs. webhdfs. enabled </name> <Value> true </value> </Property> </Configuration> |
Configure yarn
To run the MapReduce program, each NodeManager needs to load the shuffle server at startup, and the Reduce Task remotely copies the intermediate results generated by the Map Task from each NodeManager through the server. In the configuration fileyarn-site.xml
Add the following content.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
<configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>master</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> <property> <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name> <value>org.apache.hadoop.mapred.ShuffleHandler</value> </property> </configuration> |
Configure the MapReduce computing framework
To use WordCount in MapReduce to verify whether the hadoop cluster is successfully installed, You need to configure the MapReduce computing framework for hadoop. �� Configuration filemapred-site.xml
Add the following content.
1 2 3 4 5 6 7 |
<Configuration> <Property> <! -- Specify yarn as the resource scheduling platform of MapReduce --> <Name> mapreduce. framework. name </name> <Value> yarn </value> </Property> </Configuration> |
Configure slaves
In the configuration fileslaves
Add the following content.
slave1slave2
So far, we have completed the hadoop configuration on the master machine,