A Environment
System: Ubuntu 14.04 32bit
Hadoop version: Hadoop 2.4.1 (Stable)
JDK Version: 1.7
Number of clusters: 3 units
Note: The Hadoop2.4.1 we download from the Apache official website is a linux32-bit system executable, so if you need to deploy on a 64-bit system, you will need to download the SRC source code to compile it yourself.
Two. Preparatory work
(All three machines need to be configured in the first four steps)
1. Install ubuntu14.04 32bits
2. Create new user Hadoop and increase administrator privileges
Enter the following command (the entire Hadoop configuration is best to switch to root permissions, and under Ubuntu you must set a password for root to use: sudo passwd root):
[Email protected]:~# sudo adduser Hadoop
Follow the prompts to enter the information, set the password to Hadoop, enter OK. The user home directory is created automatically after the end, creating a group with the same name as the user. (The AddUser command wraps the Useradd, although the two commands under other Linux systems, but using useradd under Ubuntu, did not create a user home directory with the same name.) )
Let the user gain administrator privileges:
[Email protected]:~# sudo vim/etc/sudoers
Modify the file as follows:
# User Privilege Specification
Root all= (All) all
Hadoop all= (All) all
Save to exit, the Hadoop user has root privileges.
3. Install JDK (use Java-version to view JDK version after installation)
Downloaded the Java installation package and installed it according to the installation tutorial.
4. Modify the Machine network configuration
Modify the machine's hostname to MASTER,SLAVE1,SLAVE2 (corresponding to three machines):
[Email protected]:~# sudo vim/etc/hostname
(Marco corresponds to Master,slave1,slave2)
The IP of three machines must be fixed. Modify the Hosts file.
[Email protected]:~# sudo vim/etc/hosts
Add Field: IP hostname
(Marco corresponds to Master,slave1,slave2)
Restart the machine after completion, and you can see the hostname changes at the terminal.
(You can ping each other after configuring the host name to test whether the configuration was successful)
5. Configure SSH login without password
Install SSH (if the system does not have a default installation or if the version is too old use the following command to ensure that three machines have SSH service)
[Email protected]:~# sudo apt-get install SSH
Generate Master's public key:
[Email protected]:~# cd ~/.ssh
[Email protected]:~# ssh-keygen-t RSA # always press ENTER to save the generated key as. Ssh/id_rsa
The master node needs to be able to have no password SSH native, this step is performed on the master node:
[Email protected]:~# cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
(Can be verified with SSH master after completion)
The public key is then transferred to the SLAVE1 (Slave2) node:
[Email protected]:~# SCP ~/.ssh/id_rsa.pub [email protected]:/home/hadoop/
Then save the SSH public key to the appropriate location on the SLAVE1 node:
[Email protected]:~# cat ~/id_rsa.pub >> ~/.ssh/authorized_keys
At the end of the master node, you can ssh to Slave1 (Slave2) without a password.
Three. Configuring the cluster/Distributed environment
1. Download and unzip the hadoop-2.4.1.tar.gz file in the/home/hadoop directory. (Configured under Master, then transfer the configuration to the slave node)
2. Modify the file Slaves
[Email protected]:~# cd/home/hadoop/etc/hadoop/
[Email protected]:~# vim Slaves
Delete the original localhost, and write all slave host names on each line. As follows:
Slave1
Slave2
3. Modify the file Core-site.xml
This will be the original content
<property>
</property>
Change to the following configuration. Similar to the modifications in the following configuration files.
<property>
<name>fs.defaultFS</name>
<value>hdfs://Master:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>file:/home/hadoop/hadoopInfo/tmp</value>
</property>
(if hadoopinfo/tmp is not found when you start the service, you need to manually create the directory on three machines)
4. Modify Hdfs-site.xml
<property>
<name>dfs.namenode.name.dir</name>
<value>file:/home/hadoop/hadoopInfo/tmp/dfs/name</value>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>file:/home/hadoop/hadoopInfo/tmp/dfs/data</value>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
</property>
5. Modify the file Mapred-site.xml, this file does not exist, you need to first copy from the template:
[Email protected]:~# cp mapred-site.xml.template Mapred-site.xml
Then the configuration changes are as follows:
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
6. Modify the file Yarn-site.xml:
<property>
<name>yarn.resourcemanager.hostname</name>
<value>Master</value>
</property>
7. Once configured, copy the Hadoop file on Master to each node (although the direct use of SCP replication works correctly, it will be different, such as a symbolic link SCP has been a little different since the past.) So it's safer to pack and copy first.)
[Email protected]:~# cd/home/hadoop
[Email protected]:~# sudo tar-zcf./hadoop-2.4.1.tar.gz./hadoop-2.4.1
[Email protected]:~# scp./hadoop-2.4.1.tar.gz Slave1:/home/hadoop
Performed on Slave1 (SLAVE2):
[Email protected]:~# sudo tar-zxf ~/hadoop-2.4.1.tar.gz
[Email protected]:~# sudo chown-r hadoop:hadoop/home/hadoop
NOTE: Switch Hadoop mode, whether from the cluster to pseudo-distributed, or from the pseudo-distributed to the cluster, if you encounter a situation that does not start properly, you can delete the temporary folder of the nodes involved, so that although the previous data will be deleted, but to ensure that the cluster started correctly. Alternatively, you can set a different temporary folder (not verified) for cluster mode and pseudo-distributed mode. So if the cluster can be started before, but not boot, especially DataNode can not start, you may want to try to delete all nodes (including Slave node) on the TMP folder, re-execute the Bin/hdfs Namenode-format, start again try again.
8. You can then start Hadoop on the master node.
[Email protected]:~# cd/home/hadoop/hadoop-2.4.1
[Email protected]:~# bin/hdfs Namenode-format # First run needs to perform initialization, no longer required
[Email protected]:~# sbin/start-dfs.sh
[Email protected]:~# sbin/start-yarn.sh
The command JPS allows you to see the processes initiated by each node.
You can see that the master node started the Namenode, Secondrrynamenode, ResourceManager processes.
The slave node initiates the Datanode and NodeManager processes.
Access to the management interface of Hadoop via http://master:50070/.
Shutting down the Hadoop cluster also executes on the master node:
[Email protected]:~# sbin/stop-dfs.sh
[Email protected]:~# sbin/stop-yarn.sh
Four Application Case:
There is a Hadoop job sample for Hadoop single node and cluster on the official network,
Http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
Operate according to the Example:wordcount v2.0 section of the link
Hadoop-2.4.1 Ubuntu cluster Installation configuration tutorial