Excerpt from: http://www.powerxing.com/install-hadoop-cluster/
This tutorial describes how to configure a Hadoop cluster, and the default reader has mastered the single-machine pseudo-distributed configuration of Hadoop, otherwise check out the Hadoop installation tutorial, standalone/pseudo-distributed configuration or CentOS installation Hadoop_ standalone/pseudo-distributed configuration.
This tutorial is suitable for native Hadoop 2, including Hadoop 2.6.0, Hadoop 2.7.1 and other versions, the main reference to the official installation tutorials, detailed steps, supplemented by appropriate instructions, to ensure that the steps to smooth installation and running Hadoop. There is also a simplified version of the Hadoop installation configuration for easy-to-base readers to quickly complete the installation. This tutorial is produced by Xiamen University Database Laboratory, reproduced please specify.
For the convenience of beginners, we have prepared two different systems of Hadoop pseudo-distributed configuration tutorials. But other Hadoop tutorials will no longer be differentiated and can be applied to both Ubuntu and centos/redhat systems. For example, this tutorial takes the Ubuntu system as the main demo environment, but the different configurations of Ubuntu/centos, the CentOS 6.x and CentOS 7 operating differences will be given as far as possible.
Environment
This tutorial uses Ubuntu 14.04 64-bit as a system environment, based on native Hadoop 2, validated through the Hadoop 2.6.0 (Stable) version, and is suitable for any Hadoop 2.x.y version, such as had OOP 2.7.1,hadoop 2.4.1 and so on.
This tutorial simply uses two nodes as a clustered environment: one as the Master node, the LAN IP as 192.168.1.121, and the other as the Slave node, the LAN IP is 192.168.1.122.
Preparatory work
The installation configuration of the Hadoop cluster is roughly the following process:
- Select a machine as Master
- Configure Hadoop users on the Master node, install SSH server, install the Java environment
- Install Hadoop on the Master node and complete the configuration
- Configure Hadoop users on other Slave nodes, install SSH server, install the Java environment
- Copy the/usr/local/hadoop directory on the Master node to another Slave node
- Turn on Hadoop on the Master node
The process of configuring Hadoop users, installing SSH server, installing the Java environment, installing Hadoop, and more are described in the Hadoop installation Tutorial _ standalone/pseudo-distributed configuration or CentOS installation Hadoop_ standalone/pseudo-distributed configuration, please go to view, no repetition of the narrative.
Before proceeding with the next configuration, complete the first 4 steps of the above process .
Network configuration
Assume that the nodes used by the cluster are located on the same LAN.
If you are using a virtual machine-installed system, you need to change the network connection mode to bridge mode to enable multiple node interconnects, such as settings in Virturalbox. In addition, if the node's system is replicated directly in the virtual machine, make sure that the MAC address of each node is different (you can randomly generate the MAC address from the button on the right, otherwise the IP will conflict):
Network settings for nodes in Virturalbox
The command for viewing the IP address of a node in Linux is the ifconfig
inet address shown ( Note that CentoS installed on the virtual machine does not automatically connect to the network and needs to be connected to the Internet in the upper right corner to see the IP address):
Linux View IP command
Configure the machine name
Start by completing the preparation on the Master node and shutting down Hadoop ( /usr/local/hadoop/sbin/stop-dfs.sh
) for subsequent cluster configuration.
For ease of differentiation, you can modify the host name of each node (the hostname can be seen in the terminal header and command line to differentiate). In Ubuntu/centos 7, we execute the following command on the master node to modify the host name (that is, to master, which is case-sensitive):
- sudo vim /etc/hostname
shell Command
If you are using a CentOS 6.x system, modify the/etc/sysconfig/network file to Hostname=master, as shown in:
Hostname setup in CentOS
Then execute the following command to modify the IP mapping of the node you are using:
- sudo vim /etc/hosts
shell Command
For example, this tutorial uses the names of two nodes with corresponding IP relationships as follows:
192.168.1.121 Master192.168.1.122 Slave1
We will fill in the mapping relationship in/etc/hosts, as shown (typically there is only one 127.0.0.1 in the file, its corresponding name is localhost, if there is superfluous should be deleted, especially if there is no "127.0.0.1 Master" such as Records):
The hosts setting in Hadoop
The/etc/hosts configuration in CentOS is as follows:
The hosts setting in CentOS
after the modification is completed, reboot is required and the machine name changes will be seen in the terminal after reboot. In the next tutorial, be careful to distinguish between the Master node and the Slave node operation.
network configuration needs to be completed on all nodes
As described above is the configuration of the master node, and on the other Slave nodes, but also to the/etc/hostname (modified to Slave1, SLAVE2, etc.) and/etc/hosts (as with the configuration of master) these two files are modified!
After the configuration, you need to execute the following commands on each node to test whether the ping is mutual, if the ping does not work, the following will not successfully configure the success:
- Ping master-c 3 # ping only 3 times, otherwise press CTRL + C to interrupt
- Ping Slave1-c 3
shell Command
For example, if I ping on the Master node, ping Slave1
it will show time and the results shown are as follows:
Check to see if the ping is getting through
before proceeding with the next configuration, please complete the network configuration of all nodes and modify the hostname to take effect .
SSH login node without password
This operation is to allow the Master node to log on to each Slave node without password SSH.
First, the master node's public key is generated and executed in the terminal of the master node (because the host name is changed, so it is necessary to delete the original and regenerate again):
- CD ~/. SSH # If you do not have the directory, first execute SSH localhost once
- RM ./id_rsa* # Delete the previously generated public key (if any)
- ssh-keygen-t RSA # always press ENTER to
shell Command
Allow the master node to be able to execute without password SSH on the master node:
- Cat ./id_rsa.pub >>./authorized_keys
shell Command
You can perform verification once you are done ssh Master
(you may need to enter Yes to exit
return to the original terminal after successful execution). Then transfer the upper public key to the SLAVE1 node on the Master node:
- SCP ~/.ssh/id_rsa.pub [email protected]:/home/hadoop/
shell Command
The SCP is a shorthand for secure copy, which is used to remotely copy files under Linux, similar to the CP command, but the CP can only be copied in this machine. When executing the SCP, you will be asked to enter the password (hadoop) of the Hadoop user on Slave1, which will prompt the transmission after completion as shown in:
Copy files to a remote host via SCP
Next, on the Slave1 node, add the SSH public key to the authorization:
- mkdir ~/. SSH # If the folder does not exist, it needs to be created and ignored if it already exists.
- Cat ~/id_rsa.pub >> ~/.ssh/authorized_keys
- RM ~/id_rsa.pub # You can erase it when you're done with it.
shell Command
If there are other Slave nodes, you also need to perform the transfer of the Master public key to the Slave node, and the authorization on the Slave node.
In this way, the master node can be no password SSH to each Slave node, you can execute the following command on the master node to verify, as shown in:
- SSH Slave1
shell Command
SSH to the slave node in the master node
Configure the PATH variable
(CentOS single-machine configuration Hadoop Tutorial has configured this one, this step can be skipped)
At the end of the stand-alone pseudo-distributed configuration tutorial, you can add the Hadoop installation directory to the PATH variable so that you can use commands such as Hadoo, HDFs, and so on in any directory, if you have not configured it, you need to configure it on the Master node. First execute vim ~/.bashrc
, add one line:
export PATH=$PATH:/usr/local/hadoop/bin:/usr/local/hadoop/sbin
As shown in the following:
Configure the PATH variable
After saving, execution source ~/.bashrc
causes the configuration to take effect.
Configuring the cluster/Distributed environment
The cluster/Distributed mode needs to modify the 5 profiles in the/usr/local/hadoop/etc/hadoop, and more settings can be clicked to view the official instructions, which only set the necessary settings for normal startup: Slaves, Core-site.xml, Hdfs-site.xml, Mapred-site.xml, Yarn-site.xml.
1, file slaves, will be written as the hostname of the DataNode, one per line, the default is localhost, so in the pseudo-distributed configuration, the node as NameNode also as DataNode. The distributed configuration can retain localhost, or it can be deleted, so that the Master node is used only as NameNode.
This tutorial makes the Master node only available as NameNode, so delete the original localhost in the file and add only one line of content: Slave1.
2, change the file core-site.xml to the following configuration:
<configuration> <property> <name>fs.defaultFS</name> <value>hdfs:// master:9000</value> </property> <property> <name>hadoop.tmp.dir</name > <value>file:/usr/local/hadoop/tmp</value> <description>abase for other temporary Directories.</description> </property></configuration>
3, File hdfs-site.xml, dfs.replication is generally set to 3, but we have only one Slave node, so the value of Dfs.replication is set to 1:
<configuration> <property> <name>dfs.namenode.secondary.http-address</name> <value>Master:50090</value> </property> <property> <name> dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop/tmp/dfs/name</value > </property> <property> <name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop/tmp/dfs/data</value> </property></configuration>
4, File mapred-site.xml (may need to be renamed first, default file name is Mapred-site.xml.template), then the configuration is modified as follows:
<configuration> <property> <name>mapreduce.framework.name</name> <value >yarn</value> </property> <property> <name>mapreduce.jobhistory.address </name> <value>Master:10020</value> </property> <property> < name>mapreduce.jobhistory.webapp.address</name> <value>Master:19888</value> </property></configuration>
5, File yarn-site.xml:
<configuration> <property> <name>yarn.resourcemanager.hostname</name> < value>master</value> </property> <property> <name> yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property ></configuration>
Once configured, copy the/usr/local/hadoop folder on Master to the individual nodes. Because you have previously run pseudo-distributed mode, it is recommended that you delete the previous temporary files before switching to cluster mode. Execute on the Master node:
Cd/usr/localsudo rm-r./hadoop/logs/* # Delete the log file tar-zcf ~/hadoop.master.tar.gz./hadoop # First Compress and then copy the CD ~SCP./hadoop . master.tar.gz Slave1:/home/hadoop
Execute on the Slave1 node:
sudo rm-r/usr/local/hadoop # Erase old (if present) sudo tar-zxf ~/hadoop.master.tar.gz-c/usr/localsudo chown-r hadoop/usr/l Ocal/hadoop
Similarly, if there are other Slave nodes, perform the operation of transferring the hadoop.master.tar.gz to the Slave node and extracting the files from the Slave node.
The first boot requires the NameNode to be formatted on the master node ( only in master format, slave not required ):
HDFs Namenode-format # First run requires initialization, after which no
The CentOS system needs to shut down the firewall
The CentOS system opens the firewall by default, and the firewall for each node in the cluster needs to be shut down before the Hadoop cluster is opened. A firewall can cause the ping to pass but the Telnet port is not available, causing DataNode to start, but Live datanodes is 0.
In CentOS 6.x, you can turn off the firewall with the following command:
- sudo service iptables stop # Shut down firewall services
- sudo chkconfig iptables off # Disables the firewall from booting and does not have to be shut down manually
shell Command
If you are using CentOS 7, you need to turn it off by following the command (Firewall service changed to firewall):
- Systemctl stop Firewalld.service # off firewall
- Systemctl Disable Firewalld.service # Disable firewall boot up
shell Command
For example, the firewall is turned off in CentOS 6.x:
Hadoop can then be started, and booting needs to be done on the master node ( just start on Master ):
start-dfs.shstart-yarn.shmr-jobhistory-daemon.sh Start Historyserver # turn on the history server to see the task running in the Web
Commands enable jps
you to view the processes that are initiated by each node. Correctly, you can see the NameNode, ResourceManager, Secondrrynamenode, and jobhistoryserver processes on the Master node, as shown in:
View Master's Hadoop process through JPS
In the Slave node you can see the DataNode and NodeManager processes, as shown in:
View slave's Hadoop process through JPS
Missing either process indicates an error. It is also necessary to see whether the DataNode starts properly on the Master node through commands hdfs dfsadmin -report
, and if Live Datanodes is not 0, the cluster starts successfully. For example, I have a total of 1 datanodes:
Viewing the status of Datanode through Dfsadmin
You can also see the status of viewing DataNode and NameNode through a Web page: http://master:50070/. If this is not successful, you can troubleshoot the cause by starting the log.
Considerations when switching between pseudo-distributed and distributed configurations
- When switching from distributed to pseudo-distributed, do not forget to modify the slaves configuration file;
- When switching between the two, if you encounter a situation that does not start properly, you can delete the temporary folders of the nodes involved, so that although the previous data will be deleted, it will ensure that the cluster starts correctly. So if the cluster can be started before, but will not start, especially DataNode can not start, you may want to try to delete all nodes (including Slave node) on the/usr/local/hadoop/tmp folder, and then re-execute again
hdfs namenode -format
, start again try again.
Executing a distributed instance
Executing a distributed instance process is like pseudo-distributed mode, where you first create the user directory on HDFS:
- HDFs dfs -mkdir -p/user/hadoop
shell Command
Copy the configuration file in/usr/local/hadoop/etc/hadoop as an input file to the Distributed File system:
- HDFs DFS -mkdir input
- HDFs Dfs-put/usr/local/hadoop/etc/hadoop/*.xml Input
shell Command
By looking at the status of the DataNode (changes in occupancy), the input file is indeed copied to the DataNode, as shown in:
View the status of Datanode through a Web page
You can then run the MapReduce job:
- Hadoop jar/usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep Input Output ' dfs[a-z. + '
shell Command
The output information at run time is similar to pseudo-distributed, showing the Job's progress.
It may be a bit slow, but if you don't see progress for 5 minutes, restart Hadoop and try again. If the restart is not possible, it is likely that there is insufficient memory, it is recommended to increase the memory of the virtual machine, or by changing the memory configuration of YARN to resolve.
Show the progress of the MapReduce job
The same can be seen through the Web Interface task Progress Http://master:8088/cluster, in the Web interface click on the "Tracking UI" column of the history connection, you can see the operation of the task information, as shown in:
View information about clusters and mapreduce jobs through a Web page
Output results after execution:
Output of the MapReduce job
The shutdown of the Hadoop cluster is also performed on the Master node:
- stop-yarn.sh
- stop-dfs.sh
- mr-jobhistory-daemon.sh Stop Historyserver
shell Command
In addition, as with pseudo-distributed, you can also not start YARN, but remember to change the Mapred-site.xml file name.
Since then, you have mastered the cluster construction and basic use of Hadoop.
Hadoop cluster installation Configuration tutorial _hadoop2.6.0_ubuntu/centos