Tutorial on installing and configuring Hadoop2.4.1 cluster in Ubuntu 14.04
This tutorial is based on Hadoop 2.4.1, but should be applicable to all versions 2.x. I have installed it multiple times in Ubuntu and can be configured successfully according to this tutorial. This tutorial is just a basic installation configuration. You need to explore more functions, configurations, and skills.
Environment
- System: Ubuntu 14.04 64bit
- Hadoop version: hadoop 2.4.1 (stable)
- JDK version: OpenJDK 7
- Cluster Environment: two hosts, one as the Master, and the IP address of the LAN is 192.168.1.121; the other as the Slave, and the IP address of the LAN is 192.168.1.122.
Preparations
Follow the tutorial to install Hadoop2.4.1Standalone/pseudo-distributed configuration (SEE): Configure hadoop users on all machines, install SSH server, install Java environment, and install Hadoop on the Master host.
The Hadoop installation configuration only needs to be performed on the Master node host, and then copied to each node after configuration.
We recommend that you install Hadoop In the standalone environment on the Master host according to the above tutorial. If you can directly get started with the cluster and install Hadoop on the Master host, remember to modify the permissions of the hadoop file.
Install and configure Hadoop2.2.0 on CentOS
Build a Hadoop environment on Ubuntu 13.04
Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1
Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)
Configuration of Hadoop environment in Ubuntu
Detailed tutorial on creating a Hadoop environment for standalone Edition
Build a Hadoop environment (using virtual machines to build two Ubuntu systems in a Winodws environment)
Network Configuration
I used two hosts to build a cluster. The host name and IP address correspond to the following:
Master 192.168.1.121
Slave1 192.168.1.122
First select the host to act as the Master (for example, I chose the ip address 192.168.1.121), and then/etc/hostname
In, modify the machine name as Master, and run other host commands as Slave1 and Slave2. Then/etc/hosts
To write the host information of all clusters.
sudo vim /etc/hostnamesudo vim /etc/hosts
After completion, as shown in (/etc/hosts can have only one 127.0.0.1, corresponding to localhost; otherwise, an error will occur ). You 'd better restart the instance to see the changes in the machine name on the terminal.
Hosts settings in Hadoop
Note that the network configuration must be performed on all hosts.For example, the configuration of the Master host is described above, and the/etc/hostname (changed to Slave1, Slave2, etc.) must be modified on other Slave hosts) and/etc/hosts (usually the same as the configuration on the Master!
After configuration, You can executeping Master
Andping Slave1
Test whether the ping operation is successful.
Ping
SSH password-less login NodeThis operation allows the Master node to log on to the Slave node without a password through SSH.
First, generate the public key of the Master, and execute the following in the Master node terminal:
Cd ~ /. Ssh # If this directory does not exist, run ssh localhostssh-keygen-t rsa # Press enter until the generated key is saved as. ssh/id_rsa.
The Master node must be able to access the local machine without a password. This step is still executed on the Master node:
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
You can usessh Master
Verify it. Transmit the public key to the Slave1 node:
scp ~/.ssh/id_rsa.pub hadoop@Slave1:/home/hadoop/
Scp requires you to enter the hadoop User Password (hadoop) on Slave1. After the input is complete, a message is displayed, indicating that the transmission is complete.
ThenSlave1 Node
Save the ssh Public Key to the corresponding location and execute
cat ~/id_rsa.pub >> ~/.ssh/authorized_keys
If there are other Slave nodes, you must also transmit the public key to the Slave node and add the authorization to the Slave node.
At last, you can SSH to Slave1 node without a password on the Master node.
ssh Slave1
Configure the cluster/Distributed EnvironmentIn cluster/distributed mode, you need to modify the five configuration files in etc/hadoop. You can click the last four files to view the official default settings. Only the settings required for normal startup are set here: slaves, core-site.xml, hdfs-site.xml, mapred-site.xml, yarn-site.xml.
1. Fileslave
cd /usr/local/hadoop/etc/hadoopvim slaves
Change the originallocalhost
Delete, write the host names of all Slave instances, one in each row. For example, if I only have one Slave node, there is only one line in the file: Slave1.
2. Filecore-site.xml
, The original content is as follows:
<property></property>
Change to the following configuration. The modifications to the following configuration files are similar.
<property> <name>fs.defaultFS</name> <value>hdfs://Master:9000</value></property><property> <name>hadoop.tmp.dir</name> <value>file:/usr/local/hadoop/tmp</value> <description>Abase for other temporary directories.</description></property>
3. Filehdfs-site.xml
Because there is only one Slavedfs.replication
Is set to 1.
<property> <name>dfs.namenode.secondary.http-address</name> <value>Master:50090</value></property><property> <name>dfs.namenode.name.dir</name> <value>file:/usr/local/hadoop/tmp/dfs/name</value></property><property> <name>dfs.datanode.data.dir</name> <value>file:/usr/local/hadoop/tmp/dfs/data</value></property><property> <name>dfs.replication</name> <value>1</value></property>
4. Filemapred-site.xml
The file does not exist. First, copy the file from the template:
cp mapred-site.xml.template mapred-site.xml
Then, modify the configuration as follows:
<property> <name>mapreduce.framework.name</name> <value>yarn</value></property>
5. Fileyarn-site.xml
:
<property> <name>yarn.resourcemanager.hostname</name> <value>Master</value></property><property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value></property>
After the configuration, copy the Hadoop file on the Master to each node (although scp replication can be used directly, it will be different, for example, the symbol link scp is a little different in the past. Therefore, it is safer to package the package and then copy it ).
cd /usr/localsudo tar -zcf ./hadoop.tar.gz ./hadoopscp ./hadoop.tar.gz Slave1:/home/hadoop
InSlave1
Run:
sudo tar -zxf ~/hadoop.tar.gz -C /usr/localsudo chown -R hadoop:hadoop /usr/local/hadoop
If you have run the pseudo-distributed mode before, we recommend that you delete the temporary files before switching to the cluster mode:
rm -r /usr/local/hadoop/tmp
To switch to Hadoop mode, delete the temporary files.To switch the Hadoop mode, whether it is switching from the cluster to the pseudo-distributed mode or from the pseudo-distributed mode to the cluster, if the startup fails, you can delete the temporary folder of the involved nodes, in this way, although the previous data will be deleted, the cluster can be correctly started. Alternatively, you can set different temporary folders (unverified) for the cluster mode and pseudo-distributed mode ). If the cluster can be started before, but cannot be started later, especially if DataNode cannot be started, try to delete the tmp folder on all nodes (including Slave nodes) and run it again.bin/hdfs namenode -format
, Start again.
ThenMaster Node
To start hadoop.
Cd/usr/local/hadoop/bin/hdfs namenode-format # Initialization is required for the first run, and sbin/start-dfs.shsbin/start-yarn.sh is no longer required
Use commandsjps
You can view the processes started by each node.
View the Hadoop process of the Master using jps
The Master node is started.NameNode
,SecondrryNameNode
,ResourceManager
Process.
View the Hadoop process of Slave through jps
The Slave node is started.DataNode
AndNodeManager
Process.
You can alsoMaster Node
Runbin/hdfs dfsadmin -report
Check whether DataNode is started properly. For example, I have one Datanodes in total.
View DataNode status through dfsadmin
View the startup log to analyze the cause of startup failureSometimes the Hadoop cluster cannot be started correctly. For example, if the NameNode process on the Master node fails to start smoothly, you can check the startup log to troubleshoot the problem. However, you may need to pay attention to the following points:
- The system prompts "Master: starting namenode, logging to/usr/local/hadoop/logs/hadoop-hadoop-namenode-Master.out ", but in fact the startup log information is recorded in/usr/local/hadoop/logs/hadoop-hadoop-namenode-Master.log;
- Each startup log is appended to the log file, so you have to look at it at the end. You can see the recorded time.
- Generally, the Error prompt is at the end, that is, the Error or Java exception.
You can also view the status of DataNode and NameNode on the Web page, http: // master: 50070/
To disable a Hadoop cluster, run the following command on the Master node:
sbin/stop-dfs.shsbin/stop-yarn.sh
For more details, please continue to read the highlights on the next page: