Hadoop learns to deploy Hadoop in pseudo-distributed mode and frequently asked questions

Last Update:2018-07-20 Source: Internet

Author: User

Tags documentation ssh

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop can be run in stand-alone mode or in pseudo-distributed mode, both of which are designed for users to easily learn and debug Hadoop, and to exploit the benefits of distributed Hadoop, parallel processing, and deploy Hadoop in distributed mode. Stand-alone mode refers to the way that Hadoop runs as a single process on a single node, and pseudo-distribution mode refers to running Namenode, DataNode, Jobtracker, Tasktracker, SECONDERYNAMENODE5 processes on a single node, and distributed mode is to run a few of the above 5 processes on different nodes, such as running Datanode and Tasktracker on a node.

Pseudo-distribution mode and distributed mode in addition to the above differences, in the configuration of pseudo-distribution mode is much simpler, only need to modify Core-site.xml, Hdfs-site.xml, Mapred-site.xml, while distributed mode also requires the configuration of files such as masters and slaves. In management, it is obvious that the pseudo-distribution pattern is much simpler, after all, there is only one node, and the distributed mode has at least two nodes, and when the number of nodes is large, the complexity of Hadoop will be improved correspondingly.

This article focuses on some of the details and problems of deploying and running Hadoop for distributed mode, and it is relatively easy to deploy single-node Hadoop, but this is always the case. First, according to the official document description of Core-site.xml, hdfs-site.xml, mapred-site.xml make corresponding changes, the specific URL is HTTP://HADOOP.APACHE.ORG/DOCS/R1.2.1/ Single_node_setup.html#pseudodistributed. Because I have learned hadoop before, I did not follow the official documentation when modifying the configuration file, such as the configuration of the Dfs.name.dir and Dfs.data.dir properties added to the HDFS-SITE.XM, the default value of these two properties is ${ Hadoop.tmp.dir}/dfs/name and ${hadoop.tmp.dir}/dfs/data, specifically/tmp/hadoop-${user.name}/dfs/name and/tmp/hadoop-${. User.name}/dfs/data, where ${user.name} is the name of the user running Hadoop. It is visible that these two properties save the file in the/tmp directory in the node, and the different Linux systems have different policies for cleaning the directory, so in order to retain the values in Dfs.name.dir and dfs.data.dir for a long time, modify the original default values and modify them for the first time to/ Home/hadoop/hadoopdata.

After the modification of the above configuration file, in accordance with the official documentation of the configuration of SSH, so that in the SSH login can not enter password passwords. The specific code is as follows:

Ssh-keygen-t Dsa-p "-F ~/.SSH/ID_DSA 
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

or change the DSA in the above code to RSA, the difference is that the encryption algorithm used is different.

The Namenode can then be formatted, Bin/hadoop Namenode–format is executed at the command line, and the bin/start-all.sh is executed after the format is successful. After the terminal prints the message, execute the JPS command to see if there are 5 processes, Namenode, DataNode, Jobtracker, Tasktracker, and Seconderynamenode. After the execution of JPS found no datanode process, open the logs directory on the datanode of the log, found the following error:

WARN org.apache.hadoop.hdfs.server.datanode.DataNode:Invalid directory in Dfs.data.dir:Incorrect permission For/home /hadoop/hadoopdata, Expected:rwxr-xr-x, while Actual:rwxrwxr-x
2013-12-13 14:57:36,149 ERROR Org.apache.hadoop.hdfs.server.datanode.DataNode:All directories in Dfs.data.dir is invalid.

Based on the error, modify/home/hadoop/hadoopdata to Rwxr-xr-x, then start Hadoop, perform JPS to see how the process is running, and discover that Datanode still does not exist. Check the Datanode log again:

2013-12-13 16:08:57,516 INFO org.apache.hadoop.hdfs.server.common.Storage:Cannot lock storage/home/hadoop/ Hadoopdata.
The directory is already locked. 2013-12-13 16:08:57,632 ERROR Org.apache.hadoop.hdfs.server.datanode.DataNode:java.io.IOException:Cannot Lock Storage/home/hadoop/hadoopdata.
       The directory is already locked. At Org.apache.hadoop.hdfs.server.common.storage$storagedirectory.lock (storage.java:599) at Org.apache.hadoop.hdfs . Server.common.storage$storagedirectory.analyzestorage (storage.java:452) at Org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead (datastorage.java:111) at Org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode (datanode.java:414) at Org.apache.hadoop.hdfs.server.datanode.datanode.<init> (datanode.java:321) at Org.apache.hadoop.hdfs.server . Datanode. Datanode.makeinstance (datanode.java:1712) at Org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode (DataNode. java:1651) at Org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode (datanode.java:1669) at ORG.A Pache.hadoop.hdfs.server.datanode.DataNode.secureMain (datanode.java:1795) at
 Org.apache.hadoop.hdfs.server.datanode.DataNode.main (datanode.java:1812)

When prompted to know that the/home/hadoop/hadoopdata folder has been locked by another process, causing the datanode process to be unusable, check the configuration file Discovery Dfs.name.dir and Dfs.data.dir are configured in order to/home/hadoop/ Hadoopdata, which led to the above error. The workaround is to set Dfs.name.dir and Dfs.data.dir to/home/hadoop/hadoopdata and/home/hadoop/hadoopname, respectively, to solve the problem.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More