Hadoop learning notes (2) pseudo distribution mode configuration

Last Update:2018-12-03 Source: Internet

Author: User

Tags xsl hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

We have introduced the installation and simple configuration of hadoop in Linux, mainly in standalone mode. The so-called standalone Mode means that no daemon process is required ), all programs are executed on a single JVM. Because it is easier to test and debug mapreduce programs in standalone mode, this mode is suitable for use in the development phase.

Here we mainly record the process of configuring the hadoop pseudo distribution mode. The so-called pseudo-distribution mode is to simulate hadoop distribution on a single machine. The distribution on a single machine is not really distributed. Instead, the Java Process is used to simulate various nodes in distributed operation, including namenode and datanode, secondarynamenode, jobtracker, and tasktracker. The first three concepts are from the perspective of Distributed Storage: the cluster node is composed of one namenode and several datanode, and another secondarynamenode is used as the backup of namenode; the last two concepts are from the perspective of distributed applications: nodes in the cluster are composed of one jobtracker and several tasktrackers, jobtracker is responsible for task scheduling, and tasktracker is responsible for parallel task execution. Tasktracker must run on datanode to facilitate localized data computing, while jobtracker and namenode do not need to run on the same machine. Hadoop itself cannot distinguish between pseudo-distribution and distributed, and the two configurations are similar. The only difference is that the pseudo-distribution is configured on a single machine, and datanode and namenode are both the same machine.

The installation of Java and hadoop has been recorded in the previous section. I will not mention it here. The following mainly records the configuration of the pseudo distribution mode.

1. Ssh password-less authentication Configuration

When running in pseudo-distribution mode, you must start the daemon. The premise of starting the daemon is that SSH has been successfully installed. Namenode uses the SSH protocol to start the datanode process. In pseudo-distribution mode, both datanode and namenode are themselves. Therefore, you must configure SSH localhost password-free authentication.

First, make sure that SSH is installed and the server is running. My machine is installed by default, so I will not talk about it here.

Create a new SSH key based on the empty password to enable password-less Logon:

$ Ssh-keygen-t rsa-p'-f ~ /. Ssh/id_rsa

$ Cat ~ /. Ssh/id_rsa.pub> ~ /. Ssh/authorized_keys

Run the following command to test:

$ SSH localhost

I don't know if I need to restart the machine and try again here. I didn't say I want to restart the machine, but I restarted the machine to log on through SSH without a password.

2. Modify the hadoop configuration file

All components of hadoop can be configured using XML files. The core-site.xml file is used to configure properties for the common component, the hdfs-site.xml file is used to configure HDFS properties, and the mapred-site.xml file is used to configure mapreduce properties. These configuration files are all in the conf subdirectory.

(1)

Configure the Java environment in hadoop-env.sh

Export java_home =/usr/lib/JVM/java-1.6.0-openjdk-1.6.0.0

(2)

Configure core-site.xml, hdfs-site.xml and mapred-site.xml

Core-site.xml:

<? XML version = "1.0"?> <? XML-stylesheet type = "text/XSL" href = "configuration. XSL"?> <! -- Put site-specific property overrides in this file. --> <configuration> <property> <Name> FS. default. name </Name> <value> HDFS: // localhost: 9000 </value> <description> HDFS Uri, file system: // namenode ID: port Number </description> </property> <Name> hadoop. TMP. dir </Name> <value>/root/hadoop/hadoop-0.20.2/hadooptmp </value> <description> Local hadoop Temporary Folder on namenode </description> </property> </ configuration>

Hdfs-site.xml:

<? XML version = "1.0"?> <? XML-stylesheet type = "text/XSL" href = "configuration. XSL"?> <! -- Put site-specific property overrides in this file. --> <configuration> <property> <Name> DFS. name. dir </Name> <value>/root/hadoop/hadoop-0.20.2/HDFS/name </value> <description> store HDFS namespace metadata on namenode </description> </property> <property> <Name> DFS. data. dir </Name> <value>/root/hadoop/hadoop-0.20.2/HDFS/Data </value> <description> physical storage location of data blocks on datanode </description> </Property> <property> <Name> DFS. replication </Name> <value> 1 </value> <description> Number of replicas. If this parameter is not set, the default value is 3, the number of datanode machines should be less than </description> </property> </configuration>

Mapred-site.xml:

<? XML version = "1.0"?> <? XML-stylesheet type = "text/XSL" href = "configuration. XSL"?> <! -- Put site-specific property overrides in this file. --> <configuration> <property> <Name> mapred. job. tracker </Name> <value> localhost: 9001 </value> <description> jobtracker ID: port number, not Uri </description> </property> <Name> mapred. local. dir </Name> <value>/root/hadoop/hadoop-0.20.2/mapred/local </value> <description> local directory of mapreduce program execution on tasktracker </description> </ property> <property> <Name> mapred. system. dir </Name> <value>/tmp/hadoop/mapred/System </value> <description> This is the directory in HDFS, store the shared files when the Mr program is executed </description> </property> </configuration>

(3)

Configure the Masters file and add the namenode Host Name

The file content is as follows:

localhost

(4)

Configure the slaves file and add all datanode host names

The file content is as follows:

localhost

3. format the HDFS File System

Before using hadoop, you must format a brand new HDFS installation. An empty file system is created during the formatting process by creating the storage directory and the initial version of the namenode persistent data structure. Because namenode manages the metadata of the file system, and datanode can dynamically join or leave the cluster, this formatting process does not involve datanode. Similarly, users do not need to pay attention to the size of the file system. The number of datanode in the cluster determines the size of the file system. Datanode can be added as needed for a long period of time after formatting by the file system.

It is very convenient to format the HDFS file system. Just type the following command:

$ Hadoop namenode-format

The command output is as follows:

/************************************************************STARTUP_MSG: Starting NameNodeSTARTUP_MSG:   host = localhost.localdomain/127.0.0.1STARTUP_MSG:   args = [-format]STARTUP_MSG:   version = 0.20.2STARTUP_MSG:   build = https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r 911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010************************************************************/11/12/17 13:44:58 INFO namenode.FSNamesystem: fsOwner=root,root,bin,daemon,sys,adm,disk,wheel11/12/17 13:44:58 INFO namenode.FSNamesystem: supergroup=supergroup11/12/17 13:44:58 INFO namenode.FSNamesystem: isPermissionEnabled=true11/12/17 13:44:58 INFO common.Storage: Image file of size 94 saved in 0 seconds.11/12/17 13:44:58 INFO common.Storage: Storage directory /root/hadoop/hadoop-0.20.2/hdfs/name has been successfully formatted.11/12/17 13:44:58 INFO namenode.NameNode: SHUTDOWN_MSG: /************************************************************SHUTDOWN_MSG: Shutting down NameNode at localhost.localdomain/127.0.0.1************************************************************/

4. hadoop cluster startup

To start the HDFS and mapreduce daemon, run the following command:

$ Start-dfs.sh

The output is as follows (namenode, datanode, and secondarynamenode are started respectively, and the log storage location is also provided ):

starting namenode, logging to /root/hadoop/hadoop-0.20.2/bin/../logs/hadoop-root-namenode-localhost.localdomain.outlocalhost: starting datanode, logging to /root/hadoop/hadoop-0.20.2/bin/../logs/hadoop-root-datanode-localhost.localdomain.outlocalhost: starting secondarynamenode, logging to /root/hadoop/hadoop-0.20.2/bin/../logs/hadoop-root-secondarynamenode-localhost.localdomain.out

$ Start-mapred.sh

The output is as follows (we can see that jobtracker and tasktracker are started respectively ):

starting jobtracker, logging to /root/hadoop/hadoop-0.20.2/bin/../logs/hadoop-root-jobtracker-localhost.localdomain.outlocalhost: starting tasktracker, logging to /root/hadoop/hadoop-0.20.2/bin/../logs/hadoop-root-tasktracker-localhost.localdomain.out

You can also use the following command to replace the above two commands:

$ Start-all.sh

In fact, this script calls the above two commands.

The local computer starts three Daemon Processes: One namenode, one secondary namenode, and one datanode. You can view the log files in the logs directory to check whether the daemon is successfully started, or view jobtracker at http: // localhost: 50030/or at http: // localhost: 50070/view namenode. In addition, the JPS command of Java can also check whether the daemon is running:

$ JPs

6129 SecondaryNameNode6262 JobTracker6559 Jps6033 DataNode6356 TaskTracker5939 NameNode

It is also easy to terminate the daemon, for example:

$ Stop-dfs.sh

$ Stop-mapred.sh

5. Pseudo-distribution environment test

There are several jar files under the root directory of hadoop, where the hadoop-0.20.2-examples.jar is what we need, it contains wordcount, the function is to calculate the number of words in the input text, run the following command to create a test file:

(1) create two input files file01 and file02 on the local disk:

$ Echo "Hello World bye world"> file01

$ Echo "Hello hadoop goodbye hadoop"> file02

(2) create an input directory in HDFS:

$ Hadoop FS-mkdir Input

(3) Copy file01 and file02 to HDFS:

$ Hadoop FS-copyfromlocal/root/hadoop/file0 * Input

(4) execute wordcount:

$ Hadoop jars hadoop-0.20.2-examples.jar wordcount Input Output

(5) view the result after the task is completed:

$ Hadoop FS-cat output/part-r-00000

The execution result is:

Bye1GoodBye1Hadoop2Hello2World2

You can also go to http: // localhost: 50030/jobtracker. jsp to view the result:

OK, you can do it. I really want to play with some big datasets in the real full distribution mode. But now I am working on a project for my mentor. There are not enough machines. At least three machines are required. I will wait for my school to complete, several machines in the lab for fun.

(Sina Weibo: @ quanliang _ machine learning)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More