Hadoop pseudo-distributed mode configuration and installation

Last Update:2016-04-20 Source: Internet

Author: User

Tags hadoop mapreduce hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop pseudo-distributed mode configuration and installation

Hadoop pseudo-distributed mode configuration and installation

The basic installation of hadoop has been introduced in the previous hadoop standalone mode. This section describes the basic simulation and deployment of hadoop in the pseudo-distributed mode of hadoop.

Install software:

System: Linux 2.6.32-358. el6.x86 _ 64

JDK: jdk-7u7-linux-i586.tar.gz

Hadoop version: hadoop-0.20.2-cdh3u4.tar.gz

Hardware environment:

Three hosts:

Gdy192 192.168.61.192

Gdy194 192.168.61.194

Gdy195 192.168.61.195

The deployment model is as follows:

Deploy on gdy192: NameNode and JobTracker

Deploy on gdy194: SecondaryNameNode

Deploy on gdy195: DateNode TaskTracker

First, configure the hosts files of the three hosts so that they can access each other without using ip addresses.

First, configure a copy of information on gdy192.

[Root @ gdy192/] # vim/etc/hosts

Wq save and exit

Copy the configured files to the other two hosts.

Copy the file to gdy194.

[Root @ gdy192 ~] # Scp/etc/hosts root @ gdy194:/etc/

Enter the root password of gdy194

Copied successfully.

Go to gdy194 and check/etc/hosts to verify whether it is the file we just modified.

[Root @ gdy194/] # cat/etc/hosts

You can see that the copy is successful.

Copy again to gdy195

On gdy192, enter:

[Root @ gdy192 ~] # Scp/etc/hosts root @ gdy195:/etc/

It will not be verified here.

Create the jDK and Hadoop installation directory gd on gdy192

[Root @ gdy192/] # mkdir/usr/gd/-pv

Create the JDK and Hadoop installation directory gd on gdy194

Create the JDK and Hadoop installation directory gd on gdy195

Create hduser users and Set passwords on gdy192, gdy194, and gdy195 respectively.

On gdy192

[Root @ gdy192/] # useradd hduser

[Root @ gdy192/] # passwd hduser

On gdy194

[Root @ gdy194/] # useradd hduser

[Root @ gdy194/] # passwd hduser

On gdy195

[Root @ gdy195/] # useradd hduser

[Root @ gdy195/] # passwd hduser

Copy the prepared software package to gdy192,

If the file has been copied

Decompress these two files to the Created directory/usr/gd /.

[Root @ gdy192ftpftp] # tar-xf jdk-7u7-linux-i586.tar.gz-C/usr/gd/

[Root @ gdy192ftpftp] # tar-xf hadoop-0.20.2-cdh3u4.tar.gz-C/usr/gd/

Use ls/usr/gd/to view the extracted files.

Create a soft link for jdk and hadoop in the/usr/gd directory.

[Root @ gdy192ftpftp] # ln-s/usr/gd/jdk1.7.0 _ 07 // usr/gd/java

[Root @ gdy192ftpftp] # ln-s/usr/gd/hadoop-0.20.2-cdh3u4 // usr/gd/hadoop

[Root @ gdy192ftpftp] # ll/usr/gd/

Configure java and hadoop Environment Variables

Configure java environment variables

[Root @ gdy192/] # vim/etc/profile. d/java. sh

Add the following information:

JAVA_HOME =/usr/gd/java

PATH = $ JAVA_HOME/bin: $ PATH

Export JAVA_HOMEPATH

Wq save and exit

Configure hadoop Environment Variables

[Root @ gdy192/] # vim/etc/profile. d/hadoop. sh

Add the following information:

HADOOP_HOME =/usr/gd/hadoop

PATH = $ HADOOP_HOME/bin: $ PATH

Export HADOOP_HOMEPATH

Wq save and exit

Use scp to copy the two files to the/etc/profile. d/directory on gdy194 and gdy195 respectively.

Copy to gdy194

[Root @ gdy192/] # scp/etc/profile. d/java. sh root @ gdy194:/etc/profile. d/

[Root @ gdy192/] # scp/etc/profile. d/hadoop. sh root @ gdy194:/etc/profile. d/

Copy to gdy195

[Root @ gdy192/] # scp/etc/profile. d/java. sh root @ gdy195:/etc/profile. d/

[Root @ gdy192/] # scp/etc/profile. d/hadoop. sh root @ gdy195:/etc/profile. d/

Modify the owner and group of all files in the/usr/gd/directory as hduser

[Root @ gdy192/] # chown-R hduser. hduser/usr/gd

Switch to the hduser on gdy192

[Root @ gdy192/] # su-hduser

Use ssh-keygen and ssh-copy-id as gdy192 to directly access hduser users under gdy194 and gdy195 without a password

Command:

Create a key file first

[Hduser @ gdy192 ~] $ Ssh-keygen-t rsa-p''

Enter

Use ssh-copy-id to copy the generated secret to the hduser on the gdy194 machine so that gdy192 can access gdy194 without a password.

[Hduser @ gdy192 ~] $ Ssh-copy-id-I. ssh/id_rsa.pub hduser @ gdy194

Enter yes

Enter the hduser password on gdy194

Use ssh-copy-id to copy the generated secret to the hduser on the gdy195 machine so that gdy192 can access gdy195 without a password.

[Hduser @ gdy192 ~] $ Ssh-copy-id-I. ssh/id_rsa.pub hduser @ gdy195

Use ssh-copy-id to copy the generated secret to the hduser on gdy192 so that gdy192 can access gdy192 without a password.

Note: Even if hadoop uses an ip address for scheduled access, even if it accesses its own machine, if no password is configured for access, you need to enter a password for access. This is the same as when configuring the hadoop standalone mode. You need to configure password-free access.

[Hduser @ gdy192 ~] $ Ssh-copy-id-I. ssh/id_rsa.pub hduser @ gdy192

Verify that gdy192 has accessed gdy194 without a password.

[Hduser @ gdy192 ~] $ Ssh gdy194 'date'

If the date on gdy194 is displayed without a password, the configuration is successful.

Verify that gdy192 has accessed gdy195 without a password.

[Hduser @ gdy192 ~] $ Ssh gdy195 'date'

Verify that gdy192 does not need to be input to access your gdy192

[Hduser @ gdy192 ~] $ Ssh gdy192 'date'

Check whether the system world on the three machines is the same

[Hduser @ gdy192 ~] $ Ssh gdy194 'date'; ssh gdy195 'date'; ssh gdy192 'date'

Synchronize time on three nodes:

Note: Because hadoop does not have the permission to modify the time, You need to configure root to access gdy194, gdy195, gdy192 without a password. Then set the time in a unified manner. Or you can design other methods to ensure time synchronization. However, time synchronization is required in a deployed environment. If this step is not configured, it will not have much impact on the hadoop pseudo-distributed mode. However, we recommend that you configure it.

The configuration code is as follows.

[Hduser @ gdy192 ~] $ Exit

Exit hduser first

[Root @ gdy192/] # cd ~

Enter the root user's home directory

Create a key file

[Root @ gdy192 ~] # Ssh-keygen-t rsa-p''

Copy the key file to gdy194, gdy195

[Root @ gdy192 ~] # Ssh-copy-id-I. ssh/id_rsa.pub root @ gdy194

[Root @ gdy192 ~] # Ssh-copy-id-I. ssh/id_rsa.pub root @ gdy195

Because the root user needs to confirm yes for each access, it is useless to use the root user to configure password-less access. The configuration here is only to synchronize the time of the three computers.

First check the time of the three computers:

[Root @ gdy192 ~] # Ssh gdy194 'date'; ssh gdy195 'date'; date

Set the time of the three computers to the same time:

[Root @ gdy192 ~] # Ssh gdy194 'date 0929235415 '; ssh gdy195 'date 0929235415'; date 0929235415

View time again

[Root @ gdy192 ~] # Ssh gdy194 'date'; ssh gdy195 'date'; date

We can see that the time here has been synchronized.

Use gdy192 to switch to the hduser

[Root @ gdy192 ~] # Su-hduser

Check the time of the three computers:

[Hduser @ gdy192 ~] $ Ssh gdy194 'date'; ssh gdy195 'date'; ssh gdy192 date

Next we will start to configure the hadoop configuration file.

In the file directory of the hadoop Process

[Hduser @ gdy192hadoop] $ cd/usr/gd/hadoop/conf/

Important files have been explained in the hadoop standalone mode configuration. I will not repeat the description here. For details, see "hadoop standalone mode configuration and installation".

Edit the masters File

[Hduser @ gdy192conf] $ vim masters

Change the original localhost to gdy194

Wq save and exit

Note: As mentioned above, gdy194 is used as the Name node of SecondaryNameNode.

And masters is used to configure the second name node.

Edit slaves files

[Hduser @ gdy192conf] $ vim slaves

Change the original localhost to gdy195

Wq save and exit

Data nodes are also defined here.

Edit file core-site.xml

[Hduser @ gdy192conf] $ vim core-site.xml

In Directly Add the following information:

Hadoop. tmp. dir

/Hadoop/temp

Fs. default. name

Hdfs: // gdy192: 8020

Wq save and exit

Note: fs. default. name defines the master node. Because the configuration files on each node are the same, ip addresses or aliases must be used to define the location of the master node.

As a hadoop cache file directory is defined here, we need to create this cache file directory on three computers.

Switch to the root user.

[Hduser @ gdy192conf] $ su-root

Create/hadoop directory

[Root @ gdy192 ~] # Mkdir/hadoop/

Modify the owner and owner of the hadoop directory to hduser, so that hduser can have the write permission under this directory.

[Root @ gdy192 ~] # Chown-R hduser. hduser/hadoop

Create such a directory on gdy194 and gdy195 and grant hadoop permissions.

On gdy194

[Root @ gdy194/] # mkdir/hadoop

[Root @ gdy194/] # chown-R hduser. hduser/hadoop

On gdy195

[Root @ gdy195/] # mkdir hadoop

[Root @ gdy195/] # chown-R hduser. hduser/hadoop

Use gdy192

Exit the current user and return the previous hduser.

[Root @ gdy192 ~] # Exit

Note: As you have just logged on directly, you can exit and return to the directory of the previous hduser and hduser operations.

Edit file mapred-site.xml

[Root @ gdy192conf] # vim mapred-site.xml.

In And Add the following information.

Mapred. job. tracker

Gdy192: 8021

Wq save and exit

Similarly, because JobTracker is defined here, and we have deployed it to indicate that jobTracker is stored on gdy192.

Therefore, in standalone mode, the localhost must be changed to an ip address or an ip alias.

Edit file: hdfs-site.xml

[Root @ gdy192conf] # vim hdfs-site.xml.

In And Add the following information.

Dfs. replication

The actualnumber of replications can be specified when the file iscreated.

Dfs. data. dir

/Hadoop/data

Ture

Thedirectories where the datanode stores blocks.

Dfs. name. dir

/Hadoop/name

Ture

Thedirectories where the namenode stores its persistentmatadata.

Fs. checkpoint. dir

/Hadoop/namesecondary

Ture

Thedirectories where the secondarynamenode stores checkpoints.

Wq save and exit

Note: Here is the definition of the location of other directories in hadoop, if not defined here, will default use the default cache directory defined in the core-site.xml file.

Now the hadoop configuration file has been configured.

To the/etc/gd/folder. Create the Connection Files respectively. (The operation is the same before going up)

This is due to repeated operations. Do not explain.

Next, copy the file that has been configured on gdy192 to the corresponding location of gdy194 and gdy195.

The method is as follows:

Use root user on gdy192

Copy the file to gdy194 and gdy195.

[Hduser @ gdy192hadoop] $ scp/usr/gd/hadoop/conf/* gdy194:/usr/gd/hadoop/conf/

[Hduser @ gdy192 hadoop] $ scp/usr/gd/hadoop/conf/* gdy195:/usr/gd/hadoop/conf/

Use the root user on gdy194 and gdy195 to grant hduser permissions to the/usr/gd/hadoop folder respectively.

On gdy194

[Root @ gdy194/] # chown hduser. hduser/usr/gd/-R

[Root @ gdy194/] # ll/usr/gd/

On gdy195

[Root @ gdy195/] # chown hduser. hduser/usr/gd/-R

[Root @ gdy195/] # ll/usr/gd/

The pseudo-distributed mode of hadoop has been fully configured.

Start the hadoop pseudo-distributed mode

Use the gdy192 host. Log on to the root user again.

Switch to hduser

Format hadoop's file system HDFS

[Hduser @ gdy192 ~] $ Hadoop namenode-format

Start hadoop

[Hduser @ gdy192 ~] $ Start-all.sh

We can see that the NameNode and JobTracker nodes are successfully started on gdy192.

Check whether SecondaryNameNode is successfully started on gdy194.

[Hduser @ gdy192 ~] $ Ssh gdy194 'jps'

You can see that the instance has been started successfully.

Check whether DataNode and TaskTracker are successfully started on gdy195.

[Hduser @ gdy192 ~] $ Ssh gdy195 'jps'

Now we can see that all started successfully.

Use

[Hduser @ gdy192 ~] $ Netstat-nlpt

You can view the hadoop Port

Among them, port 50030 is the external web URL port of hadoop. You can view information about hadoop MapReduce jobs.

50070 indicates the Namenode node information of hadoop.

View hadoop MapReduce job information

Access: http: // 192.168.61.192: 50030/jobtracker. jsp in a browser

For example:

View hadoop NameNode node Information

You can access: http: // 192.168.61.192: 50070/dfshealth. jsp in a browser.

For example:

Because SecondaryNameNode is deployed on gdy194.

View the Hadoop process port information on gdy194.

[Hduser @ gdy192 ~] $ Ssh gdy194 'netstat-nlpt'

Port 50090 is the external web port of hadoop's SecondaryNameNode.

You can use: http: // 192.168.61.194: 50090/status. jsp

To access the external web port of SecondaryNameNode.

Similarly, because DataNode and TaskTracker are deployed on gdy195.

View the port information on gdy195 on gdy192

[Hduser @ gdy192 ~] $ Ssh gdy195 'netstat-nlpt'

50060 indicates the node information of TaskTracker of hadoop

50075 indicates the node information of hadoop DateNoe

The following addresses are available for access:

Http:/// 192.168.61.195: 50075/

Http:/// 192.168.61.195: 50060/tasktracker. jsp

Note: In actual deployment, the IP address before the above web access port is the IP address you actually deployed. Here I listed it according to the IP address I deployed.

Perform hadoop word statistics on hadoop

Use machine gdy192

Create a new text folder on the DNSF File System of hadoop

View created folders

[Hduser @ gdy192 ~] $ Hadoop fs-ls/

Upload a system file to the test folder.

[Hduser @ gdy192 ~] $ Hadoop fs-put/etc/hosts/test

View uploaded files

[Hduser @ gdy192 ~] $ Hadoop fs-ls/test

Make word statistics on all the test files in the hadoop directory, and output the statistical results to the word folder.

[Hduser @ gdy192 ~] $ Hadoop jar/usr/gd/hadoop/hadoop-examples-0.20.2-cdh3u4.jar wordcount/test/word

In this process, you can use

Http: // 192.168.61.192: 50030/jobtracker. jsp.

Is displayed after the job is executed.

View the output directory of word statistics

[Hduser @ gdy192 ~] $ Hadoop fs-ls/word

View the results output file part-r-00000 can see the test directory file to do the word statistics of the statistical results

[Hduser @ gdy192 ~] $ Hadoop fs-cat/word/part-r-00000

This is the statistical result of the previous statistics.

In this document, the installation and deployment of hadoop standalone mode and hadoop pseudo distribution mode are complete.

In fact, after hadoop is installed separately, hbase is usually installed again to work on hadoop. This facilitates data storage and management. How to deploy hbase in hadoop standalone mode and hbase in hadoop pseudo-distributed mode. It will be announced later.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More