1. How to install Hadoop Multi-node distributed cluster on virtual machine Ubuntu

Last Update:2017-11-09 Source: Internet

Author: User

Tags prepare scp command

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

To learn more about Hadoop data analytics, the first task is to build a Hadoop cluster environment, simplifying Hadoop as a small software, and then running it as a Hadoop distributed cluster by installing the small software on each physical node.

It's simple, but what should I do? No hurry, the main purpose of this article is to let the novice look after the hands-on implementation of these processes. Due to my lack of funds, only through the virtual machine to implement the simulation cluster environment, although it is virtual machine simulation, but the virtual machine on the Hadoop cluster construction process can also be used in the actual physical node, the idea is the same. And if you have plenty of money, you don't mind burning money to buy a lot of computer equipment, this is the best.

Maybe someone wants to know what kind of computer configuration is required to install a Hadoop cluster, just for the virtual machine environment, here's my own situation:

Cpu:intel Core Duo 2.2Ghz

Memory: 4G

HDD: 320G

System: XP

Frankly speaking, my laptop configuration is obviously not good enough, the original only 2G of memory, but the installation of Hadoop cluster is very people crash, I personally experienced after really intolerable, so later expanded 2G, although said performance is not good enough, but learning well, the current configuration is barely able to meet the learning requirements, If your hardware configuration is higher than this is the best, if you can reach 8G, or even 16G of memory, learning Hadoop means no pressure.

After the computer hardware configuration, the following said I install Hadoop preparation conditions:

1 Installing the VMware Workstation software

Some people ask why it is necessary to install this software, which is a virtual machine work platform provided by the VM company, which needs to install the Linux operating system on this platform later. The specific installation process on the Internet has a lot of information, here do not make too much explanation.

2 Installing the Linux operating system on a virtual machine

Install the Linux operating system on top of the previous step, because Hadoop is generally run on Linux platforms, although there are now versions of Windows, but the implementation of Linux is relatively stable and error-prone, if you install the Hadoop cluster in Windows, Estimate in the installation process to face a variety of problems will make people more crash, in fact, I have not installed on Windows, hehe ~

The Linux operating system installed on the virtual machine is ubuntu10.04, this is the version of the system I installed, why I use this version, very simple, because I use the cooked ^_^ in fact, which Linux system can be used, for example, you can use CentOS, Redhat, Fedora and so on, no problem at all. The process of installing Linux on a virtual machine is also skipped here, and if you don't know what you can search online, there's a lot of information on that.

3 Prepare 3 Virtual machine nodes

In fact, this step is very simple, if you have completed the 2nd step, at this time you have prepared the first virtual node, the second and third virtual machine node how to prepare? Perhaps you have figured out that you can follow the 2nd step, and then install the Linux system two times, respectively, to implement the second to third virtual machine node. But this process is estimated to make you very crash, in fact, there is an easier way, is to copy and paste, yes, you just installed the first virtual machine node, the entire system directory to replicate, the formation of the second and third virtual machine node. It's simple! ~~

A lot of people might ask, what is the use of these three nodes, the principle is simple, according to the basic requirements of the Hadoop cluster, one of which is the master node, mainly used to run Namenode, Secondorynamenode and Jobtracker tasks in Hadoop programs. The external two nodes are slave nodes, one of which is for redundancy purposes, if there is no redundancy, it can not be called Hadoop, so the simulation Hadoop cluster at least 3 nodes, if the computer configuration is very high, you can consider adding some other nodes. Slave nodes will primarily run Datanode and tasktracker tasks in Hadoop programs.

Therefore, after you have prepared these 3 nodes, you need to rename the host name of the Linux system (because the copy and paste operations are preceded by the other two nodes, at which point the host name of the 3 nodes is the same), rename the host name method:

Vim/etc/hostname

By modifying the hostname file, these three points should be modified to differentiate.

Here is my Ubuntu system host for three nodes named: Master, Node1, Node2

Basic conditions ready, back to do the truth, impatient, oh, do not worry, as long as follow my ideas, step by step, will be able to successfully install Hadoop cluster. The installation process consists of several steps:

First, Configuring the Hosts file

Second, Create a Hadoop run account

Third, Configuring SSH password-free connection

Four, Download and unzip the Hadoop installation package

Five, Configure Namenode, modify site file

Six, Configuring the hadoop-env.sh file

Seven, Configuring Masters and Slaves Files

Viii. replicating Hadoop to each node

Nine, formatted Namenode

X. Start Hadoop

11. Use JPS to verify that the background processes are successfully started

12, through the site to view the cluster situation

Below we to the above process, conquer bar! ~~

First, configure the Hosts file

A simple description of the role of the Hosts file configuration, it is mainly used to determine the IP address of each node, to facilitate subsequent

The master node can quickly find and access individual nodes. This file needs to be configured on the 3 virtual machine nodes mentioned above. Because you need to determine the IP address of each node, before configuring the Hosts file, you need to look at the current virtual machine node's IP address, which can be viewed through the ifconfig command, as in this experiment, the IP address of the master node is:

If the IP address is not correct, you can change the physical IP address of the node through the Ifconfig command, as shown below:

The above command allows you to change the IP to 192.168.1.100. After the IP address of each node is set, you can configure the Hosts file, the Hosts file path is;/etc/hosts, my Hosts file is configured as follows, you can refer to your IP address and the corresponding hostname to complete the configuration

Second, establish the Hadoop running account

That is, for the Hadoop cluster to set up a user group and users, this part is relatively simple, the reference example is as follows:

sudo groupadd Hadoop//Set up Hadoop user groups

sudo useradd–s/bin/bash–d/home/zhm–m zhm–g hadoop–g admin//Add a ZHM user, this user belongs to the Hadoop user group and has admin privileges.

sudo passwd zhm//Set user ZHM login password

Su zhm//switch to ZHM user

The above 3 virtual machine nodes need to perform the above steps to complete the creation of the Hadoop running account.

Third, configure SSH password-free connection

This link is the most important, but also the most critical, because I cut a lot of the steps in this step, walked a lot of bends

Road, if this step has been successful, the latter part of the process will be more smooth.

SSH mainly through the RSA algorithm to generate the public key and the private key, in the data transmission process to encrypt to protect the number

According to the security and reliability, the public key part is a common part, the network one node can be accessed, the private key is mainly used to encrypt the data, in case others steal data. In a word, this is an asymmetric algorithm, and it is very difficult to crack. Data access is required between the nodes of the Hadoop cluster, the nodes being accessed must be verified for the reliability of the Access node, and Hadoop uses the SSH method for remote secure login via key authentication and data encryption, of course, If the access of Hadoop to each node needs to be verified, its efficiency will be greatly reduced, so it is necessary to configure SSH password-free method to directly connect to the Access node remotely, which will greatly improve the efficiency of access.

OK, nonsense will not say, below to see How to configure SSH password-free login it! ~~

(1) Each node generates a public-private key, respectively.

Type the command:

The above command is to generate a public-private key that produces the directory in the user's home directory under the. SSH directory, as follows:

Id_dsa.pub is the public key, ID_DSA is the private key, and then the public key file is copied into the Authorized_keys file, this step is necessary, the process is as follows:

Use the same method described above in the remaining two nodes.

(2) Single loopback ssh password-free login test

That is, on the stand-alone node with SSH login, to see if the login success. Log out after successful login, the process is as follows:

Note the indicator red circle, with the above information indicates successful operation, single-point loopback SSH Login and logoff success, this will be the subsequent cross-node SSH remote password-free login ready.

Use the same method described above in the remaining two nodes.

(3) Allow the primary node (master) to ssh password-free login to two sub-nodes (slave).

In order to achieve this function, the public key of the two slave nodes must contain the public key information of the main node, so

When Master is able to access these two slave nodes smoothly and safely. The operation process is as follows:

As the above procedure shows that the Node1 node telnet to the master node via the SCP command and copies the master's public key file to the current directory, this process requires password authentication. Then, the master node of the public key file is appended to the Authorized_keys file, through this step, if there is no problem, the master node can be remotely password-free via SSH connection Node1 node. In the master node, proceed as follows:

As can be seen, the Node1 node is required for the first connection, "YES" to confirm the connection, which means that the master node connected to the Node1 node requires a manual inquiry, unable to automatically connect, enter YES after the successful access, and then logout to exit to master node. To implement SSH password-free connection to other nodes, another step, only need to execute again SSH Node1, if not asked you to enter "yes", even if successful, the process is as follows:

As shown, Master has been able to login to the Node1 node via SSH password-free.

The Node2 node can also be used in the same way, such as:

NODE2 nodes replicate the public key files in the master node

Master SSH password-free login to node2 node test:

The first time you log in:

The second time you log in:

On the surface, the SSH password-free login for both nodes has been configured successfully, but we also need to do the same for the master node master, which is a bit confusing, but for a reason, the reason is now not very good, it is said that the real physical node needs to do this work, Because Jobtracker are likely to be distributed over other nodes, there is no possibility of jobtracker on the master node.

SSH Password-free login test work for Master itself:

At this point, ssh password-free login has been configured successfully.

Iv. download and unzip the Hadoop installation package

The download of the installation package is not much to say, but you can mention that I currently use the version of hadoop-0.20.2,

This version is not up-to-date, but to learn, first get started, after the proficiency of the other version is not urgent. and the "Hadoop authoritative guide" is also a book about this version.

Note: After extracting the Hadoop software catalog under/home/zhm/hadoop

V. Configure Namenode, modify site file

Before configuring the site file, some preparations need to be made to download the latest Java version of the JDK software, which can be downloaded from the Oracle official website, and I am using the JDK software version: jdk1.7.0_09, I unpacked the Java JDK for installation in/opt/jdk1.7.0_ 09 directory, then configure the Java_home macro variable and Hadoop path, this is to facilitate the subsequent operation, this part of the configuration process mainly by modifying the/etc/profile file to complete, add the following lines of code in the profile file:

Then execute:

Let the configuration file take effect immediately. The above configuration process is carried out once for each node.

So far, the preparation has been completed, the following start to modify the Hadoop configuration file, that is, a variety of site files, files stored under/hadoop/conf, the main configuration Core-site.xml, Hdfs-site.xml, Mapred-site.xml these three files.

The Core-site.xml configuration is as follows:

The Hdfs-site.xml configuration is as follows:

Next is the Mapred-site.xml file:

Vi. Configuration of hadoop-env.sh files

This needs to be configured according to the actual situation.

Vii. Configuring Masters and slaves Files

According to the actual situation to configure the hostname of the Masters, in this experiment, the host name of the Masters main node is master,

Then fill in the Masters file:

In the same vein, fill in the Slaves file:

Viii. replicate to each node Hadoop

To replicate Hadoop to the Node1 node:

To replicate Hadoop to the Node2 node:

In this way, the node Node1 and node Node2 also have the configured Hadoop software installed.

Nine, formatted Namenode

This step operates on the Master node master:

Note: "Successfully formatted" appears above as a success.

X. Start Hadoop

This step also operates on the Master node master:

Xi. using JPS to verify the successful start of each background process

On master node master, see if the Namenode,jobtracker,secondarynamenode process is started.

If the above process is present, it is correct.

At Node1 and Node2 nodes, see if the Tasktracker and Datanode processes are started.

First come to node1 the situation:

Here is the case for Node2:

The process has started successfully. Congratulations ~ ~ ~

12, through the site to view the cluster situation

In the browser, enter: http://192.168.1.100:50030, the URL for the master node corresponds to the IP:

In the browser, enter: http://192.168.1.100:50070, the URL for the master node corresponds to the IP:

1. How to install Hadoop Multi-node distributed cluster on virtual machine Ubuntu

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More