Ubuntu Hadoop distributed cluster Construction

Source: Internet
Author: User

1. Cluster Introduction

1.1 Hadoop Introduction

Hadoop is an open-source distributed computing platform under the Apache Software Foundation. Hadoop, with Hadoop Distributed File System (HDFS, Hadoop Distributed Filesystem) and MapReduce (open-source implementation of Google MapReduce) as the core, provides users with a Distributed infrastructure with transparent underlying system details.

Hadoop clusters can be divided into Master and Salve roles. An HDFS cluster is composed of one NameNode and several DataNode. NameNode acts as the master server to manage the file system namespace and client access to the file system; DataNode in the cluster manages the stored data. The MapReduce frame is composed of a JobTracker running on the master node and a TaskTracker running on the slave node of each cluster. The master node schedules all tasks of a job, which are distributed across different slave nodes. The master node monitors their execution and re-executes the previous failed tasks. The slave node is only responsible for the tasks assigned by the master node. When a Job is submitted, after JobTracker receives the submitted Job and configuration information, it will distribute the configuration information to the slave node, schedule the task, and monitor the execution of TaskTracker.

From the above introduction, HDFS and MapReduce constitute the core of the Hadoop distributed system architecture. HDFS implements a distributed file system on the cluster. MapReduce implements distributed computing and task processing on the cluster. HDFS provides support for file operations and storage during MapReduce task processing. Based on HDFS, MapReduce distributes, traces, and executes tasks and collects results, they interact with each other to complete the main tasks of Hadoop distributed clusters.

1.2 Environment Description

There are three test environments:

192.168.75.67 master1

192.168.75.68 master2

192.168.75.69 master3

Three machine parameters: 4 VCPU, 8 GB memory, GB hard drive

Linux:

1.3 Network Configuration

Configure the Hadoop cluster as shown in section 1.2. In the following example, the host name is "master1" and the IP address is "192.168.75.67. Other Slave machines are modified on this basis.

Modify host name

In this example, the host name is located at master1.

Configure the hosts file (required)

The "/etc/hosts" file is used to configure the DNS server information to be used by the host. It records the corresponding [HostName and IP address] of each host in the LAN. When you are connecting to the network, first find the file and find the IP address corresponding to the Host Name (or domain name.

Perform a connection test:

The connection test of master3 is consistent with that of master2.

2. SSH password-less authentication Configuration

During Hadoop running, you need to manage the remote Hadoop daemon. After Hadoop is started, NameNode starts and stops various daemon on each DataNode through SSH (Secure Shell. Therefore, you do not need to enter a password when executing commands between nodes. Therefore, you need to configure SSH to use the password-Free Public Key Authentication mode, in this way, NameNode uses SSH to log on without a password and starts the DataName process. Similarly, DataNode can also use SSH to log on to NameNode without a password.

2.1 install and start the SSH protocol

Install ssh and rsync.

Install the SSH protocol

Apt-get install ssh

Apt-get install rsync

(Rsync is a remote data synchronization tool that allows you to quickly synchronize files between multiple hosts through the LAN/WAN). Ensure that all servers are installed and the preceding commands are executed successfully, each machine can log on to each other through password verification.

2.2 configure Master to log on to all Salve instances without a password

1) SSH password-less Principle

Master (NameNode | JobTracker), as the client, needs to implement password-free public key authentication, when connecting to the server Salve (DataNode | Tasktracker), needs to generate a key pair on the Master, includes a public key and a private key, and then copies the public key to all Slave instances. When the Master connects to Salve through SSH, Salve generates a random number and encrypts the random number with the public key of the Master and sends it to the Master. After the Master receives the number of encrypted data, it decrypts it with the private key and returns the number of decrypted data to Slave. After the Slave confirms that the number of decrypted data is correct, it allows the Master to connect. This is a public key authentication process, during which you do not need to manually enter the password. The important process is to copy the client Master to the Slave.

2) generate a password pair on the Master machine

Ssh-keygen-t rsa-p'-f ~ /. Ssh/id_rsa

This command is used to generate a password-less key pair. When you ask about the storage path, press enter to use the default path. Generated key pairs: id_rsa and id_rsa.pub, which are stored in "~ /. Ssh "directory. Then, configure the Master node as follows to append id_rsa.pub to the authorized key.

Verify whether the request is successful.

Ssh localhost

Build a Hadoop environment on Ubuntu 13.04

Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1

Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)

Configuration of Hadoop environment in Ubuntu

Detailed tutorial on creating a Hadoop environment for standalone Edition

Build a Hadoop environment (using virtual machines to build two Ubuntu systems in a Winodws environment)

  • 1
  • 2
  • 3
  • Next Page

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.