1. Cluster Introduction
1.1 Hadoop Introduction
Hadoop is an open-source distributed computing platform under the Apache Software Foundation. Hadoop, with Hadoop Distributed File System (HDFS, Hadoop Distributed Filesystem) and MapReduce (open-source implementation of Google MapReduce) as the core, provides users with a Distributed infrastructure with transparent underlying system details.
Hadoop clusters can be divided into Master and Salve roles. An HDFS cluster is composed of one NameNode and several DataNode. NameNode acts as the master server to manage the file system namespace and client access to the file system; DataNode in the cluster manages the stored data. The MapReduce frame is composed of a JobTracker running on the master node and a TaskTracker running on the slave node of each cluster. The master node schedules all tasks of a job, which are distributed across different slave nodes. The master node monitors their execution and re-executes the previous failed tasks. The slave node is only responsible for the tasks assigned by the master node. When a Job is submitted, after JobTracker receives the submitted Job and configuration information, it will distribute the configuration information to the slave node, schedule the task, and monitor the execution of TaskTracker.
From the above introduction, HDFS and MapReduce constitute the core of the Hadoop distributed system architecture. HDFS implements a distributed file system on the cluster. MapReduce implements distributed computing and task processing on the cluster. HDFS provides support for file operations and storage during MapReduce task processing. Based on HDFS, MapReduce distributes, traces, and executes tasks and collects results, they interact with each other to complete the main tasks of Hadoop distributed clusters.
1.2 Environment Description
There are three test environments:
192.168.75.67 master1
192.168.75.68 master2
192.168.75.69 master3
Three machine parameters: 4 VCPU, 8 GB memory, GB hard drive
Linux:
1.3 Network Configuration
Configure the Hadoop cluster as shown in section 1.2. In the following example, the host name is "master1" and the IP address is "192.168.75.67. Other Slave machines are modified on this basis.
Modify host name
In this example, the host name is located at master1.
Configure the hosts file (required)
The "/etc/hosts" file is used to configure the DNS server information to be used by the host. It records the corresponding [HostName and IP address] of each host in the LAN. When you are connecting to the network, first find the file and find the IP address corresponding to the Host Name (or domain name.
Perform a connection test:
The connection test of master3 is consistent with that of master2.
2. SSH password-less authentication Configuration
During Hadoop running, you need to manage the remote Hadoop daemon. After Hadoop is started, NameNode starts and stops various daemon on each DataNode through SSH (Secure Shell. Therefore, you do not need to enter a password when executing commands between nodes. Therefore, you need to configure SSH to use the password-Free Public Key Authentication mode, in this way, NameNode uses SSH to log on without a password and starts the DataName process. Similarly, DataNode can also use SSH to log on to NameNode without a password.
2.1 install and start the SSH protocol
Install ssh and rsync.
Install the SSH protocol
Apt-get install ssh
Apt-get install rsync
(Rsync is a remote data synchronization tool that allows you to quickly synchronize files between multiple hosts through the LAN/WAN). Ensure that all servers are installed and the preceding commands are executed successfully, each machine can log on to each other through password verification.
2.2 configure Master to log on to all Salve instances without a password
1) SSH password-less Principle
Master (NameNode | JobTracker), as the client, needs to implement password-free public key authentication, when connecting to the server Salve (DataNode | Tasktracker), needs to generate a key pair on the Master, includes a public key and a private key, and then copies the public key to all Slave instances. When the Master connects to Salve through SSH, Salve generates a random number and encrypts the random number with the public key of the Master and sends it to the Master. After the Master receives the number of encrypted data, it decrypts it with the private key and returns the number of decrypted data to Slave. After the Slave confirms that the number of decrypted data is correct, it allows the Master to connect. This is a public key authentication process, during which you do not need to manually enter the password. The important process is to copy the client Master to the Slave.
2) generate a password pair on the Master machine
Ssh-keygen-t rsa-p'-f ~ /. Ssh/id_rsa
This command is used to generate a password-less key pair. When you ask about the storage path, press enter to use the default path. Generated key pairs: id_rsa and id_rsa.pub, which are stored in "~ /. Ssh "directory. Then, configure the Master node as follows to append id_rsa.pub to the authorized key.
Verify whether the request is successful.
Ssh localhost
Build a Hadoop environment on Ubuntu 13.04
Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1
Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)
Configuration of Hadoop environment in Ubuntu
Detailed tutorial on creating a Hadoop environment for standalone Edition
Build a Hadoop environment (using virtual machines to build two Ubuntu systems in a Winodws environment)