Introduction to Hadoop
Hadoop is an open-source distributed computing platform under the Apache Software Foundation. Hadoop, with Hadoop Distributed File System (HDFS, Hadoop Distributed Filesystem) and MapReduce (open-source implementation of Google MapReduce) as the core, provides users with a Distributed infrastructure with transparent underlying system details.
Hadoop clusters can be divided into Master and Salve roles. An HDFS cluster is composed of one NameNode and several DataNode. NameNode acts as the master server to manage the file system namespace and client access to the file system; DataNode in the cluster manages the stored data. The MapReduce framework is composed of a single JobTracker running on the master node and a TaskTracker running on the slave node of each cluster. The master node is responsible for scheduling all tasks that constitute a job. These tasks are distributed across different slave nodes. The master node monitors their execution and re-executes the previous failed tasks. The slave node is only responsible for the tasks assigned by the master node. When a Job is submitted, after JobTracker receives the submitted Job and configuration information, it will distribute the configuration information to the slave node, schedule the task, and monitor the execution of TaskTracker.
From the above introduction, HDFS and MapReduce constitute the core of the Hadoop distributed system architecture. HDFS implements a distributed file system on the cluster. MapReduce implements distributed computing and task processing on the cluster. HDFS provides support for file operations and storage during MapReduce task processing. Based on HDFS, MapReduce distributes, traces, and executes tasks and collects results, they interact with each other to complete the main tasks of Hadoop distributed clusters.
Prerequisites
1) Make sure that all required software is installed on each node of your cluster: sun-JDK ssh Hadoop.
2) javatm 1.5.x must be installed. We recommend that you select the Java version released by Sun.
3) ssh must be installed and sshd is always running, so that you can use Hadoop scripts to manage the remote Hadoop daemon.
Lab Environment
Operating Platform: vmware
Operating System: CentOS 5.9
Software Version: hadoop-1.2.1, jdk-6u45
Cluster architecture: includes four nodes: one Master, three Salve nodes, and LAN connections between nodes, which can be pinged to each other. The node IP address distribution is as follows:
Host Name IP system version Hadoop node hadoop process name
Master 192.168.137.100 CetOS 5.9 master namenode, jobtracker
Slave1 192.168.137.101 cetoperating system 5.9 slave datanode, tasktracker
Slave2 192.168.137.102 CetOS 5.9 slave datanode, tasktracker
Slave3 192.168.137.103 cetoperating system 5.9 slave datanode, tasktracker
The four nodes are CentOS5.9 and have the same user hadoop. The Master machine is mainly configured with NameNode and JobTracker roles, responsible for managing the execution of distributed data and decomposition tasks; the three Salve machines are configured with DataNode and TaskTracker roles, responsible for Distributed Data Storage and task execution.
Installation Steps
Download: jdk-6u45-linux-x64.bin, hadoop-1.2.1.tar.gz (host name and network configuration omitted)
Note: In the hadoop Cluster Environment in production, because there may be many servers, the machine name is mapped by configuring DNS, compared to the/etc/host configuration method, you can avoid configuring the host file for each node, and do not need to modify the host name-IP ing file of/etc/host for each node when adding a node. The configuration steps and time are reduced for ease of management.
1. JDK Installation
#/Bin/bash jdk-6u45-linux-x64.bin
# Mv jdk1.6.0 _ 45/usr/local/
Add java environment variables:
# Vim/etc/profile
# Add
# Set java environment
Export JAVA_HOME =/usr/local/jdk1.6.0 _ 45/
Export JRE_HOME =/usr/local/jdk1.6.0 _ 45/jre
Export CLASSPATH =.: $ CLASSPATH: $ JAVA_HOME/lib: $ JRE_HOME/lib
Export PATH = $ PATH: $ JAVA_HOME/bin: $ JRE_HOME/bin
Valid java variables:
# Source/etc/profile
# Java-version
Java version "1.6.0 _ 45"
Java (TM) SE Runtime Environment (build 1.6.0 _ 45-b06)
Java HotSpot (TM) 64-Bit Server VM (build pipeline 45-b01, mixed mode)
Create the same directory on all machines. You can also create the same user. It is best to use the home path of the user as the hadoop installation path. Installation paths are:/home/hadoop/hadoop-1.2.1
# Useradd hadoop
# Passwd hadoop
Build a Hadoop environment on Ubuntu 13.04
Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1
Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)
Configuration of Hadoop environment in Ubuntu
Detailed tutorial on creating a Hadoop environment for standalone Edition
Build a Hadoop environment (using virtual machines to build two Ubuntu systems in a Winodws environment)