Introduction to Hadoop
Hadoop is an open source distributed computing platform owned by the Apache Software Foundation. With Hadoop Distributed File System (Hdfs,hadoop distributed filesystem) and MapReduce (Google MapReduce's Open source implementation) provides the user with a distributed infrastructure that is transparent to the underlying details of the system at the core of Hadoop.
For Hadoop clusters, there are two broad categories of roles: Master and Salve. A HDFS cluster is made up of a namenode and several datanode. Where Namenode as the primary server, manages the file system's namespace and the client's access to the file system; The Datanode in the cluster manages the stored data. The MapReduce framework is composed of a jobtracker running on the master node and a tasktracker running on each cluster from the node. The master is responsible for scheduling all the tasks that make up a job, which are distributed across different nodes. The master node monitors their execution and restarts the previous failed tasks, and only the tasks assigned by the master node from the node. When a job is committed, Jobtracker receives the submit job and configuration information, and distributes the configuration information to the node, dispatching the task and monitoring the execution of the Tasktracker.
As can be seen from the above introduction, HDFs and MapReduce together form the core of the Hadoop Distributed system architecture. HDFs realizes Distributed File system on the cluster, MapReduce distributed computing and task processing on the cluster. HDFS provides the support of file operation and storage in the process of MapReduce task, MapReduce realizes the distribution, tracking and execution of tasks on the basis of HDFs, and collects the results, which are the main tasks of the Hadoop distributed cluster.
Prerequisite
1 Ensure that all required software is installed on each node in your cluster: sun-jdk ssh Hadoop.
2) javatm1.5.x, must be installed, recommended to choose the Java version of the Sun company release.
3 SSH must be installed and guaranteed to run sshd to manage the remote Hadoop daemon with a Hadoop script.
Experimental environment
Operating platform: VMware
Operating system: CentOS 5.9
Software version: hadoop-1.2.1,jdk-6u45
Cluster architecture: Includes 4 nodes: 1 master,3 salve, LAN connection between nodes, can ping each other. The node IP addresses are distributed as follows:
Host Name |
Ip |
System version |
Hadoop node |
Hadoop process Name |
Master |
192.168.137.100 |
Cetos 5.9 |
Master |
Namenode,jobtracker |
Slave1 |
192.168.137.101 |
Cetos 5.9 |
Slave |
Datanode,tasktracker |
Slave2 |
192.168.137.102 |
Cetos 5.9 |
Slave |
Datanode,tasktracker |
Slave3 |
192.168.137.103 |
Cetos 5.9 |
Slave |
Datanode,tasktracker |
All four nodes are CentOS5.9 systems, and there is a same user Hadoop. The master machine is primarily configured with the roles of Namenode and Jobtracker, responsible for the execution of the distributed data and decomposition tasks of the Explorer, and the role of the 3 Salve machine configuration Datanode and Tasktracker, responsible for the distributed data storage and the execution of the tasks.
Installation steps
Download: Jdk-6u45-linux-x64.bin, hadoop-1.2.1.tar.gz (host name and network configuration slightly)
Note: In the production of the Hadoop cluster environment, because the server may have many units, by configuring the DNS mapping machine name, compared to the configuration of the/etc/host method, you can avoid each node to configure their own host files, and the new node does not need to modify the/etc/of each node Host name-ip the mapping file. Reduced configuration steps and time for easy management.
1, JDK installation #/bin/bash Jdk-6u45-linux-x64.bin #mv jdk1.6.0_45/usr/local/
Add Java environment variable: #vim/etc/profile #最后添加 # Set Java environment Export Java_home=/usr/local/jdk1.6.0_45/export jre_home=/usr/ LOCAL/JDK1.6.0_45/JRE export classpath=.: $CLASSPATH: $JAVA _home/lib: $JRE _home/lib export path= $PATH: $JAVA _home/bin: $JRE _home/bin
Effective Java variable: #source/etc/profile # java-version java Version "1.6.0_45" Java (TM) SE Runtime Environment (build 1.6.0_45-b06 Java HotSpot (TM) 64-bit Server VM (build 20.45-b01, Mixed mode)
The same directory is created on all machines, and the same user can be created, preferably using the user's home path to install the Hadoop installation path. Installation paths are:/home/hadoop/hadoop-1.2.1 #useradd Hadoop #passwd Hadoop
2, SSH configuration
After the Hadoop startup, Namenode is the ssh (Secure Shell) to start and stop the various daemons on each datanode, which requires the execution of instructions between the nodes is not required to enter the form of a password, Therefore, we need to configure SSH to use the form of password-free public key authentication. Take the four machines in this article for example, now master is the master node, and he needs to connect Slave1, Slave2, and Slave3. It is necessary to make sure that SSH is installed on each machine and that the SSHD service has been started on the Datanode machine.
Switch to a Hadoop user (make sure that user Hadoop can log on without a password because the Hadoop owner we installed later is a Hadoop user.) )
1 generate key pairs on each host #su-hadoop #ssh-keygen-t rsa#cat ~/.ssh/id_rsa.pub
This command generates a key pair: Id_rsa (private key file) and Id_rsa.pub (public key file). The default is saved in the ~/.ssh/directory.
2 Add master public key to remote host Slave1 Authorized_keys file
Create a Authorized_keys #vim under/home/hadoop/.ssh/authorized_keys
Copy the public key you just copied in
Permission is set to 600. (This is important, the network does not set 600 permissions will cause landing failure)
Test login: $ ssh Slave1 the authenticity of host ' slave2 (192.168.137.101) ' can ' t be established. RSA key fingerprint is d5:18:cb:5f:92:66:74:c7:30:30:bb:36:bf:4c:ed:e9. Are you sure your want to continue connecting (yes/no)? Yes warning:permanently added ' slave2,192.168.137.101 ' (RSA) to the list of known hosts. Last Login:fri Aug 21:31:36 2013 from slave1 [hadoop@slave1 ~]$
In the same way, copy Master's public key to another node.
3. Installing Hadoop
1 Switch to Hadoop users, download the installation package, the direct decompression installation can: #su-hadoop #wget HTTP://APACHE.STU.EDU.TW/HADOOP/COMMON/HADOOP-1.2.1/ hadoop-1.2.1.tar.gz #tar-ZXVF hadoop-1.2.1.tar.gz
My installation directory is:
/home/hadoop/hadoop-1.2.1
For convenience, use the HADOOP command or start-all.sh commands to modify Master/etc/profile add the following: Export hadoop_home=/home/hadoop/hadoop-1.2.1 export Path= $PATH: $HADOOP _home/bin
After the modification is completed, execute source/etc/profile to make it effective.
2) Configure conf/hadoop-env.sh files
Configuring conf/hadoop-env.sh files, adding: Export java_home=/usr/local/jdk1.6.0_45/
Here, modify the installation location for your JDK.
To test the Hadoop installation: