One of the Hadoop tutorials: The setup of Hadoop clusters

Last Update:2014-12-22 Source: Internet

Author: User

Keywords DFS all Name java

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop is an open source distributed computing platform owned by the Apache Software Foundation, which supports intensive distributed applications and is published as a Apache2.0 license agreement.

Hadoop: Hadoop Distributed File System HDFs (Hadoop distributed filesystem) and MapReduce (Googlemapreduce Open Source implementation) The core Hadoop provides the user with a transparent distributed infrastructure at the bottom of the system

1.Hadoop implements the MapReduce programming paradigm: Applications are segmented into a number of small portions, and each part can be executed or executed on any node in the cluster.

2.HDFS: The data used to save all the calculation points, which brings a very high band width to the whole cluster.

The 3.HADOOP cluster structure is: Master and slave. A HDFS cluster consists of a namenode and several datanode. Where Namenode as the primary server, manages the file system's namespace and the client's access to the file system; The Datanode in the cluster manages the stored data.

The 4.MapReduce framework is composed of a jobtracker running on the primary node and a tasktracker running in each cluster from the node. The master is responsible for scheduling all the tasks that make up a job, which are distributed across different nodes. The master node monitors their execution and restarts the previous failed tasks, and only the tasks assigned by the master node from the node. When a job is committed, Jobtracker receives the submit job and configuration information, and distributes the configuration information to the node, dispatching the task and monitoring the execution of the Tasktracker.

5.HDFS and MapReduce together form the core of the Hadoop Distributed system architecture. HDFs realizes Distributed File system on the cluster, MapReduce distributed computing and task processing on the cluster. HDFS provides the support of file operation and storage in the process of MapReduce task, MapReduce realizes the distribution, tracking and execution of tasks on the basis of HDFs, and collects the results, which are the main tasks of the Hadoop distributed cluster.

The five advantages of Hadoop

High scalability

Hadoop is a highly scalable storage platform because it can store and distribute inexpensive server data clusters across hundreds of parallel operations. Unlike traditional relational database systems that cannot be extended to handle large amounts of data, Hadoop is an application that can provide an enterprise with hundreds of terabytes of data nodes running on it.

Cost effectiveness

Hadoop also provides an extremely cost-effective storage solution for enterprise users. The problem of the traditional relational database management system is that he does not meet the large data processor and can not meet the cost benefit of the enterprise. Many companies used to assume that the best value for the data, and then set the classification based on these valuable data, and if all the data were saved, the cost would be too high. Although this method can be implemented in a short time, but as the volume of data increases, this approach does not solve the problem very well.

Hadoop has a different architecture, designed as an outward-expanding architecture that can economically store all of the company's data for later use, and the cost savings are staggering, with Hadoop providing hundreds of TB of storage and computing power, rather than a thousands of-dollar problem.

Flexibility is better

Hadoop makes it easy for businesses to access new data sources, and can analyze different types of data to generate value from that data, which means that businesses can leverage Hadoop's flexibility to gain valuable business value from data sources such as social media, e-mail, or click traffic.

In addition, Hadoop uses a wide range of applications, such as logarithmic processing, referral systems, data warehousing, campaign analysis, and fraud detection.

Fast

Hadoop has a unique way of storing data, and the tools used for data processing are usually on the same server as the database, resulting in faster processor data, and if you're working on a lot of unstructured data, Hadoop can effectively process terabytes of data in minutes, Instead of the PB-level data as it was before, it was measured in hours.

Fault tolerance

One of the key advantages of using Hadoop is his ability to fault tolerance. When the data is sent to a separate node, the data is also replicated to other nodes in the cluster, which means that there is another copy available in the event of a failure. Not a single point of failure.

Hadoop cluster Configuration instance: schema

1 master,1 Backup (host standby), 3 slave (created by virtual machine).

Node IP address:

Rango (Master) 192.168.56.1 Namenode

VM1 (Backup) 192.168.56.101 Secondarynode

VM2 (Slave1) 192.168.56.102 Datanode

VM3 (Slave2) 192.168.56.103 Datanode

VM4 (Slave3) 192.168.56.104 Datanode

Ps:hadoop is best run under a single user, and users in all clusters should be consistent, that is, the user name is the same.

Master machine configuration file: The Secondarynamenode,slaves file specified in the Masters file specifies the Datanode and Tasktracker to run

The master machine mainly configures the roles of Namenode and Jobtracker, responsible for the implementation of the distributed data and decomposition task of the explorer; The salve machine configures the roles of Datanode and Tasktracker, responsible for distributed data storage and task execution.

In the Hadoop cluster configuration, the IP and hostname of all the machines in the cluster need to be added to the "hosts" file so that master and all slave machines can communicate not only over IP but also through host names. Installation and configuration of JDK (Java Integrated development environment) and Hadoop.

MapReduce: "Breakdown of tasks and aggregation of results". There are two machine roles for performing mapreduce tasks: One is Jobtracker, and the other is Tasktracker,jobtracker is for dispatch, Tasktracker is for performing work. Only one jobtracker in a Hadoop cluster (in master)

The MapReduce framework is responsible for dealing with the complex problems of distributed storage, job scheduling, load balancing, fault-tolerant equalization, fault-tolerant processing and network communication in parallel programming, and the processing process is abstracted to two functions: map and Reduce,map are responsible for decomposing tasks into multiple tasks, Reduce is responsible for aggregating the results of multiple task processing after decomposition.

Hadoop configuration instance: specific process

1. Network, Host Configuration: Configure its host name on all hosts

Hosts: the host names and corresponding IP addresses of all the hosts in the cluster are added to the hosts file of all machines so that the cluster can communicate and authenticate with the host name.

2. Configure SSH password-free login

3.java Environment Installation

Cluster all machines to install JDK,JDK version: JDK1.7.0_45, and configure the environment variable:/etc/profile:

# set Java environnement

Export java_home=/usr/java/jdk1.7.0_45

Export classpath=.: $CLASSPATH: $JAVA _home/lib: $JAVA _home/jre/lib

Export path= $PATH: $JAVA _home/bin: $JAVA _home/jre/bin

Source/etc/profile Make it effective

4.hadoop installation and configuration: All machines must be installed Hadoop,hadoop version: hadoop-1.2.1

4.1 Installation: Tar zxvf hadoop-1.2.1.tar.gz; MV Hadoop-1.2.1/usr/hadoop;

Assign permissions for folder Hadoop to Hadoop users.

4.2 Hadoop environment variable: #set Hadoop path

Export Hadoop_home=/usr/hadoop

Export path= $PATH: $HADOOP _home/bin

Create a "TMP" folder in "/usr/hadoop": mkdir/usr/hadoop/tmp

4.3 Configuring Hadoop

1) Configure hadoop-env.sh:

# set Java environnement

Export java_home=/usr/java/jdk1.7.0_45

2 Configure Core-site.xml File:

3) Configure Hdfs-site.xml files

4) Configure Mapred-site.xml files

5 Configure Masters File: Secondarynamenode IP address added

6 Configure the Slaves file (master host specific): Add the host name or IP address of the Datanode node.

PS: can be installed and configured in master first, then through the scp-r/usr/hadoop root@ server ip:/usr/, the master configured Hadoop folder "/usr/hadoop" copy to All slave "/usr Directory. The Hadoop folder permissions are then given to individual Hadoop users on their respective machines. and configure the environment variables.

5 Startup and Validation

5.1 Format HDFs File system

Working with Hadoop users on master:

Hadoop Namenode-format

PS: Only once, the next start no longer need to format, just start-all.sh

5.2 Start Hadoop:

Turn off the firewall of all the machines in the cluster before starting, or it will appear datanode and automatically shut down:

Service Iptables Stop

Use the following command to start:

start-all.sh

When you start Hadoop successfully, DFS folders are generated in the TMP folder in master, and DFS folders and mapred folders are generated in the TMP folder in slave.

5.3 Verify Hadoop:

(1) Authentication method One: Use "JPS" command

(2) Authentication mode two: Use "Hadoopdfsadmin-report" to verify

6 Web View: Visit "http://masterip:50030"

Hadoop uses port description

Default port Settings Location description information

8020 namenode RPC Interactive port

8021 JT RPC Interactive port

50030 mapred.job.tracker.http.address jobtrackeradministrative Web GUI

Jobtracker HTTP servers and ports

50070 dfs.http.address namenode Administrative Web GUI

Namenode HTTP servers and ports

50010 dfs.datanode.address Datanode Control port (each Datanode listens on this port and registers it with the Namenode Onstartup) Datanode control port, mainly used for Datanode initialization when registering and answering requests to Namenode

50020 dfs.datanode.ipc.address Datanode IPC Port, usedfor block transmits

Datanode RPC server address and port

50060 mapred.task.tracker.http.address per tasktracker webinterface

Tasktracker HTTP servers and ports

50075 dfs.datanode.http.address per Datanode webinterface

Datanode HTTP servers and ports

50090 dfs.secondary.http.address per secondary Namenode web interface

HTTP servers and ports for secondary Datanode

Summary

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More