Detailed configuration process of Hadoop2.5.2 + HA + Zookeeper3.4.6

Source: Internet
Author: User

Detailed configuration process of Hadoop2.5.2 + HA + Zookeeper3.4.6

It took too long to get familiar with the Hadoop2 architecture. Some problems were encountered during the environment setup. These problems had been stuck there, so they could not be solved and delayed. Finally, qianxun wanxun solved the problem. Thank you for your help. In this article "Hadoop2.5.2 + HA + Zookeeper3.4.6 configuration process details", I will also list my problems to help later Users learn more.

This article combines the actual testing process with much effort.

Preface

This article mainly sorts out the configuration process of the hadoop2.5.2 cluster. All the steps are tested by yourself. The document structure depends on your actual situation, and you will also add your own problems encountered in the actual process. The environment building process is not important. The important point is the problems encountered during the setup process and the process of solving the problems.

Some experienced users may not be able to solve their own problems, but these problems have really delayed themselves for a long time. The final solution is too painstaking. It is also shown in this document as a summary to provide comments for the latter.

Hadoop2.5.2 Architecture

To understand this section, you must first understand the architecture of hadoop1. Here but many introduction based on hadoop1 architecture, as early as before, had set up hadoop1.2.1 pseudo distributed cluster, see hadoop Learning (a) hadoop-1.2.1 pseudo distributed configuration and encountered problems. This section describes the architecture of hadoop2.

The core components of hadoop1 are HDFS and MapReduce. HDFS and Yarn in hadoop2.

The NameNode in the new HDFS is no longer only one and can have multiple (currently only two ). Each has the same function.

The status of the two NameNode: active and standby. When the cluster is running, only the active NameNode works normally, and the standby NameNode Is In standby state, and the data of the active NameNode is synchronized at all times. Once the active NameNode cannot work, the standby state NameNode can be changed to active state through manual or automatic failover, and the work can continue. This is highly reliable.

When NameNode fails, how are their data consistent: here, the data of two NameNode is actually shared in real time. The new HDFS adopts a sharing mechanism. The JournalNode cluster or NFS is shared. NFS is at the operating system level, while JournalNode is at the hadoop level. Here we use the JournalNode cluster for data sharing.

How to achieve automatic failover of NameNode: You need to use the ZooKeeper cluster for selection. Both NameNode in the hdfs cluster are registered in ZooKeeper. When the active NameNode fails, ZooKeeper can detect this situation, it automatically switches the NameNode In the standby state to the active state.

HDFS Federation (HDFS Alliance): There is a reason for the emergence of the alliance. We know that NameNode is the core node and maintains metadata information in the entire HDFS. Therefore, its capacity is limited and it is subject to the memory space of the server. When the memory of the NameNode server cannot be loaded with data, the HDFS cluster cannot be loaded with data, and the service life is over. Therefore, its scalability is limited. HDFS Alliance refers to the fact that multiple HDFS clusters work at the same time, so the capacity is theoretically not limited. to exaggerate, It is infinite expansion. You can understand that two or more independent small clusters can be virtualized in a general cluster, and data between small clusters is shared in real time. Because the concept of namenode and datanode does not exist in the hadoop cluster. When one of the small clusters fails, you can start the namenode node in the other small cluster to continue working. Because data is shared in real time, even if namenode or datanode die together, the normal operation of the entire cluster will not be affected.

Tutorial on standalone/pseudo-distributed installation and configuration of Hadoop2.4.1 under Ubuntu14.04

Install and configure Hadoop2.2.0 on CentOS

Build a Hadoop environment on Ubuntu 13.04

Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1

Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)

Configuration of Hadoop environment in Ubuntu

Detailed tutorial on creating a Hadoop environment for standalone Edition

Cluster node Task Arrangement:

This is very important. We must first understand how tasks are arranged between nodes. If you do not understand why this is the case, you may encounter more problems later. This requires understanding the relationship between journalnode, zookeeper, datanode, and namenode. I am also delaying this for a long time. I hope readers will pay more attention to this.

Six hosts.

Journalnode and zookeeper maintain an odd number, which requires at least three nodes. I will not explain it here.

The division between namenode and datanode in hadoop2. The last four machines are used as namenode. There is also a problem here: if we put datanode and namenode together, it will definitely affect the efficiency of reading IO data, data is shared between different machines through network cables and http requests. In reality, the two can be together. However, I don't know the main difference between being together and not being together. The above explanation is just my personal opinion. If readers have better opinions, leave a message for discussion.

The Installation Process of the cluster is as follows:

All the following processes are completed on the hadoop1 machine, and then the files are copied to other nodes.

Zookeeper installation process:

1. Download and decompress zookeeper

: Http://mirror.bit.edu.cn/apache/zookeeper/zookeeper-3.4.6/

Decompress to the specified directory: Directory:/home/tom/yarn/hadoop-2.5.2/app/

Create an app directory in the hadoop directory. Decompress the file to the app directory of hadoop so that the entire project can be transplanted in the future. After that, we will install HBase, Hive, and other software, all of which are extracted to the app directory.

2. modify the configuration file

2.1 enter the conf directory in zookeeper:

Copy the file named zoo_sample.cfg as zoo. cfg. We generally do not modify the default example file of the configuration file, and change the value assignment subfile.

Edit zoo. cfg

Copy the file named zoo_sample.cfg as zoo. cfg. We generally do not modify the default example file of the configuration file, and change the value assignment subfile.

Edit zoo. cfg

TickTime = 2000
InitLimit = 10
SyncLimit = 5
DataDir =/home/tom/yarn/hadoop-2.5.0/app/zookeeper-3.4.6/zkdata
DataLogDir =/home/tom/yarn/hadoop-2.5.0/app/zookeeper-3.4.6/zkdatalog
ClientPort = 2181
Server.1 = hadoop1: 2888: 3888
Server.2 = hadoop2: 2888: 3888
Server.3 = hadoop3: 2888: 3888
Server.4 = hadoop4: 2888: 3888
Server.5 = hadoop5: 2888: 3888

2.2 create zkdata and zkdatalog folders

Create the preceding two folders in the zookeeper directory. Enter the zkdata folder, create the file myid, and enter 1. 1 is written in server.1 in the zoo. cfg text. After all the files are configured, copy the yarn directory in hadoop1 to another machine. Modify the corresponding myid text in each machine and write the myid in hadoop2 to 2. For other nodes, write corresponding numbers according to the above configuration. The Zkdatalog folder is used to specify the path for zookeeper to generate logs.

Add Environment Variables

The local environment variable is added in the/etc/profile directory. You do not need to add it to environment variables.

  • Add Environment Variables
  • The local environment variable is added in the/etc/profile directory.

You can add the ZOOKEEPER_HOME/bin directory to the original PATH.

: $ ZOOKEEPER_HOME/bin

You can also add environment variables in the. bashrc directory of the root directory to modify the profile file in the/etc directory. What is the difference between the two:. bashrc is the environment variable for the current directory user, and the profile file is a directory open to all users. When the system loads a file, first find the corresponding path strength from the profile, if not, find the corresponding environment variable path in the. bashrc file. We will be familiar with both of them.

Then source/etc/profile

The above three steps have completed zookeeper installation. Then, we will test zookeeper, which will be put to the backend after the complete configuration of hadoop1 is completed, scp will be sent to other hosts, and then we will test it together.

Hadoop Configuration

1. Download and decompress hadoop2.5.0

Path: http://apache.dataguru.cn/hadoop/common/hadoop-2.5.2/

Decompress the package to/home/tom/yarn. In fact, this step should be done before zookeeper is decompressed. I will not talk about it more.

2. modify the configuration file

There are a total of six configuration files to modify here, which are in hadoop-env.sh, core-site.xml, hdfs-site.xml, mapred-site.xml, and slaves.

Modify file directory address:/home/tom/yarn/hadoop-2.5.2/etc/hadoop/

2.1 file hadoop-env.sh

Add jdk environment variables:

Export JAVA_HOME =/usr/lib/jvm/jdk1.7.0 _ 45

2.2 file coer-site.xml

<Configuration>
<Property>
<Name> fs. defaultFS </name>
<Value> hdfs: // cluster1 </value>
</Property>
[The value here refers to the default HDFS path. There is only one HDFS cluster, which is specified here! This value comes from the configuration in the hdfs-site.xml]
<Property>
<Name> hadoop. tmp. dir </name>
<Value>/home/tom/yarn/yarn_data/tmp </value>
</Property>
[The default path is NameNode, DataNode, and JournalNode. You can also specify directories for these three types of nodes. Here, the yarn_data/tmp directory and files are created by yourself]
<Property>
<Name> ha. zookeeper. quorum </name>
<Value> hadoop1: 2181, hadoop2: 2181, hadoop3: 2181, hadoop4: 2181, hadoop5: 2181 </value>
</Property>
[Here is the address and port of the ZooKeeper cluster. Note that the number must be an odd number and should not be less than three nodes]
</Configuration> <span style = "font-size: 14px;"> <span style = "font-family: ;"> </span>
 

2.3 file hdfs-site.xml

Key core files:
<Configuration>
<Property>
<Name> dfs. replication </name>
<Value> 3 </value>
</Property>
[Specify the number of copies of the DataNode storage block. The default value is 3. Now we have 4 DataNode. The value must not be greater than 4 .]
<Property>
<Name> dfs. permissions </name>
<Value> false </value>
</Property>
[After setting permissions, you can control permissions between users]
<Property>
<Name> dfs. permissions. enabled </name>
<Value> false </value>
</Property>
<Property>
<Name> dfs. nameservices </name>
<Value> cluster1 </value>
</Property>
[Name the hdfs cluster. This name must be consistent with that in core-site and will be used below]
<Property>
<Name> dfs. ha. namenodes. cluster1 </name>
<Value> hadoop1, hadoop2 </value>
</Property>
[Specify what namenode is used when NameService is cluster1. Here, the value is also the logical name. The name can be set randomly and is unique to each other]
<Property>
<Name> dfs. namenode. rpc-address.cluster1.hadoop1 </name>
<Value> hadoop1: 9000 </value>
</Property>
[Specify the RPC address of hadoop101]
<Property>
<Name> dfs. namenode. http-address.cluster1.hadoop1 </name>
<Value> hadoop1: 50070 </value>
</Property>
[Specify the http address of hadoop101]
<Property>
<Name> dfs. namenode. rpc-address.cluster1.hadoop2 </name>
<Value> hadoop2: 9000 </value>
</Property>
<Property>
<Name> dfs. namenode. http-address.cluster1.hadoop2 </name>
<Value> hadoop2: 50070 </value>
</Property>
<Property>
<Name> dfs. namenode. servicerpc-address.cluster1.hadoop1 </name>
<Value> hadoop1: 53310 </value>
</Property>
<Property>
<Name> dfs. namenode. servicerpc-address.cluster1.hadoop2 </name>
<Value> hadoop2: 53310 </value>
</Property>
<Property>
<Name> dfs. ha. automatic-failover.enabled.cluster1 </name>
<Value> true </value>
</Property>
[Specify whether cluster1 Enables automatic fault recovery, that is, whether to automatically switch to another NameNode when NameNode fails]
<! -- Specify JournalNode -->
<Property>
<Name> dfs. namenode. shared. edits. dir </name> <value> qjournal: // hadoop1: 8485; hadoop2: 8485; hadoop3: 8485; hadoop4: 8485; hadoop5: 8485/cluster1 </value>
</Property>
[JournalNode cluster information used when two NameNode of cluster1 share the edits file directory]
<Property>
<Name> dfs. client. failover. proxy. provider. cluster1 </name> <value> org. apache. hadoop. hdfs. server. namenode. ha. ConfiguredFailoverProxyProvider </value>
</Property>
[Which implementation class is responsible for performing failover when cluster1 fails]
<Property>
<Name> dfs. journalnode. edits. dir </name>
<Value>/home/tom/yarn/yarn_data/tmp/journal </value>
</Property>
[Specify the disk path for the JournalNode cluster to store data when sharing the NameNode directory. Tmp path is created by yourself, and journal is automatically generated by starting journalnode]
<Property>
<Name> dfs. ha. fencing. methods </name>
<Value> sshfence </value>
</Property>
[If NameNode switching is required, use ssh for operations]
<Property>
<Name> dfs. ha. fencing. ssh. private-key-files </name>
<Value>/home/tom/. ssh/id_rsa </value>
</Property>
[Ssh is used for failover. Therefore, you need to configure the key storage location for password-less logon and ssh communication]
<Property>
<Name> dfs. ha. fencing. ssh. connect-timeout </name>
<Value> 10000 </value>
</Property>
<Property>
<Name> dfs. namenode. handler. count </name>
<Value> 100 </value>
</Property>
</Configuration>
 

2.4 file mapred-site.xml

<Configuration>
<Property>
<Name> mapreduce. framework. name </name>
<Value> yarn </value>
</Property>
</Configuration>
[The environment for running mapreduce is yarn, which is different from hadoop1]
2.5 file yarn-site.xml

<Configuration>
<Property>
<Name> yarn. resourcemanager. hostname </name>
<Value> hadoop1 </value>
</Property>
[Custom ResourceManager address or single point]
<Property>
<Name> yarn. nodemanager. aux-services </name>
<Value> mapreduce. shuffle </value>
</Property>
</Configuration>
 

2.6 file slaves

Add: Specify which machine is datanode, and specify 6. Use all machines in the cluster as datanode

 
Hadoop1
Hadoop2
Hadoop3
Hadoop4
Hadoop5
Hadoop6

Copy to other nodes

In the root directory of hadoop (/home/tom directory): because all our environments are loaded under the tom directory of hadoop1.

Run the scp-r command.

Note:

1. Because we copied the entire yarn directory to other nodes, zookeeper is also included. In advance, we define that zookeeper is deployed on 1-5 machines. Although zookeeper is copied to machine 6, the node with machine 6 is not configured in the zookeeper configuration file. When zookeeper is started, the node with machine 6 does not need to be started.

2. Now you have to go to the zkdata directory under the zookeeper directory and modify the myid file: Each myid corresponds to the number of the server in the zoo. cfg file.

For more details, please continue to read the highlights on the next page:

  • 1
  • 2
  • Next Page

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.