Several Problem records during Hadoop cluster deployment

Source: Internet
Author: User

Several Problem records during Hadoop cluster deployment

This chapter deploy a Hadoop Cluster

Hadoop 2.5.x has been released for several months, and there are many articles on configuring similar architectures on the Internet. So here we will focus on the configuration methods of namenode and secondary namenode on different machines, and the restoration method of meta data after namenode is down, and describes the significance of configuration items in several major configuration files.

The general framework of the cluster is

One namenode and one secondary namenode and two datanode.

In theory, secondary namenode and datanode can be infinitely expanded.

Install jdk and ssh for password-free login, and download the hadoop magic horse will not be long-winded, mainly to record the configuration items of several main configuration files

File 1 core-site.xml

<Configuration>
<Property>
<Name> fs. defaultFS </name>
<Value> hdfs: // cloud001: 9000 </value>
<Description> NameNode URI <description/>
</Property>

<Property>
<Name> io. file. buffer. size </name>
<Value> 4069 </value>
<Description> Size of read/write buffer used in SequenceFiles. <description/>
</Property>

<Property>
<Name> hadoop. tmp. dir </name>
<Value>/snwz/hadoop/config/temp </value>
<Description> <description/>
</Property>
</Configuration>

Io. file. buffer. size: IO operations on files accessed by hadoop must be performed through the code library. Therefore, in many cases, io. file. buffer. size is used to set the cache size. Large caches can provide higher data transmission for hard disks or network operations, but this means greater memory consumption and latency. This parameter must be set to a multiple of the system page size, in bytes. The default value is 4 kb. Generally, it can be set to 64 KB (65536 bytes)

Hadoop. tmp. dir: basic configuration of hadoop File System dependency. Many configuration paths depend on it. Its default location is under/tmp/{$ user}. We recommend that you modify the default path, this is because files in the temp directory will be deleted when linux is started.

Document 2: hdfs-site.xml

 

<Configuration>

<Property>
<Name> dfs. namenode. secondary. http-address </name>
<Value> cloud001: 9001 </value>
</Property>

<Property>
<Name> dfs. namenode. name. dir </name>
<Value> file:/snwz/hadoop/config/name </value>
</Property>

<Property>
<Name> dfs. datanode. data. dir </name>
<Value> file:/snwz/hadoop/config/data </value>
</Property>

<Property>
<Name> dfs. replication </name>
<Value> 2 </value>
</Property>
 
<Property>
<Name> dfs. webhdfs. enabled </name>
<Value> true </value>
</Property>
</Configuration>

Dfs. namenode. name. dir: Path on the local filesystem where the NameNode stores the namespace and transactions logs persistently.

Dfs. datanode. data. dir: Comma separated list of paths on the local filesystem of a DataNode where it shocould store its blocks.

Dfs. replicatio: Number of copies. It is recommended that the number of copies be the same as the number of slaves.

Dfs. webhdfs. enabled: whether the dfs web page function is enabled. We recommend that you start

Dfs. namenode. secondary. http-address: the address of secondary namenode. This configuration is emphasized here.

Most people, including I only think that snn is a hot backup of nn. In fact, it is used to save the backup of HDFS metadata information in namenode and reduce the restart time of namenode. In the default configuration of hadoop, The snn process runs on the machine named namenode by default. However, if an error occurs on this machine, it is a great disaster to restore the HDFS file system. A better way is to configure the snn process to run on another machine. In hadoop, namenode is responsible for the persistent storage of HDFS metadata and processing the interaction feedback from clients on various HDFS operations. To ensure interaction speed, the metadata of the HDFS file system is loaded into the memory of the namenode machine and stored to the disk for persistent storage. To ensure that this persistence process does not become the bottleneck of HDFS operations, hadoop adopts the following method: No snapshot of the current file system is persisted, the list of actions for HDFS for the last period will be saved to an Editlog file in namenode. When the namenode is restarted, in addition to the load fsImage accident, the HDFS operation recorded in the EditLog file will be replaced to restore the final state before the HDFS restart.

SecondaryNameNode periodically merges the HDFS operations recorded in the EditLog into a checkpoint, and then clears the EditLog. Therefore, the restart of namenode will Load the latest checkpoint and replay the hdfs operations recorded in the EditLog. Because the EditLog records the list of operations from the last checkpoint to the present, so it will be relatively small. If there is no periodic merge process for snn, it will take a long time to restart namenode each time. In this way, periodic merge can reduce the restart time. It also guarantees the integrity of the HDFS system.

This article details how to separate namenode and secondary namenode.

It is no longer a question about how to restore the database. If you are interested, you can search for it yourself.

We will study the checkpoin frequency and other optimization problems later.

File 3:. mapred-site.xml

<Configuration>
<Property>
<Name> mapreduce. framework. name </name>
<Value> yarn </value>
</Property>
<Property>
<Name> mapreduce. jobtracker. http. address </name>
<Value> cloud001: 50030 </value>
</Property>
<Property>
<Name> mapreduce. jobhistory. address </name>
<Value> cloud001: 10020 </value>
</Property>
<Property>
<Name> mapreduce. jobhistory. webapp. address </name>
<Value> cloud001: 19888 </value>
</Property>
</Configuration>

Mapreduce. framework. name: the new framework supports third-party MapReduce development frameworks to support non-Yarn architectures such as SmartTalk and DGSG. Note that the configuration value is usually set to Yarn. If this parameter is not configured, then the submitted Yarn job will only run in locale mode, instead of distributed mode.

Mapreduce. jobtracker. http. address: job tracker listening port

Mapreduce. jobhistory. *: hadoop history server. You can view the records of running Mapreduce jobs on the history server, for example, how many maps are used, how many Reduce tasks are used, the job submission time, the job start time, and the job completion time. By default, the Hadoop history server is not started. We can use the following command to start the Hadoop history server.

Sbin/mr-jobhistory-daemon.sh start historyserver

This article details the principles and configurations of historical servers:

Document 4: yarn-site.xml

<Configuration>

<! -- Site specific YARN configuration properties -->

<Property>
<Name> yarn. nodemanager. aux-services </name>
<Value> mapreduce_shuffle </value>
</Property>

<Property>
<Name> yarn. resourcemanager. address </name>
<Value> nameNode: 8032 </value>
</Property>
<Property>
<Name> yarn. resourcemanager. schedager. address </name>
<Value> nameNode: 8030 </value>
</Property>
<Property>
<Name> yarn. resourcemanager. resource-tracker.address </name>
<Value> nameNode: 8031 </value>
</Property>
<Property>
<Name> yarn. resourcemanager. admin. address </name>
<Value> nameNode: 8033 </value>
</Property>
<Property>
<Name> yarn. resourcemanager. webapp. address </name>
<Value> nameNode: 8088 </value>
</Property>
</Configuration>

Mapreduce_shuffle:
Yarn. resourcemanager. address: Communication Interface Between namenode and resourcemanager
Yarn. resourcemanager. resource-tracker.address: NodeManager in the new framework needs to report the job running status to RM for Resouce trail, so NodeManager node host needs to know the tracker interface address of the RM host
Yarn. resourcemanager. admin. address: The management command accesses host: port through the ResourceManager host.
Yarn. resourcemanager. webapp. address: Management page address

This is the main configuration. We will discuss this matter later. I will record the configuration I don't understand or want to explain it.

Tutorial on standalone/pseudo-distributed installation and configuration of Hadoop2.4.1 under Ubuntu14.04

Install and configure Hadoop2.2.0 on CentOS

Build a Hadoop environment on Ubuntu 13.04

Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1

Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)

Configuration of Hadoop environment in Ubuntu

Detailed tutorial on creating a Hadoop environment for standalone Edition

Build a Hadoop environment (using virtual machines to build two Ubuntu systems in a Winodws environment)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.