10 best practices of hadoop Administrators

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Document directory

Motivation
Motivation

Preface

I have been in contact with hadoop for two years and encountered many problems, including classic namenode and jobtracker memory overflow faults, HDFS storage of small files, and task scheduling problems, there are also mapreduce performance problems. some of these problems are hadoop's own defects (short board), while others are improper.

In the process of solving the problem, you sometimes need to turn to the source code, and sometimes ask colleagues and netizens for advice. If you encounter a complicated problem, you will send a mail list to hadoop users all over the world, this includes hadoop committer (hadoop developers) for help. After receiving help from many people, I will write down my problems and experiences. I hope this article will be helpful to the novice hadoop users who are overwhelmed and will not take the author's detour.

PS. This article is based on cloudera CDH 3u4 (same as Apache hadoop 1.0. The recommended configuration is an official recommendation value or an empirical value from the author. It is not absolute and may vary with different application scenarios and hardware environments.

1. Motivation for choosing cloudera CDH to deploy your cluster

Most administrators start learning from Apache hadoop. I started to use Apache hadoop for development and deployment. However, after being exposed to cloudera CDH, I found that it can make the Administrator's work easier, not only can we get the latest features and bug fixes, but it can also bring Surprising performance improvements.

Why is CDH better? The author lists the following points:

Based on the stable version of Apache hadoop, CDH applies the latest bug fix or Feature Patch. Cloudera has always insisted on Quarterly Release of the update version and annual release of the release version. The update speed is faster than that of Apache. In actual use, CDH is extremely stable and has not introduced any new problems.
Detailed installation and upgrade documents on the Official cloudera website, saving Google time.
CDH supports four installation methods: Yum/apt package, tar package, RPM package, and cloudera manager. There is always one suitable for you. The yum/apt method is recommended on the official website. The advantages of Yum/APT are as follows:
1. Online installation and upgrade are very convenient. Of course, you can also download the RPM package to your local device and install it using local yum.
2. The dependent software package is automatically downloaded. For example, if you want to install hive, hadoop is downloaded and installed in cascade mode.
3. The hadoop ecosystem package is automatically matched, and you do not need to search for hbase, flume, hive and other software that matches the current hadoop. Yum/apt will automatically find software packages that match the current version based on the current version of hadoop installed, and ensure compatibility.
4. Automatically create related directories and soft links to appropriate places (such as conf and logs Directories); automatically create HDFS and mapred users, and HDFS users are the highest permission users of HDFS, mapred users are responsible for the permissions of related directories during mapreduce execution.

Recommendation index:★★★

Recommended reason: Get the latest features and latest bug fixes; easy installation and maintenance, saving O & M time.

2. hadoop cluster configuration and management

Installing and maintaining a hadoop cluster involves a lot of management work, including software installation, device management (crontab, iptables, etc.), and configuration distribution.

For small cluster software distribution and node management, you can use pdsh, which can distribute files to the target server through SSH without keys, and send commands for a group of target devices for feedback. For large clusters or clusters with large hardware configuration differences, we recommend that you use tools such as puppet to help you maintain configuration files, or use cloudera manager to manage clusters in Gui mode (note: clodera manager is not an open-source software. The free version supports up to 50 nodes ).

Recommendation index:★★★

Recommended reason: Improve O & M Efficiency

3. Enable secondarynamenode

The main function of secondarynamenode (SNN) is to help namenode (NN) Merge and edit logs, and then copy the merged image file back to NN, to reduce the time required to merge and edit logs when nn restarts. SNN is not a hot standby of NN, but the following steps can be used to switch SNN to NN. First, import the image file copied from Nn on the snn node, modify the SNN machine name and IP address, and then restart the cluster.

Note that the memory configuration of SNN must be consistent with that of NN, because the merge and edit logs need to load metadata to the memory. In addition, not just SNN, any node that saves the NN image can be changed to Nn through the above steps, but SNN is more suitable.

Recommendation index:★★★

Recommended reason: reduce the cluster service interruption time caused by NN restart; after NN node failure, SNN acts as NN

4. Use ganglia and Nagios to monitor your cluster

When running a large mapreduce job, we usually care about the CPU, memory, disk, and bandwidth of the tasktracker (TT, at this time, we need the ganglia tool to generate Related Charts for us to diagnose and analyze problems.

Ganglia can monitor the cluster status, but when your server goes down or a TT fails, it cannot notify you. In this case, we can use the Nagios alarm software, it can configure email alarms and short message alarms. By writing plugins, You can implement your own monitoring function. Our cluster is currently monitored as follows:

Namenode and jobtracker memory
Datanode and tasktracker running status
NFS service status
Disk usage
Server Load Status

Recommendation index:★★★

Recommended reason: ganglia can help you record the cluster status for troubleshooting. Nagios can notify you immediately when you encounter any problems.

5. Setting the memory is crucial.

After the hadoop cluster is installed, the first thing is to modify the bin/hadoop-evn.sh file to set the memory. The memory of mainstream nodes is configured as 32 GB. In typical scenarios, the memory settings are as follows:

NN: 15-25 GB
JT：2-4GB
DN：1-4 GB
TT：1-2 GB，Child VM 1-2 GB

The cluster uses different settings. If the cluster has a large number of small files, the NN memory should be at least 20 GB, And the DN memory should be at least 2 GB.

Recommendation index:★★★★★

Recommendation reason: among the several components, NN is the most sensitive to memory. It has a single point of failure, which directly affects cluster availability. JT is also a single point, if the JT memory overflows, all mapreduce jobs cannot be executed normally.

6. The administrator has fun with mapreduce.

Hadoop native mapreduce needs to be written in the Java language, but it does not matter in Java. The hadoop streaming framework administrator can use python, Shell, Perl and other languages for mapreduce development, but the simpler method is to install and use hive or pig.

Recommendation index:★★★

Recommended reason: reduce O & M time and quickly respond to various ad-hot demands and fault diagnosis.

7. namenode ha

As mentioned above, NN is a single point of failure that may occur throughout the cluster.

Hadoop uses the DFS. Name. dir attribute of the HDFS. Site. xml file to specify the maintained metadata path. if you want to maintain multiple paths, you can use commas to separate multiple paths.

<property>    <name>dfs.name.dir</name>    <value>/data/cache1/dfs/nn,/data/cache2/dfs/nn</value></property>

Hadoop officially recommends configuring multiple paths for metadata, including one NFS path. However, based on my experience in a serious cluster failure, even this still results in damage to all the image files, including the image files on SNN. Therefore, it is necessary to regularly back up an available copy.

Recommendation index:★★★★★

Recommendation reason: the NN spofs of cloudera3ux and apache1.0 are one of the biggest headaches, with more preparation and a little pain.

8. Use firewall to prevent bad guys from entering

Hadoop's security control is very simple. It only includes simple permissions, that is, the permission is determined only based on the client user name. Its design principle is: "To avoid good people from doing wrong, but not to prevent bad people from doing bad things ".

If you know the IP address and port of An NN, you can easily obtain the HDFS directory structure and modify the User Name of the Local Machine to disguise it as the owner of the HDFS file to delete the file.

You can perform authentication by configuring Kerberos. However, many administrators use a simpler and more effective method to control access IP addresses through firewalls.

Recommendation index:★★★★★

Recommended reason: security is no small matter, so you can avoid it.

9. Motivation for enabling trash

I made the next mistake. When I was tired of working overtime and my brain was a little confused, I accidentally deleted and executed a command "hadoop FS-RMR/XXX" without a deletion prompt, A few terabytes of data are lost at once. It makes me collapse and regret it. At this time, you want to have a machine that can restore HDFS to the state before deletion.

The trash function is the time machine. It is disabled by default. After it is enabled, the deleted data will be mV to the user directory ". trash "folder, you can configure how long it takes to automatically delete expired data. In this way, when the operation fails, the data MV can be returned. To enable a garbage bin, follow these steps:

VI core-site.xml, add the following configuration, value in minutes.

<property>    <name>fs.trash.interval</name>    <value>1440</value>  </property>

In cdh3u4, I don't need to restart namenode to take effect. After the recycle bin is enabled, if you want to delete the file directly, you can add the "-skiptrash" parameter when using the DELETE Command, as shown below:

hadoop fs –rm –skipTrash /xxxx

Recommendation index:★★★★★

Recommendation reason: Do you want a time machine?

10. Go to the community to find help

Hadoop is a very good open-source project, but it still has many unsolved problems, such as NN, JT spof, and JT failure, block reports low efficiency in small files. In this case, you can find the person who can help you through the following channels. Several serious cluster failures by the author are obtained through the Google user group of cloudera company. Generally, If you ask a question the previous day, you will receive feedback the next day. The following are two communities that can help you. Of course, you can also help others:

Mail List of Apache hadoop:

Http://hadoop.apache.org/mailing_lists.html

Cloudera CDH Google group:

Https://groups.google.com/a/cloudera.org/forum! Forum/CDH-user

Recommendation index:★★★★★

Recommendation reason: no one is more familiar with hadoop itself than the software author. Go to the community for help and help you solve many problems that you cannot cross.

Introduction to cloudera:

The company is a hadoop software service company that provides free software CDH and cloudera manager Free Edition. It also provides hadoop-related information, training, technical support, and other services. Hadoop founder Dong cutting is an architect at the company, and the company has multiple Apache Committer.

Link: http://www.infoq.com/cn/articles/hadoop-ten-best-practice

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More