10 best Practices for Hadoop administrators

Source: Internet
Author: User
Tags free ssh hadoop ecosystem hadoop fs

Objective

Two years of contact with Hadoop, during the period encountered a lot of problems, both classic Namenode and jobtracker memory overflow failure, also has HDFS storage small file problems, both task scheduling problems, There are also mapreduce performance issues. Some of these problems are the flaws of Hadoop itself (short board), and some are inappropriate to use.

In the process of solving the problem, sometimes need to turn over the source code, sometimes to colleagues, netizens consulted, encountered complex problems will be through the mail list to Hadoop users around the world, including Hadoop Committer (Hadoop developers) for help. After getting a lot of people help, I will encounter problems and experience to sort out the written, I hope this article can be for those who are struggling with Hadoop novice help, less go the author's detour.

PS. This article is based on Cloudera CDH 3u4 (with Apache Hadoop 1.0). The recommended configuration is either the official recommendation or the author's empirical value, which is not absolute and may vary depending on the application scenario and the hardware environment.

1. Choose Cloudera CDH to deploy your cluster motivation

Most administrators start from the Apache Hadoop learning. I started with the Apache version of Hadoop for development and deployment, but after I contacted Cloudera CDH, I found that it made it easier for administrators to get the latest features and bug fixes, and sometimes surprising performance improvements.

Why is CDH better? The author lists the following points:

    1. CDH is based on the stable version of Apache Hadoop and applies the latest bug fixes or feature patches. Cloudera Perennial Quarterly release update version, release release version of the year, update faster than the official Apache, and in the actual use of the process CDH performance is extremely stable, and did not introduce new issues.
    2. Cloudera the official website to install, upgrade the document details, save Google time.
    3. CDH Support YUM/APT Package, tar package, RPM package, Cloudera Manager four ways to install, there is always a suitable for you. The official website recommends the Yum/apt method installs, the author realizes its benefit as follows:
      1. Network installation, upgrade, very convenient. Of course you can also download the RPM package locally and install it using the local Yum method.
      2. Automatic download of dependent packages, such as hive installation, will cascade to download and install Hadoop.
      3. The Hadoop ecosystem package automatically matches and does not require you to look for software that matches the current Hadoop hbase,flume,hive, and yum/apt automatically looks for a matching version of the package based on the current version of Hadoop and guarantees compatibility.
      4. Automatically create related directories and soft links to the right places (such as Conf and logs), automatically create HDFs, mapred users, HDFs users are the most privileged users of HDFs, and mapred users are responsible for the permissions of the relevant directories during MapReduce execution.

Recommended Index: ★

Recommended reasons: To get the latest features and latest bug fixes, easy installation and maintenance, save operation time.

2. Hadoop cluster configuration and management

Installing and maintaining a Hadoop cluster involves a lot of administrative work, including software installation, device management (crontab, iptables, etc.), configuration distribution, and so on.

For small cluster software distribution and node management, you can use the PDSH software, which distributes files to the target server via a key-free ssh, and sends commands to and receives feedback for a set of target devices. If it is a large cluster or a cluster with very different hardware configurations, it is recommended to use tools such as Puppet to help you maintain the configuration file, or to manage the cluster in a GUI way through Cloudera Manager (Note: Clodera Manager is not an open source software, The free version supports a maximum of 50 nodes).

Recommended Index: ★

Recommended reasons: Improve operational efficiency

3. Turn on Secondarynamenode

The main function of the Secondarynamenode (SNN) is to help the Namenode (hereinafter referred to as NN) merge the edit log and then copy the merged image file back to the NN to reduce the time required to merge the edit log when the NN restarts. SNN is not a hot standby for NN, but the following steps can be used to switch the SNN to NN. First, the SNN node imports the image file from the nn copy, then modifies the SNN machine name and IP to match the NN, and finally restarts the cluster.

It is particularly important to note that the memory configuration of the SNN is consistent with the NN because the work of merging the edit logs requires the metadata to be loaded into memory complete. In addition, not only SNN, any node that holds the NN image can be changed to NN through the above steps, but SNN is more suitable.

Recommended Index: ★

Recommended reason: Reduce the Cluster service outage time caused by the nn restart; SNN acts as an nn role after an NN node failure

4. Monitor your cluster with ganglia and Nagios

When running a large mapreduce job, we are usually very concerned about the bandwidth of the job to the Tasktracker (TT) CPU, memory, disk, and the entire network, and this is a tool that needs to be ganglia to diagnose and analyze problems.

Ganglia can monitor the status of the cluster, but when your server down or a TT hangs, it can't notify you, then we can use Nagios this alarm software, it can configure email alerts and short-interest alerts. By writing plugins, you can implement your own monitoring functions. Our cluster currently has the following monitoring:

    1. NameNode, Jobtracker memory
    2. Datanode and Tasktracker operational status
    3. NFS Service Status
    4. Disk usage
    5. Server load Status

Recommended Index: ★

Recommended reasons: Ganglia can help you to record the status of the cluster, to facilitate the diagnosis of problems; Nagios can tell you the first time you have trouble.

5. Setting up memory is critical

After the Hadoop cluster is installed, the first thing to do is to modify the bin/hadoop-evn.sh file to set the memory. The main node memory configuration is 32GB, the typical scene memory settings are as follows

NN: 15-25 GB
JT:2-4GB
DN:1-4 GB

There are different settings for cluster usage scenarios, and if the cluster has a large number of small files, it requires at least 2GB of memory for the NN memory to 20GB,DN.

Recommendation Index: ★★★★★

Recommended reasons: In several components, NN is the most sensitive to memory, it has a single point of problem, directly affect the availability of the cluster; JT is also a single point, and if JT memory overflows, none of the MapReduce jobs will execute properly.

6. Administrator to play with MapReduce

Hadoop native MapReduce needs to be written in the Java language, but not Java, and with Hadoop streaming framework administrators can develop mapreduce using languages such as Python,shell,perl, But the easier way is to install and use hive or pig.

Recommended Index: ★

Recommended reasons: Reduce operation and maintenance time, quickly respond to various ad-hot requirements and fault diagnosis.

7. NameNode HA

As mentioned earlier, NN is a single point of failure that may occur across the cluster.

Hadoop specifies a persisted metadata path through the Dfs.name.dir property of the Hdfs.site.xml file, and if you want to persist to multiple paths, you can configure multiple paths with a comma split.

<property>    <name>dfs.name.dir</name>    <value>/data/cache1/dfs/nn,/data/cache2/ Dfs/nn</value></property>

Hadoop is the official recommended configuration for metadata configuration of multiple paths, which contains an NFS path. However, according to the author of a cluster of serious failure experience, even if this caused all the image file corruption, including the image file on the SNN, it is necessary to regularly back up an available copy.

Recommendation Index: ★★★★★

Recommended reasons: Cloudera3ux and Apache1.0 nn single point problem is one of the most headache problems, more preparation, a little pain.

8. Use firewall to stop bad guys from entering

Hadoop's security controls are simple and contain only simple permissions, which are determined by the client user name only. Its design principle is: "Avoid good people do wrong, but do not stop bad people do bad things."

If you know the IP and port of an NN, it is easy to get the HDFS directory structure and delete the file by modifying the native machine user name disguised as the owner of the HDFs file.

By configuring Kerberos, you can implement authentication. But many administrators use a simpler and more efficient approach-controlling access to IP through firewalls.

Recommendation Index: ★★★★★

Recommended reasons: Safety without small things, prevention in the bud.

9. Turn on the garbage bin (trash) function motive

I once made a mistake, in my overtime very tired, the brain slightly chaotic, accidentally deleted the execution of a command "Hadoop fs-rmr/xxx/xxx", no deletion hint, a few terabytes of data, all of a sudden there is no. It just made me fall apart and regret it. This is how you wish to have a time machine that will allow HDFs to revert to its pre-deleted state.

Trash function is this time machine, it is closed by default, the data that you delete will be MV to the operation user directory after opening. Trash folder, which can be configured for more than a long time, the system automatically deletes outdated data. This way, when the operation is wrong, you can return the data mv. To open the trash bin, proceed as follows:

VI core-site.xml, add the following configuration, value unit is minutes.

<property>    <name>fs.trash.interval</name>    <value>1440</value>  </ Property>  

I do not need to restart Namenode under the CDH3U4 can be effective. After opening the trash, if you want the file to be deleted directly, you can add the "–skiptrash" parameter when using the Delete command, as follows:

Hadoop fs–rm–skiptrash/xxxx

Recommendation Index: ★★★★★

Recommended reason: Want a time machine?

10. Go to the community for help

Hadoop is a very good open source project, but it still has a lot of unresolved issues, such as NN,JT single point problem, JT hangs dead problem, block in small file reporting inefficiency and so on. At this point can be found through the following channels to help you, the author of several clusters of serious failures are through the Cloudera company's Google user group directly to get a few committer help. Usually ask questions the day before and you will get feedback the next morning. Here are two of your communities that can help, and of course you can help others:

Apache Hadoop's Mail list:

Http://hadoop.apache.org/mailing_lists.html

Cloudera CDH Google Group:

Https://groups.google.com/a/cloudera.org/forum/#!forum/cdh-user

Recommendation Index: ★★★★★

Reason for recommendation: No one is more familiar with Hadoop itself than the software author, and goes to the community to help you solve a lot of problems that you can't cross.

Cloudera Introduction:

The company is a Hadoop software services company that offers free software CDH and Cloudera Manager Edition, as well as information, training, technical support and other services for Hadoop. Hadoop founder Dong Cutting is an architect at the company, and the company has more than one Apache Committer.

Original: Http://www.infoq.com/cn/articles/hadoop-ten-best-practice

10 best Practices for Hadoop administrators

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.