Two years of experience with Hadoop

Source: Internet
Author: User
Keywords installation Practice
Tags .mall apache apt based classic cloudera code configuration

have been in touch with Hadoop for two years, during which there are many problems, both classic Namenode and jobtracker memory overflow failures, HDFs storage small file problems, both task scheduling problems and MapReduce performance problems. Some of these problems are the pitfalls of Hadoop itself (short boards), and others are improperly used.

In the process of solving problems, sometimes need to turn over the source code, sometimes to colleagues, netizens consulted, encounter complex problems will be through the mail list to the global users of Hadoop, including Hadoop Committer (Hadoop developers) for help. After getting a lot of people to help, they will encounter problems and ideas in the paper, I hope this article can be for those who are struggling with the Hadoop novice help, less to go the author's detour.

This article is based on Cloudera CDH 3u4 (with Apache Hadoop 1.0). The recommended configuration is the official recommended value or the author's experience value, which is not absolute, and may vary depending on the application scenario and hardware environment.

1. Choose Cloudera CDH Deploy your cluster

Motivation

Most administrators start learning from Apache Hadoop. I started with the Apache version of Hadoop for development and deployment, but after touching Cloudera CDH, I found that it made it easier for administrators to get the latest features and bug fixes, and sometimes surprise performance improvements.

Why is CDH better? The author lists the following points:

CDH is based on the stable version of Apache Hadoop and has applied the latest bug fixes or feature patch. Cloudera Perennial Quarterly release update version, release version of the year, update faster than the official Apache, and in the actual use of the process CDH performance is extremely stable, and did not introduce new problems.

Cloudera official website installation, upgrade document details, save Google time.

CDH Support YUM/APT Package, tar package, RPM package, Cloudera Manager four ways to install, there is always a suitable for you. The official website recommends Yum/apt Way installs, the author realizes its benefit is as follows:

Networking installation, upgrade, very convenient. Of course you can also download the RPM package locally, using the local yum installation.

Automatically download dependent packages, such as installing Hive, will cascade download and install Hadoop.

The Hadoop ecosystem pack automatically matches and does not require you to look for software such as hbase,flume,hive that matches current Hadoop, Yum/apt automatically finds matching versions of packages based on the current version of Hadoop, and guarantees compatibility.

Automatically create related directories and soft links to appropriate places (such as conf and logs directory), automatically create HDFs, mapred users, HDFs users are the highest rights users HDFs, mapred users are responsible for mapreduce execution of the relevant directory permissions.

Recommended index: ★★★

Recommended reasons: Get the latest features and the latest bug fixes, easy installation and maintenance, saving operational time.

2. Hadoop cluster configuration and management

Installing and maintaining the Hadoop cluster involves a great deal of management, including software installation, device management (crontab, iptables, etc.), configuration distribution, and so on.

For small cluster software distribution and node management, you can use the PDSH software, which distributes files to the target server via SSH without key, and sends commands and feedback for a set of target devices. If you are a large cluster or a large cluster of hardware configurations, it is recommended that you use tools such as Puppet to help you maintain your profile, or manage clusters through Cloudera Manager's GUI (note: Clodera Manager is not open source software, The free version supports up to 50 nodes.

Recommended index: ★★★

Recommended reasons: Improve operational efficiency

3. Open Secondarynamenode

The main function of Secondarynamenode (hereinafter called SNN) is to help Namenode (hereinafter nn) merge the edit log, and then copy the merged mirrored file back to NN to reduce the time required to merge the edit log when the nn reboot occurs. SNN is not a hot standby for NN, but the following steps enable you to switch SNN to NN. First, the SNN node imports a mirrored file from nn copy, and then modifies the SNN machine name and IP to match nn, and finally restarts the cluster.

It is particularly noted that the SNN memory configuration is consistent with NN because the work of merging edit logs requires that metadata be loaded into memory for completion. In addition, more than just SNN, any node that holds an NN mirror can become nn through the steps above, but SNN is more appropriate.

Recommended index: ★★★

Recommended reason: Reduce the nn restart cause the Cluster service outage time; SNN acts as nn role after NN node failure

(Responsible editor: The good of the Legacy)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.