Foreword
Hadoop contact with Hadoop for two years, encountered many problems, both the classic NameNode and JobTracker memory overflow problem, there are HDFS small file storage issues, both task scheduling problems, but also MapReduce performance problems encountered. Some are shortcomings of Hadoop itself (short board), while others are not used properly.
In the process of solving problems, sometimes need to turn the source code, and sometimes to co-workers, netizens for advice, encounter complex problems will mail list to Hadoop users around the world, including Hadoop Committer (Hadoop developers) for help. After getting a lot of help, I will encounter problems and experience finishing the article, I hope this article can help those who are struggling Hadoop novices, less take the author's detours.
PS. This article is based on Cloudera CDH 3u4 (same as Apache Hadoop 1.0). Relevant recommended configuration as the official recommended value or the author experience value, it is not absolute, may be different because of different application scenarios and hardware environments.
1. Choose Cloudera CDH to deploy your Cluster motivation
Most administrators start with Apache Hadoop. I also started with the Apache version of Hadoop for development and deployment, but with Cloudera CDH, I found it easier for administrators to do the job of not only getting the latest features and bug fixes, Surprise performance improvement.
Why is CDH better? The author lists the following points:
CDH is based on the stable version of Apache Hadoop and applies the latest bug fixes or Feature Patch. Cloudera uphold the quarterly release of the Update version of the annual Release release release update faster than the official Apache, CDH performance in the actual use and extremely stable, did not introduce new problems. Cloudera official website to install and upgrade the document details, save Google time. CDH support Yum / Apt package, Tar package, RPM package, Cloudera Manager four ways to install, there is always a suitable for you. The official website recommended Yum / Apt way to install, I understand the benefits are as follows: Networking installation, upgrade, very convenient. Of course, you can also download the rpm package to the local, using Local Yum installation. Automatically download dependent packages, such as to install Hive, it will be cascaded to download and install Hadoop. The Hadoop Ecosystem Pack matches automatically and does not require you to find Hbase, Flume, Hive and other software that match your current Hadoop. Yum / Apt automatically looks for matching versions of software packages based on the currently installed Hadoop version and guarantees compatibility. Automatically create related directories and soft link to the appropriate place (such as conf and logs and other directories); Automatically create hdfs, mapred users, hdfs user is HDFS highest authority users, mapred users are responsible for the mapreduce implementation of the relevant directory permissions.
Recommended index: ★ ★ ★
Recommended reason: get the latest features and the latest bug fixes; easy installation and maintenance, save operation and maintenance time.
2. Hadoop cluster configuration and management
Hadoop cluster installation and maintenance involves a lot of management, including software installation, equipment management (crontab, iptables, etc.), configuration distribution and so on.
PDSH is available for small cluster software distribution and node management, which distributes files to target servers via keyless SSH, as well as sending commands and getting feedback for a group of target devices. For large clusters or clusters with very different hardware configurations, it is recommended that you use a tool like puppet to help you maintain the configuration file, or manage the cluster in GUI mode with Cloudera Manager (Note: Clodera Manager is not open source and the free version supports up to 50 Nodes).
Recommended index: ★ ★ ★
Recommended reason: to improve operation and maintenance efficiency
3. Turn on SecondaryNameNode
The main function of SecondaryNameNode (SNN) is to help NameNode (NN) merge edit logs and then copy the merged image file back to NN to reduce the time required to merge edit logs when NN restarts. The SNN is not a hot standby for the NN, but the following steps can be used to switch the SNN to NN. First, SNN node import from NN Copy over the image file, and then modify the SNN machine name and IP and NN consistent, and finally restart the cluster.
Special attention is SNN memory configuration to be consistent with the NN, because the work of editing the log need to be loaded into memory to complete the metadata. In addition, not only the SNN, any node that holds the NN image can be changed to NN by the above steps, but the SNN is more suitable.
Recommended index: ★ ★ ★
Recommended reason: to reduce NN restart caused cluster service interruption time; NN node failure, SNN acts as NN role
Use Ganglia and Nagios to monitor your cluster
When running a large mapreduce job, we usually care very much about the job's bandwidth to the TaskTracker (hereinafter referred to as TT) CPU, memory, disk, and the entire network. At this time, Ganglia is required to generate relevant charts for us to diagnose and analyze problem.
Ganglia can monitor the status of the cluster, but when your server is down or a TT hangs up, it can not notify you. At this moment, we can use Nagios alarm software which can configure email alarm and short message alarm. By writing plugins, you can implement your own monitoring capabilities. Our cluster currently monitors as follows:
NameNode, JobTracker Memory DataNode and TaskTracker Running Status NFS Service Status Disk Usage Server Load Status
Recommended index: ★ ★ ★
Recommended reason: Ganglia can help you record the cluster status, facilitate the diagnosis of problems; Nagios can be the first time to notify you when you encounter problems.
5. Set the memory is crucial
Hadoop cluster installation is complete, the first thing is to modify the bin / hadoop-evn.sh file to set the memory. Mainstream node memory configuration is 32GB, the typical scene memory settings are as follows
NN: 15-25 GB
JT: 2-4GB
DN: 1-4 GB
TT: 1-2 GB, Child VM 1-2 GB
Cluster usage scenarios vary depending on the relevant settings are also different, if the cluster has a large number of small files, requires NN memory at least 20GB, DN memory at least 2GB.
Recommended index: ★ ★ ★ ★ ★
Recommended Reasons: NN is memory-intensive in several components and has a single point of issue that directly affects the availability of the cluster. JT is also a single point, and all MapReduce jobs fail to execute properly if JT memory overflows.
6. Administrator Fun MapReduce
Hadoop Native MapReduce requires Java programming, but it does not work without Java. Hadoop streaming framework administrators can use MapReduce development in languages such as Python, Shell, and Perl, but it's easier to install and use Hive or Pig.
Recommended index: ★ ★ ★
Recommended reason: to reduce the operation and maintenance time, rapid response to a variety of ad-hot needs and troubleshooting.
NameNode HA
As stated earlier, NN is a single point of failure that can occur throughout the cluster.
Hadoop allows you to configure multiple paths using comma-segmentation if you want to keep multiple paths by specifying the metadata path you want to keep by specifying the dfs.name.dir attribute in the hdfs.site.xml file.
<property> <name> dfs.name.dir </ name> <value> / data / cache1 / dfs / nn, / data / cache2 / dfs / nn </ value> </ property>
Hadoop official recommended configuration configured for the metadata path, which contains an NFS path. However, according to the author once a serious failure of the cluster experience, even if this led to all the image file is damaged, including the image file on the SNN, it is necessary to back up a usable copy regularly.
Recommended index: ★ ★ ★ ★ ★
Recommended reason: Cloudera3uX and Apache1.0 NN single point problem is one of the most headache problems, more preparation, a little painful.
Use the firewall to prevent bad people from entering
Hadoop security control is very simple, contains only simple permissions, that is, only based on the client user name, decide to use permissions. Its design principle is: "to avoid good people doing wrong, but does not stop the bad guys doing bad things."
If you know a NN IP and port, you can easily obtain the HDFS directory structure, and by modifying the local machine user name disguised as HDFS file owned owner, delete the file.
By configuring kerberos, authentication can be achieved. But many administrators use a simpler and more effective way to control access to IP through a firewall.
Recommended index: ★ ★ ★ ★ ★
Recommended reason: safety is no small matter to guard against in the first place.
9. Turn on trash function motivation
I have made a mistake, when I was very tired to work overtime, the brain a little confusion, accidentally deleted the implementation of a command "hadoop fs -rmr / xxx / xxx", did not delete tips, a few terabytes of data, all at once No more It just made me collapse and regretful. At this time you want to have a time machine can make HDFS restore to the state before deletion.
trash function is the time machine, which is turned off by default, after opening, the data you delete will mv to operate the user directory ".Trash" folder, you can configure the length of time, the system automatically delete expired data. As a result, when the operation error, you can return the data mv. Open the trash steps are as follows:
vi core-site.xml, add the following configuration, value unit is minutes.
<property> <name> fs.trash.interval </ name> <value> 1440 </ value> </ property>
I do not have to restart Namenode CDH3u4 can take effect. After opening the trash, if you want the file to be deleted directly, you can add the "-skipTrash" parameter when using the delete command as follows:
hadoop fs -rm -skipTrash / xxxx
Recommended index: ★ ★ ★ ★ ★
Recommended reason: want time machine?
10. Go to the community for help
Hadoop is a very good open source project, but it still has many unresolved issues such as NN, JT single point issues, JT hang-ups, and Block reporting inefficiencies under small files. At this point you can find people who can help you through the following channels, the author several serious cluster failures are directly through the google user group Cloudera several committer help. Usually the day before the question, the next day there will be feedback. Here are two of your communities that can help, and of course you can help others as well:
Apache hadoop's mail list:
http://hadoop.apache.org/mailing_lists.html
Cloudera CDH google group:
https://groups.google.com/a/cloudera.org/forum/#!forum/cdh-user
Recommended index: ★ ★ ★ ★ ★
Recommended reason: No one is more familiar with software author Hadoop itself, to the community for help, to help you solve many problems that can not be crossed by themselves.