Hadoop distributed platform optimization, hadoop

Source: Internet
Author: User

Hadoop distributed platform optimization, hadoop

Hadoop performance tuning is not only its own tuning, but also the underlying hardware and operating system. Next we will introduce them one by one:


1. underlying hardware

Hadoop adopts the master/slave architecture. The master (resourcemanager or namenode) needs to maintain metadata and schedule. The task volume and importance are much greater than that of slave. Therefore, master should be configured as high as possible.


2. Operating System

1) increase the maximum number of file descriptors and the upper limit of network connections (significant effect)

When there are many tasks, the OS kernel is limited by these two aspects.

Ulimit-n 2000; a maximum of 2000 file descriptors can be used. My system is 1024
Sysctl-a # displays all the kernel parameters and values. Sysctl-w net. core. somaxconn = 500 # The default value is 125. The value must be consistent with the ipc. server. listen. queue. size of the cluster.

3. Hadoop (version 2.5.1)

Mapred-default.xml:

1) Number of tasktracker concurrent tasks

Suggestion: map + reduce + 1 = num_cpu_cores

Mapreduce. tasktracker. map. tasks. maximum 2 The maximum number of map tasks that will be run simultaneously by a task tracker.
Mapreduce. tasktracker. reduce. tasks. maximum 2 The maximum number of reduce tasks that will be run simultaneously by a task tracker.

2) Adjust the heartbeat interval. The value can be changed to 300.

Yarn. app. mapreduce. am. scheduler. heartbeat. interval-ms
1000 The interval in MS at which the MR AppMaster shocould send heartbeats to the ResourceManager
3) Enable out-of-band heartbeat. The value is set to true.

Mapreduce. tasktracker. outofband. heartbeat False Expert: Set this to true to let the tasktracker send an out-of-band heartbeat on task-completion for better latency.
4) Configure disk blocks and set multiple disks to reduce I/O pressure

Mapreduce. cluster. local. dir $ {Hadoop. tmp. dir}/mapred/local The local directory where MapReduce stores intermediate data files. May be a comma-separated list of directories on different devices in order to spread disk I/o. Directories that do not exist are ignored.
5) RPC Handler count

Mapreduce. jobtracker. handler. count 10 The number of server threads for the JobTracker. This shoshould be roughly 4% of the number of tasktracker nodes.
6) Number of HTTP threads

In the shuffle stage, reduce tasks read the intermediate results of map tasks from tasktracker through http requests.

Mapreduce. tasktracker. http. threads 40 The number of worker threads that for the http server. This is used for map output fetching
7) Adjust the size of the pre-read buffer

Mapreduce. ifile. readahead True Configuration key to enable/disable IFile readahead.
Mapreduce. ifile. readahead. bytes 4194304 Configuration key to set the IFile readahead length in bytes.
8) reduce stack Start Time

This value should be increased when the cluster resources are insufficient.

Mapreduce. job. reduce. slowstart. completedmaps 0.05 Fraction of the number of maps in the job which shoshould be complete before CES are scheduled for the job.



Why only one datanode is started for hadoop distributed configuration?

Firewall configured?
The fs. default. name configuration in the core-site.xml is correct
Is the cluster ID of namenode and datanode different because namenode-format is executed after the system is started?

What do gopivotal and hadoop hdfs mean? To put it simply, I don't understand du Niang's words. It's best to give an example.

Hadoop is an open-source distributed computing platform used to process big data.

HDFS is a distributed file system on which the Hadoop platform depends.

GoPivotal is a company with many applications,

Taking Pivotal HD as an example, it is a release of Hadoop, or it can be understood as a transformed and optimized Hadoop commercial platform.

OK?

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.