Hadoop distributed platform optimization, hadoop
Hadoop performance tuning is not only its own tuning, but also the underlying hardware and operating system. Next we will introduce them one by one:
1. underlying hardware
Hadoop adopts the master/slave architecture. The master (resourcemanager or namenode) needs to maintain metadata and schedule. The task volume and importance are much greater than that of slave. Therefore, master should be configured as high as possible.
2. Operating System
1) increase the maximum number of file descriptors and the upper limit of network connections (significant effect)
When there are many tasks, the OS kernel is limited by these two aspects.
Ulimit-n 2000; a maximum of 2000 file descriptors can be used. My system is 1024
Sysctl-a # displays all the kernel parameters and values. Sysctl-w net. core. somaxconn = 500 # The default value is 125. The value must be consistent with the ipc. server. listen. queue. size of the cluster.
3. Hadoop (version 2.5.1)
Mapred-default.xml:
1) Number of tasktracker concurrent tasks
Suggestion: map + reduce + 1 = num_cpu_cores
Mapreduce. tasktracker. map. tasks. maximum |
2 |
The maximum number of map tasks that will be run simultaneously by a task tracker. |
Mapreduce. tasktracker. reduce. tasks. maximum |
2 |
The maximum number of reduce tasks that will be run simultaneously by a task tracker. |
2) Adjust the heartbeat interval. The value can be changed to 300.
Yarn. app. mapreduce. am. scheduler. heartbeat. interval-ms
|
1000 |
The interval in MS at which the MR AppMaster shocould send heartbeats to the ResourceManager |
3) Enable out-of-band heartbeat. The value is set to true.
Mapreduce. tasktracker. outofband. heartbeat |
False |
Expert: Set this to true to let the tasktracker send an out-of-band heartbeat on task-completion for better latency. |
4) Configure disk blocks and set multiple disks to reduce I/O pressure
Mapreduce. cluster. local. dir |
$ {Hadoop. tmp. dir}/mapred/local |
The local directory where MapReduce stores intermediate data files. May be a comma-separated list of directories on different devices in order to spread disk I/o. Directories that do not exist are ignored. |
5) RPC Handler count
Mapreduce. jobtracker. handler. count |
10 |
The number of server threads for the JobTracker. This shoshould be roughly 4% of the number of tasktracker nodes. |
6) Number of HTTP threads
In the shuffle stage, reduce tasks read the intermediate results of map tasks from tasktracker through http requests.
Mapreduce. tasktracker. http. threads |
40 |
The number of worker threads that for the http server. This is used for map output fetching |
7) Adjust the size of the pre-read buffer
Mapreduce. ifile. readahead |
True |
Configuration key to enable/disable IFile readahead. |
Mapreduce. ifile. readahead. bytes |
4194304 |
Configuration key to set the IFile readahead length in bytes. |
8) reduce stack Start Time
This value should be increased when the cluster resources are insufficient.
Mapreduce. job. reduce. slowstart. completedmaps |
0.05 |
Fraction of the number of maps in the job which shoshould be complete before CES are scheduled for the job. |
Why only one datanode is started for hadoop distributed configuration?
Firewall configured?
The fs. default. name configuration in the core-site.xml is correct
Is the cluster ID of namenode and datanode different because namenode-format is executed after the system is started?
What do gopivotal and hadoop hdfs mean? To put it simply, I don't understand du Niang's words. It's best to give an example.
Hadoop is an open-source distributed computing platform used to process big data.
HDFS is a distributed file system on which the Hadoop platform depends.
GoPivotal is a company with many applications,
Taking Pivotal HD as an example, it is a release of Hadoop, or it can be understood as a transformed and optimized Hadoop commercial platform.
OK?