Hadoop configuration meaning (continue to update)

Source: Internet
Author: User
Tags hadoop mapreduce


Add By yourself:

DFS. datanode. du. reserved: indicates how much non-DFS disk space is retained when datanode writes data to the disk. This prevents DFS from filling up the disk, but this parameter has a bug in 0.19.2.

I introduced"IPC.Server.Listen.Queue.Size"Which defines how does callper Handler are allowed inQueue. The default is wtill 100. So there is no change for current users. When the RPC service is started, each hadler processes the maximum number of requests in the stack, and the client needs to wait.


DFS. datanode. simulateddatastorage https://issues.apache.org/jira/browse/HADOOP-1989; this node starts a pseudo Distributed System for debugging


Slave. Host. Name

: The name of each datanode node. Generally, each machine is configured with its own IP address, which is used to execute a specific datanode connection address in HDFS on the web page. On the web management page
In mapreduce, It is the address of the machine connected to a specific map (reduce) task. If not configured, It is the machine name.


DFS. datanode. Failed. volumes. tolerated
: Number of disk damages allowed by datanode
, Datanode will use the folder configured under DFS. Data. dir (used to store blocks) at startup. If some folder cannot be used and the number is greater than the one configured above
Value, which fails to start,CodeSee org. Apache. hadoop. HDFS. server. datanode. fsdataset code 980-997.
, As follows:

Final Int Volfailurestolerated = Conf. getint (  "  DFS. datanode. Failed. volumes. tolerated  "  , 0  ); String [] datadirs = Conf. getstrings (datanode. data_dir_key );  Int Volsconfigured = 0  ;  If (Datadirs! = Null  ) Volsconfigured = Datadirs. length;  Int Volsfailed = volsconfigured- Storage. getnumstoragedirs ();  If (Volsfailed < 0 | Volsfailed > Volfailurestolerated ){  Throw   New Diskerrorexception ( "  Invalid value for volsfailed:  " + Volsfailed + "  , Volumes tolerated:  " + Volfailurestolerated );} 


DFS. blockreport. intervalmsec

Datanode periodically reports all block information on the current node to namenode. the DFS. blockreport. intervalmsec parameter controls the report interval.


DFS. blockreport. initialdelay

When used together with the previous parameter, the first time that datanode reports its block information after it is started is at (0, $ (DFS. blockreport. initialdelay), and then start from inittime (this must be different on different datanode) at intervals of DFS. blockreport. intervalmsec, The datanode will report information about all its blocks to the namenode.

If there is no inittime, many datanode will be sent from the starting moment, which will cause a large amount of data to be sent to NN and cause congestion. This parameter is used for control.


Some parameters that can be obtained during job running:

Mapred. Job. ID: Job ID, for example, job_201511121233_0001

Mapred. Tip. ID task id, for example, task_201121233_0001_m_000003

Mapred. task. ID: ID of the task attempt, for example, attempt_201%121233_000%m_000003_0

The sequence number of a task in mapred. task. Partition job, for example, 3.

Mapred. task. Is. Map whether the task is a map task, such as true

Mapred. Job. queue. name indicates the queue to which the task belongs. Generally, this attribute value is written in the configuration file for clients of different users.




DFS. Client. Max. Block. Acquire. Failures

When reading files on hadoop, dfsclient reads Specific block information from datanode. If the read node fails (the socket cannot be connected), the client will try multiple times, this is the number of set attempts. If the number of attempts exceeds this limit, an exception is thrown.

========================================================== ========================================================== ================

The following is reproduced from: http://blog.chinaunix.net/space.php? Uid = 22477743 & Do = Blog & cuid = 2046639; http://longmans1985.blog.163.com/blog/static/7060547520113652122555/


0. version 0.19.2 1. hadoop cluster: 1.1. HDFS 1.1.1 Name node (1 unit) 1.1.2 secondary Name node (1 unit, optional) 1.1.3 data node (several units) 1.2. mr 1.2.1 master [jobtracker] (1 unit) 1.2.2 slave [tasktracker] (several units) 2. configuration File 2.1 hadoop-default.xml hadoop cluster default configuration, usually do not need to modify this configuration file. the machine personalization profile in the 2.2 hadoop-site.xml hadoop cluster typically specifies the machine's personalization configuration here. 3. configuration item 3.1 FS. default. name definition: Name node URI Description: HDFS: // hostname/


3.2 mapred. Job. Tracker

Definition: jobtracker address

Description: Hostname: Port 3.3 DFS. name. dir definition: Name node local directory for saving metadata and transaction logs Description: A comma-separated directory list is used to specify redundant backup of multiple data. 3.4 DFS. data. dir definition: local directory of data node to save Block Files Description: comma-separated Directory List specifies these directories are used to save block files. 3.5 mapred. system. dir definition: directory where mapreduce saves system files on HDFS. description: 3.6 mapred. local. dir definition: local directory for saving mapreduce temporary files

Description: A comma-separated directory list is used to specify multiple directories as temporary data spaces at the same time.

3.7 mapred. tasktracker. {map | reduce }. tasks. maximum definition: Maximum number of MAP/reduce tasks that can be run simultaneously on tasktracker. description: The default number of MAP/reduce tasks is 2.


3.8 DFS. hosts/DFS. hosts. exclude definition: Data Node whitelist/blacklist file Description: 3.9 mapred. hosts/mapred. hosts. exclude definition: mapreduce whitelist/blacklist file Description: 3.10 mapred. queue. names definition: queue Name Description: hadoop mapreduce system has a "default" Job Queue (pool) by default ).

3.11 DFS. Block. Size
Definition: Default HDFS block size
Description: The default value is 128 MB.

3.12 DFS. namenode. handler. Count
Definition: Number of threads that namenode communicates with datanode at the same time

3.13 mapred. Reduce. Parallel. Copies
Definition: number of files simultaneously pulled from mapper by CER

3.14 mapred. Child. java. opts
Definition: the heap size of the Child JVM.

3.15 fs. inmemory. Size. MB
Definition: memory space used by CER to merge map output data
Description: 200 MB is used by default.

3.16 Io. Sort. Factor
Definition: Sorting factor. Number of data streams merged at the same time

3.17 Io. Sort. MB
Definition: maximum memory used for sorting

3.18 Io. file. Buffer. Size
Definition: buffer size of read/write files

3.19 mapred. Job. tracker. handler. Count
Definition: Number of threads that jobtracker communicates with tasktracker at the same time

3.20 tasktracker. http. threads
Definition: Number of threads that tasktracker enables HTTP Services. Reduce is used to pull map output data.


The red configuration is required.


Parameters Value Remarks
FS. Default. Name Namenode. HDFS: // host name/
DFS. hosts/dfs. Hosts. Exclude List of allowed/denied datanode. Use this file to control the licensed datanode list if necessary.
DFS. Replication Default Value: 3 Data Replication score
DFS. Name. dir


Default Value:/Tmp

When this value is a comma-separated directory list, the nametable data will be copied to all directories for redundant backup.
DFS. Data. dir


Default Value:/tmp

When this value is a comma-separated directory list, data is stored in all directories and usually distributed across different devices.
Mapred. system. dir HDFS path of the MAP/reduce framework storage system file. For example/Hadoop/mapred/system/. This path is the path under the default file system (HDFS) and must be accessible from both the server and client.
Mapred. Local. dir List of comma-separated paths in the local file system, where MAP/reduce temporary data is stored. Multi-path facilitates the use of disk I/O.
Mapred. tasktracker. {map | reduce}. Tasks. Maximum ATasktrackerThe maximum number of MAP/reduce tasks that can run simultaneously. The default value is 2 (2 maps and 2 reduce), which can be changed based on hardware conditions.
Mapred. Job. Tracker JobtrackerHost (or IP) and port. HOST: Port.
Mapred. hosts/mapred. Hosts. Exclude Permit/deny tasktracker list. Use this file to control the authorized tasktracker list if necessary.
Hadoop. Job. History. User. Location

Default Value: mapred. Output. DIR/_ logs/History

You can also set it to none to disable it.

Job history file directory


Conf/slaves write the name or IP address of all slave machines


Namenode remembers the blockid mapped to each file. The block corresponding to each blockid is copied to a different machine for additional parts.

The default hadoop block is 64 MB.





Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.