Next, I will continue to explain to you how MapReduce related JobConfigurationJobConf is the configuration object of the MR task and the main way to describe how the MR task is executed in the Mapreduce framework, the framework executes the MR task based on the information contained in the object, but pay attention to the following special situations: some configuration parameters are configured by the Administrator on hadoop.
Next, I will continue to explain to you how MapReduce Job Configuration JobConf is the Configuration object of the MR task and the main way to describe how the MR task is executed in the Mapreduce framework, the framework executes the MR task based on the information contained in the object, but pay attention to the following special situations: some configuration parameters are configured by the Administrator on hadoop.
Next, I will continue to explain MapReduce
Job Configuration
JobConf is the configuration object of the MR task and the main way to describe how the MR task is executed in the Mapreduce framework. The framework will faithfully execute the MR task based on the information contained in this object, note the following special cases:
- Some configuration parameters cannot be changed by the task parameter value if they are set to final by administrators in hadoop-related configuration files, such as core-site.xml, mapred-site.xml.
- Some parameters can be set directly through methods, such as setNumReduceTasks (int. However, some other parameters have a more complex relationship with the internal framework and task configurations, so the settings are more complex. For example, you need to use setNumMapTasks (int) to set the parameters.
JobConf is generally used to determine Mapper, Combiner (if used), Partitioner, CER, InputFormat, OutputFormat, and OutputCommitter implementation classes. JobConf can also be used through setInputPaths (JobConf, Path ...) /AddInputPath (JobConf, Path), or setInputPaths (JobConf, String)/addInputPaths (JobConf, String) Specify the input Path set, and set the task result output Path through setOutputPath (Path.
JobConf is also used to specify some optional configurations (generally used for optimization or special analysis purposes ). For example, specify the Comparator used by the job (Comparator for sorting or grouping), use DistributedCache to cache some necessary files, and specify whether the data and/or results in the job are compressed and how to compress. You can also debug a job using setMapDebugScript (String)/setReduceDebugScript (String) (not used yet). You can use setMapSpeculativeExecution (boolean)/setReduceSpeculativeExecution (boolean) specifies whether to enable speculative execution during task execution. You can use setMaxMapAttempts (int)/setMaxReduceAttempts (int) to set the maximum number of times each task is tried. Use setMaxMapTaskFailuresPercent (int)/setmaxreducetaskfailuresper) set the tolerable failure rate of MR tasks (map/reduce.
In addition, you can use set (String, String)/get (String, String to set the parameters required by the application, but use DistributedCache for a large amount of data (read-only.
MR Running Parameters
TaskTracer runs Mapper and Reducer in an independent JVM subprocess. Subtasks inherit all environment parameters of TaskTracker. You can use mapred in JobConf. child. java. opts refers to the sub-jvm running parameters, for example, through-Djava. library. path = <> it is optional to specify non-standard runtime linked libraries required for searching shared libraries. If the mapred. child. java. opts attribute contains the $ taskid variable, the variable value will be inserted by the Framework into the running task id during running.
The following configuration example uses the multi-parameter and variable insertion feature to configure jvm gc logs. You can use the jconsole tool to view the memory and threads of sub-processes through the JMX service without a password connection, you can also obtain the thread dump information. In addition, the maximum heap size of the sub-JVM is 512 MB, and the shared library path is set through java. library. path.
mapred.child.java.opts -Xmx512M -Djava.library.path=/home/mycompany/lib -verbose:gc -Xloggc:/tmp/@taskid@.gc -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false
Memory Management
You can use the mapred. child. ulimit parameter to configure the maximum virtual memory that a sub-process can use. Note: This attribute sets the maximum size for a single process. The unit is KB and the value must be greater than or equal to the maximum heap memory (set through-Xmx ).
Note: mapred. child. java. opts is only valid for sub-JVM processes allocated and managed from tasktracker. For details about memory parameter configurations of other hadoop Daemons, see processing the Environment of the Hadoop Daemons.
Memory available for other components of the MR framework is configurable. The performance of map and reduce tasks can be debugged from the perspective of affecting operation concurrency and data disk I/O count. Monitoring the file system counters set for a single task, especially the data bytes output from map to reduce, is very useful for debugging tasks that affect the running performance.
If the memory management feature is enabled, You can selectively overwrite some default configurations, such as virtual memory and RAM. The following are some parameters that are valid for the task:
Name |
Type |
Features |
Mapred. task. maxvmem |
Int |
Specify the maximum virtual memory of a single map or reduce task in bytes. If the task exceeds this value, it is killed. |
Mapred. task. maxpmem |
Int |
Specify the maximum RAM of a single map or reduce task in bytes. This value is referenced by the scheduler (Jobtracer) as the basis for allocating map \ reduce tasks, avoiding the use of a node over the RAM load. |
Map Parameters
A record read by map will be serialized to a buffer, and metadata is stored in the metadata buffer. As mentioned above, when the serialized buffer or metadata buffer exceeds the set threshold value, the content in the buffer will be sorted and written to the disk in the background. In this process, map continuously outputs record rows. If the buffer is full, a spill process occurs, and the map thread in the spill is blocked. After map is complete, the remaining records in the buffer are written to the disk and merged into a file with the records stored in the disk by segment. Reducing the number of spill can shorten the map time, but a large buffer can also reduce the available memory.
Name |
Type |
Features |
Io. sort. mb |
Int |
100 by default. Set the serialization and metadata buffer size in MB. |
Io. sort. record. percent |
Float |
0.05 by default. The percentage of buffer occupied by the Data Metadata buffer after the map record is serialized. To accelerate sorting, each serialized record requires 16 bytes of metadata in addition to its size. Spill occurs when the percentage of I/O. sort. mb value occupied exceeds the set value. For maps with fewer output records, the higher the value, the more frequent spill occurs. |
Io. sort. spill. percent |
Float |
0.80 by default. Metadata and serialized data buffer space thresholds. When any buffer space of the two reaches this threshold, the data is spill to the disk. Assume io. sort. record. percent = r, io. sort. mb = x, io. sort. spill. percent = q, the maximum number of records processed before the map thread spill is r * x * q * 2 ^ 16. Note: A large value may reduce the number of spill operations or even avoid merging, but it also increases the probability of map blocking. By accurately estimating the map output size and reducing the number of spill operations, the map processing time can be effectively shortened. |
Other considerations:
- If the spill threshold is reached, spill will occur, and the collection of records will continue until the spill is completed. For example, io. sort. buffer. spill. percent is set to 0.33, the remaining part of the buffer is filled, spill occurs, the next spill will contain all the collected records, or the part 0.66 of the buffer, and no additional spill process will be generated. In other words, the que value is like a defined trigger and will not be blocked.
- A record greater than the serialized data buffer will first trigger a spill and then be spill into an independent file. No parameters are configured to determine whether the record uses combiner first.
Shuffle/Reduce Parameters
As mentioned above, each reduce obtains map output through HTTP, reads it into the memory, and periodically merges the output to the disk. If the map output is compressed, each output will be decompressed and read into the memory. The following configuration parameters affect the merge and memory allocation processes in reduce processing.
Name |
Type |
Features |
Io. sort. factor |
Int |
The default value is 10. Specifies the number of file fragments that can be merged at the same time. The parameter limits the number of opened files and the compression decoder. If the number of files exceeds this value, the merge operation is divided into multiple times. This parameter is generally applicable to map tasks. This parameter should be configured for most jobs. |
Mapred. inmem. merge. threshold |
Int |
Number of sorted map output files read before memory is merged to the disk. Similar to the preceding spill threshold value, this value is not a partition unit but a trigger. This value is usually high (1000) or not enabled (0). After all, merging in memory is lower than merging on the disk. This threshold only affects memory merging during shuffle. |
Mapred. job. shuffle. merge. percent |
Float |
0.66. Memory percentage value available for map output before memory merging. If this value is exceeded, data is merged to the disk. If the value is too large, the parallel acquisition and merging efficiency will be reduced. On the contrary, if the input is placed in the memory, it can be set to 1.0. This parameter only affects the memory merging frequency in the shuffle process. |
Mapred. job. shuffle. input. buffer. percent |
Float |
The default value is 0.7. The percentage of memory that cached map output data in the shuffle process occupies the maximum size of the entire sub-jvm process heap (set through mapred. child. java. opts. This value can be set to store large map outputs based on actual conditions. |
Mapred. job. reduce. input. buffer. percent |
Float |
0.0 during memory merge, the memory is flushed from the memory to the disk until the remaining map output occupies less memory than the percentage value of the maximum jvm heap. By default, the maximum memory available must be ensured before reduce starts. map output in all memory will be merged to the disk. Reduce tasks that are not memory-sensitive. This value can be appropriately increased to avoid disk I/O (usually not available ). |
Other considerations:
- If the map output occupies more than 25% of the memory for copying, it will be directly flushed to the disk without going through the memory merging phase.
- When the combiner is running, the higher threshold value and the larger buffer value mentioned above are not applicable. The merging starts before all map outputs are obtained. When spill occurs, the combiner is running. In some practices, you can reduce the reduce processing time by merging the output so that the disk spill is small enough and the parallel spill and data pulling can be ensured, rather than continuously improving the buffer size.
- The reduce process starts when the memory is merged with map output and flushed to the disk. If multiple output segments are spill to the disk, or at least io is involved. sort. factor fragments are already on the disk, so the merge process is necessary and contains memory for Processing map output.
Sub-JVM Reuse
You can specify mapred. job. reuse. jvm. num. tasks job configuration parameters to enable jvm reuse. The default value is 1, and the jvm will not be reused (each jvm only processes one task ). If it is set to-1, a jvm can run any number of tasks of the same job. You can use JobConf. setNumTasksToExecutePerJvm (int) to specify a value greater than 1.
The configuration parameters for job execution are as follows:
Name |
Type |
Description |
Mapred. job. id |
String |
Jobid |
Mapred. jar |
String |
Location of job. jar in job path |
Job. local. dir |
String |
Job Sharing path |
Mapred. tip. id |
String |
Taskid |
Mapred. task. id |
String |
Task id |
Mapred. task. is. map |
Boolean |
Whether it is a map task |
Mapred. task. partition |
Int |
Task id in the job |
Map. input. file |
String |
Map input file path |
Map. input. start |
Long |
Map input split start offset |
Map. input. length |
Long |
Number of bytes of the map input part |
Mapred. work. output. dir |
String |
Temporary Task output path |
The standard output and error flow logs of tasks are read and written by TaskTracker $ {HADOOPLOGDIR}/userlogs path.
DistributedCache can be used to publish jar packages and Local Shared libraries used by map or reduce. Generally, sub-JVM processes can use java. library. path and LD.LIBRARYPATH specifies its own working PATH. The cache library can be loaded through System. loadLibrary or System. load. For more information about using distributed cache to load shared libraries, see Loading native libraries through DistributedCache.
?
Related Articles
- Hadoop tutorial (II)
- Hadoop tutorial (1)
- Use Hadoop and BIRT to visualize massive data volumes
- Applications that track Hadoop tasks for Mac
- Log4j 2. x Architecture
- Why does this code output "Hello World"
- Create a multi-master node with high availability for Hbase
- Log4j 2 Introduction
- Introduction to classical thesis translation-Dremel: Interactive Analysis of WebScale Datasets
- MapReduce Literature
Original article address: Hadoop tutorial (III): important MR operation parameters. Thank you for sharing them with me.