Analyze Hadoop's default counter based on computer resources

Source: Internet
Author: User
Keywords nbsp; dfs size if
Tags aliyun apache based block business business group code computer

Foreword

& http: //www.aliyun.com/zixun/aggregation/37954.html "> As the project, you need to count the computer resources used by each business group, such as cpu, memory, io read and write, network traffic. Read source to see Hadoop's default counter.

MapReduce Counter MapReduce job can observe some of the details of the operation data, Counter has a "group group" concept, used to represent all the values ​​of the same logic range.

cpu

How to measure the computational load of the tasks of mapreduce? If, according to the running time of the task, some tasks may be stuck in the last reduce for most of the time, or the resources are preempted during running, resulting in higher running time. It is inaccurate if you follow the map's number of tasks and the number of reduce tasks, since some map and reduce handles a small amount of data and runs very short times.

H CPU running time hadoop task, the task is to measure the amount of computing, hadoop Counter: "Map-Reduce Framework: CPU time spent (ms)" is the cpu running time spent on the task, the cpu time statistics It is Hadoop during operation, each task from the / proc / <pid> / stat read the corresponding process user cpu time and kernel cpu time, and their sum is cpu time.

Attached: task to get the source of cpu time: org.apache.hadoop.mapred.Task.updateResourceCounters -> org.apache.hadoop.util.LinuxResourceCalculatorPlugin.getProcResourceValues ​​(get cpu and memory resources) -> org.apache.hadoop. util.ProcfsBasedProcessTree.getProcessTree.

RAM

hadoop default counter, access to memory information, the following parameters:

"Each task reads a memory snapshot of the corresponding process from / proc / <pid> / stat, which is the current physical memory usage of the process.

Each task reads the virtual memory snapshot of the corresponding process from / proc / <pid> / stat, which is the current virtual memory usage of the process.

Total committed heap usage (bytes) "Runtime.getRuntime () on each jvm of the task. TotalMemory () Get the current heap size of jvm.

Attached: task to get memory source: org.apache.hadoop.mapred.Task.updateResourceCounters

io read and write

Hadoop read and write documents are used org.apache.hadoop.fs.FileSystem.open a document, if it is hdfs documents, there will be hdfs: / / Beginning of the document url, if it is local documents, it is the beginning of the file: / / File url. So each task's file read and write situations, can be obtained from FileSystem.getAllStatistics (), and hadoop FileSystemCounters recorded all FileSystem io read and write size, FileSystemCounters analysis is as follows:

During FileSystemCounters: HDFS_BYTES_READ job execution, data is read from HDFS only when the map side is running. The data is not limited to the contents of the source file but includes split metadata for all maps. So this value should be slightly larger than FileInputFormatCounters.BYTES_READ.

During FileSystemCounters: HDFS_BYTES_WRITTEN job, the accumulated data size of HDFS is written. Reduce is written to HDFS after it is executed (there is only map and no reduce. In this case, the map is executed and the result is written to HDFS).

"FileSystemCounters: FILE_BYTES_READ" cumulative read local disk file data size, map and reduce end sorting, the need to read and write local files.

"FileSystemCounters: FILE_BYTES_WRITTEN" The cumulative size of the file data written to the local disk, map and reduce the end of the sort, the need to read and write local file sorting, reduce do shuffle, you need to pull data from the map side, there is also written to the local Disk file situation.

Attachment: FileSystemCounters related code: org.apache.hadoop.mapred.Task.updateResourceCounters -> org.apache.hadoop.mapred.Task.FileSystemStatisticUpdater.updateCounters

FileSystemCounters counter read and write data for io, is complete, but there are some subtle hadoop io read and write counter:

During File Input Format Counters: Bytes Read job execution, the Map side reads the source file size of the split from HDFS but does not include the map's split metadata, so this value is slightly smaller than FileSystemCounters: HDFS_BYTES_READ , But very close. If the map input source file is a compressed file, its value is only the size of the compressed file before decompression (with the code: org.apache.hadoop.mapred.MapTask.TrackedRecordReader.fileInputByteCounter).

During the execution of the Map-Reduce Framework: Map input bytes job, the source file size of the split input from the HDFS at the Map side is read. If the source file is a compressed file, its value is the size of the decompressed file : The code is in org.apache.hadoop.mapred.MapTask.TrackedRecordReader.inputByteCounter).

"File Output Format Counters: Bytes Written" job will be divided into map and reduce, but there may be only map, but after the job is completed, the results should generally be written to hdfs, the value is the result The size of the file, if it is a compressed file, is only the size of the compressed file before decompression (attached: the code is in org.apache.hadoop.mapred.MapTask.DirectMapOutputCollector.fileOutputByteCounter and org.apache.hadoop.mapred.ReduceTask.NewTrackingRecordWriter. fileOutputByteCounter).

However, these subtle counter, there is no statistical map and reduce the order of reading and writing files, so to measure the job task io read and write, I think the most appropriate or use FileSystemCounters counter.

io read and write traffic can be roughly summed up by the above four parameters FileSystemCounters, there are deficiencies that are:

"FileSystemCounters: HDFS_BYTES_WRITTEN", it is just a copy of the hdfs write size, hdfs block copy is adjustable, so io read and write traffic, you also need "FileSystemCounters: HDFS_BYTES_WRITTEN" * number of copies.

Map and reduce are user-defined, there may be user code bypass Hadoop framework, do not use org.apache.hadoop.fs.FileSystem.open file, this part of the io read and write traffic, can not be counted.

Network traffic

Hadoop task generates network traffic phase: map input pull data from hdfs, reduce shuffle pull data from the map side, reduce complete write results to hdfs (if there is no reduce, the map is completed to hdfs write results).

The traffic generated by the interaction between job and hdfs can be obtained by reading and writing the two counter io reads: "FileSystemCounters: HDFS_BYTES_READ" and "FileSystemCounters: HDFS_BYTES_WRITTEN"

Reduce shuffle data pulled from the map side of the flow, the corresponding counter is:

"Map-Reduce Framework: Reduce shuffle bytes" It is reduce the total data size to pull the intermediate results of the map, if the intermediate result of the map is compressed file, its value is the size of the compressed file before decompression (with the code in org .apache.hadoop.mapred.ReduceTask.reduceShuffleBytes).

Network traffic can be summed up by the above three parameters, there are deficiencies that are:

"FileSystemCounters: HDFS_BYTES_READ" and "FileSystemCounters: HDFS_BYTES_WRITTEN", it did not consider Hadoop localization optimization of hdfs, hdfs read and write blocks, if you find the client and the target block in the same node, directly through the local read and write, some blocks If local, hadoop will read and write directly through the local file system, not through the network.

"FileSystemCounters: HDFS_BYTES_WRITTEN", it is just a copy of the hdfs write size, hdfs block copy is adjustable, so the network traffic, you also need "FileSystemCounters: HDFS_BYTES_WRITTEN" * number of copies.

Map and reduce are user-defined, there may be user code to bypass the hadoop framework, to produce their own network communications, this part of the traffic can not be counted.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.