Hadoop default counter based on computer resource analysis

Source: Internet
Author: User
Tags file url shuffle

Hadoop default counter based on computer resource analysis

Preface

The project requires statistics on the computer resources used by each business group, such as cpu, memory, io read/write, and network traffic. Read the source code to view the default counter of Hadoop.

MapReduce Counter can observe some detailed data during MapReduce job runtime. Counter has the concept of "group", which is used to represent all values in the same logical range.

Cpu

How can we measure the computing workload of mapreduce tasks? According to the running time of the tasks, most of the time of some tasks may be stuck in the last reduce, or there is a resource preemption problem during running, the running time is high. The number of map and reduce tasks is inaccurate because some map and reduce tasks process a small amount of data and the running time is short.

The cpu time used to run a hadoop task is the computing workload of the task. The counter provided by hadoop: "Map-Reduce Framework: CPU time spent (MS )", it is the cpu time consumed by the task running. How is this cpu time counted? It is during hadoop running, each task reads the user cpu time and kernel cpu time of the corresponding process from/proc/<pid>/stat. Their sum is the cpu time.

Appendix: source code for the task to obtain the cpu time: org. apache. hadoop. mapred. task. updateResourceCounters --> org. apache. hadoop. util. linuxResourceCalculatorPlugin. getProcResourceValues (obtain cpu and memory resources) --> org. apache. hadoop. util. procfsBasedProcessTree. getProcessTree.

 

Memory

Hadoop's default counter obtains memory information, which has the following parameters:

"Map-Reduce Framework: Physical memory (bytes) snapshot" each task reads the memory snapshot of the corresponding process from/proc/<pid>/stat, this is the current physical memory usage of the process.

"Map-Reduce Framework: Virtual memory (bytes) snapshot" each task reads the Virtual memory snapshot of the corresponding process from/proc/<pid>/stat, this is the current virtual memory usage of the process.

"Map-Reduce Framework: Total committed heap usage (bytes)" The jvm of each task calls Runtime. getRuntime (). totalMemory () to get the current jvm heap size.

Attachment: source code for the task to obtain the memory: org. apache. hadoop. mapred. Task. updateResourceCounters

 

Io read/write

Hadoop reads and writes files using org. apache. hadoop. fs. fileSystem. open a file. If it is an hdfs file, there is a file url starting with hdfs: //. If it is a local file, it is the file url starting with file. Therefore, the file read/write status of each task can be obtained from FileSystem. getAllStatistics (), while hadoop uses FileSystemCounters to record all io read/write sizes of FileSystem. The FileSystemCounters analysis is as follows:

"FileSystemCounters: HDFS_BYTES_READ" during job execution, data is read from HDFS only when the map side is running. This data is not limited to the source file content, but also contains all the split metadata of the map. Therefore, this value should be slightly larger than FileInputFormatCounters. BYTES_READ.

"FileSystemCounters: HDFS_BYTES_WRITTEN" the total data size written to HDFS during job execution. After reduce is executed, it will be written to HDFS (only map exists and there is no reduce, in this case, the result is written to HDFS after the map is executed ).

"FileSystemCounters: FILE_BYTES_READ" indicates the total size of file data read from the local disk. The map and reduce nodes are sorted. Local files must be read and written during sorting.

"FileSystemCounters: FILE_BYTES_WRITTEN" indicates the total size of the file data written to the local disk. The map and reduce nodes are sorted. Local files need to be read and written during sorting, and when the reduce node is shuffle, data needs to be pulled from the map end, and data is also written to a local disk file.

Attachment: FileSystemCounters code: org. apache. hadoop. mapred. Task. updateResourceCounters --> org. apache. hadoop. mapred. Task. FileSystemStatisticUpdater. updateCounters

 

The counter of FileSystemCounters has a complete set of io read/write data, but hadoop also has some minor io read/write counter:

"File Input Format Counters: Bytes Read" during job execution, the size of the Input split source File Read by the Map end from HDFS does not include the split metadata of the map, so this value is slightly smaller than "FileSystemCounters: HDFS_BYTES_READ", but it is very close. If the source file input by map is a compressed file, its value is only the size before the compressed file is decompressed (Attachment: the code is located at org. apache. hadoop. mapred. MapTask. TrackedRecordReader. fileInputByteCounter.).

"Map-Reduce Framework: Map input bytes" during job execution, the size of the split source file that the Map Client reads from HDFS. If the source file is a compressed file, the value is the size of the compressed file after decompression (Attachment: the code is located at org. apache. hadoop. mapred. MapTask. TrackedRecordReader. inputByteCounter.).

"File Output Format Counters: Bytes Written" The job execution process can be divided into map and reduce, but there may also be only map, but after the job is executed, generally, you need to write the result to hdfs. This value is the size of the result file. If it is a compressed file, its value is only the size before the compressed file is decompressed (Attachment: the code is located at org. apache. hadoop. mapred. MapTask. DirectMapOutputCollector. fileOutputByteCounter and org. apache. hadoop. mapred. cetcetask. NewTrackingRecordWriter. fileOutputByteCounter.).

However, these tiny counters do not count the file reads and writes during map and reduce sorting. Therefore, to measure the I/O reads and writes of job tasks, I think it is best to use the counter of FileSystemCounters.

 

Io read/write traffic can be summed up by the preceding four FileSystemCounters parameters. The following deficiencies exist:

"FileSystemCounters: HDFS_BYTES_WRITTEN", it is only the hdfs write size of a copy, and the hdfs block copy can be adjusted, so the io read/write traffic also needs "FileSystemCounters: HDFS_BYTES_WRITTEN "* Number of copies.

Both map and reduce are user-defined. It may be because your code bypasses the hadoop framework and does not use org. apache. hadoop. fs. fileSystem. open file, this part of io read/write traffic cannot be counted.

 

Network Traffic

The phase in which hadoop tasks generate network traffic: map input pulls data from hdfs, reduce shuffle pulls data from the map end, and reduce completes writing results to hdfs (if there is no reduce, that is, the map completes writing results to hdfs ).

The traffic generated by the interaction between job and hdfs can be obtained through two counters of io read/write analysis: "FileSystemCounters: HDFS_BYTES_READ" and "FileSystemCounters: HDFS_BYTES_WRITTEN"

The counter corresponding to the traffic generated by pulling data from the map end during reduce shuffle is:

"Map-Reduce Framework: Reduce shuffle bytes" indicates the cumulative data size of the intermediate result pulled from reduce to map. If the intermediate result generated by map is a compressed file, the value is the size before the compressed file is decompressed (Attachment: the code is located at org. apache. hadoop. mapred. ReduceTask. reduceShuffleBytes.).

 

Network traffic can be summed up by the preceding three parameters:

"FileSystemCounters: HDFS_BYTES_READ" and "FileSystemCounters: HDFS_BYTES_WRITTEN" do not consider hadoop's local optimization of hdfs. When hdfs reads and writes data blocks, if the client and the target block are on the same node, it reads and writes data locally. If some data blocks are stored locally, hadoop reads and writes data directly through the local file system instead of through the network.

"FileSystemCounters: HDFS_BYTES_WRITTEN", it is only the hdfs write size of a copy, and the hdfs block copy can be adjusted, so the network traffic also needs "FileSystemCounters: HDFS_BYTES_WRITTEN "* Number of copies.

Both map and reduce are user-defined, and user code may bypass the hadoop framework to generate network communication on its own. This part of traffic cannot be counted.

-------------------------------------- Split line --------------------------------------

Build a Hadoop environment on Ubuntu 13.04

Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1

Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)

Configuration of Hadoop environment in Ubuntu

Detailed tutorial on creating a Hadoop environment for standalone Edition

-------------------------------------- Split line --------------------------------------

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.