Concurrent file operations in hadoop map-reduce

Source: Internet
Author: User

This operation can be performed on the map or reduce side. The following is an example of a business scenario.

Brief description:

Assume that the key input by reduce is Text (String), the value is BytesWritable (byte []), there are 1 million different types of keys, and the average value is about 30 K, each key corresponds to approximately 100 values. Two files must be created for each key. One is used to add binary data in the value, and the other is used to record the location index of each value in the file. (A large number of small files will affect the performance of HDFS, so it is best to splice these small files)

When the number of files is small, you can use MultipleOutput to distribute key-value pairs. You can output them to different files or directories based on different keys. However, the reduce quantity can only be 1. Otherwise, each reduce will generate the same directory or file, which cannot achieve the ultimate goal. In addition, the most important thing is that the operating system limits the number of files opened by each process. The default value is 1024. Each datanode in the cluster may be configured with a higher value, but the maximum value is tens of thousands, it is still a limiting factor. Cannot meet the needs of millions of files.

The main purpose of reduce is to merge key-value and output it to HDFS. Of course, we can also perform other operations in reduce, such as file read/write. Because the default partitioner ensures that the data of the same key will certainly be in the same reduce, you can only open two files in each reduce for read and write (one index file and one data file ). The concurrency is determined by the reduce quantity. If the reduce quantity is set to 256, we can process data with 256 keys at the same time (partioner ensures different keys processed by different reducers, does not cause file read/write conflicts ). The efficiency of such concurrency is objective, and the requirement can be fulfilled within a short period of time.

However, due to the features of hdfs and hadoop task scheduling, many problems may still occur during file read/write, the following describes some common problems.

1.org. apache. hadoop. hdfs. protocol. AlreadyBeingCreatedException exception

This is probably the most common problem. Possible reasons are as follows:

(1) file stream conflict.

Generally, a file stream for writing is opened during file creation. We hope to append the API. Therefore, if an incorrect API is used, the above problems may occur. Taking the FileSystem class as an example, if you call the append () method after using the create () method, the above exception will be thrown. Therefore, it is best to use the createNewFile method to create only files without opening the stream.

(2) speculative execution mechanism of mapreduce

To improve efficiency, mapreduce starts some identical tasks (attempt) at the same time after a task is started. After one of the attempt is successfully completed, it is considered that the entire task is completed, the results serve as the final result and kill those slow attempt. Generally, this option is enabled for the cluster to optimize the performance (change the time by space ). However, it is not appropriate to speculate in this issue. Because we usually want a task to process a file, but if we start speculative execution, several attempts will attempt to operate the same file at the same time, and an exception will occur. Therefore, we recommend that you disable this option and set mapred. reduce. max. attempts to 1. Reset to set mapred.cece.tasks.speculative.exe cution to false.

However, problems may still occur at this time. If the unique attempt of a task has a problem, the task will still initiate another attempt after it is killed. At this time, the previous attempt is terminated due to an exception, it may still affect the file operations of the new attempt and cause exceptions. Therefore, the safest method is to draw on the speculative execution mechanism (each attempt generates its own results and finally chooses one as the final result ), append the ID number of each attempt to the operated file with the suffix, and capture and handle all file operation exceptions. This avoids file read/write conflicts. Context can be used to obtain some Context information during running, and the attempt ID can be easily obtained. NOTE: If speculative execution is enabled, but many identical files (one copy per attempt) are generated, it is still not the best solution.

At the same time, we can use the reduce output to record the running "abnormal" key. Most of these tasks are attempt_0 killed and an attempt_1 is restarted. Therefore, the following files are generally two copies. You can output keys (File exception or attemptID> 0) in these cases and perform subsequent processing, such as renaming a file or re-writing these keys. In this case, only a few keys are used, so the overall efficiency is not affected.

2. File Exception Handling

It is best to set Exception Handling for all file operations in mapreduce. Otherwise, an exception to a file may cause the entire job to fail. Therefore, in terms of efficiency, it is best to use the key as the reduce output in case of a file exception for record. At the same time, mapreduce will restart a task attempts to re-read and write the file, which ensures that we get the final data. The last thing we need is to perform some simple file rename operations on those abnormal keys.

3. Multi-directory and file Splicing

If we set the key type to 10 million, the above method will generate too many small files, thus affecting hdfs performance. In addition, because all files are in the same directory, this results in a large number of files in the same directory and affects access efficiency.

Create multiple subdirectories when creating a file. A useful method is to create a subdirectory using the taskid of reduce. In this way, the number of reduce sub-directories can be created without file conflicts. Keys processed by the same reduce operation are all in the same directory.

An index should be considered for file splicing. In order to make the file index as simple as possible, we should try to ensure that all data of the same key is in the same large file. This can be achieved using the hashCode of the key. If we want to create 1000 files in each directory, we only need to retrieve the 1000-plus hashCode pair.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.