File concurrency in Hadoop map-reduce

File concurrency in Hadoop map-reduce _ database Other

Last Update:2017-01-18 Source: Internet

Author: User

Tags exception handling

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This can be done on either the map side or the reduce end. The following is a brief description of an example from a real business scenario.

Brief description of the problem:

If reduce input key is text (String), value is byteswritable (byte[]), the type of different key is 1 million, the size of value is about 30k, each key corresponds to 100 value, Requires the creation of two files for each key, one to keep adding binary data in value, and one to record the location index of each value in the file. (A large number of small files will affect the performance of HDFs, so it is best to splice these small files)

When the number of files is less than an hour, you can consider using Multipleoutput for Key-value streaming, you can output it to different files or directories according to the key difference. But reduce can only be 1, or each reduce will generate the same directory or file, not the ultimate goal. In addition, most importantly, the operating system limits the number of files open to each process by default of 1024, and each datanode of the cluster may be configured with a higher value, but a maximum of about tens of thousands of is still a limiting factor. Cannot meet the needs of millions of documents.

The main purpose of reduce is to merge key-value and output to HDFs, and of course we can do other things in reduce, such as file reading and writing. Because the default partitioner guarantees that the data for the same key is guaranteed to be in the same reduce, only two files are opened for reading and writing in each reduce (one index file, one data file). The concurrency is determined by reduce quantity, and the reduce quantity is set to 256, so we can handle the data of 256 keys at the same time (Partioner ensures that different reduce processing keys are different and do not cause file read/write conflicts). The efficiency of such concurrency is very objective and can be completed in a relatively short period of time.

The idea is this, but at the same time due to the characteristics of HDFs and Hadoop task scheduling, in the file read and write process, there are still many problems, the following briefly say some common problems encountered.

1.org.apache.hadoop.hdfs.protocol.alreadybeingcreatedexception exception

This is probably the most frequently encountered problem. The possible causes are as follows:

(1) file stream conflict.

When you create a file, a file stream is opened for writing. And we want to append, so if we use the wrong API, it might cause the problem. In the case of the FileSystem class, the above exception is thrown if the append () method is invoked after the Create () method is used. Therefore, it is best to use the CreateNewFile method to create only files and not to open the stream.

(2) MapReduce inference enforcement mechanism

MapReduce in order to improve efficiency, after a task is started, the same task (attempt) is started at the same time, and one of the attempt is completed successfully, as the whole task is completed, the result is the final result, and the slower attempt is killed. Clusters typically turn on this option to optimize performance (in space for time). However, it is not appropriate to speculate in the context of this problem. Because we generally want a task to work with a file, but if you start a speculative execution, there are several attempt trying to manipulate the same file at the same time, throwing an exception. So it's best to turn off this option, set the mapred.reduce.max.attempts to 1, or set the Mapred.reduce.tasks.speculative.execution to False.

However, problems may still occur at this time. Because if there is a problem with the only attempt of a task, after being killed, the task will still have a attempt, which could still affect the new attempt file operation and throw an exception because the previous attempt terminated unexpectedly. So the safest way to do that is to by using the mechanism of speculative execution (each attempt produces its own results and eventually chooses one as the final result), each attempt's ID number is appended to the file being manipulated, and the exception to all file operations is captured and processed, which avoids the file's read-write conflict. The context can be used to get some contextual information about the runtime, and it is easy to get the ID number of the attempt. Note that it is OK to open speculative execution at this point, but generating many of the same files (one for each attempt) is still not the best solution.

At the same time, we can use the output of reduce to record the running of "abnormal" key. Most of these tasks were attempt_0 killed and restarted a attempt_1, so the following files are generally two copies. Key outputs of these cases (file exceptions or Attemptid > 0) can be made, and some post-processing will be done, such as file renaming, or the key being rewritten. Because the key in this case is usually only a handful, it does not affect the overall efficiency.

2. File Exception Handling

It is a good idea to set up exception handling for all file operations in MapReduce. Otherwise, a file exception could cause the entire job to fail. Therefore, in terms of efficiency, it is best to record the key as the output of reduce when the file has an exception. Because at the same time, MapReduce will restart a task attempts file read and write to ensure that we get the final data, the last thing we need is to do some simple file rename operations on those exception key.

3. Multiple catalogs and file stitching

If we set the type of key to 10 million, the above method generates too many small files to affect the performance of HDFs, and because all files are in the same directory, the number of files in the same directory is too high to affect access efficiency.

A useful way to create a multiple subdirectory while creating a file is to build a subdirectory with the taskid of reduce. This allows you to create as many subdirectories as you want, with no file conflicts. The same key for reduce processing is in the same directory.

File stitching to consider the issue of an index. In order to establish the file index as simple as possible, you should try to ensure that all the data for the same key is in the same large file. This can be accomplished using the hashcode of the key. If we want to create 1000 files in each directory, just take hashcode to 1000.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More