Python Development MapReduce Series (ii) Python implementation of MapReduce buckets

Source: Internet
Author: User
Tags shuffle

Copyright Notice: This article for Bo Master original article, without Bo Master permission not reproducedFirst, start with two points to start the following topic. (1) The map stage is sorted after the hash, before writing to disk. The two keywords sorted are partition (partition number) and key. (2) After the map is not immediately written to the disk, but there is a ring buffer, the data is written into the buffer, the default overflow rate is 80% (this value can be set by the property io.sort.mb), each overflow condition overflow generated a small file, until all the data is finished, Finally, all the small files are merged into a large file and written to disk. The purpose of this is to reduce the disk seek time, let each map output only one file, and provide the index file for this file, record the offset of each reduce corresponding data. (In fact, mapping the distribution between map and reduce) 1, the default situation is introducedIn Hadoop streaming, the default is "\ t" as the delimiter. For standard input, the first "\ T" of the data read from each row is the dividing line, and the previous part is key, after which it is value. If a "\ t" character is not there, the entire line is treated as a key. 2. The sort and partition phases of the MapReduce Shuffler processThe mapper phase, in addition to user code, is most important for the shuffle process, which is the main place where MapReduce takes time and consumes resources because it involves operations such as Disk writes. This is not about the optimization process, only the sort and partition two processes in the shuffle process.    Why only study these two processes, because sort and partition are the core ideas of MapReduce, and the whole process is constantly repeating the arrangement and division of operations. From the 1th to know, the MapReduce key is the default is a \ t partition, we can according to their own needs to obtain a specific form of key? Implement a free sort like a bucket, sort by a specified column, and so on? The answer is yes. We can use the following parameters to achieve: 3, the relevant parameters introduction 3.1map Stage
-jobconf mapred.reduce.tasks=2"This property is valid for the following example" Map.output.key.field.separator: specifying the map output<key,value>after that, the delimiter inside the key. Num.key.fields. for. Partition: When you specify a bucket, the number of columns that are used to split the bucket key after the separator is cut. -Partitioner  Org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner: The first two parameters, to cooperate with this partitioner, no words will be an error example: Map.output.key.field.separator =, Num.key.fields. for. partition = 2-Partitioner Org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner A row of data:1, 2, 3, 4, 5 (here 1 2the comma between the keys is the delimiter inside the key, and the data of the format key is divided into the same bucket stream.map.output.field.separator:map the key and the value of the delimiter Stream.num.map.ou Tput.key.fields: Specifies the number of columns the key occupies after the map output is cut by the delimiter, preceded by key, followed by value for example: Map.output.key.field.separator=, Num.key.fields. for. partition = 2-partitioner Org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner stream.map.output.field.separator=: Stream.num.map.output.key.fields= 3Input:1, 2, 3, 4, 5 1, 2, 2, 4, 5 1, 3, 4, 4, 5 1, 3, 3, 4, 5Output part-00000:1, 2, 2:4, 5 1, 2, 3:4, 5Output part-00000:1, 3, 3:4, 5 1, 3, 4:4, 5 1, 2 is a bucket value, 1, 2, 3 is key, 4, 5 is value. Here the comma between 1 and 2 is key inside the delimiter, 1, 2 format key data is divided into the same bucket

3.2 Reduce phase

The position of the delimiter in the delimiter stream.num.reduce.output.key.fields:reduce of key and value in Stream.reduce.output.field.separator:reduce

3. Test of Barrel run.sh Script(as a lazy program ape, lazy on lazy, write a script to save the trouble of writing a large series of instructions every time)
Hadoop_cmd="/home/hadoop/hadoop/bin/hadoop"Stream_jar_path="/home/hadoop/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar"input_path_a="/a.txt"Input_path_b="/b.txt"Output_path="/output"$HADOOP _cmd FS-RMR $OUTPUT _path#mapreduce at runtime, file system cannot have output directory (directory name is optional)$HADOOP _cmd jar $STREAM _jar_path-input $INPUT _file_path_a, $INPUT _file_path_b-Output $OUTPUT _sort_path-mapper"python map.py"     -reducer"python red.py"     -file./map.py-file./red.py-jobconf mapred.reduce.tasks=2     -jobconf map.output.key.field.separator=,     -jobconf Num.key.fields. for. partition=1 \        -partitioner Org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner-jobconf stream.map.output.field.separator=:     -jobconf stream.num.map.output.key.fields=3

A.txt Content

1,2,1: Hadoop 1,2,5:hadoop 1,3,4: Hadoop 1,2,9: Hadoop 1,2,11: Hadoop1,2,7:hadoop 1,3,15:hadoop 1,3,14: Hadoop1,2,19:hadoop
B.txt Content
0:java1,2,2: Java1,2,8:java 1,3,4:java 1,2,2: Java 1,2,14: Java1,2,12:java 1,3,1:java 1,3,5: Java1,2,3:java

4. Result output

The "part-00000" output reads as follows:
1,2,1: Hadoop1,2,2:java1,2,2: Java0:java:hadoop: Java1,2,5:hadoop 1,2,7:hadoop 1,2,8: Java1,2,9: Hadoop1,2,11: Hadoop1,2,14: Java1,2,19:hadoop
The "part-00001" output reads as follows:
1,3,1: Java1,3,4: Hadoop1,3,4: Java1,3,5: Java1,3,14: Hadoop 1,3,15:hadoop

5. Analysis of results

It can be seen from the results that: (1) The previous 2 was classified as a barrel mark, as part-00000,part-00001 was preceded by 1 and 3 respectively. (2) The previous 3 column is key, and the 3rd column is sorted after the bucket. (3) Key inside is separated by,. (4) between key and value is separated by:. Reference: (1) Hadoop Technology Insider: In-depth analysis of MapReduce architecture design and implementation principles

Python Development MapReduce Series (ii) Python implementation of MapReduce buckets

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.