Hadoop on Mac with intellij idea-10 Lu xiheng. hadoop (version 2nd) 6.4.1 (shuffle and sorting) map-side content sorting

Last Update:2014-10-09 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

下午对着源码看陆喜恒. Hadoop实战（第2版）6.4.1 （Shuffle和排序）Map端，发现与Hadoop 1.2.1的源码有些出入。下面作个简单的记录，方便起见，引用自书本的语句都用斜体表示。

依书本，从MapTask.java开始。这个类有多个内部类：

从书的描述可知，collect()并不在MapTask类，而在MapOutputBuffer类，其函数功能是

1、定义输出内存缓冲区为环形结构
2、定义输出内存缓冲区内容到磁盘的操作

在collect函数中将缓冲区的内容写出时会调用sortAndSpill函数。好了，从这里开始就开始糊涂了，因为collect()没调用这个函数，接触Hadoop也就几天时间，啥都不懂，一下晕了。

简单表示下当前的函数调用关系：

0 ---- MapOutputBuffer::collect()

达到写出阈值后,写了缓冲区内容，形成spill文件。即，调用了startSpill()。

0 ---- MapOutputBuffer::collect()

1 -------- startSpill()

startSpill()触发了条件：spillReady.signal()。字段spillReady在spillReady在SpillThread类中用到，SpillThread为Thread的子类，其run方法有如下内容：

SpillThread::run()
// ...
spillReady.await();
// ...
MapOutputBuffer::sortAndSpill()
// ...

那么，这里第一次看到sortAndSpill方法被调用，接上了书本的描述。现在主要函数调用关系如下

线程1

线程2（MapOutputBuffer构造函数中启动）

0 ---- MapOutputBuffer::collect()

1 -------- startSpill()

2 ------------ spillReady.signal()

SpillThread::run()
　　// ...
　　spillReady.await();
　　// ...
　　MapOutputBuffer::sortAndSpill()
　　// ...

sortAndSpill内部使用了快排：

...sorter = ReflectionUtils.newInstance(                                job.getClass(                        "map.sort.class",                         QuickSort.class,                         IndexedSorter.class),                         job);...sorter.sort();...

排序后，判断combinerRunner是否为空，为空直接写入spill，否则调用combinerRunner.combine方法，再不是combineAndSpill方法，Hadoop 1.2.1源码中没书上写的这句代码。combinerRunner在MapOutputBuffer的构造函数中定义

combinerRunner = CombinerRunner.create(job, getTaskID(),                                              combineInputCounter,                                             reporter, null);

sortAndSpill()另一个调用者是flush()，此函数是MapOutputBuffer成员。这段代码位于

MapTask::run()

　　runOldMapper()

　　　　collector = new MapOutputBuffer(umbilical, job, reporter);

　　　　...

　　　　collector.flush()

　　　　　　sortAndSpill()

　　　　　　mergeParts()

done(umbilical, reporter)

上述的调用关系才符合书第112页最后一段的描述。mergeParts()执行合并操作，这个操作的主要目的是将Map生成的众多spill文件中的数据按照划分重新组织，以便于Reduce处理。这里的划分，应该是partition之意。

待唯一的已分区且排序的Map输出文件写入最后一条记录后，Map端的shuffle阶段就结束了。从源码看，这步应该是执行done(umbilical, reporter)后才完成。

╮(╯_╰)╭ 我不是为了情怀，我就是认真。

Hadoop on Mac with IntelliJ IDEA - 10 陆喜恒. Hadoop实战（第2版）6.4.1（Shuffle和排序）Map端内容整理

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop on Mac with intellij idea-10 Lu xiheng. hadoop (version 2nd) 6.4.1 (shuffle and sorting) map-side content sorting

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hadoop on Mac with intellij idea-10 Lu xiheng. hadoop (version 2nd) 6.4.1 (shuffle and sorting) map-side content sorting

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support