"Spark" Spark's shuffle mechanism

Source: Internet
Author: User

The Shuffle in MapReduce

In the MapReduce framework, shuffle is the bridge between the map and the reduce, and the output of the map must pass through the shuffle in the reduce, and the performance and throughput of the shuffle are directly affected by the performances of the whole program.
Shuffle is a specific phase in the MapReduce framework, between the map phase and the reduce phase, when the output of the map is to be used by reduce. The output needs to be hashed by key. and distribute it to every reducer. This process is shuffle. Because shuffle involves disk read-write and network transmission, so the performance of shuffle directly affects the execution efficiency of the whole program.
Description describes the entire process of the mapreduce algorithm, where shuffle phase is between the map phase and the reduce phase:

In Hadoop, every time when the data in the memory buffer is nearly full in the mapper, the data in memory is divided by partition, then each is saved into small files, so that when buffer is spill, it will produce a lot of small files.
So the whole thing behind Hadoop until reduce is actually the constant merge, file-based multiplexing and sequencing, and the same partition merge on the map side, at the reduce side, Merge the data files from the mapper-side copy to use for the finally reduce
Multi-merge sorting, reaching two goals.

Merge, put the value of the same key into a ArrayList; sort, and finally the result is sorted by key.
This method is very good extensibility, the face of big data is not a problem, of course, the problem in efficiency, after all, multiple file-based multi-merge sorting, multi-wheel and disk for data read and write.

Spark's shuffle mechanism

The shuffle in Spark is to convert a set of irregular data as far as possible into a set of data with certain rules.
The Spark computing model is calculated in a distributed environment. This makes it impossible to accommodate all of the computational data in a single process space. This data is partitioned according to key. A small partition allocated to a piece of space, scattered in the memory of the various processes in the cluster, not all operators are satisfied with a way to partition the calculation.

When you need to sort the data for storage. Once again, it is necessary to partition the data again according to certain rules. Shuffle is the process of wrapping data in a combination of various operators that need to be re-partitioned .

Logically, you can also understand this: because another partition needs to know the partition rules. and the partition rule according to the data key through the mapping function (hash or range, etc.), the process of determining key by the data is the map process, the same time the map process can also do data processing. For example, in the join algorithm, there is a very classic algorithm called Map Side Join, is to determine which partition data should be placed in the logical definition phase. Shuffle The data is collected and assigned to the specified reduce partition, the reduce phase is based on the function to reduce the corresponding partition for the function processing required.

The process of shuffle in spark

* First, each mapper will be based on the number of reducer to create the corresponding Bucket,bucket number is MXR, where M is the number of maps, R is the number of reduce.
* Second mapper results are populated into each bucket based on the partition algorithm set.

The partition algorithm here can be defined by itself, but the default algorithm is based on the key hash to a different bucket.

* When the reducer is started, it takes the corresponding bucket as input from the remote or local block manager based on the ID of its task and the ID of the mapper to be relied upon to process the reducer.

The bucket here is an abstract concept in which each bucket in the implementation can correspond to a file. be able to correspond to part of the file or other.

reprint Please indicate the author Jason Ding and its provenance
Gitcafe Blog Home page (http://jasonding1354.gitcafe.io/)
GitHub Blog Home page (http://jasonding1354.github.io/)
CSDN Blog (http://blog.csdn.net/jasonding1354)
Jane Book homepage (http://www.jianshu.com/users/2bd9b48f6ea8/latest_articles)
Google search jasonding1354 go to my blog homepage

"Spark" Spark's shuffle mechanism

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.