Spark sort-based Shuffle Insider thorough decryption (DT Big Data DreamWorks)

Source: Internet
Author: User
Tags builtin deprecated shuffle

Content:

1, why use sorted-based Shuffle;

2, sorted-based shuffle actual combat;

3, sorted-based Shuffle Insider;

4, sorted-based shuffle deficiency;

The most common shuffle approach, sorted-based shuffle, involves large-scale spark development, operational core issues, and the key to the answer.

Must master this content.

This lesson is a successful upgrade from Spark Junior to spark intermediate talent channel.

A small level of large companies, the interview content of this talk will certainly be involved.

Why ========== need sorted-based Shuffle ============

1, Spark started is hash-based shuffle,shuffle generally contains two phases of the task: the stage of generating Shuffle data (map stage, additional supplement, need to implement getwriter in Shufflemanager to write the data (data can Through Blockmanager write in Memry, Disk, Tachyon, etc., such as want very fast shuffle, at this time consider can write data in memory, but memory is not stable, suggest to use Memoeyanddisk way, More insurance can be Memoeyanddisk 2 way, but the cost is also large), the use of shuffle data phase (reduce phase, additional supplement, need to implement Shufflemanager Getreader, Reader will go to driver to get the shuffle data from the previous stage)

2, Spark is a chain expression, you can have a lot of mapper and many reducer on a chain.

If more than one stage,spark except the last stage is reducer, the beginning stage is the entire job mapper, each stage in the middle is mapper and reducer, it is the reducer of the previous stage, The mapper of the next stage.

If there is only one stage, then this job is equivalent to only one mapper stage, will not produce shuffle, suitable for simple ETL.

3, Spark Shuffle in the beginning, only support hash-based Shuffle, and hash-based Shuffle the default mapper stage for the reducer phase of each task, Create a separate file to hold the data to be used in the task, but in some cases (such as a very large amount of data) will cause a large number of files (M*r,m represents all of the parallel tasks in Mapper, R represents the number of parallel tasks in the reducer) random disk IO operations, And will form a lot of memory consumption, very easy to cause oom, this is a fatal problem, the first can not handle large-scale data, the second spark can not run on a large-scale distributed cluster! Later, the solution was to add the shuffle consolidate mechanism to reduce the number of files produced by shuffle to C*r (c represents the number of mapper that can be used at the cores side, and R represents the number of concurrent tasks in reducer). But at this time if the reducer side of the parallel Data shard (or task) too much, then the c*r may still be too large, at this time still did not escape the file open too much doom;

Before Spark (Spark 1.1), before introducing sort-based shuffle, it is more suitable for small to medium sized large data processing.

4. In order for Spark to handle larger data at higher performance on larger clusters , the introduction of sort-based Shuffle (beginning with version SPARK1 1.1) Spark can handle big data at any scale (PB-and PB-level), especially with the introduction and optimization of the tungsten filament program, pushing spark to a peak in the ability to process more massive amounts of data in larger clusters at a faster speed!

Think of the Hadoop map of Reduce Shuffle, which is sorted. There are ring memory buffers, which are indexed by both data.

5. The Spark 1.6 version supports at least three types of shuffle

/Let the user specify short names for shuffle managers
ValShortshufflemgrnames=Map(
"Hash"-"Org.apache.spark.shuffle.hash.HashShuffleManager",
  "Sort"-"Org.apache.spark.shuffle.sort.SortShuffleManager",
  "Tungsten-sort"-"Org.apache.spark.shuffle.sort.SortShuffleManager")
ValShufflemgrname = Conf.get ("Spark.shuffle.manager", "Sort")
ValShufflemgrclass =Shortshufflemgrnames. Getorelse (Shufflemgrname.tolowercase, Shufflemgrname)
ValShufflemanager = Instantiateclass[shufflemanager] (shufflemgrclass)

The implementation of the Shuffle manager interface can be based on the actual needs of their business to optimize the use of customized shuffle implementation;

6, Spark1.6 default is sort-based shuffle way, see the 5th Spark Source code, also explain Spark-defalut

The Spark.shuffle.manager can be configured in the Conf file to set the shuffle manager implementation

Sort-based Shuffle does not generate a separate file for each task in reducer, instead, sort-based Shuffle writes all output data from mapper to one file. Because the data in each shufflemaptask is categorized, sort-based Shuffle uses the index file to store information about how the specific Shufflemaptask output data is categorized in the same data (which is classified, To Shuffle,shuffle and then to cluster). So sort-based shuffle will generate two files in each shuffle Map task in Mapper, where data is the shuffle output that stores shuffle current task, The index file stores data from the data file through the Partitioner classification information. The task in the stage at this point is to get the data from the Shuffle Map task in the previous stage that you want to crawl based on the index file.

The correct answer to the number of temporary files generated by sort-based shuffle is 2M (m represents the total number of parallel partition in Mapper, which is actually the total number of mapper-side tasks, which is not the same as the actual number of parallel);

Recalling the history of the entire shuffle, the number of temporary files produced by shuffle is changed to the following sequence:

Basic Hash Shuffle m*r;

Consalisate Way Hash Shuffle c*r;

sort-based Shuffle 2M;

========== in the cluster of hands-on combat sort-based shuffle============

Build your own directory, upload a few files

[Email protected]:/usr/local/hadoop-2.6.0/sbin# Hadoop dfs-mkdir/library

Deprecated:use of this script to the Execute HDFS command is DEPRECATED.

Instead Use the HDFs command for it.

16/02/13 13:18:41 WARN util. nativecodeloader:unable to load Native-hadoop library for your platform ... using Builtin-java classes where applicable

[Email protected]:/usr/local/hadoop-2.6.0/sbin# Hadoop dfs-mkdir/library/dataforsortedshuffle

Deprecated:use of this script to the Execute HDFS command is DEPRECATED.

Instead Use the HDFs command for it.

16/02/13 13:19:27 WARN util. nativecodeloader:unable to load Native-hadoop library for your platform ... using Builtin-java classes where applicable

[Email protected]:/usr/local/hadoop-2.6.0# Hadoop dfs-put license.txt/library/dataforsortedshuffle/

Deprecated:use of this script to the Execute HDFS command is DEPRECATED.

Instead Use the HDFs command for it.

16/02/13 13:21:06 WARN util. nativecodeloader:unable to load Native-hadoop library for your platform ... using Builtin-java classes where applicable

./spark-shell--master spark://master:7077,worker1:7077,worker2:7077

Sc.textfile ("/library/dataforsortedshuffle"). FlatMap (_.split ("")). Map (word=> (word,1)). Reducebykey (_+_). Saveastextfile ("/LIBRARY/DATAOUTPUT1")

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

Degree of parallelism 6

Temporary data files and index files, 2 data 2 indexes:

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

650) this.width=650; "src="/e/u261/themes/default/images/spacer.gif "style=" Background:url ("/e/u261/lang/zh-cn/ Images/localimage.png ") no-repeat center;border:1px solid #ddd;" alt= "Spacer.gif"/>

Cluster of three machines all have 2 index 2 data, altogether is 6 data 6 index of temporary files, so as the above theory said, the default is the sort method of production, the degree of parallelism is 6, will produce 6*2 temporary files,2M files!!!

Naming rules

Shuffle_0_0_0.data

Shuffle_0_1_0.data

Shuffle_0_2_0.data

Wait a minute

Shuffle_0_0_0.index

Shuffle_0_1_0.index

Shuffle_0_2_0.index

Wait a minute

After a while, the temporary files are deleted.

In sort-based shuffle, how does reducer get the data that he needs?

Specifically, Reducer first find drvier to get the location information for each shuffle Map task output in the parent stage, get the index file based on location information, parse the index file, Get the index file from the parsed index file to get the part of your data file that belongs to you.

Some drawbacks of the default sort-based shuffle:

1, if the number of tasks in the Mapper is too large, will still produce a lot of small files, at this time in the process of shuffle passing data to the reducer side, reduce will need to open a large number of records at the same time to deserialize, resulting in a lot of memory consumption, and the large burden of GC, Causing the system to slow or even crash;

2, if you need to sort within the Shard, then need to mapper end and reducer end of the two order

Liaoliang Teacher's card:

China Spark first person

Sina Weibo: Http://weibo.com/ilovepains

Public Number: Dt_spark

Blog: http://blog.sina.com.cn/ilovepains

Mobile: 18610086859

qq:1740415547

Email: [Email protected]


This article from "a Flower proud Cold" blog, declined reprint!

Spark sort-based Shuffle Insider thorough decryption (DT Big Data DreamWorks)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.