Spark's Straggler deep Learning (2): Thinking about the partitioning of blocks and partition--a reference to the paper

Source: Internet
Author: User
Tags ibm db2

I. The problem of dividing the partition

How to divide partition has a great impact on the collection of block data. If you need to speed up task execution based on block, what conditions should partition meet?

Reference Ideas 1:range Partition

1. Source:

IBM DB2 blu;google Powerdrill;shark on HDFS

2. Rules:

Range partition follows three principles: 1. Fine-grained range segmentation for each column to prevent data skew and workload skew; 2. The columns assigned by each partition are different; 3, the division of partition needs to be considered for data correlation and filtering associativity.

The implementation method can refer to the implementation of spark in idea 3.

3, Simple thinking:

This division of partition requires a lot of extra work, if the design for me does not need so much, the only thing to consider is the 1th: avoid data skew and workload tilt.

Reference Ideas 2:fine-grained Partitioning (one of the horizontal partitioning)

1. Source:

Fine-grained partitioning for aggressive Data skipping (ACMSIGMOD14)

2. Rules:

The purpose of the division of the fine-grained partitioning is clear: a fine-grained, size-balanced block, divided by the query operation (the partition method for shark query) to a greater extent to skip the scan of the block. The specific methods are:

(1) From the previous frequent use of the filter set (the article has proved that some typical filter can be used for decision-making) to extract the criteria of the filter as a feature;

(2) The data is recalculated according to the extracted characteristics, and the characteristic vectors are generated to modify the problem into the optimal solution.

Examples are as follows: Each partition corresponds to a feature of 0, that is, the feature is not satisfied, and if the feature is scanned, the partition can be skipped directly.

3, Simple thinking:

First of all, fine-grained has the following characteristics: (1) by an additional process daemon, working at the time of data loading or some recent task (this task must be to partition re-request, such as the display of the user partition operation) when executed ; (2) The method is applicable to the formation of partition and the formation of block, and (3) extracting typical characteristics from filter to divide the data.

Then it analyzes the idea of the method, which uses the required information, that is, the characteristics of filter and skip block, to divide block and partition, which has a high reference value. The data characteristics I need should include the importance of the data, the integrity of the data required to start the task, the integrity of the block blocks required to start the task, or, more specifically, how much of the block data is completed to start the task, and how to judge this amount?

The hash Partition realized by the reference idea 3:spark

1. Source:

Spark1.3.1 source parsing, spark default comes with Partitioner. In fact, Spark1.3.1 in this piece also achieved Rangepartitioner, and Hashpartitioner use is still range Partition.

2. Rules:

First, in order to roughly achieve the balance between the output partition, you need to define some parameters that assist in the decision of the number of samples that an RDD result assigns to each partition:

SampleSize, limiting the size of the sample, Min (20*partitions,1m);

Samplesizeperpartition,partition sample number, Ceil (3*samplesize/partitions), by rounding up to allow a small number of samples to be exceeded;

Sketched, which is used to describe the results of an RDD, including (partition ID, items number, sample), and sample is determined by the preceding parameters;

(key,weight) and imbalancedpartitions, respectively, are a buffer array and mutable type values, respectively, storing balanced partition weights and unbalanced partition, Balance is judged by the relationship between partition size and average partition size, weight =partition size/sample size.

See how to achieve a balanced judgment:

Val fraction = math.min (Samplesize/math.max (NumItems, 1L), 1.0)

Val balance_or_not = if (fraction * numitems > Samplesizeperpartition) True Else false

For unbalanced partition resampling, the weight after sampling is 1/fraction.

Second, it is important to note that a implicitly method is used in partition that gets the value of the parameter that is hidden in rangpartition: Ordering[classtag]. The parameter values are used to write and read the data flow in the spark frame. by WriteObject and ReadObject, you can control the writing and reading of data in partition.

Finally, the bounds of the decision partition is based on the Determinebounds method in Object Rangepartitioner, which uses the value of weight to balance the size of the block and then puts it into the partition. Thus balancing the size of the partition.

3, Simple thinking:

The partition strategy that spark comes with is to use the hashcode to get the partition ID by using the method of sampling weights to balance the size of each partition, But it does not take into account the correlation of partition internal data, that is, block level decision is not reflected here, need to further consider how to block optimization.

Second, how to use partition and block division strategy-Key thesis: The Power of Choice in Data-aware Cluster scheduling,osdi14

The previous article described three ways to divide partition and block, but how to apply its optimization after division, in addition to the corresponding article mentioned above, and the opening question is more corresponding to the OSDI14 year of the article The Power of Choice in Data-aware Cluster scheduling. The system designed to implement this paper is kmn, so the paper is replaced by KMN.

1. Overview

When a task in the original spark requires a resource that is output from the upstream stage, the scheduler pulls the number of resources required by the task and then delivers it to the task, whereas the KMN policy is that the scheduler pulls the mathematical mix from all the resources, The number is still the number of resources required for the task, and the general relationship is as follows.

Therefore, the characteristics of kmn is choice, how to combine the optimal block to scheduler, and then dispatch to the task, to achieve the highest efficiency. The problem is then turned into NP problem. It is important to note that KMN Select Choice is based on the entire block to make decisions, then must wait for all blocks to produce, that is, the upstream stage after the completion of the decision. This kmn need to consider the impact of upstream straggler, unfortunately, kmn is for approximate solution problems, which leads it to decide to drop the straggler to speed up.

2. Detailed realization

The core of KMN is the choice of data sense, and its decision is divided into two basic scenes of input stage and intermediate stage, decision memory locality and network Blance two aspects.

(1) Input Stage

The decision of combining blocks in Input stage can guarantee high data locality under various cluster utilization, and the data local probability of natural sampling and user-defined sampling condition is illustrated by the example of K blocks sampled in N.

(2) Intermediate Stage

The intermediate stage needs to consider its upstream and downstream stage. KMN sets additional tasks for the upstream stage and needs to confirm the impact of additional tasks on the block decision schedule, and analyzes the m/k of the upstream additional task to cross-rack skew, the inclination caused by the M-task and K-block models. Then, according to the output of upstream stage to select the best block, the problem turns into a NP difficult problem. Finally, the straggler problem in the upstream stage needs to be addressed, as the presence of straggler will result in the decision of block at the intermediate stage being affected. The article compares the appearance of Straggler and the time of the extra choice decision, finds that the impact of Straggler is 20%-40%, so the article solves the problem by using the following method: When the K task in the M-upstream task finishes executing, the downstream task is started. The fact is to speed up the start time of the downstream stage by accelerating the stage execution.

Spark's Straggler deep Learning (2): Thinking about the partitioning of blocks and partition--a reference to the paper

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.