Spark API Combinebykey (i)

Source: Internet
Author: User
Tags shuffle

1 Preface

Combinebykey is a method that cannot be avoided using spark, and is always invoked, either intentionally or unintentionally, directly or indirectly to it. It can be literally known that it has the function of aggregation, and for this reason do not want to do too much explanation, because it is very simple, because Reducebykey, Aggregatebykey, Foldbykey and other functions are to use it to achieve.

Combinebykey is a highly abstract aggregation function that can be used for aggregation and grouping of data, and the shuffle that it pulls out is also a priority in spark, so let's see how it's going to be implemented.

Deficiencies or errors, please indicate corrections.

2 Methods Source Code Introduction

This is the pairrddfunctions inside of the Combinebykey method fragment, the two methods put in a piece, that is, call this method if you do not fill the partition function Partitioner use Hashpartitioner, By default, the map segment is used for merging (this is for shuffle).

3 method source code of the daytime

Don't say much nonsense, directly paste the source code,

There is a comment, then look at the comment, the note is to express the meaning that Combinebykey is a fan function, using a set of custom aggregate functions with the key to aggregate conditions for aggregation, as for the other is not much to say, look down on the code.

The first is to determine whether the key is an array, and if it is not using the map segment merge and Hashpartitioner, the reason:

To make a map segment Merge and hash partition, then key You must be able to determine the key by comparing whether the content is the same is equal and computes hash by content values, which are then merged and partitioned, but the array is judged equal and the hash is calculated The value is not based on what is inside it, but is based on the information in the array on the stack.

Then down, to construct a aggregator, this thing can be said to be the core of Combinebykey, because the aggregation is all given to it to complete. Go inside and see the aggregator.

The above is the default constructor for aggregator, which needs to pass in three custom methods and now focuses on the meaning of these three methods:

First, followed by the aggregator of the three generics, the first k, this is your combinebykey is the aggregation of the condition key, can be any type. The following v,c two generics are the types of values that need to be aggregated, and the type of the aggregated value, two types can be the same, or it can be different, for example, many reducebykey in spark this method, if the value before aggregation is long, then the aggregation is still long. Again such as Groupbykey, if the aggregation is a String before, then the aggregation is iterable<string>.

Look at three more custom methods:

    1. Createcombiner

This method is executed on each partition, and the method is executed whenever a key is encountered in the partition that has not been processed in the sub-region. The result of the execution is the aggregation type C (which can be an array or a value) of the specified key in the sub-region, depending on the definition of the method. )

2. Mergevalue

This method will also be executed on each partition, unlike Createcombiner, which is implemented only by encountering key that has already been processed in the subregion, and executing the result of aggregating the value of the key currently encountered into the existing aggregation type C.

In fact, methods 1 and 2 are put together to see, is an if judgment condition, come in a key, go to judge if did not appear before the execution Method 1, otherwise Execute Method 2.

3. Mergecombiner

The first two methods are to implement the data merge of the same key value inside the partition, and this method is mainly used for data merging of the same key value between partitions, and the final result is formed.

Let's look at what the aggregator implements.

From the list of its methods, in fact it only has three methods, then take a look at the three ways to do it:

    1. Combinevaluesbykey

See this name, and then according to the constructor, it can be guessed that the main implementation of this method is the data merge within the partition. Look at its code:

Here according to whether you can brush the disk divided into two paths, in fact, do things are the same, the difference is when the data is stored when the memory is not enough is directly oom, one can be brushed disk. The implementation of the code is simple, is to iterate over a partition of data, and then constantly insert or update the data inside the map, here is no longer elaborate.

2. Combinecombinersbykey

This method mainly implements the data merging between partitions, that is, the result of merging Combinevaluesbykey, to see how it is implemented:

The code does not say, and combinevaluesbykey the same, but the use of the custom method is different.

3. Updatemetrics

This method is related to the brush disk,

Is the record, whether the current brush the disk, how much brush.

Here aggregator is over, and then combinebykey down.

After instantiating the aggregator, it is then determined whether repartitioning is required (shuffle):

    1. No partitioning required

When self.partitioner = = Some (partitioner), that is, the partition instance is the same time, there is no need to partition, so only need to combinevaluesbykey the first partition, there is no merge between partitions, There is no need for shuffle.

2. Partitioning required

Two partitions are different, need to partition the fragmented data by key re-partition, the purpose is to collect the same key on the same partition, due to the uncertainty of data distribution, it is possible that the data of each partition now is composed of all partitions after the partition of the data (wide dependency), So if you need shuffle, build Shuffledrdd,

In fact, here we should realize that Combinebykey The key to this is the partition Partitioner , which is an operation for the partition, the selection of the partition determines the execution of the Combinebykey after the result, if the given partition does not guarantee the same key values are partitioned into the same partition, the result of the final merge may be that there are multiple partitions with the same key .

Shuffle The purpose of this is to press the data scattered across all partitions into key partitioning and centralizing.

Need to shuffle part of the next section to elaborate.

Spark API Combinebykey (i)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.