Spark-->combinebykey "Please read the Apache Spark website document"

Source: Internet
Author: User
Tags iterable

This article, it is necessary to read, write well. But after looking, don't forget to check out the Apache Spark website. Because this article understanding or with the source code, official documents inconsistent. A little mistake! "The Cnblogs Code Editor does not support Scala, so the language keyword is not highlighted"

In data analysis, processing Key,value pair data is a very common scenario, for example, we can group, aggregate, or combine two of the rdd containing the pair data into a join based on key. From the abstraction level of the function, these operations have a common feature, which is that the data processing of type rdd[(K,V)] is rdd[(k,c)]. The V and C here can be of the same type, or they can be of different types. This data processing operation is not simply a map of the value of pair, but a combination of the original value for different key values (Combine). Thus, not only types may differ, but the number of elements may vary.

Spark provides a highly abstract operational combinebykey for this purpose. The definition of the method is as follows:

/** * Generic function to combine the elements for each key using a custom set of aggregation * functions. Turns an rdd[(k, V)] to a result of type rdd[(K, C)], for a "combined type" C * Note this V and C can be different-- For example, one might group a rdd of type * (int, int) into an RDD of type (int, seq[int]).  Users provide three functions: * *-' createcombiner ', which turns a V into a C (e.g., creates a one-element list) * -' mergevalue ', to merge a V into a C (e.g., adds it to the end of a list) *-' mergecombiners ', to combine ' C ' s int   o a single one. * In addition, users can control the partitioning of the output RDD, and whether to perform * map-side aggregation (   If a mapper can produce multiple items with the same key). */def Combinebykey[c] (createcombiner:v=C, Mergevalue: (c, V)=C, Mergecombiners: (c, c)=C, Partitioner:partitioner, Mapsidecombine:boolean=true, Serializer:serializer=NULL): rdd[(K, C)] = {    //achieve a slightly}

The functional style differs from the imperative style in that it shows what the code does (what does) rather than how it is done. The Combinebykey function mainly accepts three functions as parameters, namely Createcombiner, Mergevalue, Mergecombiners. These three functions are enough to show what it does. By understanding these three functions, you can understand combinebykey well.

Combinebykey is rdd[(k,v)]combine to rdd[(k,c)], so first you need to provide a function to complete the combine from V to C, called Combiner. If V and C are the same type, then the function is V = v. If C is a set, for example Iterable[v], then Createcombiner is V = Iterable[v].

Mergevalue is the combination of the value of the pair in the original RDD into the C-type data after the operation. The implementation of the merge operation determines how the results are calculated. Therefore, Mergevalue is more like declaring a merging method, which is guided by the result of the entire combine operation. The input of the function is V of the pair in the original RDD, and the output is the C of the pair in the result rdd.

The final mergecombiners will be merged according to the number of C corresponding to each key.

Let's think of Combinebykey as a super cool juicer. It can accept a variety of fruits at the same time, and then intelligently squeeze out different juices according to the type of fruit. Apples to apple juice, oranges to orange juice, watermelon to watermelon juice. We define the type of fruit as Fruit, and the juice is defined as Juice, then Combinebykey is rdd[(string, Fruit)]combine to rdd[(string, Juice)].

Note that there may be a lot of fruit before the juice is squeezed, even if the same type of fruit is used as a different RDD element:

("Apple", Apple1), ("Orange", Orange1), ("Apple", Apple2)

The result of the combine is that each fruit has only one glass of juice (only a different volume):

("Apple", Applejuice), ("Orange", Orangejuice)

What components does this blender consist of? First, it requires a component that provides the ability to squeeze a variety of fruits into a variety of juices, and secondly, it needs to provide the ability to mix fruit juices, and finally, to avoid mixing errors, you have to provide the ability to mix according to the type of fruit. Note the difference between the second function and the third function, which only provides a blending function, that is, the ability to pack different containers of juice into a container, and the latter's input already has a premise that the fruit type has been placed in different areas, the juice machine in the mixing of fruit juice, will not confuse the different areas of the juice.

The function of the juicer is similar to Groupbykey+foldbykey operation. It can call the Combinebykey function:

 Case classFruit (kind:string, Weight:int) {def makejuice:juice= Juice (weight * 100)} Case classJuice (volumn:int) {def add (j:juice): Juice= Juice (volumn +J.volumn)} Val Apple1= Fruit ("Apple", 5) Val Apple2= Fruit ("Apple", 8) Val Orange1= Fruit ("Orange", 10) Val Fruit= Sc.parallelize ("Apple", Apple1), ("Orange", Orange1), ("Apple", Apple2))) Val Juice=Fruit.combinebykey (f=F.makejuice, (j:juice,f)=J.add (F.makejuice), (J1:juice,j2:juice)=J1.add (J2)) executes juice.collect with the result: array[(String, Juice)]= Array ((orange,juice), (Apple,juice (1300)) There are many operations for pair RDD in the RDD that call the Combinebykey function in the internal implementation. For example Groupbykey:classPairrddfunctions[k, V] (self:rdd[(K, V)]) (implicit kt:classtag[k], vt:classtag[v], Ord:ordering[k]=NULL)  extendsLogging with Sparkhadoopmapreduceutil with Serializable {def groupbykey (Partitioner:partitioner): rdd[(K, Iter ABLE[V])={val Createcombiner= (v:v) = =Compactbuffer (v) Val mergevalue= (Buf:compactbuffer[v], v:v) = buf + =v Val mergecombiners= (C1:compactbuffer[v], c2:compactbuffer[v]) = C1 ++=C2 val Bufs=Combinebykey[compactbuffer[v]] (Createcombiner, Mergevalue, Mergecombiners, Partitioner, MapSideCombine=false) bufs.asinstanceof[rdd[(K, Iterable[v])]}}

The Groupbykey function groups the value by key for the Pairrddfunctions rdd[(K, V)]. It calls the Combinebykey function internally, and the incoming three functions assume the following responsibilities:

    • Createcombiner is the conversion of the K type in the original RDD to the ITERABLE[V] type, implemented as Compactbuffer.
    • Mergevalue is actually appending the elements of the original rdd to the Compactbuffer, and the append operation (+ =) is considered a merge operation.
    • Mergecombiners is responsible for each key value corresponding to the ITERABLE[V], provides the merging function.

Depending on the functions passed in, we can also use Combinebykey to do different things, such as aggregate,fold,average. This is a high level of abstraction, but from a declarative point of view, you don't need to know too much about implementation details. This is the charm of functional programming.

Spark-->combinebykey "Please read the Apache Spark website document"

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.