Why are two APIs of Spark RDD fold and aggregate? Why is it not a foldLeft ?, Rddfoldleft

Source: Internet
Author: User
Tags spark rdd

Why are two APIs of Spark RDD fold and aggregate? Why is it not a foldLeft ?, Rddfoldleft

Welcome to my new blog address: http://cuipengfei.me/blog/2014/10/31/spark-fold-aggregate-why-not-foldleft/

As we all know, the List of Scala standard library has a foldLeft Method Used for aggregation operations.

For example, I define a company class:

1
case class Company(name:String, children:Seq[Company]=Nil)

It has a name and a subsidiary. Then define several companies:

1
val companies = List(Company("B"),Company("A"),Company("T"))

Three big companies, then I assume a super-powerful company has merged them:

1
companies.foldLeft(Company("King"))((king,company)=>Company(name=king.name,king.children:+company))

The execution result is as follows:

12
scala> companies.foldLeft(Company("King"))((king,company)=>Company(name=king.name,king.children:+company))res6: Company = Company(King,List(Company(B,List()), Company(A,List()), Company(T,List())))

It can be seen that the result of foldLeft is a new company that includes three members of BAT.

A new Company is aggregated by List [Company], which belongs to the homogeneous aggregation operation of foldLeft.

FoldLeft can also perform heterogeneous aggregation operations:

1
companies.foldLeft("")((acc,company)=>acc+company.name)

The execution result is as follows:

12
scala> companies.foldLeft("")((acc,company)=>acc+company.name)res7: String = BAT

A String is aggregated by List [Company.

This API is very convenient. It can be used for aggregation, regardless of the homogeneous structure.

Recently, Spark was introduced. RDD is the most common class for distributed computing.

RDD has an API called fold, which is similar to the foldLeft signature. The only difference is that it can only perform homogeneous aggregation.

That is to say, if you have an RDD [X], you can only construct one X through fold.

What if I want to construct a Y using an RDD [X?

You have to use the aggregate API. The signature of aggregate is as follows:

1
aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): U

It requires a combOp parameter more than fold and foldLeft.

This makes me puzzled. Why do homogeneous and heterogeneous APIs have to be split into two parts? Why can't I learn Scala's standard library and make it look like foldLeft?

Later I figured out that this was caused by Spark's need for distribution calculation.

First, how does foldLeft of Scala List work?

1
companies.foldLeft(Company("King"))((king,company)=>Company(name=king.name,king.children:+company))
  1. Get the initial value, that is, the company named king, and merge it with the first company in the list to become a new company that includes a subsidiary.
  2. Merge the new company in the previous step with the second company in the list to become a new company with two subsidiaries.
  3. Merge the new company in the previous step with the third company in the list to become a new company with three subsidiaries.

This is an homogeneous process.

1
companies.foldLeft("")((acc,company)=>acc+company.name)
  1. Get the initial value, that is, an empty string, and combine it with the name of the first company in the list to become B
  2. Join the name of Company B in the previous step to become the BA
  3. Use the BA in the previous step together with the name of the third company in the list to become BAT

This is a heterogeneous process.

Like dominoes, the elements in the list are sucked into the results from left to right.

Now let's assume that RDD [X] has an API similar to foldLeft, whose signature is the same as foldLeft. Now I call foldLeft and give it a f :( Y, X) => Y, what should happen next?

  1. Because distributed computing is required, we need to divide many X parts into several parts and distribute them to different nodes.
  2. Each node calculates a Y value for the many X values obtained.
  3. Take the results of all the nodes, and I have a lot of Y in my hand.
  4. Ah... I don't know how to turn many Y into one Y...

As Spark's RDD does not need to trigger a pair of dominoes like Scala's List, instead, it needs to push down many pairs and finally aggregate the results of many pairs of dominoes.

In this case, if it is homogeneous, I only need to use f :( X, X) => X to do it again.

But if it is heterogeneous, then I need another f :( Y, Y) => Y.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.