Why are two APIs of Spark RDD fold and aggregate? Why is it not a foldLeft ?, Rddfoldleft

Last Update:2014-11-12 Source: Internet

Author: User

Tags spark rdd

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Why are two APIs of Spark RDD fold and aggregate? Why is it not a foldLeft ?, Rddfoldleft

Welcome to my new blog address: http://cuipengfei.me/blog/2014/10/31/spark-fold-aggregate-why-not-foldleft/

As we all know, the List of Scala standard library has a foldLeft Method Used for aggregation operations.

For example, I define a company class:

1	`case class Company(name:String, children:Seq[Company]=Nil)`

It has a name and a subsidiary. Then define several companies:

1	`val companies = List(Company("B"),Company("A"),Company("T"))`

Three big companies, then I assume a super-powerful company has merged them:

1	`companies.foldLeft(Company("King"))((king,company)=>Company(name=king.name,king.children:+company))`

The execution result is as follows:

scala> companies.foldLeft(Company("King"))((king,company)=>Company(name=king.name,king.children:+company))res6: Company = Company(King,List(Company(B,List()), Company(A,List()), Company(T,List())))

It can be seen that the result of foldLeft is a new company that includes three members of BAT.

A new Company is aggregated by List [Company], which belongs to the homogeneous aggregation operation of foldLeft.

FoldLeft can also perform heterogeneous aggregation operations:

1	`companies.foldLeft("")((acc,company)=>acc+company.name)`

The execution result is as follows:

12	`scala> companies.foldLeft("")((acc,company)=>acc+company.name)res7: String = BAT`

A String is aggregated by List [Company.

This API is very convenient. It can be used for aggregation, regardless of the homogeneous structure.

Recently, Spark was introduced. RDD is the most common class for distributed computing.

RDD has an API called fold, which is similar to the foldLeft signature. The only difference is that it can only perform homogeneous aggregation.

That is to say, if you have an RDD [X], you can only construct one X through fold.

What if I want to construct a Y using an RDD [X?

You have to use the aggregate API. The signature of aggregate is as follows:

1	`aggregate[U](zeroValue: U)(seqOp: (U, T) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): U`

It requires a combOp parameter more than fold and foldLeft.

This makes me puzzled. Why do homogeneous and heterogeneous APIs have to be split into two parts? Why can't I learn Scala's standard library and make it look like foldLeft?

Later I figured out that this was caused by Spark's need for distribution calculation.

First, how does foldLeft of Scala List work?

1	`companies.foldLeft(Company("King"))((king,company)=>Company(name=king.name,king.children:+company))`

Get the initial value, that is, the company named king, and merge it with the first company in the list to become a new company that includes a subsidiary.
Merge the new company in the previous step with the second company in the list to become a new company with two subsidiaries.
Merge the new company in the previous step with the third company in the list to become a new company with three subsidiaries.

This is an homogeneous process.

1	`companies.foldLeft("")((acc,company)=>acc+company.name)`

Get the initial value, that is, an empty string, and combine it with the name of the first company in the list to become B
Join the name of Company B in the previous step to become the BA
Use the BA in the previous step together with the name of the third company in the list to become BAT

This is a heterogeneous process.

Like dominoes, the elements in the list are sucked into the results from left to right.

Now let's assume that RDD [X] has an API similar to foldLeft, whose signature is the same as foldLeft. Now I call foldLeft and give it a f :( Y, X) => Y, what should happen next?

Because distributed computing is required, we need to divide many X parts into several parts and distribute them to different nodes.
Each node calculates a Y value for the many X values obtained.
Take the results of all the nodes, and I have a lot of Y in my hand.
Ah... I don't know how to turn many Y into one Y...

As Spark's RDD does not need to trigger a pair of dominoes like Scala's List, instead, it needs to push down many pairs and finally aggregate the results of many pairs of dominoes.

In this case, if it is homogeneous, I only need to use f :( X, X) => X to do it again.

But if it is heterogeneous, then I need another f :( Y, Y) => Y.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More