Topic Center

Contact Sales

Home > Internet > Online Trends

Do I Need to Cache RDD If Using it Multiple Times

Last Update:2020-06-09 Source: Internet

Author: User

Keywords rdd spark rdd cache

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Suppose we have the following code.

x = sc.textFile(...)
y = x.map(...)
z = x.map(...)
Is x essential for caching here? Does caching x cause spark to read the input file twice?

solution
These things don't necessarily make Spark read the input twice.

List all possible solutions:

Example 1: The file will not be read even once

x = sc.textFile(...) #creation of RDD
y = x.map(...) #Transformation of RDD
z = x.map(...) #Transformation of RDD
In this case, it will not do anything, because the conversion will not do anything.

Example 2: File read once

x = sc.textFile(...) #creation of RDD
y = x.map(...) #Transformation of RDD
z = x.map(...) #Transformation of RDD
println(y.count()) #Action of RDD
Read the file only once to make it mapped

Example 3: File read twice

x = sc.textFile(...) #creation of RDD
y = x.map(...) #Transformation of RDD
z = x.map(...) #Transformation of RDD
println(y.count()) #Action of RDD
println(z.count()) #Action of RDD
Now, as the action is used with conversion, it will only read the input file twice.

Example 4: File read once

x = sc.textFile(...) #creation of RDD
y = x.map(...) #Transformation of RDD
z = y.map(...) #Transformation of RDD
println(z.count()) #Action of RDD
Example 5: File read twice

x = sc.textFile(...) #creation of RDD
y = x.map(...) #Transformation of RDD
z = y.map(...) #Transformation of RDD
println(y.count()) #Action of RDD
println(z.count()) #Action of RDD
Since the action is now used on two different RDDs, it will read it twice.

Example 6: File read once

x = sc.textFile(...) #creation of RDD
y = x.map(...).cache() #Transformation of RDD
z = y.map(...) #Transformation of RDD
println(y.count()) #Action of RDD
println(z.count()) #Action of RDD
Even now, only after the RDD is executed and stored in memory, two different operations are used. Now, the second operation takes place on the cached RDD.

Edit: Additional information

Therefore, a question arises: what to cache and what not to cache?
Answer: The RDD you will use repeatedly needs to be cached.
Example 7:

x = sc.textFile(...) #creation of RDD
y = x.map(...) #Transformation of RDD
z = x.map(...) #Transformation of RDD
So in this case, because we are using x over and over again. Therefore it is recommended to cache x. Because it does not have to read x from the source again and again. Therefore, if you are dealing with large amounts of data, this will save a lot of time. you.

Suppose you start to use/not use serialization to cache all RDDs in memory/disk. If Spark memory is not enough to perform any tasks, it will start using the LRU (Recently Used) strategy to delete old RDDs. Whenever the deleted RDD is used again, it will perform all the steps from source to RSD conversion.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

Front-end Must Learn: CDN Acceleration Principle 12-02

Elements of CDN Network 12-01

8 New Types of Attacks Facing the Cloud Environment 11-26

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

Hot Article

Hot Tags

computing conference access forum computer class data get http html applications

Popular Keywords

html add blank space register business logo register ssl certificate full site sign in sign up node js build cloud register register a subdomain in python network management system tutorial how to learn computer science by myself

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Do I Need to Cache RDD If Using it Multiple Times

Contact Us

Hot Article

Hot Tags

Popular Keywords

Recommend Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support