Do I Need to Cache RDD If Using it Multiple Times

Source: Internet
Author: User
Keywords rdd spark rdd cache
Suppose we have the following code.

x = sc.textFile(...)
y = x.map(...)
z = x.map(...)
Is x essential for caching here? Does caching x cause spark to read the input file twice?


solution
These things don't necessarily make Spark read the input twice.

List all possible solutions:

Example 1: The file will not be read even once

x = sc.textFile(...) #creation of RDD
y = x.map(...) #Transformation of RDD
z = x.map(...) #Transformation of RDD
In this case, it will not do anything, because the conversion will not do anything.

Example 2: File read once

x = sc.textFile(...) #creation of RDD
y = x.map(...) #Transformation of RDD
z = x.map(...) #Transformation of RDD
println(y.count()) #Action of RDD
Read the file only once to make it mapped

Example 3: File read twice

x = sc.textFile(...) #creation of RDD
y = x.map(...) #Transformation of RDD
z = x.map(...) #Transformation of RDD
println(y.count()) #Action of RDD
println(z.count()) #Action of RDD
Now, as the action is used with conversion, it will only read the input file twice.

Example 4: File read once

x = sc.textFile(...) #creation of RDD
y = x.map(...) #Transformation of RDD
z = y.map(...) #Transformation of RDD
println(z.count()) #Action of RDD
Example 5: File read twice

x = sc.textFile(...) #creation of RDD
y = x.map(...) #Transformation of RDD
z = y.map(...) #Transformation of RDD
println(y.count()) #Action of RDD
println(z.count()) #Action of RDD
Since the action is now used on two different RDDs, it will read it twice.

Example 6: File read once

x = sc.textFile(...) #creation of RDD
y = x.map(...).cache() #Transformation of RDD
z = y.map(...) #Transformation of RDD
println(y.count()) #Action of RDD
println(z.count()) #Action of RDD
Even now, only after the RDD is executed and stored in memory, two different operations are used. Now, the second operation takes place on the cached RDD.

Edit: Additional information

Therefore, a question arises: what to cache and what not to cache?
Answer: The RDD you will use repeatedly needs to be cached.
Example 7:

x = sc.textFile(...) #creation of RDD
y = x.map(...) #Transformation of RDD
z = x.map(...) #Transformation of RDD
So in this case, because we are using x over and over again. Therefore it is recommended to cache x. Because it does not have to read x from the source again and again. Therefore, if you are dealing with large amounts of data, this will save a lot of time. you.

Suppose you start to use/not use serialization to cache all RDDs in memory/disk. If Spark memory is not enough to perform any tasks, it will start using the LRU (Recently Used) strategy to delete old RDDs. Whenever the deleted RDD is used again, it will perform all the steps from source to RSD conversion.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.